Logwriter here! How are you?
Evaluating storage efficiency isn’t an easy task. In my opinion, the hardest part is to choose a dataset that best represents a real system.
For example, I wouldn’t say that measuring storage savings from YCSB (Yahoo Cloud Serving Benchmark tool) dataset is the best thing to do, unless your data set looks like an YCSB data set.
The Data Set
For my testing I’ve decided to use a public data set. I found a public data set from Reddit. It’s a data set with public comments from May and June 2016. It is a single collection with 131,079,747 documents. Figure 1 shows a sample document and its fields.
The data was ingested to MongoDB using the command: mongoimport
MongoDB Cluster Topology
I have a MongoDB cluster made of a single Replica Set. My Replica Set has 3 members: 1 Primary, 1 Secondary and 1 Arbiter.
It means that the Primary has a copy of the database and he handles requests of reads/writes to it.
Secondary has also a copy of the database that is sent by the primary (replication).
Arbiter is a member that helps to elect a new primary in case of the failure of the current primary.
WiredTiger (WT) Compression
WiredTiger (WT) is the default storage engine for MongoDB 3.2. It’s completely different compared to MMAPv1. It delivers document level concurrency and it stores data in a more efficient way then MMAPv1 because it supports compression.
You can choose between two different compression libraries: snappy or zlib. The default is the snappy lib.
Snappy is a data compression/decompression library written by Google. Further information about it here.
Zlib is a data compression/decompression library written by Jean-loup Gailly and Mark Adler. More info about it here.
All-Flash FAS (AFF) Storage Efficiency
ONTAP 9.0 data-reduction technologies, including inline compression, inline deduplication, and inline data compaction, can provide significant space savings. Savings can be further increased by using NetApp Snapshot® and NetApp FlexClone® technologies.
ONTAP 9.0 has introduced a new storage efficiency technique known as “inline data compaction”. It gets more than 1 logical block and if possible, stores them in a single physical 4KB block.
Let’s say your NetApp AFF running ONTAP 9.0 gets the following write requests:
How these write requests will be handled by inline compression and inline compaction? Take a look on Figure 4.
In this example, after it breaks the requests in WAFL blocks (4KB blocks) it would need 11 blocks to store the data. Applying inline adaptive compression, the number of required blocks to store the data goes down to 8 blocks, and finally after inline data compaction the data is stored in 4 physical disk blocks.
Which Storage Efficiency Technology I Should Use on My Environment?
I hate to answer a question like that with: “It depends.”, but unfortunately this is the real answer. Let me walk you through the process that has made me to conclude that “it depends”.
I’ve done 24 different tests using the Reddit public data set.
MongoDB has a parameter to control the on-disk page max size. The larger the value, the better will be your read performance in a sequential workload. The smaller the value, the better will be your read performance for a random workload. By default, leaf_page_max is set to 32K.
If we turn off all the efficiencies (in MongoDB and ONTAP) the Reddit data set has used 171GB of space.
Keeping leaf_page_max at the default, turning Compression on at MongoDB and turning efficiency off at ONTAP, the Reddit data set has used 88GB.
WiredTiger compression is an on-disk feature, so it maintain compressed data on disk, but it doesn’t maintain compression on memory. So, you spend CPU time to compress the data before to send it to disk and when you need to read the data on disk to populate your cache you need to decompress the block and then make it available on cache.
ONTAP has been built to provide storage efficiency without impact the performance of your database. So, if you want to let ONTAP working on saving space for you, we need to change the leaf_page_max from 32K to 4K.
Changing that setting, the Reddit data set has used 130GB. It is a little bit less saving than MongoDB, but your application will experience a consistent and predictable low latency.
Please, let me know if you have any questions and see you next post!
Thanks for reading it!