Use Fast Data Algorithms

2021/03/22

As an engineer who primarily works with data and databases I spend a lot of time moving data around, hashing it, compressing it, decompressing it and generally trying to shovel it between VMs and blob stores over TLS. I am constantly surprised by how many systems only support slow, inefficient, and expensive ways of doing these operations.

In my experience, these poor algorithm choices are orders of magnitude slower than modern alternatives. Indeed, using a fast algorithm can often be the difference between actually doing compression/hashing/encryption and “Eh, I’ll skip it”. In the interest of using more of our CPU time to do useful work instead of melting ice-caps and giving programmers excessively long coffee breaks, we will explore some faster alternatives in this post.

TLDR

Application Common Bad Performance Choices Better Performance Choices Expected Performance Gain
Trusted data hashing md5, sha2, crc32 xxhash ~10x
Untrusted data hashing md5, sha2, sha1 blake3 ~10x
Fast compression snappy, gzip (zlib) lz4 10x over gzip, ~2x over snappy
Good compression gzip (zlib) zstd ~2-10x
Best compression xz (lzma) zstd -10+ ~2-10x
Java crypto (md5, aes-gcm, etc …) Built-in JVM crypto Amazon Corretto Crypto Provider (ACCP) ~4-10x

Disclaimer: There are lies, damn lies, and benchmarks from some random person on the internet.

If you are considering taking some of the advice in this post please remember to test your specific workloads, which might have different bottlenecks. Also the implementation quality in your particular software stack for your particular hardware matters a lot.

For this post I’ll be playing with a ~5 GiB real-world JSON dataset on my laptop’s Intel Core i7-8565U pinned to 4GHz. Since I want to benchmark the algorithms instead of disks I’ll be pre-touching the file into RAM with vmtouch. Remember that on most modern cloud servers with fast NVMe storage (multiple GiBps) and good page-caching algorithms your disks are likely not your bottleneck.

$ du -shc yelp_academic_dataset_review.json
6.5G    yelp_academic_dataset_review.json
$ vmtouch -t yelp_academic_dataset_review.json
# ...
$ vmtouch yelp_academic_dataset_review.json
           Files: 1
     Directories: 0
  Resident Pages: 1693525/1693525  6G/6G  100%
         Elapsed: 0.1256 seconds

Hashing

I would like to check that this blob of data over here is the same as that data over there.

Trusted Data Hashes

These hash or checksum functions are used to ensure data integrity and usually are defending against bugs/bitrot/cosmic rays instead of malicious attackers. I typically see the following poor choices:

Faster Choice

Try xxHash. It is blazing fast, high quality, and the XXH64 variant is usually sufficient for most data integrity checks. It even performs well on small data inputs, where XXH3 is particularly impressive. If you find yourself needing 128 bits of entropy (e.g. if you’re hashing data for a DHT) you can either use the 128 bit variant or xxh3. If you only have an old version available that doesn’t support the new 128 bit variant two XXH64 runs with different seeds is usually still faster than any other choice.

Untrusted Data Hashes

While most of the time the threat model for data transfer is “bugs / cosmic rays”, some of the time people want to defend against bad actors. That’s where cryptographic hashes come in, most commonly:

Faster Choice

Try BLAKE3. Yes it is a new (2020) algorithm and there are concerns about security margin but I’m just so tired of waiting for sha256. In practice, this is probably a much better choice than the known-to-be-completely-broken md5 so if you’re reaching for md5 over xxHash because you need a cryptographic alternative, consider blake3 instead. Also blake3 uses hash trees (merkle trees) which are wonderful when implemented correctly and I wish more systems used them.

The one major downside of blake3 in my opinion is that at this time (2021) I only know of really good cli, Rust, Go, C, and Python implementations and I can’t personally vouch for the Java JNI bindings. I’ve only needed to use it so far from streaming pipeline verifications or artifact verification so the cli and python implementations are good enough for me.

A quick reminder that there are lots of security-sensitive hashing situations where you don’t want a fast hash. For example, one situation where you want an intentionally slow algorithm is when dealing with passwords. In such cases you want a very slow hash like argon2, bcrypt, PBKFD2 or even just a high number of rounds of SHA-512.

Show me the Data

Expect xxHash to net about a ~10x improvement on MD5 and ~5-10x improvement on CRC32 depending on your CRC32 implementation (e.g. Java’s is truly terrible). Expect BLAKE3 to be slightly slower than xxHash with a single thread so only use it if you actually care about cryptographic hashes. A simple performance test on hashing a 6616 MiB file confirms that indeed we have 10x performance on the table ( note I’m reporting user CPU time since the system time is not really up to the hash)

Algo Hash Time (seconds) Hash Speed (MiBps) Cryptographic? Bits of Entropy
MD5 9.1s 727 Not really 128
CRC32 4.8s 1378 No 32
XXH64 0.5s 13232 No 64
SHA256 27.5s 240 Yes 256
SHA3-256 16.8s 393 Yes 256
SHA1 10.7s 618 Yes* 160
BLAKE3 1.8s 3675 Yes 256

Yes that’s right, ~0.5 seconds user CPU time for xxh64 versus ~27 for sha256 and ~9s for md5. If all you need to do is verify a file transfer you could be doing that 10x faster with xxHash.

The language versions often do make a big deal, e.g. JNI versions that link to fast native code in Java will often significantly out-perform pure Java versions. “But Joey”, you say, “I have to use XYZ algorithm from the early ’00s because of the specification!”. That is unfortunate, but at least make sure you’re using fast implementations, for example ACCP will speed up MD5 on most Java VMs by a factor of ~4 as well as AES-GCM by ~10x while it is at it. ACCP achieves this by … linking in fast native implementations of crypto.

Compression

I like my data transfers fast and don’t like paying for lots of disk or network I don’t need. I heard data compression is a thing.

Most data compresses, especially text (e.g. JSON). The three cases where data probably does not compress are if your data is random, the data was already compressed, or the data was encrypted. Often in databases the metadata around the data (e.g. write times, schemas, etc …) probably compresses even if the data doesn’t. There are three primary measures of a compression algorithm:

  1. Compression ratio: How much smaller is the result than the input?
  2. Compression speed: How quickly can I compress data?
  3. Decompression speed: How quickly can I decompress data?

Depending on the use case, developers usually make some tradeoff between these three metrics. For example, databases doing page compression care most about decompression speed, file transfers care most about compression speed, archival storage cares most ratio, etc …

Fast compression that gives great ratio can significantly improve most workloads but slow compression with bad ratio is painful and makes me sad.

Common Poor Choices

Better Choice - I care about ratio

Try zstd. To spend more compression CPU time for better compression ratio increase the compression level or increase the block size. I find that in most database workloads the default level (3) or even level 1 is a good choice for write heavy datasets (getting closer to lz4) and level 10 is good for read heavy datasets (surpassing gzip in every dimension). Note that zstd strictly dominates gzip as it is faster and gets better ratio.

Even better: zstd supports training dictionaries which can really come in handy if you have lots of individually small but collectively large JSON data (looking at you tracing systems).

Better Choice - I only care about speed

Try lz4. With near memory speeds and decent ratio this algorithm is almost always a safe choice over not compressing at all. It has excellent language support and is exceptionally good for real-time compression/decompression as it is so cheap.

Better Choice - I want to shovel data from here to there

Try zstd --adapt. This feature automatically adapts the compression ratio as the IO conditions change to make the current optimal tradeoff between CPU and “keeping the pipe fed”.

For example, if you are have very little free CPU on your system but a fast network (looking at you i3en instances) zstd --adapt will automatically compress with a lower level to minimize total transfer time. If you have a slow network and extra CPU it will automatically compress at a higher level.

Show me the Data

Compression is a bit trickier to measure because the read to write ratio matters a lot and if you can get better ratio that might be worth it to pay a more expensive compression step for cheaper decompression.

Historically we had to make tradeoffs between ratio, compression speed and decompression speed, but as we see with this quick benchmark we no longer need to make tradeoffs. These days (2021), I just reach for zstd with an appropriate level or lz4 if I really need to minimize CPU cost.

First let’s look at the results for the 6.5GiB review dataset.

Algo Ratio Compression Speed (MiBps) Decompression Speed (MiBps) Transfer Time
gzip 0.41 21 168 295s
lz4 0.65 361 1760 36s
zstd 0.38 134 730 47s
xz ?? ?? ??? ??

As xz was estimating 1.5 hours to compress and I didn’t have that kind of time I ran that on a smaller 380MiB dataset:

Algo Ratio Compression Speed (MiBps) Decompression Speed (MiBps) Transfer Time
xz 0.18 1.34 92 299s
zstd -10 0.26 15.4 713 24s
zstd -19 0.21 1.18 404 24s

As expected lz4 is the fastest choice by a lot while still cutting the dataset in half, followed by zstd. One of the really useful things about zstd is that I am no longer reaching for specialty compressors depending on the job, I just change the level/block sizes and I can get the trade-off I want.

Pipeline

Now that we have fast algorithms, it matters how we wire them together. One of the number one performance mistakes I see is doing a single step of a data movement at a time, for example decrypting a file to disk and then decompressing it and then checksumming it. As the intermediate products must hit disk and are done sequentially this necessarily slows down your data transfer.

When data transfer is slow there are usually either slow disks or slow Java heap allocations in the way. I’ve found that if I can structure my pipelines as unix pipelines of fast C programs (or Python with native extensions) with the output from one stage always flowing to the input of the other I can much more efficiently upload and download data by doing everything (IO, decrypt, decompress, checksum) in parallel.

For example, something like the following will outperform almost any Java S3 upload at putting 100 files in S3 along with whole object checksums

# Assumption: You have xargs, xxh64sum, zstd and awscli installed

# Simple version
$ DIR=.

$ find $DIR -maxdepth 1 -type f | xargs -IX -P 8 xxh64sum X | sort -k 2 > aws s3 cp - s3://bucket/prefix/checksums
$ find $DIR -maxdepth 1 -type f | xargs -IX -P 8 | bash -c 'zstd --adapt X --stdout | aws s3 cp - s3://bucket/prefix/data/X.zst'

# Or the full pipeline option (full object hash and compress at the same time)
$ find $DIR -maxdepth 1 -type f \
| xargs -IX -P 8 \
bash -c 'export URL=s3://bucket/prefix; cat X | tee >(xxh64sum - | aws s3 cp - ${URL}/X.zst.xxh) | zstd --adapt - --stdout | aws s3 cp - ${URL}/X.zstd'

Why do fast data algorithms matter ?

It matters because most systems choose the slow option and make routine development activities take longer in modern cloud infrastructure (fast networks). For example:

A note about implementation availability

A major disadvantage of using good algorithms is that they may not always show up in your language or OS by default. I’ve had good experiences with the following implementations: