As an engineer who primarily works with data and databases I spend a lot of time moving data around, hashing it, compressing it, decompressing it and generally trying to shovel it between VMs and blob stores over TLS. I am constantly surprised by how many systems only support slow, inefficient, and expensive ways of doing these operations.
In my experience, these poor algorithm choices are orders of magnitude slower than modern alternatives. Indeed, using a fast algorithm can often be the difference between actually doing compression/hashing/encryption and “Eh, I’ll skip it”. In the interest of using more of our CPU time to do useful work instead of melting ice-caps and giving programmers excessively long coffee breaks, we will explore some faster alternatives in this post.
TLDR
Application | Common Bad Performance Choices | Better Performance Choices | Expected Performance Gain |
---|---|---|---|
Trusted data hashing | md5 , sha2 , crc32 |
xxhash |
~10x |
Untrusted data hashing | md5 , sha2 , sha1 |
blake3 |
~10x |
Fast compression | snappy , gzip (zlib) |
lz4 |
10x over gzip , ~2x over snappy |
Good compression | gzip (zlib) |
zstd |
~2-10x |
Best compression | xz (lzma) |
zstd -10+ |
~2-10x |
Java crypto (md5 , aes-gcm , etc …) |
Built-in JVM crypto | Amazon Corretto Crypto Provider (ACCP) |
~4-10x |
Disclaimer: There are lies, damn lies, and benchmarks from some random person on the internet.
If you are considering taking some of the advice in this post please remember to test your specific workloads, which might have different bottlenecks. Also the implementation quality in your particular software stack for your particular hardware matters a lot.
For this post I’ll be playing with a ~5 GiB real-world JSON
dataset on my laptop’s
Intel Core i7-8565U
pinned to
4GHz.
Since I want to benchmark the algorithms instead of disks I’ll be pre-touching
the file into RAM with vmtouch
. Remember that
on most modern cloud servers with fast NVMe
storage (multiple GiBps) and
good page-caching algorithms your disks are likely not your bottleneck.
$ du -shc yelp_academic_dataset_review.json
6.5G yelp_academic_dataset_review.json
$ vmtouch -t yelp_academic_dataset_review.json
# ...
$ vmtouch yelp_academic_dataset_review.json
Files: 1
Directories: 0
Resident Pages: 1693525/1693525 6G/6G 100%
Elapsed: 0.1256 seconds
Hashing
I would like to check that this blob of data over here is the same as that data over there.
Trusted Data Hashes
These hash or checksum functions are used to ensure data integrity and usually are defending against bugs/bitrot/cosmic rays instead of malicious attackers. I typically see the following poor choices:
md5
: This hash is both weak and slow. It does have the advantage of being one of the fastest slow choices in standard libraries and therefore it is somewhat common. Two of my favorite examples are slow Cassandra Quorum Reads and slow S3 upload/download (seriously, just try disablingmd5
on parts and see how much faster S3 is).crc32
oradler32
: I often hear “well crc32 is fast”, but to be honest I haven’t in practice come across particularly fast implementations in real-world deployments. Sure theoretically there are hardware implementations, but at least in most Java or Python programs I profile running on most servers I encounter these checksums are not particularly fast and only generate 32 bits of output where I absolutely have to care about collisions.sha2
: An oddly common choice for trusted data hashes. Odd because it is slow and you don’t need a cryptographic hash for syncing a file between two hosts within an internal network with an authorized transfer protocol (e.g.rsync
or via backups toS3
with proper AWS IAM policies). If attackers have access to your backups or artifact store you have a bigger problem than them forging a hash.
Faster Choice
Try xxHash
. It is blazing fast, high
quality, and the XXH64
variant is usually sufficient for most data integrity
checks. It even performs well on small data inputs, where XXH3
is
particularly impressive. If you find yourself needing 128 bits of entropy (e.g.
if you’re hashing data for a
DHT
) you can either
use the 128 bit variant or xxh3
. If you only have an old version available
that doesn’t support the new 128 bit variant two XXH64
runs with
different seeds is usually still faster than any other choice.
Untrusted Data Hashes
While most of the time the threat model for data transfer is “bugs / cosmic rays”, some of the time people want to defend against bad actors. That’s where cryptographic hashes come in, most commonly:
md5
: It’s slow and not resistant to collisions … everything I love in a hash function.sha1
: This hash is at least somewhat performant and isn’t as completely broken asmd5
. We should probably still consider it on shaky ground though.sha2
(specificallysha256
): An ubiquitous hash and unlikemd5
andsha1
it is still considered secure, so that is nice. Unfortunately this hash is noticeably slow when used in performance critical applications.sha3
: The latest hash that NIST standardized (because they had concerns aboutSHA-2
that appear at this time somewhat unfounded). I was not able to find it packaged outside ofopenssl
and it doesn’t appear to have greatcli
or language support at this time. It’s also still pretty slow.
Faster Choice
Try BLAKE3
. Yes it is a new (2020)
algorithm and there are concerns about security margin but I’m just so tired of
waiting for sha256
. In practice, this is probably a much better choice than
the known-to-be-completely-broken md5
so if you’re reaching for md5
over
xxHash
because you need a cryptographic alternative, consider blake3
instead. Also blake3
uses hash trees (merkle
trees) which are wonderful when
implemented correctly and I wish more systems used them.
The one major downside of blake3
in my opinion is that at this time (2021) I
only know of really good cli
, Rust
, Go
, C
, and Python
implementations
and I can’t personally vouch for the Java
JNI bindings. I’ve only needed to
use it so far from streaming pipeline verifications or artifact verification so
the cli
and python
implementations are good enough for me.
A quick reminder that there are lots of security-sensitive hashing situations
where you don’t want a fast hash. For example, one situation where you want an
intentionally slow algorithm is when dealing with passwords. In such cases you
want a very slow hash like
argon2
,
bcrypt
,
PBKFD2
or even just a high number of
rounds of SHA-512
.
Show me the Data
Expect xxHash
to net about a ~10x
improvement on MD5
and ~5-10x improvement
on CRC32
depending on your CRC32
implementation (e.g. Java’s is truly terrible).
Expect BLAKE3
to be slightly slower than xxHash
with a single thread so
only use it if you actually care about cryptographic hashes. A simple
performance test
on hashing a 6616 MiB
file confirms that indeed we have 10x performance on the table (
note I’m reporting user
CPU time since the system time is not really up to the hash)
Algo | Hash Time (seconds) | Hash Speed (MiBps) | Cryptographic? | Bits of Entropy |
---|---|---|---|---|
MD5 |
9.1s |
727 |
Not really | 128 |
CRC32 |
4.8s |
1378 |
No | 32 |
XXH64 |
0.5s |
13232 |
No | 64 |
SHA256 |
27.5s |
240 |
Yes | 256 |
SHA3-256 |
16.8s |
393 |
Yes | 256 |
SHA1 |
10.7s |
618 |
Yes* | 160 |
BLAKE3 |
1.8s |
3675 |
Yes | 256 |
Yes that’s right, ~0.5 seconds user CPU time for xxh64
versus ~27 for sha256
and ~9s
for md5
.
If all you need to do is verify a file transfer you could be doing that 10x faster with xxHash.
The language versions often do make a big deal, e.g. JNI
versions that link
to fast native code in Java will often significantly out-perform pure Java
versions. “But Joey”, you say, “I have to use XYZ algorithm from the early ’00s
because of the specification!”. That is unfortunate, but at least make sure
you’re using fast implementations, for example
ACCP will speed
up MD5
on most Java VMs by a factor of ~4 as well as AES-GCM
by ~10x while
it is at it. ACCP achieves this by … linking in fast native implementations
of crypto.
Compression
I like my data transfers fast and don’t like paying for lots of disk or network I don’t need. I heard data compression is a thing.
Most data compresses, especially text (e.g. JSON
). The three cases where data
probably does not compress are if your data is random, the data was already
compressed, or the data was encrypted. Often in databases the metadata around
the data (e.g. write times, schemas, etc …) probably compresses even if the
data doesn’t. There are three primary measures of a compression algorithm:
- Compression ratio: How much smaller is the result than the input?
- Compression speed: How quickly can I compress data?
- Decompression speed: How quickly can I decompress data?
Depending on the use case, developers usually make some tradeoff between these three metrics. For example, databases doing page compression care most about decompression speed, file transfers care most about compression speed, archival storage cares most ratio, etc …
Fast compression that gives great ratio can significantly improve most workloads but slow compression with bad ratio is painful and makes me sad.
Common Poor Choices
gzip
(zlib
): A trusty workhorse but also a very slow algorithm. In many cases where your network is fast (1 Gbps+
), compressing with gzip can actually hurt your system’s performance (I have seen this numerous times).xz
(lzma
): An algorithm that has pretty good ratios, but is so slow to compress that in practice the only potential use cases are single-write many-read use cases.snappy
: One of the first “I’ll trade off ratio for speed” algorithms, snappy was good for its time. These days it is almost always outperformed by other choices.
Better Choice - I care about ratio
Try zstd
. To spend more compression
CPU time for better compression ratio increase the compression level or increase the block
size. I find that in most database workloads the default level (3
) or even
level 1
is a good choice for write heavy datasets (getting closer to lz4
)
and level 10
is good for read heavy datasets (surpassing gzip
in every
dimension). Note that zstd
strictly dominates gzip
as it is faster and gets
better ratio.
Even better: zstd
supports training dictionaries
which can really come in handy if you have lots of individually small but
collectively large JSON
data (looking at you tracing systems).
Better Choice - I only care about speed
Try lz4
. With near memory speeds and decent
ratio this algorithm is almost always a safe choice over not compressing at
all. It has excellent language support and is exceptionally good for real-time
compression/decompression as it is so cheap.
Better Choice - I want to shovel data from here to there
Try zstd --adapt
. This feature
automatically adapts the compression ratio as the IO conditions change to make
the current optimal tradeoff between CPU and “keeping the pipe fed”.
For example, if you are have very little free CPU on your system but a fast
network (looking at you i3en
instances) zstd --adapt
will automatically
compress with a lower level to minimize total transfer time. If you have a slow
network and extra CPU it will automatically compress at a higher level.
Show me the Data
Compression is a bit trickier to measure because the read to write ratio matters a lot and if you can get better ratio that might be worth it to pay a more expensive compression step for cheaper decompression.
Historically we had to make tradeoffs between ratio, compression speed and
decompression speed, but as we see with this quick
benchmark
we no longer need to make tradeoffs. These days (2021), I just reach for zstd
with an appropriate level or lz4
if I really need to minimize CPU cost.
First let’s look at the results for the 6.5GiB
review dataset.
Algo | Ratio | Compression Speed (MiBps) | Decompression Speed (MiBps) | Transfer Time |
---|---|---|---|---|
gzip |
0.41 |
21 |
168 |
295s |
lz4 |
0.65 |
361 |
1760 |
36s |
zstd |
0.38 |
134 |
730 |
47s |
xz |
?? |
?? |
??? |
?? |
As xz
was estimating 1.5 hours to compress and I didn’t have that kind of time I ran
that on a smaller 380MiB
dataset:
Algo | Ratio | Compression Speed (MiBps) | Decompression Speed (MiBps) | Transfer Time |
---|---|---|---|---|
xz |
0.18 |
1.34 |
92 |
299s |
zstd -10 |
0.26 |
15.4 |
713 |
24s |
zstd -19 |
0.21 |
1.18 |
404 |
24s |
As expected lz4
is the fastest choice by a lot while still cutting the
dataset in half, followed by zstd
. One of the really useful things about
zstd
is that I am no longer reaching for specialty compressors depending on
the job, I just change the level/block sizes and I can get the trade-off I
want.
Pipeline
Now that we have fast algorithms, it matters how we wire them together. One of the number one performance mistakes I see is doing a single step of a data movement at a time, for example decrypting a file to disk and then decompressing it and then checksumming it. As the intermediate products must hit disk and are done sequentially this necessarily slows down your data transfer.
When data transfer is slow there are usually either slow disks or slow Java
heap allocations in the way. I’ve found that if I can structure my pipelines as
unix pipelines of fast C
programs (or Python with native extensions) with the
output from one stage always flowing to the input of the other I can much more
efficiently upload and download data by doing everything (IO, decrypt, decompress, checksum)
in parallel.
For example, something like the following will outperform almost any Java S3 upload at putting 100 files in S3 along with whole object checksums
# Assumption: You have xargs, xxh64sum, zstd and awscli installed
# Simple version
$ DIR=.
$ find $DIR -maxdepth 1 -type f | xargs -IX -P 8 xxh64sum X | sort -k 2 > aws s3 cp - s3://bucket/prefix/checksums
$ find $DIR -maxdepth 1 -type f | xargs -IX -P 8 | bash -c 'zstd --adapt X --stdout | aws s3 cp - s3://bucket/prefix/data/X.zst'
# Or the full pipeline option (full object hash and compress at the same time)
$ find $DIR -maxdepth 1 -type f \
| xargs -IX -P 8 \
bash -c 'export URL=s3://bucket/prefix; cat X | tee >(xxh64sum - | aws s3 cp - ${URL}/X.zst.xxh) | zstd --adapt - --stdout | aws s3 cp - ${URL}/X.zstd'
Why do fast data algorithms matter ?
It matters because most systems choose the slow option and make routine development activities take longer in modern cloud infrastructure (fast networks). For example:
- Backing up data. Backing up 1 TiB of data to a fast blob store (can sink
multiple
GiBps
) usinggzip
andsha256
would take around15
hours. Doing it withzstd
andxxhash
takes2.2
hours. Timely backups are one of the most important properties of a data system. - Software packages. By default debian packages are either not compressed or
compressed with
xz
. Installing 500 MiB of uncompressed debian packages over a one gig network takes~5s
, withxz
compression it actually slows down and takes~1s + ~6s = ~7s
, and withzstd -19
compression it takes~1s + ~1s = ~2s
. If thesha256
is checked that would add another~20s
for a total of~30s
versus if we check withblake3
that adds~0.1s
for a total of~3s
. I’d take the3s
over30s
any day. This matters every time you build container images, bake cloud VM images or bootstrap servers with configuration management (puppet/chef). - Container Images. By default
docker
usesgzip
compression withsha256
checksums, which means to decompress and checksum 4GiB of containers I’d need about60s
of CPU time instead of ~6s
withzstd
andxxhash
(or6.5
s withblake3
). This matters when your docker pull is in the startup path of your application.
A note about implementation availability
A major disadvantage of using good algorithms is that they may not always show up in your language or OS by default. I’ve had good experiences with the following implementations: