Add Intel vs. Arm blog post (#1753)
## Description This uses Corso/kopia microbenchmarks to compare the performance of Intel vs. ARM architectures
@ -0,0 +1,238 @@
|
||||
---
|
||||
slug: benchmarking-intel-vs-arm-for-backup-applications
|
||||
title: "Benchmarking Intel vs. ARM (Graviton) for Splitting, Compressing, Hashing, and Encrypting Data"
|
||||
description: "Throughput and cost-effectiveness comparision of Intel vs. Arm using compute-intensive data-processing microbenchmarks."
|
||||
authors: ntolia
|
||||
tags: [corso, intel, arm, benchmarks]
|
||||
date: 2022-12-12
|
||||
image: ./images/benchmarking/intel_vs_arm_heatmap.png
|
||||
---
|
||||
|
||||

|
||||
|
||||
This article benchmarks the performance of the compute-intensive
|
||||
components found in modern backup applications on Intel and ARM
|
||||
(Graviton) architectures. The results around performance and
|
||||
cost-efficiency are illuminating as we expected most of
|
||||
third-party hashing and encryption libraries to be primarily optimized
|
||||
for Intel and not ARM.
|
||||
|
||||
<!-- truncate -->
|
||||
|
||||
## Why we care
|
||||
|
||||
To improve security and efficiency (storage used, network transfers)
|
||||
as well as to reduce cost, Corso, our free open-source backup tool for
|
||||
Microsoft 365, uses encryption and deduplication under the hood. From
|
||||
a security perspective, this ensures that no third party, including
|
||||
your cloud provider, can ever read your data. From an efficiency
|
||||
perspective, no duplicate copies of your data are stored or even
|
||||
transferred to the cloud provider. Therefore, if you accidentally have
|
||||
duplicated files in OneDrive or the same file or email shows up in
|
||||
multiple backups, only one copy of the data will be stored in the
|
||||
cloud and will only be transferred once.
|
||||
|
||||
However, given the large amounts of data that
|
||||
[Corso](http://corsobackup.io) might transfer, we wanted to look at
|
||||
both performance and cost-efficiency of running Corso on ARM
|
||||
(Graviton) vs. Intel and the results are quite striking!
|
||||
|
||||
The takeaways from the benchmarking include:
|
||||
|
||||
- Overall, ARM outperforms Intel for data-intensive tasks required by
|
||||
backup applications and the difference can be as high as **350%**.
|
||||
|
||||
- ARM is a much better fit for data-intensive processing that can be
|
||||
parallelized than for single-thread workloads
|
||||
|
||||
- The lower cost of ARM further amplify its advantages if you are
|
||||
focused of delivering more throughput for every dollar spent
|
||||
|
||||
- While ARM might not be ideal for some parts of the data-intensive
|
||||
deduplication pipeline, the overall win is higher
|
||||
|
||||
*While our benchmarking is extremely suggestive about the efficiency
|
||||
of ARM for these specific tasks, if compute time is a significant
|
||||
part of your spend we encourage you to do your own benchmarking. If
|
||||
you get different results we’d love to hear about it! [Join our
|
||||
Discord](https://discord.gg/63DTTSnuhT) to continue the conversation*
|
||||
|
||||
## Experimental setup
|
||||
|
||||
To ensure that our results would be as comparable as possible, we
|
||||
picked an Intel and ARM VM running in AWS in us-east-1. These are the
|
||||
configurations of the systems:
|
||||
|
||||
| Architecture | AWS Instance Type | vCPUs | Memory (GiB) | Cost ($/hour) |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| Intel | m6i.xlarge | 4 | 16 | $0.192 |
|
||||
| ARM | m6g.xlarge | 4 | 16 | $0.154 |
|
||||
|
||||
## Experiment
|
||||
|
||||
In Corso, we use [Kopia](https://kopia.io/) as our data processing
|
||||
engine. It first split the data to divide larger files into smaller
|
||||
chunks for efficient deduplication. That split data is then
|
||||
deduplicated through the use of cryptographically strong hashing
|
||||
algorithms, compressed, and then encrypted.
|
||||
|
||||
Based on this data pipeline, we split our measurements into 5
|
||||
different parts to microbenchmark the different parts of the above
|
||||
process. For each of the below microbenchmarks, we also varied
|
||||
internal benchmark parallelism between 1, 4, and 8 parallel
|
||||
goroutines.
|
||||
|
||||
- Splitting: Buzhash vs. Rabin-Karp algorithms
|
||||
- Compression: Different variants of deflate, S2, lz4, zstd, gzip, and pgzip
|
||||
- Hashing: Different BLAKE and HMAC-SHA variants
|
||||
- Encryption: AES256-GCM and ChaCha20-Poly1305
|
||||
- Hashing + Encryption: Combination of the above Encryption and Hashing algorithms
|
||||
|
||||
## ARM vs. Intel performance results
|
||||
|
||||
Given the large number of results collected (there are 372 distinct
|
||||
performance data points!), we will only be presenting a subset of the
|
||||
results collected for the splitting, compression, and combined hashing
|
||||
and encryption microbenchmarks.
|
||||
|
||||
Within these results, we will focus on the numbers with parallelism
|
||||
set to 8 and will focus on throughput numbers (higher is better)
|
||||
measured in MiB/s. For easier visualization, we also present the
|
||||
relative performance where we compare the throughput of ARM
|
||||
vs. Intel. For a more price-accurate comparison, we also present the
|
||||
price-adjusted relative performance where we see how many more MiB/s
|
||||
we can process for the same dollar.
|
||||
|
||||
While the results shown here are truncated for brevity, feel free to give us
|
||||
a shout on [Discord](https://discord.gg/63DTTSnuhT) if you want to see
|
||||
the raw data.
|
||||
|
||||
### Splitter
|
||||
|
||||
Let’s first look at splitting, a critical part of data deduplication,
|
||||
that uses rolling-hash functions. While there is a lot more
|
||||
information available elsewhere, a rolling hash is a type of hash
|
||||
function that operates on a sliding window of data, allowing it to
|
||||
efficiently compute the hash value for a given chunk of data without
|
||||
having to process the entire dataset. This is useful in deduplication
|
||||
because it allows Corso and Kopia to quickly compare the hash values
|
||||
of different chunks of identical data, even if they're not the same
|
||||
size or don't start at the same point in the dataset.
|
||||
|
||||
The size of the blocks and the algorithm used to divide the dataset
|
||||
are important considerations in deduplication, as they can affect the
|
||||
speed and efficiency of deduplication. For these results we compare
|
||||
the performance of Buzhash vs. the Rabin-Karp algorithm with block
|
||||
sizes of 1MiB, 2MiB. 4MiB, and 8MiB.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
As we can see in the results, Buzhash is faster than Rabin-Karp for
|
||||
all block sizes. Further, for Buzhash, ARM is 150% faster than Intel
|
||||
and 190% on a per-dollar price-adjusted basis.
|
||||
|
||||
While not presented here, it should be noted that with parallelism of
|
||||
1, splitting on ARM was only 60% and 80% of Intel’s performance for
|
||||
Buzhash and Rabin-Karp respectively. This does show that ARM might be
|
||||
a bad choice for single-threaded applications.
|
||||
|
||||
### Compression
|
||||
|
||||
For backups, we're limited to using lossless compression algorithms
|
||||
where there is no room for data degradation. The compression algorithm
|
||||
selection is driven by various tradeoffs such as their effectiveness
|
||||
at reducing data size, computational efficiency, and the speed. Some
|
||||
algorithms may be effective at reducing data size but may require
|
||||
a lot of computational resources while others may be less effective at
|
||||
reducing the size of a file but may be much faster to execute. For
|
||||
this experiment, the compression algorithms were run on an large file
|
||||
(100MiB) that was generated from random data and was therefore not
|
||||
compressible.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
When we look at absolute and relative performance results above, we
|
||||
see that the absolute performance of various compression algorithms on
|
||||
ARM is almost equivalent (90%+) of Intel in the worst case (`gzip`)
|
||||
and up to 300% better (`s2-parallel-4`). The difference is even
|
||||
starker on a price-adjusted basis where ARM is delivering a 380%
|
||||
improvement over Intel on a per-dollar basis.
|
||||
|
||||
While not presented here, it should again be noted that with
|
||||
parallelism of 1, compression on ARM was slower for some compression
|
||||
algorithms (`gzip`, `zstd`) and some variants
|
||||
(`deflate-best-compression`, `pgzip-best-compression`). This once
|
||||
again highlights the power of ARM when used in multi-thread
|
||||
applications and the need to benchmark for your workload.
|
||||
|
||||
### Hashing + encryption
|
||||
|
||||
Two critical components are left to complete the deduplication
|
||||
data path in Corso and Kopia. A unique key (hash) for the data needs
|
||||
to be generated for comparing identical data chunks and this is accomplished via
|
||||
either the `BLAKE` hashing algorithm or a `SHA` hashing algorithm with
|
||||
an HMAC. HMAC, or Keyed-Hash Message Authentication Code, uses a
|
||||
cryptographic hash function in combination with a secret key. This key
|
||||
is used to authenticate the data being hashed and can be used later to
|
||||
ensure that it hasn't been tampered with. The data is then also
|
||||
encrypted using an encryption algorithm (for example, `AES256-GCM` or
|
||||
`ChaCha20-Poly1305`).
|
||||
|
||||
For this experiment, we benchmarked the combination of different
|
||||
hashing and encryption algorithms but only present the `SHA` variant
|
||||
HMAC results here. All of the `BLAKE` variants on ARM significantly
|
||||
underperformed Intel (between 20--80% of Intel’s performance) and we
|
||||
believe that this is because the [underlying
|
||||
library](https://github.com/zeebo/blake3) being used has no
|
||||
ARM-specific optimizations but it does have optimizations for Intel
|
||||
platforms with AVX2/SSE4.1.
|
||||
|
||||
For the below results, the first part of the chart label is the
|
||||
hashing algorithm used (HMAC is used with every SHA variant) while the
|
||||
second part is the encryption algorithm.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
As seen above, with a parallelism of 8, hashing and encryption on ARM
|
||||
never underperforms Intel on an absolute basis and always outperforms
|
||||
on a price-adjusted basis. In particular, ARM can significantly
|
||||
outperform Intel processors by 350% on an absolute basis and by up to
|
||||
440% on a price-adjusted basis.
|
||||
|
||||
When we teased apart the hashing and encryption results, we noticed
|
||||
that hashing, for the `HMAC-SHA-256` variants were up to 700%
|
||||
(absolute)/870% (price adjusted) faster on ARM. In contrast,
|
||||
`AES-256`encryption throughput on ARM is only 90% that of Intel’s at a
|
||||
parallelism of 8 but, price adjusted, it can process 120% more data
|
||||
for the same cost while `ChaCha20-Poly1305`'s performance is between
|
||||
40% (absolute) and 50% (price adjusted) of Intel’s. However, given
|
||||
that relatively more work performed in hashing, the difference
|
||||
disappears when the combined work is taken into account.
|
||||
|
||||
While not presented here, it should again be noted that with
|
||||
parallelism of 1, hashing and encryption on ARM was slower when using
|
||||
`SHA3` HMAC variants for hashing but ARM starts to outperform with
|
||||
parallelism of 4 and above. This once again highlights the
|
||||
applicability of ARM to multi-threaded applications.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The above results show how exciting ARM is as an architecture for us
|
||||
data-processing nerds because of the performance and cost
|
||||
optimizations possible. The improvement is a result of the hardware
|
||||
acceleration available on the ARM platform used for benchmarking
|
||||
(AWS Graviton) and the additional optimizations in the libraries used in our
|
||||
software. While these things may constantly shift, we're excited
|
||||
about the advances and the benefits for users of backup software.
|
||||
|
||||
If you have questions, want access to raw data, or want to talk about
|
||||
about future benchmarking, join us on
|
||||
[Discord](https://discord.gg/63DTTSnuhT) or follow us on
|
||||
[Twitter](https://twitter.com/CorsoBackup) to get the conversation
|
||||
started. We look forward to hearing from you!
|
||||
@ -1,6 +1,6 @@
|
||||
ntolia:
|
||||
name: Niraj Tolia
|
||||
title: Founder
|
||||
title: Founder and CEO
|
||||
url: https://github.com/ntolia
|
||||
image_url: https://github.com/ntolia.png
|
||||
|
||||
|
||||
BIN
website/blog/images/benchmarking/compression_relative.png
Normal file
|
After Width: | Height: | Size: 692 KiB |
BIN
website/blog/images/benchmarking/compression_througput.png
Normal file
|
After Width: | Height: | Size: 677 KiB |
BIN
website/blog/images/benchmarking/hash_relative.png
Normal file
|
After Width: | Height: | Size: 702 KiB |
BIN
website/blog/images/benchmarking/hash_througput.png
Normal file
|
After Width: | Height: | Size: 660 KiB |
BIN
website/blog/images/benchmarking/intel_vs_arm_heatmap.png
Normal file
|
After Width: | Height: | Size: 197 KiB |
BIN
website/blog/images/benchmarking/splitter_relative.png
Normal file
|
After Width: | Height: | Size: 435 KiB |
BIN
website/blog/images/benchmarking/splitter_througput.png
Normal file
|
After Width: | Height: | Size: 416 KiB |