Rust - parallel topfew on ARM

I was trying to improve the Rust implementation for Tim Bray's topfew tool lately. The tool is so small that it is easy to grasp but still interesting enough to study.

You can do this using Unix tools but for fun you can try to do an optimized implementation. The Unix command line implementation is this:

awk '{print $1}' access_log | sort | uniq -c | sort -rn | head -12

Luckily, Dirkjan already did a Rust implementation, so I could use that without translating Tim's implementation into Rust.

The work topfew is doing is:

read lines of a file
regex split the lines into fields
count the most frequent values.

I could do some improvements for the Rust implementation:

The Tim's implementation used \s+ but now the expression is [\t ]. This is faster to match and correct to read a TSV file format.
There are multiple ways to avoid memory allocation: a) while reading and matching input lines. b) while creating compound key.
The counting uses a hashmap. The ahash hashmap has a better performance than the Rust standard library implementation for this case.

My improvement ideas came from looking at memory allocation (and guessing) and from running cargo flamegraph. If you're familiar with sampling profilers, flamegraph is super simple to use on Linux.

The topfew's work consists of two phases 1) count the values and 2) find the top values. Rayon can run the first phase in parallel over chunks of the input file. The second phase reduces the chunk results into the total count and returns the most frequent values.

Flamegraph still works after introducing parallelism. This is an added benefit of sampling profilers: they don't care how you run your workload as long as they get a stack trace.

I ran the implementation on a Graviton 64 core ARM m6g.16xlarge EC2 instance and a 96 core Intel Xeon r5dn.24xlarge EC2 instance. The performance for both is comparable with roughly 12 GB/sec. This is with a 20 GB input file read from tempfs. For comparison that's roughly 5x faster than the fastest available SSD today. A 16 core Graviton machine would be more than capable of saturating an SSD.

The interesting bit is that while the graviton ARM CPU is $3/hr, the Xeon is $9.5/hr. If you have similar performance for a workload and a 3x price difference things become interesting. It looks like, given all the competition — server ARM, AMD and GPUs — Intel will be in trouble unless they can innovate quickly.

Thomas Andreas Jung's Blog

20200602