Rust - parallel topfew on ARM

I was trying to improve the Rust implementation for Tim Bray's topfew tool lately. The tool is so small that it is easy to grasp but still interesting enough to study.

You can do this using Unix tools but for fun you can try to do an optimized implementation. The Unix command line implementation is this:
awk '{print $1}' access_log | sort | uniq -c | sort -rn | head -12
Luckily, Dirkjan already did a Rust implementation, so I could use that without translating Tim's implementation into Rust.

The work topfew is doing is:
  • read lines of a file
  • regex split the lines into fields
  • count the most frequent values.
I could do some improvements for the Rust implementation:
  1. The Tim's implementation used \s+ but now the expression is [\t ]. This is faster to match and correct to read a TSV file format.
  2. There are multiple ways to avoid memory allocation: a) while reading and matching input lines. b) while creating compound key.
  3. The counting uses a hashmap. The ahash hashmap has a better performance than the Rust standard library implementation for this case.
My improvement ideas came from looking at memory allocation (and guessing) and from running cargo flamegraph. If you're familiar with sampling profilers, flamegraph is super simple to use on Linux.

The topfew's work consists of two phases 1) count the values and 2) find the top values. Rayon can run the first phase in parallel over chunks of the input file. The second phase reduces the chunk results into the total count and returns the most frequent values.

Flamegraph still works after introducing parallelism. This is an added benefit of sampling profilers: they don't care how you run your workload as long as they get a stack trace.

I ran the implementation on a Graviton 64 core ARM m6g.16xlarge EC2 instance and a 96 core Intel Xeon r5dn.24xlarge EC2 instance. The performance for both is comparable with roughly 12 GB/sec. This is with a 20 GB input file read from tempfs. For comparison that's roughly 5x faster than the fastest available SSD today. A 16 core Graviton machine would be more than capable of saturating an SSD.

The interesting bit is that while the graviton ARM CPU is $3/hr, the Xeon is $9.5/hr. If you have similar performance for a workload and a 3x price difference things become interesting. It looks like, given all the competition — server ARM, AMD and GPUs — Intel will be in trouble unless they can innovate quickly.


Leaving Flickr

It is the time of the year to renew my new 2-year pro Flickr account. The sticker price increased by 100% for long-term Flickr accounts. This is a good time to reflect if I want to stay on Flickr.

I’ve been a paying Flickr user for a long time with more than seven thousand uploaded photos. Not very surprisingly, I use the site to discover photos and to share photos with friends and family.

I have the greatest join by:
a) going through old my photos. Then I’m reminiscing where I was in my life when I took that photo. As public online photos are carefully curated, this is a 99% positive experience.
b) looking at other peoples’ photos. After leaving South Africa I did this. I’ve seen a lot of historic Cape Town and African wilderness photos. Flickr is a good quiet place to do that, no additional social network noise and distractions.
c) knowing that I can download my photos in case I lost some or all of them.

There are also negative aspects:
a) There is limited trust with Flickr as a company. It changed hands multiple times. How good is the data protection in the long-term? Somebody will monetize my data in obscene ways in the future.
b) The Flickr’s owner will go bankrupt one day. Then I have to leave. Now I can choose to or not.
c) When I joined Flickr it had a active community. Now most people left and went somewhere else. There isn’t much of a network effect. It’s low interaction and traffic.

There are basic questions to answer:
a) Somebody will be the missing marginal Flickr user that pushes the company running Flicker to close shop. If users leave then Flickr dies. This would be said as Flickr is a de facto photo archive for the Internet. Do I care enough that this archive stays up?
b) Opportunity cost, pretend I’ve exactly these $50 to use. Is this the best way to spend it?
c) Do I have to see this as a pure business transaction? One company bought Flickr from another company. It was sheer luck that Yahoo didn’t close shop already. Companies have their own motivations to sell/run/close a site. Why should I pay money to make this business transaction a success?

a) Yes, but I expect the internet archive would take over. It makes more sense to donate to the internet archive than a for profit company.
b) No, there are more interesting things to spend money on. I could buy 10 magazines, 10 movies or x GB of photo storage for the same amount.
c) Yes, this is a business transaction. At the end of the day they’ll do anything to make ends meet for Flickr. Including decisions that will finally destroy Flickr.

The also-good-fallacy

Institutions, including companies, claim that we should support them because they do good. If we follow this thinking then we are the victim of a fallacy. Any large enough organization will do good. This is unavoidable. A large organization makes a lot of decisions. By chance the outcome of some of those decisions have to overlap with the general good.

The fallacy is to think that this is the reason the institution exists, that this is a significant portion of their actions and this is a rational decision to support the institution. All the other decisions are very selfish and not aligned with the common good in deed or to state it differently “I support an organization because it’s mostly selfish and doesn’t align its action with the general good.” Changing the point of view to look at all actions shows how ludicrous the argument is.

To be also good is as profound a statement as to exist. As long as anything exists it will do positive and negative deeds.

It’s sad to stop my Flickr subscription. Somehow the optimism of the internet went away. Openness gave way to monetization in walled gardens. At least now I’m going to have my own walled garden where all the weed I want can grow. I’ll also give $50 to Wikipedia in 2020.


You can do basic math in Python, right?

Let's do a small quiz. I've pick an example of basic math. I would like to demonstrate how limited our knowledge about the types we use is. If you are a floating point expert you can feel good that you score 100%. For the 99,99% of developers enjoy the ride.
On the way I point out heuristics I find important to consider while developing code. I think those make success more likely.
def div(a, b):
    return a / b
assert div(1,1) == 1
This is the code I would like to test. Looks pretty simple: we divide a by b. We also have a test.
Test branch coverage says we're done.
A looks good or did we miss something?
Yes, there is more than 1 float value.

How big is the problem space?

1<<(64 + 64) # I think, haven't read the IEEE standard to give the correct number
Heuristic: Reading documentation is better than not reading documentation.
If we want to tackle a problem we have to know how big it is. If the surface area is small it's easier to know all valid states. For example testing a function that takes a boolean value is easier than a function that takes an integer value.

How bad is that?

from datetime import timedelta
pflop = 1<<50 200pf="" 2018="" 9.7="" comparison:="" ibm="" mw="" nov="" span="" summit="" supercomputer="" with="">
problem = 1<<(64+64)
billion_years_with_1_pflop = problem / (timedelta(days=365).total_seconds() * pflop) / (10**9)
#Pretty bad
Heuristic: Use brute-force if possible.
In this instance we cannot test all values with brute-force. Testing exhaustively should always be the first approach. If this doesn't work we have to use our knowledge of floating point arithmetic to produce software that is good enough for the context it's used in.

Domain Knowledge

If you doing anything other than scientific computation, float is the incorrect type.
For other applications decimal type, or other types (date, complex, etc.) are appropriate.
Heuristic: Know your domain.
Depending on your domain, you're already doomed.
If you doing anything other than scientific computations with float it is the incorrect type. It won't work for your currencies, interest rate and book keeping calculations. Your those you would use a decimal type.
Now let's start the quiz to see how good our float knowledge is.

This should be true, right?

In the quiz we test the div function from before with the assert_div function. This test succeeds if the first parameter a is equal to dividing and then multiplying by b.
def assert_div(a, b):
    assert div(a,b) * b == a
This is true for a lot of float values but not all.
Can you come up with 1-2 examples where this is false?

The Quiz

It's not such a hard quiz really. I'll just give you examples and we try to explain why it's not working.
assert_div(1,0) #Let's start simple
ZeroDivisionError                         Traceback (most recent call last)
 in ()
----> 1 assert_div(1,0) #Let's start simple
 in assert_div(a, b)
      1 def assert_div(a, b):
----> 2     assert div(a,b) * b == a
 in div(a, b)
      1 def div(a, b):
----> 2     return a / b
      4 assert div(1,1) == 1
ZeroDivisionError: division by zero
This example is simple: division by zero. This is the same as in normal math and not specific to floating point. 

Not a number is fun

Floating point introduces Nan to represent all values that cannot be represented by any other value.
NaN is pretty special. You cannot treat it as any other value. Any operation with NaN returns NaN.
assert_div(float(1), float("nan"))
AssertionError                            Traceback (most recent call last)
 in ()
----> 1 assert_div(float(1), float("nan"))
 in assert_div(a, b)
      1 def assert_div(a, b):
----> 2     assert div(a,b) * b == a
float(1) / float("nan")

Not a number is even funnier

assert_div(float("nan"), 1.0)
AssertionError                            Traceback (most recent call last)
 in ()
----> 1 assert_div(float("nan"), 1.0)
 in assert_div(a, b)
      1 def assert_div(a, b):
----> 2     assert div(a,b) * b == a
Given what we saw in the last test, this test looks like it could work. It doesn't because NaN equal to itself. 
n = float("nan")
(n == n, n is n)
(False, True)
This is actually pretty odd. https://docs.python.org/3.7/reference/expressions.html
Equality comparison should be reflexive. In other words, identical objects should compare equal:x is y implies x == y
By flipping the arguments to div we get the result NaN. NaN is a weird type that even though is the same memory address we still do not consider it the same object.

Inifinity is also fun

Infinity and negative infinity support to represent values smaller and larger than a regular value float.
assert_div(float("inf"), float("inf"))
AssertionError                            Traceback (most recent call last)
 in ()
----> 1 assert_div(float("inf"), float("inf"))
 in assert_div(a, b)
      1 def assert_div(a, b):
----> 2     assert div(a,b) * b == a

Losing it

assert_div(3e-5, 7)
AssertionError                            Traceback (most recent call last)
 in ()
----> 1 assert_div(3e-5, 7)
 in assert_div(a, b)
      1 def assert_div(a, b):
----> 2     assert div(a,b) * b == a
Here we lose precision, which again leads to the input value not machting the result.
3e-5 / 7 * 7 

Big and small

import sys
assert_div(sys.float_info.max, sys.float_info.min)
AssertionError                            Traceback (most recent call last)
 in ()
      1 import sys
----> 2 assert_div(sys.float_info.max, sys.float_info.min)
 in assert_div(a, b)
      1 def assert_div(a, b):
----> 2     assert div(a,b) * b == a
In this instance the division produces infinity and there's no way to go back from infity to a.
sys.float_info.max / sys.float_info.min

Small and big

assert_div(sys.float_info.min, sys.float_info.max)
AssertionError                            Traceback (most recent call last)
 in ()
----> 1 assert_div(sys.float_info.min, sys.float_info.max)
 in assert_div(a, b)
      1 def assert_div(a, b):
----> 2     assert div(a,b) * b == a
Something similar happens if we flip the input values. The division leads to 0 and we cannot go back to a.
sys.float_info.min / sys.float_info.max

Precisely wrong

assert_div((1<<64 1="" span="">
AssertionError                            Traceback (most recent call last)
 in ()
----> 1 assert_div((1<<64 1="" span="">
 in assert_div(a, b)
      1 def assert_div(a, b):
----> 2     assert div(a,b) * b == a
This is a more interesting problem. Float's internal representation doesn't allow to represent all 1<<64 span="">integer values in a 64 bit float value. In this example (1<<64 span=""> is the same as (1<<64 span="">.
( (1<<64 float="" int="" span="">
(18446744073709551615, 18446744073709551616)

Bonus round of weirdness

If this wasn't bad enough. There is additional weirdness that is implementation specific. You'll not find this by reading the documentation.
Heuristic: Test small values exhaustively. Test with random input.
ma = sys.float_info.max * 1.1
mi = sys.float_info.min / 10
(ma, mi)
(inf, 2.225073858507203e-309)
mi < sys.float_info.min
The upper and lower bound are not treated the same. You can have a min value that is smaller than the value returned by float_info.

Wrap up

How did you do in the quiz?
If you can use an int don't use an float. You will avoid a whole class of problems.
Heuristic: Simpler software is better than complex software.
How did you do? I'm pretty sure most developers struggle with this quiz. I'm one of them. We work casually with complexity. We use concepts without thoroughly understanding them. This is true for nearly everything in software as we're building layer over layer of software.
In general this is a good thing, it's the only way we know of how to use the hardware we have. On the other hand, we should be conservative in estimating our software quality. It's probably way worse than we think. The only way out is to make the implementation as simple as possible.