A faster spelling corrector

The compact spelling corrector implemention showed some of the features of Scala: for expression, higher order functions (fold and map), tuple support and the infix method syntax. The current implementation has terrible performance, so we try to develop a faster corrector implementation.

Let's have a look at it at run-time with the VisualVM memory profiler.

Not very surprising, it creates a lot of objects. To correct 20 words of edit distance 0 to 3 randomly picked from the big.txt file it allocated 15 million char[] arrays and 9 million String instances. Allocation may be cheap but allocating millions of objects is not.

One way to improve the performance of the spelling corrector is to allocate fewer objects. The spelling corrector is based on a brute force approach: calculate all possible candidates for a given edit distance. The most expensive operation is to create candidate string instances. The java.lang.String implementation is immutable. Every single change to a String will force the creation of a new instance.

But we can do better than that. We can capitalize on the knowledge that the Strings won't be random. The candidates can be easily deduced from the base string. A candidate string has some additional private state depending on the operation (insert, transpose, replace, delete) performed. Phrasing it in this way directly leads to the flyweight pattern. Flyweights can be implemented for every operation implementing the CharSequence interface that is already part of the JDK. This will gain quite good performance improvements. No additional char[] arrays will be allocated.

I'll skip the flyweight implementation and leave it as an exercise. There is a better approach.

The flyweight implementation will still create one object per candidate. Even this can be omitted. If we would pick only candidates that are promising the creation of objects would be further reduced and the performance increased. And happy coincidence a data structure can help here: the Bloom Filter.

Additionally to dictionary of word frequencies that is already used in the base implementation of the spelling corrector a Bloom Filter is initialized with all words.

The candidate generation will than be separated in multiple steps: candidate hashing, hash filtering with the Bloom Filter, candidate instantiation and dictionary filtering.

Using the Bloom Filter most of the candidates will not survive the hash generation and hash filtering steps. The hashed candidates can be calculated without instantiating any objects, just using primitive int values that live on the stack. This has super performance characteristics: no memory allocation and garbage collecting needed.

For every type of candidate: deletion, replacement, insertion and transposition an Iterator can be implemented that produces the candidate hash values. The iterator implementations are based on a base string and additional state (positions in the base string and positions in the alphabet) necessary for the candidate hash value production.

All candidate hash values will be checked if they are contained in the Bloom Filter. If so the candidate string will be created and checked against the dictionary. The second check is necessary as the bloom filter produces false positives.

The actual false positive rate can be freely chosen. It will have no impact on the functionality. It could be 100%. The false positive rate can be changed to tune the performance depending on memory consumption requirements, cost of hash functions used in the Bloom Filter implementation and object instantiation cost for candidate strings.

To check on the basic assumption we can measure the performance characteristics of the basic string operations needed and the Bloom Filter. The limited performance test compares the run-time of a character array copy needed to create the base char[] for one insertion candidate string (creating a String candidate won't be faster than this)

arraycopy(stringValue, 0, s, 0, 1) //insertion
s(1) = 'c'
arraycopy(stringValue, 1, s, 2, stringValue.length - 1)

and the Bloom Filter contains method performance.

All presented durations lack an unit of measurement as all measurements are based on the same scale. This is sufficient to compare the values.

The time needed to create the insertion candidate string grows linearly with the base string length.

char[] length	character copy time
2	11,69
4	11,70
8	12,57
16	15,95
32	21,57
64	30,21
128	51,22

The Bloom Filter time grows linearly with the number of hash functions (k) used. The optimal number of hash functions used is determined according to the ration between expected elements (n) and bloom filter size (m).

$k = \frac{m}{n} ln(2)$

The measured time is the worst case: an entry (or a false positive) is in the Bloom Filter. In this scenario all k-hash functions are executed.

n/m ratio	k	false positive rate	time filter
1	1	0,62	3,73
2	2	0,39	5,70
3	3	0,24	7,70
4	3	0,14	7,71
5	4	0,09	9,38
6	5	0,05	11,59
7	5	0,03	11,59
8	6	0,02	13,80
9	7	0,01	18,68
10	7	0,01	18,68

In a real world scenario the average run-time of the Bloom Filter will be better as the execution can be aborted for values that are not in the Bloom Filter. The false positive probability another important figure of the Bloom Filter:

$p_f =(1-e^{-kn/m})^k$

False positive causes additional work based on the assumption to work with a valid value. This run-time is wasted as the preceding check against the dictionary will eliminate the false positives.

To get the expected resulting performance of the spelling corrector we construct a small model. The expected run-time is determined by the cost of the filter t_f, the false positives probability p_f the probability p_w that a candidate is a word in the dictionary and the time t_s to construct a candidate string. The time spend to construct strings which are false positives is p_f * t_s. The time spend to construct strings which are words is p_w*t_s. (To simplify the formula a constant string length is assumed.)

The values computed are based on candidate strings with a length of 8 chars (t_s=t_s8) and the probability that 5% of the candidates are words contained in the dictionary (p_w=0.05).

n/m ratio	k	false positive rate (p_f)	time filter (t_f)	*false positive time (p_f t_s)**	expected time (t)
1	1	0,62	3,73	7,84	12,20
2	2	0,39	5,70	4,88	11,21
3	3	0,24	7,70	3,04	11,37
4	3	0,14	7,71	1,76	10,10
5	4	0,09	9,38	1,09	11,09
6	5	0,05	11,59	0,66	12,88
7	5	0,03	11,59	0,39	12,61
8	6	0,02	13,80	0,25	14,68
9	7	0,01	18,68	0,15	19,46
10	7	0,01	18,68	0,09	19,40

By increasing the n/m ratio (and hash functions k executed respectively) of the Bloom Filter we see an increase in the time spent to check if the filter contains a candidates string (time filter). On the other hand the time spent to produce false positives (false positive time) declines with a growing n/m ratio.

For the given model the local minimum of the expected run-time is with a n/m ratio of 4 (false positive rate of 0,14). The expected worst case run-time of the Bloom Filter based implementation is better than the plain array copying run-time (with a of minimum 11,69) and the performance benefits increases dramatically with the string length.

In the next blog entries I'll implement this Bloom Filter-based spelling corrector to see if one can capitalize on the theoretical advantages. The next blog entry will be about the hash function support needed.

Thomas Andreas Jung's Blog

20100202