Tuning the Bloom-Filter-based Spelling Corrector

With the candidate iterators and filter in place the Bloom Filter-based spelling corrector can be implemented and the performance measured.

The base implementation is here. The basic idea is described in the first blog post about the Bloom Filter-based spelling corrector so I won't get into detail about the implementation.

My benchmark is the implementation by David Pollack. As far as I can tell he thought a bit about performance so the comparison is not totally unfair.

The numbers aren't to bad. The Bloom Filter-based implementation is 2.5 times faster. (The performance comparison numbers are based on correcting words of edit distance 0 to 3 derived from the dictionary. The JVM is running with -Xms256m -Xmx256m -server.) My implementation corrects 115 words per seconds.

bloom 182.04358 (115 Hz)
benchmark 466.560114 (45 Hz)
benchmark/bloom 2.562903421257701

But there is still room for improvement. Executing the application with a memory profiler reveals that a lot of java.lang.Integer objects are allocated. The filter function Int => Boolean is translated by the Scala compiler to the generic trait Function1[Integer, Boolean]. This causes boxing of int values to Integer objects. (The boxing of Boolean values is not a problem. There are only two values as long as Boolean.valueOf is used.) Replacing the filter function with a filter trait based on primitive types solves this problem.

trait Filter{
  def contains(x : Int) : Boolean
}

The other source of boxed integers is the partial application of the StringHasher.hashRange method. When the string parameter is applied to the hashRange method a new Function3[Int,Int,Int,Int] is returned. This causes boxing as the previous Function1 object does. The partial application made the code a bit more readable. We have to compromise here on readability and replace the partial application with a normal method that fits well with the JVM.

Implementing these improvements resulted in worse performance at the beginning. The problem was that I used s(idx) instead of s.charAt(idx). The s(idx) converts the String to a RichString and then calls the apply method on it. Both methods are functionally identical but the first will cause conversions from String to RichString whereas the charAt method is a direct method call on the given String. This is a subtle problem the best solution would be able to disable implicit conversions from String to RichString defined in Predef for this file.

With both improvements in place the implementation is 3.7 times faster.

bloom 113.69269 (184 Hz)
benchmark 424.767132 (49Hz)
benchmark/bloom 3.736098881994964

The next step is a bit controversial: basing the candidate production on char Arrays replacing Strings. This yields good performance improvements the final number is 4.9 times faster than the competing implementation but it has clearly a number of drawbacks. The code becomes quite low-level using System.arrayCopy instead of RichString methods. The char[] arrays are mutable so a collaborator may manipulate the array content.

bloom 89.082697 (235 Hz)
benchmark 437.214067 (48 Hz)
benchmark/bloom 4.907957232143522

Some further improvements are still possible. With enough effort the BloomSpeller, Dictionary and CandidateIterators could be based on a CharSequence flyweight implementation. Having this interface in place candidate iterators can return implementations of CharSequence containing only the base string as common state and the individual state of the candidate (index to manipulate, candidate hash value and optionally the alphabet character to use). This would save memory allocation, garbage collection for the candidate char[] array instance and redundant hash value computation. (The candidate hash value is computed a second time when the Speller implementation checks whether the candidate is in the Dictionary.)

The final step is to tune the Bloom Filter configuration. In the first blog entry of this series I calculated expected performance values based on the Bloom Filter worst case (all candidates are contained in the Bloom Filter). The actual numbers differ from the expected results as most candidate checks against the Bloom Filter will abort earlier before k hash functions were executed. Another factor not present in the model is the optimization to compute only hash values for about half of the string length.

The performance is surprisingly stable for an n/m ratio of 4 up to 256. For an n/m up to 4 the production of false candidates dominates the performance characteristics. For n/m ratio greater than 128 the execution of hash functions starts to degrade performance. Over all the configuration of an n/m ratio = 4 is good enough. Only small improvements can be attained with changing this number.

n/m ratio	performance ratio
1	1,85
2	3,07
4	4,05
8	5,07
16	4,75
32	4,83
64	4,93
128	4,69
256	3,89
512	1,78
1024	0,53

Conclusion

Scala is a flexible language. You can write high-level and low-level code as needed. The basic approach to start a high-level implementation and optimizing the hot spots worked out quite well for this example. The byte code emitted by the Scala compiler is good enough in most cases. Some abstractions of the language do not fit to well with the Java platform. This is not a problem for regular code and can be fixed with low-level implementations for hot spots as we've seen. The new @specialized annotation may help here in the future to lower the amount of low-level code.

There are still some areas worth to investigate. How to distribute the candidate production and filtering over multiple cores? Using a join-fork framework or Scala's Future construct seem to fit quite well here. Another idea is to compare the Bloom Filter-based spell checker implementation against a BK-Tree-based implementation.

Thomas Andreas Jung's Blog

20100222