20100222

Tuning the Bloom-Filter-based Spelling Corrector

With the candidate iterators and filter in place the Bloom Filter-based spelling corrector can be implemented and the performance measured.

The base implementation is here. The basic idea is described in the first blog post about the Bloom Filter-based spelling corrector so I won't get into detail about the implementation.

My benchmark is the implementation by David Pollack. As far as I can tell he thought a bit about performance so the comparison is not totally unfair.

The numbers aren't to bad. The Bloom Filter-based implementation is 2.5 times faster. (The performance comparison numbers are based on correcting words of edit distance 0 to 3 derived from the dictionary. The JVM is running with -Xms256m -Xmx256m -server.) My implementation corrects 115 words per seconds.

bloom 182.04358 (115 Hz)
benchmark 466.560114 (45 Hz)
benchmark/bloom 2.562903421257701

But there is still room for improvement. Executing the application with a memory profiler reveals that a lot of java.lang.Integer objects are allocated. The filter function Int => Boolean is translated by the Scala compiler to the generic trait Function1[Integer, Boolean]. This causes boxing of int values to Integer objects. (The boxing of Boolean values is not a problem. There are only two values as long as Boolean.valueOf is used.) Replacing the filter function with a filter trait based on primitive types solves this problem.
trait Filter{
  def contains(x : Int) : Boolean
}
The other source of boxed integers is the partial application of the StringHasher.hashRange method. When the string parameter is applied to the hashRange method a new Function3[Int,Int,Int,Int] is returned. This causes boxing as the previous Function1 object does. The partial application made the code a bit more readable. We have to compromise here on readability and replace the partial application with a normal method that fits well with the JVM.

Implementing these improvements resulted in worse performance at the beginning. The problem was that I used s(idx) instead of s.charAt(idx). The s(idx) converts the String to a RichString and then calls the apply method on it. Both methods are functionally identical but the first will cause conversions from String to RichString whereas the charAt method is a direct method call on the given String. This is a subtle problem the best solution would be able to disable implicit conversions from String to RichString defined in Predef for this file.

With both improvements in place the implementation is 3.7 times faster.

bloom 113.69269 (184 Hz)
benchmark 424.767132 (49Hz)
benchmark/bloom 3.736098881994964

The next step is a bit controversial: basing the candidate production on char Arrays replacing Strings. This yields good performance improvements the final number is 4.9 times faster than the competing implementation but it has clearly a number of drawbacks. The code becomes quite low-level using System.arrayCopy instead of RichString methods. The char[] arrays are mutable so a collaborator may manipulate the array content.

bloom 89.082697 (235 Hz)
benchmark 437.214067 (48 Hz)
benchmark/bloom 4.907957232143522

Some further improvements are still possible. With enough effort the BloomSpeller, Dictionary and CandidateIterators could be based on a CharSequence flyweight implementation. Having this interface in place candidate iterators can return implementations of CharSequence containing only the base string as common state and the individual state of the candidate (index to manipulate, candidate hash value and optionally the alphabet character to use). This would save memory allocation, garbage collection for the candidate char[] array instance and redundant hash value computation. (The candidate hash value is computed a second time when the Speller implementation checks whether the candidate is in the Dictionary.)

The final step is to tune the Bloom Filter configuration. In the first blog entry of this series I calculated expected performance values based on the Bloom Filter worst case (all candidates are contained in the Bloom Filter). The actual numbers differ from the expected results as most candidate checks against the Bloom Filter will abort earlier before k hash functions were executed. Another factor not present in the model is the optimization to compute only hash values for about half of the string length.

The performance is surprisingly stable for an n/m ratio of 4 up to 256. For an n/m up to 4 the production of false candidates dominates the performance characteristics. For n/m ratio greater than 128 the execution of hash functions starts to degrade performance. Over all the configuration of an n/m ratio = 4 is good enough. Only small improvements can be attained with changing this number.

n/m ratioperformance ratio
11,85
23,07
44,05
85,07
164,75
324,83
644,93
1284,69
2563,89
5121,78
10240,53

Conclusion


Scala is a flexible language. You can write high-level and low-level code as needed. The basic approach to start a high-level implementation and optimizing the hot spots worked out quite well for this example. The byte code emitted by the Scala compiler is good enough in most cases. Some abstractions of the language do not fit to well with the Java platform. This is not a problem for regular code and can be fixed with low-level implementations for hot spots as we've seen. The new @specialized annotation may help here in the future to lower the amount of low-level code.

There are still some areas worth to investigate. How to distribute the candidate production and filtering over multiple cores? Using a join-fork framework or Scala's Future construct seem to fit quite well here. Another idea is to compare the Bloom Filter-based spell checker implementation against a BK-Tree-based implementation.

20100216

Producing candidates for a spelling corrector

The next step to implement the Bloom Filter-based spelling corrector is to create and check hash values for candidates against the filter without actually creating candidate instances. If the filter step succeeds the candidate String instances are created in the next step.

All candidate iterators (for insertion, deletion, transposition and replacement) are based on the CandidateIterator trait. The CandidateIterator can return a hash value for a candidate with the hashValue method and a candidate string with the candidate method. The next method scrolls to the next value.
trait CandidateIterator{
  def next : Boolean
  def candidate : String
  def hashValue : Int
}
An instructive implementation is the replacement CandidateIterator. It replaces each character of the base string with a character from the alphabet to create a candidate.

The candidate method does exactly this for candidate strings. It takes the base string and replaces the character at the idx position with the character from the alphabet at the alphabetIdx position (the alphabet is defined as val alphabet = 'a' to 'z' toArray).
def candidate = (base take idx) + Dictionary.alphabet(alphabetIdx) + (base drop idx + 1)
The hashValue method is not as easily readable as the candidate method. The implementation is based on the insight that the hash function for Strings is the defined in a way that we can reused a base hash value for the a substring base.substring(0, idx). With that in place we can save about half of the hash computations. This has huge performance benefits over an implementation that works directly with Strings where hash values have to be computed from the scratch for every candidate. Analogous to the String copy costs the hashing costs increase nearly linearly with the String length. The performance benefits will be especially relevant for long Strings that will then produce a lot of candidate strings.

The next method takes care of the idx, alphabetIdx to point to valid positions. The baseHash value is calculated iteratively in the next method from the baseHash of the last round if the idx value changes.

The hashValue implementation is now based on the baseHash value. The hash is computed for the character taken from the alphabet and the remaining string base.substring(idx + 1, base.length).
class Replacement(base : String) extends BaseCandidateIterator(base){
  private var idx = 0
  private var alphabetIdx = -1
  private var baseHash = 0
  def next = {
    alphabetIdx += 1
    if(alphabetIdx == Dictionary.alphabet.length){
      idx += 1
      alphabetIdx = 0
      baseHash = hashAt(baseHash, idx - 1)
    }
    idx < base.length
  }
  def candidate = (base take idx) + Dictionary.alphabet(alphabetIdx) + (base drop idx + 1)
  def hashValue = {
    val pivot = hashFor(baseHash, Dictionary.alphabet(alphabetIdx))
    val right = hashRange(pivot, idx + 1, base.length)
    right
  }
}
The CandidateIterator let's us iterate over candidate hash values without creating objects. Externally to the CandidateIterators the filtering takes places calling the candidate to produce an instance method only when the hash value was in the filter. This work is done by the CandidateFilter.
class CandidateFilter(base : CandidateIterator, filter : Int => Boolean) extends Iterator[String]{
  private var more : Boolean = false
  def hasNext : Boolean = {
    def hasMore : Boolean = {
      while(base.next) if(filter(base.hashValue)) return true
      false
    }
    more ||= hasMore
    more
  }

  def next : String = {
    more = false
    base.candidate
  }
}
The class takes a CandidateIterator and a filter. It implements the Iterator trait. Values returned by the CandidateFilter are candidate strings that passed the Bloom Filter test.

Users of the CandidateFilter and CandidateIterator can now use the Iterator convenience methods and for expressions to work with candidate strings as usual.

With the CandidateIterator implementation and the CandidateFilter in place the spelling corrector can be implementation in the final step.

20100208

Implementing hash functions in functional style

The first ability needed to enable filtering with the Bloom Filter-based spell checker is to produce hash values for candidates.

The basic functionality is to hash a character given a base hash value from a preceding invocation. The hashFor function implements this.

def hashFor(h: Int, c: Char) : Int = h * 31 + c

Two further functions help with building candidate hashes: hashAt and hashRange.

HashAt builds the hash for the a character of base string at a index position given a base hash value.

def hashAt(base : String)(h: Int, i: Int) : Int= hashFor(h, base.charAt(i))

The hashAt method is defined in a way to be able to use partial application of parameters. (It's not using clean currying. The Scala language allows to define functions with mixed arity: unary functions and functions with arity greater than 1.)

The motivation to define the function in this way is to be able to use it with fewer parameters than the method definition def hashAt(base : String, h: Int, i: Int) when it is used repeatedly with the same base string parameter.

To get the function signature you can access the function object in the Scala console:

scala> hashAt _
(String) => (Int, Int) => Int

Alternatively the signature can be written as (String) => ((Int, Int) => Int) for better readability. The function takes one parameter and will return a function of type  (Int, Int) => Int.

To make any sense of this let's apply the parameters one by one:

scala> val f = hashAt _
f: (String) => (Int, Int) => Int = <function>

scala> val g = f("s")
g: (Int, Int) => Int =  <function>

scala> val h = g(0,0)
h: Int = 115

The parameters are applied left to right. The intermediary function definitions has each parameters bound until there are no parameter left to bind and the function is evaluated. The sequence is equivalent to calling hashAt("s")(0,0).

HashRange hashes characters of a string from a start index to the end index (exclusively) given a base hash value.

def hashRange(base : String)(h: Int, i: Int, end : Int) : Int =
    if(i < end) hashRange(base)(hashAt(base)(h, i), i + 1, end) else h

Using the hashRange method you can produce the same hashValues as the java.lang.String implementation.
public int hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;
            for (int i = 0; i < len; i++) {
                h = 31*h + val[off++];
            }
            hash = h;
        }
    return h;
}
The equivalent definition based on the hashRange function is:

def hash(s : String) = StringHasher.hashRange(s)(0, 0, s.length)

The function is defined as a tail recursive method. A tail recursive function is a function that does only return values from recursive calls and won't manipulate these return values. With a tail recursive definition the Scala compiler is able to translate the recursive call into a loop. Defining a tail recursive function is a bit tricky and you may inadvertently change a tail recursive version into a normal recursive one and even an optimized tail recursive function into a not optimized one. To omit the pitfalls of tail recursive functions Scala 2.8 introduces a @tailrec annotation. This produces a compiler failure if a function cannot be optimized.

With the hash support methods hashAt, hashFor, hashRange in place we can implement hash values for candidates. I'll implement the insertion candidate production to demonstrate the application of the methods. The insert adds a characters from an alphabet at a given index.

It is always good to write some tests (besides the fact that it tests the functionality) to be able to have a look at the API from the user's perspective. Testing the individual methods is not enough to get a feel of the API. Users will always use a combination of the three methods. One concern is that the hash code is calculated incrementally and the user has to define the hash functions in a incremental manner as well. Using the output of preceding execution as an input parameter for succeeding method calls which may be error prone.

The law for the insert candidate hash can be defined with the equivalent hash function of java.lang.String implementation. It can be expressed in specification-based test:

@Test def userSample {
  for{s <- strings
    index <- integers(0, s.length - 1)
    c <- basicLatinCharacters}{
    val insertCandidate = (s take index) + c + (s drop index)
    assertEquals(insertCandidate.hashCode, insertCandidateHash(s, index,c))
  }
}
 Three methods of the specification-based test framework Quickcheck are used in the test:
- strings returns an arbitrary string value
- integer(min,max) returns  an integer value from min to max (inclusive)
- basicLatinCharacters returns a sample character from the from the Unicode Latin-1 code page.
All three methods return values through the Generator interface that is converted with a implicit type conversion to an Scala Iterator to be able to use it in a for expression.

With the input string, insertion index and character to insert given the actual law is quite simple. We construct a string that is an equivalent insertion candidate and get its hash value. The hash value returned by the insertCandidateHash method under test has to be produce the same value. (This is a testing pattern in the context of specification-based test: testing based on a analogous functions. Stay tuned for a in depth description of the testing patterns.) This example shows that unpleasant tests with classical scenario-based TDD can be expressed in a compact way with specification-based testing. The last piece of the puzzle is the implementation of the insertCandidateHash function.
def insertCandidateHash(s : String, i : Int, c : Char) = {
  val hashRange = StringHasher.hashRange(s) _
  val left = hashRange(0, 0, i)
  val pivot = StringHasher.hashFor(left, c)
  val right = hashRange(pivot, i, s.length)
  right
}
The sample code demonstrates the use partial application of function parameters in the first line. The string parameter is applied to the function returning a function with the signature (Int, Int, Int) => Int. This function object can than be used for all hash values of the same base string. The hash value is calculated for the left, pivot and right characters of the string. The characters for the left and right hash values are unchanged from the base string. The hash value for the pivot element is one from the alphabet as one would expect for an insertion.

In the next blog I'll implement the candidate iterators for insertion, replacement, transposition and delete based on the StringHasher.

20100207

The moment electronic paper died

Perhaps you did not noticed it but last week the electronic paper followed the Maglev and became part of the museum of failed technical innovations. As the Maglev, it started with a brilliant theoretical idea that nobody cared about once it was there.

The Maglev promised faster train travel with a better efficiency. The problem of the Maglev is that is does not fit well into the current infrastructure. New separate tracks have to be build. In the meantime wheeled electric trains are capable to travel at high speeds using and interfacing better with the current infrastructure.

Electronic paper has the advantage that it does not consume electricity while displaying a single page and works without backlight. In theory an electronic paper-based device can be used much longer with recharging. Readability for text is supposed to be better than a LCD as well. But it cannot display charts, tables and photos well. Everything users are used to know on the computer and smart phones. Additionally, they switch slowly from one page to another and the display even flickers annoyingly while changing the page content.

Until the advent of the iPad I thought of buying such a device and fantasized about caring around all my IT-books, classical literature from Project Gutenberg and blogs on this device around while commuting. It was just a matter of the right price. But now I do not see a point anymore. The iPad has proven that it is possible to build a general purpose computing device that includes book reading software. The electronic paper may be better for novels but I don't care. The ability to use all kinds of software and browse the web on such a device is definitely buying me more.

That said the iPad is just the prove that it can be done. It is not platform I would like to develop on.

What's the point off having a general purpose computing device that is controlled by one company? The platform for such a device has to be reasonably open. And I think that will not only be in the interest of application developers but of the provider of such a platform as well. The development and deployment model imposed on developers will stifle innovation. As innovative as the iPad may be a single company won't outperform a whole industry in the long run.

20100202

A faster spelling corrector

The compact spelling corrector implemention showed some of the features of Scala: for expression, higher order functions (fold and map), tuple support and the infix method syntax. The current implementation has terrible performance, so we try to develop a faster corrector implementation.

Let's have a look at it at run-time with the VisualVM memory profiler.


Not very surprising, it creates a lot of objects. To correct 20 words of edit distance 0 to 3 randomly picked from the big.txt file it allocated 15 million char[] arrays and 9 million String instances. Allocation may be cheap but allocating millions of objects is not.

One way to improve the performance of the spelling corrector is to allocate fewer objects. The spelling corrector is based on a brute force approach: calculate all possible candidates for a given edit distance. The most expensive operation is to create candidate string instances. The java.lang.String implementation is immutable. Every single change to a String will force the creation of a new instance.

But we can do better than that. We can capitalize on the knowledge that the Strings won't be random. The candidates can be easily deduced from the base string. A candidate string has some additional private state depending on the operation (insert, transpose, replace, delete) performed. Phrasing it in this way directly leads to the flyweight pattern. Flyweights can be implemented for every operation implementing the CharSequence interface that is already part of the JDK. This will gain quite good performance improvements. No additional char[] arrays will be allocated.

I'll skip the flyweight implementation and leave it as an exercise. There is a better approach.

The flyweight implementation will still create one object per candidate. Even this can be omitted. If we would pick only candidates that are promising the creation of objects would be further reduced and the performance increased. And happy coincidence a data structure can help here: the Bloom Filter.

Additionally to dictionary of word frequencies that is already used in the base implementation of the spelling corrector a Bloom Filter is initialized with all words.

The candidate generation will than be separated in multiple steps: candidate hashing, hash filtering with the Bloom Filter, candidate instantiation and dictionary filtering.

Using the Bloom Filter most of the candidates will not survive the hash generation and hash filtering steps. The hashed candidates can be calculated without instantiating any objects, just using primitive int values that live on the stack. This has super performance characteristics: no memory allocation and garbage collecting needed.

For every type of candidate: deletion, replacement, insertion and transposition an Iterator can be implemented that produces the candidate hash values. The iterator implementations are based on a base string and additional state (positions in the base string and positions in the alphabet) necessary for the candidate hash value production.

All candidate hash values will be checked if they are contained in the Bloom Filter. If so the candidate string will be created and checked against the dictionary. The second check is necessary as the bloom filter produces false positives.

The actual false positive rate can be freely chosen. It will have no impact on the functionality. It could be 100%. The false positive rate can be changed to tune the performance depending on memory consumption requirements, cost of hash functions used in the Bloom Filter implementation and object instantiation cost for candidate strings.

To check on the basic assumption we can measure the performance characteristics of the basic string operations needed and the Bloom Filter. The limited performance test compares the run-time of a character array copy needed to create the base char[] for one insertion candidate string (creating a String candidate won't be faster than this)

arraycopy(stringValue, 0, s, 0, 1) //insertion
s(1) = 'c'
arraycopy(stringValue, 1, s, 2, stringValue.length - 1)

and the Bloom Filter contains method performance.

All presented durations lack an unit of measurement as all measurements are based on the same scale. This is sufficient to compare the values.

The time needed to create the insertion candidate string grows linearly with the base string length.


char[] length character copy time
2 11,69
4 11,70
8 12,57
16 15,95
32 21,57
64 30,21
128 51,22


The Bloom Filter time grows linearly with the number of hash functions (k) used. The optimal number of hash functions used is determined according to the ration between expected elements (n) and bloom filter size (m).

k = \frac{m}{n}  ln(2)


The measured time is the worst case: an entry (or a false positive) is in the Bloom Filter. In this scenario all k-hash functions are executed.




n/m ratio k false positive rate time filter
1 1 0,62 3,73
2 2 0,39 5,70
3 3 0,24 7,70
4 3 0,14 7,71
5 4 0,09 9,38
6 5 0,05 11,59
7 5 0,03 11,59
8 6 0,02 13,80
9 7 0,01 18,68
10 7 0,01 18,68

In a real world scenario the average run-time of the Bloom Filter will be better as the execution can be aborted for values that are not in the Bloom Filter. The false positive probability another important figure of the Bloom Filter:

p_f =(1-e^{-kn/m})^k

False positive causes additional work based on the assumption to work with a valid value. This run-time is wasted as the preceding check against the dictionary will eliminate the false positives.

To get the expected resulting performance of the spelling corrector we construct a small model. The expected run-time is determined by the cost of the filter t_f, the false positives probability p_f the probability p_w that a candidate is a word in the dictionary and the time t_s to construct a candidate string. The time spend to construct strings which are false positives is p_f * t_s. The time spend to construct strings which are words is p_w*t_s. (To simplify the formula a constant string length is assumed.)




The values computed are based on candidate strings with a length of 8 chars (t_s=t_s8) and the probability that 5% of the candidates are words contained in the dictionary (p_w=0.05).


n/m ratio k false positive rate (p_f) time filter (t_f) false positive time (p_f * t_s) expected time (t)
1 1 0,62 3,73 7,84 12,20
2 2 0,39 5,70 4,88 11,21
3 3 0,24 7,70 3,04 11,37
4 3 0,14 7,71 1,76 10,10
5 4 0,09 9,38 1,09 11,09
6 5 0,05 11,59 0,66 12,88
7 5 0,03 11,59 0,39 12,61
8 6 0,02 13,80 0,25 14,68
9 7 0,01 18,68 0,15 19,46
10 7 0,01 18,68 0,09 19,40

By increasing the n/m ratio (and hash functions k executed respectively) of the Bloom Filter we see an increase in the time spent to check if the filter contains a candidates string (time filter). On the other hand the time spent to produce false positives (false positive time) declines with a growing n/m ratio.

For the given model the local minimum of the expected run-time is with a n/m ratio of 4 (false positive rate of 0,14). The expected worst case run-time of the Bloom Filter based implementation is better than the plain array copying run-time (with a of minimum 11,69) and the performance benefits increases dramatically with the string length.

In the next blog entries I'll implement this Bloom Filter-based spelling corrector to see if one can capitalize on the theoretical advantages. The next blog entry will be about the hash function support needed.