Implementing hash functions in functional style

The first ability needed to enable filtering with the Bloom Filter-based spell checker is to produce hash values for candidates.

The basic functionality is to hash a character given a base hash value from a preceding invocation. The hashFor function implements this.

def hashFor(h: Int, c: Char) : Int = h * 31 + c

Two further functions help with building candidate hashes: hashAt and hashRange.

HashAt builds the hash for the a character of base string at a index position given a base hash value.

def hashAt(base : String)(h: Int, i: Int) : Int= hashFor(h, base.charAt(i))

The hashAt method is defined in a way to be able to use partial application of parameters. (It's not using clean currying. The Scala language allows to define functions with mixed arity: unary functions and functions with arity greater than 1.)

The motivation to define the function in this way is to be able to use it with fewer parameters than the method definition def hashAt(base : String, h: Int, i: Int) when it is used repeatedly with the same base string parameter.

To get the function signature you can access the function object in the Scala console:

scala> hashAt _
(String) => (Int, Int) => Int

Alternatively the signature can be written as (String) => ((Int, Int) => Int) for better readability. The function takes one parameter and will return a function of type (Int, Int) => Int.

To make any sense of this let's apply the parameters one by one:

scala> val f = hashAt _
f: (String) => (Int, Int) => Int = <function>

scala> val g = f("s")
g: (Int, Int) => Int =  <function>

scala> val h = g(0,0)
h: Int = 115

The parameters are applied left to right. The intermediary function definitions has each parameters bound until there are no parameter left to bind and the function is evaluated. The sequence is equivalent to calling hashAt("s")(0,0).

HashRange hashes characters of a string from a start index to the end index (exclusively) given a base hash value.

def hashRange(base : String)(h: Int, i: Int, end : Int) : Int =
    if(i < end) hashRange(base)(hashAt(base)(h, i), i + 1, end) else h

Using the hashRange method you can produce the same hashValues as the java.lang.String implementation.

public int hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;
            for (int i = 0; i < len; i++) {
                h = 31*h + val[off++];
            }
            hash = h;
        }
    return h;
}

The equivalent definition based on the hashRange function is:

def hash(s : String) = StringHasher.hashRange(s)(0, 0, s.length)

The function is defined as a tail recursive method. A tail recursive function is a function that does only return values from recursive calls and won't manipulate these return values. With a tail recursive definition the Scala compiler is able to translate the recursive call into a loop. Defining a tail recursive function is a bit tricky and you may inadvertently change a tail recursive version into a normal recursive one and even an optimized tail recursive function into a not optimized one. To omit the pitfalls of tail recursive functions Scala 2.8 introduces a @tailrec annotation. This produces a compiler failure if a function cannot be optimized.

With the hash support methods hashAt, hashFor, hashRange in place we can implement hash values for candidates. I'll implement the insertion candidate production to demonstrate the application of the methods. The insert adds a characters from an alphabet at a given index.

It is always good to write some tests (besides the fact that it tests the functionality) to be able to have a look at the API from the user's perspective. Testing the individual methods is not enough to get a feel of the API. Users will always use a combination of the three methods. One concern is that the hash code is calculated incrementally and the user has to define the hash functions in a incremental manner as well. Using the output of preceding execution as an input parameter for succeeding method calls which may be error prone.

The law for the insert candidate hash can be defined with the equivalent hash function of java.lang.String implementation. It can be expressed in specification-based test:

@Test def userSample {
  for{s <- strings
    index <- integers(0, s.length - 1)
    c <- basicLatinCharacters}{
    val insertCandidate = (s take index) + c + (s drop index)
    assertEquals(insertCandidate.hashCode, insertCandidateHash(s, index,c))
  }
}

Three methods of the specification-based test framework Quickcheck are used in the test:
- strings returns an arbitrary string value
- integer(min,max) returns an integer value from min to max (inclusive)
- basicLatinCharacters returns a sample character from the from the Unicode Latin-1 code page.
All three methods return values through the Generator interface that is converted with a implicit type conversion to an Scala Iterator to be able to use it in a for expression.

With the input string, insertion index and character to insert given the actual law is quite simple. We construct a string that is an equivalent insertion candidate and get its hash value. The hash value returned by the insertCandidateHash method under test has to be produce the same value. (This is a testing pattern in the context of specification-based test: testing based on a analogous functions. Stay tuned for a in depth description of the testing patterns.) This example shows that unpleasant tests with classical scenario-based TDD can be expressed in a compact way with specification-based testing. The last piece of the puzzle is the implementation of the insertCandidateHash function.

def insertCandidateHash(s : String, i : Int, c : Char) = {
  val hashRange = StringHasher.hashRange(s) _
  val left = hashRange(0, 0, i)
  val pivot = StringHasher.hashFor(left, c)
  val right = hashRange(pivot, i, s.length)
  right
}

The sample code demonstrates the use partial application of function parameters in the first line. The string parameter is applied to the function returning a function with the signature (Int, Int, Int) => Int. This function object can than be used for all hash values of the same base string. The hash value is calculated for the left, pivot and right characters of the string. The characters for the left and right hash values are unchanged from the base string. The hash value for the pivot element is one from the alphabet as one would expect for an insertion.

In the next blog I'll implement the candidate iterators for insertion, replacement, transposition and delete based on the StringHasher.

2 comments:

Artur BiesiadowskiFebruary 12, 2010
I wonder what is the performance difference between java hashcode and equivalent scala hash(String) using the tailcalls. Is it in range of 10-20% (which would be just effect of accessing values from outside of class through accessors) or rather 2-5 times slower ?
Thomas JungFebruary 12, 2010
The performance should be comparable. When the tailcall optimzation applies all access is to local variables (on the stack) and method calls will be inlined by hotspot. The partial application has an impact on performance, though. I'll blog about this soon.

Thomas Andreas Jung's Blog

20100208