Thomas Andreas Jung's Blog

Showing posts with label scala. Show all posts

20100222

Tuning the Bloom-Filter-based Spelling Corrector

With the candidate iterators and filter in place the Bloom Filter-based spelling corrector can be implemented and the performance measured.

The base implementation is here. The basic idea is described in the first blog post about the Bloom Filter-based spelling corrector so I won't get into detail about the implementation.

My benchmark is the implementation by David Pollack. As far as I can tell he thought a bit about performance so the comparison is not totally unfair.

The numbers aren't to bad. The Bloom Filter-based implementation is 2.5 times faster. (The performance comparison numbers are based on correcting words of edit distance 0 to 3 derived from the dictionary. The JVM is running with -Xms256m -Xmx256m -server.) My implementation corrects 115 words per seconds.

bloom 182.04358 (115 Hz)
benchmark 466.560114 (45 Hz)
benchmark/bloom 2.562903421257701

But there is still room for improvement. Executing the application with a memory profiler reveals that a lot of java.lang.Integer objects are allocated. The filter function Int => Boolean is translated by the Scala compiler to the generic trait Function1[Integer, Boolean]. This causes boxing of int values to Integer objects. (The boxing of Boolean values is not a problem. There are only two values as long as Boolean.valueOf is used.) Replacing the filter function with a filter trait based on primitive types solves this problem.

trait Filter{
  def contains(x : Int) : Boolean
}

The other source of boxed integers is the partial application of the StringHasher.hashRange method. When the string parameter is applied to the hashRange method a new Function3[Int,Int,Int,Int] is returned. This causes boxing as the previous Function1 object does. The partial application made the code a bit more readable. We have to compromise here on readability and replace the partial application with a normal method that fits well with the JVM.

Implementing these improvements resulted in worse performance at the beginning. The problem was that I used s(idx) instead of s.charAt(idx). The s(idx) converts the String to a RichString and then calls the apply method on it. Both methods are functionally identical but the first will cause conversions from String to RichString whereas the charAt method is a direct method call on the given String. This is a subtle problem the best solution would be able to disable implicit conversions from String to RichString defined in Predef for this file.

With both improvements in place the implementation is 3.7 times faster.

bloom 113.69269 (184 Hz)
benchmark 424.767132 (49Hz)
benchmark/bloom 3.736098881994964

The next step is a bit controversial: basing the candidate production on char Arrays replacing Strings. This yields good performance improvements the final number is 4.9 times faster than the competing implementation but it has clearly a number of drawbacks. The code becomes quite low-level using System.arrayCopy instead of RichString methods. The char[] arrays are mutable so a collaborator may manipulate the array content.

bloom 89.082697 (235 Hz)
benchmark 437.214067 (48 Hz)
benchmark/bloom 4.907957232143522

Some further improvements are still possible. With enough effort the BloomSpeller, Dictionary and CandidateIterators could be based on a CharSequence flyweight implementation. Having this interface in place candidate iterators can return implementations of CharSequence containing only the base string as common state and the individual state of the candidate (index to manipulate, candidate hash value and optionally the alphabet character to use). This would save memory allocation, garbage collection for the candidate char[] array instance and redundant hash value computation. (The candidate hash value is computed a second time when the Speller implementation checks whether the candidate is in the Dictionary.)

The final step is to tune the Bloom Filter configuration. In the first blog entry of this series I calculated expected performance values based on the Bloom Filter worst case (all candidates are contained in the Bloom Filter). The actual numbers differ from the expected results as most candidate checks against the Bloom Filter will abort earlier before k hash functions were executed. Another factor not present in the model is the optimization to compute only hash values for about half of the string length.

The performance is surprisingly stable for an n/m ratio of 4 up to 256. For an n/m up to 4 the production of false candidates dominates the performance characteristics. For n/m ratio greater than 128 the execution of hash functions starts to degrade performance. Over all the configuration of an n/m ratio = 4 is good enough. Only small improvements can be attained with changing this number.

n/m ratio	performance ratio
1	1,85
2	3,07
4	4,05
8	5,07
16	4,75
32	4,83
64	4,93
128	4,69
256	3,89
512	1,78
1024	0,53

Conclusion

Scala is a flexible language. You can write high-level and low-level code as needed. The basic approach to start a high-level implementation and optimizing the hot spots worked out quite well for this example. The byte code emitted by the Scala compiler is good enough in most cases. Some abstractions of the language do not fit to well with the Java platform. This is not a problem for regular code and can be fixed with low-level implementations for hot spots as we've seen. The new @specialized annotation may help here in the future to lower the amount of low-level code.

There are still some areas worth to investigate. How to distribute the candidate production and filtering over multiple cores? Using a join-fork framework or Scala's Future construct seem to fit quite well here. Another idea is to compare the Bloom Filter-based spell checker implementation against a BK-Tree-based implementation.

20100216

Producing candidates for a spelling corrector

The next step to implement the Bloom Filter-based spelling corrector is to create and check hash values for candidates against the filter without actually creating candidate instances. If the filter step succeeds the candidate String instances are created in the next step.

All candidate iterators (for insertion, deletion, transposition and replacement) are based on the CandidateIterator trait. The CandidateIterator can return a hash value for a candidate with the hashValue method and a candidate string with the candidate method. The next method scrolls to the next value.

trait CandidateIterator{
  def next : Boolean
  def candidate : String
  def hashValue : Int
}

An instructive implementation is the replacement CandidateIterator. It replaces each character of the base string with a character from the alphabet to create a candidate.

The candidate method does exactly this for candidate strings. It takes the base string and replaces the character at the idx position with the character from the alphabet at the alphabetIdx position (the alphabet is defined as val alphabet = 'a' to 'z' toArray).

def candidate = (base take idx) + Dictionary.alphabet(alphabetIdx) + (base drop idx + 1)

The hashValue method is not as easily readable as the candidate method. The implementation is based on the insight that the hash function for Strings is the defined in a way that we can reused a base hash value for the a substring base.substring(0, idx). With that in place we can save about half of the hash computations. This has huge performance benefits over an implementation that works directly with Strings where hash values have to be computed from the scratch for every candidate. Analogous to the String copy costs the hashing costs increase nearly linearly with the String length. The performance benefits will be especially relevant for long Strings that will then produce a lot of candidate strings.

The next method takes care of the idx, alphabetIdx to point to valid positions. The baseHash value is calculated iteratively in the next method from the baseHash of the last round if the idx value changes.

The hashValue implementation is now based on the baseHash value. The hash is computed for the character taken from the alphabet and the remaining string base.substring(idx + 1, base.length).

class Replacement(base : String) extends BaseCandidateIterator(base){
  private var idx = 0
  private var alphabetIdx = -1
  private var baseHash = 0
  def next = {
    alphabetIdx += 1
    if(alphabetIdx == Dictionary.alphabet.length){
      idx += 1
      alphabetIdx = 0
      baseHash = hashAt(baseHash, idx - 1)
    }
    idx < base.length
  }
  def candidate = (base take idx) + Dictionary.alphabet(alphabetIdx) + (base drop idx + 1)
  def hashValue = {
    val pivot = hashFor(baseHash, Dictionary.alphabet(alphabetIdx))
    val right = hashRange(pivot, idx + 1, base.length)
    right
  }
}

The CandidateIterator let's us iterate over candidate hash values without creating objects. Externally to the CandidateIterators the filtering takes places calling the candidate to produce an instance method only when the hash value was in the filter. This work is done by the CandidateFilter.

class CandidateFilter(base : CandidateIterator, filter : Int => Boolean) extends Iterator[String]{
  private var more : Boolean = false
  def hasNext : Boolean = {
    def hasMore : Boolean = {
      while(base.next) if(filter(base.hashValue)) return true
      false
    }
    more ||= hasMore
    more
  }

  def next : String = {
    more = false
    base.candidate
  }
}

The class takes a CandidateIterator and a filter. It implements the Iterator trait. Values returned by the CandidateFilter are candidate strings that passed the Bloom Filter test.

Users of the CandidateFilter and CandidateIterator can now use the Iterator convenience methods and for expressions to work with candidate strings as usual.

With the CandidateIterator implementation and the CandidateFilter in place the spelling corrector can be implementation in the final step.

20100208

Implementing hash functions in functional style

The first ability needed to enable filtering with the Bloom Filter-based spell checker is to produce hash values for candidates.

The basic functionality is to hash a character given a base hash value from a preceding invocation. The hashFor function implements this.

def hashFor(h: Int, c: Char) : Int = h * 31 + c

Two further functions help with building candidate hashes: hashAt and hashRange.

HashAt builds the hash for the a character of base string at a index position given a base hash value.

def hashAt(base : String)(h: Int, i: Int) : Int= hashFor(h, base.charAt(i))

The hashAt method is defined in a way to be able to use partial application of parameters. (It's not using clean currying. The Scala language allows to define functions with mixed arity: unary functions and functions with arity greater than 1.)

The motivation to define the function in this way is to be able to use it with fewer parameters than the method definition def hashAt(base : String, h: Int, i: Int) when it is used repeatedly with the same base string parameter.

To get the function signature you can access the function object in the Scala console:

scala> hashAt _
(String) => (Int, Int) => Int

Alternatively the signature can be written as (String) => ((Int, Int) => Int) for better readability. The function takes one parameter and will return a function of type (Int, Int) => Int.

To make any sense of this let's apply the parameters one by one:

scala> val f = hashAt _
f: (String) => (Int, Int) => Int = <function>

scala> val g = f("s")
g: (Int, Int) => Int =  <function>

scala> val h = g(0,0)
h: Int = 115

The parameters are applied left to right. The intermediary function definitions has each parameters bound until there are no parameter left to bind and the function is evaluated. The sequence is equivalent to calling hashAt("s")(0,0).

HashRange hashes characters of a string from a start index to the end index (exclusively) given a base hash value.

def hashRange(base : String)(h: Int, i: Int, end : Int) : Int =
    if(i < end) hashRange(base)(hashAt(base)(h, i), i + 1, end) else h

Using the hashRange method you can produce the same hashValues as the java.lang.String implementation.

public int hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;
            for (int i = 0; i < len; i++) {
                h = 31*h + val[off++];
            }
            hash = h;
        }
    return h;
}

The equivalent definition based on the hashRange function is:

def hash(s : String) = StringHasher.hashRange(s)(0, 0, s.length)

The function is defined as a tail recursive method. A tail recursive function is a function that does only return values from recursive calls and won't manipulate these return values. With a tail recursive definition the Scala compiler is able to translate the recursive call into a loop. Defining a tail recursive function is a bit tricky and you may inadvertently change a tail recursive version into a normal recursive one and even an optimized tail recursive function into a not optimized one. To omit the pitfalls of tail recursive functions Scala 2.8 introduces a @tailrec annotation. This produces a compiler failure if a function cannot be optimized.

With the hash support methods hashAt, hashFor, hashRange in place we can implement hash values for candidates. I'll implement the insertion candidate production to demonstrate the application of the methods. The insert adds a characters from an alphabet at a given index.

It is always good to write some tests (besides the fact that it tests the functionality) to be able to have a look at the API from the user's perspective. Testing the individual methods is not enough to get a feel of the API. Users will always use a combination of the three methods. One concern is that the hash code is calculated incrementally and the user has to define the hash functions in a incremental manner as well. Using the output of preceding execution as an input parameter for succeeding method calls which may be error prone.

The law for the insert candidate hash can be defined with the equivalent hash function of java.lang.String implementation. It can be expressed in specification-based test:

@Test def userSample {
  for{s <- strings
    index <- integers(0, s.length - 1)
    c <- basicLatinCharacters}{
    val insertCandidate = (s take index) + c + (s drop index)
    assertEquals(insertCandidate.hashCode, insertCandidateHash(s, index,c))
  }
}

Three methods of the specification-based test framework Quickcheck are used in the test:
- strings returns an arbitrary string value
- integer(min,max) returns an integer value from min to max (inclusive)
- basicLatinCharacters returns a sample character from the from the Unicode Latin-1 code page.
All three methods return values through the Generator interface that is converted with a implicit type conversion to an Scala Iterator to be able to use it in a for expression.

With the input string, insertion index and character to insert given the actual law is quite simple. We construct a string that is an equivalent insertion candidate and get its hash value. The hash value returned by the insertCandidateHash method under test has to be produce the same value. (This is a testing pattern in the context of specification-based test: testing based on a analogous functions. Stay tuned for a in depth description of the testing patterns.) This example shows that unpleasant tests with classical scenario-based TDD can be expressed in a compact way with specification-based testing. The last piece of the puzzle is the implementation of the insertCandidateHash function.

def insertCandidateHash(s : String, i : Int, c : Char) = {
  val hashRange = StringHasher.hashRange(s) _
  val left = hashRange(0, 0, i)
  val pivot = StringHasher.hashFor(left, c)
  val right = hashRange(pivot, i, s.length)
  right
}

The sample code demonstrates the use partial application of function parameters in the first line. The string parameter is applied to the function returning a function with the signature (Int, Int, Int) => Int. This function object can than be used for all hash values of the same base string. The hash value is calculated for the left, pivot and right characters of the string. The characters for the left and right hash values are unchanged from the base string. The hash value for the pivot element is one from the alphabet as one would expect for an insertion.

In the next blog I'll implement the candidate iterators for insertion, replacement, transposition and delete based on the StringHasher.

20100202

A faster spelling corrector

The compact spelling corrector implemention showed some of the features of Scala: for expression, higher order functions (fold and map), tuple support and the infix method syntax. The current implementation has terrible performance, so we try to develop a faster corrector implementation.

Let's have a look at it at run-time with the VisualVM memory profiler.

Not very surprising, it creates a lot of objects. To correct 20 words of edit distance 0 to 3 randomly picked from the big.txt file it allocated 15 million char[] arrays and 9 million String instances. Allocation may be cheap but allocating millions of objects is not.

One way to improve the performance of the spelling corrector is to allocate fewer objects. The spelling corrector is based on a brute force approach: calculate all possible candidates for a given edit distance. The most expensive operation is to create candidate string instances. The java.lang.String implementation is immutable. Every single change to a String will force the creation of a new instance.

But we can do better than that. We can capitalize on the knowledge that the Strings won't be random. The candidates can be easily deduced from the base string. A candidate string has some additional private state depending on the operation (insert, transpose, replace, delete) performed. Phrasing it in this way directly leads to the flyweight pattern. Flyweights can be implemented for every operation implementing the CharSequence interface that is already part of the JDK. This will gain quite good performance improvements. No additional char[] arrays will be allocated.

I'll skip the flyweight implementation and leave it as an exercise. There is a better approach.

The flyweight implementation will still create one object per candidate. Even this can be omitted. If we would pick only candidates that are promising the creation of objects would be further reduced and the performance increased. And happy coincidence a data structure can help here: the Bloom Filter.

Additionally to dictionary of word frequencies that is already used in the base implementation of the spelling corrector a Bloom Filter is initialized with all words.

The candidate generation will than be separated in multiple steps: candidate hashing, hash filtering with the Bloom Filter, candidate instantiation and dictionary filtering.

Using the Bloom Filter most of the candidates will not survive the hash generation and hash filtering steps. The hashed candidates can be calculated without instantiating any objects, just using primitive int values that live on the stack. This has super performance characteristics: no memory allocation and garbage collecting needed.

For every type of candidate: deletion, replacement, insertion and transposition an Iterator can be implemented that produces the candidate hash values. The iterator implementations are based on a base string and additional state (positions in the base string and positions in the alphabet) necessary for the candidate hash value production.

All candidate hash values will be checked if they are contained in the Bloom Filter. If so the candidate string will be created and checked against the dictionary. The second check is necessary as the bloom filter produces false positives.

The actual false positive rate can be freely chosen. It will have no impact on the functionality. It could be 100%. The false positive rate can be changed to tune the performance depending on memory consumption requirements, cost of hash functions used in the Bloom Filter implementation and object instantiation cost for candidate strings.

To check on the basic assumption we can measure the performance characteristics of the basic string operations needed and the Bloom Filter. The limited performance test compares the run-time of a character array copy needed to create the base char[] for one insertion candidate string (creating a String candidate won't be faster than this)

arraycopy(stringValue, 0, s, 0, 1) //insertion
s(1) = 'c'
arraycopy(stringValue, 1, s, 2, stringValue.length - 1)

and the Bloom Filter contains method performance.

All presented durations lack an unit of measurement as all measurements are based on the same scale. This is sufficient to compare the values.

The time needed to create the insertion candidate string grows linearly with the base string length.

char[] length	character copy time
2	11,69
4	11,70
8	12,57
16	15,95
32	21,57
64	30,21
128	51,22

The Bloom Filter time grows linearly with the number of hash functions (k) used. The optimal number of hash functions used is determined according to the ration between expected elements (n) and bloom filter size (m).

$k = \frac{m}{n} ln(2)$

The measured time is the worst case: an entry (or a false positive) is in the Bloom Filter. In this scenario all k-hash functions are executed.

n/m ratio	k	false positive rate	time filter
1	1	0,62	3,73
2	2	0,39	5,70
3	3	0,24	7,70
4	3	0,14	7,71
5	4	0,09	9,38
6	5	0,05	11,59
7	5	0,03	11,59
8	6	0,02	13,80
9	7	0,01	18,68
10	7	0,01	18,68

In a real world scenario the average run-time of the Bloom Filter will be better as the execution can be aborted for values that are not in the Bloom Filter. The false positive probability another important figure of the Bloom Filter:

$p_f =(1-e^{-kn/m})^k$

False positive causes additional work based on the assumption to work with a valid value. This run-time is wasted as the preceding check against the dictionary will eliminate the false positives.

To get the expected resulting performance of the spelling corrector we construct a small model. The expected run-time is determined by the cost of the filter t_f, the false positives probability p_f the probability p_w that a candidate is a word in the dictionary and the time t_s to construct a candidate string. The time spend to construct strings which are false positives is p_f * t_s. The time spend to construct strings which are words is p_w*t_s. (To simplify the formula a constant string length is assumed.)

The values computed are based on candidate strings with a length of 8 chars (t_s=t_s8) and the probability that 5% of the candidates are words contained in the dictionary (p_w=0.05).

n/m ratio	k	false positive rate (p_f)	time filter (t_f)	*false positive time (p_f t_s)**	expected time (t)
1	1	0,62	3,73	7,84	12,20
2	2	0,39	5,70	4,88	11,21
3	3	0,24	7,70	3,04	11,37
4	3	0,14	7,71	1,76	10,10
5	4	0,09	9,38	1,09	11,09
6	5	0,05	11,59	0,66	12,88
7	5	0,03	11,59	0,39	12,61
8	6	0,02	13,80	0,25	14,68
9	7	0,01	18,68	0,15	19,46
10	7	0,01	18,68	0,09	19,40

By increasing the n/m ratio (and hash functions k executed respectively) of the Bloom Filter we see an increase in the time spent to check if the filter contains a candidates string (time filter). On the other hand the time spent to produce false positives (false positive time) declines with a growing n/m ratio.

For the given model the local minimum of the expected run-time is with a n/m ratio of 4 (false positive rate of 0,14). The expected worst case run-time of the Bloom Filter based implementation is better than the plain array copying run-time (with a of minimum 11,69) and the performance benefits increases dramatically with the string length.

In the next blog entries I'll implement this Bloom Filter-based spelling corrector to see if one can capitalize on the theoretical advantages. The next blog entry will be about the hash function support needed.

20100118

Why overrideable tail recursive methods must not be optimized

Suppose you define a tail recursive method x in a object A. You can use the @tailrec annotation to mark methods you expect optimization of the tail recursive code to happen.

import annotation._
object A{
    @tailrec def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s
}

Now you move the method x to a trait or class and suddenly the compilation fails.

import annotation._
trait A{
    @tailrec def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s
}

:8: error: could not optimize @tailrec annotated method
            @tailrec def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s

I had exactly this problem. I defined a perfectly valid tail recursive function but the compiler would not optimize it to a loop. This was in Scala 2.7 that does not support the @tailrec annotation. You have the additional issue of spotting the problem in the first place.

The problem is that overrideable tail recursive methods won't be optimized. If you change the code to

import annotation._
trait A{
    @tailrec final def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s
}

it will compile again. This is the rule. But why?

Performance optimization must not change the behavior at run-time. So overriding an optimized tail recursive method will behave differently from an unoptimized tail recursive method.

Let's define A in a way that it won't be optimized and override the tail recursive method x in the subclass B.

class A{
    def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s
}

class B extends A{
    override def x(i : Int, s: String) : String = super.x(i, s + "b:")
}

This will produce the output:

scala> new A().x(3, "")
res0: String = a:3a:2a:1

scala> new B().x(3, "")
res1: String = b:a:3b:a:2b:a:1b:

The implementation of the method x in B invokes the method x in A. When the method x defined in A invokes the method x it calls the method x in B (for an instance of B). This behavior is due to late binding of calls to x.

Suppose it was possible to optimize the method x in A. The compiler would define a class similar to this definition of A. The recursive invocation of x is optimized to a loop. With this definition of x there is no possibility to apply late binding to x. The optimization would hard wire the definition of x.

class A{
    def x(i : Int, s : String) : String = {
        var j = i
        var t = s
        while(j > 0){
            t = t + "a:" + j
            j = j - 1
        }
        t
    }
}

class B extends A{
    override def x(i : Int, s: String) : String = super.x(i, "b:" + s)
}

scala> new A().x(3, "")
res0: String = a:3a:2a:1

scala> new B().x(3, "")
res1: String = b:a:3a:2a:1

The early binding of x in A changes the behavior of B. Once the method x in A is called it will never come back to B. This is the reason why the Scala compiler can't optimize overrideable tail recursive methods.

Following this explanation, it should be clear that the @tailrec annotation is worthwhile and should be used in all your Scala 2.8 code you expect the optimization to happen. For Scala 2.7 it's a bit unfortunate that moving a tail recursive method from an object to trait will change the behavior significantly without a compiler error. So be warned.

20100107

How to setup a Maven Scala project with IntelliJ IDEA support

Setting up a Maven and Scala project with IDEA is simple.

Install IntelliJ IDEA Community Edition and the Scala plugin (File -> Settings -> Plugins -> Available).

Generate your maven project using the simple Scala archetype:

mvn org.apache.maven.plugins:maven-archetype-plugin:1.0-alpha-7:create 
    -DarchetypeGroupId=org.scala-tools.archetypes  
    -DarchetypeArtifactId=scala-archetype-simple
    -DarchetypeVersion=1.2
    -DremoteRepositories=http://scala-tools.org/repo-releases
    -DgroupId=group
    -DartifactId=artifact

Add the Scala compiler lib to your project dependencies in the pom.xml file:

<dependencies>
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-compiler</artifactId>
        <version>${scala.version}</version>
        <scope>compile</scope>
    </dependency>
 </dependencies>

Update the Scala version configuration to a recent release:

<properties>
    <scala.version>2.7.7</scala.version>
</properties>

Open the a new project. Select the pom.xml in the file dialog.

An notification will pop up. Click "Create Scala Facet".

Done.

You can now compile and execute as usual and run the maven build mvn install with the Maven Tool Window. (The M2_HOME environment variable or Maven home settings (File -> Settings -> Maven -> Maven home directory) have to be present.)

20100106

The best Scala IDE: IntelliJ IDEA

After years and years of using Eclipse (after Netbeans, JBuilder, Visual Cafe, Emacs and other spooky Java "editors") IntelliJ IDEA is powerful, convenient and simple to use IDE for Scala development. Oh, yes it's free: IntelliJ IDEA Community edition.

I tried IDEA because the Scala plug-in for Eclipse is so horrible. I works so badly that using good old jEdit with some macros is way better. On a small project I had to clean the whole project for every compile. It's so bad. After several tries (starting sometime in 2008) I've given up on this thing.

Before IDEA I also tried to use Netbeans, but it failed the dump-developer-that-does-not-read-the-manual-test. Basic functionality should be usable without a detailed study of the documentation. That worked for me with IDEA but I could not get the plugin Scala plugin to run in Netbeans. It's probably me and everything is fine with Netbeans. Still I've invested less time to get things done with IDEA. It simply worked.

The simple summary: every basic IDE feature works: syntax high lighting, compiling, debugging, navigation and outline.

If you worried about productivity of Scala development because it was missing IDE support: stop worrying. The support is here.

I'm a heavy short cut user so using a new IDE slows me down naturally. Using the Eclipse key map was not an option for me. This works only 80% of the time. The main problem is that functionality with the same shortcuts behaves slightly different. That's maddening at the long run. So it's better to learn a new set of short cuts. It's like driving in Great Britain. It's easier when you are using a car with the steering wheel on the right. Driving a car with the steering wheel on the left is dangerous. There are the country side hedges where you'll see practically nothing in a left turn and chances are that you're going to start your trip on the right hand side on a calm country road. So better relearn.

Interestingly IDEA has a code analysis tool that does spell checking in comments, methods and variables names. This is already works for Scala. Eclipse does spell checking only in comments. This is a boon for code quality. You do not want to read incorrectly spelled source code. This is plain annoying to read, fix and frankly to get right (camel case does not make it easier). (The code analysis tool for Java is very good. It finds tons and tons of problems after FindBugs. I have not heard about the analysis tool before. It would be nice if it could be integrated in build tools and used in a continuous integration environments.)

The maven support seems to work out of the box. It reads a pom.xml and builds IDEA project files from it. It still uses its internal build infrastructure. With Eclipse you have to manually generate the .project and .classpath files which works but is an extra step and the eclipse specific configuration in pom files can get quite clunky.

The essential things work for Scala. You have code completion without implicit type conversion information. I suppose the support for implicit type conversion is quite hard to implement. There are a lot of callable members with type conversion for a given object. I suppose the best way to implement this would be a 2-step process. First show all directly reachable members and after hitting Ctrl + Shift again show all reachable members. Methods parameters are missing for decompiled Scala classes and as code completion information.

Besides these minor problems IntelliJ IDEA is a wonderful tool to develop Scala with and it does not crash. Go and try it yourself.

20091231

Peter Norvig's Spelling Corrector in Scala

This is a Scala version of Peter Norvig's spelling corrector. Reading his essay I realized that Python is quite a nice language and similar to Scala due to the support of for-expressions in both languages. Before that I've seen Python as a strange language that made you write self in your classes all the time.

I tried to write the shortest version possible without introducing redundancy. The code is definitely not very readable.

import util.matching.Regex.MatchIterator

val alphabet = 'a' to 'z' toArray
def train(features : MatchIterator) = (Map[String, Int]() /: features)((m, f) => m + ((f, m.getOrElse(f, 0) + 1)))
def words(text : String) = ("[%s]+" format alphabet.mkString).r.findAllIn(text.toLowerCase)
val dict = train(words(io.Source.fromFile("big.txt").mkString))

def edits(s : Seq[(String, String)]) = (for((a,b) <- s; if b.length > 0) yield a + b.substring(1)) ++
  (for((a,b) <- s; if b.length > 1) yield a + b(1) + b(0) + b.substring(2)) ++
  (for((a,b) <- s; c <- alphabet if b.length > 0) yield a + c + b.substring(1)) ++
  (for((a,b) <- s; c <- alphabet) yield a + c + b)

def edits1(word : String) = edits(for(i <- 0 to word.length) yield (word take i, word drop i))
def edits2(word : String) = for(e1 <- edits1(word); e2 <-edits1(e1)) yield e2
def known(words : Seq[String]) = for(w <- words; found <- dict.get(w)) yield w
def or[T](candidates : Seq[T], other : => Seq[T]) = if(candidates.isEmpty) other else candidates

def candidates(word: String) = or(known(List(word)), or(known(edits1(word)), known(edits2(word))))

def correct(word : String) = ((-1, word) /: candidates(word))(
  (max, word) => if(dict(word) > max._1) (dict(word), word) else max)._2

List("osters", "musters", "mixters") map correct foreach println

You can run the script directly with the Scala interpreter:

$ scala test.scala
posters
masters
matters

20091220

A decent Bloom Filter in Scala

The Bloom Filter is an interesting probabilistic data structure. Its main advantage is that the Bloom Filter drastically reduces the memory consumption.

It produces false positives. Which means that it falsely states that a value is contained. On the other hand it ensures that it produces no false negatives. This renders it useful for space efficient in-process filtering.

The memory needed is not determined by the size of the object stored but by the number of elements and quality of the Bloom Filter (the false positive rate). Whole cars, houses and star ships fit into this little bloom filter.

My Scala implementation is based on a decent stand-alone Java Bloom Filter by Ian Clarke.

class BloomFilter(val size : Int, val expectedElements : Int ){
  require(size > 0)
  require(expectedElements > 0)

  val bitArray = new BitArray(size)
  val k = ceil((bitArray.size / expectedElements) * log(2.0)).toInt
  val expectedFalsePositiveProbability = pow(1 - exp(-k * 1.0 * expectedElements / bitArray.size), k)

  def add(hash : Int) {
    def add(i : Int, seed : Int) {
      if(i == k) return
      val next = xorRandom(seed)
      bitArray.set(next)
      add(i + 1, next)
    }
    add(0, hash)
  }

  def contains(hash : Int) : Boolean = {
    def contains(i : Int, seed : Int) : Boolean = {
      if(i == k) return true
      val next = xorRandom(seed)
      if (!bitArray.get(next)) return false
      return contains(i + 1, next)
    }
    contains(0, hash)
  }

  private def xorRandom(i : Int) = {
    var y = i
    y ^= y << 13
    y ^= y >> 17
    y ^ y << 5
  }
}

It has some advantages over the base version:
- My implementation replaces java.util.Random with am in-lined Xorshift Random Number Generator that has sufficient randomness for a Bloom Filter. Additionally the java.util.Random is thread-safe which decreases the performance unnecessarily.
- The BitSet implementation is replaced by a simplified bit array implementation. An idea I've borrowed from a Bloom Filter implementation by Kevin Bourrillion. The size of the BitArray is always a power of 2. This allows some performance optimizations. The modulo operations used to transform a hash function result into a bit set index can be replaced with bit shifting: x % size = x & (size – 1) (when size is a power of 2).

class BitArray(bits : Int){
  val size = nextPow2(bits)
  require(isPowerOf2(size))
  private val data = new Array[Long](size >> 6)

  def set(index : Int) = data(idx(index)) |= (1L << index)
  def get(index : Int) = (data(idx(index)) & (1L << index)) != 0

  private val mask = size - 1
  private def idx(index : Int) = (index & mask) >> 6
  private def isPowerOf2(i : Int) = ((i - 1) & i) == 0
  private def nextPow2(i : Int) = {
    def highestBit(remainder : Int, c : Int) : Int = 
       if(remainder > 0) highestBit(remainder >> 1, c + 1) else c
    require(i <= (1 << 30))
    val n = if(isPowerOf2(i)) i else 1 << highestBit(i, 0)
    assert(n >= i && i * 2 > n && isPowerOf2(n))
    n
  }
}

- The bloom filter works directly with hash values. I think that the Set semantics are overstretched when applied to Bloom Filters. A lot of the methods are not supported and the return values of add (and addAll) are misleading. Using the hash values directly is useful if you do not want to create objects to check if they are contained in the filter. The assumption here is that the object instantiation is far more expensive than calculating hash values and create only objects for the values contained in the filter. Normal uses are of course still possible with this interface.

All the changes add up to an increase in performance of at least 300% compared to the base implementation having a similar false positive rate.