## 20101207

### Inverse Function Test Pattern in Quickcheck

I already mentioned the inverse function test pattern in the introduction to specification-based testing using Quickcheck. Now I would like to present the pattern in more detail with an example.

The basic idea of testing with inverse functions is simple. We have two functions f and its inverse function f-1. If we apply f to an input value and then take the result and apply it to f-1 this should result in the input value again: f-1(f(x)) = x.

This test pattern is applicable for all kinds of functions for example compression and encryption. In the business application domain create and delete are examples of operations that can be tested this way. Another example are do and undo operations of command objects. In both examples, doing and undoing an action leaves the state of the world unchanged.

There are some constraints on the applicability of the test pattern. The function f has to have a inversion function so it is a bijective function and one of the functions f or f-1 have to be tested. Otherwise the intermediary result could be invalid. For example, if you test a create and a delete operation together, the inverse function test passes if both operations do nothing.

The example used to illustrate the inverse function testing is a simple square function.

public double square(double a) {
double square = a * a;
return square;
}


The inverse function test can be implemented with concrete values. We use the Math.sqrt() implementation as the inverse function.

@Test public void squareConcreteValues() {
double a = 2;
assertEquals(square(Math.sqrt(a)), a, precission);
}


This is okay but defining input values manually is not very productive, not readable and has not sufficient test coverage. You can instead employ some computing power to generate the values using Quickcheck.

Firstly, the square function is not bijective as square(-x) = square(x). This is a fact we did not express in the example with the concrete values. It simply omitted this fact. To fix this the result is compared to the absolute value of x. Secondly, the function will overflow and test output like this:
java.lang.AssertionError: expected:<1.6482368012418589E307> but was:<Infinity>
is typical.

@Test public void squareWithOverflows() {
for(double a : someDoubles()) {
assertEquals(abs(a), Math.sqrt(square(a)), a * precission);
}
}


Again, this aspect was not expressed in the concrete test. Even if you were not aware of the problem the failing test points to the overflow problem. This is a nice example how Quickcheck can help you to find bugs you did not anticipate. I admit that this is a simple example but give it a try. You'll see that you run into all kinds of problems you did not think of. You have to break your software to know it.

Now we have to fix the overflow problem. Depending on the situation it can be easier to find and test a valid implementation that is more restrictive than theoretically possible but suffices your requirements. This is the trade-off between effort now and potential applicability later.

For this example it is easy to find all legal input arguments. It is the largest double value that can be squared.

@Test public void squareWithBounds() {
double max = Math.nextAfter(Math.sqrt(Double.MAX_VALUE), 0.0);
for (double a : someDoubles(-max, max)) {
assertEquals(abs(a), Math.sqrt(square(a)), a * precission);
}
}


To finish this example let’s write the test for the illegal input arguments as well. All square arguments that cause an overflow are invalid and should cause an IllegalArgumentException. The invalidArguments double generator defines all invalid values. These are the values greater than the largest valid value max and smaller than the smallest allowed value -max.

@Test public void squareInvalidArguments() {
double max = Math.sqrt(Double.MAX_VALUE);
double firstInvalid = Math.nextUp(max);
Generator<Double> invalidArguments =
oneOf(doubles(-Double.MAX_VALUE, -firstInvalid))

for (double a : someEnsureValues(asList(firstInvalid, -firstInvalid), invalidArguments)) {
try{
square(a);
fail();
}catch(IllegalArgumentException e){ }
}
}


The implemenation passing the tests is:

public double square(double a) {
double square = a * a;
if (Double.isInfinite(square)) { throw new IllegalArgumentException() };
return square;
}


Testing with inversion functions can solve the dilemma of writing tests without repeating the production code. If you repeat the code in the test you have an additional check that you wrote the code down correctly. You have to repeat an error in both places to introduce a bug. This is all you can get out of such tests. If you test with an inverse function the test and implementation do not share code. This kind of test has the potential to find conceptual problems in your code. (Actually, the reality is not black and white. You can have a test that implements the same operation with a simpler test implementation. This implementation verifies that the more complex production version works. This is the idea of the analogous function test pattern. I will come to later.)

If the inverse function pattern is applicable it can save you a lot of effort. For example the result of an operation can be very costly to verify like encryption functions. The exact representation may not be of interest for the given domain or change frequently leading to high maintenance costs with concrete test values. If you can define valid input values and have an tested inverse function the test effort is to setup the input value generator. The nice side-effect is that you test that the functions you think are inverse functions really are inverse.

## 20101110

### The easiest way to get started with Quickcheck

You may download the Quickcheck zip file from bitbucket. This zip contains the full Quickcheck distribution. Alternatively, the easiest way to get started is Maven.

<dependency>
<groupId>net.java.quickcheck</groupId>
<artifactId>quickcheck</artifactId>
<version>0.6</version>
<scope>test</scope>
</dependency>


Have fun.

## 20101108

### Alternative test approach: Quickcheck

With this article I would like to shed some light on the question: Why should I implement tests with the specification-based testing approach proposed by Quickcheck? I will show that the approach yield tests with better readability, more test coverage, helps to omit mock tests and manually maintained test fixtures.

I'll illustrate the approach with an simple example taken from the Guava library to be able to discuss the pros and cons. The example I picked is the test of the com.google.common.base.Splitter class. I present three implementations of the Splitter tests. The test implementation as it can be found in Guava today (translated to TestNG), a test implementation with TestNG @DataProvider and an implementation with Quickcheck for Java.

The following code are tests as they are today. The tests check the behavior of simple split operations for different input values. These tests have a lot of code but are structurally very similar.

@Test public void testCharacterSimpleSplit() {
String simple = "a,b,c";
Iterable<String> letters = Splitter.on(',').split(simple);
assertContentsInOrder(letters, "a", "b", "c");
}

@Test public void testCharacterSplitWithDoubleDelimiter() {
String doubled = "a,,b,c";
Iterable<String> letters = Splitter.on(',').split(doubled);
assertContentsInOrder(letters, "a", "", "b", "c");
}

@Test public void testCharacterSplitWithDoubleDelimiterAndSpace() {
String doubled = "a,, b,c";
Iterable<String> letters = Splitter.on(',').split(doubled);
assertContentsInOrder(letters, "a", "", " b", "c");
}

@Test public void testCharacterSplitWithTrailingDelimiter() {
String trailing = "a,b,c,";
Iterable<String> letters = Splitter.on(',').split(trailing);
assertContentsInOrder(letters, "a", "b", "c", "");
}

assertContentsInOrder(letters, "", "a", "b", "c");
}

@Test public void testCharacterSplitWithMulitpleLetters() {
Iterable<String> testCharacteringMotto =
Splitter.on('-').split("Testing-rocks-Debugging-sucks");
assertContentsInOrder(
testCharacteringMotto, "Testing", "rocks", "Debugging", "sucks");
}

private void assertContentsInOrder(Iterable<String> actual,
String... expected) {
assertEquals(Arrays.asList(expected), Lists.newArrayList(actual));
}


From the readability perspective the problem with tests specifying concrete values is that the reader has to infer which part of the information is significant and which part is not. For example taking this test in isolation:

@Test public void testCharacterSimpleSplit() {
String simple = "a,b,c";
Iterable<String> letters = Splitter.on(',').split(simple);
assertContentsInOrder(letters, "a", "b", "c");
}


One could come to the result that:
- split works only with one, alphanumeric character (“aa,bb” would not be a legal input string)
- the separator may not be an alphanumeric character
- splits works only for successive char values as the input is a - b - c
- all other characters are undefined

The reader of the tests has to digest all tests to be able to infer the actual properties of the Splitter. While reading he has also to remove the insignificant information and has to eliminate the invalid assumptions. (Actually, the tests say nothing about alphanumeric separators. So this assumption is still open.) As we know what split does this is trivial but for unknown functionality this is much harder.

The test can be implemented with the TestNG DataProviders in a more concise way:

static final String SPLIT = "split";

@DataProvider(name = SPLIT)
public Object[][] split() {
Object[] simpleSplit = {"a,b,c", ",", new String[] { "a", "b", "c" } };
Object[] splitWithDoubleDelimiter = {
"a,,b,c", ",", new String[] { "a", "", "b", "c" } };
Object[] splitWithDoubleDelimiterAndSpace = {
"a,, b,c", ",", new String[] { "a", "", " b", "c" } };
Object[] splitWithTrailingDelimiter = {
"a,b,c,", ",", new String[] { "a", "b", "c", "" } };
",a,b,c", ",", new String[] { "", "a", "b", "c" } };
Object[] splitWithMulitpleLetters = {
"Testing-rocks-Debugging-sucks", "-", new String[] { "Testing", "rocks", "Debugging", "sucks" } };
return new Object[][] {
simpleSplit, splitWithDoubleDelimiter, splitWithDoubleDelimiterAndSpace,
}

@Test(dataProvider = SPLIT)
public void simpleSplit(String simple, String separator, String[] expected) {
Iterable<String> letters = Splitter.on(separator).split(simple);
assertEquals(Arrays.asList(expected), Lists.newArrayList(letters));
}


This test removes a lot of the boiler-plate in the original version. It tests the split method repeatedly with a set of input values. The more compact representation of the test helps as long as we are keeping test input data and the test in one place. (If you organize your source file in a way that the test data and test is separated the readability decreases significantly.) This test approach still suffers from the readability problems the initial test has: the reader has to derive the property of the Splitter implementation from the input and expected return values. This is like inferring the properties of multiplication from a set of input and return values.

An completely different approach can be implemented using Quickcheck for Java for API. Quickcheck is build on the basic abstraction of the Generator<T>. The only method of the Generator<T> interface is next(). It returns a new value of type T.

The Splitter test uses Quickcheck functionality to create values:
- PrimitiveGenerators.strings() is a Generator<String> for undefined string values
- CombinedGeneratorIterators.someNonEmptyLists() is similar to a Generator<List<T>> but it is adapted to Iterable<List<T>> to be able to use it in a for-each loop. The lists returned have at least one element.

Basic idea is to specify a test as follows: If you take any non empty list of words, join them with a distinct separator and then split the joined string the result will be the input list of words again.

@Test public void simpleSplit() {
for (List<String> words : someNonEmptyLists(strings())) {
char separator = anyDistinctCharacter(words);
String input = Joiner.on(separator).join(words);
Iterable<String> letters = Splitter.on(separator).split(input);
assertEquals(words, Lists.newArrayList(letters));
}
}


The test uses the anyDistinctCharacter helper method to create separators. Separators have to be distinct from the characters used in words otherwise the splitter splits words. The helper uses the PrimitiveGenerators.characters() factory method to create a Generator<Character> for arbitrary characters. To create the separator values CombinedGenerator.excludeValues generator is used. It is build on a input generator but skips all generated values that are equal to the excluded values. (It skips all characters that are in words.)

private char anyDistinctCharacter(List<String> words) {
char[] notAllowedAsSeperator = Joiner.on("").join(words).toCharArray();
return excludeValues(characters(), asList(notAllowedAsSeperator)).next();
}


This specification-based test describes the contract and does not enumerate the specific test values. Compared to the two other implementations the way the test is written is more rigid. This is a positive thing, as I already tried to explain that when composing a test of concrete values there is place for misinterpretations. The other effect is that six tests boil down to one test.

This test approach has of course it's trade-offs: it depends on join and is not deterministic.
- Join has to be tested so that this test is working. I think the join test is much easier than the split test. Once join is tested we can depend on it. By testing join/split together we can check that these two are symmetric (f(f-1(x)) = x). Testing symmetry comes for free with this implementation approach.
- The real concern is that we will run the test with randomly generated test data and it is not guaranteed to contain all of the enumerated test cases in every test run.

If non-determinism is your concern there are some ways to address this.
- You can use a different distributions function in your generators. Using different distributions will make certain values more or less likely to occur.
- You can make the values used deterministic by using the CombinedGenerators.ensureValues(ensuredValues). This is still better than the tests with concrete values as you only have to specify the input values and not the expected result values.
- You can combine deterministic and random values. The generator created with CombinedGenerators.ensureValues(ensuredValues, otherValues) uses the enumerated values first and then the otherValues generator after the explicit values are exhausted.
The random nature of the generated values has of course an advantage: you may find bugs that you did not anticipate while writing the test.

Primitive type, enumerations and collection generators alone are not very useful. To test a real world application you have to be able to create values of your custom types. In the Splitter example we already used generators provided by the Quickcheck implementation.

Here is a simple example to get you started.

class Name {
private final String first;
private final String last;

public Name(String first, String last) {
super();
this.first = first;
this.last = last;
}
public String getLast() { return last; }
public String getFirst() { return first; }
}

class NameGenerator implements net.java.quickcheck.Generator<Name>{
Generator<String> first = PrimitiveGenerators.strings();
Generator<String> last = PrimitiveGenerators.strings();

@Override public Name next() {
return new Name(first.next(), last.next());
}
}


The code defines the custom type Name and a Generator<Name> using generator implementations provided by the Quickcheck framework. This is analogous to the definition of types in Java you define your own types based on the primitive and other types provided by the JDK.

The test then checks one property of the Name type. This test fails with the current implementation of the Name but this is easily fixed.

public class NameTest {
@Test public void equals(){
for(Name name : Iterables.toIterable(new NameGenerator())){
assertEquals(name, new Name(name.getFirst(), name.getLast()));
}
}
}


The test uses an adapter to the Iterable interface to be able to use the generator in a for expression. This is one way to run the test. You can also use an inner class with a more capable runner (QuickCheck.forAll(generator, characteristic)). Using the for expression is the trade-off given that the Java language has no closures yet.

Analogous to your domain object hierarchy you can now build a generator hierarchy that is used to create the domain objects. This clearly adds some overhead as you now have an additional generator for every domain object. But this pays off as you can:
- use generators in unit tests.
- use generated values to replace hard coded fixtures in integration tests (for example tests that you dbunit).
- omit mock tests that are only in place because the object graph to start a test is to hard to setup manually
- use generated values to write inverse functions (f-1(f(x)) = x) and analogous functions (f(x) = g(x) where f or g is simpler) tests (for example zip, encryption, read/write, sorting, searching, etc.).

I hope I could convince you that the approach helps you with writing more expressive tests by leveraging some computing power. You can use specification-based testing approach to test a range of software from the boring business software to more interesting areas like data-structures and algorithms. Maybe you could start by writing a test for the join method using Quickcheck?

## 20100708

### Java closures get type inference, hooray!

According to the latest State of the Lambda document closures in Java will get type inference.

The return type and exception types of a lambda expression are inferred by the compiler; the parameter types may be explicitly specified or they may be inferred from the assignment context.

So this expression will be valid without explicit type annotation:
CallbackHandler cb = { c -> System.out.println("pippo") };

Finally, they seem to have mercy with the humble Java developer.

## 20100629

### Talk to me - have fun with your favorite podcasts

Driving home the other day I listened to a German radio show that made fun of chancellor Angela Merkel. They cut together some of her speeches to let her say the truth about here coalition with the FDP.

I thought: how hard can this be to create? It seems a simple version is not to hard to build using the Java sound API and my favorite podcast.

The idea is simple: take input audio files and create a index of words contained in these files. The index entry is the word, start time and end time. When a new sentence is created it will be stitched together with single words taken from the index.

The file index implementation is straight forward. It reads the index information from a CSV file that is on the classpath. The whole application configuration is done with Guice so the index file name is injected. (I created only a small index from the Java Posse Podcast using Audible.)

The AudioInputStream is the main class to interact with the Java Sound API. You read audio data from it. If you create audio data you do this by creating a AudioInputStream the AudioSystem can read from. The actual encoding is done by the AudioSystem implementation depending on the output audio format.

The Butcher class is the one concerned with audio files. It can read and write audio files and create AudioInputStreams from an input byte array. The other interesting think the Butcher can is cutting samples from a AudioInputStream. The AudioInputStream consists of frames that represent the samples of the PCM signal. Frames have a length of multiple bytes. To cut a valid range of frames from the AudioInputStream one has to take the frame size into account. The start and end time in milliseconds have to be translated to start byte and end bytes of the start frame and end frame. (The start and end data is stored as timestamps to keep them independent from the underlying encoding of the file used.)

The Butcher implementation is simplified. It only supports one WAV file AudioFormat and does no stream processing.

The Composer creates the output file. For a given sentence it takes the audio data for each word from the input files, concatenates the audio data and writes the result to disk. The Composer is currently not very sophisticated and takes the first word from the index it can find.

After building with mvn assembly:assembly the Java application can be run with
java -jar  target/talk-to-me-1.0-SNAPSHOT-jar-with-dependencies.jar [output file] [sentence]


There is still plenty of interesting material to play around with. The current version can be improved in different ways:
• Indexing an audio file is quite cumbersome. If the start and end timestamp of a word could be detected from the silence between words indexing would be much easier.
• The amplitude of words and the length of silence should be normalized.
• Indexing could be even simpler if some speech recognition on the words could be performed.
• The output quality could be improved by finding longest sequences of words in an input audio file that match the target sentence (longest common_substring_problem and longest common subsequence problem).

## 20100222

### Tuning the Bloom-Filter-based Spelling Corrector

With the candidate iterators and filter in place the Bloom Filter-based spelling corrector can be implemented and the performance measured.

The base implementation is here. The basic idea is described in the first blog post about the Bloom Filter-based spelling corrector so I won't get into detail about the implementation.

My benchmark is the implementation by David Pollack. As far as I can tell he thought a bit about performance so the comparison is not totally unfair.

The numbers aren't to bad. The Bloom Filter-based implementation is 2.5 times faster. (The performance comparison numbers are based on correcting words of edit distance 0 to 3 derived from the dictionary. The JVM is running with -Xms256m -Xmx256m -server.) My implementation corrects 115 words per seconds.

bloom 182.04358 (115 Hz)
benchmark 466.560114 (45 Hz)
benchmark/bloom 2.562903421257701

But there is still room for improvement. Executing the application with a memory profiler reveals that a lot of java.lang.Integer objects are allocated. The filter function Int => Boolean is translated by the Scala compiler to the generic trait Function1[Integer, Boolean]. This causes boxing of int values to Integer objects. (The boxing of Boolean values is not a problem. There are only two values as long as Boolean.valueOf is used.) Replacing the filter function with a filter trait based on primitive types solves this problem.
trait Filter{
def contains(x : Int) : Boolean
}

The other source of boxed integers is the partial application of the StringHasher.hashRange method. When the string parameter is applied to the hashRange method a new Function3[Int,Int,Int,Int] is returned. This causes boxing as the previous Function1 object does. The partial application made the code a bit more readable. We have to compromise here on readability and replace the partial application with a normal method that fits well with the JVM.

Implementing these improvements resulted in worse performance at the beginning. The problem was that I used s(idx) instead of s.charAt(idx). The s(idx) converts the String to a RichString and then calls the apply method on it. Both methods are functionally identical but the first will cause conversions from String to RichString whereas the charAt method is a direct method call on the given String. This is a subtle problem the best solution would be able to disable implicit conversions from String to RichString defined in Predef for this file.

With both improvements in place the implementation is 3.7 times faster.

bloom 113.69269 (184 Hz)
benchmark 424.767132 (49Hz)
benchmark/bloom 3.736098881994964

The next step is a bit controversial: basing the candidate production on char Arrays replacing Strings. This yields good performance improvements the final number is 4.9 times faster than the competing implementation but it has clearly a number of drawbacks. The code becomes quite low-level using System.arrayCopy instead of RichString methods. The char[] arrays are mutable so a collaborator may manipulate the array content.

bloom 89.082697 (235 Hz)
benchmark 437.214067 (48 Hz)
benchmark/bloom 4.907957232143522

Some further improvements are still possible. With enough effort the BloomSpeller, Dictionary and CandidateIterators could be based on a CharSequence flyweight implementation. Having this interface in place candidate iterators can return implementations of CharSequence containing only the base string as common state and the individual state of the candidate (index to manipulate, candidate hash value and optionally the alphabet character to use). This would save memory allocation, garbage collection for the candidate char[] array instance and redundant hash value computation. (The candidate hash value is computed a second time when the Speller implementation checks whether the candidate is in the Dictionary.)

The final step is to tune the Bloom Filter configuration. In the first blog entry of this series I calculated expected performance values based on the Bloom Filter worst case (all candidates are contained in the Bloom Filter). The actual numbers differ from the expected results as most candidate checks against the Bloom Filter will abort earlier before k hash functions were executed. Another factor not present in the model is the optimization to compute only hash values for about half of the string length.

The performance is surprisingly stable for an n/m ratio of 4 up to 256. For an n/m up to 4 the production of false candidates dominates the performance characteristics. For n/m ratio greater than 128 the execution of hash functions starts to degrade performance. Over all the configuration of an n/m ratio = 4 is good enough. Only small improvements can be attained with changing this number.

 n/m ratio performance ratio 1 1,85 2 3,07 4 4,05 8 5,07 16 4,75 32 4,83 64 4,93 128 4,69 256 3,89 512 1,78 1024 0,53

## Conclusion

Scala is a flexible language. You can write high-level and low-level code as needed. The basic approach to start a high-level implementation and optimizing the hot spots worked out quite well for this example. The byte code emitted by the Scala compiler is good enough in most cases. Some abstractions of the language do not fit to well with the Java platform. This is not a problem for regular code and can be fixed with low-level implementations for hot spots as we've seen. The new @specialized annotation may help here in the future to lower the amount of low-level code.

There are still some areas worth to investigate. How to distribute the candidate production and filtering over multiple cores? Using a join-fork framework or Scala's Future construct seem to fit quite well here. Another idea is to compare the Bloom Filter-based spell checker implementation against a BK-Tree-based implementation.

## 20100216

### Producing candidates for a spelling corrector

The next step to implement the Bloom Filter-based spelling corrector is to create and check hash values for candidates against the filter without actually creating candidate instances. If the filter step succeeds the candidate String instances are created in the next step.

All candidate iterators (for insertion, deletion, transposition and replacement) are based on the CandidateIterator trait. The CandidateIterator can return a hash value for a candidate with the hashValue method and a candidate string with the candidate method. The next method scrolls to the next value.
trait CandidateIterator{
def next : Boolean
def candidate : String
def hashValue : Int
}
An instructive implementation is the replacement CandidateIterator. It replaces each character of the base string with a character from the alphabet to create a candidate.

The candidate method does exactly this for candidate strings. It takes the base string and replaces the character at the idx position with the character from the alphabet at the alphabetIdx position (the alphabet is defined as val alphabet = 'a' to 'z' toArray).
def candidate = (base take idx) + Dictionary.alphabet(alphabetIdx) + (base drop idx + 1)

The hashValue method is not as easily readable as the candidate method. The implementation is based on the insight that the hash function for Strings is the defined in a way that we can reused a base hash value for the a substring base.substring(0, idx). With that in place we can save about half of the hash computations. This has huge performance benefits over an implementation that works directly with Strings where hash values have to be computed from the scratch for every candidate. Analogous to the String copy costs the hashing costs increase nearly linearly with the String length. The performance benefits will be especially relevant for long Strings that will then produce a lot of candidate strings.

The next method takes care of the idx, alphabetIdx to point to valid positions. The baseHash value is calculated iteratively in the next method from the baseHash of the last round if the idx value changes.

The hashValue implementation is now based on the baseHash value. The hash is computed for the character taken from the alphabet and the remaining string base.substring(idx + 1, base.length).
class Replacement(base : String) extends BaseCandidateIterator(base){
private var idx = 0
private var alphabetIdx = -1
private var baseHash = 0
def next = {
alphabetIdx += 1
if(alphabetIdx == Dictionary.alphabet.length){
idx += 1
alphabetIdx = 0
baseHash = hashAt(baseHash, idx - 1)
}
idx < base.length
}
def candidate = (base take idx) + Dictionary.alphabet(alphabetIdx) + (base drop idx + 1)
def hashValue = {
val pivot = hashFor(baseHash, Dictionary.alphabet(alphabetIdx))
val right = hashRange(pivot, idx + 1, base.length)
right
}
}
The CandidateIterator let's us iterate over candidate hash values without creating objects. Externally to the CandidateIterators the filtering takes places calling the candidate to produce an instance method only when the hash value was in the filter. This work is done by the CandidateFilter.
class CandidateFilter(base : CandidateIterator, filter : Int => Boolean) extends Iterator[String]{
private var more : Boolean = false
def hasNext : Boolean = {
def hasMore : Boolean = {
while(base.next) if(filter(base.hashValue)) return true
false
}
more ||= hasMore
more
}

def next : String = {
more = false
base.candidate
}
}

The class takes a CandidateIterator and a filter. It implements the Iterator trait. Values returned by the CandidateFilter are candidate strings that passed the Bloom Filter test.

Users of the CandidateFilter and CandidateIterator can now use the Iterator convenience methods and for expressions to work with candidate strings as usual.

With the CandidateIterator implementation and the CandidateFilter in place the spelling corrector can be implementation in the final step.

## 20100208

### Implementing hash functions in functional style

The first ability needed to enable filtering with the Bloom Filter-based spell checker is to produce hash values for candidates.

The basic functionality is to hash a character given a base hash value from a preceding invocation. The hashFor function implements this.

def hashFor(h: Int, c: Char) : Int = h * 31 + c


Two further functions help with building candidate hashes: hashAt and hashRange.

HashAt builds the hash for the a character of base string at a index position given a base hash value.

def hashAt(base : String)(h: Int, i: Int) : Int= hashFor(h, base.charAt(i))

The hashAt method is defined in a way to be able to use partial application of parameters. (It's not using clean currying. The Scala language allows to define functions with mixed arity: unary functions and functions with arity greater than 1.)

The motivation to define the function in this way is to be able to use it with fewer parameters than the method definition def hashAt(base : String, h: Int, i: Int) when it is used repeatedly with the same base string parameter.

To get the function signature you can access the function object in the Scala console:

scala> hashAt _
(String) => (Int, Int) => Int

Alternatively the signature can be written as (String) => ((Int, Int) => Int) for better readability. The function takes one parameter and will return a function of type  (Int, Int) => Int.

To make any sense of this let's apply the parameters one by one:

scala> val f = hashAt _
f: (String) => (Int, Int) => Int = <function>

scala> val g = f("s")
g: (Int, Int) => Int =  <function>

scala> val h = g(0,0)
h: Int = 115


The parameters are applied left to right. The intermediary function definitions has each parameters bound until there are no parameter left to bind and the function is evaluated. The sequence is equivalent to calling hashAt("s")(0,0).

HashRange hashes characters of a string from a start index to the end index (exclusively) given a base hash value.

def hashRange(base : String)(h: Int, i: Int, end : Int) : Int =
if(i < end) hashRange(base)(hashAt(base)(h, i), i + 1, end) else h

Using the hashRange method you can produce the same hashValues as the java.lang.String implementation.
public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}

The equivalent definition based on the hashRange function is:

def hash(s : String) = StringHasher.hashRange(s)(0, 0, s.length)

The function is defined as a tail recursive method. A tail recursive function is a function that does only return values from recursive calls and won't manipulate these return values. With a tail recursive definition the Scala compiler is able to translate the recursive call into a loop. Defining a tail recursive function is a bit tricky and you may inadvertently change a tail recursive version into a normal recursive one and even an optimized tail recursive function into a not optimized one. To omit the pitfalls of tail recursive functions Scala 2.8 introduces a @tailrec annotation. This produces a compiler failure if a function cannot be optimized.

With the hash support methods hashAt, hashFor, hashRange in place we can implement hash values for candidates. I'll implement the insertion candidate production to demonstrate the application of the methods. The insert adds a characters from an alphabet at a given index.

It is always good to write some tests (besides the fact that it tests the functionality) to be able to have a look at the API from the user's perspective. Testing the individual methods is not enough to get a feel of the API. Users will always use a combination of the three methods. One concern is that the hash code is calculated incrementally and the user has to define the hash functions in a incremental manner as well. Using the output of preceding execution as an input parameter for succeeding method calls which may be error prone.

The law for the insert candidate hash can be defined with the equivalent hash function of java.lang.String implementation. It can be expressed in specification-based test:

@Test def userSample {
for{s <- strings
index <- integers(0, s.length - 1)
c <- basicLatinCharacters}{
val insertCandidate = (s take index) + c + (s drop index)
assertEquals(insertCandidate.hashCode, insertCandidateHash(s, index,c))
}
}

Three methods of the specification-based test framework Quickcheck are used in the test:
- strings returns an arbitrary string value
- integer(min,max) returns  an integer value from min to max (inclusive)
- basicLatinCharacters returns a sample character from the from the Unicode Latin-1 code page.
All three methods return values through the Generator interface that is converted with a implicit type conversion to an Scala Iterator to be able to use it in a for expression.

With the input string, insertion index and character to insert given the actual law is quite simple. We construct a string that is an equivalent insertion candidate and get its hash value. The hash value returned by the insertCandidateHash method under test has to be produce the same value. (This is a testing pattern in the context of specification-based test: testing based on a analogous functions. Stay tuned for a in depth description of the testing patterns.) This example shows that unpleasant tests with classical scenario-based TDD can be expressed in a compact way with specification-based testing. The last piece of the puzzle is the implementation of the insertCandidateHash function.
def insertCandidateHash(s : String, i : Int, c : Char) = {
val hashRange = StringHasher.hashRange(s) _
val left = hashRange(0, 0, i)
val pivot = StringHasher.hashFor(left, c)
val right = hashRange(pivot, i, s.length)
right
}

The sample code demonstrates the use partial application of function parameters in the first line. The string parameter is applied to the function returning a function with the signature (Int, Int, Int) => Int. This function object can than be used for all hash values of the same base string. The hash value is calculated for the left, pivot and right characters of the string. The characters for the left and right hash values are unchanged from the base string. The hash value for the pivot element is one from the alphabet as one would expect for an insertion.

In the next blog I'll implement the candidate iterators for insertion, replacement, transposition and delete based on the StringHasher.

## 20100207

### The moment electronic paper died

Perhaps you did not noticed it but last week the electronic paper followed the Maglev and became part of the museum of failed technical innovations. As the Maglev, it started with a brilliant theoretical idea that nobody cared about once it was there.

The Maglev promised faster train travel with a better efficiency. The problem of the Maglev is that is does not fit well into the current infrastructure. New separate tracks have to be build. In the meantime wheeled electric trains are capable to travel at high speeds using and interfacing better with the current infrastructure.

Electronic paper has the advantage that it does not consume electricity while displaying a single page and works without backlight. In theory an electronic paper-based device can be used much longer with recharging. Readability for text is supposed to be better than a LCD as well. But it cannot display charts, tables and photos well. Everything users are used to know on the computer and smart phones. Additionally, they switch slowly from one page to another and the display even flickers annoyingly while changing the page content.

Until the advent of the iPad I thought of buying such a device and fantasized about caring around all my IT-books, classical literature from Project Gutenberg and blogs on this device around while commuting. It was just a matter of the right price. But now I do not see a point anymore. The iPad has proven that it is possible to build a general purpose computing device that includes book reading software. The electronic paper may be better for novels but I don't care. The ability to use all kinds of software and browse the web on such a device is definitely buying me more.

That said the iPad is just the prove that it can be done. It is not platform I would like to develop on.

What's the point off having a general purpose computing device that is controlled by one company? The platform for such a device has to be reasonably open. And I think that will not only be in the interest of application developers but of the provider of such a platform as well. The development and deployment model imposed on developers will stifle innovation. As innovative as the iPad may be a single company won't outperform a whole industry in the long run.

## 20100202

### A faster spelling corrector

The compact spelling corrector implemention showed some of the features of Scala: for expression, higher order functions (fold and map), tuple support and the infix method syntax. The current implementation has terrible performance, so we try to develop a faster corrector implementation.

Let's have a look at it at run-time with the VisualVM memory profiler.

Not very surprising, it creates a lot of objects. To correct 20 words of edit distance 0 to 3 randomly picked from the big.txt file it allocated 15 million char[] arrays and 9 million String instances. Allocation may be cheap but allocating millions of objects is not.

One way to improve the performance of the spelling corrector is to allocate fewer objects. The spelling corrector is based on a brute force approach: calculate all possible candidates for a given edit distance. The most expensive operation is to create candidate string instances. The java.lang.String implementation is immutable. Every single change to a String will force the creation of a new instance.

But we can do better than that. We can capitalize on the knowledge that the Strings won't be random. The candidates can be easily deduced from the base string. A candidate string has some additional private state depending on the operation (insert, transpose, replace, delete) performed. Phrasing it in this way directly leads to the flyweight pattern. Flyweights can be implemented for every operation implementing the CharSequence interface that is already part of the JDK. This will gain quite good performance improvements. No additional char[] arrays will be allocated.

I'll skip the flyweight implementation and leave it as an exercise. There is a better approach.

The flyweight implementation will still create one object per candidate. Even this can be omitted. If we would pick only candidates that are promising the creation of objects would be further reduced and the performance increased. And happy coincidence a data structure can help here: the Bloom Filter.

Additionally to dictionary of word frequencies that is already used in the base implementation of the spelling corrector a Bloom Filter is initialized with all words.

The candidate generation will than be separated in multiple steps: candidate hashing, hash filtering with the Bloom Filter, candidate instantiation and dictionary filtering.

Using the Bloom Filter most of the candidates will not survive the hash generation and hash filtering steps. The hashed candidates can be calculated without instantiating any objects, just using primitive int values that live on the stack. This has super performance characteristics: no memory allocation and garbage collecting needed.

For every type of candidate: deletion, replacement, insertion and transposition an Iterator can be implemented that produces the candidate hash values. The iterator implementations are based on a base string and additional state (positions in the base string and positions in the alphabet) necessary for the candidate hash value production.

All candidate hash values will be checked if they are contained in the Bloom Filter. If so the candidate string will be created and checked against the dictionary. The second check is necessary as the bloom filter produces false positives.

The actual false positive rate can be freely chosen. It will have no impact on the functionality. It could be 100%. The false positive rate can be changed to tune the performance depending on memory consumption requirements, cost of hash functions used in the Bloom Filter implementation and object instantiation cost for candidate strings.

To check on the basic assumption we can measure the performance characteristics of the basic string operations needed and the Bloom Filter. The limited performance test compares the run-time of a character array copy needed to create the base char[] for one insertion candidate string (creating a String candidate won't be faster than this)

arraycopy(stringValue, 0, s, 0, 1) //insertion
s(1) = 'c'
arraycopy(stringValue, 1, s, 2, stringValue.length - 1)


and the Bloom Filter contains method performance.

All presented durations lack an unit of measurement as all measurements are based on the same scale. This is sufficient to compare the values.

The time needed to create the insertion candidate string grows linearly with the base string length.

 char[] length character copy time 2 11,69 4 11,70 8 12,57 16 15,95 32 21,57 64 30,21 128 51,22

The Bloom Filter time grows linearly with the number of hash functions (k) used. The optimal number of hash functions used is determined according to the ration between expected elements (n) and bloom filter size (m).

The measured time is the worst case: an entry (or a false positive) is in the Bloom Filter. In this scenario all k-hash functions are executed.

 n/m ratio k false positive rate time filter 1 1 0,62 3,73 2 2 0,39 5,70 3 3 0,24 7,70 4 3 0,14 7,71 5 4 0,09 9,38 6 5 0,05 11,59 7 5 0,03 11,59 8 6 0,02 13,80 9 7 0,01 18,68 10 7 0,01 18,68

In a real world scenario the average run-time of the Bloom Filter will be better as the execution can be aborted for values that are not in the Bloom Filter. The false positive probability another important figure of the Bloom Filter:

False positive causes additional work based on the assumption to work with a valid value. This run-time is wasted as the preceding check against the dictionary will eliminate the false positives.

To get the expected resulting performance of the spelling corrector we construct a small model. The expected run-time is determined by the cost of the filter t_f, the false positives probability p_f the probability p_w that a candidate is a word in the dictionary and the time t_s to construct a candidate string. The time spend to construct strings which are false positives is p_f * t_s. The time spend to construct strings which are words is p_w*t_s. (To simplify the formula a constant string length is assumed.)

The values computed are based on candidate strings with a length of 8 chars (t_s=t_s8) and the probability that 5% of the candidates are words contained in the dictionary (p_w=0.05).

 n/m ratio k false positive rate (p_f) time filter (t_f) false positive time (p_f * t_s) expected time (t) 1 1 0,62 3,73 7,84 12,20 2 2 0,39 5,70 4,88 11,21 3 3 0,24 7,70 3,04 11,37 4 3 0,14 7,71 1,76 10,10 5 4 0,09 9,38 1,09 11,09 6 5 0,05 11,59 0,66 12,88 7 5 0,03 11,59 0,39 12,61 8 6 0,02 13,80 0,25 14,68 9 7 0,01 18,68 0,15 19,46 10 7 0,01 18,68 0,09 19,40

By increasing the n/m ratio (and hash functions k executed respectively) of the Bloom Filter we see an increase in the time spent to check if the filter contains a candidates string (time filter). On the other hand the time spent to produce false positives (false positive time) declines with a growing n/m ratio.

For the given model the local minimum of the expected run-time is with a n/m ratio of 4 (false positive rate of 0,14). The expected worst case run-time of the Bloom Filter based implementation is better than the plain array copying run-time (with a of minimum 11,69) and the performance benefits increases dramatically with the string length.

In the next blog entries I'll implement this Bloom Filter-based spelling corrector to see if one can capitalize on the theoretical advantages. The next blog entry will be about the hash function support needed.

## 20100118

### Why overrideable tail recursive methods must not be optimized

Suppose you define a tail recursive method x in a object A. You can use the @tailrec annotation to mark methods you expect optimization of the tail recursive code to happen.
import annotation._
object A{
@tailrec def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s
}

Now you move the method x to a trait or class and suddenly the compilation fails.
import annotation._
trait A{
@tailrec def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s
}
:8: error: could not optimize @tailrec annotated method
@tailrec def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s

I had exactly this problem. I defined a perfectly valid tail recursive function but the compiler would not optimize it to a loop. This was in Scala 2.7 that does not support the @tailrec annotation. You have the additional issue of spotting the problem in the first place.

The problem is that overrideable tail recursive methods won't be optimized. If you change the code to
import annotation._
trait A{
@tailrec final def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s
}

it will compile again. This is the rule. But why?

Performance optimization must not change the behavior at run-time. So overriding an optimized tail recursive method will behave differently from an unoptimized tail recursive method.

Let's define A in a way that it won't be optimized and override the tail recursive method x in the subclass B.
class A{
def x(i : Int, s : String) : String = if(i > 0) x(i - 1, s + "a:" + i) else s
}

class B extends A{
override def x(i : Int, s: String) : String = super.x(i, s + "b:")
}

This will produce the output:
scala> new A().x(3, "")
res0: String = a:3a:2a:1

scala> new B().x(3, "")
res1: String = b:a:3b:a:2b:a:1b:

The implementation of the method x in B invokes the method x in A. When the method x defined in A invokes the method x it calls the method x in B (for an instance of B). This behavior is due to late binding of calls to x.

Suppose it was possible to optimize the method x in A. The compiler would define a class similar to this definition of A. The recursive invocation of x is optimized to a loop. With this definition of x there is no possibility to apply late binding to x. The optimization would hard wire the definition of x.
class A{
def x(i : Int, s : String) : String = {
var j = i
var t = s
while(j > 0){
t = t + "a:" + j
j = j - 1
}
t
}
}

class B extends A{
override def x(i : Int, s: String) : String = super.x(i, "b:" + s)
}

scala> new A().x(3, "")
res0: String = a:3a:2a:1

scala> new B().x(3, "")
res1: String = b:a:3a:2a:1

The early binding of x in A changes the behavior of B. Once the method x in A is called it will never come back to B. This is the reason why the Scala compiler can't optimize overrideable tail recursive methods.

Following this explanation, it should be clear that the @tailrec annotation is worthwhile and should be used in all your Scala 2.8 code you expect the optimization to happen. For Scala 2.7 it's a bit unfortunate that moving a tail recursive method from an object to trait will change the behavior significantly without a compiler error. So be warned.

## 20100111

### Lessons from Stackoverflow: missing education should not stop you from being a good developer

To put this blog into perspective: I've the weakest of all possible computer science degrees you can imagine. It's formally a degree in computer science but half of my curriculum was in communication engineering which is an interesting field (if you want to build a digital hearing aid for your grandma) but does not really help with software development.

I do appreciate the work of Jeff Atwood at Stackoverflow. It's a great site. It is a viable programming resource. Judging based on his results he is a good developer.

Ironically you can build a good product without been a topnotch computer scientist. This is really encouraging for me.

I'd been reading Jeff Atwood's blog for more than two years because it was funny, entertaining, presented a down-to-earth view of IT and provided insights of a different community (.net and Windows) from my Java community. But around the time he started Stackoverflow I quit reading his blog. I cannot say exactly why, but I lost its appeal a bit. I liked to read more computer science related stuff (mostly from the functional programming world). Roughly one year later I'm back again. I really like Stackoverflow.

But time and again listening to the Stackoverflow podcasts makes me wondering how's that possible. How could he write Stackoverlow? In the latest Stackoverflow podcast #79 Joel Spolsky tried to explain that you can parse only regular languages with regexps parser. (Actually, most implementations allow more than that. For example back referencing is more powerful than regular expressions.) This explanation goes on and on. This was like a déjà vu when he tried to explain Jeff and Scott Hanselman Hindley-Milner type inference algorithm used by the Haskell compiler. (He tried to explain that you can deduce the return type of a method from the parameters and the functions called. Let's say he did not get too far.)

Jeff explains why it's great to have his first open-source project. How many bugs he fixed and that he is now at the point where the hairy problems start. Few people seem to be able to help fixing the problems in the C# port of the PHP and Perl implementations. Joel explains again and again that you need a lexer and parser generator to parse Markdown and every student with a parser course can tell you that this is right tool for this kind of problem. That a transformation from an AST is a piece of cake. I never had a parser course so don't expect a scientific explanation here, but I was at least able to parse a subset of SQL with Parser Combinators. The moment Jeff started to explain his open source project. That the parsing is done with regular expressions I though: "Hell, why didn't he use a lexer/parser for that."

But you can learn something from Jeff Atwood's example: Your meager education should not stop you from being a good developer. We can't all have ivy league education. (They wouldn't be ivy league if everyone would get a degree. The system does only work because it discriminates between students.) Let's get over it.  Put your enthusiasm in your work and learn the bits and pieces you have to learn to get the jobs done. Try to constantly improve your theoretical knowledge, your tool knowledge, quality of your code and your communication skills. Maintain your critical thinking to spot ways you can improve. You can't known everything but you should known what you don't known:

There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know.

The known unknowns will slow you down, but only the unknown unknowns can hurt you. These are the places where you put to much effort into a problem that is already solved like parsing. But a star maverick developer can only develop so much. There will be times when a Duct Tape Developer is needed and the pure theory boys won't get the job done. This is the one thing you can learn from the Stackoverflow story: It's the motivation, stupid.

By the way, if you have doubts about your abilities, this is probably a good sign.

## 20100107

### How to setup a Maven Scala project with IntelliJ IDEA support

Setting up a Maven and Scala project with IDEA is simple.

Install IntelliJ IDEA Community Edition and the Scala plugin (File -> Settings -> Plugins -> Available).

Generate your maven project using the simple Scala archetype:

mvn org.apache.maven.plugins:maven-archetype-plugin:1.0-alpha-7:create
-DarchetypeGroupId=org.scala-tools.archetypes
-DarchetypeArtifactId=scala-archetype-simple
-DarchetypeVersion=1.2
-DremoteRepositories=http://scala-tools.org/repo-releases
-DgroupId=group
-DartifactId=artifact


Add the Scala compiler lib to your project dependencies in the pom.xml file:
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>\${scala.version}</version>
<scope>compile</scope>
</dependency>
</dependencies>


Update the Scala version configuration to a recent release:
<properties>
<scala.version>2.7.7</scala.version>
</properties>

Open the a new project. Select the pom.xml in the file dialog.

An notification will pop up. Click "Create Scala Facet".

Done.

You can now compile and execute as usual and run the maven build mvn install with the Maven Tool Window. (The M2_HOME environment variable or Maven home settings (File -> Settings -> Maven -> Maven home directory) have to be present.)

## 20100106

### The best Scala IDE: IntelliJ IDEA

After years and years of using Eclipse (after Netbeans, JBuilder, Visual Cafe, Emacs and other spooky Java "editors") IntelliJ IDEA is powerful, convenient and simple to use IDE for Scala development. Oh, yes it's free: IntelliJ IDEA Community edition.

I tried IDEA because the Scala plug-in for Eclipse is so horrible. I works so badly that using good old jEdit with some macros is way better. On a small project I had to clean the whole project for every compile. It's so bad. After several tries (starting sometime in 2008) I've given up on this thing.

Before IDEA I also tried to use Netbeans, but it failed the dump-developer-that-does-not-read-the-manual-test. Basic functionality should be usable without a detailed study of the documentation. That worked for me with IDEA but I could not get the plugin Scala plugin to run in Netbeans. It's probably me and everything is fine with Netbeans. Still I've invested less time to get things done with IDEA. It simply worked.

The simple summary: every basic IDE feature works: syntax high lighting, compiling, debugging, navigation and outline.

If you worried about productivity of Scala development because it was missing IDE support: stop worrying. The support is here.

I'm a heavy short cut user so using a new IDE slows me down naturally. Using the Eclipse key map was not an option for me. This works only 80% of the time. The main problem is that functionality with the same shortcuts behaves slightly different. That's maddening at the long run. So it's better to learn a new set of short cuts. It's like driving in Great Britain. It's easier when you are using a car with the steering wheel on the right. Driving a car with the steering wheel on the left is dangerous. There are the country side hedges where you'll see practically nothing in a left turn and chances are that you're going to start your trip on the right hand side on a calm country road. So better relearn.

Interestingly IDEA has a code analysis tool that does spell checking in comments, methods and variables names. This is already works for Scala. Eclipse does spell checking only in comments. This is a boon for code quality. You do not want to read incorrectly spelled source code. This is plain annoying to read, fix and frankly to get right (camel case does not make it easier). (The code analysis tool for Java is very good. It finds tons and tons of problems after FindBugs. I have not heard about the analysis tool before. It would be nice if it could be integrated in build tools and used in a continuous integration environments.)

The maven support seems to work out of the box. It reads a pom.xml and builds IDEA project files from it. It still uses its internal build infrastructure. With Eclipse you have to manually generate the .project and .classpath files which works but is an extra step and the eclipse specific configuration in pom files can get quite clunky.

The essential things work for Scala. You have code completion without implicit type conversion information. I suppose the support for implicit type conversion is quite hard to implement. There are a lot of callable members with type conversion for a given object. I suppose the best way to implement this would be a 2-step process. First show all directly reachable members and after hitting Ctrl + Shift again show all reachable members. Methods parameters are missing for decompiled Scala classes and as code completion information.

Besides these minor problems IntelliJ IDEA is a wonderful tool to develop Scala with and it does not crash. Go and try it yourself.

## 20100102

### Quickcheck 0.5 release

It's done: Quickcheck version 0.5 is ready. Again one of the nice little releases of Quickcheck.

The 0.5 release added the following features mainly to provide better readability:
• Support for an adapter from Generator to Iterable to allow the use of Java's for-expression in tests (replacing the inner classes).
• Generated source code for @Iterables and @Samples annotations on generator factories. The @Iterables annotation was added to support for-expression directly with Generators without additional code. The @Samples annotation can be used to generate a factory for sample values from a Generator class.

Minor changes are:

The version 0.6 of Quickcheck will be a major rework after more than 2 years of experience with the specification-based test approach and development. Mainly to support further development for example shrinking (simplification) of failed test values. A major feature still missing in Quickcheck version 0.5.

When I started Quickcheck I would not have thought that it is possible to develop such a tiny little thing for more than 2 years. Still more ideas and things that could be implemented and written about and so little time.

Recently, I've used the 0.5 version in my Scala projects I blogged about. I did not mention that I tested it in any way, though. It was a pleasant experience given that QuickCheck was not designed with Scala's language features in mind. It works with just a small implicit type conversion and adaption layer (10 lines or so).

Distribution function implementations based on the XorShift random number generator I've already worked with would be nice to have. When I'm at adding new distributions anyway, there are still the distribution functions based on heuristics to implement. For example a distribution function based on a boundary heuristic that starts production of values with minimum value, -epsilon, 0, +epsilon and maximum value before random generation kicks in. Another item on my ever growing list is the description of patterns related to specification-based testing and QuickCheck. Obviously there is still plenty of work left.