20091231

Peter Norvig's Spelling Corrector in Scala

This is a Scala version of Peter Norvig's spelling corrector. Reading his essay I realized that Python is quite a nice language and similar to Scala due to the support of for-expressions in both languages. Before that I've seen Python as a strange language that made you write self in your classes all the time.

I tried to write the shortest version possible without introducing redundancy. The code is definitely not very readable.

import util.matching.Regex.MatchIterator

val alphabet = 'a' to 'z' toArray
def train(features : MatchIterator) = (Map[String, Int]() /: features)((m, f) => m + ((f, m.getOrElse(f, 0) + 1)))
def words(text : String) = ("[%s]+" format alphabet.mkString).r.findAllIn(text.toLowerCase)
val dict = train(words(io.Source.fromFile("big.txt").mkString))

def edits(s : Seq[(String, String)]) = (for((a,b) <- s; if b.length > 0) yield a + b.substring(1)) ++
  (for((a,b) <- s; if b.length > 1) yield a + b(1) + b(0) + b.substring(2)) ++
  (for((a,b) <- s; c <- alphabet if b.length > 0) yield a + c + b.substring(1)) ++
  (for((a,b) <- s; c <- alphabet) yield a + c + b)

def edits1(word : String) = edits(for(i <- 0 to word.length) yield (word take i, word drop i))
def edits2(word : String) = for(e1 <- edits1(word); e2 <-edits1(e1)) yield e2
def known(words : Seq[String]) = for(w <- words; found <- dict.get(w)) yield w
def or[T](candidates : Seq[T], other : => Seq[T]) = if(candidates.isEmpty) other else candidates

def candidates(word: String) = or(known(List(word)), or(known(edits1(word)), known(edits2(word))))

def correct(word : String) = ((-1, word) /: candidates(word))(
  (max, word) => if(dict(word) > max._1) (dict(word), word) else max)._2

List("osters", "musters", "mixters") map correct foreach println 

You can run the script directly with the Scala interpreter:
$ scala test.scala
posters
masters
matters

20091220

A decent Bloom Filter in Scala

The Bloom Filter is an interesting probabilistic data structure. Its main advantage is that the Bloom Filter drastically reduces the memory consumption.

It produces false positives. Which means that it falsely states that a value is contained. On the other hand it ensures that it produces no false negatives. This renders it useful for space efficient in-process filtering.

The memory needed is not determined by the size of the object stored but by the number of elements and quality of the Bloom Filter (the false positive rate). Whole cars, houses and star ships fit into this little bloom filter.

My Scala implementation is based on a decent stand-alone Java Bloom Filter by Ian Clarke.
class BloomFilter(val size : Int, val expectedElements : Int ){
  require(size > 0)
  require(expectedElements > 0)

  val bitArray = new BitArray(size)
  val k = ceil((bitArray.size / expectedElements) * log(2.0)).toInt
  val expectedFalsePositiveProbability = pow(1 - exp(-k * 1.0 * expectedElements / bitArray.size), k)

  def add(hash : Int) {
    def add(i : Int, seed : Int) {
      if(i == k) return
      val next = xorRandom(seed)
      bitArray.set(next)
      add(i + 1, next)
    }
    add(0, hash)
  }

  def contains(hash : Int) : Boolean = {
    def contains(i : Int, seed : Int) : Boolean = {
      if(i == k) return true
      val next = xorRandom(seed)
      if (!bitArray.get(next)) return false
      return contains(i + 1, next)
    }
    contains(0, hash)
  }

  private def xorRandom(i : Int) = {
    var y = i
    y ^= y << 13
    y ^= y >> 17
    y ^ y << 5
  }
}
It has some advantages over the base version:
- My implementation replaces java.util.Random with am in-lined Xorshift Random Number Generator that has sufficient randomness for a Bloom Filter. Additionally the java.util.Random is thread-safe which decreases the performance unnecessarily.
- The BitSet implementation is replaced by a simplified bit array implementation. An idea I've borrowed from a Bloom Filter implementation by Kevin Bourrillion. The size of the BitArray is always a power of 2. This allows some performance optimizations. The modulo operations used to transform a hash function result into a bit set index can be replaced with bit shifting: x % size = x & (size – 1) (when size is a power of 2).
class BitArray(bits : Int){
  val size = nextPow2(bits)
  require(isPowerOf2(size))
  private val data = new Array[Long](size >> 6)

  def set(index : Int) = data(idx(index)) |= (1L << index)
  def get(index : Int) = (data(idx(index)) & (1L << index)) != 0

  private val mask = size - 1
  private def idx(index : Int) = (index & mask) >> 6
  private def isPowerOf2(i : Int) = ((i - 1) & i) == 0
  private def nextPow2(i : Int) = {
    def highestBit(remainder : Int, c : Int) : Int = 
       if(remainder > 0) highestBit(remainder >> 1, c + 1) else c
    require(i <= (1 << 30))
    val n = if(isPowerOf2(i)) i else 1 << highestBit(i, 0)
    assert(n >= i && i * 2 > n && isPowerOf2(n))
    n
  }
}
- The bloom filter works directly with hash values. I think that the Set semantics are overstretched when applied to Bloom Filters. A lot of the methods are not supported and the return values of add (and addAll) are misleading. Using the hash values directly is useful if you do not want to create objects to check if they are contained in the filter. The assumption here is that the object instantiation is far more expensive than calculating hash values and create only objects for the values contained in the filter. Normal uses are of course still possible with this interface.

All the changes add up to an increase in performance of at least 300% compared to the base implementation having a similar false positive rate.

20091019

Support for generator type conversion in Quickcheck

A new utility method is now part of Quickcheck that lets you convert a Generator<A> to a Generator<B> given that A extends B.

This is useful when you have a Generator<Integer> and would like to treat it as a generator of type Generator<Object>.

This can now be done using the Generators.cast method:
Generator<Integer> integers = PrimitiveGenerators.integers();
Generator<Object> objects = Generators.<Object> cast(integers);
This is valid as the type Generator is covariant. (It does only produce values but the state of the Generator<T> is not altered.)

The cast is needed as the Java type does only support invariant generic types. Java cannot support the valid implicit type conversion from Generator<Integer> to Generator<Object>. This information is not present in the Java type system. The cast has to be performed explicitly. (If the generic type Generator could be annotated that it is covariant this sort of conversion could be supported. Scala supports the covariant and contravariant type conversion this way.)

20091013

Using source code generation to simplify test development with Quickcheck

With Quickcheck version 0.5 it is possible to use generated source code to remove repetitive test code.

The first generator tries to remove some redundancy from tests that use only a sample from a generator. For example a common source code pattern is:
import static net.java.quickcheck.generator.PrimitiveGenerators.strings;
String sample  = strings().next();
If you're only ever going to use this one string the next() call on the generator is annoying. You're doing it over and over again.

With the @Samples annotation on the PrimitiveGenerators class the PrimitiveGeneratorSamples class will be generated that contains exactly this code. So instead of the code above you can now write:

import static net.java.quickcheck.generator.PrimitiveGeneratorSamples.anyString;
String sample = anyString();

The same @Samples annotation can be used for your own generators. The setup is simple with javac. As long as you have the quickcheck-src-generator.jar in your classpath and did not disable annotation processing in your project source generation should work out of the box.

Only static methods returning a subtype of Generator<T> are supported. The generated class will be named <YourClass>Samples and the methods some<YourMethod in singular>. Your generator method names should be in plural as in the PrimitiveGenerators and CombinedGenerators classes. (The algorithm to create a singular is simple. It will remove the last character from the method name. So people() won't work. It will create somePeopl.)

A new way to create state characteristics is part of Quickcheck 0.5. The Iterables.toIterable factory method will create an adapter the Iterable interface. Iterable instances can then be used in for expressions.
for (Integer any : Iterables.toIterable(PrimitiveGenerators.integers())) {
  assertEquals(any * 2, any + any);
}
With the @Iterables annotation the source code to adapt to the Iterables interface gets generated. The generated adapted factory method can then be directly used in your tests:
import static net.java.quickcheck.generator.PrimitiveGeneratrsIterables.*;
for (Integer any : someIntegers()) assertEquals(any * 2, any + any);
This generator can be used with your generators, too. The same restrictions apply as for the @Samples annotation. The generated class is named <YourClass>Iterables and the methods some<YourMethodName>.

At the moment the two factory method classes CombinedGenerators and PrimitiveGenerators are annotated with @Samples and @Iterables, so all factory methods support now samples and for expressions.

The combination of source code generation and the Iterables adapter removes some pain with the missing closure support in Java. I've tried to circumvent this problem. The JUnit runner support is such a trial. But now after using it I do see that it does clutter the test code unnecessarily, it is harder to navigate and to refactor. So this feature will be probably removed from the Quickcheck release 0.6 as expressing the characteristic in an inner class and the new Iterable adapter are better alternatives.

20090702

Generators

Generators are at the very heart of Quickcheck. They provide the test data the runner executes characteristics with.

Distributions

A distribution is a function that returns values between -1.0 and 1.0. The concrete frequency of values is determined by the distribution function. All generators provided by Quickcheck use an implementation of Distribution to derive their respective values from.

From a technical point of view all generators could be run with the uniform distribution. Other distributions can be used to increase the likelihood of certain values. This is useful to increase the significance of generators.

The supported distributions are: uniform, inverted normal, positive normal and negative normal.



The diagram shows the different distribution functions supported by Quickcheck. It is an example output of an integer generator with values ranging from 0 to 10.

All values of the uniform distribution have the same probability to occur. The inverted normal distribution is a distribution function whose border values are the most likely. The function is defined based on the normal distribution. This function is inverted and cut. This is done to provide a pragmatic distribution function to test border cases where the lowest and highest values are of the most interest. For the positive normal distribution the values near 0.0 are the most probable and values near 1.0 respectively for the negative normal distribution.

Primitive generators

Primitive generator are the basic building blocks for custom generators. The static factory methods in the net.java.quickcheck.generator.PrimitiveGenerators class can be used to create primitive generators (Strings, Integer, Long, Double, Byte, Character, Date, Enum). Typically the factory provides methods to create a generator with minimum value, maximum value and distribution for each primitive type. Additionally there are convenience methods that use default values.

There are two special factory methods that do not denote a primitive type: fixed value and null. Fixed value generators will always return the given values. Null generator always returns null.

Combined Generators

Combined generators use other generators to build more complex generators. As with primitive generators there is the net.java.quickcheck.generator.CombinedGenerators class with static factory methods.

There are number of generators that create subtypes of java.util.Collection and arrays (List, Set, object Arrays, int arrays, byte array).

Two generators generate tuples: the pair and triple generators. (The Java generics somewhat discourages the creation of lager tuples).

The ensured values generator can be used to create a deterministic generator. This can be combined with a random generator that kicks in when the deterministic generator runs out of values.

The nullsAnd generator can be used to create a random mixture of null values and values returned by an other generator. (The generator has this funny name to form readable statements in test code. For example: nullsAnd(integers()).)

The unique values generator can be used to decorate another generator to ensure that a value is not created repeatedly. When the base generator returns a redundant value it will be retried. (At some point the unique generator will give up and throw an exception.)

The frequency generator can be used to combine multiple generators to one generator. Every added generator will have a weight and values are taken from of the generators according to their respective weight. The oneOf generator factory method is a convenience method to create a frequency generator from a set of generators with the same weight.

Building custom generators

Armed with this knowledge about primitive and combined generators you can now start to create your own custom generators.

The following example defines a generator for java.io.File objects.
class FileGenerator implements Generator<File> {

  Generator<File> roots = fixedValues(File.listRoots());
  Generator<List<String>> paths = nonEmptyLists(letterStrings());

  @Override
  public File next() {
    File f = roots.next();

    for (String p : paths.next()) f = new File(f, p);
    return f;
   }
}

An implementation of the Generator interface has to define the next method. This method returns the next generated value. It will be called by the Quickcheck runner when a test executes.

It uses the fixedValues generator to create the root node. The number of file system roots depends on the OS. Windows systems can have multiple roots whereas Unix systems have only one. The path of the file after the root is created from a list of strings. The nonEmptyList generator will create lists with at least one entry.

The next method implementation creates the file from a file representing the root node and adds to it the relative paths generated with the paths generator. The relative paths are repeatedly added in the for loop to create the absolute path of the resulting file.

It is always a good practice to keep the base generators as members and reuse these for multiple runs of the next method.

20090625

Iterable adapter - Ways to define characteristics

The current development version of Quickcheck for Java contains new Iterable adapter. This adapter lets you convert a Generator implementation into an Iterable. This Iteraterable can then be used in Java for loops. This makes definitions of tests much shorter as no inner class has to be used to define a characteristic.

When I started porting Quickcheck from Haskell I thought that there will be some closure-like facility in the Java language in 2 years. (After all more and more developers seemed to get exposed to the concept of functions and higher order functions. Most recent languages have some form of language construct to build new control structures in libraries.) So I did not bother myself too much with the usability problems that arise from the use of inner classes. In the meantime time it is obvious that Java will not have closures soon and maybe never will. So as long as we are using Java we have to get around this.

Quickcheck has now three different ways to express a characteristic: inner classes, annotations and for-loop expressions. Each of these solutions has its advantages and disadvantages. The example shows a characteristic defined for all three approaches.


@RunWith(QuickCheckRunner.class)
public class IterableExample {

  @Test
  public void classic() {
    forAll(PrimitiveGenerators.integers(), new AbstractCharacteristic<Integer>() {

      @Override
      protected void doSpecify(Integer any) throws Throwable {
        assertEquals(any * 2, any + any);
      }
    });
  }

  @ForAll(generatorMethod = "net.java.quickcheck.generator.PrimitiveGenerators.integers")
  public void annotation(Integer any) {
    assertEquals(any * 2, any + any);
  }

  @Test
  public void iterable() {
    for (Integer any : Iterables.toIterable(PrimitiveGenerators.integers())) {
      assertEquals(any * 2, any + any);
    }
  }
}

Inner classes are the standard implementation and fine if one can ignore the boilerplate code that is used to express a characteristic as an inner class.
Annotations are not type-safe (and therefore refactoring safe), they cannot be navigated in IDEs and the generator definition and use is separated.
The Iterable adapter is type-safe and has concise definition of the characteristic but it lacks all features of the Quickcheck runner. So the Iterator<T> created by the Iterable<T> adapter will produce a fixed number of values. As the Java for loop construct is used the context information for the test run is missing. Setup and tear down have to be called explicitly.

20090618

Defining the Collection.add contract in Quickcheck for Java

The Collection add method does define a contract that is expressed in javadoc.
It states:
  • A collection implementation can throw an exception. The collection will stay unchanged and contains() will return false.
  • The add method return value indicates if the collection changed as a result of the call. The collection will contain the element (based on equals()) but only if the collection changed the very instance will be part of the collection.
This textual form can be expressed as a Quickcheck characteristic.

@Test
public void collectionAdd() {
 for (Pair<Integer, List<Integer>> any : somePairs(integers(), lists(integers()))) {
  Integer element = any.getFirst();
  Collection<Integer> collection = any.getSecond();

  boolean changedCollection = false;
  boolean exceptionThrown = false;
  try {
   changedCollection = collection.add(element);
  } catch (Exception e) {
   assertException(e);
   exceptionThrown = true;
  }
  assertTrue(collection.contains(element) != exceptionThrown);
  assertTrue(changedCollection == containsInstance(collection,
    element));
 }
}

private void assertException(Exception e) {
 assertTrue(e instanceof UnsupportedOperationException
   || e instanceof ClassCastException
   || e instanceof IllegalArgumentException
   || e instanceof IllegalStateException);
}

private boolean containsInstance(Collection<?> collection, Object element) {
 for (Object e : collection) if (e == element) return true;
 return false;
}

So in this way a specification can be provided of an interface some third party has to implement. It's then up to the implementer to provide a generator for the implementation besides the working code.

This has obviously some overlap with contract-driven development and the assertions facility. One interesting reuse of the characteristic could be to take an implementation that states it fulfills a given contract and monitor the behaviour of the implementation at run time (i.e. not while test time) and to report violations of the contract.