20091231

Peter Norvig's Spelling Corrector in Scala

This is a Scala version of Peter Norvig's spelling corrector. Reading his essay I realized that Python is quite a nice language and similar to Scala due to the support of for-expressions in both languages. Before that I've seen Python as a strange language that made you write self in your classes all the time.

I tried to write the shortest version possible without introducing redundancy. The code is definitely not very readable.

import util.matching.Regex.MatchIterator

val alphabet = 'a' to 'z' toArray
def train(features : MatchIterator) = (Map[String, Int]() /: features)((m, f) => m + ((f, m.getOrElse(f, 0) + 1)))
def words(text : String) = ("[%s]+" format alphabet.mkString).r.findAllIn(text.toLowerCase)
val dict = train(words(io.Source.fromFile("big.txt").mkString))

def edits(s : Seq[(String, String)]) = (for((a,b) <- s; if b.length > 0) yield a + b.substring(1)) ++
  (for((a,b) <- s; if b.length > 1) yield a + b(1) + b(0) + b.substring(2)) ++
  (for((a,b) <- s; c <- alphabet if b.length > 0) yield a + c + b.substring(1)) ++
  (for((a,b) <- s; c <- alphabet) yield a + c + b)

def edits1(word : String) = edits(for(i <- 0 to word.length) yield (word take i, word drop i))
def edits2(word : String) = for(e1 <- edits1(word); e2 <-edits1(e1)) yield e2
def known(words : Seq[String]) = for(w <- words; found <- dict.get(w)) yield w
def or[T](candidates : Seq[T], other : => Seq[T]) = if(candidates.isEmpty) other else candidates

def candidates(word: String) = or(known(List(word)), or(known(edits1(word)), known(edits2(word))))

def correct(word : String) = ((-1, word) /: candidates(word))(
  (max, word) => if(dict(word) > max._1) (dict(word), word) else max)._2

List("osters", "musters", "mixters") map correct foreach println 

You can run the script directly with the Scala interpreter:
$ scala test.scala
posters
masters
matters

20091220

A decent Bloom Filter in Scala

The Bloom Filter is an interesting probabilistic data structure. Its main advantage is that the Bloom Filter drastically reduces the memory consumption.

It produces false positives. Which means that it falsely states that a value is contained. On the other hand it ensures that it produces no false negatives. This renders it useful for space efficient in-process filtering.

The memory needed is not determined by the size of the object stored but by the number of elements and quality of the Bloom Filter (the false positive rate). Whole cars, houses and star ships fit into this little bloom filter.

My Scala implementation is based on a decent stand-alone Java Bloom Filter by Ian Clarke.
class BloomFilter(val size : Int, val expectedElements : Int ){
  require(size > 0)
  require(expectedElements > 0)

  val bitArray = new BitArray(size)
  val k = ceil((bitArray.size / expectedElements) * log(2.0)).toInt
  val expectedFalsePositiveProbability = pow(1 - exp(-k * 1.0 * expectedElements / bitArray.size), k)

  def add(hash : Int) {
    def add(i : Int, seed : Int) {
      if(i == k) return
      val next = xorRandom(seed)
      bitArray.set(next)
      add(i + 1, next)
    }
    add(0, hash)
  }

  def contains(hash : Int) : Boolean = {
    def contains(i : Int, seed : Int) : Boolean = {
      if(i == k) return true
      val next = xorRandom(seed)
      if (!bitArray.get(next)) return false
      return contains(i + 1, next)
    }
    contains(0, hash)
  }

  private def xorRandom(i : Int) = {
    var y = i
    y ^= y << 13
    y ^= y >> 17
    y ^ y << 5
  }
}
It has some advantages over the base version:
- My implementation replaces java.util.Random with am in-lined Xorshift Random Number Generator that has sufficient randomness for a Bloom Filter. Additionally the java.util.Random is thread-safe which decreases the performance unnecessarily.
- The BitSet implementation is replaced by a simplified bit array implementation. An idea I've borrowed from a Bloom Filter implementation by Kevin Bourrillion. The size of the BitArray is always a power of 2. This allows some performance optimizations. The modulo operations used to transform a hash function result into a bit set index can be replaced with bit shifting: x % size = x & (size – 1) (when size is a power of 2).
class BitArray(bits : Int){
  val size = nextPow2(bits)
  require(isPowerOf2(size))
  private val data = new Array[Long](size >> 6)

  def set(index : Int) = data(idx(index)) |= (1L << index)
  def get(index : Int) = (data(idx(index)) & (1L << index)) != 0

  private val mask = size - 1
  private def idx(index : Int) = (index & mask) >> 6
  private def isPowerOf2(i : Int) = ((i - 1) & i) == 0
  private def nextPow2(i : Int) = {
    def highestBit(remainder : Int, c : Int) : Int = 
       if(remainder > 0) highestBit(remainder >> 1, c + 1) else c
    require(i <= (1 << 30))
    val n = if(isPowerOf2(i)) i else 1 << highestBit(i, 0)
    assert(n >= i && i * 2 > n && isPowerOf2(n))
    n
  }
}
- The bloom filter works directly with hash values. I think that the Set semantics are overstretched when applied to Bloom Filters. A lot of the methods are not supported and the return values of add (and addAll) are misleading. Using the hash values directly is useful if you do not want to create objects to check if they are contained in the filter. The assumption here is that the object instantiation is far more expensive than calculating hash values and create only objects for the values contained in the filter. Normal uses are of course still possible with this interface.

All the changes add up to an increase in performance of at least 300% compared to the base implementation having a similar false positive rate.