20100629

Talk to me - have fun with your favorite podcasts

Driving home the other day I listened to a German radio show that made fun of chancellor Angela Merkel. They cut together some of her speeches to let her say the truth about here coalition with the FDP.

I thought: how hard can this be to create? It seems a simple version is not to hard to build using the Java sound API and my favorite podcast.

The idea is simple: take input audio files and create a index of words contained in these files. The index entry is the word, start time and end time. When a new sentence is created it will be stitched together with single words taken from the index.

The file index implementation is straight forward. It reads the index information from a CSV file that is on the classpath. The whole application configuration is done with Guice so the index file name is injected. (I created only a small index from the Java Posse Podcast using Audible.)

The AudioInputStream is the main class to interact with the Java Sound API. You read audio data from it. If you create audio data you do this by creating a AudioInputStream the AudioSystem can read from. The actual encoding is done by the AudioSystem implementation depending on the output audio format.

The Butcher class is the one concerned with audio files. It can read and write audio files and create AudioInputStreams from an input byte array. The other interesting think the Butcher can is cutting samples from a AudioInputStream. The AudioInputStream consists of frames that represent the samples of the PCM signal. Frames have a length of multiple bytes. To cut a valid range of frames from the AudioInputStream one has to take the frame size into account. The start and end time in milliseconds have to be translated to start byte and end bytes of the start frame and end frame. (The start and end data is stored as timestamps to keep them independent from the underlying encoding of the file used.)

The Butcher implementation is simplified. It only supports one WAV file AudioFormat and does no stream processing.

The Composer creates the output file. For a given sentence it takes the audio data for each word from the input files, concatenates the audio data and writes the result to disk. The Composer is currently not very sophisticated and takes the first word from the index it can find.

After building with mvn assembly:assembly the Java application can be run with
java -jar  target/talk-to-me-1.0-SNAPSHOT-jar-with-dependencies.jar [output file] [sentence]

There is still plenty of interesting material to play around with. The current version can be improved in different ways:
  • Indexing an audio file is quite cumbersome. If the start and end timestamp of a word could be detected from the silence between words indexing would be much easier.
  • The amplitude of words and the length of silence should be normalized.
  • Indexing could be even simpler if some speech recognition on the words could be performed.
  • The output quality could be improved by finding longest sequences of words in an input audio file that match the target sentence (longest common_substring_problem and longest common subsequence problem).