May 14, 2008

Predictive text input with Soothsayer

Author: Ben Martin

Soothsayer is a

predictive text input system. Many folks reading that sentence will

think of the word completion offered by mobile phones. Soothsayer is

different from such mobile phone systems in that it tries to use context

and other statistical information to offer predictions instead of just

presenting a list of words that might match the first few letters you

type.

Soothsayer is a library with many plugins which
can be configured to create a predictive system tailored to your text
entry task. If you are entering text in a language Soothsayer does not
know or you are planning to use Soothsayer in a specialized domain
containing special words or an abnormal distribution of the more
common words, then text2ngram
can be used to produce a custom n-gram database that is
tailored to your needs.

There are no Soothsayer binary packages for Ubuntu, Fedora or
openSUSE in the standard repositories. For this article I'll use version
0.6.1 and build from source on a Fedora 8 machine. Building Soothsayer
requires the SQLite development
packages to be installed. Soothsayer uses autotools to build with the
standard ./configure; make; sudo make install process.

The Soothsayer distribution includes a few demonstration programs,
shown below is soothsayerDemo after typing hi
t
and pressing F3 to complete the word "there". The vertical bars
represent word separations. As Soothsayer offers new predictions the
older ones are moved to the right. When you first start soothsayerDemo,
a bar and an initial off the bat prediction is shown. These are the two
rightmost rectangles in the screenshot below. Typing "hi " accounts for
the next two rectangles and the space rectangle shown in the middle of
figure. Note that once I pressed space Soothsayer attempted to guess
what might be the next word and presented the top six predictions.
Typing the "t" gave Soothsayer enough information to suggest the word
"there" so I hit F3 to complete. Soothsayer then inserted that text
together with a space and again offered a prediction as to what might be
the next word after the space.

/--------------------------------------------------------------------------------\
|hi
there
|
|
|
|
|
\--------------------------------------------------------------------------------/

/--\ /----\ /-\ /-----\ /---\ /-\ /-------\ /----\ /---\ /-\
|F1| |was | ||| |that | |the| ||| |himself| |he | |the| |||
|F2| |is | ||| |t | |and| ||| |hideous| |his | |and| |||
|F3| |are | ||| |there| |of | ||| |hidden | |had | |of | |||
|F4| |were| ||| |they | |to | ||| |history| |him | |to | |||
|F5| |had | ||| |them | |a | ||| |high | |have| |a | |||
|F6| |and | ||| |this | |i | ||| |hide | |her | |i | |||
\--/ \----/ \-/ \-----/ \---/ \-/ \-------/ \----/ \---/ \-/
Last selected word: there

Notice that the words Soothsayer offers after the space character are
different in the last (leftmost) rectangle. This is because many of the
common words that were offered after space was initially input do not
make sense after the text "hi there". With the setup that Soothsayer
comes with initially the word "was" is the most likely word at this
point in input.

Programmers' interface

Soothsayer offers both a C++ and Python programmer interface. A
fairly concise example of usage from C++ is given in the
doc/getting_started.txt file of the distribution. Below is a similar
program with a leaner while loop to make the core processing more
obvious. For each new character that you type soothconsole.cpp will try
to complete the next word for you and show you what it thinks you after
after.

// soothconsole.cpp
#include "soothsayer.h"

#include <iterator>
#include <iostream>
#include <sstream>
#include <vector>
#include <string>

using namespace std;

#include <termios.h>
#include <unistd.h>

int main( int, char** )
{
char c;
stringstream ss;
Soothsayer soothsayer;

struct termios tio;
tcgetattr(STDIN_FILENO,&tio);
tio.c_lflag &=(~ICANON & ~ECHO);
tcsetattr(STDIN_FILENO,TCSANOW,&tio);

while( cin >> noskipws >> c )
{
if( c == '\n' ) ss << ' ';
else ss << c;

vector<string> p = soothsayer.predict ( ss.str() );
cerr << "Predictions for:" << ss.str() << endl;
copy( p.begin(), p.end(), ostream_iterator<string>(cerr,"\n"));
}

return 0;
}

As mentioned above, you can create custom predication models using
the text2ngram program. Soothsayer is configured using a default XML
file which a make install will have placed on your
machine. If you copy that XML file into your home directory you can
modify the default ngram prediction model that Soothsayer will use for
you. In the example below I create a new model from the text of Alice in
Wonderland. As you can see, the predictions offered strongly reflect
what has been used to create the prediction model used by Soothsayer.
Normally a single "a" would not complete to "alice," but in the data
used to generate this particular prediction model that word has been
flagged as an important word starting with the letter "a."

$ cp /usr/local/etc/soothsayer.xml ~/.soothsayer.xml
$ vi ~/.soothsayer.xml
...
<Plugins>
<SmoothedNgramPlugin>
<LOGGER>ERROR</LOGGER>
<DBFILENAME>/home/ben/alice-ngram.db</DBFILENAME>
...

$ for i in `seq 1 4`
do
text2ngram -n $i -l -f sqlite -o ~/alice-ngram.db alice13a.txt
done

$ soothconsole
Predictions for:a
and
a
alice
as
at
all
...
Predictions for:alice in w
which
with
without
waiting
wonderland
was

The Soothsayer plugin system
gives you flexibility to modify Soothsayer to give predictions based on
different methods. The ability to train Soothsayer with text from the
domain that you want to use it with should let you see more effective
completion offerings, especially for words that occur frequently in your
domain but not in common English. As you can see in the Alice in
Wonderland example, using a prediction model that is tailored to the
text you are intending to type greatly effects the predictions that
Soothsayer makes, and can lead to more efficient completions.

Categories:

  • Tools & Utilities
  • Office Software
  • Free Software
  • Desktop Software
Click Here!