Linux.com

Feature: Tools & Utilities

Predictive text input with Soothsayer

By Ben Martin on May 14, 2008 (9:00:00 AM)

Share    Print    Comments   

Soothsayer is a predictive text input system. Many folks reading that sentence will think of the word completion offered by mobile phones. Soothsayer is different from such mobile phone systems in that it tries to use context and other statistical information to offer predictions instead of just presenting a list of words that might match the first few letters you type.

Soothsayer is a library with many plugins which can be configured to create a predictive system tailored to your text entry task. If you are entering text in a language Soothsayer does not know or you are planning to use Soothsayer in a specialized domain containing special words or an abnormal distribution of the more common words, then text2ngram can be used to produce a custom n-gram database that is tailored to your needs.

There are no Soothsayer binary packages for Ubuntu, Fedora or openSUSE in the standard repositories. For this article I'll use version 0.6.1 and build from source on a Fedora 8 machine. Building Soothsayer requires the SQLite development packages to be installed. Soothsayer uses autotools to build with the standard ./configure; make; sudo make install process.

The Soothsayer distribution includes a few demonstration programs, shown below is soothsayerDemo after typing hi t and pressing F3 to complete the word "there". The vertical bars represent word separations. As Soothsayer offers new predictions the older ones are moved to the right. When you first start soothsayerDemo, a bar and an initial off the bat prediction is shown. These are the two rightmost rectangles in the screenshot below. Typing "hi " accounts for the next two rectangles and the space rectangle shown in the middle of figure. Note that once I pressed space Soothsayer attempted to guess what might be the next word and presented the top six predictions. Typing the "t" gave Soothsayer enough information to suggest the word "there" so I hit F3 to complete. Soothsayer then inserted that text together with a space and again offered a prediction as to what might be the next word after the space.

/--------------------------------------------------------------------------------\ |hi there | | | | | \--------------------------------------------------------------------------------/ /--\ /----\ /-\ /-----\ /---\ /-\ /-------\ /----\ /---\ /-\ |F1| |was | ||| |that | |the| ||| |himself| |he | |the| ||| |F2| |is | ||| |t | |and| ||| |hideous| |his | |and| ||| |F3| |are | ||| |there| |of | ||| |hidden | |had | |of | ||| |F4| |were| ||| |they | |to | ||| |history| |him | |to | ||| |F5| |had | ||| |them | |a | ||| |high | |have| |a | ||| |F6| |and | ||| |this | |i | ||| |hide | |her | |i | ||| \--/ \----/ \-/ \-----/ \---/ \-/ \-------/ \----/ \---/ \-/ Last selected word: there

Notice that the words Soothsayer offers after the space character are different in the last (leftmost) rectangle. This is because many of the common words that were offered after space was initially input do not make sense after the text "hi there". With the setup that Soothsayer comes with initially the word "was" is the most likely word at this point in input.

Programmers' interface

Soothsayer offers both a C++ and Python programmer interface. A fairly concise example of usage from C++ is given in the doc/getting_started.txt file of the distribution. Below is a similar program with a leaner while loop to make the core processing more obvious. For each new character that you type soothconsole.cpp will try to complete the next word for you and show you what it thinks you after after.

// soothconsole.cpp #include "soothsayer.h" #include <iterator> #include <iostream> #include <sstream> #include <vector> #include <string> using namespace std; #include <termios.h> #include <unistd.h> int main( int, char** ) { char c; stringstream ss; Soothsayer soothsayer; struct termios tio; tcgetattr(STDIN_FILENO,&tio); tio.c_lflag &=(~ICANON & ~ECHO); tcsetattr(STDIN_FILENO,TCSANOW,&tio); while( cin >> noskipws >> c ) { if( c == '\n' ) ss << ' '; else ss << c; vector<string> p = soothsayer.predict ( ss.str() ); cerr << "Predictions for:" << ss.str() << endl; copy( p.begin(), p.end(), ostream_iterator<string>(cerr,"\n")); } return 0; }

As mentioned above, you can create custom predication models using the text2ngram program. Soothsayer is configured using a default XML file which a make install will have placed on your machine. If you copy that XML file into your home directory you can modify the default ngram prediction model that Soothsayer will use for you. In the example below I create a new model from the text of Alice in Wonderland. As you can see, the predictions offered strongly reflect what has been used to create the prediction model used by Soothsayer. Normally a single "a" would not complete to "alice," but in the data used to generate this particular prediction model that word has been flagged as an important word starting with the letter "a."

$ cp /usr/local/etc/soothsayer.xml ~/.soothsayer.xml $ vi ~/.soothsayer.xml ... <Plugins> <SmoothedNgramPlugin> <LOGGER>ERROR</LOGGER> <DBFILENAME>/home/ben/alice-ngram.db</DBFILENAME> ... $ for i in `seq 1 4` do text2ngram -n $i -l -f sqlite -o ~/alice-ngram.db alice13a.txt done $ soothconsole Predictions for:a and a alice as at all ... Predictions for:alice in w which with without waiting wonderland was

The Soothsayer plugin system gives you flexibility to modify Soothsayer to give predictions based on different methods. The ability to train Soothsayer with text from the domain that you want to use it with should let you see more effective completion offerings, especially for words that occur frequently in your domain but not in common English. As you can see in the Alice in Wonderland example, using a prediction model that is tailored to the text you are intending to type greatly effects the predictions that Soothsayer makes, and can lead to more efficient completions.

Ben Martin has been working on filesystems for more than 10 years. He completed his Ph.D. and now offers consulting services focused on libferris, filesystems, and search solutions.

Share    Print    Comments   

Comments

on Predictive text input with Soothsayer

There are no comments attached to this item.

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya