May 29, 2009

# Deferred--fat-fingers strike again..

Recently at work I have noticed an abundance of customers leaving comments with mistyped--"fat-fingered"--email addresses. We have all seen it "yahooo.com", "alo.com","gmale.com". The problem with these lost souls, beside suffering from fat-fingeritis, is the are paying customers and deseve to be heard; "Forgive them for they know not what they do".   The web interface is not maintained by me and asking "those people" to add more logic to the front-end would take an act of congress. I do have control over the email gateways that send these ill fated emails out. After the magick of Google, I stumbled upon the Levenshtein distance theory.  According to wikipedia, "The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character". ;this might work.  I proceeded to look for some code examples and put this together.(I pieced together a couple of examples to suit my needs)

<code>

#include <stdlib.h>
#include <malloc.h>
#include <string.h>
#include <stdio.h>
int levenshtein_distance(char *s,char*t);
int minimum(int a,int b,int c);

int levenshtein_distance(char *s,char*t)

{
//Step 1
int k,i,j,n,m,cost,*d,distance;
n=strlen(s);
m=strlen(t);
if(n!=0&&m!=0)
{
d=malloc((sizeof(int))*(m+1)*(n+1));
m++;
n++;
//Step 2
for(k=0;k<n;k++)
d[k]=k;
for(k=0;k<m;k++)
d[k*n]=k;
//Step 3 and 4
for(i=1;i<n;i++)
for(j=1;j<m;j++)
{
//Step 5
if(s[i-1]==t[j-1])
cost=0;
else
cost=1;
//Step 6
d[j*n+i]=minimum(d[(j-1)*n+i]+1,d[j*n+i-1]+1,d[(j-1)*n+i-1]+cost);
}
distance=d[n*m-1];
free(d);
return distance;
}
else
return -1; //a negative return value means that one or both strings are empty.
}

int minimum(int a,int b,int c)
/*Gets the minimum of three values*/
{
int min=a;
if(b<min)
min=b;
if(c<min)
min=c;
return min;
}
int main (int argc, char *argv[]) {
int result;

//used just three top domains for testing
char *domains[3] = {"aol.com", "gmail.com", "comcast.net"};
int SIZE = 3;
int x;
if(argc != 2) {
}
char *first, *second;
second = argv[1];
//add loop to go through array
for(x=0; x < SIZE; x++) {
result = levenshtein_distance(domains[x], second);
//if the required steps to match is less than two--I feel confident we have it correct       if (result < 2 ) {
printf("did you mean %s\n", domains[x]);
} else {
printf("could not find a match\n");
}
}
return 0;
}

</code>

Obviously this is not a finished concept, but one can see how this would  be useful. One could use this  to give a list of "possibly-correctable" email addresses in a database table. Anyway found this interesting and wanted to share.

Take care.