Friday, June 12, 2009

Soundex Extension Method for C#

British AirwaysI've been pretty swamped at work lately so I've not been blogging much. Worse than that, I haven't really been able to do much creative programming and virtually nothing academic.

A few days ago, I was working on some basic name searching and recognized an opportunity to squeeze in some play time. It occurred to me that the users of our application will be looking people up by name and will often not know how to spell the name correctly.

I wanted to allow users to search for people by name without regard for different spellings. For example, if you're looking for Geoff McArther or Jeff MacArther, both names will be returned by the same search term.

To do this, I used a technology invented back in the 1920s for the US Census called soundex. The following is a set of extension methods that implement the soundex phonetic algorithm:
public static string Soundex(this string s)
{
// by default, a soundex is the first letter
// followed by three numbers
return Soundex(s, 4);
}

public static string Soundex(this string s, int length)
{
return FullSoundex(s)
.PadRight(length, '0') // soundex is no shorter than
.Substring(0, length); // and no longer than length
}

public static string FullSoundex(this string s)
{
// the encoding information
const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
const string codes = "0123012D02245501262301D202";

// some helpful regexes
Regex hwBeginString = new Regex("^D+");
Regex simplify = new Regex(@"(\d)\1*D?\1+");
Regex cleanup = new Regex("[D0]");

// i need a capitalized string
s = s.ToUpper();

// i'm building the coded string using a string builder
// because i think this is probably the fastest and least
// intensive way
StringBuilder coded = new StringBuilder();

// do the encoding
for (int i = 0; i < s.Length; i++)
{
int index = chars.IndexOf(s[i]);
if (index >= 0)
coded.Append(codes[index]);
}

// okay, so here's how this goes . . .
// the first thing I do is assign the coded string
// so that i can regex replace on it
string result = coded.ToString();

// then i remove repeating characters
//result = repeating.Replace(result, "$1");
result = simplify.Replace(result, "$1").Substring(1);

// now i need to remove any characters coded as D from
// the front of the string because they're not really
// valid as the first code because they don't have an
// actual soundex code value
result = hwBeginString.Replace(result, string.Empty);

// i used the char D to indicate that an h or w existed
// so that if to similar sounds were separated by an h or
// a w that I could remove one of them. if the h or w does
// not separate two similar sounds, then i need to remove
// it now
result = cleanup.Replace(result, string.Empty);

// return the first character followed by the coded
// string
return string.Format("{0}{1}", s[0], result);
}
Now, it may not yet be perfect as I had a pretty hard time finding the exact specs for the American soundex, but it should be pretty close. If you notice something wrong, please leave me a comment and I'll correct it.

No comments:

Post a Comment