Dev Blog #2: LISA's fuzzy logic

Obviously the big news is that I've finally come up with an acronym for Davenport's Live Interactive Search Assistant. Now I can relax and get on with actually coding it.

 Lisa is an integral part of Davenport's functionality. As you type or edit notes, Lisa is constantly searching for related elements in your project and bringing shortcuts to them to hand based on the text around the edit caret. You tag things up for Lisa to find in a fairly intuitive way:

#Bulbo: A legally safe knockoff character.

But you shouldn't have to #tag Bulbo in text for Lisa to spot the match and bring related materials to hand. You just type Bulbo.

As it turns out, that kind of pattern-matching is an interesting problem.

I started out using C# Dictionaries. They're great; you can store and retrieve whatever you like by keyword, and they're fast. However, they do have some limitations, which I encountered when I tried to move from single word keys to whole phrases with fuzzy matching.

Dictionaries do support fuzzy matching via custom comparators, but they really don't like finding fuzzy matches that are sub-strings of a longer piece of text.

For a start, standard dictionaries use a hash-table to achieve ~O(1) lookup-times (Hash tables are nothing to do with hash-tags, in case you're wondering). But they require the hash of the data you're searching for to exactly match the hash of the matching entry in the dictionary. And since you don't know which part of the text is going to match in advance, you can't calculate a hash for it. Chicken, meet egg.

SortedDictionary seems at first glance to solve this problem. Rather than a hash table it uses a binary tree, which means you can compare entries in the dictionary directly with the text. Unfortunately it falls down badly if there are multiple possible matches of different lengths.

For instance, if both "Hello" and "Hello Mother" are keys in the dictionary, and the line you are editing contains the phrase "Hello Mother", it's anyone's guess which match will actually fire first - it will depend on the order in which entries were added or removed in the past.

You can work around this by keeping track of a 'best match so far' and pushing on past the initial match, but this a) incurs a performance hit and b) doesn't work. If, for example, the line contained "Hello Moths" instead, it's possible for the "Hello" match not to fire at all. Without going into too much detail, if the dictionary's tree looks like this:

          Hello Mother

 Hello                   Sausages

with "Hello Mother" at the root and "Hello" and "Sausages" as its left and right branches, the fact that "Hello Moths" is alphabetically 'later' than Hello Mother tricks the dictionary into looking down the 'Sausages' branch rather than the 'Hello' branch, and thus wrongly concluding there is no match.

The solution I've come up with works, but I can definitely see myself returning to this problem again later.

Comments

Popular Posts