Dev Blog #4: There is nothing new under the sun

I said in a previous post that I'd be revisiting my solution to fuzzy-matching substring searches. Well, insipration finally struck.

To recap, I was getting missed lookups in a binary search tree when the tree was of the form:

Hello

Hell Help


If you try to find the word "Hells" in there it should match with 'Hell', because that's a sub-string of 'Hells'. But what would actually happen is that the binary walker would see that 'Hells' comes alphabetically after 'Hello' and look down the right hand branch rather than the left. Looking down both branches would defeat the point of even using a tree.

The first solution I came up with was a Trie. I didn't know that at the time, but after coming up with the idea and implementing it I discovered that, as usual, someone else had got there first.

The downside of Tries is that they're bulky - and when a datastructure is bulky you risk cache misses, and cache misses are so expensive these days they can make nonsense of any theoretical O() performance calculations. You can slim a Trie down by implementing the internal arrays as expanding lists but that just makes searches even less memory-coherent.

The solution I devised is a compromise between a binary tree and a Trie: instead of just a 'left' and 'right' there is also a 'down' child. When searching the tree you compare the first letter of the string with the root. If it's lower, you go left. If it's higher, you go right. If it's the same, you go down and advance to the next character in the string. The fixed size of each element in the tree means cache coherency is also a realistic goal.

Naturally someone had already come up with this idea, too: it's called a Ternary search tree.

Very often the problem with coding is that until you've figured out a solution, you wouldn't know what to go looking for to find one...

Comments

Popular Posts