Intelligent String Matching for all Universities in DBpedia

When Quicksilver came out, it changed the way I used my computer. Although it's not as essential to me these days as it used to be, it's the first implementation of fuzzy string matching that I encountered and fell in love with, a feature that I now take completely for granted most of the time but couldn't live without. I wouldn't be surprised if the main reason TextMate was so successful for so long was it's implementation of this algorithm for it's “Open file in project” dialog. In fact, the main reason I switched from TextMate to Sublime Text is the brilliant use of this algorithm in its autocomplete UI.

If you haven't seen this in action before, the idea is incredibly simple but brilliant. Instead of simply searching for a substring in a list of items, the string you type is scored against each item in the list. If you simply type the full string, you will get a high score the same as a naaive search implementation. However if you type the first character of each word, e.g. “uoo” for “University of Oxford”, then that will be one of the top results. If this isn't making much sense, a live demo might help.

The problem at hand

So, now I need to provide a way to select from a large number of university names in a web application, it would be remiss of me not to give my users the same great user experience of this search algorithm. In fact, it is an especially appropriate use case because universities are frequently referred to by abbreviations, for example UCL instead of University College London. If you just tested if the name contained the search string, a user typing “ucl” would get no results, however with the right fuzzy matching algorithm University College London will be the first result.

I started off by using LiquidMetal, which has been working brilliantly for the roughly 300 universities in our database. Late last week, I discovered the number of organisations is about to rise to 15,000 - surely this approach simply won't be able to scale to that number of strings, and we'll have to revert back to simple string matching, perhaps even on the server to keep clients responsive?

Time to break out the benchmarks

First I needed a large list of university names. DBpedia to the rescue! I used the following query to get me a nice list of 15639 university names, which was perfectly serendipitous.

PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT DISTINCT ?name WHERE {  
     ?university rdf:type dbo:University.
     ?university dbpprop:name ?name
}
ORDER BY ?university

Then I scoured the interwebs looking for various implementations of this fuzzy matching algorithm, and set up some benchmarking code. I wanted to rank all of the strings against a set of likely queries, hopefully some of which would be pathological, i.e. “This string is definitely not in the dataset at all and is also quite long”. I also compared against more simple matching using String#indexOf and String#match.

Algorithm	uoo	uoc	uo…	ca…	ox…	Un…	un…	Un…	xx…	yy…	Th…	Mean	Max
quicksilver.js	67	61	67	78	50	55	53	43	35	35	68	55.64	78
LiquidMetal	71	60	61	47	39	196	144	142	6	20	17	73.00	196
string_score.js	72	74	68	51	44	89	63	61	20	31	40	55.73	89
String#indexOf	2	3	3	3	3	2	3	3	2	2	2	2.55	3
String#match	11	12	11	13	12	14	13	13	11	11	26	13.36	26

All times in milliseconds. Hover over the strings in the header row to see the full string tested

Conclusions

It turns out that the times are pretty fast even for the intelligent scoring algorithms. LiquidMetal seems to be vulnerable to a few pathological cases but the difference isn't really that big. I think from these results I will be using string_score.js, due to the more advanced featureset and better matching it provides.

I was really struck by how short these times are though - less than 100 milliseconds to rank 16,000 strings? Amazing. Even more amazing is the average of less than 3 milliseconds for the fastest method, with a simple indexOf. This suggests that even for a far larger dataset, searching on the client side should be relatively responsive and probably limited by memory more than anything else, even if you do have to sacrifice the fuzzy matching.