Monday, May 2, 2011

Searching again (400ms -> 2ms)

The problem statement: you want to search for a file using any arbitrary substring contained in the file url. This is deceptively hard, if you think in a relational db way, you want something "like %foo%" where the leading percent will force indexes to be considered redundant and a sequential scan to ensue. If you try instead to break up the file's URL into words and use a full text indexing solution you will find your old friend the leading percent again, like here for Lucene.

One easy way to use full text indexing style and allow substrings is to cheat. First disable word stemming (ie, dont turn diving and dive into a stemmed "dive" word). Then for each word to be added to the full text index, shift it left from 1..10 times. So wonderful, onderful, nderful, derful, erful etc are all added as words for the same URL. This is most simply accomplished using a postgresql function to do the shifting.

Then when you search for a substring foo you want to find to_tsquery('simple,'foo:*') as a fulltext query on your url column. As prefix searches are allowed in fulltext index engines like Lucene and postgresql's TSearch2 engine, this evaluates quite quickly.

For 200,000 URLs when using the raw regex match "~" in postgresql you might see 400ms evaluation times, the same data using a gin index on to_tsvector('simple',fnshiftstring(url)) might return in 2-3ms. Of course the index can't always handle your regular expression, but if you can tease out substrings which must be present from your regular expression, a shifted gin index could drop evaluation times for you. eg, .*[Ff]oob([0-9]... could use a shifted search for "oob" as a prefilter to full regular expression evaluation.

Getting down to a few ms evaluation allows GUIs to return some sample results as you type in your regular expression. This speed up will be available in 1.5.6+ of libferris.

A while ago I released an engine for maemo with statistical spatial indexing for regular expression evaluation. It's an interesting problem IMHO, and its also a very common one.