Thursday, May 2, 2013

Indexing on limited hardware... what to do

Libferris supports many indexing libraries and technologies through its plugin interface. Larger systems can use a PostgreSQL plugin which is tailored explicitly to get the most out of that RDBMs for larger file server indexes. For smaller end, there are memory mapped files, clucene, soprano, or SQLite. I've been doing some tinkering trying to milk extra performance out of the indexing plugins for ARM machines lately. Note that if you are using debian, the CLucene you'll want is the 2.x series, currently only packaged for experimental.

For testing purposes I built a fairly tiny index of only 130k files. An interesting test case is looking for specific files which have paths that match against a regular expression and returns a fairly small chunk of results. For this case, about 115 resulting files using a four character substring search as the regex. These are a common query for looking for files when you don't recall the exact ordering of the directory names or where a directory was. Small number of results, regex to pick them.

The memory mapped index implementation (boostmmap) uses boost IPC and multi indexed collections created in memory mapped files to maintain the index. The index has also a digram index for each URL allowing regular expressions to resolve through index rather than needing evaluation against full URLs.

The SQLite index is fairly vanilla and doesn't include many customizations for sqlite. Whereas the PostgreSQL index implementation does use many of the features specific to that database. Neither the SQLite or boostmmap indexes in the public libferris repo attempt to do any compression on URL strings or the like.

A fairly basic index on 130k files is about 80mb using either memory mapped files or SQLite. Caches are cleared by echo 3 > drop_caches. Using an odroid-u2 with emmc flash, on a cold cache the SQLite index comes out about 10% faster than the boostmmap for a query finding 115 files. Turning off the regex prefilter index in the boostmmap makes it 10% slower again. This is a trade off, a very fast CPU and a disk with great file location and single extents will show less or no difference with the prefilter as reading 80mb from disk will take less time and the CPU can run 130k regexes very quickly. The prefilter requited only 124 regex evaluations, without the prefilter all 130611 URLs needed a regex evaluation.

The interesting part is with a warm cache the boostmmap is about twice as fast overall as the SQLite index. This is a big difference as the timing is for overall complete run time from the command line, and there is some overhead in starting up the index query itself. As usual, things vary depending on if you are expecting frequent queries (warm cache), have a very fast CPU (regex eval is relatively less costly), or need multiple updaters (SQLite allows it, my memory mapped doesn't).

To then experiment a little further, I brought the ferris clucene plugin into the mix. I disabled the explicit prefilter index on regex code for initial testing, the index became about 70mb and could resolve the query on a cold cache in about 65% the time of the SQLite plugin. On warm cache the clucene was slowest, which is mainly due to the prefilter being disabled and the fallback code making the URL query a WildcardQuery with no pre or postfix to anchor the query on.

Next time around I'll see how speed effective the prefilter index is on clucene. I know it slows down adding documents (you are indexing more), and is larger (I haven't optimized for index size), but it will be interesting to see the performance on the eMMC device for the prefilter.



No comments: