When building websites, you’ll often want users to be able to search for something by name. On LinerNotes, users can search for bands, albums, genres etc from a search bar that appears on the homepage and in the omnipresent nav bar. And we need a way to match those queries to entities in our Postgres database.
At first, this might seem like a simple problem with a simple solution, especially if you’re using the ORM; just jam the user input into an ORM filter and retrieve every matching string. But there’s a problem: if you do
Bands.objects.filter(name="beatles")
You’ll probably get nothing back, because the name column in your “bands” table probably says “The Beatles” and as far as Postgres is concerned if it’s not exactly the same string, it’s not a match.
Users are naturally terrible at spelling, and even if they weren’t they’d be bad at guessing exactly how the name is formatted in your database. Of course you can use the LIKE keyword in SQL (or the equivalent ’__contains’ suffix in the ORM) to give yourself a little flexibility and make sure that “Beatles” returns “The Beatles”. But 1) the LIKE keyword requires you to evaluate a regex against every row in your table, or hope that you’ve configured your indices to support LIKE (a quick Google doesn’t tell me whether Django does that by default in the ORM) and 2) what if the user types “Beetles”?
Well, then you’ve got a bit of a problem. No matter how obvious it is to human you that “beatles” is close to “beetles”[1], to the computer they’re just two non-identical byte sequences. If you want the computer to understand them as similar you’re going to have to give it a metric for similarity and a method to make the comparison.
There are a few ways to do that. You can do what I did initially and whip out the power tools, i.e. a dedicated search system like Solr or ElasticSearch. These guys have notions of fuzziness built right in (Solr more automatically than ES). But they’re designed for full-text indexing of documents (e.g. full web pages) and they’re rather complex to set up and administer. ES has been enough of a hassle to keep running smoothly that I took the time to see if I could push the search workload to Postgres, and hence this article.
Unless you need to do something real fancy, it’s probably overkill to use them for just matching names.
Instead, we’re going to follow Starr Horne’s advice and use a Postgres EXTENSION that lets us build fuzziness into our query in a fast and fairly simple way. Specifically, we’re going to use an extension called pg_trgm (i.e. “Postgres Trigram”) which gives Postgres a “similarity” function that can evaluate how many three-character subsequences (i.e. “trigrams”) two strings share. This is actually a pretty good metric for fuzzy matching short strings like names.
To use pg_trgm, you’ll need to install the “Postgres Contrib” package. On ubuntu:
sudo apt-get install postgres-contrib
**WARNING: THIS WILL TRY TO RESTART YOUR DATABASE**
then pop open psql and install pg_trgm (NB: this only works on Postgres 9.1+; Google for the instructions if you’re on a lower version.)
psql
CREATE EXTENSION pg_trgm;
dx # to check it's installed
Now you can do
SELECT *
FROM types_and_labels_view
WHERE label % 'Mountain Goats'
ORDER BY similarity(label, 'Mountain Goats')
DESC LIMIT 100;
And out will pop the 100 most similar names. This will still take a long time if your table is large, but we can improve that with a special type of index provided by pg_trgm:
CREATE INDEX labels_trigram_index ON types_and_labels_table USING gist (label gist_trgm_ops);
or
CREATE INDEX labels_trigram_index ON types_and_labels_table USING gin (label gin_trgm_ops);
(GIN is slower than GIST to build, but answers queries faster.
That’ll take a while to build (possibly quite a while), but once it does you should be able to fuzzy search with ease and speed. If you’re using Django, you will have to drop into writing SQL to use this (until someone, maybe you, writes a Django extension to do this in the ORM.)
And as a frustrating finishing note, my attempt to implement this on LinerNotes was not ultimately succesful. It seems that that index query performance is at least O(n) and with 50 million entities in my database queries take at least 10 seconds. I’ve read that performance is great up to about 100k records then drops off sharply from there. There are some apparently additional options for improving query performance, but I’ll be sticking with ElasticSearch for now.
[1] Sorry, Googlebot! Not sorry, Bingbot.