Serving autocomplete suggestions fast!

AutocompleteAutocomplete (also known as live suggestions or search suggestions) is very popular with Search applications. It is generally used to return either query suggestion (à la Google Autocomplete) or to propose existing search results (à la Facebook).

Open source search platforms like Solr and Elasticsearch support this feature. Both allow for implementing autocomplete using edge n-grams. N-grams are subsets of words. Edge n-grams are subsets from one edge of a word (generally the beginning). For example, the edge n-grams of the word search are s, se, sea, sear, searc and search. The general principle when using them to support autocomplete is to index all those n-grams in the search index with the original word stored as is. When the user starts typing a word, we send what is typed so far as a query to the index containing the n-grams. For example, if we send the query sea, the index should return the word search as sea is an n-gram of this stored word.

This technique is very flexible as it allows for easy implementation of interesting functionalities, such as fuzzy search and infix matching. You can already find detailed instructions on how to implement the edge n-grams technique around the Web. For example, you can read this popular tutorial by Jay Hill to implement it in Solr, and this one by Jon Tai for Elasticsearch.

The best we can do?

The main drawback of this technique is that it is using an index to hold the n-grams. A Lucene index is not ideal for this task as it will have to look for a lot of terms before retrieving the n-grams. As the index grows bigger (all those n-grams add up!), we might experience performance issues. Users expect autocomplete to be fast!

A better approach is to have a dedicated and optimized structure to provide those autocomplete suggestions from a given input. This is what Lucene provides in the suggest module. The Lookup abstract class is a simple one that has a lookup method to return suggestions from an input. In this module, Lucene provides several classes extending Lookup. Those classes use one of two structures to hold the suggestions: a ternary search tree (TST) or a finite state automata (FST). But all those classes share an interesting characteristic: the structure is held entirely in memory, making the lookup very fast!

To build the structure, we have to set a source for suggestions. In general, this will either be a simple text file or the indexed terms of a field from a Lucene index. We then call the build method to clear the in-memory structure and populate it with all the items from the source. One shortcoming of this design is that there is no way to add a new item in the structure without rebuilding it completely. But there is a mechanism to store the in-memory structure to disk for fast reloading.

As said previously, there are several Lookup classes available using either a TST or a FST structure. One of these classes is very interesting: AnalyzingSuggester. This one uses an FST structure. Unlike the other Lookup classes, which only store the suggestions as is, AnalyzingSuggester allows us to use an analyzer to modify the suggestions before they are stored in the structure (the un-analyzed form is stored as well). The same analyzer is used when doing a lookup in the structure. Also, it returns suggestions in their original (un-analyzed) form. This gives us a lot of flexibility, similar to what we have when storing the suggestions in an index, but with potentially better performance. You can read more about AnalyzingSuggester in this post that describes some interesting usage.

Since AnalyzingSuggester was created, FuzzySuggester was added, which extends AnalyzingSuggester and adds the possibility of doing fuzzy matching of suggestions.

But most of the time, we don’t deal directly with Lucene classes. Let’s examine how these classes are used with Solr and Elasticsearch.

Solr

To use those Lookup classes for autocompletion, Solr offers the Suggester component. The Suggester is based on the SpellCheckComponent, but the Suggester can be used for more than spellchecking. It allows us to specify a Lookup class to provide the suggestions. Let’s see how to use the Suggester with the Lookup class AnalyzingSuggester.

First, we must configure a search component to use the Suggester:


 
  suggest
  org.apache.solr.spelling.suggest.Suggester

  
  
   org.apache.solr.spelling.suggest.fst.AnalyzingLookupFactory
  

  
  lowercase

  
  dict_ac

  
  string

  
  /path/to/my/dict.txt
 

Here, we defined a Suggester component to use AnalyzingSuggester. Notice that we don’t directly specify AnalyzingSuggester, but instead a factory that Solr offers to instantiate it. In the early days of the Suggester, we could directly specify some Lookup classes, but for the latest ones, like AnalyzingSuggester, we must use the factory.

Because we are using AnalyzingSuggester, we must specify an analyzer that will be used when storing and doing lookup in the structure. We can define this analyzer using suggestAnalyzerFieldType. In my case, I use the lowercase analyzer that simply applies lowercasing for each suggestion. But as explained, AnalyzingSuggester is really interesting when used with analyzers tailored to provide the best autocomplete experience to your users.

The storeDir setting allows you to specify a directory (within the data directory of the index), where the Lookup structure will be persisted every time it is built using spellcheck.build=true. When we reload the Solr index, the structure can be loaded back in memory from there. This is not mandatory. If we don’t use this setting, we will have to manually build the structure every time we reload the index (still with spellcheck.build=true).

As we will see later, we use spellcheck.q when querying Solr for suggestions. Whatever Lookup class is being used, the Suggester component will apply the analyzer defined for fieldType to the input provided before passing it to the Lookup structure. When using a file-based dictionary (as we are in this example), if fieldType is not set, Solr will default to a whitespace delimiter, which might not be what you want. For example, if we send the input new y, the default behavior would be to split the input into the tokens new and y and send each of those separately to the Lookup structure, thus returning suggestions for both tokens. This might not be what you expect. If you would rather have suggestions for the whole input, you must configure the Suggester to use the analyser of a field that does not split it, as we did in the example by setting the string field type.

In the example, we use a file-based dictionary. The format is really simple. Each line of the file should contain a suggestion and a weight for this suggestion (separated by a tab character). The weight will be used by the Lookup classes to sort the suggestions that match the input. Note that not all the Lookup classes can use the weight, but AnalyzingSuggester can.

Building from an index field

Instead of a file-based dictionary, the source of the suggestions can also be a field of the Solr index. Here is another example using a field:


 
  suggest
  org.apache.solr.spelling.suggest.Suggester

  
  
   org.apache.solr.spelling.suggest.fst.AnalyzingLookupFactory
  

  
  lowercase

  
  dict_ac

  
  string

  
  suggestion

  
  0.005
  true
 

Notice now that we don’t use sourceLocation anymore. Instead we use field. The source of suggestions will thus be all the indexed values of this field. We still specified fieldType to indicate how to parse the input. But this time, if we had not set this value, the default analyzer to split the input would have been the one from field (specifically, its query analyzer). This differs from the whitespace splitting that is applied by default when using the file-based dictionary.

Also, be aware of the analyzer that you use to index the values. When using AnalyzingSuggester, it might be best to set a field that does not do any analyzing on the values (like string). Between the analyzer used to parse the input, the analyzer of the field and the analyzer used by AnalyzingSuggester, things can get confusing pretty quickly!

With a field-based source, we have two other settings to consider. First, threshold, according to the Solr wiki, represents “the minimum fraction of documents (of the total) where a term should appear, in order to be added to the lookup dictionary.” This is to prevent infrequent terms from being added to the list of suggestions. Set it to 0 to keep them all. This setting is not used with a file-based dictionary because it is expected that the list only contains suggestions that we want, which makes sense.

Finally, buildOnCommit indicates that we want to rebuild the Lookup structure each time there is a commit. This will add the newer terms from the field into the list of suggestions at each commit. In the case of the file-based dictionary, we did not set this value because it makes no sense to rebuild it at each commit since the suggestions from the file don’t change.

Now that we have configured the search component, we can configure the request handler that will be used to serve the suggestions. Here is how:


 
  explicit
  true
  suggest
  true
  5
 
 
  suggest
 

This will activate the Suggester for the /suggest handler. The handler will return at maximum 5 suggestions. They will be sorted by “popularity.” In the case of a file-based dictionary, the popularity will be the weight of the suggestions. For a field-based dictionary, it is the frequency of the terms that will make them more popular.

We can now query our handler to get suggestions. For example, using this query:

http://solrserver:8983/solr/myindex/suggest?spellcheck.q=new+y

You will then receive suggestions to complete this query, for example, “new york” or “new york times”, depending on what you have in your list of suggestions.

Elasticsearch

Recently, Elasticsearch added a new feature called completion suggester. This new feature uses AnalyzingSuggester to provide in-memory suggestions lookup like Solr’s Suggester. But Elasticsearch’s completion suggester uses an interesting approach to persist the suggestions. Instead of storing them only in memory, they are also stored directly in an index using a custom posting format. The structure is built at the same time the index is built and each Lucene segment will hold part of it, making it much easier to scale across multiple servers. The in-memory structure is automatically refreshed during indexing, so it’s always up to date with the latest suggestions. You can use your main index to store the suggestions or use a dedicated one.

Elasticsearch’s completion suggester supports fuzzy search, weights and the possibility of analyzing the suggestions (since it uses AnalyzingSuggester), but it also adds the possibility to have multiple inputs for the same suggestion and support for payloads. For some examples on how to use it, refer to the documentation. This feature is pretty new, so expect some changes happening in the near future.

Conclusion

In-memory structures allow for fast autocomplete implementation! Even if the edge n-grams technique is good enough in many cases, depending on multiple of factors (the number of suggestions, how often new suggestions are added, server’s memory, etc.), using an in-memory structure might be better. As always, test with your data and your environment before making a decision!