Modern substring search

Nowadays many applications need a good search functionality. They manage large amounts of content in sometimes complex structures so looking for it manually quickly becomes unfeasible and annoying.

ElasticSearch is a powerful tool for implementing a fast and scalable search functionality for your applications. Many useful features like scoring and prefix search are available out-of-the-box.

One often requested feature needs a bit of thought and special implementation: A fulltext search for substrings.

Wildcard search

An easy way is to use an wildcard query. It allows using wildcard characters like * and ? but is not recommended due to low performance, especially if you start you search patterns with wildcards. For the sake of completeness I mention the link to the official documentation here.

Aside from performance it requires using the wildcard characters, either by the user or your code and perhaps needs to be combined with other queries like the match or term queries. Therefore I do not advise usage of wildcard queries.

Using n-grams for indexing

The trick here is to break up the tokens in your texts into even smaller parts – called n-grams – for indexing only. A word like “search” would be split into the following terms using 3-grams: sea, ear, arc, rch.

So if the user searches for “ear” a document/field containing “search” will be matched. You can configure the analyzer to use for individual fields an the minimum and maximum length of the n-grams to work best for your requirements.

The trick here is to use the n-gram analyzer only for indexing and not for searching because that would also break up the search term and lead to many false positives.

See this example configuration using the C# ElasticSearch API NEST:

var client = new ElasticClient(settings);
var response = client.Indices.Create("device-index", creator => creator
  .Settings(s => s
		.Setting("index.max_ngram_diff", 10)
		.Analysis(analysis => analysis
			.Analyzers(analyzers => analyzers
				.Custom("ngram_analyzer", analyzerDescriptor => analyzerDescriptor
					.Tokenizer("ngram_tokenizer")
					.Filters("lowercase")
				)
			)
			.Tokenizers(tokenizers => tokenizers
				.NGram("ngram_tokenizer", ngram => ngram
					.MinGram(3)
					.MaxGram(10)
				)
			)
		)
	)
	.Map<SearchableDevice>(device => device
		.AutoMap()
		.Properties(props => props
			.Text(t => t
				.Name(n => n.SerialNumber)
				.Analyzer("ngram_analyzer")
				.SearchAnalyzer("standard")
			)
			.Text(t => t
				.Name(n => n.InventoryNumber)
				.Analyzer("ngram_analyzer")
				.SearchAnalyzer("standard")
			)
			.Text(t => t
				.Name(n => n.Model)
				.Analyzer("ngram_analyzer")
				.SearchAnalyzer("standard")
			)
		)
	)
));

Using the wildcard field

Starting with ElasticSearch 7.9 there is a new field type called “wildcard”. Usage is in general straight forward: You simply exchange the field type “text” or “keyword” with this new type “wildcard”. ElasticSearch essentially uses n-grams in combination with a so called “binary doc value” to provide seemless performant substring search. See this official blog post for details and guidance when to prefer wildcard over the traditional field types.

Conclusion

Generally, search is hard. In the old days many may have used SQL like queries with wildcards etc. to implement search. With Lucene and ElasticSearch modern, highly scalable and performant indexing and search solutions are available for developers. Unfortunately, this great power comes with a bunch of pitfalls where you have to adapt your solution to fit you use-case.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.