Improving search for special content

Nowadays many applications are very complex or handle so much data that users need and expect a fast and powerful search functionality. Popular search engine you can leverage are ElasticSearch and Solr. They both use Lucene for index management under the hood and provide a similar functionality regarding indexing and text search.

In previous posts I already gave some advice on using and improving search in your applications using these engines. For example a how-to for .NET (core) with ElasticSearch and using n-grams or wildcard fields for substring search.

Even if these frameworks work really great implementing a great search functionality can be quite hard, especially when handling special data like DOIs, IP-Adresses, Hostnames and other text containing special characters.

What’s the deal with special characters?

The standard analyzers are tuned for dealing with natural language based texts. So they split words on punctuation, whitespace and certain special characters. These are usually filtered out and not indexed. So if you index something like my-cool-hostname the dashes and the complete string will not land in the index only leaving the separate parts my, cool and hostname.

The problem with that is, that neither an exact match nor a substring like my-cool will yield any results. That is not what your users will expect…

Making your search work with special data

There are several ways to improve your search functionality for fields or texts containing different kinds of non-language strings. Here are some simple options that will improve or fix handling problematic cases and make your search work as expected by your users.

My examples use ElasticSearch features and wording, so they may be called differently for other search engines.

Using a keyword field

If you only need an exact match on some weird data field like a DOI 10.1000/182 containing punctuation and slashes, where a user normally just copy & pastes the search string from somewhere using a keyword field for indexing instead of the text type might be the better option. Usually this is easy to implement and fast for indexing and searching.

Using multiple index fields for one data field

ElasticSearch also offers the possibility to index the same data using multiple index fields. So you can keep sub-string features while adding exact match or special character support by adding different index fields like keyword above or an index field using a different analyzer (see this option below). This is called multi-field in ElasticSearch. Using multi-fields can also be used to improve sorting and scoring matches.

Using different analyzers

As I mentioned before the standard analyzers for text fields use tokenization rules and character filters useful for natural language. Sometimes you want to keep words together or preserve special characters. To implement an appropriate search for hostnames or IP addresses you could for example use a custom analyzer with a whitespace or pattern tokenizer.

Conclusion

Default full text search works great out-of-the-box in many cases. However, there are many cases of special, structured data where you need to fine-tune the way the index gets populated.

Many approaches can be combined using different analyzers and indexing a data in several ways.

There is a lot you can do to provide awesome search capabilities to your users but that requires quite some knowledge of the way the search engines work and about the data you want to be searchable.

Modern substring search

Nowadays many applications need a good search functionality. They manage large amounts of content in sometimes complex structures so looking for it manually quickly becomes unfeasible and annoying.

ElasticSearch is a powerful tool for implementing a fast and scalable search functionality for your applications. Many useful features like scoring and prefix search are available out-of-the-box.

One often requested feature needs a bit of thought and special implementation: A fulltext search for substrings.

Wildcard search

An easy way is to use an wildcard query. It allows using wildcard characters like * and ? but is not recommended due to low performance, especially if you start you search patterns with wildcards. For the sake of completeness I mention the link to the official documentation here.

Aside from performance it requires using the wildcard characters, either by the user or your code and perhaps needs to be combined with other queries like the match or term queries. Therefore I do not advise usage of wildcard queries.

Using n-grams for indexing

The trick here is to break up the tokens in your texts into even smaller parts – called n-grams – for indexing only. A word like “search” would be split into the following terms using 3-grams: sea, ear, arc, rch.

So if the user searches for “ear” a document/field containing “search” will be matched. You can configure the analyzer to use for individual fields an the minimum and maximum length of the n-grams to work best for your requirements.

The trick here is to use the n-gram analyzer only for indexing and not for searching because that would also break up the search term and lead to many false positives.

See this example configuration using the C# ElasticSearch API NEST:

var client = new ElasticClient(settings);
var response = client.Indices.Create("device-index", creator => creator
  .Settings(s => s
		.Setting("index.max_ngram_diff", 10)
		.Analysis(analysis => analysis
			.Analyzers(analyzers => analyzers
				.Custom("ngram_analyzer", analyzerDescriptor => analyzerDescriptor
					.Tokenizer("ngram_tokenizer")
					.Filters("lowercase")
				)
			)
			.Tokenizers(tokenizers => tokenizers
				.NGram("ngram_tokenizer", ngram => ngram
					.MinGram(3)
					.MaxGram(10)
				)
			)
		)
	)
	.Map<SearchableDevice>(device => device
		.AutoMap()
		.Properties(props => props
			.Text(t => t
				.Name(n => n.SerialNumber)
				.Analyzer("ngram_analyzer")
				.SearchAnalyzer("standard")
			)
			.Text(t => t
				.Name(n => n.InventoryNumber)
				.Analyzer("ngram_analyzer")
				.SearchAnalyzer("standard")
			)
			.Text(t => t
				.Name(n => n.Model)
				.Analyzer("ngram_analyzer")
				.SearchAnalyzer("standard")
			)
		)
	)
));

Using the wildcard field

Starting with ElasticSearch 7.9 there is a new field type called “wildcard”. Usage is in general straight forward: You simply exchange the field type “text” or “keyword” with this new type “wildcard”. ElasticSearch essentially uses n-grams in combination with a so called “binary doc value” to provide seemless performant substring search. See this official blog post for details and guidance when to prefer wildcard over the traditional field types.

Conclusion

Generally, search is hard. In the old days many may have used SQL like queries with wildcards etc. to implement search. With Lucene and ElasticSearch modern, highly scalable and performant indexing and search solutions are available for developers. Unfortunately, this great power comes with a bunch of pitfalls where you have to adapt your solution to fit you use-case.

Using (elastic)search with .NET Core

Many modern applications require powerful search mechanisms to become useful and make their users more productive. That is in large part due to the amount of data available to work with. Thankfully there are already powerful tools to index your data and make it searchable.

One of the most well known state-of-the-art solutions is ElasticSearch and it has an API to be used from .NET called NEST. While the documentation is ok I want to give a quick rundown on how to add searching capabilities to your .NET Core application. Some ideas are borrowed from the great post “Using Elasticsearch with ASP.NET Core and Docker“.

Getting ElasticSearch running

The easiest way to get up and running with elasicsearch is to use their docker images and just run the container on your development machine. I like to a docker compose file like the following to get elasticsearch and its tooling application kibana up and running fast:

version: '3.8'
services:
    elasticsearch:
        image: docker.elastic.co/elasticsearch/elasticsearch:7.8.0
        container_name: elastic
        environment:
            - node.name=elastic
            - cluster.initial_master_nodes=elastic
        ports:
            - "9200:9200"
            - "9300:9300"
        volumes:
            - type: bind
              source: ./esdata
              target: /usr/share/elasticsearch/data
        networks:
            - esnetwork
    kibana:
        image: docker.elastic.co/kibana/kibana:7.8.0
        ports:
            - "5601:5601"
        networks:
            - esnetwork
        depends_on:
            - elasticsearch
volumes:
    esdata:
networks:
    esnetwork:
        driver: bridge

After you run it with docker compose you can talk to the search service on http://localhost:9200/ and the kibana management GUI on http://localhost:5601/. On the Kibana UI especially the Dev Tools and its console are interesting for experimenting with search queries.

Access ElasticSearch from your .NET Core app

I find it quite elegant to write a extension method for the IServiceCollection to configure the ElasticClient and register it as a Singleton to the dependency injection framework of .NET Core like so

    public static class ElasticSearchExtension
    {
        public static void AddElasticsearch(
            this IServiceCollection services, IConfiguration configuration)
        {
            var url = configuration["elasticsearch:url"];
            var settings = new ConnectionSettings(new Uri(url))
                    .DefaultMappingFor<SearchableDevice>(deviceMapping => deviceMapping
                        .IndexName("devices")
                        .IdProperty(dev => dev.Id)
                    )
                ;
            var client = new ElasticClient(settings);
            var response = client.Indices.Create("devices", creator => creator
                .Map<SearchableDevice>(device => device
                    .AutoMap()
                )
            );
            // maybe check response to be safe...
            services.AddSingleton<IElasticClient>(client);
        }
    }

The configuration block looks like following

  "ElasticSearch": {
    "url": "http://localhost:9200/"
  },

and allows for future extension.

Our search service has to be registered in Startup of course:

public void ConfigureServices(IServiceCollection services)
{
    ...
    services.AddElasticsearch(Configuration);
}

Indexing our objects

In the above example we built a SearchableDevice class whose public properties are going to be indexed by ElasticSearch. The API allows for a much more fined grained control about what and how to index but we want to keep things simple without having to worry about excluding Properties and so on. If you have set it up that way indexing a SearchableDevice is merely one simple call:

// SearchClient is the injected IElasticClient
// mySearchableDevice is an instance of SearchableDevice
SearchClient.IndexDocument(mySearchableDevice);

Searching for objects

When developing a search query I like to try it in the Kibana Dev Tools and then transform it to a NEST call. A simple query to look in all device properties if they start with “needle” looks like this:

{
  "query": {
    "multi_match": {
      "type": "phrase_prefix", 
      "fields": ["*"],
      "query": "needle"
    }
  }
}

The nice thing about the Kibana Dev Tools console is that it can display and complete the possible values for fields like “type” in your multi_match query.

That search query can then be translated to a NEST call in a straightforward way and looks this way:

var response = SearchClient.Search<SearchableDevice>(sd => sd
    .Query(q => q
        .MultiMatch(query => query
            .Type(TextQueryType.PhrasePrefix)
            .Fields("*")
            .Query("needle")
        )
    )
);

The search response contains the hits with their source objects and some metadata like the score and result count.

Bear in mind that ElasticSearch only returns the first/best 10 matches by default, so specifying the result size often might be closer to what you want.

Wrapping it up

Getting started with ElasticSearch in .NET Core does not require too much boiler plate an setup work if you use tools like docker and the NEST library. Making it usable and tuning the indexing and querying may require a lot of work to achieve the best results. On the other hand smaller applications can start-off with a simple search setup like shown above and simply evolve it when need be.