Improving search for special content

Nowadays many applications are very complex or handle so much data that users need and expect a fast and powerful search functionality. Popular search engine you can leverage are ElasticSearch and Solr. They both use Lucene for index management under the hood and provide a similar functionality regarding indexing and text search.

In previous posts I already gave some advice on using and improving search in your applications using these engines. For example a how-to for .NET (core) with ElasticSearch and using n-grams or wildcard fields for substring search.

Even if these frameworks work really great implementing a great search functionality can be quite hard, especially when handling special data like DOIs, IP-Adresses, Hostnames and other text containing special characters.

What’s the deal with special characters?

The standard analyzers are tuned for dealing with natural language based texts. So they split words on punctuation, whitespace and certain special characters. These are usually filtered out and not indexed. So if you index something like my-cool-hostname the dashes and the complete string will not land in the index only leaving the separate parts my, cool and hostname.

The problem with that is, that neither an exact match nor a substring like my-cool will yield any results. That is not what your users will expect…

Making your search work with special data

There are several ways to improve your search functionality for fields or texts containing different kinds of non-language strings. Here are some simple options that will improve or fix handling problematic cases and make your search work as expected by your users.

My examples use ElasticSearch features and wording, so they may be called differently for other search engines.

Using a keyword field

If you only need an exact match on some weird data field like a DOI 10.1000/182 containing punctuation and slashes, where a user normally just copy & pastes the search string from somewhere using a keyword field for indexing instead of the text type might be the better option. Usually this is easy to implement and fast for indexing and searching.

Using multiple index fields for one data field

ElasticSearch also offers the possibility to index the same data using multiple index fields. So you can keep sub-string features while adding exact match or special character support by adding different index fields like keyword above or an index field using a different analyzer (see this option below). This is called multi-field in ElasticSearch. Using multi-fields can also be used to improve sorting and scoring matches.

Using different analyzers

As I mentioned before the standard analyzers for text fields use tokenization rules and character filters useful for natural language. Sometimes you want to keep words together or preserve special characters. To implement an appropriate search for hostnames or IP addresses you could for example use a custom analyzer with a whitespace or pattern tokenizer.

Conclusion

Default full text search works great out-of-the-box in many cases. However, there are many cases of special, structured data where you need to fine-tune the way the index gets populated.

Many approaches can be combined using different analyzers and indexing a data in several ways.

There is a lot you can do to provide awesome search capabilities to your users but that requires quite some knowledge of the way the search engines work and about the data you want to be searchable.

Many Algorithms Benefit From a Partition in Two Phases

When I program code that solves a specific problem, I often design the algorithm in a way that mirrors my approach to solving the problem in the real world. That’s not a bad idea – the resulting algorithm can be thought through in a straightforward manner. But it lacks in one area: The separation of specification and execution. And for a computer, separating these two things has immediate advantages.

Let me explain the concept on a minimal coding challenge. The original challenge can be found here (by the way, codewars.com, despite the militaristic theming, is an abundant source of fun coding exercises):

Write a function that takes in a string of one or more words, and returns the same string, but with all five or more letter words reversed

Stop gninnipS My sdroW!

If you don’t want to be spoiled for this specific kata, please don’t read any further. The solution I use to explain the concept isn’t very elegant, though:

public static String reverseLongWords(String sentence) {
String result = "";
for (String each : sentence.split(" ")) {
if (each.length() >= 5) {
result += new StringBuilder(each).reverse().toString();
} else {
result += each;
}
result += " ";
}
return result.trim();
}

Please don’t mind the use of simple string concatenation or the clumsy way of string reversal. My point arises in the last two lines. Because we collect our words, reversed or not, directly in the result, we have to awkwardly remove the unnecessary blank that we just appended from it.

One reason for this subsequent correction is the missing separation in the two phases (specification and execution). Instead, we determine what strings the result should contain and build the result in one step.

Let’s separate the two phases, first in theory and then in code. We know how a specification of a sentence can look like because we already use one in the for loop. If we want to specify the resulting sentence without really building it, we need to store the words without their separators. The result of our specification phase would be a list of words, reversed or not. The execution phase takes our specification and transforms it into the required result. In our case, we just put blanks between the words:

public static String reverseLongWords2(String sentence) {
// building the render model (specification phase)
List<String> renderModel = new ArrayList<String>();
for (String each : sentence.split(" ")) {
if (each.length() >= 5) {
renderModel.add(new StringBuilder(each).reverse().toString());
} else {
renderModel.add(each);
}
}

// rendering the model (execution phase)
return String.join(" ", renderModel);
}

The resulting code is very similar, with one crucial difference: The first phase doesn’t create the result, but a model of it. We can call this model a “render model” and the execution phase the “render stage”. It sounds a little bit excessive for such a small task, but this is really the heart of the idea. When you separate your algorithm into the two phases, you’ll get a render model between them.

This render model has some advantages: You can test it independently from the actual representation. You can add transformation steps more easily before you commit it to the target format. If you need to build the render model iteratively, it can provide helpful methods that would be missing in the target format.

Another advantage: Your execution/render phase is more independent from the previous work. Imagine that we would want our words comma separated, not blank separated. The first algorithms relies on a hidden dependency between the blank character and the trim() method. In the second algorithm, you only need to change the rendering part of the code. It can work independently from the previous logic.

This way to partition an algorithm varies from the straightforward way that humans use when we perform the task in the real world. We tend to keep at least a partial render model in our head alongside the rendered result. If we would do the word spinning ourselves, but with the commas, we would recognize or “just know” that we write the last word and not include the trailing comma. This “just knowing” is information from our mental render model.

In my experience, it pays off to refactor an algorithm into a variation that uses the two phases or design it like this from the start. Revealing the hidden dependencies in the logic is a beneficial influence on the defect rate. Making the rendering step independent promotes testability and evolvability. It seems more work at first, but in my case, that was just adaptation effort caused by my problem solving habits.

Naming table columns in user interface

A few days ago, I had a conversation with a customer regarding naming. They had created a file containing definitions for numerous tables and their corresponding column names for a new user interface. Some names consisted of entire sentences, with words like “from” and “to“.

I found myself dissatisfied with some of these names and thought about why I don’t like them and how to make them nicer.

Guidelines for simplifying names

To me, column names should resemble key points rather than complete sentences. So I’ve compiled a few guidelines that can help you in transforming sentences to key points.

  1. Eliminate filler words: Remove words like verbs and adjectives if they don’t carry relevant information.
  2. Remove articles.
  3. Remove words without additional information. For example, if the information is already included in another word.
  4. Remove information already included in the table name.
  5. Sometimes it makes sense to change the order of the remaining words.

An example

In our example, the table name is “CW 39” and the column name is “The only day of Monday of CW 39 before CW 40“.

1. Remove all filler words: 
The information about there being only one Monday per week is widely known and unnecessary here. Therefore, the new column name is:
The day of Monday of CW 39 before CW 40

2. Remove articles:
The article “The” can be removed. So the new column name is: 
Day of Monday CW 39 before CW 40

3. Remove words without additional information:
The information that it’s about a day is already part of the word ‘Monday’. It’s also obvious, that CW 39 comes before CW 40. So the new column name is: 
Monday CW 39

4. Remove information already included in the table name:
The table name “CW 39” already tells the reader that all columns contain information on this CW. So the new column name is:
Monday

It is much better to read, isn’t it?

Conclusion

After the reformatting, the table became significantly smaller. The names are easier to read, and the understanding is faster as there’s no need to decipher the true meaning within a lengthy sentence.

Thus, it’s a huge advantage to keep the names as short as possible without losing essential information.

JavaScript – some less known Gems

I guess we can all agree that the most fun part of anything related to JavaScript is reading foreign JavaScript code 🙂

But while most of any hardness in understanding foreign (especially older) code lies in the every-year-fluctuations in common style – I also can’t really tell why I have this urge to change each “function” to a “const” declaration nowadays – once in a while you stumble across some feature, that is just too arcane.

Which means that it’s at least worth knowing about – after that, you can decide for yourself whether these enter your active vocabulary.

So these are some of my recent findings. Feel free to add.

Labelled Loops

JavaScript allows you to label any statement with a unique identifier. This is probably not a surprise for any Svelte developer (the $:… syntax is exactly that), but it is most useful in loops, because you can break or continue an outer loop using this:

// "outer" is the label here

outer: for (...) {
  for (...) {
    ...
    if (weAreDone()) break outer;
  }
}

It might have some old-school “GOTO” vibes indeed, and one can argue that in most cases there might be a more concise solution right around the corner, but especially if you have a tricky lookup algorithm, this might come handy one day.

Comma Operator

While I wondered for some years who on earth actually uses the Comma operator in C/C++, just a few weeks ago I found out that JavaScript actually has the same thing.

It allows you to execute some expression, ignoring it’s return value and directly execute the next statement in that same expression.

// (expr1, expr2) does evaluate expr1 and expr2 and return expr2.

const a = (b = 5, 3);
// a is now 3, but b = 5;

let a, b;
for (a = 0, b = 0; a + b < 3; a++, b++) { console.log(a, b) }
// output:
// 0 0
// 1 1

However, I would not advise using this thing ever. I mean, if you have some complicated expression where you decide just to do some step before evaluating some crucial other step – maybe it’s time to refactor the whole method.

void Operator

The void operator is like a special case of the comma operator in that it evaluates something and then returns undefined.

const a = calculateStuff(); // a might be whatever
const b = void calcaluteStuff(); // b is undefined

const c = (calculateStuff(), undefined); // c is identical to b, see above

This is e.g. an expression that I found while having to read some minified React code. So it might be of a certain use if you want to minify the number of letters in your code, but a more readable way would be just defining a function evaluating calculateStuff(), then returning without a return value.

However, the MDN web docs (referenced again here) give some real use cases of that operator, so if you are into that cryptic knowledge, go right ahead.

Bitwise NOT as a “found in List” check

Recently, I was weirded out by having to look at a list inclusion check in the likes of:

const list = ["a", "b"];
if (~list.indexOf("a")) {
  alert("found");
}

And this piece of code will actualy reach the alert, while it might leave you in wonders what the “~” is doing here. Unless you are used to very low-level bit arithmetic, in which case you’d directly recognize the bitwise NOT. It’s the operator that inverts every bit of a binary representation of a number to it’s opposite.

And the whole magic here lies in that for normal numbers (or BigInt), this is mathematically the same as

~a = -a - 1;

especially:
~-1 = 0
~0 = -1

and that for JavaScript, 0 is a falsy value while any other number is a truthy one. So this code above is used just to explicitly distinguish whether indexOf() returned “-1”, which it does if the object in question was not found.

So there you have it, but I’d rather use the way more readable

if (list.includes("a")) {...}
Bonus: console.log() styling

Now this does not really fit to the other operations above, but nevertheless I hear of people who did not know that before, so I’ll just drop it.

If there is any argument of console.log() starting with “%c”, it will take the next argument as a styling instruction instead of normally printing it. That is, the next argument needs to be a valid string that could also appear in a HTML stlye=”…” attribute, as

console.log("%cSo Big!", "font-size: 100pt; color: magenta");

Now, considering that console.log() is really only the most rudimentary way to output some statements for debugging (and one usually neglects the other ones like console.time(), console.timeEnd(), console.table()), this is not the next biggest thing that your imaginary Crypto Blockchain AI SaaS startup just needed, but it’s neverless good to know if you need some distinction in your logs.

Conclusion: Do whatever you like with that knowledge

While there are many things that one might love or hate about JavaScript – e.g. you might like the boolean-coercion via !! or you might hate the ??= or you might, with a mission, peddle generation expressions with function*/yield to your team – or or or… – there are always some more things that even after some years just look weird to my trained eye.

I guess it’s not completely wrong to thing that such expressions are occult for a specific reason in that there are only a handful of legit use cases for these, and forcing occult expression in a code that exceeds script size might cause some serious pain in the future.

But nevertheless, knowledge is power, so have a nice powerful evening.