String Representation and Comparisons

Strings are a fundamental data type in programming, and their internal representation has a significant impact on performance, memory usage, and the behavior of comparisons. This article delves into the representation of strings in different programming languages and explains the mechanics of string comparison.

String Representation

In programming languages, such as Java and Python, strings are immutable. To optimize performance in string handling, techniques like string pools are used. Let’s explore this concept further.

String Pool

A string pool is a memory management technique that reduces redundancy and saves memory by reusing immutable string instances. Java is a well-known language that employs a string pool for string literals.

In Java, string literals are automatically “interned” and stored in a string pool managed by the JVM. When a string literal is created, the JVM checks the pool for an existing equivalent string:

  • If found, the existing reference is reused.
  • If not, a new string is added to the pool.

This ensures that identical string literals share the same memory location, reducing memory usage and enhancing performance.

Python also supports the concept of string interning, but unlike Java, it does not intern every string literal. Python supports string interning for certain strings, such as identifiers, small immutable strings, or strings composed of ASCII letters and numbers.

String Comparisons

Let’s take a closer look at how string comparisons work in Java and other languages.

Comparisons in Java

In this example, we compare three strings with the content “hello”. While the first comparison return true, the second does not. What’s happening here?

String s1 = "hello";
String s2 = "hello";
String s3 = new String("hello");

System.out.println(s1 == s2); // true
System.out.println(s1 == s3); // false

In Java, the == operator compares references, not content.

First Comparison (s1 == s2): Both s1 and s2 reference the same object in the string pool, so the comparison returns true.

Second Comparison (s1 == s3): s3 is created using new String(), which allocates a new object in heap memory. By default, this object is not added to the string pool, so the object reference is unequal and the comparison returns false.

You can explicitly add a string to the pool using the intern() method:

String s1 = "hello";
String s2 = new String("hello").intern();

System.out.println(s1 == s2); // true

To compare the content of strings in Java, use the equals() method:

String s1 = "hello";
String s2 = "hello";
String s3 = new String("hello");

System.out.println(s1.equals(s2)); // true
System.out.println(s1.equals(s3)); // true
Comparisons in Other Languages

Some languages, such as Python and JavaScript, use == to compare content, but this behavior may differ in other languages. Developers should always verify how string comparison operates in their specific programming language.

s1 = "hello"
s2 = "hello"
s3 = "".join(["h", "e", "l", "l", "o"])

print(s1 == s2)  # True
print(s1 == s3)  # True

print(s1 is s2)  # True
print(s1 is s3)  # False

In Python, the is operator is used to compare object references. In the example, s1 is s3 returns False because the join() method creates a new string object.

Conclusion

Different approaches to string representation reflect trade-offs between simplicity, performance, and memory efficiency. Each programming language implements string comparison differently, requiring developers to understand the specific behavior before relying on it. For example, some languages differentiate between reference and content comparison, while others abstract these details for simplicity. Languages like Rust, which lack a default string pool, emphasize explicit memory management through ownership and borrowing mechanisms. Languages with string pools (e.g., Java) prioritize runtime optimizations. Being aware of these nuances is essential for writing efficient, bug-free code and making informed design choices.

Regular expressions in JavaScript

In one of our applications, users can maintain info button texts themselves. For this purpose, they can insert the desired info button text in a text field when editing. The end user then sees the text as a HTML element.

Now, for better structuring, the customer wants to make lists inside the text field. So there was a need to frame lines beginning with a hyphen with the <li></li> HTML tags.

I used JavaScript to realize this issue. This was my first use of regular expressions in JavaScript, so I had to learn their language-specific specials. In the following article, I explain the general syntax and my solution.

General syntax

For the replacement, you can either specify a string to search for or a regular expression. To indicate that it is a regular expression, the expression is enclosed in slashes.

let searchString = "Test";
let searchRegex = /Test/;

It is also possible to put individual parts of the regular expression in brackets and then use them in the replacement part with $1, $2, etc.

let hello = "Hello Tom";
let simpleBye = hello.replace(/Hello/, "Bye");    
//Bye Tom
let bye = hello.replace(/Hello (.*)/, "Bye $1!"); 
//Bye Tom!

In general, with replace, the first match is replaced. With replaceAll, all occurrences are replaced. But these rules just work for searching strings. With regular expressions, modifiers decide if all matches were searched and replaced. To find and replace all of them, you must add modifiers to the expression.

Modifiers

Modifiers are placed at the end of a regular expression and define how the search is performed. In the following, I present just a few of the modifiers.

The modifier i is used for case-insensitive searching.

let hello = "hello Tom";
let notFound = hello.replaceAll(/Hello/, "Bye");
//hello Tom
let found= hello.replaceAll(/Hello/i, "Bye");
//Bye Tom

To find all occurrences, independent of whether replace or replaceAll is called, the modifier g must be set.

let hello = "Hello Tom, Hello Anna";
let first = hello.replaceAll(/Hello/, "Bye");
//Bye Tom, Hello Anna
let replaceAll = hello.replaceAll(/Hello/g, "Bye");
//Bye Tom, Bye Anna
let replace = hello.replace(/Hello/g, "Bye");
//Bye Tom, Bye Anna

Another modifier can be used for searching in multi-line texts. Normally, the characters ^ and $ are for the start and end of the text. With the modifier m, the characters also match at the start and end of the line.

let hello = `Hello Tom,
hello Anna,
hello Paul`;
let byeAtBegin = hello.replaceAll(/^Hello/gi, "Bye");     
//Bye Tom, 
//hello Anna,
//hello Paul
let byeAtLineBegin = hello.replaceAll(/^Hello/gim, "Bye");     
//Bye Tom, 
//Bye Anna,
//Bye Paul

Solution

With this toolkit, I can now convert the hyphens into HTML <li></li>. I also remove the line breaks at the end because, in real code, they will be replaced with <br/> in the next step, and I do not want empty lines between the list points.

let infoText = `This is an important field. You can input:
- right: At the right side
- left: At the left side`;
let htmlInfo = infoText.replaceAll(/^-(.*)\n/gm, "<li>$1</li>");
//This is an important field. You can input:
//<li>right: At the right side</li><li>left: At the left side</li>

If you are familiar with the syntax and possibilities of JavaScript, it offers good functions, such as taking over parts of the regular expression.