Decoding non-utf8 server responses using the Fetch API

The new Javascript Fetch API is really nice addition to the language and my preferable, and in fact the only bearable, way to do server requests.
The Promise based API is a lot nicer than older, purely callback-based, approaches.

The usual approach to get a text response from a server using the Fetch API looks like this:

let request = fetch(url)
  .then(response => response.text())
  .then(handleText);

But this has one subtle problem:

I was building a client application that reads weather data from a small embedded device. We did not have direct access to changing the functionality of that device, but we could upload static pages to it, and use its existing HTML API to query the amount of registered rainfall and lightning strikes.

Using the fetch API, I quickly got the data and extracted it from the HTML but some of the identifiers had some screwed up characters that looked like decoding problems. So I checked whether the HTTP Content-Type was set correctly in the response. To my surprise it was correctly set as Content-Type: text/html; charset=iso-8859-1.

So why did my Javascript Application not get that? After some digging, it turned out that Response’s text() function always decodes the payload as utf-8. The mismatch between that and the charset explained the problem!

Obviously, I had to do the decoding myself. The solution I picked was to use the TextDecoder class. It can decode an ArrayBuffer with a given encoding. Luckily, that is easy to get from the response:

let request = fetch(url)
  .then(response => response.arrayBuffer())
  .then(buffer => {
    let decoder = new TextDecoder("iso-8859-1");
    let text = decoder.decode(buffer);
    handleText(text);
  });

Since I only had to support that single encoding, that worked well for me. Keep in mind that the TextDecoder is still experimental Technology. However, we had a specific browser as a target and it works there. Lucky us!

Selecting all columns of a database table with an SQL GROUP BY expression

Suppose we have an SQL database table named “temperatures” with the following contents:

LOCATION  TIME        CELSIUS
inside    2018-08-01  24
inside    2018-08-02  28
inside    2018-08-03  21
inside    2018-08-04  28
outside   2018-08-01  29
outside   2018-08-02  31
outside   2018-08-03  25
outside   2018-08-04  30

We want to find the highest temperature for each location. We use the MAX aggregate function and a GROUP BY expression:

SELECT location, MAX(celsius) celsius
FROM temperatures
GROUP BY location;

As expected, the result is:

LOCATION  CELSIUS
outside   31
inside    28

Now suppose we also want to know when each of these extreme temperatures occured. Naively, we try the following query:

SELECT location, time, MAX(celsius) celsius
FROM temperatures
GROUP BY location;

The response is an error: “not a GROUP BY expression”. In a GROUP BY expression all selected columns must be either part of the GROUP BY clause or an aggregate.

To achieve what we want we can use a JOIN:

SELECT
  t.location, t.time, t.celsius
FROM
  temperatures t
JOIN (SELECT location, MAX(celsius) celsius
      FROM temperatures
      GROUP BY location) tmax
ON
  t.location=tmax.location AND t.celsius=tmax.celsius;

This query results in multiple rows per location if the maximum temperature was recorded at different times:

LOCATION  TIME        CELSIUS
outside   2018-08-02  31
inside    2018-08-04  28
inside    2018-08-02  28

If we are only interested in the first occurrence of the maximum temperature per location, we can use the following query:

SELECT
  location,
  MIN(time) KEEP (DENSE_RANK LAST ORDER BY celsius) time,
  MAX(celsius) celsius
FROM
  temperatures
GROUP BY
  location;
LOCATION  TIME        CELSIUS
inside    2018-08-02  28
outside   2018-08-02  31

Here we don’t need a JOIN anymore, because select clause for the time column is an aggregate as well.

For what the javascript!

The setting

We are developing and maintaining an important web application for one of our clients. Our application scrapes a web page and embeds our own content into that page frame.

One day our client told us of an additional block of elements at the bottom of each page. The block had a heading “Image Credits” and a broken image link strangely labeled “inArray”. We did not change anything on our side and the new blocks were not part of the HTML code of the pages.

Ok, so some new Javascript code must be the source of these strange elements on our pages.

The investigation

I started the investigation using the development tools of the browser (using F12). A search for the string “Image Credits” instantly brought me to the right place: A Javascript function called on document.ready(). The code was basically getting all images with a copyright attribute and put the findings in an array with the text as the key and the image url as the value. Then it would iterate over the array and add the copyright information at the bottom of each page.

But wait! Our array was empty and we had no images with copyright attributes. Still the block would be put out. I verified all this using the debugger in the browser and was a bit puzzled at first, especially by the strange name “inArray” that sounded more like code than some copyright information.

The cause

Then I looked at the iteration and it struck me like lightning: The code used for (name in copyrightArray) to iterate over the elements. Sounds correct, but it is not! Now we have to elaborate a bit, especially for all you folks without a special degree in Javascript coding:

In Javascript there is no distinct notion of associative arrays but you can access all enumerable properties of an object using square brackets (taken from Mozillas Javascript docs):

var string1 = "";
var object1 = {a: 1, b: 2, c: 3};

for (var property1 in object1) {
  string1 = string1 + object1[property1];
}

console.log(string1);
// expected output: "123"

In the case of an array the indices are “just enumerable properties with integer names and are otherwise identical to general object properties“.

So in our case we had an array object with a length of 0 and a property called inArray. Where did that come from? Digging further revealed that one of our third-party libraries added a function to the array prototype like so:

Array.prototype.inArray = function (value) {
  var i;
  for (i = 0; i < this.length; i++) {
    if (this[i] === value) {
      return true;
    }
  }
  return false;
};

The solution

Usually you would iterate over an array using the integer index (old school) or better using the more modern and readable for…of (which also works on other iterable types). In this case that does not work because we do not use integer indices but string properties. So you have to use Object.keys().forEach() or check with hasOwnProperty() if in your for…in loop if the property is inherited or not to avoid getting unwanted properties of prototypes.

The takeaways

Iteration in Javascript is hard! See this lengthy discussion…The different constructs are named quite similar an all have subtle differences in behaviour. In addition, libraries can mess with the objects you think you know. So finally some advice from me:

  • Arrays are only true arrays with positive integer indices/property names!
  • Do not mess with the prototypes of well known objects, like our third-party library did…
  • Use for…of to iterate over true arrays or other iterables
  • Do not use associative arrays if other options are available. If you do, make sure to check if the properties are own properties and enumerable.

 

UX tips: charts

Continuing our series about UX tips for small areas we take a look at charts this time.

  • always label your axis – this one should be obvious the reader needs to know at what data he looks
  • use units – is this meter or yards, data can only be understood if the scale is clear
  • use distinctive colors – different data, different color, and don’t forget the color blind
  • use color with purpose – same data, same color, e.g. having different values of the same source should use the same color or tint of color
  • for bar charts always start at zero – the distortion often used for exaggerating differences is immense
  • leave out decoration – decoration distracts from the data
  • no 3d effect – data speaks for itself, please do not try to make it prettier
  • use them for lots of data – if you just have a few numbers a table of data might be more appropriate
  • use a title – what aspect or slice of data are you showing me, a summary helps to understand perspective and intention
  • use a big enough font – this and the appropriate contrast should be obvious for all UI work
  • label tick marks – use them at meaningful intervals, do not just slice the axis into equidistant intervals
  • include time – even if it isn’t a time series chart, tell the reader from when the data is
  • size 2d shapes by area – do not use diameter or radius for sizing 2d shapes (circles, bubbles…), unless you use bar charts (where the length is compared) the proportion gets way out of hand
  • use consistent rounding – first round your numbers and do this consistently
  • for comparisons use the same scale – do not distort your data by presenting it at different scale

The Pure Fabrication Tax

Last week, I attended the Maexle event of the C++ user group in Karlsruhe. The Maexle event is basically a programming contest where your program plays the Mia dice game against other programs. You have to implement a simple network protocol to join games, announce rolls and call bluffs. Your program earns points for every game it has participated and not lost. So there is a strong emphasis on starting early and staying in the game, even if your program doesn’t perform the best.

Since it was an event of the C++ user group, the programming language to be used was C++. I’m certainly no C++ hero and knew I couldn’t compete, so I joined the fun with an espionage role and programmed an observing bot that doesn’t play, but gathers data on the players. I chose Java for the task. My observer was online after two minutes, the first real player joined the server after 20 minutes. It turned out to be written in Python. The first real C++ bot was online after 35 minutes, the last one played its first round after two hours.

I listened closely to the problems the teams around me tackled and noticed something strange: Nobody talked about the actual game (Maexle/Mia). Every task was a technical one. Let’s talk about why that’s a problem.

Three Definitions

Before I dive into the subject, I want to define some terms that I’ll use to help you understand my point. It’s entirely possible to look at the story above and see a bunch of engineers having fun with some engineering tasks.

  • First, I value the economics of my customer. In this case, the customer is a lonely server on the LAN that wants to host some games of Maexle for bots. Like, lots of games. Thousands of games. The customer gives points for early market entry, so time to market is an economic factor (or a key performance indicator, some might say). You can roughly say that being online early means you can make bucks longer. The second key performance indicator is uptime. You want to stay in the game as long as you don’t lose all the time. There are some more KPI, but the two I’ve listed should have a major impact in your programming approach – if you value the economics.
  • Second, I don’t care about tools. A programming language is a tool. A compiler is a tool. Your IDE or text editor is a tool. Use your preferred tooling as long as it suits your needs. That means explicitely as long as it doesn’t actively work against your other values like the customer’s economics. This blog post is definitely not about Java or Python being “better” or “better suited” than C++. They aren’t. The first two bots (observer and player) were programmed by participants that had prior experience with the event. It wasn’t the tool that made them fast, it was the absence of rookie errors in the domain and its technical structure.
  • Third, I will explain my point with the concept of “pure fabrication. Pure fabrication is everything that is not specified by the customer, but necessary to fulfill the specification. It’s the code you write because your customer wants to persist some data. He never ordered you to write SQL statements or “open a connection to the database”, maybe he didn’t even know what a database was. Your customer wanted the data stored somehow. The code that enables you to actually program the storage is “pure fabrication” in terms of the domain. Think of it as a scaffolding holding your domain code in place. If you hire a painter to color your house, he will scaffold the walls to reach every spot with ease. You didn’t hire him to set those structures up, they are just necessary for the task. The difference to most of our code is that the painter removes the scaffolding afterwards.

Pure Fabrication vs. Domain

So, if I would have been a customer on the Maexle event, paying for a competitive Maexle bot, I would be very surprised about the actual construction process. Up to two hours into a three-hours event, my programmer would solve apparently hard and important problems, but not my problems. In fact, I wouldn’t even understand the relation between the attempted problems and my required solution. And I would have to have blind faith for more than half my money that something useable will come out of this.

This is the effect of too much pure fabrication in the programming approach. I’m all for solving hard programming problems, but I’m not interested in solving them over and over again. After some iterations, they become solved problems or, essentially, tools. And I don’t care about tools as long as they get their job done. If your domain problem requires a better tool, then we can put the programming problem on our todo list again. Otherwise, we are not valueing our customer’s economics, we are showing off to our peers.

If you program a simple game of Maexle with a heavy emphasis on time to market and even after the initial ramp up aren’t able to reason about your code using language from the domain (like game, dice, roll, bluff, double and, of course, mia), you are staying in pure fabrication land. That’s the level of programming where it matters if you used an integer or freed that memory. That’s when you pay the Pure Fabrication Tax to the fullest. Because your code now does something valueable in the domain, but the distance between your customer’s language and your code’s language is an hindrance. And this distance will demand its tax with every new feature, every change request and every bug.

Bugs are another area where the distance is measureable. If you can’t explain your bugs to the customer, you’ve made them in the pure fabrication part of your code. If you can never explain your bugs, your domain code is hidden between lines and lines of source code with lots of special characters, brackets and magic numbers. Just imagine your hired painter tries to tell you why your house is now pink instead of white or yellow: “It was a small mishap in the way we constructed the scaffolding, we used an E5 steel beam instead of a rail clamp and forgot to perform a hammer check on it”. The last part is totally made up, but I’m sure that’s how we sound for non-programmers.

Exemptions from the Tax?

What solution would I suggest? I don’t think there is a definite solution to the problem. You can’t go full Domain Driven Design on a three-hour Maexle event. By the time you’ve built your fancy Domain Specific Language to write code with the customer besides you, everybody else has gathered their game points and gone home. If you switch to a language that has a string tokenizer in its standard library, you can speed up your programming, but maybe just produce a bigger heap of slightly less low-leveled pure fabrication code.

I don’t want to advocate a solution. My attempt is to highlight the problem: The Pure Fabrication Tax. Given the right (or wrong) amount of extrinsic (or intrinsic) motivation, we are able to produce a mess in just a few hours without really connecting to the domain we produce the mess for. If we didn’t program a Maexle bot that night, but a poker bot or a chat bot, most if not all of the problems and bugs would have been the same. This is not a domain-specific problem. It’s our problem. We probably just like to pay the tax.

What are your thoughts on the Pure Fabrication Tax? Can you see it? Do you have an idea for a solution approach? Leave your comment below!

Disclaimer

And to counter everybody who thinks I’m just bashing the other participants on the event: I was the first one online on the server, with a task that requires virtually no effort and doesn’t even participate directly in the competition, with tools that solved nearly all my pure fabrication problems and still managed to create a program that contained less than five domain terms and was useless for its intended purpose. I said I value the economics of my customer (even if there was none), so I know that I failed hardest on the event. And I had prior knowledge. There was just nobody to compare my mess to.

Game Optimization Resolved

In my last blog post, I explained a performance problem in my game abstractanks but not how I solved it.

So I had not done any optimization work in a while, so the first thing I did turned out to be an error. And not only in hindsight – I actually knew how to tackle a problem like that – I just temporarily forgot at that point.

Going down the rabbit hole

Where we left off, my profiler showed FriendlyUnitOccupies as the culprit. That function basically does circle/circle collision detection using a quad-tree as the spatial acceleration structure. Looking at the samples from my profiler, I could see that that it was descending into the tree quite deeply. Like all tree structures, a quad-tree does pointer-chasing which is very bad for modern CPUs. So I figured I should look at how to optimize that. The data structure was implemented in a hurry, so there seemed plenty to do:

  • Instead of recursing into each node, use tail-call optimization and early culling to speed up traversal.
  • Pre-cache the query with the max-search radius and the other requirements to the units, e.g. not dead, same team, etc.. and then use that to build a new tree for the actual queries.

Because the data structure was pretty non-generic, I started to basically rewrite it to use it in this scenario. While I was about half way through with that, it dawned on me that I was barking at the wrong tree.

Taking a step back

The excellent book Video Game Optimization has some great advice on which level to attack an optimization problem.

  1. System-level. Can you change the system to do something differently and still solve your problem?
  2. Algorithm-level. Are you using the most efficient right algorithm for the data you have?
  3. Micro-level. Are you not wasting any processing power on the lower levels?

I was already on the algorithm level. So I went back to the systemic level: What if the AI did not try to change the target position that often, maybe just every few seconds? That effectively meant lowering the AIs APM. It’s not a bad solution, especially since that makes the AI behave more human. But on the other hand, real-time games, as the name implies, have a soft real-time requirement. So you generally like to avoid huge workloads that go over your frame budget. With how slow the algorithm was, that could easily be the case. The solution is then to do the work concurrently, either by splitting it up or doing it in the background. Both solutions seemed difficult, since the AI code does currently not allow for easy concurrency. So that idea was out.

What if the parking-positions where cached? Subsequent calls to get parking positions could probably reuse a lot of the positions that were computed in previous frames, given that the target point only moves by a little bit each frame. I figured that might work, but it requires more housekeeping and data-dependencies – the result of the previous query needs to be used for the next. That seemed complex and therefore brittle.

A Solution?

Temporal coherency was a pretty good idea though, but not the scale was to big this time. What if I exploited it within a single frame? Now the original code did obscure this, but maybe it gets a little more clear if I write it like this:

optional<v2> GameWorld::FindFreePosition(v2 Center, std::vector<v2> const& Occupied)
{
  auto CheckPosition = [&](v2 Candiate)
  {
    if (!IsPassable(Candidate))
      return false;

    if (OverlapsWith(Occupied))
      return false;

    return !FriendlyUnitOccupies(Candidate);  
  };
  auto Samples = SampledPositions(Center, SomeRandomness());
  auto Found = find_if(SampledPositions.begin(), SampledPositions.end(), CheckPosition(Position));
  
  return (Found != SampledPositions.end()) ? *Found : none;
}

Now as I explained in the previous post, this was called in a loop for each unit to be parked.

std::vector<v2> GameWorld::FindParkingPositions(v2 Center, std::size_t N)
{
  std::vector<v2> Results;
  for (std::size_t i = 0; i < N; ++i)
  {
    auto MaybePosition = FindFreePosition(Center, Results);
    if (!MaybePosition) // No more free space?
      break;
    Results.push_back(*MaybePosition);
  }
  return Results;
}

Easy to see: counting the number of CheckPosition calls, this algorithm is O(n) in number of sampled positions. The number of sampled positions depends linearly on the number of units to be parked, because more units obviously need more parking positions, essentially making this O(n²) for the unit count! But the positions get resampled for each unit – with the only change being the little bit of randomness that is injected everytime. In other words, each call would just test false for sampled positions roughly corresponding to the units that are already placed.

So what I did was a very small change: only inject the randomness once and merge the loops:

auto Samples = SampledPositions(Center, SomeRandomness());
std::vector<v2> Results;

for (auto const& Sample : Samples)
{
  if (CheckPosition(Sample))
    Results.push_back(Sample);

  if (Result.size() >= N)
    break;
}
return Results;

And this did the trick! The algorithm’s run-time when below the 1ms range, and the smaller variation in randomness is not really visible.

Conslusions

I was thrown off-track be the false conclusion that CheckPositions was too slow when it was in fact just called too often. Context is key! Always approach these things outside-in.
Using less-than-optimal abstractions obscured the opportunity to hoist out the sample generation from me. Iteration is always a separate concern, even when it is not on containers!

Entity Framework migrations with multiple database contexts

The .NET Entity Framework provides functionality for automatic database migrations. Every time your application code requires a change of the database schema you should create a migration, so that the existing database schema is updated when a new version of the application is deployed. Examples for such changes are new entity classes or the addition and removal of properties of existing entity classes. The Entity Framework functionality for database migrations is called Code First Migrations.

Code First Migrations are managed via the so-called Package Manager Console. That’s how it’s called in Visual Studio, because its usually used for package management, but it’s basically a general Power Shell command line interface. After you have created the database context class and the entity model classes for your application, you create an initial migration (usually called InitialCreate), which captures the original state of the database schema for your application:

Add-Migration InitialCreate

This will create a new migration class called InitialCreate in the Migrations folder. The filename is prefixed with a timestamp: 201810702207458_InitialCreate.cs. Each migration class has an Up() method, which applies the migration and a Down() method, which rolls the migration back.

Each subsequent migration only describes the difference to its predecessor migration. For example, you add a new string property Email to your User entity class and add a new migration:

Add-Migration AddUserEmail

The tool will scan your entity classes, compare the current state to the state of the previous migration, calculate the difference and create a new migration class, which adds a new column to the database schema.

When the migrations are run on the target system they are tracked in a special database table called __MigrationHistory.

Multiple database contexts

The above usage of Code First Migrations is well documented. Here I want to describe a feature, that is documented in less detail, because it’s less commonly used: migrations with multiple database contexts.

Let’s assume you have two database contexts: CoreDataContext and MeasurementDataContext. In this case you have to create two migration configuration classes, which inherit DbMigrationsConfiguration. You want to create two subdirectories under the Migrations directory, one for each database context:

In each subdirectory you create a migrations Configuration class, one for each database context:

namespace Migrations.CoreData
{
  internal sealed class Configuration : DbMigrationsConfiguration<CoreDataContext>
  {
    public Configuration()
    {
      MigrationsDirectory = "Migrations\CoreData";
      AutomaticMigrationsEnabled = true;
    }
  }
}

and

namespace Migrations.MeasurementData
{
  internal sealed class Configuration : DbMigrationsConfiguration<MeasruementDataContext>
  {
    public Configuration()
    {
      MigrationsDirectory = "Migrations\MeasurementData";
      AutomaticMigrationsEnabled = true;
    }
  }
}

For each configuration you have to set the MigrationsDirectory property accordingly. The AutomaticMigrationsEnabled property is optional. If it’s set the migrations will be applied automatically at the start of the application.

Now, if you run a migration command like Add-Migration, you have to add the -ConfigurationTypeName option, which specifies the Configuration class for desired the database context:


Add-Migration InitialCreate -ConfigurationTypeName Migrations.CoreData.Configuration

Add-Migration InitialCreate -ConfigurationTypeName Migrations.MeasurementData.Configuration

Add-Migration AddUserEmail -ConfigurationTypeName Migrations.CoreData.Configuration

Add-Migration AddMeasurementTimestamp -ConfigurationTypeName Migrations.MeasurementData.Configuration

The migration classes will now be created in the correct subdirectories.