Handling database warnings with JDBC

Database administrators have the possibility to set lifetimes for user passwords. This can be considered a security feature, so that passwords get updated regularly. But if one of your software services logs into the database with such an account, you want to know when the password expires in good time before this happens, so that you can update the password. Otherwise your service will stop working unexpectedly.

Of course, you can mark the date in your calendar in order to be reminded beforehand, and you probably should. But there is an additional measure you can take. The database administrator can not only set the lifetime of a password, but also a “grace period”. For example:

ALTER PROFILE app_user LIMIT PASSWORD_LIFE_TIME 180 PASSWORD_GRACE_TIME 14;

This SQL command sets the password life time to 180 days (roughly six months) and the grace period to 14 days (two weeks). If you log into the database with this user you will see a warning two weeks before the password will expire. For Oracle databases the warning looks like this:

ORA-28002: the password will expire within 14 days

But your service logs in automatically, without any user interaction. Is it possible to programmatically detect a warning like this? Yes, it is. For example, with JDBC the following code detects warnings after a connection was established:

// Error codes for ORA-nnnnn warnings
static final int passwordWillExpireSoon = 28002;
static final int accountWillExpireSoon = 28011;

void handleWarnings(Connection connection) throws SQLException {
    SQLWarning warning = connection.getWarnings();
    while (null != warning) {
        String message = warning.getMessage();
        log.warn(message);

        int code = warning.getErrorCode();
        if (code == passwordWillExpireSoon) {
            System.out.println("ORA-28002 warning detected");
            // handle appropriately
        }
        if (code == accountWillExpireSoon) {
            System.out.println("ORA-28011 warning detected");
            // handle appropriately
        }
        warning = warning.getNextWarning();
    }
}

Instead of just logging the warnings, you can use this code to send an email to your address, so that you will get notified about a soon-to-be-expired password in advance. The error code depends on your database system.

With this in place you should not be unpleasantly surprised by an expired password. Of course, this only works if the administrator sets a grace period, so you should agree on this approach with your administrator.

Using PostgreSQL with Entity Framework

The most widespread O/R (object-relational) mapper for the .NET platform is the Entity Framework. It is most often used in combination with Microsoft SQL Server as database. But the architecture of the Entity Framework allows to use it with other databases as well. A popular and reliable is open-source SQL database is PostgreSQL. This article shows how to use a PostgreSQL database with the Entity Framework.

Installing the Data Provider

First you need an Entity Framework data provider for PostgreSQL. It is called Npgsql. You can install it via NuGet. If you use Entity Framework 6 the package is called EntityFramework6.Npgsql:

> Install-Package EntityFramework6.Npgsql

If you use Entity Framework Core for the new .NET Core platform, you have to install a different package:

> Install-Package Npgsql.EntityFrameworkCore.PostgreSQL

Configuring the Data Provider

The next step is to configure the data provider and the database connection string in the App.config file of your project, for example:

<configuration>
  <!-- ... -->

  <entityFramework>
    <providers>
      <provider invariantName="Npgsql"
         type="Npgsql.NpgsqlServices, EntityFramework6.Npgsql" />
    </providers>
  </entityFramework>

  <system.data>
    <DbProviderFactories>
      <add name="Npgsql Data Provider"
           invariant="Npgsql"
           description="Data Provider for PostgreSQL"
           type="Npgsql.NpgsqlFactory, Npgsql"
           support="FF" />
    </DbProviderFactories>
  </system.data>

  <connectionStrings>
    <add name="AppDatabaseConnectionString"
         connectionString="Server=localhost;Database=postgres"
         providerName="Npgsql" />
  </connectionStrings>

</configuration>

Possible parameters in the connection string are Server, Port, Database, User Id and Password. Here’s an example connection string using all parameters:

Server=192.168.0.42;Port=5432;Database=mydatabase;User Id=postgres;Password=topsecret

The database context class

To use the configured database you create a database context class in the application code:

class AppDatabase : DbContext
{
  private readonly string schema;

  public AppDatabase(string schema)
    : base("AppDatabaseConnectionString")
  {
    this.schema = schema;
  }

  public DbSet<User> Users { get; set; }

  protected override void OnModelCreating(DbModelBuilder builder)
  {
    builder.HasDefaultSchema(this.schema);
    base.OnModelCreating(builder);
  }
}

The parameter to the super constructor call is the name of the configured connection string in App.config. In this example the method OnModelCreating is overridden to set the name of the used schema. Here the schema name is injected via constructor. For PostgreSQL the default schema is called “public”:

using (var db = new AppDatabase("public"))
{
  var admin = db.Users.First(user => user.UserName == "admin")
  // ...
}

The Entity Framework mapping of entity names and properties are case sensitive. To make the mapping work you have to preserve the case when creating the tables by putting the table and column names in double quotes:

create table public."Users" ("Id" bigserial primary key, "UserName" text not null);

With these basics you’re now set up to use PostgreSQL in combination with the Entity Framework.

 

Monitoring long running operations in Oracle databases

We regularly work with database tables with hundreds of millions of entries. Some operations on these table can take a while. Not necessarily queries, but operations in preparation to make queries fast, for example the creation of materialized views or indexes.

The problem with most SQL tools is: once you run your SQL statement you have no indication of how long it will take to complete the operation. No progress bar and no display of the remaining time. Will it take minutes or hours?

Oracle databases have a nice feature I learned about recently that can answer these questions. Operations that take longer than 6 seconds to complete are considered “long operations” and get an entry in a special view called V$SESSION_LONGOPS.

This view does not only contain the currently running long operations but also the history of completed long operations. You can query the status of the current long operations like this:

SELECT * FROM V$SESSION_LONGOPS 
  WHERE time_remaining > 0;

This view contains columns like

  • TARGET (table or view on which the operation is carried out)
  • SOFAR (units of work done so far)
  • TOTALWORK (total units of work)
  • ELAPSED_SECONDS (number of elapsed seconds from the start of the operation)

Based on these values the view offers another column, which contains the estimated remaining time in seconds: TIME_REMAINING.

This remaining time is really just an estimate, because it assumes long running operations to be linear, which is not necessarily true. Also some SQL statements can spawn multiple consecutive operations, e.g. first a “Table Scan” operation and then a “Sort Output” operation, which will only become visible after the first operation has finished. Nevertheless I found this feature quite helpful to get a rough idea of how long I will have to wait or to inform decisions such as whether I really want to perform an operation until completion or if I want to cancel it.

Modern developer Issue 4: My SQL toolbox

Thinking in SQL

SQL is such a basic and useful language but the underlying thinking is non-intuitive when you come from imperative languages like Java, Ruby and similar.
SQL is centered around sets and operations on them. The straight forward solution might not be the best one.

Limit

Let’s say we need the maximum value in a certain set. Easy:

select max(value) from table

But what if we need the row with the maximum value? Just adding the other columns won’t work since aggregations only work with other aggregations and group bys. Joining with the same table may be straight forward but better is to not do any joins:

select * from (select * from table order by value desc) where rownum<=1

Group by and having

Even duplicate values can be found without joining:

select value from table group by value having count(*) > 1

Grouping is a powerful operation in SQL land:

select max(value), TO_CHAR(time, 'YYYY-MM') from table group by TO_CHAR(time, 'YYYY-MM')

Finding us the maximum value in each month.

Mapping with outer joins

SQL is also good for calculations. Say we have one table with values and one with a mapping like a precalculated log table. Joining both gets the log of each of your values:

select t.value, log.y from table t left outer join log_table log on t.value=log.x

Simple calculations

We can even use a linear interpolation between two values. Say we have only the function values stored for integers but we values between them and these values between them can be interpolated linearly.

select t.value, (t.value-floor(t.value))*f.y + (ceil(t.value)-t.value)*g.y from table t left outer join function_table f on floor(t.value)=f.x left outer join function_table g on ceil(t.value)=g.x

When you need to calculate for large sets of values and insert them into another table it might be better to calculate in SQL and insert in one step without all the conversion and wrapping stuff present in programming languages.

Conditions

Another often overlooked feature is to use a condition:

select case when MOD(t.value, 2) = 0 then 'divisible by 2' else 'not divisible by 2' end from table t

These handful operations are my basic toolbox when working with SQL, almost all queries I need can be formulated with them.

Dates and timestamps

One last reminder: when you work with time always specify the wanted time zone in your query.

Monitoring data integrity with health checks

An important aspect for systems, which are backed by a database storage, is to maintain data integrity. Most relational databases offer the possibility to define constraints in order to maintain data integrity, usually referential integrity and entity integrity. Typical constraints are foreign key constraints, not-null constraints, unique constraints and primary key constraints.

SQL also provides the CHECK constraint, which allows you to specify a condition on each row in a table:

ALTER TABLE table_name ADD CONSTRAINT
   constraint_name CHECK ( predicate )

For example:

CHECK (AGE >= 18)

However, these check constraints are limited. They can’t be defined on views, they can’t refer to columns in other tables and they can’t include subqueries.

Health checks

In order to monitor data integrity on a higher level that is closer to the business rules of the domain, we have deployed a technique that we call health checks in some of our applications.

These health checks are database queries, which check that certain constraints are met in accordance with the business rules. The queries are usually designed to return an empty result set on success and to return the faulty data records otherwise.

The health checks are run periodically. For example, we use a Jenkins job to trigger the health checks of one of our web applications every couple of hours. In this case we don’t directly query the database, but the application does and returns the success or failure states of the health checks in the response of a HTTP GET request.

This way we can detect problems in the stored data in a timely manner and take countermeasures. Of course, if the application is bug free these health checks should never fail, and in fact they rarely do. We mostly use the health checks as an addition to regression tests after a bug fix, to ensure and monitor that the unwanted state in the data will never happen again in the future.

IS NULL or IS NOT NULL, that is the question

Today I’ll demonstrate a curiosity of SQL regarding the NOT IN operator in combination with a subquery and NULL values.

Let’s assume we have two database tables, users and profiles:

 users              profiles
+--------------+  +-------------+
| id  username |  | id  user_id |
| 0   'joe'    |  | 0   2       |
| 1   'kate'   |  | 1   0       |
| 2   'john'   |  | 2   NULL    |
| 3   'maria'  |  +-------------+
+--------------+

We want to find all users, which have no associated profile. The intuitive solution would be a negated membership test (“NOT IN”) on the result set of a subquery:

SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM profiles);

The anticipated result is:

+---------------+
| id  username	|
| 1   'kate'    |
| 3   'maria'   |
+---------------+

However, the actual result is an empty set:

+--------------+
| id  username |
+--------------+

This is irritating, especially since the non-negated form produces a sensible result:

SELECT * FROM users WHERE id IN (SELECT user_id FROM profiles);

+--------------+
| id  username	|
| 0   'joe'    |
| 2   'john'   |
+--------------+

So why does the NOT IN operator produce this strange result?

To understand what happens we replace the result of the subquery with a set literal:

SELECT * FROM users WHERE id NOT IN (2, 0, NULL);

This statement is internally translated to:

SELECT * FROM users WHERE id<>2 AND id<>0 AND id<>NULL;

And here comes the twist: a field<>NULL clause evaluates to UNKNOWN in SQL, which is treated like FALSE in a boolean expression. The desired clause would be id IS NOT NULL, but this is not what is used by SQL. As a consequence the result set is empty.

The result for the non-negated membership test (“IN”) can be explained as well. The IN clause is internally translated to:

SELECT * FROM users WHERE id=2 OR id=0 OR id=NULL;

A field=NULL clause evaluates to UNKNOWN as well. But in this case it is of no consequence, since the clause is joined via OR.

Now that we know what’s going on, how can we fix it? There are two possibilities:

One is to use an outer join:

SELECT u.id FROM users u LEFT OUTER JOIN profiles p ON u.id=p.user_id WHERE p.id IS NULL;

The other option is to filter out all NULL values in the subquery:

SELECT id FROM users WHERE id NOT IN (SELECT user_id FROM profiles WHERE user_id IS NOT NULL);

Conclusion

Both field=NULL and field<>NULL evaluate to UNKNOWN in SQL. Unfortunately, SQL uses these clauses for IN and NOT IN set operations. The solution is to work around it.

Thinking in immutability

The way I learned programming is dictated by objects and states. Besides advantages and on going efforts in the industry I couldn’t help but thinking: immutability is nice. I can use it in some cases and keep it quietly stored in the corner. But it didn’t remain silent.

The way I learned programming is dictated by objects and states. According to my thinking data is packed into objects which are later modified to reflect the changes over time. State and modification are a central modelling technique. For me programming and OOP in particular resolved around this common theme. Mutating objects pervade my thinking even beyond the code into the database and even the architecture of the whole system.
Besides advantages and on going efforts in the industry I couldn’t help but thinking: immutability is nice. I can use it in some cases and keep it quietly stored in the corner.
But it didn’t remain silent.
So I asked myself: How do you construct programs that build upon immutability? How do you (mostly) avoid mutable objects? How do you think in immutability?
The first step was to unlearn. No updates. No modifications. Read, create, copy. That’s about it. No more CRUD only CR. No more SQL updates, only inserts.

Events and logs

To illustrate I use a simple example. Creating, moving, translating and deleting a point. In the traditional OO way it looks like this:

Point p = new Point(40, 30);
p.translateXBy(5);
p.moveTo(10, 20);
p.delete();

Or using SQL might be something like this (omitting primary keys and where clauses here):

insert into points (x, y) values (40, 30)
update points p set p.x = p.x + 5
update points p set p.x = 10, p.y = 20
delete from points

In our memory (or database if we use one) every line updates our point:

Point p = new Point(40, 30); // p = {x: 40, y: 30}
p.translateXBy(5); // p = {x: 45, y: 30}
p.moveTo(10, 20); // p = {x: 10, y: 20}
p.delete(); // p = ?

But what if we do not store the results of the operations but the operations themselves? The events.
Imagine your state changes as a series of events. Just imagine.

new PointCreated(40, 30); // pointEvents = [{created[x: 40, y:30]}]
new PointTranslatedXBy(5); // pointEvents = [{created[x: 40, y:30]}, {translated[x: 5]}]
new PointMovedTo(10, 20); // pointEvents = [{created[x: 40, y:30]}, {translated[x: 5]}, {moved:[x: 10, y:20]}]
new PointDeleted(); // pointEvents = [{created[x: 40, y:30]}, {translated[x: 5]}, {moved:[x: 10, y:20]}, {deleted}]

Even in the database we would just use inserts, no more updates and no more deletes. The events are stored in a log (ironically the database does this the same way). A log is a fully ordered, append only queue. Once we use and store events we have some extras besides immutability: an audit trail, an undo stack, recovering, …
We could externalize the event stream in a message queue and could monitor it, replay it to reproduce bugs, distribute it. The possibilities are endless.

But. That’s all nice and fine. I have one more question: what’s the current state? A user should see the current state and other parts of the system also (not mentioning that I – coming from a mutable state kind of thinking – would also feel better seeing it).

So what’s the current state?

All events applied in order.

OK. But isn’t this expensive doing this all the time?

Yes!

Here another concept from databases helps us: materialized views. We can easily translate in our mind between the new immutable event driven way and the old in place update way. It is just the same data in different representations (if we only are interested in the current state). If we store the current state as a materialized view (or cache) besides the event log we can have both.
Every part of the program which needs the current state gets an immutable copy of it. If this part needs to know when something changes, it can observe the events and act accordingly. This way mutability is pushed to the borders, to the parts where the current state is shown (like the UI layer).

What’s your time, database?

Time is a difficult subject. Especially time zones and daylight saving time. Adding layers makes things worse. Ask your database.

Time is a difficult subject. Especially time zones and daylight saving time. Sounds easy? Well, take a look.
Adding layers in software development complicates the issue and every layer has its own view of time. Let’s start with an example: we write a simple application which stores time based data in a SQL database, e.g. Oracle. The table has a column named ‘at’. Since we don’t want to mess around with timezones, we use a column type without timezone information, in Oracle this would be ‘Date’ if we do not need milliseconds and ‘Timestamp’ if we need them. In Java with plain JDBC we can extract it with a call to ‘getTimestamp’:

Date timestamp = resultSet.getTimestamp("at");

The problem is now we have a timestamp in our local timezone. Where is it converted? Oracle itself has two timezone settings: for the database and for the session. We can query them with:

select DBTIMEZONE from dual;

and

select SESSIONTIMEZONE from dual;

First Oracle uses the time zone set in the session, then the database one. The results from those queries are interesting though: some return a named timezone like ‘Europe/Berlin’, the other return an offset ‘+01:00’. Here a first subtle detail is important: the named timezone uses the offset and the daylight saving time from the respective timezone, the offset setting only uses the offset and no daylight saving. So ‘+01:00’ would just add 1 hour to UTC regardless of the date.
In our example changing these two settings does not change our time conversion. The timezone settings are for another column type: timestamp with (local) timezone.
Going up one layer the JDBC API reveals an interesting tidbit:

Timestamp getTimestamp(int columnIndex)
throws SQLException

Retrieves the value of the designated column in the current row of this ResultSet object as a java.sql.Timestamp object in the Java programming language.

Sounds about right, but wait there’s another method:

Timestamp getTimestamp(int columnIndex,
Calendar cal)
throws SQLException


Retrieves the value of the designated column in the current row of this ResultSet object as a java.sql.Timestamp object in the Java programming language. This method uses the given calendar to construct an appropriate millisecond value for the timestamp if the underlying database does not store timezone information.

Just as in Oracle we can use a named timezone or an offset:

Date timestamp = resultSet.getTimestamp("at", Calendar.getInstance(TimeZone.getTimeZone("GMT+1:00")));

This way we have control over what and how the time is extracted from the database. The next time you work with time based information take a close look. And if you work with Java use Joda Time.

Database Migration Categories

Most long-running projects need to manage changes to the database schema of the system and data migrations in some way. As the system evolves new datatypes/tables and properties/columns are added, some are removed and others are changed. Relationships between objects also change in unpredictable ways so that you have to deal with these changes in some way. Not all changes are equal in nature, so we handle them differently!

One tool we use to manage our database is liquibase. The units of change are called migrations and are logged in the database itself (table “databasemigrations”) so that you can actually see in which state the schema is. Our experience using such tools for several years is very positive because there is no manual work on the database of some production system needed and new installations automatically create the database schema matching the running software. There are however a few situations when you want to do things manually. So we identified three types of changes and defined how to handle them:

1. structural changes

Structural changes modify the database schema but no data. In some cases you have to care about the default values for not null columns. These changes are handled by database management/versioning tools. They are relevant for all instances and specific for each deployed version of the system. The changes are stored with the source code under version control. Most of the time they are needed when extending the functionality of the system and implementing new features. In SQL the typical commands are CREATE, DROP and ALTER.

2. data rule changes

Changes to the way how the data is stored we call data rule changes. Examples for this are changing the representation of an enum from integer to string or a relation from one-to-many to many-to-many. In such a case the schema and importantly existing data has to be changed. For these migrations you do not need explicit ids of an object in the database but you change all entries in the same way according to the new rules. The changes can be applied to each instance of the system that is update to the new database (and software) version. Like structural changes they are executed using the database migration tool and stored under version control. The typical SQL command after the involveld structural changes needed is UPDATE with an where clause and sometimes CASE WHEN statements.

3. data modifications

Sometimes you have to change individual data sets of one instance of a system. That may be because of a bug in the software or corrupted/wrong entries that cannot by fixed using the system itself, e.g. as super user. Here you fix the entries of one instance of the system manually or with a SQL script. You will usually name specific object ids of the database and perform these exact changes only on this instance. It may be necessary to perfom similar tasks on other instances using different object ids. Because of this one-time and instance-specific nature of the changes we do not use a migration tool but some kind of SQL shell. Such manual changes have to be performed with extra caution and need to be thoroughly documented, e.g. in your issue tracker and wiki. If possible use a non-destructive approach and make backups of the data before executing the changes. Typical SQL statements are UPDATE or DELETE containing ids or business keys.

Conclusion

With categories and guidelines above developers can easily figure out how to deal with changes to the database. They can keep the software, database schema and customer data up-to-date, nice and clean over many years while improving and evolving the system and managing several instances running possibly different versions of the system.

How to speed up your ORM queries of n x m associations

What solution (no cache) causes a 45x speedup? Learn the different approaches and how they compare

What causes a speedup like this? (all numbers are in ms)

Disclaimer: the absolute benchmark numbers are for illustration purposes, the relationship and the speedup between the different approaches are important (just for the curious: I measured 500 entries per table in a PostgreSQL database with both Rails 4.1.0 and Grails 2.3.8 running on Java 7 on a recent MBP running OSX 10.8)

Say you have the model classes Book and (Book)Writer which are connected via a n x m table named Authorship:

A typical query would be to list all books with its authors like:

Fowler, Martin: Refactoring

A straight forward way is to query all authorships:

In Rails:

# 1500 ms
Authorship.all.map {|authorship| "#{authorship.writer.lastname}, #{authorship.writer.firstname}: #{authorship.book.title}"}

In Grails:

// 585 ms
Authorship.list().collect {"${it.writer.lastname}, ${it.writer.firstname}: ${it.book.title}"}

This is unsurprisingly not very fast. The problem with this approach is that it causes the famous n+1 select problem. The first option we have is to use eager fetching. In Rails we can use ‘includes’ or ‘joins’. ‘Includes’ loads the associated objects via additional queries, one for authorship, one for writer and one for book.

# 2300 ms
Authorship.includes(:book, :writer).all

‘Joins’ uses SQL inner joins to load the associated objects.

# 1000 ms
Authorship.joins(:book, :writer).all
# returns only the first element
Authorship.joins(:book, :writer).includes(:book, :writer).all

Additional queries with ‘includes’ in our case slows down the whole request but with joins we can more than halve our time. The combination of both directives causes Rails to return just one record and is therefore ruled out.

In Grails using ‘belongsTo’ on the associations speeds up the request considerably.

class Authorship {
    static belongsTo = [book:Book, writer:BookWriter]

    Book book
    BookWriter writer
}

// 430 ms
Authorship.list()

Also we can implement eager loading with specifying ‘lazy: false’ in our mapping which boosts a mild performance increase.

class Authorship {
    static mapping = {
        book lazy: false
        writer lazy: false
    }
}

// 416 ms
Authorship.list()

Can we do better? The normal approach is to use ‘has_many’ associations and query from one side of the n x m association. Since we use more properties from the writer we start from here.

class Writer < ActiveRecord::Base
  has_many :authors
  has_many :books, through: :authors
end

Testing the different combinations of ‘includes’ and ‘joins’ yields interesting results.

# 1525 ms
Writer.all.joins(:books)

# 2300 ms
Writer.all.includes(:books)

# 196 ms
Writer.all.joins(:books).includes(:books)

With both options our request is now faster than ever (196 ms), a speedup of 7.
What about Grails? Adding ‘hasMany’ and the authorship table as a join table is easy.

class BookWriter {
    static mapping = {
        books joinTable:[name: 'authorships', key: 'writer_id']
    }

    static hasMany = [books:Book]
}
// 313 ms, adding lazy: false results in 295 ms
BookWriter.list().collect {"${it.lastname}, ${it.firstname}: ${it.books*.title}"}

The result is rather disappointing. Only a mild speedup (2x) and even slower than Rails.

Is this the most we can get out of our queries?
Looking at the benchmark results and the detailed numbers Rails shows us hints that the query per se is not the problem anymore but the deserialization. What if we try to limit our created object graph and use a model class backed by a database view? We can create a view containing all the attributes we need even with associations to the books and writers.

create view author_views as (SELECT "authorships"."writer_id" AS writer_id, "authorships"."book_id" AS book_id, "books"."title" AS book_title, "writers"."firstname" AS writer_firstname, "writers"."lastname" AS writer_lastname FROM "authorships" INNER JOIN "books" ON "books"."id" = "authorships"."book_id" INNER JOIN "writers" ON "writers"."id" = "authorships"."writer_id")

Let’s take a look at our request time:

# 15 ms
 AuthorView.select(:writer_lastname, :writer_firstname, :book_title).all.map { |author| "#{author.writer_lastname}, #{author.writer_firstname}: #{author.book_title}" }
// 13 ms
AuthorView.list().collect {"${it.writerLastname}, ${it.writerFirstname}: ${it.bookTitle}"}

13 ms and 15 ms. This surprised me a lot. Seeing this in comparison shows how much this impacts performance of our request.

The lesson here is that sometimes the performance can be improved outside of our code and that mapping database results to objects is a costly operation.