Performance – Page 2 – Schneide Blog

How to speed up your ORM queries of n x m associations

What solution (no cache) causes a 45x speedup? Learn the different approaches and how they compare

What causes a speedup like this? (all numbers are in ms)

Disclaimer: the absolute benchmark numbers are for illustration purposes, the relationship and the speedup between the different approaches are important (just for the curious: I measured 500 entries per table in a PostgreSQL database with both Rails 4.1.0 and Grails 2.3.8 running on Java 7 on a recent MBP running OSX 10.8)

Say you have the model classes Book and (Book)Writer which are connected via a n x m table named Authorship:

A typical query would be to list all books with its authors like:

Fowler, Martin: Refactoring

A straight forward way is to query all authorships:

In Rails:

# 1500 ms
Authorship.all.map {|authorship| "#{authorship.writer.lastname}, #{authorship.writer.firstname}: #{authorship.book.title}"}

In Grails:

// 585 ms
Authorship.list().collect {"${it.writer.lastname}, ${it.writer.firstname}: ${it.book.title}"}

This is unsurprisingly not very fast. The problem with this approach is that it causes the famous n+1 select problem. The first option we have is to use eager fetching. In Rails we can use ‘includes’ or ‘joins’. ‘Includes’ loads the associated objects via additional queries, one for authorship, one for writer and one for book.

# 2300 ms
Authorship.includes(:book, :writer).all

‘Joins’ uses SQL inner joins to load the associated objects.

# 1000 ms
Authorship.joins(:book, :writer).all

# returns only the first element
Authorship.joins(:book, :writer).includes(:book, :writer).all

Additional queries with ‘includes’ in our case slows down the whole request but with joins we can more than halve our time. The combination of both directives causes Rails to return just one record and is therefore ruled out.

In Grails using ‘belongsTo’ on the associations speeds up the request considerably.

class Authorship {
    static belongsTo = [book:Book, writer:BookWriter]

    Book book
    BookWriter writer
}

// 430 ms
Authorship.list()

Also we can implement eager loading with specifying ‘lazy: false’ in our mapping which boosts a mild performance increase.

class Authorship {
    static mapping = {
        book lazy: false
        writer lazy: false
    }
}

// 416 ms
Authorship.list()

Can we do better? The normal approach is to use ‘has_many’ associations and query from one side of the n x m association. Since we use more properties from the writer we start from here.

class Writer < ActiveRecord::Base
  has_many :authors
  has_many :books, through: :authors
end

Testing the different combinations of ‘includes’ and ‘joins’ yields interesting results.

# 1525 ms
Writer.all.joins(:books)

# 2300 ms
Writer.all.includes(:books)

# 196 ms
Writer.all.joins(:books).includes(:books)

With both options our request is now faster than ever (196 ms), a speedup of 7.
What about Grails? Adding ‘hasMany’ and the authorship table as a join table is easy.

class BookWriter {
    static mapping = {
        books joinTable:[name: 'authorships', key: 'writer_id']
    }

    static hasMany = [books:Book]
}

// 313 ms, adding lazy: false results in 295 ms
BookWriter.list().collect {"${it.lastname}, ${it.firstname}: ${it.books*.title}"}

The result is rather disappointing. Only a mild speedup (2x) and even slower than Rails.

Is this the most we can get out of our queries?
Looking at the benchmark results and the detailed numbers Rails shows us hints that the query per se is not the problem anymore but the deserialization. What if we try to limit our created object graph and use a model class backed by a database view? We can create a view containing all the attributes we need even with associations to the books and writers.

create view author_views as (SELECT "authorships"."writer_id" AS writer_id, "authorships"."book_id" AS book_id, "books"."title" AS book_title, "writers"."firstname" AS writer_firstname, "writers"."lastname" AS writer_lastname FROM "authorships" INNER JOIN "books" ON "books"."id" = "authorships"."book_id" INNER JOIN "writers" ON "writers"."id" = "authorships"."writer_id")

Let’s take a look at our request time:

# 15 ms
 AuthorView.select(:writer_lastname, :writer_firstname, :book_title).all.map { |author| "#{author.writer_lastname}, #{author.writer_firstname}: #{author.book_title}" }

// 13 ms
AuthorView.list().collect {"${it.writerLastname}, ${it.writerFirstname}: ${it.bookTitle}"}

13 ms and 15 ms. This surprised me a lot. Seeing this in comparison shows how much this impacts performance of our request.

The lesson here is that sometimes the performance can be improved outside of our code and that mapping database results to objects is a costly operation.

Performance Hogs Sometimes Live in Most Unexpected Places

Surprises when measuring performance are common – but sometimes you just can’t believe it.

When we develop software we always apply the best practice of not optimizing prematurely. This plays together with other best practices like writing the most readable code, or YAGNI.

‘Premature’ means different things in different situations. If you don’t have performance problems it means that there is absolutely no point in optimizing code. And if you do have performance problems it means that Thou Shalt Never Guess which code to optimize because software developers are very bad at this. The keyword here is profiling.

Since we don’t like to be “very bad” at something we always try to improve our skills in this field. The skill of guessing which code has to be optimized, or “profiling in your head” is no different in this regard.

So most of the times in profiling sessions, I have a few unspoken guesses at which parts of the code the profiler will point me to. Unfortunately, I have to say that I am very often very surprised by the outcome.

Surprises in performance fixing sessions are common but they are of different quality. One rather BIG surprise was to find out that std::string::find of the C++ standard library is significantly slower (by factor > 10) than its C library counterpart strstr (discovered with gcc-4.4.6 on CentOS 6, verified with eglibc-2.13 and gcc-4.7).

Yes, you read right and you may not believe it. That was my reaction, too, so I wrote a little test program containing only two strings and calls to std::string::find and std::strstr, respectively. The results were – and I’ve no problem repeating myself here – a BIG surprise.

The reason for that is that std::strstr uses a highly optimized string matching algorithm version whereas std::string::find works with straight-forward memory comparison.

So when doing profiling sessions, always be prepared for shaking-your-world-view kind of surprises. They can even come from your beloved and highly regarded standard library.

UPDATE: See this stackoverflow question for more information.

The Grails performance switch: flush.mode=commit

Some default configuration options of Grails are not optimal for all projects.

— Disclaimer —
This optimization requires more manual work and is error prone but isn’t this with most (big) performance improvements?
For it to really work you have to structure your code accordingly and flush explicitly.
—
Recently in our performance measurements of a medium sized Grails project we noticed a strange behavior: every time we executed the same query the time it took increased. It started with 40ms and every time it took 1 ms more. The query was simple like Child.findAllByParent(parent)
The first thought: indexes! We looked at the database (a postgresql db) and we had indexes on the parent column.
Next: maybe the session cache got too large. But session.flush() and session.clear() did not solve that problem.
Another post suggested using a HQL query. Changing to

Child.executeQuery("select new Child(c.name, c.parent) from Child c where parent=:parent", [parent: parent])

had no effect.
Finally after countless more attempts we tried:

session.setFlushMode(FlushMode.COMMIT)

And not even the query executed in constant time it was also 10x faster?!
Hmmm…why?
The default flush mode in Grails is set to AUTO
Which means that before every query made the session is flushed. Every query regardless of the classes effected. The problem is known for hibernate but after 4! years it is still unresolved.
So my question here is: why did Grails chose AUTO as default?

SSD? Don’t think! Just Buy!

SSDs makes everything blazingly fast – even Grails + IDEA development

My personal experience with SSDs began with an Intel X25M that I built into a Lenovo Thinkpad R61. It replaced a Seagate 160 GB 5400rpm which in combination with Windows Vista … well, let’s just say, it wasn’t that fast.

The SSD changed everything. It was not just faster, it was downright awesome! As if I had a completely new computer.

With that in mind I thought about my desktop PC. It’s a little more than 2 year old Windows XP box, Intel Core2Duo 2.7 GHz, 4GB RAM, with a not so slow Samsung HDD. I use it mainly for programming, which is most of the time Grails programming under IntelliJ IDEA.

And let me tell you, the Grails + IDEA combination can get dog slow at times. The start-up time of IDEA alone gives you time to skim over the first three pages of Hacker News and read the latest XKCD.

So the plan was to put an extra SSD into the Windows box and put only programming related stuff on it. This would save me the potential hassle of moving my whole system but would still give me development speed-up.

I had to be a little careful because the standard settings for IDEA’s so-called “system path” and “config path” is in the user’s home directory. (Btw, this settings can be changed in file “idea.properties” which resides in “IDEA_INSTALLATION_DIR\bin”, e.g.: c:\Progam Files\JetBrains\IntelliJ IDEA 9.0.4\bin)

I think you already guessed the result. Three words: fast, faster, SSD. It’s just amazing! IDEA start-up is so fast now, I barely have time for a quick look at the newest headlines on InfoQ.

The next step is of course to put the whole system on SSD but that will probably have to wait until we upgrade the whole company to Win7. Can’t wait… 🙂

Diving into Hibernate’s Query Cache behaviour

Hibernate is a very sophisticated OR-Mapper and as such has some overhead for certain usage patterns or raw queries. Through proper usage of caches (hibernates featured a L1, L2 cache and a query cache) you can get both performance and convenience if everything fits together. When trying to get more of our persistence layer we performed some tests with the query cache to be able to decide if it is worth using for us. We were puzzled by the behaviour in our test case: Despite everything configured properly we never had any cache hits into our query cache using the following query-sequence:

Transaction start
Execute query
Update a table touched by query
Execute query
Execute query
Transaction end

We would expect that step 5 would be a cache hit but in our case it was not. So we dived into the source of the used hibernate release (the 3.3.1 bundled with grails 1.3.5) and browsed the hibernate issue tracker. We found the issue and correlated it to the issues HHH-3339 and HHH-5210. Since the fix was simpler than upgrading grails to a new hibernate release we fixed the issue and replaced the jar in our environment. So far, so good, but in our test step 5 still refused to produce a cache hit. Using the debugger strangely enough provided us a cache hit when analyzing the state of the cache and everything. After some more brooding and some println()'s and sleep()‘s we found the reason for the observed behaviour in the UpdateTimestampsCache (yes, yet another cache!):

	public synchronized void preinvalidate(Serializable[] spaces) throws CacheException {
		//TODO: to handle concurrent writes correctly, this should return a Lock to the client
		Long ts = new Long( region.nextTimestamp() + region.getTimeout() );
		for ( int i=0; i
			if ( log.isDebugEnabled() ) {
				log.debug( "Pre-invalidating space [" + spaces[i] + "]" );
			}
			//put() has nowait semantics, is this really appropriate?
			//note that it needs to be async replication, never local or sync
			region.put( spaces[i], ts );
		}
		//TODO: return new Lock(ts);
	}

The innocently looking statement region.nextTimestamp() + region.getTimeout() essentially means that the query cache for a certain “region” (e.g. a table in simple cases) is “invalid” (read: disabled) for some “timeout” period or until the end of the transaction. This period defaults to 60 seconds (yes, one minute!) and renders the query cache useless within a transaction. For many use cases this may not be a problem but our write heavy application really suffers because it works on very few different tables and thus query caching has no effect. We are still discussing ways to leverage hibernates caches to improve the performance of our app.

Speed up your buildbox, Part IV: Beyond the box

This is the fourth and last part of a series on how to boost your build box without much effort. This episode talks about possible measures to increase the build performance when a single box isn’t enough.

In the first three parts of our effort to speed up our buildbox, we replaced the harddisk with a RAM disk, upgraded the CPU to the top-notch model and installed plenty of fast RAM. This brought the build time down from 03:30 minutes to around 02:00 minutes. The CPU frequency was the biggest time saving factor in our case study. Two minutes is as fast as the build can get for our project without fiddling with the actual build process. It’s sufficient for our case, but it may not for yours.

Even top speed is too slow

Lets assume we maxed out the hardware and still have a build duration far beyond the magical ten minutes mark. What can we do now? There are two viable options at hand if you can exclude the possibility that your build process is really inefficient and needs optimization. In the latter case, it would be better to revise the process instead of the build infrastructure.

Two ways to speed up your build infrastructure

You can go down one or both of two general paths to speed up your build process. To understand the examples, lets assume the build takes 20 minutes to run on your top-notch build box.

Add more build boxes. This is the classical “parallelize it!” approach. It won’t speed up the individual build process, but enable more processes to run at the same time. This approach wont change anything if your team does seldom check-ins, which in itself is an anti-pattern to continuous integration. But if your team commits changes every ten minutes, having at least two build boxes will prevent the second committer from waiting 30 minutes on the CI results. Instead, the results will always be there after 20 minutes. You haven’t exactly sped up your build process, but the maximal waiting time of your committers. For details on the implementation, see below at “Growing a build park“.
Chop up your build process. This is known as “staging” or “pipelining” your build. This won’t speed up the individual build process, either, but deliver certain partial results of your build earlier. Lets assume you can split your build process into four distinct stages: compile, unit test, integration test, package. Whenever a stage yields a result, the comitter gets feedback immediately. In our example, this might be every 5 minutes. This has several disadvantages, as for example discussed in the article “The pipeline of doom” by Julian Simpson, but can lower the waiting time for specific aspects of your build drastically. You haven’t exactly sped up your build process, but the response time for partial results and therefore the average waiting time of your committer. For details on the implementation, see below at “Installing a build pipeline“.

Growing a build park

If you want to reduce the initial waiting delay of a build before it gets processed or increase the throughput of builds, the build farm pattern is your way to go. By adding slave build machines to your build master, you can distribute the workload on more shoulders. The best way to set up your infrastructure is to introduce a dedicated master box that only delegates actual builds to its slaves. The master box handles the archivation of build artifacts and deals with the web server requests, while the slaves only perform build tasks. The master box can be of average power, with increased storage size, while the slaves should be ultra-fast, without the need of big disks. Solid state disks or even RAM disks of the slaves can be tuned to actual workspace sizes, as it is all that needs to be stored there.

Distributed builds with Hudson

The Hudson continuous integration server has a strength in setting up these master/slave scenarios. It’s ridiculously easy to set up a build slave. You basically only need to click on a link to start the slave process. If you happen to have a standard build, everything needed gets downloaded automatically. If you want your slaves to operate automatically, you can install a windows daemon, provide a SSH account or write your own script. Usually, slaves are set up in a matter of minutes without hassle. A great idea is to turn powerful collegue boxes into build slaves (aka CI zombies) by booting an USB stick. The best way to start with master/slave builds is to turn your current PC into a hudson slave right now by using the Java Web Start method.

Installing a build pipeline

If you are interested in early but incomplete feedback from your build box, staging your build will help you out. If partitioned right, you’ll receive a series of answers on specific questions from your build process. The questions may be like:

Will it compile?
Will it pass the unit tests?
Will it function (pass the integration tests)?
Will it blend?

Ok, the last question is rather unlikely to be answered by your build box. The overall build process will not be any faster, but basic safety test results are reported earlier. If you combine this approach with distributed builds, there is the possibility to designate specifically tuned machines to different stages. The Hudson continuous integration server has the ability to tag a slave with different labels. You can then configure your build to only run on slaves with the desired label assigned.

Staged builds with Hudson

Staging with the Hudson continuous integration server isn’t as easy as the master/slave feature, but there are some plugins that allow for more complex setups. You might experience some functionality that’s still under development, but basic staging is possible even today. In combination with specialized slave build boxes, this approach can lower your build duration. It is a a complex endeavour, though.

Conclusion

Once your single build box is maxed out but still not fast enough, you enter a different realm of continuous integration infrastructure setups. Speeding up a build process beyond the single box isn’t as easy as installing more RAM. But with a fair amount of planning, you have a fair chance to improve the situation. Note that you won’t primarily lower build duration, but increase throughput and utilize partitioning and specialization. These are different measures and might not affect the wall clock time of your build. The combination of staging and distribution is the most powerful setup, but will result in the most complex infrastructure to maintain. Before entering this realm, be sure to apply any possible optimization to your build process. Because you’ll not leave that realm again soon.

What’s your story on build optimization beyond the box? Drop us a comment.

Speed up your buildbox, Part III: Memory

This is the third part of a series on how to boost your build box without much effort. This episode talks about the effects of faster and more RAM.

In the first and second part of our effort to speed up our buildbox, we replaced the harddisk with a RAM disk and swapped in a bigger CPU. This brought the build time down from 03:30 minutes to 02:00 minutes.

Boosting the memory

When we began the journey, we wanted to undercut the 02:00 minutes threshold. The last component that directly impacts performance of our box was the memory. We started out with 4 GB of DDR2-800 modules. To have a feeling for the effects, we upgraded to 4 GB of DDR2-1066 first and then added another 4 GB, resulting in 8 GB of RAM. We expected the performance gain to be small, but noticeable. The RAM disk, for example, is directly affected by memory speed.

As much, but faster

The first upgrade brought the first surprise: Upgrading from DDR2-800 to DDR2-1066 modules didn’t change anything. It’s not that the mainboard or CPU doesn’t support the faster RAM, it just seems to be fast enough, despite the data bus clock rate. Our build process still took 02:00 minutes, reproducible and without exception.

Filling all the banks

The mainboard can load up to 16 GB of RAM, but our budget just allowed to buy 8 GB of DDR2-1066 RAM. We installed it and ran the same 32 bit Ubuntu Linux as before. The build process took 02:00 minutes, which was expected now.

Changing to 64bit

We changed to boot harddisk, installed a 64 bit Ubuntu Linux and ran the build again. Still 02:00 minutes. The switch to 64 bit wasn’t a big deal with Java, but some of the included native libraries complained about the change. Recompiling them solved the issue.

Finally reaching the target

As a last measure, we increased the maximum memory of the build JVM to the biggest value it would accept. This was -Xmx2600m, a surplus of 600 MB to the original setting. This sped up the build process by five seconds, it took 01:55 minutes now.

Conclusion and perspective

We’ve reached our anticipated target of less than two minutes build time. We exceeded our original budget of 500 EUR, but bought some parts that finally weren’t used in the build box, but elsewhere. The two parts that made the whole difference were the CPU and some more memory to spend it on the RAM disk.

If you want to speed up your single build box, aim for the CPU/RAM combo and try to install a RAM disk to perform all the work on.

This leads me to the perspective of the next part of the series: If you plugged in the most expensive CPU and enormous amounts of RAM to speed up your buildbox, you still aren’t done. You should invest some time to look into distributed builds. Hudson as our continuous integration server provides nearly instant “build slave” support. With this feature, you can set up a whole build farm to further increase your build throughput.

Stay tuned for “Part IV: Beyond the box”

Speed up your buildbox, Part II: Processor

This is the second part of a series on how to boost your build box without much effort. This episode talks about the effects of different processors.

In the first part of our effort to speed up our buildbox, we replaced the spindle harddisk with a Solid State Disk (SSD) and finally, a RAM disk. This brought the build time down from 03:30 minutes to 02:50 minutes.

The Central Performance Unit

The next step on our journey to a faster buildbox was to replace the processor. Our initial processor was an Intel Core2 Duo E6750 with 2.67 GHz. To our pleasure, the processor socket, namely the LGA775 socket, is extremely versatile in supporting different processors. We had no problem in plugging in faster dual or even quad core processors, except upgrading the BIOS.

Taking the 3 GHz mark

The next processor to try out was an Intel Core2 Duo E8500 with 3.17 GHz operating frequency. The L2 cache went up from 4 MB to 6 MB.

The build time went down immediately from 02:50 minutes to 02:20 minutes. That’s nearly 20 percent less build time. And it’s perfectly linear with the CPU speed increase (also nearly 20 percent).

As a result: Investing in CPU clock power seems to pay off. The higher the frequency, the lower the build time.

Doubling the cores

Fortunately, the LGA775 socket supports quad core processors, too. We plugged in a Core2 Quad Q9550 with 2.8 GHz and ran the build again.

The result was astonishing: Despite the lower frequency, the build time dropped from 02:20 minutes to 02:00 minutes. We can’t really explain this one with basic math like the frequency coupling of the dual cores.

If your build is perfectly multithreaded, something javac isn’t, you’ll notice an even bigger speedup.

To sum it up: you can’t have enough GHz or processor cores when running a build.

Reviewing the result

We replaced the harddisk with RAM and upgraded the processor to meet the current performance threshold. This brought us from a starting 03:30 minutes build time to 02:00 minutes now. The CPU is the major player in this game, so upgrade it first.

Outlook on the third part

But what about the RAM? We really wanted to know what happens when we replace the RAM with bigger and faster one. Read more about this experiment in the third part of the series, coming soon.

Speed up your buildbox, Part I: Introduction & Harddisk

This is the first part of a series on how to boost your build box without much effort. This episode talks about the effects of different harddisks.

We actively use Hudson as our continuous integration server software. It has a nice little feature called “build history trend” that shows the duration of all archived builds. One of our major projects started out small and fast with a build duration of 01:20 minutes. One and a half year later, it reached for the 04:00 minute hurdle. It wasn’t a surprise to us, as the build has more than four times the work now and the hardware staid the same.

But a question emerged: How can we speed up our build?

Applying optimization: The basic maths

We did a quick review of our ant build scripts to ensure there’s nothing fundamentally wrong with them and then decided which road to follow first: Optimizing the build scripts or boosting the hardware? There is only one pragmatic answer to us: boost the hardware as long as it stays reasonable in price. Every optimization in the build script would need its time (which isn’t cheap) and possibly increase the script complexity (which is very expensive later on).

Optimizing the hardware

So we went on the journey to make a fast buildbox even faster. We started out with a dual core processor (2.6 GHz), a decent-but-standard harddisk and 4 GB of memory. We replaced every part on its own to see the effect. The journey includes:

Our goal is to cut our build time down by 50 percent, to a little less than 02:00 minutes. We don’t want to spend more than 500 EUR for new hardware. So now, after this introduction:

Part I: Replacing the harddisk

Our buildbox starts with a more or less normal harddisk (0.5 TB), certified for continuous usage. We could have bought just another normal harddisk of a newer generation, but that doesn’t cut it in our experience (we didn’t verify specifically, though).

Calling the carnivores

If you need to upgrade your harddisk, you can buy yourself a VelociRaptor drive and be pretty much assured that you’ll notice the difference. We had pleasant experiences with this kind of fast-spinning drives before, but this time, we wanted to go a step further and try a fast Solid State Disk (SSD). As you only need to relocate the working directories (called workspaces in hudson terminology) of your projects to the new disk, the capacity isn’t important as long as it’s greater than your project sizes. You can just plug your new disk in the buildbox and format it with a high performance file system. As our buildbox runs on Linux, relocating the workspace is just setting a symbolic link. You do not even tell hudson about it. If you happen to run on Windows, check out the “use custom workspace” setting on your job’s configuration page.

An investment of about 200 EUR and 15 minutes of installation later, we had the result: The build average before was 03:30 minutes and now 03:10 minutes. That’s not a big leap forward, as others found out, too. It’s not that the SSD was bad, it performed exceptionally well in the benchmarks, but the harddisk wasn’t the bottleneck. To further proof our assumption, we installed the fastest harddrive you can get: the RAM disk.

Only pretend to use the disk

Linux (like other unixoid systems) has the great feature of an emulated harddisk right in your memory. On Debian/Ubuntu systems, this emulated drive is mounted at /dev/shm and has a capacity of half your total physical memory. It grows dynamically, so you don’t have to worry about its initial size. But you have to check if your workspace fits into it. Our buildbox had 4 GB of RAM and 2 GB were enough to contain the hudson workspace. We configured hudson to build there (you can use symbolic links or the “custom workspace” setting as shown in the picture) and got the result: The build average went down to 02:50 minutes.

Review on the results

That’s as far as we could speed up our buildbox by just replacing the harddisk. Down from 03:30 minutes to 02:50 minutes, a reduction of 40 seconds or 20 percent. In fact, we even cheated as the buildbox doesn’t use an harddisk anymore for building. With Linux, it’s incredibly easy to utilize a RAM disk as long as you have enough RAM to loan. For Windows systems, there are several software products that can do the same. If you don’t want to loan your RAM, you can look into HyperDrives, but for a price!

So we conclude that the fastest harddisk is an emulated one and even then, its effect on the build time is limited.

Stay tuned for the next episode of our journey to a faster buildbox, when we apply a faster CPU.

	Writing Integration… on Every Unit Test Is a Stage Pla…
	mariuselvert on C# is very strict about modify…
	Anonymous on C# is very strict about modify…
	Anonymous on Cache configuration with WildF…
	Miq on Nested queries like N+1 in pra…