Lazy Initialization/evaluation can help you performance and memory-consumption-wise

One way to improve performance when working with many objects or large data structures is lazy initialization or evaluation. The concept is to defer expensive operations to the latest moment possible – ideally never. I want to show some examples how to use lazy techniques in java and give you pointers to other languages where it is even easier and part of the core language.

One use case was a large JTable showing hundreds of domain objects consisting of meta-data and measured values. Initially our domain objects held both types of data in memory even though only part of the meta data was displayed in the table. Building the table took several seconds and we were limited to showing a few hundred entries at once. After some analysis we changed our implementation to roughly look like this:

public class DomainObject {
  private final DataParser parser;
  private final Map<String, String> header = new HashMap<>();
  private final List<Data> data = new ArrayList<>();

  public DomainObject(DataParser aParser) {
    parser = aParser;
  }

  public String getHeaderField(String name) {
    // Here we lazily parse and fill the header map
    if (header.isEmpty()) {
      header.addAll(parser.header());
    }
    return header.get(name);
  }
  public Iterable<Data> getMeasurementValues() {
    // again lazy-load and parse the data
    if (data.isEmpty()) {
      data.addAll(parser.measurements());
    }
    return data;
  }
}

This improved both the time to display the entries and the number of entries we could handle significantly. All the data loading and parsing was only done when someone wanted the details of a measurement and double-clicked an entry.

A situation where you get lazy evaluation in Java out-of-the-box is in conditional statements:

// lazy and fast because the expensive operation will only execute when needed
if (aCondition() && expensiveOperation()) { ... }

// slow order (still lazy evaluated!)
if (expensiveOperation() && aCondition()) { ... }

Persistence frameworks like Hibernate often times default to lazy-loading because database access and data transmission is quite costly in general.

Most functional languages are built around lazy evaluation of and their concept of functions as first class citizens and isolated/minimized side-effects support lazyness very well. Scala as a hybrid OO/functional language introduces the lazy keyword to simplify typical java-style lazy initialization code like above to something like this:

public class DomainObject(parser: DataParser) {
  // evaluated on first access
  private lazy val header = { parser.header() }

  def getHeaderField(name : String) : String = {
    header.get(name).getOrElse("")
  }

  // evaluated on first access
  lazy val measurementValues : Iterable[Data] = {
    parser.measurements()
  }
}

Conclusion
Lazy evaluation is nothing new or revolutionary but a very useful tool when dealing with large datasets or slow resources. There may be many situations where you can use it to improve performance or user experience using it.
The downsides are a bit of implementation cost if language support is poor (like in Java) and some cases where the application can feel more responsive with precomputed/prefetched values when the user wants to see the details.

Checking preconditions in advance vs. on demand vs. exceptions

Usually, it is good practice to check certain preconditions before applying operations to input data. This is often referred to as defensive programming. Many people are used to lines like:

public void preformOn(String foo) {
  if (!myMap.containsKey(foo)) {
    // handle it correctly
    return;
  }
  // do something with the entry
  myMap.get(foo).performOperation();
}

While there is nothing wrong with such kind of “in advance checking” it may have performance implications – especially when IO is involved.

We had a problem some time ago when working with some thousand wrappers for File objects. The wrappers checked if the given File object actually is a file using the innocent isFile()-method in the constructor which caused hard disk access each time. So building our collection of wrapped files took quite some time (dozens of seconds) and our client complained (rightfully so!) about the performance. Once the collection was built the operations were fast because no checking was needed anymore.

Our first optimization step was deferring the check to the point where the file was actually used. This sped up the creation of the wrappers so it was barely noticeable but processing a bunch of elements took longer because of additional disk accesses. Even though this approach may work for a plethora of situations for our typical use cases the effect of this optimization was not enough.

So we looked at our problem from another perspective: The vast majority of file handles were actually existing and readable files and directories and foreign/unknown files were the exception. Because of this fact we chose to simply leave out any kind of checks and handle the exceptions! Exception handling is often referred to as slow but if exceptions are rare it can make a difference in some orders of magnitude. Our speed up using this approach was enourmous and the client was happy about sub-second responsiveness for his typical operations. In addition we think that the code now expresses more cleary that irregular files really are the exception and not the rule for this particular code.

Conclusion

There are different approaches to handling of parameters and input data. Depending on the cost of the check and the frequency of special input different strategies may prove beneficial both in expressing your intent and the perceived performance of your application.

Basic Image Processing Tasks with OpenCV

2D detectors and scientific CCD cameras produce many megabytes of image data. Open source library OpenCV is highly recommended as your work horse for all kinds of image processing tasks.

For one of our customers in the scientific domain we do a lot of integration of pieces of hardware into the existing measurement- and control network. A good part of these are 2D detectors and scientific CCD cameras, which have all sorts of interfaces like ethernet, firewire and frame grabber cards. Our task is then to write some glue software that makes the camera available and controllable for the scientists.

One standard requirement for us is to do some basic image processing and analytics. Typically, this entails flipping the image horizontally and/or vertically, rotating the image around some multiple of 90 degrees, and calculcating some statistics like standard deviation.

The starting point there is always some image data in memory that has been acquired from the camera. Most of the time the image data is either gray values (8, or 16 bit), or RGB(A).

As we are generally not falling victim to the NIH syndrom we use open source image processing librarys. The first one we tried was CImg, which is a header-only (!) C++ library for image processing. The header-only part is very cool and handy, since you just have to #include <CImg.h> and you are done. No further dependencies. The immediate downside, of course, is long compile times. We are talking about > 40000 lines of C++ template code!

The bigger issue we had with CImg was that for multi-channel images the memory layout is like this: R1R2R3R4…..G1G2G3G4….B1B2B3B4. And since the images from the camera usually come interlaced like R1G1B1R2G2B2… we always had to do tricks to use CImg on these images correctly. These tricks killed us eventually in terms of performance, since some of these 2D detectors produce lots of megabytes of image data that have to be processed in real time.

So OpenCV. Their headline was already very promising:

OpenCV (Open Source Computer Vision) is a library of programming functions for real time computer vision.

Especially the words “real time” look good in there. But let’s see.

Image data in OpenCV is represented by instances of class cv::Mat, which is, of course, short for Matrix. From the documentation:

The class Mat represents an n-dimensional dense numerical single-channel or multi-channel array. It can be used to store real or complex-valued vectors and matrices, grayscale or color images, voxel volumes, vector fields, point clouds, tensors, histograms.

Our standard requirements stated above can then be implemented like this (gray scale, 8 bit image):

void processGrayScale8bitImage(uint16_t width, uint16_t height,
                               const double& rotationAngle,
                               uint8_t* pixelData)
{
  // create cv::Mat instance
  // pixel data is not copied!
  cv::Mat img(height, width, CV_8UC1, pixelData);

  // flip vertically
  // third parameter of cv::flip is the so-called flip-code
  // flip-code == 0 means vertical flipping
  cv::Mat verticallyFlippedImg(height, width, CV_8UC1);
  cv::flip(img, verticallyFlippedImg, 0);

  // flip horizontally
  // flip-code > 0 means horizontal flipping
  cv::Mat horizontallyFlippedImg(height, width, CV_8UC1);
  cv::flip(img, horizontallyFlippedImg, 1);

  // rotation (a bit trickier)
  // 1. calculate center point
  cv::Point2f center(img.cols/2.0F, img.rows/2.0F);
  // 2. create rotation matrix
  cv::Mat rotationMatrix =
    cv::getRotationMatrix2D(center, rotationAngle, 1.0);
  // 3. create cv::Mat that will hold the rotated image.
  // For some rotationAngles width and height are switched
  cv::Mat rotatedImg;
  if ( (rotationAngle / 90.0) % 2 != 0) {
    // switch width and height for rotations like 90, 270 degrees
    rotatedImg =
      cv::Mat(cv::Size(img.size().height, img.size().width),
              img.type());
  } else {
    rotatedImg =
      cv::Mat(cv::Size(img.size().width, img.size().height),
              img.type());
  }
  // 4. actual rotation
  cv::warpAffine(img, rotatedImg,
                 rotationMatrix, rotatedImg.size());

  // save into TIFF file
  cv::imwrite("myimage.tiff", gray);
}

The cool thing is that almost the same code can be used for our other image types, too. The only difference is the image type for the cv::Mat constructor:


8-bit gray scale: CV_U8C1
16bit gray scale: CV_U16C1
RGB : CV_U8C3
RGBA: CV_U8C4

Additionally, the whole thing is blazingly fast! All performance problems gone. Yay!

Getting basic statistical values is also a breeze:

void calculateStatistics(const cv::Mat& img)
{
  // minimum, maximum, sum
  double min = 0.0;
  double max = 0.0;
  cv::minMaxLoc(img, &min, &max);
  double sum = cv::sum(img)[0];

  // mean and standard deviation
  cv::Scalar cvMean;
  cv::Scalar cvStddev;
  cv::meanStdDev(img, cvMean, cvStddev);
}

All in all, the OpenCV experience was very positive, so far. They even support CMake. Highly recommended!

Use Boost’s Multi Index Container!

Boost’s multi index container is a very cool and useful piece of code. Make it a part of your toolbox. You can start slowly by replacing uses of std::set and std::multiset with simple boost::multi_index_containers.

Sometimes, after you have used a special library or other special programming tool for a job, you forget about it because you don’t have that specific use case anymore. Boost’s multi_index container could fall in this category because you don’t have to hold data in memory with the need to access it by different keys all the time.

Therefore, this post is intended to be a reminder for c++ programmers that there exists this pretty cool thing called boost::multi_index_container and that you can use it in more situations than you would think at first.

(If you’re already using it on a regular basis you may stop here, jump directly to the comments and tell us about your typical use cases.)

I remember when I discovered boost::multi_index_container I found it quite intimidating at first sight. All those templates that are used in sometimes weird ways can trigger that feeling if you are not a template metaprogramming specialist (i.e. haven’t yet read Andrei Alexandrescu’s book “Modern C++ Design” ).

But if you look at it after you fought your way through the documentation and after your unit test is green that tests your first example, it doesn’t look that complicated anymore.

My latest use case for boost::multi_index_container was data objects that should be sorted by two different date-times. (For dates and times we use boost::date_time, of course). At first, the requirement was to store the objects sorted by one date time. I used a std::set for that with a custom comparator. Everything was fine.

With changing requirements it became necessary to retrieve objects by another date time, too. I started to use another std::set with a different comparator but then I remembered that there was some cool container somewhere in boost for which you can define multiple indices ….

After I had set it up with the two date time indices, the code also looked much cleaner because in order to update one object with a new time stamp I could just call container->replace(…) instead of fiddling around with the std::set.

Furthermore, I noticed that setting up a boost::multi_index_container with a specific key makes it much clearer what you intend with this data structure than using a std::set with a custom comparator. It is not that much more typing effort, and you can practice template metaprogramming a little bit 🙂

Let’s compare the two implementations:

#include <boost/shared_ptr.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
using boost::posix_time::ptime;

// objects of this class should be stored
class MyDataClass
{
  public:
    const ptime& getUpdateTime() const;
    const ptime& getDataChangedTime() const;

  private:
    ptime _updateTimestamp;
    ptime _dataChangedTimestamp;
};
typedef boost::shared_ptr<MyDataClass> MyDataClassPtr;

Now the definition of a multi index container:

#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/mem_fun.hpp>
using namespace boost::multi_index;

typedef multi_index_container
<
  MyDataClassPtr,
  indexed_by
  <
    ordered_non_unique
    <
      const_mem_fun<MyDataClass, 
        const ptime&, 
        &MyDataClass::getUpdateTime>
    >
  >
> MyDataClassContainer;

compared to std::set:

#include <set>

// we need a comparator first
struct MyDataClassComparatorByUpdateTime
{
  bool operator() (const MyDataClassPtr& lhs, 
                   const MyDataClassPtr& rhs) const
  {
    return lhs->getUpdateTime() < rhs->getUpdateTime();
  }
};
typedef std::multiset<MyDataClassPtr, 
                      MyDataClassComparatorByUpdateTime> 
   MyDataClassSetByUpdateTime;

What I like is that the typedef for the multi index container reads almost like a sentence. Besides, it is purely declarative (as long as you get away without custom key extractors), whereas with std::multiset you have to implement the comparator.

In addition to being a reminder, I hope this post also serves as motivation to get to know boost::multi_index_container and to make it a part of your toolbox. If you still have fears of contact, start small by replacing usages of std::set/multiset.

Performance Hogs Sometimes Live in Most Unexpected Places

Surprises when measuring performance are common – but sometimes you just can’t believe it.

When we develop software we always apply the best practice of not optimizing prematurely. This plays together with other best practices like writing the most readable code, or YAGNI.

‘Premature’ means different things in different situations. If you don’t have performance problems it means that there is absolutely no point in optimizing code. And if you do have performance problems it means that Thou Shalt Never Guess which code to optimize because software developers are very bad at this. The keyword here is profiling.

Since we don’t like to be “very bad” at something we always try to improve our skills in this field. The skill of guessing which code has to be optimized, or “profiling in your head” is no different in this regard.

So most of the times in profiling sessions, I have a few unspoken guesses at which parts of the code the profiler will point me to. Unfortunately, I have to say that  I am very often very surprised by the outcome.

Surprises in performance fixing sessions are common but they are of different quality. One rather BIG surprise was to find out that std::string::find of the C++ standard library is significantly slower (by factor > 10) than its C library counterpart strstr (discovered with gcc-4.4.6 on CentOS 6, verified with eglibc-2.13 and gcc-4.7).

Yes, you read right and you may not believe it. That was my reaction, too, so I wrote a little test program containing only two strings and calls to std::string::find and std::strstr, respectively. The results were – and I’ve no problem repeating myself here – a BIG surprise.

The reason for that is that std::strstr uses a highly optimized string matching algorithm version whereas std::string::find works with straight-forward memory comparison.

So when doing profiling sessions, always be prepared for shaking-your-world-view kind of surprises. They can even come from your beloved and highly regarded standard library.

UPDATE: See this stackoverflow question for more information.

GORM-Performance with Collections

The other day I was looking to improve the performance of specific parts of our Grails application. I quickly found the typical bottleneck in database centric Grails apps: Too many queries were executed because GORM hides away database queries by its built-in persistence methods for domain objects and the extremely nice dynamic finders. In search for improvements and places to use GORM/Hibernate caching I stumbled upon a very good and helpful presentation on GORM-performance in general and especially collection usage. Burt Beckwith presents some common problems and good patterns to overcome them in his SpringOne 2GX talk. I highly recommend having a thorough look at his presentation.

Nevertheless, I want to summarize his bottom line here: GORM does provide a nice abstraction from relational databases but this abstraction is leaky at times. So you have to know exactly how the stuff in your domain classes is mapped. Be especially careful it collections tend to become “large” because performance will suffer extremely. We already observed a significant performance degradation for some dozen elements; your mileage may vary. For many simple modifications on a collection all its elements have to be loaded from the database!
Instead of using hasMany/belongsTo just add a back reference to the domain object your object belongs to. With the collection you lose cascading delete and some GORM functionality but you can still use dynamic finders and put the functionality to manage associations yourself into respective classes. This may be a large gain in specific cases!

The Grails performance switch: flush.mode=commit

Some default configuration options of Grails are not optimal for all projects.

— Disclaimer —
This optimization requires more manual work and is error prone but isn’t this with most (big) performance improvements?
For it to really work you have to structure your code accordingly and flush explicitly.

Recently in our performance measurements of a medium sized Grails project we noticed a strange behavior: every time we executed the same query the time it took increased. It started with 40ms and every time it took 1 ms more. The query was simple like Child.findAllByParent(parent)
The first thought: indexes! We looked at the database (a postgresql db) and we had indexes on the parent column.
Next: maybe the session cache got too large. But session.flush() and session.clear() did not solve that problem.
Another post suggested using a HQL query. Changing to

Child.executeQuery("select new Child(c.name, c.parent) from Child c where parent=:parent", [parent: parent])

had no effect.
Finally after countless more attempts we tried:

session.setFlushMode(FlushMode.COMMIT)

And not even the query executed in constant time it was also 10x faster?!
Hmmm…why?
The default flush mode in Grails is set to AUTO
Which means that before every query made the session is flushed. Every query regardless of the classes effected. The problem is known for hibernate but after 4! years it is still unresolved.
So my question here is: why did Grails chose AUTO as default?

SSD? Don’t think! Just Buy!

SSDs makes everything blazingly fast – even Grails + IDEA development

My personal experience with SSDs began with an Intel X25M that I built into a Lenovo Thinkpad R61. It replaced a Seagate 160 GB 5400rpm which in combination with Windows Vista … well, let’s just say, it wasn’t that fast.

The SSD changed everything. It was not just faster, it was downright awesome! As if I had a completely new computer.

With that in mind I thought about my desktop PC. It’s a little more than 2 year old Windows XP box, Intel Core2Duo 2.7 GHz, 4GB RAM, with a not so slow Samsung HDD. I use it mainly for programming, which is most of the time Grails programming under IntelliJ IDEA.

And let me tell you, the Grails + IDEA combination can get dog slow at times. The start-up time of IDEA alone gives you time to skim over the first three pages of Hacker News and read the latest XKCD.

So the plan was to put an extra SSD into the Windows box and put only programming related stuff on it. This would save me the potential hassle of moving my whole system but would still give me development speed-up.

I had to be a little careful because the standard settings for IDEA’s so-called “system path” and “config path” is in the user’s home directory. (Btw, this settings can be changed in file “idea.properties” which resides in “IDEA_INSTALLATION_DIR\bin”, e.g.: c:\Progam Files\JetBrains\IntelliJ IDEA 9.0.4\bin)

I think you already guessed the result. Three words: fast, faster, SSD. It’s just amazing! IDEA start-up is so fast now, I barely have time for a quick look at the newest headlines on InfoQ.

The next step is of course to put the whole system on SSD but that will probably have to wait until we upgrade the whole company to Win7. Can’t wait… 🙂

Is Groovy++ already doomed?

<disclaimer>I really like Groovy and other cool languages like Scala, Fantom, Gosu or Clojure targetting the JVM.</disclaimer>

I know the title is a bit provocative but I want to express a concern regarding Groovy++. In my perception most people think of Groovy++ as an extension for Groovy which trades dynamic dispatching for static typing and dispatching yielding performance. So you can just look for hot spots in your code and resolve them with some annotations. Sounds nice, doesn’t it?.

That seems to be the promise of Groovy++ but it isn’t. Alex Tkachman, the founder of the Groovy++ project states this clearly in this comment to an issue with Groovy++: “100% compatibility with regular Groovy is nice when possible and we do our best to keep it but it is not a must.”.

Imho the mentioned issue together with this statement reduces the target audience to a few people who think of Groovy++ as a better Java, not a faster and type-safe Groovy where needed. I do not think there are too many people thinking that way. I think wide adoption of such a Groovy++ will not happen given the alternatives mentioned in the disclaimer above and Groovy itself. I hope they will strive for 100% compatibility with Groovy…

Speed up your buildbox, Part IV: Beyond the box

This is the fourth and last part of a series on how to boost your build box without much effort. This episode talks about possible measures to increase the build performance when a single box isn’t enough.

© Friedberg - Fotolia.comIn the first three parts of our effort to speed up our buildbox, we replaced the harddisk with a RAM disk, upgraded the CPU to the top-notch model and installed plenty of fast RAM. This brought the build time down from 03:30 minutes to around 02:00 minutes. The CPU frequency was the biggest time saving factor in our case study. Two minutes is as fast as the build can get for our project without fiddling with the actual build process. It’s sufficient for our case, but it may not for yours.

Even top speed is too slow

Lets assume we maxed out the hardware and still have a build duration far beyond the magical ten minutes mark. What can we do now? There are two viable options at hand if you can exclude the possibility that your build process is really inefficient and needs optimization. In the latter case, it would be better to revise the process instead of the build infrastructure.

Two ways to speed up your build infrastructure

You can go down one or both of two general paths to speed up your build process. To understand the examples, lets assume the build takes 20 minutes to run on your top-notch build box.

  • Add more build boxes. This is the classical “parallelize it!” approach. It won’t speed up the individual build process, but enable more processes to run at the same time. This approach wont change anything if your team does seldom check-ins, which in itself is an anti-pattern to continuous integration. But if your team commits changes every ten minutes, having at least two build boxes will prevent the second committer from waiting 30 minutes on the CI results. Instead, the results will always be there after 20 minutes. You haven’t exactly sped up your build process, but the maximal waiting time of your committers. For details on the implementation, see below at “Growing a build park“.
  • Chop up your build process. This is known as “staging” or “pipelining” your build. This won’t speed up the individual build process, either, but deliver certain partial results of your build earlier. Lets assume you can split your build process into four distinct stages: compile, unit test, integration test, package. Whenever a stage yields a result, the comitter gets feedback immediately. In our example, this might be every 5 minutes. This has several disadvantages, as for example discussed in the article “The pipeline of doom” by Julian Simpson, but can lower the waiting time for specific aspects of your build drastically. You haven’t exactly sped up your build process, but the response time for partial results and therefore the average waiting time of your committer. For details on the implementation, see below at “Installing a build pipeline“.

Growing a build park

If you want to reduce the initial waiting delay of a build before it gets processed or increase the throughput of builds, the build farm pattern is your way to go. By adding slave build machines to your build master, you can distribute the workload on more shoulders. The best way to set up your infrastructure is to introduce a dedicated master box that only delegates actual builds to its slaves. The master box handles the archivation of build artifacts and deals with the web server requests, while the slaves only perform build tasks. The master box can be of average power, with increased storage size, while the slaves should be ultra-fast, without the need of big disks. Solid state disks or even RAM disks of the slaves can be tuned to actual workspace sizes, as it is all that needs to be stored there.

Distributed builds with Hudson

The Hudson continuous integration server has a strength in setting up these master/slave scenarios. It’s ridiculously easy to set up a build slave. You basically only need to click on a link to start the slave process. If you happen to have a standard build, everything needed gets downloaded automatically. If you want your slaves to operate automatically, you can install a windows daemon, provide a SSH account or write your own script. Usually, slaves are set up in a matter of minutes without hassle. A great idea is to turn powerful collegue boxes into build slaves (aka CI zombies) by booting an USB stick. The best way to start with master/slave builds is to turn your current PC into a hudson slave right now by using the Java Web Start method.

Installing a build pipeline

If you are interested in early but incomplete feedback from your build box, staging your build will help you out. If partitioned right, you’ll receive a series of answers on specific questions from your build process. The questions may be like:

  1. Will it compile?
  2. Will it pass the unit tests?
  3. Will it function (pass the integration tests)?
  4. Will it blend?

Ok, the last question is rather unlikely to be answered by your build box. The overall build process will not be any faster, but basic safety test results are reported earlier. If you combine this approach with distributed builds, there is the possibility to designate specifically tuned machines to different stages. The Hudson continuous integration server has the ability to tag a slave with different labels. You can then configure your build to only run on slaves with the desired label assigned.

Staged builds with Hudson

Staging with the Hudson continuous integration server isn’t as easy as the master/slave feature, but there are some plugins that allow for more complex setups. You might experience some functionality that’s still under development, but basic staging is possible even today. In combination with specialized slave build boxes, this approach can lower your build duration. It is a a complex endeavour, though.

Conclusion

Once your single build box is maxed out but still not fast enough, you enter a different realm of continuous integration infrastructure setups. Speeding up a build process beyond the single box isn’t as easy as installing more RAM. But with a fair amount of planning, you have a fair chance to improve the situation. Note that you won’t primarily lower build duration, but increase throughput and utilize partitioning and specialization. These are different measures and might not affect the wall clock time of your build. The combination of staging and distribution is the most powerful setup, but will result in the most complex infrastructure to maintain. Before entering this realm, be sure to apply any possible optimization to your build process. Because you’ll not leave that realm again soon.

What’s your story on build optimization beyond the box? Drop us a comment.