A Small XML Builder in Ruby

From a C++ point of view, i.e. the statically typed world with no “dynamic” features that deserved the name, I guess you would all agree that languages like Groovy or Ruby are truly something completely different. Having strong C++ roots myself, my first Grails project gave me lots of eye openers on some nice “dynamic” possibilities. One of the pretty cool things I encountered there was the MarkupBuilder. With it you can just write XML as if it where normal Groovy Code. Simple and just downright awesome.

The other day in yet another C++ project I was again faced with the task to generate some XML from text file. And, sure enough, my thoughts wandered to the good days in the Grails project where I could just instantiate the MarkupBuilder… But wait! I remembered that a colleague had already done some scripting stuff with Ruby, so the language was already kind of introduced into the project. And despite the fact that it was a new language for him he did some heavy lifting with it in just no time (That sure does not come as a big surprise all you Ruby folks out there).

So if Ruby is such a cool language there must be something like a markup builder in it, right? Yes there is, well, sort of. Unfortunately, it’s not part of the language package and you first have to install a thing called gems to even install the XML builder package. Being in a project with tight guidelines when it comes to external dependencies and counting in the fact that we had no patience to first having to learn what Ruby gems even are, my colleague and I decided to hack our own small XML builder (and of course, just for the fun of it). I mean hey, it’s Ruby, everything is supposed to be easy in Ruby.

Damn right it is! Here is what we came up with in what was maybe an hour or so:

class XmlGen
   def initialize
      @xmlString = ""
      @indentStack = Array.new
   end

   def method_missing(tagId, attr = {})
      argList = attr.map { |key, value|
         "#{key}=\"#{value}\""
      }.reverse.join(' ')

      @xmlString << @indentStack.join('') 
      @xmlString << "<" << tagId.to_s << " " << argList
      if block_given?
         @xmlString << ">\n"
         @indentStack.push "\t"
         yield
         @indentStack.pop
         @xmlString << @indentStack.join('') << "</" << tagId.to_s << ">\n"
      else
         @xmlString << "/>\n"
      end
      self
   end

   def to_s
      @xmlString
   end
end

And here is how you can use it:

xml = XmlGen.new
xml.FirstXmlTag {
   xml.SubTagOne( {'attribute1' => 'value1'} ) {
      someCollection.each { |item|
         xml.CollectionTag( {'itemId' => item.id} )
      }
   }
}

It’s not perfect, it’s not optimized in any way and it may not even be the Ruby way. But hey, it served our needs perfectly, it was a pretty cool Ruby experience, and it sure is not the last piece of Ruby code in this project.

Always be aware of the charset encoding hell

Most developers already struggled with textual data from some third party system and getting garbage special characters and the like because of wrong character encodings.  Some days ago we encountered an obscure problem when it was possible to login into one of our apps from the computer with the password database running but not from other machines using the same db.  After diving into the problem we found out that they SHA-1 hashes generated from our app were slightly different. Looking at the code revealed that platform encoding was used and that lead to different results:platform-encoding

The apps were running on Windows XP and Windows 2k3 Server respectively and you would expect that it would not make much of a difference but in fact it did!

Lesson:

Always specify the encoding explicitly, when exchanging character data with any other system. Here are some examples:

  • String.getBytes(“utf-8”), new Printwriter(file, “ascii”) in Java
  • HTML-Forms with attribute accept-charset="ISO-8859-1"
  • In XML headers <?xml version="1.0" encoding="ISO-8859-15"?>
  • In your Database and/or JDBC driver
  • In your file format documentation
  • In LaTeX documents
  • everywhere where you can provide that info easily (e.g. as a comment in a config file)

Problems with character encodings seem to appear every once in a while either as end user, when your umlauts get garbled or as a programmer that has to deal with third party input like web forms or text files.

The text file rant

After stumbling over an encoding problem *again* I thought a bit about the whole issue and some of my thought manifested in this rant about text files. I do not want to blame our computer science predecessors for inventing and using restricted charsets like ASCII or iso8859. Nobody has forseen the rapid development of computers and their worldwide adoption and use in everyday life and thus need for an extensible charset (think of the addition of new symbols like the €), let aside performance and memory considerations. The problem I see with text files is that there is no standard way to describe the used encoding. Most text files just leave it to the user to guess what the encoding might be whereas almost all binary file formats feature some kind of defined header with metadata about the content, e.g. bit depth and compression method in image files. For text files you usually have to use heuristical tools which work  more or less depending on the input.

A standardized header for text files right from the start would have helped to indicate the encoding and possibly language or encoding version information of the text and many problems we have today would not exist. The encoding attribute in the XML header or the byte order mark in UTF-8 are workarounds for the fundamental problem of a missing text file header.

Give open source some love back!

Like many others our work is enabled by open source software. We make a heavy use of the several great open source projects out there. Since they help us doing our business and doing it in a productive way, we want to give some love aehmm work back. So we decided to dedicate one day per month to open source contributions. These can be bug fixes, new features, even documentation or bug reports. I believe that every contribution helps an open source project and many projects need help.
The whole development team will work on projects they like. One day per month does not sound much but I think even starting small helps. And maybe you can suggest a similar day in your company, too ?
Besides the obvious boost in developer motivation (and therefore productivity) there are several things your company will benefit from:

  • help in your own projects: fixing bugs in the open source projects you use is like fixing bugs in your own project
  • image for your company: being active in open source gives a better image regarding potential future employees and also shows responsibility in the field they work in
  • PR for your company and an edge over your competition: writing about your contributions and your insights in your company blog, remember: out teach your competition

So get your company to spend just one day per month or so for open source. It may not be much but every little bit helps!

Evil operator overloading of the day

The other day we encountered a strange stack overflow when cout-ing an instance of a custom class. The stream output operator << was overloaded for the class to get a nice output, but since the class had only two std::string attributes the implementation was very simple:

using namespace std;

class MyClass
{
   public:
   ...
   private:
      string stringA_;
      string stringB_;

   friend ostream& operator << (ostream& out, const MyClass& myClass);
};

ostream& operator << (ostream& out, const MyClass& myClass)
{
   return out << "MyClass (A: " << myClass.stringA_ 
              <<", B: " << myClass.stringB_ << ")"  << std::endl;
}

Because the debugger pointed us to a completely separate code part, our first thought was that maybe some old libraries had been accidently linked or some memory got corrupted somehow. Unfortunately, all efforts in that direction lead to nothing.

That was the time when we noticed that using old-style printf instead of std::cout did work just fine. Hm..

So back to that completely separate code part. Is it really so separate? And what does it do anyway?

We looked closer and after a few minutes we discovered the following code parts. Just look a little while before you read on, it’s not that difficult:

// some .h file somewhere in the code base that somehow got included where our stack overflow occurred:

...
typedef std::string MySpecialName;
...
ostream& operator << (ostream& out, const MySpecialName& name);

// and in some .cpp file nearby

...
ostream& operator << (ostream& out, const MySpecialName& name)
{
   out << "MySpecialName: " << name  << std::endl;
}
...

Got it? Yes, right! That overloaded out-stream operator << for MySpecialName together with that innocent looking typedef above put your program right into death by segmentation fault.  Overloading the out-stream operator for a given type can be a good idea – as long as that type is not a typedef of std::string. The code above not only leads to the operator << recursively calling itself but also sucks every other part of the code into its black hole which happens to include the .h file and wants to << a std::string variable.

You just have to love C++…

Branching support missing in JIRA

JIRA is a really great tool when it comes to issue tracking. It is powerful, extensible, usable and widespread. We and many of our clients use it for years and are quite satisfied coming from other tools like Bugzilla.

One thing we find is missing though is the concept of branching. In JIRA you have projects and can define a roadmap consisting of versions that follow each other. Many software products require sooner or later a development line which is only maintained getting bug and security fixes while feature development happens on a separate branch.

Tracking issues for a software with different branches is a bit more demanding because you have to identify the branches one issue affects and possibly separately fix them on each branch. One and the same issue could have different resolutions per branch, i.e. invalid if a broken functionality does not exist anymore. Nevertheless, all branches have to be checked, likely in different time frames.

Unfortunately JIRA does not support the notion of branches. You have to emulate the behaviour using different schemes like:

  • One issue per version that represents a branch
  • Multiple fix versions for an issue
  • Subtasks for an issue or issue links to ckeck and fix the issue for different versions

To me all those are workarounds which lack polished usability and add overhead to your issue management. Real branching support could help you check on which branch an issue is done, where it has still to be resolved and so on without adding more and more (perhaps loosely connected) issues to your system or forgetting to fix the issue on some branch (no question/warning when resolving an issue with multiple fix versions).

[Update:]

I created a new feature request for JIRA where you can vote or track progress on this issue.

Grails Web Application Security: XSS prevention

XSS (Cross Site Scripting) became a favored attack method in the last years. Several things are possible using an XSS vulnerability ranging from small annoyances to a complete desaster.
The XSS prevention cheat sheet states 6 rules to prevent XSS attacks. For a complete solution output encoding is needed in addition to input validation.
Here I take a further look on how to use the built in encoding methods in grails applications to prevent XSS.

Take 1: The global option

There exists a global option that specifies how all output is encoded when using ${}. See grails-app/conf/Config.groovy:

// The default codec used to encode data with ${}
grails.views.default.codec="html" // none, html, base64

So every input inside ${} is encoded but beware of the standard scaffolds where fieldValue is used inside ${}. Since fieldValue uses encoding you get a double escaped output – not a security problem, but the output is garbage.
This leaves the tags from the tag libraries to be reviewed for XSS vulnerability. The standard grails tags use all HTML encoding. If you use older versions than grails 1.1: beware of a bug in the renderErrors tag. Default encoding ${} does not help you when you use your custom tags. In this case you should nevertheless encode the output!
But problems arise with other tags like radioGroup like others found out.
So the global option does not result in much protection (only ${}), double escaping and problems with grails tags.

Take 2: Tainted strings

Other languages/frameworks (like Perl, Ruby, PHP,…) use a taint mode. There are some research works for Java.
Generally speaking in gsps three different outputs have to be escaped: ${}, <%%> and the ones from tags/taglibs. If a tainted String appears you can issue a warning and disallow or escape it. The problem in Java/Groovy is that Strings are value objects and since get copied in every operation so the tainted flag needs to be transferred, too. The same tainted flag must also be introduced for GStrings.
Since there isn’t any implementation or plugin for groovy/grails yet, right now you have to take the classic route:

Take 3: Test suites and reviews

Having a decent test suite in e.g. Selenium and reviewing your code for XSS vulnerabilities is still the best option in your grails apps. Maybe the tainted flags can help you in the future to spot places which you didn’t catch in a review.

P.S. A short overview for Java frameworks and their handling of XSS can be found here

How much boost does a C++ newbie need?

The other day, I talked to a C++ developer, who is relatively new in the language, about the C++ training they just had at his company. The training topics were already somewhat advanced and contained e.g. STL containers and their peculiarities, STL algorithms and some boost stuff like binders and smart pointers. That got me thinking about how much of STL and boost does a C++ developer just has to know in order to survive their C++ projects.

There is also another angle to this. There are certain corners of the C++ language, e.g. template metaprogramming, which are just hard to get, even for more experienced developers. And because of that, in my opinion, they have no place in a standard industry C++ project. But where do you draw the line? With template meta-programming it is obvious that it probably will never be in every day usage by Joe Developer. But what about e.g. boost’s multi-index container or their functional programming stuff? One could say that it depends on the skills of team whether more advanced stuff can be used or not. But suppose your team consist largely of C++ beginners and does not have much experience in the language, would you want to pass on using Boost.Spirit when you had to do some serious parsing? Or would you want to use error codes instead of decent exceptions, because they add a lot more potentially “invisible” code paths? Probably not, but those are certainly no easy decisions.

One of the problems with STL and boost for a C++ beginner can be illustrated with the following easy problem: How do you convert an int into a std::string and back? Having already internalized the stream classes the beginner might come up with something like this:

 int i = 5;
 std::ostringstream out;
 out << i;
 std::string i_string = out.str();  

 int j=0;
 std::istringstream in(i_string);
 in >> j;
 assert(i == j);

But if he just had learned a little boost he would know that, in fact, it is as easy as this:

 int i=5;
 std::string i_string = boost::lexical_cast<std::string>(i);

 int j = boost::lexical_cast<int>(i_string);

So you just have to know some basic boost stuff in order to write fairly decent C++ code. Besides boost::lexical_cast, which is part of the Boost Conversion Library, here is my personal list of mandatory boost knowledge:

Boost.Assign: Why still bother with std::map::push_back and the likes, if there is a much easier and concise syntax to initialize containers?

Boost.Bind (If you use functional programming): No one should be forced to wade through the mud of STL binders any longer. Boost::bind is just so much easier.

Boost.Foreach: Every for-loop becomes a code-smell after your first use of BOOST_FOREACH.

Boost.Member Function: see Boost.Bind

Boost.Smart Pointers: No comment is needed on that one.

As you can see, these are only the most basic libraries. Other extremely useful things for day-to-day programming are e.g. Boost.FileSystem, Boost.DateTime, Boost.Exceptions, Boost.Format, Boost.Unordered and Boost.Utilities.

Of course, you don’t have to memorize every part of the boost libraries, but boost.org should in any case be the first address to look for a solution to your daily  C++ challenges.

Don’t trust micro versions

Normally you would think, that upgrading a third party dependency where its micro version (after the second dot, like x in 2.3.x) changes should make your software work (even) better and not break it. Sadly enough it can easily happen. Some time ago we stumbled over a subtle change in the JNDI implementation of the Jetty webserver and servlet container: In version 6.1.11 you specified (or at least could specify) JNDI resources in jetty-env.xml with URLs like jdbc/myDatabase. After the update to 6.1.12 the specified resource could not be found anymore. Digging through code changelogs and the like provided a solution that finally worked with 6.1.12: java:comp/env/jdbc/myDatabase. The bad thing is that the latter does not work with 6.1.11 so that our configuration became micro-version-dependent on Jetty.

It seems that a new feature around JETTY-725 in the update from 6.1.11 to 6.1.12 broke our software.

Conclusion

Always make sure that your dependencies are fixed for your software releases and test your software everytime when upgrading a dependency. Do not trust some automatic dependency update system or the version numbers of a project. In the end they are just numbers and should indicate the impact of the changes but you never can be sure the changes do not break something for you.

Small gaps in the grails docs

Just for reference, if you come across one of the following problems:

Validation only datasource

Looking at the options of dbCreate in Datasource.groovy I only found 3 values: create-drop, create or update. But there is a fourth one: validate!
This one helps a lot when you use schema generation with Autobase or doing your schema updates external.

Redirect

Controller.redirect has two options for passing an id to the action id and params, but if you specify both which one will be used?

controller.redirect(id:1, params:[id:2])

Trying this out I found the id supersedes the params.id.

Update:
Thanks to Burt and Alvaro for their hints. I submitted a JIRA issue

A guide through the swamp – The CrapMap

Locate your crappy methods quickly with this treemap visualization tool for Crap4J report data.

One of the most useful metrics to us in the Softwareschneiderei is “CRAP”. For java, it is calculated by the Crap4J tool and provided as an HTML report. The report gives you a rough idea whats going on in your project, but to really know what’s up, you need to look closer.

A closer look on crap

The Crap4J tool spits out lots of numbers, especially for larger projects. But from these numbers, you can’t easily tell some important questions like:

  • Are there regions (packages, classes) with lots more crap than others?
  • What are those regions?

So we thought about the problem and found it to be solvable by data visualization.

Enter CrapMap

If you need to use advanced data visualization techniques, there is a very helpful project called prefuse (which has a successor named flare for web applications). It provides an exhaustive API to visualize nearly everything the way you want to. We wanted our crap statistics drawn in a treemap. A treemap is a bunch of boxes, crammed together by a clever layouting strategy, each one representing data, for example by its size or color.

The CrapMap is a treemap where every box represents a method. The size gives you a hint of the method’s complexity, the color indicates its crappyness. Method boxes reside inside their classes’ boxes which reside in package boxes. That way, the treemap represents your code structure.

A picture worth a thousand numbers

crapmap1

This is a screenshot of the CrapMap in action. You see a medium sized project with few crap methods (less than one percent). Each red rectangle is a crappy method, each green one is an acceptable method regarding its complexity.

Adding interaction

You can quickly identify your biggest problem (in terms of complexity) by selecting it with your mouse. All necessary data about this method is shown in the bottom section of the window. The overall data of the project is shown in the top section.

If you want to answer some more obscure questions about your methods, try the search box in the lower right corner. The CrapMap comes with a search engine using your methods’ names.

Using CrapMap on your project

CrapMap is a java swing application, meant for desktop usage. To visualize your own project, you need the report.xml data file of it from Crap4J. Start the CrapMap application and load the report.xml using the “open file” dialog that shows up. That’s all.

In the near future, CrapMap will be hosted on dev.java.net (crapmap.dev.java.net). Right now, it’s only available as a binary executable from our download server (1MB download size). When you unzip the archive, double-click the crapmap.jar to start the application. CrapMap requires Java6 to be installed.

Show your project

We would be pleased to see your CrapMap. Make a screenshot, upload it and leave a comment containing the link to the image.