Lately, we had multiple occassions where a certain software tool was missing in our arsenal. Let me describe the situations and then extrapolate the requirements for the tool.
We work on a fairly large project that resembles a web application for our customer. Because the customer is part of a larger organization, the project was also needed for a second, rather independent customer within the organization. Now we had two customers with distinct requirements. We forked the code base and developed both branches independently. But often, there is a bug fix or a new feature that is needed in both branches. And while both customers have different requirements, it’s still the same application in the core. Technically speaking, both branches are part of a product family. We use atomic commits and cherry-picks to keep the code bases of the branches in sync if needed.
Another customer has a custom hardware with an individual control software written by us. The hardware was built several times, with the same software running on all instances. After a while, one hardware instance got an additional module that only was needed there. We coped by introducing an optional software module that can control the real hardware on this special instance or act as an empty placeholder for the other instances. Soon, we had to introduce another module. The software is now heavily modularized. Then the hardware defects began. The customer replaced every failing hardware component with a new type of hardware, using their new capabilities to improve the software features, too. But every hardware instance was replaced differently and there is no plan to consolidate the hardware platforms again. Essentially, this left us with a apecific version of the software for each hardware instance. Currently, we see no possibility to unify the different hardware platforms with one general interface. What we did was to fork the code base and develop on each branch independently. But often, there is a bug fix or a new feature that is needed in several branches. Technically speaking, all branches are part of a product family. We use atomic commits and cherry-picks to keep the code bases of the branches in sync if needed.
In both cases, we needed a list that helped us to keep track which commits were already cherry-picked, never need to be picked or are not reviewed in that regard yet. Our version control system of choice, git, supports this requirement by providing globally unique commit IDs. Maintaining this list manually is a cumbersome task, so we developed a little tool that helps us with it.
Meet the diffibrillator
First thing we always do for our projects is to come up with a witty name for it. In this case, because it is a “diff tracker” really, we came up with the name “diffibrillator”. The diffibrillator is a diff tracker on the granularity level of commits. For each new commit in either repository of a product family, somebody has to review the commit and decide about its category:
- Undecided: This is the initial category for each commit. It means that there is no decision made yet whether to cherry-pick the commit to one or several other branches or to define it as “unique” to this branch.
- Unported: If a reviewer chooses this category for a commit, there is no need to port the content of the commit to other branches. The commit is regarded as part of the unique differences of this branches to all other ones in the product family.
- Ported: If there are other branches in the product family that require the same changes as are made in the commit, the reviewer has to do two things: cherry-pick the commit to the required branches (port the functionality) and mark the commit and the new cherry-pick commits as “ported”. This takes the commits out of the pending list and indicates that the changes in the commit are included in several branches.
In short, the diffibrillator helps us to keep track about every commit made on every branch in the product family and shows us where we forgot to port a functionality (like a bugfix) to the other members of the family.
Here is a typical screenshot of the desktop GUI. Some information is blurred to keep things ambiguous and to protect the innocent.
You see a (very long) table with several columns. The first column denotes the commit date of the commit in each row. The commits are sorted anti-chronologically over all projects, but inserted into its project’s column. In this screenshot, you can see that the third project wasn’t changed for quite a time. Some commits are categorized, but the latest commits need some work in this regard.
Foundation for the diffibrillator
The diffibrillator in its current state relies heavily on the atomic nature of our commits. As soon as two functionalities are included in one commit, both the cherry-pick and the categorization would lose precision. Luckily, we have only developers that adhere to the commit-early-commit-often principle. We had plans for a diff tracker with the granularity of individual changes, but an analysis of our real requirements revealed that we wouldn’t benefit from the higher change resolution but lose the trackability on the commit level. And that is the level we want to think and act upon.
Technicalities of the diffibrillator
Technically, the diffibrillator is very boring. It’s a java-based server application that uses directory structures and flat files as a data storage. The interaction is done by a custom REST interface that can be used with a swing-based desktop GUI or a javascript-based web GUI (or any other client that is coompatible with the REST interface). As there is only one server instance for all of us, the content of its data storage is “the truth” about our product family’s commits.
The biggest problem was to design the REST API orthogonal enough to make any sense but also with a big amount of pragmatism to keep it fast enough. This lead to a query that should return only the commits’ IDs but returns all information about them to avoid several thousand subsequent HTTP requests for the commits’ data. As a result, this query’s answer grew very big, leading to timeout errors on smallband connections. To counter this problem, we had to introduce result paging, where the client can specify the start index and result length of its query.
Why should you care?
We are certain that the task to keep several members of a product family in sync isn’t all that seldom. And while there are many different possible solutions to this problem, the two most prominent approaches seem to be “modularization” or “diff tracking”. We chose diff tracking as the approach with lower costs for us, but lacked tool support. The diffibrillator is a tool to keep track of all your product familys’ commits and to categorize them. It relies on atomic commits, but is relatively low-tech and easy to understand otherwise.
If you happen to have the same problem of a product family consisting of several independent projects, drop us a line. We’d love to hear from you about your experience and solutions. And if you think that the diffibrillator can help you with that task, let us know! We are not holding anything back.
nice touch, “forking” the text passages at the beginning and keeping “the core” 🙂
I’m curious about your choice of diff-tracking over modularization.
I went through the workflow in my mind: I assume that diff-tracking is initially easier and safer than modularization. But if at one point the branches differ too much or the number of branches grows, it gets to be the other way round.
Did you choose diff tracking because you knew that the diversion and/or number of branches would be fairly small or did the requirement come so late that you couldn’t introduce modularization at that point?
In regards of applicability, how well does your tool scale if the number of branches or the diversion grow extraordinarily?
Hi Martin,
first of all: you are right with your assumptions: we know that the number of branches are limited and diff-tracking is cheaper than modularization for a small number of projects. Another good point is that we really were surprised both times by the requirement and adding modularization after the fact would have been risky and a lot of effort (for a small effect regarding the branch count). Diff-tracking was the “cheapest way to go” in both projects.
We don’t have hard numbers on your last question yet. I guess that with a large number of branches, the categorization needs an overhaul (like “ported to branch X and Y, but not Z”) and generally, diff-tracking gets expensive while the effort for modularization stays the same. If the diversion grows, there is nothing left to call a product family. You have different products now 🙂
To summarize: This diff-tracker is a “good enough” tool for small product families and small development teams. If any aspect of the family scales up (branch number, amount of commits, diversity), we need to reevaluate our approach and adjust. It works for 2-4 branches with non-lazy commit/difference management.