Find cruft in your source code repository

Micheal Feathers wrote in his blog post “The Carrying-Cost of Code: Taking Lean Seriously”
that is necessary to remove old code from your product to be able to add new features. His argument is that you get a better understanding of your production code this way. Rewriting your code constantly leads to more readable and compact code.

"There are many places in the industry where existing mountains of code are a drag on progress.
[..]
Younger organizations without as much software infrastructure often have a competitive advantage provided they can ramp up to a base feature set quickly and provide value that more encumbered software-based companies can't. It's a scenario that plays out over and over again, but people don't really talk about it.
[..]
I'd like to have code base where every line of code written disappears exactly three months after it is written.
[..]
I have the suspicion that a company could actually do better over the long term doing that, and the reason is because the costs of carrying code are real, but no one accounts for them."

His reasoning goes so far as to ask product owners to remove features that are not needed. Software size seems to increase strictly monotonic. This makes maintenance harder and more costly. I’m not sure if you have to follow the advice strictly too improve your situation. Before you start arguing with your boss about removing features, it is a good idea to look for low-hanging fruit first: the oldest lines.

Metric

The heuristic comes from the observation that a) software has bugs and that b) if the software is actually used bugs will be found and fixed. Fixing the bugs leads to new code as does changes in coding style, new APIs etc. Old unchanged code is either bug-free, feature-complete and state-off-the-art or something nobody cares about. I’d say the metric is not too bad to find some victims. (A metric like this one should be a tool to find problems not an absolute measurement. Metrics should not be taken too seriously and nobody should be tempted to cheat.)

To put the idea into practice I’ve hacked some scripts to find suspects in a subversion repository. The scripts are:

find the oldest lines in your repository
find biggest change sets in your repository considering the oldest lines
find files that are changed the most by a change set considering the n oldest lines

Too get some data I’ll use the legacy subversion repository of the Quickcheck project. Quickcheck moved to Mercurial some time ago. It’s a test to see if something significant can be found with this metric.

The readme contains instructions how you can run the scripts with your subversion repository. The script are based on a local repository mirror to speed up analysis. The analysis can be execute on any subtree of the repository.

Oldest lines

Finding the oldest lines is quite simple first get all file names with svn list and then use svn blame to get the date for every line. These output is sorted by the revision (descending) of each line.

The output of the oldest_lines.sh is unfiltered. To extract useful information it has to be filtered. The filter.sh does this for Java source code: removing empty lines, single closing braces, package declaration, imports and comments.

These are the last lines of the filtered output for Quickcheck:

$ ./filter.sh | tail -5

6     blob79    public int compare(Pair<Object, Double> o1, Pair<Object, Double> o2) { File: characteristic/Classification.java Line: 162
6     blob79    next.add(gen.next()); File: generator/support/TupleGenerator.java Line: 34
6     blob79    ArrayList<Pair<Object, Double>> toSort) { File: characteristic/Classification.java Line: 150
6     blob79    @SuppressWarnings("unchecked") File: generator/CombinedGenerators.java Line: 126
6     blob79     Object[] next = generator.next(); File: generator/CombinedGenerators.java Line: 128

A potential victim here is the Classification class. It’s rudiment from the original Quickcheck implementation but never was used heavily. It’s a nice idea to do statistical testing but Classification could be removed from Quickcheck without loosing a significant feature.

Biggest change sets

The second script top_change_sets.sh finds the biggest change sets considering only the n oldest lines. This results in an interesting output for the code base (oldest 1500 lines, top 5 change sets):

$ ./top_change_sets.sh 1500 5

r182 | blob79 | 20071219 19:15:24 +0100 (Wed, 19 Dec 2007) | 1 line
basic failed test instances serialization feature implementation
136 changes

r270 | blob79 | 20090603 18:52:52 +0200 (Wed, 03 Jun 2009) | 1 line
added pojo (a.k.a object) generator for interfaces
104 changes

r6 | blob79 | 20070707 07:29:14 +0200 (Sat, 07 Jul 2007) | 1 line
initial check in
68 changes

r204 | blob79 | 20080323 19:29:28 +0100 (Sun, 23 Mar 2008) | 1 line
added svn keyword id
52 changes

r198 | blob79 | 20080323 18:29:26 +0100 (Sun, 23 Mar 2008) | 3 lines
fixed logging for serializing and deserializing runner
mandate a user set characteristic name (for serialization of test values)
added system property for number of runs
48 changes

Revision 182,198 were commits related to the obscure test data serialization and deserialization scheme. Something I’ve already removed in the latest release. The two changes resulted in 184 lines still present in the current source.

The revision 270 is not less obscure. It’s a declarative POJO object generator. The revision is so high in the list because it forced a lot of changes. This is not a good sign: obscure feature and lots of changes. That’s something worth to investigate.

Revision 6 is the initial check in. So this should be okay.

The last open issue revision 204 is the attack of the code formatters. They should be used with prudence as long as the down-stream tools can’t handle the changes properly. (Source control system should understand the AST of the source language.)

File changes

Now we can take a look at the files with most changes from a single revision. If you execute the query top_changes_in_file.sh (500 oldest lines, top 5) for the Quickcheck source code you’ll see:

$ ./top_changes_in_file.sh 500 5
r182 | blob79 | 2007-12-19 19:15:24 +0100 (Wed, 19 Dec 2007) | 1 line
basic failed test instances serialization feature implementation
34 changes | file: RunnerImpl.java

r180 | blob79 | 2007-12-07 18:59:59 +0100 (Fri, 07 Dec 2007) | 3 lines
MutationGenerator, CloningMutationGenerator and CloningGenerator added
26 changes | file: generator/support/CloningGenerator.java

r182 | blob79 | 2007-12-19 19:15:24 +0100 (Wed, 19 Dec 2007) | 1 line
basic failed test instances serialization feature implementation
24 changes | file: SerializingRunnerDecorator.java

r6 | blob79 | 2007-07-07 07:29:14 +0200 (Sat, 07 Jul 2007) | 1 line
initial check in
22 changes | file: characteristic/Classification.java

r179 | blob79 | 2007-10-13 09:14:30 +0200 (Sat, 13 Oct 2007) | 1 line
added tree generator
22 changes | file: generator/support/AbstractTreeGenerator.java

Besides the usual suspects serialization support and the Classification class two new suspects emerge: mutation generator and a tree generator. In favor of the tree generator and mutation generator implementation, they might be useful but aren’t widely used so this something worth to look at.

Conclusion

The metrics found multiple source files that are worth investigating. One feature that is already removed (serialization support), one likely victim (classification) and multiple places that are worth checking (mutation generator, tree generator, declarative POJO generator). The metrics seems to find unloved children in the code that are good candidates for removal or implementation improvements.

I always like to remove code. Fewer lines of code means fewer spots where problems may emerge. Nobody can argue that if you can remove unused code that it’s better to keep the useless code - even if it’s tested and production-quality. That’s something like a reverse YAGNI. If you really care the code will never disappear. You can find it in your source code management system. You should be okay with that fact that the old code will lose it’s relevance due to changes to the production system implementation. It can be a inspiration how it could be done if the world hadn’t changed. The burden of these changes are also the reason why it’s better to remove the code in the first place. Dragging it with you without any gain is plain waste.

Thomas Andreas Jung's Blog

20110530