CVS Branch Tag Creation - Performance

This applies to cvs revision 1.11.17 - the most recent stable release at the time of writing. It should work on other cvs 1.11's too. The development branch exhibits the same problem, but the code organisation is quite different.

If you manage large projects with CVS you may have noticed that creating a branch tag can take a long time - in our extreme cases, upto to an hour. Whilst all other big operations are acceptably a couple of minutes or less. But this is bad because people have to queue up behind the branch tag creation, so their quick operations appear go just as slowly.

Our project repository was just over 1Gbyte of text source, with about 6.5 thousand files and some 40-50 engineers working on it. The branch tagging time is a cause of significant difficulty. Fortunately, as CVS is open ("Free") it was easy to fix.

I'll describe what is happening and how this works. But if you just want the fix then skip to the end.

The first time branch tagging became a real issue for us they were taking over 30minutes to create, 45 or more in some cases. Our fix was to write some perl to cull all history older than three months, but keeping the trunk and some essential development branches. This fixed the problem, but it reappeared quite quickly, as you can see on the graph between days 0 and 60.

The graph shows our cvs performance over a 376 day period. The times for most operations grow linearly with the size of the repository, and the number of symbols in it. Branch tagging (the red line) however has a more interesting characteristic.

At day 0, branch tag was taking about 19 minutes, which was felt to be too painful, so we again culled all nonessential history, taking branch tagging down to three minutes. But as you can see, it climbed inexorably upward, getting serious again with two months (day 60). The actual characteristic is approximately n**2 against the number of symbols. Time to roll up the sleeves and get into the cvs code - thankful for Free Software. The fix we have gives a more linear characteristic which kept us going to day 355, where we had about 7000 files and approx 25 million symbols in the repository. At that point we culled back again.

Here's what's going on.

The common case for a branch is that most files are never changed on the branch. So overheads are minimized by not really doing anything when a branch tag is created for the project. CVS creates a 'magic' (virtual) branch identified by a branch number of zero, and setting the revision number component to what the branch identifier would be if it really existed. The presence of such an item for a file is saying that the branch has been created, but that no actual changes have been committed to this file for that branch as yet. This is storage, and operation, efficient.

So, when we do a tag -b for a project, CVS creates a magic branch for every file in the project. Unfortunately the algorithm used for that is extremely inefficient. The bulk of its time is taken looking for a free magic branch to use in each file. CVS is nicely modularised, and its lists are kept in hashes maintained by the 'hash' module. You walk the lists by calling 'walklist', supplying a function pointer to apply to each element in the list. For tag -b CVS starts with the lowest revision number, and tries *every* entry in the revision list for the file to see if that has already been allocated. It then bumps the number up, and tries again, until it succeeds. Even if a match is found early on a particular pass, walklist doesn't have an escape mechanism to terminate the walk, so it has to press on with the rest on that pass. That is where the n**2 performance against number of revisions - (revisions roughly proportional to symbols) comes from.

Solutions

There are two fixes we have done locally to overcome this.

Firstly to inline the walklist functionality within the code responsible for tag -b, and then manipulate that. This isn't neat in the sense that we are breaking up the nice modularity the hash/walklist is providing. It does allow us to terminate the search early on success.
Second to start the search for a free magic branch number at some more hopeful starting point than zero. I start at 5000, and try 1/2k intervals backwards until I hit an allocated slot, then go forwards from there. This way I get a number within or near the currently allocated numbers, but with far fewer lookups. I don't get exactly the same numbers because I am not reusing the very first slot I could come across if I started from the bottom - where say an earlier branch tag was deleted. It's not important to get a number within the existing ballpark, but if people are used to looking at the numbers then this will make it a little more familiar. There are better strategies for this, such as keeping a record of free slots, but I wanted to change as little as possible.

Neither of these changes affect functionality - we have been using this intensively on the 'large' project for about four months now. But as you can see from the graph, it brings branch tagging costs in line with the other cvs operations, so we have happy engineers (well, not actually happy as such, but that is quite a different story outside of the realm of economic and social theories, as conventionally constituted, to address).

The patch

I have not submitted this patch to the CVS crew because (a) its not very neat as noted above, and (b) I have seen few other people mention they are having a problem - are large projects rare?

This is for 1.11.17. It's only the server that needs the patch, though it does no harm on clients. The only file changed is src/rcs.c

Copyrights, cautions asto lack of warranty and so forth are, or course, as per the GPL under which CVS is released.

rcs.c_patch

Credits

To the Free Software community, and the CVS maintainers in particular.

Adrian West