This document summarizes the initial research for adding distributed version control as an option for Google Code. Based on popularity, two distributed version control systems were considered: Git and Mercurial. This document describes the features of the two systems, and provides an overview of the work required to integrate them with Google Code.
In traditional version control systems, there is a central repository that maintains all history. Clients must interact with this repository to examine file history, look at other branches, or commit changes. Typically, clients have a local copy of the versions of files they are working on, but no local storage of previous versions or alternate branches.
Distributed Version Control Systems (DVCS) use a different structure. With DVCS, every user has their own local repository, complete with project history, branches, etc. Switching to an alternate branch, examining file history, and even committing changes are all local operations. Individual repositories can then exchange information via push and pull operations. A push transfers some local information to a remote repository, and a pull copies remote information to the local repository. Note that neither repository is necessarily "authoritative" with respect to the other. Both repositories may have some local history that the other does not have yet. One key feature of any DVCS system is to make it easy for repositories to unambiguously describe the history they have (and the history they are requesting). Both Git and Mercurial do this by using SHA1 hashes to identify data (files, trees, changesets, etc).
DVCS's provide a lot of flexibility in developer workflows. They can be used in a manner similar to traditional VCS's, with a central "authoritative" repository with which each developer synchronizes. For larger projects, it is also possible to have a hierarchy of server repositories, with maintainers for each repository accepting changes from downstream developers and then forwarding them upstream. DVCS's also allow developers to share work with each other directly. For example, two developers working on a new feature could work on a common branch and share work with each other independent of an "authoritative" server. Once their work was stable, it could then be pushed to a public repository for a larger audience.
Because there is no central repository, the terms client and server don't necessarily apply. When talking about two repositories, they are typically referred to as local and remote rather than client and server. However, in the context of implementing a DVCS for Google Code, the repository hosted at Google will be considered the server, and a user's repository will be called the client.
There is actually a great deal of similarity between Git and Mercurial. Instead of providing a long list of features that are equivalently supported in both system, this section attempts to highlight areas of significant difference between the systems.
Both Git and Mercurial internally work with very similar data: revisions of files along with a small amount of meta information (parents, author, etc). They both have objects that represent a project-wide commit, and these are also versioned. They both have objects that associate a commit with a set of file versions. In Git, this is a tree object (a tree structure with tree objects for directories and references to file revisions as the leaves). In Mercurial, there is a manifest (a flat list mapping pathnames to file revision objects). Aside from the manifest/tree difference, both are very similar in terms of how objects are searched and walked.
Git uses a combination of storing objects directly in the file system (indexed by SHA1 hash) and packing multiple objects into larger compressed files, while Mercurial uses a revlog structure (basically a concatenation of revision diffs with periodic snapshots of a complete revision). In both cases, the native storage will not be used and the objects will be stored in Bigtable instead. Due to the similarity of the basic Git and Mercurial data objects, the effort to solve such problems should be the same regardless of which DVCS is being used.
The only major difference for the data storage layer is the implementation language. If a significant amount of Git/Mercurial code is to be reused, then a Git implementation would be in C, and a Mercurial one would be in Python (or perhaps C++ with SWIG bindings).
Mercurial has very good support for HTTP based stateless pushing and pulling of remote repositories. A reasonable amount of effort has been made to reduce the number of round trips between client and server in determining what data needs to be exchanged, and once this determination has been made all of the relevant information is bundled into a single large transfer. This is a good match for Google's infrastructure, so no modifications will be required on the client side.
Git includes support for HTTP pulls (and WebDAV pushes), but the implementation assumes that the server knows nothing about Git. It is designed such that you can have a Apache simply serve the Git repository as static content. This method requires numerous synchronous round trip requests, and is unsuitable for use in Google Code (1).
Git also has a custom stateful protocol that supports much faster exchanges of information, but this is not a good match for Google infrastructure. Specifically, it is very desirable to use a stateless HTTP protocol since there is already significant infrastructure in place to make such transactions reliable and performant.
1 As a benchmark, Git and Mercurial repositories were seeded with approximately 1500 files totaling 35 M of data. The servers were running in Chicago and the clients in Mountain View (51 ms ping time). The operation of cloning the remote repository (similar to a initial checkout in traditional version control systems) averaged 8.1 seconds for Mercurial and 178 seconds for Git (22 times slower). A single file in the repository was then changed 50 times and the clients pulled the updates. In this case, Mercurial took 1.5 seconds and Git required 18 seconds (12 times slower). When the Git protocol was used instead of HTTP, Git's performance was similar to Mercurial (8.7 seconds for cloning, 2.8 seconds for the pull).