If you've paid any attention, you'll have noticed there's a ton of new version control systems. It's a little odd, really -- the community languished with CVS for such a long time. CVS got lots of really important things right, but as we all know it also was a mess. Then Subversion and Arch came along with two different models -- one based closely on CVS (Subversion), and one based on a new distributed model. Arch seems crufty and difficult, or maybe it's just that its designer has a somewhat crufty and difficult personality... but clearly that branch of design has had an explosion of implementations, where Subversion remains alone (though far more successful).
People treat the benefit of distributed development as self-evident, but I don't agree. The best real justification I've seen of the centralized model is Greg Hudson's Why Bitkeeper Isn't Right For Free Software, which is still relevant since it wasn't about the BK license or proprietary software. All the arguments against BK on the basis of license have been clearly proven, but the model remains relevant. The basic argument Greg makes is that the Linux development process is an anomaly and doesn't apply to most projects.
But I think this can be taken further. There's nothing wrong with the centralized model. There is something wrong with the way we are using Subversion (and CVS before it). The wrongness isn't that you need a server, or a network connection, or disk space. It's that you need commit privileges.
I see these issues as the important ones that source control can solve for open source:
The distributed systems do some work on (1), usually by not needing a "server" (except maybe for rsync and any web server). But frankly the "server-less" systems they set up are usually much more complex in practice than a single well-maintained server. Now that Subversion has fixed many of its server problems (with fsfs among other things), server maintenance is really not a problem. And we share the work around; there are far fewer servers than developers, and that works fine.
But more practically, I think distributed systems enable private work in a way that is bad for the community. I think the private workflow so touted by distributed systems is a total non-feature, even an anti-feature. Open source development should happen in the open; that's what people usually want to do, and that's what we should encourage at every opportunity.
The distributed systems offer nothing for (2). Centralized systems allow you to list the files and branches and whatnot. Subversion made an important improvement on CVS by making the branching and tagging very transparent, where it was somewhat invisible and mysterious in CVS. That makes a real and practical difference in the usability of branching. Distributed systems are a step back in this respect.
Honestly I don't know how distributed systems compare on (3) and (4). Subversion could definitely be better, but I don't think that has anything to do with centralization. I find handling patches very difficult, but I think merging branches in Subversion is generally easier, and with far more room for improvement. As an aside, I don't see why the exchange of patches in email is even relevant in a usable and complete system -- emailing files around is a crappy interface for everyone. Making it less crappy is missing the point; email is not a good file transfer protocol. But because Linus does everything in email... sigh.
But centralized version control does need to become more open. The difference between someone with commit access and some random contributor should be reduced. Anonymous commits should be allowed (or commits with a very low registration threshold -- strict anonymity isn't really important here). The tools should be usable enough that we can say "we don't accept patches, we only accept pointers to branches in our repository". Maybe we could even say "we accept bug reports, but we prefer bug reports in the form of commits into our bug_example/ directory". We should stop using Wikis, and just use web frontends to our version control.
There's great potential in version control -- but it's all in usability and tools, security and scaling, not the mathematical appeal of patch management algorithms.
People like to talk about the benefit of open source's distributed development, but I think the communal aspect is just as important. We all already know that a successful open source developer must play well with others, much both follow and lead in different projects, must be able to handle and resolve personal and technical conflicts. We succede when we work in public; so why should we be so drawn to version control that encourages isolation, where everything is a fork?
I wish you would have went into this level of detail a month ago or so when we were talking about setting up a public repository for patterns. You're making a ton of sense here. I think you could have a workable model w/ subversion if you could figure out how to:
- Self-register or distribute identity. Anonymous write is a non-starter...
- Tier access such that newly registered users can create new branches in a sandbox area with full write access but not be able to write outside of their own branches until they were granted higher access.
I wonder if a simple web front end might not help accomplish a lot of this. A form takes a username/password + source branch and create a new "free branch" somewhere and spits that out as a result. This could also have a web/machine interface that you could probably wrap with five lines of bash/curl to have a nice little CLI:$ svn_share -u username http://example.com/reposhare/trunk Hello username, we don't know you. Enter a password for jailed access... Password: ********* Your branch is http://example.com/repo/username/01/
The command might be best as a wrapper around svn co that would have basic logic for establishing the account and creating one of these "free branches" and then checking it out.
This leads to some problems:
- Branch bombs - The branching mechanism is cheap but you'd have to put some kind of basic constraints around creation.
- Stale branch detection - It would be nice to have some heuristic for determining if a branch was actually valuable and to clean up those that aren't.
At any rate, I'm interested in hearing more on this if you've thought about implementation at all.
I was trying to figure out how to set up Apache for this sort of situation in Apache auth -- I haven't had a chance to revisit it since then (sysadmining, blech), but I think there's a special tool for svn permissions which is pretty granuar.
For self-registration it might be sufficient simply to use some simple webapp that manages .htaccess files, maybe with a little something for forgotten passwords and whatnot (a parrallel record of email addresses would have to be kept).
I am worried about security though, especially since Subversion is written in C. C does not encourage confidence in security. Not a big issue when you trust everyone you authorize, but if authorization is opened up...
Wouldn't the Wikepedia model be helpful here as a model for thinking about security? It seems to me that some of security concerns might not be _that_ big a deal since you can just roll back to a previous version if someone pollutes the tree.
I'm thinking of security like buffer overflows, or people adding gigantic files to the repository which cause the database size to explode (since deleted items remain in the repository database).
Have you checked out <a href="http://www.selenic.com/mercurial/">Mercurial</a>? I wrote a short post about it recently.
I'll have to throw my approval in on this one as well. I have been using GNU Arch for some time now, and although it is a powerful SCM, it's definitely not the easiest thing to use. Bazaar has addressed some of these issues while remaining compatible with GNU Arch. Tom Lord has pre-released GNU Arch 2.0, which is another complete rewrite that borrows ideas from Linus' Git. It's not a feature-complete SCM yet, but it appears to be picking up steam. Hopefully, he'll incoporate the usibility features in Bazaar into the new version.
I have not jumped on the Subversion bandwagon, partially because it had such a rocky start right about the time I was looking for a new SCM. I have it installed, but only to pull your source code, at which point I slough it off to whatever SCM I might be playing with at the time.
I have not used monotone because of its dependence upon OpenSSL libraries and the inability to incorporate GPG/OpenPGP utilities (though that's such a little nit that I should probably reconsider). I'm also not entirely comfortable with having to cut-n-paste SHA-1 strings just to get a branch I want. (Git suffers from this, too.) Mercurial incorporates a versioned tagging system that allows you to set arbitrary names to revisions and travels with the archive. You can also set local tags that do not get transfered, which allows you to make less "official" notes about what a branch or revision represents. +1 for Mercurial.
Codeville has some interesting merging algorithms, but ease of use is by far the more important thing to consider. Darcs has gotten some bad press regarding its memory consumption and long operations when merging with large histories, though these might be fixed by now. (Plus, it's written in Haskell!? Who uses that?!) Canonical is sponsoring the development of Bazaar-ng, another python-based SCM that borrows ideas from GNU Arch and Bazaar. I don't really know where it is in a feature-wise comparison, but it's a relatively new project. Now that GNU Arch 2.0 is in alpha, I'm not sure if bazaar-ng is going to take the new direction (I haven't been following that closely lately).
That brings us back to Mercurial. It's a mouthful to say, but the commandline is a terse hg COMMAND style. It's very cvs-like in the way it feels, yet shares a lot of properties with monotone and codeville. It is self-hosting and can publish over HTTP. A CGI front-end is provided so you can incoporate it with your favorite web server. The CGI/web front-end looks a lot like gitweb, the Perl git front-end on kernel.org, though there have been recent patches to the list that really improve on the presentation. Plus, it's FAST! (And written in Python, non-the-less.)
Hi; yes, I actually saw that on del.icio.us, and it's what got me thinking about version control again.
It sounds like the developer on Mercurial has the technical parts down well. I don't know about usability, especially since it seems like it's a contender for Linux development, and I think Linux developers (or maybe just Linus) have distorted ideas about usable version control.
I don't know what the technical foundation of Bazaar-NG is, but they seem the most concerned with basic usability issues. Other ones can be better or worse in terms of usability -- Arch is pretty horrible from what I can tell, and Darcs is pretty straight-forward. Many of them put so much emphasis on the distributed part, that continuous integrating (the norm in Subversion or CVS) is an Excersize Left Up To The Reader.
It is interesting that this latest generation of version control systems is very heavy on Python (Mercurial, Bazaar-NG, Codeville, and I think I'm forgetting something else...)
From a usability perspective, Mercurial is quite similar to both CVS and SVN. Commands that "mean the same" in CVS and Mercurial have the same names, flags, and so on.
Linus definitely has extremely peculiar notions about version control, but those are all firewalled in git/cogito, and have not bled elsewhere.
Darcs is yet another one -- http://darcs.net/DarcsWiki
I've just started playing with it for personal projects. CVS is too crufy, and Subversion has too many requirements for my personal stuff (though we moved from VSS to Subversion at the office and have never been happier).
Back to darcs -- the commands are easy. It seems to be distribuated when you want it to be and centralized when you want. The fact that you can immediately mail patches back and forth, and pull any source tree off the web and into are cool features.
"But centralized version control does need to become more open. The difference between someone with commit access and some random contributor should be reduced"
True to a very large extent and it focusses on a very important issue. Subversion has a very specific problem when used with free/open source projects. It pretends that patch management is none of it's business, since all it promises to be is a better CVS. But remember that CVS itself is probably two decades old and should not be setting the bar for any version control software beyond a certain extent. Now we know after a decade+ of distributed software development in which not everyone has commit privilieges, that there are problems with the way patches are handled. But it probably best not to depende on the SVN project itself to do this. As you note in another excellent point:- "There's great potential in version control -- but it's all in usability and tools, security and scaling, not the mathematical appeal of patch management algorithms.". Tortoise SVN and SVK are the only two SVN based projects that have had any real impact. I am not sure though about that bit about "patch management algorithms". It's a prominent shortcoming/defect in SVN that it does have a real handle on objects through object IDs - there is no way to clearly answer the question "Is file A on branchA the same as the file A on trunk"; also lacking is true renames. Merge Tracking will not really be possible unless these are sorted out.
Hmmm... Your idea of anonymous commits seems like a possible source of unintended consequences. I'll throw out the term 'wikification' to describe it. I love wikis, but there is a significant maintenance cost, due to flaws in human nature. You'd almost have to automatically build on every commit, and roll back changes that no worky-worky, and then black list bogus committers somehow. The haters of open source would likely DOS a lot of projects. Ultimately, there is a huge learning curve before stepping into any project. Even if I know python (and I consider myself a medium-weight), I probably can't do useful work until I've interacted significantly with the team, beyond trivial bugfixing. So, I'm luke-warm on the idea.
I think new people shouldn't be able to commit to the trunk, but they should be able to make their own branches. There remain some DoS possibilities, no doubt, but those are all simply bugs. There might be a lot of those bugs, of course...
I work as a team leader in a small/medium-sized developer team, and we are using CVS/subversion successfully, but after playing around with darcs for a toy project, we also think about moving to a SCM without a central server. May be we are a "to agile" or whatever, but sometimes it would be just nice to sync with the repository of my coworker without submitting it to a central server or using branches. And approaching a release all my team members start to merge with a central repository. May be I must say, that I highly appreciate the way Guido (GvR) and Linus manage there projects. And I guess this can be applied to a lot of open source software too. Well, and email is not to bad for patch "movement", isn't it? I think, the way darcs does it, is the best. Some people can commit directly, the rest has to send the patches by mail.
The ability to submit a patch directly via the version control system so that the project leader(s) can review and merge it as desired would be a great boon. You'd certainly think that SVN could be given explicit support for this feature, and I think that SVN is the best candidate given how much acceptance it has gained.
One thing that should be kept in mind with regards to accepting anonymous or nearly anonymous patches is the whole Linux/SCO nonsense. While only a project as successful as Linux is likely to gain that level of attention, it's always good to have an idea of who is vouching for a piece of code (and whether they wrote it themselves) when you're adding it to your repository.
One of my favorite features of Darcs is the "darcs send" command, which e-mails your changes to the owner of the upstream repository.
In some of my thoughts about building systems on top of Subversion, I've thought about expressing workflow through custom properties, and acting on those in hooks. So instead of patches you'd have branches, and a submission would be a flag/property on the branch that said you were ready for someone to look at it. The "ready for someone to look at it" is almost an indexing operation as much as anything.
Though another possibility might be special conventions. For instance, it would be nice to encapsulate discussions of patches in the version control system as well. Maybe a file (discussion.txt or something) could store that information. This would be like pushing the Trac database into Subversion. I'm not sure if this would work with Subversion in practice, but that's the kind of feature that could make me switch.
That reminds me of Subversion hooks -- there's a lot of power there. I don't think that's transferable to a distributed system.
I believe projects that use distributed models have been using email for that type of message. i.e. "Subject: [MERGE] ab327e678c8"
Distributed version control software is not incompatible with communal, centralized development. In practice, most projects that use distributed software still have one central repository that everyone syncs and submits to. They're not using the distributed software to split development off into isolated forks; they're using it precisely to allow work to be shared more easily without such hard boundaries between established developers and new contributors.
Another agreement to this.
I find that distributed systems have many more features to the centralized ones and make it easy to set up a repository. I set all of the ones on my site with an rsync command. Also not everyone has access to public subversion repository. Or people don't want to use sourceforge or berilos.
The key point I see missing here is the fact that centralized development is a subset of what a distributed model can do. At Summersault we switched from CVS to darcs, but kept the centralized development model for the most part.
darcs offers us many benefits, including spontaneous branches, interactive commits and updates, a smart patch algebra, and working on private branches we need.
You can read my full write up of our switch from CVS to darcs.
For open source work I also use darcs. We start of using 'darcs send' to exchange patches by e-mail, which works great. For small projects my website works as a central public repo for read only access, and I filter commits through my e-mail.
For larger projects, we can set of a central server for darcs (which is just copying files a web server), and give a group of comitters ssh access to commit there, while others contribute by e-mail.
I use svn for one project and am definitely slowed down because it doesn't offer interactive commits, updates or easy cherrypicking of patches.
The key point I see missing here is the fact that centralized development is a subset of what a distributed model can do.
I think there's a good possibility that one of the distributed systems will focus on making centralized development work well. I don't think they've done that yet. They are ignoring something important as a result. Since there's so much competition at the moment hopefully someone will try to differentiate in this way. I can even imagine a centralized fork; just like svk is like fork of svn, the opposite might happen. At least, I hope it does -- I personally am not tied to Subversion.
For larger projects, we can set of a central server for darcs
I think that should be the first thing to happen for a new project. Personally I "set up" a central server for all of my projects now, because I happen to have an svn repository at my disposal. Keeping 1 project or 100 projects in one svn server isn't substantially more difficult. Managing permissions is a pain... but eh. Easier than ssh accounts at least. I'm also generally fine giving people access based on informal agreements about permission.
I use svn for one project and am definitely slowed down because it doesn't offer interactive commits, updates or easy cherrypicking of patches.
I don't think these are problems with centralized servers. There's hard problems to be solved, and it's clear the new batch of version control systems is addressing many of those problems better than Subversion has. (And it's not surprising they are doing it with high-level languages.) Subversion hooks have a tremendous amount of potential, though, and I'd hate to lose that feature.
"I think there's a good possibility that one of the distributed systems will focus on making centralized development work well."
For darcs, there's the "darcshive" centralized server project:
Thanks for the response, Ian.
I think there's a good possibility that one of the distributed systems will focus on making centralized development work well. I don't think they've done that yet. They are ignoring something important as a result.
Working with a centralized server is arguably easier with darcs than svn.
Svn recommends you think about your branches, tags, and overall layout when you create the project. There is no need for this in darcs.
Svn has a special command, 'svn import' for starting a project. darcs has a syntax that is as easy to use, but more general purpose, as it can be used whenever you want to creat a new branch:
darcs put firstname.lastname@example.org:/home/repos/my_new_project
There is nothing special to learn about users and permissions, as you can use unix permissions and ssh, or gpg-signed e-mail to transport patches.
It's not clear to this to me why this wouldn't be considered "working well". It has the added benefit that casual contributors can still commit repeatedly offline on their laptops before sync'ing, or easily create their own branch.
As far as I can tell darcs cannot be run on anything but i386 and amd64. Most of the servers where I work are Linux or BSD running on sparc64.
I'm running it under Mac OSX, so I don't think you're necessarily correct there.
I don't think current centralized SCMs can accomodate large numbers of development branches. The problem comes when mainline development activity is merged into those branches to produce a more up-to-date branch. In CVS and svn, such a merge duplicates the mainline changes--even changes to unmodified files--into the back-end storage system. So, a hundred actively maintained development branches means your mainline development is consuming a hundred times as much space. I've brought up cleverer ways of representing big merges in Subversion, but there hasn't been a whole lot of interest so far. For the most part, projects seem happy storing patches in an issue-tracking system to represent small changes--not a very elegant solution, but one which eliminates 95% of the efficiency problem with branches.
The distributed SCM answer is to use a hundred times as many hard drives to hold the data. That works fine, but the perception of distributed functionality as an anti-feature is not unique to you: other projects, like gcc, would much rather see development go on "in the open" in the central repository, rather than off to the side where it's harder to find.
I just stumbled across this thread again, and noticed that something appears to have gone uncorrected here. Subversion does do branching cheaply. The whole tree is not copied into the branch: subversion internally stores what amounts to a symbolic link.
It may be the case that if you check out from the top of the tree, you'll get multiple copies of the files. I'm just pointing out how the repository itself works. (This is all described in the svn book.)
Changeset-based "distributed" version control systems (like my favorite, darcs) make patch management and integration possible in a way I haven't seen in subversion, and I see patch management as central to defining the trunk of a project. Say you want to merge the changes I've made over the past three days, because I just made my branch of SQLObject self-aware. (Bouyeah.) With a tree-based VCS you must calculate a diff between my tree three days ago and my tree now and apply that patch to your tree. With a changeset-based VCS, you just apply my patches. Svn's merging is rudimentary at best for this reason. I make mistakes with it all the time in my repositories - and I'm the only committer!
About hooks, I see svn like a car that you can put a different spoiler on, but darcs like an engine you can hook up to a different transmission if you like, and put in a car of your choice. I can see my way much more clearly to building more complex and wonderful things with darcs than i can with svn (e.g., making some sort of dynamic branch-maker-and-authorizer with apache and python and sane security and permissions and discussion, that deletes old stale branches and doesn't take too much space).
In using a distributed version control system, I found the ability to create what you might call micro branches quite useful. In an environment where we were juggling lots of small changes to a large system, being able to create a local staging environment and repository for each small feature is extremely helpful.
The problem with putting all of that in the central repository is that, aside from Subversion's branch merging difficulty, it would clutter up the repository with branches. From the POV of the main repository, each of these minor fixes were coming through as normal commits was quite nice, even though on our desktops they amounted to micro branches.
How big of a problem is a microbranch in Subversion? Greg pointed out that they aren't stored efficiently, but if you make a modification in a branch and then integrate it into the trunk, it only doubles the storage of diffs, no more (at least if I read his comment correctly) -- that doesn't seem so bad. I know I underuse short-lived branches, but Subversion doesn't disallow them; you can delete a branch as soon as you are done with it. The only real issue is that its merging is rather poor, so it discourages branching.
That said, I do see the usefulness of "spontaneous branching" as Mark termed it. There is a certain oddity to the special place the checkout has in Subversion, where in distributed systems there isn't anything quite like a checkout, and there's an opportunity to be somewhat more cohesive as a result.
Well - centralized version control is a subset of distributed version control, since you can still upload (or "push") stuff from your branches to the project's central server. What distributed version control adds is the voluntarity of this action, and freedom to fork the project independently while keeping the history and the possibility to merge back in the future easily. (And some other bonuses like ability to work offline.)
So the question is whether this kind freedom is a good thing. For the forking issue, if you want to fork, you will do that anyway and I doubt you will care that much about how hard would that be. But at least you will retain the full history, and it will be possible to easily merge the projects back later, which I'd say makes the situation at least somewhat better than with centralized VCS.
The "voluntary publication" is probably less clear-cut, but I think it's better this way as well. How does this stuff usually go? If you are new developer, you decide "I need this thing to do X" or "this Y thing is bothering me". So you can either:
- first ask for a branch on the central server. No matter how low the threshold is, this forces you into some kind of a commitment, at least social-like. "Yet sworn word may strengthen quaking heart" - "Or break it". At any rate, this will make you reconsider if you are really sure you will finish this and likely discourage you from trying.
- first hack. If you are successful, you can upload this (and you will do that as likely as you would with the centralized version control), but you had no version control while hacking! After all, this is one of the main points of distributed version control - you can version control your private development so that there is some history when you actually publish it. If you are NOT successful hacking, noone will likely ever see your work, and that's probably core of your argument against it. The question is, whether it outweights all the other disadvantages, and whether that unsuccessful work is actually worth seeing - I think it isn't, up to some rather rare exceptions.
The bottom line is, I argue that the additional freedom distributed VCS gives you is a good thing and actually significantly lowers the barrier for new hacks by making it easier to do them and potentially reducing the bureaucracy.
Petr, it's what happens after you've worked on your modification in private and you decide to publish it to the world that's the interesting question.
You can make a patch and post it to Bugzilla, or you can publish your repo on your own server where the maintainers can pull from. Bugzilla isn't as attractive as some kind of anonymous micro-branch, which after all is basically what your patch is. Publishing your own repo is not stable over time. In reality, patches often wait many months for integration, and private repos are likely to move or go down, and you still have the problem of tracking the links to all the private repos.
It seems that the problem with turning anonymous patches into anonymous branches is that svn can't merge between repos (e.g. main and pending-patches), and people are uncomfortable adding anon branches to the main svn repo. Right?
Well, I argue that centralized version control systems ain't any better than decentralized ones here. You can make server-side anonymous branch with sensible decentralized VCSes - at least git and monotone can do that, I'm not that familiar with the others; but you can always just have parallel repositories at the server.
So this is NOT a problem with distributed VCSes.