Git/Conversion

This page discusses efforts to convert away from our current Subversion repository to Git. There's no official plan to do so, just a couple of interested developers working on and off to make it happen. Tim Starling has given it a "maybe". There's no guarantee that this'll ever happen.

In the meantime here's a bunch of notes / TODOs discussing what we need to do to have a working conversion to Git.

Plan of attack
To do a conversion of the repository we need:


 * ✅ Get a dump of the repository. There's now a Pushmi mirror running on
 * ❌ Split up and convert the MediaWiki repository.
 * ✅ I tried git-svn which has all sorts of cpu (conversion from dump takes 3 weeks), memory (keep having to kill it & restart) and reliability problems.
 * Trying snerp-vortex which is much more promising. Working with the author to solve some bugs related to the MediaWiki dump. Still a few outstanding. This is the current blocker for further progress.
 * ❌ Get a copy of the old CVS status to reimport its history properly. From an IRC talk between Avar and Brion on 20101026, the current svn repository suffers from cvs2svn bugs.
 * ✅ sourceforge CVS repository enabled by brion (2010-11-24) http://wikipedia.cvs.sourceforge.net/viewvc/wikipedia/
 * ❌ get repository with rsync : rsync -av USER@wikipedia.cvs.sourceforge.net::cvsroot/wikipedia/*
 * ❌ cvs to git conversion
 * ❌ svn to git conversion
 * ❌ Write some documentation about git usage for our developers. A list of useful links might be a good start.
 * ❌ Have developers to start using git-svn to learn about git usage.
 * ❌ Convert MediaWiki's infrastructure to Git
 * ❌ Special:CodeReview needs to work with it. Shouldn't be too hard relatively. It just shell out to SVN. Just need to find the equivalent Git commands.
 * There might be some more complications actually; I think the current code assumes integer IDs and ID-based ordering. We'd need to change it to accept the longer hex commit ID hashes, and to understand the commit history tree structure. It's not rocket science, but I'd very strongly recommend giving it some more smarts there, as linear squashes of git trees can get real confusing around merges. --brion 20:30, 25 October 2010 (UTC)
 * ❌ Commits via IRC: Should be easy with CIA and *insert hundreds of IRC bots here*
 * ❌ Commits via E-Mail: ditto
 * The two above should be no problem if running our own primary git repo. I would recommend also having automatic mirrors syncing to a live backup mirror on github or gitorious. (Note that gitorious.org does not allow adding your own post-commit hooks etc.) --brion 20:30, 25 October 2010 (UTC)
 * ❌ Convert the Bugzilla code to recognize the new SHA-1 commits.
 * ❌ Create database of SVN revision ids -> Git SHA-1's. Needed for redirecting CodeReview links and anything else that uses rXXXX to the new commit ID's.

Split up and convert
A naïve  conversion of the entire repository (with branches) weighs in at around 650MB (early 2010). It makes no sense to make one Git MediaWiki repository, it should be split up.

In Subversion everything gets squashed into one giant repository. In Git repositories are split at the boundaries over which code does not cross.

Splitting

 * ❌ Everything in trunk/* gets its own repository
 * ❌ Further everything in extensions/* gets its own repository
 * ❌ Maybe other bits too, like tools/* get split up

All of these bits get the branches/* and tags/* history relevant to them integrated into their repository.

Git submodule repositories can be created to track various aggregates people are interested in. There could be a repository with:


 * MediaWiki core + Wikimedia extensions
 * All extensions (like checking out extensions/) now

There would need to be on-commit hooks to update these submodules. This has some disadvantages, for example if some function gets deprecated in core one would need to commit in core + all the extension repos. This is easy to script but a bit harder than with SVN today.

On the other hand keeping it all in one repository would mean a much larger repository. Anyone wanting to hack on core would need the full history of all extensions.

Converting

 * ✅ Every commit needs to be rewritten to give name/email pairs to SVN users. There's a tool for this already that works with git-filter-branch.
 * ❌ Have people populate the USERINFO/ directory. The information in it is incomplete, and so is the conversion as a result.

History

 * See history of MediaWiki version control

Working on the conversion

 * User:Ævar Arnfjörð Bjarmason
 * User:^demon

Would like to see it happen

 * 
 * Aryeh Gregor
 * Ashar Voultoiz (already use git locally)