User:Bawolff/review/Git2Pages

context: http://lists.wikimedia.org/pipermail/wikitech-l/2013-March/067924.html

Git2Pages.php
// Options default values $wgGit2PagesDataDir = sys_get_temp_dir;

Should be  instead.

Git2Pages.body.php
include 'GitRepository.php'; Should be using the AutoLoader.

Git2PagesHooks::PullContentFromRepo
$url = $options['repository']; This line comes before your check that $options['repository'] is actually set.

if( !isset( $options['repository'] ) || !isset( $options['filename'] ) ) { return 'repository and/or filename not defined.'; } This one is missing the around it.

More generally, for the error handling: If its an error that can be triggered by the user, it should be in the i18n file, and retrieved with. (Errors that would need sysadmin attention like "Disk space full" don't necessarily need to be translatable, although sometimes people do make them translatable).

Additionally, to be super picky, the error messages should start with a capital (I know that seems quite nitpicky, but users do notice those types of things).

} catch( Exception $ex ) { $output = ' '. $ex->getMessage. ' ';               } Should be htmlescaped. I know that none of the possible exceptions have any html in it, but escaping here would prevent bad things from happening if someone later added an exception that did have html in the message without realizing the message would be output.

$gitFolder = $wgGit2PagesDataDir. DIRECTORY_SEPARATOR. md5( $url . $branch ); Should do a check to make sure $wgGit2PagesDataDir is writable.

$branch = wfEscapeShellArg( $options['branch'] ); Its better to escape output closer to where it is used (So in this case, escape the option just before you plug it in to wfShellExec) Escaping near the output helps people to see at a quick glance that everything that needs to be escaped is escaped, instead of having to back through the code and find where the relavent variable is escaped.

Git2PagesHooks::extractOptions
$pair = explode( '=', $option ); I would recommend

GitRepository::__construct
function __construct( $gitUrl ) { $this->gitUrl = $gitUrl; }

Should do a sanity check that the url is valid, either here, or during parameter validation in Git2PagesHooks.

Unclear question: Would a file:// url be a valid thing? Potentially disclosure of information about the server, but there could be valid cases where someone wants to expose a git repo on the current server. However in any sort of production server, the git repo would probably be on a different web server than the mediawiki install (at least for a big wiki). Personally I would lean towards not allowing file:// url at all, but having it as a disabled by default only allowed if whitelisted could be ok too.

GitRepository::SparseCheckoutNewRepo
static function SparseCheckoutNewRepo( $url, $gitFolder, $checkoutItem, $branch ) The construction of the object involves passing the url. Why does this method also take a url? Either it should use the url stored in the object, or the object's constructor should not take a url parameter.

$sparseCheckoutFile = '.git/info/sparse-checkout'; wfShellExec( 'git init' ); wfShellExec( 'git remote add -f origin ' . $url ); wfShellExec( 'git config core.sparsecheckout true' ); wfShellExec( 'touch ' . $sparseCheckoutFile ); wfShellExec( 'echo ' . $checkoutItem . ' >> ' . $sparseCheckoutFile ); wfShellExec( 'git pull ' . $url . ' ' . $branch ); wfDebug( 'GitRepository: Sparse checkout subdirectory' ); For long list of commands like this, it might make more sense to write a shell script, and execute the shellscript. (OTOH that wouldn't be portable to windows, but I don't think that's a concern. If it is, one could probably make a batch file too.)

Additionally, $url needs to be escaped. $spareCheckoutFile as well, however that's not as critical since its an md5sum, but nonetheless it should still be escaped.

GitRepository::AddToSparseCheckout
if( $file = file_get_contents( $gitFolder. DIRECTORY_SEPARATOR. $sparseCheckoutFile ) ) { if( strpos( $file, $checkoutItem ) === false ) { wfShellExec( 'echo ' . $checkoutItem . ' >> ' . $sparseCheckoutFile ); }               } else { wfShellExec( 'touch ' . $sparseCheckoutFile ); wfShellExec( 'echo ' . $checkoutItem . ' >> ' . $sparseCheckoutFile ); }

Seems like it would be more natural to directly write to the file with php functions instead of shelling out to echo.

GitRepository::FindAndReadFile
/**        * Finds and reads the file. *        * @param string $gitFolder contains the path to  git repo folder * @param array $options contains user inputs */       function FindAndReadFile( $filename, $gitFolder, $startLine = 1, $endLine = -1 ) { Docs should be kept up to date as functions change :P ($options is no longer there).

if( $fileArray = file( $filePath ) ) { Coding convention nitpick - I'm personally not a fan of assignment in an if conditional.

if( $fileArray = file( $filePath ) ) { if( $endLine == -1 ) { $lineBlock = array_slice( $fileArray, $startLine - 1 ); } else { $offset = $endLine - $startLine; $lineBlock = array_slice( $fileArray, $startLine - 1, $offset + 1 ); }                       return implode( $lineBlock ); } reads in the entire file in one go. Should probably read it line at a time (In case someone tries to display a 1GB file. wfShellExec should prevent the creation of tremendously large files, but still probably best to read in line at a time, and only read what is necessary). There should also probably be a sanity max number of lines check somewhere (Probably in Git2PagesHooks as part of validation of user input. The sanity check should probably just check for insane values, like displaying over 20000 lines) throw new Exception( "File does not exist or is unreadable." ); Use the class MWException for any exceptions thrown (It doesn't overly matter here since you catch all thrown exceptions, but still a good practise to do). Even better would be to create a subclass of MWException (For example Git2PageException), use that for your exceptions. Then in Git2PageHooks::PullContentFromRepo, only catch your Git2PageException subclass. This would let any other exception types pass through (For example, if you call some MediaWiki function that throws an exception), so you only do error handling on the exceptions made by your code.

Concurrency
What if two different people try and checkout the same git repo at the same time. (This is actually quite likely to happen - Say two people go to same page at same time. If nothing in parser cache, they will both try to render at same time). Should probably have some sort of lock to prevent the extension from stepping over itself.

Security

 * Should have the option of only allowing a whitelist of sites. I don't know for sure, but I imagine the extension is much more likely to be deployed to mw.org if it was limited to only Wikimedia's git repo. This would prevent someone from using the extension as a proxy for a DOS attack (i.e. Someone puts in action=parse of the api and hits refresh five billion times a second. The extension would then send out a lot of git clone requests).
 * Can a git checkout involve creating files with the execute bit set? Specifically removing the execute bit on files might make sense (Obviously this is a bit of paranoia as if people can execute arbitrary files on your server, you are already screwed, but paranoia doesn't hurt).
 * Should warn user in install docs not to make $wgGit2PagesDataDir anywhere web accessible
 * Directories/files created by this script should be writable only by the web server. That seems to already be the case.
 * Using predictable directory names in the publicly writable system tmp directory is probably bad on a shared system. Another user could create a git checkout in the directory your extension would use, and add a post-merge hook (which would be executed during git pull). Your extension is now executing this other persons code with the web server's permissions. This could be combated by adding some random numbers to the data that gets put in md5sum to make the directory name so that the directory used is not predictable. This would of course make a different directory be used every time. The other way to prevent this issue is for the person setting up the extension to point $wgGit2PagesDataDir to somewhere only writable by the webserver.
 * Alternatively, perhaps adding $wgSecretKey to the data hashed together would be a good way to prevent this attack.
 * prefixing these directory names with  so that they are easily recognizable might be good. (For example, in case the directories start taking up significant disk space, so that a sysadmin can easily tell what program is generating them).

Freshness of results

 * Things should definitely be re-pulled from the git repo the next time the page is parsed.
 * The results of getting the git file are stored in the parser cache (which is good). However, given that data can be changed, we may want to reduce the expiry of the parser cache (I'd recommend a value of 3 days by default, with it being configurable as a global). This can be accomplished via
 * This would be a lot more work, but could potentially store in the db a links table including what file from what repo is shown on what pages. Then have a post-commit hook in the git repo (If we're assuming most snippets are from one specific repo) that adds HTMLCacheUpdate jobs to the job queue to purge relavent pages every time someone uploads a new version of the file being shown, to git.

Other

 * [If we ignore the security issue with guessable tmp directory names for a moment] I believe its usually assumed that scripts clean up files placed in the temp directory after they're done with them. At the same time it would be good to keep the checkouts around for efficiency reasons. It might be good to have an option to delete the git checkout directory after.
 * The exit codes of the various wfShellExec's is not checked. They should be checked in case something goes wrong (Which could be anything from disk full to git not installed, etc). If something returns a non-zero exit status, the attempt to checkout the repo should be aborted, and the user should get some sort of error message (Doesn't necessarily have to be detailed)
 * Might be good to have a syntax for short aliases of git repos. Being able to say instead of  for a preconfigured list of well known repo names. If it was installed on mw.org, I imagine that most people would use it to fetch files from wmf's repo. If another group used it, I imagine they to would want to use it mostly to fetch their own code, so would be good to not require typing the entire long url every time.
 * Most mediawiki parser functions do not have their first argument (the one directly after the colon) as a named argument. Perhaps optionally allowing that to be the filename would make sense (so )
 * In actual testing: When I tested this on mediawiki/core, the  step took forever (aka longer than would be considered acceptable. Probably on order of magnitude larger than php's execution time limit). The -f part of this command causes it to download the entire history of the git repo. This seems unnecessary, and not ok performance wise. (I think you can do shallow checkouts or something to fix this. I don't know very much about git internals).
 * Could be cool to have an option to checkout a specific version of a file (aka not the most recent version).
 * As noted on your email, as people change the files, line numbers change. One way to do this is have an option like  where instead of line numbers, it will start at the first line starting with foo. We probably don't want to let users specify arbitrary regexes, but simple patterns could probably be allowed.
 * What would be really cool (but possibly rather difficult and involve making the extension understand code) is a user being able to say: So the function named Bar, and it would show just that function.
 * Ideally this extension would respect $wgHTTPProxy. Maybe passing an environmental variable to git would be enough to ensure that. (Or maybe it already magically works. I didn't test that).