Requests for comment/Streamlining Composer usage

Proposal
Most of the problems are the result of dealing with a build step (running composer) in the same process step as patch validation/merge. Instead of a separate step before or during deployment.

The master branches (core, extensions and so on) will be changed to run composer during CI like what is done currently for Wikidata.

The following is likely contentious, but at least somewhat unclear. Some proposed solutions:


 * 1) Automatically build and commit mediawiki/vendor (see ).
 * 2) Run Composer during deployment, i.e. during scap.
 * 3) Manually update mediawiki/vendor during branch cut. Do not use that repository for anything else. Not for master nor for any other installations of Mediawiki. This does not solve the circular dependency problem for wmf deployment branches,  we would just bypass CI and fix it in the next step (which causes problems with zuul, see ).

To solving the package integrity problem fully there is currently not enough support from our developer community. The following might be a small step in the right direction: For each version of a package add a trusted list of hashes. Then make composer check these and refuse to continue after downloading but before saving to vendor something that doesn't match. The next step might be to replace the trusted list of hashes with one of public GPG keys and use signatures on tags and make Packagist distribute a signed list of hashes for each version.

To improve availability Satis may be of use (see ).

Review of dependencies needs to happen before a commit that adds them to composer.json is merged.

Open questions
How do we update mediawiki/vendor.git automatically? Do we instead do this during scap? Would we then stop using mediawiki/vendor.git?

What do we do for Mediawiki release branches?

The person creating the wmf branches and deploying the train is likely affected the most by this. What do they think about this?

Do we think we should come up with threat models? What is the minimum that we want to have addressed over all software in production, on deployers machines, on developers machines? Is there enough support to improve the Composer package integrity for all our used dependencies?

What about the problems that are not directly solved by only automating the vendor update?

Background
MediaWiki core and its extensions depend on libraries that are managed via composer. This RFC intends to continue from where Requests for comment/Composer managed libraries for use on WMF cluster and Requests for comment/Extensions continuous integration left off. Library infrastructure for MediaWiki will increase the use of Wikimedia maintained libraries hugely. To not hinder this effort we need to streamline the process for adding and upgrading composer dependencies and for building and deploying with composer.

Besides library dependencies composer can be used to build an autoloader for parts of core and parts of extensions. To properly make use of that we need to ensure that our build and deployment process works with that.

For development purposes, most people currently run composer in the root of Mediawiki. This loads the composer-merge-plugin which merges all dependencies from other specified composer.json files, usually from extensions. For Wikimedia production deployment from them wmf branches, we do not use composer directly but instead an intermediate repository mediawiki/vendor which is manually updated. In between we have the master development branches, continuous integration jobs and the beta cluster environment which all currently use mediawiki/vendor. Each of the branches might need to use a different strategy for continuous integration in the future.

Neither Composer nor its autoloader should be used to register an extension, that should be done by Mediawiki extension registration which gets triggered by something in LocalSettings.php. See Requests for comment/Improving extension management for improving installation of extensions and dependencies of extensions on other extensions.

Wikidata (Wikibase, related extensions and dependencies)
Wikidata will bring in 19 more components, only maintained by people trusted with Wikimedia merge rights (i.e. not components from outside Wikimedia). Its CI uses one composer run like during development instead of mediawiki/vendor. Once per day a virtual machine builds the Wikidata "extension". It contains all extensions needed for Wikidata.org and dependencies. The build output is proposed as a patch to Gerrit for the mediawiki/extensions/Wikidata repository. It is then +2ed by a human. The Composer generated autoloader is in use in these extensions and libraries.

Wikidata dependencies (already outdated, there are now more):

 * 1) composer/installers @v1.0.21 (already in core)
 * 2) data-values/common @0.2.3
 * 3) data-values/data-types @0.4.1
 * 4) data-values/data-values @1.0.0
 * 5) data-values/geo @1.1.4
 * 6) data-values/interfaces @0.1.5
 * 7) data-values/javascript @0.7.0
 * 8) data-values/number @0.4.1
 * 9) data-values/serialization @1.0.2
 * 10) data-values/time @0.7.0
 * 11) data-values/validators @0.1.2
 * 12) data-values/value-view @0.14.5
 * 13) diff/diff @2.0.0
 * 14) serialization/serialization @3.2.1
 * 15) wikibase/data-model @3.0.0
 * 16) wikibase/data-model-javascript @1.0.2
 * 17) wikibase/data-model-serialization @1.4.0
 * 18) wikibase/internal-serialization @1.4.0
 * 19) wikibase/javascript-api @1.0.3
 * 20) wikibase/serialization-javascript @2.0.3
 * 21) propertysuggester/property-suggester @2.2.0 (MediaWiki extension, would become submodule of core)
 * 22) wikibase/wikibase @dev-master (MediaWiki extension, would become submodule of core)
 * 23) wikibase/Wikidata.org @dev-master (MediaWiki extension, would become submodule of core)
 * 24) wikibase/wikimedia-badges @dev-master (MediaWiki extension, would become submodule of core)

wmf deployment branches
The wmf branches of extensions and so on are added as submodules to the wmf branch of mediawiki/core when a new wmf branch is created. Merges to a wmf branch in an extension automatically result in a commit in mediawiki/core that updates the respective submodule. (Introducing automatic submodule updates was a recent changes.)

When a new wmf branch is created care is being taken to trigger the CI (see ). Example from 1.26wmf9: * 20c7219 - Submitting branch for review so that it gets tested by jenkins. refs T101551 (7 days ago)  * 11015b2 - Creating new WMF 1.26wmf9 branch (7 days ago)  * 6521b36 - Creating new WMF 1.26wmf9 branch (7 days ago) 

Double Review
Upgrading a dependency of e.g. the extension Wikibase if it were included in mediawiki/vendor.git: This is work that could be automated. Now a human might not notice when something doesn't match even though they are the magic prevention mechanism for problems that are not specified in enough detail to know what automatic mechanisms could prevent them instead. This is extra manual review work while Wikimedia can't even keep up with the normal influx of reviews :-(.
 * 1) A patch to the dependency (e.g. wikibase/data-model) is proposed.
 * 2) It is reviewed and merged by a Wikimedian.
 * 3) A release for the dependency is done.
 * 4) A patch that updates the requirement in mediawiki/extensions/Wikibase.git is proposed, reviewed and merged.
 * 5) An update of mediawiki/vendor is proposed, causing a second review!

mediawiki/vendor creates a circular dependency
The update of a source composer.json and mediawiki/vendor would need to happen at the same time.

If the CI uses medawiki/vendor it fails because vendor was not updated. Bypassing CI breaks beta.

If the CI uses composer, beta may fail because mediawiki/vendor was not updated.

If mediawiki/vendor is changed first, beta breaks because the rest is not prepared for the new versions in vendor.

(A broken beta also means any development system updated at that time is broken)

This not only applies to the master branches but also to the wmf deployment branches as the submodule is updated automatically, so there is no chance to prepare mediawiki/vendor ahead of time and update composer.json and the vendor submodule in one commit.

Currently this is dealt with by overriding the CI and causing a temporary breakage that is then fixed by updating mediawiki/vendor or the other way around. (Overriding the CI causes problems with zuul, see .)

Generated autoloader needs frequent updates
mediawiki/vendor.git needs the updated class map for the Composer generated autoloader. We use the optimized variant and thus adding a class means updating the autoloader. So adding a class to core and/or an extension may need an update to vendor.

Usage of libraries without depending on them
Extensions start to use classes from libraries that are pulled in via composer without declaring their dependency in composer.json (see for this problem related to Wikidata).

version check in update.php is not sufficient
It currently only handles a very narrow case of libraries with an exact version specifier to be bundled with core via mediawiki/vendor. We can use the new composer/semver library to make this less terrible.

operations/mediawiki-config
It uses composer. Currently only to pull in wikimedia/cdb. wikimedia/cdb uses an composer generated autoloader. The dependencies are embedded in the git same repo under multiversion/vendor. In theory a namespace/version conflict with mediawiki/core and/or mediawiki/vendor could happen.

Usage of github
Some of the parts here might be on github. Manual:Developing_libraries suggests it is ok. The merge-plugin is hosted there.

Availability
Packagist and whatever other parts involved in downloading all packages might be down or slow. (See related about our CI not being able to run composer because of throttling.)

Package integrity
Having a cryptographic chain of trust from a developer to the software that is run on my machines is what I personally would want. But I don't think I can convince everyone publishing free software that is used for Wikimedia to do that within the next year. - Jan Zerebecki

Composer does not check if the downloaded package is what we expected it to be. There are multiple ways how this could be compromised: Solving some of the above items may give others for free. This is a very rough list without proper threat models.
 * 1) Transfer integrity: is usually ensured by HTTPS with all its downsides. AFAIK nothing ensures that only HTTPS URLs are used, nor is the certificate checked, nor are CA compromises mitigated.
 * 2) Collection integrity: in Composers case Packagist and who may define what a package name resolves to.
 * 3) Download integrity: there is nothing that connects a tar/zip archive download to the git source except that Packagist gives out the URL. Also there is no way to verify that the archive received is actually the one that someone you chose intended you to receive.
 * 4) Source hosting or update integrity: git repository modification because the host where it is located is compromised or because anyone who has access is compromised or gone hostile. This is usually mitigated by signing every commit. See http://mikegerwitz.com/papers/git-horror-story for details.

The following list was is from T105638 by User:CSteipp_(WMF) and needs to be merged with the above.
 * 1) MITM between composer and packagist - in my estimation, this has both High likelihood and High impact. It's a trivial technical attack to pull off without composer checking certificates, and allows the attacker to link their own, modified version of any library into the composed codebase.
 * 2) * Risk: (using high = 3, med = 2, low = 1, and risk = likelihood x impact ), then 3 x 3 = 9
 * 3) * Mitigation: Currently none (adding https certificate checking in composer and requiring all connections to be https shouldn't be too difficult to add and upstream)
 * 4) MITM between composer and packagist with valid packagist certificate - In the event that composer start validating the https connection's certificate, is that enough? In my estimation, this is a fairly difficult attack to pull off, for a someone attacking the WMF's infrastructure, so I put the likelihood at Low. For normal developers running `composer update` on their laptop, this is still moderate, since buying an SSL MITM proxy that contains a code-signing certificate is fairly expensive still, so I'm going to say likelihood is Low to Medium. This has the same impact as #1.
 * 5) * Risk: Low-Medium (1.5) x High (3) = 4.5
 * 6) * Mitigation: Currently none (adding certificate pinning to composer would me moderately expensive, since it doesn't seem that composer would maintain that upstream)
 * 7) Github is compromised (the entire organization) - in my estimation, compromising github would be a difficult. They have a security team that seems to be making reasonable choices currently. So I'd guess the likelihood is Low. The impact, aiui, would be High for normal composer usage (where just the tarball of a repo at a certain commit is downloaded, and does not appear to be integrity checked). If composer does get a checksum that it checks, or composer is setup to clone the repo and checkout a sha1-hash (which seems to be hard to forge), the the impact would be reduced.
 * 8) * Risk: Low (1) x High? (3) = 3?
 * 9) * Mitigation: Currently none (adding certificate pinning to composer would me moderately expensive, since it doesn't seem that composer would maintain that upstream)
 * 10) Packagist is compromised (the entire organization) - Packagist concerns me a little more, since I don't know anything about their operational security. I'll do some quick research on that, or if anyone has a citation, happy to evaluate. The impact would again be High, because a attacker could add any code to any library by pointing to their own version of it.
 * 11) * Risk: ? x High (3) = ?
 * 12) * Mitigation: unknown
 * 13) Github repo is "compromised" by owner accepting hostile pull request - The likelihood seems Low to me, since if we've determined the library is of sufficient quality to include in our codebase (A MediaWiki developer has decided this is a good library to include, and the library has passed security review by my team), then I (hope) it's unlikely they would accept a malicious pull request. If it did happen, impact would be High, however.
 * 14) * Risk: Low (1) x High (3) = 3
 * 15) * Mitigation: Vetting of libraries included in MediaWiki (developer review, security review), updates to vendor are code reviewed
 * 16) Github repo is compromised by owner having their password stolen - Github mitigates this with things like mandatory https with HSTS, using OAuth to interact with other services, optional 2-factor authentication for accounts, review and expiration of potentially unsafe ssh keys. However, assuming may library owners are not going to take advantage of the optional security features, we can probably call the likelihood Medium, and the impact High, since it would allow the attacker to add any code to the repo.
 * 17) * Risk: Medium (2) x High (3) = 6
 * 18) * Mitigation: Updates to vendor are code reviewed

Most other software package managers also do not ensure against compromises of all of these. Notable exceptions are Debian and RedHat based distributions. I heard maven optionally has support for some (or was that some other Java package manager?), but I didn't look in detail. Having an overview of these and which threat they mitigate by what technical means would be interesting but out of scope for this RFC.

Blocking creation and use of PHP libraries for this reason, at the same time as not blocking other languages that have the same package manager issues and not blocking further development unless these are solved does not actually achieve security as the weakest chain is your overall security.