Requests for comment/Streamlining Composer usage

Proposal
Most of the problems are the result of dealing with a build step (running composer) in the same process step as patch validation/merge. Instead of a separate step before or during deployment.

The master branches (core, extensions and so on) will be changed to run composer during CI like what is done currently for Wikidata.

The following is likely contentious, but at least somewhat unclear. Some proposed solutions:


 * 1) Automatically build and commit mediawiki/vendor (see ).
 * 2) Run Composer during deployment, i.e. during scap.
 * 3) Manually update mediawiki/vendor during branch cut. Do not use that repository for anything else. Not for master nor for any other installations of Mediawiki. This does not solve the circular dependency problem.

To solving the package integrity problem fully there is currently not enough support from our developer community. The following might be a small step in the right direction: For each version of a package add a trusted list of hashes. Then make composer check these and refuse to continue after downloading but before saving to vendor something that doesn't match. The next step might be to replace the trusted list of hashes with one of public GPG keys and use signatures on tags and make Packagist distribute a signed list of hashes for each version.

Review of dependencies needs to happen before a commit that adds them to composer.json is merged.

Open questions
How do we update mediawiki/vendor.git automatically? Do we instead do this during scap? Would we then stop using mediawiki/vendor.git?

What do we do for Mediawiki release branches?

The person creating the wmf branches and deploying the train is likely affected the most by this. What do they think about this?

Do we think we should come up with threat models? What is the minimum that we want to have addressed over all software in production, on deployers machines, on developers machines? Is there enough support to improve the Composer package integrity for all our used dependencies?

What about the problems that are not directly solved by only automating the vendor update?

TODO, reference:

https://www.mediawiki.org/wiki/Requests_for_comment/Extensions_continuous_integration

https://www.mediawiki.org/wiki/Requests_for_comment/Extension_registration

https://www.mediawiki.org/wiki/Requests_for_comment/Improving_extension_management

Background
MediaWiki core and its extensions depend on libraries that are managed via composer. This RFC intends to continue from where Requests for comment/Composer managed libraries for use on WMF cluster left off. Library infrastructure for MediaWiki will increase the use of Wikimedia maintained libraries hugely. To not hinder this effort we need to streamline the process for adding and upgrading composer dependencies and for building and deploying with composer.

Besides library dependencies composer can be used to build an autoloader for parts of core and parts of extensions. To properly make use of that we need to ensure that our build and deployment process works with that.

For development purposes, most people currently run composer in the root of Mediawiki. This loads the composer-merge-plugin which merges all dependencies from other specified composer.json files, usually from extensions. For Wikimedia production deployment from them wmf branches, we do not use composer directly but instead an intermediate repository mediawiki/vendor which is manually updated. In between we have the master development branches, continuous integration jobs and the beta cluster environment which all currently use mediawiki/vendor. Each of the branches might need to use a different strategy for continuous integration in the future.

Wikidata (Wikibase, related extensions and dependencies)
Wikidata will bring in 19 more components, only maintained by people trusted with Wikimedia merge rights (i.e. not components from outside Wikimedia). Its CI uses one composer run like during development instead of mediawiki/vendor. Once per day a virtual machine builds the Wikidata "extension". It contains all extensions needed for Wikidata.org and dependencies. The build output is proposed as a patch to Gerrit for the mediawiki/extensions/Wikidata repository. It is then +2ed by a human. The Composer generated autoloader is in use in these extensions and libraries.

Wikidata dependencies (already outdated, there are now more):

 * 1) composer/installers @v1.0.21 (already in core)
 * 2) data-values/common @0.2.3
 * 3) data-values/data-types @0.4.1
 * 4) data-values/data-values @1.0.0
 * 5) data-values/geo @1.1.4
 * 6) data-values/interfaces @0.1.5
 * 7) data-values/javascript @0.7.0
 * 8) data-values/number @0.4.1
 * 9) data-values/serialization @1.0.2
 * 10) data-values/time @0.7.0
 * 11) data-values/validators @0.1.2
 * 12) data-values/value-view @0.14.5
 * 13) diff/diff @2.0.0
 * 14) serialization/serialization @3.2.1
 * 15) wikibase/data-model @3.0.0
 * 16) wikibase/data-model-javascript @1.0.2
 * 17) wikibase/data-model-serialization @1.4.0
 * 18) wikibase/internal-serialization @1.4.0
 * 19) wikibase/javascript-api @1.0.3
 * 20) wikibase/serialization-javascript @2.0.3
 * 21) propertysuggester/property-suggester @2.2.0 (extension, would become submodule of core)
 * 22) wikibase/wikibase @dev-master (extension, would become submodule of core)
 * 23) wikibase/Wikidata.org @dev-master (extension, would become submodule of core)
 * 24) wikibase/wikimedia-badges @dev-master (extension, would become submodule of core)

wmf deployment branches
The wmf branches of extensions and so on are added as submodules to the wmf branch of mediawiki/core when a new wmf branch is created. Merges to a wmf branch in an extension automatically result in a commit in mediawiki/core that updates the respective submodule. (Introducing automatic submodule updates was a recent changes.)

When a new wmf branch is created care is being taken to trigger the CI (see ). Example from 1.26wmf9: * 20c7219 - Submitting branch for review so that it gets tested by jenkins. refs T101551 (7 days ago)  * 11015b2 - Creating new WMF 1.26wmf9 branch (7 days ago)  * 6521b36 - Creating new WMF 1.26wmf9 branch (7 days ago) 

Double Review
Upgrading a dependency of e.g. the extension Wikibase if it were included in mediawiki/vendor.git: This is work that could be automated. Now a human might not notice when something doesn't match even though they are the magic prevention mechanism for problems that are not specified in enough detail to know what automatic mechanisms could prevent them instead. This is extra manual review work while Wikimedia can't even keep up with the normal influx of reviews :-(.
 * 1) A patch to the dependency (e.g. wikibase/data-model) is proposed.
 * 2) It is reviewed and merged by a Wikimedian.
 * 3) A release for the dependency is done.
 * 4) A patch that updates the requirement in mediawiki/extensions/Wikibase.git is proposed, reviewed and merged.
 * 5) An update of mediawiki/vendor is proposed, causing a second review!

mediawiki/vendor creates a circular dependency
The update of a source composer.json and mediawiki/vendor would need to happen at the same time.

If the CI uses medawiki/vendor it fails because vendor was not updated. Bypassing CI breaks beta.

If the CI uses composer, beta may fail because mediawiki/vendor was not updated.

If mediawiki/vendor is changed first, beta breaks because the rest is not prepared for the new versions in vendor.

(A broken beta also means and development system updated at that time is broken)

This not only applies to the master branches but also to the wmf deployment branches as the submodule is updated automatically, so there is no chance to prepare mediawiki/vendor ahead of time and update composer.json and the vendor submodule in one commit.

Currently this is dealt with by overriding the CI and causing a temporary breakage that is then fixed by updating mediawiki/vendor or the other way around.

Generated autoloader needs frequent updates
mediawiki/vendor.git needs the updated class map for the Composer generated autoloader. We use the optimized variant and thus adding a class means updating the autoloader. So adding a class to core and/or an extension may need an update to vendor.

version check in update.php is not sufficient
It currently only handles a very narrow case of libraries with an exact version specifier to be bundled with core via mediawiki/vendor. We can use the new composer/semver library to make this less terrible.

operations/mediawiki-config
It uses composer. Currently only to pull in wikimedia/cdb. wikimedia/cdb uses an composer generated autoloader. The dependencies are embedded in the git same repo under multiversion/vendor. In theory a namespace/version conflict with mediawiki/core and/or mediawiki/vendor could happen.

Usage of github
Some of the parts here might be on github. Manual:Developing_libraries suggests it is ok. The merge-plugin is hosted there.

Availability
Packagist and whatever other parts involved in downloading all packages might be down or slow.

Package integrity
Having a cryptographic chain of trust from a developer to the software that is run on my machines is what I personally would want. But I don't think I can convince everyone publishing free software that is used for Wikimedia to do that within the next year. - Jan Zerebecki

Composer does not check if the downloaded package is what we expected it to be. There are multiple ways how this could be compromised: Solving some of the above items may give others for free. This is a very rough list without proper threat models. Most other software package managers also do not ensure against compromises of all of these. Notable exceptions are Debian and RedHat based distributions. I heard maven optionally has support for some (or was that some other Java package manager?), but I didn't look in detail. Having an overview of these and which threat they mitigate by what technical means would be interesting but out of scope for this RFC.
 * 1) Transfer integrity: is usually ensured by HTTPS with all its downsides. AFAIK nothing ensures that actually HTTPS URLs are used nor are CA compromises mitigated.
 * 2) Collection integrity: in Composers case Packagist and who may define what a package name resolves to.
 * 3) Download integrity: there is nothing that connects a tar/zip archive download to the git source except that Packagist gives out the URL. Also there is no way to verify that the archive received is actually the one that someone you chose intended you to receive.
 * 4) Source hosting or update integrity: git repository modification because the host where it is located is compromised or because anyone who has access is compromised or gone hostile. This is usually mitigated by signing every commit. See http://mikegerwitz.com/papers/git-horror-story for details.

Blocking creation and use of PHP libraries for this reason, at the same time as not blocking other languages that have the same package manager issues and not blocking further development unless these are solved does not actually achieve security as the weakest chain is your overall security.