Requests for comment/Linker refactor

From mediawiki.org
Request for comment (RFC)
Linker refactor
Component General
Creation date
Author(s) User:Aude
Document status accepted
See Phabricator.

Static methods are widely used throughout the MediaWiki code base. This introduces procedural code, global state and makes it quite a challenge to do good unit tests of the MediaWiki code. It is difficult to isolate pieces of code and test each piece, each class, each method and is not possible to mock the static methods of another class.

One of the most egregious examples is in the Linker class, whose static methods are used nearly 500 times throughout MediaWiki. Especially heavy use occurs in special pages, ChangesList / Enhanced Changes, Difference Engine, etc., making them difficult to test fully.

MediaWiki should move away from using static functions, and a first step or trial of this would be to replace the Linker utility class with service objects. Then other classes using Linker can mock out those methods and isolate their own methods for testing.

Another issue with Linker is that it mixes many different functionalities that are not necessarily directly about "linking". Linker includes code that handles formatting comments, such as in recent changes pages. This should be separated into distinct value and service class (Comment/Summary + formatters). Linker also contains code for formatting the table of contents in pages. This could be split into its own formatter class.

Any refactoring of the Linker class needs to keep performance in mind and how the code interacts with the LinkCache. Methods such as Linker::link() can be quite inefficient, especially if called numerous times. (e.g. on watchlist) In refactoring this code, may also be able to look into how to improve that.

Justification[edit]

  • Improve testability and code quality of MediaWiki
  • More bugs / regressions caught via tests and prevented from making their way into production (although tests alone, do not prevent all bugs)
  • More confidence that MediaWiki code works correctly, without regressions, under hhvm vs. zend. HHVM prides itself on 100% passing tests on dozens of frameworks and php projects.[1] For MediaWiki, that is misleading and not assuring.

As of September 2014, ~6-7% of MediaWiki is covered by unit tests.[2] In the September 2014 quarterly review of the Release/QA engineering team, concern about this was raised by Lila ("we need all three (integration, browser, unit tests) ", "at Sugar, we enforced "no commit without tests") and others ("some of MW's tech debts make it actively hostile to unit tests" --robla, meeting notes) The Linker class is a useful place to start tackling this issues since it is used so widely across the codebase, and new link formatting code can tie into the TitleValue code.

Implementation[edit]

Steps[edit]

  1. Add more tests for Linker class to cover stuff.
  2. Introduce new LinkFormatter (or several, with nice interface) that does not use static methods. (example)
    1. LinkFormatter (and related) only formats links, integrating with TitleValue and formatters
    2. introduce comment / summary value classes and formatters to eventually replace code in Linker
    3. split table of contents formatting code
    4. split out image linker code into separate class
  3. Phase out / deprecate the static methods in the current Linker class and delegate to new code.

Performance[edit]

As measure of performance (more investigation and data needed), here is some profiling data for Special:Recentchanges with the OldChangesList format. Linker::link is called 182 times for outputting 22 changes.

With Linker:

  • OldChangesList::formatChangeLine - called 22 times, takes 140.75 ms/req and 676.32 kb/req.

With LinkFormatter, https://gerrit.wikimedia.org/r/#/c/160980 (incomplete, WIP), and caches cleared:

  • OldChangesList::formatChangeLine - called 22 times, takes 46.6 ms/req and 226.93 kb/req.

More data: User:Aude/ChangesList-Linker performance

Architecture[edit]

Summary classes[edit]

Summaries (or "comments") are stored in the recentchanges and revision tables (and maybe elsewhere also). The comments are stored in a format such as "/* Feature films */ Linking [[Ice Age 5]]". The first part is the "autocomment". In Wikibase, to make autocomments more suitable for multilingual environment, they get stored in a "cryptic" format like "/* wbsetsitelink-add:1|frwiki */Japon". In the recent changes table, comments are stored in a field with varchar(255) while they are stored in revision as tinyblob. (255 bytes)

As these are stored in the database, we will always need a way to parse and work with this format. Then we need a value object to represent the summary, and then a service for formatting them.

  • Summary - value object
  • SummaryParser - to transform what is stored in the database into an object
  • SummaryFormatter - to format in places such as recent changes

In the future, it may be useful to store comments in a more consistent manner and perhaps accommodate storing them in a structured way (e.g. json).

Formatters[edit]

  • LinkFormatter - creates and formats various types of links (might be useful to split this up further)
  • TableOfContentsFormatter - handle formatting for table of contents
  • ImageLinker - makeImageLink[3], makeThumbLink2[4] and related code factored out

Notes[edit]

  1. http://hhvm.com/frameworks/
  2. https://integration.wikimedia.org/cover/mediawiki-core/master/php/
  3. The method makeImageLink() has an NPath complexity of 4448286720 (recommended is 200 or less)
  4. The method makeThumbLink2() has an NPath complexity of 7507200

Further reading[edit]