Architecture meetings/Wikimania 2013

From mediawiki.org

Agenda[edit]

We are developing this document:


Initial notes[edit]

About 2/3 of room was at the arch meeting in Amsterdam (May 2013) <- most of the +2s We're trying to figure out the best role of architects & other folks doing core work, and how we should best go about making big changes to the codebase.

  • lots of things have been done as extensions, or other simpler things get done, because folks are often afraid to dive deep :)
  • planned to build a document (the arch guidelines, link above) .... Tim has cleaned it up a lot, so go reread it! https://www.mediawiki.org/wiki/Architecture_guidelines
  • plan to get much more serious about the RFC process -- the way to suggest big changes -- https://www.mediawiki.org/wiki/RFC All have been at least read and some markup info added, but we still need to go through them and find what's actionable.




Best current practices[edit]

  • Tim has written about the following and would appreciate review:
    • Interface stability - Link: https://www.mediawiki.org/wiki/Architecture_guidelines#Interface_changes
      • a big issue for third-party extensions and other extra-wikimedia-sphere-folks. Communication is key: deprecations etc.
      • LTS (long term service) concept... should everything from the last LTS stay until the next one? Opposite extreme: whenever we feel like :)
      • dependency injection may simplify use of back-ocmpat deprecated interfaces?
      • static analysis: do some basic searching of active extensions for usages of things that get changed/removed (but this doesn't cover things not in our source control)
        • maintenance status of extensions may be ....... unknown. Something may sit in source control but be utterly unmaintained. How do we handle these?
          • activity statistics from source control -> show on the extension page. May help to clarify what's active and what might have problems.
      • introduction of interfaces: classes, hooks, etc... many hooks have poor encapsulation (pass $this over, hook now depends on EVERYTHING in that class)
        • so try to keep the interface _clean_ when you add an interface. Don't expose unnecessary stuff.
      • What about user-perceived interfaces? Is it within the scope of this document / discussion to articulate expectations about the stability and performance of the UI?
    • Traditional (encapsulated) OOP - Link: https://www.mediawiki.org/wiki/Architecture_guidelines#Separation_of_concerns_.E2.80.94_encapsulation_versus_value_objects
    • YAGNI - background: http://c2.com/cgi/wiki?YouArentGonnaNeedIt , link to Tim's text: https://www.mediawiki.org/wiki/Architecture_guidelines#You_aren.27t_gonna_need_it
  • The following have not be written about, would appreciate ideas:
    • Testing: Do we recommend testing at three levels of integration? Method/class, semi-integrated (e.g. parser tests) and UI automation? Or just some subset of those? Should test coverage be 100% at the narrowest integration level?
      • we know we need it but don't yet know exactly what's the best thing to do
      • I think the answer to this question depends on the utility of testing infrastructure (Jenkins, Beta Labs) that are still in their early stages. It's difficult to know exactly what class of errors they help us catch and how reliably they help us catch them. So it might make sense to defer this for now.

Changes[edit]

The following changes were proposed in Amsterdam, but were controversial, unresolved, and were not formulated into RFCs:

  • Dependency injection. Separate "service bundles" like RequestContext into many small classes and pass each object instance to each service object constructor. https://www.mediawiki.org/wiki/Architecture_guidelines#Use_dependency_injection_for_external_resources
    • Testing is a big concern here; lots of classes/functions require global structures (databases, users, etc) and setup/teardown is really painful (slow, difficult, or impossible) in unit testing
      • Detach the logic module from the things it relies on, so you can swap in mock objects etc
      • often complaint is 'don't want ot have to pass 20 things on a constructor!" but if you're doing that you're doing it wrong... you should be getting objects from factories that inject the deps for you, usually
      • backwards compat between service modules and existing static methods etc -- in some instances should be doable
      • what code will benefit most from more testing? security/permissions-based, data integrity code, etc? "everything"? :)
        • Collecting data about this is even easier to do than collecting data about dead / deprecated code paths because we are already emitting stack traces that reference objects / files / lines. This is a Wikimedia infrastructure project rather than a MediaWiki architecture project, but I think empirical evidence is crucial to answering these questions well, so perhaps it's something we should commit to as part of this process.
        • mock objects should usually be high-level -- if you're doing a mock of an sql query and you're not testing sql-specific lowlevel code you're doing it wrong
        • c - it's easy to do DI wrong.... be careful :) Get the wikibase people to describe how they do it?
        • adds some complexity: more classes, more files. but the classes are simpler, which should be a win?
    • pretty sure we'll have to do this incrementally......
      • tim warns that some of the funnest bugs come from integration -- so don't lose the integration testing in a rush towards isolated units
        • There's quite a lot happening on that front, so I'm not too worried: beta labs and vagrant make testing your code against a production-like environment substantially easier than before, and there's quite a lot of momentum behind browser testing.
    • NEED A SPECIFIC BABYSTEP TARGET FOR DEP INJECTION -> do that, if we're happy then keep migrating out
      • should perhaps do that _along with_ writing new tests. use existing mock framework from phpunit, etc.... something concrete?
      • api modules? special pages? (but neither of these lets you control instantiation)
    • -> need to build an RFC to move forward on this, get a better idea what to start on...
      • doesn't have to be right away?
      • -> abstraction layer for storage -> refactor the bits out of revision, page, title, and friends. needs permissions service? needs storage service?
      • ^ gabriel & daniel will hack something out

Daniel misses having an abstraction layer for storage

Gabriel & Daniel to hash out an RfC for storage front


Tim proposed the following change which also has not been fully developed:

  • Have an instance manager for request-lifetime singleton objects, which will be used instead of class-static singletons. This would improve testability.
    • If I understand correctly, the precise manner in which this would improve testability, esp. in comparison to injecting individual resources, is that the instance manager would be something we could get right in one place and then enjoy the benefits (faster tests that are simpler to set up) everywhere, vs. fighting this fight every time a new class is added or an existing on refactored. But unless this instance manager is trivial to construct in an interactive shell or draft code, it will make it harder to write small functions and test them as you go along, building up code in small pieces.
      • No, it is just a refinement on the current configure/setup/test/shutdown cycle, especially the shutdown part. The idea is that you can reconfigure MediaWiki without having to destroy every singleton individually, but you would still have to have a complete environment for testing. The instance manager would be trivial to construct, but it would not contain factory functions, those would be elsewhere. But it would provide a standard method for setting singleton instances -- replacing boilerplate like RepoGroup::destroySingleton() and ::setSingleton().

static function singleton() {

   global $wgFooConf;
   if ( !InstanceManager::getInstance( __CLASS__ ) ) {
       InstanceManager::setInstance( __CLASS__, new self( $wgFooConf ) );
   }
   return InstanceManager::getInstance( __CLASS__ );

}

I endorse this message. Or is it a kludge? Kill globals! (that'll take longer though) ^ make sure we don't confuse it with a true solution. but it should help for certain things like testing. churn it on the mailing lists!


Pass factories in sometimes to allow lazy instantiaion... Remember in PHP execution, startup time is very important, it's one of your most frequently executed parts of the code.



best practices, type hinting....

  • benefit 1 - help locality of errors
  • benefit 2 - potential perf improvements under HipHop (citation needed)
  • benefit 3 - documentation
    • also a big help for mock objects for testing!
      • DEFINE NARROW INTERFACES AND USE THEM IN YOUR TYPE HINTING


hard to test in isolation - login,

Other issues with the document:[edit]

  • What is missing?
    • The document puts too much emphasis on deflecting bad contributions and on strategies for dealing with technical debt. I think we could improve it by stating the direction we want to see MediaWiki head in.
      • +1 - but it is easier to agree on what not to do. -- Gabriel
    • The document is heavily biased toward the back-end; there's virtually no mention of MediaWiki's substantial JavaScript code base and no discussion of best practices for building user interfaces. Example issue: HTMLForm provides a way to implement forms that is pleasingly concise and abstract and appeals to programmer instincts, but it's often hard to twist high-level generalized abstractions so that they implement a design spec precisely. On the other hand, implementing design specs precisely often means writing verbose code that is not reusable.
    • Related to dependency injection but deserving of some attention: private / public / protected. (Let's get rid of private / protected anywhere we can -- which is most places, IMO.)
    • Performance expectations for new code, ideally with some automated enforcement.
    • Exception handling. When should you throw errors? Under what conditions should you skip over errors and continue execution?
    • Type hinting. Say that it is good. Because it is.
  • How should it be organised?
    • Ideally as a single document that one could plausibly read from beginning to end. The thing to avoid is having a set of prescriptive and isolated recommendations that people use to bludgeon one another in code review and IRC.

RFCs to discuss (not resolve):[edit]

Configuration database[edit]

  • consider moving defaults to a serialized format?
    • nasty case -- default is dynamic based on other configuration (especially something like $wgServer, which is initialized rather late during setup)
    • default installs should be workable in the core database only (no FS access), but we'll want CDB files or something for high-scale rollouts like Wikimedia!
    • nasty case -- the default might be a list, to which the actual config wants to add (or replace it). This requires the defaults to actually be applied in LocalSettings.php
  • kill the global namespace vars? or keep them for back-compat? or....
  • ..... no cookie-licking! let's kick the RFC into shape to where anybody who has the time can implement it (even if outside wmf)
  • should we start on new config bits only? or include the existing globals so we have a migration path for them, and then deprecate/kill later?
  • we should be opportunistic... maybe dovetail with some project that can use the config db and push it into core starting with that, then move things in?
  • startup performance & access performance are REALLY IMPORTANT ON THIS! memory usage may be important as well
    • ^- pluggable storage, not just arrays. Need to hack out some specs for thre quirements on this. <- may need 'hybrid' storage, such as CDB on filesystem for most Wikimedia stuff but a few settings in local database
    • what config can we safely expose via api? (parsoid can use this etc)
    • note that we have maint scripts for getting config info (eg for admin scripts to use), this sort of model should still work
      • avoid reading the config databases directly from other languages (like a python script), read it through MediaWiki so its pluggable storage always works
      • note: avoid turning into a general key-value store. try to keep it configuration-related!
      • consider the config _store_ as a prereq for config _ui_. (though for example wikia kinda cheated, and there's the old Configure extension... but these are scary :D)
    • EventLogging (and several other extensions) use memcached for managing JSON blobs of configuration across wikis.
  • avoid access to settings as global state, instead pass individual settings or a task-specifc settings object. Any code accessing a global setting/singleton essentially becomes impossible to unit-test.

Page deletion[edit]

  • we generally seem to think this is a good idea
  • check if platonides wants to update it or someone else should catch it up

HTML storage / Parsoid[edit]

  • note: make sure that sanitation of stored stuff happens when retrieving (general html security paranoia!)

Other change ideas to discuss (add yours here):[edit]

  • MongoDB (it's webscale!)
    • Or generic key-value store interface?


General notes[edit]

  • make sure we dedicate a proportional amount of effort to planning on frontend/javascript code as well
    • this is a growing and important part of our codebase
    • we're accumulating expertise in various projects, need to talk about it more
    • get a front-end architecture strategy!
  • be more explicit about criteria that aren't defined/evaluated by backend behavior
  • provide better support for frontend people on scary things like caching (css updates are often problematic, various teams have had to learn how to deal with this separately)
  • regular checkins on RFC process?
  • assign things to people on bugzilla!