Wikimedia Engineering/2014-15 Goals/Q3

Early notes for consideration in December.

We distinguish between:


 * Above waterline - top 5 priorities
 * Below waterline - high priority projects/needs, close competitors for top spots
 * Below lava line - not a contender for a top priority

Right now we're not in the sorting stage yet, but we can collect some candidates already identified in the Q2 planning. As we get closer to Q3, we'll collect input from various team leads, the architecture team, and other stakeholders.

Strong Candidates

 * Editing performance cont'd from Q2, as anticipated, unless the project is just not working.
 * Why: We're only kicking this off in November, so need some time to dig in.
 * A/B and multi-variate testing infrastructure. Create better foundations for testing, comparing and validating user experience changes.
 * Why: Must-have to increase product development velocity (though should be driven by concrete product needs for Q3).
 * Fundraising tech refactor. Make fr-tech less of an island internally and ensure the team can add and rotate team members.
 * Why: Fr-tech needs to staff up to support new integrations (e.g. mobile), and to support that, and create more team sustainability, this has long been identified as a must-have.
 * Product Instrumentation and Visualization Solidifying the instrumentation and dashboarding pipeline and consistently applying instrumentation across all teams working on user-facing features
 * Why: Becoming more data driven in our goals and objectives requires consistent and robust instrumentation
 * SOA Auth: Provide authentication/authorization as a service for MediaWiki and other consumers
 * Why: As more standalone services are being devised (SOA) it becomes increasingly important that they have a shared basis for authentication/authorization.
 * Goals: Build scalable standalone authn/z service. Integrate this service with MediaWiki for WMF auth needs, providing a future model for service injection in MediaWiki core.

Why

 * Must-have to improve product quality.
 * Make deployments less painful
 * Improve our QA.
 * Deploying is still hard, Beta Cluster is part of that

What

 * Overarching goal: Have realworld metrics and monitoring in place
 * To judge effectiveness of the below goals
 * Reconcile HHVM and Beta Cluster
 * See: HHVM fcgi restart during scap runs cause 503s (and failed tests)
 * Required crossover: MW Core, Ops
 * Dev code pipeline
 * A "nightly" build tested on Beta Cluster would give us a higher degree of certainty (and a firmer commitment to test against) before deploying
 * see also: True code pipeline
 * Required crossover: MW Core (minimal, hopefully)
 * Beta Cluster/Prod reconcilization
 * Identifying (we already have many) the rough edges between Prod and BC and addressing them
 * This is purposefully non-deterministic for "done-ness" at this point, but we would timebox the investigation portion to make it deterministic/do-able
 * Required crossover: Ops, MW Core
 * Puppet code pipeline ("stretch-goal")
 * Need a method to allow ops/others to test their code changes on Beta Cluster before prod deployment
 * Why:
 * Will ensure "upstream" puppet changes don't kill Beta Cluster accidentally (and unknowingly)
 * Will allow for fuller testing of puppet changes
 * Required crossover: Ops

Why

 * Biggest organic growth & green field opportunity.
 * The work in Q2 was preparatory for a longer-term plan for new mobile contributor and reader engagement

We are working toward a single, integrated mobile system that feeds into and off of structured data. Our goal is to give mobile readers and contributors more engaging, device- and context-appropriate ways to consume and grow the sum of all human knowledge, both on the mobile devices that exist today and on those that are coming in the near future. This is a complex, multi-stage project, not a one-off experiment.

Apps mid-term (Q3)
More work in some of the following feature areas, focusing on the broad theme of reader engagement:


 * Deeper in-page engagement (e.g., related articles/content)
 * Social sharing (e.g., Tweet-a-fact)
 * Browsing/discovery (e.g., curated collections)

Mobile Web mid-term (Q3)

 * New contribution types
 * test WikiGrok with readers in stable
 * new types of contribution via WikiGrok: ranking claims, using WikiGrok to generate Wikidata descriptors for items, etc.
 * build system to aggregate contributions and ship to Wikidata
 * gamefied WikiGrok experience


 * Reader engagement
 * explore social sharing and/or browsing/discovery features
 * page styling overhaul for app-mobile web reader experience parity
 * continue experiments with generating mobile infoboxes from Wikidata to fuel use-cases for Wikidata querying

Apps and Web long-term (Q4 and beyond)

 * Drive traffic from mobile Web to the app
 * Create self-sustained contribution and curation workflows on mobile that lessen the burden of quality control for the existing community (e.g., aggregating, patrol queues, mobile-only flows)
 * Replace more unstructured elements (e.g., infoboxes, tables, references) on apps and Web with structured data that can be restyled natively and used in novel reader and contributor features on apps
 * Continue experimenting to stay ahead of mobile trends

Why
The two mobile teams (apps and web) are highly dependent on having the ability to serve content in a more modular way to readers and editors. Having an easy-to-query central repository of structured data that supplements Wikipedia articles (e.g. Wikidata) makes it possible to create entirely new ways of presenting content to users, but Wikidata currently lacks the infrastructure to fetch anything but very simple information at scale. In order to be able to continue building features like WikiGrok, create easy-to-edit mobile infoboxes from Wikidata, and continue to explore new ways of breaking up content for users on small screens, we need to build this service.

Why
Management wants us to be more data driven in our feature development and assessments. We have a basic logging and visualization pipeline (Event Logging + Limn) that is functional but knowledge in how to use it is inconsistent across the organization. We need to collaborate with the engineering teams to make sure they understand how to use this pipeline. In addition, we need to ensure that instrumentation is consistent across the organization and that the tools have desired features and necessary capacity, operability and reliability.

What

 * Set up training, documentation, consultation and office hours for Event Logging (Jan)
 * Review and harmonize schemas
 * Performance test and address system throughput, SPOFs, monitoring, etc
 * Create/enhance beta system and event QA environment/best practices
 * Identify/implement new features (e.g data primitives)
 * Establish visualization roadmap

Candidates

 * Phabricator for code review, phase out Gerrit.
 * Why maybe not: This may be premature as we'll still be in the early days of using Phabricator as a PM tool, and may not yet fully understand the requirements. It may also contend for resources with critical test infrastructure work.
 * Also, the team has been working full time at full speed with a high pressure for delivering, and now it's time to take it a bit easier, work on other things, and let Phabricator consolidate. The Code Review project may start, but slower, without being a top priority.--Qgil-WMF (talk) 07:33, 4 November 2014 (UTC)
 * Front-end standardization / UX standardization cont'd as a top priority.
 * Why: We'll have only converted one or at best a few interfaces; if we don't want the interfaces to remain disjoint and instead get the benefit, we'll need to make a concerted push to roll it out to all of core and major extensions.
 * Why maybe not: We're now establishing a lot of technical foundations and working parameters for the team; we may not need to continue it as a top priority to keep the momentum going.
 * Library infrastructure work cont'd.
 * Why: So it doesn't immediately fall by the wayside after making some initial efforts.
 * UX Testing Environment (REFLEX)
 * Why: Our production environment is not set up for running 10/100/1000 users through a battery of tasks without one user invalidating tasks of subsequent users, and sandboxed environment that reflects the current production state of the sites, with or without modification and the addition of usability tracking software is necessary for the success of quantifiably measured qualitative analysis of our prototypes and productions features. Ops, platform, analytics, and design research would need to work together for this. Technical details
 *  Logging the user environment (FINCH)
 * Why: Wikimedia Product and Design cannot currently make informed decisions about some aspects of our users experience of our sites, such as access device, platform, and screen size. Critical information about connection speed, geolocation and technology availability are not available or not easily accessible to decision makers in the Product group.
 * MicroSurveys
 * Why: Ability to quickly identify problems or positive sentiment around existing systems (whether static or in development). These would be brief, few-question surveys to gain quantifiable insight into the usability of current systems for contributors. Discussed as testing out mobile version as well. Good for getting information from both readers and contributors.  The two main approaches are:
 * Overall satisfaction (aka net promoter scores): One-click method for users to tell you whether they are generally satisfied.  Good for gathering baseline data, good for comparing different products (e.g., people reading on Mobile apps vs Mobile Web), good for identifying the existence of "skunk in the doorway" problems (problems you didn't know to ask about, because you couldn't see the software from the user's perspective, i.e., that there was a serious problem in the user's path).
 * Structured feedback: Click here to make a suggestion, click here to file a bug report, click here to say that it's working okay for you.  This is still simple from the user's perspective, while being more informative from the product manager's perspective.
 * Localization cache do-over
 * Why: It's currently the largest bottleneck to quick deployments.
 * Beta Popups and/or Echo Notifications for Beta Features
 * Why: We need to build out better notification tools to ensure users are properly informed of upcoming changes. This could be used to notify when a feature will be launched into Beta as well as when it will be launched into Production as a quick way to notify everyone that a change is imminent


 * Testing overhaul 
 * Why: Our state of testing is shameful, it is slowing us down and prevents even our smartest developers to get their changes merged in. A few examples: operations/puppet barely have any tests causing a review burden on ops shoulders. Mediawiki/core tests requires you to install it + setup the backends and is painfully slow. Operations/mediawiki-config doesn't validate any site configurations, luckily we have a staging area to confirm. A first step would be to have true unit tests for the most important projects (site config, puppet, mediawiki/core) and aim for a good coverage over the course of 2015.


 * Progressive rollout of features
 * Why: to actually deploy what we want to a reduced set of our user base w/o having to have the feature "enabled" via a beta feature or fully productionized for all wikis. Example: we want ContentTranslation deployed only to Catalan Wikipedia or to logged in users of Catalan Wikipedia, making sure no other users can enable the feature (as we know it is not ready for them)


 * Browser reports
 * Why: to know what our users use to browse the desktop and mobile sites and plan features accordingly


 * Power user tooling
 * Why: Our power users make a disproportionately large amount of our content. We should spend some time refining and improving the tools they use so that they can be the most effective they can possibly be.
 * Draft proposal: User:Deskana (WMF)/Power user tools development team
 * Skin unification
 * Why: Our users are interacting with two completely separate user interfaces when moving between mobile and desktop devices, and features designed for each of these platforms are not compatible with each other. By improving the skin system to be more modular while targeting both mobile and desktop, a single responsive skin can be converged on and more features can be made available to both targets.

General thoughts

 * Are some of these still too specific, too project-focused? E.g. could fr-tech refactor and library infrastructure work be collapsed into cross-organizational efforts to reduce technical debt, to break up monolithic code, to increase test coverage, etc.?--Erik Moeller (WMF) (talk) 08:02, 17 October 2014 (UTC)
 * To me fr-tech is one of the few projects that absolutely needs to happen and always get shuffled to the background. Let's give them the resources they need to succeed given the critical need of those systems supporting everything else that we do Tfinc (talk) 18:45, 17 October 2014 (UTC)
 * We should be careful about repaying technical debt purely for the sake of repaying technical debt. IMO it is preferable to tie new stuff to the paydown of debt, or at least figure out how to tie the paydown of technical debt to some concrete deliverable - this a) helps us make sure that we're paying off debt in some kind of priority order [eg focusing on the stuff that matters first] while b) not totally sacrificing creating things of value and c) helps ensure that we're paying off the debt in the right way [eg not doing a massive refactor, only to discover when the next new feature comes up we should have refactored a little differently] Awjrichards (WMF) (talk) 18:50, 31 October 2014 (UTC)
 * I wonder if we could tie repaying the tech debt into the work of a new integration, which is also high priority for fr-tech next quarter in order to meet the fundraising goal this FY. AGomez (WMF) (talk) 19:31, 12 December 2014 (UTC)