Onboarding new Wikipedians/Recommender system

This document describes a simple recommender system for new Wikipedians, to be delivered via Extension:GettingStarted.

Rationale and background
Our previous method for selecting tasks delivered in the landing page depended on SuggestBot, a recommendations enginer originally designed for delivering recommended articles based largely on your past contribution history. This dependency on a bot and its edits to an associated template provided us with a minimally-viable product to release as an experiment. Now that A/B testing data has suggested it is worth pursuing in earnest, we built a simple task recommendations engine within the extension, order to:


 * 1) give us greater control over the type of task delivered, the frequency, and the user interface
 * 2) deploy Extension:GettingStarted outside English Wikipedia with full i18n/l10n support

User experience
Our end goal is to deliver compelling tasks to users that, when completed, improve Wikipedia. The primary interface users will experience this through is the "Getting started" page, but to create a compelling task list within that page, we will ultimately need to discover what is a good task for newcomers to Wikipedia?

We propose that, at a high level, great tasks for beginners on Wikipedia...


 * have a clear beginning and end
 * feel rewarding to do, even if they're small
 * don't require extensive knowledge of community rules and norms

One further assumption we're making at this stage is that, because we're beginning from a cold start with users who have no editing history, the tasks we'll be delivering will not require interest or expertise in the subject. As time goes on, we may use completed tasks to filter the recommendations, but for now we're not trying to personalize the recommendations upfront.

Architecture

 * Task recommendation process:
 * 1) Generation
 * What sources we derive tasks from
 * What attributes we filter tasks on
 * 1) Queueing
 * How we store tasks so they are ready for delivery
 * How often we refresh the queue of tasks
 * How long the task queue is
 * 1) Delivery
 * How tasks get delivered to extensions or other user interfaces
 * 1) Optimization
 * How the system learns from data about which tasks are chosen or completed to improve its recommendations

Task generation
Possible sources of tasks include:
 * Categories (such as those in Wikipedia:Backlog)
 * RecentChanges events
 * Extensions and the feeds related to them, such as NewPagesFeed/Page Curation or Echo.
 * Wikitext parsing, such as to find spelling errors

Possible attributes we can filter tasks by include:
 * Length
 * Markup complexity, e.g. the presence of infoboxes or references
 * Categories
 * Media, e.g. pages which lack images
 * Pageviews

CategoryRoulette currently attempts (up to 100 times) to select $$n$$ (currently 12) distinct articles from a given category. The filtering of tasks à la the above is done after generating the list of pages, in the delivery extension. Efficiency and reliability might be increased by instead filtering during generation of the task list, but this is TBD for the near future.

Queueing
We plan to avoid generating new database tables to keep a queue of tasks, and will first attempt implementing the queue in a Redis collection of page IDs. Keeping a queue of pages, as opposed to generating the list of tasks with a fresh database query each time, is required to compensate for the complexity of the queries. Our current plan is to keep a queue of all articles appropriate to the list, and refresh when a relevant even (such as removing a category or deleting a page) triggers a refresh.

Delivery
This recommender system will not itself deliver an interface to users. The first release will be embedded inside Extension:GettingStarted, and thus will use the Special page created via that extension to deliver the pages. In the future, we may deliver recommended tasks via other interfaces, such as guided tours or notifications. For now the delivery mechanism will not be a public API tied to a repository, but will rather be a generic interface to call the Redis queue for a list of articles.

Optimization and machine learning
This stage in the task recommendation process is something we will defer until a working implementation of the other steps -- generation, queueing, and delivery -- are complete.

The potential attributes we could collect and filter tasks on include:
 * Type, e.g. copyediting, adding image, add reference etc.
 * Difficulty rating
 * Topic (of the article)
 * Time (estimated time to complete)
 * "Freshness"
 * Popularity (in pageviews or in frequency of completion)

While we may not be storing the queue of tasks in database tables, the final step in the process (post-delivery) will be logging the tasks delivered in to a permanent archive. Without logs or tables describing the above information, we would be throwing away valuable information for the analysis of tasks.

Error handling
Regardless of the interface delivering tasks, a standard way to handle a failure to provide an array of page_ids should be produced. Unless the Redis cache is empty, the natural default which works agnostic to the task type is to simply deliver the previous cache of articles, and log an error. The other solution is to this problem is simple redundancy in Redis servers, in case of a restart or database crash etc.

GettingStarted use case and requirements
The following type of recommendations will be delivered in the Special:GettingStarted landing page, and are the primary use case we will build the first version of this recommender system around.

The GettingStarted landing page will deliver a random article from one of the following task categories, depending on which task the user selects:


 * Category:All articles needing copy edit
 * Category:All Wikipedia articles needing clarification
 * Category:All articles with too few wikilinks

In addition to coming from the appropriate category above, tasks should be filtered based on the following criteria:


 * Namespace 0 only (this is generally true of articles in the above categories, but should be included as a safety check)
 * Length no greater than 10kbytes
 * Must be editable. (No protected pages.)
 * Exclude articles in Category:Living people
 * Do not include redirects or deleted pages (this latter requirement may seem nonsensical, but is explicitly included to account for cases where there may be a time discrepancy between generation of the task list and delivery to users long enough for these events to occur).