ORES/Feature injection

Feature injection refers to a unique and powerful functionality that ORES provides for flexibly using the prediction models that the system supports.

Features: how machines see the world


Machines, like ORES, operate based on numbers. So in order for ORES to make predictions about edits, articles, or other wiki entities, they must first be reduced to a set of numbers. We call these numbers features. In the example above, features of an edit are measured (e.g. "words added" and "curse words added") because under some circumstances, they correlate with damaging edits. A machine learning algorithm can use features like these to learn the patterns that correlate with vandalism and other types of damage in order to do something useful—like help with counter-vandalism work.

Features are how machines see the world. For better or worse, if some characteristic of an edit suggests that the edit is vandalism, but that characteristic is not captured in any of the features that are measured and provided to a machine learning algorithm, then the algorithm cannot learn any patterns related to it. Similarly, by manipulating the features that are provided for a prediction, we can explore how a model views the world.

Asking ORES for features
For example, let's consider the ORES article quality model. We can ask ORES to tell us which features it uses to predict the quality level of [//en.wikipedia.org/wiki/?oldid=21312312 French Renaissance as of revision 21312312] by adding ?features on the end of an ordinary scoring request. See example output below.

https://ores.wikimedia.org/v3/scores/enwiki/21312312/articlequality?features ... "score": { "prediction": "Start", "probability": { "B": 0.23655260760880706, "C": 0.12453739014424042, "FA": 0.003878024838323291, "GA": 0.00736070100748482, "Start": 0.5951630249866495, "Stub": 0.032508251414494885 } }, "features": { "feature.english.stemmed.revision.stems_length": 3926, "feature.enwiki.main_article_templates": 0.0, "feature.enwiki.revision.category_links": 3.0, "feature.enwiki.revision.cite_templates": 0.0, "feature.enwiki.revision.cn_templates": 0.0, "feature.enwiki.revision.image_links": 0.0, "feature.enwiki.revision.infobox_templates": 0.0, "feature.enwiki.revision.paragraphs_without_refs_total_length": 5863.0, "feature.enwiki.revision.who_templates": 0.0, "feature.wikitext.revision.chars": 5894.0, "feature.wikitext.revision.content_chars": 5208.0, "feature.wikitext.revision.external_links": 0.0, "feature.wikitext.revision.headings_by_level(2)": 3.0, "feature.wikitext.revision.headings_by_level(3)": 2.0, "feature.wikitext.revision.ref_tags": 0.0, "feature.wikitext.revision.templates": 5.0, "feature.wikitext.revision.wikilinks": 73.0 } ...

This relatively narrow set of features show the limited view that the "articlequality" model has of a Wikipedia page. Given what it can see, ORES predicts that this revision is at "Start class" with relatively high confidence (59.5%).

Feature injection: Playing with what ORES sees
In this section, we'll go through a few different case studies that show what we can achieve with feature injection.

Article quality predictors
Revision 21312312 from the previous example was saved in 2005—long before English Wikipedians had standardized references and citation templates, so we see that the relevant feature (feature.enwiki.revision.cite_templates) reports 0.0. What if this article's citations were updated with a set of references and templates? What would ORES think of its quality level then? Here's where "feature injection" comes in handy. There are a few features that are directly relevant to references and citation templates:
 * feature.enwiki.revision.cite_templates
 * feature.enwiki.revision.paragraphs_without_refs_total_length
 * feature.wikitext.revision.ref_tags
 * feature.wikitext.revision.templates

There are 13 total paragraphs in the article. Let's ask ORES what it would predict if half of those paragraphs had exactly one templated reference. To do that, we'll add each feature with its new value to the url after ?features: curl --get https://ores.wikimedia.org/v3/scores/enwiki/21312312/articlequality \ -d features \ -d feature.enwiki.revision.cite_templates=13.0 \ -d feature.enwiki.revision.paragraphs_without_refs_total_length=0.0 \ -d feature.wikitext.revision.ref_tags=13.0 \ -d feature.wikitext.revision.templates=18.0

From this, we get a slightly different score: "score": { "prediction": "Start", "probability": { "B": 0.21101537545347274, "C": 0.34027980158201354, "FA": 0.006190763486682799, "GA": 0.06921489299078662, "Start": 0.3618188435415371, "Stub": 0.011480322945507269 } } If the article was properly referenced, it would *almost* qualify for C class over Start class (34.0% vs 36.2%). What if we went a bit farther and we added an infobox with an image? Would that push it over the line? Let's add that to our URL: https://ores.wikimedia.org/v3/scores/enwiki/21312312/articlequality ?features &feature.enwiki.revision.cite_templates=13.0 &feature.enwiki.revision.paragraphs_without_refs_total_length=0.0 &feature.wikitext.revision.ref_tags=13.0 &feature.wikitext.revision.templates=18.0 &feature.enwiki.revision.infobox_templates=1.0 &feature.enwiki.revision.image_links=1.0

From this we finally cross the threshold of "C class": "score": { "prediction": "C", "probability": { "B": 0.22491750422853965, "C": 0.3269388189203612, "FA": 0.011275242234045054, "GA": 0.21184437347163956, "Start": 0.21762683702433538, "Stub": 0.007397224121079145 } }