Talk:ORES/Draft topic

About this board

Discussing the WikiProject Directory abstraction

EpochFail (talkcontribs)
A conceptual diagram of the WikiProject directory next to a somewhat random list of WikiProjects and a sample of mid-level categories.

In order to develop a high-level topic labeling of Wikipedia, we explored many options. Generally, we want to predict a relatively small set of topics (less than 100) in order to make sure our model was small enough to work in practice. Further these topics should cover the space of Wikipedia subjects uniformly. We briefly considered using the category system built into Wikipedia, but given the amount of past work attempting to make sense of the looping/overloaded/non-hierarchical nature of category usage(cite all the things), we abandoned that idea early. Another option seemed to provide great potential. WikiProjects are subject-focused working groups -- the kind of subject-interested working groups we wanted to target with our topic modeling. These working groups tag articles that fall within the scope of their projects. E.g. WikiProject Birds tags the article for Eagle as within their content space. Regretfully there are hundreds of Wiki Projects and the coverage of WikiProjects is not very uniform. E.g. some projects are extremely broad (E.g. WikiProject Asia) while others are more focused (E.g. Chinese military history).

The WikiProject Directory[1] provides a convenient intermediary ontology of WikiProjects that starts with four broad topics: Culture; Geography; History & Society; and Science, Technology and Mathematics (STEM). From there, the directory drills down into sub-topics and eventually specific WikiProjects. For example, WikiProject Birds exists underneath the path STEM/Science/Animals.

By taking advantage of this directory structure and using the mid-level categories and the WikiProject taggings already available on Wikipedia, we can (1) have a small, uniform set of target class and (2) enable editors to control the class output. Imagine a scenario where something isn't quite right with the WikiProject Directory structure and so it doesn't capture the right level of detail in order to be useful. E.g. editors want to target biographies about about women and yet "women" do not show up at any level of the directory. An editor could add a new level under Culture/Biography called "Women" and include WikiProject Women Writers, WikiProject Women Scientists, and other women-focused WikiProjects underneath this new branch of the tree. We can then run the same automated scripts for extracting structure on the repaired tree to re-label articles using these new mid-level categories without wasting any human-effort re-labeling articles.

In this way, the directory acts as a layer of ontological abstraction between the WikiProject taggings and the uniform set of categories that we want to use in order to model topics on Wikipedia. By allowing volunteer editors to easily modify this ontology and implementing automated systems for re-interpreting the ontology, we enable a straightforward process for adjusting our prediction model to match their expectations. We suspect that our prediction model will highlight limitations in the directory structure and that editors will be able to work with us in a tight loop of iterations where converge on an optimal set of target labels for the model.

EpochFail (talkcontribs)
Pginer-WMF (talkcontribs)

Is this exposed in a way that can be queried for (a) getting the available topics, and (b) getting some articles for a given topic?

In Content translation we provide suggestions of articles that users may want to create. Based on user research, it would be helpful for users to get suggestions in a particular area of their interest (culture, history, science, etc.). Being able to list such categories and obtain articles from those (to be used as seed articles for the recommendations) would be useful.

EpochFail (talkcontribs)

You can get the full list of topics with a prediction. E.g. we can ask for the topic of the most recent version of en:Ann Bishop (biologist):

This gives a list of possible topics with probabilities. Generally, if a probability is above 0.05, it's probably relevant. In the output, you can see:

"Culture.Arts": 0.0003592029168249067,
"Culture.Broadcasting": 0.0012930479068612793,
"Culture.Crafts and hobbies": 0.007179908944091317,
"Culture.Entertainment": 0.0015836478753997376,
"Culture.Food and drink": 0.003449284189374042,
"Culture.Internet culture": 0.0007987469417527358,
"Culture.Language and literature": 0.8707451941795378,
"Culture.Media": 0.003323758305107651,
"Culture.Performing arts": 0.001155100605793028,
"Culture.Philosophy and religion": 0.07478498612340995,

Note that "Culture" is a top-level category and that biographies are sorted under "Culture.Language and literature". If we look at the top predicted categories for this article, we get:

History_And_Society.History and society(0.4)
Culture.Philosophy and religion(0.07)
Culture.Language and literature(0.87)

As you might expect, this prediction strongly suggests that Ann Bishop is an article about a person (Culture.Language and literature) from Europe who is historically important and contributed to fields of Biology and Medicine -- and maybe Philosophy/religion as well.

Oh! I just realized that you could get the possible classes directly from "model_info" as well. Here's the query I would use:

Note that "params.labels" contains the full list of labels. Disregard the "Assessment" labels as those have nothing to do with topical interest.

KHarlan (WMF) (talkcontribs)

@EpochFail are there plans to roll out drafttopic to more wikis?

EpochFail (talkcontribs)

Not at the moment. It hasn't been set as a priority for any product development so we have focused on other work in our FY20 plan. But that could change with discussion.

KHarlan (WMF) (talkcontribs)

@EpochFail thanks. Like @Pginer-WMF, we (Growth) would like to be able to know with some degree of certainty what topic a particular article belongs to, for our Growth/Personalized first day/Newcomer tasks project. We're currently experimenting with approximating topics with morelike searches and traversing the category tree, but both are imperfect. (See task T231506#5495917)

Speaking of categories, I know you mentioned that you evaluated them, did you look specifically at

cc @MMiller (WMF)

EpochFail (talkcontribs)

If y'all are looking to classify newcomer drafts in a specific language, I'd like to explore getting y'all some support for that. Let's talk more about that. Are you still targeting Czech and Korean?

I didn't look at this "Main topic classifications" taxonomy though I do think it provides a nice high level structure -- unlike most of the category tree. I followed a few of the sitelinks and it looks like this is under-developed in many other wikis, but I have some ideas for making it cross-lingual anyway.

KHarlan (WMF) (talkcontribs)

@EpochFail yes at the moment we're working with Arabic, Czech, Korean, and Vietnamese wikis, plus Basque wiki is going to get our features soon but not as a primary target wiki.

If y'all are looking to classify newcomer drafts in a specific language

To be more particular, we want to classify articles (not newcomer drafts) in a specific language. Quick summary is that we will use mapping of maintenance templates to task types, e.g. in Czech the template Upravit belongs to a "copy editing" task type, to query for a list of articles. Then we want to know, of the ~20,000 articles that have maintenance templates on Czech wiki, which ones belong to which topics, so that the user can filter the list of available tasks by topic and see ones that are relevant to their interests.

EpochFail (talkcontribs)

Gotcha. We could possibly start work on this next quarter. If you're not targeting drafts, then using any sitelinks available to enwiki and running the classifier on that should work alright in the short term. In the medium term, we'll need to develop embeddings for languages beyond English. This is a low-level component of the topic prediction strategy. I was hoping to pick this up next quarter to boost our capacity to work with topics generally, so it wouldn't be unreasonable to start moving some of it towards production too.

What sort of timescale are y'all working with? When do you need to have this ready to use in production?

Reply to "Discussing the WikiProject Directory abstraction"

New article review funnel dynamics

EpochFail (talkcontribs)

Currently, the processes that support reviewing new article creations in Wikipedia include the New Page Patrol (supported by the Page Curation Tool) (NPP) and the Articles for Creation (AfC) working group. The only substantial difference between these two processes from a workflow perspective is where they operate and the implications of that space. NPP operates in the main article space and thus is motivated to make sure problematic new page creations don't last long because they show up as part of the encyclopedia. AfC operates in the Draft namespace where new articles are not linked from the main encyclopedia and are not indexed by search engines.

While AfC and NPP are somewhat independent, they largely work in parallel as a single line of defense against the introduction of spam, vandalism, and other types of non-encyclopedic content. In the case of both AfC and NPP, a small group of volunteers is responsible for making a wide range of different types of judgement calls about a new article. Some judgments are straightforward and require little expertise: e.g., Is this spam or vandalism? Others require a nuanced understanding of a topic area: e.g., Is Ann Bishop a notable biologist or not?

Historically, new article review has been an overwhelming problem. Both AfC and NPP maintain backlogs with tens of thousands of articles[1] Recently, NPP has become so overwhelmed with the backlog and the potentially negative effects that it has been having on the quality the encyclopedia that they proposed that Wikipedia change policy and restrict the creation of new articles from new editors[2]. This has largely had the effect of re-routing new article review work from NPP to AfC[3], but it has not addressed the backlog at all.

English Wikipedia's multi-stage review funnel for edits to current articles.
English Wikipedia's single-stage review funnel for new articles creations.

But not all review processes result in such backlogs. It's interesting to compare the review process for edits to current articles (edit review) with the review process funnel for new articles. In the case edit review, a dynamic multi-stage filter is implemented that implements a distributed cognition process[4][5]. Geiger and Halfaker describe the multi-stages process where AI-augmented robots form the first line of defense, then AI-augmented human-computation tools catch less obvious damage, then finally, editors from across the encyclopedia review changes to articles that they're interested via watchlists. Through this multi-stage process, all edits are reviewed and no backlog forms.

So why doesn't the new article review process look more like this? Before our work, there was no AI support for detecting problematic new articles (for bots or human-computation) and there were no effective routing mechanisms for distributing the workload across subject matter experts.

  1. E.g. see discussions here:,800
  2. en:Wikipedia:Autoconfirmed_article_creation_trial
  4. Geiger, R. S., & Halfaker, A. (2013, August). When the levee breaks: without bots, what happens to Wikipedia's quality control processes?. In Proceedings of the 9th International Symposium on Open Collaboration (p. 6). ACM.
  5. Geiger, R. S., & Ribes, D. (2010, February). The work of sustaining order in wikipedia: the banning of a vandal. In Proceedings of the 2010 ACM conference on Computer supported cooperative work (pp. 117-126). ACM.
EpochFail (talkcontribs)

Ping: User:Sumit.iitp. Not my best writing. But it's some progress. I'll have more tomorrow.

Sumit.iitp (talkcontribs)

Benefits of learning from the WikiProjects directory?

The directory itself is a result of hand crafted categorization of topics. Our model uses this categorization to generate a machine-readable topic tree which forms the base of our predictions. Having this dependence of the model on the structure of the topic labels, we allow our model to be easily rebuilt and trained as and whenever a change to topic hierarchies is desired. Easy, because the users need not go and change labels in any database, but can directly update the directory page, which they are already familiar with. While rebuilding, our model will generate the new topic tree from the updated directory page, extract labels, and train on those labels all in a single pipeline.

Reply to "New article review funnel dynamics"
There are no older topics