Topic on Talk:ORES/Draft topic

Discussing the WikiProject Directory abstraction

EpochFail (talkcontribs)
A conceptual diagram of the WikiProject directory next to a somewhat random list of WikiProjects and a sample of mid-level categories.

In order to develop a high-level topic labeling of Wikipedia, we explored many options. Generally, we want to predict a relatively small set of topics (less than 100) in order to make sure our model was small enough to work in practice. Further these topics should cover the space of Wikipedia subjects uniformly. We briefly considered using the category system built into Wikipedia, but given the amount of past work attempting to make sense of the looping/overloaded/non-hierarchical nature of category usage(cite all the things), we abandoned that idea early. Another option seemed to provide great potential. WikiProjects are subject-focused working groups -- the kind of subject-interested working groups we wanted to target with our topic modeling. These working groups tag articles that fall within the scope of their projects. E.g. WikiProject Birds tags the article for Eagle as within their content space. Regretfully there are hundreds of Wiki Projects and the coverage of WikiProjects is not very uniform. E.g. some projects are extremely broad (E.g. WikiProject Asia) while others are more focused (E.g. Chinese military history).

The WikiProject Directory[1] provides a convenient intermediary ontology of WikiProjects that starts with four broad topics: Culture; Geography; History & Society; and Science, Technology and Mathematics (STEM). From there, the directory drills down into sub-topics and eventually specific WikiProjects. For example, WikiProject Birds exists underneath the path STEM/Science/Animals.

By taking advantage of this directory structure and using the mid-level categories and the WikiProject taggings already available on Wikipedia, we can (1) have a small, uniform set of target class and (2) enable editors to control the class output. Imagine a scenario where something isn't quite right with the WikiProject Directory structure and so it doesn't capture the right level of detail in order to be useful. E.g. editors want to target biographies about about women and yet "women" do not show up at any level of the directory. An editor could add a new level under Culture/Biography called "Women" and include WikiProject Women Writers, WikiProject Women Scientists, and other women-focused WikiProjects underneath this new branch of the tree. We can then run the same automated scripts for extracting structure on the repaired tree to re-label articles using these new mid-level categories without wasting any human-effort re-labeling articles.

In this way, the directory acts as a layer of ontological abstraction between the WikiProject taggings and the uniform set of categories that we want to use in order to model topics on Wikipedia. By allowing volunteer editors to easily modify this ontology and implementing automated systems for re-interpreting the ontology, we enable a straightforward process for adjusting our prediction model to match their expectations. We suspect that our prediction model will highlight limitations in the directory structure and that editors will be able to work with us in a tight loop of iterations where converge on an optimal set of target labels for the model.

EpochFail (talkcontribs)
Pginer-WMF (talkcontribs)

Is this exposed in a way that can be queried for (a) getting the available topics, and (b) getting some articles for a given topic?

In Content translation we provide suggestions of articles that users may want to create. Based on user research, it would be helpful for users to get suggestions in a particular area of their interest (culture, history, science, etc.). Being able to list such categories and obtain articles from those (to be used as seed articles for the recommendations) would be useful.

EpochFail (talkcontribs)

You can get the full list of topics with a prediction. E.g. we can ask for the topic of the most recent version of en:Ann Bishop (biologist):

This gives a list of possible topics with probabilities. Generally, if a probability is above 0.05, it's probably relevant. In the output, you can see:

"Culture.Arts": 0.0003592029168249067,
"Culture.Broadcasting": 0.0012930479068612793,
"Culture.Crafts and hobbies": 0.007179908944091317,
"Culture.Entertainment": 0.0015836478753997376,
"Culture.Food and drink": 0.003449284189374042,
"Culture.Internet culture": 0.0007987469417527358,
"Culture.Language and literature": 0.8707451941795378,
"Culture.Media": 0.003323758305107651,
"Culture.Performing arts": 0.001155100605793028,
"Culture.Philosophy and religion": 0.07478498612340995,

Note that "Culture" is a top-level category and that biographies are sorted under "Culture.Language and literature". If we look at the top predicted categories for this article, we get:

History_And_Society.History and society(0.4)
Culture.Philosophy and religion(0.07)
Culture.Language and literature(0.87)

As you might expect, this prediction strongly suggests that Ann Bishop is an article about a person (Culture.Language and literature) from Europe who is historically important and contributed to fields of Biology and Medicine -- and maybe Philosophy/religion as well.

Oh! I just realized that you could get the possible classes directly from "model_info" as well. Here's the query I would use:

Note that "params.labels" contains the full list of labels. Disregard the "Assessment" labels as those have nothing to do with topical interest.

KHarlan (WMF) (talkcontribs)

@EpochFail are there plans to roll out drafttopic to more wikis?

EpochFail (talkcontribs)

Not at the moment. It hasn't been set as a priority for any product development so we have focused on other work in our FY20 plan. But that could change with discussion.

KHarlan (WMF) (talkcontribs)

@EpochFail thanks. Like @Pginer-WMF, we (Growth) would like to be able to know with some degree of certainty what topic a particular article belongs to, for our Growth/Personalized first day/Newcomer tasks project. We're currently experimenting with approximating topics with morelike searches and traversing the category tree, but both are imperfect. (See task T231506#5495917)

Speaking of categories, I know you mentioned that you evaluated them, did you look specifically at

cc @MMiller (WMF)

EpochFail (talkcontribs)

If y'all are looking to classify newcomer drafts in a specific language, I'd like to explore getting y'all some support for that. Let's talk more about that. Are you still targeting Czech and Korean?

I didn't look at this "Main topic classifications" taxonomy though I do think it provides a nice high level structure -- unlike most of the category tree. I followed a few of the sitelinks and it looks like this is under-developed in many other wikis, but I have some ideas for making it cross-lingual anyway.

KHarlan (WMF) (talkcontribs)

@EpochFail yes at the moment we're working with Arabic, Czech, Korean, and Vietnamese wikis, plus Basque wiki is going to get our features soon but not as a primary target wiki.

If y'all are looking to classify newcomer drafts in a specific language

To be more particular, we want to classify articles (not newcomer drafts) in a specific language. Quick summary is that we will use mapping of maintenance templates to task types, e.g. in Czech the template Upravit belongs to a "copy editing" task type, to query for a list of articles. Then we want to know, of the ~20,000 articles that have maintenance templates on Czech wiki, which ones belong to which topics, so that the user can filter the list of available tasks by topic and see ones that are relevant to their interests.

EpochFail (talkcontribs)

Gotcha. We could possibly start work on this next quarter. If you're not targeting drafts, then using any sitelinks available to enwiki and running the classifier on that should work alright in the short term. In the medium term, we'll need to develop embeddings for languages beyond English. This is a low-level component of the topic prediction strategy. I was hoping to pick this up next quarter to boost our capacity to work with topics generally, so it wouldn't be unreasonable to start moving some of it towards production too.

What sort of timescale are y'all working with? When do you need to have this ready to use in production?

Reply to "Discussing the WikiProject Directory abstraction"