In order to develop a high-level topic labeling of Wikipedia, we explored many options. Generally, we want to predict a relatively small set of topics (less than 100) in order to make sure our model was small enough to work in practice. Further these topics should cover the space of Wikipedia subjects uniformly. We briefly considered using the category system built into Wikipedia, but given the amount of past work attempting to make sense of the looping/overloaded/non-hierarchical nature of category usage(cite all the things), we abandoned that idea early. Another option seemed to provide great potential. WikiProjects are subject-focused working groups -- the kind of subject-interested working groups we wanted to target with our topic modeling. These working groups tag articles that fall within the scope of their projects. E.g. WikiProject Birds tags the article for Eagle as within their content space. Regretfully there are hundreds of Wiki Projects and the coverage of WikiProjects is not very uniform. E.g. some projects are extremely broad (E.g. WikiProject Asia) while others are more focused (E.g. Chinese military history).
The WikiProject Directory provides a convenient intermediary ontology of WikiProjects that starts with four broad topics: Culture; Geography; History & Society; and Science, Technology and Mathematics (STEM). From there, the directory drills down into sub-topics and eventually specific WikiProjects. For example, WikiProject Birds exists underneath the path STEM/Science/Animals.
By taking advantage of this directory structure and using the mid-level categories and the WikiProject taggings already available on Wikipedia, we can (1) have a small, uniform set of target class and (2) enable editors to control the class output. Imagine a scenario where something isn't quite right with the WikiProject Directory structure and so it doesn't capture the right level of detail in order to be useful. E.g. editors want to target biographies about about women and yet "women" do not show up at any level of the directory. An editor could add a new level under Culture/Biography called "Women" and include WikiProject Women Writers, WikiProject Women Scientists, and other women-focused WikiProjects underneath this new branch of the tree. We can then run the same automated scripts for extracting structure on the repaired tree to re-label articles using these new mid-level categories without wasting any human-effort re-labeling articles.
In this way, the directory acts as a layer of ontological abstraction between the WikiProject taggings and the uniform set of categories that we want to use in order to model topics on Wikipedia. By allowing volunteer editors to easily modify this ontology and implementing automated systems for re-interpreting the ontology, we enable a straightforward process for adjusting our prediction model to match their expectations. We suspect that our prediction model will highlight limitations in the directory structure and that editors will be able to work with us in a tight loop of iterations where converge on an optimal set of target labels for the model.