Jump to content

User:TBurmeister (WMF)/Sandbox/Collections

From mediawiki.org

This page gathers ideas and strategies around how to group technical documentation into meaningful units that facilitate browsing, discovery, and maintenance. In Q1 of FY22-23 we are investigating: "What strategies or tools have people used, considered or suggested to identify sets of related docs or divide up the content space?" This is part of work on phab:T313037.

How do we define the collections?[edit]

Different organizing principles currently exist, and others have been suggested in brainstorming sessions or other feedback channels. Documents are related to each other in multiple, varied ways:

Theme Examples
General type of technology Templates



Data dumps

As part of this research, I made an experimental attempt (summarized below) to organize and create consistency between a small set of categories and collections we had listed for Key Docs as part of KR1 in FY 21/22.

General technical area (technology + processes) Deployment

Code review[1]




...some ideas people have proposed for identifying these areas are: Bugzilla/Phab tags or projects, Gerrit tags, Tech Blog categories[2]; Category:MediaWiki components, list of components at Developers/Maintainers[3].

As part of this research, I made an experimental attempt (summarized below) to organize and create consistency between a small set of categories and collections we had listed for Key Docs as part of KR1 in FY 21/22.

Specific technology, app, or component Lua






...some ideas people have proposed for identifying these areas are: Bugzilla/Phab tags or projects, Gerrit tags, Tech Blog categories[2]; Category:MediaWiki components, list of components at Developers/Maintainers, MediaWiki modules[3]

As part of this research, I made an experimental attempt (summarized below) to organize and create consistency between a small set of categories and collections we had listed for Key Docs as part of KR1 in FY 21/22.

Workflow or user journey "create workflow oriented documentation: problem -> find the code -> get into the code -> how to gerrit -> how to get review -> how review works -> have clear workflow"[4]

"Documentation collections are often centered around a specific technology. In some cases, linking from an entry point to a [technology-specific] landing page fails to provide readers with enough context to know what to choose. For example, an entry point that links to the Gerrit documentation collection and the Phabricator documentation collection doesn’t tell the reader why they might want to use those technologies. To help with this, we can use a different type of landing page: one that links to documentation collections in context as they relate to a theme. Instead of an entry point linking directly to the Gerrit and Phabricator docs, the entry point links to a cross-collection landing page that contextualize these technologies as they relate to the theme of technical contribution."[5]

Content type Tutorials[6]


API docs

...other examples[3]

Content format On-wiki docs

Markdown files



PAWS notebooks

Maintainers or owners Cloud Services team

Pywikibot enthusiasts?

"Collections are centered around documentation ownership and maintenance. A collection represents a set of docs that is maintained by a group of people with a similar interest. For example, if we look at Wikimedia Apps, there is no group of people interested in maintaining docs for all Wikimedia apps. Maintainer interest is grouped around a specific app, so we can look at each app and identify the docs for that app as a collection." [5]

Repo structure code within a single repo is documented by a "collection"...this is likely too low-level
Audience developers, system administrators, users (for MediaWiki docs)[3](note: audience is also somewhat embedded in our use of different wikis, like wikitech for infra devs, meta for non-technical audiences. as always, there are exceptions to this, like some developer-focused content on and about en:Wikipedia (example))

Case study #1: Continuous integration docs[edit]

Most documentation collections contain sub-collections that can be organized around one or more of the above themes. To observe this in practice, consider the corpus of docs listed at Continuous integration#Documents. This "collection" is created by the Special:PrefixIndex/Continuous_integration/. These documents form multiple, overlapping but also disjoint collections based on the following themes:

  • Maintainers: a sub-collection of the docs owned / maintained by the Release Engineering team
  • General technical area: continuous integration infrastructure / project (however, this is also itself a sub-collection of an even larger topic: "Testing".)
  • Specific technologies or components (listing just a few here. are these sub-collections, or collections of their own? when and why does it matter?):
  • Content types:
    • Tutorials
    • Entry points
    • Troubleshooting
  • Workflows or user journeys:
    • Regenerate Jenkins jobs
    • Control which tests Zuul runs on a patchset
    • [also most of the Tutorials]

Page hierarchy is being used to create this collection of docs. However, there is other metadata that reflects other collections or sub-collections for the same content, along with content that is missing in the collection based solely on Special:PrefixIndex/Continuous_integration/:

Reflections about doc improvement process based on this case study:

  • The ambiguity around where to draw the boundaries of a collection makes it easy to get lost in the ocean of content. We need to help people draw boundaries in a way that is meaningful but doesn't require them to have an ontology of the entire tech stack. Outcome: Documentation/Toolkit/Collection audit/Survey.
  • Without some sort of thematic anchor for a set of docs, its easy for content to be scattered across multiple repos, which leads to duplication and findability + maintenance trouble. Because the concept of continuous integration overlaps with the technologies used to do CI, and the infra components most impacted by it, there are some gaps in the documentation of each of those collections. This could be mostly solved with more intentional cross-referencing or tagging, or focusing on one topic per doc and using transclusion if necessary.
  • For the benefit of readers and maintainers, it's useful to establish one repo as the canonical location for docs about a given topic, technology, or product. The most challenging question, and what leads to the patchwork of overlapping and disjoint content, is what should those topics be? That is essentially the same question as "what are the collections?"
  • I don't think we should try to build an ontology, and I think focusing first on topic governance in this context is a trap (shocking, right?). Doc creators and maintainers need to be able to ask ourselves meaningful questions about what we want our readers to be able to understand and accomplish, then create the set of docs that enable that, and add some consistent metadata to those docs and others that are related...but without creating walls of links or decontextualized "see alsos", and without creating an overly-controlled set of doc categories or landing pages.

Case study #2: Research:Data and Data_dumps[edit]

TODO transfer content/findings

Experiment: listing categories and collections[edit]

As part of KR1:The Docs in FY 21-22, we had identified collections and general categories for the first set of key docs for the Developer Portal. To understand more about the challenge of defining collections, I took each of those categories and collection names and attempted to mold them into a standardized list. Note how the same collections appear on multiple rows:

General category Example collections
Cloud services cloud vps, toolforge, PAWS
Communication TechBlog, phabricator, communication channels
Community hackathons, TechBlog
Contributing commons contributing, MediaWiki coding conventions, phabricator, Gitlab, Gerrit
Data and machine learning data dumps, PAWS, Quarry, ML platform(s), analytics
Development tools & processes gerrit, gitlab, phabricator, tech decision forum
Enterprise autowikibrowser (tool)
MediaWiki APIs, Extensions, Wikibase, coding conventions, Parsoid, Quibble
Monitoring & SRE
Networking & Servers
Outreach & Community Support Grants, growth, education, small wiki toolkits, TechBlog
Performance analytics
Platforms? (WMF infra?) RESTbase, APIs, Wikibase, ML platform(s), Cloud services, Continuous integration, Parsoid
Product design & UX/UI Design style guide
Releases and deployment Continuous integration, MediaWiki
Research Data dumps, ML platform(s), Grants, Research team
System architecture and design Accessibility, system architecture, code health
Testing Continuous integration, Unit testing, Jenkins, Quibble
Tools & apps kaios app, Commons mobile app, huggle (tool), Toolhub, Toolforge, bots, upload tools, wpcleaner (tool)
Wikidata Wikibase
WMF organisation and teams abstract wikipedia, research team, platform eng team...

See the above table in a spreadsheet

How do we indicate and track which content is in which collections?[edit]

The doc review checklist for KR1:The Docs (FY 21/22) included the question: "How is the page linked to related pages? What categories does the page belong to? Is it part of a clearly defined collection of pages?" This question can be complex to answer, as documented above: collections of pages are not often "clearly defined". To help people scope the set of docs they attempt to improve, it's important to help them understand and navigate all the different ways that pages can be related or form ad-hoc "collections".

What follows is a survey of various current practices that people use to create "collections" of technical content.

Document metadata[edit]

Metadata can be embedded in a page name or added to a page in a variety of ways. People often use these different types of metadata to create connections between related content:

Type of metadata Examples
Page hierarachy VisualEditor



EngProd team had an approach of moving generic Manual pages to be team subpages to indicate stewardship (example); and redirecting Talk pages to team Talk page to make responses easier/possible. Emphasized subpages under a common parent as making it easier for maintainer to find and review. [7]

Transcluding the pages automatically listed by Special:PrefixIndex makes it possible to auto-generate a list of subpages, but that can get noisy and doesn't list the content in a meaningful way. Example: Continuous integration#Documents

Namespace (note also: Extension default namespaces)
'wikitech' => [
		110 => 'Obsolete',
		111 => 'Obsolete_talk',
		112 => 'OfficeIT', // T123383
		113 => 'OfficeIT_talk',
		// NS 114/115 reserved for 'Translation'
		116 => 'Tool', // T122865
		117 => 'Tool_talk', // T122865
	'mediawikiwiki' => [
		100 => 'Manual',
		101 => 'Manual_talk',
		102 => 'Extension',
		103 => 'Extension_talk',
		104 => 'API',
		105 => 'API_talk',
		106 => 'Skin',
		107 => 'Skin_talk',
(see others in the config)

Past proposals have suggested creating namespaces for specific documentation types (todo: can i find this again in my browser history? :-( )

Namespace + page hierarchy Extension:PhpTagsExtension:PhpTags/Magic expressions
Extension:PhpTags/Quick start guide
Extension:PhpTags/For developers
Extension:PhpTags/Performance"Trying to have relevant docs together, e.g. Help:Extension:Translate* and have all linked from the main page of that documentation."[7]
Page name masquerading as namespace https://wikitech.wikimedia.org/wiki/Category:Portals (Portal is not a namespace on Wikitech)
Category Category:Upload variables

Category:Stable extensions Category:WMF Projects

Category:MediaWiki components



Add Tags to docs - like labels[4]

Content model (not really page metadata, though we could use it

as such to identify these special types of content)

GadgetDefinition, NewsletterContent

Navigation templates or navboxes[edit]


Some navigation templates group together pages that *mostly* share a hierarchy or namespace, but navigation templates usually include at least some pages that reside outside of the canonical page hierarchy or namespace for a given topic. Especially because namespaces exist to cover different facets of the same topic (like Help:, API:, and Manual:), navigation templates often serve to provide paths between these namespaces. For example: the VisualEditor navigation template includes links to pages following the page hierarchy pattern /VisualEditor/page, but also Help:VisualEditor/User_guide.

Link hubs and lists of docs[edit]

This approach is similar to a navigation template, but the lists of docs are usually longer or specially formatted as a link hub or doc portal, and not transcluded anywhere -- they only live in this one location. Transcluding the pages automatically listed by Special:PrefixIndex makes it possible to auto-generate a list of subpages, but that can get noisy and doesn't list the content in a meaningful way. Long lists of links don't really help anyone.

Manually-curated examples:

Examples using Special:PrefixIndex:

Web domains[edit]

  • "We do separate doc locations a bit on audience, wikitech used for services, wikitech more internal in terms of audience, mw.org very broad scope"[7]
  • MediaWiki vs. Wikitech vs. Meta...
    • "1 wiki for promoting and supporting MediaWiki software and its API to sysadmins, developers and end users.
    • 1 wiki for promoting and coordinating the open source development of MediaWiki based software projects and Wikimedia technical initiatives."[9]
  • "Following is a list of the main places to find Wikimedia technical documentation.
    • MediaWiki - MediaWiki software documentation and technical documentation for many other Wikimedia technology projects. This is the default space for publishing technical documents about Wikimedia technology.
    • Wikitech - Technical documentation for Wikimedia Foundation infrastructure and services. This includes production clusters, Wikimedia Cloud Services, Toolforge hosting, the Beta Cluster, and other data services.
    • Wikidata - Technical documentation related to the Wikidata project, particularly the Tools page.
    • PAWS - Documentation about PAWS, Wikimedia's hosted Jupyter notebooks instance. Notebooks are frequently used to create tutorials and documentation for Wikimedia technology.
    • Phabricator - Phabricator is a collaboration tool that is used by the Wikimedia technical community for task and bug management. You can find many issues and software projects documented here. Use best practices for software documentation when filing tasks and interacting in this space."[10]

Conclusions and outcomes[edit]

Defining collections[edit]

  • It's possible to define a set of high-level categories that cover the range of Wikimedia technical documentation, but the utility of doing so is dubious. The categories are so broad that their main utility would be to provide a landing page that guides readers to more specific collections. They could also be useful to enable Tech leads to understand the broad scope of all the documentation their departments should (perhaps) be maintaining.
  • Collections belong in more than one category.
  • Collections have different organizing themes, but most of them are organized around a specific technology, platform, process, or team. In practice, it's hard to figure out if a collection should be organized around a specific technology (like "Jenkins"), or the system/process that technology is part of ("Continuous integration", "MediaWiki testing"), or the team that owns or maintains those technologies and processes ("Release engineering").
    • Whatever we decide about collections, this is the key hurdle that we need to remove in order for people to be able to scope their doc improvements and assess docs as a corpus.
    • Therefore: focusing on improving stewardship of the large mass of collections is more useful than focusing on high-level categories.
  • Individual documents can be part of more than one collection. It follows that few, if any, collections can ever be clearly defined. We have limited mechanisms on-wiki to represent these complex one-to-many relationships between documents and collections, and collections and high-level categories. This is part of why we end up with long lists of see-also and nav boxes that bring together scattered docs.
    • Therefore: the best way to improve stewardship of "collections" is to work with teams and project owners to help them identify the docs that they should consider "in scope" for their work. User:KBach-WMF/Sandbox/PywikibotCollection is one example of this sort of sense-making, doc-gathering effort guided by tech writer expertise. The collection assessment guide that came out of this quarter's work includes some techniques for how to identify docs that are part of a collection, but these strategies are likely to be difficult for teams to apply because so much of the process changes based on context.

Indicating and tracking collection membership[edit]

Ranking the methods of indicating collection membership, in order of decreasing reliability, consistency, and sustainability based on my research:

  • Doc metadata: Page hierarchy: The most ubiquitous tactic people use to relate pages to each other -- though perhaps they do this unintentionally ("this is just where we always put a new doc") and create collections that aren't really related by any clear or useful concept.
  • Navigation templates: Useful in that they're an identifiable part of a page or set of pages, unlike a plain-wikitext list of links that is indistinguishable from other page content. Easily-maintained, so often more updated than other methods of relating pages to each other.
  • Categories: Underutilized on our technical wikis, probably because the only way to browse categories is alphabetically, and that severely limits the utility of category landing pages in a non-encyclopedic context. There's potential here to use categories to provide better guided paths into more specific content collections, but figuring that out requires more content strategy work and analysis. More important to work on improving the individual pieces of content and the collections that direct people to them, because that is where the most essential (and incorrect) information resides.
  • Doc metadata: Namespaces: Troubling inconsistency across wikis. Different namespaces exist on different wikis, and even the same namespaces are used inconsistently across and within wikis. Overall, namespaces don't seem to correspond to an organizational concept that is universally very useful.
  • Web domains: Would be useful to separate major documentation by audience (i.e. Wikitech is for x, MediaWiki is for y), but we should enforce that more officially and more clearly. Right now there's a lot of content that lives on one wiki but could be relevant for readers who expect to find it on a different wiki. I also think we should avoid putting documentation in PAWS notebooks because they're not easily searchable and thus are harder to maintain.
  • Link hubs / lists of docs: Avoid! These are prime candidates for asking hard questions about the scope of a collection, and then putting the useful content in the appropriate page hierarchy, adding it to a navbox, or archiving it.
  • Doc metadata: content model: This is like a special wiki flavor of doc types. Not super relevant for this investigation.

Conclusions & recommendations[edit]

  • Use Categories to create collections of docs that are related through very broad concepts that include multiple teams and technologies, like "Testing".
    • This is similar to the concept of "cross-collection landing pages" from KR1:The Docs.
    • All categories should have overview docs that orient the reader to the technical landscape at a high level, with guidance to get readers to sub-collection landing pages.
    • Individual teams and project owners should not feel required to try to maintain docs at this level -- this is cross-team work that should be coordinated by technical writers in partnership with team leads, directors, and/or working groups.
  • Use page hierarchy to create collections of docs that are related through more specific concepts, like software projects, and complex technical systems with multiple sub-components (like "Continuous integration"). These pages should link to code repos (and docs within them) where applicable.
    • This aligns most closely with the concept of "collections" from KR1:The Docs.
    • These collections should have overview docs and task-focused docs.
    • This is the level at which individual teams and project owners should manage, maintain, and improve sets of docs. Doing this collection-level work usually requires some tech writer guidance on strategy and tactics, because these vary significantly by the context, history, and unique challenges of each set of docs.
    • Everyone should be encouraged to improve and help maintain docs at this level, but doing so requires more communication and collaboration to align contributions with the overall structure and content strategy for the collection.
  • Put docs that relate to specific technical components in the repos where the code lives. Docs that live with code should contain pointers to on-wiki collection landing pages, which provide the higher-level context for the code, along with connections to other relevant content.
    • These collections should have reference docs and task-focused docs. They may have overview docs, but that depends on the topic and on what is already covered on-wiki (and linked to from these docs-with-the-code).
    • Everyone should be encouraged and empowered to improve and help maintain docs at this level.
  • Technical writers should engage directly with developers and subject matter experts to help identify the boundaries of their collections and provide guidance on applying useful page structure and document metadata to relate docs to each other and to other collections in useful and meaningful ways.

Key artifacts[edit]