User:TBurmeister (WMF)/Sandbox/Collections

This page gathers ideas and strategies around how to group technical documentation into meaningful units that facilitate browsing, discovery, and maintenance. In Q1 of FY22-23 we are investigating: "What strategies or tools have people used, considered or suggested to identify sets of related docs or divide up the content space?" This is part of work on phab:T313037.

How do we define the collections?
Different organizing principles currently exist, and others have been suggested in brainstorming sessions or other feedback channels. Documents are related to each other in multiple, varied ways:

Case study #1: Continuous integration docs
Most documentation collections contain sub-collections that can be organized around one or more of the above themes. To observe this in practice, consider the corpus of docs listed at Continuous integration. This "collection" is created by the Special:PrefixIndex/Continuous_integration/. These documents form multiple, overlapping but also disjoint collections based on the following themes:


 * Maintainers: a sub-collection of the docs owned / maintained by the Release Engineering team
 * General technical area: continuous integration infrastructure / project (however, this is also itself a sub-collection of an even larger topic: "Testing".)
 * Specific technologies or components (listing just a few here. are these sub-collections, or collections of their own? when and why does it matter?):
 * Jenkins (but there is also Parsoid/Jenkins Testing and https://wikitech.wikimedia.org/wiki/Jenkins)
 * Phan
 * Parsoid (but there is also Parsoid)
 * Quibble
 * Content types:
 * Tutorials
 * Entry points
 * Troubleshooting
 * Workflows or user journeys:
 * Regenerate Jenkins jobs
 * Control which tests Zuul runs on a patchset
 * [also most of the Tutorials]

Page hierarchy is being used to create this collection of docs. However, there is other metadata that reflects other collections or sub-collections for the same content, along with content that is missing in the collection based solely on Special:PrefixIndex/Continuous_integration/:

Reflections about doc improvement process based on this case study:
 * Some, but not all, of the docs are in Category:Testing
 * Some, but not all, of the docs are in Category:Continuous integration
 * Based on the code paths mentioned in Continuous integration/Parsoid, https://doc.wikimedia.org/Parsoid/master/tutorial-devsetup.html#!/guide/setup could be related documentation, along with Parsoid and https://wikitech.wikimedia.org/wiki/Parsoid.
 * Quibble documentation lives at https://doc.wikimedia.org/quibble/ and Continuous integration/Quibble. Hooray, there's a redirect from mw:Quibble to the CI doc, and the CI doc has a link to doc.wikimedia.org.  Gold star!


 * The ambiguity around where to draw the boundaries of a collection makes it easy to get lost in the ocean of content. We need to help people draw boundaries in a way that is meaningful but doesn't require them to have an ontology of the entire tech stack.
 * Without some sort of thematic anchor for a set of docs, its easy for content to be scattered across multiple repos, which leads to duplication and findability + maintenance trouble. Because the concept of continuous integration overlaps with the technologies used to do CI, and the infra components most impacted by it, there are some gaps in the documentation of each of those collections. This could be mostly solved with more intentional cross-referencing or tagging, or focusing on one topic per doc and using transclusion if necessary.
 * For the benefit of readers and maintainers, it's useful to establish one repo as the canonical location for docs about a given topic, technology, or product. The most challenging question, and what leads to the patchwork of overlapping and disjoint content, is what should those topics be?  That is essentially the same question as "what are the collections?"
 * I don't think we should try to build an ontology, and I think focusing first on topic governance in this context is a trap (shocking, right?). Doc creators and maintainers need to be able to ask ourselves meaningful questions about what we want our readers to be able to understand and accomplish, then create the set of docs that enable that, and add some consistent metadata to those docs and others that are related...but without creating walls of links or decontextualized "see alsos", and without creating an overly-controlled set of doc categories or landing pages.

Case study #2: Research:Data and Data_dumps
TODO transfer content/findings

Experiment: listing categories and collections
As part of KR1:The Docs in FY 21-22, we had identified collections and general categories for the first set of key docs for the Developer Portal. To understand more about the challenge of defining collections, I took each of those categories and collection names and attempted to mold them into a standardized list. Note how the same collections appear on multiple rows: See the above table in a spreadsheet

Conclusions:


 * It's possible to define a set of high-level categories that cover the range of Wikimedia technical documentation, but the utility of doing so is dubious. The categories are so broad that their main utility would be to provide a landing page that guides readers to more specific collections.  They could also be useful to enable Tech leads to understand the broad scope of all the documentation their departments should (perhaps) be maintaining.
 * Collections belong in more than one category.
 * Collections have different organizing themes, but most of them are organized around a specific technology, platform, process, or team. In practice, it's hard to figure out if a Collection should be organized around a specific technology (like "Jenkins"), or the system/process that technology is part of ("Continuous integration", "MediaWiki testing"), or the team that owns or maintains those technologies and processes ("Release engineering").
 * Whatever we decide about Collections, this is the key hurdle that we need to remove in order for people to be able to scope their doc improvements and assess docs as a corpus.
 * Therefore: focusing on improving stewardship of the large mass of Collections is more useful than focusing on high-level categories.
 * Individual documents can be part of more than one Collection, but we have limited mechanisms on-wiki to represent that. This is part of why we end up with long lists of see-also and nav boxes that bring together scattered docs.

How do we indicate and track which content is in which collections?
This is a survey of current practices, not (yet) a recommendation for what we should do.

Document metadata
Metadata can be embedded in a page name or added to a page in a variety of ways. People often use these different types of metadata to create connections between related content:

Navigation templates or navboxes
Examples:


 * https://wikitech.wikimedia.org/wiki/Wikimedia_infrastructure
 * Template:Huggle/DocHeader
 * Template:Installation Guides
 * Template:OOUI
 * ..More examples

Some navigation templates group together pages that *mostly* share a hierarchy or namespace, but navigation templates usually include at least some pages that reside outside of the canonical page hierarchy or namespace for a given topic. Especially because namespaces exist to cover different facets of the same topic (like Help:, API:, and Manual:), navigation templates often serve to provide paths between these namespaces. For example: the VisualEditor navigation template includes links to pages following the page hierarchy pattern, but also.

Link hubs and lists of docs
This approach is similar to a navigation template, but the lists of docs are usually longer or specially formatted as a link hub or doc portal, and not transcluded anywhere -- they only live in this one location. Transcluding the pages automatically listed by Special:PrefixIndex makes it possible to auto-generate a list of subpages, but that can get noisy and doesn't list the content in a meaningful way. Long lists of links don't really help anyone.

Manually-curated examples:


 * https://meta.wikimedia.org/wiki/Small_wiki_toolkits
 * Developer hub
 * Developer portal - consider each site section as a "collection"
 * https://meta.wikimedia.org/wiki/Data_dumps
 * Wikibase
 * Product Analytics

Examples using Special:PrefixIndex:


 * Continuous integration

Web domains

 * "We do separate doc locations a bit on audience, wikitech used for services, wikitech more internal in terms of audience, mw.org very broad scope"
 * MediaWiki vs. Wikitech vs. Meta...
 * "1 wiki for promoting and supporting MediaWiki software and its API to sysadmins, developers and end users.
 * 1 wiki for promoting and coordinating the open source development of MediaWiki based software projects and Wikimedia technical initiatives."
 * "Following is a list of the main places to find Wikimedia technical documentation.
 * MediaWiki - MediaWiki software documentation and technical documentation for many other Wikimedia technology projects. This is the default space for publishing technical documents about Wikimedia technology.
 * Wikitech - Technical documentation for Wikimedia Foundation infrastructure and services. This includes production clusters, Wikimedia Cloud Services, Toolforge hosting, the Beta Cluster, and other data services.
 * Wikidata - Technical documentation related to the Wikidata project, particularly the Tools page.
 * PAWS - Documentation about PAWS, Wikimedia's hosted Jupyter notebooks instance. Notebooks are frequently used to create tutorials and documentation for Wikimedia technology.
 * Phabricator - Phabricator is a collaboration tool that is used by the Wikimedia technical community for task and bug management. You can find many issues and software projects documented here. Use best practices for software documentation when filing tasks and interacting in this space."

Community suggestions and prior art not covered above

 * Add Tags to docs - like labels


 * Idea to ponder further: can we use the patterns in how people already deploy metadata to improve the connections between related docs along multiple axes? What about something like:
 * Use Categories to create collections of docs that are related through broad concepts that include multiple teams and technologies, like "Testing".
 * This is similar to the concept of "cross-collection landing pages" from KR1:The Docs.
 * All categories should have overview docs that orient the reader to the technical landscape at a high level, with guidance to get readers to sub-collection landing pages.
 * Use page hierarchy to create collections of docs that are related through more specific concepts, like software projects, and complex technical systems with multiple sub-components (like "Continuous integration"). These pages should link to code repos (and docs within them) where applicable.
 * This aligns most closely with the concept of "collections" from KR1:The Docs.
 * These collections should have overview docs and task-focused docs.
 * Put docs that relate to specific technical components in the repos where the code lives. So, docs that are tutorials about Parsoid, or Quibble how-to docs, would live at doc.wikimedia.org.
 * These collections should have reference docs and task-focused docs. They may have overview docs, but that depends on the topic and on what is already covered on-wiki (and linked to from these docs-with-the-code).