User:TBurmeister (WMF)/Sandbox/Collections

This page gathers ideas and strategies around how to group technical documentation into meaningful units that facilitate browsing, discovery, and maintenance. In Q1 of FY22-23 we are investigating: "What strategies or tools have people used, considered or suggested to identify sets of related docs or divide up the content space?" This is part of work on phab:T313037.

How do we define the collections?
Different organizing principles currently exist, and others have been suggested in brainstorming sessions or other feedback channels. Documents are related to each other in multiple, varied ways:

Case study #1: Continuous integration docs
Most documentation collections contain sub-collections that can be organized around one or more of the above themes. To observe this in practice, consider the corpus of docs listed at Continuous integration. This "collection" is created by the Special:PrefixIndex/Continuous_integration/. These documents form multiple, overlapping but also disjoint collections based on the following themes:


 * Maintainers: a sub-collection of the docs owned / maintained by the Release Engineering team
 * General technical area: continuous integration infrastructure / project (however, this is also itself a sub-collection of an even larger topic: "Testing".)
 * Specific technologies or components (listing just a few here. are these sub-collections, or collections of their own? when and why does it matter?):
 * Jenkins (but there is also Parsoid/Jenkins Testing and https://wikitech.wikimedia.org/wiki/Jenkins)
 * Phan
 * Parsoid (but there is also Parsoid)
 * Quibble
 * Content types:
 * Tutorials
 * Entry points
 * Troubleshooting
 * Workflows or user journeys:
 * Regenerate Jenkins jobs
 * Control which tests Zuul runs on a patchset
 * [also most of the Tutorials]

Page hierarchy is being used to create this collection of docs. However, there is other metadata that reflects other collections or sub-collections for the same content, along with content that is missing in the collection based solely on Special:PrefixIndex/Continuous_integration/:

Reflections about doc improvement process based on this case study:
 * Some, but not all, of the docs are in Category:Testing
 * Some, but not all, of the docs are in Category:Continuous integration
 * Based on the code paths mentioned in Continuous integration/Parsoid, https://doc.wikimedia.org/Parsoid/master/tutorial-devsetup.html#!/guide/setup could be related documentation, along with Parsoid and https://wikitech.wikimedia.org/wiki/Parsoid.
 * A light in the darkness: Quibble documentation lives at https://doc.wikimedia.org/quibble/ and Continuous integration/Quibble. Hooray, there's a redirect from mw:Quibble to the CI doc, and the CI doc has a link to doc.wikimedia.org.  Gold star!


 * The ambiguity around where to draw the boundaries of a collection makes it easy to get lost in the ocean of content.
 * Without some sort of thematic anchor for a set of docs, its easy for content to be scattered across multiple repos, which leads to duplication and findability + maintenance trouble. Because the concept of continuous integration overlaps with the technologies used to do CI, and the infra components most impacted by it, there are some gaps in the documentation of each of those collections. This could be mostly solved with more intentional cross-referencing or tagging. However, for the benefit of readers and maintainers, it's useful to establish one repo as the canonical location for docs about a given topic, technology, or product.  The most challenging question, and what leads to the patchwork of overlapping and disjoint content, is what should those topics be?  That is essentially the same question as "what are the collections?"  So I'm yet again going in circles towards an ontology or something.
 * I don't think we should try to build an ontology (shocking, right?) That has been tried . What we need is to be able to ask ourselves meaningful questions about what we want our readers to be able to understand and accomplish, then create the set of docs that enable that, and add some consistent metadata to those docs and others that are related...but without creating walls of links or decontextualized "see alsos".
 * Topic governance is a trap. Let's try another case study.

Case study #2: Research:Data and Data_dumps
TODO transfer content/findings

How do we indicate and track which content is in which collections?
This is a survey of current practices, not (yet) a recommendation for what we should do.

Document metadata
Metadata can be embedded in a page name or added to a page in a variety of ways. People often use these different types of metadata to create connections between related content:

Navigation templates or navboxes
Examples:


 * https://wikitech.wikimedia.org/wiki/Wikimedia_infrastructure
 * Template:Huggle/DocHeader
 * Template:Installation Guides
 * Template:OOUI
 * ..More examples

Some navigation templates group together pages that *mostly* share a hierarchy or namespace, but navigation templates usually include at least some pages that reside outside of the canonical page hierarchy or namespace for a given topic. Especially because namespaces exist to cover different facets of the same topic (like Help:, API:, and Manual:), navigation templates often serve to provide paths between these namespaces. For example: the VisualEditor navigation template includes links to pages following the page hierarchy pattern, but also.

Link hubs and lists of docs
This approach is similar to a navigation template, but the lists of docs are usually longer or specially formatted as a link hub or doc portal, and not transcluded anywhere -- they only live in this one location. Transcluding the pages automatically listed by Special:PrefixIndex makes it possible to auto-generate a list of subpages, but that can get noisy and doesn't list the content in a meaningful way. Long lists of links don't really help anyone.

Manually-curated examples:


 * https://meta.wikimedia.org/wiki/Small_wiki_toolkits
 * Developer hub
 * Developer portal - consider each site section as a "collection"
 * https://meta.wikimedia.org/wiki/Data_dumps
 * Wikibase
 * Product Analytics

Examples using Special:PrefixIndex:


 * Continuous integration

Web domains

 * "We do separate doc locations a bit on audience, wikitech used for services, wikitech more internal in terms of audience, mw.org very broad scope"
 * MediaWiki vs. Wikitech vs. Meta...
 * "1 wiki for promoting and supporting MediaWiki software and its API to sysadmins, developers and end users.
 * 1 wiki for promoting and coordinating the open source development of MediaWiki based software projects and Wikimedia technical initiatives."
 * "Following is a list of the main places to find Wikimedia technical documentation.
 * MediaWiki - MediaWiki software documentation and technical documentation for many other Wikimedia technology projects. This is the default space for publishing technical documents about Wikimedia technology.
 * Wikitech - Technical documentation for Wikimedia Foundation infrastructure and services. This includes production clusters, Wikimedia Cloud Services, Toolforge hosting, the Beta Cluster, and other data services.
 * Wikidata - Technical documentation related to the Wikidata project, particularly the Tools page.
 * PAWS - Documentation about PAWS, Wikimedia's hosted Jupyter notebooks instance. Notebooks are frequently used to create tutorials and documentation for Wikimedia technology.
 * Phabricator - Phabricator is a collaboration tool that is used by the Wikimedia technical community for task and bug management. You can find many issues and software projects documented here. Use best practices for software documentation when filing tasks and interacting in this space."

Community suggestions and prior art not covered above

 * Add Tags to docs - like labels