User:TBurmeister (WMF)/Sandbox/Collections

This page gathers ideas and strategies around how to group technical documentation into meaningful units that facilitate browsing, discovery, and maintenance. In Q1 of FY22-23 we are investigating: "What strategies or tools have people used, considered or suggested to identify sets of related docs or divide up the content space?" This is part of work on phab:T313037.

How do we define the collections?
Different organizing principles currently exist, and others have been suggested in brainstorming sessions or other feedback channels. Documents are related to each other in multiple, varied ways:

Case study #1: Continuous integration docs
Most documentation collections contain sub-collections that can be organized around one or more of the above themes. To observe this in practice, consider the corpus of docs listed at Continuous integration. This "collection" is created by the Special:PrefixIndex/Continuous_integration/. These documents form multiple, overlapping but also disjoint collections based on the following themes:


 * Maintainers: a sub-collection of the docs owned / maintained by the Release Engineering team
 * General technical area: continuous integration infrastructure / project (however, this is also itself a sub-collection of an even larger topic: "Testing".)
 * Specific technologies or components (listing just a few here. are these sub-collections, or collections of their own? when and why does it matter?):
 * Jenkins (but there is also Parsoid/Jenkins Testing and https://wikitech.wikimedia.org/wiki/Jenkins)
 * Phan
 * Parsoid (but there is also Parsoid)
 * Quibble
 * Content types:
 * Tutorials
 * Entry points
 * Troubleshooting
 * Workflows or user journeys:
 * Regenerate Jenkins jobs
 * Control which tests Zuul runs on a patchset
 * [also most of the Tutorials]

Page hierarchy is being used to create this collection of docs. However, there is other metadata that reflects other collections or sub-collections for the same content, along with content that is missing in the collection based solely on Special:PrefixIndex/Continuous_integration/:

Reflections about doc improvement process based on this case study:
 * Some, but not all, of the docs are in Category:Testing
 * Some, but not all, of the docs are in Category:Continuous integration
 * Based on the code paths mentioned in Continuous integration/Parsoid, https://doc.wikimedia.org/Parsoid/master/tutorial-devsetup.html#!/guide/setup could be related documentation, along with Parsoid and https://wikitech.wikimedia.org/wiki/Parsoid.
 * Quibble documentation lives at https://doc.wikimedia.org/quibble/ and Continuous integration/Quibble. Hooray, there's a redirect from mw:Quibble to the CI doc, and the CI doc has a link to doc.wikimedia.org.  Gold star!


 * The ambiguity around where to draw the boundaries of a collection makes it easy to get lost in the ocean of content. We need to help people draw boundaries in a way that is meaningful but doesn't require them to have an ontology of the entire tech stack. Outcome: Documentation/Toolkit/Collection audit/Survey.
 * Without some sort of thematic anchor for a set of docs, its easy for content to be scattered across multiple repos, which leads to duplication and findability + maintenance trouble. Because the concept of continuous integration overlaps with the technologies used to do CI, and the infra components most impacted by it, there are some gaps in the documentation of each of those collections. This could be mostly solved with more intentional cross-referencing or tagging, or focusing on one topic per doc and using transclusion if necessary.
 * For the benefit of readers and maintainers, it's useful to establish one repo as the canonical location for docs about a given topic, technology, or product. The most challenging question, and what leads to the patchwork of overlapping and disjoint content, is what should those topics be?  That is essentially the same question as "what are the collections?"
 * I don't think we should try to build an ontology, and I think focusing first on topic governance in this context is a trap (shocking, right?). Doc creators and maintainers need to be able to ask ourselves meaningful questions about what we want our readers to be able to understand and accomplish, then create the set of docs that enable that, and add some consistent metadata to those docs and others that are related...but without creating walls of links or decontextualized "see alsos", and without creating an overly-controlled set of doc categories or landing pages.

Case study #2: Research:Data and Data_dumps
TODO transfer content/findings

Experiment: listing categories and collections
As part of KR1:The Docs in FY 21-22, we had identified collections and general categories for the first set of key docs for the Developer Portal. To understand more about the challenge of defining collections, I took each of those categories and collection names and attempted to mold them into a standardized list. Note how the same collections appear on multiple rows: See the above table in a spreadsheet

How do we indicate and track which content is in which collections?
The doc review checklist for KR1:The Docs (FY 21/22) included the question: "How is the page linked to related pages? What categories does the page belong to? Is it part of a clearly defined collection of pages?" This question can be complex to answer, as documented above: collections of pages are not often "clearly defined". To help people scope the set of docs they attempt to improve, it's important to help them understand and navigate all the different ways that pages can be related or form ad-hoc "collections".

What follows is a survey of various current practices that people use to create "collections" of technical content.

Document metadata
Metadata can be embedded in a page name or added to a page in a variety of ways. People often use these different types of metadata to create connections between related content:

Navigation templates or navboxes
Examples:


 * https://wikitech.wikimedia.org/wiki/Wikimedia_infrastructure
 * Template:Huggle/DocHeader
 * Template:Installation Guides
 * Template:OOUI
 * ..More examples

Some navigation templates group together pages that *mostly* share a hierarchy or namespace, but navigation templates usually include at least some pages that reside outside of the canonical page hierarchy or namespace for a given topic. Especially because namespaces exist to cover different facets of the same topic (like Help:, API:, and Manual:), navigation templates often serve to provide paths between these namespaces. For example: the VisualEditor navigation template includes links to pages following the page hierarchy pattern, but also.

Link hubs and lists of docs
This approach is similar to a navigation template, but the lists of docs are usually longer or specially formatted as a link hub or doc portal, and not transcluded anywhere -- they only live in this one location. Transcluding the pages automatically listed by Special:PrefixIndex makes it possible to auto-generate a list of subpages, but that can get noisy and doesn't list the content in a meaningful way. Long lists of links don't really help anyone.

Manually-curated examples:


 * https://meta.wikimedia.org/wiki/Small_wiki_toolkits
 * Developer hub
 * Developer portal - consider each site section as a "collection"
 * https://meta.wikimedia.org/wiki/Data_dumps
 * Wikibase
 * Product Analytics

Examples using Special:PrefixIndex:


 * Continuous integration

Web domains

 * "We do separate doc locations a bit on audience, wikitech used for services, wikitech more internal in terms of audience, mw.org very broad scope"
 * MediaWiki vs. Wikitech vs. Meta...
 * "1 wiki for promoting and supporting MediaWiki software and its API to sysadmins, developers and end users.
 * 1 wiki for promoting and coordinating the open source development of MediaWiki based software projects and Wikimedia technical initiatives."
 * "Following is a list of the main places to find Wikimedia technical documentation.
 * MediaWiki - MediaWiki software documentation and technical documentation for many other Wikimedia technology projects. This is the default space for publishing technical documents about Wikimedia technology.
 * Wikitech - Technical documentation for Wikimedia Foundation infrastructure and services. This includes production clusters, Wikimedia Cloud Services, Toolforge hosting, the Beta Cluster, and other data services.
 * Wikidata - Technical documentation related to the Wikidata project, particularly the Tools page.
 * PAWS - Documentation about PAWS, Wikimedia's hosted Jupyter notebooks instance. Notebooks are frequently used to create tutorials and documentation for Wikimedia technology.
 * Phabricator - Phabricator is a collaboration tool that is used by the Wikimedia technical community for task and bug management. You can find many issues and software projects documented here. Use best practices for software documentation when filing tasks and interacting in this space."

Defining collections

 * It's possible to define a set of high-level categories that cover the range of Wikimedia technical documentation, but the utility of doing so is dubious. The categories are so broad that their main utility would be to provide a landing page that guides readers to more specific collections.  They could also be useful to enable Tech leads to understand the broad scope of all the documentation their departments should (perhaps) be maintaining.
 * Collections belong in more than one category.
 * Collections have different organizing themes, but most of them are organized around a specific technology, platform, process, or team. In practice, it's hard to figure out if a collection should be organized around a specific technology (like "Jenkins"), or the system/process that technology is part of ("Continuous integration", "MediaWiki testing"), or the team that owns or maintains those technologies and processes ("Release engineering").
 * Whatever we decide about collections, this is the key hurdle that we need to remove in order for people to be able to scope their doc improvements and assess docs as a corpus.
 * Therefore: focusing on improving stewardship of the large mass of collections is more useful than focusing on high-level categories.
 * Individual documents can be part of more than one collection. It follows that few, if any, collections can ever be clearly defined. We have limited mechanisms on-wiki to represent these complex one-to-many relationships between documents and collections, and collections and high-level categories. This is part of why we end up with long lists of see-also and nav boxes that bring together scattered docs.
 * Therefore: the best way to improve stewardship of "collections" is to work with teams and project owners to help them identify the docs that they should consider "in scope" for their work. User:KBach-WMF/Sandbox/PywikibotCollection is one example of this sort of sense-making, doc-gathering effort guided by tech writer expertise. The collection assessment guide that came out of this quarter's work includes some techniques for how to identify docs that are part of a collection, but these strategies are likely to be difficult for teams to apply because so much of the process changes based on context.

Indicating and tracking collection membership
Ranking the methods of indicating collection membership, in order of decreasing reliability, consistency, and sustainability based on my research:


 * Doc metadata: Page hierarchy: The most ubiquitous tactic people use to relate pages to each other -- though perhaps they do this unintentionally ("this is just where we always put a new doc") and create collections that aren't really related by any clear or useful concept.
 * Navigation templates: Useful in that they're an identifiable part of a page or set of pages, unlike a plain-wikitext list of links that is indistinguishable from other page content. Easily-maintained, so often more updated than other methods of relating pages to each other.
 * Categories: Underutilized on our technical wikis, probably because the only way to browse categories is alphabetically, and that severely limits the utility of category landing pages in a non-encyclopedic context. There's potential here to use categories to provide better guided paths into more specific content collections, but figuring that out requires more content strategy work and analysis. More important to work on improving the individual pieces of content and the collections that direct people to them, because that is where the most essential (and incorrect) information resides.
 * Doc metadata: Namespaces: Troubling inconsistency across wikis. Different namespaces exist on different wikis, and even the same namespaces are used inconsistently across and within wikis.  Overall, namespaces don't seem to correspond to an organizational concept that is universally very useful.
 * Web domains: Would be useful to separate major documentation by audience (i.e. Wikitech is for x, MediaWiki is for y), but we should enforce that more officially and more clearly. Right now there's a lot of content that lives on one wiki but could be relevant for readers who expect to find it on a different wiki. I also think we should avoid putting documentation in PAWS notebooks because they're not easily searchable and thus are harder to maintain.
 * Link hubs / lists of docs: Avoid! These are prime candidates for asking hard questions about the scope of a collection, and then putting the useful content in the appropriate page hierarchy, adding it to a navbox, or archiving it.
 * Doc metadata: content model: This is like a special wiki flavor of doc types. Not super relevant for this investigation.

Conclusions & recommendations

 * Use Categories to create collections of docs that are related through very broad concepts that include multiple teams and technologies, like "Testing".
 * This is similar to the concept of "cross-collection landing pages" from KR1:The Docs.
 * All categories should have overview docs that orient the reader to the technical landscape at a high level, with guidance to get readers to sub-collection landing pages.
 * Individual teams and project owners should not feel required to try to maintain docs at this level -- this is cross-team work that should be coordinated by technical writers in partnership with team leads, directors, and/or working groups.
 * Use page hierarchy to create collections of docs that are related through more specific concepts, like software projects, and complex technical systems with multiple sub-components (like "Continuous integration"). These pages should link to code repos (and docs within them) where applicable.
 * This aligns most closely with the concept of "collections" from KR1:The Docs.
 * These collections should have overview docs and task-focused docs.
 * This is the level at which individual teams and project owners should manage, maintain, and improve sets of docs. Doing this collection-level work usually requires some tech writer guidance on strategy and tactics, because these vary significantly by the context, history, and unique challenges of each set of docs.
 * Everyone should be encouraged to improve and help maintain docs at this level, but doing so requires more communication and collaboration to align contributions with the overall structure and content strategy for the collection.
 * Put docs that relate to specific technical components in the repos where the code lives. Docs that live with code should contain pointers to on-wiki collection landing pages, which provide the higher-level context for the code, along with connections to other relevant content.
 * These collections should have reference docs and task-focused docs. They may have overview docs, but that depends on the topic and on what is already covered on-wiki (and linked to from these docs-with-the-code).
 * Everyone should be encouraged and empowered to improve and help maintain docs at this level.


 * Technical writers should engage directly with developers and subject matter experts to help identify the boundaries of their collections and provide guidance on applying useful page structure and document metadata to relate docs to each other and to other collections in useful and meaningful ways.

Key artifacts

 * Experimental draft of high-level categories that we could use to curate technical documentation collections into major topic areas. Recommendations about how / whether to apply these to doc work will follow, based on discussion of these conclusions and recommendations.
 * New process and general guidance for how to identify, assess, and improve docs at the collection level: Documentation/Toolkit/Collection_audit.
 * Q2 tech writer strategy focused on targeted collaboration with maintainers of specific collections (currently a Google doc, will be on-wiki later).