Parsing/Notes/Wikitext 2.0/Typed Templates

From mediawiki.org

Introduction[edit]

This is a more specialized proposal that follows on the wikitext 2.0 note that proposes typed wikitext constructs.

In the wikitext 2.0 note, templates provide a type that specifies how a transclusion should be processed / handled in the context of the page within which it is transcluded. Some possible types could be string, number, CSS, attribute, array of attributes, inline HTML, block HTML. In a fully fleshed out proposal, we would probably have a richer set of types. The wikitext 2.0 note doesn't propose a specific mechanism for declaring these types. The balanced templates proposal identifies parser functions as one such mechanism.

In this document, we extend the notion of template types to something richer. At the 2017 Dev Summit, we talked about the future of wikitext and typed wikitext was one of the proposals. In there, we briefly alluded to but didn't really expand on this richer notion of types for templates. That elaboration is the goal of this document.

Problem Statement[edit]

Over time, editors on Wikimedia wikis have built up a rich template system. There are templates for infoboxes (with specialized infoboxes for different domains), navboxes, portals, charts, data tables, maintenance templates, administration templates, templates for citations, identify stubs, and so on. There is some benefit in thinking of how templates represent domain-specific types in the context of an encyclopedia.

Multiple in-template mechanisms[edit]

In recent times, there have been different mechanisms explored for different purposes and projects.

  • There is the TemplateData extension which lets editors tell VisualEditor, Parsoid, and other tools how transclusions of this template should be edited and how they should be serialized. There is an auxiliary benefit they provide in terms of serving as machine-readable / processable documentation of a template's parameters. HOW: Add a <templatedata> tag inside the template source. STATUS: Implemented
  • There is TemplateStyles extension which lets editors specify how the output of a template should be styled. HOW: Add a <templatestyles src="page-title-here"> inside the template source. STATUS: Implemented
  • There is the Balanced Templates proposal which lets a template opt into specific kinds of balanced output mechanisms. HOW: Add a {{#balance:<type>}} parser function inside the template source. STATUS: Proposal

It is possible to imagine a future application where there might benefits to a template specifying custom JavaScript resources or other sources that govern the behavior of that template (ex: animations?). That will require its own mechanism.

Need for cross-wiki meta information encoded in template names[edit]

Besides the above, there are related information analysis problems that require understanding template semantics and what they represent.

  • Content Translation lets editors translate a page from one language wiki to another language wiki. This requires that transclusions from the source wiki be adapted to use similar templates on the target wiki. As can be imagined, there is no ready-made solution for doing this. Content Translation provides its own set of mechanisms to do this. But, in an ideal world, there would be a mechanism where knowledge domain concepts from the source wiki could be mapped to identical concepts on the target wiki. This would require both wikis to reference an underlying global concept domain.
  • Mobile Content Service would like to extract conceptual information from wikis that correspond to media / information on those pages that might have customized presentations / renderings on the mobile platform. For example, information about pronunciations, co-ordinates, hatnotes, etc. Typically, this kind of information is represented via templates. Similar to Content Translation, they don't have a standardized mechanism to extract these specialized concepts from all wikis. They are likely going to be using a bunch of heuristics (and hacks) to extract this information for their application. They would likewise benefit if these templates had a mapping to a concept database / taxonomy in an underlying global concept domain.
  • Research projects like stub expansion have to grapple with questions about identifying stubs reliably across wikis. Once again, like the two previous analysis projects, they would benefit from having a mapping to a concept database / taxonomy in an underlying global concept domain.

Potential partial solutions[edit]

In all these analysis scenarios (and possibly others that I am not aware of), the common thread is the ability to map templates to some shared / global taxonomy or concept map or what-have-you. For example, enwiki relies on category pages (this and this) for representing this meta-level information. But, this is definitely not a global / shared taxonomy. Besides this, one can imagine using wikidata for representing and leveraging this shared taxonomy. The other suggested option is to use schema.org (with extensions specific to our wikis). Of course, if editors choose to, they could add pointers to both schemas. In any case, the point is to pick a global concept vocabulary, whether wikidata, schema.org, or something else.

Besides these two ideas, there is also the proposal for shadow namespaces that can partially alleviate this problem in a limited manner by using the same template across wikis.

Typing Proposal[edit]

Preamble[edit]

In light of the previous discussion, the high-level conceptual proposal is to think of templates as having rich type definitions that unify the disparate mechanisms and needs in a centralized place. Much like in programming languages, where type definitions encapsulate and communicate information about the underlying domain, it is useful to think of templates providing type definitions. These type definitions provide rich information to existing tools and to other humans, and let us build newer tools that leverage that information.

This is simply a generalization of existing proposals and ideas from two different directions. On one hand, this is a generalization of existing mechanisms and existing work as elaborated in the problem statement. On the other hand, this is also a generalization of the typed wikitext idea that originated from thinking about wikitext as a typed documentation specification language that enables efficient processing for tools and enables error-free writing and improved reasoning abilities for humans.

Strawman sketch[edit]

A template type would probably be some representation of this form.

{
    /* required */
    name: "...",

    /* ------ how should output be processed? ------ */
    output: enumerated type from { number, string, css value, key=value, array of key=value, inline html, block html, .. }

    /* ------ "semantic" HTML output (if required) ------ */
    output_wrapper_elt: "..", (ex: infobox) /* TODO: Link to phab task */

    /* ------ existing use (VE, Parsoid, etc) ------ */
    template_data: <pointer-to-template-data-info>, (could be inlined later on perhaps)

    /* ------ template styles ------ */
    styles: {
        desktop: <namespace-qualified-page-title>,
        mobile: <namespace-qualified-page-title>,
        ...
    }

    /* ------ future use case? ------ */
    js: {
        desktop: ...,
        mobile: ...,
    }

    /* ------ Semantic Info ------ */
    domain_info:  { /* pointers to a shared space as long as all wikis reference that shared taxonomy */
        category: ...  /* could be an enumerated type or could be a wikidata id */
        concept_id: ... /* assuming shared domain information is captured in wikidata id */
    }
}

Open questions[edit]

  • Where is this type information stored? Maybe in-template-source (like templatedata) or it could be part of its own namespace with some way of associating a type with a template. Maybe the TemplateData format itself could evolve to incorporate this?
  • How will the shared concept vocabulary be developed?
  • ...
  • ...

Discussion[edit]

Types in programming languages[edit]

User-defined types in high-level languages provide these (among potentially other) benefits:

  1. Abstraction: It helps programmers abstract domain-specific information and manipulate that information conceptually / structurally or however they wish to do so.
  2. Reasoning: In turn, abstraction through types enables programmers to reason about what the code is doing, and also serves as a way to communicate with others who might work on the same piece of code.
  3. Tooling: Types also communicate information to automated tools. For example, in the traditional programming language domain, types aid IDEs, interpreters, compilers and other code analysis tools. Performance optimizations benefit from type information.

The same benefits accrue to wikitext as a language if we consolidate relevant template meta information in central type definition structures.

Typed templates[edit]

This typing system is meant to be a somewhat transparent system on top of templates and wikitext. Just like templatedata and templatestyles, this is meant to be an opt-in mechanism. But, the more information there is, the more benefits. Eventually, for top-level templates (those that are used in non-template namespaces), the expectation is that types will be provided for all templates with the output field becoming required. This corresponds to the migration of wikitext towards well-balanced templates being the default mechanism.

One of the biggest advantages of this consolidated typing mechanism is that the type definitions are extensible without requiring code changes. So, types can be extended / evolved in sync with tools and utilities that need that information without require changes to the core code. More importantly, this lets template editors and tooling communities to develop them independent of the software.

In programming languages, some languages require explicit types. Others provide type inference mechanisms without explicit types. In the context of wikitext and templates, we will probably have explicit type declarations. So, the proposal is for this to be a system with explicit types, but where automated tools analyzing the corpus could seed some aspects of the type definition (similar to type-feedback-based / profile-based recompilation strategies for traditional programming languages). So, this system allows for a collaboration between human curated type information and automated "type inference" tools.

Related documents[edit]