Parsoid/Internals/data-parsoid

From mediawiki.org

Temporarily in data-parsoid, but not in final DOM output[edit]

tsr: Tag widths for all tokens (from tokenizer)

extTagWidths: Width of opening and closing tags for extension tags.  Ex: <ref ...>..</ref>, <gallery ....>..</gallery>

Proposal: Make these temporary properties used till we lint the HTML (instead of emitting to final DOM output)[edit]

autoInsertedStart: whether this start HTML tag has no corresponding wikitext and was auto-inserted to generate well-formed html. Usually happens when treebuilder fixes up badly nested HTML.

autoInsertedEnd: whether this end HTML tag has no corresponding wikitext and was auto-inserted to generate well-formed html. Ex: <tr>, <th>, <td>, <li>, etc. that have no explicit closing markup. Or, html tags that aren't closed

Proposal: Remove from data-parsoid and rely on selser to preserve syntax variations[edit]

selfClose: are void tags self-closed? (ex: <br> vs <br />)

noClose: void tags that are not self-closed (ex: <br>)

brokenHTMLTag: used to RT back these kind of tags: </br> or <br/  > or <hr/  >

srcTagName: source tag name (records case variations) for HTML tags. Ex: <div> vs <DiV> vs <DIV>

startTagSrc, endTagSrc, attrSepSrc: source for start/end/attribute-text separators (used in table wikitext)

  • |foo || bar
  • |foo {{!}}{{!}}bar
  • {{!}} foo
  • |style='color:red;'{{!}}foo || bar

pipetrick: true if the link was a pipetrick [[Foo|]] (NOTE: This will likely be removed soon since this should not show up in saved wikitext since this is a pre-save transformation trick.)

Proposal: Maybe move to data-mw?[edit]

stx_v: "row"  set for td/th cells that show up on the same line. Ex: |foo ||bar ||baz (Maybe use stx: for this as well)

stx:

  • "html" - set for html tags. Ex: <div>foo</div>
  • "row" - set for dt/dd that show on the same line. Ex: ";a:b" vs ";a\n:b"
  • "piped" - set for piped wikilinks with explicit content Ex: [[Foo|bar]] vs [[Foo]]
  • "magiclink"- set for magic links (RFC/PMID/ISBN) Ex: RFC 1234, ISBN 1234567890 (Not needed anymore?)
  • "url" - set for url links Ex: http://google.com (Not needed anymore?)

CSA: possibly for the future: add "stx_v" option to list items w/ an intervening double-newline (useful for talk page comments)

Required properties[edit]

dsr: Wikitext source ranges that generated this DOM node (start-offset, end-offset, start-tag-width, end-tag-width).

Consider input wikitext: abcdef ''foo'' something else . Let us look at the ''foo'' part of the input. It generates <i data-parsoid='{"dsr":[7,14,2,2]}'>foo</i> . The dsr property of the data-parsoid attribute of this i-tag tells us the following. This HTML node maps to input wikitext substring 7..14. The opening tag <i> was 2 characters wide in wikitext and the closing tag </i> was also 2 characters wide in wikitext.

src: used to emit original wikitext in some scenarios (entities, placeholder spans)

tail: link trail source (Ex: the "l" in [[Foo]]l)

prefix: link prefix source

Other properties[edit]

a and sa: are used when the attribute source and rendering differ; in which case "a" contains the rendered attribute and "sa" contains the source attribute. When transforming back HTML to Wikitext, "a" is used to check whether the content of the attribute has been modified and, if not, "sa" is used to reserialize it as it was in the original wikitext (avoiding a dirty diff).

Example: [[%23%3c]][[%23%3e]] gets rendered to

<p><a rel="mw:WikiLink" href="./Main_Page#&lt;" title="Main Page" data-parsoid='{"stx":"simple","a":{"href":"./Main_Page#&lt;"},"sa":{"href":"%23%3c"}}'>#&lt;</a><a rel="mw:WikiLink" href="./Main_Page#>" title="Main Page" data-parsoid='{"stx":"simple","a":{"href":"./Main_Page#>"},"sa":{"href":"%23%3e"}}'>#></a></p>


pi: stands for "parameter info". When processing a template, this property contains the name information of parameters, whether named or not; it also contains the whitespace information surrounding the parameter and its value, if any. This is used when transforming back HTML to Wikitext: we want to keep the parameter order, names and spacing to avoid dirty diffs. This works with the params property in data-mw: pi stores the non-semantic information used to choose a specific formatting of the parameters, and params stores the semantic information needed for editing parameters.

Example:

{{1x|bar | 3 =bat|baz=quux }}

gets rendered to

<p data-parsoid='{"dsr":[0,29,0,0]}'><span about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"pi":[[{"k":"1"},{"k":"3","named":true,"spc":[" "," ","",""]},{"k":"baz","named":true,"spc":["","",""," "]}]],"dsr":[0,29,null,null]}' data-mw='{"parts":[{"template":{"target":{"wt":"1x","href":"./Template:1x"},"params":{"1":{"wt":"bar "},"3":{"wt":"bat"},"baz":{"wt":"quux"}},"i":0}}]}'>bar </span></p>

The spc property in pi array elements is a 4-element array and captures the whitespace seen around the key=value pair in the transclusion. Elements 0 and 1 are spaces before and after key. Elements 2 and 3 are spaces before and after value.