RemexHtml/nl

Introductie
RemexHtml is een parser voor HTML5, geschreven in PHP.

Uitgangspunten RemexHtml:


 * Modulair en flexibel.
 * Snel, liever dan elegant. Voorbeeld: we gebruiken soms directe toegang in plaats van accessors, en programmeren zelf handmatig de code als die performance gevoelig is.
 * Robuust, het streven is de slechts denkbare performance helemaal te voorkomen.

RemexHtml bevat deze modules:


 * Een soepele preprocessor en tokenizer. Dit zorgt voor een 'token event stream'.
 * Soepele boomstructuuropbouw, inclusief foutherstel. Dit zorgt voor een goed doorlopende aanpassing van de boomstructuur.
 * Een snelle geïntegreerde HTML-serializer, passend bij het HTML fragment serialisatie algoritme.
 * DOMDocument opbouw.

RemexHtml huidige gebreken:


 * Encoding ondersteuning. Er wordt vanuit gegaan dat de invoer geldige UTF-8 is.
 * Scripting.
 * XHTML serialisatie.
 * Precieze volgen van de bij bepaalde fouten gegenereerde uitvoer van de parser.

RemexHtml wil de W3C aanbevelingen van HTML 5.1 volgen, er zijn wat uitzonderingen voor wat kleine bugfixes voor het ondersteunen van oudere versies van HTML. Er is gekozen om de W3C standaard te implementeren en niet de laatste WHATWG draft omdat we streven naar stabiliteit en niet naar volledigheid van functionaliteit.

RemexHtml voldoet aan alle html5lib testen, met uitzondering van het tellen van de parse fouten en testen waarin gekeken wordt of aan de toekomstige versie van de standaard wordt voldaan.

In MediaWiki
RemexHtml has been available in MediaWiki as a core composer dependency since MediaWiki 1.29. Its initial use case was as a replacement for HTML Tidy. Output from the wikitext parser is fed into RemexHtml's HTML parser and cleaned up per the HTML 5 tag soup specification. The Tokenizer component is now also used for tag stripping in Sanitizer.

It is also used for HTML postprocessing in the, and  extensions.



Buiten de Mediawiki
Install the wikimedia/remex-html package from Packagist:

composer require wikimedia/remex-html

Semantic versioning is used. The major version number will be incremented for every change that breaks backwards compatibility.



Overzicht architectuur
For full reference documentation, please see the documentation generated from the source (or the source itself)


 * Generated API documentation

RemexHtml uses a pipeline model. Each event producer calls the attached callback object when it has an event ready to produce. The pipeline stages are:


 * Tokenizer
 * Produces a stream of tokens from HTML. Performs tokenization, as described by the tokenization chapter in the HTML specification.


 * Dispatcher
 * Tracks the insertion mode, and relays token events to the handler specific to the current insertion mode. Each insertion mode has its own class, with methods for each of the token types.


 * TreeBuilder
 * A helper class for the insertion modes. It tracks the state of the tree construction process, receives requests for tree mutation from the insertion mode classes, and dispatches tree mutation events.

In the HTML specification, the tree construction algorithm is imagined as being tightly integrated with creation of a DOM data structure. A major innovation of RemexHtml is to separate tree construction into a phase which generates a tree mutation event stream, and a phase which actually produces the data structure. RemexHtml is able to directly serialize the tree mutation event stream, without needing to store the whole DOM in memory.


 * Serializer : Produces HTML from a tree mutation event stream.
 * DOMBuilder : Produces a native PHP DOMDocument from a tree mutation event stream.

When Serializer is used, there is a final pipeline stage:


 * Formatter
 * The Formatter interface converts SerializerNode objects to strings. It is a helper for Serializer which allows details of the produced HTML to be easily customised. Serializer is complex and stateful, whereas Formatter subclasses are generally stateless, except for configuration.

RemexHtml also provides:


 * DOMSerializer
 * a utility class to serialize a DOM contained within a DOMBuilder, with an interface similar to Serializer.


 * PropGuard
 * Many RemexHtml classes use the PropGuard trait, which prevents accidental assignment of undeclared properties. This helps detect developer confusion over class types. If there is a pressing need to use undeclared properties in your application, PropGuard can be globally disabled using PropGuard::$armed = false.


 * TokenGenerator
 * a class which provides a token stream via a generator interface, instead of an event stream. It constructs its own Tokenizer. Consuming token events in this way is less efficient, but may be more convenient for some use cases.

There are optional pipeline stages providing debugging facilities:


 * DispatchTracer
 * This class sits between Tokenizer and Dispatcher. It reports all token events, and reports insertion mode transitions within Dispatcher. Log messages are sent to a callback function.


 * TreeMutationTracer
 * This forwards tree mutation events coming from TreeBuilder, and reports such events to a callback.


 * DestructTracer
 * This class forwards tree mutation events, and reports when the Element object emitted by TreeBuilder is destroyed. This helps to identify memory leaks.

RemexHtml's model of a configurable pipeline provides a great deal of flexibility. Applications may subclass pipeline classes provided by RemexHtml, or write their own from scratch, implementing the relevant event receiver interface. Or they may interpose custom pipeline stages in between RemexHtml's standard stages.

However, for simple use cases, there is a fair amount of boilerplate. T217850 proposes to add a simplified method for constructing a standard pipeline, but this has not yet been implemented.

Voorbeelden


De DOM opbouwen uit invoertekst
In the above code sample, the pipeline is constructed backwards, from end to start. The constructor of each pipeline stage receives the following pipeline stage. Then with the pipeline fully constructed, $tokenizer->execute causes the whole input text to be parsed and emitted through the pipeline, eventually reaching the DOMBuilder. After execution, the constructed document is available via $domBuilder->getFragment.



Doelen link wijzigen
This example modifies an HTML document on the fly, altering href attributes inside tags and returning an HTML string. It does this by subclassing HtmlFormatter, which is a relatively easy hook point into reserialization. It clones the SerializerNode and Attributes objects to avoid altering the document as seen by Serializer, since it is possible this function may be called more than once on each node, and we don't want to prefix the domain name more than once.

Alternatively we could have used SerializerNode::$snData as a flag, to avoid double-prefixing:

Performance
Various options can be enabled which improve performance, potentially at the expense of correctness:


 * Tokenizer
 * ignoreErrors - This does not simply discard parse errors as they are generated. In some cases it chooses a more efficient algorithm which implicitly ignores errors. If parse errors are not required, this should always be set.
 * skipPreprocess - The HTML specification requires that the input be preprocessed to normalize line endings and strip control characters. If line endings are already normalized in your application, and if you don't mind control characters being propagated through to the output, this option can be enabled, for a small improvement to performance.
 * ignoreNulls - Enabling this option causes any null characters to be passed through to the output. The HTML specification requires complex, context-dependent handling of null characters whenever they appear in the input. So if the application simply strips null characters from the input and enables this option, the result will not be standards-compliant, but performance will be slightly improved.
 * ignoreCharRefs - This is an aggressive and rarely-useful optimisation option which ignores character references, passing them through unmodified. It needs to be paired with a special serializer that will emit bare ampersands from text nodes instead of escaping them.
 * TreeBuilder
 * ignoreNulls, ignoreErrors - Same as the corresponding Tokenizer options



Over TokenizerError exceptions
If RemexHtml throws a TokenizerError exception, for example "pcre.backtrack_limit exhausted", this is usually not a bug in RemexHtml. Either the relevant configuration setting should be increased, or the input size should be limited. The pcre.backtrack_limit INI setting should be at least double the input size.



Zie ook

 * T89331 – Replace Tidy in MW parser with HTML 5 parse/reserialize
 * Html5Depurate (earlier option)
 * Parsing/Replacing Tidy

