RemexHtml/zh
介绍
RemexHtml是HTML 5的解析器,使用PHP编写。
RemexHtml的目标是:
- 模块化和灵活。
- 快速,而非优雅。 例如,我们有时会使用直接成员访问而非通过访问器,并手动内联一些对性能敏感的代码。
- 鲁棒性强,力求最坏情况下的性能达到 O(N)。
RemexHtml 包含以下模块:
- 符合规范的预处理器和分词器。 这会生成一个标记事件流。
- 符合规范的树构建,包括错误恢复。 这会生成一个树突变事件流。
- 一个快速集成的 HTML 序列化器,符合 HTML 片段序列化算法。
- DOMDocument 构建。
RemexHtml 目前缺少:
- 编码支持。 输入应为有效的 UTF-8。
- 脚本
- XHTML 序列化。
- 精确符合指定的解析错误生成。
RemexHtml 旨在符合 W3C 推荐标准 HTML 5.1,但包含一些向后移植的次要错误修复除外。 我们选择实施 W3C 标准而非最新的 WHATWG 草案,是因为我们的应用更需要稳定性而非功能的完整性。
RemexHtml 通过了所有 html5lib tests,但解析错误计数和引用未来版本标准的测试除外。
安裝
MediaWiki内
- 自 MediaWiki 1.29 起,RemexHtml 已在 MediaWiki 中作为核心 composer 依赖项提供。
- 它的最初用例是作为 HTML Tidy 的替代品。
维基文本解析器的输出被送入 RemexHtml 的 HTML 解析器,并根据 HTML5 标签汤规范进行清理。
- Tokenizer 组件现在也用于 Sanitizer 中的标签剥离。
- 它也用于 Collection、TEI 和 Wikibase 扩展中的 HTML 后处理。
其他地方
从 Packagist 安装 wikimedia/remex-html 包:
composer require wikimedia/remex-html
采用语义化版本控制。 每当有破坏向后兼容性的更改时,主版本号将递增。
功能预览
如需完整的参考文档,请参阅从源代码生成的文档(或源代码本身)。
RemexHtml 使用管道模型。 每个事件生产者在其有事件准备产生时,会调用附加的回调对象。 管道阶段如下:
- Tokenizer
- * 从HTML生成标记流。 根据 tokenization chapter 中 HTML 规范所述执行分词。
- Dispatcher
- 跟踪 insertion mode,并将标记事件中继到特定于当前插入模式的处理程序。 每种插入模式都有自己的类,并为每种标记类型提供相应的方法。
- TreeBuilder
- * 插入模式的辅助类。 * 它跟踪树构建过程的状态,接收来自插入模式类的树突变请求,并派发树突变事件。
在 HTML 规范中,树构建算法被设想为与 DOM 数据结构的创建紧密集成。 RemexHtml 的一个主要创新点在于将树构建分离为两个阶段:一个阶段生成树突变事件流,另一个阶段实际产生数据结构。 RemexHtml 能够直接序列化树突变事件流,而无需将整个 DOM 存储在内存中。
- Serializer
- * 从树突变事件流生成 HTML。
- DOMBuilder
- * 从树突变事件流生成本地 PHP DOMDocument。
- 当使用 Serializer 时,存在最后一个管道阶段:
- Formatter
- Formatter 接口将 SerializerNode 对象转换为字符串。 它是 Serializer 的辅助器,允许轻松自定义所生成 HTML 的细节。 * Serializer 复杂且具有状态,而 Formatter 子类通常是无状态的(配置信息除外)。
RemexHtml 还提供:
- DOMSerializer
- * 一个实用工具类,用于序列化包含于 DOMBuilder 中的 DOM,并具有与 Serializer 类似的接口。
- PropGuard
- 许多 RemexHtml 类使用 PropGuard 特性,以防止意外分配未声明的属性。 这有助于检测开发者对类类型的混淆。 如果在您的应用程序中确实需要使用未声明的属性,可以通过 PropGuard::$armed = false 全局禁用 PropGuard。
- TokenGenerator
- * 一个通过生成器接口而非事件流来提供标记流的类。 * 它构建自己的分词器。 以这种方式使用令牌事件效率较低,但在某些用例中可能更方便。
There are optional pipeline stages providing debugging facilities:
- DispatchTracer
- This class sits between Tokenizer and Dispatcher. It reports all token events, and reports insertion mode transitions within Dispatcher. Log messages are sent to a callback function.
- TreeMutationTracer
- This forwards tree mutation events coming from TreeBuilder, and reports such events to a callback.
- DestructTracer
- This class forwards tree mutation events, and reports when the Element object emitted by TreeBuilder is destroyed. This helps to identify memory leaks.
RemexHtml's model of a configurable pipeline provides a great deal of flexibility. Applications may subclass pipeline classes provided by RemexHtml, or write their own from scratch, implementing the relevant event receiver interface. Or they may interpose custom pipeline stages in between RemexHtml's standard stages.
However, for simple use cases, there is a fair amount of boilerplate. T217850 proposes to add a simplified method for constructing a standard pipeline, but this has not yet been implemented.
示例
Construct a DOM from input text
use Wikimedia\RemexHtml\DOM\DOMBuilder;
use Wikimedia\RemexHtml\TreeBuilder\TreeBuilder;
use Wikimedia\RemexHtml\TreeBuilder\Dispatcher;
use Wikimedia\RemexHtml\Tokenizer\Tokenizer;
function parseHtmlToDom( $input ) {
$domBuilder = new DOMBuilder();
$treeBuilder = new TreeBuilder( $domBuilder );
$dispatcher = new Dispatcher( $treeBuilder );
$tokenizer = new Tokenizer( $dispatcher, $input );
$tokenizer->execute();
return $domBuilder->getFragment();
}
In the above code sample, the pipeline is constructed backwards, from end to start. The constructor of each pipeline stage receives the following pipeline stage. Then with the pipeline fully constructed, $tokenizer->execute() causes the whole input text to be parsed and emitted through the pipeline, eventually reaching the DOMBuilder. After execution, the constructed document is available via $domBuilder->getFragment().
更改链接目标
use Wikimedia\RemexHtml\HTMLData;
use Wikimedia\RemexHtml\Serializer\HtmlFormatter;
use Wikimedia\RemexHtml\Serializer\Serializer;
use Wikimedia\RemexHtml\Serializer\SerializerNode;
use Wikimedia\RemexHtml\Tokenizer\Tokenizer;
use Wikimedia\RemexHtml\TreeBuilder\Dispatcher;
use Wikimedia\RemexHtml\TreeBuilder\TreeBuilder;
function changeLinks( $html ) {
$formatter = new class extends HtmlFormatter {
public function element( SerializerNode $parent, SerializerNode $node, $contents ) {
if ( $node->namespace === HTMLData::NS_HTML
&& $node->name === 'a'
&& isset( $node->attrs['href'] )
) {
$node = clone $node;
$node->attrs = clone $node->attrs;
$node->attrs['href'] = 'http://example.com/' . $node->attrs['href'];
}
return parent::element( $parent, $node, $contents );
}
};
$serializer = new Serializer( $formatter );
$treeBuilder = new TreeBuilder( $serializer );
$dispatcher = new Dispatcher( $treeBuilder );
$tokenizer = new Tokenizer( $dispatcher, $html );
$tokenizer->execute();
return $serializer->getResult();
}
This example modifies an HTML document on the fly, altering href attributes inside <a> tags and returning an HTML string.
It does this by subclassing HtmlFormatter, which is a relatively easy hook point into reserialization.
It clones the SerializerNode and Attributes objects to avoid altering the document as seen by Serializer, since it is possible this function may be called more than once on each node, and we don't want to prefix the domain name more than once.
Alternatively we could have used SerializerNode::$snData as a flag, to avoid double-prefixing:
if ( !$node->snData ) {
$node->snData = true;
$node->attrs['href'] = 'http://example.com/' . $node->attrs['href'];
}
性能
Various options can be enabled which improve performance, potentially at the expense of correctness:
- Tokenizer
- ignoreErrors - This does not simply discard parse errors as they are generated. In some cases it chooses a more efficient algorithm which implicitly ignores errors. If parse errors are not required, this should always be set.
- skipPreprocess - The HTML specification requires that the input be preprocessed to normalize line endings and strip control characters. If line endings are already normalized in your application, and if you don't mind control characters being propagated through to the output, this option can be enabled, for a small improvement to performance.
- ignoreNulls - Enabling this option causes any null characters to be passed through to the output. The HTML specification requires complex, context-dependent handling of null characters whenever they appear in the input. So if the application simply strips null characters from the input and enables this option, the result will not be standards-compliant, but performance will be slightly improved.
- ignoreCharRefs - This is an aggressive and rarely-useful optimisation option which ignores character references, passing them through unmodified. It needs to be paired with a special serializer that will emit bare ampersands from text nodes instead of escaping them.
- TreeBuilder
- ignoreNulls, ignoreErrors - Same as the corresponding Tokenizer options
关于分词器错误异常
If RemexHtml throws a TokenizerError exception, for example "pcre.backtrack_limit exhausted", this is usually not a bug in RemexHtml. Either the relevant configuration setting should be increased, or the input size should be limited. The pcre.backtrack_limit INI setting should be at least double the input size.
参见