Parsing/Replacing Tidy/FAQ/ja

Tidy とは?
Tidy は、現在ウィキのページにあるいくつかの HTML エラーを修正するため、MediaWiki によって使用されているライブラリです. Badly formed markup is common on wiki pages when editors use HTML tags in templates and on the page itself. (Example: unclosed HTML tags, such as a  without a , are common). In some cases, MediaWiki can generate erroneous HTML by itself. Tidy fixes these markup errors, but also does other "cleanup" on its own that is not required for correctness. For example, it removes empty elements and adds whitespace between HTML tags, which can sometimes change rendering. Since Tidy is based on HTML4 semantics and the web has moved to HTML5, it also makes some incorrect changes to HTML to 'fix' things that used to not work; for example, Tidy will unexpectedly move a bullet list out of a table caption even though that's allowed.

なぜ、何と置き換えるのですか?
Tidyの技術は、ブラウザが標準化されていなかった1990年代のものです. Tidyの動作は大まかにHTML4の意味論に基づいているものの、現代のどのブラウザとも対応していません. メンテナンスを積極的にしないまま何年も経ったTidyですが、今までとは全く違う振る舞いをする「tidy-html5」として復活しました. The older Tidy is no longer being packaged. As noted earlier, Tidy does HTML cleanup unrelated to fixing errors. Together, all these issues have led to lots of bugs filed against it on Phabricator, and a replacement has been asked for since at least 2013. HTML5 is the standard today, and the parsing algorithm for HTML5 is clearly specified, which has led to compatible implementations across browsers and other libraries. This algorithm also clearly specifies how broken markup should be fixed up. In this new technological landscape, Tidy should really be replaced with a HTML5 parser that fixes up the broken markup and generates valid, well-formed HTML markup in the standard way. However, Wikimedia wikis have a huge corpus of pages whose markup relies on Tidy's fixups. Doing an immediate and straight-forward replacement of Tidy with a third-party HTML5 based tool is not feasible, since a HTML5-based tool would repair some markup differently and this can break how pages look. So, we are replacing Tidy with our own tool based on the HTML5 specification, but which also adds a few Tidy-compatibility workarounds to minimize the impact of replacing Tidy. After experimenting with 3 different solutions, we have settled on RemexHTML, a PHP-based HTML5 parser on top of which we have written Tidy-compatibility passes to replicate some Tidy behaviour which we need to provide for now. In the future, RemexHTML could also be used to enable new core MediaWiki features, such as more-robust section editing, balanced template support, and more efficient page updates after templates have been edited. For those who are wondering, note that using tidy-html5 would not preclude us from having to deal with fixing markup errors, since some of the required cleanup is due to changing from HTML4 to HTML5 semantics. There are other change-management reasons for preferring a in-house tool, including the ability to enable other features as mentioned above.

技術面のその他の詳細に関心がある場合はhttps://phabricator.wikimedia.org/T89331あるいは Replacing Tidy (Tidy を置き換える) を参照してください.

Which tests have you performed so far?
To identify the impact of replacing Tidy with a HTML5 based tool, we have utilized a testing strategy (using a tool called "VisualDiff") that compares the pixel-by-pixel output image of MediaWiki with Tidy enabled, with the pixel-by-pixel output image of its replacement. Early on, we found that a common difference was minor vertical whitespace changes. In the belief that these would either not be noticeable or would be tolerable, we wrote a tool called "UprightDiff" which is able to identify vertical motion within an image and to discount such motion for the purposes of automated testing. This also let us assign a numeric score to differences and readily identify the most egregious differences. We exported a subset of about 64,000 articles (some from the recent changes stream, and rest selected randomly) from various Wikimedia wikis (40 wikis from Wikipedia, Wikisource, Wiktionary, and Wikivoyage), and rendered them with Tidy and with RemexHTML, then used "UprightDiff" to analyse the result. This takes a lot of cpu cycles, memory, and disk space, and it takes 2 days for one round of testing to complete. This limits the size of the testing corpus but we believe 64K is a sizeable sample to figure out the kind of fixes necessary.

To minimize the differences and reduce the impact of fixes that would be needed from editors, we added some additional Tidy-compatibility fixups. Since we found that self-closing tags were extremely common on wikimedia wikis, we added a compatibility fix to treat them as empty tags (i.e.  is treated as  ). We added some other compatibility passes as well. After all these fixes were in place and we repeated our tests, we found that 93.4% of pages had no changes in rendering. And, 96.9% of pages had either no pixel diffs (93.4%) or insignificant vertical whitespace shifts only (3.5% = 96.9 - 93.4). The remaining 3.1% pages (100 - 96.9) showed pixel differences that had other reasons.

Based on these tests, we identified several classes of markup errors that will render differently between the two. For one class of markup errors (self-closing tags that aren't valid in HTML5), we added a maintenance category that editors have already been using to fix up templates and pages. But, the other classes of markup errors are not easy to detect automatically at this time and editors' assistance is necessary to identify and fix them up.

具体的な変更点と、その実施時期
すでに通知がされたとおり、Tidy を HTML5 ベースのツールと一挙に入れ換えることはしません. マークアップエラーの1クラスに管理カテゴリを追加、編集者が修正しやすくしました. その他のマークアップでも編集者がエラーの識別と修正を楽にできるように、編集中に変更点の比較とマークアップのエラーの修正を行う ParserMigration 拡張機能を構築しました. それとは別に、修正の必要があるエラーの検出用に Linter 拡張機能を作りました.

ParserMigration 拡張機能は2017年3月末時点ですべてのウィキに実装しています. 大規模なウィキすべてに Linter を実装したのは2017年6月20日時点です. これらの拡張機能が編集者によるページ修正をサポートし、2017年中の Tidy からの置換に結びつくことを期待していました. 修正の量が充分に蓄積し、また編集者や閲覧者に対する影響が最小限に抑えられ堪えられると確信できたとき、Tidy からの置換を実行します. ただし、これを無期限に引きずることは望んでいません. そこで、Linter によって優先度が高いと識別された課題を編集者にお願いして順位付けができると理想的です.

それとは別途、Tidy 互換性の修正（前記のとおり）はあくまでも Tidy を置き換えるまでの経過措置であると意図しています. その段階を経てから、同様のテストとツールのサポートの進み方に従い、互換性の修正箇所から徐々に置き換えていきます.

すべきこと
Linter 拡張機能はすべての wiki に実装されました. ヘルプページに示されたように、ご利用の wiki の Special:LintErrors ページにある優先度の高いカテゴリにリストされたwikitext のパターンとテンプレートを修正してください. そのカテゴリの項目にはそれぞれ、修正が必要なものを例示するヘルプページが対応しています. 下記に指示を簡略化してあります.

修正プログラムの確認用に wikitext を移行する編集者を支援するため、ParserMigration拡張機能を展開しました. 個人設定→編集で「ParserMigrationツール」を使用可能にすると、Tidy の現状と期待される出力 (RemexHTML) を並べて表示するリンクが、すべての記事の編集ツールボックスに追加されます. それを有効にすると、変更の前と後の記事を同じ画面上に左右に表示することで、自分が編集した内容がレンダリングをどのように変更/修正したか、確認できます.

自分のウィキへの影響
この情報はwikitext deprecation tool (wikitext 非推奨ツール) を参照してください. ページ数とは影響を受けるページの数なのに、事例によってはテンプレートに起因する問題が起きます. つまり、特定のテンプレートが修正されると、そのテンプレートを含むページはすべてリストから消えてしまうのです. したがって修正対象が数千から数百万ページあると表示されても、実際に修正が必要なテンプレートはほんの一握りという可能性が高くなります. これをもっと明確にしようと取り組んでいます. Parsing/Replacing Tidy/Linter/Stats を使って進捗状況を閲覧でき、また視覚的な違いのテスト結果として、60件のウィキから抽出した最大7万3千点の記事サンプルを右記で参照できます. http://mw-expt-tests.wmflabs.org/

簡易版の解説 - ページの修正のしかた
ここには、優先度の高い linter のカテゴリをすべて対象にした手順を簡単にまとめてあります. WPCleaner など支援ツールの使い方は、リンク先のヘルプページで説明している場合があります.

入れ子の指定が不適切なテーブル - 削除または修正
In this example, Tidy will delete Table 2 above. But, RemexHTML will not delete that table. This can change how pages look. To prevent this, editors should fix the wikitext and remove Table 2. While the following row-tag need not be removed, we recommend removing it. Since the closing table tag is no longer needed, it should be removed as well.

Alternatively, add an explicit  cell on the row started by the previous line before the start of Table 2 if you need nested tables. What is the correct fix depends on the page. But, in most cases deletion as above is going to be the right fix.


 * Help:Extension:Linter/deletable-table-tag

Work around a parser bug for paragraph wrapping
On most wikis, it looks like the biggest generator of these linter warnings are the nowrap or nowrap begin templates. The simplest fix that will handle the vast majority of these linter cases is to add a newline before the opening &lt;span&gt; tag in the template source of these templates.

In all other cases, when wikitext has a span next to a div/td (and other such "block" tags) and has a  CSS property, please add a newline after the div tag.
 * Note that this has the effect of enclosing the whole span in a paragraph element. Here the bug comes from newlines within the div element that cause a new paragraph to be generated for "foo".
 * The alternative, if you don't want a paragraph element automatically inserted (with its additional margins) by MediaWiki to surrounding the "span" inside the "div", is to not use any newline at all in the content of the "div" element, or to "hide" these newlines within HTML comments:


 * Help:Extension:Linter/pwrap-bug-workaround

Fix invalid self-closing tags
Self-closing tags like &lt;div/>, &lt;span/>, &lt;b/>, etc are not valid in HTML5. They need to be fixed according to what the editor intent might have been. In some cases, it is a typo where a &lt;/b> is intended. In other cases, they need to be deleted. In some other cases, they need to be replaced with a &lt;nowiki/>. Please see the detailed help page for this category.


 * Help:Extension:Linter/self-closed-tag

Fix pages affected by a Tidy whitespace bug

 * Help:Extension:Linter/tidy-whitespace-bug

Fix HTML5 vs Tidy misnested tag problems
Here is an example to illustrate the problem.

That was just one example to demonstrate the problem. There are other instances -  foobar  (an enwiki talk page) or  \n*x   (many itwiki pages via the use of citazione necessaria template) are other instances.


 * Help:Extension:Linter/html5-misnesting

font tag with color attribute wrapping wikilinks
Here is an example to illustrate the problem.

{| class="wikitable" !Wikitext !Tidy !Remex !Proposed Fix
 * or better yet,  and similar tags if they are used in self-closed form?'''
 * or better yet,  and similar tags if they are used in self-closed form?'''
 * or better yet,  and similar tags if they are used in self-closed form?'''
 * or better yet,  and similar tags if they are used in self-closed form?'''
 * or better yet,  and similar tags if they are used in self-closed form?'''

A: As noted in T134423, the only valid self-closed HTML tags are:,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,. Non-HTML self-closing tags (like  and  ) are not affected by this change. is a special case because while it is a HTML tag, MediaWiki treats it like an extension tag and hence remains unaffected. All other self-closing HTML tags should be fixed (and are already being fixed by editors at this time).

Since this usage is found in a lot of pages in our testing, in order to prevent unexpected rendering effects (e.g.  being treated as    and causing more text being bolded than intended), we added a fix to the parser to convert them to an empty tag (eg.   will be converted to  ). But, we don't intend to retain this fix indefinitely. So, we would like for editors to continue fixing this deprecated usage.

Q: What were the results of tests on languages other than English, or on sister projects?

A: There is nothing in what Tidy, RemexHTML do that is specific to Wikipedia or English. This project is primarily about a change from HTML4 to HTML5 semantics and getting rid of some Tidy cleanups of HTML. These changes affects all projects and languages equally, except if some projects and languages tend to have more markup errors or use more self-closed tags than others.

Q: What other changes are editors likely to see, after this replacement?

A: The effect of this replacement is primarily going to affect readers, as they may notice that the page doesn't look right (for example, excessively wide navboxes without line breaks or wrapping). However, if anything, this might lead to the rendering seen in VisualEditor to match the rendering seen outside it much more than before, since Parsoid's output has been HTML5-compliant since the beginning, and we are now moving the read output to HTML5. We do not expect any impact on VisualEditor edits, but we will promptly address any bugs reported with respect to dirty diffs. In addition, we do not plan to add any error messages or warnings displayed on pages if the markup errors are not fixed.

Q: How does the replacement relate to other projects you are working on?

A: By enabling the move to HTML5 semantics, this is one of the steps evolving markup in our corpus to keep up with web standard. We also expect to leverage this tool to support well-balanced template output. Separately, but relatedly, this will also make the output of the PHP parser (used for reads) and the output of Parsoid (used for edits in VE, Content Translation) more consistent since Parsoid already uses HTML5 semantics. One of our goals is to make the two outputs fully consistent with each other and use one parser for both reads and edits.

--The Parsing Team

Volunteers available to support this effort
''Community Liaisons invite interested Wikimedians to please add their name in the sections below and support their community engagement efforts. Thank you. As with similar past initiatives, signing up is optional.'' Please see Parsing/Get_involved, and add it to your bookmarks, as future requests for assistance will go through that page!
 * 1) I am available to test with the ParserMigration tool.
 * 2) (ここにあなたの署名を記入)
 * 3) I am available to fix templates.
 * 4) Jonesey95 (talk) 16:10, 14 November 2016 (UTC)
 * 5) Samuele2002 (talk)
 * 6) TheDragonFire (talk)
 * 7) Stryn (talk) 14:22, 17 July 2017 (UTC)
 * 8) Already did some in fawiki Ladsgroup (talk) 15:40, 18 September 2017 (UTC)
 * 9) I am available to study and discuss fixes to templates.
 * 10) Jonesey95 (talk) 16:10, 14 November 2016 (UTC)
 * 11) I am available to spread the word among my community.
 * 12) (See this page) --Sannita (talk) 18:02, 8 July 2017 (UTC)
 * 13) See this page. Stryn (talk) 14:22, 17 July 2017 (UTC)
 * 1) (See this page) --Sannita (talk) 18:02, 8 July 2017 (UTC)
 * 2) See this page. Stryn (talk) 14:22, 17 July 2017 (UTC)