Parsing/Replacing Tidy/FAQ/uk

Що таке Tidy?
Tidy — це бібліотека, яку наразі використовує MediaWiki для виправлення деяких помилок HTML, знайдених на вікі-сторінках. Погано сформована розмітка є поширеною на вікі-сторінках, коли редактори використовують HTML-теги у шаблонах і на власне сторінках (наприклад, незакриті HTML-теги, як-от  без , are common). У деяких випадках MediaWiki може генерувати помилковий HTML by itself. Tidy виправляє такі помилки розмітки, але також робить іншу «чистку» на власний розсуд, яка не вимагається для правильності. Наприклад, вона вилучає порожні елементи та додає пропуски між HTML-тегами, що можуть іноді змінити rendering. Оскільки Tidy засновано на семантиці HTML4, а Web переїжджає на HTML5, вона також здійснює деякі некоректні зміни в HTML, аби «виправити» речі, які used to not work; наприклад, Tidy несподівано винесе маркований список за межі заголовка таблиці, навіть though, це дозволено.

Чому ви замінюєте її? І чим?
Технологія Tidy походить із 1990-х, коли браузери не були стандартизованими. Поведінка Tidy loosely заснована на семантиці HTML4, але не збігається з сучасними браузерами. Після витрачених років без активної підтримки, Tidy has now been revived як «tidy-html5» з дуже різною поведінкою. Старіша Tidy is no longer being packaged. Як було зазначено раніше, Tidy виконує чистку HTML, не пов'язану з виправленням помилок. Разом усі ці проблеми призводять до багатьох багів, filed against it на Фабрикаторі, і мова про заміну йде, починаючи з принаймні 2013 року. HTML5 є сьогоднішнім стандартом, і алгоритм синтаксичного аналізу HTML5 є чітко визначеним, що призводить до сумісних реалізацій крізь браузери та інші бібліотеки. Цей алгоритм також чітко визначає, як зламана розмітка повинна бути виправлена. У цьому новому технологічному landscape, Tidy справді повинен бути замінений парсером HTML5, який виправляє зламану розмітку та генерує валідну, добре сформовану HTML-розмітку стандартним шляхом. Проте, вікі Вікімедії мають huge corpus сторінок, розмітка яких relies на виправлення Tidy. Doing an immediate and straight-forward заміна Tidy стороннім, заснованим на HTML5 інструментом не є feasible, оскільки заснований на HTML5 інструмент ремонтуватиме деяку розмітку інакше, і це може зламати теперішній вигляд сторінок. Тому ми замінюємо Tidy власним інструментом, заснованим на специфікації HTML5, але який також додає кілька сумісних із Tidy workarounds задля мінімізації impact заміни Tidy. Після експерименту з трьома різними рішеннями ми зупинилися на RemexHTML, заснованому на PHP парсері HTML5, on top of which ми написали сумісні з Tidy passes для реплікації деякої поведінки Tidy, яку нам необхідно надати зараз. У майбутньому RemexHTML може також використовуватися для enable нових core можливостей MediaWiki, як-от more-robust редагування розділів, збалансована підтримка шаблонів та ефективніші оновлення сторінок після редагування шаблонів. Для тих, хто wondering, зазначимо, що використання tidy-html5 не would preclude нас від having to deal з виправленням помилок розмітки, оскільки деякі з необхідних чисток is due to changing від семантики HTML4 до HTML5. Є й інші change-management причини надати перевагу in-house інструменту, включно з можливістю enable інші вищезгадані можливості.

Якщо ви зацікавлені в інших технічних подробицях, будь ласка, див. https://phabricator.wikimedia.org/T89331 або Заміна Tidy.

Which tests have you performed so far?
To identify the impact of replacing Tidy with a HTML5 based tool, we have utilized a testing strategy (using a tool called "VisualDiff") that compares the pixel-by-pixel output image of MediaWiki with Tidy enabled, with the pixel-by-pixel output image of its replacement. Early on, we found that a common difference was minor vertical whitespace changes. In the belief that these would either not be noticeable or would be tolerable, we wrote a tool called "UprightDiff" which is able to identify vertical motion within an image and to discount such motion for the purposes of automated testing. This also let us assign a numeric score to differences and readily identify the most egregious differences. We exported a subset of about 64,000 articles (some from the recent changes stream, and rest selected randomly) from various Wikimedia wikis (40 wikis from Wikipedia, Wikisource, Wiktionary, and Wikivoyage), and rendered them with Tidy and with RemexHTML, then used "UprightDiff" to analyse the result. This takes a lot of cpu cycles, memory, and disk space, and it takes 2 days for one round of testing to complete. This limits the size of the testing corpus but we believe 64K is a sizeable sample to figure out the kind of fixes necessary.

To minimize the differences and reduce the impact of fixes that would be needed from editors, we added some additional Tidy-compatibility fixups. Since we found that self-closing tags were extremely common on wikimedia wikis, we added a compatibility fix to treat them as empty tags (i.e.  is treated as  ). We added some other compatibility passes as well. After all these fixes were in place and we repeated our tests, we found that 93.4% of pages had no changes in rendering. And, 96.9% of pages had either no pixel diffs (93.4%) or insignificant vertical whitespace shifts only (3.5% = 96.9 - 93.4). The remaining 3.1% pages (100 - 96.9) showed pixel differences that had other reasons.

Based on these tests, we identified several classes of markup errors that will render differently between the two. For one class of markup errors (self-closing tags that aren't valid in HTML5), we added a maintenance category that editors have already been using to fix up templates and pages. But, the other classes of markup errors are not easy to detect automatically at this time and editors' assistance is necessary to identify and fix them up.

What will happen? When will these changes happen?
As noted earlier, we cannot yet do a drop-in replacement of Tidy with a HTML5-based tool. We have added a maintenance category for one class of markup errors, which will help editors fix them up. To help editors identify and fix up other markup errors, we also built a ParserMigration extension that helps them compare output in production and fix markup errors. Separately, we have also built the Linter extension to to identify some of the fixes that are needed.

As of end March 2017, we deployed the ParserMigration extensions to all wikis. As of June 20, 2017, Linter has been deployed to all large wikis. Via these extensions, we hope to enable editors fix up pages for actually replacing Tidy in 2017. Once enough fixes are made and we are satisfied that the impact on editors and readers will be minor and tolerable, we will go ahead and replace Tidy But, we would also not like to drag this on indefinitely. So, it would be ideal if the high-priority issues identified by the Linter extension are prioritized by editors.

Separately, the Tidy-compatibility fixups (mentioned in the previous section) are meant to be in place till we replace Tidy. After that, we will start replacing these fixups gradually while relying on similar testing and tooling support.

What will editors need to do?
The Linter extension has been deployed to all wikis. As indicated on the help page, please fix wikitext patterns and templates identified in the high priority categories on the Special:LintErrors page of your wiki. Every item in that category has a help page with examples indicating what needs fixing. See simplified instuctions below.

To assist editors in migrating wikitext to verify the fixes they make, we have deployed the ParserMigration extension. If you enable "ParserMigration" in your preferences, a link will be added to the toolbox of all articles, which can show the current (Tidy) and expected (RemexHTML) output side-by-side. You can preview article text changes in the same side-by-side view and see how your edits changes / fixes rendering.

Simplified instructions for fixing pages
Here are some simplified instructions for handling all the high-priority linter categories.

Delete OR fix badly nested tables
In this example, Tidy will delete Table 2 above. But, RemexHTML will not delete that table. This can change how pages look. To prevent this, editors should fix the wikitext and remove Table 2. While the following row-tag need not be removed, we recommend removing it. Since the closing table tag is no longer needed, it should be removed as well.

Alternatively, add an explicit  cell on the row started by the previous line before the start of Table 2 if you need nested tables. What is the correct fix depends on the page. But, in most cases deletion as above is going to be the right fix.

Work around a parser bug for paragraph wrapping
On most wikis, it looks like the biggest generator of these linter warnings are the nowrap or nowrap begin templates. The simplest fix that will handle the vast majority of these linter cases is to add a newline before the opening &lt;span&gt; tag in the template source of these templates.

In all other cases wikitext of this form where the span has a  CSS property has to be fixed to add a newline after the div tag.

Fix invalid self-closing tags
Self-closing tags like &lt;div/>, &lt;span/>, &lt;b/>, etc are not valid in HTML5. They need to be fixed according to what the editor intent might have been. In some cases, it is a typo where a &lt;/b> is intended. In other cases, they need to be deleted. In some other cases, they need to be replaced with a &lt;nowiki/>. Please see the detailed help page for this category.

Caveats

 * We don't yet know how to automatically determine all affected pages and templates, but we expect that the list of templates and suggestions will get more complete as we learn about problems that were not caught before. Community Liaisons will reach out to template experts, regulars at technical village pumps, etc. to make sure they become aware of the necessary changes and of resources that may help them.
 * We are thinking about other easy, sustainable ways in which we may support "gnomes" and all the Wikimedians willing to help with this effort.

Other FAQs
Q: What happens to  and similar tags if they are used in self-closed form?

A: As noted in T134423, the only valid self-closed HTML tags are:,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,. Non-HTML self-closing tags (like  and  ) are not affected by this change. is a special case because while it is a HTML tag, MediaWiki treats it like an extension tag and hence remains unaffected. All other self-closing HTML tags should be fixed (and are already being fixed by editors at this time).

Since this usage is found in a lot of pages in our testing, in order to prevent unexpected rendering effects (e.g.  being treated as    and causing more text being bolded than intended), we added a fix to the parser to convert them to an empty tag (eg.   will be converted to  ). But, we don't intend to retain this fix indefinitely. So, we would like for editors to continue fixing this deprecated usage.

Q: What were the results of tests on languages other than English, or on sister projects?

A: There is nothing in what Tidy, RemexHTML do that is specific to Wikipedia or English. This project is primarily about a change from HTML4 to HTML5 semantics and getting rid of some Tidy cleanups of HTML. These changes affects all projects and languages equally, except if some projects and languages tend to have more markup errors or use more self-closed tags than others.

Q: What other changes are editors likely to see, after this replacement?

A: The effect of this replacement is primarily going to affect readers, as they may notice that the page doesn't look right (for example, excessively wide navboxes without line breaks or wrapping). However, if anything, this might lead to the rendering seen in VisualEditor to match the rendering seen outside it much more than before, since Parsoid's output has been HTML5-compliant since the beginning, and we are now moving the read output to HTML5. We do not expect any impact on VisualEditor edits, but we will promptly address any bugs reported with respect to dirty diffs. In addition, we do not plan to add any error messages or warnings displayed on pages if the markup errors are not fixed.

Q: How does the replacement relate to other projects you are working on?

A: By enabling the move to HTML5 semantics, this is one of the steps evolving markup in our corpus to keep up with web standard. We also expect to leverage this tool to support well-balanced template output. Separately, but relatedly, this will also make the output of the PHP parser (used for reads) and the output of Parsoid (used for edits in VE, Content Translation) more consistent since Parsoid already uses HTML5 semantics. One of our goals is to make the two outputs fully consistent with each other and use one parser for both reads and edits.

--The Parsing Team

Volunteers available to support this effort
''Community Liaisons invite interested Wikimedians to please add their name in the sections below and support their community engagement efforts. Thank you. As with similar past initiatives, signing up is optional.''
 * 1) I am available to test with the ParserMigration tool.
 * 2) (add your signature here)
 * 3) I am available to fix templates.
 * 4) Jonesey95 (talk) 16:10, 14 November 2016 (UTC)
 * 5) Samuele2002 (talk)
 * 6) I am available to study and discuss fixes to templates.
 * 7) Jonesey95 (talk) 16:10, 14 November 2016 (UTC)
 * 8) I am available to spread the word among my community.
 * 9) (See this page) --Sannita (talk) 18:02, 8 July 2017 (UTC)
 * 1) I am available to spread the word among my community.
 * 2) (See this page) --Sannita (talk) 18:02, 8 July 2017 (UTC)