Parsing/Replacing Tidy/FAQ/fr

Qu'est-ce que Tidy ?
Tidy is a library currently used by MediaWiki to fix some HTML errors found in wiki pages. Trouver du texte mal formaté est commun dans les pages wiki quand les éditeurs utilisent les balises HTML dans les modèles ou dans la page elle-même. (Exemple: les balises HTML non fermées, comme  sans , sont fréquents). Dans certains cas, MediaWiki peut aussi générer du code HTML érronné par lui-même. Tidy corrige ces erreurs de balises, mais fait aussi du nettoyage de son côté et qui n'est pas nécessaire pour la correction. Par exemple, il enlève les éléments vides et ajoute des espaces entre les balises HTML ce qui peut quelquefois modifier la présentation. Comme Tidy s'appuie sur la sémantique de HTML4 et que le web est passé à HTML5, il apporte aussi des modifications incorrectes au texte HTML pour 'corriger' les choses qui ne marchaient pas; par exemple, Tidy va, sans qu'on le prévoit, déplacer une liste à puces en dehors d'une zone de texte de table bien que cela ne soit pas autorisé.

Pourquoi le remplacez-vous ? Et par quoi ?
La technologie de Tidy date des années 1990, quand les explorateurs pour l'Internet étaient standardisés. Le comportement de Tidy est largement basé sur la sémantique HTML4 mais ne correspond plus aux explorateurs modernes. Après des années passées sans réelle maintenance, Tidy a maintenant été remis à jour sous le nom de "tidy-html5" avec un comportement très différent. L'ancien Tidy n'est plus regénéré. Comme expliqué précédemment, Tidy nettoie le HTML quelquesoit les erreurs à corriger. Conjugués, tous ces problèmes ont conduits à créer beaucoup de bogues qui ont donné lieu à des tickets dans Phabricator, et une demande de produit en remplacement a été faite depuis au moins 2013. HTML5 est le standard aujourd'hui, et l'algorithme d'analyse pour HTML5 est clairement spécifié, ce qui a conduit à des implementations compatibles à travers les navigateurs et autres librairies. Cet algorithme spécifie aussi clairement comment un balisage cassé devrait être corrigé. Dans ce nouveau paysage technologique, Tidy devrait réellement être remplacé par un analyseur HTML5 qui corrigerait le texte déffectueux et en génèrerait un texte HTML valide et bien formaté d'une manière standard. Mais les wikis Wikimedia ont un énorme corpus de pages dont le codage dépend des corrections de Tidy. Faire le remplacement intermédiaire et directement de Tidy par outils tiers basé sur HTML5 n'est pas faisable, car un outil basé sur HTML5 réparerait certains textes différemment et ceci pourrait casser l'aspect des pages. So, we are replacing Tidy with our own tool based on the HTML5 specification, but which also adds a few Tidy-compatibility workarounds to minimize the impact of replacing Tidy. After experimenting with 3 different solutions, we have settled on RemexHTML, a PHP-based HTML5 parser on top of which we have written Tidy-compatibility passes to replicate some Tidy behaviour which we need to provide for now. In the future, RemexHTML could also be used to enable new core MediaWiki features, such as more-robust section editing, balanced template support, and more efficient page updates after templates have been edited. For those who are wondering, note that using tidy-html5 would not preclude us from having to deal with fixing markup errors, since some of the required cleanup is due to changing from HTML4 to HTML5 semantics. There are other change-management reasons for preferring a in-house tool, including the ability to enable other features as mentioned above.

Si vous êtes intéressé par d'autres détails techniques, veuillez voir https://phabricator.wikimedia.org/T89331 ou Remplacer Tidy.

Which tests have you performed so far?
To identify the impact of replacing Tidy with a HTML5 based tool, we have utilized a testing strategy (using a tool called "VisualDiff") that compares the pixel-by-pixel output image of MediaWiki with Tidy enabled, with the pixel-by-pixel output image of its replacement. Early on, we found that a common difference was minor vertical whitespace changes. In the belief that these would either not be noticeable or would be tolerable, we wrote a tool called "UprightDiff" which is able to identify vertical motion within an image and to discount such motion for the purposes of automated testing. This also let us assign a numeric score to differences and readily identify the most egregious differences. We exported a subset of about 64,000 articles (some from the recent changes stream, and rest selected randomly) from various Wikimedia wikis (40 wikis from Wikipedia, Wikisource, Wiktionary, and Wikivoyage), and rendered them with Tidy and with RemexHTML, then used "UprightDiff" to analyse the result. This takes a lot of cpu cycles, memory, and disk space, and it takes 2 days for one round of testing to complete. This limits the size of the testing corpus but we believe 64K is a sizeable sample to figure out the kind of fixes necessary.

To minimize the differences and reduce the impact of fixes that would be needed from editors, we added some additional Tidy-compatibility fixups. Since we found that self-closing tags were extremely common on wikimedia wikis, we added a compatibility fix to treat them as empty tags (i.e.  is treated as  ). We added some other compatibility passes as well. After all these fixes were in place and we repeated our tests, we found that 93.4% of pages had no changes in rendering. And, 96.9% of pages had either no pixel diffs (93.4%) or insignificant vertical whitespace shifts only (3.5% = 96.9 - 93.4). The remaining 3.1% pages (100 - 96.9) showed pixel differences that had other reasons.

Based on these tests, we identified several classes of markup errors that will render differently between the two. For one class of markup errors (self-closing tags that aren't valid in HTML5), we added a maintenance category that editors have already been using to fix up templates and pages. But, the other classes of markup errors are not easy to detect automatically at this time and editors' assistance is necessary to identify and fix them up.

Qu'arrivera-t-il ? Quand ces modifications surviendront-elles ?
Comme indiqué antérieurement, nous ne pouvons cependant pas faire un remplacement ponctuel de Tidy avec un outil basé sur HTML5. We have added a maintenance category for one class of markup errors, which will help editors fix them up. To help editors identify and fix up other markup errors, we also built a ParserMigration extension that helps them compare output in production and fix markup errors. De plus, avons aussi réalisé l'extension Linter pour identifier quelques-unes des corrections nécessaires.

As of end March 2017, we deployed the ParserMigration extensions to all wikis. As of June 20, 2017, Linter has been deployed to all large wikis. Via these extensions, we hope to enable editors fix up pages for actually replacing Tidy in 2017. Once enough fixes are made and we are satisfied that the impact on editors and readers will be minor and tolerable, we will go ahead and replace Tidy Mais aussi, nous n'aimerions pas tirer ceci indéfiniment. Il serait donc idéal si les problèmes de priorité haute identifiés par l'extension Linter soient traités en premier lieu par les éditeurs are prioritized by editors.

Separately, the Tidy-compatibility fixups (mentioned in the previous section) are meant to be in place till we replace Tidy. After that, we will start replacing these fixups gradually while relying on similar testing and tooling support.

Que devront faire les éditeurs ?
L'extension Linter a été déployée sur tous les wikis. Comme il est dit sur la page d'aide, veuillez corriger les patterns wikitext et les modèles identifiés parmi les  catégories de haute priorité sur la page Special:LintErrors de votre wiki. Chaque élément dans cette catégorie a une page d'aide avec des exemples qui indiquent ce qui doit être corrigé. Voir les instructions simplifiées ci-dessous.

Pour aider les éditeurs dans la migration de leur texte wiki et vérifier les corrections qu'ils ont faites, nous avons déployé l'extension ParserMigration. Si vous activez "ParserMigration" dans vos préférences, un liens sera ajouté à la boîte à outils de tous les articles, qui pourra montrer côte à côte les sorties courantes (Tidy) et attendues (RemexHTML). Vous pouvez prévisualiser les modifications du texte des articles dans la même vue côte à côte et voir comment vos mises à jour modifient / corrigent l'affichage.

Quel est l'impact sur mon wiki ?
Please see the wikitext deprecation tool for this information. Note that in several cases the count refers to pages affected, but that includes template-generated issues. In other words, once a given template is fixed, all the pages featuring that template will disappear from the list. So it's highly likely that you won't have to fix thousands or millions pages, but a handful of templates. We are working to make this even more clear. You can also see some progress on Parsing/Replacing Tidy/Linter/Stats/July31, and the results of visual differences tests on a sample of ~73K articles from 60 wikis at http://mw-expt-tests.wmflabs.org/.

Instructions simplifiées pour corriger les pages
Here are some simplified instructions for handling all the high-priority linter categories. Some of the linked help pages may contain instructions to use assisting tools such as WPCleaner.

Supprimer OU corriger les tables mal imbriquées
In this example, Tidy will delete Table 2 above. But, RemexHTML will not delete that table. This can change how pages look. To prevent this, editors should fix the wikitext and remove Table 2. While the following row-tag need not be removed, we recommend removing it. Since the closing table tag is no longer needed, it should be removed as well.

Alternatively, add an explicit  cell on the row started by the previous line before the start of Table 2 if you need nested tables. What is the correct fix depends on the page. But, in most cases deletion as above is going to be the right fix.


 * Help:Extension:Linter/deletable-table-tag

Palliatif pour corriger un bogue de l'analyseur dans les cas de repli de paragraphe
On most wikis, it looks like the biggest generator of these linter warnings are the nowrap or nowrap begin templates. The simplest fix that will handle the vast majority of these linter cases is to add a newline before the opening &lt;span&gt; tag in the template source of these templates.

In all other cases, when wikitext has a span next to a div/td (and other such "block" tags) and has a  CSS property, please add a newline after the div tag.


 * Help:Extension:Linter/pwrap-bug-workaround

Corriger les balises auto-fermantes invalides
Self-closing tags like &lt;div/>, &lt;span/>, &lt;b/>, etc are not valid in HTML5. They need to be fixed according to what the editor intent might have been. In some cases, it is a typo where a &lt;/b> is intended. In other cases, they need to be deleted. In some other cases, they need to be replaced with a &lt;nowiki/>. Please see the detailed help page for this category.


 * Help:Extension:Linter/self-closed-tag

Caveats

 * We don't yet know how to automatically determine all affected pages and templates, but we expect that the list of templates and suggestions will get more complete as we learn about problems that were not caught before. Community Liaisons will reach out to template experts, regulars at technical village pumps, etc. to make sure they become aware of the necessary changes and of resources that may help them.
 * We are thinking about other easy, sustainable ways in which we may support "gnomes" and all the Wikimedians willing to help with this effort.

autres FAQs
Q: Qu'arrive-t-il à  et aux balises similaires si elles sont utilisées dans une forme auto-fermante ?

A: As noted in T134423, the only valid self-closed HTML tags are:,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,. Non-HTML self-closing tags (like  and  ) are not affected by this change. is a special case because while it is a HTML tag, MediaWiki treats it like an extension tag and hence remains unaffected. All other self-closing HTML tags should be fixed (and are already being fixed by editors at this time).

Since this usage is found in a lot of pages in our testing, in order to prevent unexpected rendering effects (e.g.  being treated as    and causing more text being bolded than intended), we added a fix to the parser to convert them to an empty tag (eg.   will be converted to  ). But, we don't intend to retain this fix indefinitely. So, we would like for editors to continue fixing this deprecated usage.

Q: Quels sont les résultats des tests pour les langues différentes de l'anglais, ou sur les projets frères ?

A: There is nothing in what Tidy, RemexHTML do that is specific to Wikipedia or English. This project is primarily about a change from HTML4 to HTML5 semantics and getting rid of some Tidy cleanups of HTML. These changes affects all projects and languages equally, except if some projects and languages tend to have more markup errors or use more self-closed tags than others.

'''Q: Quels sont les autres changements que les éditeurs pourraient voir, après ce remplacement ? '''

A: The effect of this replacement is primarily going to affect readers, as they may notice that the page doesn't look right (for example, excessively wide navboxes without line breaks or wrapping). However, if anything, this might lead to the rendering seen in VisualEditor to match the rendering seen outside it much more than before, since Parsoid's output has been HTML5-compliant since the beginning, and we are now moving the read output to HTML5. We do not expect any impact on VisualEditor edits, but we will promptly address any bugs reported with respect to dirty diffs. In addition, we do not plan to add any error messages or warnings displayed on pages if the markup errors are not fixed.

Q: How does the replacement relate to other projects you are working on?

A: By enabling the move to HTML5 semantics, this is one of the steps evolving markup in our corpus to keep up with web standard. We also expect to leverage this tool to support well-balanced template output. Separately, but relatedly, this will also make the output of the PHP parser (used for reads) and the output of Parsoid (used for edits in VE, Content Translation) more consistent since Parsoid already uses HTML5 semantics. One of our goals is to make the two outputs fully consistent with each other and use one parser for both reads and edits.

--L'Equipe de l'Analyseur

Volunteers available to support this effort
''Community Liaisons invite interested Wikimedians to please add their name in the sections below and support their community engagement efforts. Thank you. As with similar past initiatives, signing up is optional.'' Please see Parsing/Get_involved, and add it to your bookmarks, as future requests for assistance will go through that page!
 * 1) I am available to test with the ParserMigration tool.
 * 2) (ajoutez votre signature ici)
 * 3) Je suis disponible pour corriger les modèles.
 * 4) Jonesey95 (talk) 16:10, 14 November 2016 (UTC)
 * 5) Samuele2002 (talk)
 * 6) TheDragonFire (talk)
 * 7) Stryn (talk) 14:22, 17 July 2017 (UTC)
 * 8) Je suis disponible pour étudier et discuter des corrections concernant les modèles.
 * 9) Jonesey95 (talk) 16:10, 14 November 2016 (UTC)
 * 10) Je suis disponible pour diffuser les informations à ma communauté.
 * 11) (See this page) --Sannita (talk) 18:02, 8 July 2017 (UTC)
 * 12) See this page. Stryn (talk) 14:22, 17 July 2017 (UTC)
 * 1) (See this page) --Sannita (talk) 18:02, 8 July 2017 (UTC)
 * 2) See this page. Stryn (talk) 14:22, 17 July 2017 (UTC)