Html5Depurate

Html5Depurate is a web service which takes potentially invalid HTML as input, parses it using the HTML5 parsing algorithm, and outputs the resulting document using an XHTML serialization.

It was a proposed replacement for Tidy in MediaWiki and is in development as part of task T89331.

It is written in Java, so that it can use the excellent validator.nu parser by Henri Sivonen and the Mozilla Foundation.

In consideration of 3rd party users, RemexHTML, a PHP-only HTML5 parsing library has been used instead as the basis for a Tidy replacement instead of HTML5Depurate.

Package installation
Packages for Ubuntu Trusty and Debian Jessie are available from apt.wikimedia.org. These can be installed as follows. For Jessie: apt-get install apt-transport-https echo deb https://apt.wikimedia.org/wikimedia/ jessie-wikimedia main >> /etc/apt/sources.list.d/wikimedia.list apt-get update apt-get install html5depurate

For Trusty: apt-get install apt-transport-https echo deb https://apt.wikimedia.org/wikimedia/ trusty-wikimedia main >> /etc/apt/sources.list.d/wikimedia.list apt-get update apt-get install html5depurate

The service will automatically start on localhost:4339. The package is reasonably secure, since it sets up a new unprivileged user for the daemon, and uses a very restrictive Java security policy.

Note that the package uses Maven Central during its build process, so the source package does not contain all the relevant source files.

Compilation
The source can be obtained with

git clone https://gerrit.wikimedia.org/r/mediawiki/services/html5depurate

Install Maven, JDK 7 and jsvc. Compile using:

mvn package

This will download all dependencies from Maven Central, compile, test, and generate a single .jar file which bundles all dependencies. The jar file will appear in the target directory, with a filename that depends on the current version. For testing as a foreground process, you can use something like:

java -cp target/html5depurate-1.1-SNAPSHOT.jar \ org.wikimedia.html5depurate.DepurateDaemon

To run it in the background, you can use jsvc, for example:

/usr/bin/jsvc \ -cp $PWD/target/html5depurate-1.1-SNAPSHOT.jar \ -pidfile /tmp/html5depurate.pid \ -errfile /tmp/html5depurate.err \ -outfile /tmp/html5depurate.out \ -procname html5depurate \ org.wikimedia.html5depurate.DepurateDaemon

Or check out the  branch for fully baked SysV init scripts.

Configuration
Configuration options may be specified in /etc/html5depurate/html5depurate.conf. Possible configuration options and their default values are documented below:

maxPostSize = 100000000
 * 1) Max POST size, in bytes.

host = localhost port = 4339
 * 1) Host or IP and port on which Html5depurate will listen.

It's advisable to also configure Java's logging service. For example, the Debian package uses the following logging.properties file:

handlers = java.util.logging.FileHandler .level = INFO

java.util.logging.FileHandler.pattern = /var/log/html5depurate/html5depurate.log java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter java.util.logging.FileHandler.append = true java.util.logging.SimpleFormatter.format = %1$tF %1$tT %4$s: %5$s %6$s%n

Then run Java with

Client configuration
MediaWiki can be configured to use this service by putting the following in LocalSettings.php:

$wgUseTidy = false; $wgTidyConfig = array(	'driver' => 'Html5Depurate' );

To instruct Html5Depurate to provide backwards compatibility with Tidy as far as is possible, use the compat/document API endpoint:

$wgTidyConfig['url'] = 'http://localhost:4339/compat/document';

Maintainer notes

 * /Updating apt.wikimedia.org
 * https://github.com/wikimedia/html5depurate