Toolserver:~arnomane/WP-autoreviewer

Under contruction!

The Wikipedia Autoreviewer is a web based service to find errors (like typos and wrong formatting) in Wikipedia articles and make proposals how to improve them. Most features work currently only for the German WP but some are also available for English articles. It was originally written by Arnim Rupp and was then published under GPL3 license and transferred to the WP-Tool-Server. Further development is done by User:arnomane and the community.

There's also a version to crawl all wiki-links from one page and create a big table with all the results called Autoreview-spider (Example: http://de.wikipedia.org/wiki/Benutzer:Rupp.de/Autoreview_Top100 (German)). It's in beta-stage but kind of works (there wasn't much demand so never invested more time).

It's written using perl 5.8.6 on Linux. The source-code contains UTF8-characters so developers have to make sure the text-editor handles them correctly.

Unsolved problems
I had to change usr/lib/perl5/5.8.6/unicore/CaseFolding.txt and SpecialCasing.txt to prevent perl from matching ß to ss to be able to look for "daß" without finding "dass" (which is as far as i could find out no bug, but the official pattern matching for UTF8). In both files i uncommented the line starting with 00DF and called "perl mktables" in the same directory. this change is of course for the whole server! i hope somebody finds a better way.

Source Code Quality
The quality of the source-code might look ok from a quick look but the underlying structure, how things are parsed, is quite horrible. The reason is that i started this out as a tool to remove the wiki-links to year and dates. And then came the next idea, and the next and the next and now it's ~45 checks which are done. Also i mostly had only 1-2 hour of time per day for coding so rewriting larger parts was quite hard.

Someday someone should make a complete rewrite using lexical analysis which would overcome many of the shortcomings of this version. I guess there are libraries available to parse mediawiki-source.

WP-autoreview.pl
Main program and web-service.

config.pm
Name says it all.

autoreview.pm
Subroutines, also used by Autoreview-spider

Bad-Word-Files
These contain the "bad words" to be marked. There's one file for each language, encoded in the last two letters:
 * abbreviations_de.txt
 * avoid_words_de.txt
 * avoid_words_en.txt
 * fill_words_de.txt
 * typo.list.de.txt
 * winamax

Disambiguation / Begriffsklärungsseiten (BKL)
To be able to mark wikilinks to disambiguation-pages the autoreviewer needs a regularly updated list of these pages. The download is started with the script download_bkl.sh which creates the file BKL_neu.txt. Using the script convert_BKL.pl it's converted into BKL_converted.txt. This one is then used. The downloading and converting should be done regularly to avoid wrong error-messages.

test.html
See Testing below.

Testing and Debugging
A HTML-page called test.html is included which is used for self-testing the autoreviewer. It contains mediawiki-source and four different labels:


 * 1) GOOD:  line is ok and no error should be reported
 * 2) BAD:   line is bad and an error should be marked directly in the wiki-source
 * 3) SAINT: line is ok and no error should be shown in the general comments above the wiki-source
 * 4) EVIL:  line is bad and an error should be shown in the general comments above the wiki-source

The test can be run using the web-interface by clicking the checkbox "Testpage" or on the command line using "./WP-autoreview.pl test" (which is very usefull to check of the recent changes broke something which happens quite easily, see Source Code Quality above).

Debugging is controlled by some variables:
 * $debug
 * $debugcat
 * $debugdry
 * $developer

Performance
I had to invest quite some time in performance tuning but still long articles take quite a while to get parsed. It's very important to compile the numerous regular expressions for the "bad words" already when loading them, e.g.:

push @avoid_words, qr/(\b$_\b)/;

I don't know how much speed could be gained by using mod_perl or mod_fcgi but i guess it's not worth the effort.

For benchmarking parts of the autoreviewer the CPAN-module "Benchmark::Timer" is used. To get output set $developer && $debug:

if ( $developer && $debug ) { print "#####".$bench->report('total')." "; print "#####".$bench->report('readfiles')." "; }

Shortcomings

 * Lots of features only available in for German (one reason is, that i couldn't as detailed pages how articles should look like as for the German WP).
 * Some messages and variables in the code are in German (e.g. [BKL] is the German abbreviation of disambiguation-pages)