Thread:Talk:How to contribute/Missing pieces/reply (6)

Dealing with (parsing) wikitext is almost impossible, but the XML dumps at dumps.wikimedia.org are wikitext-only. So if you need HTML instead of wikitext, you have three options:


 * 1) HTML dumps (but there haven't been new dumps in years inexplicably);
 * 2) scrape the live site (the HTML Squid cache layer); or
 * 3) use the API's action=parse (often quite slow for a single page, even worse for millions of requests).

I think we should be clearer about how to do option "2". For example, by specifying an appropriately useful User-Agent and by not slamming the site with too many requests in too a short a period of time (this is how bots and spiders get blocked). And discuss iframing, hot-linking, etc. This quickly gets more into Meta-Wiki territory, though.