Extension:ArchiveLinks/Project/status

From mediawiki.org

This page is designed as a status update/work log for ArchiveLinks so anyone who is interested can easily track the status of the project.

February 2012[edit]

  • After much stalling contact has been made with Sumanah and the Internet Archive

Projected Deadlines[edit]

    • February 9th, 2012 - Data feed of live new links made available to the Internet Archive
  • Feb 10th: Currently have access problems to the toolserver, and ts admins are swamped with outage problems. Delayed on putting feed up live.
    • Feb 29th, By tomorrow this *will* be deployed.

March 2012[edit]

From March 12:

  • also note that the link I provided gives an XML result so you need to click view source to see anything

From March 28:

  • Good news is I think the script actually works, I’m getting the select queries working and am to the point where I should be inserting the data.
  • The bad news is I can’t insert the data at all on the toolserver. I keep getting MySQL Error 1290 which google tells me is a permissions related error.
  • I’m not sure if they disabled user queries due to replag or what is going on (apparently they did a schema change on the main cluster and s1 (the English Wikipedia Toolserver Cluster) is replagged by 1 week and a few days), and it’s kind of difficult to find help at 4 in the morning . This problem persists even in phpMyAdmin, so I know it’s not just my code....

(From later on the 28th):

  • I think I figured a workaround (using a different cluster) but I didn't have time to implement it due to time constraints. I'll fix the lingering problems and have it ready by Monday morning...

April 2012[edit]

So apparently about 6 hours later I did get an answer as to why the original toolserver cluster I was using didn't work:
[10:37] <Dispenser> [#wikimedia-toolserver] kevin_brown: 1290 mean the database is in read-only mode. The WMF is adding a column that'll be done sometime next month.
Anyhow I have switched to sql.toolserver.org, and the feed is now up at (with actual data in the database!):
http://toolserver.org/~nn123645/toolserver-feed/index.php
It uses the same query parameters as in the ArchiveLinks extension. Also I'm thinking I should rename the folder to something more descriptive, I'm thinking "en-wiki-link-feed" or something along those lines, any ideas?

//I do have a few names, "WikiDataFeed" or "En-Wiki-Data-Flow". Or something like that. Cheers!

I will set the script to a cron script to run every 20 seconds and pull 100 pages at a time....

Later on 1 April:

As a side note I'm still working on cron, I'm having some issues with the scheduling.
For the time being if you hit http://toolserver.org/~nn123645/toolserver-feed/cronscript.php it will 100 pages and insert all the new links on those pages into the db. (NOTE: This will take around a minute or more to do, don't close the connection or navigate away from that page. Also be careful about hitting this link, it is not setup to prevent concurrent instances and there could be issues arising from that. As you might imagine this is resource intensive so please don't it in more than one tab/connection.)
There are currently 11k rows in the db...