User:BDavis (WMF)/Notes/Finding Files To Purge

Trying to help with 54647.

Some unspecified network error caused htcp purge messages to be lost for an extended period of time. Once the network issue was resolved we needed to purge a lot of potentially stale pages from the varnish cache layer.

Figuring out what to do

 * Reedy knew that the purgeList.php maintenance script could be used to send the purges
 * Nik knew where to find a query that would grab all pages changed in a rate range
 * I volunteered to try and bring the two together.

Finding the pages
Nik's sql query came in the form of php code from CirrusSearch. I took that and put it into phpsh to get the generated sql. $minUpdate = new MWTimestamp( '2013-09-22T00:00:00Z' ); $maxUpdate = new MWTimestamp( '2013-09-26T00:00:00Z' ); $dbr = wfGetDB( DB_SLAVE ); $minId = $dbr->addQuotes( '-1' ); $minUpdate = $dbr->addQuotes( $dbr->timestamp( $minUpdate ) ); $maxUpdate = $dbr->addQuotes( $dbr->timestamp( $maxUpdate ) ); $res = $dbr->selectSQLText(              array( 'page', 'revision' ),               array_merge( array( 'page_touched', 'page_counter', 'page_restrictions' ), Revision::selectPageFields, Revision::selectFields ),              'page_id = rev_page'               . ' AND rev_id = page_latest'               . " AND ( ( $minUpdate = rev_timestamp AND $minId < page_id ) OR $minUpdate < rev_timestamp )"               . " AND rev_timestamp <= $maxUpdate",               // Note that redirects are allowed here so we can pick up redirects made during search downtime               'INTERACTIVE',               array( 'ORDER BY' => 'rev_timestamp, rev_page')           );

The resulting sql was: SELECT page_touched ,page_counter ,page_restrictions ,page_namespace ,page_title ,page_id ,page_latest ,page_is_redirect ,page_len ,rev_id ,rev_page ,rev_text_id ,rev_timestamp ,rev_comment ,rev_user_text ,rev_user ,rev_minor_edit ,rev_deleted ,rev_len ,rev_parent_id ,rev_sha1 ,rev_content_format ,rev_content_model FROM `page` ,`revision` WHERE page_id = rev_page AND rev_id = page_latest AND (   ( '20130922000000' = rev_timestamp AND '-1' < page_id )    OR '20130922000000' < rev_timestamp  ) AND rev_timestamp <= '20130926000000' ORDER BY rev_timestamp, rev_page

This was TMI but led me to formulate a simpler version that was "good enough for now". I ran that on tools-lab against the enwiki and dewiki mirrors:

ssh tools-login.wmflabs.org become bd808-test

mysql --defaults-file="${HOME}/replica.my.cnf" -h enwiki.labsdb enwiki_p -e "SELECT page_title FROM page, revision WHERE page_id = rev_page AND rev_id = page_latest   AND page_namespace = 0  AND (    ( '20130922000000' = rev_timestamp AND '-1' < page_id )    OR '20130922000000' < rev_timestamp  )  AND rev_timestamp <= '20130926000000'
 * " > enwiki-misses.txt

mysql --defaults-file="${HOME}/replica.my.cnf" -h dewiki.labsdb dewiki_p -e "SELECT page_title FROM page, revision WHERE page_id = rev_page AND rev_id = page_latest   AND page_namespace = 0  AND (    ( '20130922000000' = rev_timestamp AND '-1' < page_id )    OR '20130922000000' < rev_timestamp  )  AND rev_timestamp <= '20130926000000'
 * " > dewiki-misses.txt

Massaging the data
Now that I had dumps of page titles I needed to turn them into full urls for Reedy to use to do the purges:

First, URLEncode the titles: perl -pi.raw -e 'chomp;s/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg;print "\n"' enwiki-misses.txt dewiki-misses.txt

Then turn them into full URLs and make sure there aren't any dupes: perl -pi.enc -e 's@^@http://de.wikipedia.org/@' dewiki-misses.txt mv dewiki-misses.txt dewiki-misses.txt.url sort -u dewiki-misses.txt.url > dewiki-misses.txt

perl -pi.enc -e 's@^@http://en.wikipedia.org/@' enwiki-misses.txt mv enwiki-misses.txt enwiki-misses.txt.url sort -u enwiki-misses.txt.url > enwiki-misses.txt

Now What?
After a lot of futzing around I scp'd the files to a web accessible location where Reedy picked them up and applied them.

Heres the start of a script that may help with this the next time it happens.
 * 1) !/usr/bin/env bash
 * 2) Find all wiki pages changed between a given start and end date.
 * 3) TODO:
 * 4) - Add command line flags
 * 5) - Add namespace support

LANG=${1:-en} WIKI=${LANG}wiki START_TS=20130922000000 END_TS=20130926000000

mysql --defaults-file="${HOME}/replica.my.cnf" -h ${WIKI}.labsdb \ ${WIKI}_p --skip-column-names -e " SELECT page_title FROM page, revision WHERE page_id = rev_page AND rev_id = page_latest  AND page_namespace = 0  AND (    ( '${START_TS}' = rev_timestamp AND '-1' < page_id )      OR '${START_TS}' < rev_timestamp  )  AND rev_timestamp <= '${END_TS}' perl -e 'while () {  next if (/^\s+$/);  chomp;  s/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg;  print $_, "\n"; }'| perl -p -e "s@^@http://${LANG}.wikipedia.org/wiki/@" | sort -u > ${WIKI}-changed.txt
 * 1) NOTE: only gets changes to namespace 0 (Main)


 * 1) vim:sw=2 ts=2 sts=2 et: