User:BDavis (WMF)/Notes/Finding Files To Purge

Trying to help with 54647.

Some unspecified network error caused htcp purge messages to be lost for an extended period of time. Once the network issue was resolved we needed to purge a lot of potentially stale pages from the varnish cache layer.

Figuring out what to do

 * Reedy knew that the purgeList.php maintenance script could be used to send the purges
 * Nik knew where to find a query that would grab all pages changed in a rate range
 * I volunteered to try and bring the two together.

Finding the pages
Nik's sql query came in the form of php code from CirrusSearch. I took that and put it into phpsh to get the generated sql. $minUpdate = new MWTimestamp( '2013-09-22T00:00:00Z' ); $maxUpdate = new MWTimestamp( '2013-09-26T00:00:00Z' ); $dbr = wfGetDB( DB_SLAVE ); $minId = $dbr->addQuotes( '-1' ); $minUpdate = $dbr->addQuotes( $dbr->timestamp( $minUpdate ) ); $maxUpdate = $dbr->addQuotes( $dbr->timestamp( $maxUpdate ) ); $res = $dbr->selectSQLText(              array( 'page', 'revision' ),               array_merge( array( 'page_touched', 'page_counter', 'page_restrictions' ), Revision::selectPageFields, Revision::selectFields ),              'page_id = rev_page'               . ' AND rev_id = page_latest'               . " AND ( ( $minUpdate = rev_timestamp AND $minId < page_id ) OR $minUpdate < rev_timestamp )"               . " AND rev_timestamp <= $maxUpdate",               // Note that redirects are allowed here so we can pick up redirects made during search downtime               'INTERACTIVE',               array( 'ORDER BY' => 'rev_timestamp, rev_page')           );

The resulting sql was: SELECT page_touched ,page_counter ,page_restrictions ,page_namespace ,page_title ,page_id ,page_latest ,page_is_redirect ,page_len ,rev_id ,rev_page ,rev_text_id ,rev_timestamp ,rev_comment ,rev_user_text ,rev_user ,rev_minor_edit ,rev_deleted ,rev_len ,rev_parent_id ,rev_sha1 ,rev_content_format ,rev_content_model FROM `page` ,`revision` WHERE page_id = rev_page AND rev_id = page_latest AND (   ( '20130922000000' = rev_timestamp AND '-1' < page_id )    OR '20130922000000' < rev_timestamp  ) AND rev_timestamp <= '20130926000000' ORDER BY rev_timestamp, rev_page

This was TMI but led me to formulate a simpler version that was "good enough for now". I ran that on tools-lab against the enwiki and dewiki mirrors:

mysql --defaults-file="${HOME}/replica.my.cnf" -h enwiki.labsdb enwiki_p -e "SELECT page_title FROM page, revision WHERE page_id = rev_page AND rev_id = page_latest   AND page_namespace = 0  AND (    ( '20130922000000' = rev_timestamp AND '-1' < page_id )    OR '20130922000000' < rev_timestamp  )  AND rev_timestamp <= '20130926000000'
 * " > enwiki-misses.txt

mysql --defaults-file="${HOME}/replica.my.cnf" -h dewiki.labsdb dewiki_p -e "SELECT page_title FROM page, revision WHERE page_id = rev_page AND rev_id = page_latest   AND page_namespace = 0  AND (    ( '20130922000000' = rev_timestamp AND '-1' < page_id )    OR '20130922000000' < rev_timestamp  )  AND rev_timestamp <= '20130926000000'
 * " > dewiki-misses.txt

Massaging the data
Now that I had dumps of page titles I needed to turn them into full urls for Reedy to use to do the purges:

First, URLEncode the titles: perl -pi.raw -e 'chomp;s/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg;print "\n"' enwiki-misses.txt dewiki-misses.txt

Then turn them into full URLs and make sure there aren't any dupes: perl -pi.enc -e 's@^@http://de.wikipedia.org/@' dewiki-misses.txt mv dewiki-misses.txt dewiki-misses.txt.url sort -u dewiki-misses.txt.url > dewiki-misses.txt

perl -pi.enc -e 's@^@http://en.wikipedia.org/@' enwiki-misses.txt mv enwiki-misses.txt enwiki-misses.txt.url sort -u enwiki-misses.txt.url > enwiki-misses.txt