Topic on Talk:Quarry

Need help with finding most transcluded unprotected pages

7
Jarekt (talkcontribs)

I am trying to make sure some of the most transcluded pages (hopefully templates and modules) on Commons are protected. The first step would be to find some number (lets say 100) most transcluded pages and than look up their protection level, So I started with


use commonswiki_p;
SELECT tl_namespace, tl_title, count(*) as tot
FROM templatelinks
group by tl_namespace, tl_title
order by tot desc
limit 100

Which was killed after half an hour (see https://quarry.wmflabs.org/query/8653). Anybody knows what is wrong with this query and how to fix it? The results should mirror c:Special:MostTranscludedPages.

BDavis (WMF) (talkcontribs)

Trying the query out in the sql-optimizer tool shows that there are 1,605,698,124 (yes, 1.6 billion!) rows to process. The group by and order by clauses force the database server to copy all of the data into temporary storage and then perform a file sort on that storage. The limit clause will only change the data returned to your client, so all the rows have to be grouped and then the counts that were computed for each group all have to be sorted and finally only the first 100 results will be returned.

Your best best for getting this sort of data is to use the Action API's query action to get the same list you see on Special:MostTranscludedPages in a machine readable format: https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=querypage&utf8=1&formatversion=2&qppage=Mostlinkedtemplates

Jarekt (talkcontribs)

Yes sorting 1.6 billion pages is a big job. My plan was to use that query as a piece of larger query where I than look up protection level of each page. Maybe I should be counting transclusions of unprotected templates (and latter modules) and show the ones with over some number (1M or 500k). So limit the scope first and than count.

Zhuyifei1999 (talkcontribs)
Zhuyifei1999 (talkcontribs)

quarry:query/31088 would be a query that checks the transclusion count from all templates, using a dependent subquery on templatelinks table. Don't think it'll work though.

Jarekt (talkcontribs)
Zhuyifei1999 (talkcontribs)

Wow, 31131 is so much faster than I would have anticipated. Nice. I guess most of the time it's counting the transclusions of protected templates, and the time of joining and filtering on page_restrictions is minimal.

Reply to "Need help with finding most transcluded unprotected pages"