Topic on Talk:Quarry

Retrieving a list of pages and the users who created them

3
Mike Christie (talkcontribs)

I'd like to be able to retrieve a list of users who created pages in the Talk namespace that end in "/GAn". For GA5, for example, I wrote this:

select p.page_title,p.page_id,r.rev_timestamp, u.user_name

from page p

inner join revision r

  on p.page_id = r.rev_page

inner join actor a

  on r.rev_actor = a.actor_id

inner join user u

  on a.actor_user = u.user_id

where p.page_title  LIKE '%a/GA5'

  and r.rev_parent_id = 0

  and p.page_namespace = 1

but it executes extremely slowly. Eliminating the joins to actor and user makes it run in a few seconds, so why is this slow? I would have thought it would pick up the indexes on actor and user on those fields.

BDavis (WMF) (talkcontribs)

You can use the tool at https://sql-optimizer.toolforge.org/ to get an explain plan for any particular query. When I put this query into that system it does look like the database engine is trying to use indexes, but I also see that joining with the user table makes the database consider 43 million indexed rows as part of it's analysis.

One optimization that might help is using the special "revision_userindex" table rather than directly using the revision table.

The p.page_title LIKE '%a/GA5' clause will always result in a linear scan of some collection of rows. MySQL/MariaDB's indexes just don't work with leading wildcard searches. There are techniques that can be used to work around this, but they are not in use in the Wiki Replicas databases as they are not needed for MediaWiki's normal operations.

Mike Christie (talkcontribs)

Thanks; that's very helpful, and I didn't know about the explain plan tool. I'll have a look at the revision_userindex table.

Reply to "Retrieving a list of pages and the users who created them"