Topic on Talk:Quarry

My query gets stopped for some reason

6
Bunyk (talkcontribs)

Before it was working fine, providing results in half an hour, but last 3 versions just stop: quarry.wmcloud.org/query/68565


Is it because it hits some timeout and needs to be optimized somehow? Or I'm running it too often? Or there is some bug in Quarry?

Matěj Suchánek (talkcontribs)

Toolforge SQL Optimizer tells:

id select_type table type possible_keys key key_len ref rows Extra
1.1 PRIMARY ALL distinct_key 1367979 Using temporary; Using filesort
1.2 PRIMARY pagelinks ref pl_namespace pl_namespace 261 ukwiki.archive.ar_namespace,ukwiki.archive.ar_title 21 Using where; Using index
2.3 MATERIALIZED archive index ar_name_title_timestamp ar_name_title_timestamp 275 1367979 Using where; Using index
3.4 MATERIALIZED page index PRIMARY, page_name_title, page_random, page_len, page_redirect_namespace_len page_name_title 261 3640238 Using index
3.5 MATERIALIZED revision ref rev_page_actor_timestamp, rev_page_timestamp rev_page_timestamp 4 ukwiki.page.page_id 4 Using index

Basically, the query scans too many rows and does filesort, which is slow.

Also you don't need to query the revision table. When the page does not exist, there will be no revisions. I'd probably rewrite the query using a join like:

SELECT ... FROM pagelinks
LEFT JOIN page ON page_namespace = pl_namespace AND page_title = pl_title
WHERE EXISTS (SELECT 1 FROM archive WHERE ...)
AND page_id IS NULL
[...]

etc. and try it again.

Bunyk (talkcontribs)

Oh, thank you for your explanation.


So yes, first I also thought that I don't need to look into revisions table, but when I do not check that there are no revisions for the page it also includes pages that have deleted revisions (rows in archived), but are not currently deleted.


Maybe I'll try to detect deleted pages by checking if they have entry in Manual:Logging table. But what is the fastest way to check if link is currently red or not?


What I'm trying to do here is to create a list of pages that were deleted, but are still very wanted, to consider undeletion if they were notable, or just remove links to them if they were not notable.

Matěj Suchánek (talkcontribs)

when I do not check that there are no revisions for the page it also includes pages that have deleted revisions (rows in archived), but are not currently deleted. Page and revision existence is equivalent. There is no existing page with no revisions and a revision cannot exist without an existing page. What I meant is you don't have to include revision in the query, checking just page is enough for you.

Maybe I'll try to detect deleted pages by checking if they have entry in Manual:Logging table. That's possible but there is not any index on log_type, log_action, log_namespace, log_title. So it's potentially very slow, too.

But what is the fastest way to check if link is currently red or not? It's the query I have provided you in my reply. Or an equivalent query with NOT EXISTS.

Bunyk (talkcontribs)

And because of you mentioning filesort, which is slow, I tried to remove <code>ORDER BY</code>. And you know what? It executed in 419.00 returning 333 rows, which I could sort now in browser. So thank you for the hint.


Crazy, how sorting 3 hundreds rows would make it slow?

Bunyk (talkcontribs)

And after applying advice to check that only page is absent, not revision for it, same query got executed in 68.86 seconds. 6x speedup, thank you again.

Reply to "My query gets stopped for some reason"