Talk:Quarry

Jump to navigation Jump to search

About this board

Discussion area for discussion about Quarry itself and help about individual queries.

RoySmith (talkcontribs)

If I have a rev_content_id, how do I get the actual content? The only way I see to join the revision and content tables is on content_sha1, but that's not indexed.

BDavis (WMF) (talkcontribs)

The Wiki Replica databases do not store contents of revisions or pages. Page content must be retrieved from the MediaWiki Action API which is not integrated into Quarry.

Reply to "How to get revision content?"
RoySmith (talkcontribs)

I'm running select count(*) from comment limit 10; (https://quarry.wmflabs.org/query/39582), which is taking forever. I would think that would be instantaneous.

Quarry claims, "Query status: running", but when I click the Explain button, I get, "Hmm... Is the SQL actually running?!"


Zhuyifei1999 (talkcontribs)

"Hmm... Is the SQL actually running?!" Is a bug. phab:T205214

Why is it show? Because the explain is like:

{
  "rows": [
    [
      1,
      "PRIMARY",
      "comment",
      "index",
      null,
      "comment_hash",
      "4",
      null,
      266953357,
      "Using where; Using index"
    ],
    [
      11,
      "DEPENDENT SUBQUERY",
      "logging",
      "ref",
      "type_time,log_comment_type,log_type_action",
      "log_comment_type",
      "8",
      "enwiki.comment.comment_id",
      185,
      "Using where; Using index"
    ],
    [
      10,
      "DEPENDENT SUBQUERY",
      "revision_comment_temp",
      "ref",
      "PRIMARY,revcomment_rev,revcomment_comment_id",
      "revcomment_comment_id",
      "8",
      "enwiki.comment.comment_id",
      1,
      "Using index"
    ],
    [
      10,
      "DEPENDENT SUBQUERY",
      "revision",
      "eq_ref",
      "PRIMARY",
      "PRIMARY",
      "4",
      "enwiki.revision_comment_temp.revcomment_rev",
      1,
      "Using where"
    ],
    [
      9,
      "DEPENDENT SUBQUERY",
      "recentchanges",
      "ref",
      "rc_comment_deleted",
      "rc_comment_deleted",
      "8",
      "enwiki.comment.comment_id",
      1,
      "Using where; Using index"
    ],
    [
      8,
      "DEPENDENT SUBQUERY",
      "protected_titles",
      "ref",
      "pt_reason_id",
      "pt_reason_id",
      "8",
      "enwiki.comment.comment_id",
      4,
      "Using index"
    ],
    [
      7,
      "DEPENDENT SUBQUERY",
      "oldimage",
      "ref",
      "oi_description_deleted",
      "oi_description_deleted",
      "8",
      "enwiki.comment.comment_id",
      28,
      "Using where; Using index"
    ],
    [
      6,
      "DEPENDENT SUBQUERY",
      "ipblocks",
      "ref",
      "ipb_reason_deleted",
      "ipb_reason_deleted",
      "9",
      "enwiki.comment.comment_id,const",
      32,
      "Using index"
    ],
    [
      5,
      "DEPENDENT SUBQUERY",
      "filearchive",
      "ref",
      "fa_description_deleted",
      "fa_description_deleted",
      "8",
      "enwiki.comment.comment_id",
      124,
      "Using where; Using index"
    ],
    [
      4,
      "DEPENDENT SUBQUERY",
      "filearchive",
      "ref",
      "fa_reason",
      "fa_reason",
      "8",
      "enwiki.comment.comment_id",
      405,
      "Using index"
    ],
    [
      3,
      "DEPENDENT SUBQUERY",
      "image",
      "ref",
      "img_description_id",
      "img_description_id",
      "8",
      "enwiki.comment.comment_id",
      26,
      "Using index"
    ]
  ],
  "headers": [
    "id",
    "select_type",
    "table",
    "type",
    "possible_keys",
    "key",
    "key_len",
    "ref",
    "rows",
    "Extra"
  ]
}
RoySmith (talkcontribs)

How did you get the explain output?

Zhuyifei1999 (talkcontribs)

As for why it is joining other tables, it wants to prevent people from gaining any extra knowledge of comments that may not be publicly viewable from the web interface, such as revision-deleted comments.

The enwiki_p.comment is a view against the database enwiki, with the following definition (SHOW CREATE VIEW comment;):

CREATE ALGORITHM=UNDEFINED DEFINER=`viewmaster`@`%` SQL SECURITY DEFINER VIEW `comment` AS select `enwiki`.`comment`.`comment_id` AS `comment_id`,`enwiki`.`comment`.`comment_hash` AS `comment_hash`,`enwiki`.`comment`.`comment_text` AS `comment_text`,`enwiki`.`comment`.`comment_data` AS `comment_data` from `enwiki`.`comment` where (exists(select 1 from `enwiki`.`image` where (`enwiki`.`image`.`img_description_id` = `enwiki`.`comment`.`comment_id`) limit 1) or exists(select 1 from `enwiki`.`filearchive` where (`enwiki`.`filearchive`.`fa_deleted_reason_id` = `enwiki`.`comment`.`comment_id`) limit 1) or exists(select 1 from `enwiki`.`filearchive` where ((`enwiki`.`filearchive`.`fa_description_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`filearchive`.`fa_deleted` & 2) = 0)) limit 1) or exists(select 1 from `enwiki`.`ipblocks` where ((`enwiki`.`ipblocks`.`ipb_reason_id` = `enwiki`.`comment`.`comment_id`) and (`enwiki`.`ipblocks`.`ipb_deleted` = 0)) limit 1) or exists(select 1 from `enwiki`.`oldimage` where ((`enwiki`.`oldimage`.`oi_description_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`oldimage`.`oi_deleted` & 2) = 0)) limit 1) or exists(select 1 from `enwiki`.`protected_titles` where (`enwiki`.`protected_titles`.`pt_reason_id` = `enwiki`.`comment`.`comment_id`) limit 1) or exists(select 1 from `enwiki`.`recentchanges` where ((`enwiki`.`recentchanges`.`rc_comment_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`recentchanges`.`rc_deleted` & 2) = 0)) limit 1) or exists(select 1 from (`enwiki`.`revision` join `enwiki`.`revision_comment_temp` on((`enwiki`.`revision_comment_temp`.`revcomment_rev` = `enwiki`.`revision`.`rev_id`))) where ((`enwiki`.`revision_comment_temp`.`revcomment_comment_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`revision`.`rev_deleted` & 2) = 0)) limit 1) or exists(select 1 from `enwiki`.`logging` where ((`enwiki`.`logging`.`log_comment_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`logging`.`log_deleted` & 2) = 0) and (`enwiki`.`logging`.`log_type` in ('abusefilter','articlefeedbackv5','block','campus','close','contentmodel','course','create','delete','eparticle','gather','gblblock','gblrename','gblrights','globalauth','gwtoolset','import','institution','instructor','interwiki','liquidthreads','lock','managetags','massmessage','merge','moodbar','move','mwoauthconsumer','newsletter','newusers','notifytranslators','online','pagelang','pagetranslation','pagetriage-curation','pagetriage-deletion','patrol','protect','renameuser','review','rights','spamblacklist','stable','student','tag','thanks','timedmediahandler','translationreview','upload','usermerge'))) limit 1))
Zhuyifei1999 (talkcontribs)

> How did you get the explain output?

You just have to try a few times, each time refreshing the page.. Hacky... I know.

BDavis (WMF) (talkcontribs)
RoySmith (talkcontribs)

Hmm. I actually tried that, but apparently with too small a value of "few" :-(

The view thing is ugly, but I get it. Is there any way to cancel a running query? This is still running, but I really don't want to trash the DB over this.

Maybe I should just back up and ask what I'm really trying to do. I want to find all the revisions which have a certain string in the edit comment. I'm assuming I want something long the lines of:


select rev_id from revision, comment

where rev_comment_id = comment_id

and comment_text like '%my certain string%';


That's obviously an expensive query, but I don't have any feel for just how long it would take? Are we talking minutes, hours, days? Will I do bad things to the DB if I try to run that?

BDavis (WMF) (talkcontribs)

Using the comment_revision view instead of the comment view will help a bit, but comment_text like '%my certain string%'; will be really slow and expensive no matter what. Any LIKE match with a leading wildcard (%) will require a full table scan. See 8.3.1 How MySQL Uses Indexes in the MySQL documentation for a bit more detail, but fundamentally MySQL/MariaDB do not support indexes for prefix wildcards.

Reply to "Why is count(*) so slow?"

Use of a selected db inside of a query

6
Achim55 (talkcontribs)

If a db is selected like USE enwiki_p; is there a way to check this selected db in IF or CASE statements? Pseudocode: if db == enwiki_p then etc. Background: Running similar queries against different dbs using the (nearly) same code, so that one has to change the USE statement only.

Edgars2007 (talkcontribs)

Not an answer to first question, but to (probably) your problem. You can use something like this:

use enwiki_p;
select * from some_en_wiki_table;

use dewiki_p;
select * from some_de_wiki_table;

use frwiki_p;
select * from some_fr_wiki_table;

It should work.

Achim55 (talkcontribs)

Not really. I found a way (see quarry:query/16164) but that's not sufficient, because other projects like wikisource or wikispecies have same names for their main pages.

Edit: Just had an idea, the page_id could be used additionally, that will work I think.

Zhuyifei1999 (talkcontribs)
Achim55 (talkcontribs)

Great, thank you very much!

Wargo (talkcontribs)

Commons images only in hidden cat and in unexisting cat

4
Estopedist1 (talkcontribs)
Wargo (talkcontribs)

By what properties?

Estopedist1 (talkcontribs)
Wargo (talkcontribs)

Table categorylinks also stores red categories assignement so you can add someting like

WHERE cl_to = 'Uncategorized media with description in Azerbaijani language'
Reply to "Commons images only in hidden cat and in unexisting cat"
Alex Blokha (talkcontribs)

Any example of quarry, which I can pass to pagepile to create new?

Framawiki (talkcontribs)
Alex Blokha (talkcontribs)

Yes, but pagepile says it can create pagepile from quarry. I need an example of such quarry.

Wargo (talkcontribs)

Quarry is tool to make SQL queries to wikimedia databases. You need to enter query id in this field. You can get it from URL while you have opened query.

Reply to "Pagepile from quarry"
ԱշոտՏՆՂ (talkcontribs)

Hi! is there any way to know if a revision changed the proofread status of a Wikisource page?

Wargo (talkcontribs)

Yes, notice edit summary has new level of proofread.

Reply to "Proofread status"
Achim55 (talkcontribs)

Some weeks ago I've seen a queryable way to check whether a wiki is closed or not. Unfortunately I don't find it anymore (should have written down the link due to my age...). Any hints?


Wargo (talkcontribs)
Matěj Suchánek (talkcontribs)

Renaming Wikisource pages to new titles.

7
Ineuw (talkcontribs)

Is there someone who (has the authority) to execute update querries on four misnamed Wikisource volumes? It would entail the renaming of the 4 djvu files and rename about 820 pages in each volume and replace the the same in the main namespace pages.


Wargo (talkcontribs)

Use bot to move these pages

Ineuw (talkcontribs)
Ankry (talkcontribs)

The list of changes is incomplete and probably the suggested changes are incorrect.

  1. I think subpages of file pages are unsuppoerted (as "/" is not a legal character in a filename AFAIK); I see you decided to change "/" to "-"
  2. moving an index/file page requires to move the associated pages in Page namespace; and this is good idea to prepare the list of required moves in advance as after the move you lose the list of pages
  3. There are sometimes problems with redirects in Index namespace (they may need manual fixing of page content model by admin and they seems to be not fully supported by proofreadpage). I would suggest avoiding such redirects, but this requires fixing all pages linking to them.

The (2) suggests that the move is performed by an admin who is a also a bot operator or has online support of a bot operator (to synchtonise the changes and make the mess as short in time as possible). (Admin rights should not be necessery, but they may be needed for fixing (3) or any accidental mistakes).

Making such a move on an index being worked on in always a mess for them.

I made some such moves in pl.ws, but not such massive ones. My bot has not the bot flag here, so I cannot help at the moment.

Ineuw (talkcontribs)

Ankry, thanks for the advice. I forgot about the "/"slash in the index file name and it was converted to a "-" hyphen automatically by move process. I will correct my list accordingly. Already posted a bot request on en.WS

Ineuw (talkcontribs)

Replying to my own comment. I began the move on the Mediawiki Commons, where the originals and images are kept - before renaming the .djvu file in Wikisource. The Wikisource move failed and I am stuck with a self-made mess. Would it be possible to restore EVERYTHING to their original state on the Commons and Wikisource? The reason being that a forward moving process of correcting my mistakes will leave behnid a big mess.


Wargo (talkcontribs)
Reply to "Renaming Wikisource pages to new titles."

Categories with exactly 1 category in them

2
Bennylin (talkcontribs)

Hi, can anybody help me create a query for a list of "Categories with exactly 1 category in them"? Thanks

Bennylin (talkcontribs)
Mike Peel (talkcontribs)

I've been trying to run https://quarry.wmflabs.org/query/28096 for the last few days, but it keeps timing out. It normally works (or works after timing out once or twice). Are things running slow on the back-end? If so, would it be possible to increase the time-out from 30 minutes to 60 minutes or so? Thanks!

Zhuyifei1999 (talkcontribs)

It says it's executed in 894.39 seconds. Is this still an issue?

Mike Peel (talkcontribs)

I just reran it again today, and it worked first time with an execution time of 735.61 - so maybe this is resolved. However, let's see how it goes over the next week...

Framawiki (talkcontribs)

Note that Quarry is "simply" an interface for Wiki replicas databases. A timeout certainly comes from these DB.

Mike Peel (talkcontribs)

It's back to timing out again today. :-( It must somehow be linked to server loads, as I'm not changing the query between runs, and the amount of data to analyse and the number of results are roughly similar.

Mike Peel (talkcontribs)

It's still timing out. :-(

Mike Peel (talkcontribs)

... and today it completed in 484.18 seconds (~8 minutes). Very weird - the run time seems to be very random.

Mike Peel (talkcontribs)

It's been back to timing out for most of this last week.

Frietjes (talkcontribs)

I have been having the same problem with query/36685 :(

Mike Peel (talkcontribs)

The query has been timing out for the last fortnight now. :-(

Mike Peel (talkcontribs)
Ahecht (talkcontribs)

Same issue with quarry:quarry/36666 which had been taking about 15 minutes but is now timing out after 30.

Dvorapa (talkcontribs)

I experience the same thing. It seems analytics cluster is overloaded or there may be some other issue. All my simple queries now take 20+ minutes or timeout, a year ago they run under 2 minutes.

For Toolforge users the easy solution is to try the web cluster (sql --cluster=web enwiki), but the issue should be solved anyway.

Dvorapa (talkcontribs)

Per IRC thread everything should be back to normal soon

Mike Peel (talkcontribs)

IRC thread?

Frietjes (talkcontribs)

I was able to run quarry:query/36685 about 20 minutes ago, and it didn't timeout this time!

Mike Peel (talkcontribs)
Dvorapa (talkcontribs)

Error 500 should be fixed now based on what guys have said on IRC (#wikimedia-cloud)

Mike Peel (talkcontribs)

It seems to be working now, thanks! :-)

Reply to "Queries running slow?"