Talk:Quarry

Jump to navigation Jump to search

About this board

Discussion area for discussion about Quarry itself and help about individual queries.

Ineuw (talkcontribs)

Am I correct to assume that the text layer of a Wikisource page is encoded as UTF-8? Does this apply to all Wikipedia sites that contain text?

Achim55 (talkcontribs)

Yes and yes as far as I know. Citing Manual:$wgDBTableOptions: "MediaWiki internally always produces data in UTF-8 encoding to be saved in the database."


Reply to "Wikisource text layer encoding"
RoySmith (talkcontribs)

I want to find all revisions where a given user added a specific word to a page. Is this possible?

The best I can see is to find all revisions a user has made, then grovel over them to generate the diffs for each one and test if the word is in the diff.

Matěj Suchánek (talkcontribs)
Reply to "Search by revision text?"
202.63.103.210 (talkcontribs)

Can someone help me to write a query which gives the article content from Wikipedia articles in Telugu language

Matěj Suchánek (talkcontribs)

It is not possible to extract content from articles via Quarry (relevant tables are not publicly available).

Framawiki (talkcontribs)

But you can use the Special:Export page on your local wiki to get an w:XML version of the content of the pages.

Reply to "article extraction"
RhinosF1 (talkcontribs)
Milimetric (WMF) (talkcontribs)

I see you already have a few examples like https://quarry.wmflabs.org/query/35057, and I'm wondering what else you'd like help with. One thing I can suggest is that you try to format your queries so they can be more easily read. For example, try this style (which I'm a little too obsessive about, I admit, but it's good)

SELECT COUNT(DISTINCT rc_user)
  FROM recentchanges_userindex
           JOIN
       user                       ON (rc_user = user_id)
           LEFT JOIN
       ipblocks                   ON (rc_user = ipb_user)
 WHERE rc_user != 0
   AND rc_timestamp > DATE_FORMAT(DATE_SUB(NOW(), INTERVAL 30 DAY), '%Y%m%d%H%i%S')
   AND rc_type = 0
   AND rc_bot = 0
   AND ipb_id IS NULL
   AND (user_registration IS NULL OR user_registration < DATE_FORMAT(DATE_SUB(NOW(), INTERVAL 1 DAY), '%Y%m%d%H%i%S'))
   AND user_editcount >= 1

And btw, you can write that user registration clause as:

(coalesce(user_registration, '2000') < DATE_FORMAT(DATE_SUB(NOW(), INTERVAL 1 DAY), '%Y%m%d%H%i%S'))
RhinosF1 (talkcontribs)

Ah, It wasn't me that made them! Can't remember who! - I was looking for one that just did a number of editors with a logged edit or action in last 30 days @Milimetric (WMF)

Reply to "Need an active user query"
Saeidpourbabak (talkcontribs)

Hi,

This simple quarry should return some pages in enwiki which have no fa interwiki. It works for articles, but not for templates (the results for templates have fa interwiki). Can somebody help?

Matěj Suchánek (talkcontribs)

I couldn't find an example in the results. Note that there may be a problem in the database, like a deferred update.

Saeidpourbabak (talkcontribs)
Matěj Suchánek (talkcontribs)

Looks like Wikidata doesn't do database updates properly. I did a null edit on that page and now the entry is in the database. Unfortunately, you will need to do this on every page if you want accurate results.

Saeidpourbabak (talkcontribs)

Thank you, but isn't there any more efficient and standard way to fix the problem? The quarry I referred to was just an example I made to show the issue. The quarry that I used and encountered problem with has more than 13k results, which performing null edit on that number of pages is impractical.

Matěj Suchánek (talkcontribs)

Well, to "fix the problem" you need to somehow trigger a new database update. And as a casual user, the only way (if you're lucky) is really performing a null edit. I will make a bug report on this shortly.

Matěj Suchánek (talkcontribs)
Reply to "template tables not updated?"
RhinosF1 (talkcontribs)
Milimetric (WMF) (talkcontribs)

Just to clarify, do you mean:

Let the list of people on that page be L. You want to find this set:

{user | user <- L and user has not made any edits on this wiki in 5 years}

What about other wikis? Also, what do you mean by "remove them", like remove them from the list?

To find users that haven't made any edits in one wiki you could do:

 select user_id
   from user
  where user_id in (... your list ...) 
    and user_id not in
        (select distinct rev_user
           from revision
          where rev_timestamp > '20140301000000'
        )
;

But that subquery would be impossibly slow on enwiki, so this is a heavy query. Not sure how to optimize it off the top of my head. Might just be better to make a temporary table somewhere out of the users that did edit in the past 5 years and query that.

RhinosF1 (talkcontribs)

Editors in last 5 years would also work. That query failed (https://quarry.wmflabs.org/query/34487). Just enwiki is what we need. We need to end up with a list of users who have/have not edited in the past 5 years and are on that page.

Milimetric (WMF) (talkcontribs)

you substituted "https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Apple_Inc./Subscribe" where I have (... your list ...) in the query, but that's not how Quarry works. Quarry just executes SQL, it doesn't now how to read the wiki and parse the user ids out of it. You'd have to do that manually, or build a query to get a list of user_ids based on user names. You can also make a list of user names that looks like this:

('user name 1', 'user name 2', ...) and pass that in the query above as `user_name in ('user name 1', 'user name 2')`.

RhinosF1 (talkcontribs)

Sorry for the late reply, The query was done in the end via production for us

Reply to "Quarry Help Needed"
Землеройкин (talkcontribs)
Zhuyifei1999 (talkcontribs)

It'd probably dead already. No status being updated is a consequence of quarry running out of memory while trying to store results.

Reply to "How to kill a query?"

Error Hmm... Is the SQL actually running?!

2
Kerry Raymond (talkcontribs)

What does this mean?

Zhuyifei1999 (talkcontribs)
Reply to "Error Hmm... Is the SQL actually running?!"

Search for genus & species items missing descriptions

6
Tom.Reding (talkcontribs)

I'd like to port this working WD Query Service query into Quarry (it works when there are ≲ 5000 possible results, but genus & species are timing out the server), but I'm having trouble, could someone help?

Matěj Suchánek (talkcontribs)
USE wikidatawiki_p;
SELECT page_title FROM page
JOIN pagelinks ON page_id = pl_from
LEFT JOIN wb_terms ON term_full_entity_id = page_title AND page_namespace = 0 AND term_language = 'en' AND term_type = 'description'
WHERE pl_title = 'Q34740' AND pl_namespace = 0;

Since the SQL database doesn't hold statements, we need to use standard MediaWiki links.

Tom.Reding (talkcontribs)

How does one search for empty description term_text though? I tried adding "AND NOT EXISTS(term_text)" to the WHERE, but got an error; I also tried adding term_text to the SELECT but it times out, even with LIMIT 100.

Matěj Suchánek (talkcontribs)

Oops, I forgot AND term_text IS NULL.

Tom.Reding (talkcontribs)

Killed again after 30 minutes :( I have gotten the original WDQS query to not timeout/error somehow, by, counter intuitively, removing the artificially imposed limit... Thanks for your help though.

Matěj Suchánek (talkcontribs)

That is interesting...

Mahir256 (talkcontribs)

It is clear that the results of past runs of a query are stored on Quarry's servers; I was able to find those for my label statistics run on the 25th (compare those from the 26th) after some crawling based on a suspected range of run IDs within which the 25th's results would have fallen.

Is there a way to obtain a list of the results of past runs of a query so that such crawling won't be necessary?

Zhuyifei1999 (talkcontribs)

You are correct. It is indeed stored. It is however not exposed in the interface. And I just checked the code and found there are no uses of Query.revs. So unfortunately we don't currently have the list of revisions for a particular query exposed. Would you like to contribute a patch to add this?

Reply to "Listing past runs of a query"