Talk:Quarry

Jump to: navigation, search

About this board

Edit description

Discussion area for discussion about Quarry itself and help about individual queries.

By clicking "Add topic", you agree to our Terms of Use and agree to irrevocably release your text under the CC BY-SA 3.0 License and GFDL

Index pages not linked with authors page.

1
Jayantanth (talkcontribs)

I need one quarries, where BNWS have few pages not linked the authors' page. We are manually added this with the following template Template:Small scan link at Authors page.

Reply to "Index pages not linked with authors page."

Queries for orphaned pages don't work

23
Mess (talkcontribs)

I can't perform queries for orphaned pages on it.wiki since this week, neither with Quarry (seehttps://quarry.wmflabs.org/query/18648) or with Lists (see http://tools.wmflabs.org/lists/itwiki/Voci/Voci_orfane_per_dimensione). The syntax of the queries always worked until the coming up of this issue, but I don't get what's the reason why it doesn't complete the task now.

Achim55 (talkcontribs)

You are right, my queries are killed as well. It's not the first time Quarry has slown down and - surprise!! - there are again a lot of queries that are "running" or queued since some weeks. Needs a clean restart, @Yuvipanda or someone else: please reset it, thanks.

Milimetric (WMF) (talkcontribs)

Hm, @Yuvipanda isn't maintaining this anymore sadly, but I'm watching these pages. I'll go try and figure out how to reset it.

Achim55 (talkcontribs)

User:Milimetric (WMF), thank you!. It's an old problem, see Topic:T8ip24a3iragdqrh, Topic:T6sxf1vjyj5shf9e and several more. See also here:

Mess (talkcontribs)

User:Milimetric (WMF), may I suggest you to implement a maximum time limit of (just say) 1 hour for queued/hanged queries in order to prevent this overload? I can't figure out why some queries can avoid the regular limit of 30 minutes and can't be killed like any other query when the time period is over.

Milimetric (WMF) (talkcontribs)

Interesting, yeah, I thought there was a solid time limit in place. I'll double check that code if I can but sadly this is very low on my priority list. So far I found out I don't have rights to the Quarry box so I pinged some people to get that. I'll try and help if I can.

Mess (talkcontribs)

OK, thank you for your interest, @Milimetric (WMF). Meanwhile I've read the 2016 Community Wishlist Survey request for the Quarry maintenance (linked by @Achim55) and I've found out this old Phabricator task (https://phabricator.wikimedia.org/T139162), where Yuvipanda had published the SQL code to reset the queue.

Milimetric (WMF) (talkcontribs)

Ok, I got access to the project and restarted the celery workers (the things that run the queries). Let me know if things don't improve and I can try restarting more stuff.

Achim55 (talkcontribs)

User:Milimetric (WMF), it isn't better than before. Even my roughly shortened quarry:query/7250 is killed. Btw. I hate this idiotic editor.

Mess (talkcontribs)

@Milimetric (WMF), maybe it should be better a whole cleaning: queries like , or are still running!

Milimetric (WMF) (talkcontribs)

I will now reboot the boxes, maybe that will help. Which editor? The quarry code editor?

Achim55 (talkcontribs)

Oh no, sorry, the editing of this page I had in mind. It differs a lot from wikis' and adding links is not an optimal solution as it isn't made for intuitive use. Dan, thank you for your efforts here!

Milimetric (WMF) (talkcontribs)

Ok, I rebooted all the boxes, and verified that some queries are running ok. I still see the status of the queries that @Mess pointed out as "Running" but that can't be because the box was rebooted. So that must just be invalid state in the db, those aren't running anymore.

Milimetric (WMF) (talkcontribs)

@Achim55, yeah, Flow is a work in progress. For me, I'm a newcomer to wikitext so it's much much easier to work with Flow. If you're interested in working on it, I know the people on the team and they're super nice and some of the smartest people I know. Here's some links if you're interested:

browse the code: https://github.com/wikimedia/mediawiki-extensions-Flow

contribute: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/Flow

The team hangs out in IRC under #wikimedia-collaboration

Mess (talkcontribs)

@Milimetric (WMF), I tried to re-run several times [my query], but nothing happened - and it's the same also for the [other one] re-executed by @Achim55. I'm starting to doubt about the codes of our queries, but they can't be faulty, as they worked regularly in the past.

Achim55 (talkcontribs)

Confirmed. Doesn't work yet.

Mess (talkcontribs)

I'm trying also to use the "LIMIT" clause to reduce the output in [this query test], but it's useless. It's like in some particular cases statements as "NOT IN" (or maybe what follows inside the parentheses) influence negatively the whole query (if I remove that, the query test runs without problems). Anyway, I don't know where to bang my head and solve this crazy puzzle.

Milimetric (WMF) (talkcontribs)

@Mess & @Achim55, I should have looked at your queries earlier. Yes, those NOT IN clauses are very intense to compute over categorylinks and pagelinks tables, those tables have a huge number of records. I would suggest re-writing. Sorry I don't have much time now to help. What I would try to do is re-write the NOT INs as JOINs. This might give the optimizer a better chance to make a plan that runs faster. If that doesn't work, do me a favor and explain in plain English what you're looking for and I can think about it when I have some more time.

Mess (talkcontribs)

@Milimetric (WMF), you're absolutely right. I've recently created two separated queries ( 20065and 20068) as a temporary solution for my task, even if I've already realized there was something "strange" in the query codes for their excessive time-consuming extraction (that we didn't care in the past simply because they didn't stopped themselves like now). I'll work again on the original code to find an ultimate solution, but probably I can carry on also with this alternative.

Mess (talkcontribs)

@Milimetric (WMF): HOORAY! I DID IT! I've found what was bad in my query: in practice, the part where it searched for orphaned redirects was the real "point of failure", as it was badly implemented, so I just fixed it with a better code and now the query 18648 not only works again, but also accomplishes its task in only 3 minutes (a great improvement if compared with the 15 minutes it took previously). So, @Achim55, follow the advices above and try to find the part of code that takes too much time to run (I suggest you to split up the query and execute every single piece separately).

Achim55 (talkcontribs)

Pardon?? "Write better code so you will no longer recognise the malfunction of Quarry." Fine. I think one should fasten the brake a bit more so that some action will become necessary...

Milimetric (WMF) (talkcontribs)

Well, no matter how amazing Quarry is, a statement like select * from revision will always break it on large enough wikis. No tool is substitute for writing good code. Think of writing bad code like throwing lots of kilowatts of energy in the trash. You wouldn't leave a 100W bulb on all day for no reason, so don't write inefficient SQL :)

Achim55 (talkcontribs)

Sorry for having been misunderstood: As I pointed out many months ago queries like "SELECT * FROM page" are regularly not killed but stay "running". That's the point. The "killing mechanism" is broken. In the near future we can celebrate an anniversary, I found a query that "runs" for nearly one year now. I don't know what happens in the background, but it's our experience that quarry gets slower the more queries are "running" for more than one hour instead of being killed.

Reply to "Queries for orphaned pages don't work"
Zache (talkcontribs)

Hi, is there any way to create queries which results would be rendered as links in Quarry?

Eg. if i make query like this i need to copy paste it to the wiki to render wikitext to hyperlinks. However it is extra step if same thing can be done directly inside Quarry.

- https://quarry.wmflabs.org/query/19995

XXN (talkcontribs)

Currently no. https://phabricator.wikimedia.org/T74874

Reply to "hyperlinks in Quarry results"

How can i separate the labels without bracket, comma, etc.

6
Info-farmer (talkcontribs)

This Quarry contains the labels with brackets, comma, etc., How can i separate the labels with brackets, comma and the labels without any special characters? Why i am asking is i want to follow the wikidata rules here.

Mess (talkcontribs)

Maybe I've understood what you need. You want to separate the ""Q",ips_item_id" and "ips_site_page" in two different columns, right? Well, so write this SELECT command:

SELECT CONCAT("Q",ips_item_id), ips_site_page

and you'll split them without using tildes or other special characters.

If you want also to get rid of commas and brackets into the labels upon the columns, then write like this:

SELECT CONCAT("Q",ips_item_id) AS ips_item_id, ips_site_page

(as I've done over there: https://quarry.wmflabs.org/query/19715)

Info-farmer (talkcontribs)

Excuse me. sorry for my English language.

Let me furnish my need clearly.

I want ips_site_pages without comma and also without brackets.

Example,

the following pages should not come into my result

Q28235 ~ தூரமேற்கு வளர்ச்சி பிராந்தியம், நேபாளம்

Q41559 ~வெள்ளையன்கள் (பட்டாம்பூச்சிக்குடும்பம்)

If possible, i want the result separately as follows.

ips_site_pages without comma and also without brackets, etc

ips_site_pages with comma .

ips_site_pages with brackets.

Milimetric (WMF) (talkcontribs)

I see. So you want to detect which ips_site_pages values have:

1. no commas or parentheses (you call them brackets, but those are usually [ or {. These: ( are parentheses). 2. only commas but no parentheses 3. only parentheses but no commas

You can use the "if" function in SQL and order your results by that, that's one way. So taking your original query and adding that logic would look like this:

use wikidatawiki_p;

 select CONCAT("Q",ips_item_id,'~',ips_site_page),
        if(ips_site_page not like '%,%' and ips_site_page not like '%(%' and ips_site_page not like '%)%',
            1, if(ips_site_page like '%,%' and ips_site_page not like '%(%' and ips_site_page not like '%)%',
                  2, if(ips_site_page not like '%,%' and (ips_site_page like '%(%' or ips_site_page like '%)%'),
                      3, 4
                  )
            )
        ) as ips_item_id_type

   from wb_items_per_site i

  where i.ips_site_id='tawiki'
    and not exists
        (select term_entity_id
           from wb_terms
          where i.ips_item_id = term_entity_id
            and term_type = "label"
            and term_language='ta')

  group by ips_item_id
  order by ips_item_id_type
  limit 3000;

You can play with it here: https://quarry.wmflabs.org/query/19793. The results are sorted by the type we identified in the if statement, so you have to look at page 30 or so to find results with type 3 for example.

There are other ways to write the string comparisons, but the "like" and "not like" operators are really straightforward so I went with that.

Info-farmer (talkcontribs)

Thanks indeed. How can i get all the labels of ta.wikipedia or te.wikipedia? Is it possible to get all labels in batches? For example, 1 to 10,000 then, 10,000 to 20,000 and so on.

Milimetric (WMF) (talkcontribs)

Well, your where clause has "where i.ips_site_id='tawiki'" so that's limiting results to ta.wikipedia. Just change it to tewiki for te.wikipedia, or do "where i.ips_site_id in ('tawiki', 'tewiki')" to get both, but then be careful with your term_language = 'ta' clause, you might want to do it one at a time to keep it simple.

It would be easier for the system to just return all the results at once rather than get batches of 1-10000, 10000-20000, and so on. Is there a problem with just using the pagination in the results? You can figure out what page to go to based on the results count per page.

Reply to "How can i separate the labels without bracket, comma, etc."
קיפודנחש (talkcontribs)

is there an API interface to query Quarry, or is the UI the only way to use it? peace.

Edgars2007 (talkcontribs)

AFAIK, only UI

Reguyla (talkcontribs)

Same here, I think you would need to request a Labs account to do API type stuff.y

XXN (talkcontribs)

You can use directly the mediawiki API.

Or, you can consider posting a feature request for a Quarry API in phabricator.

Edoderoo (talkcontribs)

It would be wonderful to be able to use the quarry-output in a Python script, as the generator in Quarry is often a 1000 times faster....

Edgars2007 (talkcontribs)

Well, that is possible. If you use quarry page id (not query id), then the url is: https://quarry.wmflabs.org/query/${id}/result/latest/0/json

I think you'll be able to move forward yourself; if not, say.

Edoderoo (talkcontribs)

splendid!

Reply to "does quarry have API interface?"

LabsDB replica databases can drift from production originals

11
Summary by BDavis (WMF)

There is some discussion of causes of replica drift on wikitech. There is also a tracking task in phabricator where you can report specific problems you find.

The TL;DR of why this happens is that MediaWiki performs some database changes which are not deterministic (they are based on the current state of the database at the time the change is made) and the wiki replicas hosted by Cloud Services are not identical copies of production. This combination makes for drift problems that must be resolved manually or by a very slow re-import process. The new cluster that is being prepared is using a different replication engine that we hope will make drift less common. That work is tracked in phab:T140788. The only thing that will eliminate drift is changing how MediaWiki does the problematic database operations.

Le Deluge (talkcontribs)

Quarry is using a copy of the database that doesn't match the production database - there's clearly been some kind of corruption or replication problem. It first reared its head 13 months ago, and has now come back and is active at the moment. Please can Quarry be given a "clean" version of the db to feed on?

Email issues mean I can't get on Phabricator at the moment, but hopefully someone here can get something done. I do a lot of work with red-link and other problem categories on en.wiki, particularly with these three reports : | Uncategorized categories (Quarry), Categories categorized in red-linked categories (Quarry) and Red-linked categories with incoming links (Quarry).

Obviously you would expect every category in the first query to exist - but it became clear that there were four that were "stuck"in the query that had been deleted either on 22 April 2016 or 30/Apr/16. The second query has fourteen "zombies" - some cats that were also deleted on 22 April 2016 (and so shouldn't be in the report), and some cats that do exist but have a parent category that exists (which should disqualify them from the report) - in those cases the parent category was moved to its current name on 22 April 2016. I've even tried null edits and recreating some of these zombies and then deleting them, but it doesn't affect what happens in Quarry.

It gets worse on the third query, which now has 31 zombie cats all of which are empty and so shouldn't be in the query. A couple overlap with query 2, there's some birth and date ones whose only thing in common is the removal of a CfD tag on 1 May 2016, and there's maintenance categories - one deleted on 13 May 2016, one on 4 June 16 and one deleted on both 8 June 2016 and 14 August 2016. So far, a coherent story seems to be emerging - a big problem on 22 April 16, which trickled on for a few weeks afterwards but then was fixed.

Now it's back. If you look at the third query you'll see a bunch of maintenance categories from the last week, which were mostly deleted on 2 June 2017 but there's at least one which was deleted just 13 hours ago. So whatever this corruption/replication problem is, it has clearly come back.

From my point of view, all I want is Quarry to be working on a copy of the database that actually matches the production database. However, there's also a wider issue of what is the underlying cause of all this - and does it only affect replication to the Labs copy of the database, or is there a wider problem? Worms, meet can.... Le Deluge (talk) 13:43, 4 June 2017 (UTC)

BDavis (WMF) (talkcontribs)

The problem of deletions not replicating properly is known. The short answer is that MediaWiki does things in the database that the current replication strategy does not always deal with well. Work has been underway for several months to build a new database cluster that uses a different replication strategy that should miss fewer deletes and undeletes. I won't say that it will get rid of all of them because there are some database operations that MediaWiki does that are just very difficult to sync to the filtered replicas in the Cloud Services environment. As soon as the new database servers are fully populated for all shards we expect to change the configuration of Quarry to use them.

Le Deluge (talkcontribs)

Thanks for the reply (and I assume you had something to do with Quarry getting so much quicker recently, I now have queries taking 10-20% of the time - and more importantly not timing out!)

If what you're saying is that Quarry will be working off a "clean" copy of the database within a few months then that's good enough for me, I was more worried that I'd turned up something that had wider implications for the integrity of the database. And it seemed to have got a whole lot worse recently, although that may just be a particular thing to do with query 3 which has a lot of maintenance cats passing through it for a day or two, lately they've been coming in but not managing to get out again.

So just my curiosity remains - what's actually been happening with the secondary copies of eg en:category:Law_about_religion_by_country? Is it some kind of locking issue or something deeper?

BDavis (WMF) (talkcontribs)

There is some discussion of causes of replica drift on wikitech. There is also a tracking task in phabricator where you can report specific problems you find.

The TL;DR of why this happens is that MediaWiki performs some database changes which are not deterministic (they are based on the current state of the database at the time the change is made) and the wiki replicas hosted by Cloud Services are not identical copies of production. This combination makes for drift problems that must be resolved manually or by a very slow re-import process. The new cluster that is being prepared is using a different replication engine that we hope will make drift less common. That work is tracked in phab:T140788. The only thing that will eliminate drift is changing how MediaWiki does the problematic database operations.

Le Deluge (talkcontribs)

And the replication problem is ongoing - another 7 cats are still there that were deleted today such as here

Achim55 (talkcontribs)

Confirmed. See also [https://tools.wmflabs.org/supercount/?project=commons.wikimedia&user=Le_Deluge]: a lag of 30 hours, yesterday in the evening it has been 14 hours. :((

Le Deluge (talkcontribs)

And now it's a bit over 3 hours. Thanks, it's a sign of database stress, but I'm not talking about something that's "just" a bit of replication lag. These queries seem to reflect what's happening in the production database pretty closely - a minute or two lag at most. What I'm talking about is individual entries getting "stuck" - for over a year in some cases.

It feels like these entries have ended up getting permanently locked on the Quarry db, apparently as a result of being moved or deleted on the production database (presumably at a time of database stress). The locks have been cleared on the production db but not on the Quarry db and as a result the entries on the Quarry db can't be updated.

At least, that's my guess.

Achim55 (talkcontribs)

The increasing lag mentioned above is now at 37 hours, subtracting 14 of yesterday, there is no replication at all.Yesterday at 12:32 (UTC) I saved a page on Commons the content of which isn't queriable via quarry by now.

Le Deluge (talkcontribs)

Ahem, it does sound less impressive if you accidentally divide by 10 like I did.... Again, we are at slight cross-purposes, I'm talking about the main en database which is not showing significant lag, even if the Commons database is struggling. But the fact that a major database is under such stress must mean that Wikimedia as a whole has a problem.

BDavis (WMF) (talkcontribs)

There is ongoing production maintenance of the "s4" shard that hosts the commons database. The latest at this writing is phab:T166206. This query of the server admin logs shows others happening recently.

You can check on the lag of particular database replicas using https://tools.wmflabs.org/replag/.

The DBA team has been working on building a new replica cluster for Labs/Cloud Services that we hope will be ready for everyone to use in the next month or two. This new cluster uses a different backend technology for replication (row based replication rather than binlogs) that should stay in better sync with the production servers. This process of bringing up a new cluster from scratch is pretty slow because we can't just copy the data directly from production. The wiki replicas are filtered to remove data that has been supressed and that filtering process needs to happen as the data is copied across rather than some operation after a full copy to try and remove it.

We don't like it when the replicas are out of sync by more than a second or two, but sometimes there isn't much that can be done to prevent it. You can take some amount of solace in knowing that back in the olden days of the toolserver replica lag of more than a week was not uncommon for all of the tables in all of the wikis. We are generally doing a lot better as time goes on.

BDavis (WMF) (talkcontribs)

The s4 shard is back in sync, but there is another planned maintenance starting sometime on 2017-06-12 that will cause replication to lag for a few days (phab:T166206#3331928).

How can insert Tilde(~) symbol as a delimiter after QNumber?

3
Info-farmer (talkcontribs)

Kindly, Have a look at https://quarry.wmflabs.org/query/19420. How can i insert ~ symbol inbetween the Q-number and the label?

XXN (talkcontribs)

You can just prepend tildes to the second column:
CONCAT("Q",ips_item_id),CONCAT('~ ',ips_site_page)

or to append them the first column:
CONCAT("Q",ips_item_id,'~')

but if you need 3 columns, you can use:
CONCAT("Q",ips_item_id), CONCAT('~'), ips_site_page

Edgars2007 (talkcontribs)

Note, that in second XXN's example, you can use '~' instead of CONCAT('~')

Reply to "How can insert Tilde(~) symbol as a delimiter after QNumber?"

Is it possible to cancel a running quarry?

2
Summary by BDavis (WMF)

"The only way to cancel a query is to run another one in the same edit pane. However, you shouldn't worry about canceling a query. Quarry has a query canceler that will automatically take care of a long running query."

Tdcan (talkcontribs)

is it possible to cancel a quarry?

Halfak (WMF) (talkcontribs)

The only way to cancel a query is to run another one in the same edit pane. However, you shouldn't worry about canceling a query. Quarry has a query canceler that will automatically take care of a long running query.

Alex Blokha (talkcontribs)

Hello,

How can I select all articles from german wiki, which are bigger by size, than articles in enwiki?

Edgars2007 (talkcontribs)

Not in easy way. You could do two separate queries: 1) dewiki articles, that are in enwiki 2) enwiki articles, that are in dewiki. And then process it off-wiki with some programm.

Alex Blokha (talkcontribs)

Omg. I didn't expect it is so complex.

Alex Blokha (talkcontribs)

Is there any ablity to join them by wikidata id?

XXN (talkcontribs)

https://quarry.wmflabs.org/query/14329

SELECT 
 enp.page_title as en_title,
 enp.page_len as en_len,
 dep.page_title as de_title,
 dep.page_len as de_len       
FROM enwiki_p.page enp
 JOIN enwiki_p.langlinks enll ON enll.ll_from = enp.page_id and ll_lang="de"
 JOIN dewiki_p.page dep on enll.ll_title = REPLACE(dep.page_title, '_', ' ')
WHERE enp.page_namespace = 0 and enp.page_is_redirect = 0 and dep.page_namespace = 0 and dep.page_is_redirect = 0
GROUP BY enp.page_id  
HAVING dep.page_len > enp.page_len  
LIMIT 50

Its runtime is long and is very close to time out on Quarry.

Alex Blokha (talkcontribs)

You can post your answere here.

https://stackoverflow.com/questions/44167755/select-bigger-articles-among-two-wikipedias

I will mark it as answered.

XXN (talkcontribs)

Ok :)

find a roman letter and greek letter next to it?

6
Naraht (talkcontribs)

Is there a way to search through Quarry for [A-Z][Α-Ω], it doesn't appear that a search on the english language wikipedia for insource:/[A-Z][Α-Ω]/ doesn't return any page at all.

Edgars2007 (talkcontribs)

No, you can't search wikitext via Quarry.

Naraht (talkcontribs)

Thank you. Still looking for a way to do this without having another editor pull it from a database dump.

XXN (talkcontribs)

@Naraht: a search for insource:/[A-Za-z]+[Α-Ωα-ω]+/ on en.wp returns 859 pages.

Naraht (talkcontribs)

Thanx, over the last two months I've gotten more adept at using insource. Found some really odd usages...

Naraht (talkcontribs)

@XXN Thank you.