Talk:Quarry

Jump to: navigation, search

About this board

Discussion area for discussion about Quarry itself and help about individual queries.

Jon Harald Søby (WMNO) (talkcontribs)

Hi! We're planning on running a survey on the Northern Sami Wikipedia, and I think the most efficient way to get a list of who to get in touch with would be to run a query on Quarry. But I don't know how to construct such a query (never used SQL…), so I'm wondering if someone could help me. What I want is basically just a list of all registered users who have made more than 10 edits ever. Can someone help me with that?

Edgars2007 (talkcontribs)

https://quarry.wmflabs.org/query/23113

Edgars2007 (talkcontribs)

Note that user_editcount let's say isn't very precise, it could be some(?) edits off the real numbers. But for this purpose it probably will be fine.

Jon Harald Søby (WMNO) (talkcontribs)

Thank you very much, Edgars2007!

Reply to "List of users with at least 10 edits"
Mmaarrkkooss (talkcontribs)

I gave a query to quarry about all the articles and their categories - it's logical it will need some time. The query status is running, but when I press explain, it says "Error: Hmm... Is the SQL actually running?! ". Which is it then?

Zhuyifei1999 (talkcontribs)

Which query? Could you give a link?

Mmaarrkkooss (talkcontribs)

It says running but here you go : https://quarry.wmflabs.org/query/22865

Mmaarrkkooss (talkcontribs)

I might been doing this wrong however.

Zhuyifei1999 (talkcontribs)

From the logs, the query most likely generated too many results for quarry to store them. You hit phab:T172086.

Mmaarrkkooss (talkcontribs)

Where can I see the logs? How could I run a query that big?

Zhuyifei1999 (talkcontribs)
MariaDB [enwiki_p]> SELECT COUNT(1) FROM categorylinks INNER JOIN page p1 ON (p1.page_id = cl_from AND (p1.page_namespace = 0)) LEFT JOIN page p2 ON (p2.page_namespace = '14' AND (p2.page_title = cl_to)) LEFT JOIN page_props ON ((pp_page=p2.page_id) AND pp_propname = 'hiddencat') WHERE   (pp_propname IS NULL);
+----------+
| COUNT(1) |
+----------+
| 28917820 |
+----------+
1 row in set (9 min 7.29 sec)

MariaDB [enwiki_p]> Select COUNT(1) from page;
+----------+
| COUNT(1) |
+----------+
| 43592182 |
+----------+
1 row in set (2 min 1.31 sec)

Quarry will not and should not be able to store result sets this large. As for the logs, they are as files on the runner instances, and I do not know of an easy way to expose them.

Mmaarrkkooss (talkcontribs)

Thanks for the insights. Do you know how could obtain such large data from that query?

Mmaarrkkooss (talkcontribs)

Also, the above shell is it quarry? Cause it doesn't look so.

Zhuyifei1999 (talkcontribs)

For large amounts of offline data, you can use https://dumps.wikimedia.org/enwiki/. The above shell is directly querying the Wiki Replicas servers on Toolforge, but quarry will produce the same results if the same query is run at the same time, since it connects to the exact same servers.

Mmaarrkkooss (talkcontribs)

I have in fact downloaded these sql tables from the dumps and I am writing them in a database (which is a slow process) but I have a fear that this query won't work either.

Mmaarrkkooss (talkcontribs)

Hello again.

I am thinking of using pagination and limiting the results. How many do you think I could get away with? 1 million records? More ? Less?

Mmaarrkkooss (talkcontribs)

500k is doable.

Mmaarrkkooss (talkcontribs)

Even though 500k was doable with Limit 5000000 (I don't have the link because I queried from the same link) giving Limit 500k,500k in https://quarry.wmflabs.org/query/22970 is not working. Damn is this frustrating

edit : first 500k results : https://quarry.wmflabs.org/query/22969

edit2: nvm

Reply to "Is the query actually running or not?"
Gfdgss (talkcontribs)

Hello,

I need to get the text from many pages. To start understanding the scheme I wanted to get the text from a specific page.

However when I go to revision table all of the rev_text_id column in filled with zeros.

Also, there is an exception thrown when I'm trying to use text table saying it doesn't exist.

Do you know why does this happen? are we not allowed to mine that data? does anyone know how to solve this (via Quarry or alternative ways)?

Thanks for any help!

Stefan2 (talkcontribs)

It is my understanding that the text table contains the wikitext of all revisions of all pages, including deleted pages. Deleted content only is meant to be available for admins, so I'd imagine that the restrictions you are facing are in place for that reason.

Gfdgss (talkcontribs)

Thanks Srefan2.

Meaning there is not anyway to access the text of a page?

I know Wikipedia let's you download dumps. Don't these files include all the page text as a part of them? If so- if we can get the text from those files, why can't we do it from here as well?

Moreover, in revision table you have a field saying if the action done on the edited page was deleting it, therefore they could screen those deleted pages out real easy and still give us access to text table.

Stefan2 (talkcontribs)

I don't know what the database dumps contain or how they differ from the database copy available on Toollabs.

The revision table has a field called rev_deleted which tells if the revision has been deleted using revision deletion. If the entire page is deleted, the revisions end up in the archive table instead. In both cases, the text seems to reside in the text table (provided that the content wasn't deleted before upgrading the servers to MediaWiki 1.5, in 2005 or something).

I think I tried to look up something in the text table some time ago but failed, although I don't remember exactly what happened. Note that there is no key on the old_text field, so the execution time of a query will be proportional to the number of entries in the table (very slow) as opposed to the logarithm of the number of entries (faster). Also, the documentation says that the text can be difficult to get (for example, it may be gzipped). I don't know if WMF stores page content in compressed form or not.

Gfdgss (talkcontribs)

I highly appreciate your in-depth reply.

Do you know of any way I can use the current scheme to find which users have Barnstars or Service Awards on their user page?

If so- is there any way of knowing what type exactly and when was it given?

Thanks

Stefan2 (talkcontribs)

The linked page says that the barnstar templates normally are substituted, so you can't search for template transclusions. However, you could use the imagelinks table to find all user pages which transclude File:Original Barnstar.png (or other barnstar images). If the page contains a barnstar image, then there's a fair chance that the person has been granted a barnstar.

For service awards, you could similarly use the templatelinks table to find all pages which transclude w:Template:Registered Editor and similar templates.

Once you have a list of user pages, then I suppose that you could write a computer program which downloads specific information (such as the wikitext) from the API if there's more information you need.

This comment was hidden by Gfdgss (history)
This comment was hidden by Gfdgss (history)
Reply to "text table"

Looking for templates' names table

3
Summary by BDavis (WMF)

Templates are also pages in the page table. The tl_namespace and tl_title columns are pointers back to the page_namespace and page_title columns of the page table. The "Template" namespace is namespace 10. So if you wanted to find pages in the main namespace that transclude the Template:Table_style template:

USE enwikisource_p;
SELECT page_title
FROM templatelinks
LEFT JOIN page ON (tl_from = page_id AND tl_from_namespace = page_namespace)
WHERE tl_namespace = 10
  AND tl_title = 'Table_style'
  AND tl_from_namespace = 0;
Ineuw (talkcontribs)

I am trying to extract pages in Wikisource which contain a particular template. I have the page table, the templatelinks table, but can't find the table which stores the template names.

BDavis (WMF) (talkcontribs)

Templates are also pages in page table. The tl_namespace and tl_title columns are pointers back to the page_namespace and page_title columns of the page table. The "Template" namespace is namespace 10. So if you wanted to find pages in the main namespace that transclude the Template:Table_style template:

USE enwikisource_p;
SELECT page_title
FROM templatelinks
LEFT JOIN page ON (tl_from = page_id AND tl_from_namespace = page_namespace)
WHERE tl_namespace = 10
  AND tl_title = 'Table_style'
  AND tl_from_namespace = 0;
Ineuw (talkcontribs)

Much thanks, I guess you saw me struggling with it. :-)

Summary by BDavis (WMF)

Most visited page data is not available in Quarry.

Alex Blokha (talkcontribs)

Hello,

1) Can I receive with quarry the 3000 of most visited pages in english wikipedia? If yes, how?

2) can I receive 100 of most visited pages in english wikipedia, which are not present in ukrainian wikipedia?

Edgars2007 (talkcontribs)

1) No

2) Not in easy way

JAllemandou (WMF) (talkcontribs)

Most visited pages are not available in quarry for now.

1) You can access the top 1000 viewed pages per project using the analytics pageview API (https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Most_viewed_articles)

2) This is not available as of now.

Alex Blokha (talkcontribs)

1) Yes, I know about this tool, but I need more :)

Milimetric (WMF) (talkcontribs)

If you need more than 1000 pages, the only reasonable way to get that right now is to download the pageviews dumps and crunch the numbers yourself: https://dumps.wikimedia.org/other/pagecounts-ez/

IKhitron (talkcontribs)

Hello. I have a huge problem. I deleted unintentionally a very important query, a couple of minutes ago. Is there a way to restore it? Some kind of archive? It was there for months, maybe years. Is this possible? I really do not want to start to restore it from scratch. Please, help me. Thank you very much.

Zhuyifei1999 (talkcontribs)

Which query? Do you want the results or the SQL? I may be able to check the Quarry database and see if they are archived somewhere.

IKhitron (talkcontribs)

Thank you very much for your answer, Zhuyifei1999. 4441. It's SQL only. Crossing fingers.

Zhuyifei1999 (talkcontribs)

The second to last revision (id 206077, timestamp 2017-09-05 18:38:01) SQL was, AFAICT:

use hewiki_p;
select replace(page_title, "_", " ") as article, el_to as address from page join externallinks on page_id = el_from
where (el_to like "%action=edit%" or el_to like "%org/w/index.php%" or (el_to like "%wik%org/w%" and not el_to like "%upload.wikimedia.org%")
       or el_to like "%upload.wikimedia.org/wikipedia/he%"or el_to like "%upload.wikimedia.org/wikipedia/commons%"
       or el_to like "%stats.wikimed%" or el_to like "https://www.google.co.il/url?%")
and page_namespace = 0 and not el_to like "%rationalwiki.%" and not el_to like "%action=history%"
and not el_to like "%jugglewiki.%" and not el_to like "%openoffice%" and not el_to like "%wikileaks.%" and not el_to like "%strategywiki%"
and not el_to like "%osx86project.%" and not el_to like "%postgresql.%" and not el_to like "%wikigallery.%" and not el_to like "%wikisage.%"
and not el_to like "%wikiberal.%" and not el_to like "http://www.yeshiva.org.il/wiki/%" and not el_to like "http://clinfowiki.org/%"
and not el_to like "https://www.wikidata.org/wiki/Q%?uselang=he#P%" and not el_to like "%stswiki.%" and not el_to like "%inkscape.%"
and not el_to = "http://la.wikisource.org/wiki/Disputatio_pro_declaratione_virtutis_indulgentiarum?match=en" and not el_to like "%antwiki.%"
and not el_to like "%whatwg.%" and not el_to like "%ejwiki%" and not page_title in ("אולואיד", "ויקיפדיה_הבולגרית")
and not el_to like "%translate.google.com%" and not el_to like "%uselang=he%"  and not el_to like "%page=%" order by article # 21263
IKhitron (talkcontribs)

Zhuyifei1999, Thank you!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Summary by BDavis (WMF)

Excel has a "feature" that limits cells of type "link" (URLs) to 255 characters. Quarry has been updated to check for errors from the Excel writing library and instead write the cell out as a plain string.

IKhitron (talkcontribs)

Hi again, Zhuyifei1999. Maybe you remember, a couple of month ago you created for me a new download option, excel. I see now it has a bug. I prefer to ask you here, not in phab, because maybe you aware of this. When I download the same 4441 query you saved, the result has all the article names, but only some URLs. What do you think about this? Thank you.

Zhuyifei1999 (talkcontribs)

Uh, works for me (LibreOffice Calc). I don't have Microsoft Excel so I have no idea why it is not working on Excel.

IKhitron (talkcontribs)

So what's we can do, Zhuyifei1999? Nobody can work when half a data is deleted. Thank you.

Zhuyifei1999 (talkcontribs)

I'll ask some friends of mine who has Excel to check them; might take some time though. If I forget (or take too long) please file a ticket in Phabricator.

IKhitron (talkcontribs)

Very well, Zhuyifei1999. Can I suggest a possible cause? It's from my work experience as programmer, it has nothing to do with wiki.

Zhuyifei1999 (talkcontribs)

Cannot reproduce for https://quarry.wmflabs.org/run/202117/output/0/xlsx?download=true on Microsoft Excel for Mac, Version 15.37

IKhitron (talkcontribs)

Well, Zhuyifei1999, that means it's more possible I'm right. If I am, the problem is with some fields with more than 256 bytes length. Thank you.

Zhuyifei1999 (talkcontribs)

Can you test that out? Eg. with SELECT "SOMETEXT"; varying the length of the text. Since there is no way for me to find any text missing I cannot easily help you, unfortunately.

IKhitron (talkcontribs)

Try 21263, please. I have there a lot of long links. Thank you.

Zhuyifei1999 (talkcontribs)

Ok I can reproduce this on LibreOffice Calc, Version 5.1.6.2. Filed phab:T175285 to track this.

IKhitron (talkcontribs)

Thank you.

جار الله (talkcontribs)

Hello, https://quarry.wmflabs.org/query/21866 i need the log_action or log_type of patrol

subtype=patrol

and review

subtype=accept

subtype=unaccept

like this https://en.wikipedia.org/w/index.php?title=Special%3ALog&type=patrol&user=&page=&year=&month=-1&tagfilter=&subtype=patrol

and this https://en.wikipedia.org/w/index.php?title=Special%3ALog&type=review&user=&page=&year=&month=-1&tagfilter=&subtype=accept

https://en.wikipedia.org/w/index.php?title=Special%3ALog&type=review&user=&page=&year=&month=-1&tagfilter=&subtype=unaccept

Framawiki (talkcontribs)

Hello, how can we help you ? :)

جار الله (talkcontribs)

@Framawiki I wonder if there is log_action or log_type for review of accept

and unaccept edit?

Reply to "QSL approve"
Mess (talkcontribs)

I can't perform queries for orphaned pages on it.wiki since this week, neither with Quarry (seehttps://quarry.wmflabs.org/query/18648) or with Lists (see http://tools.wmflabs.org/lists/itwiki/Voci/Voci_orfane_per_dimensione). The syntax of the queries always worked until the coming up of this issue, but I don't get what's the reason why it doesn't complete the task now.

Achim55 (talkcontribs)

You are right, my queries are killed as well. It's not the first time Quarry has slown down and - surprise!! - there are again a lot of queries that are "running" or queued since some weeks. Needs a clean restart, @Yuvipanda or someone else: please reset it, thanks.

Milimetric (WMF) (talkcontribs)

Hm, @Yuvipanda isn't maintaining this anymore sadly, but I'm watching these pages. I'll go try and figure out how to reset it.

Achim55 (talkcontribs)

User:Milimetric (WMF), thank you!. It's an old problem, see Topic:T8ip24a3iragdqrh, Topic:T6sxf1vjyj5shf9e and several more. See also here:

Mess (talkcontribs)

User:Milimetric (WMF), may I suggest you to implement a maximum time limit of (just say) 1 hour for queued/hanged queries in order to prevent this overload? I can't figure out why some queries can avoid the regular limit of 30 minutes and can't be killed like any other query when the time period is over.

Milimetric (WMF) (talkcontribs)

Interesting, yeah, I thought there was a solid time limit in place. I'll double check that code if I can but sadly this is very low on my priority list. So far I found out I don't have rights to the Quarry box so I pinged some people to get that. I'll try and help if I can.

Mess (talkcontribs)

OK, thank you for your interest, @Milimetric (WMF). Meanwhile I've read the 2016 Community Wishlist Survey request for the Quarry maintenance (linked by @Achim55) and I've found out this old Phabricator task (https://phabricator.wikimedia.org/T139162), where Yuvipanda had published the SQL code to reset the queue.

Milimetric (WMF) (talkcontribs)

Ok, I got access to the project and restarted the celery workers (the things that run the queries). Let me know if things don't improve and I can try restarting more stuff.

Achim55 (talkcontribs)

User:Milimetric (WMF), it isn't better than before. Even my roughly shortened quarry:query/7250 is killed. Btw. I hate this idiotic editor.

Mess (talkcontribs)

@Milimetric (WMF), maybe it should be better a whole cleaning: queries like , or are still running!

Milimetric (WMF) (talkcontribs)

I will now reboot the boxes, maybe that will help. Which editor? The quarry code editor?

Achim55 (talkcontribs)

Oh no, sorry, the editing of this page I had in mind. It differs a lot from wikis' and adding links is not an optimal solution as it isn't made for intuitive use. Dan, thank you for your efforts here!

Milimetric (WMF) (talkcontribs)

Ok, I rebooted all the boxes, and verified that some queries are running ok. I still see the status of the queries that @Mess pointed out as "Running" but that can't be because the box was rebooted. So that must just be invalid state in the db, those aren't running anymore.

Milimetric (WMF) (talkcontribs)

@Achim55, yeah, Flow is a work in progress. For me, I'm a newcomer to wikitext so it's much much easier to work with Flow. If you're interested in working on it, I know the people on the team and they're super nice and some of the smartest people I know. Here's some links if you're interested:

browse the code: https://github.com/wikimedia/mediawiki-extensions-Flow

contribute: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/Flow

The team hangs out in IRC under #wikimedia-collaboration

Mess (talkcontribs)

@Milimetric (WMF), I tried to re-run several times [my query], but nothing happened - and it's the same also for the [other one] re-executed by @Achim55. I'm starting to doubt about the codes of our queries, but they can't be faulty, as they worked regularly in the past.

Achim55 (talkcontribs)

Confirmed. Doesn't work yet.

Mess (talkcontribs)

I'm trying also to use the "LIMIT" clause to reduce the output in [this query test], but it's useless. It's like in some particular cases statements as "NOT IN" (or maybe what follows inside the parentheses) influence negatively the whole query (if I remove that, the query test runs without problems). Anyway, I don't know where to bang my head and solve this crazy puzzle.

Milimetric (WMF) (talkcontribs)

@Mess & @Achim55, I should have looked at your queries earlier. Yes, those NOT IN clauses are very intense to compute over categorylinks and pagelinks tables, those tables have a huge number of records. I would suggest re-writing. Sorry I don't have much time now to help. What I would try to do is re-write the NOT INs as JOINs. This might give the optimizer a better chance to make a plan that runs faster. If that doesn't work, do me a favor and explain in plain English what you're looking for and I can think about it when I have some more time.

Mess (talkcontribs)

@Milimetric (WMF), you're absolutely right. I've recently created two separated queries ( 20065and 20068) as a temporary solution for my task, even if I've already realized there was something "strange" in the query codes for their excessive time-consuming extraction (that we didn't care in the past simply because they didn't stopped themselves like now). I'll work again on the original code to find an ultimate solution, but probably I can carry on also with this alternative.

Mess (talkcontribs)

@Milimetric (WMF): HOORAY! I DID IT! I've found what was bad in my query: in practice, the part where it searched for orphaned redirects was the real "point of failure", as it was badly implemented, so I just fixed it with a better code and now the query 18648 not only works again, but also accomplishes its task in only 3 minutes (a great improvement if compared with the 15 minutes it took previously). So, @Achim55, follow the advices above and try to find the part of code that takes too much time to run (I suggest you to split up the query and execute every single piece separately).

Achim55 (talkcontribs)

Pardon?? "Write better code so you will no longer recognise the malfunction of Quarry." Fine. I think one should fasten the brake a bit more so that some action will become necessary...

Milimetric (WMF) (talkcontribs)

Well, no matter how amazing Quarry is, a statement like select * from revision will always break it on large enough wikis. No tool is substitute for writing good code. Think of writing bad code like throwing lots of kilowatts of energy in the trash. You wouldn't leave a 100W bulb on all day for no reason, so don't write inefficient SQL :)

Achim55 (talkcontribs)

Sorry for having been misunderstood: As I pointed out many months ago queries like "SELECT * FROM page" are regularly not killed but stay "running". That's the point. The "killing mechanism" is broken. In the near future we can celebrate an anniversary, I found a query that "runs" for nearly one year now. I don't know what happens in the background, but it's our experience that quarry gets slower the more queries are "running" for more than one hour instead of being killed.

Milimetric (WMF) (talkcontribs)

Oh, as far as I can tell those queries are killed, the interface just doesn't show that status as "Killed". So there are some bugs but they're not affecting the system performance as far as I can tell. There's a discussion going on about replacing Quarry with a tool developed independent of WMF, so we can take advantage of regular updates. Here's that thread: https://phabricator.wikimedia.org/T169452

Zhuyifei1999 (talkcontribs)

Queries are killed, but the status updating system may break due to various reasons. See phab:project/board/800/ column "Bugs (Endless Query Run)". Unfortunately many of the old logs are lost so I'm unable to debug them. Please file a ticket if you see any such behaviour.

Achim55 (talkcontribs)

@Zhuyifei1999, that's fine so far. So let me ask in a different way: Half a year ago needed 10 ... 12 minutes for one run. Today it needs 25...30 minutes. Why is quarry working extremely slowly now?

Zhuyifei1999 (talkcontribs)

This is not Quarry's problem. The query runs on the wiki replicas and they may get slower due to excess use and more data processing, and for Wiki Replica issues it's best to consult the DBA. However, I may try to optimize the query a bit.

Zhuyifei1999 (talkcontribs)

Never mind. I'd suggest you to ask the DBAs regarding query optimization or wait till quarry switch to the new Wiki Replica servers some time soon (in next next few weeks/months).

Zhuyifei1999 (talkcontribs)

Quarry has switched to the new Replica servers yesterday, which is said to be 5x faster, so you may want to try the query again. Note that 'USE stopthisrun' won't work on the new servers (and idk what even are 'stopthisrun')

Achim55 (talkcontribs)

Thank you very much for notifying. I use non-existing db 'stopthisrun' for quickly killing my query if I see an error after I already had started it.

Reply to "Queries for orphaned pages don't work"

Excluding blue links come from a template?

1
Estopedist1 (talkcontribs)

I wonder is it possible to exclude blue links come from a template? The querry is here: https://quarry.wmflabs.org/query/21396

Reply to "Excluding blue links come from a template?"