Talk:Quarry

About this board

Previous discussion was archived at Talk:Quarry/Archive 1 on 2015-04-17. Discussion area for discussion about Quarry itself and help about individual queries.

Dnshitobu (talkcontribs)

I think we should add a search button to the Recent Queries page to make it easy for searching among old queries.

BDavis (WMF) (talkcontribs)
Reply to "Search button"

from wikiquote how to get wikidata_id

3
Djm-leighpark (talkcontribs)

I've got a query on the English Wikiquote and I'd like to add an additional column for the wikidata item page/QID if possible (The associated wikidata page may or may not exist).

Another way of looking at this for a specific case for where the page information can be got from enwikiquote_p by that her associated Wikidata item page (Q68584846) exists and is properly linked back to Wikiquote.

The key question is to ask which pages in the set I am considering do not have an associated wikidata item so this can be investigated.

I've had a glance round the schema tables but have not spotted how to do this.

Has anyone got any ideas. Thankyou. Djm-leighpark (talk) 00:39, 20 November 2022 (UTC)

Matěj Suchánek (talkcontribs)

If you add

LEFT JOIN page_props ON pp_page = page_id AND pp_propname = 'wikibase_item'

the pp_value field will contain the Wikidata item id (or null if the page isn't connected). If you want to filter for unconnected pages, add pp_value IS NULL.

Djm-leighpark (talkcontribs)

@Matěj Suchánek Sweet. Just what I was looking for and have put it in my query. I might have spent days looking for that and not found it or given up and gone off to do something else. Thankyou very much. -- Djm-leighpark (talk) 11:30, 20 November 2022 (UTC)

Reply to "from wikiquote how to get wikidata_id"

How to search for range of dates?

2
RoySmith (talkcontribs)

I want to search ipblocks for a range of expiry dates. I'm doing:


select ipb_address, ipb_expiry from ipblocks

where ipb_user = 0

and ipb_expiry != 'infinity'

and ipb_expiry > '20221001000000'

and ipb_expiry < '20221008000000'

limit 10;


which returns 0 rows. I'm assuming it's got something to do with ipb_expiry being a varbinary?

This is https://quarry.wmcloud.org/query/68543

RoySmith (talkcontribs)

It turned out the query is working perfectly. There just weren't any blocks expiringing in that date range. I widened out the range and it works fine. Dooh!

My query gets stopped for some reason

6
Bunyk (talkcontribs)

Before it was working fine, providing results in half an hour, but last 3 versions just stop: quarry.wmcloud.org/query/68565


Is it because it hits some timeout and needs to be optimized somehow? Or I'm running it too often? Or there is some bug in Quarry?

Matěj Suchánek (talkcontribs)

Toolforge SQL Optimizer tells:

id select_type table type possible_keys key key_len ref rows Extra
1.1 PRIMARY ALL distinct_key 1367979 Using temporary; Using filesort
1.2 PRIMARY pagelinks ref pl_namespace pl_namespace 261 ukwiki.archive.ar_namespace,ukwiki.archive.ar_title 21 Using where; Using index
2.3 MATERIALIZED archive index ar_name_title_timestamp ar_name_title_timestamp 275 1367979 Using where; Using index
3.4 MATERIALIZED page index PRIMARY, page_name_title, page_random, page_len, page_redirect_namespace_len page_name_title 261 3640238 Using index
3.5 MATERIALIZED revision ref rev_page_actor_timestamp, rev_page_timestamp rev_page_timestamp 4 ukwiki.page.page_id 4 Using index

Basically, the query scans too many rows and does filesort, which is slow.

Also you don't need to query the revision table. When the page does not exist, there will be no revisions. I'd probably rewrite the query using a join like:

SELECT ... FROM pagelinks
LEFT JOIN page ON page_namespace = pl_namespace AND page_title = pl_title
WHERE EXISTS (SELECT 1 FROM archive WHERE ...)
AND page_id IS NULL
[...]

etc. and try it again.

Bunyk (talkcontribs)

Oh, thank you for your explanation.


So yes, first I also thought that I don't need to look into revisions table, but when I do not check that there are no revisions for the page it also includes pages that have deleted revisions (rows in archived), but are not currently deleted.


Maybe I'll try to detect deleted pages by checking if they have entry in Manual:Logging table. But what is the fastest way to check if link is currently red or not?


What I'm trying to do here is to create a list of pages that were deleted, but are still very wanted, to consider undeletion if they were notable, or just remove links to them if they were not notable.

Matěj Suchánek (talkcontribs)

when I do not check that there are no revisions for the page it also includes pages that have deleted revisions (rows in archived), but are not currently deleted. Page and revision existence is equivalent. There is no existing page with no revisions and a revision cannot exist without an existing page. What I meant is you don't have to include revision in the query, checking just page is enough for you.

Maybe I'll try to detect deleted pages by checking if they have entry in Manual:Logging table. That's possible but there is not any index on log_type, log_action, log_namespace, log_title. So it's potentially very slow, too.

But what is the fastest way to check if link is currently red or not? It's the query I have provided you in my reply. Or an equivalent query with NOT EXISTS.

Bunyk (talkcontribs)

And because of you mentioning filesort, which is slow, I tried to remove <code>ORDER BY</code>. And you know what? It executed in 419.00 returning 333 rows, which I could sort now in browser. So thank you for the hint.


Crazy, how sorting 3 hundreds rows would make it slow?

Bunyk (talkcontribs)

And after applying advice to check that only page is absent, not revision for it, same query got executed in 68.86 seconds. 6x speedup, thank you again.

Reply to "My query gets stopped for some reason"

How to overcome stale query results?

2
Djm-leighpark (talkcontribs)

This #SheSaid main page Wikiquote:Wikiquote:SheSaid#New_articles currently has a link Query to track the hashtag which unfortunately gives a stale query result of the 18th December 2022 while up-to-date information is really required. Pressing Fork & Submit will give current query results, but is there a way of getting the query re-executed at the current time by adding parameters to the URL ? Alternatively could the result screen indicate that fork&Submit will give a refreshed result to novice Quarry users who have been pointed at a stale query result set? Thankyou. -- ~~~~

Djm-leighpark (talkcontribs)

I have since noticed replacing "/query/" with "/fork/" in the url goes direct to a screen where an up-date-query may be executed to get a fresh snapshot, (quarry.wmcloud.org/fork/67962 for example). Obviously would not be advisable to do that for heavies and depends if underlying data is remaining fairly static, in the case here I would argue it is not. Any alternative tips remain welcome. Thankyou. -- Djm-leighpark (talk) 14:45, 31 October 2022 (UTC)

Reply to "How to overcome stale query results?"
2405:201:D021:7F:6113:A239:C147:6538 (talkcontribs)

Is there an easy way to query all company (organisation) pages without having to go through various categories and their sub categories to filter only companies?

Reply to "Query all companies"

Retrieving a list of pages and the hidden/tracking categories they are in

3
Invisigoth67 (talkcontribs)

Hello, I'd like to be able to retrieve a list of pages created by a specific user and the hidden/tracking categories these pages are in. Using the pagelinks or categorylinks tables didn't work for me. If it's a matter of query run time, both input lists (pages by specific user and a selection of hiden categories) could be provided manually.

BDavis (WMF) (talkcontribs)

Can you be more specific on how categorylinks failed to answer your question? A query to find the hidden categories of a main namespace page could look like:

SELECT p.page_title, cl_to AS hidden_category
FROM page AS p
INNER JOIN categorylinks ON p.page_id = cl_from
INNER JOIN page AS catpage ON catpage.page_namespace = 14
  AND catpage.page_title = cl_to
INNER JOIN page_props ON pp_page = catpage.page_id AND pp_propname = 'hiddencat'
WHERE p.page_namespace = 0
  AND p.page_title = 'NOFX';
Invisigoth67 (talkcontribs)

Thank you very much, this is exactly what I was looking for! Seems that I was close but used a wrong table row name at one point.

Reply to "Retrieving a list of pages and the hidden/tracking categories they are in"

Retrieving a list of pages and the users who created them

3
Mike Christie (talkcontribs)

I'd like to be able to retrieve a list of users who created pages in the Talk namespace that end in "/GAn". For GA5, for example, I wrote this:

select p.page_title,p.page_id,r.rev_timestamp, u.user_name

from page p

inner join revision r

  on p.page_id = r.rev_page

inner join actor a

  on r.rev_actor = a.actor_id

inner join user u

  on a.actor_user = u.user_id

where p.page_title  LIKE '%a/GA5'

  and r.rev_parent_id = 0

  and p.page_namespace = 1

but it executes extremely slowly. Eliminating the joins to actor and user makes it run in a few seconds, so why is this slow? I would have thought it would pick up the indexes on actor and user on those fields.

BDavis (WMF) (talkcontribs)

You can use the tool at https://sql-optimizer.toolforge.org/ to get an explain plan for any particular query. When I put this query into that system it does look like the database engine is trying to use indexes, but I also see that joining with the user table makes the database consider 43 million indexed rows as part of it's analysis.

One optimization that might help is using the special "revision_userindex" table rather than directly using the revision table.

The p.page_title LIKE '%a/GA5' clause will always result in a linear scan of some collection of rows. MySQL/MariaDB's indexes just don't work with leading wildcard searches. There are techniques that can be used to work around this, but they are not in use in the Wiki Replicas databases as they are not needed for MediaWiki's normal operations.

Mike Christie (talkcontribs)

Thanks; that's very helpful, and I didn't know about the explain plan tool. I'll have a look at the revision_userindex table.

Reply to "Retrieving a list of pages and the users who created them"

Fetching from Flow database

2
Summary by Msz2001

There's no flowdb available in Quarry

Msz2001 (talkcontribs)

I am interested in querying the flowdb database (Flow / Structured Discussions). Is it possible to do it with Quarry? Currently I'm getting access denied after submitting something like that: SELECT * FROM flowdb.flow_revision.

BDavis (WMF) (talkcontribs)

Flowdb lives in something called the x1 cluster for the Wikimedia production wikis. This cluster is not included in the Wiki Replicas service and thus is not exposed to Quarry.

How to fetch revision text content

2
Summary by EpicPupper

Quarry does not have access to revision text. There are XML dumps which may have the information, and generally using the Action API to fetch revision text is a reasonable thing to do as well.

Ywats0ns (talkcontribs)

Hello, I'd like to do some analysis for all recent revision that contain a certain keyword/validate a regex. I have been able to fetch the recent revision up to 30 days with the action API, but this requires a lots of resource I guess, as I have to call the API for every revision. Is there a way to access this trough quarry or trough a dump of the revisions ?

Thanks a lot

BDavis (WMF) (talkcontribs)

Quarry does not have access to revision text as this part of the production database is not included in the wikitech:Wiki Replicas service. There are XML dumps which may have the information you are interested in. See meta:Data dumps for more information about that service. Generally using the Action API to fetch revision text is a reasonable thing to do as well. API:Etiquette has some advice on how to make your Action API requests more friendly to the servers by doing things like setting a custom user-agent that gives contact information and using the maxlag parameter to avoid piling on when the database is under greater than normal replication pressure.