I think we should add a search button to the Recent Queries page to make it easy for searching among old queries.
Talk:Quarry
I've got a query on the English Wikiquote and I'd like to add an additional column for the wikidata item page/QID if possible (The associated wikidata page may or may not exist).
Another way of looking at this for a specific case for where the page information can be got from enwikiquote_p by that her associated Wikidata item page (Q68584846) exists and is properly linked back to Wikiquote.
The key question is to ask which pages in the set I am considering do not have an associated wikidata item so this can be investigated.
I've had a glance round the schema tables but have not spotted how to do this.
Has anyone got any ideas. Thankyou. Djm-leighpark (talk) 00:39, 20 November 2022 (UTC)
If you add
LEFT JOIN page_props ON pp_page = page_id AND pp_propname = 'wikibase_item'
the pp_value
field will contain the Wikidata item id (or null if the page isn't connected). If you want to filter for unconnected pages, add pp_value IS NULL
.
@Matěj Suchánek Sweet. Just what I was looking for and have put it in my query. I might have spent days looking for that and not found it or given up and gone off to do something else. Thankyou very much. -- Djm-leighpark (talk) 11:30, 20 November 2022 (UTC)
I want to search ipblocks for a range of expiry dates. I'm doing:
select ipb_address, ipb_expiry from ipblocks
where ipb_user = 0
and ipb_expiry != 'infinity'
and ipb_expiry > '20221001000000'
and ipb_expiry < '20221008000000'
limit 10;
which returns 0 rows. I'm assuming it's got something to do with ipb_expiry being a varbinary?
It turned out the query is working perfectly. There just weren't any blocks expiringing in that date range. I widened out the range and it works fine. Dooh!
Before it was working fine, providing results in half an hour, but last 3 versions just stop: quarry.wmcloud.org/query/68565
Is it because it hits some timeout and needs to be optimized somehow? Or I'm running it too often? Or there is some bug in Quarry?
Toolforge SQL Optimizer tells:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
---|---|---|---|---|---|---|---|---|---|
1.1 | PRIMARY | ALL | distinct_key | 1367979 | Using temporary; Using filesort | ||||
1.2 | PRIMARY | pagelinks | ref | pl_namespace | pl_namespace | 261 | ukwiki.archive.ar_namespace,ukwiki.archive.ar_title | 21 | Using where; Using index |
2.3 | MATERIALIZED | archive | index | ar_name_title_timestamp | ar_name_title_timestamp | 275 | 1367979 | Using where; Using index | |
3.4 | MATERIALIZED | page | index | PRIMARY, page_name_title, page_random, page_len, page_redirect_namespace_len | page_name_title | 261 | 3640238 | Using index | |
3.5 | MATERIALIZED | revision | ref | rev_page_actor_timestamp, rev_page_timestamp | rev_page_timestamp | 4 | ukwiki.page.page_id | 4 | Using index |
Basically, the query scans too many rows and does filesort, which is slow.
Also you don't need to query the revision
table. When the page does not exist, there will be no revisions. I'd probably rewrite the query using a join like:
SELECT ... FROM pagelinks LEFT JOIN page ON page_namespace = pl_namespace AND page_title = pl_title WHERE EXISTS (SELECT 1 FROM archive WHERE ...) AND page_id IS NULL [...]
etc. and try it again.
Oh, thank you for your explanation.
So yes, first I also thought that I don't need to look into revisions table, but when I do not check that there are no revisions for the page it also includes pages that have deleted revisions (rows in archived), but are not currently deleted.
Maybe I'll try to detect deleted pages by checking if they have entry in Manual:Logging table. But what is the fastest way to check if link is currently red or not?
What I'm trying to do here is to create a list of pages that were deleted, but are still very wanted, to consider undeletion if they were notable, or just remove links to them if they were not notable.
when I do not check that there are no revisions for the page it also includes pages that have deleted revisions (rows in archived), but are not currently deleted. Page and revision existence is equivalent. There is no existing page with no revisions and a revision cannot exist without an existing page. What I meant is you don't have to include revision
in the query, checking just page
is enough for you.
Maybe I'll try to detect deleted pages by checking if they have entry in Manual:Logging table. That's possible but there is not any index on log_type, log_action, log_namespace, log_title
. So it's potentially very slow, too.
But what is the fastest way to check if link is currently red or not? It's the query I have provided you in my reply. Or an equivalent query with NOT EXISTS
.
And because of you mentioning filesort, which is slow, I tried to remove <code>ORDER BY</code>. And you know what? It executed in 419.00 returning 333 rows, which I could sort now in browser. So thank you for the hint.
Crazy, how sorting 3 hundreds rows would make it slow?
And after applying advice to check that only page is absent, not revision for it, same query got executed in 68.86 seconds. 6x speedup, thank you again.
This #SheSaid main page Wikiquote:Wikiquote:SheSaid#New_articles currently has a link Query to track the hashtag which unfortunately gives a stale query result of the 18th December 2022 while up-to-date information is really required. Pressing Fork & Submit will give current query results, but is there a way of getting the query re-executed at the current time by adding parameters to the URL ? Alternatively could the result screen indicate that fork&Submit will give a refreshed result to novice Quarry users who have been pointed at a stale query result set? Thankyou. -- ~~~~
I have since noticed replacing "/query/" with "/fork/" in the url goes direct to a screen where an up-date-query may be executed to get a fresh snapshot, (quarry.wmcloud.org/fork/67962 for example). Obviously would not be advisable to do that for heavies and depends if underlying data is remaining fairly static, in the case here I would argue it is not. Any alternative tips remain welcome. Thankyou. -- Djm-leighpark (talk) 14:45, 31 October 2022 (UTC)
Is there an easy way to query all company (organisation) pages without having to go through various categories and their sub categories to filter only companies?
Hello, I'd like to be able to retrieve a list of pages created by a specific user and the hidden/tracking categories these pages are in. Using the pagelinks or categorylinks tables didn't work for me. If it's a matter of query run time, both input lists (pages by specific user and a selection of hiden categories) could be provided manually.
Can you be more specific on how categorylinks failed to answer your question? A query to find the hidden categories of a main namespace page could look like:
SELECT p.page_title, cl_to AS hidden_category
FROM page AS p
INNER JOIN categorylinks ON p.page_id = cl_from
INNER JOIN page AS catpage ON catpage.page_namespace = 14
AND catpage.page_title = cl_to
INNER JOIN page_props ON pp_page = catpage.page_id AND pp_propname = 'hiddencat'
WHERE p.page_namespace = 0
AND p.page_title = 'NOFX';
Thank you very much, this is exactly what I was looking for! Seems that I was close but used a wrong table row name at one point.
I'd like to be able to retrieve a list of users who created pages in the Talk namespace that end in "/GAn". For GA5, for example, I wrote this:
select p.page_title,p.page_id,r.rev_timestamp, u.user_name
from page p
inner join revision r
on p.page_id = r.rev_page
inner join actor a
on r.rev_actor = a.actor_id
inner join user u
on a.actor_user = u.user_id
where p.page_title LIKE '%a/GA5'
and r.rev_parent_id = 0
and p.page_namespace = 1
but it executes extremely slowly. Eliminating the joins to actor and user makes it run in a few seconds, so why is this slow? I would have thought it would pick up the indexes on actor and user on those fields.
You can use the tool at https://sql-optimizer.toolforge.org/ to get an explain plan for any particular query. When I put this query into that system it does look like the database engine is trying to use indexes, but I also see that joining with the user table makes the database consider 43 million indexed rows as part of it's analysis.
One optimization that might help is using the special "revision_userindex" table rather than directly using the revision table.
The p.page_title LIKE '%a/GA5'
clause will always result in a linear scan of some collection of rows. MySQL/MariaDB's indexes just don't work with leading wildcard searches. There are techniques that can be used to work around this, but they are not in use in the Wiki Replicas databases as they are not needed for MediaWiki's normal operations.
Thanks; that's very helpful, and I didn't know about the explain plan tool. I'll have a look at the revision_userindex table.
I am interested in querying the flowdb
database (Flow / Structured Discussions). Is it possible to do it with Quarry? Currently I'm getting access denied after submitting something like that: SELECT * FROM flowdb.flow_revision
.
Flowdb lives in something called the x1 cluster for the Wikimedia production wikis. This cluster is not included in the Wiki Replicas service and thus is not exposed to Quarry.
Hello, I'd like to do some analysis for all recent revision that contain a certain keyword/validate a regex. I have been able to fetch the recent revision up to 30 days with the action API, but this requires a lots of resource I guess, as I have to call the API for every revision. Is there a way to access this trough quarry or trough a dump of the revisions ?
Thanks a lot
Quarry does not have access to revision text as this part of the production database is not included in the wikitech:Wiki Replicas service. There are XML dumps which may have the information you are interested in. See meta:Data dumps for more information about that service. Generally using the Action API to fetch revision text is a reasonable thing to do as well. API:Etiquette has some advice on how to make your Action API requests more friendly to the servers by doing things like setting a custom user-agent that gives contact information and using the maxlag
parameter to avoid piling on when the database is under greater than normal replication pressure.