How do I map from page.page_namespace to a string ("Draft", "Talk", etc)?
About this board
Discussion area for discussion about Quarry itself and help about individual queries.
I would like to know as well if there is a shortcut for this. But meanwhile I use this
SELECT page_namespace, (case
when page_namespace = 0 then 'Main'
when page_namespace = 1 then 'Talk'
when page_namespace = 2 then 'User'
when page_namespace = 3 then 'User talk'
when page_namespace = 4 then 'Wikipedia'
when page_namespace = 5 then 'Wikipedia talk'
when page_namespace = 6 then 'File'
when page_namespace = 7 then 'File talk'
when page_namespace = 8 then 'MediaWiki'
when page_namespace = 9 then 'MediaWiki talk'
when page_namespace = 10 then 'Template'
when page_namespace = 11 then 'Template talk'
when page_namespace = 12 then 'Help'
when page_namespace = 13 then 'Help talk'
when page_namespace = 14 then 'Category'
when page_namespace = 15 then 'Category talk'
when page_namespace = 100 then 'Portal'
when page_namespace = 101 then 'Portal talk'
when page_namespace = 828 then 'Module'
when page_namespace = 829 then 'Module talk'
end) as namespace_name
I've been playing around with this, but can't get it to work. Unfortunately, my SQL is kind of weak. When I run https://quarry.wmflabs.org/query/30504, I get Unknown column 'page_namespace' in 'field list'.
Can you give an example of how this would be used in a real query?
The namespace label varies by wiki (depending on installed extensions) and UI language. See phab:T180558 for some discussion of a user provided table on ToolsDB that was once available for this sort of lookup. Unfortunately even if that tool was fixed there is currently no way to connect to ToolsDB from Quarry per phab:T151158
We download this periodically to make use of it in Analytics, this is the script we use:
You can store the results in a table and join to that if you like.
Is there some way of automagic update/re-execute quarries from tools-Servers?
For a few of my quarries it might be a good idea to refresh the output from time to time (eg. weekly) since they could be used a list to fix some inconsistencies. Is there any way of doing this by a cronjob or script or so?
user conversations data
What would be the best way to extract conversations made on article or user talk spaces?
If someone could point me towards the XML dump to be used for this or a table in the database that can give me the text...
I'm interested in analyzing user conversations and link that to the probability of a user getting blocked in the future...
The dump you need is, for example for simplewiki, https://dumps.wikimedia.org/simplewiki/20181001/. You want the sets that have the full revision history to recreate conversations. When you go through it you can filter down to just the Talk and User_talk namespaces.
yes, i was able to figure out that i need the xml dump of pages-revision-meta-history.
Thanks a ton for your replies @Milimetric (WMF) !
user activity data
Im looking for a query that can give me user id/name alongwith all activity associated with the user such as comments/interaction with other users as well as the text of edit made on pages
even the table names that need the join will be good to know. I'll write the query by myself
The text has to be fetched from the API, it's not queryable. But the metadata associated with users is in the user table. Their contributions are in the revision table, where "rev_user" is the user id. Take a look at the schema documentation: https://www.mediawiki.org/wiki/Manual:Database_layout
I'm not sure what you're working on, but it sounds like the Detox research project might be of interest: https://meta.wikimedia.org/wiki/Research:Detox
Thanks a ton for your reply! Yes i am trying to do something along the lines of the Detox research. Have a couple of doubts. Would be great if you have any insight into this -
- The database layout page says that there's a table called text which contains the "text content" pertaining to the revisions made. But when i try to access that table using Quarry , i get an error saying that the table doesn't exist. Manual:Text table so it does seem like the text should exist in the table format but i wonder why it says that the table doesn't exist. Any idea? It would be ideal to get the data from a sql like table format, my last resort would be to parse through the XML dump....
- What would be the best way to extract conversations made on article or user talk spaces? I'm interested in analyzing user conversations and link that to the probability of a user getting blocked in the future....
@Cr29uva, yes, the text table isn't replicated to these databases because it's too big and hard to strip private parts out of it. The XML dumps are your best bet for now. We're close to releasing a new dataset that's more analytics-oriented, that may be of interest to you in the future. It will have user blocks as applied over the history of each user (so, for example, if someone's blocked five times, it will have those periods with a start and end). It won't have text initially either, but we're working on adding that over the next year.
Important lack/bug discovered?
After these major updates in the Quarry, I have found a lack/bug?
If you use this: AND page_title IN ('phrase X', 'phrase Y', 'phrase ...') (test template here: https://quarry.wmflabs.org/query/29938) then there you can only put fewer than 750 phrases. Before these major updates the maximum number was about 2200 phrases. Any suggestions?
I tested a little bit. It seems that important is the sum of characters, not the sum of lines. It seems that the limit is about 13 000 characters. And if your quarry consists of, for example, 14 000 characters it cannot be done.
If I am right, then why earlier the limit was much higher than 13 000 characters?
For some reason max_allowed_packet is 1024. It's supposed to be 16M...
Arturo made a patch for this: gerrit:462769, thanks.
@Estopedist1 The patch has now been merged and applied. Could you confirm it works now?
Yes, it is more powerful now. About 36 000 characters. I think earlier it was more powerful than now, but no problem for me. I guess it probable consumes very much energy (I mean servers, hardware) to rise the max_allowed_packet.
How does one request a username change for quarry?
@Vexations: You should get renamed after your next login to Quarry after your SUL rename.
@Vexations: can you confirm that the problem is solved ?
Yes, it has been solved, thanks,
Is externallinks not available for the English Wikipedia?
I was hoping to do a query to find Flickr images on enwiki (these need to be moved to Commons). I get an error. Thanks.
View 'enwiki_p.externallinks' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them
Not related. The underlying wiki replicas seems to deny the access.
MariaDB [enwiki_p]> SELECT el_from FROM externallinks limit 5; ERROR 1356 (HY000): View 'enwiki_p.externallinks' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them
Could you file a bug under #data-services?
Could you file a bug under #data-services?
Seeking list of Wikimedia user accounts with blocks
Thanks for any help anyone can provide. If it is easy then I would appreciate a query. If it is more complicated then I would appreciate general instructions.
I need to learn how to list all English Wikipedia user accounts which are blocked.
I am looking at Manual:Database layout trying to understand which property in this scheme matches to a user account block. After I learn how to query for blocks I want to query for accounts with other userrights, but I am not seeing how this is mapped. If anyone has advice then I would appreciate it. Thanks.
USE enwiki_p; SELECT ipb_address FROM ipblocks WHERE ipb_user <> 0; SELECT ipb_address FROM ipblocks JOIN user_groups ON ug_user = ipb_user WHERE ipb_user <> 0 AND ug_group = 'user_group_identifier';
No index for revision table?
I'm running what looks like a very simple query:
select count(*) from revision where rev_user = 130326;
I don't see any keys listed.
MariaDB [enwiki_p]> select count(*) from revision_userindex where rev_user = 130326; +----------+ | count(*) | +----------+ | 23549 | +----------+ 1 row in set (3.81 sec)
Even quicker (the count process is a bit different, takes care for things like log events)
MariaDB [enwiki_p]> select user_editcount from user where user_id = 130326; +----------------+ | user_editcount | +----------------+ | 24197 | +----------------+ 1 row in set (0.01 sec)
Interesting. There's no table "revision_userindex" listed in Category:MediaWiki database tables. Is that just not up to date? I do see it when I run "show tables", and the query you suggest does indeed work.
So, the next obvious question is, why do we have two copies of the table, one with the index and another that's missing it?
Because Category:MediaWiki database tables lists the tables that built-in to MediaWiki. These tables can contain highly sensitive data, such as hashes of passwords, tokens, IP addresses, revision-deleted or oversighted revision comments, etc. Therefore, when these databases are replicated to wiki replicas, which Quarry queries from, some tables like
revision must redact some cells from a raw when a certain criteria is met.
We implement some of the redaction by not replication some tables at all, and some other redaction by using views. As for your query, the relevant redaction in view (not table)
if((`enwiki`.`revision`.`rev_deleted` & 4),NULL,`enwiki`.`revision`.`rev_user`) AS `rev_user`
, which makes
rev_deleted has bit 2^2 (for revdel-ed username) set. As a side effect of using this
if, indexes in the table
rev_user cannot be used.
Since unusable indexes is a serious performance problem, the DBAs made alternative views that redact the rows entirely. This is the case of view
revision_userindex, where the definition includes and
`enwiki`.`revision`.`rev_user` AS `rev_user`
where ((`enwiki`.`revision`.`rev_deleted` & 4) = 0)
rev_user is now a simple passthrough, allowing use of indexes on that column; the bit condition in
rev_deleted is now an implicit but enforced where condition to the underlying
revision table that we cannot query directly.
So no, we don't have two copies of the table, but two views of the same table.
Ah, cool. Thanks. That all makes perfect sense.
Maybe the "Database tables" tab in Quarry could include this information? I imagine I'm not the first person to stub my toe on this.
Good point. Though, what link text and target do you suggest? cc @Framawiki:
See also phab:T139289
Since a table was deleted, how does one get the page_id for a qid efficiently?
page properties doesn't seem to work well.
JOIN page ON page_namespace = 0 AND page_title = ...
Thanks. Way too easy :)
Perhaps I shouldn't have been so brief.
wb_items_per_site still contains numbers, so you need to do either:
JOIN page ON page_namespace = 0 AND page_title = CONCAT('Q', ips_item_id)
JOIN wb_items_per_site ON REPLACE('Q', '', page_title) = ips_item_id
This will make the query use indexes and therefore will be fast.