Talk:Quarry

Jump to navigation Jump to search

About this board

Discussion area for discussion about Quarry itself and help about individual queries.

RoySmith (talkcontribs)

How do I map from page.page_namespace to a string ("Draft", "Talk", etc)?

Tohaomg (talkcontribs)

I would like to know as well if there is a shortcut for this. But meanwhile I use this

SELECT page_namespace, (case

when page_namespace = 0 then 'Main'

when page_namespace = 1 then 'Talk'

when page_namespace = 2 then 'User'

when page_namespace = 3 then 'User talk'

when page_namespace = 4 then 'Wikipedia'

when page_namespace = 5 then 'Wikipedia talk'

when page_namespace = 6 then 'File'

when page_namespace = 7 then 'File talk'

when page_namespace = 8 then 'MediaWiki'

when page_namespace = 9 then 'MediaWiki talk'

when page_namespace = 10 then 'Template'

when page_namespace = 11 then 'Template talk'

when page_namespace = 12 then 'Help'

when page_namespace = 13 then 'Help talk'

when page_namespace = 14 then 'Category'

when page_namespace = 15 then 'Category talk'

when page_namespace = 100 then 'Portal'

when page_namespace = 101 then 'Portal talk'

when page_namespace = 828 then 'Module'

when page_namespace = 829 then 'Module talk'

end) as namespace_name

RoySmith (talkcontribs)

I've been playing around with this, but can't get it to work. Unfortunately, my SQL is kind of weak. When I run https://quarry.wmflabs.org/query/30504, I get Unknown column 'page_namespace' in 'field list'.

Can you give an example of how this would be used in a real query?

BDavis (WMF) (talkcontribs)

The namespace label varies by wiki (depending on installed extensions) and UI language. See phab:T180558 for some discussion of a user provided table on ToolsDB that was once available for this sort of lookup. Unfortunately even if that tool was fixed there is currently no way to connect to ToolsDB from Quarry per phab:T151158

Milimetric (WMF) (talkcontribs)
Reply to "Namespace string?"

Is there some way of automagic update/re-execute quarries from tools-Servers?

1
Wurgl (talkcontribs)

For a few of my quarries it might be a good idea to refresh the output from time to time (eg. weekly) since they could be used a list to fix some inconsistencies. Is there any way of doing this by a cronjob or script or so?

Reply to "Is there some way of automagic update/re-execute quarries from tools-Servers?"
Cr29uva (talkcontribs)

What would be the best way to extract conversations made on article or user talk spaces?

If someone could point me towards the XML dump to be used for this or a table in the database that can give me the text...

I'm interested in analyzing user conversations and link that to the probability of a user getting blocked in the future...

Milimetric (WMF) (talkcontribs)

The dump you need is, for example for simplewiki, https://dumps.wikimedia.org/simplewiki/20181001/. You want the sets that have the full revision history to recreate conversations. When you go through it you can filter down to just the Talk and User_talk namespaces.

Cr29uva (talkcontribs)

yes, i was able to figure out that i need the xml dump of pages-revision-meta-history.

Thanks a ton for your replies @Milimetric (WMF) !

Reply to "user conversations data"
199.111.226.142 (talkcontribs)

Im looking for a query that can give me user id/name alongwith all activity associated with the user such as comments/interaction with other users as well as the text of edit made on pages

199.111.226.142 (talkcontribs)

even the table names that need the join will be good to know. I'll write the query by myself

Milimetric (WMF) (talkcontribs)
Cr29uva (talkcontribs)

Hey @Milimetric

Thanks a ton for your reply! Yes i am trying to do something along the lines of the Detox research. Have a couple of doubts. Would be great if you have any insight into this -

  1. The database layout page says that there's a table called text which contains the "text content" pertaining to the revisions made. But when i try to access that table using Quarry , i get an error saying that the table doesn't exist. Manual:Text table so it does seem like the text should exist in the table format but i wonder why it says that the table doesn't exist. Any idea? It would be ideal to get the data from a sql like table format, my last resort would be to parse through the XML dump....
  2. What would be the best way to extract conversations made on article or user talk spaces? I'm interested in analyzing user conversations and link that to the probability of a user getting blocked in the future....
Milimetric (WMF) (talkcontribs)

@Cr29uva, yes, the text table isn't replicated to these databases because it's too big and hard to strip private parts out of it. The XML dumps are your best bet for now. We're close to releasing a new dataset that's more analytics-oriented, that may be of interest to you in the future. It will have user blocks as applied over the history of each user (so, for example, if someone's blocked five times, it will have those periods with a start and end). It won't have text initially either, but we're working on adding that over the next year.

Reply to "user activity data"
Estopedist1 (talkcontribs)

After these major updates in the Quarry, I have found a lack/bug?

If you use this: AND page_title IN ('phrase X', 'phrase Y', 'phrase ...') (test template here: https://quarry.wmflabs.org/query/29938) then there you can only put fewer than 750 phrases. Before these major updates the maximum number was about 2200 phrases. Any suggestions?

Estopedist1 (talkcontribs)

I tested a little bit. It seems that important is the sum of characters, not the sum of lines. It seems that the limit is about 13 000 characters. And if your quarry consists of, for example, 14 000 characters it cannot be done.

If I am right, then why earlier the limit was much higher than 13 000 characters?

Zhuyifei1999 (talkcontribs)

For some reason max_allowed_packet is 1024. It's supposed to be 16M...

Zhuyifei1999 (talkcontribs)
Zhuyifei1999 (talkcontribs)

@Estopedist1 The patch has now been merged and applied. Could you confirm it works now?

Estopedist1 (talkcontribs)

Yes, it is more powerful now. About 36 000 characters. I think earlier it was more powerful than now, but no problem for me. I guess it probable consumes very much energy (I mean servers, hardware) to rise the max_allowed_packet.

Reply to "Important lack/bug discovered?"
Vexations (talkcontribs)

How does one request a username change for quarry?

Zhuyifei1999 (talkcontribs)
Zhuyifei1999 (talkcontribs)

@Vexations: You should get renamed after your next login to Quarry after your SUL rename.

Framawiki (talkcontribs)

@Vexations: can you confirm that the problem is solved ?

Vexations (talkcontribs)

Yes, it has been solved, thanks,

Reply to "user rename"

Is externallinks not available for the English Wikipedia?

8
B (talkcontribs)

I was hoping to do a query to find Flickr images on enwiki (these need to be moved to Commons). I get an error. Thanks.

View 'enwiki_p.externallinks' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

Edgars2007 (talkcontribs)
Edgars2007 (talkcontribs)
Zhuyifei1999 (talkcontribs)

Not related. The underlying wiki replicas seems to deny the access.

MariaDB [enwiki_p]> SELECT el_from FROM externallinks limit 5;
ERROR 1356 (HY000): View 'enwiki_p.externallinks' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

Could you file a bug under #data-services?

B (talkcontribs)

Any news?

Zhuyifei1999 (talkcontribs)

Could you file a bug under #data-services?

Edgars2007 (talkcontribs)
B (talkcontribs)

Thanks

Reply to "Is externallinks not available for the English Wikipedia?"

Seeking list of Wikimedia user accounts with blocks

2
Bluerasberry (talkcontribs)

Thanks for any help anyone can provide. If it is easy then I would appreciate a query. If it is more complicated then I would appreciate general instructions.

I need to learn how to list all English Wikipedia user accounts which are blocked.

I am looking at Manual:Database layout trying to understand which property in this scheme matches to a user account block. After I learn how to query for blocks I want to query for accounts with other userrights, but I am not seeing how this is mapped. If anyone has advice then I would appreciate it. Thanks.

Matěj Suchánek (talkcontribs)
USE enwiki_p;
SELECT ipb_address FROM ipblocks WHERE ipb_user <> 0;

SELECT ipb_address FROM ipblocks
JOIN user_groups ON ug_user = ipb_user
WHERE ipb_user <> 0
AND ug_group = 'user_group_identifier';
Reply to "Seeking list of Wikimedia user accounts with blocks"
RoySmith (talkcontribs)

I'm running what looks like a very simple query:

select count(*) from revision where rev_user = 130326;

It's taking forever to complete. According to Manual:Revision table, rev_user should be the first part of the user_timestamp key, so this should be a fast indexed query. But, if I describe the table

describe revision;

I don't see any keys listed.

Zhuyifei1999 (talkcontribs)
Framawiki (talkcontribs)

Even quicker (the count process is a bit different, takes care for things like log events)

MariaDB [enwiki_p]> select user_editcount from user where user_id = 130326;
+----------------+
| user_editcount |
+----------------+
|          24197 |
+----------------+
1 row in set (0.01 sec)
RoySmith (talkcontribs)

Interesting. There's no table "revision_userindex" listed in Category:MediaWiki database tables. Is that just not up to date? I do see it when I run "show tables", and the query you suggest does indeed work.

So, the next obvious question is, why do we have two copies of the table, one with the index and another that's missing it?

Zhuyifei1999 (talkcontribs)

Because Category:MediaWiki database tables lists the tables that built-in to MediaWiki. These tables can contain highly sensitive data, such as hashes of passwords, tokens, IP addresses, revision-deleted or oversighted revision comments, etc. Therefore, when these databases are replicated to wiki replicas, which Quarry queries from, some tables like revision must redact some cells from a raw when a certain criteria is met.

We implement some of the redaction by not replication some tables at all, and some other redaction by using views. As for your query, the relevant redaction in view (not table) revision is if((`enwiki`.`revision`.`rev_deleted` & 4),NULL,`enwiki`.`revision`.`rev_user`) AS `rev_user` , which makes rev_user NULL when rev_deleted has bit 2^2 (for revdel-ed username) set. As a side effect of using this if, indexes in the table revision for rev_user cannot be used.

Since unusable indexes is a serious performance problem, the DBAs made alternative views that redact the rows entirely. This is the case of view revision_userindex, where the definition includes and `enwiki`.`revision`.`rev_user` AS `rev_user` and where ((`enwiki`.`revision`.`rev_deleted` & 4) = 0) . rev_user is now a simple passthrough, allowing use of indexes on that column; the bit condition in rev_deleted is now an implicit but enforced where condition to the underlying revision table that we cannot query directly.

So no, we don't have two copies of the table, but two views of the same table.

RoySmith (talkcontribs)

Ah, cool. Thanks. That all makes perfect sense.

Maybe the "Database tables" tab in Quarry could include this information? I imagine I'm not the first person to stub my toe on this.

Zhuyifei1999 (talkcontribs)

Good point. Though, what link text and target do you suggest? cc @Framawiki:

Framawiki (talkcontribs)

It would be great to have a wikitech page that includes the map of replicas table like the mediawiki one Currently a thing like this doesn't exist as I know.

See also phab:T139289

Reply to "No index for revision table?"
Jura1 (talkcontribs)

Since a table was deleted, how does one get the page_id for a qid efficiently?

page properties doesn't seem to work well.

Matěj Suchánek (talkcontribs)
JOIN page ON page_namespace = 0 AND page_title = ...
Jura1 (talkcontribs)

Thanks. Way too easy :)

Matěj Suchánek (talkcontribs)

Perhaps I shouldn't have been so brief. wb_items_per_site still contains numbers, so you need to do either:

JOIN page ON page_namespace = 0 AND page_title = CONCAT('Q', ips_item_id)

or

JOIN wb_items_per_site ON REPLACE('Q', '', page_title) = ips_item_id

This will make the query use indexes and therefore will be fast.

Reply to "Wikidata"