Talk:Quarry

Jump to navigation Jump to search

About this board

Discussion area for discussion about Quarry itself and help about individual queries.

Not able to perform query on itwiki_p

3
Lucarosty (talkcontribs)

Until yesterday evening I used "https://quarry.wmflabs.org/query/26129" and "https://quarry.wmflabs.org/query/31383" to perform some query for lint errors. This morning I'm unable to perform that query on itwiki (i try the same query on enwiki_t without problem). The page return this error "View 'itwiki_p.page' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them"

Zhuyifei1999 (talkcontribs)
Lucarosty (talkcontribs)

Perfect now is working! Thanks

Reply to "Not able to perform query on itwiki_p"
Saeidpourbabak (talkcontribs)

Hi, I have THIS quarry with 20422 rows results. When I try to download the output, I get only 1070 rows in the CSV file, and for the XLSX I get an empty (0 KB) file.

Any idea what is wrong?

Zhuyifei1999 (talkcontribs)

It's something about having surrogates (CC @Framawiki:)

Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]: Traceback (most recent call last):
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wsgi.py", line 870, in __next__
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:     return self._next()
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wrappers.py", line 84, in _iter_encoded
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:     yield item.encode(charset)
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd9' in position 385: surrogates not allowed
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]: Traceback (most recent call last):
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wsgi.py", line 870, in __next__
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._next()
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wrappers.py", line 82, in _iter_encoded
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     for item in iterable:
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/contrib/iterio.py", line 183, in __new__
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     rv = g.switch()
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/contrib/iterio.py", line 178, in run
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     func(stream)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "./quarry/web/output.py", line 152, in respond
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     if (worksheet.write(row_num, col_num, cell) < -1 and
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 67, in cell_wrapper
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return method(self, *args, **kwargs)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 406, in write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._write(row, col, *args)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 431, in _write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._write_token_as_string(token, row, col, *args)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 387, in _write_token_as_string
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._write_string(row, col, *args)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 522, in _write_string
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     self._write_single_row(row)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 5432, in _write_single_row
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     self._write_cell(row_num, col_num, col_ref)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 5610, in _write_cell
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     self._xml_inline_string(string, preserve, attributes)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/xmlwriter.py", line 167, in _xml_inline_string
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     (attr, t_attr, string))
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/usr/lib/python3.5/codecs.py", line 718, in write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self.writer.write(data)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/usr/lib/python3.5/codecs.py", line 376, in write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     data, consumed = self.encode(object, self.errors)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd9' in position 39: surrogates not allowed
Zhuyifei1999 (talkcontribs)
>>> surrogate = b'\xc3'.decode('utf8','surrogateescape')
>>> surrogate
'\udcc3'
>>> surrogate.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed
>>> surrogate.encode('utf-8', 'surrogateescape')
b'\xc3'
Zhuyifei1999 (talkcontribs)

This is bad, but I don't know of any other way to get control of the error handler used in _iter_encoded:

>>> import codecs
>>> codecs.register_error('strict', codecs.lookup_error('surrogateescape'))
>>> 
>>> surrogate = b'\xc3'.decode('utf8','surrogateescape')
>>> surrogate.encode('utf-8')
b'\xc3'
>>> codecs.lookup_error('surrogateescape')
<built-in function surrogateescape>

@Framawiki: Any objections to applying this 'hack'?

Zhuyifei1999 (talkcontribs)

I also broke the worker while playing with surrogates:

Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]: [2018-11-06 18:33:00,947: ERROR/ForkPoolWorker-25] Task worker.run_query[6a7ebdd0-fa08-4732-aa2a-2deafb5d7892] raised unexpected: UnicodeDecodeError('utf-8', b'\xed\xb3\x83', 0, 1, 'invalid continuation byte')
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]: Traceback (most recent call last):
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/app/trace.py", line 382, in trace_task
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     R = retval = fun(*args, **kwargs)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/app/trace.py", line 641, in __protected_call__
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     return self.run(*args, **kwargs)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/quarry/web/worker.py", line 74, in run_query
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     while cur.nextset():
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/cursors.py", line 107, in nextset
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     return self._nextset(False)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/cursors.py", line 102, in _nextset
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     conn.next_result(unbuffered=unbuffered)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 520, in next_result
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 727, in _read_query_result
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     result.read()
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1073, in read
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     self._read_result_packet(first_packet)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1143, in _read_result_packet
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     self._read_rowdata_packet()
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1181, in _read_rowdata_packet
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     rows.append(self._read_row_from_packet(packet))
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1197, in _read_row_from_packet
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     data = data.decode(encoding)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
Zhuyifei1999 (talkcontribs)
Saeidpourbabak (talkcontribs)

Hello and thanks for taking action.

Do I have to do something or I just need to wait? I'm afraid that your replies and gerrit link is too technical for me.

Framawiki (talkcontribs)

Hi Saeidpourbabak and thanks for the ping. We'll try to fix the site soon and will contact you when it's done.

Zhuyifei1999 (talkcontribs)

This should be fixed now:

$ curl -s https://quarry.wmflabs.org/run/310305/output/0/csv | wc -l
20423
$ wget -O /dev/null https://quarry.wmflabs.org/run/310305/output/0/xlsx
--2018-12-02 15:40:35--  https://quarry.wmflabs.org/run/310305/output/0/xlsx
Resolving quarry.wmflabs.org... 208.80.155.156
Connecting to quarry.wmflabs.org|208.80.155.156|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘/dev/null’

/dev/null                               [   <=>                                                            ] 454.11K  1.05MB/s    in 0.4s    

2018-12-02 15:40:38 (1.05 MB/s) - ‘/dev/null’ saved [465013]
Reply to "Not able to download output"

revision table rev_comment missing

2
Summary by 197.218.84.150
Saeidpourbabak (talkcontribs)

Hi, It seems to me that the revision table is not working properly, namely rev_comment data is missing. I know I have tons of edits with a specific summary, but now I find none. For example see these two quarries, one from 3 months ago and one from today. Thanks.

Framawiki (talkcontribs)

Hi, no, that's not a bug :) Since recently (phab:T166733) comments are available in a new specific table. See Manual:Revision table#rev comment for the depreciation message. You'll need to use a join between revision and comment tables.

Summary by BDavis (WMF)

Watchlist data is not available in the Wiki Replica databases due to privacy concerns.

RoySmith (talkcontribs)

Can I query my watchlist in quarry? I get:

Table 'enwiki_p.watchlist' doesn't exist

Edgars2007 (talkcontribs)

No, you can't. Personal information etc.

Is site_stats protected now?

3
Summary by BDavis (WMF)

The site_stats view was broken by a column change in the backing table, but this has been fixed. See phab:T210481 for more details.

Andreyyshore (talkcontribs)

Since my statistics bot failed to run today, I noticed that I can't access the site_stats table on any wiki.

Here's a simple query that doesn't work anymore as of today:

USE rowiki_p;

DESCRIBE site_stats;

This yields the following error: View 'rowiki_p.site_stats' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

If I use SHOW TABLES FROM rowiki_p;, I can see that the table still exists. I was using it to get the total number of edits on rowiki.

Any idea why it's suddenly inaccessible?

Framawiki (talkcontribs)

Hi, that is not limited to Quarry but to the whole wmf labs replicas, and the restriction was not intentional. See phab:T210481.

Andreyyshore (talkcontribs)

Thanks for clarifying where the problem was.

Like the other commenters, I can also confirm that the problem is gone now.

Need help with finding most transcluded unprotected pages

7
Jarekt (talkcontribs)

I am trying to make sure some of the most transcluded pages (hopefully templates and modules) on Commons are protected. The first step would be to find some number (lets say 100) most transcluded pages and than look up their protection level, So I started with


use commonswiki_p;
SELECT tl_namespace, tl_title, count(*) as tot
FROM templatelinks
group by tl_namespace, tl_title
order by tot desc
limit 100

Which was killed after half an hour (see https://quarry.wmflabs.org/query/8653). Anybody knows what is wrong with this query and how to fix it? The results should mirror c:Special:MostTranscludedPages.

BDavis (WMF) (talkcontribs)

Trying the query out in the sql-optimizer tool shows that there are 1,605,698,124 (yes, 1.6 billion!) rows to process. The group by and order by clauses force the database server to copy all of the data into temporary storage and then perform a file sort on that storage. The limit clause will only change the data returned to your client, so all the rows have to be grouped and then the counts that were computed for each group all have to be sorted and finally only the first 100 results will be returned.

Your best best for getting this sort of data is to use the Action API's query action to get the same list you see on Special:MostTranscludedPages in a machine readable format: https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=querypage&utf8=1&formatversion=2&qppage=Mostlinkedtemplates

Jarekt (talkcontribs)

Yes sorting 1.6 billion pages is a big job. My plan was to use that query as a piece of larger query where I than look up protection level of each page. Maybe I should be counting transclusions of unprotected templates (and latter modules) and show the ones with over some number (1M or 500k). So limit the scope first and than count.

Zhuyifei1999 (talkcontribs)
Zhuyifei1999 (talkcontribs)

quarry:query/31088 would be a query that checks the transclusion count from all templates, using a dependent subquery on templatelinks table. Don't think it'll work though.

Jarekt (talkcontribs)
Zhuyifei1999 (talkcontribs)

Wow, 31131 is so much faster than I would have anticipated. Nice. I guess most of the time it's counting the transclusions of protected templates, and the time of joining and filtering on page_restrictions is minimal.

Reply to "Need help with finding most transcluded unprotected pages"
Summary by BDavis (WMF)

Logging out of Quarry and then logging back in should cause an changes to your Wikimedia username to be seen by Quarry.

Vexations (talkcontribs)

How does one request a username change for quarry?

Zhuyifei1999 (talkcontribs)
Zhuyifei1999 (talkcontribs)

@Vexations: You should get renamed after your next login to Quarry after your SUL rename.

Framawiki (talkcontribs)

@Vexations: can you confirm that the problem is solved ?

Vexations (talkcontribs)

Yes, it has been solved, thanks,

Is externallinks not available for the English Wikipedia?

8
Summary by BDavis (WMF)
B (talkcontribs)

I was hoping to do a query to find Flickr images on enwiki (these need to be moved to Commons). I get an error. Thanks.

View 'enwiki_p.externallinks' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

Edgars2007 (talkcontribs)
Edgars2007 (talkcontribs)
Zhuyifei1999 (talkcontribs)

Not related. The underlying wiki replicas seems to deny the access.

MariaDB [enwiki_p]> SELECT el_from FROM externallinks limit 5;
ERROR 1356 (HY000): View 'enwiki_p.externallinks' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

Could you file a bug under #data-services?

B (talkcontribs)

Any news?

Zhuyifei1999 (talkcontribs)

Could you file a bug under #data-services?

Edgars2007 (talkcontribs)
B (talkcontribs)

Thanks

Summary by BDavis (WMF)

The text table is not present in the Wiki Replica databases that Quarry uses for search. Full text search of wikitext can be performed using the search API's insource operator instead.

Jarekt (talkcontribs)
Zhuyifei1999 (talkcontribs)

No. As far as I understand, the table, if it were available, would just be pointers to the text in 'external storage' and be pretty much meaningless anyways without access to this 'external storage'. You can however query the text with Special:Search, or use indirect queries like with categorylinks or externallinks to get the features of the text.

Stefan2 (talkcontribs)

Doesn't the text table also contain information about deleted revisions which shouldn't be publicly available?

The documentation states that the text may be in various formats (gzipped, stored elsewhere etc.) so a simple search query might not work. Besides, with the LIKE operator used in quarry:31044, the execution time would be proportional to the total number of text revisions on Commons, which would maybe be too slow.

Zhuyifei1999 (talkcontribs)
Jarekt (talkcontribs)

Thank you both. I never fully trusted regular search, but it seems like in this case it is better than SQL query.

Is there some way of automagic update/re-execute quarries from tools-Servers?

4
Summary last edited by BDavis (WMF) 21:29, 9 November 2018 1 month ago

Now waiting … thanks!

Wurgl (talkcontribs)

For a few of my quarries it might be a good idea to refresh the output from time to time (eg. weekly) since they could be used a list to fix some inconsistencies. Is there any way of doing this by a cronjob or script or so?

Framawiki (talkcontribs)
Wurgl (talkcontribs)

A ticket is fine! Maybe you mean T203791 ?

Framawiki (talkcontribs)