Talk:Quarry

Jump to navigation Jump to search

About this board

Discussion area for discussion about Quarry itself and help about individual queries.

"select *" returns different rows from "select col_name"?

2
RoySmith (talkcontribs)

When I run:

select rd_title from redirect where rd_title like 'The\_%' limit 5;


I get:

The_"5"_Royales
The_"5"_Royales
The_"5"_Royales
The_"5"_Royales
The_"5"_Royales

If I replace the explicit column name "rd_title" with "*", I get different rows:

select * from redirect where rd_title like 'The\_%' limit 5;

2751 0 The_Angry_Brigade
2903 0 The_Diary_of_a_Young_Girl
3317 0 The_Birth_of_a_Nation
3920 0 The_Beatles
4238 0 The_Bronx

What's going on here? I expected to get the same rows, just more columns.

Matěj Suchánek (talkcontribs)
Reply to ""select *" returns different rows from "select col_name"?"

Not able to download output

9
Summary by BDavis (WMF)

Fixed by gerrit:472377

Saeidpourbabak (talkcontribs)

Hi, I have THIS quarry with 20422 rows results. When I try to download the output, I get only 1070 rows in the CSV file, and for the XLSX I get an empty (0 KB) file.

Any idea what is wrong?

Zhuyifei1999 (talkcontribs)

It's something about having surrogates (CC @Framawiki:)

Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]: Traceback (most recent call last):
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wsgi.py", line 870, in __next__
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:     return self._next()
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wrappers.py", line 84, in _iter_encoded
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:     yield item.encode(charset)
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd9' in position 385: surrogates not allowed
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]: Traceback (most recent call last):
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wsgi.py", line 870, in __next__
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._next()
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wrappers.py", line 82, in _iter_encoded
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     for item in iterable:
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/contrib/iterio.py", line 183, in __new__
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     rv = g.switch()
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/contrib/iterio.py", line 178, in run
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     func(stream)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "./quarry/web/output.py", line 152, in respond
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     if (worksheet.write(row_num, col_num, cell) < -1 and
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 67, in cell_wrapper
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return method(self, *args, **kwargs)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 406, in write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._write(row, col, *args)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 431, in _write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._write_token_as_string(token, row, col, *args)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 387, in _write_token_as_string
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._write_string(row, col, *args)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 522, in _write_string
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     self._write_single_row(row)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 5432, in _write_single_row
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     self._write_cell(row_num, col_num, col_ref)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 5610, in _write_cell
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     self._xml_inline_string(string, preserve, attributes)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/xmlwriter.py", line 167, in _xml_inline_string
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     (attr, t_attr, string))
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/usr/lib/python3.5/codecs.py", line 718, in write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self.writer.write(data)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/usr/lib/python3.5/codecs.py", line 376, in write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     data, consumed = self.encode(object, self.errors)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd9' in position 39: surrogates not allowed
Zhuyifei1999 (talkcontribs)
>>> surrogate = b'\xc3'.decode('utf8','surrogateescape')
>>> surrogate
'\udcc3'
>>> surrogate.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed
>>> surrogate.encode('utf-8', 'surrogateescape')
b'\xc3'
Zhuyifei1999 (talkcontribs)

This is bad, but I don't know of any other way to get control of the error handler used in _iter_encoded:

>>> import codecs
>>> codecs.register_error('strict', codecs.lookup_error('surrogateescape'))
>>> 
>>> surrogate = b'\xc3'.decode('utf8','surrogateescape')
>>> surrogate.encode('utf-8')
b'\xc3'
>>> codecs.lookup_error('surrogateescape')
<built-in function surrogateescape>

@Framawiki: Any objections to applying this 'hack'?

Zhuyifei1999 (talkcontribs)

I also broke the worker while playing with surrogates:

Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]: [2018-11-06 18:33:00,947: ERROR/ForkPoolWorker-25] Task worker.run_query[6a7ebdd0-fa08-4732-aa2a-2deafb5d7892] raised unexpected: UnicodeDecodeError('utf-8', b'\xed\xb3\x83', 0, 1, 'invalid continuation byte')
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]: Traceback (most recent call last):
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/app/trace.py", line 382, in trace_task
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     R = retval = fun(*args, **kwargs)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/app/trace.py", line 641, in __protected_call__
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     return self.run(*args, **kwargs)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/quarry/web/worker.py", line 74, in run_query
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     while cur.nextset():
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/cursors.py", line 107, in nextset
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     return self._nextset(False)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/cursors.py", line 102, in _nextset
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     conn.next_result(unbuffered=unbuffered)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 520, in next_result
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 727, in _read_query_result
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     result.read()
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1073, in read
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     self._read_result_packet(first_packet)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1143, in _read_result_packet
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     self._read_rowdata_packet()
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1181, in _read_rowdata_packet
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     rows.append(self._read_row_from_packet(packet))
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1197, in _read_row_from_packet
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     data = data.decode(encoding)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
Zhuyifei1999 (talkcontribs)
Saeidpourbabak (talkcontribs)

Hello and thanks for taking action.

Do I have to do something or I just need to wait? I'm afraid that your replies and gerrit link is too technical for me.

Framawiki (talkcontribs)

Hi Saeidpourbabak and thanks for the ping. We'll try to fix the site soon and will contact you when it's done.

Zhuyifei1999 (talkcontribs)

This should be fixed now:

$ curl -s https://quarry.wmflabs.org/run/310305/output/0/csv | wc -l
20423
$ wget -O /dev/null https://quarry.wmflabs.org/run/310305/output/0/xlsx
--2018-12-02 15:40:35--  https://quarry.wmflabs.org/run/310305/output/0/xlsx
Resolving quarry.wmflabs.org... 208.80.155.156
Connecting to quarry.wmflabs.org|208.80.155.156|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘/dev/null’

/dev/null                               [   <=>                                                            ] 454.11K  1.05MB/s    in 0.4s    

2018-12-02 15:40:38 (1.05 MB/s) - ‘/dev/null’ saved [465013]

Not able to perform query on itwiki_p

3
Summary by BDavis (WMF)

Fixed by phab:T210978

Lucarosty (talkcontribs)

Until yesterday evening I used "https://quarry.wmflabs.org/query/26129" and "https://quarry.wmflabs.org/query/31383" to perform some query for lint errors. This morning I'm unable to perform that query on itwiki (i try the same query on enwiki_t without problem). The page return this error "View 'itwiki_p.page' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them"

Zhuyifei1999 (talkcontribs)
Lucarosty (talkcontribs)

Perfect now is working! Thanks

Neechalkaran (talkcontribs)

I have custom Query to get required data for given page_title. Since one of the parameter title is dynamic, I can't download the results in json from static link given by Quarry. is there any work around to pass the Query dynamically and get the result in json?

Framawiki (talkcontribs)

Hi, if by "dynamic" you want to execute queries with a parameter that changes, no it's not possible. It would be probably better to create a tool with a webpage : wikitech:Portal:Toolforge.

Neechalkaran (talkcontribs)
Framawiki (talkcontribs)

The link you provided allows to download the result of a query run (number 317046). It is static results. If you want to have updated ones, you have to execute the query again. You should really take a look to toolforge, that offers direct access to db servers, so the ability to execute your query as needed.

Reply to "Dynamic Quarry"

Which file in the Wikipedia dumps corresponds to the "revision" table?

2
Gatherinfer (talkcontribs)

I need to download the entire revision history of English Wikipedia (the metadata for each edit, not the content itself). On Quarry, that's the "revision" table. However I can't download it from Quarry because it's too big. I looked in the Wikipedia dumps site:


https://dumps.wikimedia.org/enwiki/20181201/


And I wasn't sure which file or set of files I need. I can see files corresponding to some of the tables I can query (e.g. user_groups.sql, redirect.sql, etc) but I can't find one named revision.sql or similar.

There are two collections of files, both of which have "These files contain no page text, only revision metadata." in the subheading. Is one of these the one I need? If so, which one? I don't want to download many gigabytes of files only to find I chose the wrong one ..

Thankyou.

BDavis (WMF) (talkcontribs)
Reply to "Which file in the Wikipedia dumps corresponds to the "revision" table?"
Jarekt (talkcontribs)

To count number of one page PDF file one can use select count(*) from image where img_minor_mime='pdf' and img_metadata like '%\"Pages\";s:1:\"1";%'. I am looking for something similar for DjVu files. Are there better ways for counting pages in multipage files?

TheDJ (talkcontribs)

It seems that DjVu handler retrieves page count from img_metadata as well. It does so by parsing the xml, finding the 'metatree' element and then counting the amount of objects contained within it.

$count = count( $metatree->xpath( '//OBJECT' ) );

https://github.com/wikimedia/mediawiki/blob/master/includes/media/DjVuHandler.php#L292

A better way might be to use the api here, which has defined accessors for this (it's in the dimensions properties of images/media):

https://commons.wikimedia.org/w/api.php?action=query&titles=File:H.M.S.%20Pinafore.djvu&prop=imageinfo&iiprop=timestamp%7Cuser%7Curl%7Cdimensions


Jarekt (talkcontribs)

I know I can access number of pages from lua and I see I can also use API, but I was looking for a way to build SQL query to detect single page DjVu files. I think I got it: quarry:query/32028 seems to work.

Reply to "Number of pages in DjVu files"

revision table rev_comment missing

2
Summary by 197.218.84.150
Saeidpourbabak (talkcontribs)

Hi, It seems to me that the revision table is not working properly, namely rev_comment data is missing. I know I have tons of edits with a specific summary, but now I find none. For example see these two quarries, one from 3 months ago and one from today. Thanks.

Framawiki (talkcontribs)

Hi, no, that's not a bug :) Since recently (phab:T166733) comments are available in a new specific table. See Manual:Revision table#rev comment for the depreciation message. You'll need to use a join between revision and comment tables.

Summary by BDavis (WMF)

Watchlist data is not available in the Wiki Replica databases due to privacy concerns.

RoySmith (talkcontribs)

Can I query my watchlist in quarry? I get:

Table 'enwiki_p.watchlist' doesn't exist

Edgars2007 (talkcontribs)

No, you can't. Personal information etc.

Is site_stats protected now?

3
Summary by BDavis (WMF)

The site_stats view was broken by a column change in the backing table, but this has been fixed. See phab:T210481 for more details.

Andreyyshore (talkcontribs)

Since my statistics bot failed to run today, I noticed that I can't access the site_stats table on any wiki.

Here's a simple query that doesn't work anymore as of today:

USE rowiki_p;

DESCRIBE site_stats;

This yields the following error: View 'rowiki_p.site_stats' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

If I use SHOW TABLES FROM rowiki_p;, I can see that the table still exists. I was using it to get the total number of edits on rowiki.

Any idea why it's suddenly inaccessible?

Framawiki (talkcontribs)

Hi, that is not limited to Quarry but to the whole wmf labs replicas, and the restriction was not intentional. See phab:T210481.

Andreyyshore (talkcontribs)

Thanks for clarifying where the problem was.

Like the other commenters, I can also confirm that the problem is gone now.

Need help with finding most transcluded unprotected pages

7
Jarekt (talkcontribs)

I am trying to make sure some of the most transcluded pages (hopefully templates and modules) on Commons are protected. The first step would be to find some number (lets say 100) most transcluded pages and than look up their protection level, So I started with


use commonswiki_p;
SELECT tl_namespace, tl_title, count(*) as tot
FROM templatelinks
group by tl_namespace, tl_title
order by tot desc
limit 100

Which was killed after half an hour (see https://quarry.wmflabs.org/query/8653). Anybody knows what is wrong with this query and how to fix it? The results should mirror c:Special:MostTranscludedPages.

BDavis (WMF) (talkcontribs)

Trying the query out in the sql-optimizer tool shows that there are 1,605,698,124 (yes, 1.6 billion!) rows to process. The group by and order by clauses force the database server to copy all of the data into temporary storage and then perform a file sort on that storage. The limit clause will only change the data returned to your client, so all the rows have to be grouped and then the counts that were computed for each group all have to be sorted and finally only the first 100 results will be returned.

Your best best for getting this sort of data is to use the Action API's query action to get the same list you see on Special:MostTranscludedPages in a machine readable format: https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=querypage&utf8=1&formatversion=2&qppage=Mostlinkedtemplates

Jarekt (talkcontribs)

Yes sorting 1.6 billion pages is a big job. My plan was to use that query as a piece of larger query where I than look up protection level of each page. Maybe I should be counting transclusions of unprotected templates (and latter modules) and show the ones with over some number (1M or 500k). So limit the scope first and than count.

Zhuyifei1999 (talkcontribs)
Zhuyifei1999 (talkcontribs)

quarry:query/31088 would be a query that checks the transclusion count from all templates, using a dependent subquery on templatelinks table. Don't think it'll work though.

Jarekt (talkcontribs)
Zhuyifei1999 (talkcontribs)

Wow, 31131 is so much faster than I would have anticipated. Nice. I guess most of the time it's counting the transclusions of protected templates, and the time of joining and filtering on page_restrictions is minimal.

Reply to "Need help with finding most transcluded unprotected pages"