Talk:Quarry

Jump to navigation Jump to search

About this board

Discussion area for discussion about Quarry itself and help about individual queries.

Землеройкин (talkcontribs)
Zhuyifei1999 (talkcontribs)

It'd probably dead already. No status being updated is a consequence of quarry running out of memory while trying to store results.

Reply to "How to kill a query?"
RhinosF1 (talkcontribs)
Milimetric (WMF) (talkcontribs)

Just to clarify, do you mean:

Let the list of people on that page be L. You want to find this set:

{user | user <- L and user has not made any edits on this wiki in 5 years}

What about other wikis? Also, what do you mean by "remove them", like remove them from the list?

To find users that haven't made any edits in one wiki you could do:

 select user_id
   from user
  where user_id in (... your list ...) 
    and user_id not in
        (select distinct rev_user
           from revision
          where rev_timestamp > '20140301000000'
        )
;

But that subquery would be impossibly slow on enwiki, so this is a heavy query. Not sure how to optimize it off the top of my head. Might just be better to make a temporary table somewhere out of the users that did edit in the past 5 years and query that.

RhinosF1 (talkcontribs)

Editors in last 5 years would also work. That query failed (https://quarry.wmflabs.org/query/34487). Just enwiki is what we need. We need to end up with a list of users who have/have not edited in the past 5 years and are on that page.

Reply to "Quarry Help Needed"

Error Hmm... Is the SQL actually running?!

2
Kerry Raymond (talkcontribs)

What does this mean?

Zhuyifei1999 (talkcontribs)
Reply to "Error Hmm... Is the SQL actually running?!"

Search for genus & species items missing descriptions

6
Tom.Reding (talkcontribs)

I'd like to port this working WD Query Service query into Quarry (it works when there are ≲ 5000 possible results, but genus & species are timing out the server), but I'm having trouble, could someone help?

Matěj Suchánek (talkcontribs)
USE wikidatawiki_p;
SELECT page_title FROM page
JOIN pagelinks ON page_id = pl_from
LEFT JOIN wb_terms ON term_full_entity_id = page_title AND page_namespace = 0 AND term_language = 'en' AND term_type = 'description'
WHERE pl_title = 'Q34740' AND pl_namespace = 0;

Since the SQL database doesn't hold statements, we need to use standard MediaWiki links.

Tom.Reding (talkcontribs)

How does one search for empty description term_text though? I tried adding "AND NOT EXISTS(term_text)" to the WHERE, but got an error; I also tried adding term_text to the SELECT but it times out, even with LIMIT 100.

Matěj Suchánek (talkcontribs)

Oops, I forgot AND term_text IS NULL.

Tom.Reding (talkcontribs)

Killed again after 30 minutes :( I have gotten the original WDQS query to not timeout/error somehow, by, counter intuitively, removing the artificially imposed limit... Thanks for your help though.

Matěj Suchánek (talkcontribs)

That is interesting...

Mahir256 (talkcontribs)

It is clear that the results of past runs of a query are stored on Quarry's servers; I was able to find those for my label statistics run on the 25th (compare those from the 26th) after some crawling based on a suspected range of run IDs within which the 25th's results would have fallen.

Is there a way to obtain a list of the results of past runs of a query so that such crawling won't be necessary?

Zhuyifei1999 (talkcontribs)

You are correct. It is indeed stored. It is however not exposed in the interface. And I just checked the code and found there are no uses of Query.revs. So unfortunately we don't currently have the list of revisions for a particular query exposed. Would you like to contribute a patch to add this?

Reply to "Listing past runs of a query"

"select *" returns different rows from "select col_name"?

2
RoySmith (talkcontribs)

When I run:

select rd_title from redirect where rd_title like 'The\_%' limit 5;


I get:

The_"5"_Royales
The_"5"_Royales
The_"5"_Royales
The_"5"_Royales
The_"5"_Royales

If I replace the explicit column name "rd_title" with "*", I get different rows:

select * from redirect where rd_title like 'The\_%' limit 5;

2751 0 The_Angry_Brigade
2903 0 The_Diary_of_a_Young_Girl
3317 0 The_Birth_of_a_Nation
3920 0 The_Beatles
4238 0 The_Bronx

What's going on here? I expected to get the same rows, just more columns.

Matěj Suchánek (talkcontribs)
Reply to ""select *" returns different rows from "select col_name"?"

Not able to download output

9
Summary by BDavis (WMF)

Fixed by gerrit:472377

Saeidpourbabak (talkcontribs)

Hi, I have THIS quarry with 20422 rows results. When I try to download the output, I get only 1070 rows in the CSV file, and for the XLSX I get an empty (0 KB) file.

Any idea what is wrong?

Zhuyifei1999 (talkcontribs)

It's something about having surrogates (CC @Framawiki:)

Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]: Traceback (most recent call last):
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wsgi.py", line 870, in __next__
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:     return self._next()
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wrappers.py", line 84, in _iter_encoded
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]:     yield item.encode(charset)
Nov 06 18:09:26 quarry-web-01 uwsgi-quarry-web[3030]: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd9' in position 385: surrogates not allowed
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]: Traceback (most recent call last):
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wsgi.py", line 870, in __next__
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._next()
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/wrappers.py", line 82, in _iter_encoded
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     for item in iterable:
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/contrib/iterio.py", line 183, in __new__
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     rv = g.switch()
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/werkzeug/contrib/iterio.py", line 178, in run
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     func(stream)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "./quarry/web/output.py", line 152, in respond
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     if (worksheet.write(row_num, col_num, cell) < -1 and
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 67, in cell_wrapper
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return method(self, *args, **kwargs)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 406, in write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._write(row, col, *args)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 431, in _write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._write_token_as_string(token, row, col, *args)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 387, in _write_token_as_string
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self._write_string(row, col, *args)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 522, in _write_string
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     self._write_single_row(row)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 5432, in _write_single_row
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     self._write_cell(row_num, col_num, col_ref)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/worksheet.py", line 5610, in _write_cell
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     self._xml_inline_string(string, preserve, attributes)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/srv/quarry/venv/lib/python3.5/site-packages/xlsxwriter/xmlwriter.py", line 167, in _xml_inline_string
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     (attr, t_attr, string))
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/usr/lib/python3.5/codecs.py", line 718, in write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     return self.writer.write(data)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:   File "/usr/lib/python3.5/codecs.py", line 376, in write
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]:     data, consumed = self.encode(object, self.errors)
Nov 06 18:09:46 quarry-web-01 uwsgi-quarry-web[3030]: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd9' in position 39: surrogates not allowed
Zhuyifei1999 (talkcontribs)
>>> surrogate = b'\xc3'.decode('utf8','surrogateescape')
>>> surrogate
'\udcc3'
>>> surrogate.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed
>>> surrogate.encode('utf-8', 'surrogateescape')
b'\xc3'
Zhuyifei1999 (talkcontribs)

This is bad, but I don't know of any other way to get control of the error handler used in _iter_encoded:

>>> import codecs
>>> codecs.register_error('strict', codecs.lookup_error('surrogateescape'))
>>> 
>>> surrogate = b'\xc3'.decode('utf8','surrogateescape')
>>> surrogate.encode('utf-8')
b'\xc3'
>>> codecs.lookup_error('surrogateescape')
<built-in function surrogateescape>

@Framawiki: Any objections to applying this 'hack'?

Zhuyifei1999 (talkcontribs)

I also broke the worker while playing with surrogates:

Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]: [2018-11-06 18:33:00,947: ERROR/ForkPoolWorker-25] Task worker.run_query[6a7ebdd0-fa08-4732-aa2a-2deafb5d7892] raised unexpected: UnicodeDecodeError('utf-8', b'\xed\xb3\x83', 0, 1, 'invalid continuation byte')
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]: Traceback (most recent call last):
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/app/trace.py", line 382, in trace_task
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     R = retval = fun(*args, **kwargs)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/app/trace.py", line 641, in __protected_call__
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     return self.run(*args, **kwargs)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/quarry/web/worker.py", line 74, in run_query
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     while cur.nextset():
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/cursors.py", line 107, in nextset
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     return self._nextset(False)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/cursors.py", line 102, in _nextset
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     conn.next_result(unbuffered=unbuffered)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 520, in next_result
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 727, in _read_query_result
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     result.read()
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1073, in read
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     self._read_result_packet(first_packet)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1143, in _read_result_packet
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     self._read_rowdata_packet()
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1181, in _read_rowdata_packet
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     rows.append(self._read_row_from_packet(packet))
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 1197, in _read_row_from_packet
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]:     data = data.decode(encoding)
Nov 06 18:33:00 quarry-worker-02 celery-quarry-worker[2512]: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
Zhuyifei1999 (talkcontribs)
Saeidpourbabak (talkcontribs)

Hello and thanks for taking action.

Do I have to do something or I just need to wait? I'm afraid that your replies and gerrit link is too technical for me.

Framawiki (talkcontribs)

Hi Saeidpourbabak and thanks for the ping. We'll try to fix the site soon and will contact you when it's done.

Zhuyifei1999 (talkcontribs)

This should be fixed now:

$ curl -s https://quarry.wmflabs.org/run/310305/output/0/csv | wc -l
20423
$ wget -O /dev/null https://quarry.wmflabs.org/run/310305/output/0/xlsx
--2018-12-02 15:40:35--  https://quarry.wmflabs.org/run/310305/output/0/xlsx
Resolving quarry.wmflabs.org... 208.80.155.156
Connecting to quarry.wmflabs.org|208.80.155.156|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘/dev/null’

/dev/null                               [   <=>                                                            ] 454.11K  1.05MB/s    in 0.4s    

2018-12-02 15:40:38 (1.05 MB/s) - ‘/dev/null’ saved [465013]

Not able to perform query on itwiki_p

3
Summary by BDavis (WMF)

Fixed by phab:T210978

Lucarosty (talkcontribs)

Until yesterday evening I used "https://quarry.wmflabs.org/query/26129" and "https://quarry.wmflabs.org/query/31383" to perform some query for lint errors. This morning I'm unable to perform that query on itwiki (i try the same query on enwiki_t without problem). The page return this error "View 'itwiki_p.page' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them"

Zhuyifei1999 (talkcontribs)
Lucarosty (talkcontribs)

Perfect now is working! Thanks

Neechalkaran (talkcontribs)

I have custom Query to get required data for given page_title. Since one of the parameter title is dynamic, I can't download the results in json from static link given by Quarry. is there any work around to pass the Query dynamically and get the result in json?

Framawiki (talkcontribs)

Hi, if by "dynamic" you want to execute queries with a parameter that changes, no it's not possible. It would be probably better to create a tool with a webpage : wikitech:Portal:Toolforge.

Neechalkaran (talkcontribs)
Framawiki (talkcontribs)

The link you provided allows to download the result of a query run (number 317046). It is static results. If you want to have updated ones, you have to execute the query again. You should really take a look to toolforge, that offers direct access to db servers, so the ability to execute your query as needed.

Reply to "Dynamic Quarry"

Which file in the Wikipedia dumps corresponds to the "revision" table?

2
Gatherinfer (talkcontribs)

I need to download the entire revision history of English Wikipedia (the metadata for each edit, not the content itself). On Quarry, that's the "revision" table. However I can't download it from Quarry because it's too big. I looked in the Wikipedia dumps site:


https://dumps.wikimedia.org/enwiki/20181201/


And I wasn't sure which file or set of files I need. I can see files corresponding to some of the tables I can query (e.g. user_groups.sql, redirect.sql, etc) but I can't find one named revision.sql or similar.

There are two collections of files, both of which have "These files contain no page text, only revision metadata." in the subheading. Is one of these the one I need? If so, which one? I don't want to download many gigabytes of files only to find I chose the wrong one ..

Thankyou.

BDavis (WMF) (talkcontribs)
Reply to "Which file in the Wikipedia dumps corresponds to the "revision" table?"