Topic on Project:Support desk

What encoding is the category links SQL dump (Python cannot parse)?

4 comments • 01:53, 11 March 2021 3 years ago

4

Summary by MarkAHershberger

Tracked in Phabricator
Task T250517

Aaronshenhao (talkcontribs)

I'm trying to get all Wikipedia categories links for a project. I've successfully managed to load enwiki-latest-page.sql SQL dump using UTF-8. However, I get the following error when trying to parse enwiki-latest-categorylinks.sql in Python:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1957: invalid continuation byte

If I ignore the errors, or use decode the file byte by byte, the SQL file seems to contain non-unicode characters after line 45, which messes the file up. Can anyone shed some light on the issue? Is this expected? Couldn't find anything on the help page. Since I was able to easily open the the page database using UTF-8, I did not expect this error. The code I'm using is very simple:

with open(filepath, "r", encoding="utf-8") as f:
  for _ in range(80):
    # Peek first 80 lines
    # outputFile.write(f.readline())
    print(f.readline())

Wikidump link: https://dumps.wikimedia.org/enwiki/latest/

Relevant help page: Manual:Categorylinks table, https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#Database_tables

Reply Edited by MarkAHershberger 19:02, 17 April 2020 4 years ago

MarkAHershberger (talkcontribs)

This sounds like an issue that should be filed on the phabricator.

Reply 18:55, 17 April 2020 4 years ago

Bawolff (talkcontribs)

cl_sorykey is arbitrary binary data. Everything else should be valid utf-8 (its always possible that there is some really old junk in db from like 10 years ago that isn't. I dont think there is, but it wouldnt be shocking either)

Reply 23:24, 18 April 2020 4 years ago

2001:4898:80E8:0:C245:C97F:6E8F:173C (talkcontribs)

import mysql.connector

...connection

cursor.execute("select cl_sortkey from categorylinks LIMIT 100")

result = mycursor.fetchone()

UnicodeDecodeError Traceback (most recent call last)

<ipython-input-147-11704e00bfab> in <module>

14 print(result)'''

15 cursor.execute("select cl_sortkey from categorylinks LIMIT 100")

---> 16 result = cursor.fetchone()

...

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 31: invalid continuation byte

Reply 01:53, 11 March 2021 3 years ago

Reply to "What encoding is the category links SQL dump (Python cannot parse)?"