Topic on Talk:Quarry

Gfdgss (talkcontribs)

Hello,

I need to get the text from many pages. To start understanding the scheme I wanted to get the text from a specific page.

However when I go to revision table all of the rev_text_id column in filled with zeros.

Also, there is an exception thrown when I'm trying to use text table saying it doesn't exist.

Do you know why does this happen? are we not allowed to mine that data? does anyone know how to solve this (via Quarry or alternative ways)?

Thanks for any help!

Stefan2 (talkcontribs)

It is my understanding that the text table contains the wikitext of all revisions of all pages, including deleted pages. Deleted content only is meant to be available for admins, so I'd imagine that the restrictions you are facing are in place for that reason.

Gfdgss (talkcontribs)

Thanks Srefan2.

Meaning there is not anyway to access the text of a page?

I know Wikipedia let's you download dumps. Don't these files include all the page text as a part of them? If so- if we can get the text from those files, why can't we do it from here as well?

Moreover, in revision table you have a field saying if the action done on the edited page was deleting it, therefore they could screen those deleted pages out real easy and still give us access to text table.

Stefan2 (talkcontribs)

I don't know what the database dumps contain or how they differ from the database copy available on Toollabs.

The revision table has a field called rev_deleted which tells if the revision has been deleted using revision deletion. If the entire page is deleted, the revisions end up in the archive table instead. In both cases, the text seems to reside in the text table (provided that the content wasn't deleted before upgrading the servers to MediaWiki 1.5, in 2005 or something).

I think I tried to look up something in the text table some time ago but failed, although I don't remember exactly what happened. Note that there is no key on the old_text field, so the execution time of a query will be proportional to the number of entries in the table (very slow) as opposed to the logarithm of the number of entries (faster). Also, the documentation says that the text can be difficult to get (for example, it may be gzipped). I don't know if WMF stores page content in compressed form or not.

Gfdgss (talkcontribs)

I highly appreciate your in-depth reply.

Do you know of any way I can use the current scheme to find which users have Barnstars or Service Awards on their user page?

If so- is there any way of knowing what type exactly and when was it given?

Thanks

Stefan2 (talkcontribs)

The linked page says that the barnstar templates normally are substituted, so you can't search for template transclusions. However, you could use the imagelinks table to find all user pages which transclude File:Original Barnstar.png (or other barnstar images). If the page contains a barnstar image, then there's a fair chance that the person has been granted a barnstar.

For service awards, you could similarly use the templatelinks table to find all pages which transclude w:Template:Registered Editor and similar templates.

Once you have a list of user pages, then I suppose that you could write a computer program which downloads specific information (such as the wikitext) from the API if there's more information you need.

This post was hidden by Gfdgss (history)
This post was hidden by Gfdgss (history)
Reply to "text table"