Manual:Pywikibot/Cookbook/Working with page histories

Revisions
Processing page histories may be frightening due to the amount of data, but is easy because we have plenty of methods. Some of them extract a particular information such as a user name, while others return an object called revision. A revision represents one line of a page history with all its data that are more than you see in the browser and more than you usually need. Before we look into these methods, let's have a look at revisions. We have to keep in mind that
 * some revisions may be deleted
 * otherwise the text, comment or the contributor's name may be hidden by admins so that non-admins won't see (this may cause errors to be handled)
 * oversighters may hide revisions so deeply that even admins won't see them

Furthermore, bots are not directly marked in page histories. You see in recent changes if an edit is made by bot because this property is stored in the recent changes table of the database and is available there for a few weeks. If you want to know if an edit was made by a bot, you may
 * guess it from the bot name and the comment (not yet implemented, but we will try below)
 * follow through a lot of database tables which contributor had a bot flag in the time of the edit, and consider that registered bots can switch off their flag temporarily and admins can revert changes in bot mode (good luck!)
 * retrieve from the wiki the current bot flag owners and suppose that same users were bots in the time of the edit (that's what Pywikibot does)

API:Revisions gives a deeper insight, while Manual:Slot says something about slots and roles (for most of us this is not too interesting).

Methods returning a single revision also return the content of the page, so it is a good idea to choose a short page for experiments (see Special:ShortPages on your home wiki). by default does not contain the text unless you force it.

If you want to know what  is good for, see here. Tl;dr: for comparing revisions and recognizing reverts.

For now I choose a page which is short, but has some page history: hu:8. évezred (8th millennium). Well, we have really few to say about it, and we suffer from lack of reliable sources. Let's see the last (current) revision!

As we look into the code, we don't get too much information about how to print it in more readably, but we notice that  is a subclass off , which is described here. So we can try :

While a revision may look like a dictionary on the screen, however it is not a dictionary, a in the above loop would show that all these tuple-like pairs are real tuples.

Non-existing pages raise  if you get their revisions.

keeps the oldid of the previous edit and will be zero for a newly created page. For any revision as :

Extract data from revison
We don't have to transform a  object into a dictionary to use it. The above experiment was just for overview. Now we know what to search for, and we can directly get the required data. Better, this structure is more comfortable to use than a common directory, because you have two ways to get a value: As you see, they are identical. But keep in mind that both solutions may cause problems if some parts of the revision were hidden by an admin. Let's see what happens upon hiding:

You may say this is not quite consequent, but this is the experimental result. You have to handle hidden properties, but for a general code you should know whether the bot runs as admin. A possible solution: If you are not an admin but need admin rights for testing, you may get one on https://test.wikipedia.org.

For example will never raise an, but is not very useful in most cases. On the other hand, will do something if the content is not hidden for you and not empty. means here either an empty page or a hidden content. If it counts for you, will make the difference.

Example was found when an oversighter suppressed the content of the revision. and  were   both for admin and non-admin bot, and an additional   key appeared with the value    (empty string).

Is it a bot edit?
Have a look at this page history. It has a lot of bots, some of which is no more or was never registered. Pywikibot has a  method which takes user name (not an object) and checks if it has a bot flag. This won't detect all these bots. We may improve with regarding the user name and the comment. This method is far not sure, may have false positives as well as negatives, but – as shown in the 3rd column – gives a better result then  which is in the second column.

hu:Kategória:Bottal létrehozott olasz vasútállomás cikkek contains articles created by bot. Here is a page generator that yields pages which were possibly never edited by any human: Test with:

Timestamp
is a  object which is well documented here. It is subclass of ]. Most importantly, MediaWiki always stores times in UTC, regardless of your time zone and daylight saving time.

The documentation suggests to use  for the current time; it is also a   in UTC.

Elapsed time since last edit: Pretty much, isn't it? :-) The result is a  object.

In the shell timestamps are human-readable. But when you print them from a script, they get a machine-readable format. If you want to restore the easily readable format, use the  function: For the above subtraction  is nicer, because   gives days and seconds, without converting them to hours and minutes.

Useful methods
Methods discussed in this section belng to  class with one exception, so may be used for practically any page.

Page history in general

 * will create a wikitable form the page history. The order may be reverted and number of rows may be limited. Useful e.g. when the page history gets unavailable and you want to save it to a talk page.
 * returns a small statistics: contributors with the number of their edits in the form of a dictionary, sorted by the decreasing number. Hidden names appear as empty string both for admin and non-admin bots.
 * will iterate through the revisions of a page beginning from the latest. As detailed above, this differs from one revision in that by default it does not retrieve the content of the revision. To get a certain revision turn the iterator into list.
 * use:
 * for beginning from the oldest version
 * for retrieving the page contents
 * for limiting the iteration to 5 entries
 * with a  to limit the iteration in time

For example to get a difflink between the first and the second version of a page without knowing its oldid (works for every language version)

And this one is a working piece of code from hu:User:DhanakBot/szubcsonk.py. This bot administers substubs. Before we mark a substub for deletion, we wonder if it has been vandalized. Maybe it was a longer article, but someone has truncated it, and a good faith user marked it as substub not regarding the page history. So the bot places a warning if the page was 1000 bytes longer or twice as long at any point of its history as now.
 * may also be interesting; this is the underlying method that is called by, but it has some extra features. You may specify the user whose contributions you want or don't want to have.

Last version of the page
Last version got a special attention from developers and is very comfortable to use. For example to get a difflink between the newest and the second newest version of a page without knowing its oldid (works for every language version)
 * (property): returns the current revision for this page. It's a Revision object as detailed above.
 * But some properties are available directly (they are equivalent to retrieve values from ):


 * (property): returns oldid of the current (latest) version.
 * : returns name or IP address of last user to edit page.
 * : returns name or IP address of last human/non-bot user to edit page. This is explained above. Our  function is an alternative possibility.
 * : returns True if last editor was unregistered.

Oldest version of a page

 * (property) is very similar to, but returns the first version rather than last.

Determine how many times is the current version longer then the first version (beware of division by zero which is unlikely but possible):

Determine which user how many articles has created in a given category, not including its subcategories: (Use  if you are interested in subcategories, too, but that will be slightly slower.)

Knowing that Main Page is a valid alias for the main page in every language and family, sort several Wikipedias by creation date: For some reason it gives a false date for dewiki where Main Page is redirected to  namespace, but looks nice anyway. :-)

You want to know when did the original creator edit in your wiki last time. In some cases it is a question whether it's worth to contact him/her. The result is a timestamp as described above, so you can subtract it from the current date to get the elapsed time. See also Working with users section.

Other

 * This will return a permalink to the given version. The oldid argument may be got as the revid value of that revision. If you omit it, the latest id will automatically be assigned. For getting permalinks for all versions of the page:
 * This will return a permalink to the given version. The oldid argument may be got as the revid value of that revision. If you omit it, the latest id will automatically be assigned. For getting permalinks for all versions of the page:

However, besides this URL-format during the years MediaWiki invented a nicer form for inner use. You may use in any language This will result in such permalinks that you can use on your wiki:.


 * takes an oldid and returns the text of that version (not a  object!). May be useful if you know the version id from somewhere.

Deleted revisions
When a page is deleted and recreated, it will get a new id. Thus the only way of mining in the deleted revisions is to identify the page by the title. On the other hand, when a page is moved (renamed), it takes the old id to the new title and a new redirect page is created with a new id and the old title. Taking everything into account, investigation may be complicated as deleted versions may be under the old title and deleted versions under the same title may belong to another page. It may be easier without bot if it is about one page. Now we take a simple case where the page was never renamed.


 * does not need admin rigths, and simply says a yes or no to the question if the page has any deleted revisions. Don't ask me for a use case.

The following methods need amin rights, otherwise they will raise. The main use case is to get timestamps for.
 * iterates through the timestamps of deleted revisions and yields them. Meanwhile it caches other data in a private variable for later use. Iterators may be processed with a  loop or transformed into lists. For example to see the number of deleted revisions:

This method takes a timestamp which is most easily got from the above, and returns a dictionary. Don't be mislead by the name; this is not a   object. Its keys are: dict_keys(['revid', 'user', 'timestamp', 'slots', 'comment']) Theoretically a argument should return the text of the revision (otherwise text is returned only if it had previously been retrieved). Currently the documentation does not exactly cover the happenings, see T331422. Instead, revision text may be got (with an example timestamp) as

Underlying method for both above methods is  which makes possible to get the deleted revisions of several pages together and to get only or rather exclude the revisions by a given user.