User:Bináris/Pywikibot cookbook

Pywikibot is the ninth wonder of the world, the eighth being MediaWiki itself.

Pywikibot is very flexible and powerful tool to edit Wikipedia or another MediaWiki instance. However, there comes the moment when you feel that something is missing from it, and the Universe calls you to write your own scripts. Don't be afraid, this is not a disease, this is the natural way of personal evolution. Pywikibot is waiting for you: you will find the  directory, which is ready to host your scripts.

This book is for you, if you
 * already have some experience with Pywikibot and have some vision about its capabilities
 * have some basic knowledge of Python and object-oriented programming
 * want to hack your own scripts
 * are already familiar with Manual:Pywikibot and especially Manual:Pywikibot/Create your own script (Library usage is also very useful, but more complex).

For general help see the bottom right template. In this book we go imto coding examples with some deeper explanation.

(A personal confession from the creator of this page: I just wanted to use Pywikipedia, as we called it in the old times, then I wanted to slightly modify some of the scripts to better fit to my needs, then I went to the book store and bought my first Python book. So it goes.)

Creating a script

 * Encoding and environment: It is vital that all Python 3 source files MUST be UTF-8 without a BOM. Therefore it is a good idea to forget the bare Notepad of Windows forever, because it has the habit to soil files with BOM. The minimal suggested editor is Notepad++, which is developed for programming purposes and is cross-platform. It has an Encoding menu where you see what I am speaking about, and you may set UTF-8 without BOM as default encoding. Any real programming IDE will do the job properly, e.g. Visual Studio Code is quite popular nowadays. Python has an integrated editor called IDLE, which uses proper encoding, but for some mysterious reason does not show line numbers, so you will suffer a lot from error messages, when you keep trying to find the 148th line of your code.


 * Where to put:  directory is designed to host your scripts. This is a great idea, because this directory will be untouched when you update Pywikibot, and you can easily backup your own work, regarding just this directory.
 * You may also create your own directory structure. If you would like to use other than the default, search for  in , and you will see the solution.
 * See also https://doc.wikimedia.org/pywikibot/master/utilities/scripts_ref.html#module-pywikibot.scripts.wrapper.

Running a script
You have basically two ways. The recommended one is to call your script through. Your prompt should be in Pywikibot root directory where  is and use: However, if you don't need these features, especially if you don't use global options and don't want  to handle command line arguments, you are free to run the script directly from   directory.
 * Global options: https://doc.wikimedia.org/pywikibot/master/global_options.html#global-options
 * Advantages of this method: https://doc.wikimedia.org/pywikibot/master/scripts_ref/scripts.html#module-scripts

Coding style
Of course, we have PEP 8, Manual:Coding conventions/Python and Manual:Pywikibot/Development/Guidelines. But sometimes we feel like just hacking a small piece of code for ourselves and not bothering the style.

Several times a small piece of temporary code begins to grow beyond our initial expectations, and we have to clean it.

If you'll take my advice, do what you want, but my experience is that it is always worth to code for myself as if I coded for the world.

On the other side, when you use Pywikibot interactively (see below), it is normal to be lazy and use abbreviations and aliases. For example Note that the  alias cannot be used in the second import. It will be useful later, e.g. for.

However, in this cookbook we won't use these abbreviations for better readability.

Beginning and ending
In most cases you see something like this in the very first line of Pywkibot scripts:

or

This is a shebang. If you use a Unix-like system, you know what it is for. If you run your scripts on Windows, you may just omit this line, it does not do anything. But it can be a good idea to use anyway in order someday others want to use your script.

The very last two lines of the scripts also follow a pattern. They usually look like this: This is a good practice in Python. When you run the script directly from command line (that's what we call directory mode), the condition will be true, and the  function will be called. That's where you handle arguments and start the process. On the other side, if you import the script (that is the library mode), the condition evaluates to false, and nothing happens (just the lines on the main level of your script will be executed). Thus you may directly call the function or method you need.

Scripting vs interactive use
For proper work we use scripts. But there is an interesting way of creating a sandbox. Just go to your Pywikibot root directory (where  is), type python and at the Python prompt type Now you are in the world of Pywikibot (if  is properly set). This is great for trying, experimenting, even for small and rapid tasks. For example to change several occurences of Pywikipedia to Pywikibot on an outdated community page just type: Throughout this document  prompt indicates that we are in the interactive shell. You are encouraged to play with this toy. Where this prompt is not present, the code lines have to be saved into a Python source file. Of course, when you use, it goes live on your wiki, so be careful. You may also set the testwiki as your site to avoid problems.

A big advantage of shell is that you may omit the  function. In most cases equals to section shows a rare exception when these are not equivalent, and we can take advantage of the difference for understanding what happens.

Documentation and help
We have three levels of documentation. As you go forward into understanding Pywikibot, you will become more and more familiar with these levels.
 * 1) Manual:Pywikibot – written by humans for humans. This is recommended for beginners. It also has a "Get help" box.
 * 2) https://doc.wikimedia.org/pywikibot – mostly autogenerated technical documentation with all the fine details you are looking for. Click on   if you use the latest deployed stable version of Pywikibot (this is recommended unless you want to develop the framework itself), and on   if you use the actual version that is still under development. Differences are usually small.
 * 3) The code itself. It is useful if you don't find something in the documentation or you want to find working solutions and good practices. You may reach it from the above docs (most classes and methods have a   link) or from your computer.

Testing the framework
Let's try this at Python prompt: Of course, you will have the name of your own bot there if you have set the  properly. Now, what does it mean? This does not mean this is a valid username, even less it is logged in. This does not mean you have reached Wikipedia, neither you have Internet connection. This means that Python is working, Pywikibot is working, and you have set your home wiki and username in. Any string may be there by this point.

If you save the above code to a file called test.py: and run it with, you will get Brghhwsf.

Now try This is already a real contacting your wiki; the result is the name of your bot if you have logged in, otherwise. For advanced use it is important that although  is similar to a   object instance, here it is just a string. See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._basesite.BaseSite.user.

Creating a Page object from title
In the further part of this cookbook, unless otherwise stated, we always assume that you have already used these two basic statements:

You want to get the article about Budapest in your wiki. While it is in the article namespace, it is as simple as

Note that Python is case sensitive, and in its world  and   mean classes,   and   class instances, while lowercase   and   should be variables.

For such simple experiments interactive Python shell is comfortable, as you can easily see the results without using, saving and running your code.

Getting the type of an object is often useful when you want to discover the capabilities of Pywikibot. It seems to be strange, but the main thing is that you got a Page. Now let's see the user page of your bot. Either you prefix it with the namespace ( and other English names work everywhere, while the localized names only in your very wiki) or you give the namespace number as the third argument. So

and

will give the same result. is the localized version of  in Hungarian; Pywikibot won't respect that I used the English name for the namespace in my command, the result is always localized.

Getting the title of the page
On the other hand, if you already have a  object, and you need its title as a string,   method will do the job:

Possible errors
While getting pages may cause much less errors than saving them, a few types are worth to mention, some of them being technical, while others possible contradictions between our expectations and reality. Let's speak about them before actually getting the page.
 * 1) The page does not exist.
 * 2) The page is a redirect.
 * 3) You may have been mislead regarding the content in some namespaces. If your page is in Category namespace, the content is the descriptor page. If it is in User namespace, the content is the user page. The trickiest is the File namespace: the content is the file descriptor page, not the file itself; however if the file comes from Commons, the page may not exist in your wiki at all, while you still see the picture.
 * 4) The expected piece of text is not in the page content because it is transcluded from a template. You see the text on the page, but cannot replace it directly by bot.
 * 5) Sometimes a badly formed code may work well. For example  with two spaces will behave as  . While the page is in the category and you will get it from a page generator (see below), you won't find the desired string in it.
 * 6) And, unfortunately, Wikipedia servers sometimes face errors. If you get a 500 error, go and read a book until server comes back.
 * 7) InvalidTitleError is raised in very rare cases. A possible reason is that you wanted to get a page title that contains illegal characters.

Getting the content of the page
Important: by this time we don't have any knowledge about the existence of the page. We have not contacted live wiki yet. We just created an object. It is just as a street number: you may write it on a document, but either there is a house there or not.

There are two main approaches of getting the content. It is important to understand the difference.

Page.text
You may notice that  does not have parentheses. Looking into the code we discover that it is not a method, rather a property. This means  is ready to use without calling it, may be assigned a value, and is present upon saving the page.
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.BasePage.text

will write the whole text on your screen. Of course, this is for experiment.

You may write  if you need a copy of the text, but usually this is unneccessary. is not a method, so referring to it several times does not slow down your bot. Just manipulate  or assign it a new value, then save.

If you want to know details on how a property works, search for "Python decorators". For using it in your scripts it is enough to know the behaviour. Click on the above link and go through the right-hand menu. You will find some other properties without parentheses.

will never raise an error. If the page is a redirect, you will get the redirect link instead of the content of the target page. If the page does not exist, you will get an empty string which is just what happens if the page does exist, but is empty (it is usual at talk pages). Try this:

is comfortable if you don't have to deal with the existence of the page, otherwise it is your responsibility to make the difference. An easy way is.

While page creation does not contact the live wiki, refering to text for the first time and  usually does. For several pages it will take a while. If it is too slow for you, go to the section. shows if it is neccessary; if it returns True, the bot will not retrieve the page again. Therefore it returns  for non-existing pages as it is senseless to reload them. Although this is a public method, you are unlikely to have to use it directly.

Page.get
The traditional way is  which forces you to handle the errors. In this case we store the value in a variable.
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.BasePage.get

A non-existing page causes a NoPageError:

A redirect page causes an IsRedirectPageError:

If you don't want to handle redirects, just make the difference between existing and non-existing pages,  will make its behaviour more similar to that of  :

Here is a piece of code to handle the cases. It is already too long for prompt, so I saved it. Which results in: While  is simple, it gives only the text of the redirect page. With  we got another Page instance without parsing the text. Of course, the target page may also not exist or be another redirect. Scripts/redirect.py gives a deeper insight.

For a practical application see.

Reloading
If your bot runs slowly and you are in doubt that the page text is still actual, use. The experiment shows that it does not update, which is good on one side, as you don't lose your data, but on the other side needs attention to be concious,

currently does not reflect to forced reload, see T330980.

Overview
Page generators form one of the most powerful tools of Pywikibot. A page generator iterates the desired pages.

Why use page generators? A possible reason to write your own page generator is mentioned in Follow your bot section.
 * You may separate finding pages to work on and the actual processing, so the code becomes cleaner and more readable.
 * Thus you create separate namespaces which are one honking great idea.
 * They implement a reuseable code for typical tasks.
 * Pywikibot team writes page generators, and can follow the changes of MediaWiki API, and you have to write your code on a higher level.

Most page generators are available via command line arguments for end users. See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.pagegenerators.html for details. If you write your own script, you may use these arguments, but if they are permanent for the task, you may want to directly invoke the appropriate generator instead of handling command line arguments.

Life is too short to list them all here, but the most used generators are listed under the above link. you may also discover them in the  directory of your Pywikibot installation. They may be divided into three main groups:
 * 1) High-level generators for direct use, mostly (but not exclusively) based on MediaWiki API. Usually they have long and hard-to-remember names, but the names may always be cheated from docs or code. They are connected to command line arguments.
 * 2) Filters. They wrap around another generator (take the original generator as argument), and filter the results, for example for namespace. This means they won't run too fast...
 * 3) Low-level API-based generators may be obtained as methods of ,  ,  ,  ,   objects. Most of them is packed into a hig-level generator function, which is the preferred use (we may say, the public interface of Pywikibot), however nothing forbids the direct use. Sometimes they yield structures rather than page objects, but may be turned to a real page generator, as we will see in an example.

Pagegenerators package (group 1 and 2)
Looking into  directory you discover scripts whose names begin with an underscore. This means they are not for direct import, however they can be useful for discovering the features. The incorporated generators may be used as which is almost equivalent to:

To interpret this directory which appears for us as  package in code:
 * primarily holds the documentation, but there are also some basic generators in it.
 * holds the primary generators.
 * holds the wrapping filter generators.
 * is responsible for interpreting the command line arguments and choosing the appropriate generator function.

API generators (group 3)
MediaWiki offers a lot of low-level page generators, which are implemented in  class. is child of, so we may use these methods for our   instance. While the above mentioned pagelike objects have their own methods that may easily be found in the documentation of the class, they usually use an underlying method which is implemented in, and somtimes offers more features.

Usage
Generators may be used in  loops as shown above, but also may be transformed to lists: But be careful: while loops continuously process pages, the list comprehension may take a while because it has to read all the items from the generator. This statement is very fast for total=10, takes noticeable time for total=1000, and is definitely slow for total=100000. Of course, it will consume a lot of memory for big numbers, so usually it is better to use generators in a loop.

List Pywikibot user scripts with
You want to collect users' home made Pywikibot scripts from all over Wikipedia. Supposing that they are in user namespace and have a title with  ending, a possible soulution is to create an own pagegenerator using. Rather slow than quick. :-) This will not search in other projects. And test with: If you want a reduced version for testing, you may use This will limit the number of Wikipedias to 3, and excludes the biggest, enwiki. You may also use a  argument in.

Sample result:

Titles beginning with a given word –
During writing this section we had a question on the language/spelling village pump about the spelling of the name of József Degenfeld. A search showed that we have several existing articles about Degenfeld family. To quickly compose a list of them the technics from Creating and reading lists chapter was copied: For such rapid tasks shell is very suitable.

Reading linked pages
reads from a file, while  from a wikipage. See Creating and reading lists chapter for the usage.

Pages created by a user with a site iterator
You want to list the pages created by a given user, for example yourself. How do we know that an edit was creation of a new page? The response is the  value, which is the oldid of the previous edit. If it is zero, there is no previous version, that means it was either creation or recreation after a deletion. Where do we get a parentid from? Either from a  (see Working with users chapter) or a   (see Working with page histories).

Of course, we begin with high-level page generators, just because this is the title of the chapter. We have one that is promising: Its description is: Yield unique pages edited by user:username. Now, this is good for beginning, but we have only pages, so we have to get the first revision, and see if the username equals to the desired user, which will not be the case in the vast majority, but will be very slow.

So we look into the source and notice that this function calls , which is a method of a  object, and has the description Yield tuples describing this user edits. This is promising again, but looking into it we see that the tuple does not contain, but we find an underlying method again, which is. This looks good, and has a link to API:Usercontribs, which is the fourth step of our investigation. Finally, this tells what we want to hear: yes, it has.

Technically,  is not a page generator, but we will turn it. It takes the username as string, iterates s, and we may create the pages from their titles. The simpliest version just to show the essence: Introduction was long, solution short. :-)

But it was not only short, but also fast, because we did not create unnecessary objects, and, what is more important, did not get unnecessary data. We got dictionaries, read the  and the   from them, and created only the desired   objects – but did not retrieve the pages, which is the slow part of the work.

Summary
High-level page generators are really various and flexible and are often useful when we do some common task, especially if we want the pwb wrapper to handle our command-line arguments. But for some specific tasks we have to go deeper. On the next level there are the generator methods of pagelike objects, such as,  ,   etc., while on the lowest level page generators and other iterators of the   object, which are directly based on MediaWiki API. Going deeper is possible through the documentation and the code itself.

On the other hand, iterating pages through an API iterator, given the namespece as argument may be faster than using a high-level generator from pagegenrators package, then filter it with a wrapping. At least we may suppose (no benchmark has been made).

In some rare cases this is still not enough, if some features offered by API are not implemented in Pywikibot. You may either implement them and contribute to the common code base, or make a copy of them and enhance with the missing parameter according to API documentation.

Revisions
Processing page histories may be frightening due to the amount of data, but is easy because we have plenty of methods. Some of them extract a particular information such as a user name, while others return an object called revision. A revision represents one line of a page history with all its data that are more than you see in the browser and more than you usually need. Before we look into these methods, let's have a look at revisions. We have to keep in mind that
 * some revisions may be deleted
 * otherwise the text, comment or the contributor's name may be hidden by admins so that non-admins won't see (this may cause errors to be handled)
 * oversighters may hide revisions so deeply that even admins won't see them

Furthermore, bots are not directly marked in page histories. You see in recent changes if an edit is made by bot because this property is stored in the recent changes table of the database and is available there for a few weeks. If you want to know if an edit was made by a bot, you may
 * guess it from the bot name and the comment (not yet implemented, but we will try below)
 * follow through a lot of database tables which contributor had a bot flag in the time of the edit, and consider that registered bots can switch off their flag temporarily and admins can revert changes in bot mode (good luck!)
 * retrieve from the wiki the current bot flag owners and suppose that same users were bots in the time of the edit (that's what Pywikibot does)

API:Revisions gives a deeper insight, while Manual:Slot says something about slots and roles (for most of us this is not too interesting).

Methods returning a single revision also return the content of the page, so it is a good idea to choose a short page for experiments (see Special:ShortPages on your home wiki). by default does not contain the text unless you force it.

For now I choose a page which is short, but has some page history: hu:8. évezred (8th millennium). Well, we have really few to say about it, and we suffer from lack of reliable sources. Let's see the last (current) revision!

As we look into the code, we don't get too much information about how to print it in more readably, but we notice that  is a subclass off , which is described here. So we can try :

While a revision may look like a dictionary on the screen, however it is not a dictionary, a in the above loop would show that all these tuple-like pairs are real tuples.

Non-existing pages raise  if you get their revisions.

keeps the oldid of the previous edit and will be zero for a newly created page. For any revision as :

Extract data from revison
We don't have to transform a  object into a dictionary to use it. The above experiment was just for overview. Now we know what to search for, and we can directly get the required data. Better, this structure is more comfortable to use than a common directory, because you have two ways to get a value: As you see, they are identical. But keep in mind that both solutions may cause problems if some parts of the revision were hidden by an admin. Let's see what happens upon hiding:

You may say this is not quite consequent, but this is the experimental result. You have to handle hidden properties, but for a general code you should know whether the bot runs as admin. A possible solution: If you are not an admin but need admin rights for testing, you may get one on https://test.wikipedia.org.

For example will never raise an, but is not very useful in most cases. On the other hand, will do something if the content is not hidden for you and not empty. means here either an empty page or a hidden content. If it counts for you, will make the difference.

Example was found when an oversighter suppressed the content of the revision. and  were   both for admin and non-admin bot, and an additional   key appeared with the value    (empty string).

Is it a bot edit?
Have a look at this page history. It has a lot of bots, some of which is no more or was never registered. Pywikibot has a  method which takes user name (not an object) and checks if it has a bot flag. This won't detect all these bots. We may improve with regarding the user name and the comment. This method is far not sure, may have false positives as well as negatives, but – as shown in the 3rd column – gives a better result then  which is in the second column.

hu:Kategória:Bottal létrehozott olasz vasútállomás cikkek contains articles created by bot. Here is a page generator that yields pages which were possibly never edited by any human: Test with:

Timestamp
is a  object which is well documented here. It is subclass of ]. Most importantly, MediaWiki always stores times in UTC, regardless of your time zone and daylight saving time.

The documentation suggests to use  for the current time; it is also a   in UTC.

Elapsed time since last edit: Pretty much, isn't it? :-) The result is a  object.

In the shell timestamps are human-readable. But when you print them from a script, they get a machine-readable format. If you want to restore the easily readable format, use the  function: For the above subtraction  is nicer, because   gives days and seconds, without converting them to hours and minutes.

Useful methods
Methods discussed in this section belng to  class with one exception, so may be used for practically any page.

Page history in general

 * will create a wikitable form the page history. The order may be reverted and number of rows may be limited. Useful e.g. when the page history gets unavailable and you want to save it to a talk page.
 * returns a small statistics: contributors with the number of their edits in the form of a dictionary, sorted by the decreasing number. Hidden names appear as empty string both for admin and non-admin bots.
 * will iterate through the revisions of a page beginning from the latest. As detailed above, this differs from one revision in that by default it does not retrieve the content of the revision. To get a certain revision turn the iterator into list.
 * use:
 * for beginning from the oldest version
 * for retrieving the page contents
 * for limiting the iteration to 5 entries
 * with a  to limit the iteration in time

For example to get a difflink between the first and the second version of a page without knowing its oldid (works for every language version)

And this one is a working piece of code from hu:User:DhanakBot/szubcsonk.py. This bot administers substubs. Before we mark a substub for deletion, we wonder if it has been vandalized. Maybe it was a longer article, but someone has truncated it, and a good faith user marked it as substub not regarding the page history. So the bot places a warning if the page was 1000 bytes longer or twice as long at any point of its history as now.
 * may also be interesting; this is the underlying method that is called by, but it has some extra features. You may specify the user whose contributions you want or don't want to have.

Last version of the page
Last version got a special attention from developers and is very comfortable to use. For example to get a difflink between the newest and the second newest version of a page without knowing its oldid (works for every language version)
 * (property): returns the current revision for this page. It's a Revision object as detailed above.
 * But some properties are available directly (they are equivalent to retrieve values from ):


 * (property): returns oldid of the current (latest) version.
 * : returns name or IP address of last user to edit page.
 * : returns name or IP address of last human/non-bot user to edit page. This is explained above. Our  function is an alternative possibility.
 * : returns True if last editor was unregistered.

Oldest version of a page

 * (property) is very similar to, but returns the first version rather than last.

Determine how many times is the current version longer then the first version (beware of division by zero which is unlikely but possible):

Determine which user how many articles has created in a given category, not including its subcategories: (Use  if you are interested in subcategories, too, but that will be slightly slower.)

Knowing that Main Page is a valid alias for the main page in every language and family, sort several Wikipedias by creation date: For some reason it gives a false date for dewiki where Main Page is redirected to  namespace, but looks nice anyway. :-)

You want to know when did the original creator edit in your wiki last time. In some cases it is a question whether it's worth to contact him/her. The result is a timestamp as described above, so you can subtract it from the current date to get the elapsed time. See also Working with users section.

Other

 * This will return a permalink to the given version. The oldid argument may be got as the revid value of that revision. If you omit it, the latest id will automatically be assigned. For getting permalinks for all versions of the page:
 * This will return a permalink to the given version. The oldid argument may be got as the revid value of that revision. If you omit it, the latest id will automatically be assigned. For getting permalinks for all versions of the page:

However, besides this URL-format during the years MediaWiki invented a nicer form for inner use. You may use in any language This will result in such permalinks that you can use on your wiki:.


 * takes an oldid and returns the text of that version (not a  object!). May be useful if you know the version id from somewhere.

Deleted revisions
When a page is deleted and recreated, it will get a new id. Thus the only way of mining in the deleted revisions is to identify the page by the title. On the other hand, when a page is moved (renamed), it takes the old id to the new title and a new redirect page is created with a new id and the old title. Taking everything into account, investigation may be complicated as deleted versions may be under the old title and deleted versions under the same title may belong to another page. It may be easier without bot if it is about one page. Now we take a simple case where the page was never renamed.


 * does not need admin rigths, and simply says a yes or no to the question if the page has any deleted revisions. Don't ask me for a use case.

The following methods need amin rights, otherwise they will raise. The main use case is to get timestamps for.
 * iterates through the timestamps of deleted revisions and yields them. Meanwhile it caches other data in a private variable for later use. Iterators may be processed with a  loop or transformed into lists. For example to see the number of deleted revisions:

This method takes a timestamp which is most easily got from the above, and returns a dictionary. Don't be mislead by the name; this is not a   object. Its keys are: dict_keys(['revid', 'user', 'timestamp', 'slots', 'comment']) Theoretically a argument should return the text of the revision (otherwise text is returned only if it had previously been retrieved). Currently the documentation does not exactly cover the happenings, see T331422. Instead, revision text may be got (with an example timestamp) as

Underlying method for both above methods is  which makes possible to get the deleted revisions of several pages together and to get only or rather exclude the revisions by a given user.

File pages
is a cubclass of, so you can use all the above methods, but it has some special methods. Keep in mind that a  represents a file desciption page in the   namespace. Files (images, voices) themselves are in the  pseudo namespace.
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.get_file_history
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.get_file_historyversion
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.getFileVersionHistoryTable
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.oldest_file_info
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.latest_file_info





Walking the namespaces
Does your wiki have an article about Wikidata? A Category:Wikidata? A template named Wikidata or a Lua module? This piece of code answers the question: In Page.text section we got know the properties that work without parentheses;  is of the same kind. Creating page object with the namespace number is familiar from Creating a Page object from title section. This loop goes over the namespaces available in your wiki beginning from the negative indices marking pseudo namespaces such as Media and Special.

Our documentation says that  is a kind of dictionary, however it is hard to discover. The above loop went over the numerical indices of namespaces. We may use them to discover the values by means of a little trick. Only the Python prompt shows that they have a different string representation when used with  or without it –   may not be omitted in a script. This loop: will write the namespace indices and the canonical names. The latters are usually English names, but for example the #100 namespace in Hungarian Wikipedia has a Hungarian canonical name because the English Wikipedia does not have the counterpart any more. Namespaces may vary from wiki to wiki; for Wikipedia WMF sysadmins set them in the config files, but for your private wiki you may set them yourself following the documentation. If you run the above loop, you may notice that File and Category appears as  and , respectively, so this code gives you a name that is ready to link, and will appear as normal link rather than displaying an image or inserting the page into a category.

Now we have to dig into the code to see that namespace objects have absoutely different  and   methods inside. While  writes , the object name at the prompt without   uses the   method. We have to use it explicitely:

The first few lines of the result from Hungarian Wikipedia are: Namespace(id=-2, custom_name='Média', canonical_name='Media', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False) Namespace(id=-1, custom_name='Speciális', canonical_name='Special', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False) Namespace(id=0, custom_name=, canonical_name=, aliases=[], case='first-letter', content=True, nonincludable=False, subpages=False) Namespace(id=1, custom_name='Vita', canonical_name='Talk', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True) Namespace(id=2, custom_name='Szerkesztő', canonical_name='User', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True) Namespace(id=3, custom_name='Szerkesztővita', canonical_name='User talk', aliases=['User vita'], case='first-letter', content=False, nonincludable=False, subpages=True) Namespace(id=4, custom_name='Wikipédia', canonical_name='Project', aliases=['WP', 'Wikipedia'], case='first-letter', content=False, nonincludable=False, subpages=True) So custom names mean the localized names in your language, while aliases are usually abbreviations such as WP for Wikipedia or old names for backward compatibility. Now we know what we are looking for. But how to get it properly? On top level documentation suggests to use  to get the local names. But if we try to give the canonical name to the function, it will raise errors at the main namespace. After some experiments the following form works: This will write the indices, canonical (English) names and localized names side by side. There is another way that gives nicer result, but we have to guess it from the code of namespace objects. This keeps the colons:

Determine the namespace of a page
Leaning on the above results we can determine the namespace of a page in any form. We investigate an article and a user talk page. Although the namespace object we get is told to be a dictionary in the documentation, it is quite unique, and its behaviour and even the apparent type depends on what we ask. It can be equal to an integer and to several strings at the same time. This reason of this strange personality is that the default methods are overwritten. If you want to deeply understand what happens here, open  (the underscore marks that it is not intended for public use) and look for ,   and   methods. The starred command will give  on any non-Hungarian site, but the value will be    again, if you write user talk in your language.

Any of the values may be got with the dotted syntax as shown in the last three lines.

It is common to get unknown pages from a page generator or a similar iterator and it may be important to know what kind of page we got. Int this example we walk through all the pages that link to an article. will show the canonical (mostly English) names,  the local names and   (without parentheses!) the numerical index of the namespace. To improve the example we think that main, file, template, category and portal are in direct scope of readers, while the others are only important for Wikipedians, and we mark this difference.

Content pages and talk pages
Let's have a rest with a much easier exercise! Another frequent task is to switch from a content page to its talk page or vice versa. We have a method to toggle and another to decide if it is in a talk namespace: Note that pseudo namespaces (such as  and , with negative index) cannot be toggled. will always return a  object except when original page is in these namespaces. In this case it will return. So if there is any chance your page may be in a pseudo namespace, be prepared to handle errors.

The next example shows how to work with content and talk pages together. Many wikis place a template on the talk pages of living persons' biographies. This template collects the talk pages into a category. We wonder if there are talk pages without the content page having a "Category:Living persons". These pages need attention from users. The first experience is that separate listing of blue pages (articles), green pages (redirects) and red (missing) pages is useful as they need a different approach.

We walk the category (see ), get the articles and search them for the living persons' categories by means of a regex (it is specific for Hungarian Wikipedia, not important here). As the purpose is to separate pages by colour, we decide to use the old approach of getting the content (see ). Note that running on errors on purpose and preferring them to s is part of the philosophy of Python.

Working with categories


Task: hu:Kategória:A Naprendszer kisbolygói (minor planets of Solar System) has several subcategories (one level is enough) with reduntantly categorized articles. Let's remove the parent category from the articles in subcategories BUT stubs! Chances are that it could be solved by category.py, after reading the documentation carefully, but for me this time it was faster to hack:

Creating and reading lists
Creating a list of pages is frequent task. For example A list may be saved to a file or to a wikipage. listpages.py does something like this, but the input is restricted to builtin page generators and output has a lot of options. If you write an own script, you may want a simple solution in place. Suppose that you have any iterable (list, tuple or generator) called  that contains your collection.
 * 1) You collect titles to work on because collecting is slow and can be done while you are sleeping.
 * 2) You want to review the list and make further discussions before you begin the task with your bot.
 * 3) You want to know the extent of a problem before you begin to write a bot for it.
 * 4) Listing is the purpose itself. It may be a maintenance list that requires attention from human users. It may be a community task list etc.
 * 5) Someone asked yo to create a list on which he or she wants to work.

Something like this: will give an appropriate list that is suitable both for wikipage and file. It looks like this:

On Windows sometimes you get a  when you try to save page names containing non-ASCII characters. In this case  will help: Of course, imports should be on top of your script, this is just a sample. While a file does not require the linked form, it is useful to keep them in the same form so that a list can be copied from a file to a wikipage at any time.

To retrieve your list from page Special:MyPage/Mylist use:

If you want to read the pages from the file to a list, do: If you are not familiar with regular expressions, just copy, it will work. :-)

Where to save the files?
While introducing the  directory is a great idea to separate your own scripts, using pwb.py your prompt is in the Pywikibot root directory. Once this structure is created so nicely, you may not want to mix your files into Pywikibot system files. Saving it to  requires to give the path every time, and is an unwanted mix again, because there are scripts rather than data.

A possible solution is to create a directory directly under Pywikibot root such as, which is short for "texts", is one letter long and very unlikely to appear at any time as a Pywikibot system directory:

Now instead  you may use   (use   both on Linux and on Windows!) when you save and open files. This is not a big pain, and your saved data will be in a separate directory.

Working with your watchlist
We have a watchlist.py among scripts which deals with the watchlist of the bot. This does not sound too exciting. But if you have several thousand pages on your watchlist, handling it by bot may sound well. To do this you have to run the bot with your own username rather than that of the bot. Either you overwrite it in user-config.py or save the commands in a script and run python pwb.py -user:MyUserAccount myscript

Loading your watchlist
The first tool we need is. This is a page generator, so you may process pages with a for loop or change it to a list. When you edit your watchlist on Wikipedia, talk pages will not appear. This is not the case for ! It will double the number by listing content pages as well as talk pages. You may want to cut it.

Print the number of watched pages: This may take a while as it goes over all the pages. For me it is 18235. Hmm, it's odd, in both meaning. :-) How can a doubled number be odd? This reveals a technical error: decades ago I watched  which was a redirect to a project page, but technically a page in article  namespace back then, having the talk page  . Meanwhile   was turned into an alias for Wikipedia namespace, getting a new talk page, and the old one remained there stuck and abandoned, causing this oddity.

If you don't have such a problem, then will do both tasks: turn the generator into a real list and throw away talk pages as dealing with them separately is senseless. Note that  takes a   argument if you want only the first , but we want to process the whole watchlist now. You should get half of the previous number if everything goes well. For me it raises an error because of the above stuck talk page. (See T331005.) I show it because it is rare and interesting: pywikibot.exceptions.InvalidTitleError: The (non-)talk page of 'Vita:WP:AZ' is a valid title in another namespace. So I have to write a loop for the same purpose: Well, the previous one looked nicer. For the number I get 9109 which is not exactly  to the previous one, but I won't bother it for now.

A basic statistics
In one way or other, we have at last a list with our watched pages. The first task is to create a statistics. I wonder how many pages are on my list by namespace. I wish I had the data in sqlite, but I don't. So a possible solution:

There is an other way if we steal the technics from  (discussed in Useful methods section). We just generate the namespace numbers, and create of them a collections.Counter object: This is a subclass of dictionaries so may be used as a dict. The difference compared to the previous is that a  sorts items by the decreasing number automatically.

Selecting anon pages and unwatch according to a pattern
The above statistics shows that almost the half of my watchlist consists of user pages because I patrol recent changes, welcome people and warn if neccessary. And it is neccessary often. Now I focus on anons: I could use the own method of a User instance to determine if they are anons without importing, but for that I would have to convert pages to Users: Anyway,  shows that they are over 2000.

IPv4 addresses starting with  belong to schools in Hungary. Earlier most of them was static, but nowadays they are dynamic, and may belong to other school every time, so there is no point in keeping them. For unwatching I will use : With  in last line I will also see a   for each successful unwatching. Without it only unwatches. This loop will be slower than the previous. For repeated run it will write these s again because watchlist is cached. To avoid this and refresh watchlist use  which will always reload it.

Watching and unwatching a list of pages
By this time we delt with pages one by one with  which is a method of the   object. But if we look into the code, we may discover that this method uses a method of : Even more exciting, this method can handle complete lists at once, and even better the list items may be strings – this means you don't have to create Page objects of them, just provide titles. Furthermore it supports other sequence types like a generator function, so page generators may be used directly.
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._apisite.APISite.watch

To watch a lot of pages if you have the titles, just do this: To unwatch a lot of pages if you already have Page objects: For use of page generators see the second example under.

Further ideas for existing pages
With Pywikibot you may watch or unwatch any quantity of pages easily if you can create a list or generator for them. Let your brain storm! Some patterns: API:Watch shows that MediaWiki API may have further parameters such as expiry and builtin page generators. At the time of writing this article Pywikibot does not support them yet. Please hold on.
 * Pages in a category
 * Subpages of a page
 * Pages following a title pattern
 * Pages got from logs
 * Pages created by a user
 * Pages from a list page
 * Pages often visited by a returning vandal whose known socks are in a category
 * Pages based on Wikidata queries

Red pages
Non-existing pages differ from existing in we have to know the exact titles in advance to watch them.

Watch the yearly death articles in English Wikipedia for next decade so that you see when they are created:

hu:Wikipédia:Érdekességek has "Did you know?" subpages by the two hundreds. It has other subpages, and you want to watch all these tables until 10000, half of what is blue and half red. So follow the name pattern:

While English Wikipedia tends to list existing articles, in other Wikipedias list articles are to show all the relevant titles either blue or red. So the example is from Hungarian Wikipedia. Let's suppose you are interested in history of Umayyad rulers. hu:Omajjád uralkodók listája lists them but the articles of Córdoba branch are not yet written. You want to watch all of them and know when a new article is created. You notice that portals are linked from the page, but you want to watch only the articles, so you use a wrapper generator to filter the links.

List of ancient Greek rulers differs from the previous: many year numbers are linked which are not to be watched. You exclude them by title pattern. Or just to watch the red pages in the list:

In the first two examples we used standalone pages in a loop, then a page generator, then lists. They all work.

Summary

 * For walking your watchlist use  generator function. Don't forget to use the   global parameter if the   contains your bot's name.
 * For watching and unwatching a single page use  and.
 * For watching and unwatching several pages at once, given them as list of titles, list of page objects or a page generator use.

Working with logs

 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._generators.GeneratorsMixin.logevents

Using predefined bots as parent classes
See https://doc.wikimedia.org/pywikibot/master/library_usage.html.



Working with textlib
Example: https://www.mediawiki.org/wiki/Manual:Pywikibot/Cookbook/Creating_pages_based_on_a_pattern

Creating pages based on a pattern
Pywikibot is your friend when you want to create a lot of pages that follow some pattern. In the first task we create more than 250 pages in a loop. Then we go on to categories. We prepare a lot of them, but create only as many in one run that we want to fill with articles, in order to avoid a lot of empty categories.

Rules of orthography
Rules of Hungarian orthography have 300 points, several of which have a lot of subpoints marked with letters. There is no letter a without b, and last letter is l. We have templates pointing to these on an outer source. Templates cannot be used in an edit summary, but inner links can, so we create a lot of pages with short inner links that hold these templates. Of course, bigger part is a bot work, but first we have to list the letters. Each letter from b to l gets a list with the numbers of points of which this is the last letter (lines 5–12). For example, 120 is in the list of e, so we create the pages for the 120, 120 a) ... 120 e) points. The idea is to build a title generator (from line 14). (It also could be a page generator, but title was more comfortable.)

The result is at hu:Wikipédia:AKH. marks the actual subpage which has, while   with   the main page. As we get the titles from the iterator (line 41–), we create the text with the appropriate template and a standard part, we create the page, and add its link to the text of the main page. At the end we save the main page (line 63–).

Categories of notable pupils and teachers
We want to create categories for famous pupils and teachers of Budapest schools based on a pattern. Of course, this is not relevant for each school; first we want to see which article has "famous pupils" and "famous teachers" section which may occur in several forms, so the best thing is to review it by eyes. We also check if the section contains enough notable people to have a category.

In this task we don't bother creating Wikidata items; these categories are huwiki-specific, and creating items in Wikidata by bot needs an approval.


 * Step 1 – list the section titles of the schools onto a personal sandbox page
 * We use the  function from   to get the titles. This returns a NamedTuple in which   holds the sections as   tuples, from which element   is the desired title with its   signs.
 * Note that  is not a method of a class, just a function, thus it is not aware of , and must explicitely get it.
 * The result is here.


 * Step 2 – manual work
 * We go through the schools, remove the unwanted and the subtitles, and mark with  after the title if we want to create categories both for pupils and teachers, and with , if only for pupils. There could be also a  , but isn't. This is the result.
 * We don't want to create a few dozens of empty categories at once because the community may not like it. Rather, we mark the schools we want to work on soon with the beginning of the desired category name and the sortkey, as shown here, and the bot will create the categories if the name is present and the category does not exist yet.


 * If you don't like the syntax used here,never mind, it's up to you. This is just an example, you can create and parse any syntax, any delimiters.


 * Step 3 – creating the categories
 * We read the patterns from the page with a regex, parse them, and create the name and content of the category page (including sortkey within the parent category).
 * The script creates a common category, then one for the pupils, and then another for the teachers only if necessary.
 * Next time we can add new names to schools on which we want to work that day; the existing categories will not be changed or recreated.