User:Bináris/Pywikibot cookbook

__NEWSECTIONLINK__

Pywikibot is the ninth wonder of the world, the eighth being MediaWiki itself.

This page is for you, if you
 * already have some experience with Pywikibot and have some vision about its capabilities
 * have some basic knowledge of Python and object-oriented programming
 * want to hack your own scripts
 * are already familiar with Manual:Pywikibot and especially Manual:Pywikibot/Create your own script.

Pywikibot is very flexible and powerful tool to edit Wikipedia or another MediaWiki instance. However, there comes the moment when you feel that something is missing from it, and the Universe calls you to write your own scripts. Don't be afraid, this is not a disease, this is the natural way of personal evolution. Pywikibot is waiting for you: you will find the  directory with a bare , which is ready to host your scripts.

(A personal confession from the creator of this page: I just wanted to use Pywikipedia, as we called it in the old times, then I wanted to slightly modify some of the scripts to better fit to my needs, then I went to the book store and bought my first Python book. So it goes.)

Creating a script

 * Encoding and environment: It is vital that all Python 3 source files MUST be UTF-8 without a BOM. Therefore it is a good idea to forget the bare Notepad of Windows forever, because it has the habit to soil files with BOM. The minimal suggested editor is Notepad++, which is developed for programming purposes and is cross-platform. It has an Encoding menu where you see what I am speaking about, and you may set UTF-8 without BOM as default encoding. Any real programming IDE will do the job properly, e.g. Visual Studio Code is quite popular nowadays. Python has an integrated editor called IDLE, which uses proper encoding, but for some mysterious reason does not show line numbers, so you will suffer a lot from error messages, when you keep trying to find the 148th line of your code.


 * Where to put:  directory is designed to host your scripts. This is a great idea, because this directory will be untouched when you update Pywikibot, and you can easily backup your own work, regarding just this directory.
 * You may also create your own directory structure. If you would like to use other than the default, search for  in , and you will see the solution.
 * See also https://doc.wikimedia.org/pywikibot/master/utilities/scripts_ref.html#module-pywikibot.scripts.wrapper.

Running a script
You have basically to ways. The recommended one is to call your script through. Your prompt should be in Pywikibot root directory where  is and use: However, if you don't need these features, especially if you don't use global options and don't want  to handle command line arguments, you are free to run the script directly from   directory.
 * Global options: https://doc.wikimedia.org/pywikibot/master/global_options.html#global-options
 * Advantages of this method: https://doc.wikimedia.org/pywikibot/master/scripts_ref/scripts.html#module-scripts

Coding style
Of course, we have PEP 8, Manual:Coding conventions/Python and Manual:Pywikibot/Development/Guidelines. But sometimes we feel like just hacking a small piece of code for ourselves and not bothering the style.

Several times a small piece of temporary code begins to grow beyond our initial expectations, and we have to clean it.

If you'll take my advice, do what you want, but my experience is that it is always worth to code for myself as if I coded for the world.

On the other side, when you use Pywikibot interactively (see below), it is normal to be lazy and use abbreviations and aliases. For example Note that the  alias cannot be used in the second import. It will be useful later, e.g. for.

Beginning and ending
In most cases you see something like this in the very first line of Pywkibot scripts:

or

This is a shebang. If you use a Unix-like system, you know what it is for. If you run your scripts on Windows, you may just omit this line, it does not do anything. But it can be a good idea to use anyway in order someday others want to use your script.

The very last two lines of the scripts also follow a pattern. They usually look like this: This is a good practice in Python. When you run the script directly from command line (that's what we call directory mode), the condition will be true, and the  function will be called. That's where you handle arguments and start the process. On the other side, if you import the script (that is the library mode), the condition evaluates to false, and nothing happens (just the lines on the main level of your script will be executed). Thus you may directly call the function or method you need.

Documentation and help
We have three levels of documentation. As you go forward into understanding Pywikibot, you will become more and more familiar with these levels.
 * 1) Manual:Pywikibot – written by humans for humans. This is recommended for beginners. It also has a "Get help" box.
 * 2) https://doc.wikimedia.org/pywikibot – mostly autogenerated technical documentation with all the fine details you are looking for. Click on   if you use the latest deployed stable version of Pywikibot (this is recommended unless you want to develop the framework itself), and on   if you use the actual version that is still under development. Differences are usually small.
 * 3) The code itself. It is useful if you don't find something in the documentation or you want to find working solutions and good practices. You may reach it from the above docs (most classes and methods have a   link) or from your computer.

Basic concepts
Throughout the manual and the documentation we speak about MediaWiki rather than Wikipedia and Wikibase rather than Wikidata beacuse these are the underlying software. You may use Pywikibot on other projects of WikiMedia Foundation, any non-WMF wiki and repository on Internet, even on a MediaWiki or Wikibase instance on your home computer. See the right-hand menu for help.


 * Site
 * A site is a wiki that you will contact. Usually you work with one site at a time. If you have set your family and language in, getting your site (e.g. one language version of Wikipedia) is as simple as.
 * You will find more options at https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.html#pywikibot.Site
 * Site is a bit tricky because you don't find its methods directly in the documentation. This is because Site is not a class, rather a factory method that returns class instances. Digging into it you will find that its methods are under https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._basesite.BaseSite as well as https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#module-pywikibot.site._apisite.
 * Even more strange is that some of these methods manipulate pages. This is a technical decision of the developers, but you don't have to deal with them, as they are not for direct use. It is interesting to browse the site methods as you may get ideas from them how to use the framework.


 * This shows that your site is an, but it also inherits methods from.


 * Repository
 * A data repository is a Wikibase instance; when you work with WMF wikis, it is Wikidata itself. While you can directly work with Wikidata, in which case it will be a site, often you want to connect Wikipedia and Wikidata. So your Wikipedia will be the site and Wikidata the connected data repository. The site knows about its repository, therefore you don't have to write it in your user-config, rather get it as . (See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._apisite.APISite.data_repository.) However, articles have some direct methods to their Wikidata page, so in most cases you may not have to use the repository at all.
 * Image repository basically means Commons, at least for WMF wikis.
 * Continuing the previous code will show that the repository is also a special subclass of BaseSite:


 * Commons and Wikidata don't have language versions like Wikipedia, so their language and family is the same.


 * Page
 * Page is the most sophisticated entity in Pywikibot as we usually work with pages, and they have several kinds, properties, operations. See:
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.html#pywikibot.Page
 * Page is a subclass of BasePage, a general concept to represent all kinds of wikipages such as Page, WikibasePage and FlowPage. See:
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.BasePage
 * In most of your time you are likely to work with Page instances (unless your primary target is Wikidata itself). You won't instantiate BasePage directly, but if you wonder what you can do with a page (such as an article, a talk page or a community noticeboard), you should read both documentations above.
 * We have two main approach to pages: ready-to-use page objects may be obtained from pagegenerators or single pages may be created one by one from the titles. Below we look into both.


 * Category
 * Category is a subclass of Page. That means it represents a category with its special methods as well as the category page itself.


 * User
 * User is a subclass of Page. That means it represents a user with its special methods as well as the user page itself.

All the above concepts are classes or classlike factories; in the scripts we instantiate them. E.g.  will create a site instance.


 * Directory mode and library mode of Pywikibot: See in section.

Testing the framework
Let's try this at Python prompt: Of course, you will have the name of your own bot there if you have set the  properly. Now, what does it mean? This does not mean this is a valid username, even less it is logged in. This does not mean you have reached Wikipedia, neither you have Internet connection. This means that Python is working, Pywikibot is working, and you have set your home wiki and username in. Any string may be there by this point.

If you save the above code to a file called test.py: and run it with, you will get Brghhwsf.

Now try This is already a real contacting your wiki; the result is the name of your bot if you have logged in, otherwise. For advanced use it is important that although  is similar to a   object instance, here it is just a string. See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._basesite.BaseSite.user.

Creating a Page object from title
In the further part of this cookbook, unless otherwise stated, we always assume that you have already used these two basic statements:

You want to get the article about Budapest in your wiki. While it is in the article namespace, it is as simple as

Note that Python is case sensitive, and in its world  and   mean classes,   and   class instances, while lowercase   and   should be variables.

For such simple experiments interactive Python shell is comfortable, as you can easily see the results without using, saving and running your code.

Getting the type of an object is often useful when you want to discover the capabilities of Pywikibot. It seems to be strange, but the main thing is that you got a Page. Now let's see the user page of your bot. Either you prefix it with the namespace ( and other English names work everywhere, while the localized names only in your very wiki) or you give the namespace number as the third argument. So

and

will give the same result. is the localized version of  in Hungarian; Pywikibot won't respect that I used the English name for the namespace in my command, the result is always localized.

Getting the title of the page
On the other hand, if you already have a  object, and you need its title as a string,   method will do the job:

Possible errors
While getting pages may cause much less errors than saving them, a few types are worth to mention, some of them being technical, while others possible contradictions between our expectations and reality. Let's speak about them before actually getting the page.
 * 1) The page does not exist.
 * 2) The page is a redirect.
 * 3) You may have been mislead regarding the content in some namespaces. If your page is in Category namespace, the content is the descriptor page. If it is in User namespace, the content is the user page. The trickiest is the File namespace: the content is the file descriptor page, not the file itself; however if the file comes from Commons, the page may not exist in your wiki at all, while you still see the picture.
 * 4) The expected piece of text is not in the page content because it is transcluded from a template. You see the text on the page, but cannot replace it directly by bot.
 * 5) Sometimes a badly formed code may work well. For example  with two spaces will behave as  . While the page is in the category and you will get it from a pagegenerator (see below), you won't find the desired string in it.
 * 6) And, unfortunately, Wikipedia servers sometimes face errors. If you get a 500 error, go and read a book until server comes back.
 * 7) InvalidTitleError is raised in very rare cases. A server error is much more likely. :-)

Getting the content of the page
Important: by this time we don't have any knowledge about the existence of the page. We have not contacted live wiki yet. We just created an object. It is just as a street number: you may write it on a document, but either there is a house there or not.

There are two main approaches of getting the content. It is important to understand the difference.

Page.text
You may notice that  does not have parentheses. Looking into the code we discover that it is not a method, rather a property. This means  is ready to use without calling it, may be assigned a value, and is present upon saving the page.
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.BasePage.text

will write the whole text on your screen. Of course, this is for experiment.

You may write  if you need a copy of the text, but usually this is unneccessary. is not a method, so referring to it several times does not slow down your bot. Just manipulate  or assign it a new value, then save.

If you want to know details on how a property works, search for "Python decorators". For using it in your scripts it is enough to know the behaviour. Click on the above link and go through the right-hand menu. You will find some other properties without parentheses.

will never raise an error. If the page is a redirect, you will get the redirect link instead of the content of the target page. If the page does not exist, you will get an empty string which is just what happens if the page does exist, but is empty (it is usual at talk pages). Try this:

is comfortable if you don't have to deal with the existence of the page, otherwise it is your responsibility to make the difference. An easy way is.

While page creation does not contact the live wiki, refering to text for the first time and  does. For several pages it will take a while. If it is too slow for you, go to the section.

Page.get
The traditional way is  which forces you to handle the errors. In this case we store the value in a variable.
 * https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.BasePage.get

A non-existing page causes a NoPageError:

A redirect page causes an IsRedirectPageError:

If you don't want to handle redirects, just make the difference between existing and non-existing pages,  will make its behaviour more similar to that of  :

Here is a piece of code to handle the cases. It is already too long for prompt, so I saved it. Which results in: While  is simple, it gives only the text of the redirect page. With  we got another Page instance without parsing the text. Of course, the target page may also not exist or be another redirect. Scripts/redirect.py gives a deeper insight.

For a practical application see.

Pagegenerators – working with plenty of pages
https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.pagegenerators.html

Working with namespaces
TODO: https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._namespace.Namespace.custom_prefix

Walking the namespaces
Does your wiki have an article on Wikidata? A Category:Wikidata? A template named Wikidata or a Lua module? This piece of code answers the question: In section we got know the properties that work without parentheses;   is of the same kind. Creating page object with the namespace number is familiar from  section. This loop goes over the namespaces available in your wiki beginning from the negative indices marking pseudo namespaces such as Media and Special.

Our documentation says that  is a kind of dictionary, however it is hard to discover. The above loop went over the numerical indices of namespaces. We may use them to discover the values by means of a little trick. Only the Python prompt shows that they have a different string representation when used with  or without it –   may not be omitted in a script. This loop: will write the namespace indices and the canonical names. The latters are usually English names, but for example the #100 namespace in Hungarian Wikipedia has a Hungarian canonical name because the English Wikipedia does not have the counterpart any more. Namespaces may vary from wiki to wiki; for Wikipedia WMF sysadmins set them in the config files, but for your private wiki you may set them yourself following the documentation. If you run the above loop, you may notice that File and Category appears as  and , respectively, so this code gives you a name that is ready to link, and will appear as normal link rather than displaying an image or inserting the page into a category.

Now we have to dig into the code to see that namespace objects have absoutely different  and   methods inside. While  writes , the object name at the prompt without   uses the   method. We have to use it explicitely:

The first few lines of the result from Hungarian Wikipedia are: Namespace(id=-2, custom_name='Média', canonical_name='Media', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False) Namespace(id=-1, custom_name='Speciális', canonical_name='Special', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False) Namespace(id=0, custom_name=, canonical_name=, aliases=[], case='first-letter', content=True, nonincludable=False, subpages=False) Namespace(id=1, custom_name='Vita', canonical_name='Talk', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True) Namespace(id=2, custom_name='Szerkesztő', canonical_name='User', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True) Namespace(id=3, custom_name='Szerkesztővita', canonical_name='User talk', aliases=['User vita'], case='first-letter', content=False, nonincludable=False, subpages=True) Namespace(id=4, custom_name='Wikipédia', canonical_name='Project', aliases=['WP', 'Wikipedia'], case='first-letter', content=False, nonincludable=False, subpages=True) So custom names mean the localized names in your language, while aliases are usually abbreviations such as WP for Wikipedia or old names for backward compatibility. Now we know what we are looking for. But how to get it properly? On top level documentation suggests to use  to get the local names. But if we try to give the canonical name to the function, it will raise errors at the main namespace. After some experiments the following form works: This will write the indices, canonical (English) names and localized names side by side. There is another way that gives nicer result, but we have to guess it from the code of namespace objects. This keeps the colons:

Determine the namespace of a page
Leaning on the above results we can determine the namespace of a page in any form. We investigate an article and a user talk page. Although the namespace object we get is told to be a dictionary in the documentation, it is quite unique, and its behaviour and even the apparent type depends on what we ask. It can be equal to an integer and to several strings at the same time. This reason of this strange personality is that the default methods are overwritten. If you want to deeply understand what happens here, open  (the underscore marks that it is not intended for public use) and look for ,   and   methods. The last command will give  on any non-Hungarian site, but the value will be    again, if you write user talk in your language.

It is common to get unknown pages from a pagegenerator or a similar iterator and it may be important to know what kind of page we got. Int this example we walk through all the pages that link to an article. will show the canonical (mostly English) names,  the local names and   (without parentheses!) the numerical index of the namespace. To improve the example we think that main, file, template, category and portal are in direct scope of readers, while the others are only important for Wikipedians, and we mark this difference.

Content pages and talk pages
Let's have a rest with a much easier exercise! Another frequent task is to switch from a content page to its talk page or vice versa. We have a method to toggle and another to decide if it is in a talk namespace: Note that pseudo namespaces (such as  and , with negative index) cannot be toggled. will always return a  object except when original page is in these namespaces. In this case it will return. So if there is any chance your page may be in a pseudo namespace, be prepared to handle errors.

The next example shows how to work with content and talk pages together. Many wikis place a template on the talk pages of living persons' biographies. This template collects the talk pages into a category. We wonder if there are talk pages without the content page having a "Category:Living persons". These pages need attention from users. The first experience is that separate listing of blue pages (articles), green pages (redirects) and red (missing) pages is useful as they need a different approach.

We walk the category (see ), get the articles and search them for the living persons' categories by means of a regex (it is specific for Hungarian Wikipedia, not important here). As the purpose is to dissolve pages by colour, we decide to use the old approach of getting the content (see ). Note that running on errors on purpose and preferring them to s is part of the philosophy of Python.

Working with your watchlist
We have a watchlist.py among scripts which deals with the watchlist of the bot. This does not sound too exciting. But if you have several thousand pages on your watchlist, handling it by bot may sound well. To do this you have to run the bot with your own username rather than that of the bot. Either you overwrite it in user-config.py or save the commands in a script and run python pwb.py -user:MyUserAccount myscript

The first tool we need is. This is a pagegenerator, so you may process pages with a for loop or change it to a list. When you edit your watchlist on Wikipedia, talk pages will not appear. This is not the case for ! It will double the number by listing content pages as well as talk pages. You may want to cut it.
 * Loading your watchlist and mke some statistics

Print the number of watched pages: This may take a while as it goes over all the pages. For me it is 18235. Hmm, it's odd, in both meaning. :-) How can a doubled number be odd? This reveals a technical error: decades ago I watched  which was a redirect to a project page, but technically a page in article  namespace back then, having the talk page  . Meanwhile   was turned into an alias for Wikipedia namespace, getting a new talk page, and the old one remained there stuck and abandoned, causing this oddity.

If you don't have such a problem, then will do both tasks: turn the generator into a real list and throw away talk pages as dealing with them separately is senseless. Note that  takes a   argument if you want only the first , but we want to process the whole watchlist now. You should get half of the previous number if everything goes well. For me it raises an error because of the above stuck talk page. I show it bbecause it is rare and interesting: pywikibot.exceptions.InvalidTitleError: The (non-)talk page of 'Vita:WP:AZ' is a valid title in another namespace. So I have to write a loop for the same purpose: Well, the previous one looked nicer. For the number I get 9109 which is not exactly  to the previous one, but I won't bother it for now.

In one way or other, we have at last a list with our watched pages. The first task is to create a statistics. I wonder how many pages are on my list by namespace. I wish I had the data in sqlite, but I don't. So a possible solution: This shows that almost the half of my watchlist consists of user pages because I watch recent changes, welcome people and warn if neccessary. And it is neccessary often. Now I focus on anons:

I could use the own method of a User instance to determine if they are anons without imprting, but for that I would have to convert pages to Users: Anyway,  shows that they are over 2000.
 * Selecting anon pages and unwatch according to a pattern

IPv4 addresses starting with  belong to schools in Hungary. Earlier most of them was static, but nowadays they are dynamic, and may belong to other school every time, so there is no point in keeping them. For unwatching I will use : With  in last line I will also see a   for each successful unwatching. Without it only unwatches. This loop will be slower than the previous. For repeated run it will write these s again because watchlist is cached. To avoid this and refresh watchlist use  which will always reload it.