Manual:Pywikibot/Cookbook/Page generators

Overview
Page generators form one of the most powerful tools of Pywikibot. A page generator iterates the desired pages.

Why use page generators? A possible reason to write your own page generator is mentioned in Follow your bot section.
 * You may separate finding pages to work on and the actual processing, so the code becomes cleaner and more readable.
 * Thus you create separate namespaces which are one honking great idea.
 * They implement a reuseable code for typical tasks.
 * Pywikibot team writes page generators, and can follow the changes of MediaWiki API, and you have to write your code on a higher level.

Most page generators are available via command line arguments for end users. See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.pagegenerators.html for details. If you write your own script, you may use these arguments, but if they are permanent for the task, you may want to directly invoke the appropriate generator instead of handling command line arguments.

Life is too short to list them all here, but the most used generators are listed under the above link. you may also discover them in the  directory of your Pywikibot installation. They may be divided into three main groups:
 * 1) High-level generators for direct use, mostly (but not exclusively) based on MediaWiki API. Usually they have long and hard-to-remember names, but the names may always be cheated from docs or code. They are connected to command line arguments.
 * 2) Filters. They wrap around another generator (take the original generator as argument), and filter the results, for example for namespace. This means they won't run too fast...
 * 3) Low-level API-based generators may be obtained as methods of ,  ,  ,  ,   or   objects. Most of them is packed into a hig-level generator function, which is the preferred use (we may say, the public interface of Pywikibot), however nothing forbids the direct use. Sometimes they yield structures rather than page objects, but may be turned to a real page generator, as we will see in an example.

Pagegenerators package (group 1 and 2)
Looking into  directory you discover scripts whose names begin with an underscore. This means they are not for direct import, however they can be useful for discovering the features. The incorporated generators may be used as which is almost equivalent to:

To interpret this directory which appears for us as  package in code:
 * primarily holds the documentation, but there are also some wrapper generators in it.
 * holds the primary generators.
 * holds the wrapping filter generators.
 * is responsible for interpreting the command line arguments and choosing the appropriate generator function.

API generators (group 3)
MediaWiki offers a lot of low-level page generators, which are implemented in  class. is child of, so we may use these methods for our   instance. While the above mentioned pagelike objects have their own methods that may easily be found in the documentation of the class, they usually use an underlying method which is implemented in, and somtimes offers more features.

Usage
Generators may be used in  loops as shown above, but also may be transformed to lists: But be careful: while loops continuously process pages, the list comprehension may take a while because it has to read all the items from the generator. This statement is very fast for total=10, takes noticeable time for total=1000, and is definitely slow for total=100000. Of course, it will consume a lot of memory for big numbers, so usually it is better to use generators in a loop.

A few interesting generators
A non-exhaustive list of useful generators. All these may be imported from.

Autonomous generators (_generators.py)
Most of them correspond to a special page on wiki. ... and much more...
 * : Yields all the pages in a long-long queue along the road in alphabetic order. You may specify the start, namespace, a limit to avoid endless queues, and if redirects should be included, excluded or exclusively yielded. See an example below.
 * : Pages whose title begins with a given string. See an example below.
 * : Events from logs
 * : Pages from a given category.
 * : Pages that are linked from another page. See an example in chapter Creating and reading lists.
 * : Reads from file or URL. See an example in chapter Creating and reading lists.
 * : Generates pages from their titles.
 * : Generates pages that a given user worked on.
 * : Reads from a downloaded dump on your device. In the dump pages are usually sorted by pageid (creation time). See in Working with dumps chapter.

An odd one out

 * looks for pages in dump that are subject to a text replacement. It is defined within, and may be imported from there.

Filtering generators (_filters.py)

 * : Only lets out pages from given namespace(s).
 * : Let's you define an ignore list which pages won't be yielded.
 * : Yields either only redirect pages or only not redirects.
 * : Generator which filters out subpages based on depth.
 * : This is not a function, rather a class. Makes possible to filter titles with a regular expression.
 * : Lets out only pages which are in all of the given categories.

Other wrappers (__init__.py)

 * : You have  objects from another generator. This wrapper examines them, and whichever represents a user, a category or a file description page, turns it into the appropriate subclass so that you may use more methods;others remain untouched.
 * : Takes a page generator, and yields the content pages and the corresponding talk pages each after the other (or just the talk pages).
 * : This one is exciting: makes possible to follow the events on your wiki in live, taking pages from recent changes or some log.

List Pywikibot user scripts with
You want to collect users' home made Pywikibot scripts from all over Wikipedia. Supposing that they are in user namespace and have a title with  ending, a possible soulution is to create an own pagegenerator using. Rather slow than quick. :-) This will not search in other projects. And test with: If you want a reduced version for testing, you may use This will limit the number of Wikipedias to 3, and excludes the biggest, enwiki. You may also use a  argument in.

Sample result: They are not sure to be Pywikibot scripts as there are other Python programs published. You may retrieve the pages and check them for  line.

Titles beginning with a given word –
During writing this section we had a question on the village pump about the spelling of the name of József Degenfeld. A search showed that we have several existing articles about Degenfeld family. To quickly compose a list of them the technics from Creating and reading lists chapter was copied: For such rapid tasks shell is very suitable.

Pages created by a user with a site iterator
You want to list the pages created by a given user, for example yourself. How do we know that an edit was creation of a new page? The response is the  value, which is the oldid of the previous edit. If it is zero, there is no previous version, that means it was either creation or recreation after a deletion. Where do we get a parentid from? Either from a  (see Working with users chapter) or a   (see Working with page histories).

Of course, we begin with high-level page generators, just because this is the title of the chapter. We have one that is promising: Its description is: Yield unique pages edited by user:username. Now, this is good for beginning, but we have only pages, so we have to get the first revision, and see if the username equals to the desired user, which will not be the case in the vast majority, but will be very slow.

So we look into the source and notice that this function calls , which is a method of a  object, and has the description Yield tuples describing this user edits. This is promising again, but looking into it we see that the tuple does not contain, but we find an underlying method again, which is. This looks good, and has a link to API:Usercontribs, which is the fourth step of our investigation. Finally, this tells what we want to hear: yes, it has.

Technically,  is not a page generator, but we will turn it. It takes the username as string, iterates s, and we may create the pages from their titles. The simpliest version just to show the essence: Introduction was long, solution short. :-)

But it was not only short, but also fast, because we did not create unnecessary objects, and, what is more important, did not get unnecessary data. We got dictionaries, read the  and the   from them, and created only the desired   objects – but did not retrieve the pages, which is the slow part of the work. Handle the result with care, as some false positives may occur, e.g. upon page history merges.

Based on this simple solution we create an advanced application that Of these tasks only recognizing the redirects makes it necessary to retrieve the pages which is slow and loads the server and bandswidth. While the algorithm would be simpler if we did the filtering within the loop, more efficient is to do this filtering afterwards for selected pages
 * gets only pages from selected namespaces (this is not a post-generation filtering as Pywikibot's, MediaWikiAPI will do the filtering on the fly)
 * separates pages by namespaces
 * separates disambiguation pages from articles by title and creates a fictive namespace for them
 * filters out redirect pages from articlea and templates (in other namespaces and among disamb pages the ratio of these is very low, so we don't bother; this is a decision)
 * saves the list to separate subpages of the user by namespace
 * writes the creation date next to the titles.


 * Line 4: Username as const (not a  object). Of course, you may get it from command line or a web interface.
 * Line 23: A dictionary for the results. Keys are namespace numbers, values are empty lists.
 * Line 26: This is the core of the script. We get the contributions with 0 parentid, check them for being a disambiguation page, prefix the category names with a colon, and store the titles together with the timestamps as tuples. We don't retrieve any page content by this point.
 * Line 35: Removal of redirects. Now we have to retrieve selected pages. Note the slicing in the loop head; this is necessary when you loop over a list and meanwhile remove items from it.  creates a copy to loop over it, preventing a conflict.
 * Line 43: We save the lists to subpages, unless they are empty.

A sample result is at hu:Szerkesztő:Bináris/Létrehozott lapok.

Summary
High-level page generators are really various and flexible and are often useful when we do some common task, especially if we want the pwb wrapper to handle our command-line arguments. But for some specific tasks we have to go deeper. On the next level there are the generator methods of pagelike objects, such as,  ,   etc., while on the lowest level page generators and other iterators of the   object, which are directly based on MediaWiki API. Going deeper is possible through the documentation and the code itself.

On the other hand, iterating pages through an API iterator, given the namespece as argument may be faster than using a high-level generator from pagegenrators package, then filter it with a wrapping. At least we may suppose (no benchmark has been made).

In some rare cases this is still not enough, if some features offered by API are not implemented in Pywikibot. You may either implement them and contribute to the common code base, or make a copy of them and enhance with the missing parameter according to API documentation.