Manual:Pywikibot/Cookbook/Working with namespaces

From mediawiki.org

Walking the namespaces[edit]

Does your wiki have an article about Wikidata? A Category:Wikidata? A template named Wikidata or a Lua module? This piece of code answers the question:

for ns in site.namespaces:
    page = pywikibot.Page(site, 'Wikidata', ns)
    print(page.title(), page.exists())

In Page.text section we got know the properties that work without parentheses; site.namespaces is of the same kind. Creating page object with the namespace number is familiar from Creating a Page object from title section. This loop goes over the namespaces available in your wiki beginning from the negative indices marking pseudo namespaces such as Media and Special.

Our documentation says that site.namespaces is a kind of dictionary, however it is hard to discover. The above loop went over the numerical indices of namespaces. We may use them to discover the values by means of a little trick. Only the Python prompt shows that they have a different string representation when used with print() or without it – print() may not be omitted in a script. This loop:

for ns in site.namespaces:
    print(ns, site.namespaces[ns])

will write the namespace indices and the canonical names. The latters are usually English names, but for example the #100 namespace in Hungarian Wikipedia has a Hungarian canonical name because the English Wikipedia does not have the counterpart any more. Namespaces may vary from wiki to wiki; for Wikipedia WMF sysadmins set them in the config files, but for your private wiki you may set them yourself following the documentation. If you run the above loop, you may notice that File and Category appears as :File: and :Category:, respectively, so this code gives you a name that is ready to link, and will appear as normal link rather than displaying an image or inserting the page into a category.

Now we have to dig into the code to see that namespace objects have absoutely different __str__() and __repr__() methods inside. While print() writes __str__(), the object name at the prompt without print() uses the __repr__() method. We have to use it explicitely (Python gives us the repr() function to reach __repr__() which is more friendly):

for ns in site.namespaces:
    print(repr(site.namespaces[ns]))

The first few lines of the result from Hungarian Wikipedia are:

Namespace(id=-2, custom_name='MĂ©dia', canonical_name='Media', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False)
Namespace(id=-1, custom_name='SpeciĂĄlis', canonical_name='Special', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False)
Namespace(id=0, custom_name=, canonical_name=, aliases=[], case='first-letter', content=True, nonincludable=False, subpages=False)
Namespace(id=1, custom_name='Vita', canonical_name='Talk', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True)
Namespace(id=2, custom_name='SzerkesztƑ', canonical_name='User', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True)
Namespace(id=3, custom_name='SzerkesztƑvita', canonical_name='User talk', aliases=['User vita'], case='first-letter', content=False, nonincludable=False, subpages=True)
Namespace(id=4, custom_name='Wikipédia', canonical_name='Project', aliases=['WP', 'Wikipedia'], case='first-letter', content=False, nonincludable=False, subpages=True)

So custom names mean the localized names in your language, while aliases are usually abbreviations such as WP for Wikipedia or old names for backward compatibility. Now we know what we are looking for. But how to get it properly? On top level documentation suggests to use ns_normalize() to get the local names. But if we try to give the canonical name to the function, it will raise errors at the main namespace. After some experiments the following form works:

for ns in site.namespaces:
    print(ns, 
          site.namespaces[ns], 
          site.ns_normalize(str(site.namespaces[ns])) if ns else '')

-2 Media: MĂ©dia
-1 Special: SpeciĂĄlis
0 :
1 Talk: Vita
2 User: SzerkesztƑ
3 User talk: SzerkesztƑvita
4 Project: Wikipédia
5 Project talk: Wikipédia-vita
6 :File: FĂĄjl
etc.

This will write the indices, canonical (English) names and localized names side by side. There is another way that gives nicer result, but we have to guess it from the code of namespace objects. This keeps the colons:

for ns in site.namespaces:
    print(ns,
          site.namespaces[ns],
          site.namespaces[ns].custom_prefix())

-2 Media: MĂ©dia:
-1 Special: SpeciĂĄlis:
0 : :
1 Talk: Vita:
2 User: SzerkesztƑ:
3 User talk: SzerkesztƑvita:
4 Project: Wikipédia:
5 Project talk: Wikipédia-vita:
6 :File: :FĂĄjl:
etc.

Determine the namespace of a page[edit]

Leaning on the above results we can determine the namespace of a page in any form. We investigate an article and a user talk page. Although the namespace object we get is told to be a dictionary in the documentation, it is quite unique, and its behaviour and even the apparent type depends on what we ask. It can be equal to an integer and to several strings at the same time. This reason of this strange personality is that the default methods are overwritten. If you want to deeply understand what happens here, open pwikibot/site/_namespace.py (the underscore marks that it is not intended for public use) and look for __str__(), __repr__() and __eq__() methods.

>>> page = pywikibot.Page(site, 'Budapest')
>>> page.namespace()
Namespace(id=0, custom_name='', canonical_name='', aliases=[], case='first-letter', content=True, nonincludable=False, subpages=False)
>>> print(page.namespace())
:
>>> page.namespace() == 0
True
>>> page.namespace().id
0

>>> page = pywikibot.Page(site, 'user talk:BinBot')
>>> page.namespace()
Namespace(id=3, custom_name='SzerkesztƑvita', canonical_name='User talk', aliases=['User vita'], case='first-letter', content=False, nonincludable=False, subpages=True)
>>> print(page.namespace())
User talk:
>>> page.namespace() == 0
False
>>> page.namespace() == 3
True
>>> page.namespace().custom_prefix()
'SzerkesztƑvita:'
>>> page.namespace() == 'User talk:'
True
>>> page.namespace() == 'SzerkesztƑvita:'  # *
True
>>> page.namespace().id
3
>>> page.namespace().custom_name
'SzerkesztƑvita'
>>> page.namespace().aliases
['User vita']

The starred command will give False on any non-Hungarian site, but the value will be True again, if you write user talk in your language.

Any of the values may be got with the dotted syntax as shown in the last three lines.

It is common to get unknown pages from a page generator or a similar iterator and it may be important to know what kind of page we got. Int this example we walk through all the pages that link to an article. page.namespace() will show the canonical (mostly English) names, page.namespace().custom_prefix() the local names and page.namespace().id (without parentheses!) the numerical index of the namespace. To improve the example we think that main, file, template, category and portal are in direct scope of readers, while the others are only important for Wikipedians, and we mark this difference.

def for_readers(ns: int) -> bool:
    return ns in (0, 6, 10, 14, 100)
    # (main, file, template, category and portal)

basepage = pywikibot.Page(site, 'Budapest')
for page in basepage.getReferences(total=110):
    print(page.title(),
          page.namespace(),
          page.namespace().custom_prefix(),
          page.namespace().id,
          for_readers(page.namespace().id)
         )

Content pages and talk pages[edit]

Let's have a rest with a much easier exercise! Another frequent task is to switch from a content page to its talk page or vice versa. We have a method to toggle and another to decide if it is in a talk namespace:

>>> page = pywikibot.Page(site, 'Budapest')
>>> page
Page('Budapest')
>>> page.isTalkPage()
False
>>> talk = page.toggleTalkPage()
>>> talk
Page('Vita:Budapest')
>>> talk.isTalkPage()
True
>>> talk.toggleTalkPage()
Page('Budapest')

Note that pseudo namespaces (such as Special and Media, with negative index) cannot be toggled. toggleTalkPage() will always return a Page object except when original page is in these namespaces. In this case it will return None. So if there is any chance your page may be in a pseudo namespace, be prepared to handle errors.

The next example shows how to work with content and talk pages together. Many wikis place a template on the talk pages of living persons' biographies. This template collects the talk pages into a category. We wonder if there are talk pages without the content page having a "Category:Living persons". These pages need attention from users. The first experience is that separate listing of blue pages (articles), green pages[1] (redirects) and red (missing) pages is useful as they need a different approach.

We walk the category (see Working with categories), get the articles and search them for the living persons' categories by means of a regex (it is specific for Hungarian Wikipedia, not important here). As the purpose is to separate pages by colour, we decide to use the old approach of getting the content (see Page.get()).

import re
import pywikibot

site = pywikibot.Site()
cat = pywikibot.Category(site, 'KategĂłria:ÉlƑ szemĂ©lyek Ă©letrajzai')
regex = re.compile(
    r'(?i)\[\[kategĂłria:(feltehetƑen )?Ă©lƑ szemĂ©lyek(\||\]\])')
blues = []
greens = []
reds = []

for talk in cat.members():
    page = talk.toggleTalkPage()
    try:
        if not regex.search(page.get()):
            blues.append(page)
    except pywikibot.exceptions.NoPageError:
        reds.append(page)
    except pywikibot.exceptions.IsRedirectPageError:
        greens.append(page)

Note that running on errors on purpose and preferring them to ifs is part of the philosophy of Python.

Notes[edit]

  1. ↑ Appropriate if there is a possibility in your wiki to mark redirects with green.