手冊:建立一個機器人

本頁使用了標題或全文手工轉換
From mediawiki.org
This page is a translated version of the page Manual:Creating a bot and the translation is 37% complete.

MediaWiki机器人或简称机器人是与维基百科(和其他维基媒体项目)互动的自动程序,就像是人类编辑者一样。 本页试图解释如何在维基媒体项目中使用机器人进行开发,其中大部分内容也可用于其他基于MediaWiki的维基站点。 下列的解释主要是針對那些有一定编程经验、但不知道如何将这些知识应用于创建MediaWiki机器人的人。

我为什么需要创建机器人?

机器人可以自动执行任务,而且速度比人类快得多。 如果您有一项需要执行多次的简单任务(例如,在一个有1000个页面的分类中的所有页面上添加一個模板 ),那么这项任务就是机器人比人类更适合来完成。

创建机器人前的注意事项

重复使用既有的机器人

去向某個既有的机器人提出要求總是要简单得多。如果你的要求只是周期性的,或者你不擅长编程,这通常是最好的解决方案。某些维基有专门的页面可以提出此类请求。 此外,还有许多的工具可供任何人使用。 其中大部分都采用了具有MediaWiki專用功能的增强型网络浏览器的形式。 The most popular of these is AutoWikiBrowser (AWB), a browser specifically designed to assist with editing on Wikipedia and other Wikimedia projects. 更完整的工具列表请参见英语维基百科上的w:Wikipedia:Tools/Editing tools。 有許多工具,例如AWB,通常可以在对编程知之甚少或一无所知的情况下进行操作。

Use Toolhub to explore available tools for Wikimedia wikis.

重覆使用代码库

如果由于需求的频繁性或新颖性,您决定需要一个自己的机器人,那么您不需要从头开始编写。 许多机器人都会公布自己的源代码,有时只需很少的额外开发时间就能重复使用。 此外,还有许多标准的机器人框架可供下载。 这些框架佔机器人代码的绝大部分。 由于这些机器人框架已被普遍使用,复杂的编码工作也已由他人完成并经过了大量测试,因此基于这些框架的机器人更容易获得批准使用。 The most popular and common of these frameworks is Pywikibot (PWB), a bot framework written in Python. It is thoroughly documented and tested and many standardized Pywikibot scripts (bot instructions) are already available. 机器人框架的其他示例可参见下文。 对于某些机器人框架(如PWB),只需熟悉脚本即可成功运行机器人(定期更新这些框架非常重要)。

重要问题

编写新机器人需要很强的编程能力。全新的机器人必须经过大量测试,才能获准正常运行。 规划对于获得无差错、高效率和高效益的计划至关重要。 以下的初步考虑非常重要:

  • 机器人是手动辅助还是全自动?
  • 您将独自创建机器人,还是在其他程序员的帮助下创建?
  • 将使用哪种语言来实现机器人?
  • 机器人的请求、编辑或其他操作是否会被记录?如果是,日志是存储在本地的媒体上,还是wiki的页面上?
  • 机器人是在网络浏览器中运行(例如用JavaScript编写),还是独立的程序?
  • If the bot is a standalone program, will it run on your local computer, or on a remote server such as the Toolforge?
  • 如果机器人是运行在远程服务器上,其他编辑者是否可以操作机器人或启动它?

MediaWiki机器人是如何工作的?

运行概览

MediaWiki机器人就像人类编辑一样,也会阅读wiki页面,并在它认为需要修改的地方进行修改。 不同之处在于,虽然机器人比人类速度更快、更不容易疲劳,但它们远没有我们聪明。 机器人擅长重复性任务,这些任务的模式很容易确定,不需要做太多决定。

在最典型的情况下,机器人登录自己的账户,以与浏览器大致相同的方式从维基请求页面--尽管它不会在屏幕上显示页面,而是在内存中运行--然后以编程方式检查页面代码,看是否需要进行任何更改。 然后,它就会以与浏览器大致相同的方式,进行并提交它所设计的任何编辑。

由于机器人访问网页的方式与人类相同,因此机器人也会遇到与人类用户相同的困难。 他们在申请页面或进行编辑时,可能会陷入编辑冲突、页面超时或遇到其他意想不到的复杂情况。 由于机器人的工作量大于真人,因此机器人更容易遇到这些问题。 因此,在编写机器人时必须考虑到这些情况。

机器人的应用程序接口

为了修改wiki页面,机器人必须从wiki中取回页面,然後将编辑内容发送回去。 有几种应用程序编程接口(API)可用于此目的。

  • MediaWiki Action API (api.php). 这项网络服务是专门为允许机器人等自动程序进行查询和发佈變更而编写的。 数据以 JSON 格式返回(详情请参见输出格式)。
    Status: MediaWiki的内置功能,可在所有维基媒体服务器上使用。 其他非维基媒体wiki可能会禁用或限制写入访问。
    还有一个API sandbox供想要测试api.php功能的人使用。
  • Special:Export can be used to obtain bulk export of page content in XML form. See Manual:Parameters to Special:Export for arguments;
    Status: MediaWiki的内置功能,可在所有维基媒体服务器上使用。
  • Raw (Wikitext) page processing: sending a action=raw or a action=raw&templates=expand GET request to index.php will give the unprocessed wikitext source code of a page. For example: https://en.wikipedia.org/w/index.php?title=Help:Creating_a_bot&action=raw. An API query with action=query&prop=revisions&rvprop=content or action=query&prop=revisions&rvprop=content&rvexpandtemplates=1 is roughly equivalent, and allows for retrieving additional information.
    Status: MediaWiki的内置功能,可在所有维基媒体服务器上使用。

某些网络服务器被配置为允许压缩(gzip)内容的请求。 这可以通过在HTTP请求头中加入一行“Accept-Encoding: gzip”来实现;如果HTTP回复头包含“Content-Encoding: gzip”,则文档为gzip格式,否则为普通的未压缩格式。 请注意,这是针对网络服务器而非MediaWiki软件的。 其他使用MediaWiki的网站可能没有这项功能。 如果您使用的是现有的机器人框架,它应该可以处理类似的底层操作。

登录

经批准的机器人需要登录后才能进行编辑。 虽然机器人可以在不登录的情况下提出读取请求,但已完成测试的机器人应登录所有活动。 通过带有机器人标志(见下文#机器人标志)的账户登录的机器人可以从 Mediawiki API (api.php) 中每次查询获得更多结果。 大多数的机器人框架都会自动处理登录和Cookie,但如果您没有使用现有框架,则需要按照以下步骤操作。

为了安全起见,必须使用HTTP POST方法传递登录数据。 由于在URL中HTTP GET所请求的参数很容易看到,因此通过GET登录已被禁止。

若要使用MediaWiki API 登录机器人,需要有2个POST的请求:

Request 1 – this is a GET request to obtain a login token

This will return a "logintoken" parameter in JSON form, as documented at API:Login. 还提供其他输出格式。 它还会返回HTTP cookie,如下所述。

Request 2 – this is a POST to complete the login

其中,TOKEN 是上一个结果中的标记。 前一个请求的HTTP cookie也必须與第二个请求一起传递。

成功登录会导致维基媒体服务器设置多个HTTP cookie。 机器人必须保存这些 cookie,并在每次请求时将它们发送回来(这对编辑尤为重要)。 在英文维基百科上,应使用以下列cookie: enwikiUserIDenwikiTokenenwikiUserName。 The enwikisession cookie is required to actually send an edit or commit some change, otherwise the MediaWiki:Session fail preview error message will be returned.

Main-account login via action=login is deprecated and may stop working without warning. To continue using bot code which logs in with action=login, see Special:BotPasswords.

编辑;编辑令牌

MediaWiki uses a system of edit tokens for making edits to MediaWiki pages, as well as other operations that modify existing content such as rollback. The token looks like a long hexadecimal number followed by '+\', for example:

d41d8cd98f00b204e9800998ecf8427e+\

The role of edit tokens is to prevent "edit hijacking", where users are tricked into making an edit by clicking a single link.

The editing process involves two HTTP requests. First, a request for an edit token must be made. Then, a second HTTP request must be made that sends the new content of the page along with the edit token just obtained. It is not possible to make an edit in a single HTTP request. An edit token remains the same for the duration of a logged-in session, so the edit token needs to be retrieved only once and can be used for all subsequent edits.

To obtain an edit token, follow these steps:

  • MediaWiki API (api.php). Make a request with the following parameters (see API:Edit - Create&Edit pages).
    • action=query
    • meta=tokens

    The token will be returned in the edittoken attribute of the response.

If the edit token the bot receives does not have the hexadecimal string (i.e., the edit token is just '+\') then the bot most likely is not logged in. This might be due to a number of factors: failure in authentication with the server, a dropped connection, a timeout of some sort, or an error in storing or returning the correct cookies. If it is not because of a programming error, just log in again to refresh the login cookies. The bots must use assertion to make sure that they are logged in.

Edit conflicts

Edit conflicts occur when multiple, overlapping edit attempts are made on the same page. Almost every bot will eventually get caught in an edit conflict of one sort or another, and should include some mechanism to test for and accommodate these issues.

Bots that use the Mediawiki API (api.php) should retrieve the edit token, along with the starttimestamp and the last revision "base" timestamp, before loading the page text in preparation for the edit; prop=info|revisions can be used to retrieve both the token and page contents in one query (example). When submitting the edit, set the starttimestamp and basetimestamp attributes, and check the server responses for indications of errors. For more details, see API:Edit - Create&Edit pages.

Generally speaking, if an edit fails to complete the bot should check the page again before trying to make a new edit, to make sure the edit is still appropriate. Further, if a bot rechecks a page to resubmit a change, it should be careful to avoid any behavior that could lead to an infinite loop and any behavior that could even resemble edit warring.

Overview of the process of developing a bot

Actually, coding or writing a bot is only one part of developing a bot.

The development cycle below is a recommendation from English Wikipedia.

Overview of English Wikipedia bot development cycle

On Wikimedia wikis, ensure that your bot follows any potential bot policies of the wiki.

Idea

  • The first task in creating a MediaWiki bot is extracting the requirements or coming up with an idea.
  • Make sure an existing bot isn't already doing what you think your bot should do.

Specification

  • Specification is the task of precisely describing the software to be written, possibly in a rigorous way. You should come up with a detailed proposal of what you want it to do. Try to discuss this proposal with some editors and refine it based on feedback. Even a great idea can be made better by incorporating ideas from other editors.
  • In the most basic form, your specified bot must meet the following criteria:
  • The bot is harmless (it must not make edits that could be considered disruptive to the smooth running of the wiki)
  • The bot is useful (it provides a useful service more effectively than a human editor could)
  • The bot does not waste server resources.

Software architecture

  • Think about how you might create it and which programming language(s) and tools you would use. Architecture is concerned with making sure the software system will meet the requirements of the product as well as ensuring that future requirements can be addressed. Certain programming languages are better suited to some tasks than others, for more details see the section on programming languages below.

Implementation

Implementation (or coding) involves turning design and planning into code. It may be the most obvious part of the software engineering job, but it is not necessarily the largest portion. In the implementation stage you should:

  • Create an account for your bot. Go to the sign up page when logged in to create the account, linking it to yours. (If you do not create the bot account while logged in, it might be blocked on some wikis according to their policies)
  • Create a user page for your bot. Your bot's edits must not be made under your own account. Your bot will need its own account with its own username and password.
  • Add the same information to the user page of the bot. It would be a good idea to add a link to the approval page (whether approved or not) for each function.
  • Code your bot in your chosen programming language.

Testing

A good way of testing your bot as you are developing is to have it show the changes (if any) it would have made to a page, rather than actually editing the live wiki. Some bot frameworks (such as pywikibot) have pre-coded methods for showing diffs.

Documentation

An important (and often overlooked) task is documenting the internal design of your bot for the purpose of future maintenance and enhancement. This is especially important if you are going to allow clones of your bot. Ideally, you should post the source code of your bot on its userpage or in a revision control system (see #Open-source bots) if you want others to be able to run clones of it. This code should be well documented (usually using comments) for ease of use.

Queries/Complaints

You should be ready to respond to queries about or objections to your bot on your user talk page, especially if it is operating in a potentially sensitive area.

Maintenance

Maintaining and enhancing your bot to cope with newly discovered bugs or new requirements can take far more time than the initial development of the software. Not only may it be necessary to add code that does not fit the original design, but just determining how software works at some point after it is completed may require significant effort (this is another reason to document your code as you go along).

General guidelines for running a bot

In addition to the official bot policy, which covers the main points to consider when developing your bot, there are a number of more general advisory points to consider when developing your bot.

Bot best practices

  • Set a custom User-Agent header for your bot (per the Wikimedia User-Agent policy, if your bot will be operating on Wikimedia wikis). If you don't, your bot may encounter errors and may end up blocked at the server level.
  • Use the maxlag parameter with a maximum lag of 5 seconds. This will enable the bot to run quickly when server load is low, and throttle the bot when server load is high.
    • If writing a bot in a framework that does not support maxlag, limit the total requests (read and write requests together) to no more than 10/minute.
  • Use the MediaWiki API whenever possible, and set the query limits to the largest values that the server permits, to minimize the total number of requests that must be made.
  • Edit (write) requests are more expensive in server time than read requests. Be edit-light and design your code to keep edits to a minimum.
    • Try to consolidate edits. One single large edit is better than 10 smaller ones.
  • Enable HTTP persistent connections and compression in your HTTP client library, if possible.
  • Do not make multi-threaded requests. Wait for one server request to complete before beginning another
  • Back off upon receiving errors from the server. Errors such as timeouts are often an indication of heavy server load. Use a sequence of increasingly longer delays between repeated requests.
  • Make use of assertion to ensure your bot is logged in.
  • Test your code thoroughly before making large automated runs. Individually examine all edits on trial runs to verify they are perfect.

Common bot features you should consider implementing

Manual assistance

If your bot is doing anything that requires judgment or evaluation of context (e.g., correcting spelling) then you should consider making your bot manually-assisted, which means that a human verifies all edits before they are saved. This significantly reduces the bot's speed, but it also significantly reduces errors.

Disabling the bot

It is good bot policy to have a feature to disable the bot's operation if it is requested. Remember that if your bot goes bad, it is your responsibility to clean up after it! You could have the bot refuse to run if a message has been left on its talk page, on the assumption that the message may be a complaint against its activities; this can be checked using the API meta=userinfo query (example on English Wikipedia). Or you could have a page that will turn the bot off if text on the page is changed (e.g. require the page be empty, contain only the word "True", or something similar); this can be checked by loading the page contents before each edit.

Signature

Just like a human, if your bot makes edits to a talk page in MediaWiki, it should sign its post with four tildes (~~~~). Signatures usually belong only on talk namespaces.

Bot Flag

A bot's edits will be visible at Special:RecentChanges, unless the edits are set to indicate a bot. Once the bot has been approved and given its bot flag permission, one can add the "bot-True" to the API call - see API:Edit#Parameters in order to hide the bot's edits in Special:RecentChanges.

In Python, using either mwclient or wikitools, then adding Bot=True to the edit/save command will set the edit as a bot edit - e.g.

PageObject.edit(text=pagetext, bot=True, summary=pagesummary)

Monitoring the bot status

If the bot is fully automated and performs regular edits, you should periodically check it runs as specified, and its behavior has not been altered by software changes.

Open-source bots

Many bot operators choose to make their code open source, and occasionally it may be required before approval for particularly complex bots. Making your code open source has several advantages:

  • It allows others to review your code for potential bugs. As with prose, it is often difficult for the author of code to adequately review it.
  • Others can use your code to build their own bots. A user new to bot writing may be able to use your code as an example or a template for their own bots.
  • It encourages good security practices, rather than security through obscurity.
  • If you leave the project, it allows other users to run your bot tasks without having to write new code.

Open-source code, while rarely required, is typically encouraged in keeping with the open and transparent nature of wikis, though there are some cases when code should not be made public. For example, the open proxy-finding code of ProcseeBot could be used for malicious purposes on other sites.

Making code open source can add some extra work to coding. One has to make sure that sensitive information such as passwords is separated into a file that isn't made public.

There are several options available for users wishing to make their code open. Some users choose to put the code in a subpage of the bot's userspace, although this can be a hassle to maintain if not automated and results in the code being multi-licensed under the wiki's licensing terms in addition to any other terms you may specify. Another solution is to use a revision control system such as SVN, Git, or Mercurial. Wikipedia has articles comparing the different software options and websites for code hosting, many of which have no cost. Wikimedia also offers Git code repository hosting for its users and running Wikimedia related software tools via Wikimedia Cloud Services.

Programming languages and libraries

See also: API:Client code

Bots can be written in almost any programming language. The choice of a language often depends on the experience of the bot writer (which languages are familiar) or on the availability of pre-developed libraries to perform the desired task. The following list includes some languages that have libraries to assist with bot tasks.

Awk

Perl

If located on a webserver, you can start your program running and interface with your program while it is running via the w:Common Gateway Interface from your browser. If your internet service provider provides you with webspace, the chances are good that you have access to a perl build on the webserver from which you can run your Perl programs.

Libraries:

  • MediaWiki::API – Basic interface to the API, allowing scripts to automate editing and extraction of data from MediaWiki driven sites.
  • MediaWiki::Bot – A fairly complete MediaWiki bot framework written in Perl. Provides a higher level of abstraction than MediaWiki::API. Plugins provide administrator and steward functionality. Currently unsupported.

PHP

PHP can also be used for programming bots. MediaWiki developers are already familiar with PHP, since that is the language MediaWiki and its extensions are written in. PHP is an especially good choice if you wish to provide a webform-based interface to your bot. For example, suppose you wanted to create a bot for renaming categories. You could create an HTML form into which you will type the current and desired names of a category. When the form is submitted, your bot could read these inputs, then edit all the articles in the current category and move them to the desired category. (Obviously, any bot with a form interface would need to be secured somehow from random web surfers.)

The PHP bot functions table on English Wikipedia may provide some insight into the capabilities of the major bot frameworks.

PHP Bot frameworks
Key people[php 1] Name PHP Version last update Uses API[php 2] Exclusion compliant Admin functions Plugins Repository Notes
Cyberpower678, Addshore, and Jarry1250 Peachy 5.2.1 2015 GitHub Large framework, currently undergoing rewrite. Documentation currently non-existent, so poke w:User:Cyberpower678 for help.
Addshore mediawiki-api-base 7.4+ 2021 N/A N/A extra libs GitHub Base library for interaction with the mediawiki api, provides you with ways to handle logging in, out and handling tokens as well as easily getting and posting requests.
Addshore mediawiki-api 7.4+ 2021 No some extra libs GitHub Build on top of mediawiki-api-base this adds more advanced services for the api such as RevisionGetter, UserGetter, PageDeleter, RevisionPatroller, RevisionSaver etc.
Kaspo Phpwikibot Unknown 2009 Partial No No No Google Code Uses a single class.
Jarry1250 Wikibot 5 2009 No No enwiki Used solely by LivingBot. A fork of Phpwikibot. Uses a single class.
Foxy Loxy PHPediaWiki 5 2009 No No SourceForge Fork of SxWiki
Nzhamstar, Xymph, Waldyrious Wikimate 5.3-5.6,
7.x, 8.x
2023 No No No GitHub Supports main article and file stuff. Authentication, checking if pages exist, reading and editing pages/sections. Getting file information, downloading and uploading files. Aims to be easy to use.
Kaleb Heitzman MediaWIkiBot 5 2012 No No No GitHub Supports the entire API including uploading and importing. Also supports Semantic MediaWiki. Single Class that creates dynamic methods to work with any of the API calls.
Edward Z. Yang Wikipedia Bot in PHP Unknown 2005 No No No No enwiki "Probably stale" source code
Cobi wikibot.classes 5 2010 No No enwiki Used by multiple large bots (e.g. ClueBot and SoxBot). Uses several classes.
Valerio Bozzolan boz-mw 5.6 2019 N/A extra libs GitHub Object-oriented. 80+ classes also to handle Wikidata. Inline documentation. Support for file uploading.
  1. Does not include those who worked on frameworks forked to create listed framework.
  2. Where possible. Excludes uploading images and other such tasks which are not currently supported by the API.

Python

Python is a popular interpreted language with object-oriented features.

Libraries
Please help update this table.
Python Bot frameworks
Key people[py 1] Name Python Version last update Uses API[py 2] Exclusion compliant Admin functions Plugins Repository Notes
xqt Pywikibot Python 3.7 or higher or PyPy 2023

The most used Python bot framework. Includes ready to use scripts.

Myst WikibaseIntegrator Python 3.7 or higher 2022 Not applicable No GitHub Only to interact with Wikibase instances like Wikidata
Mr.Z-man wikitools 2 2016 GitHub Incompatible with Python 3. (downloads)
Bryan mwclient 2021 GitHub An API-based framework
The Earwig mwparserfromhell 2021 GitHub A Python parser for MediaWiki text
  1. Does not include those who worked on frameworks forked to create listed framework.
  2. Where possible. Excludes uploading images and other such tasks which are not currently supported by the API.

Microsoft .NET

Microsoft .NET is a set of languages including C#, C++/CLI, Visual Basic .NET, J#, JScript .NET, IronPython, and Windows PowerShell. The Microsoft Visual Studio integrated development environment is often used, or the free Microsoft Visual Studio Express versions. Using Mono Project, .NET programs can also run on Linux, Unix, BSD, Solaris and Mac OS X as well as under Windows.

Libraries:

  • DotNetWikiBot Framework – a full-featured client API on .NET, that allows to build programs and web robots easily to manage information on MediaWiki-powered sites. Now translated to several languages. Detailed compiled documentation is available in English.
  • WikiFunctions .NET library – Bundled with AWB, is a library of stuff useful for bots, such as generating lists, loading/editing articles, connecting to the recent changes IRC channel and more.

Java

Java programs are generally developed with an IDE, such as Eclipse or NetBeans; development using a command line console (with the javac and java programs) is also an option.

Libraries:

  • Java Wiki Bot Framework – A Java wiki bot framework
  • wiki-java – A Java wiki bot framework that is only one file
  • WPCleaner – The library used by the WPCleaner tool
  • jwiki – A simple and easy-to-use Java wiki bot framework

JavaScript (Node.js)

JavaScript is a scripting language used mainly on web pages, such as for user scripts added to your vector.js or your monobook.js pages. Using Node.js it is possible to use JavaScript server-side, such as for developing bots.

Please help to update this table.
NodeJS Bot frameworks
Key people[js 1] Name Nodejs Version last update Uses API[js 2] Exclusion compliant Admin functions Package Repository Notes
SD0001 mwn 10+ 2021 npm GitHub Large library with classes for working with page titles and wikitext. Works with TypeScript also. Promise-based API (asyncawait). Limited wikitext parsing capabilities.
kanashimi wikiapi 0.10–15.x 2021 Partial npm GitHub JavaScript MediaWiki API for node.js with modern ECMAScript 2016 asyncawait and wikitext parser.
MediaWiki module 2014 GitHub Provides a framework of standard requests (e.g. log in, log out, etc.) as well as a general wrapper method for the MediaWiki API and includes throttling. The module can also be added to your Wikimedia .js page and used as library for on-wiki JS calls.
  1. Does not include those who worked on frameworks forked to create listed framework.
  2. Where possible. Excludes uploading images and other such tasks which are not currently supported by the API.

Ruby

Ruby is a popular dynamic, object-oriented programming language.

Libraries:

  • MediaWiki::Butt - Ruby framework for the API in active development. Tested with versions as up-to-date as CurseGamepedia is.
  • mediawiki/ruby/api, Ruby API client library. Last updated December 2017, no longer maintained, but still works.
  • MediaWiki::Gateway – Ruby framework for the API. Last updated January 2016. No longer in active development, tested up to MediaWiki 1.22, compatible with Wikimedia wikis. Unknown if still works.
  • wikipedia-client - Ruby framework using the API. Last updated March 2018. Unknown if still works.

Common Lisp

  • CL-MediaWiki implements MediaWiki API as a Common Lisp package. Is planned to use JSON as a query data format. Supports maxlag and assertion.

Haskell

VBScript

VBScript is a scripting language based on the Visual Basic programming language. There are no published bot frameworks for VBScript, but some examples of bots that use it can be seen below:

Examples:

Bash

Bash is a Unix shell.