Incremental dumps/File format/XML output
Appearance
(Redirected from User:Svick/Incremental dumps/File format/XML output)
The XML output from incremental dumps should be exactly the same as the current XML dumps, with the following exceptions. Any exception not listed here is most likely a bug and should be reported.
The exceptions (from most serious to least):
- Revisions of a page are ordered by their id in history dumps. XML dumps don't actually have any order specified.
- The
<restrictions>tag is omitted.
Thepage_restrictionsfield in the database is not used anymore, so the<restrictions>tag doesn't provide accurate information about the restrictions of a page. - The
idattribute is missing for the<text>tag in stub dumps.
This is currently used in the dump infrastructure for creating pages dumps, but is not useful to users. - Comments that are 255 bytes long and end in an invalid UTF-8 sequence are shortened.
In the current dumps, the invalid sequence is replaced with U+FFFD REPLACEMENT CHARACTER. In the XML produced by incremental dumps, the invalid sequence is removed.
This applies only to the last character of full-length comments. In other cases, incremental dumps use U+FFFD REPLACEMENT CHARACTER, just like current dumps. - Anonymous IPv6 contributors whose address is not in full form (i.e. it contains
::) will be normalized to full form. This should be very rare, the addresses should almost always be in full form already. - The
minortag is consistently written as<minor />(with space).
In current dumps, this is inconsistent: pages dumps use<minor />, while stub dumps use<minor/>.
This could affect users who read the dumps using regular expressions or similar methods, it doesn't make any difference for those who use XML parsers.