Requests for comment/Structured data push notification support for recent changes

Structured data push notification support for recent changes. A long title for a goal that has been named and proposed in various forms. Following is a list of related buzz words:
 * Structured data
 * JSON and/or XML
 * Jabber / XMPP
 * WebSockets (HTML5 / AJAX)
 * PubSubHubbub
 * Push Notification Service
 * Socket.io: http://socket.io/

Specification

 * Recent changes packages should be easily readable by machine (JSON)
 * Should not be influenced by local wiki modifications (e.g. Interface messages)
 * Should have a way for a client to present a localized sentence describing the event (i.e. which i18n messages to use, which variables to replace with what)
 * This could probably be done by an API module that returns a map of log type/actions and message-keys. With the new logging framework as of 1.18, the order of variables is more logical, making this easier to implement.
 * Properties (depending on the implementation, some could be made optional / toggleable):
 * timestamp
 * user (name, id)
 * user rights (array)
 * user groups (array)
 * page (current pageid, fullpagename, is_redirect)
 * page_namespace (canonical, localized, id)
 * page_title
 * revision size (bytes before, bytes after, bytes diff: "-100" or "+12")
 * revision ids (revision oldid, revision diffid)
 * revision comment (raw comment, parsed comment)
 * url to diff (page edit), page (log event) or oldid (page creation)
 * rc id
 * rc type (new, edit, log)
 * rc flags (rc_minor: m, rc_bot: b, rc_patrolled: !) ( should these be in revision table instead? )
 * tags (revision tags)
 * log specific stuff
 * Push order must match order in which events occurred
 * Push order must match order in which events occurred

Current: UDP / netcat / ircd

 * MediaWiki emits a UDP packet to a specified server (see $wgRC2UDPAddress etc.)
 * This packet contains 1 single localized string. Similar to the text in the list-item on Special:RecentChanges, though flattened to not have HTML.
 * The UDP receiver (netcat) pipes the message as-is to a channel on a known IRC server (ircd running in the background)
 * Clients join the channel through an irc socket

Problems

 * No machine-readable structure (1 English string, instead of key/value parse)
 * Hard to parse, unstable/variable output:
 * Color-coded IRC stuff
 * Requires periodic downloading of interface messages from the target wiki (which can change at any time either due to software updates or when a user on the local wiki changes the message in the MediaWiki-namespace)
 * Messages can cut-off (because IRC has a limited message length). Right now this usually doesn't break the notification because the last part of the string is the "edit summary", and there is no close-tag after it, so the receiver just reads it as if the edit summary was shorter. It only gets problematic if it gets cut-off before the edit summary starts, because then the message no longer matches the expected pattern.
 * Not flexible / extendable
 * UDP is (apparently?) unreliable in that packets can go missing or arrive in the wrong order.
 * Is this in general due to how UDP works, or because of the way we use it?
 * Can we fix it or do we have to use UDP this way in order to be performant enough (since it is emitted from within the web request response from a large number of different web servers).

Proposal 1) UDP / nodejs / socket-io

 * MediaWiki emits a UDP packet
 * This packet contains a JSON string with stable (localization independent) keys (it would look much like the JSON response of API action=recentchanges)
 * The UDP receiver (nodejs) forwards the JSON string to a topic in the (socket-io powered) socket (running in the same Node process)
 * Clients subscribe to topic(s) through the socket, and parse the JSON.

Proposal 2) UDP / { Listener / Cache / Dispatcher } / Client

 * MediaWiki send a UDP packet in any format (even the current format is fine) to listener process
 * Listener process parse and stores it to cache daemon
 * Cache daemon stores the information for defined period of time (for example it stores last 10 000 structures) in operating memory
 * Client connects to dispatcher (dispatcher fork itself for security reasons, so each client has own process?) and specify which format they want to retrieve data in (XML, JSON, raw structures) and eventually request a backlog of changes
 * Dispatcher retrieves data from cache daemon and deliver them to clients in the format they requested as long as the client is connected

Pros: Cons:
 * Needs no modification of mediawiki at all
 * Very fast and very stable
 * Offers multiple formats to client
 * All of the current problems (if using the same format), except that the burdon is on us instead of the subscribers. Most significantly:
 * Localisation
 * Cut off
 * Require to write some code in c++ :) (we can use some existing products - for example memcached so that we don't need to write so much of it)