My name is Subramanya Sastry (Subbu) and I have been working at the Wikimedia Foundation since May 2012, as a senior software engineer, and a member of the Parsoid team.
I work for or provide services to the Wikimedia Foundation, and this is the account I intend to use for edits or statements I make in that role. However, the Foundation does not vet all my activity, and edits, statements, or other contributions made by this account may not reflect views of the Foundation.
Wiki pages with wikitest use cases/tests
Other useful wiki pages to test against
- Mediawiki Formatting Help page: Help:Formatting
- Big page that can be a stressor: en:Wikipedia:Village_pump_(technical)
Notes I am making as I work through the code/algorithms/strategies for parsing wikitext in the context of the Visual Editor project. These notes may reflect a partial understanding or even misunderstanding of the issues involved, and are more notes to myself than anything else.
While the specific newline issues that led to the formulation of the note have mostly been addressed, the broad idea contained in the above note is applicable and possible useful in a more general sense, not just for whitespace, i.e. use the original wikitext to serialize most of the original text -- this also has an added benefit that for minor edits, there is no need to serialize a humongous DOM. For example, if someone corrects a typo on a barack obama page, does it make sense to really re-serialize everything? Is it simpler to issue a patch request to the PHP service to string replace specific sections of the original wikitext?
More generally, it may be useful to think of serialization as a diff patch in certain contexts, where applicable. Not sure how easy it will be to do this, but something to consider for large pages where progressively, changes on those pages will mostly be minor relative to the size of the page. Serialization has to be complete in and of itself to support all use cases and cannot rely on modification hints for correctness. But, modification hints from the visual editor could help the serializer optimize performance by focusing on modified bits and patching the source wikitext string rather than regenerate it from scratch.
Once a page has attained a certain size, a lot of edits on the page are likely going to be "minor" relative to the size of the page -- this is especially true for humongous pages. So, here is another idea for supporting high-performance edits.
Consider a page and let r be the latest revision. Let denote this revision of page . Then, let us consider edits of the page. These edits would then have produced page revisions . In regular operation, if you fetch a revision , you would fetch the wikitext for this revision, parse it, and send it to the VE. And the new edit would have to be serialized back to wikitext before storing that into the DB. But, given the earlier observation about minor edits, here is one way to improve on this to deliver better performance.
Let be the DOM produced by Parsoid when it parses . If this version is cached on disk, then to produce the DOM for page , you would have to process edit transactions from and transform . If these edit transactions can be supported easily and efficiently, then there is no need to parse wikitext on every fetch. You might also decide to always cache the latest revision. So, the basic operation to support is apply an edit onto an existing dom . So, you need a representation for and an efficient way to apply this. This is all pie in the sky idea at this time. Documenting it here for consideration in the future.