Extension:Quotation

Highly experimental, might trigger a nuclear implosion! ;D

Extension:Citation is a small extension to check if a quoted string is contained within a referenced page. The purpose for this is to make it simple to verify if the quote is indeed on that page, and if not to mark the quotes that does not check out both visually at the page itself and listed on a special page.

Usage
There are both a tag function and parser function, with nearly similar function. Both takes a number of parameters that describes the function and some content that describes what is quoted. The parser function can take several content fragments while the tag function take only one.

The extension provides a simple markup scheme, where text fragments ("quotes") can be refound in text from pages at external sites. There are two forms of wildcards; an anonymous pattern  and a replacement pattern. The extensions configuration might limit one or both somehow. The default is to allow the anonymous pattern to extend over any number of characters, but only allow a single punctuation character, and only allow the replacement pattern to be a single character.

Parameters

 * refines : A ref-tag to extend with the result of the verification.
 * inline : Valueless marker to tell the function to render an inline quote (q) after the processing. This is the default for the parser function.
 * block : Valueless marker to tell the function to render a block quote (blockquote) after the processing. This is the default for the tag function.
 * format: One of inline or block.
 * src : The actual source to query for a page, and the signature of the external page. This page will then be loaded, parsed, and the actual content matching the quote will be filtered out. The quotes must match according to the built in rules, and the quotes will be marked accordingly. Failing to do so will also mark the page for later retrieval by the special page. If missing the quote will be marked accordingly.
 * href : A link to a user readable variant of the actual page requested. The pages can be different and no provision is made to check if the linked version infact is equal to the analyzed source page. Changes to href will not have any consequences for src parameter. As the link is readily available in the text it should not be a big problem to weed out any attempt to falsify their destinations. Default is to set href to the same as the src parameter.
 * xpath : What to extract from the source page given as a Xpath description. If the description extracts several fragments the first one to match is used as the reported one. Reported matches will not extend outside the fragments, that is if paragraphs are extracted as fragments the matches will be limited to those fragments and they will not join up.
 * initial : Maximum number of characters in front of the matched quote. This will include any initial wildcard from the quote itself.
 * middle : Maximum number of characters matched by a wildcard in the middle of the quote.
 * final : Maximum number of characters after the matched quote. This will include any final wildcard from the quote itself and also any punctuation character.

Example 1
Assume a source text

This is a test foo bar baz…

Then there are two formats, with two different results (this is only shown for the tag function)

Both forms gives the same text encapsulated in a q or blockquote element, in one of several forms, with first, third and fifth result did passe validation, while second, fourth and sixth result did not.

Example 2
Assume we do something like this &lt;quote src="url" initial="10" final="10"&gt;Duis aute irure dolor in reprehenderit in … fugiat nulla pariatur.&lt;/quote&gt; &lt;quote src="url" initial="10" final="10"&gt;Duis aute irure dolor in reprehenderit in [some replacement text] fugiat nulla pariatur.&lt;/quote&gt;

with a src link to a text like this
 * Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

The two quotes would then produce the inline html code &lt;q class="quote-valid" title=""&gt;Duis aute irure dolor in reprehenderit in … fugiat nulla pariatur.&lt;/q&gt;&lt;ref&gt;…onsequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteu…&lt;/ref&gt; &lt;q class="quote-valid" title=""&gt;Duis aute irure dolor in reprehenderit in [some replacement text] fugiat nulla pariatur.&lt;/q&gt;&lt;ref&gt;…onsequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteu…&lt;/ref&gt;

Algorithm
The quoted (marked) text will be transformed into a single string before the equivalence check, with whitespace squashed and characters transformed into normalized form C, and with inserted ellipsis or bracketed text replaced with a wildcard. The src will then be used for downloading the source text, the source will be stripped for some elements (tags and their content) that could otherwise create problems (like &lt;script&gt; ), the remaining text stripped for all remaining merkup, the whitespace squashed, and characters transformed into normalized form C. If the quote is contained within the processed source text the quote is marked with a class, if not the quote is marked with. If there is no src the quote will be neither valid nor invalid. If the quote is verified from a live site it will also be marked as.

The result from the processing will be saved as a page property for later reuse, and the downloaded page saved to disk. No further download will take place before the page is rebuilt. When the page is rebuilt the previous result as stored in the page property will be reused as default state until the external server replies, possibly with another result. If the page is edited the previous state is lost, but then the quote will be verified against the downloaded page. If the site, or page, goes away the page property will still be used as the default state, but then the quote will not be marked as. When the quote is in an archived state it will link to the previous downloaded page.

Quotes with src attributes will trigger tracking categories. The page will first go into a category for pending verification, and then into either a category for valid or invalid quotes. Quotes without a src attribute will not be categorized.

Quotes with src attributes will trigger logging. The log entry will be created after reply from the external site, or on timeout.

Stuff

 * Manual:Job queue/For developers
 * Manual:Purge
 * memcached