Extension:Quotation

Extension:Citation is a small extension to check if a quoted string is contained on the page referenced. The purpose for this is to make it simple to verify if the quote is indeed at the referenced page, and if not to mark the quotes that does not check out both visually at the page itself and listed on a special page.

Usage
There are both a tag function and parser function, with nearly similar function. Both takes a number of parameters that describes the function and some content that describes what is quoted. The parser function can take several content fragments while the tag function take only one.

Parameters

 * inline : Valueless marker to tell the function to render an inline quote (q) after the processing. This is the default for the parser function.
 * block : Valueless marker to tell the function to render a block quote (blockquote) after the processing. This is the default for the tag function.
 * format
 * src : The actual source to query for a page. This page will then be loaded, parsed, and the actual content matching the quote will be filtered out. The quotes must match according to the built in rules, and the quotes will be marked accordingly. Failing to do so will also mark the page for later retrieval by the special page. If missing the quote will be marked accordingly.
 * href : A link to a user readable variant of the actual page requested. The pages can be different and no provision is made to check if the linked version infact is equal to the analyzed source page. Changes to href will not have any consequences for src parameter. As the link is readily available in the text it should not be a big problem to weed out any attempt to falsify their destinations. Default is to set href to the same as the src parameter.
 * xpath : What to extract from the source page given as a Xpath description. If the description extracts several fragments the first one to match is used as the reported one. Reported matches will not extend outside the fragments, that is if paragraphs are extracted as fragments the matches will be limited to those fragments and they will not join up.
 * initial : Maximum number of characters in front of the matched quote. This will include any initial wildcard from the quote itself.
 * middle : Maximum number of characters matched by a wildcard in the middle of the quote.
 * final : Maximum number of characters after the matched quote. This will include any final wildcard from the quote itself and also any punctuation character.

Example
&lt;quote src="url" format="inline"&gt;some text that is quoted&lt;/quote&gt; "some text that is quoted"

Both forms gives the same text encapsulated in a cite element, in one of several forms. &lt;q class="quote-valid" title=""&gt;some text that is quoted&lt;/q&gt; &lt;blockquote class="quote-valid" title""&gt;some text that is quoted&lt;/blockquote&gt;

First form have passed validation, second form have not, third form could not complete validation of some reason.

Assume we do something like this with a src link to a text like
 * Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

&lt;quote src="url" initial="10" final="10" format="inline"&gt;Duis aute irure dolor in reprehenderit in [...] fugiat nulla pariatur.&lt;/quote&gt; "Duis aute irure dolor in reprehenderit in [some replacement text] fugiat nulla pariatur."

The two quotes would then produce the html code &lt;q class="quote-valid" title=""&gt;some text that is quoted&lt;/q&gt;&lt;ref&gt;&hellip;onsequat. Duis aute irure dolor in reprehenderit in [...] fugiat nulla pariatur. Excepteu&gt;&lt;/ref&gt; &lt;q class="quote-valid" title=""&gt;some text that is quoted&lt;/q&gt;&lt;ref&gt;&hellip;onsequat. Duis aute irure dolor in reprehenderit in [some replacement text] fugiat nulla pariatur. Excepteu&gt;&lt;/ref&gt;

Algorithm
The quoted (marked) text will be transformed into single string before the equivalence check, with whitespace squashed and characters transformed into normalized form C, and with inserted bracketed text removed. The src will then be used for downloading the source text, the source will then be stripped for some elements (tags and their content) that otherwise could give problems (like &lt;script&gt; ), the remaining text stripped for all remaining tags, the whitespace squashed and characters transformed into normalized form C. If the quote is contained within the processed source text the quote is marked as valid, if not the quote is invalid. This is done by setting a class in the containing cite element. If there is no src the quote will be neither valid nor invalid. Marking the quote as verified from a live site will also lead to the quote being marked as alive.

The result from the processing will be stored by memcached for later reuse, with a timeout sufficient to handle continuous editing, and logged for later referral. When the page is cached then also the result from the processing is cached, so no further processing will take place before the page is rebuilt the next time. When it is rebuilt the result is stored as a page property and becomes the default state for the next run. Normally this could last from days to weeks or even months. If the site, or page, goes away the logged information can be used but the quote is then marked as archived according to the last known state.

A special page will be available for look up of the stored page properties, to make it possible to find pages with invalid and/or outdated quotes.

The logged information could include additional information, like the quote itself with additional context.

Stuff

 * Manual:Job queue/For developers
 * Manual:Purge
 * memcached