User:Leucosticte/AutoBlurb

This is a proposal for an extension to:
 * 1) Automatically generate blurbs (e.g. featured article blurbs) from article text;
 * 2) Evaluate articles to determine whether they are suitable for featuring; and
 * 3) Randomly select articles that meet the criteria in #2, above.

The purpose of this page is to start a dialog on what exact functionality is desired and how it should be implemented.

Purpose
The main purpose for this extension would be for automated randomly featured article rotations. Currently this is done on some wikis, e.g. RationalWiki, by means of Extension:RandomSelection. It could be possible to create an extension to randomly feature every article in, say, mainspace, that met certain criteria. However, it would be necessary to have featured article blurbs for them (such as what you see at, say, w:Wikipedia:Today's featured article/January 1, 2012 or rationalwiki:Template:Cover abstract/Peer review). It is labor-intensive to create such blurbs, and they go out of date when the article changes. Therefore, it would be desirable to come up with a way to auto-generate such blurbs.

Lead length measurement
Any article with an acceptable lead can be featured. The lead is measured as that portion of the article text that ends in the first ==, which would indicate the beginning of the first section. It could also end in a [[Category, as that would mark the end of the article. The lead must be a certain minimum length to be accepted, because we don't a substub showing up as a featured article. So, perhaps 1,024 characters or some other round number like that as an absolute minimum cutoff. Maybe there would be an absolute maximum too, say, 3,072 characters.

The blurb will be automatically extracted from the lead, with as many paragraphs being taken from the lead as is necessary to get us in the proper character range, or to hit as close to a target number as possible. Thus, we may try to get as close to, say, 2,048 characters as possible. So, if two paragraphs would be 1,500 characters and three paragraphs would be 2,500 characters, it'll go with the three paragraphs.

Those 2,048 characters include brackets but exclude any tags (e.g. references enclosed in ) and templates (e.g. infoboxes), except maybe a few whitelisted ones, such as Template:W. They also exclude any image files. However, the extension will There might even be some ways to detect the caption, even for infoboxes, and use it when it's available as what you see when you hover over the pic in the blurb.

Lead processing

 * Detect the first content in an article that's bolded. In the blurb, it would bold that and also wikilink it to the article.
 * Remove all tags (except, perhaps, tags) and templates except for those whitelisted as okay for inclusion in the blurb.
 * Find the first image in the article, whether as part of an infobox or otherwise, and use that as the illustration for the blurb. Make the caption the same as the image caption. (E.g. if it comes from an infobox, search for ; or, preferably, parse the article and then see what the parsed article uses as the image caption.)
 * Include as many paragraphs, but no more, from the lead as is necessary to get as close to the target blurb length as possible.

Implementation
Fortunately we already have functions for Special:RandomPage, Special:NewPages and Special:RecentChanges, which could be good starting places. We might be able to cannibalize some of that code. Whitelisting of templates (i.e. allowing their inclusion in the blurb) could be done by adding, say, an  magic word to the template.

Configuration

 * could be used to specify the minimum lead length (in characters).
 * could be used to specify the maximum lead length (in characters).
 * could be used to specify the target lead length (in characters); this determines how many paragraphs from the lead should be included.

Code before it was abandoned
Please note that this code was only in a semi-working state. There are some references to template levels. It was assigning a template level to each character in the string (i.e. the revision text) that represented what level of template nestedness it was currently at. It then searched in the areas that had template nestedness 0 (i.e. were outside of any templates) and looked for bolded text. That text would presumably be the article title, e.g. MediaWiki is a free web-based wiki software application. The "MediaWiki" in that sentence would then be wikilinked to the MediaWiki article. I got all that stuff to work properly.

Some challenges that remained included extracting the images and so on. It would have been necessary to parse the code, find stuff with .jpg, .png, etc. extensions. At whatever template level, it would pull that first file, and get rid of all other images and such from that opening snippet. Delimiters that would indicate a likely beginning of an image file would be, say, an = sign (e.g. image=example.png, in an infobox template) or double left brackets ( [[ ).

The wiki on which I was going to put this content went non-public, so I felt a less compelling need to finish work on the extension.

AutoBlurb.php
';       #$output = $parser->recursivePreprocess ( $output, $title, $options); #$output = $thisRevisionText; /*$lowestWhereIsBeginning = strlen ( $output ); foreach ( $wgAutoBlurbFileEndings as $thisFileEnding ) { $whereIsEnding[$thisFileEnding] = strpos ( $output, $thisFileEnding ); $lowestIndicatorLoc = $whereIsEnding; foreach ( $wgAutoBlurbFileBeginningIndicators as $thisIndicator ) { $indicatorLoc = strrpos ( $output, $thisIndicator ); if ( $indicatorLoc < $lowestIndicatorLoc ) { #$lowestIndictator = $thisIndicator; $lowestIndicatorLoc = $indicatorLoc; }           }            $whereIsBeginning[$thisFileEnding] = $lowestIndicatorLoc; if ( $whereIsBeginning[$thisFileEnding] === false ) {               $whereIsBeginning[$thisFileEnding] = 0; }           if ( $whereIsBeginning[$thisFileEnding] < $lowestWhereIsBeginning ) {               $lowestWhereIsBeginning = $whereIsBeginning[ $thisFileEnding ]; $lowestBeginningFileEnding = $thisFileEnding; }       }        return substr ( $output, $lowestWhereIsBeginning, $whereIsEnding[$lowestBeginningFileEnding]); */       #return ' '.$output.' '; return $output; } }