User:Frostly/save

Based on initial community consultations, draft architecture has been drafted. This is subject to change based on further discussion and research. Fortuna is implemented as a MediaWiki extension. Its design is inspired by the MachineVision extension, written by Wikimedia Foundation developers and deployed on Wikimedia production.

Overview
When a new image is uploaded to a wiki, the Fortuna extension triggers a delayed job request to ensure that the image is still present (i.e., not deleted), and if so, request and store web matches of images generated by one or more API providers. These matches are then filtered and served to reviewers on the Special:SuspiciousFiles page. Accepted web matches are tagged for speedy deletion, nominated for a deletion process, or deleted (for administrators).

In addition to new uploads, lists of image file page titles may be passed ro the maintenance script  to have web matches retrieved and stored on demand.

When web matches are received for an image, an Echo event is fired to notify the uploader and patroller (if applicable) that web matches are available for review, according to the uploader and patrollers' notification preference.

The extension is designed to support arbitrary API providers (including issuing requests to multiple providers simultaneously). Intiial providers planned for implementation are the Google Cloud Vision API, the Bing Visual API, the TinEye API and Pixsy.

Image
Images are stored by their SHA1 hash in the  table. This means that if an image file is uploaded that is identical to one for which a record exists in the database, it is the same image for the Fortuna extension's purposes, and matches will not be requested again.

The extension only handles bitmap and vector images and disregards all other file types. Bitmap thumbnails are requested for vector files when encountered.

Match
A match is stored as a string of a URL and integer of a match percentage in the  table. Human-readable domains associated with matches are fetched at the point of presentation to the end-user. A match will be associated with an image no more than once, even if the match is subsequently suggested by a different API provider.

Waiting period
A waiting period is enforced between upload time and the submission of an image to an API provider for label suggestions. This is to reduce the likelihood of making a search request for an image that is soon to be deleted. The waiting period is planned to be 48 hours by default. This value is configured in.

Review state
Review state is a critical concept in the Fortuna extension, because it governs which images are presented on Special:SuspiciousFiles, and to which audiences. The review states are represented as integers, with a default state of 0 (unreviewed). Possible states include the following:


 * Unreviewed (0): The default match review state. The match may be presented in either the "popular" or "user uploads" tab on Special:SuspiciousFiles.
 * Accepted (1): The match was accepted by a contributor. A deletion request, tag or direct deletion should have been performed, and afterwards it should no longer appear on Special:SuspiciousFiles.
 * Rejected (-1): The match was rejected by a contributor. It should no longer appear on Special:SuspiciousFiles.
 * Withhold from "popular" (-2): The initial review state for a match which is unreviewed but should be withheld from the "popular" tab and only shown on manual searches in the "user uploads" tab. A match may receive this review state based on the SafeSearch ratings of the image to which it pertains.
 * Withhold from all (-3): The review state for a match pertaining to an image which should be withheld completely from Special:SuspiciousFiles. Files can be put into this review state through a maintenance script,.
 * Not displayed (-4): A special review state assigned to match when an attempt to display them fails because a match could not be found. This results in the match no longer being shown on Special:SuspiciousFiles.

Feed filtering
(Not to be confused with system-wide filters (see below).) Filters can be applied in SuspiciousFiles. Planned filters include uploader edit count, uploader rights, and whether an image is patrolled. More filters will be added with community consultation.

New uploads
In a handler for the UploadComplete hook, the Fortuna extension checks whether the uploaded file is a bitmap or vector image. If so, and if the extension is configured to request matches for new uploads, the extension creates a new  and enqueues it on the job queue. If a waiting period is configured, the job is created with a  value of the current time plus the configured waiting period. When the job is executed, if the file still exists (i.e., has not been deleted), a request for matches is created and sent to Google Cloud Vision via. SafeSearch annotations are requested from an internal WMF service utilizing Google Cloud Vision.

When a response is received, system-wide filters are applied (see "Image and system-wide filtering" below). If any matches remain after filtering, they are stored in the database, and an Echo event is fired to trigger a notification to the uploader and patroller that matches are available for review. Matches are eventually served on Special:SuspiciousFiles and updated with their votes by reviewers.

If no matches are found from Cloud Vision, the extension queues additional providers for requesting. The order of requests is: This order is from least expensive to most expensive, in order to minimize costs.
 * Google Cloud Vision (through GoogleCloudVisionClient)
 * Bing Visual Search
 * Pixsy
 * TinEye API (through TinEyeAPI.php)

Custom image lists
The match lifecycle for matches fetched through  for custom image lists is similar to that for new uploads. The main difference is that instead of scheduling match fetching jobs,  directly invokes   in each image on the list.

Image and system-wide filtering
Label suggestions have multiple filters applied in  before storage, and each operates differently from the others.

The first filtering pass, based on, is intended to withhold images completely from being shown on Special:SuggestedTags. If a label in  is among the suggested labels returned for an image, the initial review state for all suggested labels is set to WITHHOLD_ALL, which has the effect of excluding it completely. The image is not shown in either the "popular" or "personal uploads" tab on Special:SuggestedTags. The labels are, however, retained in the database.

The second filtering pass, based on, conditionally withholds images from the "popular" tab. If an image receives a SafeSearch rating that exceeds the allowed value on any of the configured dimensions, it is withheld from the "popular" tab but still available to the uploader in the "personal uploads" tab on Special:SuggestedTags. All suggested labels are retained in the database.

The third and final pass, based on, is intended to discard specific label suggestions judged not to be useful to the projects. Suggestions corresponding to labels in  are simply discarded before the remaining suggested labels are stored.

Review state and data model
There is a conceptual mismatch between the extension's data model and its presentation layer. Because the extension was written to support multiple providers, review state is a property of a label rather than an image. In practice, however, all labels for an image are reviewed at once on Special:SuggestedTags on an image-by-image basis, and there is only one labeling provider (Google). This means that in practice, the data model is unnecessarily complicated; an image's eligibility for presentation on Special:SuggestedTags must be derived from the review states of its various suggested labels rather than being stored as a property of the image itself. Besides being needlessly confusing, this created early problems with query performance.

Redirects and deletions
A common source of bugs is that values in the Freebase-Wikidata mappings may refer to a Wikidata item which has been redirected or deleted. The code attempts to resolve redirects as needed to mitigate the effects of outdated mappings, but it is not perfect.