Wikimedia Product/NSFW image classifier/Storage notes

(Summary of offline discussion, still being updated)

Each scoring service (there may be multiple in near future) will provide a floating-point score between 0 and 1, which will be saved and exposed through various means, most notably needing to be available to AbuseFilter.

Data table
Primary storage would be in a local database table provided by an extension. Something like:


 * table
 * varchar(255)-> foreign key to  (alternately, key to img_sha1, which is smaller and doesn't require rename handling)
 * varchar(32) -> key to the source, eg 'google-vision' or whatevs. Potentially this could be an enum to save space but we know those are maintenance nightmares. If the upstream model changes, the key should probably change.
 * float -> the score value, between 0 and 1
 * varchar(14) -> consider storing when the scoring was done?

These would be on the local wiki -- for instance visible directly to Commons only for a Commons image.

page_props
Secondary storage would be to mirror these scores into the  for the file page, which should make them available to AbuseFilter and many other tools which can pull from page_props via local database or API.

FileRepo
It may also be useful to expose scoring info as extended metadata through FileRepo, which should make the data queryable on wikis using the images. Check whether AbuseFilter can make use of this already or if it would need to be extended.

ElasticSearch
It may be nice to have scoring info available in search databases for easy filtering, but filtering search results is a complex topic probably to be looked at later.

Timing and real-time concerns
Scoring may take several seconds after upload, so won't be available _immediately_ on an upload's revision creation.

A second revision could be made with the page_props update, which could trigger AbuseFilter, or the AbuseFilter rules could be applied only on uses of the image. Need to get a better idea the exact rulesets planned here.