Wikimedia Product/NSFW image classifier/Storage notes

From mediawiki.org

(Summary of offline discussion, still being updated)

Each scoring service (there may be multiple in near future) will provide a floating-point score between 0 and 1, which will be saved and exposed through various means, most notably needing to be available to AbuseFilter.

Data table[edit]

Primary storage would be in a local database table provided by an extension. Something like:

  • image_filter_scoring table
    • ifs_image varchar(255)-> foreign key to img_name (alternately, key to img_sha1, which is smaller and doesn't require rename handling)
    • ifs_source varchar(32) -> key to the source, eg 'google-vision' or whatevs. Potentially this could be an enum to save space but we know those are maintenance nightmares. If the upstream model changes, the key should probably change.
    • ifs_score float -> the score value, between 0 and 1
    • ifs_timestamp varchar(14) -> consider storing when the scoring was done?

These would be on the local wiki -- for instance visible directly to Commons only for a Commons image.

Secondary storage[edit]

page_props[edit]

Secondary storage would be to mirror these scores into the page_props for the file page, which should make them available to AbuseFilter and many other tools which can pull from page_props via local database or API.

FileRepo[edit]

It may also be useful to expose scoring info as extended metadata through FileRepo, which should make the data queryable on wikis using the images. Check whether AbuseFilter can make use of this already or if it would need to be extended.

ElasticSearch[edit]

It may be nice to have scoring info available in search databases for easy filtering, but filtering search results is a complex topic probably to be looked at later.

Timing and real-time concerns[edit]

Scoring may take several seconds after upload, so won't be available _immediately_ on an upload's revision creation.

A second revision could be made with the page_props update, which could trigger AbuseFilter, or the AbuseFilter rules could be applied only on uses of the image. Need to get a better idea the exact rulesets planned here.