User:Dantman/SpamDB

From mediawiki.org

Idea: Write a service which accepts submissions of spam and indexes them in a large database. In the future this service could be used to build some sort of filter to judge the spammyness of a wiki edit.

Suggested stack of technology:

  • Python; With a gevent based server to be able to handle a lot of requests.
  • Riak for the storage of the data blobs.
  • Other database engine (unknown) for storing indexes and things we may want to iterate over.

Every spam entry will have a short document or row. The typical data pieces will be title and text. Both of these are mapped to a sha hash used as a key for blob storage. Looking up the data in Riak will get it back for use.

{
  "title": "[...shasum...]",
  "text": "[...shasum...]"
}

Though considering we'll want to separate data stored in Riak without loosing flexibility in our document keys we may want to store values as something like "riak:title/[...shasum...]".