User:Dantman/SpamDB

Idea: Write a service which accepts submissions of spam and indexes them in a large database. In the future this service could be used to build some sort of filter to judge the spammyness of a wiki edit.

Suggested stack of technology:

Python; With a gevent based server to be able to handle a lot of requests.
Riak for the storage of the data blobs.
Other database engine (unknown) for storing indexes and things we may want to iterate over.

Every spam entry will have a short document or row. The typical data pieces will be title and text. Both of these are mapped to a sha hash used as a key for blob storage. Looking up the data in Riak will get it back for use.

{
  "title": "[...shasum...]",
  "text": "[...shasum...]"
}

Though considering we'll want to separate data stored in Riak without loosing flexibility in our document keys we may want to store values as something like "riak:title/[...shasum...]".