Notifications/Risks

This is a page to attempt to document potential engineering risks associated with the Echo deploy and rollout.

E-mail related risks

 * Risks associated with receiving (marking e-mail as spam)
 * Echo email notifications will be coming from the wiki domain (instead of wiki@wikimedia.org). Is there spam reputation associated with IP + sender from domain?
 * There was some discussion (unsettled) about adding a new mail server with a fresh IP and sending a lot of mail from it on day one. New mailservers generally get warmed up slowly so the IP can build reputation.
 * Our current request from operations is to provide us with two email addresses: noreply-notifications@wikipedia.org and noreply-notifications@mediawiki.org (modified from original ticket, which had asked for four new email addresses)
 * Risks associated with e-mail volume
 * Current projection suggests a relatively modest level of outgoing estimate
 * enwiki estimate (topline from this spreadsheet calculation): ~300 emails/hour ~8000 emails/day ~250k emails/month
 * emails rate for current tool?: (no estimate yet: can anyone provide current rates of emails sent for en-wiki, broken down by talkpage message, watchlist and other notifications?)
 * Opting existing users into new types of emails. If somebody has talk page emails or watch list emails enabled I think it's reasonable to assume they want similar notifications, but if they get more than they got last month, some number of people will complain.
 * current plan is for all new users to automatically get email notifications without having to opt-in, as outlined in this feature requirements page.
 * we would also continue to send the talk page message notifications to current users (as they do now), and have the current tool continue to send watchlist notifications -- but all other notifications would be web-only and require users to opt-in for email.
 * Requiring log in to unsubscribe from mail types.

Longer term risks:
 * How is VERP currently handled to fix e-mail bounces in the system?

Risks associated with JobQueue
Echo uses the JobQueue to queue events for delivery as notifications. In addition, Echo introduces a new JobQueue patch to allow for delayed sending for e-mail bundling purposes and will probably (for enwiki release) use Aaron's JobQueue patch to allow the queue to be implmented in Redis


 * If JobQueue were backed up, timely notification delivery will be impacted. This should be mitigated by using a RAM-based queue in Redis
 * JobQueue delay implementations are not tested under high load?
 * Redis is a job queue implementation currently untested
 * for Echo specifically (should get redis up on ee-prototype to confirm it works)
 * In production
 * Redis-related issues (LRU?)

Risks associated with the Database
Echo still uses some database tables (and memcache) for handling notifications registration, bundling and the "mailbox" of read notifications. These tables have JOINs internally but are not joined against enwiki db to ensure better partitioning. This data will be live on the extension1 db cluster instead of the respective wiki db (e.g. enwiki db). Post-release optimization would replace this dependency with a Redis-based system.

Per the Echo design meeting, We will use GROUP BY to get bundle hash for email digest, this will reduce the number of indexes in this table. The decision was made mainly based on the fact that email digest is done by a cron job running in the background.


 * Risk with GROUP BY: perform a full index scan regardless LIMIT -> temp table -> file sort, this is very bad on large volume of data


 * Mitigation:
 * Clear processed event from the queue when processing email for each user so the data volume would not grow at constant rate
 * Apply a user-hash-priority index so the scan would be only on user level


 * echo_event and echo_notification: would grow at a constant rate, more data means slower performance ( data insert or data lookup ). The notification lookup is mainly on echo_notification table, a second lookup is on echo_event for valid event_type.  If the number of invalid event is greater than valid event, the optimizer will do the lookup mainly on echo_event, which has no efficient index to support the lookup.


 * Mitigation
 * we can delete/archive notification if they are more than 30 days old. For important notification like talk-page-notification, we should delete only if they have been read.
 * perform regular clean up on these two tables to delete invalid notifications

Risks to existing workflows
Echo replaces the current user talk page notification system, so that existing talk page notifications go through it. The existing user talk page notifications are robust and are incorporated into many user habits right now.


 * If Echo were to go down or get backed up, quality of service of this will be impacted
 * The flow of these is different? For instance, e-mail throttling/bundling occur after a certain point (specify)

Overall UX risks

 * Potential that some extension introduces a new Echo notification that is too "chatty" on the system