Analytics/Archive/Editor Engagement Vital Signs/Backfilling

Backfilling
Some benchmarks of how log did it take to backfill data for Rolling Active Editor EEVS metric. The point is to have a ballpark estimate for our backfilling and make sure future code changes do not blow up these numbers.

Numbers will fluctuate depending on labs DB access but they should not diverge much from these.

We use as baseline our master branch on 2014-06-12 versus changes on this patchset: https://gerrit.wikimedia.org/r/#/c/150475/

Labs db infrstructure ( labsdb1002 dewiki, commons, etc) was upgraded to maria db about the last week of July. All data is on an SSD now.

Config for celery was: BROKER_URL                         : redis://localhost:6379/0 CELERY_RESULT_BACKEND              : redis://localhost:6379/0 CELERY_TASK_RESULT_EXPIRES         : 2592000 CELERY_DISABLE_RATE_LIMITS         : True CELERY_STORE_ERRORS_EVEN_IF_IGNORED : True CELERYD_CONCURRENCY                : 10 CELERYD_TASK_TIME_LIMIT            : 3630 CELERYD_TASK_SOFT_TIME_LIMIT       : 3600 DEBUG                              : False LOG_LEVEL                          : INFO MAX_PARALLEL_PER_RUN               : 10 MAX_INSTANCES_PER_RECURRENT_REPORT : 365 CELERY_BEAT_DATAFILE               : /var/run/wikimetrics/celerybeat_scheduled_tasks CELERY_BEAT_PIDFILE                : /var/run/wikimetrics/celerybeat.pid CELERYBEAT_SCHEDULE                : 'update-daily-recurring-reports': 'task'     : 'wikimetrics.schedules.daily.recurring_reports' # The schedule can be set to 'daily' for a crontab-like daily recurrence 'schedule' : debug

rowiki

 * Backfilling of 3 months of data takes about 3 minutes
 * Backfilling of 1 year of data takes about 10 minutes

eswiki

 * Backfilling of 3 months of data took 8 minutes.
 * Backfilling of 5 months of data took 10 minutes
 * Backfilling of 1 year of data took 30 minutes

frwiki

 * backfilling 3 months took 12 mins

rowiki

 * Backfilling of 3 months of data takes about 3 minutes
 * Backfilling of 1 year of data takes about 6 minutes

frwiki

 * Backfilling 3 months: 7 mins

rowiki RollingNewActiveEditor

 * Backfilling 3 months: 42 seconds

rowiki RollingSurvivingNewActiveEditor

 * Backfilling 3 months: 44 seconds

frwiki RollingNewActiveEditor

 * Backfilling 3 months: 3.5 minutes

frwiki RollingSurvivingNewActiveEditor

 * Backfilling 3 months: 4.5 minutes

ruwiki

 * Backfilling 1 day - 4 minutes
 * Backfilling 1 week - 4 minutes
 * Backfilling a month - 7minutes

frwiki

 * Backfilling 2 months: 12 minutes

Rolling Recurrent old active editor https://gerrit.wikimedia.org/r/#/c/161521/
Select as is did not run (as in it run forever)

SELECT anon_1.user_id AS anon_1_user_id, IF(SUM(anon_1.count_one) >= %s AND SUM(anon_1.count_two) >= %s, %s, %s) AS `IF_1` FROM (SELECT anon_2.user_id AS user_id, anon_2.count_one AS count_one, anon_2.count_two AS count_two FROM (SELECT revision_userindex.rev_user AS user_id, SUM(IF(revision_userindex.rev_timestamp <= %s, %s, %s)) AS count_one, SUM(IF(revision_userindex.rev_timestamp > %s, %s, %s)) AS count_two FROM revision_userindex INNER JOIN user ON user.user_id = revision_userindex.rev_user INNER JOIN logging ON user.user_id = logging.log_user WHERE logging.log_type = %s AND logging.log_action = %s AND logging.log_timestamp < %s AND revision_userindex.rev_timestamp BETWEEN %s AND %s GROUP BY revision_userindex.rev_user

UNION ALL

SELECT archive.ar_user AS user_id, SUM(IF(archive.ar_timestamp <= %s, %s, %s)) AS count_one, SUM(IF(archive.ar_timestamp > %s, %s, %s)) AS count_two FROM archive INNER JOIN user ON user.user_id = archive.ar_user INNER JOIN logging ON user.user_id = logging.log_user WHERE logging.log_type = %s AND logging.log_action = %s AND logging.log_timestamp < %s AND archive.ar_timestamp BETWEEN %s AND %s GROUP BY archive.ar_user) AS anon_2) AS anon_1 GROUP BY anon_1.user_id