Admin tools development/SUL Audit

In order to effectively plan for the final Grand Unification (required for Echo and Flow), we need to determine the current state of SUL.

Status
2013-04-11 - Unifying non-clashing accounts.

Goals
As a first step, we want to determine the following metrics: Once we have access to these numbers, we need to determine a strategy for unification which will involve breaking down the above by
 * 1) Number of global users
 * 2) Number of local users without a global account (can be easily unified)
 * 3) Number of local users that clash with another local user (no global account involved)
 * 4) Number of local users that clash with a global user
 * Users who have edited in the last 3/6/12/24 months
 * Users with logins in the last 3/6/12/24 months
 * Users with a confirmed email address
 * Users without a confirmed email address
 * (anything else?)

Metrics
The following numbers are based on static dumps of all wikis from 2013/05/06, and slaved centralauth; consequently the numbers won't add up perfectly.

The following data is using the user_touched field in the user tables. This is not a particularly accurate number, but is conservative when it exists.

Historical 2013-04-17

Historical 2013-03-15

Historical 2013-02-23

HW Needs
Doing this will require some fairly intensive DB querying. Because of the cross db-host nature of querying local.user against centralauth.*, we have developed the following plan to consolidate all user tables on the same server for the purpose of the audit. Each of these options allows us to run very intensive queries without having a detrimental effect on site performance as well as giving us the ability to execute the queries in SQL without having to connect to multiple hosts programmatically.

Option 1 - Replicate all clusters to a single slave
Unfortunately, it would appear that it is not possible to slave s[1-7] to a single slave without running multiple instances of MySQL, which would negate the benefit. If, however, some method of accomplishing this is discovered, this plan would solve almost all temporal issues (replag being the only remaining issue).

✅ - Option 2 - Replicate s7/centralauth and script the rest
Because some of the required tables in centralauth do no use an auto_increment as the primary key, we would slave at least the  db from s7 to the audit db host. For each individual wiki we would batch insert programmatically, using the auto_increment ids in the user table to periodically update the tables. We would also write a script to even less frequently update all fields in the tables so that we grab updates to fields such as,  , and.

This option does suffer from the fact that all databases are not constantly updated and that they are not all in sync for a given time since the scripts take time to execute. At least for now, these concerns are not huge as we are looking for ballpark statistics.