User:AKlapper (WMF)/Bitergia data quality queries

The data behind wikimedia.biterg.io regularly needs updates to make our metrics reliable.

For convenience this page lists SQL queries and bash scripts that I occasionally run.

The usual workflow is to use sortinghat to update the database, create an updated database dump file, push that file back into the repository, and get the changes deployed on the production server.

Find accounts which likely should have an affiliation / enrollment
Note that by default the enrollments array of a user is empty. First run SELECT * FROM organizations to get the correct organization_id.


 * SELECT uuid, id, name, username, source FROM identities WHERE username LIKE '%WMF%' AND source != "bugzilla" AND username NOT LIKE '%WMFr' AND uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 17) ORDER BY UUID;
 * SELECT identities.uuid, identities.id, identities.name, identities.username, identities.source FROM identities JOIN profiles WHERE identities.uuid = profiles.uuid AND identities.email LIKE '%@wikimedia.org' AND identities.uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 17) AND identities.email NOT LIKE "ttijhof@%" AND is_bot=0 ORDER BY identities.uuid;
 * SELECT uuid, id, name, username, source FROM identities WHERE username LIKE '%WMDE%' AND uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 16) ORDER BY UUID;
 * SELECT uuid, id, name, username, source FROM identities WHERE email LIKE '%@wikimedia.de' AND uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 16) ORDER BY uuid;
 * SELECT uuid, id, name, username, source FROM identities WHERE username LIKE '%WMSE%' AND uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 18) ORDER BY UUID;
 * SELECT uuid, id, name, username, source FROM identities WHERE email LIKE '%@wikimedia.se' AND uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 18) ORDER BY uuid;

Find detached accounts with same email addresses to merge

 * SELECT i1.email, i1.source, i2.source, i1.uuid, i2.uuid FROM identities i1 JOIN identities i2 WHERE i1.email IS NOT NULL AND i1.email != "gerritpatchuploader@gmail.com" AND i1.email != "" AND i1.email NOT LIKE "ttijhof@%" AND i2.email != "" AND i2.email IS NOT NULL AND i2.email != "gerritpatchuploader@gmail.com" AND i2.email NOT LIKE "ttijhof@%" AND i1.email = i2.email AND i1.source != i2.source AND i1.uuid != i2.uuid ORDER BY i1.email;</tt>
 * Running this query should not be needed anymore.
 * (For the records, there is a worse approach in T149327. Excluding ttijhof</tt> because of T123643.)
 * Special case: For (now defunct) Bugzilla, the email address is stored in the username column: SELECT DISTINCT(i1.uuid), i1.username, i2.uuid, i2.email FROM identities i1 JOIN identities i2 WHERE i1.source = "bugzilla" AND i2.source != "bugzilla" AND i1.username = i2.email AND i1.uuid != i2.uuid ORDER BY i1.uuid;</tt>

Check detached accounts with same mw and phab usernames if they are connected to merge
See the script and DB commands.

Check detached accounts with same username and same source

 * SELECT p1.username,p1.uuid,profiles.name FROM identities p1 JOIN profiles JOIN identities p2 WHERE p1.source = "phabricator" AND p2.source = "phabricator" AND p1.uuid != p2.uuid AND p1.username = p2.username AND p1.uuid = profiles.uuid ORDER BY p1.username;</tt>

Check duplicated enrollments
...until https://github.com/grimoirelab/sortinghat/issues/91 is fixed:


 * SELECT uuid,start,end,organization_id FROM enrollments WHERE uuid IN (SELECT uuid FROM enrollments GROUP BY uuid HAVING COUNT(uuid) > 1) ORDER BY uuid,start;</tt>

Query all existing Phab accounts about their connected MediaWiki.org accounts
This obviously takes a long time as we talk about >10000 accounts. See the script and DB commands. (For a cheaper version that requires more manual checking, see T170091.)

Find detached accounts with exact same usernames across sources
This is fuzzy and always needs manual investigation. Less expensive by declaring a source to compare with; still takes several minutes:

SELECT i1.id, i2.id, i1.username, i2.username, i1.source, i2.source, i1.uuid, i2.uuid FROM identities i1 JOIN identities i2 WHERE i1.username IS NOT NULL AND i2.username IS NOT NULL AND i1.username = i2.username AND i1.source = "whatever" AND i1.source != i2.source AND i1.uuid != i2.uuid;</tt>

List detached accounts with same usernames in two different sources
This is fuzzy and always needs manual investigation. You must ignore common names. Declare two sources to make the query less expensive.


 * General (sources: bugzilla, pipermail, gerrit): SELECT a.name, a.uuid, b.name, b.uuid FROM identities a JOIN identities b WHERE a.source = "bugzilla" AND b.source = "pipermail" AND b.email NOT LIKE "no-reply%" AND a.uuid != b.uuid AND a.name = b.name ORDER BY a.name;</tt>
 * Bugzilla and Phabricator: SELECT b.username,b.uuid FROM identities a JOIN identities b WHERE a.source = "bugzilla" AND b.source = "phabricator" AND a.uuid != b.uuid AND a.username = b.username AND a.username IS NOT NULL AND b.username IS NOT NULL ORDER BY b.username;</tt>

One-time
Only listing here for reference.


 * Eliminate duplicated «"source": "wikimedia:its"» identities in korma identities DB