User:AKlapper (WMF)/Bitergia data quality queries

From mediawiki.org

The data behind wikimedia.biterg.io regularly needs updates to make our metrics reliable.

For convenience this page lists SQL queries and bash scripts that I occasionally run.

The usual workflow is to use sortinghat to update the database, create an updated database dump file, and update user profiles via Hatstall to get the changes deployed on the production server.

Find accounts which likely should have an affiliation / enrollment[edit]

Note that by default the enrollments array of a user is empty.

  • SELECT CONCAT("https://wikimedia.biterg.io/identities/hatstall/", uuid), id, name, username, source FROM identities WHERE username LIKE '%WMF%' AND source != "bugzilla" AND username NOT LIKE '%WMFr' AND username != "Wmftestfore3" AND uuid NOT IN (SELECT uuid FROM enrollments e JOIN organizations o WHERE o.name = "Wikimedia Foundation" AND e.organization_id = o.id) AND source != "supybot" ORDER BY UUID;
  • SELECT CONCAT("https://wikimedia.biterg.io/identities/hatstall/", identities.uuid), identities.id, identities.name, identities.username, identities.source FROM identities JOIN profiles WHERE identities.uuid = profiles.uuid AND identities.email LIKE '%@wikimedia.org' AND identities.uuid NOT IN (SELECT uuid FROM enrollments e JOIN organizations o WHERE o.name = "Wikimedia Foundation" AND e.organization_id = o.id) AND identities.email NOT LIKE "ttijhof@%" AND identities.email != "mediawiki-l@Wikimedia.org" AND identities.name != "Amazon Sec. Team messages-noreply@amazon.com" AND is_bot=0 ORDER BY identities.uuid;
  • SELECT CONCAT("https://wikimedia.biterg.io/identities/hatstall/", uuid), id, name, username, source FROM identities WHERE (email LIKE '%@wikimedia.de' OR username LIKE '%WMDE%') AND uuid NOT IN (SELECT uuid FROM enrollments e JOIN organizations o WHERE o.name = "Wikimedia Deutschland" AND e.organization_id = o.id) ORDER BY UUID;
  • SELECT CONCAT("https://wikimedia.biterg.io/identities/hatstall/", uuid), id, name, username, source FROM identities WHERE (((email LIKE '%@wikimedia.se' OR username LIKE '%WMSE%') AND uuid NOT IN (SELECT uuid FROM enrollments e JOIN organizations o WHERE o.name = "Wikimedia Sverige" AND e.organization_id = o.id)) OR (email LIKE '%@thisdot.co' AND uuid NOT IN (SELECT uuid FROM enrollments e JOIN organizations o WHERE o.name = "ThisDot" AND e.organization_id = o.id)) OR (email LIKE '%@speedandfunction.com' AND uuid NOT IN (SELECT uuid FROM enrollments e JOIN organizations o WHERE o.name = "Speed & Function" AND e.organization_id = o.id)) OR (email LIKE '%@hallowelt%' AND uuid NOT IN (SELECT uuid FROM enrollments e JOIN organizations o WHERE o.name = "Hallo Welt!" AND e.organization_id = o.id))) ORDER BY UUID;

Look also at GitLab accounts and if some of them should have an affiliation; see phab:T306769:

Use the email addresses in the Phabricator database to identify folks that should have an affiliation set:

  • Run in Phab DB to query based on contractorcompany email addresses: SELECT u.userName FROM phabricator_user.user u JOIN phabricator_user.user_email ue WHERE ue.userPHID = u.phid AND ue.isPrimary = 1 AND ue.address LIKE "%@contractorcompany.xyz" AND u.isDisabled = 0;
  • Copy resulting usernames
  • Run in Bitergia DB: SELECT * FROM organizations; (to get name of contractorcompany)
  • Run in Bitergia DB to find folks with no affiliation yet set (replace contractorcompany and usernames): SELECT p1.username,CONCAT("https://wikimedia.biterg.io/identities/hatstall/", p1.uuid) FROM identities p1 WHERE p1.source = "phabricator" AND (p1.username = "username1" OR p1.username = "username2") AND p1.uuid NOT IN (SELECT uuid FROM enrollments e JOIN organizations o WHERE o.name = "contractorcompany" AND e.organization_id = o.id);

Find detached Phabricator accounts which changed their Also Known As[edit]

As long as phab:T305230 is unresolved:

  • SELECT DISTINCT(t1.username) FROM identities t1 JOIN identities t2 ON t1.username = t2.username WHERE t1.uuid != t2.uuid AND t1.id != t2.id AND t1.source = "phabricator" AND t2.source = "phabricator";

Find detached accounts with same email addresses to merge[edit]

  • SELECT i1.email, i1.source, i2.source, i1.uuid, i2.uuid FROM identities i1 JOIN identities i2 WHERE i1.email IS NOT NULL AND i1.email != "gerritpatchuploader@gmail.com" AND i1.email != "" AND i1.email NOT LIKE "ttijhof@%" AND i1.email != "root@localhost" AND i1.email != "no-reply@phabricator.wikimedia.org" AND i2.email != "" AND i2.email IS NOT NULL AND i2.email != "gerritpatchuploader@gmail.com" AND i2.email NOT LIKE "ttijhof@%" AND i1.email = i2.email AND i1.source != i2.source AND i1.uuid != i2.uuid ORDER BY i1.email;
  • Running this query should not be needed anymore.
  • (For the records, there is a worse approach in phab:T149327. Excluding ttijhof because of phab:T123643.)
  • Special case: For (now defunct) Bugzilla, the email address is stored in the username column: SELECT DISTINCT(i1.uuid), i1.username, i2.uuid, i2.email FROM identities i1 JOIN identities i2 WHERE i1.source = "bugzilla" AND i2.source != "bugzilla" AND i1.username = i2.email AND i1.uuid != i2.uuid ORDER BY i1.uuid;

Check detached accounts with same mw and phab usernames if they are connected to merge[edit]

See the script and DB commands.

Check detached accounts with same username and same source[edit]

  • SELECT p1.username,CONCAT("https://wikimedia.biterg.io/identities/hatstall/", p1.uuid), p1.source, profiles.name FROM identities p1 JOIN profiles JOIN identities p2 WHERE p1.source = p2.source AND p1.uuid != p2.uuid AND p1.username = p2.username AND p1.uuid = profiles.uuid ORDER BY p1.username

Check duplicated enrollments[edit]

...until https://github.com/chaoss/grimoirelab-sortinghat/issues/91 is fixed:

  • SELECT CONCAT("https://wikimedia.biterg.io/identities/hatstall/", e.uuid), p.name, e.start, e.end, e.organization_id FROM enrollments e INNER JOIN profiles p ON e.uuid = p.uuid WHERE e.uuid IN (SELECT uuid FROM enrollments GROUP BY uuid HAVING COUNT(uuid) > 1) ORDER BY e.uuid,e.start;

Not regularly / more expensive[edit]

Query all existing Phab accounts about their connected MediaWiki.org accounts[edit]

This obviously takes a long time as we talk about >10000 accounts. See the script and DB commands. (For a cheaper version that requires more manual checking, see phab:T170091.)

Find detached accounts with exact same usernames across sources[edit]

This is fuzzy and always needs manual investigation. Very expensive query; takes several minutes:

SELECT i1.id, i2.id, i1.username, i2.username, i1.source, i2.source, i1.uuid, i2.uuid FROM identities i1 JOIN identities i2 WHERE i1.username IS NOT NULL AND i2.username IS NOT NULL AND i1.username = i2.username AND i1.source = "whatever" AND i1.source != i2.source AND i1.uuid != i2.uuid;

List detached accounts with same usernames in two different sources[edit]

This is fuzzy and always needs manual investigation. You must ignore common names. Declare two sources to make the query less expensive.

  • General (sources: bugzilla, pipermail, gerrit): SELECT a.name, a.uuid, b.name, b.uuid FROM identities a JOIN identities b WHERE a.source = "bugzilla" AND b.source = "pipermail" AND b.email NOT LIKE "no-reply%" AND a.uuid != b.uuid AND a.name = b.name ORDER BY a.name;
  • Bugzilla and Phabricator: SELECT b.username,b.uuid FROM identities a JOIN identities b WHERE a.source = "bugzilla" AND b.source = "phabricator" AND a.uuid != b.uuid AND a.username = b.username AND a.username IS NOT NULL AND b.username IS NOT NULL ORDER BY b.username;

One-time[edit]

Only listing here for reference.