User:AKlapper (WMF)/Bitergia data quality queries

From mediawiki.org
Jump to navigation Jump to search

The data behind wikimedia.biterg.io regularly needs updates to make our metrics reliable.

For convenience this page lists SQL queries and bash scripts that I occasionally run.

The usual workflow is to use sortinghat to update the database, create an updated database dump file, push that file back into the repository, and get the changes deployed on the production server.

Find accounts which likely should have an affiliation / enrollment[edit]

Note that by default the enrollments array of a user is empty. First run SELECT * FROM organizations to get the correct organization_id.

  • SELECT CONCAT("https://wikimedia.biterg.io/identities/hatstall/", uuid), id, name, username, source FROM identities WHERE username LIKE '%WMF%' AND source != "bugzilla" AND username NOT LIKE '%WMFr' AND username != "Wmftestfore3" AND uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 18) AND source != "supybot" ORDER BY UUID;
  • SELECT CONCAT("https://wikimedia.biterg.io/identities/hatstall/", identities.uuid), identities.id, identities.name, identities.username, identities.source FROM identities JOIN profiles WHERE identities.uuid = profiles.uuid AND identities.email LIKE '%@wikimedia.org' AND identities.uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 18) AND identities.email NOT LIKE "ttijhof@%" AND identities.email != "mediawiki-l@Wikimedia.org" AND identities.name != "Amazon Sec. Team messages-noreply@amazon.com" AND is_bot=0 ORDER BY identities.uuid;
  • SELECT CONCAT("https://wikimedia.biterg.io/identities/hatstall/", uuid), id, name, username, source FROM identities WHERE (email LIKE '%@wikimedia.de' OR username LIKE '%WMDE%') AND uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 17) ORDER BY UUID;
  • SELECT CONCAT("https://wikimedia.biterg.io/identities/hatstall/", uuid), id, name, username, source FROM identities WHERE (email LIKE '%@wikimedia.se' OR username LIKE '%WMSE%') AND uuid NOT IN (SELECT uuid FROM enrollments WHERE organization_id = 19) ORDER BY UUID;

Find detached accounts with same email addresses to merge[edit]

  • SELECT i1.email, i1.source, i2.source, i1.uuid, i2.uuid FROM identities i1 JOIN identities i2 WHERE i1.email IS NOT NULL AND i1.email != "gerritpatchuploader@gmail.com" AND i1.email != "" AND i1.email NOT LIKE "ttijhof@%" AND i1.email != "root@localhost" AND i1.email != "no-reply@phabricator.wikimedia.org" AND i2.email != "" AND i2.email IS NOT NULL AND i2.email != "gerritpatchuploader@gmail.com" AND i2.email NOT LIKE "ttijhof@%" AND i1.email = i2.email AND i1.source != i2.source AND i1.uuid != i2.uuid ORDER BY i1.email;
  • Running this query should not be needed anymore.
  • (For the records, there is a worse approach in phab:T149327. Excluding ttijhof because of phab:T123643.)
  • Special case: For (now defunct) Bugzilla, the email address is stored in the username column: SELECT DISTINCT(i1.uuid), i1.username, i2.uuid, i2.email FROM identities i1 JOIN identities i2 WHERE i1.source = "bugzilla" AND i2.source != "bugzilla" AND i1.username = i2.email AND i1.uuid != i2.uuid ORDER BY i1.uuid;

Check detached accounts with same mw and phab usernames if they are connected to merge[edit]

See the script and DB commands.

Check detached accounts with same username and same source[edit]

  • SELECT p1.username,CONCAT("https://wikimedia.biterg.io/identities/hatstall/", p1.uuid), profiles.name FROM identities p1 JOIN profiles JOIN identities p2 WHERE p1.source = "phabricator" AND p2.source = "phabricator" AND p1.uuid != p2.uuid AND p1.username = p2.username AND p1.uuid = profiles.uuid ORDER BY p1.username;

Check duplicated enrollments[edit]

...until https://github.com/chaoss/grimoirelab-sortinghat/issues/91 is fixed:

Not regularly / more expensive[edit]

Query all existing Phab accounts about their connected MediaWiki.org accounts[edit]

This obviously takes a long time as we talk about >10000 accounts. See the script and DB commands. (For a cheaper version that requires more manual checking, see phab:T170091.)

Find detached accounts with exact same usernames across sources[edit]

This is fuzzy and always needs manual investigation. Less expensive by declaring a source to compare with; still takes several minutes:

SELECT i1.id, i2.id, i1.username, i2.username, i1.source, i2.source, i1.uuid, i2.uuid FROM identities i1 JOIN identities i2 WHERE i1.username IS NOT NULL AND i2.username IS NOT NULL AND i1.username = i2.username AND i1.source = "whatever" AND i1.source != i2.source AND i1.uuid != i2.uuid;

List detached accounts with same usernames in two different sources[edit]

This is fuzzy and always needs manual investigation. You must ignore common names. Declare two sources to make the query less expensive.

  • General (sources: bugzilla, pipermail, gerrit): SELECT a.name, a.uuid, b.name, b.uuid FROM identities a JOIN identities b WHERE a.source = "bugzilla" AND b.source = "pipermail" AND b.email NOT LIKE "no-reply%" AND a.uuid != b.uuid AND a.name = b.name ORDER BY a.name;
  • Bugzilla and Phabricator: SELECT b.username,b.uuid FROM identities a JOIN identities b WHERE a.source = "bugzilla" AND b.source = "phabricator" AND a.uuid != b.uuid AND a.username = b.username AND a.username IS NOT NULL AND b.username IS NOT NULL ORDER BY b.username;

One-time[edit]

Only listing here for reference.