User:AKlapper (WMF)/Bitergia data quality queries

The data behind wikimedia.biterg.io regularly needs updates to make our metrics reliable.

For convenience this page lists SQL queries and bash scripts that I occasionally run.

The usual workflow is to use sortinghat to update the database, create an updated database dump file, push that file back into the repository, and get the changes deployed on the production server.

Find accounts which likely should have an affiliation / enrollment
Note that by default the enrollments array of a user is empty.



Look also at GitLab accounts and if some of them should have an affiliation; see T306769:
 * This allows at least manual checking of email addresses and/or group membership on https://ldap.toolforge.org/user/username (replace ) which does not scale in the long run (see T306770).

Use the email addresses in the Phabricator database to identify folks that should have an affiliation set:
 * Run in Phab DB to query based on  email addresses:
 * Copy resulting s
 * Run in Bitergia DB:  (to get name of
 * Run in Bitergia DB to find folks with no affiliation yet set (replace  and  s):

Find detached accounts with same email addresses to merge

 * Running this query should not be needed anymore.
 * (For the records, there is a worse approach in T149327. Excluding  because of T123643.)
 * Special case: For (now defunct) Bugzilla, the email address is stored in the username column:
 * Special case: For (now defunct) Bugzilla, the email address is stored in the username column:

Check detached accounts with same mw and phab usernames if they are connected to merge
See the script and DB commands.

Check duplicated enrollments
...until https://github.com/chaoss/grimoirelab-sortinghat/issues/91 is fixed:



Query all existing Phab accounts about their connected MediaWiki.org accounts
This obviously takes a long time as we talk about >10000 accounts. See the script and DB commands. (For a cheaper version that requires more manual checking, see T170091.)

Find detached accounts with exact same usernames across sources
This is fuzzy and always needs manual investigation. Less expensive by declaring a source to compare with; still takes several minutes:

List detached accounts with same usernames in two different sources
This is fuzzy and always needs manual investigation. You must ignore common names. Declare two sources to make the query less expensive.


 * General (sources: bugzilla, pipermail, gerrit):
 * Bugzilla and Phabricator:

One-time
Only listing here for reference.


 * Eliminate duplicated «"source": "wikimedia:its"» identities in korma identities DB