Fundraising tech/Database schema

This is a description of WMF Fundraising database schema and fields.

The databases are only available in the private fundraising cluster, please request access through Phabricator.

See also Fun SQL Queries.

= pgehres database = This database is named in honor of our dear friend Peter Gehres, and holds a slightly aggregated cache of banner impression statistics. It is written to by a command-line Django script, see the source `git clone https://gerrit.wikimedia.org/r/wikimedia/fundraising/tools/DjangoBannerStats`. This cron job processes the series of logged web requests to `beacon/impression`.

One day, we hope to replace this database.

pgehres.bannerimpressions
Banner request counts, aggregated from the weblogs.

Banner impressions are now available in the `bannerimpressions` table. They are aggregated in 5-minute chunks and grouped by banner, campaign, project, language and country. The timestamp reflects the "middle" of the period (e.g. 1:00:00 to 1:05:00 is recorded as 1:02:30). The count is found in the `count` column and is corrected for any sampling.

P.S. - Raw impressions can be found in `bannerimpressions_raw`, but I would strongly recommend against querying it due to the massive number of rows. Queries on the aggregate table will be orders of magnitude faster.

-- pgehres

MariaDB [(none)]> use pgehres; MariaDB [pgehres]> describe bannerimpressions; +-+--+--+-+---++ +-+--+--+-+---++ +-+--+--+-+---++
 * Field      | Type                 | Null | Key | Default           | Extra          |
 * id         | int(11) unsigned     | NO   | PRI | NULL              | auto_increment |
 * timestamp  | timestamp            | NO   | MUL | CURRENT_TIMESTAMP |                |
 * banner     | varchar(255)         | NO   | MUL |                   |                |
 * campaign   | varchar(255)         | NO   | MUL |                   |                |
 * project_id | smallint(3) unsigned | YES  | MUL | NULL              |                |
 * language_id | smallint(3) unsigned | YES | MUL | NULL              |                |
 * country_id | smallint(3) unsigned | YES  | MUL | NULL              |                |
 * count      | mediumint(11)        | YES  |     | 0                 |                |

`project_id`, `language_id` and `country_id` map to `project`, `language` and `country` tables in the same database.

One gotcha is that for performance reasons only the top 20-ish languages get a real language_id. Everything else gets a generic one (see https://gerrit.wikimedia.org/r/#/c/119740/ for how to add to that list. TODO Split all languages.)

pgehres.landingpageimpression_raw
Sorry, there is no built-in aggregation. This table logs the landing page and UTM data for various URLs. We're making this data available for future landing page A/B testing, but there are no active consumers.

= drupal database = Our Drupal modules add a few tables of interest.

drupal.contribution_source
Unpacked normalization of `contribution_tracking.utm_source` into its three components: banner, landing page, and payment method.

Join against `contribution_tracking`, select * from contribution_tracking t left join contribution_source s on s.contribution_tracking_id = t.id; (TODO: understand or fix why some rows are missing: T98643)

banner - Name of the banner

drupal.contribution_tracking
For every person that lands on a payments.wiki (BROKEN: or donate.wiki) page, a row is created in the contribution_tracking table. This is what we have historically always used to track landing page impressions. The record is updated with a contribution ID if it results in a successful donation.

Description of fields

 * id: an autonumber
 * contribution_id:  Joins to the id column in civicrm_contribution.  Contributions that were not actually completed, will be NULL.
 * form_amount: The currency code and amount that the user had initially selected.
 * usd_amount: Apparently broken
 * note: No longer in use
 * referrer: The page that got the user to our pipeline. Usually a wiki project page. Sometimes something totally different.
 * anonymous: True if the user has selected an option indication that they wish to remain anonymous. THIS OPTION IS NOT PRESENT ON ALL FORMS
 * utm_source: A string that builds when a user moves through our donation pipeline. Typically includes a banner name/email code, any landing page info, and a payment method
 * utm_medium: A general indication of the group of places that this user came from (common ones are 'sitenotice', 'sidebar' or 'email')
 * utm_campaign: The specific campaign that this person came from
 * utm_key: for recent-ish banners: how many times the person saw a banner (cookieCount) before they started this contribution No longer in use
 * payments_form: Also apparently broken
 * optout: True if the user has selected an option indication that they wish to opt out of all bulk emails. This does not apply to the Thank You email, which we are legally obligated to send (at least in the US). THIS OPTION IS NOT PRESENT ON ALL FORMS
 * language: The user's language preferences
 * country: The user's country of web origin.
 * ts: Timestamp

drupal.banner_history_contribution_associations
Links contribution_tracking id's with banner history log id's.

drupal.exchange_rates
Current and historical foreign exchange rates.

drupal.large_donation_notification
Donation amount thresholds that trigger an email, maintained by the `large_donation` module.

drupal.wmf_campaigns_campaign
Campaign names which will trigger an email upon matching donations. Maintained by the `wmf_campaigns` module.

= civicrm database = This is the database that drives civi. As such, all information about completed transactions will be in there somewhere.

Note that we add many custom fields which are managed by CiviCRM and should not be queried directly due to dynamically generated table names, e.g. `civicrm_value_1_stock_information_10`.

civicrm.address
Billing address given by the donor. You must always restrict to `civicrm_address.is_primary = 1` when querying.

civicrm.civicrm_contact
Main record for a donor. We create a new contact for every donation, and deduping happens after the fact if ever.

civicrm.civicrm_contribution
This table contains all the financial information about every donation we have received.

Field Descriptions

 * id - Primary key. Joins to drupal.contribution_tracking.contribution_id.
 * contact_id - Joins to civicrm_contact.id
 * financial_type_id - Joins to civicrm_financial_type.id. We're inconsistent about how we assign financial types.
 * payment_instrument_id - Joins to civicrm_option_value where option_group_id = 10 ("payment_instrument"). This encodes the full payment method.
 * receive_date - The date that the transaction was initiated on the payments system.
 * total_amount - The donation amount in USD.
 * trxn_id - A unique transaction identifier, not necessarily the same as the gateway's transaction ID. Usually starts with the gateway in all-caps, followed by the gateway's transaction id for this donation.
 * thankyou_date - the date we sent the Thank You letter to the donor.
 * source - Original currency and gross.
 * contribution_recur_id - If this is a recurring payment, this will join to civicrm_contribution_recur.id, otherwise will be NULL.
 * contribution_status_id - Joins to civicrm_option_value where option_group_id = 11 ("contribution_status").
 * check_number - if it's a check, this should be a number.

Joining civicrm_contribution and contribution_tracking
select * from drupal.contribution_tracking t left join civicrm.civicrm_contribution c on c.id = t.contribution_id;

civicrm.civicrm_email
Always restrict to civicrm_email.is_primary = 1 unless you're doing something crazy.

civicrm.wmf_contribution_extra
The `wmf_civicrm` module adds its own schema to `civicrm_contribution`, stored in the `wmf_contribution_extra` table. Perhaps it should be in the Drupal database, but CiviCRM core doesn't know about custom tables in another database.

Fields

 * id - primary key
 * entity_id - Joins to civicrm_contribution.id.
 * gateway - Which payment processor handled this transaction.
 * gateway_account - Account name for processors with multiple accounts.
 * gateway_txn_id - Order ID at the processor. Note that this is not necessarily unique, processors have funny ways of recording refunds, recurring payments and so on.
 * original_amount - Gross in the native currency.
 * original_currency - Native currency code.
 * parent_contribution_id - Link to civicrm_contribution.id, for refunds only. This should be deprecated with Civi 4.6.
 * finance_only - We're hiding this record from most reports. (TODO: document why)
 * source_name - Specific system responsible for creating this donation record.
 * source_type - Class of source system.
 * source_host - Originating machine.
 * source_version - Revision of the code that produced this record.
 * source_enqueued_time - Time at which this message was first sent to the completed donation queue.
 * no_thank_you - A string explaining why we aren't sending an automatic thank-you letter. Usually NULL.  If there is content in this field, the `thank_you` job will not send an automatic letter.

Joining wmf_contribution_extra and civicrm_contribution
select * from civicrm_contribution c join wmf_contribution_extra e on e.entity_id = c.id;

= fredge database =

fredge.payments_fraud
Summary of risk score and validation outcome for each donation attempt.

fredge.payments_fraud_breakdown
Individual components of `payments_fraud.risk_score`, join against that table like: select * from payments_fraud f join payments_fraud_breakdown b on b.payments_fraud_id = f.id;

fredge.payments_initial
Information about donation outcome, measured when the initial donation workflow is completed.

Join to `contribution_tracking`, select * from fredge.payments_initial i join drupal.contribution_tracking t on t.id = i.contribution_tracking_id;