User:AGreen (WMF)/Draft:Mapping of EventLogging properties to FundraisingImpressions database

From mediawiki.org

This page sets out how data will be input into the pgehres database by new scripts that will replace DjangoBannerStats.

Tables to deprecate[edit]

Table name Summary No e-com queries No other uses reported
landingpageimpressions Not populated by existing scripts, table is empty. Yes
globalcollect_orderids Table doesn't exist on production, only in create.sql. Yes
globalcollect_orderids_unused Table doesn't exist on production, only in create.sql. Yes
squidhost Lookup table of Varnish (formerly squid) hosts. Referenced by landingpageimpression_raw and bannerimpression_raw, but those columns are not used in queries. Yes
squidrecord Not populated by existing scripts, table is empty. Yes

Table for tracking file ingestion, to update[edit]

In legacy, the squidlog table records which files that have been ingressed. We'll rename it to files, and for backwards compatibility, create a view.

Also, we'll add several columns:

New column Description
status Enum column for the file's processing status. Can be processing or consumed. (Unprocessed files are not included in the table.)
consumed_events Total events consumed from the file.
ignored_events Total events ignored in the file (i.e., read and validated but not stored).
invalid_events Total invalid events (or lines) in the file.

Tables to maintain[edit]

Table Description
language Lookup table for languages.
country Lookup table for countries.
project Lookup table for projects.
bannerimpressions Aggregates CentralNotice banner impressions on specific dimensions.
landingpageimpression_raw Individual landing page pageview events (only certain properties).
donatewiki_unique Individual landing page pageview events (only certain properties).

Mapping from LandingPageImpression events[edit]

Legacy could process events from two sites, wikimediafoundation.org and and donate.wikimedia.org, but only put events from the second site in donatewiki_unique. Since effectively we're no longer receiving events from the first site, the new script will check all LandingPageImpression events for 'donatewiki' in the wiki property.

Input Source Validation Transformation DB location
dt event metadata Parseable by datetime.strptime() using configured format. Parsed from JSON and provided to mysql.connector as datetime.datetime object. landingpageimpression_raw.timestamp,

donatewiki_unique.timestamp

country event ^[a-zA-Z]{2,2}$ Reference in lookup table. landingpageimpression_raw.country_id
language event ^[a-z]+([_-][a-zA-Z0-9]+)?$ Reference in lookup table. Default to 'en' if value is missing or is an empty string. landingpageimpression_raw.language_id
utm_source event Default validation regex (from config) landingpageimpression_raw.utm_source,

donatewiki_unique.utm_source

utm_campaign event Default validation regex (from config) landingpageimpression_raw.utm_campaign,

donatewiki_unique.utm_campaign

utm_medium event Default validation regex (from config) landingpageimpression_raw.utm_medium
utm_key event Default validation regex (from config) landingpageimpression_raw.utm_key
contact_id event Default validation regex (from config) donatewiki_unique.contact_id
link_id event Default validation regex (from config) donatewiki_unique.link_id
template event Default validation regex (from config) Use only if landingpage is 'Special:LandingPage''. In that case, join all event properties with '~'; for any that are absent, use the string 'default'. landingpageimpression_raw.landingpage
appeal event Default validation regex (from config) landingpageimpression_raw.landingpage
appeal_template event Default validation regex (from config) landingpageimpression_raw.landingpage
form_template event Default validation regex (from config) landingpageimpression_raw.landingpage
form_countryspecific event Default validation regex (from config) landingpageimpression_raw.landingpage
landingpage event Default validation regex (from config) If this value is not 'Special:LandingPage', remove the last two parts separated by ('/'); otherwise, ignore. landingpageimpression_raw.landingpage
sample_rate event (none)
(none) landingpageimpression_raw.squid_id,

landingpageimpression_raw.squid_sequence

(none) landingpageimpression_raw.processed

Mapping from CentralNoticeImpression events[edit]

Input Source Validation Transformation Output Notes
dt event metadata Parseable by datetime.strptime() using configured format. Parsed from JSON and provided to mysql.connector as datetime.datetime object. bannerimpressions.timestamp
banner event ^[A-Za-z0-9_]+$ bannerimpressions.banner
campaign event Default validation regex (from config) bannerimpressions.campaign
project event ^[a-z0-9\-_\.]+$ Reference in lookup table. bannerimpressions.project_id
uselang event ^[a-z]+([_-][a-zA-Z0-9]+)?$ Only certain languages processed separately, following this legacy setting. Remaining languages are aggregated. Reference in lookup table. bannerimpressions.language_id
country event ^[a-zA-Z]{2,2}$ Reference in lookup table. bannerimpressions.country_id
impressionEventSampleRate event (not used, as per legacy)
(aggregate of all of the above) Total number of hits with the same values for banner, campaign, project, language (see above) and country, received in a given log file, for a given minute, scaled according to combined (server- and client-side) sample rate. bannerimpressions.count