User:SSethi (WMF)/Sandbox/Eventlogging data access

From mediawiki.org

(Borrowing instructions from Niharika's user page on office wiki)

Hive/Hadoop[edit]

Host bast1002.wikimedia.org
    # Direct connection for the bastion host
    ProxyCommand none
    ControlMaster auto

Host *.wikimedia.org *.wmnet !gerrit.wikimedia.org !git-ssh.wikimedia.org
    User enter_your_username
    # Everything else goes via bastion acting as a proxy
    ProxyCommand ssh -a -W %h:%p bast1002.wikimedia.org
    # Do not offer other identities loaded in ssh-agent
    IdentitiesOnly yes
    IdentityFile ~/.ssh/enter_your_key
  • ssh stat1007.eqiad.wmnet
  • hive
  • use event;
  • Find your table and its partition information by using describe UserFeedback;
  • Example queries:
    • select count(*) as cnt from UserFeedback where year=2019;
    • select event.page_name, count(*) as cnt from UserFeedback where year=2019 and event.vote="yes" group by event.page_name;
    • select event.page_name, count(*) as cnt from UserFeedback where year=2019 and event.vote="no" group by event.page_name;
    • select event.page_name as pagename,count(*) as total,sum(case when event.vote = 'yes' then 1 else 0 end) as yes, sum(case when event.vote = 'no' then 1 else 0 end) as no from userfeedback where year=2019 group by event.page_name;
  • To obtain quarterly stats for Toolforge docs in Help namespace on Wikitech wiki & Action API docs in API namespace on MediaWiki.org:
    • hive -e "use event; describe UserFeedback; select event.page_name as pagename, count(*) as total,sum(case when event.vote = 'yes' then 1 else 0 end) as yes, sum(case when event.vote = 'no' then 1 else 0 end) as no from userfeedback where month>=6 AND month<9 group by event.page_name;" > /home/srishakatux/temp.tsv
    • sed 's/\t/,/g' temp.tsv > temp.csv
    • scp srishakatux@stat1007.eqiad.wmnet:/home/srishakatux/temp.csv /Users/ssethi/Documents/temp.csv