Analytics/Reports/Clients without JavaScript

Goal
The goal of this project is to get a rough estimate of how big of a percentage of our page requests come from browsers with partial (or none) Javascript support. The methodology used will provide a rough - but not a detailed - estimate. The idea is to be able to know if the number of clients is smaller than 10%, 1% or 0.1%.

Metodology
Every request to all wikimedia projects is stored in hadoop for about 30 days. Requests are segregated into "text" (requests to desktop websites), "mobile" (requests to apps and mobile website) and "bits" (requests to our static domain, from which javascript and css are served). More so, for every request to "text" and "mobile" we store whether the request is a pageview or not according to the new pageview definition:.

Our method to obtain a rough estimate goes as follows:

At the end of step 2 we know -for example- that "1.2% of our pageviews  come from IE10". We do not take the device into account.
 * 1. For a timeperiod T get all requests to text and mobile.
 * 2. Calculate browser percentages for all those requests (this would be 'set#1').

We exclude the startup module from javascript requests as this module is the one used to restrict JS support on mediawiki and thus it is served to all browsers.
 * 3. For timeperiod T get all requests of javascript files from bits.


 * 4. Calculate browser percentages on javascript bits data (this would be 'set#2').
 * 5. Compare set#2 with set#1, set#1 should be a super-set of set#2. Browsers that are on set#1 but do not appear on set#2 represent the set of browsers without strong javascript support.

The timeperiod we have choosen was the first week of January. At a later point we run same method over a smaller dataset of two days in February obtaining similar results.

Caveats
This methodology will not detect users that navigate with a modern browser (say Chrome 39) but with javascript turned off. To detect those a very specific experiment is needed. Our report will include bot requests and count those as clients without javascript.

We do not expect browser percentages on text and mobile to match exactly those on bits as requests for static files are subjected to different cache ratios than requests for main content that, in the case of our projects, is never cached on the client side. But browser percentages ratios should match. For example: if we get 0.6% of our pageview requests from Chrome 39 on Mac Os X with version 10.6 and and 0.4% of pageviews come from Mac Os X  with version 10.9 the ratio 0.6/0.4 should be about the same on browser percentages on bits.

Phones with no caching of javascript to speak of are overrepresented on bits data. The winner is windows7 phone, which is responsible for a huge number of requests in percentage in the static domains.

Preliminary Results
Results with data from the 2nd week of January. Pageview data was sampled 1/1000 which amounts to 4 million of requests, bits data was raw (200 million). The approximate total of pageview requests without javascript enabled is about 10% but note that this includes bot requests. If we remove the main bots we see: Bingbot, YandexBot, Googlebot, TwitterBot and other self-labeled "Python Requests" the percentage is much lower, about 3% 

Details
Percentage of pageview totals for browsers that do not request javascript files on bits.

Without OS info
The percentage of browsers without javascript enabled bots removed is about ~2.5%. Note that the list below reports browsers responsible for at least 0.001% of pageviews at the OS level, that makes visible Opera Mini.

Note that IE7 is not on this list, because we do see a fair amount of pageviews on bits for IE7 data if we aggregate all pageviews with a user agent of this type: {"browser_major": "7" "browser_family": "IE"}

0.0012 {"browser_major": "15"  "browser_family": "Chrome Frame"} 0.0012 {"browser_major": "21"  "browser_family": "Opera Mini"} 0.0013 {"browser_major": "2"  "browser_family": "Opera Mini"} 0.0013 {"browser_major": "22"  "browser_family": "Opera Mini"} 0.0014 {"browser_major": "0"  "browser_family": "Maxthon"} 0.0014 {"browser_major": "25"  "browser_family": "Opera Mini"} 0.0014 {"browser_major": "7"  "browser_family": "Opera"} 0.0014 {"browser_major": "8530"  "browser_family": "BlackBerry"} 0.0017 {"browser_major": "5"  "browser_family": "Baidu Browser"} 0.0017 {"browser_major": "9700"  "browser_family": "BlackBerry"} 0.0019 {"browser_major": "0"  "browser_family": "Kazehakase"} 0.0019 {"browser_major": "11"  "browser_family": "Opera Mobile"} 0.0019 {"browser_major": "14"  "browser_family": "Opera Mobile"} 0.0019 {"browser_major": "2"  "browser_family": "iBrowser"} 0.002 {"browser_major": "4"  "browser_family": "SEMC-Browser"} 0.002 {"browser_major": "537"  "browser_family": "WebKit Nightly"} 0.0021 {"browser_major": "0"  "browser_family": "Python Requests"} 0.0021 {"browser_major": "2010"  "browser_family": "Outlook"} 0.0021 {"browser_major": "720"  "browser_family": "CFNetwork"} 0.0023 {"browser_major": "1"  "browser_family": "K-Meleon"} 0.0024 {"browser_major": "10"  "browser_family": "Opera Mobile"} 0.0027 {"browser_major": "3"  "browser_family": "Nokia OSS Browser"} 0.0029 {"browser_major": "24"  "browser_family": "Thunderbird"} 0.0031 {"browser_major": "0"  "browser_family": "K-Meleon"} 0.0031 {"browser_major": "548"  "browser_family": "CFNetwork"} 0.0035 {"browser_major": "2"  "browser_family": "Lynx"} 0.0035 {"browser_major": "9"  "browser_family": "Opera Mini"} 0.0036 {"browser_major": "2007"  "browser_family": "Outlook"} 0.0037 {"browser_major": "-"  "browser_family": "CFNetwork"} 0.0046 {"browser_major": "31"  "browser_family": "Thunderbird"} 0.0053 {"browser_major": "4"  "browser_family": "Ovi Browser"} 0.0078 {"browser_major": "3"  "browser_family": "NetFront"} 0.0084 {"browser_major": "18"  "browser_family": "Chromium"} 0.0101 {"browser_major": "454"  "browser_family": "CFNetwork"} 0.0164 {"browser_major": "609"  "browser_family": "CFNetwork"} 0.0207 {"browser_major": "3"  "browser_family": "Ovi Browser"} 0.0221 {"browser_major": "1"  "browser_family": "TwitterBot"} 0.0339 {"browser_major": "7"  "browser_family": "Nokia Browser"} 0.0368 {"browser_major": "2"  "browser_family": "Ovi Browser"} 0.0551 {"browser_major": "6"  "browser_family": "UP.Browser"} 0.1104 {"browser_major": "8"  "browser_family": "Opera"} 0.124 {"browser_major": "2"  "browser_family": "Python Requests"} 0.1377 {"browser_major": "5"  "browser_family": "IE"} 0.1387 {"browser_major": "4"  "browser_family": "IE"} 0.2476 {"browser_major": "711"  "browser_family": "CFNetwork"} 0.5503 {"browser_major": "-"  "browser_family": "YandexBot"} 0.882 {"browser_major": "-"  "browser_family": "Slurp"} 6.0593 {"browser_major": "2"  "browser_family": "bingbot"}

With OS Info
Note that browsers responsible of less than 0.01% of pageviews are not n this list and that is why IE6 and Opera Mini (for which there is a lot of fragmentation across OS) do not appear. Total of browsers that do not support javascript (bots removed) according to this list is ~3%.

0.0101 {"os_minor": "6"  "os_major": "10"  "os_family": "Mac OS X"  "browser_major": "454"  "browser_family": "CFNetwork"} 0.0112 {"os_minor": "3"  "os_major": "9"  "os_family": "Symbian OS"  "browser_major": "7"  "browser_family": "Nokia Browser"} 0.0115 {"os_minor": "0"  "os_major": "8"  "os_family": "iOS"  "browser_major": "4"  "browser_family": "Sleipnir"} 0.0115 {"os_minor": "11"  "os_major": "3"  "os_family": "Linux"  "browser_major": "2"  "browser_family": "Python Requests"} 0.0124 {"os_minor": "-"  "os_major": "-"  "os_family": "Mac OS X"  "browser_major": "-"  "browser_family": "Safari"} 0.0138 {"os_minor": "-"  "os_major": "-"  "os_family": "Other"  "browser_major": "7"  "browser_family": "IE"} 0.0145 {"os_minor": "4"  "os_major": "9"  "os_family": "Symbian OS"  "browser_major": "7"  "browser_family": "Nokia Browser"} 0.0164 {"os_minor": "1"  "os_major": "6"  "os_family": "iOS"  "browser_major": "609"  "browser_family": "CFNetwork"} 0.0167 {"os_minor": "-"  "os_major": "-"  "os_family": "Nokia Series 40"  "browser_major": "3"  "browser_family": "Ovi Browser"} 0.0176 {"os_minor": "-"  "os_major": "-"  "os_family": "Windows"  "browser_major": "5"  "browser_family": "IE"} 0.0215 {"os_minor": "-"  "os_major": "-"  "os_family": "Windows"  "browser_major": "4"  "browser_family": "IE"} 0.0221 {"os_minor": "-"  "os_major": "-"  "os_family": "Other"  "browser_major": "1"  "browser_family": "TwitterBot"} 0.0287 {"os_minor": "-"  "os_major": "-"  "os_family": "Mac OS X"  "browser_major": "-"  "browser_family": "Other"} 0.0323 {"os_minor": "-"  "os_major": "-"  "os_family": "Windows 95"  "browser_major": "-"  "browser_family": "Other"} 0.0368 {"os_minor": "-"  "os_major": "-"  "os_family": "Nokia Series 40"  "browser_major": "2"  "browser_family": "Ovi Browser"} 0.0408 {"os_minor": "0"  "os_major": "7"  "os_family": "iOS"  "browser_major": "672"  "browser_family": "CFNetwork"} 0.0425 {"os_minor": "-"  "os_major": "-"  "os_family": "Red Hat"  "browser_major": "-"  "browser_family": "Other"} 0.0432 {"os_minor": "-"  "os_major": "-"  "os_family": "Windows XP"  "browser_major": "-"  "browser_family": "Safari"} 0.0454 {"os_minor": "0"  "os_major": "7"  "os_family": "iOS"  "browser_major": "2"  "browser_family": "bingbot"} 0.0551 {"os_minor": "-"  "os_major": "-"  "os_family": "Other"  "browser_major": "6"  "browser_family": "UP.Browser"} 0.0943 {"os_minor": "13"  "os_major": "3"  "os_family": "Linux"  "browser_major": "2"  "browser_family": "Python Requests"} 0.1101 {"os_minor": "-"  "os_major": "-"  "os_family": "Windows XP"  "browser_major": "8"  "browser_family": "Opera"} 0.1122 {"os_minor": "-"  "os_major": "-"  "os_family": "Windows CE"  "browser_major": "4"  "browser_family": "IE"} 0.1182 {"os_minor": "-"  "os_major": "-"  "os_family": "Windows 2000"  "browser_major": "5"  "browser_family": "IE"} 0.2476 {"os_minor": "0"  "os_major": "8"  "os_family": "iOS"  "browser_major": "711"  "browser_family": "CFNetwork"} 0.2718 {"os_minor": "-"  "os_major": "-"  "os_family": "Other"  "browser_major": "9"  "browser_family": "Chrome"} 0.3111 {"os_minor": "-"  "os_major": "-"  "os_family": "Windows 98"  "browser_major": "-"  "browser_family": "Other"} 0.5503 {"os_minor": "-"  "os_major": "-"  "os_family": "Other"  "browser_major": "-"  "browser_family": "YandexBot"} 0.6775 {"os_minor": "0"  "os_major": "6"  "os_family": "iOS"  "browser_major": "2"  "browser_family": "Googlebot"} 0.882 {"os_minor": "-"  "os_major": "-"  "os_family": "Other"  "browser_major": "-"  "browser_family": "Slurp"} 6.0139 {"os_minor": "-"  "os_major": "-"  "os_family": "Other"  "browser_major": "2"  "browser_family": "bingbot"}

What about IE6/IE7?
According to our code base we should not be serving any javscript to IE6/IE7 browsers. However, we do see some user agents on bits for IE6/IE7. For example this user agent below is responsible of 1% of total pageviews (across all wikipedia projects) and a big number of pageviews on bits: {"browser_major":"6","os_family":"Windows XP","os_major":"-","device_family":"Other","browser_family":"IE","os_minor":"-"}

Likely this browser is not identified by our code as IE6 and thus is being served Javascript (this is a bug) We need to do a little bit more research here to see the javascript requests being served.

Queries
Browser percentages for global pageviews: CREATE TEMPORARY FUNCTION ua as 'org.wikimedia.analytics.refinery.hive.UAParserUDF'; -- https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Avoiding_overgreedy_scans_.2F_Operator_precedence use wmf;

SELECT a.useragent, Count(*) FROM (select ua(user_agent) as useragent    from webrequest     where year=2015 and month=02 and day 25  and is_pageview=true and  webrequest_source in ("mobile","text")) a GROUP BY a.useragent

Browser percentages for bits requests: CREATE TEMPORARY FUNCTION ua as 'org.wikimedia.analytics.refinery.hive.UAParserUDF'; -- https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Avoiding_overgreedy_scans_.2F_Operator_precedence use wmf_raw;

SELECT a.useragent, count(*) FROM (select ua(user_agent) as useragent    from webrequest     where year=2015 and month=02 and day 25 and uri_path not LIKE ('%startup%') and webrequest_source in ("bits") and content_type="text/javascript" ) a GROUP BY a.useragent