User:Bawolff/Reflections on graphs

From mediawiki.org
This is just some idle thoughts. I have no intention to do anything about any of this at this time

The graphs extension has been here since about 2014. It was part of an attempt by Yurik to bring his bold vision of a more interactive Wikipedia to fruition.

I think Graph is an important attempt, and I commend everyone who worked on it for attempting to realize this vision. However, looking back, I can't help but be a little disappointed with the uptake of the Graph extension. I am beginning to feel that it might not be the right approach to bring interactivity to Wikipedia.

How is the Graphs extension used currently[edit]

This analysis is going to be limited to english wikipedia largely because I speak english and this already took me a lot of time

Currently, on English Wikipedia, the graph extension is used on 26,238 pages. This of course is a small percentage of the total pages, but nonetheless looks quite successful at first glance. However, most of these are in non-content namespaces, from a template that generates a graph of page views for a specific page (w:Template:PageViews graph).

There are 4,140 pages on en.wikipedia.org in the main namespace that use graphs. The breakdown by quality level is:

+----------+----------+
| class    | count(*) |
+----------+----------+
| FA       |       16 |
| GA       |       64 |
| B        |      246 |
| C        |      593 |
| Start    |     1624 |
| Stub     |     1007 |
| FL       |        5 |
| List     |      215 |
| Redirect |        3 |
| Template |        1 |
| Future   |        1 |
|          |      261 |
+----------+----------+

Query: select pc, count(*) from (select max( pa_class) 'pc', page_id from categorylinks inner join page on cl_from = page_id inner join page_assessments on pa_page_id = page_id where cl_to = 'Pages_using_the_Graph_extension' and page_namespace = 0 group by page_id ) `a` group by 1;

As a percentage, that's 0.07% overall, 0.2% of "Good Articles", 0.3% of Featured Articles. Creating good interactive content is always going to be hard. However, if the extension was successfully adopted, I would expect to see the best Wikipedia content have interactive elements where it suited the subject matter.

Furthermore, where the extension is used, it is mostly used to make simple graphs via a few specific templates:

+------------------------------------------+----------+
| tl_title                                 | count(*) |
+------------------------------------------+----------+
| Graph:Street_map_with_marks              |     2501 |
| Graph:Chart                              |     1075 |
| Graph:Map                                |      436 |
+------------------------------------------+----------+

query: select tl_title, count(*) from templatelinks where tl_title like 'Graph:%' and tl_from_namespace = 0 and tl_namespace = 10 and tl_title not like '%/styles.css' group by 1 order by 2 desc limit 40 ;

w:Template:Graph:Street_map_with_marks is essentially only used by w:Template:OSM_Location_map, where the graph extension is used to make thumbnails for Kartographer. While this is interactive content, I don't think this counts as really using the "graphs" extension, if its just to make a thumbnail of Kartographer content. w:Template:Graph:Map is used to make a map with certain countries highlighted. Well that's cool and all, its not exactly what I'm thinking of when I think interactive data visualization.

w:Template:Graph:Chart is used to make various types of standard static plots (Line plots, bar graphs, pie charts). These types of charts are of course really important for any system named "Graphs". However they are still the most basic form of graph. I can't help but think that these types of graphs aren't really fulfilling the interactive data visualization vision.

So after that, there are 154 other main-space examples (Some pages may be unfairly excluded if they both have a simple graph and a complex graph). Many of these are vega specs placed directly on the page (Perhaps due to VisualEditor integration). The list is: Raglan, Onga, Epidemiology of obesity, LA Galaxy, Santa Rosa, Nesploy, Ranhat, Montgai, Dril, Bogotá, List of Moscow Metro stations, Living Building Challenge, Tiburon, Healdsburg, Calistoga, Disneyland, Gliwice, Rohnert Park, Vancouver Whitecaps FC, Erlangen, Kahramanmaraş Airport, List of vetoed United Nations Security Council resolutions, Falkensee, HIV/AIDS in New York City, Triacanthidae, Sebastopol, Cloverdale, Sonoma, Scorzè, Eatontown, Annetta North, Águas de São Pedro, Rappler, Public transport in Hamilton and Waikato, 2017 Census of Pakistan, 2018 Cypriot presidential election, Pelendri, Portland Timbers, Other White, List of most expensive cars sold at auction, Expansion timeline of the Moscow Metro, George Rogers Park, Government spending in the United Kingdom, Kamenica nad Cirochou, Macquarie College, Renewable energy in Spain, Romanian Intelligence Service, Social earnings ratio, Taxation in Iceland, University Rover Challenge, Vrtište, Wałbrzych Special Economic Zone "INVEST-PARK", Arcadia Fund, Community Consolidated School District 89, United Nations Security Council veto power, Disability in South Korea, Overwatch, Steam, Smoking in Australia, Childbirth in South Korea, Kalamassery, Ano Doliana, Renewable energy in Italy, Krasnodar Krai, Doliana, Child mortality, Health care, HIV/AIDS, Cancer, List of countries by carbon dioxide emissions, Banana, Seattle Sounders FC, List of Seattle Sounders FC seasons, Mobile phone, Landline, Homicide, List of U.S. states and territories by GDP, Climate of Chicago, Suicide in Russia, Climate of Los Angeles, Burn, Thermal burn, Geography of Washington, North Thames Gas Board, Geography of New York City, Web search engine, Politics of global warming, List of countries by carbon dioxide emissions per capita, New York City, MATLAB, Kapellen, Tilbury power stations, West Thurrock Power Station, Northfleet Power Station, Grain Power Station, Kingsnorth power station, Belvedere Power Station, Blackwall Point Power Station, Brunswick Wharf Power Station, Barking Power Station, Fulham Power Station, Richborough Power Station, Bradwell nuclear power station, Sizewell nuclear power stations, Dungeness Nuclear Power Station, Woolwich Power Station, Stepney Power Station, Deptford Power Station, Littlebrook Power Station, West Ham Power Station, Hackney Power Station, Acton Lane Power Station, Taylors Lane Power Station, Brimsdown Power Station, Croydon power stations, Kingston Power Station, Battersea Power Station, List of most expensive paintings, Grove Road Power Station, Little Barford Power Station, Great Yarmouth Power Station, Cliff Quay Power Station, Goldington Power Station, Letchworth, Peterborough Power Station, Hastings Power Station, Shoreham Power Station, Communist Party of Britain, Ashford Power Station, Govt First Grade College Ankola, Communist Party of Great Britain, London Electricity Board, Charing Cross and Strand Electricity Supply Corporation Limited, City of London Electric Lighting Company Limited, SEEBOARD, Southern Electric, SWEB Energy, Eastern Electricity, East Midlands Electricity, Midlands Electricity, SWALEC, MANWEB, Yorkshire Electricity, North Eastern Electricity Board, NORWEB, Smart Growth America, Norwich power stations, Faaaha, 2020 coronavirus pandemic in Qatar, Timeline of the 2019–20 coronavirus pandemic, United Nations Security Council, 2020 coronavirus pandemic in Sri Lanka, 2019–20 coronavirus pandemic in mainland China, 2019–20 coronavirus pandemic

Most of these are simple static graphs. Some notable exceptions is interactive time scale maps, such as the one at w:Template:Interactive COVID-19 maps and the one at w:List_of_countries_by_carbon_dioxide_emissions, which shows how geographic data evolves over time (See also w:template:Global Heat Maps by Year). Also the graph at w:Vancouver_Whitecaps_FC. Nonetheless, I have yet to see any examples where a graph based visualization makes what would otherwise be a difficult concept clear, or where the visualization stops me in my tracks, and is core to my understanding of the article. A picture is worth a thousand words, where are the graphs that are worth a thousand words?

What are the issues[edit]

I think the fundamental problem with the Graph extension is its both too high level and too low level at the same time.

It is too low level: The graph extension uses the "Vega" JSON format to specify graphs. This is a declarative grammar for specifying visualizations. It is (imo) non intuitive, and extremely difficult for the average user to make sense of. It is also specialist knowledge, that people wouldn't just happen to know for unrelated reasons (Compared to say Lua programming with scribunto, where a certain percentage of wikieditors happen to be programmers so would be familiar with it already). As a result, there is not much innovation around different graph types. Most are just copy and pasted from a few sets of examples.

It is too high-level - The JSON syntax is rather high level, and does not mesh well with existing template technologies, such as scribunto. While people do use scribunto to create these json documents, the process (imho) is not really pleasant. It is my belief that a more imperative model would be easier for people to integrate into the template ecosystem. This is critically important, as better integration into the template system, allows for better abstractions, which is needed for it to be used by average users.

Additionally, imo the format is too high level to meet the goals of general interactive content. The format provides support for various types of graphs, but is limited to graphs. The vision I see of interactive content, includes things like physics demonstrations, Game of Life demos, etc. In essence, the sorts of things java applets were used for in educational content in the early internet.

The graph extension provides a very complex primitive, that is confusing to understand and hard to build abstractions around. I believe we should instead be providing a simple primitive, that is easy to understand (but low level) and has a very good integration story with Scribunto, so that users can build abstraction layers that are appropriate to their needs.

Last of all, I think there is a lack of shared understanding around what type of visualizations Wikipedia needs. We need to better understand the types of things that Wikipedians want in their articles, so that we can make tools to give that to them. A Wikipedia article is a formal document, not every type of interactive content is appropriate to it. Right now, there are a lack of motivating examples of what an ideal, interactively-rich Wikipedia article would look like. At most people make hand-wavey gestures to Encarta's interactive content or the idea of a museum where you can "touch" the exhibits. I think to really bootstrap this enterprise, we would need to see what an example Wikipedia article with amazing high quality visualizations look like.

To somewhat answer my own question - I think what we should be aiming for, is something akin to some of the more educational articles on observableHq. The aim should be to allow users to interact with the topic of the article, if that's applicable to the subject. For example, consider https://observablehq.com/@tmcw/enigma-machine. This allows user's to understand the engima machine, by letting them play with it. The explanatory power of this type of visualization is easily worth a thousand words. This is the gold standard we should be aiming for. Some other interesting examples from observableHq: https://observablehq.com/@freedmand/sounds, https://observablehq.com/@kristw/boba-science, https://observablehq.com/@elaval/coronavirus-worldwide-evolution

Alternative approaches[edit]

I think the better approach would be to provide simple low level primitives, that users could build complex abstractions on.

D3 clearly seems to be the most popular choice in this domain.

I think a cool idea would be to use something like fengari (A JS runtime for lua) combined with bindings to D3, to execute certain scribunto scripts on the client side.

This would give users the full freedom to do anything D3 supports. D3 has a lot of resources around the internet on how to use it. Using Lua would mean it would be very similar to making a normal Scribunto template.

Lua allows us to easily only expose browser apis we want the user to use. We could combine this with iframe sandboxing (<iframe sandbox="allow-scripts">) to make the iframe be from a different origin. Last of all, we could use CSP to disable network access (And require the template editor to pre-specify all resources that might be needed).

From a security perspective this gives us good coverage - Exposing a limited API via lua means that the user should not be able to do anything evil. If they somehow find a sandbox escape, our users should still be safe, since it is being executed in a different security context due to the sandbox flag. As a last resort, since network access is disabled, the attacker cannot exfiltrate any data. So even if the attacker manages to say fingerprint the user, or run a crypto-miner, its useless, because they cannot retrieve or save the data. The main remaining risk would be DoS attacks of slowing down the browser by executing lots of CPU. This can probably be treated similar to any other vandalism - its after all, much less destructive than shock images.