Wikimedia Performance Team/Perceived Performance
The field of web performance tends to assume that faster is always better, and has produced many guidelines over the years like RAIL that promote the existence of absolute universal values for what feels instantaneous, fast or slow. Over the years, many academic studies have looked into the subject of performance perception, and we attempt here to provide a meaningful literature review for this body of work, to shed some light on what is known and what is yet to be studied.
Specific duration values are often cited in performance guidelines, and they come up regularly in academia, but their origin is always traced back to arbitrary essays that didn't use research to prove these values. One paper in particular with popular magic numbers, written by Robert B. Miller in 1968, comes up constantly and has been used as the basis for many performance guidelines, including RAIL.
Miller's numbers are pulled out of thin air, and his influential paper is only an essay, with no research to back up the dogmatic magic numbers that have proven to be so popular. Miller develops some intellectually appealing arguments to explain his point of view, which is probably why his essay became so popular, but none of the arguments made are demonstrated through experiments. Miller also describes ideas that look bizarre nowadays, such as delaying errors on purpose to let users give up on their train of thought naturally, rather than being abruptly stopped by an instantaneous error.
The second most popular source of magic numbers is Jakob Nielsen, who like Miller positions himself as an expert on the matter and refreshes Miller's magic numbers into a more modern - yet equally arbitrary and unproven - package, filtering out Miller's wildest theories. Nielsen's essays are probably popular due to their simplicity and intellectual appeal, but put forward magic numbers that are not backed by any research.
When looking at actual numbers coming out of experiments, we find that they vary wildly from one study to the next, depending on the context. Frustration thresholds can be at 11s for page loads, while for UI latency, frustration can start at 450ms or 600ms depending on the context (dragging vs tapping). None of these match the most frequently quoted magic numbers about frustration.
Latency threshold (the limit under which something feels instantaneous) has been found to be in a range as wide as 34 to 137ms, with a median 54ms. Quite different from the 100ms magic number that is widely accepted for this phenomenon.
None of the universal magic numbers circulating the performance community have been proven to be real through experiments.
The majority of studies on the matter of performance perception are of limited quality. People tend to only quote the one-liner finding, but the nature of the studies that "proved" those statements often leaves much to be desired.
Many of them are dated, looking at waiting times in increments of dozens of seconds or even minutes. Such waiting durations were relevant at the beginning of the web, but people's expectations have changed greatly over time and it seems very far-fetched to transpose people's reaction of waiting for a web page to load for 30 seconds on a desktop computer in the early 2000s to the current experience of mobile web browsing.
Some studies are on entirely different mediums than the web, making the translating of their results as universal psychological phenomenons questionable. How can the behavior of people waiting in line in a bank in the pre-internet era translate to the way people experience web performance? Yet those old studies come up regularly as citations in performance perception studies.
Another frequent offender is the use of fake browsers and fake mobile devices. Conclusions are drawn about intricate interactions with the medium, but subjects aren't using the real thing, let alone in a real-life context.
And the most common weakness of all is that the majority of those studies are conducted in laboratories with subjects all being students from the same university. Young people with access to higher education aren't a representative group of the human population, and their tech savviness in particular introduces a lot of bias that make these studies' results hard to accept as universal truth that apply to everyone.
Finally, some studies focus too closely on the "low-level" mechanics of performance, disconnecting subjects from the real experience. Such as comparing videos side by side, or being able to replay a video to decide the point where the page is loaded.
While this doesn't tell us directly about the positive or negative feeling associated with a waiting experience, it's interesting to know how granular people's perception can be, when deciding whether or not a specific amount of performance improvement is worth pursuing.
At equal duration, it seems like people might under-estimate a black screen's duration compared to a waiting animation. And that people will tend to over-estimate short times and under-estimate long times.  However that study was done on a fake mobile device using an unusual loading animation, when loading an app. A context for which people probably have a de-facto expectation of behavior, based on the other apps they use.
Latency perception, or the threshold under which people can't tell the difference with something instantaneous varies as much as 34 to 137ms in a group of 20 students, showing how diverse the granularity of time perception can be between people.
Overall attempts to find mathematical formulas behind time perception have all failed, as the variations between people are too wide. No universal rule or threshold has ever been demonstrated. While this isn't performance perception, they might be related and this might tell us something about why universal performance magic numbers have never been proven either.
When dealing with large delays (4s, 10s, 16s), it has been demonstrated that lower performance variability is more desirable than a better average performance with bigger poor performance peaks. However, on smaller delays (300ms - 3s) a small study found no difference in task error rates nor in the satisfaction survey results when comparing low and high performance variability. One notable difference was that with low variability the "human time" decreased, presumably because people could predict the waiting time reliably and use it to think about their next move.
This question of the importance of lower performance variability as a trade-off for higher averages comes up regularly, but it's never been studied properly.
While most studies assume that everyone is the same, some do set out to look at differences between people.
A study looked at monochronic versus polychronic cultures and found that subjects from polychronic cultures are better at estimating long wait times. And while long wait times affected attitude negatively regardless of the type of time culture experienced, subjects from polychronic cultures had a less negative attitude. Which makes sense considering that they're more used to interruptions and multitasking.
On the other hand, when comparing two personality types, one of which is considered to be more obsessed with competitiveness and urgency, no difference in the anxiety reaction to poor performance of a system in a time-limited task were found. However that study is very old and the validity of the method used to measure anxiety is unclear.
Older people have been shown to be more tolerant of delays.
Finally, a recent study showed that people with a habit of playing action video games will perceive more granular differences in short time spans.
So far it does look like poor performance and delays are perceived negatively regardless of culture and personality, but that the person's background can influence the extent of the negative effect, with some people being more affected than others. The minimum granularity of time people can perceive also varies, but that doesn't tell us anything directly about performance perception.
The more busy people are physically, the shorter their attention windows to their mobile devices get. A fairly obvious observation, but one that has been confirmed by an experiment.
It has been suggested (very old study) that people are more tolerant of delays on familiar websites.
The longer people use a website, the stronger opinion they form about its performance. And if they experience fast performance at some point with a website, it seems to affect their future expectation of the website's performance.
WPO stats is full of real-world studies correlating web performance changes with business KPIs like conversion rates, and some academic studies demonstrate the same effect, but they tend to be in highly competitive contexts, where people have plenty of competition to pick from should the site at hand display sub-par performance. While at first glance we might consider that Wikipedia doesn't have direct competition for the in-depth information people seek on it, it is competing for people's attention and might be affected by the same phenomenon.
The varying degrees of effect performance has in WPO stats case studies shows that the effect of performance is best studied on real users on the real website. People have different expectations for different websites.
We often focus performance improvements on infrastructure and code we have direct control over, or assume that bad performance is due to the network, but a recent large scale study by Akamai has shown that for mobile, hardware and OS generation affect performance a lot more than the network does. A phone generation upgrade can speed up page load time by 24 to 56%.
A topic that hasn't been studied much is where users put the blame when performance is poor. An old study showed that in the case of a text-only website, users would blame factors other than the website for the slowness, while when the website included graphics, they would blame the website for the poor performance.
Where people put the blame when they experience poor performance (website, network, hardware) has never been studied. We tend to focus on improving the performance of things we fully control, but there might be much greater gains to be had for users beyond the limit of our data centers.
User surveys and performance metrics
A real-world study showed the correlation between the frequency of backend performance issues and user reports of bad performance. This was used as a way of establishing baseline target performance once improvements were made to the point that user report of bad performance stopped. This black box approach is effective in gauging the real opinion of users on performance, but in itself doesn't help target where the improvements need to be made.
When asked to rate their experience of loading sites over HTTP/1 and HTTP/2, people were unable to feel the difference. However, the objective difference in that study was between 20 and 40ms, which is probably below the granularity of difference that people are capable of perceiving.
A recent study has demonstrated that it's possible to generate synthetic visual metrics that correlate highly (85%) with the survey results when asking users to rate page performance. It acknowledges that this attempt to produce a one-size-fits-all metric is probably a fool's errand given how different websites are between each other. A more contextual metric tailored to a specific website or page type might give even better correlation with user perception.
A high correlation has been found between optimizing elements where people focus their gaze and user-perceived performance as expressed in a survey. This would suggest that measuring gaze could help prioritize performance optimizations that carry the most impact on perception. On mobile, scrolling the viewport has been correlated highly to gaze, allowing to make accurate predictions about areas of attention worth optimizing for based on scroll data.
The low correlation between the most popular performance metrics and user surveys demonstrates how important direct user feedback is to get an accurate view of perceived performance. The holy grail of a universal passively collected real-user metric or even a synthetic one that correctly captures people's perception is still out of reach for now.
Existing RUM metrics like onload, TTFP as well as SpeedIndex correlate very poorly with user-perceived page load time. This suggests that there is still significant work to be done to create new metrics that correlate better with the performance users perceive. Some efforts have been done in that area, discovering for example that images seem to carry a lot of weight in the performance perception on most websites and that capturing the progress over time in the metric makes them much more accurate than "single point in time" metrics.
Novel metrics like ReadyIndex, which focuses on interactivity have been tried, but in very artificial settings where it's unclear how important they really are to users in organic conditions.
This extensive review shows how little we really know about performance perception and how inadequate the metrics currently used as industry standards probably are.
It's clear that universality is highly unlikely and that the best way to understand perceived performance of a website is still to ask its users directly about it.
There some very identifiable gaps in the performance perception research that could easily be answered with simple research experiments. Such as looking at variability trade-offs or what people blame when they encounter poor performance. These are great opportunities for influential work.
Continue reading on the Wikimedia Performance blog, such as:
- "Magic numbers", Gilles Dubuc, 2019
- "Performance perception: How satisfied are Wikipedia users?", Gilles Dubuc, 2019
- "Performance perception: Correlation to RUM metrics", Gilles Dubuc, 2019
- "Response Time in Man-Computer Conversational Transactions" Miller 1968
- "Response Times: The 3 Important Limits" Nielsen, 1993
- "A methodology for the evaluation of high response time on E-commerce users and sales" Poggi, Carrera, Gavaldà, Ayguadé, Torres 2012
- "User-Acceptance of Latency in Touch Interactions" Ritter, Kempter, Werner 2015
- "Are 100 ms Fast Enough? Characterizing Latency Perception Thresholds in Mouse-Based Interaction" Forch, Franke, Rauh, Krems 2017
- “Interaction in 4-Second Bursts: The Fragmented Nature of Attentional Resources in Mobile HCI” Antti, Tamminen, Roto, Kuorelahti 2005
- "Stuck in traffic: how temporal delays affect search behaviour" David Maxwell, Leif Azzopardi 2014
- "Improving the Human–Computer Dialogue With Increased Temporal Predictability" Weber, Haering, Thomaschke 2013
- "Subjective vs. objective time measures: a note on the perception of time in consumer behavior" Hornik 1986
- "A Study on Tolerable Waiting Time: How long Are Web Users Willing to Wait?” Nah 2004
- "The duration perception of loading applications in smartphone: Effects of different loading types" Zhao, Ge, Qu, Zhang, Sun 2017
- "Perceived Performance of Top Retail Webpages In the Wild: Insights from Large-scale Crowdsourcing of Above-the-Fold QoE" Gao, Dey, Ahammad 2017
- "EYEORG: A Platform For Crowdsourcing Web Quality Of Experience Measurements" Varvello, Blackburn, Naylor, Papagiannaki, 2016
- "The Impact of Waiting Time Distributions on QoE of Task-Based Web Browsing Sessions" Nazrul, Vijaya, David 2014
- "Culture and consumer responses to web download time: a four-continent study of mono and polychronism." Rose, Evaristo, Straub 2003
- "Impact of system response time on state anxiety" Guynes 1988
- "Examining tolerance for online delays" Selvidge 2003
- "Web site delays: how slow can you go?” Galletta, Henry, McCoy, Polak, 2002
- "What slows you down? Your network or your device?" Steiner, Gao 2016
- "The effect of network delay and media on user perceptions of web resources." Jacko, Sears, Borella 2000
- "Defining Standards for Web Page Performance in Business Applications" Rempel 2015
- "The Web, the Users, and the MOS: Influence of HTTP/2 on User Experience" Bocchi, De Cicco, Mellia, Rossi 2017
- "Narrowing the gap between QoS metrics and Web QoE using Above-the-fold metrics" Neves da Hora, Sheferaw Asrese, Christophides, Teixeira, Rossi 2018
- "Improving User Perceived Page Load Times Using Gaze" Kelton, Ryoo, Balasubramanian, Das 2017
- "Towards Better Measurement of Attention and Satisfaction in Mobile Search" Lagun, Hsieh, Webster, Navalpakkam 2014
- "Vesper: Measuring Time-to-Interactivity for Modern Web Pages" Netravali, Nathan, Mickens, Balakrishnan 2017