The first results from the SpeedPerception challenge are in! We had over 5,000 sessions completed, with over 50,000 valid data points (77,000+ votes overall).
We tested three initial hypotheses, of which two were confirmed:
✔ No single existing webperf metric explains human choices with 90%+ accuracy
✔ Users did not wait until “visual complete” to make their choice
✗ Visual metrics did NOT perform better than non-visual/network metrics
For those of you unaware, SpeedPerception is a free, open-source, benchmark dataset of how people perceive above-the fold rendering and webpage loading processes which can be used to better understand the perceptual aspects of end-user web experience. The benchmark we’ve posted on Github can provide a quantitative basis to compare different algorithms. Our hope is that this data will spur computer scientists and other web performance engineers to make progress quantifying perceived web performance.
While we’ve posted the initial findings on Github, we will be releasing additional results. We appreciate feedback on both the study and results, and suggestions for next steps. If you want to analyze the data yourself and test your own hypotheses, the data and code is all available on Github. Please do share any results and conclusions with us.
And, if you were one of the over 5,000 challenge participants, a thank you.
Parvez Ahammad, Clark Gao, Prasenjit Dey, Estelle Weyl, Pat Meenan.
Chi-squared test is typically used for evaluating strength of association in scenarios where the measurement variables are categorical (example: vehicles classified as two-wheelers, four-wheelers etc.). It's also a nonparametric test - in other words, it is "distribution-free". Similar to how correlation coefficient helps interpret strength of association between real-valued variables, Cramer's V-score from Chi-squared test can be used with categorical variables. Degrees of freedom (df) for a chi-aquare distribution is 1 less than number of categories (df = number of categories - 1). Here's a handy table for how to interpret Cramer's V-score :
There are two popular types of correlation coefficients (Pearson and Spearman). See the table below for how to interpret these cofficients.
To play & participate in the challenge: http://speedperception.meteorapp.com/challenge
SpeedPerception is an open-source, collaborative effort to understand perceptual aspects of end-user web experience. Our goal is to create a free, open-source, benchmark dataset to advance the systematic study of how people perceive the webpage loading process – the above-the-fold rendering in particular. Our belief (and hope) is that this benchmark can provide a quantitative basis to compare different algorithms and spur computer scientists and other web performance engineers to make progress quantifying perceived webpage performance. We plan to open source all the collected data and analysis once there is sufficient participation. Please share this post – we need as many people to participate as possible.
-- Parvez Ahammad, Clark Gao, Prasenjit Dey, Estelle Weyl, Pat Meenan
In our group, we have been interested in the question of how the structure of a webpage influences its performance on the web. It is (in my opinion) one of the key questions at the heart of distributed web application delivery. Thanks to amazing resources like HTTP Archive and BigQueri.es, it is pretty straight forward to access and play with large scale web performance data (measured twice a month across 400,000+ websites and made available for free!).
Note: This blog post describes collaborative work that had contributions from myself, Clark Gao, Matthew Mok and Karan Kumar (all at Instart Logic Inc.). The image above is borrowed from Paul Irish's talk slides with my hand drawn squiggly circle to highlight this blog post's focus.
Sometime last year when I started to dig deeper into web performance and how human end-users perceive the process of webpage loading, it quickly became clear that typical W3C standard metrics weren't sufficient. I also learned (painfully) that popular page-level metrics like onLoad could be easily gamed via scripting tricks. I became particularly fascinated with the problem of measuring above-the-fold webpage loading process - one that's almost a computer vision type of a problem (but my computer vision friends don't know about it yet). Couple of early discoveries kept me going:
Note: The title of this article is a play on the term "Rube Goldberg machine". Acccording to Wikipedia, a Rube Goldberg machine is a contraption, invention, device, or apparatus that is deliberately over-engineered to perform a simple task in a complicated fashion, generally including a chain reaction. Keep that in your mind.
Everyone (that cares about ML) knows about supervised / unsupervised / semi-supervised learning pipelines. I have now come across an entirely new class of ML pipelines that I shall call "Rube Goldberg Machine Learning" pipelines. Before I go on and explain what I mean, let me provide some context.
Last week, I attended Velocity conference in Santa Clara, CA. For those of you who are unfamiliar, Velocity conference is a popular enterprise-oriented (non-academic) conference focused on web performance and DevOps topics. I was very excited to see a machine learning talk in the program: "Using machine learning to determine drivers of bounce and conversion". Apparently it was the first ML talk ever at this venue (for a field that produces insane amount of data, I don't know why ML doesn't show up more often in Velocity). So, yay! Given what I know about the prior work of Pat Meenan and Tammy Everts, I had high hopes for the talk and the potential findings. Neither of them are ML practitioners, but both of them have done stellar work in web performance community before. Unfortunately my excitement quickly dissipated within the first few slides (click the link to the slides if you are curious). Several web performance folks told me informally that the conclusions of the talk didn't seem right, because the conclusions appeared to go against the conventional wisdom in the web performance field. I have no problem with going against conventional wisdom; sometimes it is nice to correct long-held misconceptions if there's good evidence. My disappointment mainly stems from the misuse of ML models in this talk. Instead of simply venting, let me break down the good/bad/ugly aspects of this talk:
I could go on further about other nitty-gritty, but let me stop here and summarize. I saw a first-ever ML talk at Velocity that didn't really teach me anything about web performance despite analyzing a million sessions worth of data. The talk generated a lot of buzz on the basis of disrupting conventional wisdom while the models are highly questionable. I think what allowed the talk to fly through is the use of over-engineered and overly-complicated ML pipelines that aren't interpretable. This is the class of ML that I am going to call "Rube Goldberg Machine Learning" from now on.
As a recent ICML talk title says: "Friends Don’t Let Friends Deploy Models They Don’t Understand".
A decade before the "AI winter", going by this NYT article on single layer perceptrons, the AI craze was in full swing. AI and Machine Learning have made some significant progress in this last decade, but it's always worth remembering how the marketing of AI will always be way ahead of the true capacity of AI. While too much undue excitement can hurt a research field at times, notice that some of the predictions about AI agents (recognizing people/faces, language translation) have already become reality today.