Speeding up Web Page Loads with Shandian – Wang et al. 2016
Despite its importance and various attempts to improve page load time (PLT), the end-to-end PLT for most pages is still a few seconds on desktops and more than ten seconds on mobile devices.
Page load times are very important for user experience and translate directly into commercial results for many sites: Shopzilla increased revenue by 12% by reducing PLT from 6 seconds to 1.2 seconds, and Amazon famously found that every 100ms of increase in PLT cost them 1% in sales. By determining and prioritising precisely the resources that are needed during initial page load, Shandian is able to demonstrate significant PLT improvements and remain compatible with caching and CDN services.
We evaluate Shandian on the top 100 Alexa web pages which have been heavily optimized by other technologies. Our evaluations still show that Shandian reduces PLT by more than half with a reasonably powerful proxy server on a variety of mobile settings with varied RTT, bandwidth, CPU power, and memory… Unlike many techniques that only improve network or computation, Shandian shows consistent benefits on a variety of settings. We also find that the amount of load-time state is decreased while the total amount of traffic is increased moderately by 1%.
(emphasis mine).
That represents a whopping 5+ seconds of page-load time reduction on mobile devices for the average page. Because we’ve made a bit of a mess with the interactions between HTML, CSS, and JavaScript, there are lots of inefficiencies in the page load process that make it hard to load resources in parallel. Shandian adds a reverse proxy that loads a page up to the load
event, computes the resulting state, and sends just that to the browser for the initial page load. On the browser, the state is unmarshalled and used to display the initial page. The trick of course is to do all of this with low overhead, in a manner that is compatible with existing web infrastructure, and such that it does not break subsequent web page functionality (what happens after the load
event…). Before you get too excited, Shandian does require a lightly modified client-side web browser.
A proxy server is set up to preload a web page… the preload is expected to be fast since it exploits greater compute power at the proxy server and since all the resources that would normally result in blocking transfers are locally available. When migrating state (logic that determines a Web page and the state of the page load process) to the client, the proxy server prioritizes state needed for the initial page load over state that will be used later, so as to convey critical information as fast as possible. After all the state is fully migrated, the user can interact with the page normally as if the page were loaded directly without using a proxy server.
Why are Page Loads so Slow?
While [techniques such as SPDY] are moderately effective at speeding up the individual activities corresponding to a page load, they have had limited impact in reducing overall PLT because they still communicate redundant code, stall in the presence of conflicting operations, and are constrained by the limited parallelism in the page load process.
Ideally a browser would fetch the web objects of a page fully in parallel, but this is often prevented by dependencies among web objects. Consider loading this sample page:
When CSS appears ahead of JavaScript (1.css in our example), evaluating the JavaScript needs to wait until the CSS is loaded and evaluated since both JavaScript and CSS can modify the elements’ styles in the DOM. When the JavaScript parser processing 0.html encounters the script
tags it must stop parsing, load the corresponding JavaScript, evaluate the script, and then resume parsing.
To understand inefficiencies in the Web page load process, we conduct a study on the top 100 Alexa pages by using Chrome (which is a highly optimized browser)…
- CSS files often contain rules that are never used in a page, or at least not used during initial page load. In the top 100 sites, 75% of CSS rules are unused in the median case. “Surprisingly, 80% and 96% of CSS rules are unused for google.com and facebook.com respectively.“
- 15% of the page load time for top pages is spent waiting for JavaScript or CSS to be loaded on the critical path, and 5% of page load time is used for evaluating CSS and JavaScript.
- 80% of pages have sequentially loaded Web objects on the critical path.
Capturing Page Load State
Precisely identifying the state that is needed during a page load (load-time state) is non-trivial since load-time state and post-load state are largely mingled.
The Web page rendered using Shandian also needs to be functionally equivalent to one that is computed solely on the client, therefore the server needs proper client-side state to function properly (e.g. browser size, cookies, HTML5 local storage).
The load-time state sent to the client does not include any JavaScript (and hence there is no JavaScript evaluation during initial page load). Instead, the result of evaluating the JavaScript, which is reflected in the HTML elements and their styles, is sent. This minimizes client computation time and avoids blocking executions on the client.
For example, instead of transmitting a piece of D3 JavaScript to construct an SVG graphic on the client, the JavaScript is evaluated at the server to generate the load-time state of HTML elements that represent the SVG.
CSS evaluation is also slow and should be avoided during initial page load as much as possible. The result of CSS evaluation is unfortunately often a detailed and unwieldy list of styles for each HTML element. The most expensive part is the CSS selector matching step that matches the selectors of all the CSS rules to each HTML element:
Our design decision here is to perform CSS parsing and matching on the server, but leave style computations to be performed on the client. We migrate all the inputs required by style computations as part of load-time state.
One the page load event fires in the Shandian proxy, the HTML elements in the DOM and the matched CSS rules are serialized to json.
Deserializing the load-time state, which is both simple and fast, determines the page load time on the client… Compared to the page load process, the deserialization process does not block, does not incure additional network interactions, and avoids parsing of unused CSS or JavaScript, thereby significantly speeding up page loads on the client.
Here’s an example of Shandian load state and page loading for the example we saw previously:
Post-Load State
Following the initial load, all the other resources needed by the page must be made available in a way that makes the behaviour of the page identical to one loaded without Shandian.
To ensure interactivity, the post-load state should include the portion of JavaScript that was not used in the load-time state, together with unused CSS, because they might be required later in user interactions.
The CSS is just sent unmodified in its original form since CSS evaluation is idempotent. Unfortunately the same cannot be said for JavaScript evalution. JavaScript is split up into idempotent function declarations, and other non-idempotent statements. A partial heap is transmitted with the load state from which the client can construct the full heap state. If the JavaScript uses eval
and document.write
this approach can no longer work (their use is considered bad practice), and the use of Shandian is disabled for such pages.
Evaluation
The Shandian reverse proxy is implemented as a webserver extension based on Chrome’s content shell with most modifications to Blink a few to V8. “The client-side browser is also based on Chrome, and we modify it as little as possible.”
The evalation shows that:
- Shandian significantly improves PLT under a variety of scenarios
- Shandian does not significantly hurt data usage, and
- The amount of client-side state that needs to be transferred to the server is small.
Page load times of the Alexa top 100 websites mobile home pages using a modified Android Chrome are reduced by up to 60% in the median case.
Using local assets and Dummynet to emulate varying bandwidths and RTTs showed that Shandian is insensitive to RTT, that bandwidth is not a limiting factor of PLTs, and that varying CPU and memory has the same impact for both Shandian and Chrome.
In summary, Shandian significantly improves PLT compared to Chrome under a variety of realistic mobile scenarios. This is rare since most techniques are specific to improve one of computation and network. But Shandian improves both…
The size of the total data loaded by the browser (pre and post load) increases by 7% before compression when using Shandian, but this drops to only 1% with standard gzip compression.
Shandian is compatible with existing latency-reduction techniques with notable examples of caching and CDNs.