Design Doc: Critical CSS Beaconing

Critical CSS Beaconing

Jan-Willem Maessen, 2013-03-04

Goal

Add critical CSS beaconing capability to mod_pagespeed.

Background

Our goal is to inline the critical CSS (those CSS rules that affect content above the fold) on pages served by mod_pagespeed installations. There is already a filter that inlines critical CSS under development, so the real challenge is to identify the critical CSS data without using a headless browser. We can't use a headless browser with mod_pagespeed, so we would like to do this rendering in browser of an actual visitor to the page, then send the results back as a beacon. The goal is to use the same basic settings and flow to generate instrumented pages, keeping difference between the headless browser implementation and beaconing as small as possible.

Design options

One option is to just use Phantom JS to render the page offline. There are a number of reasons why this might be problematic; they are discussed in a separate document.

The other option is to use beacon data from page visitors to compute the critical CSS. The problem is that only same-domain CSS rules are available to JavaScript code for security reasons. This means that the external CSS referenced by a page must either be inlined or served from the same domain as the page itself. This is at odds with common configurations of mod_pagespeed, where site owners explicitly request that rewritten resources be moved to a cookieless domain different from the domain on which the page itself resides.

This means that we will rewrite pages in two different modes:

Critical CSS Rewriting mode. In this mode existing property-cache information is used to insert critical CSS into the page, and no beaconing is required. Heavily-visited pages should generally be served in this mode. CSS files are rewritten and sharded/served in accordance with the usual mod_pagespeed configuration.
Critical CSS Instrumentation mode. In this mode JavaScript code is injected into the page that traverses the DOM, identifies the critical CSS, and returns the result to the server in a beacon. To avoid the same-domain restriction on CSS, we inline any CSS file that can’t be moved to the origin domain.

We choose whether a given site visitor obtains a instrumentation mode based on the contents of the property cache. Broadly speaking, we will serve an instrumented page if there are no critical CSS entries in the property cache. As we collect more instrumentation data, the probability of entering instrumentation mode will drop (based on flipping a biased coin), with the goal of serving some target fraction of page views in rewriting mode.

In the remainder of this document we focus on serving pages in instrumentation mode.

Flow

Early in the page flow, the cricial_css_beacon_filter runs. It examines the existing property cache entry and decides whether to enter instrumentation mode. In instrumentation mode the filter does several important rewrites:

Insert an annotation at start of document indicating that several subsequent filters should be disabled (most notably, critical CSS itself needs to be disabled).
Annotate <link rel="stylesheet”> entries with either a pagespeed_always_inline (different domain from html) or a pagespeed_no_transform (same domain as html) attribute.
Insert the instrumentation javascript if any such links are found in the page.

The CSS inliner must be modified to unconditionally inline css annotated with pagespeed_always_inline. Note that this attribute can easily be employed by site owners for reasons unrelated to critical CSS identification.

The critical CSS filter must be modified to ignore pages with the "instrumentation mode" flag at the start of the document.

When the page arrives at the browser, the beacon JS runs, identifies the critical CSS, and sends the data back to the mod_pagespeed server in a beacon. The beacon may be one or more GETs or (more likely) a single POST; both options are being explored. The code to identify the critical CSS can be shared with the headless browser mode; the only single-purpose code should be the actual beaconing that returns the collected data.

Back at the server, the beacon handler will take the data and store it back into the property cache.

Risks

Security concerns:

There are several possible attack vectors available by using carefully crafted beacons. One option is to attach a random signature to each beacon, and ignore any incoming beacon that does not match a recently-sent signature.

Injection of arbitrary CSS into the page. No signature scheme can completely prevent this attack. Instead, we must rewrite the critical_css_filter to incorporate critical CSS rules only if they are actually found on the page or in its CSS files. One simple expedient is to have beacons simply return the relevant selectors (and possibly the files in which they occur), and then include all rules that contain the returned selectors. However, this is different from the way the filter is currently implemented.
Inclusion of irrelevant CSS. An attacker can effectively force the browser to inline all the CSS by returning beacons that identify every CSS rule as critical. This will result in page slowdowns, but should not actually break pages or result in security breaches.
DOSing server with beacons. This is a particular risk for a POST-based scheme, where a very large POST could be returned as a beacon. We intend to limit the size of POST data we will consider and discard / terminate longer POSTs to the beacon address.

Unstable data:

Because we’re obtaining data from real web browsers, the actual critical CSS rules identified as critical will vary depending upon browser configuration and window size. Note that the property cache distinguishes mobile and desktop user agents, so data from these sources will not be mixed. We will consider a rule to be critical if it is identified as such in a large percentage of returned beacons.

Startup transients:

The first time a page is served, its CSS will not be available to be inlined. We should control for the absence of some critical CSS data. This is a particular concern for long-tail hosted sites, where site data is likely to be evicted from mod_pagespeed’s cache before a repeat page view occurs. This is a systemic problem with mod_pagespeed and we do not propose to solve it here.

Tasks

Task	Who	Time
critical_css_beacon_filter	jmaessen
pagespeed_always_inline	jmaessen	3-4 days
explore POST for beacon	jud
critical_css_beacon_js	jmaessen
beacon handler	jmaessen
modify critical_css_filter to only include CSS actually occurring on page	jmaessen / ksimbili	Probably the highest risk work item

Future

Some future prospects:

Hybrid evaluation and rewriting mode, in which only a fraction of <style> tags are inlined and the pingback data takes that fact into account.
Code sharing between different beacons. In particular critical_css_beacon and critical_image_beacon are likely to share a great deal of code, and it might be possible for an instrumentation-mode page to not inject the critical_image_beacon code at all, and do all the work in the critical_css_beacon code instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly