Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Design Doc: Brainstorming PageSpeed Optimization Products and Content Security Policy

Maks Orlovich edited this page Aug 2, 2017 · 33 revisions

Maks Orlovich, December 2016

Introduction

PageSpeed Optimization Products (mod_pagespeed, ngx_pagespeed, etc., all based on same set of libraries — PageSpeed Optimization Libraries) improve loading speed of websites by applying various optimizations, such as cache extension, resource combining, inlining, etc., automatically in the flow of requests. Since such operations can alter the sources of resources and nature of scripts being used on a website, previously valid Content Security Policy provided by the website may become inaccurate, and prevent the website from operating. This document outlines which features of PageSpeed Optimization Libraries interact with the CSP security model, and what challenges arise in getting the two to cooperate.

Background: Content Security Policy

Content Security Policy permits web pages to restrict sources from which various resources (such as scripts, style sheets, etc.) can be executed, both by paths and disabling inline scripts or styles. This is meant as a defense in depth for XSS attacks: for example, if a page does not permit any script content to be inline, and restricts scripts to its own domain, an attacker injecting a <script> tag would only be able to refer to resource on the page’s domain, and not something totally arbitrary. Limited inlining can also be permitted by having the content exhibit a nonce that matches the policy, or having the policy include cryptographic hashes on inline style or script content. Policies are expressed as HTTP headers or <meta> elements.

Background: PageSpeed Optimization Libraries Processing Model

The way PSOL processes pages has important implications on its interactions with CSP. Pages are not processed all at once, but incrementally as chunks of source come in to the libraries. This means in particular that the response headers, and <head> of the document containing CSP information may be sent out on the wire before some of the resources they affect are seen. (It's a bit unclear to me as of time of this writing as to all the implication of <meta> CSP on content below it; that may require some finesse since the policy could potentially differ within a chunk)

The library also generally will not wait for slow operations like fetches to perform optimizations, but rather tries to perform them in background and uses mainly cached information for live traffic. This means one can’t guarantee that certain optimizations will happen.

The library can keep track of information associated with pages probabilistically (e.g. resources they use), but there is of generally no guarantee that the actual page will match, so they can only be used conservatively.

Interactions with Specific Features

URL Mapping for CDNs

A user can ask PSOL to move resources to a CDN. The interaction here is pretty straightforward: given a mapping from a.com to a.cdn.com/foo/, any policy that permits a.com/bar should be extended to also permit a.cdn.com/foo/bar (retaining the original in case some mappings are missed). In this case this expansion feels natural, since after all the content author is explicitly asking for the CDN to be in use.

URL Changes for Optimized Resources / Cache Extension

Whenever a resource is optimized, it is given a new name that encodes the optimization performed, and a content-hash of the result (this permits the optimized version to be given a very long cache time: if a new version needs to be generated due to original or settings change, it will get a different URL unless it somehow manages to be bit-identical). For example, http://www.modpagespeed.com/mod_pagespeed_example/images/Puzzle.jpg might get changed into http://www.modpagespeed.com/mod_pagespeed_example/images/256x192xPuzzle.jpg.pagespeed.ic.LxXAhtOwRv.webp

This is important since if the original page’s CSP specified the exact the URL, the modified page will not work right without also explicitly permitting the rewritten URL. The URL, however, cannot be predicted with certainty. Three options seem possible:

  1. Relax the security policy to be directory level
  2. Try to predict the optimized URL from history (taking UA-dependent optimizations into account), set CSP headers based on prediction, and if it fails to match do not use the optimized resource. (This may be hard to implement in part to things that undergo multiple optimizations with intermediate results)
  3. Block the rewrite if existing policy is too narrow to permit it — so if the policy was directory-level it would go through, but if an exact file was required, it will not.

Thinking some, I think checking things against existing policy is a must-have since we don't know where an injection hole may be: it may be before us (seems likely with e.g. PHP), so it may be the case that we can't actually trust the input HTML, so we should also be careful not to "legalize" anything that wasn't accepted by policy already when rewriting it.

Library Canonicalization

PSOL can be told to replace load of common libraries (e.g. jQuery) with a version on a CDN (e.g. Google Hosted Libraries). This is a variant of above — some URLs may get mapped to some other URLs, with pretty much the same issues.

Combining Across Paths

PSOL can (optionally) combine, for example, CSS files a/b/foo.css and a/c/bar.css into a file inside the a/ directory containing both. Obviously, if the original policy only permitted a/b and a/c but not a/ this would fail; so either the policy has to be expanded (with similar issues as above — it requires predicting), which seems quite undesirable, or this particular behavior blocked in cases where it would violate existing policy.

JavaScript Combining

The implementation uses eval, so it’s incompatible with policies that do not have unsafe-eval. This doesn’t seem like the sort of thing we should be adding automatically.

Image Inlining

This seems impossible when the policy does not permit all data: sources for images. Options again seem to be either auto-expanding the policy (highly dubious) or disabling the feature if the policy disallows it.

Script and Style Inlining

The spec offers a nonce mechanism to permit such usage, however we have to be careful to not add nonces to a ‘default’ policy or any other one containing unsafe-inline (but not existing nonces or hashes), since adding a first one would disable unsafe-inline and likely break a page relying on it. This is made slightly trickier by potential presence of policy in meta tags, which makes it positional.

Transformation of Already Inline Resources

If an inline script or style already has a hash, it needs to be updated if the contents are altered by PSOL. There are two potential implementation difficulties:

  1. The opening tag should not be flushed out to the wire before the content + end tag are.
  2. Hash computation may be affected by whitespace; and it’s extremely unlikely PSOL’s HTML parser gets whitespace folding the same as the browser. When the content is minified this will likely be moot (since leading/trailing whitespace will be removed anyway), but if the only change to content is due to rewriting of URLs it’s possible that computing the new hash precisely may be difficult. Perhaps switching to a nonce may be an acceptable alternative.

We also need to be sure that the original hash matched before updating it, so we perhaps cannot escape potential whitespace issues.

Issues Affecting Synthesized Scripts

If I understand the spec correctly, on* attributes, style= attributes can’t be used unless there is an unsafe-inline policy (though CSP revision 3 adds unsafe-hashed-attributes, which might help). The former should probably be using a listener anyway; the latter — if PSOL uses any — would be hard to replace, however.

Any <script> tags synthesized would have to be annotated with the nonce, which we would have to wire through somehow. Details of how that happens might matter; we need to be careful not to expose it to anywhere an attacker may see (it’s already visible to scripts, at least).

Effects of CSP on Page Semantics

base-uri restrictions can alter domain resolution behavior (making certain <base> tags inoperable). This is potentially trouble since we can end up in scenarios where a specific UA not supporting base-uri restrictions results in different URL resolution across user agents, which may matter for page correctness.

The trickiest case: recursive optimization of CSS

The CSP policy applies to an entire page, which means restrictions it imposes on the CSS are with the HTML. This means two things:

  1. We have to encode them in the URL for .pagespeed. resources
  2. IPRO can't really do anything recursive if CSP is present, since we will not have a policy available at the time, so we could conceivably optimize an image to a URL the policy won't permit, or, worse, inline a CSS that would have been blocked by the policy.

Possible options, and their tradeoffs:

  1. No special handling in MPS. This should mostly work if the policy is delivered site-wide on HTTP headers, including in CSS resources: we will apply input checks, output checks for images (I think cache extension might be too conservative, though). However, if the policy is not site-wide, both .pagespeed. rewrites and IPRO could fail to honor policy (including potentially inlining resources which would be disallowed by policy), and also get into hash-mismatch flakiness. Also if the policy differs between different pages, we have no way of applying its different effects to the same CSS resource (as the output policy check would be done when the CSS is generated)

  2. Disable all nested ops if non-trivial CSP, encode URL. This is overly conservative, but .pagespeed. resources will be rewritten consistently. If IPRO doesn't have access to the CSP policy, however, that in turn can do all the same bad things as in option #1. Policies being different doesn't matter, though, as long as some exist, since this basically does a conservative approximation.

    a) Note that this in a sense encodes the policy as a bit. I don't think we can encode it precisely, but there may be useful summaries that carry more info, which suggests that ResourceContext should probably use an enum for this.

General Implementation Notes

An implementation would need to keep track of both default-src and other policies accurately at each point; expansions of type-specific policies may require copying over of default-src.

Implementation status.

Some bits of CSP syntax support and integration are in, but lots still outstanding.

Outstanding CLs:

  1. csp-urls (pull request): parses the host-scheme part of the grammar in detail. That's basically the syntax for most URL restriction patterns.
  2. csp_urls2 (pull request): Implements the algorithm for matching URLs against source patterns.
  3. csp-urls3 (pull request): Implements all the top-level algorithms. Tweaks the representation for sources whose presence is basically used as bool to be just bools. Also had to update parsing to understand nonce and hash expression some, since even if we don't match them, their presence affects some things.
  4. csp-integrate (pull request): hooks up CSP with RewriteDriver and ScanFilter and an off switch. Makes one filter use it.
  5. csp-integrate2 (pull request): Check policy in CreateInputResource. Avoid modifying inline things and combining when appropriate.
  6. csp-integrate3 (no pull request yet, branch, depends on csp-integrate2): validate Render requests for policy (inlining permission, output URL locations), be aware of base-uri restrictions.
  7. csp-integrate4 (no pull request yet, branch, depends on csp-integrate3). Makes us understand the distinction against a missing directive (doesn't affect anything) and an empty one, including properly parsing it (disallows everything of the sort). Refine our "is policy empty" checks.
  8. doc update
  9. Another audit pass of all filters --- Otto highlighted domain_rewrite_filter as one suspect.
Clone this wiki locally