Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Design Doc: HTML Caching in PageSpeed (Plan)

Jeff Kaufman edited this page Jan 9, 2017 · 2 revisions

HTML Caching in PageSpeed (Plan)

Anupama Dutta, 2013-04-08

Background:

Please read Jeff Kaufman’s doc on this subject. Jeff’s doc accurately captures the problems with HTML caching within Pagespeed, and potential solutions for the same.

Since this HTML caching feature has many parts, here is a rough plan on how the complete set of features can be implemented in phases.


Phase 1: Basic functionality (with NPS/CDN integration as end goal).

We have a CDN partner who would like rewritten HTML cached and served out. For phase 1, we will design with ngx_pagespeed integration with this CDN in mind. Note that the framework built should be easily extendible to include integrations with Varnish or other similar caching layers in subsequent phases.

Assuming that nginx has its cache feature (proxy_cache) enabled, this can be configured to serve cached content or bypass the cache completely based on user-agent matching.

Example config changes can be seen here.

In order to use the nginx cache effectively for caching rewritten html, every HTML request should be processed in the following manner:

a) The nginx proxy_cache configuration will construct a proxy_cache_key based on the user-agent of the request (using logic recommended by us). Refer to this document for details of how the UA influences what optimizations will be done and whether or not the response will be served from cache or not.

b) The nginx configuration will decide whether a cache lookup is relevant or not, and also decide whether to add a header indicating the UA-dependent-capabilities (e.g. webp, image inlining) that it wants to support.

c) The pagespeed server, on getting a request, would use the additional header to determine what UA-dependent-optimizations are supported in addition to the non-UA-dependent ones.

d) Once the asynchronous (background) rewrites for the request are completed, the percentage of rewriting that was done before the response was served out is determined. If this percentage is below a specified threshold (say 95%), we issue a purge request to the proxy_cache using the same request headers as the original request. So, cookies, user-agents, and any other headers that makes a difference to the caching logic will automatically get factored into the purge logic too. For e.g., if the cache was bypassing the requests with certain cookies, the purge request will simply be a no-op because of the same cookie field. (The webmaster will have to ensure that the proxy_cache_key construction logic remains the same in both the proxy_cache_purge and proxy_cache logic).

Phase 2a: Advanced functionality

  1. Generating cacheable rewritten HTML without UA dependent optimizations (with options for enabling the UA-dependent optimizations on client side maybe?) [Use case: news site pages cached on a CDN]

  2. Support beacon-dependent-rewriters (inline-images, inline-preview-images, lazyload-images, critical-css) by triggering a purge, whenever sufficient amount of beacon data is available for a given URL (for a given UA class) over and above the purge done for the completed async rewrites. So, every request may see different versions of the HTML in the cache (original, partially rewritten, fully rewritten without any beacon-dependent-rewriting-applied, absolutely-fully-rewritten with beacon data also applied.) This has now been implemented using a different approach, and the details are in this document.

  3. If there is a need/ feature request for computing the min-nested-resource-TTL and using this for the fully rewritten HTML, maybe try to incorporate this computation logic somewhere.

  4. Cache original html if computing it is too expensive (and if this is a big feature request).

  5. Provide a direct purge API for the webmaster to be able to purge specific HTML URLs from the pcache.

Phase 2b: Other caching layers/options.

  1. Caching within pagespeed (when no other caching layer is present).

Pcache or http cache could be used in these cases. This could be relevant to MPS and NPS, and used as a fallback method for avoiding overload of the origin servers in case of DOS attacks or unusually large loads.

Rough flow for this could be:

a) When a HTML request is received, we do a pcache lookup.

b) If the pcache lookup is successful, the corresponding cached HTML response is served out with a large webmaster-specified TTL.

c) If the pcache lookup is a failure, there is a fetch done for the original request, the content is rewritten in the line of the request as much as possible and served out with a low TTL. This content is also buffered up and used for an async blocking rewrite call, which on finishing, stores its response in the pcache.

  1. Other external caching layers: Varnish for wordpress, and maybe mod_cache for MPS.

Varnish configuration changes (or mod_cache config changes) mirroring the changes done for nginx cache configuration should be generated for the webmaster’s convenience.

a) Varnish should cache responses for different UserAgent (classes) separately. .vcl can be used to do this and to add the Vary:UA header. Logic for UA classes should be identical within PS and in the .vcl file.

b) Varnish should cache responses with general cookies, but not the cookies for logged-in users. By default, varnish does not cache anything with a cookie. Change .vcl methods to actually cache responses for non-logged-in-user-cookies specified by the webmaster.

c) Varnish purge API access for HTML URLs that need to be purged by the webmaster or from the code.

  1. Caching some information within pagespeed (using other caching layers for storing the actual rewritten HTML responses):

We could add support for a local map (bloom filter?) [may even be just a pcache lookup in case of MPS and NPS] allowing direct lookup (non-async) that determines whether we are in a position to generate a fully rewritten entry for the HTML or not, and use that information to provide more accurate TTLs (i.e. lower TTLs for partially rewritten HTML and larger TTLs for fully rewritten ones).

  1. Handing cookies: When caching within pagespeed or externally, whenever there is some caching logic in the external caching layer related to cookies, we might need a flow of the following kind:

a) Inspect the cookies on the request to see if these can be ignored for caching the rewritten html (i.e., are these mentioned in the CacheDespiteCookies configuration described below).

b) If all the cookies can be ignored, then use the UA classification logic (that decides whether the UA supports inlining, webp and other transformations) to decide on the UA-class that this request belongs to.

c) Create the cache-key (for nginx cache) using the UA class and the URL and trigger a proxy_cache_purge to get rid of the incompletely rewritten response that would have been stored in the cache (as a result of the synchronously served rewritten response).

CacheDespiteCookies option:

We should be able to ignore cookies that represent non-logged-in-users, and serve out the same cached rewritten content for such users. Example cookies could be session/click tracking cookies, __utm cookies (read only in js) added by google analytics, centralAuth_token etc. These ought to be specified in the config file in some way, e.g.:

pagespeed CacheDespiteCookie session_id,__utma,__utmb,__utmc,__utmz

Code structure (outdated wrt Phase 1):

Still to figure out details here. But, basically, we should see if CacheHtmlFlow can be reused here.

The existing CacheHtmlFlow serves out in-the-line-of-request-rewritten-page (with or without split filter enabled) and triggers an async flow which rewrites the same buffered html through a specific filter (strip non-cacheable filter). It also always rewrites the cached original HTML in the line of request.

In our case, we won’t have any specific additional filters but the same set of filters acting in both the in-the-line-of-request-flow and the async flow. The async flow needs to be forced to become a blocking one (maybe it already is?). The in-the-line-of-request flow will need to serve out the rewritten cached HTML as is (if available) instead of doing fresh rewriting.

The structure of the flows and rewrite drivers involved here might be the same, but the rewriters enabled in each code path might be very different.

Other useful links to read up / refer to:

Varnish:

Cache invalidation by URL pattern in MPS:

Clone this wiki locally