Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Slurping

Jeff Kaufman edited this page Jan 10, 2017 · 1 revision

To debug, load-test, and measure performance of our Apache module prior to release, we must collect content from the Internet. Targeted test-cases that we write ourselves are also valuable for unit-testing, but they only test what we deliberately think of. Our quality is substantially increased by ensuring our system performs correctly and efficiently across content generated by the World.

To serve substantive internet content from a local Apache server, there are a few options. We have not had much success mirroring complex sites using techniques such as wget -p, for a variety of reasons, including Ajax, filename limitations, and wget's limited prowess in extracting links from HTML documents. Slurping is a technology and file format that exists in a few places within Google, like Chromium, and is also built into mod_pagespeed. It captures HTTP response headers and data in one file. The filename reflects the URL, but escapes characters that are invalid or cumbersome to use in Unix and Windows filenames.

Building a Slurped Directory

To slurp content with mod_pagespeed, add (or uncomment) the pagespeed.conf settings:

ModPagespeedSlurpDirectory /path/to/slurped/dir/
ModPagespeedSlurpReadOnly off

Note from Josh: I keep a symlink at the root: /slurp/, so that I can quickly turn on slurping with a conf file edit without having to remember & type a long pathname.

Restart Apache, then set your browser proxy to point to the mod_pagespeed server (e.g. YOURHOST:8080). Everything you browser will (a) be rewritten by mod_pagespeed and (b) have its original non-rewritten content recorded in the directory you specify. To achieve this, mod_pagespeed will try to handle origin content in addition to rewriting. It will first check in the specified directory for the appropriately encoded filename. If this file is not present, mod_pagespeed will fetch this content from the Internet and save it in the directory so subsequent requests can be satisfied locally.

You may then set ModPagespeedReadOnly to 'on' and then you will only be served content that is already in the slurped directory. Other requests will get a 404 (not found) error. This typically happens for randomly-generated ad content, google analytics, and other non-reproducible behaviors.

See also: apache_debug_slurp_test in devel/Makefile and devel/slurp_test.sh

Note: slurping, and serving slurps, is not hooked up in ngx_pagespeed

Clone this wiki locally