[pull] main from webrecorder:main #81

pull · 2024-03-23T22:22:40Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

Fixes #498 To revert after 1.0.0 when we make changes that allow for using the temp CDX in WACZ creation.

- add our own signal handling to create-login-profile to ensure fast exit in k8s - print crawler version info string on startup

Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

Docs: Minor fixes to edit link & clarifications

@tw4l

@tw4l Oops :\

This PR provides improved support for running crawler as non-root, matching the user to the uid/gid of the crawl volume. This fixes #502 initial regression from 0.12.4, where `chmod u+x` was used instead of `chmod a+x` on the node binary files. However, that was not enough to fully support equivalent signal handling / graceful shutdown as when running with the same user. To make the running as different user path work the same way: - need to switch to `gosu` instead of `su` (added in Brave 1.64.109 image) - run all child processes as detached (redis-server, socat, wacz, etc..) to avoid them automatically being killed via SIGINT/SIGTERM - running detached is controlled via `DETACHED_CHILD_PROC=1` env variable, set to 1 by default in the Dockerfile (to allow for overrides just in case) A test has been added which runs one of the tests with a non-root `test-crawls` directory to test the different user path. The test (saved-state.test.js) includes sending interrupt signals and graceful shutdown and allows testing of those features for a non-root gosu execution. Also bumping crawler version to 1.0.1

…osed gracefully (#504) The intent is for even non-graceful interruption (duplicate Ctrl+C) to still result in valid WARC records, even if page is unfinished: - immediately exit the browser, and call closeWorkers() - finalize() recorder, finish active WARC records but don't fetch anything else - flush() existing open writer, mark as done, don't write anything else - possible fix to additional issues raised in #487 Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

Due to issues with capturing top-level pages, make bypassing service workers the default for now. Previously, it was only disabled when using profiles. (This is also consistent with ArchiveWeb.page behavior). Includes: - add --serviceWorker option which can be `disabled`, disabled-if-profile (previous default) and `enabled` - ensure page timestamp is set for direct fetch - warn if page timestamp is missing on serialization, then set to now before serializing bump version to 1.0.2

Initial (beta) support for QA/replay crawling! - Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page - Runs local http server with full-page, ui-less ReplayWeb.page embed - ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint. - Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd - Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified. - Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff images. - If using --writePagesToRedis, a `comparison` key is added to existing page data where: ``` comparison: { screenshotMatch?: number; textMatch?: number; resourceCounts: { crawlGood?: number; crawlBad?: number; replayGood?: number; replayBad?: number; }; }; ``` - bump version to 1.1.0-beta.2

tw4l and others added 14 commits March 18, 2024 14:03

Temporarily disable tmp-cdx creation (#499)

4d64eed

Fixes #498 To revert after 1.0.0 when we make changes that allow for using the temp CDX in WACZ creation.

profiles: handle terminate signals directly (#500)

5060e6b

- add our own signal handling to create-login-profile to ensure fast exit in k8s - print crawler version info string on startup

version: bump to 1.0.0

9a2ada3

Fixes docs edit link

4b5ebb0

Adds note about where to find Browsertrix — the cloud service

0d26cf2

Update docs/docs/index.md

3ec9d1b

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

Merge pull request #501 from webrecorder/docs-minor-fixes

79e39ae

Docs: Minor fixes to edit link & clarifications

Docs homepage link fix

5e2768e

@tw4l Oops :\

quickfix: fix typo, remove duplicate declaration!

ecbc1d8

pull bot added the ⤵️ pull label Mar 24, 2024

pull bot merged commit ecbc1d8 into justarmadillo:main Mar 24, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from webrecorder:main #81

[pull] main from webrecorder:main #81

pull bot commented Mar 23, 2024 •

edited

Loading

[pull] main from webrecorder:main #81

[pull] main from webrecorder:main #81

Conversation

pull bot commented Mar 23, 2024 • edited Loading

pull bot commented Mar 23, 2024 •

edited

Loading