Expose API for updating the crawl scope + exclusions of a running crawl #311

ikreymer · 2022-09-13T23:46:44Z

Thinking perhaps the most flexible way to do this is to update the crawl template in place, and then restart the crawl?
This would avoid the need for a custom update API on browsertrix crawler, for now..

The way this would work:

Update the crawl template (via k8s config map)
Send SIGINT to all the crawlers, causing to finish current page gracefully, and then restart
restarted crawlers now using updated config.

This could allow for potentially more flexible updates beyond exclusions, and ensure the crawler is shutdown and restarted in a graceful way..

Downside:

need to restart container, but should be fairly quick.
Possibly more tricky to support in non-k8s setup, but those will only be for dev-only..

edsu · 2022-09-15T10:04:03Z

I think this makes sense. I guess the history of the crawl having started with a particular configuration is being preserved in some way in browsertrix cloud? I think it would be useful to know the options that went into the construction of a web archive, which seems related to webrecorder/specs#127 ?

add new api: `crawls/{crawl_id}/addExclusion?regex=...` which will: - create new config with add 'regex' as exclusion (deleting or making inactive previous config) - update crawl to point to new config - update statefulset to point to new config, causing crawler pods to restart - filter out urls matching 'regex' from both queue and seen list (currently a bit slow)

ikreymer · 2022-10-21T01:28:13Z

Turns out it was simpler to follow the same approach as updating crawl configs: a new crawl config is created, and old one is deactivated (if first crawl) or deactivated (if other crawl attempts were made).

Then, the crawl statefulset is updated to point to the new crawl config, which causes k8s to gracefully restart the crawler pods at the end of the current page, exactly the desired behavior, w/o any extra signaling.

The crawl queue filtering is only part that's a little bit tricky: have to remove JSON block strings by value of the Redis list, as there's no other atomic way to do it (can't remove by index in list, as the queue may get updated).

The new API is: /api/archives/{aid}/crawls/{crawl_id}/addExclusion?regex=... to keep it simple, however, could of course also extend support to existing update crawl config approach.

SuaYoo · 2022-11-01T22:50:18Z

Implemented here: #347

ikreymer · 2022-11-07T23:49:32Z

To support also remove exclusions, changing API to:

POST /api/archives/{aid}/crawls/{crawl_id}/exclusions?regex=...

DELETE /api/archives/{aid}/crawls/{crawl_id}/exclusions?regex=...

ikreymer mentioned this issue Sep 13, 2022

Visualize Crawl Queue + Edit Crawl Queue + Support Exclusion Rules #304

Closed

6 tasks

ikreymer self-assigned this Sep 13, 2022

ikreymer added this to the Exclusions + Crawl Queue milestone Oct 21, 2022

ikreymer added this to Webrecorder Projects Oct 21, 2022

ikreymer moved this to Todo in Webrecorder Projects Oct 21, 2022

ikreymer moved this from Todo to Dev In Progress in Webrecorder Projects Oct 21, 2022

ikreymer closed this as completed in 793611e Nov 13, 2022

Repository owner moved this from Dev In Progress to Done! in Webrecorder Projects Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose API for updating the crawl scope + exclusions of a running crawl #311

Expose API for updating the crawl scope + exclusions of a running crawl #311

ikreymer commented Sep 13, 2022 •

edited

Loading

edsu commented Sep 15, 2022

ikreymer commented Oct 21, 2022 •

edited

Loading

SuaYoo commented Nov 1, 2022

ikreymer commented Nov 7, 2022

Expose API for updating the crawl scope + exclusions of a running crawl #311

Expose API for updating the crawl scope + exclusions of a running crawl #311

Comments

ikreymer commented Sep 13, 2022 • edited Loading

edsu commented Sep 15, 2022

ikreymer commented Oct 21, 2022 • edited Loading

SuaYoo commented Nov 1, 2022

ikreymer commented Nov 7, 2022

ikreymer commented Sep 13, 2022 •

edited

Loading

ikreymer commented Oct 21, 2022 •

edited

Loading