Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose API for updating the crawl scope + exclusions of a running crawl #311

Closed
Tracked by #304
ikreymer opened this issue Sep 13, 2022 · 4 comments
Closed
Tracked by #304
Assignees

Comments

@ikreymer
Copy link
Member

ikreymer commented Sep 13, 2022

Thinking perhaps the most flexible way to do this is to update the crawl template in place, and then restart the crawl?
This would avoid the need for a custom update API on browsertrix crawler, for now..

The way this would work:

  • Update the crawl template (via k8s config map)
  • Send SIGINT to all the crawlers, causing to finish current page gracefully, and then restart
  • restarted crawlers now using updated config.

This could allow for potentially more flexible updates beyond exclusions, and ensure the crawler is shutdown and restarted in a graceful way..

Downside:

  • need to restart container, but should be fairly quick.
  • Possibly more tricky to support in non-k8s setup, but those will only be for dev-only..
@edsu
Copy link
Collaborator

edsu commented Sep 15, 2022

I think this makes sense. I guess the history of the crawl having started with a particular configuration is being preserved in some way in browsertrix cloud? I think it would be useful to know the options that went into the construction of a web archive, which seems related to webrecorder/specs#127 ?

@ikreymer ikreymer added this to the Exclusions + Crawl Queue milestone Oct 21, 2022
ikreymer added a commit that referenced this issue Oct 21, 2022
add new api: `crawls/{crawl_id}/addExclusion?regex=...` which will:
- create new config with add 'regex' as exclusion (deleting or making inactive previous config)
- update crawl to point to new config
- update statefulset to point to new config, causing crawler pods to restart
- filter out urls matching 'regex' from both queue and seen list (currently a bit slow)
@ikreymer
Copy link
Member Author

ikreymer commented Oct 21, 2022

Turns out it was simpler to follow the same approach as updating crawl configs: a new crawl config is created, and old one is deactivated (if first crawl) or deactivated (if other crawl attempts were made).

Then, the crawl statefulset is updated to point to the new crawl config, which causes k8s to gracefully restart the crawler pods at the end of the current page, exactly the desired behavior, w/o any extra signaling.

The crawl queue filtering is only part that's a little bit tricky: have to remove JSON block strings by value of the Redis list, as there's no other atomic way to do it (can't remove by index in list, as the queue may get updated).

The new API is: /api/archives/{aid}/crawls/{crawl_id}/addExclusion?regex=... to keep it simple, however, could of course also extend support to existing update crawl config approach.

@ikreymer ikreymer moved this from Todo to Dev In Progress in Webrecorder Projects Oct 21, 2022
@SuaYoo
Copy link
Member

SuaYoo commented Nov 1, 2022

Implemented here: #347

@ikreymer
Copy link
Member Author

ikreymer commented Nov 7, 2022

To support also remove exclusions, changing API to:

POST /api/archives/{aid}/crawls/{crawl_id}/exclusions?regex=...

DELETE /api/archives/{aid}/crawls/{crawl_id}/exclusions?regex=...

Repository owner moved this from Dev In Progress to Done! in Webrecorder Projects Nov 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants