-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose API for updating the crawl scope + exclusions of a running crawl #311
Comments
I think this makes sense. I guess the history of the crawl having started with a particular configuration is being preserved in some way in browsertrix cloud? I think it would be useful to know the options that went into the construction of a web archive, which seems related to webrecorder/specs#127 ? |
add new api: `crawls/{crawl_id}/addExclusion?regex=...` which will: - create new config with add 'regex' as exclusion (deleting or making inactive previous config) - update crawl to point to new config - update statefulset to point to new config, causing crawler pods to restart - filter out urls matching 'regex' from both queue and seen list (currently a bit slow)
Turns out it was simpler to follow the same approach as updating crawl configs: a new crawl config is created, and old one is deactivated (if first crawl) or deactivated (if other crawl attempts were made). Then, the crawl statefulset is updated to point to the new crawl config, which causes k8s to gracefully restart the crawler pods at the end of the current page, exactly the desired behavior, w/o any extra signaling. The crawl queue filtering is only part that's a little bit tricky: have to remove JSON block strings by value of the Redis list, as there's no other atomic way to do it (can't remove by index in list, as the queue may get updated). The new API is: |
Implemented here: #347 |
To support also remove exclusions, changing API to:
|
Thinking perhaps the most flexible way to do this is to update the crawl template in place, and then restart the crawl?
This would avoid the need for a custom update API on browsertrix crawler, for now..
The way this would work:
This could allow for potentially more flexible updates beyond exclusions, and ensure the crawler is shutdown and restarted in a graceful way..
Downside:
The text was updated successfully, but these errors were encountered: