Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical method for converting multiple WARC files to WACZ #33

Open
jackdos opened this issue Mar 22, 2023 · 4 comments
Open

Canonical method for converting multiple WARC files to WACZ #33

jackdos opened this issue Mar 22, 2023 · 4 comments

Comments

@jackdos
Copy link

jackdos commented Mar 22, 2023

I'm not sure if this is a feature request or just a request for clarification, but I'm looking for a canonical way to generate a WACZ file from multiple WARC files.

I am dealing some web collections that span multiple WARCs, but should be represented as a single WACZ. From the command line I can get this to work by putting all the WARC files in a single folder and running:

wacz create -o test.wacz -f warcs/*.warc

however, I have failed with multiple attempts to cleanly invoke this from within a java wrapper. I've tried different combinations of different levels of escaping and quoting of parameters, but to no avail. Either way I assume this is relying on either OS or python expansion of the * wildcard, and it's not clear what would and would not be expected to work in terms of wildcards, regex expressions etc.

What I'm looking for ideally is either for the -f parameter to be repeatable (in the way that -i is in ffmpeg) so that each file can be explicitly listed; or to be able to specify a -d parameter to point to a directory explicitly expected to contain multiple warc files. The directory option would probably need to let you specify what file extensions to consider, or should clearly document what happens when non warc content is found in the directory.

@quinn
Copy link

quinn commented May 1, 2023

Bump, I'm also having trouble with this.

@ikreymer
Copy link
Member

ikreymer commented May 1, 2023

Sorry missed this earlier! The -f warcs/*.warc is relying on shell expansion to fill in the file list. The -f flag works as you are suggesting, it is expecting a list of filenames (relative to current working directory or absolute) after the -f param.
eg. -f warcs/a.warc warcs/b.warc ... warcs/n.warc should work.

This is what we do in the crawler, generate a list of WARC files, and then pass each one as a param after the -f param:
https://github.com/webrecorder/browsertrix-crawler/blob/main/crawler.js#L881

@quinn
Copy link

quinn commented May 1, 2023

Thanks that works!

@jackdos
Copy link
Author

jackdos commented May 2, 2023

OK, great, was just a request for clarification then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants