-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨Streaming utils for zipping and reading/wiring to S3 #7186
base: master
Are you sure you want to change the base?
✨Streaming utils for zipping and reading/wiring to S3 #7186
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7186 +/- ##
==========================================
+ Coverage 86.98% 87.03% +0.04%
==========================================
Files 1667 1668 +1
Lines 64721 64684 -37
Branches 1096 1115 +19
==========================================
- Hits 56299 56298 -1
+ Misses 8109 8068 -41
- Partials 313 318 +5
Continue to review full report in Codecov by Sentry.
|
…zipping-of-s3-content
…zipping-of-s3-content
…zipping-of-s3-content
…zipping-of-s3-content
…zipping-of-s3-content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last thing. please double check download_fileobj
# NOTE `download_fileobj` cannot be used to implement this because | ||
# it will buffer the entire file in memory instead of reading it | ||
# chunk by chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about this?
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig
I see chunks configuration there. also I see multipart download and multi threading. Are you sure about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation is still valid.
I tried really hard to use download_fileobj
, but could not do so.
The download_fileobj
writes to a file object
. The options you are highlighting, help with tuning the chunks and parallelism of writing to the output. You cannot pause the process and resume it.
To do so, I had to mull the checks one at a time, like it's done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanderegg I'll rewrite the last part.
To achieve on demand file chunk download, the only way I found was to pull once chunk like I did here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff! Thanks a lot for the effort! I would suggest to add some RAM checks and perhaps also some disk space checks to your tests.
# Download the chunk | ||
response = await self._client.get_object( | ||
Bucket=bucket_name, Key=object_key, Range=range_header | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity. Do you know if there is any difference in the cost incurred by downloading a file in a single or multiple calls? Or is the price of downloading from aws s3 just a factor of the size of the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no real idea about this.
s3_client: S3Client, | ||
archive_s3_object_key: S3ObjectKey, | ||
mocked_progress_bar_cb: Mock, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test! Maybe this is actually where it would make sense to do the memory profiling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could not figure out how to do it. will not use time on it. If you have an example that already works I will give it a try.
…zipping-of-s3-content
|
What do these changes do?
These bring a set of utilities that will allow us to create zip archives on the fly and stream it as it gets created to S3. The idea is to use constant amount or RAM and no disk space.
How does this work? A request to upload a zip archive to S3 is created. As chunks of this archives are requested by the uploader, the streaming zip utility requests chunks of files on the fly and compose the archive. It will provide pieces of the archive to the S3 uploaded as soon as they are available.
Have a look at
/home/silenthk/work/pr-osparc-stream-zipping-of-s3-content/packages/aws-library/tests/test_s3_client.py::test_workflow_compress_s3_objects_and_local_files_in_a_single_archive_then_upload_to_s3
for a full working workflow.Progress bar support has also been added. Progress is sent based on the data read from the input streams.
Bonus: renamed
_filemanager.py
which created confusion tofilemanager_utils.py
Related issue/s
How to test
Dev-ops checklist