Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨Streaming utils for zipping and reading/wiring to S3 #7186

Open
wants to merge 70 commits into
base: master
Choose a base branch
from

Conversation

GitHK
Copy link
Contributor

@GitHK GitHK commented Feb 7, 2025

What do these changes do?

These bring a set of utilities that will allow us to create zip archives on the fly and stream it as it gets created to S3. The idea is to use constant amount or RAM and no disk space.

How does this work? A request to upload a zip archive to S3 is created. As chunks of this archives are requested by the uploader, the streaming zip utility requests chunks of files on the fly and compose the archive. It will provide pieces of the archive to the S3 uploaded as soon as they are available.

Have a look at /home/silenthk/work/pr-osparc-stream-zipping-of-s3-content/packages/aws-library/tests/test_s3_client.py::test_workflow_compress_s3_objects_and_local_files_in_a_single_archive_then_upload_to_s3 for a full working workflow.

Progress bar support has also been added. Progress is sent based on the data read from the input streams.

Bonus: renamed _filemanager.py which created confusion to filemanager_utils.py

Related issue/s

How to test

Dev-ops checklist

@GitHK GitHK added this to the Singularity milestone Feb 7, 2025
@GitHK GitHK self-assigned this Feb 7, 2025
Copy link

codecov bot commented Feb 7, 2025

Codecov Report

Attention: Patch coverage is 96.32353% with 5 lines in your changes missing coverage. Please review.

Project coverage is 87.03%. Comparing base (4ca666e) to head (9d48624).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7186      +/-   ##
==========================================
+ Coverage   86.98%   87.03%   +0.04%     
==========================================
  Files        1667     1668       +1     
  Lines       64721    64684      -37     
  Branches     1096     1115      +19     
==========================================
- Hits        56299    56298       -1     
+ Misses       8109     8068      -41     
- Partials      313      318       +5     
Flag Coverage Δ
integrationtests 65.30% <75.00%> (+0.01%) ⬆️
unittests 86.03% <93.38%> (+0.02%) ⬆️
Components Coverage Δ
api ∅ <ø> (∅)
pkg_aws_library 94.17% <100.00%> (+0.14%) ⬆️
pkg_dask_task_models_library 97.09% <ø> (ø)
pkg_models_library 91.54% <100.00%> (+0.01%) ⬆️
pkg_notifications_library 84.57% <ø> (ø)
pkg_postgres_database 88.28% <ø> (ø)
pkg_service_integration 70.03% <ø> (ø)
pkg_service_library 72.61% <97.97%> (+0.45%) ⬆️
pkg_settings_library 90.61% <ø> (ø)
pkg_simcore_sdk 85.08% <75.00%> (-0.39%) ⬇️
agent 96.46% <ø> (ø)
api_server 90.56% <ø> (ø)
autoscaling 96.08% <ø> (ø)
catalog 91.71% <ø> (ø)
clusters_keeper 99.24% <ø> (ø)
dask_sidecar 91.25% <ø> (ø)
datcore_adapter 93.19% <ø> (ø)
director 76.59% <ø> (ø)
director_v2 91.27% <ø> (-0.03%) ⬇️
dynamic_scheduler 97.33% <ø> (ø)
dynamic_sidecar 89.77% <ø> (ø)
efs_guardian 90.25% <ø> (ø)
invitations 93.28% <ø> (ø)
osparc_gateway_server ∅ <ø> (∅)
payments 92.66% <ø> (ø)
resource_usage_tracker 88.32% <ø> (-0.66%) ⬇️
storage 86.67% <ø> (+0.11%) ⬆️
webclient ∅ <ø> (∅)
webserver 84.83% <ø> (+0.08%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ca666e...9d48624. Read the comment docs.

@GitHK GitHK added a:services-library issues on packages/service-libs a:aws-library labels Feb 7, 2025
@GitHK GitHK requested a review from sanderegg February 14, 2025 05:48
Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last thing. please double check download_fileobj

Comment on lines +485 to +487
# NOTE `download_fileobj` cannot be used to implement this because
# it will buffer the entire file in memory instead of reading it
# chunk by chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this?
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig
I see chunks configuration there. also I see multipart download and multi threading. Are you sure about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation is still valid.
I tried really hard to use download_fileobj, but could not do so.
The download_fileobj writes to a file object. The options you are highlighting, help with tuning the chunks and parallelism of writing to the output. You cannot pause the process and resume it.

To do so, I had to mull the checks one at a time, like it's done here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mull?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanderegg I'll rewrite the last part.

To achieve on demand file chunk download, the only way I found was to pull once chunk like I did here.

packages/pytest-simcore/src/pytest_simcore/file_extra.py Outdated Show resolved Hide resolved
@GitHK GitHK requested a review from sanderegg February 14, 2025 08:26
Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff! Thanks a lot for the effort! I would suggest to add some RAM checks and perhaps also some disk space checks to your tests.

# Download the chunk
response = await self._client.get_object(
Bucket=bucket_name, Key=object_key, Range=range_header
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity. Do you know if there is any difference in the cost incurred by downloading a file in a single or multiple calls? Or is the price of downloading from aws s3 just a factor of the size of the data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no real idea about this.

packages/aws-library/tests/test_s3_client.py Show resolved Hide resolved
s3_client: S3Client,
archive_s3_object_key: S3ObjectKey,
mocked_progress_bar_cb: Mock,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test! Maybe this is actually where it would make sense to do the memory profiling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not figure out how to do it. will not use time on it. If you have an example that already works I will give it a try.

@GitHK GitHK mentioned this pull request Feb 14, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:aws-library a:services-library issues on packages/service-libs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants