✨Streaming utils for zipping and reading/wiring to S3 #7186

GitHK · 2025-02-07T12:27:45Z

What do these changes do?

These bring a set of utilities that will allow us to create zip archives on the fly and stream it as it gets created to S3. The idea is to use constant amount or RAM and no disk space.

How does this work? A request to upload a zip archive to S3 is created. As chunks of this archives are requested by the uploader, the streaming zip utility requests chunks of files on the fly and compose the archive. It will provide pieces of the archive to the S3 uploaded as soon as they are available.

Have a look at /home/silenthk/work/pr-osparc-stream-zipping-of-s3-content/packages/aws-library/tests/test_s3_client.py::test_workflow_compress_s3_objects_and_local_files_in_a_single_archive_then_upload_to_s3 for a full working workflow.

Progress bar support has also been added. Progress is sent based on the data read from the input streams.

Bonus: renamed _filemanager.py which created confusion to filemanager_utils.py

Related issue/s

How to test

Dev-ops checklist

No ENV changes or I properly updated ENV (read the instruction)

codecov · 2025-02-07T12:29:59Z

Codecov Report

Attention: Patch coverage is 96.32353% with 5 lines in your changes missing coverage. Please review.

Project coverage is 87.03%. Comparing base (4ca666e) to head (9d48624).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7186      +/-   ##
==========================================
+ Coverage   86.98%   87.03%   +0.04%     
==========================================
  Files        1667     1668       +1     
  Lines       64721    64684      -37     
  Branches     1096     1115      +19     
==========================================
- Hits        56299    56298       -1     
+ Misses       8109     8068      -41     
- Partials      313      318       +5

Flag	Coverage Δ
integrationtests	`65.30% <75.00%> (+0.01%)`	⬆️
unittests	`86.03% <93.38%> (+0.02%)`	⬆️

Components	Coverage Δ
api	`∅ <ø> (∅)`
pkg_aws_library	`94.17% <100.00%> (+0.14%)`	⬆️
pkg_dask_task_models_library	`97.09% <ø> (ø)`
pkg_models_library	`91.54% <100.00%> (+0.01%)`	⬆️
pkg_notifications_library	`84.57% <ø> (ø)`
pkg_postgres_database	`88.28% <ø> (ø)`
pkg_service_integration	`70.03% <ø> (ø)`
pkg_service_library	`72.61% <97.97%> (+0.45%)`	⬆️
pkg_settings_library	`90.61% <ø> (ø)`
pkg_simcore_sdk	`85.08% <75.00%> (-0.39%)`	⬇️
agent	`96.46% <ø> (ø)`
api_server	`90.56% <ø> (ø)`
autoscaling	`96.08% <ø> (ø)`
catalog	`91.71% <ø> (ø)`
clusters_keeper	`99.24% <ø> (ø)`
dask_sidecar	`91.25% <ø> (ø)`
datcore_adapter	`93.19% <ø> (ø)`
director	`76.59% <ø> (ø)`
director_v2	`91.27% <ø> (-0.03%)`	⬇️
dynamic_scheduler	`97.33% <ø> (ø)`
dynamic_sidecar	`89.77% <ø> (ø)`
efs_guardian	`90.25% <ø> (ø)`
invitations	`93.28% <ø> (ø)`
osparc_gateway_server	`∅ <ø> (∅)`
payments	`92.66% <ø> (ø)`
resource_usage_tracker	`88.32% <ø> (-0.66%)`	⬇️
storage	`86.67% <ø> (+0.11%)`	⬆️
webclient	`∅ <ø> (∅)`
webserver	`84.83% <ø> (+0.08%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ca666e...9d48624. Read the comment docs.

…zipping-of-s3-content

sanderegg

last thing. please double check download_fileobj

sanderegg · 2025-02-14T07:27:21Z

packages/aws-library/src/aws_library/s3/_client.py

+        # NOTE `download_fileobj` cannot be used to implement this because
+        # it will buffer the entire file in memory instead of reading it
+        # chunk by chunk


what about this?
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig
I see chunks configuration there. also I see multipart download and multi threading. Are you sure about this?

Current implementation is still valid.
I tried really hard to use download_fileobj, but could not do so.
The download_fileobj writes to a file object. The options you are highlighting, help with tuning the chunks and parallelism of writing to the output. You cannot pause the process and resume it.

To do so, I had to mull the checks one at a time, like it's done here.

@sanderegg I'll rewrite the last part.

To achieve on demand file chunk download, the only way I found was to pull once chunk like I did here.

packages/pytest-simcore/src/pytest_simcore/file_extra.py

bisgaard-itis

Cool stuff! Thanks a lot for the effort! I would suggest to add some RAM checks and perhaps also some disk space checks to your tests.

bisgaard-itis · 2025-02-14T09:54:49Z

packages/aws-library/src/aws_library/s3/_client.py

+                # Download the chunk
+                response = await self._client.get_object(
+                    Bucket=bucket_name, Key=object_key, Range=range_header
+                )


Just out of curiosity. Do you know if there is any difference in the cost incurred by downloading a file in a single or multiple calls? Or is the price of downloading from aws s3 just a factor of the size of the data?

I have no real idea about this.

packages/aws-library/tests/test_s3_client.py

bisgaard-itis · 2025-02-14T09:59:38Z

packages/aws-library/tests/test_s3_client.py

+    s3_client: S3Client,
+    archive_s3_object_key: S3ObjectKey,
+    mocked_progress_bar_cb: Mock,
+):


Nice test! Maybe this is actually where it would make sense to do the memory profiling.

could not figure out how to do it. will not use time on it. If you have an example that already works I will give it a try.

packages/pytest-simcore/src/pytest_simcore/helpers/comparing.py

…zipping-of-s3-content

sonarqubecloud · 2025-02-14T12:23:58Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Andrei Neagu added 6 commits February 7, 2025 09:37

added stream-zip

0054d6a

added utils for stream zipping

ff96d46

rename

fe6c34d

added minimal progress support

bd1df7c

rename

7b7aae0

fixed types

5f72a43

GitHK added this to the Singularity milestone Feb 7, 2025

GitHK self-assigned this Feb 7, 2025

Andrei Neagu added 19 commits February 7, 2025 13:34

refactor

aace76a

refactor

388b81a

added S3 streaming and integration test

357273a

refactor

973423e

Merge remote-tracking branch 'upstream/master' into pr-osparc-stream-…

b2a66f5

…zipping-of-s3-content

removed debug print

c129672

refactor to use size instead of items count as progress

e32f265

using faster file hash checking

de490ad

Merge remote-tracking branch 'upstream/master' into pr-osparc-stream-…

706934d

…zipping-of-s3-content

refactor progress on zip

c672212

remove unused

cee5e9c

remove unused

6d3eb72

remove outdated

3873976

reshuffled imports

706ee4b

fixed more broken imports

a7e0867

reverted delted import

4775e55

remove unused error

e0a5407

revert number

23515e3

fixed broken import

b546f10

GitHK added a:services-library issues on packages/service-libs a:aws-library labels Feb 7, 2025

Andrei Neagu added 22 commits February 13, 2025 10:20

Merge remote-tracking branch 'upstream/master' into pr-osparc-stream-…

eea3fba

…zipping-of-s3-content

rename

ed8280f

rename

f170791

rename module

17c0ac5

refactor interface

c024827

refactor progress

a844a0c

refactor placement of FileLikeFileStreamReader

a277e12

rename

5a1a8e7

update

ec87ba0

rename and move around parts

b835ae0

renamed modules

70afc40

renames

0addf04

moved imports to more appropriate places

cb479f3

Merge remote-tracking branch 'upstream/master' into pr-osparc-stream-…

881d2a1

…zipping-of-s3-content

refactor

e37972c

renamed

0f11526

renamed to bytes_iter

b23d1f1

renaming paths

b29bff2

renamed

d6ca255

refactor

a169f73

added missing type

2c2d2db

Merge remote-tracking branch 'upstream/master' into pr-osparc-stream-…

526ed0a

…zipping-of-s3-content

GitHK requested a review from sanderegg February 14, 2025 05:48

sanderegg reviewed Feb 14, 2025

View reviewed changes

rename fixture

0ec7e23

GitHK requested a review from sanderegg February 14, 2025 08:26

bisgaard-itis approved these changes Feb 14, 2025

View reviewed changes

GitHK mentioned this pull request Feb 14, 2025

✨ Add exporter code to storage #7218

Draft

1 task

Merge remote-tracking branch 'upstream/master' into pr-osparc-stream-…

9d48624

…zipping-of-s3-content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨Streaming utils for zipping and reading/wiring to S3 #7186

✨Streaming utils for zipping and reading/wiring to S3 #7186

GitHK commented Feb 7, 2025 •

edited

Loading

codecov bot commented Feb 7, 2025 •

edited

Loading

sanderegg left a comment

sanderegg Feb 14, 2025

GitHK Feb 14, 2025

sanderegg Feb 14, 2025

GitHK Feb 14, 2025

bisgaard-itis left a comment

bisgaard-itis Feb 14, 2025

GitHK Feb 14, 2025

bisgaard-itis Feb 14, 2025

GitHK Feb 14, 2025

sonarqubecloud bot commented Feb 14, 2025

✨Streaming utils for zipping and reading/wiring to S3 #7186

Are you sure you want to change the base?

✨Streaming utils for zipping and reading/wiring to S3 #7186

Conversation

GitHK commented Feb 7, 2025 • edited Loading

What do these changes do?

Related issue/s

How to test

Dev-ops checklist

codecov bot commented Feb 7, 2025 • edited Loading

Codecov Report

sanderegg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bisgaard-itis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Feb 14, 2025

Quality Gate passed

GitHK commented Feb 7, 2025 •

edited

Loading

codecov bot commented Feb 7, 2025 •

edited

Loading