refactor blob creation #7

jaideepr97 · 2025-01-06T17:38:59Z

This PR does the following:

add new create_blobs function
move blob related util functions out of basics.py into more appropriate locations in the utils package
split existing unit tests and rearrange them to target the new locations for their respective functions
add common testing functions

Signed-off-by: Jaideep Rao <[email protected]>

tarilabs · 2025-01-15T15:24:29Z

Hi @jaideepr97 thanks for this PR but I'm not following why:

remove of creation of tar, targz type of layers
refactoring to remove use of hashingwriter, with the added need to recompute the hash from the file

The idea was to have 2 functions which would write a Layer, either just Tar, or TarGz, and while writing this Layer, return the simple Hash (while writing, without re-reading) and the Hash Pre/Post to use it as DiffID and Digest.

It seems to me, this refactor into file_checksums_without_compression and file_checksums_with_compression makes compute of the hash with the requirement of re-reading the files, and further it deletes the actual layer blob.

Can you kindly clarify?
I'm also happy to do it online call if preferable.

Thanks in advance for your clarifications

jaideepr97 · 2025-01-20T15:15:28Z

Thanks for reviewing @tarilabs

I havent removed any functionality, I moved these functions into the utils package in a previous PR, I am only removing those functions from basics.py in this PR
Same is true for HashingWriter as well

It seems to me, this refactor into file_checksums_without_compression and file_checksums_with_compression makes compute of the hash with the requirement of re-reading the files, and further it deletes the actual layer blob.

these functions are only created for unit testing purposes, that's why they are placed in tests/common.py

The actual logic has not been changed

tarilabs

thank you so much @jaideepr97

Thanks for reviewing @tarilabs

I havent removed any functionality, I moved these functions into the utils package in a previous PR, I am only removing those functions from basics.py in this PR Same is true for HashingWriter as well

It seems to me, this refactor into file_checksums_without_compression and file_checksums_with_compression makes compute of the hash with the requirement of re-reading the files, and further it deletes the actual layer blob.

these functions are only created for unit testing purposes, that's why they are placed in tests/common.py

The actual logic has not been changed

I see now, noted below, thanks for your patience I didn't recall about that previous refactor.

/approve

I think in a follow-up PR we can deal with small comment below about test utility function refactor, indeed separately

To make things hopefully easier, exceptionally will merge with merge-commit this one

tarilabs · 2025-01-22T08:07:18Z

olot/basics.py

-def tar_into_ocilayout(ocilayout: Path, model: Path):
-    sha256_path = ocilayout / "blobs" / "sha256"
-    temp_tar_filename = sha256_path / "temp_layer"
-    with open(temp_tar_filename, "wb") as temp_file:
-        writer = HashingWriter(temp_file)
-        with tarfile.open(fileobj=writer, mode="w") as tar: # type:ignore
-            tar.add(model, arcname="/models/"+model.name, filter=tar_filter_fn)
-    checksum = writer.hash_func.hexdigest()
-    click.echo(f"digest of the tar file: {checksum}")
-    final_tar_filename = checksum
-    os.rename(temp_tar_filename, sha256_path / final_tar_filename)
-    click.echo(f"tar file renamed to: {final_tar_filename}")
-    return checksum
-
-
-def targz_into_ocilayout(ocilayout: Path, model: Path):
-    sha256_path = ocilayout / "blobs" / "sha256"
-    temp_tar_filename = sha256_path / "temp_layer"
-    with open(temp_tar_filename, "wb") as temp_file:
-        writer = HashingWriter(temp_file)
-        with gzip.GzipFile(fileobj=writer, mode="wb", mtime=0, compresslevel=6) as gz: # type:ignore
-            inner_writer = HashingWriter(gz)
-            with tarfile.open(fileobj=inner_writer, mode="w") as tar: # type:ignore
-                tar.add(model, arcname="/models/"+model.name, filter=tar_filter_fn)
-    checksum = writer.hash_func.hexdigest()
-    tar_checksum = inner_writer.hash_func.hexdigest()
-    final_tar_filename = checksum
-    os.rename(temp_tar_filename, sha256_path / final_tar_filename)
-    return (checksum, tar_checksum)
-


I see now, these were moved into olot/utils/files.py

tarilabs · 2025-01-22T08:35:01Z

olot/basics.py

-def tar_into_ocilayout(ocilayout: Path, model: Path):
-    sha256_path = ocilayout / "blobs" / "sha256"
-    temp_tar_filename = sha256_path / "temp_layer"
-    with open(temp_tar_filename, "wb") as temp_file:
-        writer = HashingWriter(temp_file)
-        with tarfile.open(fileobj=writer, mode="w") as tar: # type:ignore
-            tar.add(model, arcname="/models/"+model.name, filter=tar_filter_fn)
-    checksum = writer.hash_func.hexdigest()
-    click.echo(f"digest of the tar file: {checksum}")
-    final_tar_filename = checksum
-    os.rename(temp_tar_filename, sha256_path / final_tar_filename)
-    click.echo(f"tar file renamed to: {final_tar_filename}")
-    return checksum


indeed moved to

olot/olot/utils/files.py

Lines 68 to 85 in c7ceea6

if not file_path.exists():

raise ValueError(f"Input file '{file_path}' does not exist.")

os.makedirs(dest, exist_ok=True)

temp_dest = dest / "temp"

try:

with open(temp_dest, "wb") as temp_file:

writer = HashingWriter(temp_file)

with tarfile.open(fileobj=writer, mode="w") as tar: # type:ignore

tar.add(file_path, arcname="/models/"+file_path.name, filter=tar_filter_fn)

checksum = writer.hash_func.hexdigest()

os.rename(temp_dest, dest / checksum)

return checksum

except tarfile.TarError as e:

raise tarfile.TarError(f"Error creating tarball: {e}") from e

except OSError as e:

raise OSError(f"File operation failed: {e}") from e

tarilabs · 2025-01-22T08:35:38Z

olot/basics.py

-def targz_into_ocilayout(ocilayout: Path, model: Path):
-    sha256_path = ocilayout / "blobs" / "sha256"
-    temp_tar_filename = sha256_path / "temp_layer"
-    with open(temp_tar_filename, "wb") as temp_file:
-        writer = HashingWriter(temp_file)
-        with gzip.GzipFile(fileobj=writer, mode="wb", mtime=0, compresslevel=6) as gz: # type:ignore
-            inner_writer = HashingWriter(gz)
-            with tarfile.open(fileobj=inner_writer, mode="w") as tar: # type:ignore
-                tar.add(model, arcname="/models/"+model.name, filter=tar_filter_fn)
-    checksum = writer.hash_func.hexdigest()
-    tar_checksum = inner_writer.hash_func.hexdigest()
-    final_tar_filename = checksum
-    os.rename(temp_tar_filename, sha256_path / final_tar_filename)
-    return (checksum, tar_checksum)


indeed was moved to

olot/olot/utils/files.py

Lines 108 to 128 in c7ceea6

if not file_path.exists():

raise ValueError(f"Input file '{file_path}' does not exist.")

os.makedirs(dest, exist_ok=True)

temp_dest = dest / "temp"

try:

with open(temp_dest, "wb") as temp_file:

writer = HashingWriter(temp_file)

with gzip.GzipFile(fileobj=writer, mode="wb", mtime=0, compresslevel=6) as gz: # type:ignore

inner_writer = HashingWriter(gz)

with tarfile.open(fileobj=inner_writer, mode="w") as tar: # type:ignore

tar.add(file_path, arcname="/models/"+file_path.name, filter=tar_filter_fn)

precompress_checksum = inner_writer.hash_func.hexdigest()

postcompress_checksum = writer.hash_func.hexdigest()

os.rename(temp_dest, dest / postcompress_checksum)

return (postcompress_checksum, precompress_checksum)

except tarfile.TarError as e:

raise tarfile.TarError(f"Error creating tarball: {e}") from e

except OSError as e:

raise OSError(f"File operation failed: {e}") from e

tarilabs · 2025-01-22T08:36:27Z

olot/utils/files.py

@@ -114,9 +114,9 @@ def targz_from_file(file_path: Path, dest: Path) -> tuple[str, str]:
    try:
        with open(temp_dest, "wb") as temp_file:
            writer = HashingWriter(temp_file)
-            with gzip.GzipFile(fileobj=writer, mode="wb", mtime=0, compresslevel=6) as gz: # type:ignore
+            with gzip.GzipFile(fileobj=writer, mode="wb", mtime=0, compresslevel=6) as gz: # type: ignore[call-overload]


thank you for specifying better for lint ignore

tarilabs · 2025-01-22T08:49:06Z

tests/common.py

+def file_checksums_without_compression(source_file: Path, tmp_dest: Path) -> str:
+
+    tmp_dest.mkdir(parents=True, exist_ok=True)
+    temp_tar_filename = tmp_dest / "temp_layer"
+
+    with open(temp_tar_filename, "wb") as temp_file:
+        writer = HashingWriter(temp_file)
+        with tarfile.open(fileobj=writer, mode="w") as tar: # type: ignore[call-overload]
+            tar.add(source_file, arcname="/models/"+source_file.name, filter=tar_filter_fn)
+    sha = get_file_hash(temp_tar_filename)
+    shutil.rmtree(tmp_dest)
+    return sha
+
+
+def file_checksums_with_compression(source_file: Path, tmp_dest: Path) -> tuple[str, str]:
+    temp_tar_filename = tmp_dest / "temp_layer.tar.gz"
+
+    with open(temp_tar_filename, "wb") as temp_file:
+        writer = HashingWriter(temp_file)
+        with gzip.GzipFile(fileobj=writer, mode="wb", mtime=0, compresslevel=6) as gz: # type: ignore[call-overload]
+            inner_writer = HashingWriter(gz)
+            with tarfile.open(fileobj=inner_writer, mode="w") as tar: # type: ignore[call-overload]
+                tar.add(source_file, arcname="/models/"+source_file.name, filter=tar_filter_fn)
+
+    postcompress_chksum_from_disk = get_file_hash(temp_tar_filename)
+
+    uncompressed_tar = tmp_dest / "uncompressed.tar"
+    with tarfile.open(uncompressed_tar, mode="w") as tar: # type: ignore[call-overload]
+        tar.add(source_file, arcname=source_file.name)
+    uncompressed_checksum_from_disk = get_file_hash(uncompressed_tar)
+
+    shutil.rmtree(tmp_dest)
+
+    return postcompress_chksum_from_disk, uncompressed_checksum_from_disk


I think we could refactor these to re-use existing functions (above) and then remove the added layer, ie to avoid duplicating part of the hashing logic, but I also think this could be done in a subsequent PR, wdyt?

tarilabs · 2025-01-22T08:52:45Z

tests/utils/files_test.py

+def test_Hashwriter_tar(tmp_path):
+    """Example bespoke use of HashingWriter for .tar
+    """
+    hello_path = sample_model_path() / "hello.md"
+    write_dest = sha256_path(tmp_path)
+    write_dest.mkdir(parents=True, exist_ok=True)
+    temp_tar_filename = write_dest / "temp_layer"
+
+    with open(temp_tar_filename, "wb") as temp_file:
+        writer = HashingWriter(temp_file)
+        with tarfile.open(fileobj=writer, mode="w") as tar:
+            tar.add(hello_path, arcname=hello_path.name)
+
+    checksum_from_disk = get_file_hash(temp_tar_filename)
+    checksum = writer.hash_func.hexdigest()
+    assert checksum == checksum_from_disk
+
+    final_tar_filename = f"{checksum}.tar"
+    os.rename(temp_tar_filename, write_dest / final_tar_filename)
+    print(f"tar file renamed to: {final_tar_filename}")
+
+    for file in tmp_path.rglob('*'):
+        if file.is_file():
+            print(file)


moved from

https://github.com/containers/olot/pull/7/files#diff-e6bcf9c8a7eb9d493e604bcae41818b790764d53e4d1ad3b0a1763004f94067aL34-L60

tarilabs · 2025-01-22T08:59:17Z

tests/utils/files_test.py

+def test_Hashwriter_tar_gz(tmp_path):
+    """Example bespoke use of HashingWriter for .tar.gz
+    """
+    hello_path = sample_model_path() / "hello.md"
+    write_dest = sha256_path(tmp_path)
+    write_dest.mkdir(parents=True, exist_ok=True)
+    temp_tar_filename = write_dest / "temp_layer.tar.gz"
+
+    with open(temp_tar_filename, "wb") as temp_file:
+        writer = HashingWriter(temp_file)
+        with gzip.GzipFile(fileobj=writer, mode="wb", mtime=0, compresslevel=6) as gz:
+            inner_writer = HashingWriter(gz)
+            with tarfile.open(fileobj=inner_writer, mode="w") as tar:
+                tar.add(hello_path, arcname=hello_path.name)
+
+    checksum_from_disk = get_file_hash(temp_tar_filename)
+    checksum = writer.hash_func.hexdigest()
+    assert checksum == checksum_from_disk
+
+    uncompressed_tar = tmp_path / "uncompressed.tar"
+    with tarfile.open(uncompressed_tar, mode="w") as tar:
+        tar.add(hello_path, arcname=hello_path.name)
+
+    uncompressed_checksum_from_disk = get_file_hash(uncompressed_tar)
+    tar_checksum = inner_writer.hash_func.hexdigest()
+    assert tar_checksum == uncompressed_checksum_from_disk
+
+    final_tar_filename = f"{checksum}.tar.gz"
+    os.rename(temp_tar_filename, write_dest / final_tar_filename)
+
+    for file in tmp_path.rglob('*'):
+        if file.is_file():
+            print(file)


moved from

https://github.com/containers/olot/pull/7/files#diff-e6bcf9c8a7eb9d493e604bcae41818b790764d53e4d1ad3b0a1763004f94067aL85-L122

tarilabs · 2025-01-22T09:07:08Z

tests/utils/files_test.py

+def test_tarball_from_file(tmp_path):
+    """Test tarball_from_file() function is able to produce the expected tar (uncompressed) layer blob in the oci-layout
+    """
+    model_path = sample_model_path() / "model.joblib"
+    write_dest = sha256_path(tmp_path)
+    write_dest.mkdir(parents=True, exist_ok=True)
+    digest = tarball_from_file(model_path, write_dest) # forcing it into a partial temp directory with blobs subdir for tests
+    for file in tmp_path.rglob('*'):
+        if file.is_file():
+            print(file)
+
+    checksum_from_disk = get_file_hash(write_dest / digest) # read the file
+    assert digest == checksum_from_disk # filename should match its digest
+
+    found = None
+    with tarfile.open(write_dest / digest, "r") as tar:
+        for tarinfo in tar:
+            if tarinfo.name == "models/model.joblib" and tarinfo.mode == 0o664:
+                found = tarinfo # model.joblib is added in models/ inside modelcar with expected permissions
+    assert found


moved from

https://github.com/containers/olot/pull/7/files#diff-e6bcf9c8a7eb9d493e604bcae41818b790764d53e4d1ad3b0a1763004f94067aL63-L82

tarilabs · 2025-01-22T09:07:34Z

tests/utils/files_test.py

+def test_targz_from_file(tmp_path):
+    """Test targz_from_file() function is able to produce the expected tar.gz layer blob in the oci-layout
+    """
+    model_path = sample_model_path() / "model.joblib"
+    write_dest = sha256_path(tmp_path)
+    write_dest.mkdir(parents=True, exist_ok=True)
+    postcomp_chksum, precomp_chskum = targz_from_file(model_path, write_dest) # forcing it into a partial temp directory with blobs subdir for tests
+
+    for file in tmp_path.rglob('*'):
+        if file.is_file():
+            print(file)
+
+    checksum_from_disk = get_file_hash(write_dest / postcomp_chksum) # read the file
+    assert postcomp_chksum == checksum_from_disk # filename should match its digest
+
+    found = None
+    with tarfile.open(write_dest / postcomp_chksum, "r:gz") as tar:
+        for tarinfo in tar:
+            if tarinfo.name == "models/model.joblib" and tarinfo.mode == 0o664:
+                found = tarinfo # model.joblib is added in models/ inside modelcar with expected permissions
+    assert found
+
+    uncompressed_tar = write_dest / "uncompressed.tar"
+    with gzip.open(write_dest / postcomp_chksum, "rb") as g_in:
+        with open(uncompressed_tar, "wb") as f_out:
+            shutil.copyfileobj(g_in, f_out)
+    for file in tmp_path.rglob('*'):
+        if file.is_file():
+            print(file)
+    tar_checksum_from_disk = get_file_hash(uncompressed_tar) # compute the digest for the .tar from the .tar.gz
+    assert precomp_chskum == tar_checksum_from_disk # digests should match


moved from

https://github.com/containers/olot/pull/7/files#diff-e6bcf9c8a7eb9d493e604bcae41818b790764d53e4d1ad3b0a1763004f94067aL125-L155

tarilabs · 2025-01-22T09:09:58Z

tests/utils/types_test.py

+def test_compute_hash_of_str():
+    """Basis compute_hash_of_str() fn testing
+    """
+    my_string = "Hello, World! Thanks Andrew, Roland, Syed and Quay team for the idea :)"
+    actual = compute_hash_of_str(my_string)
+    assert actual == "8695589abd30ec83cde42fabc3b4f6fd7da629eca94cf146c7920c6b067f4087" # can use echo -n "Hello, World! Thanks Andrew, Roland, Syed and Quay team for the idea :)" | shasum -a 256


moved from

https://github.com/containers/olot/pull/7/files#diff-e6bcf9c8a7eb9d493e604bcae41818b790764d53e4d1ad3b0a1763004f94067aL26-L31

refactor blob creation

d9873dd

Signed-off-by: Jaideep Rao <[email protected]>

jaideepr97 force-pushed the blob-mgmt branch from 72f2256 to d9873dd Compare January 6, 2025 18:01

This was referenced Jan 7, 2025

Create OCI image manifest #8

Merged

Create index and oci-layout files #9

Merged

tarilabs approved these changes Jan 22, 2025

View reviewed changes

tarilabs merged commit ceccbe9 into containers:main Jan 22, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor blob creation #7

refactor blob creation #7

jaideepr97 commented Jan 6, 2025

tarilabs commented Jan 15, 2025

jaideepr97 commented Jan 20, 2025

tarilabs left a comment

tarilabs Jan 22, 2025

tarilabs Jan 22, 2025

tarilabs Jan 22, 2025

tarilabs Jan 22, 2025

tarilabs Jan 22, 2025

tarilabs Jan 22, 2025

tarilabs Jan 22, 2025

tarilabs Jan 22, 2025

tarilabs Jan 22, 2025

tarilabs Jan 22, 2025

	if not file_path.exists():
	raise ValueError(f"Input file '{file_path}' does not exist.")

	os.makedirs(dest, exist_ok=True)
	temp_dest = dest / "temp"

	try:
	with open(temp_dest, "wb") as temp_file:
	writer = HashingWriter(temp_file)
	with tarfile.open(fileobj=writer, mode="w") as tar: # type:ignore
	tar.add(file_path, arcname="/models/"+file_path.name, filter=tar_filter_fn)
	checksum = writer.hash_func.hexdigest()
	os.rename(temp_dest, dest / checksum)
	return checksum
	except tarfile.TarError as e:
	raise tarfile.TarError(f"Error creating tarball: {e}") from e
	except OSError as e:
	raise OSError(f"File operation failed: {e}") from e

refactor blob creation #7

refactor blob creation #7

Conversation

jaideepr97 commented Jan 6, 2025

tarilabs commented Jan 15, 2025

jaideepr97 commented Jan 20, 2025

tarilabs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment