Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sparse file support #35

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion packages/tar-parser/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

- Runs anywhere JavaScript runs
- Built on the standard [web Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Streams_API), so it's composable with `fetch()` streams
- Supports POSIX, GNU, and PAX tar formats
- Supports POSIX, GNU, PAX tar formats, and old GNU sparse files
- Memory efficient and does not buffer anything in normal usage
- 0 dependencies

Expand Down Expand Up @@ -34,6 +34,8 @@ await parseTar(response.body.pipeThrough(new DecompressionStream('gzip')), (entr
});
```

### Handling Different Filename Encodings

If you're parsing an archive with filename encodings other than UTF-8, use the `filenameEncoding` option:

```ts
Expand All @@ -44,6 +46,34 @@ await parseTar(response.body, { filenameEncoding: 'latin1' }, (entry) => {
});
```

### Working with Sparse Files

For sparse files, tar-parser reconstructs the file by default, filling in zeroed regions as indicated by the sparse map:

```ts
await parseTar(response.body, async (entry) => {
if (entry.header.type === 'sparse') {
// Fully reconstructed sparse file
let reconstructedData = await entry.bytes();
console.log(entry.name, reconstructedData.length);
}
});
```

If you prefer the raw data chunks as they appear in the archive (without reconstructing zeros), you can call `entry.bytes({ raw: true })`:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "without reconstructing zeros" do you mean that the resulting byte array will have zeroes in it? That's just spacer data, right? Forgive my ignorance, but when would that ever be useful?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default behavior of tar.

Basically, if someone sends you a sparse tarball, you don't have to do anything differently if you just pipe to a write stream for the file. So, for someone with no knowledge of sparse files, it "just work".

However, if you're a more advanced user, you can get the sparse offset's and lengths, and use fs.write(fd, buffer, offset, length), which will be more efficient, because you're not writing the extra data.


```ts
await parseTar(response.body, async (entry) => {
if (entry.header.type === 'sparse') {
// Raw archived data segments only, no sparse reconstruction
let rawData = await entry.bytes({ raw: true });
console.log(entry.name, rawData.length);
}
});
```

This allows you to save memory by only working with blocks that actually contain data.

## Benchmark

`tar-parser` performs on par with other popular tar parsing libraries on Node.js.
Expand Down
162 changes: 162 additions & 0 deletions packages/tar-parser/src/lib/tar.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -456,4 +456,166 @@ describe('tar-stream test cases', () => {
],
]);
});

it('parses sparse.tar', async () => {
/* sparse.tar generated with:

truncate -s 32K sparsefile

# Insert multiple sparse data segments
echo -n "DATA1" | dd of=sparsefile bs=1 seek=0 conv=notrunc
echo -n "DATA2" | dd of=sparsefile bs=1 seek=8192 conv=notrunc
echo -n "DATA3" | dd of=sparsefile bs=1 seek=16384 conv=notrunc

tar --sparse -cf sparse.tar sparsefile
*/
let blockSize = 4096;
let entries: { name: string; data: Uint8Array; header: TarHeader }[] = [];

await parseTar(readFixture(fixtures.sparse), async (entry) => {
let data = await entry.bytes();
entries.push({ name: entry.name, data, header: entry.header });
});
assert.equal(entries.length, 1);
const { name, data, header } = entries[0];
assert.equal(name, 'sparse');
assert.equal(data.length, blockSize * 8);
assert.deepEqual(header.sparseMap, [
{
offset: 0,
size: 4096,
},
{
offset: 8192,
size: 4096,
},
{
offset: 16384,
size: 4096,
},
{
offset: 32768,
size: 0,
},
]);

let dec = new TextDecoder();

for (let i = 0; i < 3; i++) {
let exp = `DATA${i + 1}`;
assert.equal(
dec.decode(data.subarray(i * blockSize * 2, i * blockSize * 2 + exp.length)),
exp,
);
}
});

it('supports raw mode', async () => {
let blockSize = 4096;
let entries: { name: string; data: Uint8Array }[] = [];

let data: Uint8Array;
await parseTar(readFixture(fixtures.sparse), async (entry) => {
data = await entry.bytes({ raw: true });
});

let dec = new TextDecoder();

for (let i = 0; i < 3; i++) {
let exp = `DATA${i + 1}`;
assert.equal(dec.decode(data!.subarray(i * blockSize, i * blockSize + exp.length)), exp);
}
});

it('parses sparse-extended.tar', async () => {
/* sparse.tar generated with:

block_size=4096
truncate -s $((block_size*20)) sparse

for i in {1..20..2}; do
echo -n "DATA$i" | dd of=sparse bs=1 seek=$((i*block_size)) conv=notrunc
done

tar --sparse -cf sparse.tar sparse
*/
let blockSize = 4096;
let entries: { name: string; data: Uint8Array; header: TarHeader }[] = [];

await parseTar(readFixture(fixtures.sparseExtended), async (entry) => {
let data = await entry.bytes();
entries.push({ name: entry.name, data, header: entry.header });
});
assert.equal(entries.length, 1);
const { name, data, header } = entries[0];
assert.equal(name, 'sparse');
assert.equal(data.length, blockSize * 20);
assert.deepEqual(header.sparseMap, [
{ offset: 4096, size: 4096 },
{ offset: 12288, size: 4096 },
{ offset: 20480, size: 4096 },
{ offset: 28672, size: 4096 },
{ offset: 36864, size: 4096 },
{ offset: 45056, size: 4096 },
{ offset: 53248, size: 4096 },
{ offset: 61440, size: 4096 },
{ offset: 69632, size: 4096 },
{ offset: 77824, size: 4096 },
{ offset: 81920, size: 0 },
]);

let dec = new TextDecoder();

for (let i = 1; i < 20; i += 2) {
let exp = `DATA${i}`;
assert.equal(dec.decode(data.subarray(i * blockSize, i * blockSize + exp.length)), exp);
}

for (let i = 0; i < 10; i++) {
assert.equal(data[i], 0);
}
});

it('parses sparse-multiple-extended-headers.tar', async () => {
/* sparse.tar generated with:

block_size=4096
truncate -s $((block_size*26)) sparse

for i in {1..26..2}; do
echo -n "DATA$i" | dd of=sparse bs=1 seek=$((i*block_size)) conv=notrunc
done

tar --sparse -cf sparse.tar sparse
*/
let blockSize = 4096;
let entries: { name: string; data: Uint8Array; header: TarHeader }[] = [];
const numHoles = 13;

await parseTar(readFixture(fixtures.sparseMultipleExtendedHeaders), async (entry) => {
let data = await entry.bytes();
entries.push({ name: entry.name, data, header: entry.header });
});
assert.equal(entries.length, 1);
const { name, data, header } = entries[0];
assert.equal(name, 'sparse');
assert.equal(data.length, blockSize * (numHoles * 2));
const expectedMap = Array.from({ length: numHoles }, (_, i) => ({
offset: (i + 1) * blockSize * 2 - blockSize,
size: blockSize,
}));
expectedMap.push({ offset: blockSize * numHoles * 2, size: 0 });
assert.deepEqual(header.sparseMap, expectedMap);

let dec = new TextDecoder();

for (let i = 1; i < numHoles * 2; i += 2) {
let exp = `DATA${i}`;
assert.equal(dec.decode(data.subarray(i * blockSize, i * blockSize + exp.length)), exp);
}

for (let i = 0; i < numHoles; i++) {
assert.equal(data[i], 0);
}
});
});
103 changes: 89 additions & 14 deletions packages/tar-parser/src/lib/tar.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ import {
getOctal,
getString,
overflow,
parseOldGnuSparse,
parseOldGnuSparseExtension,
} from './utils.ts';

const TarBlockSize = 512;
Expand All @@ -32,8 +34,23 @@ export interface TarHeader {
devmajor: number | null;
devminor: number | null;
pax: Record<string, string> | null;

// Old GNU sparse support
atime?: number | null;
ctime?: number | null;
volumeOffset?: number | null;
realSize?: number | null;
// sparse map contains of map of which bytes actually contain data
sparseMap?: { offset: number; size: number }[];
isExtended?: boolean;
}

const ZeroOffset = '0'.charCodeAt(0);
const UstarMagic = new Uint8Array([0x75, 0x73, 0x74, 0x61, 0x72, 0x00]); // "ustar\0"
const UstarVersion = new Uint8Array([ZeroOffset, ZeroOffset]); // "00"
const GnuMagic = new Uint8Array([0x75, 0x73, 0x74, 0x61, 0x72, 0x20]); // "ustar "
const GnuVersion = new Uint8Array([0x20, 0x00]); // " \0"

const TarFileTypes: Record<string, string> = {
'0': 'file',
'1': 'link',
Expand All @@ -43,19 +60,17 @@ const TarFileTypes: Record<string, string> = {
'5': 'directory',
'6': 'fifo',
'7': 'contiguous-file',
'20': 'dumpdir',
'27': 'gnu-long-link-path',
'28': 'gnu-long-path',
'29': 'multivolume',
'30': 'gnu-long-path',
'35': 'sparse',
'38': 'volume',
'55': 'pax-global-header',
'72': 'pax-header',
};

const ZeroOffset = '0'.charCodeAt(0);
const UstarMagic = new Uint8Array([0x75, 0x73, 0x74, 0x61, 0x72, 0x00]); // "ustar\0"
const UstarVersion = new Uint8Array([ZeroOffset, ZeroOffset]); // "00"
const GnuMagic = new Uint8Array([0x75, 0x73, 0x74, 0x61, 0x72, 0x20]); // "ustar "
const GnuVersion = new Uint8Array([0x20, 0x00]); // " \0"

export interface ParseTarHeaderOptions {
/**
* Set false to disallow unknown header formats. Defaults to true.
Expand Down Expand Up @@ -141,6 +156,10 @@ export function parseTarHeader(block: Uint8Array, options?: ParseTarHeaderOption
throw new TarParseError('Invalid tar header, unknown format');
}

if (header.type === 'sparse') {
parseOldGnuSparse(block, header);
}

return header;
}

Expand Down Expand Up @@ -312,7 +331,17 @@ export class TarParser {
return;
}

this.#header = parseTarHeader(block, this.#options);
if (!this.#header?.isExtended) {
this.#header = parseTarHeader(block, this.#options);
}

if (this.#header.type === 'sparse' && this.#header.isExtended) {
while (this.#header.isExtended) {
if (this.#buffer!.length < TarBlockSize) return;
let extBlock = this.#read(TarBlockSize);
this.#header.isExtended = parseOldGnuSparseExtension(extBlock, this.#header);
}
}

switch (this.#header.type) {
case 'gnu-long-path':
Expand Down Expand Up @@ -459,20 +488,66 @@ export class TarEntry {
/**
* The content of this entry buffered into a single typed array.
*/
async bytes(): Promise<Uint8Array> {
if (this.#bodyUsed) {
async bytes(options?: { raw?: boolean }): Promise<Uint8Array> {
if (this.bodyUsed) {
throw new Error('Body is already consumed or is being consumed');
}

this.#bodyUsed = true;

let result = new Uint8Array(this.size);
let offset = 0;
for await (let chunk of this.#body) {
result.set(chunk, offset);
offset += chunk.length;
if (
this.header.type === 'sparse' &&
this.header.sparseMap &&
this.header.realSize &&
!options?.raw
) {
const result = new Uint8Array(this.header.realSize);
const reader = this.body.getReader();
let leftover = new Uint8Array(0);

async function readExactly(size: number): Promise<Uint8Array> {
const chunk = new Uint8Array(size);
let bytesFilled = 0;

while (bytesFilled < size) {
if (leftover.length === 0) {
const { done, value } = await reader.read();
if (done) throw new Error('Unexpected end of sparse data');
leftover = value;
}

const needed = size - bytesFilled;
const toCopy = Math.min(needed, leftover.length);

chunk.set(leftover.subarray(0, toCopy), bytesFilled);
bytesFilled += toCopy;

leftover = leftover.subarray(toCopy);
}

return chunk;
}

for (const { offset, size } of this.header.sparseMap) {
const chunkData = await readExactly(size);
result.set(chunkData, offset);
}

return result;
}

const chunks: Uint8Array[] = [];
let total = 0;
for await (let c of this.body) {
chunks.push(c);
total += c.length;
}
let result = new Uint8Array(total);
let pos = 0;
for (let c of chunks) {
result.set(c, pos);
pos += c.length;
}
return result;
}

Expand Down
Loading
Loading