Zip with no compression is a nice contender for a container format that shouldn't be slept on. It effectively reduces the I/O, while unlike TAR, allowing direct random to the files without "extracting" them or seeking through the entire file, this is possible even via mmap, over HTTP range queries, etc.
You can still get the compression benefits by serving files with Content-Encoding: gzip or whatever. Though it has builtin compression, you can just not use that and use external compression instead, especially over the wire.
It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.
I think the articles complaints about lacking unix access rights and metadata is a bit strange. That seems like a feature more than a bug, as I wouldn't expect this to be something that transfers between machines. I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.
1718627440 35 minutes ago [-]
Isn't this what is already common in the Python community?
> I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.
I'm the opposite, when I pack and unpack something, I want the files to be identical including attributes. Why should I throw away all the timestamps, just because the file were temporarily in an archive?
stabbles 2 hours ago [-]
> Zip with no compression is a nice contender for a container format that shouldn't be slept on
SquashFS with zstd compression is used by various container runtimes, and is popular in HPC where filesystems often have high latency. It can be mounted natively or with FUSE, and the decompression overhead is not really felt.
stabbles 2 hours ago [-]
"I/O is the bottleneck" is only true in the loose sense that "reading files" is slow.
Strictly speaking, the bottleneck was latency, not bandwidth.
akaltar 2 hours ago [-]
Amazing article, thanks for sharing. I really appreciate the deep investigations in response to the comments
raggi 2 hours ago [-]
there are a loooot of languages/compilers for which the most wall-time expensive operation in compilation or loading is stat(2) searching for files
ghthor 1 hours ago [-]
I actually ran into this issue building dependency graphs of a golang monorepo. We analyzed the cpu trace and found that the program was doing a lot of GC so we reduced allocations. This was just noise though as the runtime was just making use of time waiting for I/O as it had shelled out to go list to get a json dep graph from the CLI program. This turns out to be slow due to stat calls and reading from disk. We replaced our usage of go list with a custom package import graph parser using the std lib parser packages and instead of reading from disk we give the parser byte blobs from git, also using git ls-files to “stat” the files. Don’t remember the specifics but I believe we brought the time from 30-45s down to 500ms to build the dep graph.
Rendered at 10:38:43 GMT+0000 (Coordinated Universal Time) with Vercel.
You can still get the compression benefits by serving files with Content-Encoding: gzip or whatever. Though it has builtin compression, you can just not use that and use external compression instead, especially over the wire.
It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.
I think the articles complaints about lacking unix access rights and metadata is a bit strange. That seems like a feature more than a bug, as I wouldn't expect this to be something that transfers between machines. I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.
> I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.
I'm the opposite, when I pack and unpack something, I want the files to be identical including attributes. Why should I throw away all the timestamps, just because the file were temporarily in an archive?
SquashFS with zstd compression is used by various container runtimes, and is popular in HPC where filesystems often have high latency. It can be mounted natively or with FUSE, and the decompression overhead is not really felt.
Strictly speaking, the bottleneck was latency, not bandwidth.