I've studied the Burrows-Wheeler Transform, I understand the transformation, I've re-implemented it countless times for kicks, I see how it improves compressability, but for the life of me the intuition of _why_ it works has never really clicked.
It's a fantastic bit of algorithmic magic that will always impress me to see it.
unnah 1 days ago [-]
The Burroughs-Wheeler transform has been described as a unique algorithm idea in that there are no non-trivial variations or related algorithms, unlike more conventional compression algorithms, which can be tweaked and improved in so many ways. There is no general compression theory in which BWT could be described as a special case.
It looks to me that the above still holds: Bzip2 and Bzip3 are simply combining more conventional compression algorithms with the BWT, which itself is still the same old transform. Bzip2 does Huffman coding after BWT, and Bzip3 does arithmetic coding.
derf_ 22 hours ago [-]
> There is no general compression theory in which BWT could be described as a special case.
I would not really say this is true. BWT is spiritually similar to the various Prediction by Partial Matching (PPM) algorithms, except that instead of needing to decide how much context (i.e., preceding symbols) to use to model the next symbol, and carefully learning the probabilities for each unique context, it naturally sorts symbols with the same context (of _any_ length) so they appear next to each other, and relies on adaptation to update your learned statistics as you move from one context to the next, without ever explicitly tracking context boundaries [0].
It definitely is related to prediction by partial match (PPM).
BWT sorts rotated data and what is achieved is that same suffixes group together:
...
"Bzip2 and Bzip3 are simply combining more"
"Bzip3 are simply combining moreBzip2 and "
The preceding (to suffix) character goes to end and then gets outputted. This is much like PPM going backward. There is a PPM* algorithm (unbounded context length) where authors considered reconstruction of contexts from data, utilizing something like LZSS seach. Same idea is in BWT - context is reconstructed from data.
BWT also breaks near dependencies in data, this is why move-to-front with Huffman or arithmetic encoding works well there.
unnah 17 hours ago [-]
Wow, thanks. As always, the best way to learn more on the internet is to be confidently and totally wrong!
altairprime 1 days ago [-]
Can BWT be combined with zstd, which uses asymmetric numeral systems?
CJefferson 1 days ago [-]
Yes, it would actually be interesting to just have a bwt pass which does no compression, so we can then try lots of post compression options.
rkeene2 22 hours ago [-]
I've been thinking about adding support for this kind of stacking to DACT [0].
BWT can be combined with anything which does RLE and get a benefit.
What does it does is give RLE more to work with.
klodolph 23 hours ago [-]
I think you would just need ANS, not the rest of zstd.
altairprime 8 hours ago [-]
I don’t know enough to evaluate that, but it sounds plausible. Apparently modifying or integrating zstd into custom solutions was a common path in submissions to, at the very least, the GDCC 2021 T2 contest. This is all well outside of my learned competence so I’m just here to learn and ask naive questions.
klodolph 3 hours ago [-]
Zstd is basically LZ77 + FSE. There are some other pieces to it, but they’re minor. FSE is a form of entropy coding that you can swap out with other entropy coding techniques, like Huffman or arithmetic coding. For most objectives, FSE is the clear winner of those three.
As people mentioned elsewhere in the thread, Burrows-Wheeler isn’t really composable with other systems. Nobody has figured out a reasonable way to combine LZ77 and Burrows-Wheeler. That’s why zstd + bzip2 does not work. But you do need an entropy coding system to go with Burrows-Wheeler… and FSE fits the bill.
thrtythreeforty 1 days ago [-]
Thank you for the reference. I learned something new today. That algorithm is wild. If you had shown me the transform and asked if it had an inverse, I would have said of course it doesn't, it's too weird.
loxias 1 days ago [-]
I always understood it as working because of the predictability of a symbol/letter/token given the previous one.
Sorting all the shifts of a string puts all the characters in order, then looking at the last column shows you all the _preceding_ characters. If there's any predictability there (which there often is), it's now easier to compress. It's sorta like an entropy coder in that way.
I've never thought of it as being that deep, and understood them since I was a kid -- building an intuition for "why" the FFT works is much harder -- but that being said, I clicked quickly to reply thinking "that's easy! I can explain this!" then struggled for a while trying to get the picture in my mind into text. :)
crazygringo 1 days ago [-]
But shouldn't Huffman coding already detect that same predictability and compress it the same?
What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.
loxias 1 days ago [-]
> What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.
Ahhhh. Now we're on the same page. :) Seeing how it helps when combined is somewhat subtle/non-obvious. I believe it relates to BWT and Huffman both being approximations of something more optimal. The two transforms could also have different window sizes -- one rarely does BWT on a whole 1GB file -- which introduce inefficiencies. Huffman coding is also only optimal in the very large alphabet and very long data limits. As your data length and alphabet size decrease, it gets less optimal.
Put differently, "I think that's a wonderfully phrased question, this _is_ my specialization/subfield, and I'm gonna need to chew on it for a while."
crazygringo 1 days ago [-]
Thanks. Yeah, I can see how that would make more sense if BWT was redundant under a theoretically perfect Huffman compression, but it happens to pick up some things that real-world Huffman encoders don't, with their practical limits on CPU and memory.
Sesse__ 13 hours ago [-]
Nearly any real-world Huffman encoder is trivially optimal, i.e., given a set of symbols with probabilities, it is easy to construct the optimal set of output bits for each symbol. (There are some exceptions in that if you have a huge amount of symbols, or some that are extremely rare, you can bound the upper length and get a non-optimal code. This matters very little in practice, i.e., less than 0.1% in a typical setting. And of course, if you allow fractional bits or probabilities that change based on context, then you can do better, but then it's no longer Huffman.)
BWT is orthogonal to Huffman; like LZ, it exploits that some symbol _sequences_ are more common than others, while Huffman is about the optimality of coding each symbol on its own.
jltsiren 1 days ago [-]
One way to look at data compression is splitting it into a model and an encoder. The model describes the data, while the encoder encodes (or equivalently predicts) the data according to the model. The compressed output consists of the serialized model and the encoded data. BWT is a model, while Huffman is an encoder.
Huffman takes a probability distribution and a symbol and encodes the symbol according to the distribution. If you encode all symbols independently according to the same distribution, you probably don't get very good compression.
You get a bit better results with a model that has a separate distribution for each context. If the previous symbols were X, Y, and Z, you encode the next symbol according to the distribution for context XYZ. This approach doesn't really scale, because the size of the model grows rapidly (exponentially in a naive implementation) with context length. You get better compression with an adaptive model. You start with a uniform distribution and update the available contexts and distributions after encoding each symbol. On the one hand, you don't have to store the model explicitly. But on the other hand, updating the model is very slow.
Burrows-Wheeler transform is an implicit model. It sorts the symbols according to the context that follows them, and it does that simultaneously for each context length. Because you don't have to store the model explicitly, you can effectively use longer context lengths than with a fixed explicit model. And because you don't have to update an explicit model after encoding each symbol, using the BWT is much faster than using an adaptive model.
palaiologos 1 days ago [-]
Hi, tool author here.
Huffman coding is a static minimum-redundancy code. What this means is that it finds an optimal assignment of bit sequences to letters in the input alphabet (commonly US-ASCII or extensions). This however means that Huffman coding can not exploit redundancies that stem from the concrete sequence of characters. For example, you could easily predict that an `e` comes after `Th`, but Huffman coding can not know that.
Hence after applying the Burrows-Wheeler transform you need to have some sort of a higher-order transform (i.e. a transform that considers more than just individual bytes) which somehow reaps from the changed distribution of the result of the algorithm. But we will get to that in a second.
The joke here is that the Burrows-Wheeler transform is closely related to suffix trees and suffix arrays, which are often used in bioinformatics and HPC for full-text search. If you wanted to find a pattern of length `p` in a text of length `n`, if you already have a suffix tree of the original text, the search is linear in the length /of the pattern/ - i.e. O(p). The suffix tree stores all suffixes of a string in a compressed manner (i.e. it has a linear space overhead, approximately O(20n) as given by Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press), so you can search for a word in it by simply traversing from the root node to an internal or leaf node by following a sequence of bytes that comprise the word.
As such, a suffix tree (and equivalently suffix array and the BWT, which is trivially computed from a suffix array) form something which can be thought of as a static PPM model. Notably real world implementations of PPM use suffix trees as a part of their main storage data structure (e.g. PPMd). What this all means is that given a suffix tree, we can very cheaply give the probability distribution for the next byte that follows a given fixed-order sequence of bytes. This is nice, because then e.g. an order-2 predictor would be able to tell that `Th` is followed by `e` once enough data has been gathered.
As you can probably guess, the more preceding bytes you know, the better will be your estimate for what is the most likely next byte. But the larger your context, the more expensive the searches and computations become due to pointer chasing in the suffix tree.
So how do we remedy this? We notice that the Burrows-Wheeler transform essentially clusters similar contexts together, meaning that a low order predictor (= faster, simpler) on BWT compresses as well as a high order predictor (= slow, complicated) on the original data, at the cost of an extra transformation. This is viable, because the Burrows-Wheeler transform can be quickly computed and there have been recent advancements in running it on the GPU. So what this means is that bzip3 uses BWT + a low order predictor with an arithmetic coder to encode the bytes, meaning that it can make use of high order statistics for compression and performs comparably at a faster speed.
eru 17 hours ago [-]
You are spot on.
Btw, the Burrows-Wheeler transform is often explained as taking the last column.
I find it easier to understand why it works if you think of BWT as sorting all rotations of your string by from their _second_ character onwards, and then writing down all the first characters.
eru 17 hours ago [-]
> But shouldn't Huffman coding already detect that same predictability and compress it the same?
Huffman coding only works on individual letters. It doesn't know anything about relationships between adjacent letters.
vanviegen 1 days ago [-]
Understanding why increasing predictability helps with compression is not the hard part though. What's hard to grasp is why the transform is reversible.
gfody 1 days ago [-]
a word can be factored into the set and frequency of letters + the specific permutation. compressable patterns in either channel seem likely when the underlying words are language like.
SideQuark 1 days ago [-]
And in general that set of descriptors provides no compressibility. BWT is much richer that that set in that the way it works performs well for data we care about.
Describing a multiset takes as much information as the multiset contained to begin with, on average. BWT somehow works better on things of use.
mnw21cam 1 days ago [-]
I remember the lecturer commenting on what sort of sick and twisted mind could come up with such a ridiculous convoluted notion when I was taught it at university.
ajb 19 hours ago [-]
Wheeler was also one of the inventors of the "closed subroutine" AKA function, which had to be implemented via a hack as machines of the time did not include ISA support for "return":
I feel exactly the same, and have also implemented it backwards and forwards. I've thought about it in my sleep, trying to recall how it REALLY works. Happens every few years ;-) I always thought it was probably obvious to everyone else what the "magic" is.
Isn't it basically run length encoding but on patterns? Sorting lexicographical on all rotations means blocks of patterns get grouped together, which means you can do compression more easily, right?
duskwuff 1 days ago [-]
> the intuition of _why_ it works has never really clicked.
In terms of why it aids compression, or why it's reversible?
As far as the first goes: it transforms n-grams into repeated characters.
forty 1 days ago [-]
Haha I also studied as part of a student project and I remember my computer science teacher saying "it's dark magic" ^^
1 days ago [-]
bawolff 1 days ago [-]
Yeah. BWT and zero knowledge proofs are my goto examples of CS things that seem like pure magic.
jltsiren 1 days ago [-]
Public-key cryptography is magic. Zero-knowledge proofs are a consequence that's difficult to find on your own but easy to understand once you've seen it.
I remember seeing someone (probably Avi Wigderson) demonstrating a zero-knowledge proof for graph coloring on an overhead projector when I was starting my second year studying CS. He had a slide with a graph, multiple slides with different permutations of the same valid coloring of the vertices, and a piece of paper with "doors" over the vertices to cover everything. The audience could ask him to open the doors over the vertices connected by any edge, and he would show that the coloring is valid, as far as those two vertices are concerned. And then he would replace the coloring with another permutation for the next iteration. The idea felt clever but kind of obvious in retrospect.
toast0 8 hours ago [-]
> Public-key cryptography is magic.
Public key cryptography is kind of magic, but the basic piece that makes everything work isn't too hard. It's been a while since I took discrete math, so this is a bit rusty, but it should get you in the right direction. It's modular multiplication in a field. If you take a small field and make a times table, you can see the magic. All of the factors that are relatively prime to the modulus will have a pattern that you F times X is only equal to F times Y if X and Y are equal. There will also be a multiplicative inverse so that F times X time I equals X for all X; because of commutation I times X times F also equals X. When you've determined an F and an I, you give your correspondent F and keep I --- then if you multiply your message by F, they multiply it by I and get back the message. If they want to send you a message, they multiply it by I and send it, then you multiply it by F.
There's some other stuff, but that's the magic part, IMHO. You just have to use a bigger field and some other stuff I forgot. ;)
bawolff 1 days ago [-]
Magical things are usually built up from the trivial examples.
Graph colouring is usually used as a teaching example for zkp, because of how easy it is to understand. Its still amazing how you can go from that example to "Here is a file that anyone can verify (non-interactively) which shows i have a passport and that passport is not from a country sanctioned by the usa but otherwise does not reveal anything about me" or "I will not reveal my ip or any other identifying/unique info but i will prove i have not been previously blocked from this website including during the other times i anonoymously accessed this website"
jltsiren 1 days ago [-]
Things that look magical often stop being magical when you have the right perspective and the right abstractions. The step from a toy example to proving any verifiable statement is just NP-completeness. Which is simple enough that undergraduates are often expected to understand it.
Interactive zero-knowledge proofs are also technically non-interactive. They are defined in terms of the verifier evaluating a transcript of the protocol. If the verifier accepts some causal assumptions about the provenance of the transcript, they will accept the proof. If they disagree with the assumptions, the proof is indistinguishable from random noise they could generate themself. An interactive commitment – challenge – response protocol is one possible causal assumption. A source of randomness could replace the challenges, making the protocol non-interactive. Or there could be a pre-committed secret, making a single-round protocol effectively non-interactive.
These are things a sufficiently interested CS undergraduate can prove and understand. Public-key cryptography, on the other hand, remains magical. There are many things people assume to be true. Which need to be true for public-key cryptography to function. Empirically these things seem to be true, but nobody has been able to prove them. And I don't think anyone really understands them.
maginx 9 hours ago [-]
Agreed - I've worked with PKI in many years, and know why the various systems work...in terms of "why you can decrypt again", not in terms of why it is secure (no other way to decrypt) which no one really knows. But if we assume for a moment the systems are secure, it is truly fascinating when thinking about it in abstract terms: Who would have thought, it is possible to give someone an exact recipe to follow that scramble a message to be sent... we can consider e.g. an encryption routine where the public key is inline so it is really just a standalone scrambling problem. Even though it is completely public, it can only be used to scramble and not unscramble. The sender who literally went through the steps to scramble the message, cannot undo what they just did. (the sender could have saved the original before scrambling!). And it is not because data is thrown away it can't be unscrambled - all the information is there there, fully recoverable, but only by the person who made the scrambling-recipe and there's no practical way to deduce this unscrambling recipe from the scrambling recipe.
As you may be aware, different compression tools fill in different data type niches. In particular, less specialised statistical methods (bzip2, bzip3, PPMd) generally perform poorly on vaguely defined binary data due to unnatural distribution of the underlying data that at least in bzip3's case does not lend well to suffix sorting.
Conversely, Lempel-Ziv methods usually perform suboptimally on vaguely defined "textual data" due to the fact that the future stages of compression that involve entropy coding can not make good use of the information encoded by match offsets while maintaining fast decompression performance - it's a long story that I could definitely go into detail about if you'd like, but I want to keep this reply short.
All things considered, data compression is more of an art than science, trying to fit in an acceptable spot on the time to compression ratio curve. I created bzip2 as an improvement to the original algorithm, hoping that we can replace some uses of it with a more modern and worthwhile technology as of 2022. I have included benchmarks against LZMA, zstandard, etc. mostly as a formality; in reality if you were to choose a compression method it'd be very dependent on what exactly you're trying to compress, but my personal stance is that bzip3 would likely be strictly better than bzip2 in all of them.
bzip3 usually operates on bigger block sizes, up to 16 times bigger than bzip2. additionally, bzip3 supports parallel compression/decompression out of the box. for fairness, the benchmarks have been performed using single thread mode, but they aren't quite as fair towards bzip3 itself, as it uses a way bigger block size. what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.
nmz 3 hours ago [-]
If the focus is on text, then the best example is probably the sqlite amalgation file which is a 9mb C file.
linsomniac 1 days ago [-]
Thanks for the reply. I just figured I'd try it and see, and the bzip3 results are extremely good. I figured it was worth trying because a fair bit of the data in that image is non-binary (man pages, config files, shell/python code), but probably the bulk of it is binary (kernel images, executables).
andix 1 days ago [-]
Shouldn't a modern compression tool, targeting a high compression rate, try to switch its compression method on the fly depending on the input data?
I have no idea about compression, just a naive thought.
supertrope 1 days ago [-]
7-Zip can apply a BCJ filter before LZMA to more effectively compress x86 binaries. https://www.7-zip.org/7z.html. Btrfs’ transparent compression feature checks if the first block compressed well; if not it gives up for the rest of the file.
As you notice my sanity check actually has a slightly different size. Not sure why. The benchmark is a bit underspecified because new perl versions were released in the interim. I used all releases up to perl-5.37.1 to get to the correct number of files. Just treat all numbers to have about 2% uncertainty to account for this difference.
I can't provide compression/decompression times, but the --long or --long=31 arguments should not have major impact on speed, they mostly impact used memory. --long=31 requires setting the same in the decompressor, making that option mostly useful for internal use, not archives meant for public consumption.
As you can see, the benchmark chosen by the author mostly comes down to finding similar data that's far away. I wonder if bzip3 can do this better than other algorithms (especially in less memory) or simply chose default parameters that use more memory.
Edit: added more benchmarks
SeptiumMMX 1 days ago [-]
Given that it's BWT, the difference should be the most prominent on codebases with huge amounts of mostly equivalent files. Most compression algorithms won't help if you get an exact duplicate of some block when it's past the compression window (and will be less efficient if near the end of the window).
But here's a practical trick: sort files by extension and then by name before putting them into an archive, and then use any conventional compression. It will very likely put the similar-looking files together, and save you space. Done that in practice, works like a charm.
Ooh, that’s neat. How much improved do you get from this? Is it more single or double digit % diff?
sevg 1 days ago [-]
To make your comment more useful you’ll want to include compression and decompression time.
Using the results from the readme, seems like bzip3 performs competitively with zstd on both counts.
idoubtit 1 days ago [-]
I've experimented a bit with bzip3, and I think the results in the readme are not representative. I think it's a handmade pick, with an uncommon input and unfair choices of parameters. And it's made with a HDD, which skews the results even more.
For instance, with a 800 MB SQL file, for the same compression time and optimal parameters (within my capacity), bzip3 produced a smaller file (5.7 % compression ration) than zstd (6.1 % with `--long -15`). But the decompression was about 20× slower (with all cores or just one).
I'm not claim my stupid benchmark is better or even right. It's just that my results were very different from bzip3's readme. So I'm suspicious.
dralley 1 days ago [-]
Also the compression levels..
linsomniac 1 days ago [-]
I believe the compression levels are included in the list above.
dralley 1 days ago [-]
Not for zstd or lzma
linsomniac 1 days ago [-]
Added, thanks.
jl6 1 days ago [-]
A 4x improvement over lzma is an extraordinary claim. I see the author has also given a result after applying lrzip (which removes long-range redundancies in large files), and the difference isn’t so great (but bzip3 still wins). Does the amazing result without lrzip mean bzip3 is somehow managing to exploit some of that long-range redundancy natively?
I’d be astonished if such a 4x result generalized to tarballs that aren’t mostly duplicated files.
wongarsu 1 days ago [-]
Currently running my own benchmarks, my preliminary results are that zstd becomes competitive again once you add the --long option (so `zstd --long -16 all.tar` instead of `zstd -16 all.tar`). Which is an option that not everyone might be aware of, but whose usefulness should be intuitive for this benchmark of >200 very similar files.
eqvinox 1 days ago [-]
I'd argue that's actually the lowlight of the README since that is a very poor choice of benchmark. Combining a multitude of versions of the same software massively favors an algorithm good at dealing with this kind of repetitiveness in a way that will not be seen in typical applications.
The "Corpus benchmarks" further down in the README are IMHO much more practically relevant. The compression ratio of bzip3 is not significantly better, but the runtime seems quite a bit lower than lzma at least.
miohtama 1 days ago [-]
In Linux source benchmark results are interestingly more equal, LZMA still holding up well.
What makes Perl source benchmark special? Deduplication?
mmastrac 1 days ago [-]
An old friend use to say that Perl is line noise that was given sentience.
ape4 1 days ago [-]
This is the source - which is probably C.
bhaney 1 days ago [-]
It's 246 C files and 3163 perl files
loeg 1 days ago [-]
Why -T12 for zstd and T16 for xz? How many threads is bzip3 using?
zimpenfish 1 days ago [-]
From the source, it looks like bzip3 defaults to 1 thread if not explicitly set by arguments.
JoshTriplett 1 days ago [-]
...using zstd level 16, when zstd goes up to 22. And without turning on zstd's long-range mode.
t43562 8 hours ago [-]
> bzip2 -9 -k all.tar 981.78s user 9.77s system 95% cpu 8M memory 17:16.64 total
> bzip3 -e -b 256 -j 12 all.tar 2713.81s user 16.28s system 634% cpu 18301M memory 7:10.10 total
The difference in memory usage might be worth noting: 8M v 18301M
Groxx 7 hours ago [-]
Probably worth noting that bzip2 also did by far the worst in this. ~7x larger files than the best bzip3. Large memory use is generally required for good compression.
I'm curious how well gzip would have handled this though, as it's generally low memory too, and all the others in that list have FAR more than 8M memory used.
t43562 7 hours ago [-]
I think gzip was 5M. This all depends where you are using it e.g. on a Raspberry Pico I'd be using gzip no matter the extra bytes probably.
pxeger1 1 days ago [-]
The author is super cool. They are one of very few people to write any substantial programs in Malbolge, a programming language designed to be cryptographically hard to use (or something like that)
matthberg 15 hours ago [-]
The Wikipedia page on Malbolge was quite the horrific read, downright amazing to have a lisp written in it.
I still remember going crazy about bzip (the first one) and re-compressing all my data with it.
And then I remember discovering, several years later, that bzip (the first one) is an obsolete format that is now difficult to even decompress.
I learned my lesson and now use horribly sub-optimal formats that I'm sure will stick around for a long time, if not forever.
thomasmg 18 hours ago [-]
Improving BWT is great!
In my view, improving "long range" compression has the biggest potential. There are many, many algorithms and implementations for very short range (huffman, arithmetic, ANS) and short range (LZ, BWT), but not that much research has gone into "long range" yet. There's deduplication, and large-window LZ / BWT.. but not much more yet. What is missing is efficietly (and with little memory) finding similarities on multi-GB data sets. I think sorting by similarity would help there. Or did I miss research in this area?
ggm 1 days ago [-]
Small request: write a header or tail block which records the compression efficiency. Bzip2 doesn't. Gzip does. Knowing the uncompressed size can be vital. Yes, there is a risk of lying and making zip bombs.
Zamiel_Snawley 1 days ago [-]
Shouldn’t knowing how big it’s supposed to be make it easier to stop a zip bomb? Just stop decompressing once you hit the size from the header.
jeroenhd 1 days ago [-]
That only works if the standard actually describes what you're supposed to do with extra data at the end, and everyone agrees.
In practice, there have been antivirus bypasses that made use of AV scanners treating the additional data differently from common extraction software (I believe it was winrar?).
One could argue that a text document with a corrupt file size byte should still be decodeable. One could also argue that the file is completely corrupt and should be treated as such. Knowing that there will be tools that will take the first approach regardless, I'd stick to just decoding the extra data and marking the file as potentially damaged rather than send the user down a data recovery rabbit hole for a file that'll decompress just fine.
o11c 21 hours ago [-]
IIRC to decode raw zlib or deflate using only command-line tools, you have to prepend the gzip header and stream while ignoring errors at EOF.
How does this compare to existing BWT-based compressors on things like the Large Text Compression Benchmark?
twotwotwo 1 days ago [-]
A random thought that Might Work, Who Knows(tm): first compress long-ish repetitions in the input and store the copies separately from the literals (like zstd). Then run just blocks of literals through the BWT before entropy coding.
The idea is that the BWT can help take advantage of however much context remains after the copies are snipped out, whether that's one byte or a few, and shrinking the input with the LZ-ish step may make it faster. It might be strictly worse than using PPM or basic context modeling like Brotli's; either of those can be "aware" of the preceding bytes even when they come from copies rather than literals.
It's implied in "successor to bzip2" and a lot of the comments, but it's worth highlighting that high-compression algorithms, especially those that are also slow to decompress, are a pretty specialized niche now. Using zstd or brotli at low to medium settings sometimes speeds things up by reducing network or storage transfers more than their CPU use slows you down. (Especially when your storage transfers are network transfers.) Even when compressing isn't a net speedup, you pay fairly little time for the saved space. Even lowish levels of zstd and brotli often eke out decent compression ratios for big inputs since, with modern quantities of RAM, their MBs-long history windows let them take advantage of long-range matches.
biglost 22 hours ago [-]
Everytime someone says I'm too smart, i remember this kind of black magic and inmediately answer: no, im not
jgalt212 1 days ago [-]
I poke around in this space periodically, but I've never found a compelling reason to move away from gzip.
alkh 1 days ago [-]
I am a huge proponent of zstd after I learned about it on HN.
I've recently had to compress a 12 Gb csv file. zstd took ~3 sec for compression and ~11 sec for decompression and got the file to ~1.1 Gb. Xz took ~3.5 min for compression(!) and the resulting file was ~740 Mb(I didn't measure the decompression time). I just realized that in most cases it's more efficient to use zstd, especially for file transfer.
The majority of OS that people use either have it installed or can be trivially downloaded to, so I see no point in using gzip nowadays, unless it is mandated by API
eqvinox 1 days ago [-]
Did you confuse gzip and xz? You mention numbers from xz and then suddenly talk about gzip? These two are not related…
alkh 1 days ago [-]
Hey, I didn't confuse them but I guess I should have been more specific. I've addressed 2 main points in my text. 1) In my opinion, for the vast majority of cases you don't need to be worried about comparability, as you can easily install better alternatives to gzip for platforms a lot of people use. 2) I think zstd can become a primary replacement for gzip due to its high speed of compression and good compression ratio, even when compared to such great algorithms like xz/lzma. Sacrificing some compression ratio for (de)compression speed is worth it, for me at least
dist-epoch 16 hours ago [-]
zstandard at level 21 with some tweaked parameters (larger dictionaries) will give you xz/lzma compression ratios, while still being faster.
loeg 1 days ago [-]
zstd is faster and provides better compression than gzip at every point on the curve. There is no reason to use gzip these days other than backwards compatibility.
idoubtit 1 days ago [-]
I've reached the same conclusion, and I've been using zstd unless I want extremely fast (de)compression with lz4.
And even when I still have to use gzip (the format), the executables of libdeflate are fully compatible and noticeably faster than gzip/gunzip (the tool).
> There is no reason to use gzip these days other than backwards compatibility
And forward compatibility. The Lindy effect says gzip is likelier to be widely available across platforms and tools in the long term than zstd.
lazide 6 hours ago [-]
Zip beats all on that front.
hinkley 1 days ago [-]
Do zless, zcat and zgrep support zstd everywhere? And I mean everywhere? VMs? Alpine? FreeBSD? OSX? Openwrt?
Nothing is shittier than sshing into a box that doesn’t understand half of your command line tricks. Or the clever shell script you just tested six ways to Sunday. It’s like fighting with your hands tied behind your back.
sophiebits 1 days ago [-]
> other than backwards compatibility
hinkley 1 days ago [-]
Other than that, Mrs Lincoln, how was the play?
adastra22 1 days ago [-]
Yes, it's supported on all those platforms.
hinkley 1 days ago [-]
How far back?
adastra22 1 days ago [-]
How much does that matter? All currently supported releases at least.
Brian_K_White 1 days ago [-]
bash4 was released Feb 2009. 16 years ago.
The very latest version of osx ships with bash3, from 20 years ago.
It's a fair question, and "currently supported" is essentially meaningless.
(Yes we all know about macports and the other more popular but less correct systems, and yes most bash scripts should be written to specifically avoid relying on any special features exactly to avoid being broken so easily. These are both true and both beside the point. Fact is still, this is not only a "currently supported" system by some technicality like the final months of a 5 year lts release or something, this is the up to this very minute version of a massive install base, with a 20 year old version of something as basic and all-touching as the very shell itself.
I know about this so intimately because I actually have scripts that don't work because they were intentional stunts to see just what could be done without forking any external exes or even child shells, but allow using every possible feature and trick in bash itself.)
loeg 1 days ago [-]
The bash 4 thing is due to the GPL 3, not some inherent slowness in updating software. It has nothing to do with zstd, which is permissively licensed.
Brian_K_White 1 days ago [-]
It doesn't matter why. All that matters is that "current" is not a valid word. Old things exist in "current" systems. And current systems may also be old systems.
loeg 1 days ago [-]
Ok. I think engineers are well capable of evaluating what systems they need to support and if zstd is a usable option for them. In many situations, the answer will be "yes."
adastra22 21 hours ago [-]
You can download a supported zstd release through homebrew for every version of macOS that is supported by Apple (as those are the releases homebrew supports), and it compiles from source for older ones.
I don't even know why we're talking about bash here. But for the record, this is a unique circumstance that affects only software which made the transition from GPLv2 -> GPLv3. Changing your license can cause you to fork away from some users. News at 11.
hinkley 1 days ago [-]
I really don’t want to switch to zsh, but these are facts.
The number of devs working on OSX producing non-OSX applications is staggering.
adastra22 1 days ago [-]
What does that have to do with zstd?
Brian_K_White 1 days ago [-]
It directly addresses "How much does that matter? All currently supported releases at least."
adastra22 1 days ago [-]
No it doesn’t? You can get zstd on macOS just fine. We’re talking about zstd, not bash, and supported releases not what ships with the distribution.
Brian_K_White 11 hours ago [-]
You can get bash4 or 5 too, as I already said. I can't help you with your depth of experience. I didn't even say zstd was unfit to depend on, just that it was a fair question.
hinkley 6 hours ago [-]
Once again, I’m having to remind a person that coding is a team sport and personal choices are bullshit.
Every deviation you make from stock is something else you have to put into. The onboarding docs and convince all your coworkers to install. They won’t. Or they think they will but will miss a step or machine. Or for language specific tools, they’ll use a multiversion tool like asdf or nvm or ram and have to reinstall again every upgrade.
It’s not up to me. It’s not up to you. It’s up to us. And if you can’t see that, maybe you shouldn’t be working with other developers. Or maybe you haven’t been and you need to.
hinkley 1 days ago [-]
Seriously?
There are places still migrating off of Java 8.
And pet servers still exist in the wild.
jeroenhd 1 days ago [-]
Java 8 will be supported until 2032 if you pick the right distro, and that distro comes with zstd in the repos (if not pre-installed).
And if your system is so old that you don't have access to eight-year-old software, I don't see why the rest of the world needs to accommodate you. Sticking with weird and old software is fine, but that comes with the risk of having to deal with everyone else moving on.
mobutu 1 days ago [-]
Seriously who cares about those who are living in the past? It is on them for not keeping up with the times.
loeg 1 days ago [-]
Yes, it's been integrated in lots of places, mostly as long ago as the late 2010s.
ThatGuyRaion 1 days ago [-]
zstd is a great setup, for sure, but the build system they use is patently awful. Can someone make an autoconf or hell, even CMake build system for them pleasee???
loeg 1 days ago [-]
Just install the binary package from whatever distro you use? Why do you need to build it?
But if it matters, FreeBSD has a pretty trivial BSDmake build of it:
You could easily do something similar in GNU make or whatever without the dependencies on the FBSD build system. It's basically just the classic "cc -c a bunch of C files, then link them."
jrockway 1 days ago [-]
I suppose the point of open source software is to edit the source code and make your own version. If you can't build it, you can't do that.
loeg 1 days ago [-]
Zstd has lots of contributors who have all figured out how to edit and build it.
jrockway 1 days ago [-]
Yeah, not sure what the original commenter is complaining about. I did "git clone https://github.com/facebook/zstd", "cd zstd", "make" and have a working binary. Doesn't get easier than that.
ThatGuyRaion 21 hours ago [-]
Not everyone is using a standard tool chain or anything like that and oftentimes when I find projects with atypical build systems that give me trouble I tend to just move on to something else
ThatGuyRaion 21 hours ago [-]
It's basically a giant GCCism. Don't be condescending.
loeg 19 hours ago [-]
I have no idea what you're talking about at this point.
ThatGuyRaion 22 hours ago [-]
> Just install the binary package from whatever distro you use? Why do you need to build it?
Cuz I'm actually a packaging maintainer for a couple different old operating systems. Regardless I thank you for being thankful even though I feel like you were being backhanded and demeaning on a number of levels.
arp242 14 hours ago [-]
> you were being backhanded and demeaning on a number of levels.
What an awful comment to make.
MawKKe 1 days ago [-]
Quick glance of zstd github repo shows they do provide CMake build scripts?
ThatGuyRaion 22 hours ago [-]
Last I touched it they only had glitchy make files and I don't like cmake particularly, I was just like "bespoke make files are fine if they're posix compliant, but these aren't"
t43562 8 hours ago [-]
> bzip2 -9 -k all.tar 981.78s user 9.77s system 95% cpu 8M memory 17:16.64 total
> bzip3 -e -b 256 -j 12 all.tar 2713.81s user 16.28s system 634% cpu 18301M memory 7:10.10 total
The memory usage is one reason: 8M vs 18301M
Borg3 1 days ago [-]
Same here. Storage is not that expensive so I do NOT care to squize every byte out of archive. Also, im more into retro, so portability and memory usage is more importand for me :)
palaiologos 1 days ago [-]
Hi, tool author here!
Regarding your first remark: high ratio data compression has its time and place, and I personally understand that to many people it is not very desirable. In a lot of scenarios something as plain and simple as LZ4 generally suffices.
On the other hand, there is an unofficial (= unsupported) port of bzip3 to older (386+) machines that run MS-DOS6.22. I have prepared it for a retrocomputing meeting in Ontario that I attended a while back. Let me know what you think :).
Oh nice. Ill will look at it at some point to check things out and to see if I can compile it here and there out of the box :)
jgalt212 10 hours ago [-]
After reading this, I see lots of zstd fans around here, and with good reason. That being said, I think our shop will stick with gzip until zstd arrives in the Python Standard Library.
otterley 1 days ago [-]
(2022)
froh 1 days ago [-]
what do you mean??
usually this signifies the year some article was finalized and published.
and this is a GitHub repo with recent commits so this is not correct here.
lifthrasiir 21 hours ago [-]
But there seem to have been no major updates since the original submission (every release up to 1.5.1 is for minor maintenances), so the year would be indeed useful for the vast majority who do remember that.
otterley 1 days ago [-]
The date signifies that this tool is not new; and this is ostensibly a "news" site.
hulitu 1 days ago [-]
> Bzip3: A better and stronger spiritual successor to BZip2
Please, no. What's next ? BZip4 and BZip5, each one incompatible with each other ?
lifthrasiir 20 hours ago [-]
While the name bzip3 itself would be okay for reasons others stated, I still don't like it because it is so easy to confuse with bzip2 and may indicate a false relationship with bzip2 regardless of intents. A spiritual successor doesn't have to be named confusingly similar, after all.
snvzz 1 days ago [-]
The format gets a new major version precisely because it is incompatible.
sitkack 1 days ago [-]
this is the way.
wakawaka28 1 days ago [-]
If they have to be incompatible then it's better to not conceal that. Generalized file formats require you to implement more stuff to support them, and we can't tell the format by looking at the file name.
chefandy 1 days ago [-]
Yeah I'm sick of this. Did you know you can't even use ext2/3/4 together on the same partition? What a mess.
loeg 1 days ago [-]
The ext4/3 filesystems, notably, can read/write ext2 (and for ext4: ext3) filesystems in a compatible way.
toast0 7 hours ago [-]
IIRC, ext3 filesystems, if properly unmounted can be mounted with an ext2 driver
chefandy 1 days ago [-]
Oh yeah true
LeFantome 6 hours ago [-]
Perhaps not the best example
lexicality 1 days ago [-]
> DO NOT COMPRESS ANY DATA WITH THIS PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.
I know every open source project (and quite a lot of expensive proprietary ones!) come with a "btw this software might wipe your computer, if it does that's your fault lol" clause in their license but I can't imagine trying to convince anyone else that using this for anything remotely serious is a good idea with that line in the readme.
palaiologos 1 days ago [-]
Hi! Tool author here.
Almost every single open source compression tool contains a clause like this. For example, the one in the README that you see has been directly lifted from the bzip2 README. Almost all open source projects come with such a no-warranty scheme. 7-Zip, zstandard, xz-utils, etc; as exemplified by a quote from the license text of the MIT license:
> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
If you were willing to sign a commercial support contract with me on terms that we negotiated, I would be willing to provide you warranty.
If you were not aware, this is essentially the business model of WinRAR. The reason why tools like 7-Zip are not used by the public sector or companies (at least here) is that they provide no warranty in case of data loss. However, if you actually buy WinRAR, then you can hold them liable for damage to your archives. The "infinite 40 day trial" of WinRAR does not entitle you to compensation for damages and thus corporate entities and public entities have to buy WinRAR licenses. WinRAR has never cared about personal customers.
In general, having to cope with mild reliability of software is what you have to live with - you already get more than you paid for. Not to say that my tool is unreliable - I put a lot of effort into it, but it would put you in bad light to complain about something that you generously received for free :).
lexicality 17 hours ago [-]
I did acknowledge that.
My point was more if you went into a store to buy some cereal and you had two options: "Cornflakes" and "Cornflakes 2 - they're better!" but you noticed that while both packets had standard legal nonsense on them but Cornflakes 2 had "This cereal almost certainly does not contain broken glass" as well, personally I think human nature would make me go with the packet that didn't bring up broken glass in the first place - even if both of them have the exact same chance of containing it
tomato_basil 24 hours ago [-]
[dead]
cwillu 1 days ago [-]
Simple enough to be safe, at the cost of performance: uncompress and compare to the original.
hinkley 1 days ago [-]
You could have bugs that show up on different hardware or compiler versions. So the round trip is table stakes but not a guarantee.
Edit: someone deleted a response that said that if you can read it back then the data is there. I think in a data recovery sense that’s definitely true, if it’s consistent across inputs. But building something that simulates the undefined behavior or race condition - if it’s symmetrical between read and write could be pretty tricky. And you’d have to decode based on file version number to know if you need the emulation. So possible but terrible to maintain, and the interim versions from creating and discovering the bug would still be broken.
1 days ago [-]
lexicality 1 days ago [-]
And what do you do if it doesn't match?
vardump 1 days ago [-]
Isn't it obvious? Warn the user, who can now use something else instead.
lexicality 1 days ago [-]
That only works if the "user" is an interactive TTY with a human on the other end of it though. What if I tried using this for compressing automatic backups? Do I need an error handling routine that uses something else?
vardump 1 days ago [-]
A backup system should be reliable and be able to report errors. No matter what they may be.
jeroenhd 1 days ago [-]
Your automatic backups may actually be corrupted by random bit flips. Happens quite a lot with ZFS NAS systems where the admin forgot to set up a scrub job and still uses incremental backups.
Any read or write could fail for a multitude of reasons. The chance of an entire file being lost is rather small, but it's still an edge case that can happen in the real world if the FS flips to read only at just the wrong time. Hell, on platforms like macOS, you can't even assume fsync returning success will actually write floating data to storage!
andix 1 days ago [-]
It reads to me more like: DONT USE EXPERIMENTAL COMPRESSION TOOLS FOR BACKING UP YOUR IMPORTANT DATA. USE IT FOR TRANSFERRING DATA AND CHECK HASHES.
chasil 1 days ago [-]
The author of lzip goes into some degree of excitement on the reliability and recoverability of the lzip format compared to xz.
I personally back up about a terabyte each week, and I use 7-zip because it has built-in encryption, which is required because of the HR data in the backup. Thank heavens for xargs -P.
I could use "openssl enc" combined with any pure compression utility, but I don't want to make decompression impossible if I get hit by a bus.
I have replaced all my previous uses of xz with lzip ever since I read that page (via https://news.ycombinator.com/item?id=32210438), but for some reason lzip never seem to rise to the same level of fame as xz. bzip3 also wasn't benchmarked against lzip.
lifthrasiir 20 hours ago [-]
I think you should just skip both xz and lzip, because that essay is in my opinion technically correct but also only deals with a very much minor concern [1]. If you want the recovery out of archives, you should use dedicated formats like PArchive and not ordinary archives with half-baked recovery attempts.
For my personal backup use, I actually use RAR with recovery records, optionally with encryption if it's sensitive. I was only using xz in places where I couldn't use RAR (e.g. for work), and those places tend to also have lzip available.
hinkley 1 days ago [-]
Honestly I think this may be part of why compression in flight has been a growth area in compression lately. If your transport encoding becomes garbled people notice immediately. If you manage to serve a few billion compressed requests without any incident reports then maybe you can trust it for data at rest.
PaulKeeble 1 days ago [-]
I had bzip2 eat some files some years back and ended up loosing them. I don't trust bzip2 for backup files and I wouldn't trust bzip3 either as a result.
supertrope 1 days ago [-]
A shell script to run a checksum on the original file, compress, uncompress, and verify output integrity should be trivial.
Arech 1 days ago [-]
I wonder why this isn't inbuilt into the program..
cowsandmilk 1 days ago [-]
I mean, hosting downloads that have been compressed with this seems fine. I have the original, I just want a smaller version for those downloading.
Rendered at 02:40:21 GMT+0000 (Coordinated Universal Time) with Vercel.
It's a fantastic bit of algorithmic magic that will always impress me to see it.
It looks to me that the above still holds: Bzip2 and Bzip3 are simply combining more conventional compression algorithms with the BWT, which itself is still the same old transform. Bzip2 does Huffman coding after BWT, and Bzip3 does arithmetic coding.
I would not really say this is true. BWT is spiritually similar to the various Prediction by Partial Matching (PPM) algorithms, except that instead of needing to decide how much context (i.e., preceding symbols) to use to model the next symbol, and carefully learning the probabilities for each unique context, it naturally sorts symbols with the same context (of _any_ length) so they appear next to each other, and relies on adaptation to update your learned statistics as you move from one context to the next, without ever explicitly tracking context boundaries [0].
[0] N.J. Larsson, "The Context Trees of Block Sorting Compression," 1998. https://ieeexplore.ieee.org/abstract/document/672147
BWT sorts rotated data and what is achieved is that same suffixes group together:
The preceding (to suffix) character goes to end and then gets outputted. This is much like PPM going backward. There is a PPM* algorithm (unbounded context length) where authors considered reconstruction of contexts from data, utilizing something like LZSS seach. Same idea is in BWT - context is reconstructed from data.BWT also breaks near dependencies in data, this is why move-to-front with Huffman or arithmetic encoding works well there.
[0] http://dact.rkeene.org/
BWT can be combined with anything which does RLE and get a benefit.
What does it does is give RLE more to work with.
As people mentioned elsewhere in the thread, Burrows-Wheeler isn’t really composable with other systems. Nobody has figured out a reasonable way to combine LZ77 and Burrows-Wheeler. That’s why zstd + bzip2 does not work. But you do need an entropy coding system to go with Burrows-Wheeler… and FSE fits the bill.
Sorting all the shifts of a string puts all the characters in order, then looking at the last column shows you all the _preceding_ characters. If there's any predictability there (which there often is), it's now easier to compress. It's sorta like an entropy coder in that way.
I've never thought of it as being that deep, and understood them since I was a kid -- building an intuition for "why" the FFT works is much harder -- but that being said, I clicked quickly to reply thinking "that's easy! I can explain this!" then struggled for a while trying to get the picture in my mind into text. :)
What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.
Ahhhh. Now we're on the same page. :) Seeing how it helps when combined is somewhat subtle/non-obvious. I believe it relates to BWT and Huffman both being approximations of something more optimal. The two transforms could also have different window sizes -- one rarely does BWT on a whole 1GB file -- which introduce inefficiencies. Huffman coding is also only optimal in the very large alphabet and very long data limits. As your data length and alphabet size decrease, it gets less optimal.
Put differently, "I think that's a wonderfully phrased question, this _is_ my specialization/subfield, and I'm gonna need to chew on it for a while."
BWT is orthogonal to Huffman; like LZ, it exploits that some symbol _sequences_ are more common than others, while Huffman is about the optimality of coding each symbol on its own.
Huffman takes a probability distribution and a symbol and encodes the symbol according to the distribution. If you encode all symbols independently according to the same distribution, you probably don't get very good compression.
You get a bit better results with a model that has a separate distribution for each context. If the previous symbols were X, Y, and Z, you encode the next symbol according to the distribution for context XYZ. This approach doesn't really scale, because the size of the model grows rapidly (exponentially in a naive implementation) with context length. You get better compression with an adaptive model. You start with a uniform distribution and update the available contexts and distributions after encoding each symbol. On the one hand, you don't have to store the model explicitly. But on the other hand, updating the model is very slow.
Burrows-Wheeler transform is an implicit model. It sorts the symbols according to the context that follows them, and it does that simultaneously for each context length. Because you don't have to store the model explicitly, you can effectively use longer context lengths than with a fixed explicit model. And because you don't have to update an explicit model after encoding each symbol, using the BWT is much faster than using an adaptive model.
Huffman coding is a static minimum-redundancy code. What this means is that it finds an optimal assignment of bit sequences to letters in the input alphabet (commonly US-ASCII or extensions). This however means that Huffman coding can not exploit redundancies that stem from the concrete sequence of characters. For example, you could easily predict that an `e` comes after `Th`, but Huffman coding can not know that.
Hence after applying the Burrows-Wheeler transform you need to have some sort of a higher-order transform (i.e. a transform that considers more than just individual bytes) which somehow reaps from the changed distribution of the result of the algorithm. But we will get to that in a second.
The joke here is that the Burrows-Wheeler transform is closely related to suffix trees and suffix arrays, which are often used in bioinformatics and HPC for full-text search. If you wanted to find a pattern of length `p` in a text of length `n`, if you already have a suffix tree of the original text, the search is linear in the length /of the pattern/ - i.e. O(p). The suffix tree stores all suffixes of a string in a compressed manner (i.e. it has a linear space overhead, approximately O(20n) as given by Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press), so you can search for a word in it by simply traversing from the root node to an internal or leaf node by following a sequence of bytes that comprise the word.
As such, a suffix tree (and equivalently suffix array and the BWT, which is trivially computed from a suffix array) form something which can be thought of as a static PPM model. Notably real world implementations of PPM use suffix trees as a part of their main storage data structure (e.g. PPMd). What this all means is that given a suffix tree, we can very cheaply give the probability distribution for the next byte that follows a given fixed-order sequence of bytes. This is nice, because then e.g. an order-2 predictor would be able to tell that `Th` is followed by `e` once enough data has been gathered.
As you can probably guess, the more preceding bytes you know, the better will be your estimate for what is the most likely next byte. But the larger your context, the more expensive the searches and computations become due to pointer chasing in the suffix tree.
So how do we remedy this? We notice that the Burrows-Wheeler transform essentially clusters similar contexts together, meaning that a low order predictor (= faster, simpler) on BWT compresses as well as a high order predictor (= slow, complicated) on the original data, at the cost of an extra transformation. This is viable, because the Burrows-Wheeler transform can be quickly computed and there have been recent advancements in running it on the GPU. So what this means is that bzip3 uses BWT + a low order predictor with an arithmetic coder to encode the bytes, meaning that it can make use of high order statistics for compression and performs comparably at a faster speed.
Btw, the Burrows-Wheeler transform is often explained as taking the last column.
I find it easier to understand why it works if you think of BWT as sorting all rotations of your string by from their _second_ character onwards, and then writing down all the first characters.
Huffman coding only works on individual letters. It doesn't know anything about relationships between adjacent letters.
Describing a multiset takes as much information as the multiset contained to begin with, on average. BWT somehow works better on things of use.
https://en.m.wikipedia.org/wiki/Wheeler_Jump
In terms of why it aids compression, or why it's reversible?
As far as the first goes: it transforms n-grams into repeated characters.
I remember seeing someone (probably Avi Wigderson) demonstrating a zero-knowledge proof for graph coloring on an overhead projector when I was starting my second year studying CS. He had a slide with a graph, multiple slides with different permutations of the same valid coloring of the vertices, and a piece of paper with "doors" over the vertices to cover everything. The audience could ask him to open the doors over the vertices connected by any edge, and he would show that the coloring is valid, as far as those two vertices are concerned. And then he would replace the coloring with another permutation for the next iteration. The idea felt clever but kind of obvious in retrospect.
Public key cryptography is kind of magic, but the basic piece that makes everything work isn't too hard. It's been a while since I took discrete math, so this is a bit rusty, but it should get you in the right direction. It's modular multiplication in a field. If you take a small field and make a times table, you can see the magic. All of the factors that are relatively prime to the modulus will have a pattern that you F times X is only equal to F times Y if X and Y are equal. There will also be a multiplicative inverse so that F times X time I equals X for all X; because of commutation I times X times F also equals X. When you've determined an F and an I, you give your correspondent F and keep I --- then if you multiply your message by F, they multiply it by I and get back the message. If they want to send you a message, they multiply it by I and send it, then you multiply it by F.
There's some other stuff, but that's the magic part, IMHO. You just have to use a bigger field and some other stuff I forgot. ;)
Graph colouring is usually used as a teaching example for zkp, because of how easy it is to understand. Its still amazing how you can go from that example to "Here is a file that anyone can verify (non-interactively) which shows i have a passport and that passport is not from a country sanctioned by the usa but otherwise does not reveal anything about me" or "I will not reveal my ip or any other identifying/unique info but i will prove i have not been previously blocked from this website including during the other times i anonoymously accessed this website"
Interactive zero-knowledge proofs are also technically non-interactive. They are defined in terms of the verifier evaluating a transcript of the protocol. If the verifier accepts some causal assumptions about the provenance of the transcript, they will accept the proof. If they disagree with the assumptions, the proof is indistinguishable from random noise they could generate themself. An interactive commitment – challenge – response protocol is one possible causal assumption. A source of randomness could replace the challenges, making the protocol non-interactive. Or there could be a pre-committed secret, making a single-round protocol effectively non-interactive.
These are things a sufficiently interested CS undergraduate can prove and understand. Public-key cryptography, on the other hand, remains magical. There are many things people assume to be true. Which need to be true for public-key cryptography to function. Empirically these things seem to be true, but nobody has been able to prove them. And I don't think anyone really understands them.
(edited to add 12 thread runs of bzip3 and remove superfluous filenames)
Thank you for your benchmark!
As you may be aware, different compression tools fill in different data type niches. In particular, less specialised statistical methods (bzip2, bzip3, PPMd) generally perform poorly on vaguely defined binary data due to unnatural distribution of the underlying data that at least in bzip3's case does not lend well to suffix sorting.
Conversely, Lempel-Ziv methods usually perform suboptimally on vaguely defined "textual data" due to the fact that the future stages of compression that involve entropy coding can not make good use of the information encoded by match offsets while maintaining fast decompression performance - it's a long story that I could definitely go into detail about if you'd like, but I want to keep this reply short.
All things considered, data compression is more of an art than science, trying to fit in an acceptable spot on the time to compression ratio curve. I created bzip2 as an improvement to the original algorithm, hoping that we can replace some uses of it with a more modern and worthwhile technology as of 2022. I have included benchmarks against LZMA, zstandard, etc. mostly as a formality; in reality if you were to choose a compression method it'd be very dependent on what exactly you're trying to compress, but my personal stance is that bzip3 would likely be strictly better than bzip2 in all of them.
bzip3 usually operates on bigger block sizes, up to 16 times bigger than bzip2. additionally, bzip3 supports parallel compression/decompression out of the box. for fairness, the benchmarks have been performed using single thread mode, but they aren't quite as fair towards bzip3 itself, as it uses a way bigger block size. what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.
I have no idea about compression, just a naive thought.
I can't provide compression/decompression times, but the --long or --long=31 arguments should not have major impact on speed, they mostly impact used memory. --long=31 requires setting the same in the decompressor, making that option mostly useful for internal use, not archives meant for public consumption.
As you can see, the benchmark chosen by the author mostly comes down to finding similar data that's far away. I wonder if bzip3 can do this better than other algorithms (especially in less memory) or simply chose default parameters that use more memory.
Edit: added more benchmarks
But here's a practical trick: sort files by extension and then by name before putting them into an archive, and then use any conventional compression. It will very likely put the similar-looking files together, and save you space. Done that in practice, works like a charm.
Using the results from the readme, seems like bzip3 performs competitively with zstd on both counts.
For instance, with a 800 MB SQL file, for the same compression time and optimal parameters (within my capacity), bzip3 produced a smaller file (5.7 % compression ration) than zstd (6.1 % with `--long -15`). But the decompression was about 20× slower (with all cores or just one).
I'm not claim my stupid benchmark is better or even right. It's just that my results were very different from bzip3's readme. So I'm suspicious.
I’d be astonished if such a 4x result generalized to tarballs that aren’t mostly duplicated files.
The "Corpus benchmarks" further down in the README are IMHO much more practically relevant. The compression ratio of bzip3 is not significantly better, but the runtime seems quite a bit lower than lzma at least.
What makes Perl source benchmark special? Deduplication?
I'm curious how well gzip would have handled this though, as it's generally low memory too, and all the others in that list have FAR more than 8M memory used.
Here's the program as talked about on HN:
https://news.ycombinator.com/item?id=38850961
https://github.com/kspalaiologos/malbolge-lisp
The cursed language: https://en.wikipedia.org/wiki/Malbolge
And the variant used by the program: https://esolangs.org/wiki/Malbolge_Unshackled
And then I remember discovering, several years later, that bzip (the first one) is an obsolete format that is now difficult to even decompress.
I learned my lesson and now use horribly sub-optimal formats that I'm sure will stick around for a long time, if not forever.
In my view, improving "long range" compression has the biggest potential. There are many, many algorithms and implementations for very short range (huffman, arithmetic, ANS) and short range (LZ, BWT), but not that much research has gone into "long range" yet. There's deduplication, and large-window LZ / BWT.. but not much more yet. What is missing is efficietly (and with little memory) finding similarities on multi-GB data sets. I think sorting by similarity would help there. Or did I miss research in this area?
In practice, there have been antivirus bypasses that made use of AV scanners treating the additional data differently from common extraction software (I believe it was winrar?).
One could argue that a text document with a corrupt file size byte should still be decodeable. One could also argue that the file is completely corrupt and should be treated as such. Knowing that there will be tools that will take the first approach regardless, I'd stick to just decoding the extra data and marking the file as potentially damaged rather than send the user down a data recovery rabbit hole for a file that'll decompress just fine.
1: https://github.com/IlyaGrebnov/libbsc
The idea is that the BWT can help take advantage of however much context remains after the copies are snipped out, whether that's one byte or a few, and shrinking the input with the LZ-ish step may make it faster. It might be strictly worse than using PPM or basic context modeling like Brotli's; either of those can be "aware" of the preceding bytes even when they come from copies rather than literals.
It's implied in "successor to bzip2" and a lot of the comments, but it's worth highlighting that high-compression algorithms, especially those that are also slow to decompress, are a pretty specialized niche now. Using zstd or brotli at low to medium settings sometimes speeds things up by reducing network or storage transfers more than their CPU use slows you down. (Especially when your storage transfers are network transfers.) Even when compressing isn't a net speedup, you pay fairly little time for the saved space. Even lowish levels of zstd and brotli often eke out decent compression ratios for big inputs since, with modern quantities of RAM, their MBs-long history windows let them take advantage of long-range matches.
The majority of OS that people use either have it installed or can be trivially downloaded to, so I see no point in using gzip nowadays, unless it is mandated by API
And even when I still have to use gzip (the format), the executables of libdeflate are fully compatible and noticeably faster than gzip/gunzip (the tool).
And forward compatibility. The Lindy effect says gzip is likelier to be widely available across platforms and tools in the long term than zstd.
Nothing is shittier than sshing into a box that doesn’t understand half of your command line tricks. Or the clever shell script you just tested six ways to Sunday. It’s like fighting with your hands tied behind your back.
It's a fair question, and "currently supported" is essentially meaningless.
(Yes we all know about macports and the other more popular but less correct systems, and yes most bash scripts should be written to specifically avoid relying on any special features exactly to avoid being broken so easily. These are both true and both beside the point. Fact is still, this is not only a "currently supported" system by some technicality like the final months of a 5 year lts release or something, this is the up to this very minute version of a massive install base, with a 20 year old version of something as basic and all-touching as the very shell itself.
I know about this so intimately because I actually have scripts that don't work because they were intentional stunts to see just what could be done without forking any external exes or even child shells, but allow using every possible feature and trick in bash itself.)
I don't even know why we're talking about bash here. But for the record, this is a unique circumstance that affects only software which made the transition from GPLv2 -> GPLv3. Changing your license can cause you to fork away from some users. News at 11.
The number of devs working on OSX producing non-OSX applications is staggering.
Every deviation you make from stock is something else you have to put into. The onboarding docs and convince all your coworkers to install. They won’t. Or they think they will but will miss a step or machine. Or for language specific tools, they’ll use a multiversion tool like asdf or nvm or ram and have to reinstall again every upgrade.
It’s not up to me. It’s not up to you. It’s up to us. And if you can’t see that, maybe you shouldn’t be working with other developers. Or maybe you haven’t been and you need to.
There are places still migrating off of Java 8.
And pet servers still exist in the wild.
And if your system is so old that you don't have access to eight-year-old software, I don't see why the rest of the world needs to accommodate you. Sticking with weird and old software is fine, but that comes with the risk of having to deal with everyone else moving on.
But if it matters, FreeBSD has a pretty trivial BSDmake build of it:
https://github.com/freebsd/freebsd-src/blob/main/lib/libzstd...
https://github.com/freebsd/freebsd-src/blob/main/usr.bin/zst...
You could easily do something similar in GNU make or whatever without the dependencies on the FBSD build system. It's basically just the classic "cc -c a bunch of C files, then link them."
Cuz I'm actually a packaging maintainer for a couple different old operating systems. Regardless I thank you for being thankful even though I feel like you were being backhanded and demeaning on a number of levels.
What an awful comment to make.
> bzip3 -e -b 256 -j 12 all.tar 2713.81s user 16.28s system 634% cpu 18301M memory 7:10.10 total
The memory usage is one reason: 8M vs 18301M
Regarding your first remark: high ratio data compression has its time and place, and I personally understand that to many people it is not very desirable. In a lot of scenarios something as plain and simple as LZ4 generally suffices.
On the other hand, there is an unofficial (= unsupported) port of bzip3 to older (386+) machines that run MS-DOS6.22. I have prepared it for a retrocomputing meeting in Ontario that I attended a while back. Let me know what you think :).
https://github.com/kspalaiologos/dev-urandom/blob/main/dos/B...
usually this signifies the year some article was finalized and published.
and this is a GitHub repo with recent commits so this is not correct here.
Please, no. What's next ? BZip4 and BZip5, each one incompatible with each other ?
I know every open source project (and quite a lot of expensive proprietary ones!) come with a "btw this software might wipe your computer, if it does that's your fault lol" clause in their license but I can't imagine trying to convince anyone else that using this for anything remotely serious is a good idea with that line in the readme.
Almost every single open source compression tool contains a clause like this. For example, the one in the README that you see has been directly lifted from the bzip2 README. Almost all open source projects come with such a no-warranty scheme. 7-Zip, zstandard, xz-utils, etc; as exemplified by a quote from the license text of the MIT license:
> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
If you were willing to sign a commercial support contract with me on terms that we negotiated, I would be willing to provide you warranty.
If you were not aware, this is essentially the business model of WinRAR. The reason why tools like 7-Zip are not used by the public sector or companies (at least here) is that they provide no warranty in case of data loss. However, if you actually buy WinRAR, then you can hold them liable for damage to your archives. The "infinite 40 day trial" of WinRAR does not entitle you to compensation for damages and thus corporate entities and public entities have to buy WinRAR licenses. WinRAR has never cared about personal customers.
In general, having to cope with mild reliability of software is what you have to live with - you already get more than you paid for. Not to say that my tool is unreliable - I put a lot of effort into it, but it would put you in bad light to complain about something that you generously received for free :).
My point was more if you went into a store to buy some cereal and you had two options: "Cornflakes" and "Cornflakes 2 - they're better!" but you noticed that while both packets had standard legal nonsense on them but Cornflakes 2 had "This cereal almost certainly does not contain broken glass" as well, personally I think human nature would make me go with the packet that didn't bring up broken glass in the first place - even if both of them have the exact same chance of containing it
Edit: someone deleted a response that said that if you can read it back then the data is there. I think in a data recovery sense that’s definitely true, if it’s consistent across inputs. But building something that simulates the undefined behavior or race condition - if it’s symmetrical between read and write could be pretty tricky. And you’d have to decode based on file version number to know if you need the emulation. So possible but terrible to maintain, and the interim versions from creating and discovering the bug would still be broken.
Any read or write could fail for a multitude of reasons. The chance of an entire file being lost is rather small, but it's still an edge case that can happen in the real world if the FS flips to read only at just the wrong time. Hell, on platforms like macOS, you can't even assume fsync returning success will actually write floating data to storage!
https://www.nongnu.org/lzip/xz_inadequate.html
I personally back up about a terabyte each week, and I use 7-zip because it has built-in encryption, which is required because of the HR data in the backup. Thank heavens for xargs -P.
I could use "openssl enc" combined with any pure compression utility, but I don't want to make decompression impossible if I get hit by a bus.
I have replaced all my previous uses of xz with lzip ever since I read that page (via https://news.ycombinator.com/item?id=32210438), but for some reason lzip never seem to rise to the same level of fame as xz. bzip3 also wasn't benchmarked against lzip.
[1] Previously: https://news.ycombinator.com/item?id=39873122