Sunday, April 10, 2011

A look into zip

 If there is one compression format which is well-known and popular, it is Zip. None other has reached such presence.

Zip was invented by Phil Katz, creator of PKWare, in 1989. Yes, that was 22 years ago. The format was made freely available to anyone, an unusual  move back in these days. This openness was key to bring Zip to popularity. It provided other companies, such as WinZip, the opportunity to create their own version of Zip and distribute it without the cost, hassle and risk to license. As a consequence, Zip quickly became the "default" compression format for PC systems, and beyond.

Later, an open-source version of the algorithm was created by Jean-Loup Gailly, named gzip for the program, and zlib fo the library. Zlib is probably one of the most deployed library today, being used in a mind-blowing variety of devices and systems.

All these programs have something in common, they share the same compression format, called deflate.

Deflate, the core of Zip files, is now well defined, being described in IETF RFC 1951. Such normalization organism contribute to give great credit to the format, considered reliable enough for anyone to build its program above specifications.

And indeed, the definition is clear enough. Well, at least to whoever has a bit of knowledge of entropy coding, more specifically Huffman one.

The driving idea in starting this analysis was to assess if Zhuff algorithm could be re-used in a way that would produce zip-compatible data streams, with the speed of Zhuff. The quick answer is "not easily", since the differences are non negligible.

The very first property of Zip streams is that they entangle all information (literal, offset, lengths) in a single stream. This stream, however, uses 2 trees, one for offset, and another one for literals, match lengths, and "end of block" marker.
Consequently, the more complex tree, handling 3 types of values, has more than 256 elements. It must check if the value is a literal, a match length (triggering an offset) or an end-of-block (triggering a new tree description, or end of file).

This is in stark contrast with Zhuff, which relies on 5 separated data streams, resulting in very tight loops. Each stream is decoded very fast, thanks to Huff0. More streams, but faster ones.

The complex part of the format is the header tree format, which is itself huffman compressed after being transformed using specific values for repeating patterns and streams of zero. It looks pretty hard at first, but since the description is complete, it can nonetheless be implemented without ambiguity.

Now, since the changes to Zhuff (and by consequence to Huff0) would be quite consistent, i wonder if it is useful to try this exercise. After all, who needs a Zip at Zhuff's speed ?