Friday, November 18, 2011

A format for compressed streams

 I've been pretty careful at designing compression format & algorithm, but much less detailed when it comes to compressed stream format. A good example of this situation is given in this post, where LZ4 compressed format is precisely defined, but not a single word is mentioned regarding the stream format generated by LZ4.exe.

But there is a difference. The stream format has more responsibilities than "only" compression.
For a simple example, a stream typically starts with a "Magic Number", which hints the decoder that the stream being decoded is most probably a valid one : if the magic number does not match, then what follows will be nonsense to the decoder. It's one of many ways to avoid some basic error, such as feeding the decoder with random data.

Format definitions are very important, since once settled, they are keys to create compatible applications. Therefore, once published, a format should either not change, or guarantee that any future change will not break existing applications. This is called forward and backward compatibility, and is not trivial.
Since a great deal of attention must be pushed into this definition, instead of rushing one, i want to underline in this post what are the objectives of the format, and how they are achieved, in an attempt to both clarify my mind and get comments.

First definition : what is a stream ?
At its most basic level, a file satisfy the definition of a stream. These are ordered bytes, with a beginning and an end.
File is in fact a special stream with added properties. A file size can to be known beforehand. And in most circumstances, it is likely to be stored into a seekable media, such as an HDD.
But a stream is not limited to a file. It can also be a bunch of files grouped together (tar), or some dynamically generated data. It may be read from a streaming media, such as pipe, with no seek capability (no way to "jump" forward or backward) and no prior knowledge of its size : we'll learn that the stream is finished on hitting its end.

OK, so now we are getting into the "hard" part : what are the missions of the Compressed Stream Format ?
Here is a list, from the top of my mind, sorted in priority order :

1) Be compatible with both streaming and seekable media, in both directions (compression and decompression)
2) Detect valid streams from invalid ones
3) Designed for backward & forward compatibility
4) Control data validity (e.g. checksum)
5) Optionally slice the stream into multiple independent "chunks", for multi-threading purpose
6) Offer user control over chunk size
7) Allow to quick-jump to a specific location into the stream (seekable media only)
8) Provide a way to correct transmission errors, or at least retrieve as much data as possible from a transmission error
9) Enforce alignment rules

Options 1 to 5 seem really compulsory, while beyond that point, they may be questionable.

There are other missions which i'm still unsure if they should join the list.
For example, should the stream format embed a definition of the compression algorithm ?
That's not what i'm doing currently : the "magic number" is also associated to a specific family of compatible decoders, and therefore already performs that role.

Should it allow some user comments ?
Here, i'm expecting this feature to rather be part of an Archive format.

What about file names ? Should there be a place for them into the format ?
In my opinion, this is exactly the role of an Archive format, listing, sorting, placing files names and directory structure. Furthermore, embedding a name into a compressed stream format disregards the fact that some streams are not single files.