Tuesday, May 15, 2012

Dealing with blocks side-effects

 Continuing on the previous post analysis of the lz4demo's current framing format, another side-effect created by this simple format is "latency".

Since the fixed block size is 8MB, the codec must wait for the first block to be completely filled before starting any decoding or compressing operation. It effectively defers processing by a few microseconds.

This issue may not seem large, especially if underlying I/O is fast. Nonetheless, not all I/O are fast, and even in such cases, an 8MB "starting time" is bound to be measurably worse than a 256KB one for example.

As a consequence, a framing format with a smaller block size would offer better and smoother processing throughput.

Which leads us to a last and important issue : independent blocks.
While this strategy is good for simplicity and multi-threading, it's bad for compression : it translates into a worsened compression ratio on the first 64KB of each block.

With block sizes of 8MB, this effect is negligible (affecting compression ratio by less than 0.1%). However, the smaller the block size, the worse the side-effect. With small block sizes, this effect can no longer be neglected.

Therefore, should the blocks remain independent ?

Indeed. By making the next block depending on the previous one, it nullifies the problem of worsened compression ratio. But it also makes it impossible to decode a compressed block independently, with negative consequences on multi-threading and partial decoding capabilities.
Is that really an issue ? Well, it really depends on the use case.

In many circumstances, such as simple file compression or stream compression, it does not matter, since data is encoded and decoded sequentially anyway. Throwing away the capability to multi-thread decompression seems bad, but in fact, most of the time, I/O is unable to cope with LZ4 decoding speed (around 1GB/s). So a single decoding thread is enough to handle almost any I/O load.
Since there is little need for partial decoding, nor for multithreaded decoding, the compression ratio gain looks more useful.

There is just a little remaining problem :
While the decoding function will need few adaptation to handle this new use case, most of the complexity being located into the buffer manager, the compression function on the other hand has to be adapted.

While each block were independant, compression could start with a pristine clean reference table.
But with sequentially dependant blocks, the initialization becomes more complex : the previous 64K needs to be copied in front of the next block, and then loaded/hashed into the reference table, before starting compression. It obviously costs CPU and time.
A variant is to just "translate" the references already loaded into the table as a result of compressing the previous block, but it's limited to "single thread" scenario.

OK, but now that we can reference data from previous blocks, how far should we go ? The natural maximum distance is the "copy window size". This size is, by default, 64KB for LZ4. But it could happen that the receiving end of the compressed stream has not enough memory to store that much data. In such case, the compressor must be careful in not using references beyond the memory capacity of the receiver. In other words, it must deliberately discard long-distance copy operations.

Should such a use case be part of the generic framing format or not ?
My answer would be : it's okay, as long as an easy solution can be found.

How could that work ? Once again, let's focus on the decoder side.
I'll imagine a controller with only 4K memory available as buffer.
A simple way to handle such case is by dividing this space into 2 parts : 2K for the "previous block", and 2K for the "block to decode". So we end up with :
- Block size = 2K = memory limit / 2
- Maximum copy reference = lower limit of previous block

Obviously, there are other possibilities, such as cutting data into even small parts (for example 1K blocks) and having a back reference of up to 3 previous blocks. But as a first approximation, it seems these variant will provide almost equivalent results while being more complex.

This situation can be summarized with a simple rule : never reference data beyond one block distance.

With only this simple rule in place, it seems the default LZ4 framing format could be made compatible even with environments with severely limited RAM, provided the encoder selects a suitable block size.

Monday, May 14, 2012

Memory buffer management

 One of the objectives for the LZ4 framing format is to instruct the LZ4 decoder how to best handle memory to decode the compressed data.

To illustrate with a simple example, let's study the current framing format of lz4demo.

This framing has no option, and is therefore completely static. At the core of the specification is a "data split" methodology, which automatically split input data (typically files) into multiple blocks of 8MB each. That is to say, if the source data is larger than 8MB, then the first block will only hold the first 8MB of data, and the next block too, up to the point that only one block remains. This last one can have any size between 0 and 8 MB. The last block can also be the first one, if source data is < 8MB.

Moreover, each block is completely independant. Which means each block can be compressed/decompressed in any order, without dependency, except to output data blocks in the same order as input ones. This property is very interesting for multithreading.

Now, this simple scheme comes also with some problems of its own. Let's look into it.

First, since we don't know the size of the source data, the decoder has to allocate 8MB of memory, no matter what. Maybe the file is only 4KB for example ? Well, the decoder don't know it.
To be fair, knowing the maximum compression ratio (255:1), it could guess that, if the compressed size is 2KB, then the decoded size cannot be more than 512KB. But this is awkward and awful trick.

For modern computers, it seems this is a minor issue. However, in many other circumstances, such as embedded and low-power devices, this amount of memory is way too large.

To limit the size of allocated memory, a potential work-around for the decoder is to use a "sliding buffer" methodology.
For example, data would be decoded into blocks of size 256KB, a lot smaller than the full 8MB size. On reaching the end of the small block, decoded data would be written to disk, keeping only the last 64KB ones (copy window size) for reference of the next small block.
This is workable, and would save a lot of memory. But there are a few drawbacks :
  • A specific decoding function is necessary to do this, which does not exist (yet).
    Such function needs to keep track of its "internal state", in order to resume decoding where it left.
    Which means, for example, that we may be in the middle of match copy operation.
    The "exit" and "error" conditions are also different and more numerous.
    As a consequence, the code for such function will be more complex and slower.
  • Adding to the performance issue, the "sliding" mechanism involves repeateadly copying the last 64KB of each small block in preparation for the next one.
  • Although 256KB is a lot smaller than 8MB, it's still too large for some low-memory applications.
    Alas there is a limit to this methodology, which is the size of the "window copy operation" (64KB) plus a reasonable amount of data to decode, typically another 64KB.
    So, the minimum decoding buffer for this methodology is 128KB.
    This is better, but still too large.

As a consequence, it seems clear that there is no such thing as "one-size fits all". The size of blocks needs to be customizable, in order to fit a broader range of situations.

Additionnally, it seems better to help the decoder allocating only what's needed, especially, when it can be smaller than the default block size. Which means that, for small data source (smaller than the block size), it would be interesting to provide the original source size : this will allow the decoder to only allocate what's necessary, instead of the default maximum block size.