Saturday, January 1, 2011

Quick look back at Huff0 : an entropy coder analysis

Since LZ4 is pretty much limited in testing match searches over long distances, i'm almost pushed into writing a new compression software to continue investigations.

Therefore, since such an effort is required, it is also a good time to have a fresh look at entropy coder.

Entropy is generally not used as a direct compression method. Its better use is *after* modeling.
For example, after an LZ77 compression pass, we end up with many match lengths, which distribution is not random : short lengths are much more common than longer ones.
Entropy coder will ensure that frequent symbol get coded using less bits than uncommon ones. That's the basis for better compression results.

The most renowned entropy coder is probably the Huffman one. It is no longer considered top of the line, but continue to be used in many mainstream applications, due to its simplicity and speed.
Huffman will guarantee that, given a symbol distribution, the best possible combination of bits per symbol will be created. That's a strong property.

Since Huffman proved that its method was the best possible, entropy coders received little attention before a few decades later, when Arithmetic coders started to be implemented.
This new beast is a lot more complex to explain, so i'll invite you to read the wikipedia entry to get more details. For the purpose of this post, it is enough to accept that arithmetic coders can attribute the equivalent of a fractional bit to a symbol. Yes, it becomes possible to encode a symbol using, for example, 0.38 bits.
This may sound strange, but it's all mathematics.

By breaking the "integer" boundary, arithmetic coders allow for much better precision, and therefore better compression. This is especially important when probability is higher than 50% : Huffman cannot do better than providing 1 bit per symbol, which is just right for 50%, but is much less optimal for 80%. In such instance, arithmetic coder will do much better.

A lesser version of arithmetic coder is called Range coder. It's pretty much the same, except that it works with integer numbers, and, more importantly, is free from patents, unlike arithmetic ones.

In order to test both method, i've created 2 programs, Huff0 and Range0.

Huff0 is a block based Huffman Entropy coder. Each data is divided into blocks of 16KB. The compressed output consists in a tree, followed by compressed entry.
Smaller blocks allow faster adaptation, but cost more in header, so a trade-off had to be chosen. 16KB was selected as being mostly optimal for distributed input, such as literals, or most files.

Range0 is a block based Range Entropy coder. Each data is divided into blocks of 128KB. The compressed output consists of a frequency count, followed by compressed entry.
Why 128KB ? Because the frequency count is much heavier than the Huffman tree. As a consequence, dividing data into smaller blocks results in a loss, in spite of better adaptation.

Such trade-off between adaptation speed and header costs deserve a full study in itself, and we'll keep that for another post.

In order to better understand the properties of each coder, i needed a benchmark corpus. I started by using files, but this is nonsense. As stated earlier, entropy coders are not used as direct compression tools, but rather as a "second stage" pass, applied to statistics or probabilities output from model.
So it sounded more reasonable to create files which would mimic the statistics produced by such models. Here is what i ended up with :

                       Huff0 v0,5       Range0 v0,6 
                       %    C    D      %    C    D 
Not compressible 
enwik8.7z           100,01 740 1400  100,00 870 1400 
Hardly compressible 
audio1               93,43 285  280   93,38 174   83 
Distributed 
enwik8-lz4-lit       73,35 210  200   73,00 155   76 
Lightly ordered 
enwik8-lz4-offh      84,18 188  181   83,95 138   76 
Ordered 
enwik8-lz4-ml        34,30 200  197   33,94 155   83 
Squeezed 
enwik8-lz4-run       23,34 209  218   21,85 163  116 
Ultra compressible
loong                 0,36 450 2930    0,07 362  427  

That's a lot of results for a single post. But there is a lot to learn from it.
enwik8-lz4-y are files created by using LZ4 on enwik8, extracting a specific field. This is the fastest LZ4 version, not the HC one, so this could change in the future.
% is the resulting size, as % of original file.
C is the compression speed, in MB/s.
D is the decoding speed, in MB/s.

Not compressible : as stated, this 7z entry is already compressed, so there is nothing left.
The basic idea here, is to detect this situation fast enough, and to switch to backup copy mode.
The catch, is that this mode should not be too triggered too easily, otherwise we may lose some compression.

What immediately look strange here is that Huff0 is slower than Range0. Probably there is still some room for improvement.

Hardly compressible : this is a good torture test for the compression detector. This entry is a wave file, 16 bits stereo. As a consequence, when dealt with directly by an entropy coder, half the data is purely random (lower bits), and the other half has a small tendency to lean towards an average threshold. All in all, the compression that can be achieved is less than stellar, but still measurable.
We are starting to witness a tendency that will only grow stronger : Range0 compress better, but slower. Especially, the decoding speed of Range0 is an issue when compared with Huff0.

I will skip directly to 
Squeezed : in this file, one of the symbol has a probability higher than 50%. And it pays off quite a lot for Range0. This is no longer about a .2% small benefit, but a more sensible 1.5% advantage. And this figure can only become better if using the HC version of LZ4, which will produced even more squeezed output.

Ultra Compressible : this is a test file, produced by Eugene Shelwien, to test limit conditions when compressing long streams of identical bytes. It is not a single byte, however, there are some changes from time to time, to make life harder to the coder.
What should immediately be visible, is that Huff0 succeeds in producing a file which is much better compressed than 1/8, the theoretical minimum. This is thanks to the relatively small size of blocks : when a block consists of only one repeated byte, it gets a specific shortcut sequence. The decoding speed shows this is highly effective.
This is in contrast with Range0, which decoding speed is *only* 427MB/s. This is due to larger block size, which means we more rarely have a full block with only one byte; as a consequence, it must be decoded, which is slower than the special escape sequence.
Note however, that, as could be expected, Range0 compression is much better than Huff0 on this example.

So, where does that lead us to ?
Well, that's just a start. It provides some interesting ground for future comparisons, allowing targeted improvements to become measurable. 

More of it in a future post...

Thursday, December 30, 2010

MMC available as Source Code

 Finally, MMC can downloaded as a C source code, thanks to google code excellent publishing services.

Source is for programmers, obviously.
Interface has been made as simple as possible, with just a few calls  :


U32 MMC_InsertAndFindBestMatch (void* MMC_Data, BYTE* ip, BYTE** r);
U32 MMC_Insert1 (void* MMC_Data, BYTE* ip);
U32 MMC_InsertMany (void* MMC_Data, BYTE* ip, U32 length);


MMC_Data is handled transparently, so that program users do not need to care about it. It can be manipulated with the simple definitions below :

void* MMC_Create (BYTE* beginBuffer, BYTE* endBuffer);
U32 MMC_Init (void* MMC_Data, BYTE* beginBuffer, BYTE* endBuffer);
U32 MMC_Free (void** MMC_Data);

Although this is plain C, the logic seems pretty close from C++, with all data encapsulated into a void*.
This logic also ensures that MMC is multi-thread ready.


You can download it here :
http://code.google.com/p/mmc/

Sunday, December 26, 2010

A new look at collisions

Quite some time ago,  i investigated Collisions effects, while working on the first version of MMC.

This was my second investigation, the first one being completed while working on simpler HC (Hash Chain), and which conclusion was : it's not worth increasing the Head Table beyond a certain level, since the improved dispatching ability is hampered by decreasing cache efficiency.

MMC made me have a second look, since its natural behavior is meant to reduce if not eliminate collisions. I ended up with the following diagram, which proved me wrong : MMC was simply decreasing collision effect, but not eliminating it.
I was also shocked by the very linear nature of the collision counter, adamant when using an exponential scale. In effect, it simply proved something simple : increasing hash size by 2 decrease collision by 2, which means that the hash formula is as good as it can be.

Nonetheless, this second study did not changed my conclusion, which was that beyond a certain level, cache effect is more important than reduced collision gains.

This was before working on large search window sizes. Indeed, since my earlier implementations were too slow, working on large search windows was simply out of reach. This has all changed with latest versions of MMC.

Invited by Piotr, i tried MMC with much larger window size, starting with 16MB.
The algorithm was slow as hell, but something struck me : the collision rate was extremely high, sometimes beyond 80% of all comparisons loop. This is way out of control, and can explain alone the bad performances.

Hence, a third look at this phenomena.

A "collision" happen when a hashed sequence, of length MINMATCH, get the same hash value as another different sequence. If the hash formula is relatively well distributed, it should happen once in every "hash-size" position. In fact, this is a worst case situation, since compressible data, by definition, have a non-random distribution pattern, meaning that many sequences are either extremely rare if not impossible.
On the other hand, worst case scenario do sometimes happen, typically when compressing an already compressed file, or part of file.

With collision representing sometimes up to 80% of comparisons, there was definitely quite a lot of effort wasted at this stage.

This is the result of keeping the hash table identical on all window size, in an effort to keep the numbers "comparable" (thus avoiding to attribute the merit of a better MMC method while it would in fact come from a larger hash distribution).

This however made me forget that Hash size is also part of the performance equation.
A quick solution to this problem is simply to enlarge the initial hash table, in charge of dispatching the sequences. With a good hash formula, enlarging hash by 2 should reduce collisions by 2. And it works as well as it should.

Hence, i came back to 4MB dictionary to make some tests and measure effects. Here are the results : the first figure is the % of comparisons which are collisions, and the second one is the speed.

Window Search 4M :

FileCalgaryFirefoxEnwik8OpenOffice
Hash 128K13% - 8.1 MB/s60% - 2.7 MB/s6% - 3.0 MB/s66% - 2.8 MB/s
Hash 256K7% - 8.0 MB/s42% - 3.8 MB/s3% - 2.9 MB/s49% - 4.4 MB/s
Hash 512K4% - 8.0 MB/s27% - 4.9 MB/s2% - 2.9 MB/s33% - 6.2 MB/s
Hash 1M2% - 8.0 MB/s16% - 5.8 MB/s1% - 2.9 MB/s19% - 7.6 MB/s
Hash 2M1% - 8.0 MB/s9% - 6.3 MB/s.5% - 2.9 MB/s11% - 8.5 MB/s
Hash 4M.5% - 7.9 MB/s5% - 6.7 MB/s.2% - 2.9 MB/s6% - 9.2 MB/s
Hash 8M.3% - 7.7 MB/s2% - 6.7 MB/s.1% - 2.9 MB/s3% - 9.3 MB/s
Now this is pretty clear : this "fixed size" hash table was holding back MMC performance at larger search size. With an increased budget, it makes the final speed now relatively acceptable.

But there is more to it : as can be expected, files which were not advertising much collisions (Calgary, Enwik8) do not gain anything, and even lose some performance, most probably due to cache effects.

On the other hand, some files are suffering heavily from collisions (Firefox, OpenOffice), and as a consequence, increasing the hash size results in major speed gains, with OpenOffice showing impressive results.

This is in stark contrast with my earlier statement, which was probably biased due to :
1) using small search window, resulting in larger hash sizes having a notable bad performance impact
2) using Enwik8 and Calgary as my main test files. I though Firefox was an exception, but maybe it was the other way round.

Obviously, different files feature different distribution.

Which lead us to the question : which "Head" size is the best ?
I started with a fixed size of 17 bits, but now it seems a better rule should be to correlate this size with Window Size.
This, in effect, means that "Head Size" is no longer a "small additional table", and that it's budget is now correlated with dictionary size.

As far as i remember, 7zip uses 2x Dictionary size for Head table. Since each head element is 4 Bytes (32 bits pointer or index),  it means the number of elements is twice smaller as Dictionary Size (ex : 2M for a 4M window size).

Judging by the result in the above table, this looks just right.