Thursday, December 30, 2010

MMC available as Source Code

 Finally, MMC can downloaded as a C source code, thanks to google code excellent publishing services.

Source is for programmers, obviously.
Interface has been made as simple as possible, with just a few calls  :


U32 MMC_InsertAndFindBestMatch (void* MMC_Data, BYTE* ip, BYTE** r);
U32 MMC_Insert1 (void* MMC_Data, BYTE* ip);
U32 MMC_InsertMany (void* MMC_Data, BYTE* ip, U32 length);


MMC_Data is handled transparently, so that program users do not need to care about it. It can be manipulated with the simple definitions below :

void* MMC_Create (BYTE* beginBuffer, BYTE* endBuffer);
U32 MMC_Init (void* MMC_Data, BYTE* beginBuffer, BYTE* endBuffer);
U32 MMC_Free (void** MMC_Data);

Although this is plain C, the logic seems pretty close from C++, with all data encapsulated into a void*.
This logic also ensures that MMC is multi-thread ready.


You can download it here :
http://code.google.com/p/mmc/

Sunday, December 26, 2010

A new look at collisions

Quite some time ago,  i investigated Collisions effects, while working on the first version of MMC.

This was my second investigation, the first one being completed while working on simpler HC (Hash Chain), and which conclusion was : it's not worth increasing the Head Table beyond a certain level, since the improved dispatching ability is hampered by decreasing cache efficiency.

MMC made me have a second look, since its natural behavior is meant to reduce if not eliminate collisions. I ended up with the following diagram, which proved me wrong : MMC was simply decreasing collision effect, but not eliminating it.
I was also shocked by the very linear nature of the collision counter, adamant when using an exponential scale. In effect, it simply proved something simple : increasing hash size by 2 decrease collision by 2, which means that the hash formula is as good as it can be.

Nonetheless, this second study did not changed my conclusion, which was that beyond a certain level, cache effect is more important than reduced collision gains.

This was before working on large search window sizes. Indeed, since my earlier implementations were too slow, working on large search windows was simply out of reach. This has all changed with latest versions of MMC.

Invited by Piotr, i tried MMC with much larger window size, starting with 16MB.
The algorithm was slow as hell, but something struck me : the collision rate was extremely high, sometimes beyond 80% of all comparisons loop. This is way out of control, and can explain alone the bad performances.

Hence, a third look at this phenomena.

A "collision" happen when a hashed sequence, of length MINMATCH, get the same hash value as another different sequence. If the hash formula is relatively well distributed, it should happen once in every "hash-size" position. In fact, this is a worst case situation, since compressible data, by definition, have a non-random distribution pattern, meaning that many sequences are either extremely rare if not impossible.
On the other hand, worst case scenario do sometimes happen, typically when compressing an already compressed file, or part of file.

With collision representing sometimes up to 80% of comparisons, there was definitely quite a lot of effort wasted at this stage.

This is the result of keeping the hash table identical on all window size, in an effort to keep the numbers "comparable" (thus avoiding to attribute the merit of a better MMC method while it would in fact come from a larger hash distribution).

This however made me forget that Hash size is also part of the performance equation.
A quick solution to this problem is simply to enlarge the initial hash table, in charge of dispatching the sequences. With a good hash formula, enlarging hash by 2 should reduce collisions by 2. And it works as well as it should.

Hence, i came back to 4MB dictionary to make some tests and measure effects. Here are the results : the first figure is the % of comparisons which are collisions, and the second one is the speed.

Window Search 4M :

FileCalgaryFirefoxEnwik8OpenOffice
Hash 128K13% - 8.1 MB/s60% - 2.7 MB/s6% - 3.0 MB/s66% - 2.8 MB/s
Hash 256K7% - 8.0 MB/s42% - 3.8 MB/s3% - 2.9 MB/s49% - 4.4 MB/s
Hash 512K4% - 8.0 MB/s27% - 4.9 MB/s2% - 2.9 MB/s33% - 6.2 MB/s
Hash 1M2% - 8.0 MB/s16% - 5.8 MB/s1% - 2.9 MB/s19% - 7.6 MB/s
Hash 2M1% - 8.0 MB/s9% - 6.3 MB/s.5% - 2.9 MB/s11% - 8.5 MB/s
Hash 4M.5% - 7.9 MB/s5% - 6.7 MB/s.2% - 2.9 MB/s6% - 9.2 MB/s
Hash 8M.3% - 7.7 MB/s2% - 6.7 MB/s.1% - 2.9 MB/s3% - 9.3 MB/s
Now this is pretty clear : this "fixed size" hash table was holding back MMC performance at larger search size. With an increased budget, it makes the final speed now relatively acceptable.

But there is more to it : as can be expected, files which were not advertising much collisions (Calgary, Enwik8) do not gain anything, and even lose some performance, most probably due to cache effects.

On the other hand, some files are suffering heavily from collisions (Firefox, OpenOffice), and as a consequence, increasing the hash size results in major speed gains, with OpenOffice showing impressive results.

This is in stark contrast with my earlier statement, which was probably biased due to :
1) using small search window, resulting in larger hash sizes having a notable bad performance impact
2) using Enwik8 and Calgary as my main test files. I though Firefox was an exception, but maybe it was the other way round.

Obviously, different files feature different distribution.

Which lead us to the question : which "Head" size is the best ?
I started with a fixed size of 17 bits, but now it seems a better rule should be to correlate this size with Window Size.
This, in effect, means that "Head Size" is no longer a "small additional table", and that it's budget is now correlated with dictionary size.

As far as i remember, 7zip uses 2x Dictionary size for Head table. Since each head element is 4 Bytes (32 bits pointer or index),  it means the number of elements is twice smaller as Dictionary Size (ex : 2M for a 4M window size).

Judging by the result in the above table, this looks just right.

Saturday, December 18, 2010

Parsing Level 1 : Lazy matching

All compressors that you can download on my site are using a pretty simple logic :

On each position, we try to find a match (the best possible one for LZ4 HC, or just a good enough one for all others).  If we do, we encode a match, then move forward by match length, a try again.

This is known as "greedy matching". We just use a match as soon as we get an opportunity for it.

However, even when making a full search for the longest possible match, this is not the best possible output. It's quite possible that, by using the match at current position, we miss a better one on next position.

Need an example ?
Just consider the following sentence :

(...) a roaming attempt (...) really necessary. (...) a really necessary (...)

On reaching the second "a", the match finder will find a 3 characters match : "a_r".
Reaching the final "y" will therefore need 2 match sequences, by then adding "eally necessary".

However, should we wait for the next position, we would create an "a" as as literal.
And then, all the rest of the sentence would be described as a single match.

This is cheaper. Considering LZ4 format (that's just an example), a match sequence costs 3 bytes, and a literal costs 1 byte. This gives us :
Don't wait (greedy) : 3 + 3 = 6 Bytes
Wait one byte : 1 (literal) + 3 (match) = 4 Bytes.
So we would lose 2 bytes with greedy matching.

There are a number of way to improve on this situation. Let's start with the simplest one : lazy matching.

The idea is very simple : on finding a match, we keep it in memory and try to find another and better match at the next position.
If we don't find a better match, then we use the first match.
If we do find a better match, we use this one instead.

The idea can also be applied recursively; that means that if we find a better match, we continue checking the next position, and so on.

This idea is known as "lazy matching", and used for several years in LZ compression programs.
However, it obviously requires more searches, and is therefore slower.
This is a typical trade-off that each compression program must choose.

For maximum compression strategy, advanced parsing is a necessity.

Thursday, December 16, 2010

MMC : Secondary Promotion : Experimental results

I spent quite some time working on Secondary promotion lately. The algorithm was particularly complex, and ensuring that all corner cases were correctly addressed required many debug iterations.

In the end, i finally completed a new version of SecFull, an aggressive implementation which tries to promote all characters it meets during its travel across the different levels. The changes required the capability to "move back" by one level, since in some cases, chains were broken and needed repair.

A much simpler version was produced in the process, called Sec1. It just tries to promote the last seen character, and only when a valid gateway is confirmed. This ensures that the changes are minimal and well located, keeping all the rest of the code unmodified.
Complexity difference between Sec1 and SecFull is quite large, and therefore it is not so clear which one will bring better benefit.

Anyway, it is now possible to compare. All search algorithms generate exactly the same references, resulting in the same output. Therefore, all changes affect only the speed of searches.

As usual, let's just start by a simple counter of comparisons :

Nb of comparisons - Window Search 64K :
FileNoSecSec1SecFull
Calgary4.87M4.31M3.70M
Firefox35.2M31.2M25.7M
Enwik8166.M139.M121.M
Window Search 512K :
FileNoSecSec1SecFull
Calgary9.31M7.94M6.32M
Firefox78.2M67.3M52.9M
Enwik8370.M285.M238.M
Window Search 4M :
FileNoSecSec1SecFull
Calgary12.0M10.2M7.95M
Firefox161.M141.M114.M
Enwik8619.M472.M357.M
The trend is pretty clear : the more aggressively we promote, the less comparisons are necessary to complete a full search. And the difference is more and more pronounced as the size of search buffer increases.
That's for comparisons. Now, what about speed ?
Surely, the less number of comparisons, the better the speed ?
Well, yes, approximately; but we also have to take in consideration code complexity. If each search is "more costly", then the total may end up being slower.
As can be guessed, Sec 1 is more complex than NoSec, and SecFull is the most complex of all. So, does it pay back ? Here are some results :

Speed - Window Search 64K :
FileNoSecSec1SecFull
Calgary32.8 MB/s31.5 MB/s32.0 MB/s
Firefox40.5 MB/s38.9 MB/s40.0 MB/s
Enwik830.8 MB/s29.9 MB/s30.1 MB/s
Window Search 512K :
FileNoSecSec1SecFull
Calgary16.6 MB/s16.9 MB/s17.9 MB/s
Firefox19.4 MB/s18.6 MB/s19.9 MB/s
Enwik813.0 MB/s13.2 MB/s14.7 MB/s
>Window Search 4M :
FileNoSecSec1SecFull
Calgary6.5 MB/s6.8 MB/s8.1 MB/s
Firefox2.3 MB/s2.4 MB/s2.7 MB/s
Enwik81.9 MB/s2.4 MB/s3.0 MB/s

Time for some serious thinking.
Even with less comparisons, the new algorithms are not guaranteed to be faster. This is especially true at 64K, where all Secondary algorithm fail to produce any gain, even though the number of comparisons is well reduced.
We need larger search buffer to make Secondary promotions worthwhile. And this is reached as soon as 512K, and then growing with search size.

Actually, SecFull seems to win over Sec1 in each and every circumstance. So it can be preferred.

But to be fair, results are a bit disappointing. The very large difference in comparison loops would make us believe that the benefit could be larger. But in the end, it is still relatively mild.
No ground-breaking progress this time...

Wednesday, December 15, 2010

RZM

Once upon a time...
There was one of those incredibly talented programmer, which could produce in short time what could take years to amateurs as myself. During its short presence in the compression scene, he produced several excellent programs, featuring compression / speed ratio beyond anything that was available back then.

Among these little jewels, was RZM, and LZ-class compressor. Right from the first release, it became a serious competitor to 7-zip, the best reference LZ implementation so far. This was achieved without even using any external filter, nor any strange trick, just plain LZ.

Or was it ? Christian Martelock, the author, described its algorithm as being a "dull ROLZ order-1" compression engine, with 64K positions per context.
This description was close enough from Etincelle to deserve a try.

Etincelle is also an order-1 ROLZ; however, it gets its speed thanks to a quick scan into dictionary. This is in contrast with RZM, which makes a full search.

It was just a matter of little time to implement full search into Etincelle, and get some first results.
RZM is approximately 30% more powerful than Etincelle, therefore, i was wondering if such kind of result was readily reachable.

It was not, obvously. With full search, Etincelle features a 15% improvement, same as LZ4.
So where the other 15% could come from ?

Well, for a start, Christian clearly states that he uses Optimal Parsing to get this result. From experience, Optimal Parsing provides a fair boost to compression ratio, but not a huge one. 3-5% is a typical improvement to be expected.

Another, more obvious, modification would be to move away from Huffman entropy coder, and exchange it for a Range coder. With its improved precision, this would help quite a bit, especially on literal occurrence. I suspect an improvement in the range of 2-3%,

Now, that still leave us with something like 8% to find out. And this part is not so clearly explained.

On reading the thread on RZM at encode.ru, a repeated pattern became apparent : all encoded tokens are context-evaluated.
It's a step above my simple block-based implementations. While i'm simply separating data depending on their type, ensuring homogeneous probabilities, context-evaluation goes a few steps further : it separates probabilities based on one or several selected elements or statistics, which are considered "correlated" to the probability being calculated.
It does not need to be completely deterministic (in which case, we would get a perfect predictor). A loosely coupled correlation is good enough to squeeze the probability table in a way which makes it more accurate, therefore producing better compression.

Since we benefit more from better correlation, the choice of context is pretty important.
As an example, for literal, the previous character is usually a good indication. Hence it makes sense to use it as a context. That's called Order-1 entropy coding.
Obviously, many more measurable element can be selected. For example : was the previous position a literal, or the end of a match, what was the length of the previous match, and so on. They can also be combined.
But, the prediction will only be as better as the correlation can be. Therefore, mixing random elements together will prove a poor indicator.
The story becomes even more harder when considering that correlation elements may change overtime. This is quite normal if data type being compressed changes, even within a single file. Tracking such changes can become a significant burden, and therefore, faster alternative is to fix the probability model, hard-wiring the correlation rules. This is necessarily less optimal, but obviously much cheaper to calculate.

Another problem in creating too many contexts is sparsity : the more numerous the contexts, the more sparse the data in each of them. It becomes therefore difficult, if not dangerous, to try to extract some probabilities from a too small sample set; this is the "escape estimation" problem, well known since PPM era.
This problem is critical for hard-wired correlation rules, such as PPM. More modern techniques, with dynamic merging & splitting, can somewhat circumvent it.

Exploiting such correlation requires, as a basic minimum, a fully adaptive entropy coder, able to change its probabilities at each step. Although simple in principle, its implementation is not, and can become a serious speed bottleneck. There is, however, a specific sub-case : the binary range coder. With just two output values to take care of (1 or 0), probability updates are very simple. Hence its predominance in most, if not all, the best compression programs today.

So here can have a look at the different steps to progress towards an RZM-like compressor :
1) move to range coder
2) work on optimal parsing
3) move to fully adaptive entropy
4) play with hard wired contexts
5) play with context selection, mixing, and creation

Quite a long road ahead...