Friday, March 7, 2014

Perfect Normalization

 People keeping an eye on the github repository of FSE may have noticed the apparition a new function, called FSE_normalizeCountHC(). Let's detail it.

The previous post presented some new technique to improve accuracy of the normalization step, while remaining fast. The post of today will concentrate on making the normalization step "optimal", keeping an eye on speed as secondary objective.
(Note : this objective has also been covered on Charles Bloom's blog, using a different explanation and formula, but with essentially the same result. It's a good read.)

Reaching the "optimal" status means no other probability distribution can achieve a better compression. This is obviously measured against a "theoretical optimal entropy coder", which FSE is not (it's a close approximation of it). But still, any gain from a better distribution is likely to translate into better FSE compression.

To achieve the best possible distribution, we need to inverse our reasoning : instead of distributing remaining "bonus points", we will round up everyone, and then consider that a "round down" is a cost. This cost is necessary : we can't "round up" all probabilities, otherwise the sum of probabilities will be bigger than the table. By design, some of these probabilities will have to be rounded down.
We will simply select the ones which will cost less.

This requires to define a "cost function".
The cost function is relatively easy to understand :
Let's suppose we have a symbol S, which appears Qs times into data source, resulting in a probability of P. The cost per symbol can be calculated using Shannon formula :
Cp = -log2(P);

Now, we want to evaluate the cost of downgrading the probability of S from P to P-1.
The new cost per symbol will be higher : C(p-1) = -log2(P-1);
So the total cost of this operation will be : Qs x (C(p-1) - Cp)

We can calculate this cost for each and every symbol present into data source, and then pick the one which cost less, degrade its probability note by one, update its cost, select the new symbol with lowest cost, rinse and repeat, stop when you have reached the required total (= table size).
By construction, this method is guaranteed to provide the best possible normalized distribution.

So, what kind of result does it provide ?
First, a look at "book1" benchmark :

4K table, Fast normalization : 435 459 bytes
4K table, Perfect normalization : 435 426 bytes (gain : 0,008%)

2K table, Fast normalization : 436 138 bytes
2K table, Perfect normalization : 436 041 bytes (gain : 0,022%)

1K table, New normalization : 437 952 bytes
1K table, Perfect normalization : 437 756 bytes (gain : 0,045%)

As can be seen, the remaining gain is very small. That's because we already grabbed most of the benefits from latest improvements.
Still, it can be interesting for those willing to reach the best possible compression ratio, since the "normalization" cost only applies during compression, it's totally free on the decompression side.

Looking further, the new function also provides a debug mode which details its selection processus. It's very interesting, so let's have a look at it, for example compressing "book1" with 2K tables :

smallest :  76 : cost  413.0 bits  -  count    413 : 1.10  (from 2 -> 1)
smallest :  70 : cost  413.0 bits  -  count    413 : 1.10  (from 2 -> 1)
smallest :  89 : cost  416.0 bits  -  count    416 : 1.11  (from 2 -> 1)
smallest :  87 : cost  440.5 bits  -  count    753 : 2.01  (from 3 -> 2)
smallest :  63 : cost  444.0 bits  -  count    759 : 2.02  (from 3 -> 2)
smallest :  69 : cost  444.0 bits  -  count    444 : 1.18  (from 2 -> 1)
smallest :  59 : cost  445.7 bits  -  count    762 : 2.03  (from 3 -> 2)
smallest : 106 : cost  468.0 bits  -  count    468 : 1.25  (from 2 -> 1)
smallest :  33 : cost  486.7 bits  -  count    832 : 2.22  (from 3 -> 2)
smallest :  83 : cost  497.2 bits  -  count    850 : 2.26  (from 3 -> 2)
smallest :  62 : cost  498.0 bits  -  count    498 : 1.33  (from 2 -> 1)
smallest :  60 : cost  498.0 bits  -  count    498 : 1.33  (from 2 -> 1)
smallest :  79 : cost  500.7 bits  -  count    856 : 2.28  (from 3 -> 2)
smallest :  78 : cost  502.0 bits  -  count    502 : 1.34  (from 2 -> 1)
smallest : 120 : cost  503.7 bits  -  count    861 : 2.29  (from 3 -> 2)
smallest :  84 : cost  517.1 bits  -  count   1966 : 5.24  (from 6 -> 5)
smallest : 113 : cost  520.0 bits  -  count    520 : 1.39  (from 2 -> 1)
smallest :  46 : cost  530.6 bits  -  count   7170 : 19.10  (from 20 -> 19)
smallest :  39 : cost  533.5 bits  -  count   6470 : 17.24  (from 18 -> 17)
smallest : 107 : cost  533.9 bits  -  count   4994 : 13.30  (from 14 -> 13)
smallest : 118 : cost  535.7 bits  -  count   5382 : 14.34  (from 15 -> 14)
smallest :  98 : cost  537.8 bits  -  count   9132 : 24.33  (from 25 -> 24)
smallest : 115 : cost  538.8 bits  -  count  36788 : 98.00  (from 99 -> 98)
smallest :  10 : cost  538.9 bits  -  count  16622 : 44.28  (from 45 -> 44)
smallest : 110 : cost  539.1 bits  -  count  40919 : 109.01  (from 110 -> 109)
smallest : 104 : cost  539.2 bits  -  count  37561 : 100.06  (from 101 -> 100)
smallest :  44 : cost  540.2 bits  -  count  10296 : 27.43  (from 28 -> 27)
smallest : 109 : cost  540.3 bits  -  count  14044 : 37.41  (from 38 -> 37)
smallest : 116 : cost  540.6 bits  -  count  50027 : 133.27  (from 134 -> 133)
smallest : 111 : cost  540.8 bits  -  count  44795 : 119.33  (from 120 -> 119)
smallest :  97 : cost  541.3 bits  -  count  47836 : 127.43  (from 128 -> 127)
smallest : 119 : cost  541.4 bits  -  count  14071 : 37.49  (from 38 -> 37)
smallest : 108 : cost  541.4 bits  -  count  23078 : 61.48  (from 62 -> 61)
smallest :  32 : cost  541.5 bits  -  count 125551 : 334.47  (from 335 -> 334)
smallest : 105 : cost  542.0 bits  -  count  37007 : 98.59  (from 99 -> 98)
smallest : 114 : cost  542.3 bits  -  count  32889 : 87.62  (from 88 -> 87)
smallest : 101 : cost  542.8 bits  -  count  72431 : 192.96  (from 193 -> 192)
smallest :  32 : cost  543.1 bits  -  count 125551 : 334.47  (from 334 -> 333)
smallest : 102 : cost  543.2 bits  -  count  12237 : 32.60  (from 33 -> 32)
smallest :  45 : cost  543.8 bits  -  count   3955 : 10.54  (from 11 -> 10)
smallest : 110 : cost  544.1 bits  -  count  40919 : 109.01  (from 109 -> 108)
smallest : 117 : cost  544.2 bits  -  count  16031 : 42.71  (from 43 -> 42)
smallest : 115 : cost  544.4 bits  -  count  36788 : 98.00  (from 98 -> 97)
smallest : 104 : cost  544.6 bits  -  count  37561 : 100.06  (from 100 -> 99)
smallest : 116 : cost  544.7 bits  -  count  50027 : 133.27  (from 133 -> 132)
smallest :  32 : cost  544.7 bits  -  count 125551 : 334.47  (from 333 -> 332)
smallest : 100 : cost  544.8 bits  -  count  26623 : 70.92  (from 71 -> 70)
smallest : 111 : cost  545.4 bits  -  count  44795 : 119.33  (from 119 -> 118)
smallest :  97 : cost  545.6 bits  -  count  47836 : 127.43  (from 127 -> 126)
smallest : 101 : cost  545.7 bits  -  count  72431 : 192.96  (from 192 -> 191)
smallest : 103 : cost  546.2 bits  -  count  12303 : 32.78  (from 33 -> 32)

OK, it may seem a bit daunting to analyze, let's go for the most interesting conclusions :
1) The first selected symbols make a big difference, since they really cost much less (starting with 413 bits). Then, the costs tend to converge, and there is very little difference remaining between the various available choices beween 535 and 545 bits.
Basically, it means most of the gains for the "HC" versions came from correctly selecting the first few symbols.
2) These first symbols tend to have "low probability" : a lot of 2->1 transitions, some 3->2, and so on. Starting with 530 bits, they completely disappear, and we only have higher probability symbols.
3) Some higher probability symbol appear several times. Note for example symbol 32, which is rounded down several times, reaching 332, while its "real" probability should be 334.47. It means it's better to downgrade it several times, rather than to downgrade a single time a lower probability symbol.

The point 2 was the most surprising to me. Remember last time we showed a graphic proving that the low probability symbols were prone to the most important "round down" losses.


This is because this graphic was only considering the "worst case" situation, with "real probability" being infinitely close to the next probability step (e.g. 1.999, then 2.999, etc.)

The new result implies that low probability symbols are also susceptible to generate the lowest amount of losses, when they are at the other end of the spectrum (e.g. 1.01).

This lead me to try to visualize the "level of freedom" of the "round down cost" (all possible values from 1.001 to 1.999, then from 2.001 to 2.999, and so on). It gave the following graphic :



As could be guessed, this "level of freedom" is much higher at low probability than at high probability. In fact, the "freedom area" quickly becomes a very thin line beyond value 300.
This corresponds to intuition :
From 1.01 to 1.99, the level of freedom is a factor 2.
From 2.01 to 2.99, the level of freedom is now 50%.
and so on.
By the time it reaches probability 1000, the level of freedom is basically 1/1000th, which is almost nothing.



This is the zoomed version of the same graphic, concentrating on the first few values. As said last time, this graphic will remain the same irrespective of the total size of the table.

So that means that most of the attention must be paid to the symbols with lowest probability, where the difference between a low and a high "fractional rest" can make a large cost difference, and therefore should be selected carefully.
High probability symbols don't have such level of freedom, and therefore their "fractional rest" has very little importance, and almost no impact on total cost.

Another interesting point is that the graphic converges towards 1/ln(2) = 1.443, not 1.5. Which means that the intuitive 50% threshold is not true, at least not for low-probability symbols.

Unfortunately, the new "HC" version is also slow. Not "horribly slow", just slow enough to be noticeable in benchmark, reducing total speed by as much as 5-10%, for a negligible compression gains. That means the "fast" version remains the default one, while the "HC" version is provided in option.

Nonetheless, the above findings are very interesting, and may be useful in the quest for a better fast renorm algorithm.

Sunday, March 2, 2014

Better normalization, for better compression

 A secondary issue to deal with when implementing an FSE entropy coder is the requirement to normalize statistics. Basically, it consists in attributing a probability to each symbol, so that the sum of all probabilities = table size, which is itself a power of 2.
It is in no way specific to FSE, and any other entropy coder with the ability to represent fractional bits, such as arithmetic coders, would face the same issue.

The process seems fairly straightforward : get a ratio, which is Ratio = destSize / origSize; with destSize = table size, and origSize = input Size. Then, simply multiply each symbol occurrence by this ratio to get its scaled probability :
P[s] = Occurrence[s] * R;

We just have 2 little details to solve with this process :

1) Ratio is a float number (since in practice input Size is a non numeral multiple of table size). It basically means all calculations are going to be float, which is hardly ideal for speed.
This part is the easiest to solve : just use some fixed-point arithmetic. It's particularly simple with 64-bits arithmetic : scale table Size to a large number, such as 2^60, and R becomes a large integer, with which you can multiply symbol occurrences.  It's accurate enough.
32-bits arithmetic is fine too, but requires to pay a bit more attention to limit situations, such as large input sizes.

2) P[s] is unlikely to be a natural number. You will end up with something like P[s] = 10.1, or P[s]=10.9 . So the natural question is : to which value should be rounded P[s] ?
With above examples, it looks simple : 10.1 should be rounded to 10, 10.9 should be rounded to 11.
Except that : if you apply this simple logic to all symbols, the sum of P[s] is not going to be equal to table size. It's going to be close, but not exactly this number. Which is unforgiving : this strict condition is mandatory, otherwise compressed data will be corrupted.

There are a few ways to handle this situation.
A "panic" solution would simply consists in adding or removing some points to random P[s] until reaching the desired total. As long as all present symbols get a P[s]>=1, it's going to work. Compression will suffer, but compressed data will be decodable.

This is obviously not very optimal. So let's try to find something better, keeping in mind the requirement for speed.

The initial algorithm shipped within FSE was completely speed oriented : it would attribute a probability to each symbol in a single pass, guaranteeing that all present symbols get a probability >=1, and that the sum of them is necessarily equal to table size.
To reach this result, there were 2 complementary operations :
1) Statistics were altered, to guarantee that all symbols present get a probability >= 1. It simply consists in adding a constant to all symbol statistics, so that even when a symbol is present only once, (Occurence[s] + base) * R >= 1.
2) "Fractional rest" was tracked. P[s] was rounded down by default, but would be rounded up when the cumulative sum of current and previous fractional rests would be > 1. This idea is somewhat equivalent to dithering algorithms, but in 1D space.

It works. It's not optimal, but damn fast.
Note that we are not talking about a large loss here. It would typically be ~0.05% using a 4K table. Almost negligible. However, I also discovered that the error became larger as table size was reduced (up to 0.20% for a 2K table for example).
Since one of my current objectives is to use smaller tables, I figure it was a good idea to investigate.

Both mechanisms above introduce some distortion. The most important one is the statistics alteration. The effect is more pronounced when sourceSize / tableSize gets bigger, as one would expect, since the base alteration has to be increased.
The second effect is less harmful, but still not optimal : the "round up" bonus is more likely to be attributed to symbols with a large "fractional rest", such as 10.9, but it's not guaranteed. If the remaining sum of previous fractional rests is very small, for example 0.05, the new sum will be 0.95, and therefore not trigger the round up bonus. On the other hand the next symbol, even if it has a small fractional rest, such as 10.1, would pass the total to 1.05, and therefore trigger the "round up" bonus.

It's desirable to make a more accurate selection of which symbol probability must be rounded up or down.
So the algorithm is going to be altered. The new idea is now : we round down all symbol probabilities by default, then we provide the "round up bonus" to the symbols which "more deserve it".

We just introduce 2 new problems.

First, we can't round down probabilities < 1. These symbols have reached the absolute minimum. We must provide them this probability of 1.

That means, we now have 2 situations : either the "round down" of all symbols of probability > 1 is prevalent, and we still have some "bonus points" to distribute", or, the "round up" of symbols of probability < 1 is prevalent, and we must now remove some more from some other symbols.

Let's suppose we are in situation 1 : we still have a few bonus points to distribute.
So the initial reasoning is still valid.
This lead us to the second problem : which symbol does "more deserve it" ? How to translate that ?

An intuitive idea is that symbols with highest "fractional rest" deserve it more. Simply said, 10.9 should be rounded up instead of 10.1.
This is simple case. But more complex choices have to be made too. For example, supposed I've got only 1 bonus point, which probability should be rounded up : 1.9 or 10. 9 ?

To investigate this question, I've started to look at the impact of a "round down" error. Suppose I'm providing to symbol A a probability of 1 instead of 2, and to symbol B a probability of 10 instead of 11, what is the impact ?
First let's look at the impact on the unit cost of each symbol :



This is intuitive : the smaller the probability, the more a "round down" error will have an impact on symbol cost.
But it doesn't tell the whole story. The symbols with smaller probabilities are, by definition, less numerous than symbols with higher probabilities. So it could be that, even though the "unit cost" of more probable symbols is less impacted, multiplied by the number of symbols, the total error may be larger.

So let's have a look at how much compressed data is generated by these symbols, simply by multiplying their unit cost by their occurrence :



Well, it's interesting. Symbols with probabilities of ~30% do actually generate more compressed data than symbols with probabilities > 80%.
Interesting, yes, but that's still the wrong picture to look at. The above picture gives the total amount of compressed data generated if the probability is correct. I won't get a smaller amount of data if I attribute a probability of 5% to a symbol which real probability is 30% : its cost will sky-rocket instead, following the trajectory of the first picture.

What actually matters is "how much additional compressed data will be generate if I round down this probability".
To visualize this question, I've created the following scenario :
- Let's imagine that all probabilities are accurately represented (meaning, a symbol with a probability of 2 is really present 2 times).
- Let's imagine that I provide to this symbol a probability wrong by one (meaning, a symbol present 2 times is now attributed a probability of 1)
- What is the additional cost of this error ?

It gives the following picture :



OK, now this is really interesting.
This is almost a flat line. The total error is consistently going to be around 1.5 bits.

To be more precise, it's not "totally flat", it's in fact slightly diminishing, converging towards a value around 1.443 (seems to be 1/ln(2)).
There is a more remarkable difference at the beginning, so let's zoom out this portion :


It generates a few comments. First, it's not an "exploding cost". It is capped, to 2 bits max, for the specific case of downgrading a probability from 2 to 1.
Second, it converges very fast. We almost reach 1.5 bits at probability 10, it then continues its slow convergences towards 1/ln(2).

Another interesting fact is that this convergence is independent of the total size of table. This example is extracted from a table size of 2K, but you could have a table of 16K, or a table of just 32, and you would still get the exact same result !

This seems strange, but math explains it all.
As stated previously, the scenario is :
symbol has exactly P[s] occurrence,
so its total compressed cost shall be : P[s] * -log2(P[s] / TableSize)
but since we undervalued it probability, its new cost is : P[s] * -log2(((P[s]-1) / TableSize)
So the difference cost is : (P[s] * -log2(((P[s]-1) / TableSize)) - (P[s] * -log2(P[s] / TableSize))
and since : log2(P[s] / TableSize) = log2(P[s]) - log2(TableSize)
we have : diff = (P[s] * (log2(TableSize) - log2((P[s]-1]))) - (P[s] * (log2(TableSize) - log2(P[s]))
which simplifies into : diff = P[s] * (log2(P[s]) - log2(P[s]-1))
Basically, the term TableSize disappears from equation.

Concretely, what it means for our fast normalization algorithm, is that using the "fractional rest" as the key metric to decide which symbol should get its "round up" bonus is correct. The gain is almost flat across the spectrum.
It also gives a second key, to solve ties.
Since the loss curb is decreasing with probabilities, it means that, for an identical "fractional rest", we have a bit more to gain when Probability is lower.

And here we have our algorithm to distribute the "round up bonus" : select symbols with highest "fractional rest", solve ties by putting lower probability symbols first.


We still now have to deal with scenario 2 :
we have no "bonus" point to distribute,
quite the reverse situation : we have distributed too many probabilities, as a consequence of the requirement to provide a minimum probability of "1" to all symbols present even only once. So now, we must remove some points, somewhere, to reach the desired total.

We'll deal with this situation quite easily. All symbols are already rounded down. So we have no more "fractional rest" to play with. We are going to downgrade any symbol by a full probability point.
So we just look at the graphic. It simply states that, even though the loss is almost equivalent, it's still lower for higher probability symbols, even if marginally.
So we'll give the entire cost to the symbol with highest probability. It's a fast approximation, and works well in practice.


With this new algorithm, FSE now compresses a bit more.

To give an example, let's compress book1 (part of calgary corpus) as a single block.

4K table, Old normalization : 435 723 bytes
4K table, New normalization : 435 459 bytes (gain : 0,06%)

A very small, but still perceptible gain. Let's check with lower table sizes :

2K table, Old normalization : 436 783 bytes
2K table, New normalization : 436 138 bytes (gain : 0,15%)

1K table, Old normalization : 439 181 bytes
1K table, New normalization : 437 952 bytes (gain : 0,28%)

And indeed, with lower table sizes, the increased accuracy becomes more important.

The good thing is, the new normalization approximation algorithm is still very fast, and can therefore be used as a drop-in replacement of the older one, with no perceptible speed impact.