Wednesday, April 2, 2014

Ultra-fast normalization

 Today's objective is to use the lessons learned when defining the perfect normalization algorithm to design a new, better, fast normalization algorithm. As stated last time, perfect normalization works fine, it's just a bit too slow to be used within FSE.

The main learning, in my opinion, is that fractional rest doesn't really matter. Or more specifically, it only matters for small p values, where round down cost elasticity is highest.
Also of interest, the "threshold" level is not a stable 0.5 across, it's a p-specific threshold, starting at 0.443 for p==1, and then converging towards 0.5.

Elasticity of the "round down cost"


The list of "round down" selection also shows that only the first few decisions make a (large) difference, targeting almost exclusively small p values with small fractional rest.
After these first "easy gains", all other "round down" basically produce the same cost. The higher the value, the most stable the cost, in accordance with the graph "round down cost elasticity".

smallest :  76 : cost  413.0 bits  -  count    413 : 1.10  (from 2 -> 1)
smallest :  70 : cost  413.0 bits  -  count    413 : 1.10  (from 2 -> 1)
smallest :  89 : cost  416.0 bits  -  count    416 : 1.11  (from 2 -> 1)
smallest :  87 : cost  440.5 bits  -  count    753 : 2.01  (from 3 -> 2)
smallest :  63 : cost  444.0 bits  -  count    759 : 2.02  (from 3 -> 2)
(..)

smallest : 110 : cost  544.1 bits  -  count  40919 : 109.01  (from 109 -> 108)
smallest : 117 : cost  544.2 bits  -  count  16031 : 42.71  (from 43 -> 42)
smallest : 115 : cost  544.4 bits  -  count  36788 : 98.00  (from 98 -> 97)
smallest : 104 : cost  544.6 bits  -  count  37561 : 100.06  (from 100 -> 99)
smallest : 116 : cost  544.7 bits  -  count  50027 : 133.27  (from 133 -> 132)
smallest :  32 : cost  544.7 bits  -  count 125551 : 334.47  (from 333 -> 332)
smallest : 100 : cost  544.8 bits  -  count  26623 : 70.92  (from 71 -> 70)
(...)


All these results point towards a simple idea :
could we just define a threshold under which all p values would be "rounded down", and then, fine-tune the remaining difference by attributing it to the largest value, since it makes almost no cost difference which p is rounded down as long as it is large enough (stable cost) ?

OK, let's try that.
The initial idea is to use as a threshold the converging value (1.443 = 1/ln(2), although its relative position vary depending on p, from 0.443 to 0.5). It works, but we can do better.

One important side-effect regarding weight distribution concerns the very small values, resulting in a realP < 1, which must be rounded up to p=1 (the minimum weight). The number of such small values vary greatly depending on data type and size to compress. We could be lucky and have no small realP, or inversely have hundreds of them. Between those extreme, the most likely situation is that we will get a few of them.
Because of these round up, it results in a hidden cost for all other symbol probabilities, which will have to "make room" (hence, round down) for them. To take into consideration this collective cost, we will simply raise the decision bar on the graph.

Round up/down threshold

Note that since the "round down cost elasticity" becomes very small as p becomes big (it ends up being a very thin line), it means that above a certain value of p, all p will simply be rounded down, irrespective of fractional rest.

It makes for a very simple algorithm, that we expect to be both fast and efficient.
Let's put that to the test.

First, let's have a look at final compression size.
As usual, we'll use "book1" as a single-block compression test :

Table  HC    Fast3   Fast2
 4K  435426  435398  435459 
 2K  436041  436072  436138
 1K  437756  437875  437952

(Note : Fast3 is the newly designed fast approximation algorithm, Fast2 was designed in this post)

As can be seen, Fast3 consistently beats Fast2, albeit by a small margin, since there is very little gain remaining.

The most important expected benefit is on the speed front : since the algorithm is much simpler, it is expected to run faster.
To measure this effect, I've modified FSE so that it runs its normalization loop several thousand times per block, hence ensuring normalization becomes the dominant operation. The final result should not be interpreted as an absolute speed, but rather a relative measure.

                  Fast3  Fast2
Norm. SpeedTest : 25.6   6.4

This one is pretty clear : the new algorithm is approximately 4x times faster than the previous one.

Faster, stronger, simpler, that's a good deal.
The new algorithm is now integrated into the open-sourced version of FSE hosted at github.

------------------------------------------------------------
Epilogue :
Astute readers may have noted that Fast3 compresses better than HC on 4K Table setting.
How could that be ? HC is supposed is supposed to be perfect, hence unbeatable, normalization !

Well, an answer is, the perfect normalization assumes a perfect entropy coder, which FSE is not, it's only a good approximation of it.

On the "book1" example, the threshold bar of Fast3 produces distributions which incidentally benefit from FSE irregularity.
Which introduce a new question : could we intentionally use these irregularities to boost compression ratio even further ?

Well, let's keep that question for a future blog post...