Friday, March 15, 2019

Presenting XXH3

xxh3

XXH3 - a new speed-optimized hash algorithm

The xxHash family of hash functions has proven more successful than anticipated. Initially designed as a checksum companion for LZ4, it has found its way into many more projects, requiring vastly different workloads.

I was recently summoned to investigate performance for a bloom filter implementation, requiring to generate quickly 64 pseudo-random bits from small inputs of variable length. XXH64 could fit the bill, but performance on small inputs, never was its priority. It’s not completely wasteful either, it pays a bit attention to short inputs thanks to a small speed module in SMHasher. However, the module itself does the bare minimum, and it was not clear to me what’s exactly measured.

So I decided to create my own benchmark program, as a way to ensure that I understand and control what’s being measured. This was a very interesting journey, leading to surprising discoveries.

The end result of this investigation is XXH3, a cross-over inspired by many other great hash algorithms, which proves substantially faster than existing variants of xxHash, across basically all dimensions.
Let’s detail those dimensions, and give some credit where inspiration is due.

Checksumming long inputs

xxHash started as a fast checksum for LZ4, and I believe it can still be useful for this purpose. It has proven popular among movie makers for file transfer verification, saving a lot of time thanks to its great speed. The main downside is that XXH64() is limited to 64-bit, which is insufficient when comparing a really large number of files (and by large I mean many many million ones). For this reason, a 128-bit variant has often been requested,

XXH3 features a wide internal state of 512 bits, which makes it suitable to generate a hash of up to 256 bit. For the time being, only 64-bit and 128-bit variants are exposed, but a similar recipe can be used for a 256-bit variant if there is any need for it one day. All variant feature same speed, since only the finalization stage is different.

XXH3 bandwidth

I’m using this opportunity to compare with a few other well known hash algorithms, either because their notoriety makes them frequently named in discussions related to hash algorithms (FNV, CRC), or because they are very good in at least one dimension.

XXH3 proves very fast on large inputs, thanks to a vector-friendly inner-loop, inspired Bulat Ziganshin’s Farsh, itself based on UMAC paper.

Unfortunately, UMAC features a critical flaw for checksumming, which makes it ignore 4 bytes of input, on average every 16 GB. This might not seem much, and it might even be acceptable if the goal is to generate a 32-bit checksum as in the original paper. But for checksumming large files with 64-bit or 128-bit fingerprints, this is a big no-no.
So the version embedded into XXH3 is modified, to guarantee that all input bytes are necessarily present in the final mix. This makes it a bit slower, but as can be seen in the graphs, it remains plenty fast.

Vectorization must be done manually, using intrinsic, as the compiler seems unable to properly auto-vectorize the scalar code path.
For this reason, the code offers 4 paths : scalar (universal), SSE2, AVX2, and also NEON offered by Devin Hussey. It may be possible to vectorize additional platforms, though this requires dedicated efforts.

SSE2 is enough to reach substantial speed, which is great because all x64 cpus necessarily support this instruction set. SSE2 is also free of dynamic throttling issues, and is automatically enabled on all x64 compilers. Hence I expect it to be the most common target.

On a given code path, compilers can make a difference. For example, AVX2 vectorization is significantly more effective with clang. Actually, the speed of this variant is so fast that I was wondering if it was faster than my main memory. So I graphed the speed over a variety of input sizes.

XXH3 Bandwidth, per size

As one can see, the AVX2 build is much faster than main memory, and the impact of cache size is now clearly observable, at 32 KB (L1), 256 KB (L2) and 8 MB (L3). As a rule, “top speed” is only achievable when data is already in cache.

So is it worth being so fast ?
If data is very large (say, a movie), it can’t fit in the cache, so the bottleneck will be at best the main memory, if not I/O system itself. In which case, a faster hash may save cpu time, but will not make the checksumming operation faster.

On the other hand, there are many use cases where data is neither large nor small, say in the KB range. This includes many types of record, typical of database workloads. In these use cases, hashing is not the main operation : it’s just one of many operations, sandwiched between other pieces of code. Input data is already in the cache, because it was needed anyway by these other operations. In such a scenario, hashing faster helps to a faster overall run time, as cpu savings are employed by subsequent operations.

32-bit friendliness

The computing world is massively transitioning to 64-bit, even on mobile. The remaining space for 32-bit seems ever shrinking. Yet, it’s still present, in more places than can be listed. For example, many virtual environment generate bytecodes designed to produce a 32-bit application.

Thing is, most modern hash algorithms take advantage of 64-bit instructions, which can ingest data twice faster, so it’s key to great speed. Once translated for the 32-bit world, these 64-bit instructions can still be emulated, but at a cost. In most cases, it translates into a massive speed loss. That’s why XXH32 remains popular for 32-bit applications, it’s a great performer in this category.

A nice property of XXH3 is that it doesn’t lose so much speed when translated into 32-bit instructions. This is due to some careful choices in instructions used in the main loop. The result is actually pretty good :

XXH3, bandwidth in 32-bit mode

XXH3 can overtake XXH32, even without vectorial instruction ! Enabling SSE2 put it in another league.

A similar property can be observed on ARM 32-bit. The base speed is very competitive, and the NEON vectorial code path designed by Devin makes wonder, pushing speed to new boundaries.

Hashing small inputs

The high speed achieved on large input wasn’t actually the center of my investigation.
The main focus is about short keys of random lengths, with a distribution of length roughly in the 20-30 bytes area, featuring occasional outliers, both tiny and large.

This scenario is very different. Actually, with such small input, the vectorized inner loop is never triggered. Delivering a good quality hash result must be achieved using a small amount of operations.

This investigation quickly converged onto Google’s CityHash, by Geoff Pyke and Jyrki Alakuijala. This algorithm features an excellent access pattern for small data, later replicated into FarmHash, giving them an edge. This proved another major source of inspiration for XXH3.

A small concern is that Cityhash comes in 2 variants, with or without seed. One could logically expect that both variants are “equivalent”, with one just setting a default seed value.
That’s not the case. The variant without seed forego the final avalanche stage, making it faster. Unfortunately, it also makes it fail SMHasher’s avalanche test, showing very large bias. For this reason, I will distinguish both variants in the graph, as the speed difference on small inputs is quite noticeable.

The benchmark test looks simple enough : just loop over some small input of known size, and count the nb of hashes produced. Size is only known at run time, so there’s no way for the compiler to “specialize” the code for a given size. There are some finicky details in ensuring proper timing, but once solved, it gives an interesting ranking.

XXH3, throughput, small fixed size

Top algorithms are based on the same “access pattern”, and there are visible “steps” on reaching 33+ length, and then again at 65+. That’s because, in order to generate less branches, the algorithm does exactly the same work from 33 to 64 bytes. So the amount of instructions to run is comparatively large for 33 bytes.
In spite of this, XXH3 maintains a comfortable lead even at “bad” length values (17, 33, 65).

This first results looks good, but it’s not yet satisfying.
Remember the “variable size” requirement ?
This is not met by this scenario.

Impact of variable input sizes

Always providing the same input size is simply too easy for branches. The branch predictor can make a good job at guessing the outcome every time.

This is just not representative of most real-life scenarios, where there’s no such predictability. Mix inputs of different sizes, and it wreaks havoc on all these branches, adding a considerable cost at each hash. This impact is often overlooked, because measuring it is a bit more difficult. But it’s important enough to deserve some focus.

In the following scenario, input sizes are presumed randomly distributed between 1 and N. The distribution of lengths is pre-generated, and the same distribution is used for all hashes for a same N. This lean towards worst case scenario: generally, input sizes feature some kind of locality (as in target scenario, mostly between 20 and 30 bytes). But it gives us a good idea of how algorithms handle varying sizes.

XXH3, throughput on small inputs of random length

This is a more significant victory for algorithms with an optimized access pattern. When input sizes become unpredictable, branch mispredictions become a much larger contributor to performance. The optimized access pattern makes the workload more predictable, and reduces the nb of branches which can be mispredicted. This is key to preserve a good level of performance in these conditions.

Throughput versus Latency

Throughput is relatively simple to measure : just loop over a bunch of inputs, hash them, then count the number of hashes completed in a given time.
But one could wonder if throughput is an appropriate metric. It represents a “batch” workload, where a ton of hashes are feverishly completed one after another. It may happen sometimes.

But in many cases, hashing is just one operation sandwiched between other very different tasks. This is a completely different background.
In this new setup, hashing must wait for prior operation to complete in order to receive its input, and later operation is blocked as long as the hash result is not produced. Hence latency seems a much better metric.

However, measuring latency is a lot more complex. I had many false starts in this experiment.
I initially thought that it would be enough to provide the result of previous hash as seed of the next hash. It doesn’t work : not only some algorithms do not take seed as arguments, a few others only use the seed at the very end of the calculation, letting them start hash calculations before the end of previous hash.
In reality, in a latency scenario, the hash is waiting for the input to be available, so it’s the input which must be based on previous hash result. After a lot of pain, the better solution was finally suggested by Felix Handte : use a pre-generated buffer of random bytes, and start hashing from a variable position derived from previous hash result. It enforces that next hash has to wait for previous hash result before starting.

This new setup creates a surprisingly different ranking :

XXH3, latency, fixed size

Measurements are a bit noisy, but trends look visible.

The latency-oriented test favors algorithms like Vladimir Makarov’s mumv2 and Leo Yuriev’s t1ha2, using the 64x64=>128-bits multiplication. This proved another source of inspiration for XXH3.

Cityhash suffers in this benchmark. Cityhash is based on simpler instructions, and completing a hash requires many more of them. In a throughput scenario, where there is no serialization constraint, Cityhash can start next hash before finishing previous one. Its simple instructions can be spread more effectively over multiple execution units, achieving a high level of IPC (Instruction per Clock). This makes Cityhash throughput friendly.

In contrast, the 64x64=>128-bits multiplication has access to a very restricted set of ports, but is more powerful at mixing bits, allowing usage of less instructions to create a hash with good avalanche properties. Less instructions translate into a shorter pipeline.

In the latency scenario, mumh2 fares very well, fighting for first place up to the 32-byte mark, after which XXH3 starts to take a lead.

However, this scenario involves fixed input size. It’s simple to code and explain, but as we’ve seen before, fixed size is actually an uncommon scenario : for most real-world use cases, input has an unpredictable size.

Hence, let’s combine the benchmark techniques seen previously, and look at the impact of random input lengths on latency.

XXH3, latency, random length

This is an important graph, as it matches the target use case of XXH3, and incidentally many real-world database/server use cases I’m aware of.

The variable size scenario favors algorithms using an optimized access pattern to reduce branch misprediction. mumv2, which was performing very well when input size was stable, loses a lot in this scenario. t1ha2 makes a better effort, and while not as well optimized as Cityhash for this purpose, loses nonetheless much less performance to variable input size, overtaking second place (if one does not count the “seed-less” variants in the ranking, due to afore-mentioned avalanche problems).

As could be expected, XXH3 is well tuned for this scenario. It’s no surprise since it was its target. So it’s basically mission accomplished ?

Hash Quality

It wouldn’t be a complete presentation without a note on hash quality. A good hash should make collisions as rare as possible, bounded by the birthday paradox, and offer great avalanche property : two different inputs shall produce vastly different output, even if they only differ by a single bit.

As expected, XXH3 completes all tests from SMHasher test suite. Both 64 and 128-bit variants were validated, as well as each of their 32-bit constituent.

But it went a bit farther.
SMHasher was designed many years ago, at a time when hashing was mostly a single main loop iterating over input. But as hash algorithms have become more powerful, this model feels no longer correct : modern hashes tend to feature a large inner loop, which is only triggered after a certain length. That means that the algorithm being tested when there are only a few input bytes is actually different from the one run on large inputs.

Because search space tends to explode with input size, and because computing capacity used to be weaker when SMHasher was created, most tests are concentrated on small inputs. As a consequence, tests for larger input sizes are very limited.

In order to stress the algorithm, it was necessary to push the tests beyond their usual limits. So I created a fork of rurban’s excellent SMHasher fork, methodically increasing limits to new boundaries. It’s still the same set of tests, but exploring a larger space, hence longer to run.
This proved useful during the design stage, eliminating risks of long-distance “echo” for example (when bits cancel each other by virtue of being at some exact relative position).
It also proved interesting to run these extended tests on existing algorithms, uncovering some “surprises” that were masked by the lower threshold of original tests.
To this end, these changes will be offered back to rurban’s fork, in the hope that they will prove useful for future testers and implementers.

Release

XXH3 is now released as part of xxHash v0.7.0. It’s still labelled “experimental”, and must be unlocked using macro XXH_STATIC_LINKING_ONLY. It’s suitable for ephemeral data and tests, but avoid storing long-term hash values yet. This period will be used to gather user’s feedback, after which the algorithm will transferred into stable in a future release.

Update: Since the release of xxHash v0.8.0, XXH3 is now labelled "stable", meaning produced hash values can be stored on disk or exchanged over a network, as any future version is now guaranteed produce the same hash value. Compared with initial release, v0.8.0 comes with streaming capabilities, 128-bit variant support, and better inlining.

Wednesday, January 30, 2019

Compiler-checked contracts

 btrc: compile6 : compiler-validated contracts

Contract programming is not a new concept. It’s a clean way to design narrow contracts, by spelling explicitly its conditions, and checking them actively at contract’s interface. In essence, it’s translated into a bunch of assert() at the entrance and exit of a function. It’s a fairly good formal design, although one associated with a runtime penalty.

We left the previous episode with an ability to express function preconditions and make them checked by the compiler, but no good way to transport the outcome of these checks into the function body. We’ll pick up from there.

The proposed solution is to re-express these invariants in the function as assert(), as they should have been anyway if EXPECT() was absent. It works, but it also means that downstream EXPECT() can only be validated when assert() are enabled, aka. in debug builds.

Let’s try to improve this situation, and keep EXPECT() active while compiling in release mode, aka with assert() disabled.
What is needed is an assert() that still achieves its outcome on Value Range Analysis while disabled. Such a thing exists, and is generally called an assume().

assume()

assume() is not part of the language, but most compilers offer some kind of hook to build one. Unfortunately, they all differ.

On gcc, assume() can be created using __builtin_unreachable() :

#define assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)

clang provides __builtin_assume(). icc and Visual provide ___assume(). etc.
You get the idea.

An important point here is that, in contrast with all techniques seen so far, assume() actually reduces compiler’s effectiveness at catching bug. It’s an explicit “trust me, I’ll tell you what to know” situation, and it’s easy to get it wrong.

One way to mitigate this negative impact is to make sure assume() are converted into assert() within debug builds, so that there is at least a chance that wrong assumptions get caught during tests.

assume() however have additional restrictions compared to assert(). assert() merely need to produce no side-effect, and offer a tractable runtime cost, though even this last point is negotiable. But assume() must be transformed into pure compiler hints, leaving no trace in the generated binary (beyond the assumption’s impact). In particular, the test itself should not be present in the generated binary.
This reduces eligible tests to simple conditions only, such as (i>=0) or (ptr != NULL). A counter example would be (is_this_graph_acyclic(*graph)). “complex” conditions will not provide any useful hint to the compiler, and on top of that, may also leave a runtime trace into the generated binary, resulting in a reduction of performance.

Note : I haven’t found a way to ensure this property on gcc : if assume() is served a complex condition, it will happily generate additional asm code, without issuing any warning.
Fortunately, clang is way better at this game, and will correctly flag bad assume() conditions, which would generate additional code in gcc.
As a consequence, it’s preferable to have clang available and check the code with it from time to time to ensure all assume() conditions are correct.

In our case, assume() main objective is not really performance : it is to forward conditions already checked by EXPECT() within function’s body, so that they can be re-used to automatically comply with downstream EXPECT() conditions, even when assert() are disabled, aka during release compilation.

So here we are : every time a condition is required by EXPECT() and cannot be deducted from the local code, express it using assume() rather than assert(). This will make it possible to keep EXPECT() active irrespective of the debug nature of the build.
Note that, if any condition is too complex for assume(), we are back to square one, and need to rely on assert() only (hence debug builds only).

Being able to keep EXPECT() active in release builds is nice, but not terrific. At this stage, we still need to write all these assume() in the code, and we cannot take advantage of pre-conditions already expressed at the entrance of the function.

Worse, since pre-conditions are expressed on one side, in the *.h header where function prototype is published, while the corresponding assume() are expressed in the function body, within *.c unit file, that’s 2 separate places, and it’s easy to lose sync, when one side changes the conditions.

Expressing conditions in one place

What we need is to express preconditions in a single source of truth. This place should preferably be close to the prototype declaration, since it can also serve as function documentation. Then the same conditions will be used in the function body, becoming assumptions.

The solution is simple : define a macro to transport the conditions in multiple places.
Here is an example.
The conditions are transferred from the header, close to prototype declaration, into the function body, using a uniquely named macro. It guarantees that conditions are kept in sync.
In the example, note how the knowledge of minus2() preconditions, now considered satisfied within function body, makes it possible to automatically comply with the preconditions of invoked minus1(), without adding any assert() or assume().

In this example, the condition is trivial (i>=2), using a single argument. Using a macro to synchronize such a trivial condition may seem overkill. However, synchronization is important in its own right. Besides, more complex functions, featuring multiple conditions on multiple arguments, will be served by a design pattern which can be just reproduced mindlessly, whatever the complexity of the preconditions : assume(function_preconditions());.

There is still a variable element, related to the number of arguments and their order.
To deal with that variance, argument names could be baked directly into the preconditions macro. Unfortunately, this would only work within a function. But since the macro transporting preconditions is itself invoked within a macro, it wouldn’t expand correctly.

Another downside is that we just lost a bit of clarity in the warning message : conditions themselves used to be part of the warning message, now only the macro name is, which transmits less information.
Unfortunately, I haven’t found a way around this issue.

To preserve clarity of the warning message, it may be tempting to keep the previous format, with conditions expressed directly in the masking function macro, whenever such conditions are not required afterwards in the body. However, it creates a special cases, with some functions which replicate conditions in their body, and those that don’t.

Transmitting preconditions compliance into function body makes it easier to comply with a chain of preconditions. A consequence of which, it makes it more tractable to use compile-time pre-conditions onto a larger scope of the code base.

Post conditions

Yet we are not completely done, because the need to check preconditions implies that all contributors of any parameter are part of the game. Function’s return values themselves are contributors.

For example, one may invoke a function f1() requiring an argument i>0.
The said argument may be provided as a return value of a previous function f2().
f2() might guarantee in its documentation that its return value is necessarily >0, hence is compliant,
but the compiler doesn’t read the documentation. As far as it’s concerned, the return value could be any value the type allows.

The only way to express this situation is to save the return value into an intermediate variable,
and then assert() or assume() it with the expected guarantee,
then pass it to the second function.
This is a bit more verbose than necessary, especially as f2() was already fulfilling the required preconditions. Besides, if f2() guarantees change, the local assumption will no longer be correct.

Guarantees on function’s outcome are also called post-conditions. The whole game is to pass this information to the compiler.

This could be done by bundling the post-conditions into the macro invoking the function.
Unfortunately, that’s a bit hard to achieve with a portable macro, usual woes get in the way : single-evaluation, variable declarations and returning a value are hard to achieve together.

For this particular job, we are better off using an inline function.
See this example on godbolt.
It works almost fine : the guarantees from first function are used to satisfy preconditions of second function. This works without the need to locally re-assess first function’s guarantees.
As an exercise, removing the post-conditions from encapsulating inline function immediately triggers a warning on second invocation, proving it’s effective.

However, we just lost a big property by switching to an inline function : warnings now locate precondition violations into the inline function, instead of the place where the function is invoked with incorrect arguments. Without this information, we just know there is a contract violation, but we don’t know where. This makes fixing it sensibly more difficult.

To circumvent this issue, let’s use a macro again. This time we will combine a macro to express preconditions with an inlined function to express outcome guarantees. Here is an example.
This one gets it right on almost everything : it’s portable, conditions and guarantees are transferred to the compiler, which triggers a warning whenever a condition is not met, indicating the correct position of the problem.

There is just one last little problem : notice how the input parameter v get evaluated twice in the macros. This is fine if v is a variable, but not if it’s a function. Something like f1( f2(v) ) will evaluate f2() twice, which is bad, both for runtime and potentially for correctness, should f2(v) return value be different on second invocation.

It’s a pity because this problem was solved in the first proposal, using only an inline function. It just could not forward the position where a condition was broken. Now we are left with two incomplete proposals.

Let’s try it again, using a special kind of macro.
gcc and by extension clang support a special kind of statement expression, which makes it possible to create a compound statement able to return a value (its last expression). This construction is not portable. In general, I wouldn’t advocate it due to portability restrictions. But in this case, EXPECT() only works on gcc to begin with, so it doesn’t feel too bad to use a gcc specific construction. It simply must be disabled on non-gcc targets.

The new formulation, reproduced below, works perfectly, and now enforces the contract while avoiding the double-evaluation problem, and correctly indicates the position at which a condition is violated, significantly improving diagnosis.

int positive_plus1(int v); 
#define positive_plus1_preconditions(v)   ((v)>=0)     // Let's first define the conditions. Name is long, because conditions must be unique to the function.
#define positive_plus1_postconditions(r)   ((r)>0)     // Convention : r is the return value. Only used once, but published close to the prototype, for documentation.

// Encapsulating macro
// The macro itself can be published in another place of the header,
// to leave complete visibility to the prototype and its conditions.
// This specific type of macro is called a statement-expression,
// a non-portable construction supported by `gcc` (and `clang`)
// It's okay in this case, because `EXPECT()` only works with `gcc` anyway.
// But it will have to be disabled for non-gcc compilers.
#define positive_plus1(iv) ({                                             \
    int const _v = iv;   /* avoid double-evaluation of iv */              \
    int _r;                                                               \
    EXPECT(positive_plus1_preconditions(_v));  /* also used within function body */ \
    _r = positive_plus1(_v);                                              \
    assume(positive_plus1_postconditions(_r)); /* only used here */       \
    _r;   /* last expression is the return value of compound statement */ \
 })

Summary

That’s it. This construction gives all the tools necessary to use compiler-checked contracts in a C code base. Such strong checks increase the reliability of the code base, especially during refactoring exercises, by catching at compile time all potential contract breaches, and requiring to deal with them, either through branches or at least explicitly assert() them. This is a big step up from a situations where breaking conditions was plain silent at compilation, and may break during tests if assert() are not forgotten and the test case is able to break the condition.

It can be argued that applying this design pattern makes declaring functions more verbose, and it’s true. The effort though was supposed to be already done in a different way : as part of code documentation, and as part of runtime checks (list of assert() within function body). The difference is that they are expressed upfront, and are known to the compiler, which is more powerful.

Nonetheless, it would be even better if conditions could become part of the function signature, making the notation clearer, better supported, and by extension possibly compatible with automatic documentation or IDE’s context info, simplifying their presentation.
There is currently a C++20 proposal, called attribute contract, which plans to offer something close. Okay, it’s not C, and quite importantly it is a bit different in subtle ways : it’s more focused on runtime checks. There is a specific [[expects axiom: (...)]] notation which seems closer to what is proposed in this article, because it doesn’t silently insert automatic runtime checks. However, as far as I know, it also doesn’t guarantee any compiler check, reducing the contract to a simple assume(). It implies this topic is left free to compiler’s willingness, which may or may not pick it up, most likely resulting in significant behavior differences.

But hopefully, the trick presented in this article is available right now, and doesn’t need to wait for any committee, it can be used immediately on existing code bases.

I hope this article will raise awareness on what compilers already know as part of their complex machinery primarily oriented towards better runtime performance, and make a case on how to re-purpose a small part of it to improve correctness too.

Monday, January 28, 2019

Compile-time tests

btrc: compile5 : compile-time tests

Compile-time tests

A function generally operates on states and parameters. The function’s result is deemed valid if its inputs respect a number of (hopefully documented) conditions. It can be as simple as saying that a size should be positive, and a state should be already allocated.

The usual way to check if conditions are met is to assert() them, right at the beginning of the function. The assert() adds a runtime check, which is typically active during tests. The hope if that, if tests are thorough enough, any scenario which can violate the conditions will be found during tests, and fixed.

As one can already guess, this method is imperfect. Don’t get me wrong: adding assert() is way way better than not adding them, but the whole precinct is to hope that tests will be good enough to find the bad paths leading to a condition violation, and one can never be sure that all bad paths were uncovered.

In some cases, it’s possible to transfer a check at compile time instead.
It only works for a subset of what can be checked. But whatever is validated at compilation stage carries much stronger guarantees : it’s like a mini-proof that always holds, for whatever state the program is in.

As a consequence, it eliminates the need for a runtime check, which saves cpu and binary size.
More importantly, it removes the need of a “failure code path”, requiring the caller to test and consider carefully what must be done when an incorrect condition happens. This leads to a corresponding simplification of the code, with massive maintenance benefits.
On top of that, since the condition can be checked immediately during compilation or parsing, it’s right in the short feedback loop of the programmer, allowing failures to be identified and fixed quickly.

This set of benefits is too strong to miss. As a general rule, whatever can be checked at compile time should be.

static assert

Invariant guarantees can be checked at compile time with a static_assert(). Compilation will stop, with an error, if the invariant condition is not satisfied. A successful compilation necessarily means that the condition is always respected (for the compilation target).

A typical usage is to ensure that the int type of target system is wide enough. Or that some constants respect a pre-defined order. Or, as suggested in an earlier article, that a shell type is necessarily large enough to host its target type.

It has all the big advantages mentioned previously : no trace in the generated binary, no runtime nor space cost, no reliance on runtime tests to ensure that the condition is respected.

C90 compatibility

static_assert() is a special macro added in the C11 standard. While most modern compilers are compatible with this version of the standard, if you plan on making your code portable on a wider set of compilers, it’s a good thing to consider an alternative which is compatible with older variants, such as C90.

Fortunately, it’s not that hard. static_assert() started its life as a “compiler trick”, and many of them can be found over Internet. The basic idea is to transform a condition into an invalid construction, so that the compiler must issue an error at the position of the static_assert(). Typical tricks include :

  • defining an enum value as a constant divided by 0
  • defining a table which size is negative

For example :

#define STATIC_ASSERT(COND,MSG) typedef char static_assert_##MSG[(COND)?1:-1]

One can find multiple versions, with different limitations. The macro above has the following ones :

  • cannot be expressed in a block after the first statement for C90 compatibility (declarations before statements)
  • require different error messages to distinguish multiple assertions
  • require the error message to be a single uninterrupted word, without double quotes, differing from C11 version

The 1st restriction can be circumvented by putting brackets around { static_assert(); } whenever needed.
The 2nd one can be improved by adding a __LINE__ macro as part of the name, thus making it less probable for two definitions to use exactly the same name. The macro definition becomes more complex though.
The last restriction is more concerning: it’s a strong limitation, directly incompatible with C11.

That’s why I rather recommend this more complete version by Joseph Quinsey, which makes it possible to invoke the macro the same way as the C11 version, allowing to switch easily from one to another. The declaration / statement limitation for C90 is still present, but as mentioned, easily mitigated.

Limitations

A huge limitation is that static asserts can only reason about constants, which values are known at compile time.

Constants, in the C dialect, regroup a very restricted set :

  • Literals value, e.g. 222.
  • Macros which result in literals value, e.g. #define value 18
  • enum values
  • sizeof() results
  • Mathematical operations over constants which can be solved at compile time, e.g. 4+1 or even ((4+3) << 2) / 18.

As a counter-example, one might believe that const int i = 1; is a constant, as implied by the qualifier const. But it’s a misnomer : it does not define a constant, merely a “read-only” (immutable) value.

Therefore it’s not possible to static_assert() conditions on variables, not even const ones. It’s also not possible to express conditions using functions, not even pure ones (only macro replacements are valid).

This obviously strongly restrains the set of conditions that can be expressed with a static_assert().

Nonetheless, every time static_assert() is a valid option, it’s recommended to use it. It’s a very cheap, efficient zero-cost abstraction which guarantees an invariant, contributing to a safer code generation.

Arbitrary conditions validated at compile time

Checking an arbitrary condition at compile time? like a runtime assert() ? That sounds preposterous.
Yet, that’s exactly what we are going to see in this paragraph.

The question asked changes in a subtle way : it’s no longer “prove that the condition holds given current value(s) in memory”, but rather “prove that the condition can never be false”, which is a much stronger statement.

The benefits are similar to static_assert() : as the condition is guaranteed to be met, no need to check it at run time, hence no runtime cost, no need for a failure path, no reliance on tests to detect bad cases, etc.

Enforcing such a strong property may seem a bit overwhelming. However, that’s exactly what is already required by the standard, for any operation featuring undefined behavior as a consequence of violation of their narrow contract.
The real problem is that the full responsibility of knowing and respecting the contract is transferred onto the programmer, which receives, by default, no compile-time signal to warn when these conditions are broken.

Compile-time condition validation reverse this logic, and ensure that a condition is always met if it passes compilation. This is a big change, with corresponding safety benefits.

This method is not suitable for situations determined by some unpredictable runtime event. For example, it’s not possible to guarantee that a certain file will exist at runtime, so trying to open a file always requires a runtime check.

But there are a ton of conditions that the programmer expect to be always true, and which violation necessarily constitutes a programming error. These are our targets.

Example

Let’s give a simple example :
dereferencing a pointer requires that, as a bare minimum, the pointer is not NULL. It’s not a loose statement, like “this pointer is probably not NULL in general”, it must be 100% true, otherwise, undefined behavior is invoked.
How to ensure this property then ?

Simple : test if the pointer is NULL, and if it is, do not dereference it, and branch elsewhere.
Passing the branch test guarantees the pointer is now non-NULL .

This example is trivial, yet very applicable.
It’s extremely common to forget such a test, since there’s no warning for the programmer. A NULL pointer can happen due to exceptional conditions which can be difficult to trigger during tests, such as a rare malloc() failure for example.

And that’s just a beginning : most functions and operations feature a set of conditions to be respected for their behavior and result to be correct. Want to divide ? better be by non-zero. Want to add signed values ? Well, be sure they don’t overflow. Let’s call memcpy() ? First, ensure memory segments are allocated and don’t overlap.
And on, and on, and on.

While it’s sometimes possible to assert() some of these conditions, it’s not great, because in absence of compilation warnings, contract violation can still happen at runtime. And while the assert(), if enabled, will avoid the situation to degenerate into undefined behavior, it still translates into an abrupt abort(), which is another form of vulnerability.

A better solution is to ensure that the condition always hold. This is where a compile-time guarantee comes in.

Solution

We want the compiler to emit a warning whenever a condition cannot be guaranteed to be true. Technically, this is almost like an assert(), though without a trace in the generated binary.

This outcome is already common : whenever an assert() can be proven to be always true, the compiler will remove it, through a fairly common optimization stage called Dead Code Elimination (DCE).

Therefore, the idea is to design an assert() that must be removed from final binary through DCE, and emits a warning if it does not.

Since no such instruction exists in the base language, we’ll have to rely on some compiler-specific extensions. gcc for example offers a function attribute which does exactly that :

warning ("message")
If the warning attribute is used on a function declaration and a call to such a function is not eliminated through dead code elimination or other optimizations, a warning that includes "message" is diagnosed. This is useful for compile-time checking.

This makes it possible to create this macro :

__attribute__((noinline))
__attribute__((warning("condition not guaranteed")))
static void never_reach(void) { abort(); } // must define a side effect, to not be optimized away

// EXPECT() : will trigger a warning if the condition is not guaranteed to be true
#define EXPECT(c) (void)((c) ? (void)0 : never_reach())

The resulting macro is called EXPECT(), for consistency with a recent C++20 proposal, called attribute contract, which suggests the notation [[expects: expression]] to achieve something similar (though not strictly identical, but that’s a later topic).

EXPECT() is designed to be used the same way as assert(), the difference being it will trigger a warning at compile time whenever it cannot be optimized away, underlying that the condition can not be proven to be always true.

Limitations

It would be too easy if one could just start writing EXPECT() everywhere as an assert() replacement. Beyond the fact that it can only be used to test programming invariants, there are additional limitations.

First, this version of EXPECT() macro only works well on gcc. I have not found a good enough equivalent for other compilers, though it can be emulated using other tricks, such as an incorrect assembler statement, or linking to some non existing function, both of which feature significant limitations : do not display the line at which condition is broken, or do not work when it’s not a program with a main() function.

Second, checking the condition is tied to compiler’s capability to combine Value Range Analysis with Dead Code Elimination. That means the compiler must use at least a bit of optimization. These optimizations are not too intense, so -O1 is generally enough. Higher levels can make a difference if they increase the amount of inlining (see below).

However, -O0 definitely does not cut it, and all EXPECT() will fail. Therefore, EXPECT() must be disabled when compiling with -O0. -O0 can be used for fast debug builds for example, so it cannot be ruled out. This issue makes it impossible to keep EXPECT() always active by default, so its activation must be tied to some explicit build macro.

Third, Value Range Analysis is limited, and can only track function-local changes. It cannot cross function boundaries.

There is a substantial exception to this last rule for inline functions : for these cases, since function body will be included into the caller’s body, EXPECT() conditions will be applied to both sides of the interface, doing a great job at checking conditions and inheriting VRA outcome for optimization.

inline functions are likely the best place to start introducing EXPECT() into an existing code base.

Function pre-conditions

When a function is not inline, the situation becomes more complex, and EXPECT() must be used differently compared to assert().

For example, a typical way to check that input conditions are respected is to assert() them at the beginning of the function. This wouldn’t work with EXPECT().

Since VRA does not cross function boundaries, EXPECT() will not know that the function is called with bad parameters. Actually, it will also not know that the function is called with good parameters. With no ability to make any assumption on function parameters, EXPECT() will just always fail.

// Never call with `v==0`
int division(int v)
{
    EXPECT(v!=0);  // This condition will always fail :
                   // the compiler cannot make any assumption about `v` value.
    return 1/v;
}

int lets_call_division_zero(void)
{
    return division(0);   // No warning here, though condition is violated
}

To be useful, EXPECT() must be declared on the caller side, where it can properly check input conditions.
Yet, having to spell input conditions on the caller side at every invocation is cumbersome. Worse, it’s too difficult to maintain: if conditions change, all invocations must be updated !

A better solution is to spell all conditions in a single place, and encapsulate them as part of the invocation.

// Never call with `v==0`
int division(int v)
{
    return 1/v;
}

// The macro has same name as the function, so it masks it.
// It encapsulates all preconditions, and deliver the same result as the function.
#define division(v) ( EXPECT(v!=0), division(v) )

int lets_call_division_zero(void)
{
    return division(0);   // Now, this one gets flagged right here
}

int lets_call_division_by_something(int divisor)
{
    return division(divisor);   // This one gets flagged too : there is no guarantee that it is not 0 !
}

int lets_divide_and_pay_attention_now(int divisor)
{
    if (divisor == 0) return 0;
    return division(divisor);   // This one is okay : no warning
}

Here are some more example usages. Note how EXPECT() are combined with a function signature into a macro, so that compile time checks get triggered every time the function is called.

Limitations

This construction solves the issue on the caller side, which is the most important one.

You may note that the macro features a typical flaw : its argument v is present twice. It means that, if v is actually a function, it’s going to be invoked twice. In some cases, like rand(), both invocations may even produce different results.

However, at this stage, it’s impossible to successfully invoke the macro using a function as argument to begin with.
That’s because then function’s return value has no any guarantee attached beyond its type.
So, if the function is int f(), its return value could be any value, from INT_MIN to INT_MAX.
As a consequence, no function’s return value can ever comply with any condition. It will necessarily generate a warning.

The encapsulating macro can only check conditions on variables, and it will only accept variables which are guaranteed to respect the conditions. If a single one may break any condition, a warning is issued.

However, pre-conditions remain unknown to the function body itself. This is an issue, because without it, it is necessary to re-express the conditions within the function body, which is an unwelcome burden.

A quick work-around is to express these guarantees inside the function body using assert(). This is, by the way, what should have been done anyway.

An associated downside is that ensuring that EXPECT() conditions are respected using assert() presumes that assert() are present and active in source code, to guide the Value Range Analysis. If assert() are disabled, their corresponding EXPECT() will fail.
This suggests that EXPECT() can only be checked in debug builds, and with optimization enabled (-O1).

With all these assert() back, it seems like these compile-time checks are purely redundant, hence almost useless.

Not quite. It’s true that so far, it has not reduced the amount of assert() present in the code, but the compiler now actively checks expressed pre-conditions, and mandates the presence of assert() for every condition that the local code does not explicitly rule out. This is still a step up : risks of contract violation are now underlined early, and it’s no longer possible to “forget” an assert(). As a side effect, tests will also catch condition violations sooner, leading to more focused and shorter debug sessions. This is still a notable improvement.

It nonetheless feels kind of incomplete. One missing aspect is an ability to transfer pre-conditions from the calling site to the function body, so that they can be re-used to satisfy a chain of pre-conditions.
This capability requires another complementary tool. We’ll see that in the next blog post.

Thursday, January 24, 2019

Compiler Warnings

 btrc: compile4 : compiler warnings

One way to improve C code quality is to reduce the number of strange constructions that the standard does not explicitly forbid. This will greatly help code reviewers, who want less surprises, and try to understand what a segment of source code is achieving and impacting.

A straightforward way to create such a “constrained” C variant is to add compiler-specific warning flags. They will trigger warnings on detecting certain constructions considered dubious, if not downright dangerous.

A simple example is the condition if (i=1) {. This test seems to check if i equal 1, but that’s not what it does : it assigns the value 1 to i. Also, as a consequence, it is always true. This is most likely a typo, the programmer probably wanted to express an equality test if (i==1) {. Yet, it’s not invalid, strictly speaking. So a compiler is allowed to accept it at face value and generate corresponding assembly without any warning. That may take a while to debug …

The if (i=1) { typo statement is well known, and nowadays it triggers a warning in most compilers with the help of warning flags.
At the very least, the warning is an invitation to spell the intention more clearly.
Sometimes, it was a genuine error, and the compiler just helped us catch this issue before it ever reaches production, saving some considerable debug time.

Multiplying the number of flags will increase the number of warnings. But sifting through a large list of warnings to find which ones are interesting and which one are merely informational can be daunting. Moreover, collaborative programming requires simple rules, that anyone can abide by.

Using warnings should be coupled with a strict “zero-warning” policy. Every warning must be considered an error to be dealt with immediately. This is a clean signal that everyone understand, and that any CI environment can act upon. If a warning message is considered not fixable, or not desirable to fix, it’s preferable to remove the associated flag from the build chain.

On gcc, ensuring that no warning can be ignored can be enforced by the -Werror flag, which makes any warning a fatal error. Visual has “treat warnings as errors”.
More complex policies are possible, such as activating more warnings and only make some of them fatal (for example -Werror=vla) but it makes the setup more complex, and logs more difficult to follow.

As a consequence, it’s not a good idea to just “enable everything”. Each additional flag increases the number of false-positive to deal with. When too many warnings are generated, it will feel like a discouraging and low-value task, leading to its abandon. Only warnings which bring some value deserve to be tracked, fixed, and continuously maintained. Therefore, it is preferable to only add a flag when its benefit is clearly understood.

That being said, the best moment to crank up the warning level is at the beginning of a project. What tends to be difficult is to add new flags to an existing project, because new flags will reveal tons of programming patterns that where silently allowed and must now be avoided, anywhere within the repository. On the other hand, keeping an existing code clean is much simpler, as issues appear only in new commits, and can therefore be located and fixed quickly.

MS Visual

My programming habits have largely switched from Windows to Unix these last few years, so I’m no longer up to date on this topic.
By default, Visual organizes its list of optional warnings into “levels”. The higher the level, the more warnings it generates. It’s also possible to opt-in for a single specific warning, but I have not enough experience to comment that usage.

By default, Visual compiler uses level 1 on command line , and level 3 on IDE.
Level 3 is already pretty good, but I recommend to aim for level 4 if possible. That level will catch additional tricky corner cases, making the code cleaner and more portable.
Obviously, on an existing project, move up progressively towards higher levels, as each of them will generate more warnings to clean up.

The exact way to change the warning level may depend on the IDE version. On command line, it’s always /W4, so that one is pretty clear. On IDE, it’s generally accessible in the properties->C tab, which is one of the first displayed, as shown here.

Do not use /Wall as part of your regular build process. It contains too many warnings of “informational” value, which are not meant to be suppressed, hence will continuously drown the signal and make “zero warning policy” impossible.

gcc and clang

gcc and by imitation clang offer a command line experience with a large list of compatible flags for warnings.
Overtime, I’ve developed my own selection, which has become pretty long. I would recommend it to any code base. I’m going to detail it below. It is by no means a “final” or “ultimate” version. The list can always evolve, integrating more flags, either because I missed them, or they end up being more useful than I initially anticipated, or because they become more broadly supported.

For simplicity purpose, I tend to concentrate on flags that are well supported by gcc and clang, and present since a few revisions. Flags which only work on “latest version of X” are not considered in this list, because they can cause trouble for compilation on targets without version X. This issue can be solved by adding yet another machinery to maintain version-specific flags, complete with its own set of problems, but I would not recommend to start with such complexity.

If your project does not include those flags yet, I suggest to only enable them one after another. A project developed without a specific flag is likely to have used the flagged pattern in many places. It’s important to clean one flag completely before moving to next one, otherwise, the list of warnings to fix becomes so large that it will seem insurmountable. Whenever it is, just drop the flag for the time being, you’ll come back to it later.

Basics

  • -Wall : This is the “base” warning level for gcc/clang. In contrast to what its name implies, it does not enable “all” warnings, far from it, but a fairly large set of flags that the compiler team believes is generally safe to follow. For a detailed list of what it includes, you can consult this page, which is likely applicable to latest compiler version. The exact list of flags evolves with the specific version of the compiler. It’s even different depending on gcc or clang. It’s okay, because the flag itself doesn’t change.
    I would recommend to start with this flag, and get to the bottom of it before moving on to more flags. Should the generated list of warnings be overwhelming, you can break it down into a more narrow set of flags, or selectively disable a few annoying warnings with -Wno-###, then plan to re-enable them progressively later.

  • -Wextra : This is the second level for gcc and clang. It includes an additional set of flags, which constrains the code style further, improving maintainability. For example, this level will raise a warning whenever a switch() { case: } uses a fall-through implicitly, which is generally (but not always) a mistake.
    This flag used to be called -W, but I recommend the -Wextra form, which is more explicit.

Correctness

  • -Wcast-qual : This flag ensures that a QUALifier is respected during a cast operation. This is especially important for the const “read-only” qualifier: it ensures that a pointer to a read-only area cannot be silently transformed into another pointer with write capability, which is quite an essential guarantee. I even don’t quite get it how come this is an optional warning, instead of a compulsory property of the language.

  • -Wcast-align : the C standard requires that a type must be stored at an address suitable for its alignment restriction. For example, on 32-bits systems, an int must be stored at an address which is a multiple of 4. This restriction tends to be forgotten nowadays because x86 cpus have always been good at dealing with unaligned memory accesses, and ARM ones have become better at this game (they used to be terrible). But it’s still important to respect this property, for portability, for performance (avoid inter-pages accesses), and for compatibility with deep transformations such as auto-vectorization. Casting can unintentionally violate this condition. A typical example is when casting a memory area previously reserved as a table of char*, hence without any alignment restriction, in order to store int value, which require an alignment of 4. -Wcast-align will detect the violation, and fixing it will make sure the code respect alignment restrictions, making it more portable.

  • -Wstrict-aliasing : Strict aliasing is a complex and badly known rule. It states that, in order to achieve better performance, compilers are allowed to consider that 2 pointers of different types never reference the same address space, so their content cannot “collide”. If they nonetheless do, it’s an undefined behavior, hence anything can happen unpredictably.
    To ensure this rule is not violated, compilers may optionally offer some code analysis capabilities, that will flag suspicious constructions. gcc offers -Wstrict-aliasing, with various levels of caution, with 1 being the most paranoid.
    Issues related to strict aliasing violation only show up in optimized codes, and are among the most difficult to debug. It’s best to avoid them. I recommend using this flag at its maximum setting. If it generates too much noise, try more permissive levels. -Wstrict-aliasing=3 is already included as part of -Wall, so if -Wall is already clean, the next logical step is level 2, then 1.
    One beneficial side-effect of this flag is that it re-inforces the separation of types, which is a safer practice. Cross-casting a memory region with pointers of different types is no longer an easy option, as it gets immediately flagged by the compiler. There are still ways to achieve this, primarily through the use of void* memory segments, which act as wildcards. But the extra-care required is in itself protective, and should remind the developer of the risks involved.

  • -Wpointer-arith forbids pointer arithmetic on void* or function pointer. C unfortunately lacks the concept of “memory unit”, so a void* is not a pointer to an address: it’s pointer to an object “we don’t know anything about”. Pointer arithmetic is closely related to the concept of table, and adding +1 is always relative to the size of the table element (which must be a constant). With void*, we have no idea what this element size could be, so it’s not possible to +1 it, nor do more complex pointer arithmetic.
    To perform operation on bytes, it’s necessary to use a pointer to a byte type, be it char*, unsigned char* or int8_t*.
    This is a strict interpretation of the standard, and helps make the resulting code more portable.

Variable declaration

  • -Winit-self : prevents a fairly silly corner case, where a variable is initialized with itself, such as int i = i+1;, which can not be right. clang and g++ make it part of -Wall, but not gcc.

  • -Wshadow : A variable v declared at a deep nesting level shadows any other variable with same name v declared at an upper level. This means that invoking vat the deeper level will target the deeper v. This is legal from a C standard perspective, but it’s considered bad practice, because it’s confusing for the reviewer. Now 2 different variables with different roles and lifetime carry the same name. It’s better to differentiate them, by using different names.
    Sidenote : name shadowing can be annoying when using a library which unfortunately defines very common symbol names as part of its interface. Don’t forget that the C namespace is global. For this reason, whenever publishing an API, always ensure that no public symbol is too “common” (such as i, min, max, etc.). At a minimum, add a PREFIX_ to the public symbol name, so that opportunities of collision get drastically reduced.

  • -Wswitch-enum : This flag ensures that, in a switch(enum) { case: } construction, all declared values of the enum have a case: branch. This can be useful to ensure that no enum value has been forgotten (even if there is a default: branch down the list to deal with them). Forgetting an enum value is a fairly common scenario when the enum list changes, typically by adding an element to it. The flag will issue a warning on all relevant switch() { case: }, simplifying code traversal to ensure that no case has been missed.

Functions

  • -Wstrict-prototypes : historically, C functions used to be declared with just their name, without even detailing their parameters. This is considered bad practice nowadays, and this flag will ensure that a function is declared with a fully formed prototype, including all parameter types.
    A common side effect happens for functions without any parameter. Declaring them as int function() seems to mean “this function takes no argument”, but it’s not correct. Due to this historical background, it actually means “this function may have any number of arguments of any type, it’s just not documented”. Such definition will limit the effectiveness of the compiler in controlling the validity of an invocation, so it’s bad, and this flag will issue a warning. The correct way to tell that a function has no (zero) argument is int function(void).

  • -Wmissing-prototypes : this flag enforces that any public function (non-static) has a declaration somewhere. It’s easy to game that condition by just writing a prototype declaration right in front of the function definition itself, but it misses the point : this flag will help find functions which are (likely) no longer useful.
    The problem with public functions is that the compiler has no way to ensure they are not used anymore. So it will generate them, and wait for the linking stage to know more. In a library, such “ghost” function will be present, occupy valuable space, and more importantly will still offer a public symbol to be reached, remaining within the library’s attack surface, and offering a potential backdoor for would-be attackers. Being no longer used, these functions may also not be correctly tested anymore, and might allow unintended state manipulations. So it’s better to get rid of them.
    If a kind of “private function just for internal tests” is needed, and should not be exposed in the official *.h header, create a secondary header, like *-debug.h for example, where the function is declared. And obviously #include it in the *.c unit. This will be cleaner and compatible with this flag.

  • -Wredundant-decls : A prototype should be declared only once, and this single definition should be #include everywhere it’s needed. This policy avoids multiple source of truth, with associated synchronization problems.
    This flag will trigger a warning if it detects that a function prototype is declared twice (or more).

Floating point

  • -Wfloat-equal : this flag prevents usage of == equality operator between float value. This is because floating point values are lossy representations of real numbers, and any operation with them will incur an inaccuracy uncertainty, which exact detail depends on target platform, hence is not portable. Two floating-point values should not be compared with equality, it’s not supposed to make sense given the lossy nature of the representation. Rather ensure that the distance between 2 floats is below a certain threshold to consider them “equivalent enough”.

Preprocessor

  • -Wundef : forbids evaluation of a macro symbol that’s not defined. Without it, #if SYMBOL_NOT_EXIST is silently translated into #if 0, which may or may not generate the intended outcome. This is useful when the list of macro symbols evolves : whenever a macro symbol disappears, all related preprocessor tests get flagged with this warning, which makes it possible to review and adapt them.

Standard Library

  • -Wformat=2 : this will track potential printf() issues which can be abused to create security hazard scenarios.
    An example is when the formatting chain itself can be under control of an external source, such as printf(message), with char* message being externally manipulated. This can be used to read and write out of bound and take remote control of the system. Yep, it’s that dangerous.
    The solution to this specific issue is to write printf("%s", message). It may look equivalent, but this second version is safer, as it interprets message only as a pure char* string to display, instead of a formatting string which can trigger read/write orders from inside printf().
    -Wformat=2 will flag this issue, and many more, such as ensuring proper correspondence between the argument type and control string statement, leading to a safer program.
    These issues go beyond the C language proper, and more into stdio library territory, but it’s good to enable more options to be protected from this side too.

Extended compatibility

  • -Wvla : prevents usage of Variable Length Array.
    VLA were supported in C99, but are now optional since C11 (support can be tested using __STDC_NO_VLA__ macro). They allow nice things such as allocating on stack a table of variable size, depending on a function parameter. However, VLA have a pretty poor reputation. I suspect a reason is that they were served by sub-par implementations leading to all sort of hard issues, such as undetected stack-overflow, with unpredictable consequences.
    Note that even “good” implementations, able to dynamically expand stack to make room for larger tables, and correctly detect overflow issue to properly abort(), cannot provide any way for the program to be informed of such issue and react accordingly. It makes it impossible to create a program that is guaranteed to not abort().
    For better portability, it’s enough to know that some implementations of VLA are/were poor, and that VLA is no longer required in C11 to justify avoiding it. VLA is also not available for C90.

  • -Wdeclaration-after-statement : this flag is useful for C90 compatibility. Having all declarations at the top of the block, before any statement, is a strict rule that was dropped with C99, and it’s now possible to declare new variables anywhere in a block. This flag is mostly useful if the goal is to be compatible with C90 compilers, such as MS Visual Studio C before 2015 as an example.

  • -Wc++-compat : this flag ensures that the source can be compiled unmodified as both valid C and C++ code. This will require a few additional restrictions, such as casting from void*, which is unnecessary in C, but required in C++.
    This it handy for highly portable code, because it’s not uncommon for some users to just import the source file in their project and compile it as a C++ file, even though it’s clearly labelled as C. Moreover, when targeting C90 compatibility, C++ compatibility is not too far away, so the remaining effort is moderate.

Other interesting flags

  • -Wconversion : The C language allows most conversions to be performed silently. Transforming an int value into a short one ? No problem, just spell it. This design choice dates from the 70’s, when reducing the number of keystrokes was important, due to concerns we can’t even start to imagine today (slow printers, limited display space, hard key presses, etc.). Thing is, many type conversions are actually dangerous. That int to short ? What if the original value is larger than SHRT_MAX ? Yep, that’s undefined behavior. short to int conversion, on the other hand, is risk free.
    -Wconversion will flag any silent type conversion which is not risk free. In an existing code base developed without this flag, this will lead to a very large number of warnings, likely within intractable territory.
    The situation is even worse for gcc, because it flags type conversions resulting from implicit operation conversions. In this short example, all variables are short types. There is no other type anywhere. Yet, gcc's -Wconversion flag will trigger multiple warnings, because a basic operation such as + is allowed to be performed into int space, hence storing the final result into a short is now considered a “risky” conversion. Some constructions, such as += can’t even be fixed !
    Bottom line : starting a new code base with -Wconversion is doable, but adding this flag to an existing project is likely a too large burden.
    Special mention for the combination clang + -Wconversion -Wno-sign-conversion, which I use regularly, but only on clang.

  • -Weverything (clang only) : While it’s not recommended to use too many warnings in the production build chain, it can be sometimes interesting to look at more options. Special mention can be given to -Weverything on clang, which will activate every possible warning flag.
    Now, -Weverything is not meant to be used in production. It’s mostly a convenient “discovery” feature for clang developers, which can track and understand new warnings as they are added to “trunk”.
    But for the purpose of testing if the compiler can help find new issues, it can be an interesting temporary digression. One or two of these warnings might uncover real issues, inviting to re-assess the list of flags used in production.

Summary

All the flags presented so far can be combined into the following list, provided below for copy-pasting purposes :
-Wall -Wextra -Wcast-qual -Wcast-align -Wstrict-aliasing -Wpointer-arith -Winit-self -Wshadow -Wswitch-enum -Wstrict-prototypes -Wmissing-prototypes -Wredundant-decls -Wformat=2 -Wfloat-equal -Wundef -Wvla -Wdeclaration-after-statement -Wc++-compat

Quite a mouthful. Adopting as-is this list into an existing project might result in an abundant list of warnings if they were not already part of the build. Don’t be afraid, your code is not completely broken, but consider having a look: it might be fragile in subtle ways that these flags will help find. Enable additional warnings one by one, selectively, pick those which add value to your project. In the long run, these flags will help keep the code better maintained.

Compiler warning flags can be seen as a giant list of patterns that the compiler is pre-trained to detect. It’s great. But beyond these pre-defined capabilities, one might be interested in adding one’s own set of conditions for the compiler to check and enforce. That’s the purpose of next blog post.

Special Thanks

An early version of this article was commented by Nick Terrell and Evan Nemerson.

Tuesday, January 22, 2019

Opaque types and static allocation

 btrc: compile3 : opaque type and static allocation
In a previous episode, we’ve seen that it is possible to create opaque types. However, creation and destruction of such type must be delegated to some dedicated functions, which themselves rely on dynamic allocation mechanisms.

Sometimes, it can be convenient to bypass the heap, and all its malloc() / free() shenanigans. Pushing a structure onto the stack, or within thread-local storage, are natural capabilities offered by a normal struct. It can be desirable at times.

The previously described opaque type is so secret that it has no size, hence is not suitable for such scenario.

Fortunately, static opaque types are possible.
The main idea is to create a “shell type”, with a known size and an alignment, able to host the target (private) structure.

For safer maintenance, the shell type and the target structure must be kept in sync, by using typically a static assert. It will ensure that the shell type is always large enough to host the target structure. This check is important to automatically detect future evolution of the target structure.

If it wasn’t for the strict aliasing rule, we would have a winner : just use the shell type as the “public” user-facing type, proceed with transforming it into the private type inside the unit. It would combine properties of struct while remaining opaque.

Strict aliasing

Unfortunately, the strict aliasing rule gets in the way : we can't manipulate the same memory region from two pointers of different type (edit Christer Ericson : for the lifespan of the stored value). That's because the compiler is allowed to make assumptions about pointer value provenance for the benefit of performance.

To visualize the issue, I like this simple example, powered by Godbolt. Notice how the two +1 get combined into a single +2, saving one save+load round trip, and allowing computation over i and f in parallel, so it’s real saving.
But unfortunately, if f and i have same addresses, the result is wrong : the first i+1 influences the operation on f which influences the final value of i.
Of course, this example feels silly : it’s pretty hard to find a use case which justifies operations on int and float simultaneously and pointing at the same memory address. It shows that the rule is quite logical : if these pointers have different type, they most likely do not reference the same memory area. And since benefits are substantial, it’s tempting to use that assumption.

Interpreting differently the same memory area using different types of pointers is called “type punning”. It may work, as long as the compiler serializes operations as expected in the code, but there is no guarantee that it will continue to work safely in the future. A known way to break older programs employing type punning is to recompile them with modern compilers using advanced performance optimizations such as -O3 -lto. With enough inlining, register caching and dead code elimination, one will start to experience strange effects, which can be very hard to debug.

This is explained in greater details in this excellent article from Mike Acton. For an even deeper understanding of what can happen under the hood, you can read this document suggested by Josh Simmons. It demonstrates that there is a lot more to a pointer than just its binary representation.

One line of defense could be disable usage of strict aliasing by the optimizer, with a compilation directive such as fno-strict-aliasing on gcc.
I wouldn’t recommend it though. On top of impacting performance, it ties code correctness to a specific compiler setting, which may or may not be present in user’s project. Portability is also impacted, since there is no guarantee that this capability will always be available on some different C compiler.

Another line of defense consists in using the char* pointer, which is the exception to the rule, and can alias anything. When one memory area is passed as a char*, the compiler will pay attention to serialize char* read/write properly. It works well in practice, at least in my tests. What is worrying though is that in theory, the compiler is only obliged to guarantee the read in correct order. That it pays attention to serialize the write too seems to be “extra care”, presumably so that existing programs continue to work as intended. Not sure if it is reliable to depend on it on long term.

Another issue is, our proposed shell type is not a char* table. It’s a union, containing a char* table. That’s not the same, and in this case, the exception does not hold.

As a consequence, the shell type must not be confused with the target type. The strict aliasing rule makes them non-interchangeable !

Safe static allocation for opaque types

The trick is to use a 3rd party initializer, to convert the allocated space and return a pointer of appropriate type.
To ensure strict compliance with C standard, it’s a multi-steps trick, hence a more complex setup. Consider this technique as “advanced”, implying limited usage scenarios.

Here is an example :

typedef struct thing_s thing;   // incomplete (opaque) type

typedef union {
    char body[SIZE];
    unsigned alignment_enforcer;   // ensures `thingBody` respect alignment of largest member of `thing`
} thingBody;

// PREFIX_initStatic_thing() accepts any buffer as input, 
// and returns a properly initialized `thing*` opaque pointer.
// It ensures `buffer` has proper size (`SIZE`) and alignment (4) restrictions
// and will return `NULL` if it does not.
// Resulting `thing*` uses the provided buffer only, it will not allocate further memory on its own.
// Use `thingBody` to define a memory area respecting all conditions.
// On success, `thing*` will also be correctly initialized.
thing* PREFIX_initStatic_thing(void* buffer, size_t size);

// Notice there is no corresponding destructor.
// Since the space is reserved externally, its deallocation is controlled externally.
// This presumes that `initStatic` does Not dynamically allocates further space.
// Note that it doesn't make sense for `initStatic` to invoke dynamic allocation.

/* ====================================== */
/* Example usage */

int function()
{
    thingBody scratchSpace;   /* on stack */
    thing* T const = PREFIX_initStatic_thing(&scratchSpace, sizeof(scratchSpace));
    assert(T != NULL);  // Should be fine. Only exception is if `struct thing_s` definition changes and there is some version mismatch.
    
    // Now use `T` as a normal `thing*` pointer
    // (...)
    
    // do Not `free(T)` at function's end, since thingBody is part of the stack
}

In this example, the static size of thingBody is used to allocate space for thing on the stack. It’s faster, and there is no need to care about deallocation.

But that’s all it does. No data is ever read from nor written to thingBody. All usages of the memory region pass through thing*, which is safe.

Compared to a usual public struct, the experience is not equivalent.
To begin with, the proposed stack allocation is a multi-liner and creates 2 variables : the shell type, and the target pointer. It’s not too bad, and this model fits well enough any kind of manual allocation scenario, be it on stack or within a pre-reserved area (for embedded environments typically).

If that matters, stack allocation could have been made a one liner, hidden behind a macro.
But I tend to prefer the variant in above example. It makes it clear what’s happening. Since one of C strengths is a clear grasp of resource control, it is better to preserve that level of understanding.

There are more problematic differences though.
It’s not possible to use the shell type as a return type of a function: once again, shell type and target incomplete type are different things. On the same line, it’s not possible to pass the shell type by value. The memory region can only be passed by reference, and only using the correctly typed pointer.

Embedding the shell type into a larger structure is dangerous and generally not recommended : it requires 2 members (the shell and the pointer), but the pointer is only valid if the struct is not moved around, nor copied. That’s a too strong constraint to make it safely usable.

Removing the pointer

Suggested by Sebastian Aaltonen, it is generally possible to bypass the target pointer, and just reuse the address of the shell type instead. Since the shell type is never accessed directly, there is no aliasing to be afraid of.

The only issue is, some compilers might not like the pointer cast from shellType* to target opaque*, irrespective of the fact that the shellType is never accessed directly. This is an annoying false positive. That being said, newer compilers are better at detecting this pattern, and won’t complain.
Note that the explicit casting is not optional, so the notation cannot be shortened, hence this method will not save much keystrokes.

The real goal is to guarantee that the address transmitted is necessarily the address of shell. This makes sense when the intention is to move shell around or copy it : no risk to lose sync with a separate pointer variable.

To be complete, note that, in above proposal, initStatic() does more than casting a pointer :

  • It ensures that the memory area has correct size & alignment properties
    • shellType provides these guarantees too.
      • The only corner case is when the program invokes initStatic() from a dynamic library. If runtime library version is different from the one used during compilation of the program, it can lead to a potential discrepancy on size or alignment requirements.
      • No such risk when using static linking.
  • It ensures that the resulting pointer references a properly initialized memory area.

The second bullet point, in particular, still needs to be done one way or another, so initStatic() is still useful, at least as an initializer.

Using the shell type directly

Removing the pointer is nice, but the real game changer is to be able to employ the opaque type as if it was a normal struct, in particular :

  • assign with =
  • can be passed by value as function parameter
  • can be received as return type from a function

These properties can influence the API design, making the opaque type “feel” more natural to use. For example :

// declaration
#define SIZE 8
typedef union {
    char body[SIZE];
    unsigned align4;   // ensures `thing` is aligned on 4-bytes boundaries
} thing;
// No need for a "separate" incomplete type.
// The shell IS the public-facing type for API.

thing thing_init(void);
thing thing_set_byValue(int v);
thing thing_combine(thing a, thing b);

// usage
thing doubled_value(int v)
{
    thing const ta = thing_set_byValue(v);
    thing const tb = ta;
    return thing_combine(ta, tb);
}

This can be handy for small POD types (typically less than a few dozens of bytes), giving them a behavior similar to basic types.
Since passing arguments and results by value implies some memory copy, the cost of this approach increases as type size increases. Therefore, whenever the type becomes uncomfortably large, prefer switching to a pointer reference.

The compiler may completely eliminate the memory copy operation if it can somehow inline the invoked functions. That’s, by definition, hard to do when these functions are in a separate unit, due to the need to access a private type declaration.
However, -lto (Link Time Optimization) can break the unit barrier. As a consequence, functions which were behaving correctly while not inlined might end up being inlined, triggering weird optimization effects.

For example, statements acting directly on shell*, such as potential memset() initialization, or any kind of value assignment, might be reordered for parallel processing with other statements within inlined functions acting on internal_type*, on the assumption that shell* and internal_type* should not be aliased.
To be fair, I would expect a modern compiler to be clever enough to detect that shell* and internal_type* reference effectively the same address, and avoid re-ordering or eluding memory read / write operations. Nevertheless, this is a risk, that might be triggered by complex cases or less clever compilers (typically older ones).

The solution is to use memcpy() to transfer data back and forth between internal type and shell type. memcpy() acts as a synchronization point for memory accesses : it guarantees that read and write orders will be serialized, ordered as written in the source code. The compiler will not be able to “outsmart” the code by re-ordering statements under the assumptions that side-effects on 2 pointers of different types cannot alias each other : a memcpy() can alias anything, so it has to be performed in the requested order.

Back to struct ?

Adding memcpy() everywhere is a small inconvenience. Also, there is always a risk that the compiler will not be smart enough to elide the copy operation.

Due to these limitations and risks, it can be better to give up this complexity and just use a public struct. As long as the struct is a POD type, all conveniences are available. And without the need to add some private declaration, it’s now possible to define implementations directly in header, as explicit inline functions, sharply reducing the cost of passing parameters.

To avoid direct accesses to structure member, one can still mention it clearly in code comments, and use scary member names as deterrent. A more involved way to protect struct members is to give them scary and useless names, such as dont_access_me_1, dont_access_me_2, etc. and rename them with macros in the code section which can actually interpret them. This is a bit more involving, especially if the number of member names is large, potentially leading to confusion. More importantly, the compiler will no longer be able to help in case of contract violation, and protecting the design pattern will now entirely depend on reviewers. Still, it’s a very reasonable choice, notably for “internal” types, which are not exposed on user side API, hence should only be manipulated by a small number of skillful contributors subject to review process.

For user facing types though, opacity is more valuable. And if the type size is large enough to begin with, it seems a no brainer : prefer the opaque type, and only use references.