A few days ago I learned about a cool new project called printpp which is a printf format string preprocessor. It rewrites format strings from brace-delimited format into printf format at compile time. Although the project mentions Python formatting, the only thing they seem to have in common is delimiters.

Great thing about this approach is that it doesn’t add any overhead to printf. For example, the code

where pprintf is a macro defined by pprintpp, compiles to

This is very nice in terms of binary code size, but can we do better in terms of performance if we don’t limit ourselves to printf? Notice that the compiler generated a call to __printf_chk. Let’s have a look at this function:

Unsurprisingly this function does little but call vfprintf. Calling an overload that takes va_list is common for functions that use varargs. Let’s use Google Benchmark to measure the effect of this extra call and vararg processing:

and let’s compare this to a method of passing formatting arguments via variadic templates implemented in the fmt library:

Note that I’m using the experimental std branch of the fmt library for cleaner C++11 API, but the results should be the same for the stock version which is C++98 compatible.

The test_print function calls make_format_args to place formatting arguments in an fmt::format_args object that is passed to test_vprint. This looks very similar to the printf case but because test_print and fmt::make_format_args are inline, compiler optimizes them away and the effect is that the arguments are placed into fmt::format_args at the caller site. And since fmt::format_args is not parameterized on format argument types, there is no template code bloat.

Running this benchmark on my Early 2015 MacBook Pro gives the following result:

The fmt_variadic method that uses variadic templates is 2.5 times faster but the absolute difference is just 3ns per call. Does it really matter? To answer this question, let’s compare it to the total formatting time:

As you can see formatting is so fast that even varargs overhead is nontrivial. Using varargs would slow down fmt::format by more than 6% and this is not even the most efficient API fmt provides, because it returns std::string.

The situation becomes even more interesting when you take into account positional arguments. The arguments in va_list can only be accessed sequentially and need to be copied by a printf-like function to an array for efficient access by index.

With variadic templates nothing prevents storing arguments in an array on the caller side and this is exactly what fmt does. The effect is easy to see on this benchmark:

Using positional arguments slows down sprintf by 33% and fmt::format by only 4%.

Of course you don’t get all this performance improvement completely for free. The formatting code

compiles to the following binary code:

Since the earlier printpp’s objdump output seem to be obtained with GCC on Linux I use the same platform here to get a meaningful comparison. I also passed -fno-stack-protector to gcc because there was no stack protection code in printpp’s case either.

Compared to printf, fmt::print requires 9 extra bytes to place arguments in the array on stack, 3 more bytes to pass the pointer to the argument array, and 3 fewer bytes due to no return value. So the total difference is just 9 bytes and the number of CPU instructions is the same.

The cool thing is that the type information is effectively passed for free (compared to printf) because it takes the same amount of code as passing the flag argument to __printf_chk.

So replacing varargs with variadic templates may give nontrivial performance improvement if you are willing to accept small increase in binary code. If you want to keep your binaries as small as possible, but still want to enjoy somewhat improved type safety compared to printf, check out pprintpp.

Not relying on printf also facilitates extensibility and, in particular, formatting of user-defined types, but that’s a topic for another blog post.

Benchmarks used in this post are available in the format-benchmark repository (see vararg-benchmark.cc).