Botan 1.7 Progress Report: Fast Word Load/Store
So far there have been a number of pretty interesting changes made in the development tree of Botan in the 1.7 release series. I think the thing that excites me most about them is that a number were contributed by people who were using Botan for their own projects, noticed areas that it could be improved, and sent patches. I'm planning on a series of posts detailing some of the more exciting or interesting of them, this being the first.
One change that has been very nice is the inclusion of word<->byte conversion templates by Yves Jerschow. In many algorithms, we have to convert between arrays of bytes (the input and/or output buffers) and some internal wordsize that everything is actually be processed in. Here is an example taken from the source code to Blowfish in 1.6.4: R and L are the two 32-bit words that are used inside the algorithm, and represent the processed ciphertext, which we want to load into the output array:
out[0] = get_byte(0, R); out[1] = get_byte(1, R); out[2] = get_byte(2, R); out[3] = get_byte(3, R); out[4] = get_byte(0, L); out[5] = get_byte(1, L); out[6] = get_byte(2, L); out[7] = get_byte(3, L);
Seems a little redundant, yes? Unfortunately due to performance requirements, we can't just use a loop for this: shocking as it may seem, many C++ compilers will not generate the same code for a loop like
for(size_t j = 0; j != 4; ++j)
{
out[j ] = get_byte(j, R):
out[j+4] = get_byte(j, L):
}
as it will for the unrolled instructions above.
And since this code has to run for every 8 bytes of input, every cycle counts. However, clearly the exact same code is needed anywhere one wishes to write out a pair of 32-bit integers to a byte array in big-endian format: in fact I would often copy-and-paste the code to do it! In retrospect, that this did not get me to start looking at alternatives is somewhat embarrassing. The code for handling four words, or for little-endian values, and so on, are just simple variations on this code.
With the new templates, we can simply write:
store_be(out, R, L);
and let some very basic template and function overload magic do the
work for us, without any additional runtime overhead. What does
store_be look like? First we have some code for handling
multiple values; the one for storing two words looks like:
template<typename T>
inline void store_be(byte out[], T a, T b)
{
store_be(a, out + (0 * sizeof(T)));
store_be(b, out + (1 * sizeof(T)));
}
Since sizeof is a compile-time constant, the compiler
should calculate these offsets at compile time and insert them
directly into the instruction stream, then inline the appropriate
version of store_be, which depends on the type of T (which,
again, is known at compile time).
And now we get to the additional optimization that this whole thing
provides. The version of store_be for a single 32-bit integer
is:
inline void store_be(u32bit in, byte out[4])
{
#if BOTAN_TARGET_UNALIGNED_LOADSTOR_OK
*reinterpret_cast<u32bit*>(out) = BOTAN_ENDIAN_B2N(in);
#else
out[0] = get_byte(0, in);
out[1] = get_byte(1, in);
out[2] = get_byte(2, in);
out[3] = get_byte(3, in);
#endif
}
What is this? The second part, with a sequence of
get_byte, is our same dumb sequence of extractions
(get_byte itself compiles down to a shift and a mask on most
architectures). So, assuming the compiler is reasonable good about
inlining, we will get the exact same assembly as we did before, but
with substantially less code required.
But, what is up with the macro? It turns out that, if we happen to
know that unaligned memory reads and writes work, we can substantially
optimize the sequence via nasty pointer manipulation
(reinterpret_cast is not to be used lightly!) and writing to
a byte pointer as if it was a word pointer. Programmers who work on
RISC processors are no doubt raising their eyebrows at this point,
because this is something that only works safely on processors like
x86/x86-64: a SPARC program that attempts to write to an unaligned
word pointer will be immediately killed by a SIGBUS signal from the
kernel. That is why the guard macro is used; currently it is only
enabled by default on x86/x86-64 processors, though if someone has a
non-x86 processor out there that can handle misaligned word
read/writes they can enable it at build time and get the same
optimization. This does require a byte swap operation, if the target
endianness does not match that of the CPU we are running on; on x86-64
we make use of the bswap instruction for this. On my Core2, I
see a 2-10% improvement in the performance of many algorithms with
this optimization enabled.
It may seem like this a relatively small gain, but over the source code for 25 block ciphers, 3 stream ciphers, 12 hash functions, and various other bits of code that need perform byte<->word conversions, we gain true improvements in code size and clarity (which was the whole point: the extra performance optimization was just a bonus).
Posted 2008/05/03 in devnotes; no comments
< License Change in Botan | Botan 1.7.6 Released >