Botan News

Botan 1.7 Progress Report: Fast Word Load/Store

So far there have been a number of pretty interesting changes made in the development tree of Botan in the 1.7 release series. I think the thing that excites me most about them is that a number were contributed by people who were using Botan for their own projects, noticed areas that it could be improved, and sent patches. I'm planning on a series of posts detailing some of the more exciting or interesting of them, this being the first.

One change that has been very nice is the inclusion of word<->byte conversion templates by Yves Jerschow. In many algorithms, we have to convert between arrays of bytes (the input and/or output buffers) and some internal wordsize that everything is actually be processed in. Here is an example taken from the source code to Blowfish in 1.6.4: R and L are the two 32-bit words that are used inside the algorithm, and represent the processed ciphertext, which we want to load into the output array:

   out[0] = get_byte(0, R); out[1] = get_byte(1, R);
   out[2] = get_byte(2, R); out[3] = get_byte(3, R);
   out[4] = get_byte(0, L); out[5] = get_byte(1, L);
   out[6] = get_byte(2, L); out[7] = get_byte(3, L);

Seems a little redundant, yes? Unfortunately due to performance requirements, we can't just use a loop for this: shocking as it may seem, many C++ compilers will not generate the same code for a loop like

   for(size_t j = 0; j != 4; ++j)
      {
      out[j  ] = get_byte(j, R):
      out[j+4] = get_byte(j, L):
      }

as it will for the unrolled instructions above.

And since this code has to run for every 8 bytes of input, every cycle counts. However, clearly the exact same code is needed anywhere one wishes to write out a pair of 32-bit integers to a byte array in big-endian format: in fact I would often copy-and-paste the code to do it! In retrospect, that this did not get me to start looking at alternatives is somewhat embarrassing. The code for handling four words, or for little-endian values, and so on, are just simple variations on this code.

With the new templates, we can simply write:

   store_be(out, R, L);

and let some very basic template and function overload magic do the work for us, without any additional runtime overhead. What does store_be look like? First we have some code for handling multiple values; the one for storing two words looks like:

template<typename T>
inline void store_be(byte out[], T a, T b)
   {
   store_be(a, out + (0 * sizeof(T)));
   store_be(b, out + (1 * sizeof(T)));
   }

Since sizeof is a compile-time constant, the compiler should calculate these offsets at compile time and insert them directly into the instruction stream, then inline the appropriate version of store_be, which depends on the type of T (which, again, is known at compile time).

And now we get to the additional optimization that this whole thing provides. The version of store_be for a single 32-bit integer is:

inline void store_be(u32bit in, byte out[4])
   {
#if BOTAN_TARGET_UNALIGNED_LOADSTOR_OK
   *reinterpret_cast<u32bit*>(out) = BOTAN_ENDIAN_B2N(in);
#else
   out[0] = get_byte(0, in);
   out[1] = get_byte(1, in);
   out[2] = get_byte(2, in);
   out[3] = get_byte(3, in);
#endif
   }

What is this? The second part, with a sequence of get_byte, is our same dumb sequence of extractions (get_byte itself compiles down to a shift and a mask on most architectures). So, assuming the compiler is reasonable good about inlining, we will get the exact same assembly as we did before, but with substantially less code required.

But, what is up with the macro? It turns out that, if we happen to know that unaligned memory reads and writes work, we can substantially optimize the sequence via nasty pointer manipulation (reinterpret_cast is not to be used lightly!) and writing to a byte pointer as if it was a word pointer. Programmers who work on RISC processors are no doubt raising their eyebrows at this point, because this is something that only works safely on processors like x86/x86-64: a SPARC program that attempts to write to an unaligned word pointer will be immediately killed by a SIGBUS signal from the kernel. That is why the guard macro is used; currently it is only enabled by default on x86/x86-64 processors, though if someone has a non-x86 processor out there that can handle misaligned word read/writes they can enable it at build time and get the same optimization. This does require a byte swap operation, if the target endianness does not match that of the CPU we are running on; on x86-64 we make use of the bswap instruction for this. On my Core2, I see a 2-10% improvement in the performance of many algorithms with this optimization enabled.

It may seem like this a relatively small gain, but over the source code for 25 block ciphers, 3 stream ciphers, 12 hash functions, and various other bits of code that need perform byte<->word conversions, we gain true improvements in code size and clarity (which was the whole point: the extra performance optimization was just a bonus).

Posted 2008/05/03 in devnotes; no comments

< License Change in Botan | Botan 1.7.6 Released >

Name:


E-mail:


URL:


Comment: