Converting 9-digit integers to strings

Much has been written on the internet about converting integers to strings (e.g. A, B, C). Normally, when this is considered, half of the problem (or the interesting trick in the solution) revolves around knowing how long the resulting string will be. As such, I'd like to consider a slight variant of the problem: converting an integer between 0 and 999999999 to a nine digit string, and always emitting the full nine digits (so 456 becomes 000000456, for example). We can begin with a simple libc-based approach:

void sprintf9(char* p, uint32_t n) {
  sprintf(p, "%09u", n);
}

Using sprintf is short and pithy, but potentially not the most efficient choice. Another option might be:

void divmod9(char* p, uint32_t n) {
  for (uint32_t i = 9; i--; n /= 10) {
    p[i] = '0' + (n % 10);
  }
}

It might look like this approach is going to be slowed down by looping and dividing quite a lot, but Clang will fully unroll the loop, turn all the division into multiplication, and turn the addition into bitwise-or. The result is impressively efficient. In particular, the divide-by-multiplying trick is something that everyone should read up on - I'd recommend this explaination by "fish" for those unfamiliar with it.

A third alternative is this enigma from LuaJIT:

#define WINT_R(x, sh, sc) { \
  uint32_t d = (x * (((1<<sh)+sc-1)/sc)) >> sh; \
  x -= d * sc; \
  *p++ = (char)('0' + d); }

void lj_strfmt_wuint9(char* p, uint32_t n) {
  uint32_t v = n / 10000, w;
  n -= v * 10000;
  w = v / 10000;
  v -= w * 10000;
  *p++ = (char)('0' + w);
  WINT_R(v, 23, 1000)
  WINT_R(v, 12, 100)
  WINT_R(v, 10, 10)
  *p++ = (char)('0' + v);
  WINT_R(n, 23, 1000)
  WINT_R(n, 12, 100)
  WINT_R(n, 10, 10)
  *p++ = (char)('0' + n);
}

If you consider (x * (((1<<sh)+sc-1)/sc)) >> sh in the realm of mathematical real numbers, rather than the realm of C integer arithmetic, then the expression looks kind of like x * (((1 << sh) / sc) >> sh), which in turn looks kind of like x * (1 / sc), which is of course x / sc. This intution turns out to be spot-on: this expression is the same divide-by-multiplying trick as used by Clang - sc is the value we want to divide by (what fish calls d), sh is what fish calls k, and (((1<<sh)+sc-1)/sc) is what fish calls m. The choices of sh look lightly magical, but 23 is the smallest k which suffices for dividing numbers in the range 0 through 9999 by 1000, 12 is the smallest k for 0 through 999 by 100, and 10 is the smallest k for 0 through 99 by 10.

Another option is this SSE2 implementation, which is clear as mud:

#include <emmintrin.h>

void vectorised9(char* p, uint32_t n) {
  __m128i a = _mm_set1_epi32(n);
  __m128i b = _mm_srli_epi64(
      _mm_mul_epu32(a, _mm_set1_epi32(879609303)), 43);
  __m128i c = _mm_shuffle_epi32(_mm_mul_epu32(b,
      _mm_setr_epi32(10000, 0, 429497, 0)), 0x47);
  p[0] = '0' | _mm_cvtsi128_si32(c);
  __m128i d = _mm_sub_epi32(_mm_unpacklo_epi64(b, a),
      _mm_mul_epu32(c, _mm_setr_epi32(10000, 0, 1, 0)));
  __m128i e = _mm_srli_epi32(
      _mm_mul_epu32(d, _mm_set1_epi32(5243)), 19);
  __m128i f = _mm_or_si128(e,
      _mm_shuffle_epi32(_mm_sub_epi32(d,
          _mm_mul_epu32(e, _mm_set1_epi32(100))), 0x91));
  __m128i g = _mm_mulhi_epu16(f, _mm_set1_epi32(6554));
  __m128i h = _mm_slli_si128(_mm_sub_epi32(f,
      _mm_mullo_epi16(g, _mm_set1_epi32(10))), 2);
  __m128i i = _mm_packus_epi16(_mm_or_si128(g, h), h);
  _mm_storel_epi64((__m128i*)(p + 1),
      _mm_or_si128(i, _mm_set1_epi32(0x30303030)));
}

After you get over all the underscores, you can see that this code is full of the same divide-by-multiplying trick. For example, the assignment to g is doing WINT_R(f, 16, 10) four times in parallel (note the choice of k = 16 rather than the minimal k = 10; it could use 10, but using 16 allows it to get a shift for free as part of _mm_mulhi_epu16).

Of course, a consideration of different implementations wouldn't be complete without a benchmark. As such, I present a highly unscientific benchmark of how long it takes a single core of my MacBook to convert every integer between 0 and 999999999 to a string:

	clang -O2	clang -O2 -m32 -msse2
sprintf9	1m31s	1m52s
divmod9	8.7s	11.7s
lj_strfmt_wuint9	9.5s	11.4s
vectorised9	5.2s	5.2s

We can conclude that vectorisation is a win, that 32-bit code is slightly slower than 64-bit code (except for the vectorised solution), that LuaJIT's cleverness isn't an improvement upon the simple divmod9 (once Clang has optimised the hell out of divmod9), and that sprintf is dog slow.

Of course, your choice of compiler is important. I've used Clang for the above, as it is the default easy option for MacBooks, but we shouldn't forget gcc. Then again, perhaps we should forget gcc: its 64-bit compilation of vectorised9 turns _mm_set1_epi32(n) into a store-to-memory, a load-from-memory, and a shuffle — the shuffle is required, but the store and load are definitely not. Meanwhile, gcc's 32-bit compilation of vectorised9 turns _mm_storel_epi64 into a store-to-memory, two loads-from-memory, and two more stores-to-memory — everything except the first store is completely superfluous.

Next time we'll see how converting 9-digit integers to strings turns out to be very useful in the context of converting floating point numbers to strings.

Update:

@nasonov points out a more efficient vectorised implementation, which clocks in at 3.6s (64-bit) or 4.2 seconds (32-bit) in my benchmark setup:

#include <emmintrin.h>

void nasonov9(char* p, uint32_t u) {
  uint32_t v = u / 10000;
  uint32_t w = v / 10000;
  u -= v * 10000;
  v -= w * 10000;

  const __m128i first_madd =
      _mm_set_epi16(-32768, -32768, 0, 26215, 0, 10486, 0, 8389);
  const __m128i mask =
      _mm_set_epi16(0xffff, 0, 0xfffc, 0, 0xfff0, 0, 0xff80, 0);
  const __m128i second_madd =
      _mm_set_epi16(-256, -640, 64, -160, 16, -20, 2, 0);

  __m128i x = _mm_madd_epi16(_mm_set1_epi16(v), first_madd);
  __m128i y = _mm_madd_epi16(_mm_set1_epi16(u), first_madd);
  x = _mm_and_si128(x, mask);
  y = _mm_and_si128(y, mask);
  x = _mm_or_si128(x, _mm_slli_si128(x, 2));
  y = _mm_or_si128(y, _mm_slli_si128(y, 2));
  x = _mm_madd_epi16(x, second_madd);
  y = _mm_madd_epi16(y, second_madd);

  __m128i z = _mm_srli_epi16(_mm_packs_epi32(x, y), 8);
  z = _mm_packs_epi16(z, z);
  p[0] = '0' | w;
  _mm_storel_epi64((__m128i*)(p + 1),
      _mm_or_si128(z, _mm_set1_epi32(0x30303030)));
}