Converting floats to strings, part 1

Much has been written about converting floating point numbers to strings: Bruce Dawson has lots to say, David M. Gay's venerable dtoa is a classic, and Florian Loitsch's relatively recent grisu paper is worth studying. Often the problem is framed as "converting a floating point number to the shortest possible decimal string representation", but this framing is neither neccessary nor sufficient for implementing the %e / %f / %g formats of sprintf. Furthermore, this framing introduces significant complexity. As such, I'd like to begin by considering a much simpler framing: convert a double-precision floating point number to a decimal string, such that the string gives the exact mathematical real number represented by the float.

First of all, we need a quick reminder of what an IEEE754 double-precision floating point number looks like under the hood. If we ignore negative numbers, infinities, NaNs, and denormals, then we have just an 11-bit exponent e and a 53-bit mantissa m, which together represent the number m (an integer) times 2^e (where e might be negative). Denormals also fit the pattern of m times 2^e, albeit with a different encoding of m and e.

Armed with this knowledge, we can pull the m and e fields out of a number:

typedef union {
  double n;
  uint64_t u64;
  struct {
    uint32_t lo;
    uint32_t hi;
  } u32;
} TValue;

void decode(double n) {
  TValue t;
  t.n = n;
  if ((t.u32.hi << 1) >= 0xffe00000) {
    if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
      printf("NaN\n");
    } else {
      printf("Infinity\n");
    }
  } else {
    int32_t e = (t.u32.hi >> 20) & 0x7ff;
    uint64_t m = t.u32.hi & 0xfffff;
    if (e == 0) {
      e++;
    } else {
      m |= 0x100000;
    }
    e -= 1043;
    if (t.u32.lo) {
      e -= 32;
      m = (m << 32) | t.u32.lo;
    }
    printf("%llu * 2^%d\n", (long long unsigned)m, (int)e);
  }
}

Some example outputs of this decode function are:

n	Output
`0./0.`	`NaN`
`1./0.`	`Infinity`
`0.`	`0 * 2^-1042`
`1.`	`1048576 * 2^-20`
`10.`	`1310720 * 2^-17`
`0.1`	`7205759403792794 * 2^-56`
`1e-308`	`2024022533073106 * 2^-1074`
`pow(2, 1020)`	`1048576 * 2^1000`

These outputs are not in decimal, and they're not particularly convenient for human comprehension, but they are 100% accurate.

From m and e values, one classical approach for getting decimal digits is:

Convert m to a binary bignum (easy).
Multiply the binary bignum by 2^e (easy).
In a loop, compute the binary bignum modulo ten to get one digit (hard), and then divide the binary bignum by ten (hard).

I prefer the following approach, as it gets rid of the hard part:

Convert m to a decimal bignum (not easy, but not hard).
Multiply the decimal bignum by 2^e (not easy, but not hard).
In a loop, print the digits of the decimal bignum (easy).

What do I mean by binary bignum and decimal bignum? A binary bignum is a number stored as multiple uint32_t pieces, with each piece being in the range 0 through 2³²-1, and the number being sum(piece_i * (2³²)ⁱ). On the other hand, a decimal bignum is a number stored as multiple uint32_t pieces, with each piece being in the range 0 through 10⁹-1, and the number being sum(piece_i * (10⁹)ⁱ). 10⁹ is chosen as it is the largest power of ten which fits into a uint32_t (even better, 10⁹ requires 29.9 bits to store, so only 2.1 bits out of every 32 are wasted).

I'd like to represent a decimal bignum using the following variables:

uint32_t nd[128];
int32_t ndlo;
int32_t ndhi;

For ndlo <= i <= ndhi, piece_i is nd[i & 127], and otherwise piece_i is zero.

As it happens, we already know how to print the digits of an individual decimal bignum piece. Using one of those functions from last time, printing an entire decimal bignum is painless:

void nd_print(char* p, uint32_t* nd, int32_t ndlo, int32_t ndhi) {
  int32_t i;
  for (i = ndhi; i >= 0; --i) {
    nasonov9(p, nd[i & 127]); p += 9;
  }
  *p++ = '.';
  for (; i >= ndlo; --i) {
    nasonov9(p, nd[i & 127]); p += 9;
  }
  *p = 0;
}

Multiplying by 2^e takes a bit of effort. Let's start by considering the case where e is negative, at which point we're really dividing by 2^-e. In turn, this is dividing by 2 -e times. Just dividing by 2 doesn't take that much code:

int32_t nd_div2(uint32_t* nd, int32_t ndlo, int32_t ndhi) {
  uint32_t i = ndhi & 127, carry = 0;
  for (;;) {
    uint32_t val = nd[i];
    nd[i] = (val >> 1) + carry;
    carry = (val & 1) * 500000000;
    if (i == (ndlo & 127)) break;
    i = (i - 1) & 127;
  }
  if (carry) nd[--ndlo & 127] = carry;
  return ndlo;
}

We can generalise this to dividing by 2^k:

int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
                 uint32_t k) {
  uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
  uint32_t i = ndhi & 127, carry = 0;
  for (;;) {
    uint32_t val = nd[i];
    nd[i] = (val >> k) + carry;
    carry = (val & mask) * mul;
    if (i == (ndlo & 127)) break;
    i = (i - 1) & 127;
  }
  if (carry) nd[--ndlo & 127] = carry;
  return ndlo;
}

The above will work for k between zero and nine inclusive; if k is larger than nine, then 1000000000 >> k can no longer be represented by an integer. As such, the complete solution has to start by dividing in batches of 2⁹:

int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
                 uint32_t k) {
  while (k >= 9) {
    uint32_t i = ndhi & 127, carry = 0;
    for (;;) {
      uint32_t val = nd[i];
      nd[i] = (val >> 9) + carry;
      carry = (val & 0x1ff) * 1953125;
      if (i == (ndlo & 127)) break;
      i = (i - 1) & 127;
    }
    if (carry) nd[--ndlo & 127] = carry;
    k -= 9;
  }
  if (k) {
    uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
    uint32_t i = ndhi & 127, carry = 0;
    for (;;) {
      uint32_t val = nd[i];
      nd[i] = (val >> k) + carry;
      carry = (val & mask) * mul;
      if (i == (ndlo & 127)) break;
      i = (i - 1) & 127;
    }
    if (carry) nd[--ndlo & 127] = carry;
  }
  return ndlo;
}

We can then go through the same process for multiplying, starting with multiplication by 2:

int32_t nd_mul2(uint32_t* nd, int32_t ndhi) {
  uint32_t carry_in = 0;
  for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
    uint32_t val = (nd[i] << 1) | carry_in;
    carry_in = val / 1000000000;
    nd[i] = val - carry_in * 1000000000;
  }
  if (carry_in) nd[++ndhi] = carry_in;
  return ndhi;
}

By promoting val to 64-bits, this can be generalised to small k:

int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k) {
  uint32_t carry_in = 0;
  for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
    uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
    carry_in = (uint32_t)(val / 1000000000);
    nd[i] = (uint32_t)val - carry_in * 1000000000;
  }
  if (carry_in) nd[++ndhi] = carry_in;
  return ndhi;
}

This time the constraint on k comes from wanting val to be no more than (10⁹)², which limits k to 29. It also turns out to be useful to make carry_in a parameter, all of which leads to the complete code for multiplying by 2^k:

int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k,
                 uint32_t carry_in) {
  while (k >= 29) {
    for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
      uint64_t val = ((uint64_t)nd[i] << 29) | carry_in;
      carry_in = (uint32_t)(val / 1000000000);
      nd[i] = (uint32_t)val - carry_in * 1000000000;
    }
    if (carry_in) {
      nd[++ndhi] = carry_in; carry_in = 0;
    }
    k -= 29;
  }
  if (k) {
    for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
      uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
      carry_in = (uint32_t)(val / 1000000000);
      nd[i] = (uint32_t)val - carry_in * 1000000000;
    }
    if (carry_in) nd[++ndhi] = carry_in;
  }
  return ndhi;
}

We can plug these routines into the decode function from earlier to create a print function:

void print(double n) {
  TValue t;
  t.n = n;
  if ((t.u32.hi << 1) >= 0xffe00000) {
    if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
      printf("NaN\n");
    } else {
      printf("Infinity\n");
    }
  } else {
    char buf[1154];
    uint32_t nd[128];
    int32_t ndlo = 0;
    int32_t ndhi = 0;
    int32_t e = (t.u32.hi >> 20) & 0x7ff;
    nd[0] = t.u32.hi & 0xfffff;
    if (e == 0) {
      e++;
    } else {
      nd[0] |= 0x100000;
    }
    e -= 1043;
    if (t.u32.lo) {
      e -= 32;
      nd[0] = (nd[0] << 3) | (t.u32.lo >> 29);
      ndhi = nd_mul2k(nd, ndhi, 29, t.u32.lo & 0x1fffffff);
    }
    if (e >= 0) {
      ndhi = nd_mul2k(nd, ndhi, (uint32_t)e, 0);
    } else {
      ndlo = nd_div2k(nd, ndlo, ndhi, (uint32_t)-e);
    }
    nd_print(buf, nd, ndlo, ndhi);
    printf("%s\n", buf);
  }
}

Some example outputs of this print function are:

n	Output
`0./0.`	`NaN`
`1./0.`	`Infinity`
`0.`	`000000000.`
`1.`	`000000001.`
`10.`	`000000010.`
`0.1`	`000000000000000000.1000000000000000055511 15123125782702118158340454101562500000000`
`1e-308`	000000000000000000.0000000000000000000000 00000000000000000000000000000000000000000 00000000000000000000000000000000000000000 00000000000000000000000000000000000000000 00000000000000000000000000000000000000000 00000000000000000000000000000000000000000 00000000000000000000000000000000000000000 00000000000000000000000000000000000000009 99999999999999909326625337248461995470488 73403204569370722504933164788134100221702 36685306110285951575783017584918228243784 38792553200763769833775473829862512856683 41346193998972906543693727922885247662294 86591679434355446221493480729436132941672 16662821737555414480159115639791276054897 20142038977058035153396077150619905566488 97702602917109778267250244017165230316273 90652604144008597950935492433262042405635 56399326294969169893097546113480479123599 46979384052000893178607312050101591177117 04697471514344499487123311264707354172378 09953873785021982614510236627959137966047 18812599767273565216024053297899062477635 21525981391443887618575275588619928089116 90506171197530846785775640581096161907433 18668839610809435427125598308539800029848 265694454312324523925781250000000
`pow(2, 1020)`	`00000001123558209288947442330815744243140 45851123561183894160795893800723582922378 43810195794279832650471001320007117491962 08485367436055090103890580296441496713277 36104933390540928297688887250778808824658 17684505312860552384417646403930092119569 40880170232270940691778664363999670287115 4982269052209770601514008576.`

Some of these outputs could be tidied up by trimming leading and trailing zeroes, but other than that, this is the output we were aiming for. We can see that the double-precision floating point number closest to 0.1 is actually ever so slightly more than 0.1, that the double-precision floating point number closest to 1e-308 is actually ever so slightly less than 1e-308, and that in some cases the string representation of the exact decimal represented by a float can be extremely long (though, to be fair, 1e-308 is almost as bad as it gets for length).

Next time we'll see how to adapt nd_print into something which can behave like the %e, %f, and %g formats of sprintf (i.e. outputs which you might actually want, as opposed to pedantic exact representations).