Converting floats to strings, part 1
Much has been written about converting floating point numbers to strings: Bruce Dawson has lots to say, David M. Gay's venerable dtoa is a classic, and Florian Loitsch's relatively recent grisu paper is worth studying. Often the problem is framed as "converting a floating point number to the shortest possible decimal string representation", but this framing is neither neccessary nor sufficient for implementing the %e
/ %f
/ %g
formats of sprintf
. Furthermore, this framing introduces significant complexity. As such, I'd like to begin by considering a much simpler framing: convert a double-precision floating point number to a decimal string, such that the string gives the exact mathematical real number represented by the float.
First of all, we need a quick reminder of what an IEEE754 double-precision floating point number looks like under the hood. If we ignore negative numbers, infinities, NaNs, and denormals, then we have just an 11-bit exponent e and a 53-bit mantissa m, which together represent the number m (an integer) times 2e (where e might be negative). Denormals also fit the pattern of m times 2e, albeit with a different encoding of m and e.
Armed with this knowledge, we can pull the m and e fields out of a number:
typedef union {
double n;
uint64_t u64;
struct {
uint32_t lo;
uint32_t hi;
} u32;
} TValue;
void decode(double n) {
TValue t;
t.n = n;
if ((t.u32.hi << 1) >= 0xffe00000) {
if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
printf("NaN\n");
} else {
printf("Infinity\n");
}
} else {
int32_t e = (t.u32.hi >> 20) & 0x7ff;
uint64_t m = t.u32.hi & 0xfffff;
if (e == 0) {
e++;
} else {
m |= 0x100000;
}
e -= 1043;
if (t.u32.lo) {
e -= 32;
m = (m << 32) | t.u32.lo;
}
printf("%llu * 2^%d\n", (long long unsigned)m, (int)e);
}
}
Some example outputs of this decode
function are:
n | Output |
---|---|
0./0. | NaN |
1./0. | Infinity |
0. | 0 * 2^-1042 |
1. | 1048576 * 2^-20 |
10. | 1310720 * 2^-17 |
0.1 | 7205759403792794 * 2^-56 |
1e-308 | 2024022533073106 * 2^-1074 |
pow(2, 1020) | 1048576 * 2^1000 |
These outputs are not in decimal, and they're not particularly convenient for human comprehension, but they are 100% accurate.
From m and e values, one classical approach for getting decimal digits is:
- Convert m to a binary bignum (easy).
- Multiply the binary bignum by 2e (easy).
- In a loop, compute the binary bignum modulo ten to get one digit (hard), and then divide the binary bignum by ten (hard).
I prefer the following approach, as it gets rid of the hard part:
- Convert m to a decimal bignum (not easy, but not hard).
- Multiply the decimal bignum by 2e (not easy, but not hard).
- In a loop, print the digits of the decimal bignum (easy).
What do I mean by binary bignum and decimal bignum? A binary bignum is a number stored as multiple uint32_t
pieces, with each piece being in the range 0 through 232-1, and the number being sum(piecei * (232)i)
. On the other hand, a decimal bignum is a number stored as multiple uint32_t
pieces, with each piece being in the range 0 through 109-1, and the number being sum(piecei * (109)i)
. 109 is chosen as it is the largest power of ten which fits into a uint32_t
(even better, 109 requires 29.9 bits to store, so only 2.1 bits out of every 32 are wasted).
I'd like to represent a decimal bignum using the following variables:
uint32_t nd[128];
int32_t ndlo;
int32_t ndhi;
For ndlo <= i <= ndhi
, piecei
is nd[i & 127]
, and otherwise piecei
is zero.
As it happens, we already know how to print the digits of an individual decimal bignum piece. Using one of those functions from last time, printing an entire decimal bignum is painless:
void nd_print(char* p, uint32_t* nd, int32_t ndlo, int32_t ndhi) {
int32_t i;
for (i = ndhi; i >= 0; --i) {
nasonov9(p, nd[i & 127]); p += 9;
}
*p++ = '.';
for (; i >= ndlo; --i) {
nasonov9(p, nd[i & 127]); p += 9;
}
*p = 0;
}
Multiplying by 2e takes a bit of effort. Let's start by considering the case where e is negative, at which point we're really dividing by 2-e. In turn, this is dividing by 2 -e times. Just dividing by 2 doesn't take that much code:
int32_t nd_div2(uint32_t* nd, int32_t ndlo, int32_t ndhi) {
uint32_t i = ndhi & 127, carry = 0;
for (;;) {
uint32_t val = nd[i];
nd[i] = (val >> 1) + carry;
carry = (val & 1) * 500000000;
if (i == (ndlo & 127)) break;
i = (i - 1) & 127;
}
if (carry) nd[--ndlo & 127] = carry;
return ndlo;
}
We can generalise this to dividing by 2k:
int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
uint32_t k) {
uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
uint32_t i = ndhi & 127, carry = 0;
for (;;) {
uint32_t val = nd[i];
nd[i] = (val >> k) + carry;
carry = (val & mask) * mul;
if (i == (ndlo & 127)) break;
i = (i - 1) & 127;
}
if (carry) nd[--ndlo & 127] = carry;
return ndlo;
}
The above will work for k
between zero and nine inclusive; if k
is larger than nine, then 1000000000 >> k
can no longer be represented by an integer. As such, the complete solution has to start by dividing in batches of 29:
int32_t nd_div2k(uint32_t* nd, int32_t ndlo, int32_t ndhi,
uint32_t k) {
while (k >= 9) {
uint32_t i = ndhi & 127, carry = 0;
for (;;) {
uint32_t val = nd[i];
nd[i] = (val >> 9) + carry;
carry = (val & 0x1ff) * 1953125;
if (i == (ndlo & 127)) break;
i = (i - 1) & 127;
}
if (carry) nd[--ndlo & 127] = carry;
k -= 9;
}
if (k) {
uint32_t mask = (1U << k) - 1, mul = 1000000000 >> k;
uint32_t i = ndhi & 127, carry = 0;
for (;;) {
uint32_t val = nd[i];
nd[i] = (val >> k) + carry;
carry = (val & mask) * mul;
if (i == (ndlo & 127)) break;
i = (i - 1) & 127;
}
if (carry) nd[--ndlo & 127] = carry;
}
return ndlo;
}
We can then go through the same process for multiplying, starting with multiplication by 2:
int32_t nd_mul2(uint32_t* nd, int32_t ndhi) {
uint32_t carry_in = 0;
for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
uint32_t val = (nd[i] << 1) | carry_in;
carry_in = val / 1000000000;
nd[i] = val - carry_in * 1000000000;
}
if (carry_in) nd[++ndhi] = carry_in;
return ndhi;
}
By promoting val
to 64-bits, this can be generalised to small k
:
int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k) {
uint32_t carry_in = 0;
for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
carry_in = (uint32_t)(val / 1000000000);
nd[i] = (uint32_t)val - carry_in * 1000000000;
}
if (carry_in) nd[++ndhi] = carry_in;
return ndhi;
}
This time the constraint on k
comes from wanting val
to be no more than (109)2, which limits k
to 29. It also turns out to be useful to make carry_in
a parameter, all of which leads to the complete code for multiplying by 2k:
int32_t nd_mul2k(uint32_t* nd, int32_t ndhi, uint32_t k,
uint32_t carry_in) {
while (k >= 29) {
for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
uint64_t val = ((uint64_t)nd[i] << 29) | carry_in;
carry_in = (uint32_t)(val / 1000000000);
nd[i] = (uint32_t)val - carry_in * 1000000000;
}
if (carry_in) {
nd[++ndhi] = carry_in; carry_in = 0;
}
k -= 29;
}
if (k) {
for (uint32_t i = 0; i <= (uint32_t)ndhi; i++) {
uint64_t val = ((uint64_t)nd[i] << k) | carry_in;
carry_in = (uint32_t)(val / 1000000000);
nd[i] = (uint32_t)val - carry_in * 1000000000;
}
if (carry_in) nd[++ndhi] = carry_in;
}
return ndhi;
}
We can plug these routines into the decode
function from earlier to create a print
function:
void print(double n) {
TValue t;
t.n = n;
if ((t.u32.hi << 1) >= 0xffe00000) {
if (((t.u32.hi & 0x000fffff) | t.u32.lo) != 0) {
printf("NaN\n");
} else {
printf("Infinity\n");
}
} else {
char buf[1154];
uint32_t nd[128];
int32_t ndlo = 0;
int32_t ndhi = 0;
int32_t e = (t.u32.hi >> 20) & 0x7ff;
nd[0] = t.u32.hi & 0xfffff;
if (e == 0) {
e++;
} else {
nd[0] |= 0x100000;
}
e -= 1043;
if (t.u32.lo) {
e -= 32;
nd[0] = (nd[0] << 3) | (t.u32.lo >> 29);
ndhi = nd_mul2k(nd, ndhi, 29, t.u32.lo & 0x1fffffff);
}
if (e >= 0) {
ndhi = nd_mul2k(nd, ndhi, (uint32_t)e, 0);
} else {
ndlo = nd_div2k(nd, ndlo, ndhi, (uint32_t)-e);
}
nd_print(buf, nd, ndlo, ndhi);
printf("%s\n", buf);
}
}
Some example outputs of this print
function are:
n | Output |
---|---|
0./0. | NaN |
1./0. | Infinity |
0. | 000000000. |
1. | 000000001. |
10. | 000000010. |
0.1 | 000000000000000000.1000000000000000055511 |
1e-308 | 000000000000000000.0000000000000000000000 |
pow(2, 1020) | 00000001123558209288947442330815744243140 |
Some of these outputs could be tidied up by trimming leading and trailing zeroes, but other than that, this is the output we were aiming for. We can see that the double-precision floating point number closest to 0.1 is actually ever so slightly more than 0.1, that the double-precision floating point number closest to 1e-308 is actually ever so slightly less than 1e-308, and that in some cases the string representation of the exact decimal represented by a float can be extremely long (though, to be fair, 1e-308 is almost as bad as it gets for length).
Next time we'll see how to adapt nd_print
into something which can behave like the %e
, %f
, and %g
formats of sprintf
(i.e. outputs which you might actually want, as opposed to pedantic exact representations).