Converting integers to floats on x86 February 10, 2019

There are several ways to convert integers to floating-point numbers. The most obvious one is using the native instruction of a target CPU, while the other ones include table lookups and bit manipulation trick, each having their own trade-offs. This article provides a comparison between these techniques.

Techniques

Hardware instructions

Modern CPUs with a FPU usually have dedicated instructions for converting integers to floating-point numbers and forth. The examples include x86's cvtsi2ss instruction. Although they might seem to be the optimal choice as they usually are, sometimes the other techniques described later may outperform the dedicated instructions.

Bit manipulation

The mapping from the integer representations of IEEE floating-point numbers to the represented numbers is a piecewise linear function. The bit manipulation technique utilizes this property to accomplish the task.

The bits of a single-precision floating-point number is laid out as follows:

sign 31 0 exponent 30 0 1 1 1 (8 1 1 bits) 0 23 0 22 0 fraction 1 1 ... (23 bits) 1 1 0 0 0

The value represented is

value=(1)sign×2exponent127×(1+fraction223) \text{value} = (-1)^\text{sign} \times 2^{\text{exponent}-127} \times \left( 1 + \text{fraction} \cdot 2^{-23} \right)

If the exponent field is set to 127+23=150127 + 23 = 150 and the sign is positive, the represented value is: 223+fraction2^{23} + \text{fraction}. Based on this, we can construct a conversion function, which would be something like the following in Rust:

fn u23_to_f32(x: u32) -> f32 {
    assert!(x < 0x800000);
    <f32>::from_bits(x | 0x4b000000) - 8388608.0
}

If the input range must support the entirety of 32-bit integers, the upper part and the lower part of an input value could be converted separately and then added together. It appears that LLVM uses this technique in some cases.

Table lookups

Since the L1 data cache of a modern CPU has very high throughput, it makes a sense to use a lookup table for integer-to-FP conversion. As reading a lookup table is just a normal data access, it will compete with other memory accesses and put a strain on the memory system. It rapidly loses efficiency as the input range becomes large, so it's only applicable if the range is very small.

Unlike other techniques, an extra care must be taken to limit the input range when using applying this technique. Out-of-bounds accesses not only lead to an application crash, but may also cause serious security problems.

Theoretical analysis

Scalar operations

cvt2siss is not very fast in general. The following table based on [Fog2018] shows the latency and reciprocal throughput of cvt2siss with 32-bit operand:

[Fog2018] Agner Fog, Software optimization resources.

cvt2siss x,r32 Sandy Bridge Skylake Ryzen
Latency 4 6 8
Reciprocal throughput 1.5 2 1
µops 4 2 2

The u23_to_f32 function shown in the Bit manipulation section is compiled to the following x86_64 assembly:

orl     $1258291200, %esi
vmovd   %esi, %xmm1
vaddss  %xmm0, %xmm1, %xmm1

The following tables based on [Fog2018] show the characteristics of these instructions:

or r,r Sandy Bridge Skylake Ryzen
Latency 1 1 1
Reciprocal throughput 1 0.25 0.25
Port p015 p0156
mov x,r32 Sandy Bridge Skylake Ryzen
Latency 1 2 3
Reciprocal throughput 1 1
Port p015 p5
addss x,x Sandy Bridge Skylake Ryzen
Latency 3 4 3
Reciprocal throughput 1 0.5 0.5
Port p1 p01 p23

Based on these tables, the performance characteristics of u23_to_f32 can be estimated:

u23_to_f32 Sandy Bridge Skylake Ryzen
Latency 5 7 7
Reciprocal throughput 1 1 1

Comparing with the table for cvt2siss, it can be said that the bit manipulation technique provides a superior or equivalent throughput in all of the microarchitectures shown here, albeit having a narrower input range. On the other hand, this technique requires one extra µops.

The assembly code of the table lookup technique consists of a single instruction:

vmovss  (%rdx,%rbx,4), %xmm0

The load latency is 5 cycles and the reciprocal throughput is 0.5 [IntelSORM], thus it theoretically provides the highest throughput.

[IntelSORM] Intel corporation, Intel® 64 and IA-32 Architectures Optimization Reference Manual.

Vector operations

TODO

References

[Fog2018] Agner Fog, Software optimization resources.

[IntelSORM] Intel corporation, Intel® 64 and IA-32 Architectures Optimization Reference Manual.