There are several ways to convert integers to floating-point numbers. The most obvious one is using the native instruction of a target CPU, while the other ones include table lookups and bit manipulation trick, each having their own trade-offs. This article provides a comparison between these techniques.
๐Techniques
๐Hardware instructions
Modern CPUs with a FPU usually have dedicated instructions for converting integers to floating-point numbers and forth. The examples include x86's cvtsi2ss
instruction. Although they might seem to be the optimal choice as they usually are, sometimes the other techniques described later may outperform the dedicated instructions.
๐Bit manipulation
The mapping from the integer representations of IEEE floating-point numbers to the represented numbers is a piecewise linear function. The bit manipulation technique utilizes this property to accomplish the task.
The bits of a single-precision floating-point number are laid out as follows:
The value represented is
If the exponent field is set to and the sign is positive, the represented value is: . Based on this, we can construct a conversion function, which would be something like the following in Rust:
If the input range must support the entirety of 32-bit integers, the upper part and the lower part of an input value could be converted separately and then added together. It appears that LLVM uses this technique in some cases.
๐Table lookups
Since the L1 data cache of a modern CPU has very high throughput, it makes a sense to use a lookup table for integer-to-FP conversion. As reading a lookup table is just a normal data access, it will compete with other memory accesses and put a strain on the memory system. It rapidly loses efficiency as the input range becomes large, so it's only applicable if the range is very small.
Unlike other techniques, an extra care must be taken to limit the input range when using applying this technique. Out-of-bounds accesses not only lead to an application crash, but may also cause serious security problems.
๐Theoretical analysis
cvt2siss
is not very fast in general. The following table based on Agner Fog's work1 shows the latency and reciprocal throughput of cvt2siss
with 32-bit operand:
cvt2siss x,r32 | Sandy Bridge | Skylake | Ryzen |
---|---|---|---|
Latency | 4 | 6 | 8 |
Reciprocal throughput | 1.5 | 2 | 1 |
ยตops | 4 | 2 | 2 |
The u23_to_f32
function shown in the Bit manipulation section is compiled to the following x86_64 assembly:
orl $1258291200, %esi
vmovd %esi, %xmm1
vaddss %xmm0, %xmm1, %xmm1
The following tables based on Agner Fog's work1 show the characteristics of these instructions:
or r,r | Sandy Bridge | Skylake | Ryzen |
---|---|---|---|
Latency | 1 | 1 | 1 |
Reciprocal throughput | 1 | 0.25 | 0.25 |
Port | p015 | p0156 |
mov x,r32 | Sandy Bridge | Skylake | Ryzen |
---|---|---|---|
Latency | 1 | 2 | 3 |
Reciprocal throughput | 1 | 1 | |
Port | p015 | p5 |
addss x,x | Sandy Bridge | Skylake | Ryzen |
---|---|---|---|
Latency | 3 | 4 | 3 |
Reciprocal throughput | 1 | 0.5 | 0.5 |
Port | p1 | p01 | p23 |
Based on these tables, the performance characteristics of u23_to_f32
can be estimated:
u23_to_f32 | Sandy Bridge | Skylake | Ryzen |
---|---|---|---|
Latency | 5 | 7 | 7 |
Reciprocal throughput | 1 | 1 | 1 |
Comparing with the table for cvt2siss
, it can be said that the bit manipulation technique provides a superior or equivalent throughput in all of the microarchitectures shown here, albeit having a narrower input range. On the other hand, this technique requires one extra ยตops.
The assembly code of the table lookup technique consists of a single instruction:
vmovss (%rdx,%rbx,4), %xmm0
The load latency is 5 cycles and the reciprocal throughput is 0.5 according to Intel's document2, thus it theoretically provides the highest throughput.
๐References
Agner Fog, Software optimization resources.
Intel corporation, Intelยฎ 64 and IA-32 Architectures Optimization Reference Manual.