CPU pipelines: when more is less
2017/10/03I have been working on micro-optimizations for some simulation code, and was reminded of a counter-intuitive artifact of modern CPU architecture, which is worth a short post.
Consider (just for the sake of example) a very simple (if not particularly meaningful) function,
\[ f(x) = \begin{cases} (x+2)^2 & \text{if } x \ge 0,\\ 1-x & \text{otherwise} \end{cases} \]
with implementations
f1(x) = ifelse(x ≥ 0, abs2(x+2), 1-x)
f2(x) = x ≥ 0 ? abs2(x+2) : 1-x
f1
calculates both possibilities before choosing between them with ifelse
, while f2
will only calculate values on demand. So, intuitively, it should be faster.
But it isn't...
julia> x = randn(1_000_000);
julia> using BenchmarkTools
julia> @btime f1.($x);
664.228 μs (2 allocations: 7.63 MiB)
julia> @btime f2.($x);
6.519 ms (2 allocations: 7.63 MiB)
...it is about 10x slower.
This can be understood as an artifact of the instruction pipeline: your x86 CPU likes to perform similar operations in staggered manner, and it does not like branches (jumps) because they break the flow.
Comparing the native code reveals that while f1
is jump-free, the if
in f2
results in a jump (jae
):
julia> @code_native f1(1.0)
.text
Filename: REPL[2]
pushq %rbp
movq %rsp, %rbp
movabsq $139862498743472, %rax # imm = 0x7F34468E14B0
movsd (%rax), %xmm2 # xmm2 = mem[0],zero
Source line: 1
addsd %xmm0, %xmm2
mulsd %xmm2, %xmm2
movabsq $139862498743480, %rax # imm = 0x7F34468E14B8
movsd (%rax), %xmm3 # xmm3 = mem[0],zero
subsd %xmm0, %xmm3
xorps %xmm1, %xmm1
cmpnlesd %xmm0, %xmm1
andpd %xmm1, %xmm3
andnpd %xmm2, %xmm1
orpd %xmm3, %xmm1
movapd %xmm1, %xmm0
popq %rbp
retq
nopw %cs:(%rax,%rax)
julia> @code_native f2(1.0)
.text
Filename: REPL[3]
pushq %rbp
movq %rsp, %rbp
Source line: 1
xorps %xmm1, %xmm1
ucomisd %xmm1, %xmm0
jae L37
movabsq $139862498680736, %rax # imm = 0x7F34468D1FA0
movsd (%rax), %xmm1 # xmm1 = mem[0],zero
subsd %xmm0, %xmm1
movapd %xmm1, %xmm0
popq %rbp
retq
L37:
movabsq $139862498680728, %rax # imm = 0x7F34468D1F98
addsd (%rax), %xmm0
mulsd %xmm0, %xmm0
popq %rbp
retq
nopl (%rax)
In my application the speed gain was more modest, but still sizeable. Benchmarking a non-branching version of your code is sometimes worth it, especially if it the change is simple and both branches of the conditional can be run error-free. If, for example, we had to calculate
g(x) = x ≥ 0 ? √(x+2) : 1-x
then we could not use ifelse
without restricting the domain, since √(x+2)
would fail whenever x < -2
.
Julia Base
contains many optimizations like this: for a particularly nice example see functions that use Base.null_safe_op
.