Matthias Wahl - BobKonf 2021
Systems engineer at Wayfair
Working on Tremor, doing Rust for a living.
AMD Epyc 7702 ES (Zen 2 | Rome | CCD)
AMD Epyc 7702 ES (Zen 2 | Rome | CCD)
AMD Epyc 7702 ES (Zen 2 | Rome | CCD)
Form of Data Parallelism
huge registers
128, 256, 512 bits
interpreted as packed numeric types (signed, unsigned integers, floats, doubles)
huge set of instructions
Lane-wise operations e.g. add each lane in 2 registers, logical-shift-left
horizontal operations, e.g. add together all lanes in a register
masked operations
Via instruction set extensions.
Not necessarily available on your CPU.
CVT T PD 2 DQ
Convert with Truncation (round towards zero) Packed Double Precision Floating Point Numbers to double-quadword (64 bit) integers
VPMOVZXBQ
VFMADD231SD
Our compiler knows our CPU better than we do.
int32_t sum(int32_t* array, size_t len) {
int32_t s = 0;
for (size_t i = 0; i < len; i++) {
s += array[i];
}
return s;
}
Simple sum of given array of ints
On GodboltSmall benchmark with len = 1000
no optimizations (-O0) | auto-vectorization enabled (-O3) |
---|---|
630 ticks | 153 ticks |
.L4:
vmovdqu (%rax), %xmm2
vinserti128 $0x1, 16(%rax), %ymm2, %ymm0
addq $32, %rax
vpaddd %ymm0, %ymm1, %ymm1
cmpq %rdx, %rax
jne .L4
.L3:
leaq 1(%rdx), %rcx
addl (%rdi,%rdx,4), %eax
cmpq %rcx, %rsi
jbe .L1
addl (%rdi,%rcx,4), %eax
leaq 2(%rdx), %rcx
cmpq %rcx, %rsi
jbe .L1
...
ret
Autovectorized assembly
A.K.A. memchr
x86_64 AVX2
x86_64 AVX2 - SIMD Loop
x86_64 AVX2 - SIMD Loop
convenient syntax to broadcast functions onto array/matrix elements
compiler can freely vectorize
NTuple{4, VecElement{Float64}}
implemented on top of LLVM vector type
<4 x double>
Abstracts over different platforms
register = (VecElement(0.5), VecElement(1.1), VecElement(1.2), VecElement(0.0))
compiler transforms this into LLVM vector
operations on SIMD registers: calling LLVM intrinsics via embedded llvm IR strings:
llvmcall("%res = fadd <4 x double> %0, %1 ...")
https://github.com/eschnett/SIMD.jl
Some more convenience
Part of JDK 16 (to be released March 2021)
Early access builds available.
Typed interface to SIMD registers
Parameterized on lane type
and register size
Vector size property of the host CPU
auto-detected at runtime
Hotspot C2 JIT emits SIMD instructions if available
Fallback scalar methods in plain java
memchr in java with Vector API
Hard to beat auto-vectorized scalar version:
if is_x86_feature_detected!("avx2") {
unsafe { foo_avx2() };
}
https://github.com/simd-lite/simd-json
Port of lemires C++ SIMD-json project
In heavy use at Tremor
Focussed on maxing out performance
Deliberate choice to use vendor intrinsics
Different stuff works for different platforms
Compared to scalar version
{ } : " , whitespace
structural character detection algorithm "shufti" (ARM NEON)
'}' = 0x7d
/ \
nibbles: 7 13
# table lookup using nibbles as lookup index
high_niblle_table[7] = 0x01
low_nibble_table[13] = 0x09
# character class
(0x01 & 0x09) & mask = 0x01
~ 12 instructions for creating bitmask from 16 characters
Up to 2x faster than Rust serde_json
according to serde-rs/json-benchmark
You can show off!