Answer to: SIMD slower than a single implementation for toy example
Score: 3
Your godbolt link omitted -DNDEBUG, so it had to check assertions in every scalar iteration, preventing auto-vectorization. With the gcc options you specified in the question, it auto-vectorizes gradient_normal as expected. The SIMD loop in it is at .L65: on Godbolt
So you'd expect both version to run about the same speed, modulo any warm-up effects making one slower. Idiomatic way of performance evaluation? The auto-vectorized version has extra code to handle lengths that aren't a multiple of the vector width. But you're testing the manually-vectorized version first, so any warm-up effects make it slower. Probably if you reversed the order of the benchmarks, you'd find the other one slower in your current hand-rolled test framework.
Not quite a duplicate of Basic ways to speed up a simple Eigen program which mentions -DNDEBUG being important for Eigen.
-march=x86-64-v2 -mavx2 is very weird, and only makes sense for Via Isaiah or
Zhaoxin Lujiazui (which apparently doesn't signal AVX2 in its CPUID feature bits if https://chipsandcheese.com/p/the-weird-and-wacky-world-of-via-part-2-zhaoxins-not-quite-electric-boogaloo is correct, but does also support AVX2 without FMA). All other x86 CPUs with AVX2 also support FMA. -march=x86-64-v3 enables AVX2+FMA+BMI1/2. So your option loses out on BMI instructions even if you wanted to avoid FMA.
You commented that you're getting some crashes in your large code-base with some third-party libraries with x86-64-v3 and haven't debugged what's really going on yet, so sure, I guess this works as temporary measure and is better than just SSE4. You might try -march=x86-64-v3 -mno-fma to just disable FMA if that's the problem; BMI1/2 integer instructions can't change numerical results.
Or -march=x86-64-v3 -ffp-contract=off (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-ffp-contract) to never automatically use FMA for expressions like a*b + c, only when used manually.
In this microbenchmark, FMA doesn't help. You have (x-y)*0.5, which can't be optimized to a single FMA. GCC's default -ffp-contract=fast makes the same asm with -march=x86-64-v3 as with -march=x86-64-v2 -mavx2.
View Question ↗
Question
Parent Entity
Score: 2 • Views: 55
Site: stackoverflow
SaaS Metrics