Question Details

No question body available.

Tags

c++ performance eigen simd avx

Answers (1)

January 15, 2026 Score: 3 Rep: 378,632 Quality: Medium Completeness: 80%

Your godbolt link omitted -DNDEBUG, so it had to check assertions in every scalar iteration, preventing auto-vectorization. With the gcc options you specified in the question, it auto-vectorizes gradientnormal as expected. The SIMD loop in it is at .L65: on Godbolt

So you'd expect both version to run about the same speed, modulo any warm-up effects making one slower. Idiomatic way of performance evaluation? The auto-vectorized version has extra code to handle lengths that aren't a multiple of the vector width. But you're testing the manually-vectorized version first, so any warm-up effects make it slower. Probably if you reversed the order of the benchmarks, you'd find the other one slower in your current hand-rolled test framework.


Not quite a duplicate of Basic ways to speed up a simple Eigen program which mentions -DNDEBUG being important for Eigen.


-march=x86-64-v2 -mavx2 is very weird, and only makes sense for Via Isaiah or
Zhaoxin Lujiazui (which apparently doesn't signal AVX2 in its CPUID feature bits if https://chipsandcheese.com/p/the-weird-and-wacky-world-of-via-part-2-zhaoxins-not-quite-electric-boogaloo is correct, but does also support AVX2 without FMA). All other x86 CPUs with AVX2 also support FMA. -march=x86-64-v3 enables AVX2+FMA+BMI1/2. So your option loses out on BMI instructions even if you wanted to avoid FMA.

You commented that you're getting some crashes in your large code-base with some third-party libraries with x86-64-v3 and haven't debugged what's really going on yet, so sure, I guess this works as temporary measure and is better than just SSE4. You might try -march=x86-64-v3 -mno-fma to just disable FMA if that's the problem; BMI1/2 integer instructions can't change numerical results.

Or -march=x86-64-v3 -ffp-contract=off (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-ffp-contract) to never automatically use FMA for expressions like ab + c, only when used manually.

In this microbenchmark, FMA doesn't help. You have (x-y)0.5, which can't be optimized to a single FMA. GCC's default -ffp-contract=fast makes the same asm with -march=x86-64-v3 as with -march=x86-64-v2 -mavx2.