stackoverflow January 15, 2026 Rep: 85

SIMD slower than a single implementation for toy example

Score

Answers

Views

18.3

Trend Score

Question Details

No question body available.

Answers (1)

January 15, 2026 Score: 3 Rep: 378,632 Quality: Medium Completeness: 80%

Your godbolt link omitted -DNDEBUG, so it had to check assertions in every scalar iteration, preventing auto-vectorization. With the gcc options you specified in the question, it auto-vectorizes gradientnormal as expected. The SIMD loop in it is at .L65: on Godbolt

So you'd expect both version to run about the same speed, modulo any warm-up effects making one slower. Idiomatic way of performance evaluation? The auto-vectorized version has extra code to handle lengths that aren't a multiple of the vector width. But you're testing the manually-vectorized version first, so any warm-up effects make it slower. Probably if you reversed the order of the benchmarks, you'd find the other one slower in your current hand-rolled test framework.

Not quite a duplicate of Basic ways to speed up a simple Eigen program which mentions -DNDEBUG being important for Eigen.

-march=x86-64-v2 -mavx2 is very weird, and only makes sense for Via Isaiah or
Zhaoxin Lujiazui (which apparently doesn't signal AVX2 in its CPUID feature bits if https://chipsandcheese.com/p/the-weird-and-wacky-world-of-via-part-2-zhaoxins-not-quite-electric-boogaloo is correct, but does also support AVX2 without FMA). All other x86 CPUs with AVX2 also support FMA. -march=x86-64-v3 enables AVX2+FMA+BMI1/2. So your option loses out on BMI instructions even if you wanted to avoid FMA.

You commented that you're getting some crashes in your large code-base with some third-party libraries with x86-64-v3 and haven't debugged what's really going on yet, so sure, I guess this works as temporary measure and is better than just SSE4. You might try -march=x86-64-v3 -mno-fma to just disable FMA if that's the problem; BMI1/2 integer instructions can't change numerical results.

Or -march=x86-64-v3 -ffp-contract=off (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-ffp-contract) to never automatically use FMA for expressions like ab + c, only when used manually.

In this microbenchmark, FMA doesn't help. You have (x-y)0.5, which can't be optimized to a single FMA. GCC's default -ffp-contract=fast makes the same asm with -march=x86-64-v3 as with -march=x86-64-v2 -mavx2.

Analysis Metrics

Complexity

Low (20%)

Urgency

Low (40%)

Trend Score

18.3

/100

Question Information

Question ID 79868320

Posted January 15, 2026 8:17 AM

Age 60 days

Analysis Status Analyzed

Owner ID 7787890

Actions

View on Stack Exchange View More from stackoverflow

Related Questions

Google Antigravity models not loading

Score: 117 Answers: 4

How to enable liquid glass (iOS 26) for bottom tabs in react...

Score: 3 Answers: 2

Running JavaFX Application on JRE 8u451 After JavaFX Removal

Score: 1 Answers: 0

How to make a Liquid Glass Next button as a Circle?

Score: 1 Answers: 4

std::hive container in the upcoming c++ standard

Score: 11 Answers: 1

Export Question Data

Export this question and its answers for further analysis or reporting.

Back to Questions

SIMD slower than a single implementation for toy example

Question Details

Tags

Answers (1)

Analysis Metrics

Question Information

Actions

Related Questions

Export Question Data