Question Details

No question body available.

Tags

openmp

Answers (2)

Accepted Answer Available
Accepted Answer
March 18, 2026 Score: 2 Rep: 2,979 Quality: High Completeness: 60%

If you are performing OpenMP scaling experiments to measure the performance of a single OpenMP code as you change the number of threads (rather than OpenMP inside MPI), setting OMPDYNAMIC seems a very bad idea, since it allows the OpenMP implementation to choose how many threasd you actually use abut that is your X-axis!

Setting it will therefore make it hard to perform sane scaling experiments. (You could print out how many threads have actually been allocated, and use that as the X-axis, but it seems a perverse way to do things!)

For simple scaling, you want to ensure that nothing else is running on the node you are using for measurements, and then use the OMPNUMTHREADS envirable to control the number of threads you use.

Of course, you could investigate the performance of running multiple instances of your OpenMP code on a single node, in which case not setting OMPNUM_THREADS and using the number of OpenMP instances as the X-axis would be feasible, but the is really back to the MPI scaling...

March 18, 2026 Score: 0 Rep: 15,338 Quality: Medium Completeness: 80%

As usual with OpenMP features, their exact implementation, and therefore their effect, is rather vague when you simply read the standard. I've investigated what that setting does on GCC's libgomp and Clang's libomp.

Common implementations

For GCC the setting can be traced to the function gompdynamicmaxthreads. Here is the Linux version, versions for other platforms do essentially the same. The comment helpfully states:

When OMPDYNAMIC is set, at thread launch determine the number of threads we should spawn for this team.
??? I have no idea what best practice for this is. Surely some function of the number of processors that are still online and the load average. Here I use the number of processors online minus the 15 minute load average.

The code does exactly that and I have verified it with a quick test.

Clang also uses the load average (function kmploadbalancenproc), though it's implementation is more involved. I have not tested or analyzed it too deeply, so I could be wrong. As I understand it, it essentially takes the load average but accounts for its own contribution through the active thread pool. However, it seems to use a much shorter load average, resulting in more and faster fluctuations.

Usefulness in general

The effect for both is the same: If the system is under load, the number of active threads is reduced. In theory this is nice but I think the implementation is very lacking and highly dubious.

Especially for GCC with its very long 15 minute average, it is entirely possible that part of that load was a previous run of the very same program you are currently running. Essentially the code can become scared by its own shadow. For something like a well utilized compute cluster or batch system, that load probably belongs to a different batch process that has just finished. And as a developer with quick compile-and-test cycles, the load includes the compilation.

There is also no check whether the load even occurs on CPUs that could be used by OpenMP. For example if you use numactl to bind the process to a certain number of CPUs, both implementations will be overly pessimistic. This will be a particular issue for mixed MPI and OpenMP use cases, or if you use a system with E- and P-cores such as current Intel or Apple hardware where your application should be bound to the P-cores while E-cores may run your development environment.

And of course there is no check whether the load has a lower priority than your program.

Overall I think OMP_DYNAMIC is rarely useful. The heuristic is simply too dumb to account for all possible work flows and system configurations.

Usefulness for scaling experiments

In regards to your original question: No, this flag should not be used for benchmarking. If you run the same program three times, chances are it will give you three different thread counts. If your system has such a high background load that this needs to be factored into your benchmarking, your system is unsuitable for benchmarking in the first place! Or at best you have to account for it in a more reproducible manner such as limiting the thread count consistently.