stackoverflow January 7, 2026 Rep: 497

Triangulating matrices using LU decomposition method in C : Question about for loop optimization

Score

Answers

Views

22.6

Trend Score

Question Details

No question body available.

Answers (2)

January 7, 2026 Score: 3 Rep: 3,704 Quality: Medium Completeness: 60%

You have actually achieved your aim of 10 essential lines of C code to implement LU matrix decomposition. It is all the optional brackets that are making your solution much longer than it needs to be.

Your objective of merging the two for loops is complicated by the fact that the outer loop on i has different limits i = 1 for the first loop and i = 0 for the second.

Adding a conditional test if(i) to merge them creates an unnecessary overhead so is unwise and it will only work correctly for the very first line and column since the second inner loop on j that normalises the off diagonal elements will be working on transient incompletely computed diagonal values.

You can however make optimisations that will make the resulting code faster and that modern optimising compilers still cannot see. NB It is generally better to use established system libraries for matrix operations as they are robust, very well tested and cache aware.

The first optimisation that holds generally is that divide is considerably (~10x) more expensive than every other binary arithmetic operation so that eliminating the divide y = y/x in the innermost loop by computing rx = 1.0/x outside the loop and then y = yrx is much faster but slightly less accurate. But only do this after confirming by looking at the assembler listing that the compiler won't do it for you.

It also eliminates one none essential multiply from the innermost loop.

The other specific tweak to this algorithm and only really helpful for moderate size matrices is to swap the order of the loops i,j to j,i and make i > j by construction (see tri_opt).

These are the code snippets for the various versions of tri :

// original code with all brackets included 20 non blank lines 6 { 6 }void tri(double lu[d][d])
{
    for (int i = 1; i < d; i++)
    {
        for (int j = 0; j < d - i; j++)
        {
            for (int k = 0; k < d - i; k++)
            {
                lu[i + j][i + k] = (lu[i + j][i + k]  lu[i - 1][i - 1] - lu[i - 1][i + k]  lu[i + j][i - 1]) / lu[i - 1][i - 1];
            }
        }
    }
    for (int i = 0; i < d; i++)
    {
        for (int j = 0; j < d; j++)
        {
            if (i > j) { lu[i][j] = lu[i][j] / lu[j][j]; }
        }
    }
}
// after removing all non essential "{ }" bracket pairs 10 lines 1 { 1 }
// core code could be two very long lines if you insist so 5 lines in total 
void tri_obfuscated(double lu[d][d])
{
    for (int i = 1; i < d; i++)
        for (int j = 0; j < d - i; j++)
            for (int k = 0; k < d - i; k++)
                lu[i + j][i + k] = (lu[i + j][i + k]  lu[i - 1][i - 1] - lu[i - 1][i + k]  lu[i + j][i - 1]) / lu[i - 1][i - 1];
    for (int i = 0; i < d; i++)
        for (int j = 0; j < d; j++)
            if (i > j) lu[i][j] = lu[i][j] / lu[j][j];
}
// this version takes the scaling divisions out of the inner loop to optimise speed
// I was surprised that neither MSVC or Intel ICX did this even when allowed FP:fast
// with d visible as 5 at compile time they each unrolled the innermost loops.
void tri_opt(double lu[d][d])
{
    for (int i = 1; i < d; i++)
    {
        double scale = 1.0 / lu[i - 1][i - 1];
        for (int j = 0; j < d - i; j++)
        {
            for (int k = 0; k < d - i; k++)
            {
                lu[i + j][i + k] = lu[i + j][i + k] - lu[i - 1][i + k]  lu[i + j][i - 1]  scale;
            }
        }
    }    for (int j = 0; j < d; j++)
    {
        double scale = 1.0 / lu[j][j];
        for (int i = j+1; i < d; i++)
        {
            lu[i][j] = lu[i][j]  scale;
        }
    }
}

Most compilers today at higher levels of optimisation will spot common loop invariants and move them outside. The divide to multiply by a reciprocal can be problematic enough that compilers will only consider it at maximum optimisation and when allowed the relaxed precision of fp-fast.

January 7, 2026 Score: 0 Rep: 144 Quality: Low Completeness: 60%

To include the second nested loop in the first, you need to see where it changes a value that will be used in the sequence:

the calculation

lu[i + j][i + k] = (lu[i + j][i + k]  lu[i - 1][i - 1] - lu[i - 1][i + k]  lu[i + j][i - 1]) / lu[i - 1][i - 1];

uses the line i-1, so you can't change the line i before the lu[i+1] assignment, that occurs when j=1 or, in other words, you can change it when j > 0.

The first column of each line did not enter the nested loop, so do the calculation in the end.

The second nested loop uses the condition i>j, wich implies i>j>=0, then i>0. So we could start the loop with i=1.

Also, the same condition lead us to d>i>j, then (because i is integer), d>d-1>j or, in other words,

j 0 && i > j) lu[i][j] = lu[i][j] / lu[j][j];
        }
        lu[i][0] = lu[i][0] / lu[0][0];
    }
}

If you don't like the if(j>0) you could include this case outside and start the second loop with j=1.

for (int i = 1; i < d; i++)
    {
        //case when j=0
        for (int k = 0; k < d-i; k++)
            lu[i][i+k] = (lu[i][i+k]lu[i-1][i-1] - lu[i-1][i+k]lu[i][i-1]) / lu[i-1][i-1];
        for (int j = 1; j < d-i; j++)
        {
            for (int k = 0; k < d-i; k++)
                lu[i+j][i+k] = (lu[i+j][i+k]lu[i-1][i-1] - lu[i-1][i+k]lu[i+j][i-1]) / lu[i-1][i-1];
            if (i > j) lu[i][j] = lu[i][j] / lu[j][j];
        }
        lu[i][0] = lu[i][0] / lu[0][0];
    }

Export Question Data

Export this question and its answers for further analysis or reporting.

Back to Questions

Triangulating matrices using LU decomposition method in C : Question about for loop optimization

Question Details

Tags

Answers (2)

Analysis Metrics

Question Information

Actions

Related Questions

Export Question Data