Question Details

No question body available.

Tags

bash awk text

Answers (2)

Accepted Answer Available
Accepted Answer
May 30, 2025 Score: 3 Rep: 30,828 Quality: Expert Completeness: 60%

With any POSIX awk:

$ cat foo.awk function stat( sum, sum2, i) { sum = sum2 = 0 for(i = 1; i
May 29, 2025 Score: 2 Rep: 38,342 Quality: Medium Completeness: 80%

Pearson, I updated the post, thanks for it. About the tool, I use bash scripting so linux tools such as awk are easy to include, so code using those great tools are the way to go, but I think I can also include python solutions.

I would then suggest to use existing implementation for computing Pearson correlation.

If you are okay with installing GNU datamash then you might make use of ppearson following way, let file.tsv content be

1   1   -1  1
1.5 1.5 -1.5    2.25
2   2   -2  4
2.5 2.5 -2.5    6.25
3   3   -3  9
3.5 3.5 -3.5    12.25
4   4   -4  16
4.5 4.5 -4.5    20.25
5   5   -5  25

then

datamash 'ppearson 1:2 ppearson 1:3 ppearson 1:4' < file.tsv

gives output

1   -1  0.98263874354359

which is correlation between columns 1 and 2, between columns 1 and 3, between columns 1 and 4. You might either craft string describing what to compute or call datamash for each correlation you wish to know. For example to find every against every column (Cartesian product) then you might do

for i in $(seq 4)
do
  for j in $(seq 4)
  do
    corr=$(datamash "ppearson $i:$j" < file.tsv)
    echo "$i vs $j correlation is $corr"
  done
done

which will result in

1 vs 1 correlation is 1
1 vs 2 correlation is 1
1 vs 3 correlation is -1
1 vs 4 correlation is 0.98263874354359
2 vs 1 correlation is 1
2 vs 2 correlation is 1
2 vs 3 correlation is -1
2 vs 4 correlation is 0.98263874354359
3 vs 1 correlation is -1
3 vs 2 correlation is -1
3 vs 3 correlation is 1
3 vs 4 correlation is -0.98263874354359
4 vs 1 correlation is 0.98263874354359
4 vs 2 correlation is 0.98263874354359
4 vs 3 correlation is -0.98263874354359
4 vs 4 correlation is 1

You would need to first process your data into GNU datamash-compliant format then get values and act depending on their values.

(tested in GNU datamash 1.9)

python has statistics.correlation as part of standard library, but be warned it was added in 3.10, so if your python --version is lower than that you will not have access to it.