Get value from columns in two files

Question

Get value from columns in two files

My initial observations look like this:

  name analyte
 spring 0.1
 winter 0.4

To calculate the p-value, I did a self-tuning simulation:

  name analyte
 spring 0.001
 winter 0
 spring 0
 winter 0.2
 spring 0.03
 winter 0
 spring 0.01
 winter 0.02
 spring 0.1
 winter 0.5
 spring 0
 winter 0.04
 spring 0.2
 winter 0
 spring 0
 winter 0.06
 spring 0
 winter 0
 .....

Now I want to calculate the empirical value of p: in the initial data, winter Analyte = 0.4 - if the winter analyte was analyzed in the downloaded data> = 0.4 (for example, 1 time), and the download was started (for example, 100 times), then the empirical Value p for winter analyte is calculated:

1/100 = 0.01

(How many times the data was the same or higher than in the original data divided by the total number of observations) For spring, the analyte p-value:

2/100 = 0.02

I want to calculate these p values with awk. My solution for spring:

awk -v VAR="spring" '($1==VAR && $2>=0.1) {n++} END {print VAR,"p-value=",n/100}'

spring p-value = 0.02 I need help to transfer the source file (with the names spring and winter and their analytes, observations and the number of observations) to awk and assign them.

+4

regex grep awk perl sed

Dany bee Jun 17 '13 at 18:37

source share

2 answers

this works for me (GNU awk 3.1.6):

 FNR == NR { a[$1] = $2 next } $2 > a[$1] { b[$1]++ } { c[$1]++ } END { for (i in a) print i, "p-value=",b[i]/c[i] }

.. output:

 winter p-value= 0.111111 spring p-value= 0.111111

+2

Endoro Jun 17 '13 at 20:55

source share

jaypal singh · Accepted Answer · 2013-06-17T19:00:12+0000

Explanation and script content:

Run it like: `awk -f script.awk original bootstrap`

 # Slurp the original file in an array a # Ignore the header NR==FNR && NR>1 { # Index of this array will be type # Value of that type will be original value a[$1]=$2 next } # If in the bootstrap file value # of second column is greater than original value FNR>1 && $2>a[$1] { # Increment an array indexed at first column # which is nothing but type b[$1]++ } # Increment another array regardless to identify # the number of times bootstrapping was done { c[$1]++ } # for each type in array a END { for (type in a) { # print the type and calculate empirical p-value # which is done by dividing the number of times higher value # of a type was seen and total number of times # bootstrapping was done. print type, b[type]/c[type] } }

Test:

 $ cat original name Analyte spring 0.1 winter 0.4 $ cat bootstrap name Analyte spring 0.001 winter 0 spring 0 winter 0.2 spring 0.03 winter 0 spring 0.01 winter 0.02 spring 0.1 winter 0.5 spring 0 winter 0.04 spring 0.2 winter 0 spring 0 winter 0.06 spring 0 winter 0 $ awk -f s.awk original bootstrap spring 0.111111 winter 0.111111

Analysis:

 Spring Original Value is 0.1 Winter Original Value is 0.4 Bootstrapping done is 9 times for this sample file Count of values higher than Spring original value = 1 Count of values higher than Winter original value = 1 So, 1/9 = 0.111111

Get value from columns in two files

Explanation and script content:

Run it like: awk -f script.awk original bootstrap

Test:

Analysis:

this works for me (GNU awk 3.1.6):

More articles:

Run it like: `awk -f script.awk original bootstrap`