How can I calculate empirical CDF in R?

I am reading a rare table from a file that looks like this:

1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1  0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1

Note line lengths are different.

Each line represents one simulation. The value in the i-th column in each row indicates how many times the value of i-1 was observed in this simulation. For example, in the first simulation (first row), we got a single result with a value of "0" (first column), 7 results with a value of "2" (third column), etc.

I want to create an average cumulative distribution function (CDF) for all simulation results, so I could use it later to calculate the empirical p-value for true results.

To do this, I can summarize each column first, but I need to take zeros for undef columns.

How can I read such a table with different rows of rows? How to sum columns replacing "undef" values ​​with 0? And finally, how do I create a CDF? (I can do it manually, but I think there is some kind of package that can do this).

+1
source share
2 answers

This will read the data in:

dat <- textConnection("1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1  0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1")
df <- data.frame(scan(dat, fill = TRUE, what = as.list(rep(1, 29))))
names(df) <- paste("Val", 1:29)
close(dat)

Result:

> head(df)
  Val 1 Val 2 Val 3 Val 4 Val 5 Val 6 Val 7 Val 8 Val 9 Val 10 Val 11 Val 12
1     1     0     7     0     0     1     0     0     0      5      0      0
2     1     0     0     1     0     0     0     3     0      0      0      0
3     0     0     0     1     0     0     0     2     0      0      0      0
4     1     0     0     1     0     3     0     0     0      0      1      0
5     0     0     0     1     0     0     0     2     0      0      0      0
....

If the data is in a file, specify the file name instead dat. This code assumes that there are no more than 29 columns, according to the data you provided. Change 29according to actual data.

Get column sums using

df.csum <- colSums(df, na.rm = TRUE)

the function ecdf()generates the ECDF that you need,

df.ecdf <- ecdf(df.csum)

and we can build it using the method plot():

plot(df.ecdf, verticals = TRUE)
+4

ecdf() ( R) ecdf() ( Hmisc).

+2

Source: https://habr.com/ru/post/1773121/


All Articles