Distribution definition so that I can generate test data

Question

Distribution definition so that I can generate test data

I have a couple of 100M / count values in a text file on my Linux machine. I would like to figure out which formula I would use to generate more pairs that follow the same distribution.

From a random check, it looks power-ish, but I need to be a little more strict than that. Can R make it easy? If so, how? Is there anything else that works better?

+3

r

twk Jun 17 '09 at 14:44

source share

3 answers

Alex Martelli · Answer 1 · 2009-06-17T15:04:11+0000

While a little expensive, you can accurately simulate the distribution of samples (without any hypothesis about the distribution of the population) as follows.

, " <= X". Sleepycat Berkeley , , btree; SQLite , , , ( ).

, - ( ). K .

, , , X 0 K ", < =", .

, R - . Python/R, Python R, , !

John D. Cook · Answer 2 · 2009-06-17T16:26:29+0000

, , , . , Pareto distribution , .

medriscoll · Answer 3 · 2009-06-27T08:54:55+0000

, .

"" R sample(). , , .

, , , , - :

affils <- as.factor(c('democrat','republican','independent'))
counts <- c(552,431,27)
## Simulate 20 new voters, sampling from affiliation distribution
new.voters <- sample(affils,20, replace=TRUE,prob=counts)
new.counts <- table(new.voters)

In practice, you are likely to bring 100mm rows of values and counts using the R function read.csv (). Assuming you have a title bar that says “values \ t counts”, this code might look something like this:

dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)

One caveat: as you know, R stores all its objects in memory, so make sure you have enough free access for 100-meter data rows (keeping character strings as factors will help reduce footprint).

Distribution definition so that I can generate test data

More articles: