Distribution definition so that I can generate test data

I have a couple of 100M / count values ​​in a text file on my Linux machine. I would like to figure out which formula I would use to generate more pairs that follow the same distribution.

From a random check, it looks power-ish, but I need to be a little more strict than that. Can R make it easy? If so, how? Is there anything else that works better?

+3
source share
3 answers

While a little expensive, you can accurately simulate the distribution of samples (without any hypothesis about the distribution of the population) as follows.

, " <= X". Sleepycat Berkeley , , btree; SQLite , , , ( ).

, - ( ). K .

, , , X 0 K ", < =", .

, R - . Python/R, Python R, , !

+4

, , , . , Pareto distribution , .

+4

, .

"" R sample(). , , .

, , , , - :

affils <- as.factor(c('democrat','republican','independent'))
counts <- c(552,431,27)
## Simulate 20 new voters, sampling from affiliation distribution
new.voters <- sample(affils,20, replace=TRUE,prob=counts)
new.counts <- table(new.voters)

In practice, you are likely to bring 100mm rows of values ​​and counts using the R function read.csv (). Assuming you have a title bar that says “values ​​\ t counts”, this code might look something like this:

dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)

One caveat: as you know, R stores all its objects in memory, so make sure you have enough free access for 100-meter data rows (keeping character strings as factors will help reduce footprint).

+1
source

Source: https://habr.com/ru/post/1710690/


All Articles