Using a histogram as an input to R

This is admittedly a very simple question that I simply cannot find the answer to.

In R, I have a file that has 2 columns: 1 from the categorical data names, and the second is the count column (count for each of the categories). With a small dataset, I would use 'reshape' and the 'untable' function to make 1 column and do the analysis this way. The question is how to handle this with a large dataset.

In this case, my data is humane and it just won't work.

My question is: how can I tell R to use something like the following as distribution data:

Cat Count A 5 B 7 C 1 

That is, I give him a histogram as an input signal and has a value of R, which means that there are 5 of A, 7 of B and 1 of C when calculating other information about the data.

The desired input, not the output, would be for R to understand that the data would be the same as below,

a b b b b b b b b s

In data of a reasonable size, I can do it myself, but what will you do when the data is very large?

Edit

The total amount of all accounts is 262 916 849.

In terms of what it will be used for:

This is new data that is trying to understand the relationship between this new data and other data. It is necessary to work with linear regressions and mixed models.

+4
source share
4 answers

I think that you are asking to change the frame of these categories of data and calculate in one observation vector where the categories are repeated. Here is one way:

 dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1)) # Cat Count #1 A 5 #2 B 7 #3 C 1 rep.int(dat$Cat,times=dat$Count) # [1] AAAAABBBBBBBC #Levels: ABC 
+7
source

To keep an eye on @Blue Magister, a great answer, here is a 100,000 line histogram with a total of 551,245,193:

 set.seed(42) Cat <- sapply(rep(10, 100000), function(x) { paste(sample(LETTERS, x, replace=TRUE), collapse='') }) dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE)) > head(dat) Cat Count 1 XYHVQNTDRS 5154 2 LSYGMYZDMO 4724 3 XDZYCNKXLV 8691 4 TVKRAVAFXP 2429 5 JLAZLYXQZQ 5704 6 IJKUBTREGN 4635 

This is a fairly large dataset by my standards, and the Blue Magister operation describes very quickly:

 > system.time(x <- rep(dat$Cat,times=dat$Count)) user system elapsed 4.48 1.95 6.42 

To complete the operation, about 6 GB of RAM is used.

+4
source

It depends on what kind of statistics you are trying to calculate. The xtabs function will create tables for you in which you can specify counters. The Hmisc package has functions such as wtd.mean , which takes a weight vector to calculate the average (and related functions for standard deviation, quantiles, etc.). The biglm package can be used to simultaneously expand portions of the data set and analysis. Perhaps there are other packages that will process the frequency data, but which best depend on what questions you are trying to answer.

+2
source

Existing answers boil down to expanding all extended databases to full distribution, and then using the R histogram function, which is memory inefficient and will not scale for the very large datasets that the original posters ask for. The HistogramTools CRAN package includes a PreBinnedHistogram , which takes arguments for gaps and counts to create a histogram object in R without massively expanding the dataset.

For example, if the data set contains 3 buckets with 5, 7, and 1 elements, all the other solutions posted here so far expand this into a list of 13 elements, and then create a histogram. PreBinnedHistogram contrary, creates a histogram directly from 3 input lists of elements without creating a much larger intermediate vector in memory.

 big.histogram <- PreBinnedHistogram(my.data$breaks, my.data$counts) 
0
source

Source: https://habr.com/ru/post/1433637/


All Articles