Criteria for deciding which columns of characters should be converted to factors

I am working on the book “Analyzing Baseball Data with R” by Marchy and Albert and wonder about a problem that they do not address.

Many of the data sets that I need to import are quite large (although not very "big" in the sense of "big data"). For example, Retroheet Game Logs has 1 csv file per year starting in 1871, where each file has a line for each game played this year, and 161 columns. When I read it in a data frame using read.csv() , using the default setting on stringsAsFactors , all 75 of 161 columns become factors. Some of these columns are conceptually factors (such as "D" or "N" for day and night games), but others are probably best left as strings (many of the columns contain the names of starting pitchers, closers, etc.) . I know how to convert columns from factors to rows or vice versa, but I do not want to scan through 161 columns, resulting in an explicit solution for 75 of them.

I find it important that I notice that the conceptually small data obtained by a subset of these game logs is surprisingly large, given the need to maintain complete information about the factors. For example, taking into account the GL2016 data obtained when loading, unpacking and reading in a file, object.size(GL2016) is about 2.8 MB, and when I use:

 df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",]) 

to extract the home day games played by the Cleveland Indians in 2016, I get df with 26 rows. 26/2428 (where 2428 is the number of lines in the entire information frame) is just over 1%, but object.size(df) is about 1.3 MB, which is much more than 1% of the size of GL2016 .

I came up with an ad-hoc solution. First I defined a function:

 big.factor <- function(v,k){is.factor(v) && length(levels(v)) > k} 

And then used mutate_if from dplyr , for example:

 GL2016 %>% mutate_if(function(v){big.factor(v,30)},as.character) -> GL2016 

30 is the number of teams in MLB, and I arbitrarily decided that any factor with more than 30 levels should probably be considered as a string.

After this code was run, the number of factor variables was reduced from 75 to 12. It works in the sense that although now GL2016 is about 3.2 MB (a little more than before), if I now a subset of the DataFrame so that bring out Cleveland's games a day, the resulting dataframe is only 0.1 MB.

Questions:

1) What criteria (hopefully less ad-hoc than what I used above) are relevant for deciding which character columns should be converted to factors when importing a large dataset?

2) I am aware of the cost in terms of memory for converting all personal data into factors, but can I bear any hidden costs (say during processing) when I convert most of these factors to strings?

+5
source share
1 answer

Essentially, I think you need to do the following:

 df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",]) df <- droplevels(df) 

droplevels function will remove all unused factor levels and thus greatly reduce the df size.

+1
source

Source: https://habr.com/ru/post/1265514/


All Articles