Efficient way to filter low-frequency data in a data frame in R

I have data.frame with multiple columns and you want to filter low-frequency data according to a combination of variables. An example is the same as with a man / woman in a changing sex and with high / low cholesterol. Then my data frame will look like this:

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df


  index    Sex  Age
1      1   Male High
2      2 Female High
3      3   Male High
4      4 Female High
5      5 Female High
6      6   Male High
7      7 Female High
8      8 Female High
9      9 Female  Low
10    10   Male  Low
11    11 Female High
12    12   Male High
13    13 Female High
14    14 Female High
15    15   Male  Low
16    16 Female  Low
17    17   Male High
18    18   Male  Low
19    19   Male  Low
20    20 Female  Low

Now I want to filter out the Sex / Age combination, where the frequency is above 3

table(df[,2:3])
        Age
Sex      High Low
  Female    8   3
  Male      5   4

In other words, I want to keep indexes for women of high, male and male.

Please note that 1) my data frame has several variables (not like in the example above), and 2) I use I don’t want to use the third R package and 3) I want it to be fast.

+4
source share
5

R:

lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

#   index    Sex  Age
#1      1   Male High
#2      2 Female High
#3      3   Male High
#4      4 Female High
#5      5 Female High
#6      6   Male High
#7      7 Female High
#8      8 Female High
#10    10   Male  Low
#11    11 Female High
#12    12   Male High
#13    13 Female High
#14    14 Female High
#15    15   Male  Low
#17    17   Male High
#18    18   Male  Low
#19    19   Male  Low

, :

vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

R ave:

subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)
+7

, Base-R

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df

merge(
    df
    , aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
    , by = c("Sex", "Age")
)

sum 1 .

+4

We can do this with help data.table, and it must also be effective.

library(data.table)
setDT(df)[, .SD[.N > 3], .(Sex, Age)]

Or using .I

setDT(df)[df[, .I[.N >3], .(Sex, Age)]$V1]
+4
source
Answer

A dplyrwill

library(dplyr)
df %>% 
  group_by(Sex, Age) %>% 
  filter(n() > 3) 

Despite the fact that in OP this is not a basic solution of R. I thought that it could be useful for future users who do not have such restrictions.

+1
source
vars     <- c("Sex","Age")
max_freq <- 3
new_df   <- merge(df, subset(as.data.frame(table(df[,vars])),Freq>max_freq)[1:2])

new_df
#       Sex  Age index
# 1  Female High     2
# 2  Female High     7
# 3  Female High    14
# 4  Female High    11
# 5  Female High     5
# 6  Female High     4
# 7  Female High    13
# 8  Female High     8
# 9    Male High     6
# 10   Male High     3
# 11   Male High     1
# 12   Male High    17
# 13   Male High    12
# 14   Male  Low    10
# 15   Male  Low    15
# 16   Male  Low    18
# 17   Male  Low    19
+1
source

Source: https://habr.com/ru/post/1693319/


All Articles