In a subset of R without using a subset () and use [in a more concise way to prevent typos?

When working with data frames, a subset is usually required. However, using a subset function is not recommended. The problem with the following code is that the data frame name is repeated twice. If you copy and paste the code and enter the code, itโ€™s easy not to accidentally change the second mention of adf, which can be a disaster.

adf=data.frame(a=1:10,b=11:20) print(adf[which(adf$a>5),]) ##alas, adf mentioned twice print(with(adf,adf[{a>5},])) ##alas, adf mentioned twice print(subset(adf,a>5)) ##alas, not supposed to use subset 

Is there a way to write above without mentioning adf twice? Unfortunately, using () or inside (), I canโ€™t access adf in general?

The subset function (...) may make the task easier, but they warn that they are not using it:

This is a convenience feature designed for interactive use. For programming, it is better to use standard subset functions, such as [, and, in particular, non-standard evaluation of a subset of arguments can have unintended consequences.

+6
source share
2 answers

After some thought, I wrote a super-simple function called given:

 given=function(.,...) { with(.,...) } 

This way I do not need to repeat the name data.frame. I also found that it is 14 times faster than filter() . See below:

 adf=data.frame(a=1:10,b=11:20) given=function(.,...) { with(.,...) } with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :( given(adf,.[a>5 & b<18,]) ##adf mentioned once :) dplyr::filter(adf,a>5,b<18) ##adf mentioned once... microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000) microbenchmark(given(adf,.[a>5 & b<18,]),times=1000) microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000) 

Using a Micro Lens

 > adf=data.frame(a=1:10,b=11:20) > given=function(.,...) { with(.,...) } > with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :( ab 6 6 16 7 7 17 > given(adf,.[a>5 & b<18,]) ##adf mentioned once :) ab 6 6 16 7 7 17 > dplyr::filter(adf,a>5,b<18) ##adf mentioned once... ab 1 6 16 2 7 17 > microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000) Unit: microseconds expr min lq mean median uq max neval with(adf, adf[a > 5 & b < 18, ]) 47.897 60.441 67.59776 67.284 70.705 361.507 1000 > microbenchmark(given(adf,.[a>5 & b<18,]),times=1000) Unit: microseconds expr min lq mean median uq max neval given(adf, .[a > 5 & b < 18, ]) 48.277 50.558 54.26993 51.698 56.64 272.556 1000 > microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000) Unit: microseconds expr min lq mean median uq max neval dplyr::filter(adf, a > 5, b < 18) 524.965 581.2245 748.1818 674.7375 889.7025 7341.521 1000 

I noticed that given( ) is actually a little faster than with() , due to the length of the variable name.

The optimal thing about given is that you can do some things without binding: (data.frame (a = 1:10, b = 11: 20), [a> 5 and b <18,])

0
source

As @akrun states, I would use the dplyr filter function:

 require("dplyr") new <- filter(adf, a > 5) new 

In practice, I do not consider the notation of a subset ( [ ] ) to be problematic, because if I copy a block of code, I use find and replace inside RStudio to replace all references to the data frame in the selected code. Instead, I use dplyr because notation and syntax are easier to track for new users (and me!), But because dplyr functions "do well."

+1
source

Source: https://habr.com/ru/post/986447/


All Articles