Using R to interpret a symbolic formula for external use

In R, the formula object is symbolic, and it is quite difficult to parse. However, I need to parse such a formula into an explicit set of labels for use outside of R.

(1)

The provision f represents model formulas in which the answer is not specified, for example. ~V1 + V2 + V3 , I tried:

 t <- terms(f) attr(t, "term.labels") 

However, this does not give what is explicitly obvious if some of the variables in f are categorical. For example, let V1 be a categorical variable with two categories, i.e. Logical, and V2 is double.

Therefore, the model specified by ~V1:V2 must have 2 parameters: "interception" and "xyes: z". Meanwhile, the model specified by ~V1:V2 - 1 should have the parameters "xno: z" and "xyes: z". However, without specifying the terms() function, which variables are categorical (and how many categories), it is not possible to interpret them. Instead, it simply has V1:V2 in its "terms.labels", which means nothing in the context that V1 is categorical.

(2)

Using model.matrix , on the other hand, is an easy way to get exactly what I want. The problem is that this requires the data argument, which is bad for me, because I only need to explicitly interpret the symbolic formula to use outside R. This way of getting this will be time-consuming (comparative), since R has to read data from external source, when all he really needs to know for the formula is variables that are categorical (and how many categories) and which variables double.

Is it possible to use "model.matrix" only with data types and not actual data? If not, what else is a viable solution?

+6
source share
1 answer

Well, only in the context of data availability can you determine whether a given variable is a factor or a numerical one. Thus, you cannot do this without a data argument. But all you need is a structure, not the data itself, so you can use a data frame of 0 rows with columns of all the correct types.

 f <- ~ V1:V2 V1numeric <- data.frame(V1=numeric(0), V2=numeric(0)) V1factor <- data.frame(V1=factor(c(), levels=c("no","yes")), V2=numeric(0)) 

Looking at two data.frames files:

 > V1numeric [1] V1 V2 <0 rows> (or 0-length row.names) > str(V1numeric) 'data.frame': 0 obs. of 2 variables: $ V1: num $ V2: num > V1factor [1] V1 V2 <0 rows> (or 0-length row.names) > str(V1factor) 'data.frame': 0 obs. of 2 variables: $ V1: Factor w/ 2 levels "no","yes": $ V2: num 

Use model.matrix with these

 > model.matrix(f, data=V1numeric) (Intercept) V1:V2 attr(,"assign") [1] 0 1 > model.matrix(f, data=V1factor) (Intercept) V1no:V2 V1yes:V2 attr(,"assign") [1] 0 1 1 attr(,"contrasts") attr(,"contrasts")$V1 [1] "contr.treatment" 

If you have a real data set, it's easy to get a 0-line data.frame from the one that stores the column information. Just fine-tune it with FALSE . You do not need to create data.frame manually if you have suitable properties.

 > str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... > str(mtcars[FALSE,]) 'data.frame': 0 obs. of 11 variables: $ mpg : num $ cyl : num $ disp: num $ hp : num $ drat: num $ wt : num $ qsec: num $ vs : num $ am : num $ gear: num $ carb: num 
+4
source

Source: https://habr.com/ru/post/945181/


All Articles