R Language: How to print / view summary statistics for a subset of samples?

These are some newcomers to statistical programming for R for whom I could not find an answer on the Internet. My data code is marked as "eitc" in the code below.

1) As soon as I loaded into the data frame, I would like to see summary statistics. I used the functions:

eitc <- read.dta(file="/Users/Documents/eitc.dta") summary(eitc) sapply(eitc,mean,na.rm=TRUE) #for sample mean, min, max, etc. 

How to find summary statistics on my framework when certain qualifications are performed. For example, I would like to see summary statistics for all variables when the variable "children" is greater than or equal to 1. Equivalent Stata code:

 summarize if children >= 1 

2) . How can I find certain parameters when performing certain qualifications? For example, I want to find the average value of the variable "work" when the variable "post93" is zero and the variable "anykids" is 1. Equivalent Stata code:

 mean work if post93==0 & anykids==1 

3) Ideally, when I run the summary statistics above, I would like to know how many observations were included in the calculation / compliance with the criteria.

4) When I read the data in my frame, it would be nice to see how many cases are included in the data set (and maybe how many rows have missing values ​​or β€œNA” in them).

5) In addition, I create dummy variables using the following code. Is this the right way to do this or is there a more efficient route?

 post93.dummy <- as.numeric(eitc$year>1993) eitc=cbind(eitc,post93.dummy) 
+4
source share
4 answers

Many of your requirements are answered by a subset , for example.

 summary(subset(eitc, post93 == 0 & anykids == 1, select=work)) nrow(subset(eitc, post93 == 0 & anykids == 1, select=work)) # for number of obs. 

The documentation ?subset has good examples.

The cbind dummy variable binding method does not matter. Just do:

 eitc$post93.dummy <- as.numeric(eitc$year>1993) 
+10
source

I will use the mtcars data available in the datasets package. See ?mtcars .

Announcement 1. You can see the mtcars summary when gear greater than 3:

 summary(mtcars[mtcars$gear > 3, ]) ## or by using Tukey five number summary sapply(mtcars[mtcars$gear > 3, ], fivenum) 

Announcement 2. Use with :

 with(mtcars, mean(hp[gear > 3 & mpg > 20])) 

Ad 3. In the same place (but use length ):

 with(mtcars, length(hp[gear > 3 & mpg > 20])) ## or sapply(mtcars[mtcars$gear > 3, ], length) ## which is trivial when there are no NA's sapply(mtcars[mtcars$gear > 3, ], length, na.rm = TRUE) ## but this one good when there are NA's nrow(mtcars[mtcars$gear > 3, ]) 

Announcement 4. See previous, but to find out

how many lines have missing values ​​or "NA" in them

do something like this:

 apply(dtf, 1, function(x) length(is.na(x))) 

Announcement 5 .. This is not a dummy variable, it is a kind of subset of source data grouped by columns. What are you trying to achieve anyway?

Please be brief. One question to a question, please!

+6
source

I would advise you to look at the plyr package for creating summaries. Here is some quick code (does not start);

 #Generate a new factor based on the numeric value of children with 5 levels eitc$childfac<-cut(eitc$children,5) # Generate mean and sd of the variables foo and bar based on that factor ddply(eitc, .(childfac), function(df) { return(data.frame(meanfoo=mean(df$foo), sdfoo=stdev(df$foo), meanbar=mean(df$bar), sdbar=stdev(df$bar)) }) 

You can also look at hmisc and psych for more descriptive statistics routines. (For more information, contact Quick-R )

+2
source

Here you can quickly display summary statistics for a subset of your data using data.table .

 library(data.table) dt <- data.table(mtcars) var.names <- c("cyl", "disp", "hp") dt[mpg > 20, list(name=var.names, N=.N, mean=lapply(.SD, mean), sd=lapply(.SD, sd)), .SDcols=var.names] 

You can use model.matrix to create dummy variables, see here .

0
source

Source: https://habr.com/ru/post/1337573/


All Articles