R Language: How to print / view summary statistics for a subset of samples?

Question

R Language: How to print / view summary statistics for a subset of samples?

These are some newcomers to statistical programming for R for whom I could not find an answer on the Internet. My data code is marked as "eitc" in the code below.

1) As soon as I loaded into the data frame, I would like to see summary statistics. I used the functions:

eitc <- read.dta(file="/Users/Documents/eitc.dta") summary(eitc) sapply(eitc,mean,na.rm=TRUE) #for sample mean, min, max, etc.

How to find summary statistics on my framework when certain qualifications are performed. For example, I would like to see summary statistics for all variables when the variable "children" is greater than or equal to 1. Equivalent Stata code:

 summarize if children >= 1

2) . How can I find certain parameters when performing certain qualifications? For example, I want to find the average value of the variable "work" when the variable "post93" is zero and the variable "anykids" is 1. Equivalent Stata code:

 mean work if post93==0 & anykids==1

3) Ideally, when I run the summary statistics above, I would like to know how many observations were included in the calculation / compliance with the criteria.

4) When I read the data in my frame, it would be nice to see how many cases are included in the data set (and maybe how many rows have missing values or “NA” in them).

5) In addition, I create dummy variables using the following code. Is this the right way to do this or is there a more efficient route?

 post93.dummy <- as.numeric(eitc$year>1993) eitc=cbind(eitc,post93.dummy)

+4

r statistics stata

baha-kev Jan 29 '11 at 8:11

source share

4 answers

I will use the mtcars data available in the datasets package. See ?mtcars .

Announcement 1. You can see the mtcars summary when gear greater than 3:

 summary(mtcars[mtcars$gear > 3, ]) ## or by using Tukey five number summary sapply(mtcars[mtcars$gear > 3, ], fivenum)

Announcement 2. Use with :

 with(mtcars, mean(hp[gear > 3 & mpg > 20]))

Ad 3. In the same place (but use length ):

 with(mtcars, length(hp[gear > 3 & mpg > 20])) ## or sapply(mtcars[mtcars$gear > 3, ], length) ## which is trivial when there are no NA's sapply(mtcars[mtcars$gear > 3, ], length, na.rm = TRUE) ## but this one good when there are NA's nrow(mtcars[mtcars$gear > 3, ])

Announcement 4. See previous, but to find out

how many lines have missing values or "NA" in them

do something like this:

 apply(dtf, 1, function(x) length(is.na(x)))

Announcement 5 .. This is not a dummy variable, it is a kind of subset of source data grouped by columns. What are you trying to achieve anyway?

Please be brief. One question to a question, please!

+6

aL3xa Jan 29 '11 at 10:37

source share

I would advise you to look at the plyr package for creating summaries. Here is some quick code (does not start);

 #Generate a new factor based on the numeric value of children with 5 levels eitc$childfac<-cut(eitc$children,5) # Generate mean and sd of the variables foo and bar based on that factor ddply(eitc, .(childfac), function(df) { return(data.frame(meanfoo=mean(df$foo), sdfoo=stdev(df$foo), meanbar=mean(df$bar), sdbar=stdev(df$bar)) })

You can also look at hmisc and psych for more descriptive statistics routines. (For more information, contact Quick-R )

+2

PaulHurleyuk Jan 29 '11 at 10:54

source share

Here you can quickly display summary statistics for a subset of your data using data.table .

 library(data.table) dt <- data.table(mtcars) var.names <- c("cyl", "disp", "hp") dt[mpg > 20, list(name=var.names, N=.N, mean=lapply(.SD, mean), sd=lapply(.SD, sd)), .SDcols=var.names]

You can use model.matrix to create dummy variables, see here .

0

pbaylis Nov 07 '16 at 18:54

source share

Michael dunn · Accepted Answer · 2011-01-29T08:51:46+0000

Many of your requirements are answered by a subset , for example.

 summary(subset(eitc, post93 == 0 & anykids == 1, select=work)) nrow(subset(eitc, post93 == 0 & anykids == 1, select=work)) # for number of obs.

The documentation ?subset has good examples.

The cbind dummy variable binding method does not matter. Just do:

 eitc$post93.dummy <- as.numeric(eitc$year>1993)

R Language: How to print / view summary statistics for a subset of samples?

More articles: