How to import SAS files into R?

I am trying to analyze data from a NATS 2012-2013 poll from this location . There are three files in the zip folder, marked as 2012-2013 NATS format.sas, formats.sas7bcat and nats2012.sas7bdat. The third file contains actual data, but the second file contains labels that come with the data; that is, as an example, if the “Race” variable in the raw data file has categories 1,2,3 and 4, the labels indicate that these categories indicate “Caucasian”, “African American”, “Hispanic” and “Other”. I was able to import the sas7bdat file into R using the sas7bdat package, but when I try to execute cross-tables, I cannot see which category each cell represents. For example, if I try to do this:

table(SMOKSTATUS_R, RACEETHNIC) 

I get:

 RACEETHNIC SMOKSTATUS_R 1 2 3 4 5 6 7 8 9 1 4045 455 55 7 63 0 675 393 373 2 1183 222 38 2 26 0 217 255 154 3 14480 957 238 14 95 3 1112 950 369 4 23923 2532 1157 23 147 1 1755 3223 909 5 81 18 4 0 1 0 11 17 9 

As far as I can tell, the only way to insert shortcuts into the data is to enter them manually, but there are 240 variables, and in addition, there are currently labels in the form of a format.sas7bcat file. Is there a way to import a format file into R so that labels can be bound to variables? This is how it is done in SAS, but I don’t have t oSAS access right now. Thanks for the help.

+1
import r sas
Oct 29 '15 at 18:22
source share
2 answers

The formats.sas file must be readable and parseble in column vectors, which you then apply, like any column label vector.

If you want to outline categorical variables that you seem to be mostly concerned about based on your question, this should be fairly simple. You will see a code that looks like this:

 value RACEF 1 = 'Caucasian' 2 = 'African-American' 3 = 'Hispanic' 4 = 'Other' ; 

You just need to parse this into a vector.

If you're lucky, the names of their categories will be identical to the names of the columns (possibly with F, as I have in this example); if in this case you can probably just develop how to apply them directly.

If this is not the case, you will have to analyze the second half of the program. It will consist of the following lines:

 format race RACEF. gender SEXF. income INCRF. ... ; 

This, of course, shows the relationship between the column name and the format name and thus tells you which vector of column names you should use to designate that column.

+1
Oct 29 '15 at 19:02
source share

This should be single line:

 library('haven') sas <- read_sas('nats2012.sas7bdat', 'formats.sas7bcat') with(sas, table(SMOKSTATUS_R, RACEETHNIC)) # RACEETHNIC # SMOKSTATUS_R 1 2 3 4 5 6 7 8 9 # 1 4045 455 55 7 63 0 675 393 373 # 2 1183 222 38 2 26 0 217 255 154 # 3 14480 957 238 14 95 3 1112 950 369 # 4 23923 2532 1157 23 147 1 1755 3223 909 # 5 81 18 4 0 1 0 11 17 9 table(names(attr(sas[, 'SMOKSTATUS_R'], 'labels')[sas[, 'SMOKSTATUS_R']]), names(attr(sas[, 'RACEETHNIC'], 'labels')[sas[, 'RACEETHNIC']])) # Amer. Indian, AK Nat. Only, Non-Hispanic # Current everyday smoker 63 # Current some days smoker 26 # Former smoker 95 # Never smoker 147 # Unknown 1 

Use haven to read in the data, but it also gives you useful attributes , namely variable labels:

 attributes(sas$SMOKSTATUS_R) # $label # [1] "SMOKER STATUS (4-level)" # # $class # [1] "labelled" # # $labels # Current everyday smoker Current some days smoker Former smoker # 1 2 3 # Never smoker Unknown # 4 5 # # $is_na # [1] FALSE FALSE FALSE FALSE FALSE 

You can easily write this to a function used more widely:

 do_fmt <- function(x, fmt) { lbl <- if (!missing(fmt)) unlist(unname(fmt)) else attr(x, 'labels') if (!is.null(lbl)) tryCatch(names(lbl[match(unlist(x), lbl)]), error = function(e) { message(sprintf('formatting failed for %s', attr(x, 'label')), domain = NA) x }) else x } table(do_fmt(sas[, 'SMOKSTATUS_R']), do_fmt(sas[, 'RACEETHNIC'])) # Amer. Indian, AK Nat. Only, Non-Hispanic # Current everyday smoker 63 # Current some days smoker 26 # Former smoker 95 # Never smoker 147 # Unknown 1 

And apply to the entire dataset

 sas[] <- lapply(sas, do_fmt) sas$SMOKSTATUS_R[1:4] # [1] "Never smoker" "Former smoker" "Former smoker" "Never smoker" 

Although sometimes this does not happen as shown below. It looks somehow wrong with haven package

 attr(sas$SMOKTYPE, 'labels') # INAPPLICABLE REFUSED DK NOT ASCERTAINED # -4.00000 -0.62500 -0.50000 -0.46875 # PREMADE CIGARETTES ROLL-YOUR-OWN BOTH # 1.00000 2.00000 3.00000 

So, instead, you can parse the format.sas file with simple simple expressions

 locf <- function(x) { x <- data.frame(x, stringsAsFactors = FALSE) x[x == ''] <- NA indx <- !is.na(x) x[] <- lapply(seq_along(x), function(ii) { idx <- cumsum(indx[, ii]) idx[idx == 0] <- NA x[, ii][indx[, ii]][idx] }) x[, 1] } fmt <- readLines('~/desktop/2012-2013-NATS-Format/2012-2013-NATS-Format.sas') ## not sure if comments are allowed in the value definitions, but ## this will check for those in case fmt <- gsub('\\*.*;|\\/\\*.*\\*\\/', '', fmt) vars <- gsub('(?i)value\\W+(\\w*)|.', '\\1', fmt, perl = TRUE) vars <- locf(vars) regex <- '[\'\"].*[\'\"]|[\\w\\d-]+' vals <- gsub(sprintf('(?i)\\s*(%s)\\s*(=)\\s*(%s)|.', regex, regex), '\\1\\2\\3', fmt, perl = TRUE) View(dd <- na.omit(data.frame(values = vars, formats = vals, stringsAsFactors = FALSE))) sp <- split(dd$formats, dd$values) sp <- lapply(sp, function(x) { x <- Filter(nzchar, x) x <- strsplit(x, '=') tw <- function(x) gsub('^\\s+|\\s+$', '', x) sapply(x, function(y) setNames(tw(y[1]), tw(y[2]))) }) 

Thus, smoke type formats (one of them that have not been executed above), for example, are processed as follows:

 sp['A5_'] # $A5_ # 'INAPPLICABLE' 'REFUSED' 'DK' # "-1" "-7" "-8" # 'NOT ASCERTAINED' 'PREMADE CIGARETTES' 'ROLL-YOUR-OWN' 'BOTH' # "-9" "1" "2" "3" 

And then you can use this function again to apply to the data

 table(do_fmt(sas['SMOKTYPE'], sp['A5_'])) # 'BOTH' 'DK' 'INAPPLICABLE' # 736 17 51857 # 'PREMADE CIGARETTES' 'REFUSED' 'ROLL-YOUR-OWN' # 7184 2 396 
+2
Oct 29 '15 at 10:19
source share



All Articles