Reading SAS sas7bdat data in R

What parameters sas7bdat R have for reading files in the native SAS format, sas7bdat , in R?

NCES Common Core , for example, contains an extensive repository of data files stored in this format. For concreteness, let me focus on trying to read into this file from the LEA Universe in 1997-1998, which contains demographic data of the educational level at the organization level for all conditions starting from A to I.

Here's a preview from SAS data:

sas_preview

What is the easiest way to get this data across to my R environment? I do not have any version of SAS and I do not want to pay, so just converting it to .csv would be difficult.

+18
r r-faq
May 02 '15 at 19:58
source share
2 answers

sas7bdat worked fine for everyone except one of the files I was looking at (in particular, this one ); reporting an error to sas7bdat developer, Matthew Shotwell, he also pointed me towards the Hadley haven package in R, which also has a read_sas method.

This method is superior for two reasons:

1) He had no problems reading the linked file 2) It is much (I say much ) faster than read.sas7bdat . Here's a quick test (for this file , which is smaller than the others) to prove:

 microbenchmark(times=10L, read.sas7bdat("psu97ai.sas7bdat"), read_sas("psu97ai.sas7bdat")) Unit: milliseconds expr min lq mean median uq max neval cld read.sas7bdat("psu97ai.sas7bdat") 66696.2955 67587.7061 71939.7025 68331.9600 77225.1979 82836.8152 10 b read_sas("psu97ai.sas7bdat") 397.9955 402.2627 410.4015 408.5038 418.1059 425.2762 10 a 

This right - haven::read_sas takes (on average) 99.5% less time than sas7bdat::read.sas7bdat .

minor update

I previously could not figure out whether the two methods gave the same data (i.e. both have equal levels of accuracy with respect to reading the data), but finally did it:

 # Keep as data.tables sas7bdat <- setDT(read.sas7bdat("psu97ai.sas7bdat")) haven <- setDT(read_sas("psu97ai.sas7bdat")) # read.sas7bdat prefers strings as factors, # and as of now has no stringsAsFactors argument # with which to prevent this idj_factor <- sapply(haven, is.factor) # Reset all factor columns as characters sas7bdat[ , (idj_factor) := lapply(.SD, as.character), .SDcols = idj_factor] # Check equality of the tables all.equal(sas7bdat, haven, check.attributes = FALSE) # [1] TRUE 

However, note that read.sas7bdat has retained a massive list of attributes for the file, presumably a SAS hook:

 str(sas7bdat) # ... # - attr(*, "column.info")=List of 70 # ..$ :List of 12 # .. ..$ name : chr "NCESSCH" # .. ..$ offset: int 200 # .. ..$ length: int 12 # .. ..$ type : chr "character" # .. ..$ format: chr "$" # .. ..$ fhdr : int 0 # .. ..$ foff : int 76 # .. ..$ flen : int 1 # .. ..$ label : chr "UNIQUE SCHOOL ID (NCES ASSIGNED)" # .. ..$ lhdr : int 0 # .. ..$ loff : int 44 # .. ..$ llen : int 32 # ... 

So, if you really need these attributes (I know some people are especially interested in label s, for example), read.sas7bdat be the option for you after all.

+29
May 05 '15 at 2:20
source share

Problem

The problem is this: the files you are trying to use are poorly formatted. In particular, empty cells are not encoded ( R uses NA ), but simply left empty. When you try to load a tab delimited file, this creates problems for R, which considers that there are incorrect column numbers.

Workaround using SAS files

I found a workaround by downloading the SAS file using the sas7bdat package and then sas7bdat over empty cells ( "" ) as NA:

 install.packages("sas7bdat") require("sas7bdat") download.file("http://nces.ed.gov/ccd/Data/zip/ag121a_supp_sas.zip", destfile = "sas.zip") unzip("sas.zip") sas <- read.sas7bdat(file = "ag121a_supp.sas7bdat", debug = FALSE) sas[sas == ""] <- NA 

There are two problems with this method:

  • It is slow (see comments) Package
  • sas7bdat is currently considered experimental at the time of writing by its author. Therefore, it may not download all sas files, and I have to check those that it does completely for inconsistencies before use.

Non-R Solution

It's not entirely canonical, but you can also download tab delimited files, open them in LibreOffice Calc (Microsoft Excel seems to ruin everything), and find and replace everything by searching for "" and replacing NA .

+5
May 02 '15 at 10:42
source share



All Articles