Google Docs export spreadsheet values โ€‹โ€‹with commas. read.csv () in R treats them as factors instead of numeric

I'm new to R and trying to read a Google public spreadsheet in an R data frame with numeric columns. My problem is that there are a lot of commas in the exported spreadsheet, for example "13 061,422". The read.csv () function sees this as a factor. I tried the lines AsFactors = FALSE and colClasses = c (rep ("numeric", 7)), but it didn't work. Is there a way to force values โ€‹โ€‹with commas and decimals to numeric values, either in read.csv (), or later, when they are considered as factors in the R-frame of data R? Here is my code:

require(RCurl) myCsv <- getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0Agbdciapt4QZdE95UDFoNHlyNnl6aGlqbGF0cDIzTlE&single=true&gid=0&range=A1%3AG4928&output=csv", ssl.verifypeer=FALSE) #ssl.verifypeer=FALSE gets around certificate issues I don't understand. fullmatrix <- read.csv(textConnection(myCsv)) str(fullmatrix) 

that leads to:

 'data.frame': 4927 obs. of 7 variables: $ wave. : Factor w/ 4927 levels "1,000.8900","1,002.8190",..: 4875 4874 4873 4872 4871 4870 4869 4868 4867 4866 ... $ wavelength : Factor w/ 4927 levels "1,000.074","1,000.267",..: 1 2 3 4 5 6 7 8 9 10 ... $ d2o : num 85.2 87.7 86.3 87.6 85.6 ... $ di : num 54.3 55.8 54.9 55.6 54.9 ... $ ddw : num 48.2 49.7 49.4 50.2 49.6 ... $ ddw.old : num 53.3 55 53.9 54.8 53.7 ... $ d2o.ddw.mix: num 65.8 67.9 67.2 68.4 66.8 ... 

Thanks for any help! I'm new to R, so guessing (hoping) is easy!

+4
source share
2 answers

Yes. Two methods. The easiest way to understand at first is probably just to as.is=TRUE as character vectors, and then use gsub to remove commas and any currency characters before converting to numeric. The second is a bit more complicated, but I think more. Create an as-method for the format you are using. Then you can use colClasses to do this in one step.

I see that @EDi has already executed version # 1 (using stringsAsFactors , not as.is , so I will document Strategy # 2:

  library(methods) setClass("num.with.commas") #[1] "num.with.commas" setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", from))) require(RCurl) #Loading required package: RCurl #Loading required package: bitops myCsv <- getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0Agbdciapt4QZdE95UDFoNHlyNnl6aGlqbGF0cDIzTlE&single=true&gid=0&range=A1%3AG4928&output=csv", ssl.verifypeer=FALSE) > fullmatrix <- read.csv(textConnection(myCsv), colClasses=c(rep("num.with.commas",2), rep("numeric",4) )) str(fullmatrix) #-------------- 'data.frame': 4927 obs. of 7 variables: $ wave. : num 9999 9997 9995 9993 9992 ... $ wavelength : num 1000 1000 1000 1001 1001 ... $ d2o : num 85.2 87.7 86.3 87.6 85.6 ... $ di : num 54.3 55.8 54.9 55.6 54.9 ... $ ddw : num 48.2 49.7 49.4 50.2 49.6 ... $ ddw.old : num 53.3 55 53.9 54.8 53.7 ... $ d2o.ddw.mix: num 65.8 67.9 67.2 68.4 66.8 ... 

as methods are mandatory. There are many methods in the R database, such as as.list , as.numeric , as.character . In each case, they try to enter input that is in one mode and make a reasonable copy of it in another mode. For example, it makes sense to force the matrix to a data frame, because they both have two dimensions. This is slightly less meaningful to force a data block into a matrix (but it succeeds with losing all the attributes of the columns and enforcing general mode.)

In this case, I take the character string as input, removing any commas and changing the character values โ€‹โ€‹to numeric. Then I use the read.table (in this case, using read.csv ) the colClasses argument to send to the as method I registered with setAs . For more details, you can go to the help(setAs) page. The S4 class system confuses many people, including me. This is the only area of โ€‹โ€‹success I have had with S4 methods.

+8
source

Read the data using stringsAsFactors = FALSE in, remove the commas (using gsub() ) and convert them to numeric (using as.numeric() ):

 > fullmatrix <- read.csv(textConnection(myCsv), stringsAsFactors = FALSE) > str(fullmatrix) 'data.frame': 4927 obs. of 7 variables: $ wave. : chr "9,999.2590" "9,997.3300" "9,995.4010" "9,993.4730" ... $ wavelength : chr "1,000.07410549122" "1,000.26707130804" "1,000.46011160533" "1,000.65312629553" ... $ d2o : num 85.2 87.7 86.3 87.6 85.6 ... $ di : num 54.3 55.8 54.9 55.6 54.9 ... $ ddw : num 48.2 49.7 49.4 50.2 49.6 ... $ ddw.old : num 53.3 55 53.9 54.8 53.7 ... $ d2o.ddw.mix: num 65.8 67.9 67.2 68.4 66.8 ... > fullmatrix$wave. <- as.numeric(gsub(",", "", fullmatrix$wave.)) > fullmatrix$wavelength <- as.numeric(gsub(",", "", fullmatrix$wavelength)) > str(fullmatrix) 'data.frame': 4927 obs. of 7 variables: $ wave. : num 9999 9997 9995 9993 9992 ... $ wavelength : num 1000 1000 1000 1001 1001 ... $ d2o : num 85.2 87.7 86.3 87.6 85.6 ... $ di : num 54.3 55.8 54.9 55.6 54.9 ... $ ddw : num 48.2 49.7 49.4 50.2 49.6 ... $ ddw.old : num 53.3 55 53.9 54.8 53.7 ... $ d2o.ddw.mix: num 65.8 67.9 67.2 68.4 66.8 ... > fullmatrix[1, 1] [1] 9999.259 
+8
source

Source: https://habr.com/ru/post/1388228/


All Articles