Problems importing csv file / conversion from integer to double in R

Today I finally decided to start climbing the steep learning curve. I spent several hours and managed to import my dataset and do a few other basic things, but I am having problems with the data type: the decimal place column is imported as an integer, and the conversion is a double change of value .

Trying to get a small csv file, which will be presented here as an example, I found that the problem only occurs when the data file is too large (my source file is a 1048418 by 12 matrix, but even with β€œonly” 5000 lines I have the same problem . When I have only 100, 1000 or even 2000 rows, the column is correctly imported as double).

Here is a smaller data set (still 500 kB, but again, if the data set is small, the problem is not replicated). The code

> ex <- read.csv("exampleshort.csv",header=TRUE) > typeof(ex$RET) [1] "integer" 

Why is the returns column imported as an integer when the file is large, when it is obviously of type double?

Worst of all, if I try to convert it to double, the values ​​will be changed

 > exdouble <- as.double(ex$RET) > typeof(exdouble) [1] "double" > ex$RET[1:5] [1] 0.005587 -0.005556 -0.005587 0.005618 -0.001862 2077 Levels: -0.000413 -0.000532 -0.001082 -0.001199 -0.0012 -0.001285 -0.001337 -0.001351 -0.001357 -0.001481 -0.001486 -0.001488 ... 0.309524 > exdouble[1:5] [1] 1305 321 322 1307 41 

This is not the only column that was not imported correctly, but I decided that if I find a solution for one column, I should be able to sort the others. Here is some more info:

 > sapply(ex,class) PERMNO DATE COMNAM SICCD PRC RET RETX SHROUT VWRETD VWRETX EWRETD EWRETX "integer" "integer" "factor" "integer" "factor" "factor" "factor" "integer" "numeric" "numeric" "numeric" "numeric" 

They should be in the following order: integer, date, string, integer, double, double, double, integer, double, double, double, double (types are probably incorrect, but I hope you get what I mean )

+6
source share
1 answer

See the help for read.csv ?read.csv . Here is the relevant section:

 colClasses: character. A vector of classes to be assumed for the columns. Recycled as necessary, or if the character vector is named, unspecified values are taken to be 'NA'. Possible values are 'NA' (the default, when 'type.convert' is used), '"NULL"' (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or '"factor"', '"Date"' or '"POSIXct"'. Otherwise there needs to be an 'as' method (from package 'methods') for conversion from '"character"' to the specified formal class. Note that 'colClasses' is specified per column (not per variable) and so includes the column of row names (if any). 

Good luck with your quest to learn R. It's hard, but so much fun after you go through the first few steps (which, I admit, will take some time).

try this and fix the rest accordingly:

 ex <- read.csv("exampleshort.csv",header=TRUE,colClasses=c("integer","integer","factor","integer","numeric","factor","factor","integer","numeric","numeric","numeric","numeric"), na.strings=c(".")) 

As BenBolker points out, the colClasses argument is probably not needed. However, note that using the colClasses argument can speed colClasses up, especially with a large dataset.

na.strings . See the next section in ?read.csv :

  na.strings: a character vector of strings which are to be interpreted as 'NA' values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. 

For reference purposes (this should not be used as a solution because the best solution is to import the data correctly in one step): RET not imported as an integer. It was imported as factor . For future reference, if you want to convert factor to numeric , use

new_RET <-as.numeric(as.character(ex$RET))

+6
source

Source: https://habr.com/ru/post/903079/


All Articles