Convert character to numeric value in R

I have a file that I read in R and translates to a dataframe (called CA1) to have the following structure:

Station_ID Guage_Type Lat Long Date Time_Zone Time_Frame H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12 H13 H14 H15 H16 H17 H18 H19 H20 H21 H22 H23 1 4457700 HI 41.52 124.03 19480701 8 LST 0 0 0 0 0 0 0 0 0 0 0 0 MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS 2 4457700 HI 41.52 124.03 19480705 8 LST 0 1 1 1 1 1 2 2 2 4 5 5 4 7 1 1 0 0 10 13 5 1 1 3 3 4457700 HI 41.52 124.03 19480706 8 LST 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4457700 HI 41.52 124.03 19480727 8 LST 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 4457700 HI 41.52 124.03 19480801 8 LST 0 0 0 0 0 0 0 0 0 0 0 0 MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS MIS 6 4457700 HI 41.52 124.03 19480817 8 LST 0 0 0 0 0 0 ACC ACC ACC ACC ACC ACC 6 1 0 0 0 0 0 0 0 0 0 0 

From H0 to H23, they are read as a symbol (), as there will be cases when the value will not be numeric and will have values ​​such as MIS, ACC or DEL.

My question is: is there a way to cast values ​​for each column H0-H23 to a numeric value and have character values ​​(MIS, ACC, DEL) like NA or NAN, which I can check for it if they are (.nan or is.na ), so I can run some numerical models. Or would it be better if the character values ​​were changed to an identifier, such as -9999?

I have tried many ways. I found several on this site, but nothing works. For instance:

  for (i in 8:31) { CA1[6,i] <- as.numeric(as.character(CA1[6,i])) } 

which of course gives warnings, but when I test if the two specific values ​​are is_numeric () (CA1 [6.8] and CA1 [6.19]), I get a false statement for both. Firstly, I don’t understand why, but the second thing I’m doing is that it is "". However, I can verify this with is.na (CA1 [6,19]) and returns true, which is fine for me to know that it is not numeric.

The second method I tried is:

  for (i in 8:31) { CA1[6,i] <- as.numeric(levels(CA1[6,i]))[CA1[6,i]] } 

who got the same results as before.

Is there a way to do what I'm trying to do in an efficient way? Your help is greatly appreciated. Thanks you

+6
source share
3 answers

The immediate problem: each column of a data frame can contain only values ​​of the same type. 6 in CA1[6,i] in your code means that only one value is converted in each column, so when it is inserted after the conversion, it must be forced back into the row to match the rest of the column.

You can solve this by translating the entire column at a time so that the column is completely replaced. those. delete 6 :

  for (i in 8:31) { CA1[,i] <- as.numeric(as.character(CA1[,i])) } 
+6
source

When you read data, you can usually specify what types of columns. For example, read.table / read.csv has the colClasses argument.

 # Something like this read.table('foo.txt', header=TRUE, colClasses=c('integer', 'factor', 'numeric', numeric', 'Date')) 

See ?read.table more details.

+6
source

Following Tommy's answer, you can potentially deal with this problem when reading data. If "MIS" , "ACC" and "DEL" always indicate missing values, you can use the na.strings argument in read.table .

 read.table('foo.txt', header=TRUE, na.strings = c("MIS", "ACC", "DEL")) 

If there are other character strings that always indicate missing values, you can add them to the vector above.

However, if, for example, "MIS" appears in the Time_Frame column and has a value different from the indicated missing value, then DO NOT TAKE THIS APPROACH!

+2
source

Source: https://habr.com/ru/post/914879/


All Articles