Convert character to numeric without coercion NA to R

I work in R and have a DataFrame, dd_2006, with number vectors. When I first imported the data, I needed to remove $, decimal points and some spaces from 3 of my variables: SumOfCost, SumOfCases and SumOfUnits. For this, I used str_replace_all . However, as soon as I used str_replace_all , the vectors were converted to characters. Therefore, I used as.numeric (var) to convert vectors to numeric, but NA was introduced, although when I ran the code below BEFORE, I ran as.numeric code, there was no NA in the vectors.

 sum(is.na(dd_2006$SumOfCost)) [1] 0 sum(is.na(dd_2006$SumOfCases)) [1] 0 sum(is.na(dd_2006$SumOfUnits)) [1] 0 

Here is my code after import, starting with removing $ from the vector. In the output of str(dd_2006) I deleted some of the variables for the sake of space, so the #s column in the str_replace_all code below does not match the output I posted here (but they do in the source code):

 library("stringr") dd_2006$SumOfCost <- str_sub(dd_2006$SumOfCost, 2, ) #2=the first # after the $ #Removes decimal pt, zero after, and commas dd_2006[ ,9] <- str_replace_all(dd_2006[ ,9], ".00", "") dd_2006[,9] <- str_replace_all(dd_2006[,9], ",", "") dd_2006[ ,10] <- str_replace_all(dd_2006[ ,10], ".00", "") dd_2006[ ,10] <- str_replace_all(dd_2006[,10], ",", "") dd_2006[ ,11] <- str_replace_all(dd_2006[ ,11], ".00", "") dd_2006[,11] <- str_replace_all(dd_2006[,11], ",", "") str(dd_2006) 'data.frame': 12604 obs. of 14 variables: $ CMHSP : Factor w/ 46 levels "Allegan","AuSable Valley",..: 1 1 1 $ FY : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1 ... $ Population : Factor w/ 1 level "DD": 1 1 1 1 1 1 1 1 1 1 ... $ SumOfCases : chr "0" "1" "0" "0" ... $ SumOfUnits : chr "0" "365" "0" "0" ... $ SumOfCost : chr "0" "96416" "0" "0" ... 

I found an answer to a similar question to mine here using the following code:

 # create dummy data.frame d <- data.frame(char = letters[1:5], fake_char = as.character(1:5), fac = factor(1:5), char_fac = factor(letters[1:5]), num = 1:5, stringsAsFactors = FALSE) 

Let's take a look at data.frame

 > d char fake_char fac char_fac num 1 a 1 1 a 1 2 b 2 2 b 2 3 c 3 3 c 3 4 d 4 4 d 4 5 e 5 5 e 5 

and run:

 > sapply(d, mode) char fake_char fac char_fac num "character" "character" "numeric" "numeric" "numeric" > sapply(d, class) char fake_char fac char_fac num "character" "character" "factor" "factor" "integer" 

Now you are probably asking yourself: "Where is the anomaly?" Well, I came across very peculiar things in R, and this is not the most unpleasant thing, but it can confuse you, especially if you read this before climbing into bed.

Here: the first two columns are characters. I deliberately called the second fake_char. Identify the similarity of this character variable to what Dirk created in his answer. This is actually a numerical vector converted to character. The third and fourth columns are factors, and the last is purely numerical.

If you use the conversion function, you can convert fake_char to numeric, but not to char variable.

 > transform(d, char = as.numeric(char)) char fake_char fac char_fac num 1 NA 1 1 a 1 2 NA 2 2 b 2 3 NA 3 3 c 3 4 NA 4 4 d 4 5 NA 5 5 e 5 Warning message: In eval(expr, envir, enclos) : NAs introduced by coercion but if you do same thing on fake_char and char_fac, you'll be lucky, and get away with no NA's: 

transform (d, fake_char = as.numeric (fake_char), char_fac = as.numeric (char_fac))

  char fake_char fac char_fac num 1 a 1 1 1 1 2 b 2 2 2 2 3 c 3 3 3 3 4 d 4 4 4 4 5 e 5 5 5 5 

So, I tried the above code in my script, but still came up with NA (without warning about forcing).

 #changing sumofcases, cost, and units to numeric dd_2006_1 <- transform(dd_2006, SumOfCases = as.numeric(SumOfCases), SumOfUnits = as.numeric(SumOfUnits), SumOfCost = as.numeric(SumOfCost)) > sum(is.na(dd_2006_1$SumOfCost)) [1] 12 > sum(is.na(dd_2006_1$SumOfCases)) [1] 7 > sum(is.na(dd_2006_1$SumOfUnits)) [1] 11 

I also used table(dd_2006$SumOfCases) , etc., to look at the observations, to see if there are any characters that I missed in the observations, but there were none. Any thoughts on why NS appear, and how to get rid of them?

+6
source share
3 answers

As Anando noted, the problem is somewhere in your data, and we cannot help you much without a reproducible example. However, here is a snippet of code that will help you capture entries in your data that cause problems:

 test = as.character(c(1,2,3,4,'M')) v = as.numeric(test) # NAs intorduced by coercion ix.na = is.na(v) which(ix.na) # row index of our problem = 5 test[ix.na] # shows the problematic record, "M" 

Instead of guessing why NA is entered, pull out the entries that cause the problem and access them directly / individually until the NA disappears.

UPDATE: Looks like the problem is with your call to str_replace_all . I don't know the stringr library, but I think you can do the same with gsub as follows:

 v2 = c("1.00","2.00","3.00") gsub("\\.00", "", v2) [1] "1" "2" "3" 

I'm not quite sure if this is achieved:

 sum(as.numeric(v2)!=as.numeric(gsub("\\.00", "", v2))) # Illustrate that vectors are equivalent. [1] 0 

If this does not achieve a specific goal for you, I would suggest completely abandoning this stage of your pre-treatment, since it does not seem necessary and seems to create problems.

+13
source

If you want to convert a character to a numeric one, then first convert it to a coefficient (using as.factor) and save / overwrite the existing variable. Then convert this variable to a numeric value (using as.numeric). You would not create NAs in this way and you can convert the data set that you have into a numerical one.

+4
source

A simple solution is to let retype guess the new data types for each column

 library(dplyr) library(hablar) dd_2006 %>% retype() 
0
source

Source: https://habr.com/ru/post/949238/


All Articles