R dplyr, using a mutation with na.omit, causes an incompatible size error (% d)

I am doing data cleansing. I use mutate in Dplyr a lot, as it creates new columns step by step, and I can easily understand how this happens.

Here are two examples where I have this error

Error: incompatible size (%d), expecting %d (the group size) or 1 

Example 1: Get the name of a city from a zip code. The data is just like this:

  Zip 1 02345 2 02201 

And I notice when it has NA, it doesn't work.

Without NA, this works:

 library(dplyr) library(zipcode) data(zipcode) test = data.frame(Zip=c('02345','02201'),stringsAsFactors=FALSE) test %>% rowwise() %>% mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] ) 

as a result

 Source: local data frame [2 x 2] Groups: <by row> Zip Town1 1 02345 Manomet 2 02201 Boston 

With NA, this does not work:

 library(dplyr) library(zipcode) data(zipcode) test = data.frame(Zip=c('02345','02201',NA),stringsAsFactors=FALSE) test %>% rowwise() %>% mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] ) 

as a result

 Error: incompatible size (%d), expecting %d (the group size) or 1 

Example 2. I want to get rid of the redundant state that occurs in the Town column in the following data.

  Town State 1 BOSTON MA MA 2 NORTH AMAMS MA 3 CHICAGO IL IL 

Here's how I do it: (1) splits a string in Town into words, for example. "BOSTON" and "MA" for line 1. (2) see if any of these words matches the state of this line (3) delete matching words

 library(dplyr) test = data.frame(Town=c('BOSTON MA','NORTH AMAMS','CHICAGO IL'), State=c('MA','MA','IL'), stringsAsFactors=FALSE) test %>% mutate(Town.word = strsplit(Town, split=' ')) %>% rowwise() %>% # rowwise ensures every calculation only consider currect row mutate(is.state = match(State,Town.word ) ) %>% mutate(Town1 = Town.word[-is.state]) 

This leads to:

  Town State Town.word is.state Town1 1 BOSTON MA MA <chr[2]> 2 BOSTON 2 NORTH AMAMS MA <chr[2]> NA NA 3 CHICAGO IL IL <chr[2]> 2 CHICAGO 

Meaning: For example, line 1 shows is.state == 2, that is, the second word in Town is the name of the state. After getting rid of this work, Town1 is the correct name for the city.

Now I want to fix NA in line 2, but adding na.omit will result in an error:

 test %>% mutate(Town.word = strsplit(Town, split=' ')) %>% rowwise() %>% # rowwise ensures every calculation only consider currect row mutate(is.state = match(State,Town.word ) ) %>% mutate(Town1 = Town.word[-na.omit(is.state)]) 

leads to:

 Error: incompatible size (%d), expecting %d (the group size) or 1 

I checked the type and size of the data:

 test %>% mutate(Town.word = strsplit(Town, split=' ')) %>% rowwise() %>% # rowwise ensures every calculation only consider currect row mutate(is.state = match(State,Town.word ) ) %>% mutate(length(is.state) ) %>% mutate(class(na.omit(is.state))) 

leads to:

  Town State Town.word is.state length(is.state) class(na.omit(is.state)) 1 BOSTON MA MA <chr[2]> 2 1 integer 2 NORTH AMAMS MA <chr[2]> NA 1 integer 3 CHICAGO IL IL <chr[2]> 2 1 integer 

So this is% d of length == 1. Can anyone where wrong? Thanks

+6
source share
1 answer

Can you just sub come out?

 test %>% rowwise() %>% mutate(Town=sub(sprintf('[, ]*%s$', State), '', Town)) ## Source: local data frame [3 x 2] ## Groups: <by row> ## ## Town State ## 1 BOSTON MA ## 2 NORTH AMAMS MA ## 3 CHICAGO IL 

(This path also picks up commas after the city, if that happens.)

NB: if you use ungroup() here with rowwise_df (as it is), it will also destroy the tbl_df class and output a direct data.frame that is suitable for your data, but will clobber your screen if you are not careful and look at large volumes of data (as I have done countless times). (Github # 936 and # 553 links.)

+3
source

Source: https://habr.com/ru/post/988828/


All Articles