Retrieve the nth item from the nested list following strsplit - R

I was trying to figure out how to deal better with strsplit output. I often have data that I would like to share:

 mydata <- c("144/4/5", "154/2", "146/3/5", "142", "143/4", "DNB", "90") #[1] "144/4/5" "154/2" "146/3/5" "142" "143/4" "DNB" "90" 

After cleavage, the results are as follows:

 strsplit(mydata, "/") #[[1]] #[1] "144" "4" "5" #[[2]] #[1] "154" "2" #[[3]] #[1] "146" "3" "5" #[[4]] #[1] "142" #[[5]] #[1] "143" "4" #[[6]] #[1] "DNB" #[[7]] #[1] "90" 

I know from the strsplit reference manual that final empty lines are not created. Therefore, in each of my results there will be 1, 2 or 3 elements based on the number "/" to be divided into

Getting the first element is very trivial:

 sapply(strsplit(mydata, "/"), "[[", 1) #[1] "144" "154" "146" "142" "143" "DNB" "90" 

But I'm not sure how to get the 2nd, 3rd ... when each result has an unequal number of elements.

 sapply(strsplit(mydata, "/"), "[[", 2) # Error in FUN(X[[4L]], ...) : subscript out of bounds 

I would like to return from a working solution, the following:

 #[1] "4" "2" "3" "NA" "4" "NA" "NA" 

This is a relatively small example. I could do some for the loop very easily on this data, but for real data with 1000s of observations to run strsplit and dozens of elements derived from this, I was hoping to find a more generalized solution.

+5
source share
4 answers

(at least with respect to 1D vectors) [ seems to return NA when "i> length (x)", while [[ returns an error.

 x = runif(5) x[6] #[1] NA x[[6]] #Error in x[[6]] : subscript out of bounds 

Digging a bit, do_subset_dflt (ie [ ) calls ExtractSubset , where we notice that when the desired index ("ii") is "> length (x)" NA returns (bit changed to be clean):

 if(0 <= ii && ii < nx && ii != NA_INTEGER) result[i] = x[ii]; else result[i] = NA_INTEGER; 

On the other hand, do_subset2_dflt (i.e., [[ ) returns an error if the required index ("offset") is "> length (x)" (modified bit to clear):

 if(offset < 0 || offset >= xlength(x)) { if(offset < 0 && (isNewList(x)) ... else errorcall(call, R_MSG_subs_o_b); } 

where #define R_MSG_subs_o_b _("subscript out of bounds")

(I'm not sure about the above code snippets, but they seem relevant based on their results)

+4
source

Try the following:

 > read.table(text = mydata, sep = "/", as.is = TRUE, fill = TRUE) V1 V2 V3 1 144 4 5 2 154 2 NA 3 146 3 5 4 142 NA NA 5 143 4 NA 6 DNB NA NA 7 90 NA NA 

If you want to treat DNB as NA, add the argument na.strings="DNB" .

If you really want to use strsplit , try the following:

 > do.call(rbind, lapply(strsplit(mydata, "/"), function(x) head(c(x,NA,NA), 3))) [,1] [,2] [,3] [1,] "144" "4" "5" [2,] "154" "2" NA [3,] "146" "3" "5" [4,] "142" NA NA [5,] "143" "4" NA [6,] "DNB" NA NA [7,] "90" NA NA 

Note: Using the alexis_laz observation that x[i] returns NA , if i not in 1:length(x) , the last line of code above can be simplified to:

 t(sapply(strsplit(mydata, "/"), "[", 1:3)) 
+3
source

You can use regex (if allowed)

  library(stringr) str_extract(mydata , perl("(?<=\\d/)\\d+")) #[1] "4" "2" "3" NA "4" NA NA str_extract(mydata , perl("(?<=/\\d/)\\d+")) #[1] "5" NA "5" NA NA NA NA 
+1
source

You can assign the length inside sapply , the result is NA , where the current length is shorter than the assigned length.

 s <- strsplit(mydata, "/") sapply(s, function(x) { length(x) <- 3; x[2] }) # [1] "4" "2" "3" NA "4" NA NA 

Then you can add a second indexing argument with mapply

 m <- max(sapply(s, length)) mapply(function(x, y, z) { length(x) <- z; x[y] }, s, 2, m) # [1] "4" "2" "3" NA "4" NA NA 
0
source

Source: https://habr.com/ru/post/1201576/


All Articles