R: split text with multiple regex patterns and exceptions

It is advisable to separate the vector elements of the text symbol in sentences. There is more than one splitting criteria template ( "and/ERT" , "/$" ). There are also exceptions to the patterns ( :/$. , and/ERT then , ./$. Smiley ).

Try it: compare the cases in which there should be a split. Insert an unusual pattern ( "^&*" ) in this place. strsplit concrete template

Problem: I do not know how to handle exceptions correctly. There are obvious cases where an unusual template ( "^&*" ) must be deleted and the source text restored before running strsplit .

code:

 text <- c("This are faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!", "This are the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!", "Like above the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!") patternSplit <- c("and/ERT", "/\\$") # The class of split-cases is much larger then in this example. Therefore it is not possible to adress them explicitly. patternSplit <- paste("(", paste(patternSplit, collapse = "|"), ")", sep = "") exceptionsSplit <- c("\\:/\\$\\.", "and/ERT then", "\\./\\$\\. Smiley") exceptionsSplit <- paste("(", paste(exceptionsSplit, collapse = "|"), ")", sep = "") # If you don't have exceptions, it works here. Unfortunately it splits "*$/*" into "*" and "$/*". Would be convenient to avoid this. See example "ideal" split below. textsplitted <- strsplit(gsub(patternSplit, "^&*\\1", text), "^&*", fixed = TRUE) # # Ideal split: textsplitted > textsplitted [[1]] [1] "This are faulty propositions one and/ERT" [2] "two ,/$," [3] "which I want to split ./$." [4] "There are cases where I explicitly want and/ERT" [5] "some where I don't want to split ./$." [6] "For example :/$. when there is an and/ERT then I don't want to split ./$." [7] "This is also one case where I dont't want to split ./$. Smiley !/$." [8] "Thank you ./$!" [[2]] [1] "This are the same faulty propositions one and/ERT [2] "two ,/$," #... # This try doesen't work! text <- gsub(patternSplit, "^&*\\1", text) text <- gsub(exceptionsSplit, "[original text without "^&*"]", text) textsplitted <- strsplit(text, "^&*", fixed = TRUE) 
+6
source share
1 answer

I think you can use this expression to achieve the desired sections. Since strsplit uses characters that it separates, you will have to separate it into spaces following the items that match / don't match (which you have in your desired output in your OP):

 strsplit( text[[1]] , "(?<=and/ERT)\\s(?!then)|(?<=/\\$[[:punct:]])(?<!:/\\$[[:punct:]])\\s(?!Smiley)" , perl = TRUE ) #[[1]] #[1] "This are faulty propositions one and/ERT" #[2] "two ,/$," #[3] "which I want to split ./$." #[4] "There are cases where I explicitly want and/ERT" #[5] "some where I don't want to split ./$." #[6] "For example :/$. when there is an and/ERT then I don't want to split ./$." #[7] "This is also one case where I dont't want to split ./$. Smiley !/$." #[8] "Thank you ./$!" 

Explanation

  • (?<=and/ERT)\\s - divide by space, \\s , which is preceded by IS , (?<=...) by "and/ERT"
  • (?!then) - BUT , only if it is a space NOT , (?!...) to "then"
  • | - OR operator to bind the following expression
  • (?<=/\\$[[:punct:]]) is a positive statement for "/$" followed by any punctuation letter
  • (?<!:/\\$[[:punct:]])\\s(?!Smiley) - matches the space NOT preceded by ":/$"[[:punct:]] (but in accordance with the previous dot IS preceded by "/$[[:punct:]]" but NOT , (?!...) on "Smiley"
+7
source

Source: https://habr.com/ru/post/953473/


All Articles