Now I will say that the easiest and possibly the cleanest way to achieve what you want is to focus on finding the data you want in the array, instead of finding a place to separate the text.
I say this because split introduces a lot of problems, like for example:
split(" +|(?=\\p{Punct})"); will be divided only by a space and a character before punctuation, which means that text like "abc" def will be divided by "abc " def . So, as you can see, it does not crash after " on "abc .
the previous problem can be easily solved by adding another condition |(?<=\\p{Punct}) , like split(" +|(?=\\p{Punct})|(?<=\\p{Punct})") , but we still have not solved all your problems because of ... Therefore, we need to figure out a way to prevent separation between these points .|.|. .
- To do this, we could try to exclude
. from \p{Punct} and try to process it separately, but that will make our regex pretty complicated. - Another way to do this is to replace
... with some unique row by adding this row to our split logic and after replacing it in ... in our result array. But this approach would also require us to know which line will never be possible in your text, so we will need to generate it every time we parse the text.
- Another possible problem is that the regex mechanism before java-8 will generate an empty element at the beginning of your result array if the punctuation is the first character like
" . Thus, in Java 7 there is a "foo" bar string divided by (?=\p{Punct) , will result in the elements [ , "foo, " bar] . To avoid this problem, you need to add a regular expression, for example (?!^) , to prevent splitting at the beginning of the line.
In any case, these decisions look too complicated.
So, instead of the split method, consider the find method from the Matcher class and focus on what you want in the result array.
Try a template like this: [.]{3}|\p{Punct}|[\S&&\P{Punct}]+"
Demo:
String sentence = "In (the) preceding examples, classes derived from..."; Pattern p = Pattern.compile("[.]{3}|\\p{Punct}|[\\S&&\\P{Punct}]+"); Matcher m = p.matcher(sentence); while(m.find()){ System.out.println(m.group()); }
Output:
In ( the ) preceding examples , classes derived from ...
source share