How to break a line, including punctuation?

I need to split a string (in Java) where punctuation marks are stored in the same array as words:

String sentence = "In the preceding examples, classes derived from..."; String[] split = sentence.split(" "); 

I need a separation array:

 split[0] - "In" split[1] - "the" split[2] - "preceding" split[3] - "examples" split[4] - "," split[5] - "classes" split[6] - "derived" split[7] - "from" split[8] - "..." 

Is there an elegant solution?

+6
source share
7 answers

You need to look around:

 String[] split = sentence.split(" ?(?<!\\G)((?<=[^\\p{Punct}])(?=\\p{Punct})|\\b) ?"); 

Look around, approve, but (important here) do not consume input when matching.


Some test codes:

 String sentence = "Foo bar, baz! Who? Me..."; String[] split = sentence.split(" ?(?<!\\G)((?<=[^\\p{Punct}])(?=\\p{Punct})|\\b) ?"); Arrays.stream(split).forEach(System.out::println); 

Output;

 Foo bar , baz ! Who ? Me ... 
+2
source

You can try by first replacing the triple points with an ellipsis:

  String sentence = "In the preceding examples, classes derived from..."; String[] split = sentence.replace("...", "…").split(" +|(?=,|\\p{Punct}|…)"); 

After that, you can leave it as is or convert it back by running replace("…", "...") on the entire array.

+1
source

I believe this method will do what you want

 public static List<String> split(String str) { Pattern pattern = Pattern.compile("(\\w+)|(\\.{3})|[^\\s]"); Matcher matcher = pattern.matcher(str); List<String> list = new ArrayList<String>(); while (matcher.find()) { list.add(matcher.group()); } return list; } 

It will split the string into

  • Consecutive Word Symbols
  • Ellipsis ...
  • Everything else separated by a space

In this example

 "In the preceding examples, classes.. derived from... Hello, World! foo!bar" 

List will be

 [0] In [1] the [2] preceding [3] examples [4] , [5] classes [6] . [7] . [8] derived [9] from [10] ... [11] Hello [12] , [13] World [14] ! [15] foo [16] ! [17] bar 
+1
source

Now I will say that the easiest and possibly the cleanest way to achieve what you want is to focus on finding the data you want in the array, instead of finding a place to separate the text.

I say this because split introduces a lot of problems, like for example:

  • split(" +|(?=\\p{Punct})"); will be divided only by a space and a character before punctuation, which means that text like "abc" def will be divided by "abc " def . So, as you can see, it does not crash after " on "abc .

  • the previous problem can be easily solved by adding another condition |(?<=\\p{Punct}) , like split(" +|(?=\\p{Punct})|(?<=\\p{Punct})") , but we still have not solved all your problems because of ... Therefore, we need to figure out a way to prevent separation between these points .|.|. .

    • To do this, we could try to exclude . from \p{Punct} and try to process it separately, but that will make our regex pretty complicated.
    • Another way to do this is to replace ... with some unique row by adding this row to our split logic and after replacing it in ... in our result array. But this approach would also require us to know which line will never be possible in your text, so we will need to generate it every time we parse the text.
  • Another possible problem is that the regex mechanism before java-8 will generate an empty element at the beginning of your result array if the punctuation is the first character like " . Thus, in Java 7 there is a "foo" bar string divided by (?=\p{Punct) , will result in the elements [ , "foo, " bar] . To avoid this problem, you need to add a regular expression, for example (?!^) , to prevent splitting at the beginning of the line.

In any case, these decisions look too complicated.


So, instead of the split method, consider the find method from the Matcher class and focus on what you want in the result array.

Try a template like this: [.]{3}|\p{Punct}|[\S&&\P{Punct}]+"

  • [.]{3} will match ...
  • \p{Punct} will match one punctuation character, which according to the documentation is one of the !"#$%&'()*+,-./:;<=> ?@ []^_`{|}~

    ! " # $ % & ' ( ) * + , - . / = > ? @ [ \ ] ^ ` { | } ~
  • [\S&&\P{Punct}]+ will match one or more characters that are
    • \S no spaces
    • && and
    • \p{Punct} not punctuation ( \P{foo} is the negation of \P{foo} ).

Demo:

 String sentence = "In (the) preceding examples, classes derived from..."; Pattern p = Pattern.compile("[.]{3}|\\p{Punct}|[\\S&&\\P{Punct}]+"); Matcher m = p.matcher(sentence); while(m.find()){ System.out.println(m.group()); } 

Output:

 In ( the ) preceding examples , classes derived from ... 
+1
source

You can sanitize the replacement string, for example, "," to "," etc. for all the punctuation you want to distinguish.

In the special case of "..." you can do:

 // there can be series of dots sentence.replace(".", " .").replace(". .", "..") 

Then you split up.

EDIT: replace single quotes with double quotes.

0
source

For your particular case, there are two main problems - ordering (for example, the first punctuation, and then the word or vice versa) and ... punctuation.

Otherwise, you can easily implement it using

 \p{Punct} 

like this:

 Pattern.compile("\p{Punct}"); 

Regarding these two issues:

1.Ordering: You can try the following:

 private static final Pattern punctuation = Pattern.compile("\\p{Punct}"); private static final Pattern word = Pattern.compile("\\w"); public static void main(String[] args) { String sentence = "In the preceding examples, classes derived from..."; String[] split = sentence.split(" "); List<String> result = new LinkedList<>(); for (String s : split) { List<String> withMarks = splitWithPunctuationMarks(s); result.addAll(withMarks); } } private static void List<String> splitWithPunctuationMarks(String s) { Map<Integer, String> positionToString = new TreeMap<>(); Matcher punctMatcher = punctuation.matcher(s); while (punctMatcher.find()) { positionToString.put(punctMatcher.start(), punctMatcher.group()) } Matcher wordMatcher = // ... same as before // Then positionToString.values() will contain the // ordered words and punctuation characters. } 
  1. ... you can try to look back at previous occurrences of a character . in (currentIndex - 1) every time you find it.
0
source

another example. this solution probably works for all combinations.

 import java.util.ArrayList; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; public class App { public static void main(String[] args) { String sentence = "In the preceding examples, classes derived from..."; List<String> list = splitWithPunctuation(sentence); System.out.println(list); } public static List<String> splitWithPunctuation(String sentence) { Pattern p = Pattern.compile("([^a-zA-Z\\d\\s]+)"); String[] split = sentence.split(" "); List<String> list = new ArrayList<>(); for (String s : split) { Matcher matcher = p.matcher(s); boolean found = false; int i = 0; while (matcher.find()) { found = true; list.add(s.substring(i, matcher.start())); list.add(s.substring(matcher.start(), matcher.end())); i = matcher.end(); } if (found) { if (i < s.length()) list.add(s.substring(i, s.length())); } else list.add(s); } return list; } } 

Output:

 In the preceding examples , classes derived from ... 

More complex example:

 String sentence = "In the preced^^^in## examp!les, classes derived from..."; List<String> list = splitWithPunctuation(sentence); System.out.println(list); 

Output:

 In the preced ^^^ in ## examp ! les , classes derived from ... 
0
source

Source: https://habr.com/ru/post/985971/


All Articles