Java Pattern.split () with overlapping delimiters

Question

Java Pattern.split () with overlapping delimiters

Firstly, I know the similar questions that were asked here:

How to split a line but also keep the delimiters?

However, I had a problem with the implementation of line splitting using Pattern.split (), where the pattern is based on a list of separators, but sometimes they may overlap. Here is an example:

The goal is to split the string based on a set of known codewords that are surrounded by slashes, where I need to store both the delimiter (codeword) and the value after it (which can be an empty string).

In this example, the code words are:

/ABC/
/DEF/
/GHI/

Based on the stream mentioned above, the template is constructed as follows, using look-forward and look-behind to mark the string with AND codeword values:

((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))

Working line:

"123/ABC//DEF/456/GHI/789"

split, :

"123","/ABC/","/DEF/","456","/GHI/","789"

( "ABC" "DEF" ):

"123/ABC/DEF/456/GHI/789"

, "DEF/456" "/ABC/", "DEF/" , !

:

"123","/ABC/","DEF/456","/GHI/","789"

:

"123","/ABC","/","DEF/","456","/GHI/","789"

, "ABC" "DEF" .

, look-ahead OR look-behind, , , . !

+4

java regex

Julian Mclean 15 . '16 15:02

3

:

String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");

, /.

.

+1

Bohemian 15 . '16 16:11

TDD (Red-Green-Refactor), :

()

, , " ". - , , , .

import static org.assertj.core.api.Assertions.assertThat;

import java.util.List;

import org.junit.Test;

public class TokenizerSpec {

    Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");

    @Test
    public void itShouldTokenizeTwoConsecutiveCodewords() {
        String input = "123/ABC//DEF/456";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
    }

    @Test
    public void itShouldTokenizeMisleadingCodeword() {
        String input = "123/ABC/DEF/456/GHI/789";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
    }

    @Test
    public void itShouldTokenizeWhenValueContainsSlash() {
        String input = "1/23/ABC/456";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
    }

    @Test
    public void itShouldTokenizeWithoutCodewords() {
        String input = "123/456/789";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123/456/789");
    }

    @Test
    public void itShouldTokenizeWhenEndingWithCodeword() {
        String input = "123/ABC/";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123", "/ABC/");
    }

    @Test
    public void itShouldTokenizeWhenStartingWithCodeword() {
        String input = "/ABC/123";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("/ABC/", "123");
    }

    @Test
    public void itShouldTokenizeWhenOnlyCodeword() {
        String input = "/ABC//DEF//GHI/";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
    }
}

()

This class will perform all of the above tests.

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;

public final class Tokenizer {

    private final List<String> codewords;

    public Tokenizer(String... codewords) {
        this.codewords = Arrays.asList(codewords);
    }

    public List<String> splitPreservingCodewords(String input) {
        List<String> tokens = new ArrayList<>();

        int lastIndex = 0;
        int i = 0;
        while (i < input.length()) {
            final int idx = i;
            Optional<String> codeword = codewords.stream()
                                                 .filter(cw -> input.substring(idx).indexOf(cw) == 0)
                                                 .findFirst();
            if (codeword.isPresent()) {
                if (i > lastIndex) {
                    tokens.add(input.substring(lastIndex, i));
                }
                tokens.add(codeword.get());
                i += codeword.get().length();
                lastIndex = i;
            } else {
                i++;
            }
        }

        if (i > lastIndex) {
            tokens.add(input.substring(lastIndex, i));
        }

        return tokens;
    }
}

Improve Execution (Refactor)

Not done at the moment (not enough time for me to spend on this answer now). I will be happy to refactor Tokenizerit if you ask me (but later). :-) Or you can do it yourself reliably enough, since you have unit tests to avoid regressions.

0

Spotted Dec 16 '16 at 10:45

source share

Patrick Parker · Accepted Answer · 2016-12-15T16:13:01+0000

find, split, , :

public class SampleJava {
static final String[] CODEWORDS = {
    "ABC",
    "DEF",
    "GHI"};
static public void main(String[] args) {
    String input = "/ABC/DEF/456/GHI/789";
    String codewords = Arrays.stream(CODEWORDS)
            .collect(Collectors.joining("|", "/(", ")/"));
    //     codewords = "/(ABC|DEF|GHI)/";
    Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
    Matcher m = p.matcher(input);
    while(m.find()) {
        System.out.print(m.group(0));
        if(m.group(1) != null) {
            System.out.print(" ← code word");
        }
        System.out.println();
    }
}
}

:

/ABC/←
DEF/456
/GHI/←
789

Java Pattern.split () with overlapping delimiters

()

()

Improve Execution (Refactor)

More articles: