Splitting text into pieces (Javascript, regex)

I tried to break the text into several small pieces to parse it using Javascript and RegEx. I illustrated my best shot here, for example:

https://regex101.com/r/jfzTlr/1

I have a set of rules to follow: I would like to get blocks. Each block starts with an asterisk (*) as the first character (if not indented, otherwise a tab), followed by 2-3 capital letters, a comma, (possible) space and code, which can be A, R, T, RS or RSS. Following this is an optional point. Then draw a line where the text appears. This text ends when the following stars appear, following the same pattern as above.

Can someone help me figure out how this can be shared? This is my model so far:

[^\t](.{2,3}),\s?.{1,3}\.?\n.*

Thank you so much!

+4
source share
3 answers

Since you're going to use JavaScript, why not do it with a split that gives you a captured string for splitting and split parts? Then bind the headers together in an array that looks like

[[heading1, block1], [heading2, block2], ...]

Thus, you will immediately receive data in a good format for processing the string. Just an idea!

const s = `*GW, A
This is my very first line. The asterics defines a new block, followed by the initials (2-3 chars), a comma, a (possible) space and a code that could be A, R, T, RS or RSS. Followed by that is an optional dot. Linebreak afterwards, where the text comes.

	*JP, R.
	New block here, as the line (kind of) starts with an asterics. Indentations with 4 spaces or a tab means that it is a second level thing only, that does not need to be stripped away necessarily.

	But as you can see, a block can be devided into several
    lines, 

    even with multiple lines.

	*GML, T.
	And so we continue...

    Let just make sure that a line can start with an
    *asterics, without breaking the whole thing.
	*GW, RS
	Yet another block here.

		*GW, RSS.
		And a very final one.

        Spread over several lines.

*TA, RS.
First level all of a sudden again.
*PA, RSX
    Just a line to check whether RSX is a separate block.

`;
  
const splits = s.split(/\*([A-Z]{2,3}),\s?([AT]|RS{0,2})(\.?)\n/).slice(1);

const grouped = [];

for (let i = 0; i < splits.length; i += 4) {
  const group = splits.slice(i, i+3);
  group[3] = splits[i+3].trim().split(/\s*[\r\n]+\s*/g);
  grouped.push(group);
}

console.log(grouped);
Run codeHide result
+1
source

you can use

^[ \t]*\*[A-Z]{2,3},\s*(?:[ART]|RSS?)\.?[\n\r](?:(?!^[ \t]*\*[A-Z]{2,3},\s*(?:[ART]|RSS?)\.?)[\s\S])+

Watch the demo at regex101.com .


It is divided into parts:
^[ \t]*\*[A-Z]{2,3}           # start of the line, spaces or tabs and 2-3 UPPERCASE letters
,\s*(?:[ART]|RSS?)\.?[\n\r]   # comma, space (optional), code, dot and newline
(?:                           # non-capturing group

    (?!^[ \t]*\*[A-Z]{2,3},\s*(?:[ART]|RSS?)\.?)   
                              # neg. lookahead with the same pattern as above
    [\s\S]                    # \s + \S = effectively matching every character
)+

This method is called a moderate greedy token.

+1
source

Hope this is what you wanted. It works.

([\*\t])+(.{2,3}),\s?.[A,R,T,RS,RSS]{1,3}\.?\n.*

-1
source

Source: https://habr.com/ru/post/1692418/


All Articles