Parsing text using regular expressions

I have a dictionary in .txt format that looks like this:

term 1 definition 1 definition 2 term 2 definition 1 definition 2 definition 3 etc. 

There is a tab before the definition, basically it is like this:

 term 1 [tab]definition 1 [tab]definition 2 etc. 

Now I need to wrap each term and its definitions with a <term> , ie:

 <term> term 1 definition 1 definition 2 </term> 

I tried to use regular expressions to find a term with its definitions, but no luck. Could you help me with this?

Thanks for any suggestions!

+4
source share
3 answers

Try this regex:

 (^|\n).+(\n[ \t]+.+)* 

Assuming ^ marks the beginning of a line, \n is a line break character, as well . does not match line break.

0
source

Assuming an implementation that

  • Matches multiple lines ( /.../m )
  • Uses \A to indicate the beginning of a line

this should correspond to one β€œterm”:

 \A[^\t][^\n]+\n(\t[^\n]+\n)+ 
0
source

Match a string with a leading character without a space followed by one or more lines with leading TAB:

  $ perl -0077 -pe 's / ^ (\ S. + \ n (^ \ t. + \ n) +) / <term> \ n $ 1 <\ / term> \ n / mg' dict
 <term>
 term 1
         definition 1
         definition 2
 </term>

 <term>
 term 2
         definition 1
         definition 2
         definition 3
 </term>
0
source

Source: https://habr.com/ru/post/1300563/


All Articles