Parsing text using regular expressions

Question

Parsing text using regular expressions

I have a dictionary in .txt format that looks like this:

term 1 definition 1 definition 2 term 2 definition 1 definition 2 definition 3 etc.

There is a tab before the definition, basically it is like this:

 term 1 [tab]definition 1 [tab]definition 2 etc.

Now I need to wrap each term and its definitions with a <term> , ie:

 <term> term 1 definition 1 definition 2 </term>

I tried to use regular expressions to find a term with its definitions, but no luck. Could you help me with this?

Thanks for any suggestions!

+4

regex parsing

Peterim Feb 07 '10 at 21:29

source share

3 answers

Assuming an implementation that

Matches multiple lines ( /.../m )
Uses \A to indicate the beginning of a line

this should correspond to one “term”:

 \A[^\t][^\n]+\n(\t[^\n]+\n)+

0

Jakob borg Feb 07 '10 at 21:37

source share

Match a string with a leading character without a space followed by one or more lines with leading TAB:

  $ perl -0077 -pe 's / ^ (\ S. + \ n (^ \ t. + \ n) +) / <term> \ n $ 1 <\ / term> \ n / mg' dict
 <term>
 term 1
         definition 1
         definition 2
 </term>

 <term>
 term 2
         definition 1
         definition 2
         definition 3
 </term>

0

Greg bacon Feb 07 '10 at 21:39

source share

Gumbo · Accepted Answer · 2010-02-07T21:33:23+0000

Try this regex:

 (^|\n).+(\n[ \t]+.+)*

Assuming ^ marks the beginning of a line, \n is a line break character, as well . does not match line break.

Parsing text using regular expressions

More articles: