Negative regular expression before a certain period

I want to analyze the LaTeX document and mark some of its terms with a special team. In particular, I have a list of terms, say:

Astah UML use case ... 

and I want to mark the first occurrence of Ast in the text with this user command: \gloss{Astah} . While this works (using Python):

 for g in glossary: pattern = re.compile(r'(\b' + g + r'\b)', re.I | re.M) text = pattern.sub(start + r'\1' + end, text, 1) 

and it works great.

But then I found out that:

  • I don’t want to match the terms following the LaTeX inline comment (therefore conditions preceded by one or more % )
  • and I don’t want to match the terms inside the section title (that is, \section{term} or \paragraph{term} )

So, I tried this:

 for g in glossary: pattern = re.compile(r'(^[^%]*(?!section{))(\b' + g + r'\b)', re.I | re.M) text = pattern.sub(r'\1' + start + r'\2' + end, text, 1) 

but it matches expressions inside comments preceded by other characters, and also matches terms inside headers.

Is this something about the “greed” of regular expressions that I don’t understand? or maybe the problem is elsewhere?

As an example, if I have this text:

 \section{Astah} Astah is a UML diagramming tool... bla bla... % use case: A use case is a... 

I would like to convert it to:

 \section{Astah} \gloss{Astah} is a \gloss{UML} diagramming tool... bla bla... % use case: A \gloss{use case} is a... 
+6
source share
2 answers

The trick here is to use a regular expression that begins to match at the beginning of the line, because this allows us to check whether the word we are trying to match matches the comment:

 ^([^%\n]*?)(?<!\\section{)(?<!\\paragraph{)\b(Astah)\b 

The multi-line flag m required. The expressions of this regular expression are replaced by \1\\gloss{\2} .

+1
source

Here are my two cents:

First, we need to use the regex module of Matthew Barnett . This brings a lot of interesting features. And one of its functions may be useful in this case, added (*SKIP) and (*FAIL) .

From the documentation :

  • Added (* PRUNE), (* SKIP) and (* FAIL) (Hg issue 153)

(* PRUNE) discards the return information to this point. When used in an atomic group or the reverse side, it will not affect the pattern.

(* SKIP) is similar to (* PRUNE), except that it also sets where the next match attempt begins in the text. When used in an atomic group or search, this will not affect the spanning pattern.

(* FAIL) causes an immediate rollback. (* F) is a valid reduction.

So, let's build a template and test it using the regex module:

 import regex pattern = regex.compile(r'%.*(*SKIP)(*FAIL)|\\section{.*}(*SKIP)(*FAIL)|(Astah|UML|use case)') s = """ \section{Astah} Astah is a UML diagramming tool... bla bla... % use case: A use case is a... """ print regex.sub(pattern, r'\\gloss{\1}', s) 

Exit:

 \section{Astah} \gloss{Astah} is a \gloss{UML} diagramming tool... bla bla... % use case: A \gloss{use case} is a... 

Template:

This sentence illustrates this well:

the trick must fit different contexts that we don’t want to “neutralize them”.

On the left side we will write contexts that we do not need. And on the right side (the last part) we fix what we want. Thus, all contexts are separated by the sign Alternation | , and the last (what we want) is exciting.

Since in this case we will perform a replacement, we need (* SKIP) (* FAIL) to keep undamaged parts that we do not want to replace.

What does the template mean:

 %.*(*SKIP)(*FAIL)|\\section{.*}(*SKIP)(*FAIL)|(Astah|UML|use case) %.*(*SKIP)(*FAIL) # Matches the pattern but skip and fail | # or \\section{.*}(*SKIP)(*FAIL) # Matches the pattern but skip and fail | # or (Astah|UML|use case) # Matches the pattern and capture it. 

This simple trick is described in more detail on RexEgg .

Hope this helps.

0
source

Source: https://habr.com/ru/post/1015473/


All Articles