I want to analyze the LaTeX document and mark some of its terms with a special team. In particular, I have a list of terms, say:
Astah UML use case ...
and I want to mark the first occurrence of Ast in the text with this user command: \gloss{Astah}
. While this works (using Python):
for g in glossary: pattern = re.compile(r'(\b' + g + r'\b)', re.I | re.M) text = pattern.sub(start + r'\1' + end, text, 1)
and it works great.
But then I found out that:
- I don’t want to match the terms following the LaTeX inline comment (therefore conditions preceded by one or more
%
) - and I don’t want to match the terms inside the section title (that is,
\section{term}
or \paragraph{term}
)
So, I tried this:
for g in glossary: pattern = re.compile(r'(^[^%]*(?!section{))(\b' + g + r'\b)', re.I | re.M) text = pattern.sub(r'\1' + start + r'\2' + end, text, 1)
but it matches expressions inside comments preceded by other characters, and also matches terms inside headers.
Is this something about the “greed” of regular expressions that I don’t understand? or maybe the problem is elsewhere?
As an example, if I have this text:
\section{Astah} Astah is a UML diagramming tool... bla bla... % use case: A use case is a...
I would like to convert it to:
\section{Astah} \gloss{Astah} is a \gloss{UML} diagramming tool... bla bla... % use case: A \gloss{use case} is a...
source share