How to use the nltk regular expression pattern to extract a specific phrase fragment?

Question

How to use the nltk regular expression pattern to extract a specific phrase fragment?

I wrote the following regular expression to label certain phrases pattern

pattern = """ P2: {<JJ>+ <RB>? <JJ>* <NN>+ <VB>* <JJ>*} P1: {<JJ>? <NN>+ <CC>? <NN>* <VB>? <RB>* <JJ>+} P3: {<NP1><IN><NP2>} P4: {<NP2><IN><NP1>} """

This template correctly marks the phrase, for example:

 a = 'The pizza was good but pasta was bad'

and give the desired result with two phrases:

The pizza was good
The pasta was bad.

However, if my suggestion looks something like this:

 a = 'The pizza was awesome and brilliant'

only matches the phrase:

 'pizza was awesome'

instead of the desired:

 'pizza was awesome and brilliant'

How to enable regex template for my second example?

+5

python regex nlp nltk

pd176 Dec 04 '15 at 14:37

source share

1 answer

alvas · Accepted Answer · 2015-12-04T17:18:32+0000

First, let's take a look at the POS tags that NLTK provides:

 >>> from nltk import pos_tag >>> sent = 'The pizza was awesome and brilliant'.split() >>> pos_tag(sent) [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')] >>> sent = 'The pizza was good but pasta was bad'.split() >>> pos_tag(sent) [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')]

(Note: above are the outputs from NLTK v3.1 pos_tag , the earlier version may vary)

What you want to capture is essentially:

NN VBD JJ CC JJ
NN VBD JJ

So let's catch them with these patterns:

 >>> from nltk import RegexpParser >>> sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant'] >>> sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad'] >>> patterns = """ ... P: {<NN><VBD><JJ><CC><JJ>} ... {<NN><VBD><JJ>} ... """ >>> PChunker = RegexpParser(patterns) >>> PChunker.parse(pos_tag(sent1)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])]) >>> PChunker.parse(pos_tag(sent2)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])

To "cheat" by hardcoding !!!

Back to the POS templates:

NN VBD JJ CC JJ
NN VBD JJ

You can simplify:

NN VBD JJ (CC JJ)

Thus, you can use optional operators in the regular expression, for example:

 >>> patterns = """ ... P: {<NN><VBD><JJ>(<CC><JJ>)?} ... """ >>> PChunker = RegexpParser(patterns) >>> PChunker.parse(pos_tag(sent1)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])]) >>> PChunker.parse(pos_tag(sent2)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])

Most likely you are using the old tagger, so your templates are different, but I think you see how you could capture the necessary phrases using the above example.

Steps:

First, check what POS-patterns using pos_tag
Then generalize the templates and simplify them
Then put them in RegexpParser

How to use the nltk regular expression pattern to extract a specific phrase fragment?

More articles: