How to use the nltk regular expression pattern to extract a specific phrase fragment?

I wrote the following regular expression to label certain phrases pattern

pattern = """ P2: {<JJ>+ <RB>? <JJ>* <NN>+ <VB>* <JJ>*} P1: {<JJ>? <NN>+ <CC>? <NN>* <VB>? <RB>* <JJ>+} P3: {<NP1><IN><NP2>} P4: {<NP2><IN><NP1>} """ 

This template correctly marks the phrase, for example:

 a = 'The pizza was good but pasta was bad' 

and give the desired result with two phrases:

  • The pizza was good
  • The pasta was bad.

However, if my suggestion looks something like this:

 a = 'The pizza was awesome and brilliant' 

only matches the phrase:

 'pizza was awesome' 

instead of the desired:

 'pizza was awesome and brilliant' 

How to enable regex template for my second example?

+5
source share
1 answer

First, let's take a look at the POS tags that NLTK provides:

 >>> from nltk import pos_tag >>> sent = 'The pizza was awesome and brilliant'.split() >>> pos_tag(sent) [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')] >>> sent = 'The pizza was good but pasta was bad'.split() >>> pos_tag(sent) [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')] 

(Note: above are the outputs from NLTK v3.1 pos_tag , the earlier version may vary)

What you want to capture is essentially:

  • NN VBD JJ CC JJ
  • NN VBD JJ

So let's catch them with these patterns:

 >>> from nltk import RegexpParser >>> sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant'] >>> sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad'] >>> patterns = """ ... P: {<NN><VBD><JJ><CC><JJ>} ... {<NN><VBD><JJ>} ... """ >>> PChunker = RegexpParser(patterns) >>> PChunker.parse(pos_tag(sent1)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])]) >>> PChunker.parse(pos_tag(sent2)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])]) 

To "cheat" by hardcoding !!!

Back to the POS templates:

  • NN VBD JJ CC JJ
  • NN VBD JJ

You can simplify:

  • NN VBD JJ (CC JJ)

Thus, you can use optional operators in the regular expression, for example:

 >>> patterns = """ ... P: {<NN><VBD><JJ>(<CC><JJ>)?} ... """ >>> PChunker = RegexpParser(patterns) >>> PChunker.parse(pos_tag(sent1)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])]) >>> PChunker.parse(pos_tag(sent2)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])]) 

Most likely you are using the old tagger, so your templates are different, but I think you see how you could capture the necessary phrases using the above example.

Steps:

  • First, check what POS-patterns using pos_tag
  • Then generalize the templates and simplify them
  • Then put them in RegexpParser
+9
source

Source: https://habr.com/ru/post/1237409/


All Articles