First, let's take a look at the POS tags that NLTK provides:
>>> from nltk import pos_tag >>> sent = 'The pizza was awesome and brilliant'.split() >>> pos_tag(sent) [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')] >>> sent = 'The pizza was good but pasta was bad'.split() >>> pos_tag(sent) [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')]
(Note: above are the outputs from NLTK v3.1 pos_tag , the earlier version may vary)
What you want to capture is essentially:
So let's catch them with these patterns:
>>> from nltk import RegexpParser >>> sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant'] >>> sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad'] >>> patterns = """ ... P: {<NN><VBD><JJ><CC><JJ>} ... {<NN><VBD><JJ>} ... """ >>> PChunker = RegexpParser(patterns) >>> PChunker.parse(pos_tag(sent1)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])]) >>> PChunker.parse(pos_tag(sent2)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
To "cheat" by hardcoding !!!
Back to the POS templates:
You can simplify:
Thus, you can use optional operators in the regular expression, for example:
>>> patterns = """ ... P: {<NN><VBD><JJ>(<CC><JJ>)?} ... """ >>> PChunker = RegexpParser(patterns) >>> PChunker.parse(pos_tag(sent1)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])]) >>> PChunker.parse(pos_tag(sent2)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
Most likely you are using the old tagger, so your templates are different, but I think you see how you could capture the necessary phrases using the above example.
Steps:
- First, check what POS-patterns using
pos_tag - Then generalize the templates and simplify them
- Then put them in
RegexpParser