Recently, I just solved a very similar problem - I needed to extract the item (s), action, object (s). And I open my work so you can check out this library: https://github.com/krzysiekfonal/textpipeliner
This is based on spacy (an adversary to nltk), but it is also based on a sentence tree.
So, for example, let this document be embedded in spacy as an example:
import spacy nlp = spacy.load("en") doc = nlp(u"The Empire of Japan aimed to dominate Asia and the " \ "Pacific and was already at war with the Republic of China " \ "in 1937, but the world war is generally said to have begun on " \ "1 September 1939 with the invasion of Poland by Germany and " \ "subsequent declarations of war on Germany by France and the United Kingdom. " \ "From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered " \ "or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. " \ "Under the Molotov-Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and " \ "annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. " \ "The war continued primarily between the European Axis powers and the coalition of the United Kingdom " \ "and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, " \ "the aerial Battle of Britain, the Blitz bombing campaign, the Balkan Campaign as well as the " \ "long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion " \ "of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part " \ "of the Axis' military forces into a war of attrition. In December 1941, Japan attacked " \ "the United States and European territories in the Pacific Ocean, and quickly conquered much of " \ "the Western Pacific.")
Now you can create a simple pipe structure (more about pipes in the readme of this project):
pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/*"), NamedEntityFilterPipe(), NamedEntityExtractorPipe()]), FindTokensPipe("VERB"), AnyPipe([SequencePipe([FindTokensPipe("VBD/dobj/NNP"), AggregatePipe([NamedEntityFilterPipe("GPE"), NamedEntityFilterPipe("PERSON")]), NamedEntityExtractorPipe()]), SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"), AggregatePipe([NamedEntityFilterPipe("LOC"), NamedEntityFilterPipe("PERSON")]), NamedEntityExtractorPipe()])])] engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2]) engine.process()
And as a result, you get:
>>>[([Germany], [conquered], [Europe]), ([Japan], [attacked], [the, United, States])]
In fact, it relied heavily (search tubes) on another library - grammaregex. You can read about it from the message: https://medium.com/@krzysiek89dev/grammaregex-library-regex-like-for-text-mining-49e5706c9c6d#.zgx7odhsc
EDITED
Actually, the example that I presented in readme discards adj, but all you need to do is to adjust the pipeline structure passed to the engine according to your needs. For example, for your sample sentences, I can offer a structure / solution that gives you a set of three elements (subj, verb, adj) for each sentence:
import spacy from textpipeliner import PipelineEngine, Context from textpipeliner.pipes import * pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/NNP"), NamedEntityFilterPipe(), NamedEntityExtractorPipe()]), AggregatePipe([FindTokensPipe("VERB"), FindTokensPipe("VERB/xcomp/VERB/aux/*"), FindTokensPipe("VERB/xcomp/VERB")]), AnyPipe([FindTokensPipe("VERB/[acomp,amod]/ADJ"), AggregatePipe([FindTokensPipe("VERB/[dobj,attr]/NOUN/det/DET"), FindTokensPipe("VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])]) ] engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2]) engine.process()
This will give you the result:
[([Donald, Trump], [is], [the, worst])]
A bit of complexity is that you have a compound sentence, and lib issues one tuple per sentence - I will soon add the ability (I also need for my project) to transfer the list of pipe structures to the engine to allow more tuples per sentence to be produced. But for now, you can solve this by simply creating a second engine for aggravated points whose structure will differ only from VERB / ββconj / VERB instead of VERB (these regular expressions always start with ROOT, so VERB / ββconj / VERB you to only the second verb in the compound sentence):
pipes_structure_comp = [SequencePipe([FindTokensPipe("VERB/conj/VERB/nsubj/NNP"), NamedEntityFilterPipe(), NamedEntityExtractorPipe()]), AggregatePipe([FindTokensPipe("VERB/conj/VERB"), FindTokensPipe("VERB/conj/VERB/xcomp/VERB/aux/*"), FindTokensPipe("VERB/conj/VERB/xcomp/VERB")]), AnyPipe([FindTokensPipe("VERB/conj/VERB/[acomp,amod]/ADJ"), AggregatePipe([FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/det/DET"), FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])]) ] engine2 = PipelineEngine(pipes_structure_comp, Context(doc), [0,1,2])
And now, after starting both engines, you will get the expected result :)
engine.process() engine2.process() [([Donald, Trump], [is], [the, worst])] [([Hillary], [is], [better])]
This is what you need, I think. Of course, I just quickly created a channel structure for this example sentence, and this will not work for each case, but I saw a lot of sentence constructions and it will already fulfill a pretty good percentage, but then you can just add more FindTokensPipe, etc. For which they wonβt work at present, and Iβm sure that after several corrections you will cover a really good number of possible sentences (English is not too complicated, therefore ... :)