How to extract items into a sentence and their corresponding dependent phrases?

I am trying to work on the subject matter in the sentence so that I can get the mood in accordance with the subject. For this purpose I am using nltk in python2.7. As an example, take the following sentence:

Donald Trump is the worst president of USA, but Hillary is better than him

He can see that Donald Trump and Hillary are two subjects, and feelings associated with Donald Trump are negative, but positive with Hillary . So far, I can break this sentence into pieces of nominal phrases, and I can get the following:

 (S (NP Donald/NNP Trump/NNP) is/VBZ (NP the/DT worst/JJS president/NN) in/IN (NP USA,/NNP) but/CC (NP Hillary/NNP) is/VBZ better/JJR than/IN (NP him/PRP)) 

Now, how do I go about finding items from these noun phrases? Then how do I group phrases intended for both subjects together? When I have phrases designed for both subjects separately , I can perform a sentiment analysis on both separately.

EDIT

I looked over the library mentioned by @Krzysiek ( spacy ) and it also gave me dependency trees in sentences.

Here is the code:

 from spacy.en import English parser = English() example = u"Donald Trump is the worst president of USA, but Hillary is better than him" parsedEx = parser(example) # shown as: original token, dependency tag, head word, left dependents, right dependents for token in parsedEx: print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights]) 

Here are the dependency trees:

 (u'Donald', u'compound', u'Trump', [], []) (u'Trump', u'nsubj', u'is', [u'Donald'], []) (u'is', u'ROOT', u'is', [u'Trump'], [u'president', u',', u'but', u'is']) (u'the', u'det', u'president', [], []) (u'worst', u'amod', u'president', [], []) (u'president', u'attr', u'is', [u'the', u'worst'], [u'of']) (u'of', u'prep', u'president', [], [u'USA']) (u'USA', u'pobj', u'of', [], []) (u',', u'punct', u'is', [], []) (u'but', u'cc', u'is', [], []) (u'Hillary', u'nsubj', u'is', [], []) (u'is', u'conj', u'is', [u'Hillary'], [u'better']) (u'better', u'acomp', u'is', [], [u'than']) (u'than', u'prep', u'better', [], [u'him']) (u'him', u'pobj', u'than', [], []) 

This gives a deep understanding of the dependencies of different offer tokens. Below is a link to paper that describes the relationships between different pairs. How can I use this tree to attach contextual words to different objects to them?

+5
source share
2 answers

Recently, I just solved a very similar problem - I needed to extract the item (s), action, object (s). And I open my work so you can check out this library: https://github.com/krzysiekfonal/textpipeliner

This is based on spacy (an adversary to nltk), but it is also based on a sentence tree.

So, for example, let this document be embedded in spacy as an example:

 import spacy nlp = spacy.load("en") doc = nlp(u"The Empire of Japan aimed to dominate Asia and the " \ "Pacific and was already at war with the Republic of China " \ "in 1937, but the world war is generally said to have begun on " \ "1 September 1939 with the invasion of Poland by Germany and " \ "subsequent declarations of war on Germany by France and the United Kingdom. " \ "From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered " \ "or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. " \ "Under the Molotov-Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and " \ "annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. " \ "The war continued primarily between the European Axis powers and the coalition of the United Kingdom " \ "and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, " \ "the aerial Battle of Britain, the Blitz bombing campaign, the Balkan Campaign as well as the " \ "long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion " \ "of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part " \ "of the Axis' military forces into a war of attrition. In December 1941, Japan attacked " \ "the United States and European territories in the Pacific Ocean, and quickly conquered much of " \ "the Western Pacific.") 

Now you can create a simple pipe structure (more about pipes in the readme of this project):

 pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/*"), NamedEntityFilterPipe(), NamedEntityExtractorPipe()]), FindTokensPipe("VERB"), AnyPipe([SequencePipe([FindTokensPipe("VBD/dobj/NNP"), AggregatePipe([NamedEntityFilterPipe("GPE"), NamedEntityFilterPipe("PERSON")]), NamedEntityExtractorPipe()]), SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"), AggregatePipe([NamedEntityFilterPipe("LOC"), NamedEntityFilterPipe("PERSON")]), NamedEntityExtractorPipe()])])] engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2]) engine.process() 

And as a result, you get:

 >>>[([Germany], [conquered], [Europe]), ([Japan], [attacked], [the, United, States])] 

In fact, it relied heavily (search tubes) on another library - grammaregex. You can read about it from the message: https://medium.com/@krzysiek89dev/grammaregex-library-regex-like-for-text-mining-49e5706c9c6d#.zgx7odhsc

EDITED

Actually, the example that I presented in readme discards adj, but all you need to do is to adjust the pipeline structure passed to the engine according to your needs. For example, for your sample sentences, I can offer a structure / solution that gives you a set of three elements (subj, verb, adj) for each sentence:

 import spacy from textpipeliner import PipelineEngine, Context from textpipeliner.pipes import * pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/NNP"), NamedEntityFilterPipe(), NamedEntityExtractorPipe()]), AggregatePipe([FindTokensPipe("VERB"), FindTokensPipe("VERB/xcomp/VERB/aux/*"), FindTokensPipe("VERB/xcomp/VERB")]), AnyPipe([FindTokensPipe("VERB/[acomp,amod]/ADJ"), AggregatePipe([FindTokensPipe("VERB/[dobj,attr]/NOUN/det/DET"), FindTokensPipe("VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])]) ] engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2]) engine.process() 

This will give you the result:

 [([Donald, Trump], [is], [the, worst])] 

A bit of complexity is that you have a compound sentence, and lib issues one tuple per sentence - I will soon add the ability (I also need for my project) to transfer the list of pipe structures to the engine to allow more tuples per sentence to be produced. But for now, you can solve this by simply creating a second engine for aggravated points whose structure will differ only from VERB / ​​conj / VERB instead of VERB (these regular expressions always start with ROOT, so VERB / ​​conj / VERB you to only the second verb in the compound sentence):

 pipes_structure_comp = [SequencePipe([FindTokensPipe("VERB/conj/VERB/nsubj/NNP"), NamedEntityFilterPipe(), NamedEntityExtractorPipe()]), AggregatePipe([FindTokensPipe("VERB/conj/VERB"), FindTokensPipe("VERB/conj/VERB/xcomp/VERB/aux/*"), FindTokensPipe("VERB/conj/VERB/xcomp/VERB")]), AnyPipe([FindTokensPipe("VERB/conj/VERB/[acomp,amod]/ADJ"), AggregatePipe([FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/det/DET"), FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])]) ] engine2 = PipelineEngine(pipes_structure_comp, Context(doc), [0,1,2]) 

And now, after starting both engines, you will get the expected result :)

 engine.process() engine2.process() [([Donald, Trump], [is], [the, worst])] [([Hillary], [is], [better])] 

This is what you need, I think. Of course, I just quickly created a channel structure for this example sentence, and this will not work for each case, but I saw a lot of sentence constructions and it will already fulfill a pretty good percentage, but then you can just add more FindTokensPipe, etc. For which they won’t work at present, and I’m sure that after several corrections you will cover a really good number of possible sentences (English is not too complicated, therefore ... :)

+6
source

I was going through the spacy library more, and I finally figured out a solution through dependency management. Thanks to this repo, I figured out how to include adjectives also in my subjective object of the verb (making it SVAO), as well as taking out compound subjects in the request. Here is my solution:

 from nltk.stem.wordnet import WordNetLemmatizer from spacy.en import English SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"] OBJECTS = ["dobj", "dative", "attr", "oprd"] ADJECTIVES = ["acomp", "advcl", "advmod", "amod", "appos", "nn", "nmod", "ccomp", "complm", "hmod", "infmod", "xcomp", "rcmod", "poss"," possessive"] COMPOUNDS = ["compound"] PREPOSITIONS = ["prep"] def getSubsFromConjunctions(subs): moreSubs = [] for sub in subs: # rights is a generator rights = list(sub.rights) rightDeps = {tok.lower_ for tok in rights} if "and" in rightDeps: moreSubs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"]) if len(moreSubs) > 0: moreSubs.extend(getSubsFromConjunctions(moreSubs)) return moreSubs def getObjsFromConjunctions(objs): moreObjs = [] for obj in objs: # rights is a generator rights = list(obj.rights) rightDeps = {tok.lower_ for tok in rights} if "and" in rightDeps: moreObjs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"]) if len(moreObjs) > 0: moreObjs.extend(getObjsFromConjunctions(moreObjs)) return moreObjs def getVerbsFromConjunctions(verbs): moreVerbs = [] for verb in verbs: rightDeps = {tok.lower_ for tok in verb.rights} if "and" in rightDeps: moreVerbs.extend([tok for tok in verb.rights if tok.pos_ == "VERB"]) if len(moreVerbs) > 0: moreVerbs.extend(getVerbsFromConjunctions(moreVerbs)) return moreVerbs def findSubs(tok): head = tok.head while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head: head = head.head if head.pos_ == "VERB": subs = [tok for tok in head.lefts if tok.dep_ == "SUB"] if len(subs) > 0: verbNegated = isNegated(head) subs.extend(getSubsFromConjunctions(subs)) return subs, verbNegated elif head.head != head: return findSubs(head) elif head.pos_ == "NOUN": return [head], isNegated(tok) return [], False def isNegated(tok): negations = {"no", "not", "n't", "never", "none"} for dep in list(tok.lefts) + list(tok.rights): if dep.lower_ in negations: return True return False def findSVs(tokens): svs = [] verbs = [tok for tok in tokens if tok.pos_ == "VERB"] for v in verbs: subs, verbNegated = getAllSubs(v) if len(subs) > 0: for sub in subs: svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_)) return svs def getObjsFromPrepositions(deps): objs = [] for dep in deps: if dep.pos_ == "ADP" and dep.dep_ == "prep": objs.extend([tok for tok in dep.rights if tok.dep_ in OBJECTS or (tok.pos_ == "PRON" and tok.lower_ == "me")]) return objs def getAdjectives(toks): toks_with_adjectives = [] for tok in toks: adjs = [left for left in tok.lefts if left.dep_ in ADJECTIVES] adjs.append(tok) adjs.extend([right for right in tok.rights if tok.dep_ in ADJECTIVES]) tok_with_adj = " ".join([adj.lower_ for adj in adjs]) toks_with_adjectives.extend(adjs) return toks_with_adjectives def getObjsFromAttrs(deps): for dep in deps: if dep.pos_ == "NOUN" and dep.dep_ == "attr": verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"] if len(verbs) > 0: for v in verbs: rights = list(v.rights) objs = [tok for tok in rights if tok.dep_ in OBJECTS] objs.extend(getObjsFromPrepositions(rights)) if len(objs) > 0: return v, objs return None, None def getObjFromXComp(deps): for dep in deps: if dep.pos_ == "VERB" and dep.dep_ == "xcomp": v = dep rights = list(v.rights) objs = [tok for tok in rights if tok.dep_ in OBJECTS] objs.extend(getObjsFromPrepositions(rights)) if len(objs) > 0: return v, objs return None, None def getAllSubs(v): verbNegated = isNegated(v) subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"] if len(subs) > 0: subs.extend(getSubsFromConjunctions(subs)) else: foundSubs, verbNegated = findSubs(v) subs.extend(foundSubs) return subs, verbNegated def getAllObjs(v): # rights is a generator rights = list(v.rights) objs = [tok for tok in rights if tok.dep_ in OBJECTS] objs.extend(getObjsFromPrepositions(rights)) potentialNewVerb, potentialNewObjs = getObjFromXComp(rights) if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0: objs.extend(potentialNewObjs) v = potentialNewVerb if len(objs) > 0: objs.extend(getObjsFromConjunctions(objs)) return v, objs def getAllObjsWithAdjectives(v): # rights is a generator rights = list(v.rights) objs = [tok for tok in rights if tok.dep_ in OBJECTS] if len(objs)== 0: objs = [tok for tok in rights if tok.dep_ in ADJECTIVES] objs.extend(getObjsFromPrepositions(rights)) potentialNewVerb, potentialNewObjs = getObjFromXComp(rights) if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0: objs.extend(potentialNewObjs) v = potentialNewVerb if len(objs) > 0: objs.extend(getObjsFromConjunctions(objs)) return v, objs def findSVOs(tokens): svos = [] verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"] for v in verbs: subs, verbNegated = getAllSubs(v) # hopefully there are subs, if not, don't examine this verb any longer if len(subs) > 0: v, objs = getAllObjs(v) for sub in subs: for obj in objs: objNegated = isNegated(obj) svos.append((sub.lower_, "!" + v.lower_ if verbNegated or objNegated else v.lower_, obj.lower_)) return svos def findSVAOs(tokens): svos = [] verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"] for v in verbs: subs, verbNegated = getAllSubs(v) # hopefully there are subs, if not, don't examine this verb any longer if len(subs) > 0: v, objs = getAllObjsWithAdjectives(v) for sub in subs: for obj in objs: objNegated = isNegated(obj) obj_desc_tokens = generate_left_right_adjectives(obj) sub_compound = generate_sub_compound(sub) svos.append((" ".join(tok.lower_ for tok in sub_compound), "!" + v.lower_ if verbNegated or objNegated else v.lower_, " ".join(tok.lower_ for tok in obj_desc_tokens))) return svos def generate_sub_compound(sub): sub_compunds = [] for tok in sub.lefts: if tok.dep_ in COMPOUNDS: sub_compunds.extend(generate_sub_compound(tok)) sub_compunds.append(sub) for tok in sub.rights: if tok.dep_ in COMPOUNDS: sub_compunds.extend(generate_sub_compound(tok)) return sub_compunds def generate_left_right_adjectives(obj): obj_desc_tokens = [] for tok in obj.lefts: if tok.dep_ in ADJECTIVES: obj_desc_tokens.extend(generate_left_right_adjectives(tok)) obj_desc_tokens.append(obj) for tok in obj.rights: if tok.dep_ in ADJECTIVES: obj_desc_tokens.extend(generate_left_right_adjectives(tok)) return obj_desc_tokens 

Now when you pass the request, for example:

 from spacy.en import English parser = English() sentence = u""" Donald Trump is the worst president of USA, but Hillary is better than him """ parse = parser(sentence) print(findSVAOs(parse)) 

You will receive the following:

 [(u'donald trump', u'is', u'worst president'), (u'hillary', u'is', u'better')] 

Thanks @Krzysiek for your solution too, I really couldn't go deeper into your library to change it. I rather tried to modify the above link to solve my problem.

+4
source

Source: https://habr.com/ru/post/1257427/


All Articles