NLTK relationship relationships return nothing

I recently worked on using nltk to extract relationships from text. so I’m creating a sample text: "Tom is Microsoft's co-founder." and using the following program to test and return nothing. I can’t understand why.

I am using NLTK version: 3.2.1, python version: 3.5.2.

Here is my code:

import re import nltk from nltk.sem.relextract import extract_rels, rtuple from nltk.tokenize import sent_tokenize, word_tokenize def test(): with open('sample.txt', 'r') as f: sample = f.read() # "Tom is the cofounder of Microsoft" sentences = sent_tokenize(sample) tokenized_sentences = [word_tokenize(sentence) for sentence in sentences] tagged_sentences = [nltk.tag.pos_tag(sentence) for sentence in tokenized_sentences] OF = re.compile(r'.*\bof\b.*') for i, sent in enumerate(tagged_sentences): sent = nltk.chunk.ne_chunk(sent) # ne_chunk method expects one tagged sentence rels = extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10) for rel in rels: print('{0:<5}{1}'.format(i, rtuple(rel))) if __name__ == '__main__': test() 

1. After some debugging, if it is discovered that when I changed the input as

"Gates was born in Seattle, Washington, on October 28, 1955."

nltk.chunk.ne_chunk () output:

(S (HUMAN WAR / NNS) was / WBD born / VBN in / IN (GPE Seattle / NNP), /, (GPE Washington / NNP) on / IN October / NNP 28 / CD, /, 1955 / CD. /. )

Test () returns:

[PER: 'Gates / NNS'] 'was / VBD born / VBN in / IN' [GPE: 'Seattle / NNP']

2. After I changed the input as:

"Gates was born in Seattle on October 28, 1955."

Test () does not delete anything.

3. I dug in nltk / sem / relextract.py and found this strange

the output is called by the function: semi_rel2reldict (pairs, window = 5, trace = False), which returns the result only when len (pairs)> 2, and therefore when one sentence with less than three NE will return None.

Is this a bug or am I using NLTK incorrectly?

+5
source share
1 answer

First, to group NE with ne_chunk , the idiom would look something like this:

 >>> from nltk import ne_chunk, pos_tag, word_tokenize >>> text = "Tom is the cofounder of Microsoft" >>> chunked = ne_chunk(pos_tag(word_tokenize(text))) >>> chunked Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')])]) 

(see also fooobar.com/questions/711991 / ... )

Then consider the function extract_rels .

 def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10): """ Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern. The parameters ``subjclass`` and ``objclass`` can be used to restrict the Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION', 'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE'). """ 

When you call this function:

 extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10) 

He sequentially performs 4 processes.

1. It checks if your subjclass and objclass

i.e. https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L202 :

 if subjclass and subjclass not in NE_CLASSES[corpus]: if _expand(subjclass) in NE_CLASSES[corpus]: subjclass = _expand(subjclass) else: raise ValueError("your value for the subject type has not been recognized: %s" % subjclass) if objclass and objclass not in NE_CLASSES[corpus]: if _expand(objclass) in NE_CLASSES[corpus]: objclass = _expand(objclass) else: raise ValueError("your value for the object type has not been recognized: %s" % objclass) 

2. It extracts “pairs” from your marked NE inputs:

 if corpus == 'ace' or corpus == 'conll2002': pairs = tree2semi_rel(doc) elif corpus == 'ieer': pairs = tree2semi_rel(doc.text) + tree2semi_rel(doc.headline) else: raise ValueError("corpus type not recognized") 

Now, let's take into account your input sentence Tom is the cofounder of Microsoft , which returns tree2semi_rel() :

 >>> from nltk.sem.relextract import tree2semi_rel, semi_rel2reldict >>> from nltk import word_tokenize, pos_tag, ne_chunk >>> text = "Tom is the cofounder of Microsoft" >>> chunked = ne_chunk(pos_tag(word_tokenize(text))) >>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]] 

Thus, it returns a list of 2 lists, the first internal list consists of an empty list and a Tree that contains the "PERSON" tag.

 [[], Tree('PERSON', [('Tom', 'NNP')])] 

The second list consists of the phrase is the cofounder of and Tree , which contains "ORGANIZATION."

Moving forward.

3. extract_rel then tries to change the pairs to some kind of relationship dictionary

 reldicts = semi_rel2reldict(pairs) 

If we look at what the semi_rel2reldict function semi_rel2reldict with your example sentence, we see that this is where the empty list is returned:

 >>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]] >>> semi_rel2reldict(tree2semi_rel(chunked)) [] 

So, let's look at the code semi_rel2reldict https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L144 :

 def semi_rel2reldict(pairs, window=5, trace=False): """ Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which stores information about the subject and object NEs plus the filler between them. Additionally, a left and right context of length =< window are captured (within a given input sentence). :param pairs: a pair of list(str) and ``Tree``, as generated by :param window: a threshold for the number of items to include in the left and right context :type window: int :return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon' :rtype: list(defaultdict) """ result = [] while len(pairs) > 2: reldict = defaultdict(str) reldict['lcon'] = _join(pairs[0][0][-window:]) reldict['subjclass'] = pairs[0][1].label() reldict['subjtext'] = _join(pairs[0][1].leaves()) reldict['subjsym'] = list2sym(pairs[0][1].leaves()) reldict['filler'] = _join(pairs[1][0]) reldict['untagged_filler'] = _join(pairs[1][0], untag=True) reldict['objclass'] = pairs[1][1].label() reldict['objtext'] = _join(pairs[1][1].leaves()) reldict['objsym'] = list2sym(pairs[1][1].leaves()) reldict['rcon'] = _join(pairs[2][0][:window]) if trace: print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass'])) result.append(reldict) pairs = pairs[1:] return result 

The first thing that semi_rel2reldict() does is to check where more than two elements are located, exit tree2semi_rel() , which does not match your tree2semi_rel() suggestion:

 >>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]] >>> len(tree2semi_rel(chunked)) 2 >>> len(tree2semi_rel(chunked)) > 2 False 

Yeah, so extract_rel nothing.

Now the question is, how do extract_rel() return something even with two elements from tree2semi_rel() ? Is it possible?

Let's try another sentence:

 >>> text = "Tom is the cofounder of Microsoft and now he is the founder of Marcohard" >>> chunked = ne_chunk(pos_tag(word_tokenize(text))) >>> chunked Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')]), ('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN'), Tree('PERSON', [('Marcohard', 'NNP')])]) >>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])], [[('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN')], Tree('PERSON', [('Marcohard', 'NNP')])]] >>> len(tree2semi_rel(chunked)) > 2 True >>> semi_rel2reldict(tree2semi_rel(chunked)) [defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': 'and/CC now/RB he/PRP is/VBZ the/DT', 'subjtext': 'Tom/NNP'})] 

But this only confirms that extract_rel cannot extract when tree2semi_rel returns pairs <2. What happens if we remove this condition while len(pairs) > 2 ?

Why can't we do while len(pairs) > 1 ?

If we look closer to the code, we will see the last reldict filling line, https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L169 :

 reldict['rcon'] = _join(pairs[2][0][:window]) 

It tries to access the third element of pairs , and if the length of the pairs is 2, you get an IndexError .

So, what happens if we delete this rcon key and just change it to while len(pairs) >= 2 ?

To do this, we must override the semi_rel2redict() function:

 >>> from nltk.sem.relextract import _join, list2sym >>> from collections import defaultdict >>> def semi_rel2reldict(pairs, window=5, trace=False): ... """ ... Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which ... stores information about the subject and object NEs plus the filler between them. ... Additionally, a left and right context of length =< window are captured (within ... a given input sentence). ... :param pairs: a pair of list(str) and ``Tree``, as generated by ... :param window: a threshold for the number of items to include in the left and right context ... :type window: int ... :return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon' ... :rtype: list(defaultdict) ... """ ... result = [] ... while len(pairs) >= 2: ... reldict = defaultdict(str) ... reldict['lcon'] = _join(pairs[0][0][-window:]) ... reldict['subjclass'] = pairs[0][1].label() ... reldict['subjtext'] = _join(pairs[0][1].leaves()) ... reldict['subjsym'] = list2sym(pairs[0][1].leaves()) ... reldict['filler'] = _join(pairs[1][0]) ... reldict['untagged_filler'] = _join(pairs[1][0], untag=True) ... reldict['objclass'] = pairs[1][1].label() ... reldict['objtext'] = _join(pairs[1][1].leaves()) ... reldict['objsym'] = list2sym(pairs[1][1].leaves()) ... reldict['rcon'] = [] ... if trace: ... print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass'])) ... result.append(reldict) ... pairs = pairs[1:] ... return result ... >>> text = "Tom is the cofounder of Microsoft" >>> chunked = ne_chunk(pos_tag(word_tokenize(text))) >>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]] >>> semi_rel2reldict(tree2semi_rel(chunked)) [defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})] 

Oh! It works, but there is still the 4th step in extract_rels() .

4. It performs a reldict filter based on the regular expression that you provided to the pattern parameter, https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222 :

 relfilter = lambda x: (x['subjclass'] == subjclass and len(x['filler'].split()) <= window and pattern.match(x['filler']) and x['objclass'] == objclass) 

Now try it with a hacked version of semi_rel2reldict :

 >>> text = "Tom is the cofounder of Microsoft" >>> chunked = ne_chunk(pos_tag(word_tokenize(text))) >>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]] >>> semi_rel2reldict(tree2semi_rel(chunked)) [defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})] >>> >>> pattern = re.compile(r'.*\bof\b.*') >>> reldicts = semi_rel2reldict(tree2semi_rel(chunked)) >>> relfilter = lambda x: (x['subjclass'] == subjclass and ... len(x['filler'].split()) <= window and ... pattern.match(x['filler']) and ... x['objclass'] == objclass) >>> relfilter <function <lambda> at 0x112e591b8> >>> subjclass = 'PERSON' >>> objclass = 'ORGANIZATION' >>> window = 5 >>> list(filter(relfilter, reldicts)) [defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})] 

It works! Now let's see it as a tuple:

 >>> from nltk.sem.relextract import rtuple >>> rels = list(filter(relfilter, reldicts)) >>> for rel in rels: ... print rtuple(rel) ... [PER: 'Tom/NNP'] 'is/VBZ the/DT cofounder/NN of/IN' [ORG: 'Microsoft/NNP'] 
+3
source

Source: https://habr.com/ru/post/1259444/


All Articles