How to speed up NE recognition with stanford NER with python nltk

First, I sign the contents of the file to the sentences, and then I call Stanford NER on each of the offers. But this process is very slow. I know if I will name this content of the whole file if it is faster, but I call it in every sentence, because I want to index each sentence before and after recognizing the network.
st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
    sentences = sent_tokenize(filecontent) #break file content into sentences
    for j,sent in enumerate(sentences): 
        words = word_tokenize(sent) #tokenize sentences into words
        ne_tags = st.tag(words) #get tagged NEs from Stanford NER

Perhaps this is due to a challenge st.tag()for each sentence, but is there a way to expedite its implementation?

EDIT

The reason I want to mark sentences separately is because I want to write sentences in a file (for example, indexing a sentence), so if you sent an untagged sentence in the previous step, I can get the raw sentence (I’m also doing lemmatization here)

file format:

(sent_number, orig_sentence, NE_and_lemmatized_sentence)

+4
source share
3 answers

There is a function from StanfordNERTaggertag_sents() , see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)
+6
source

you can use stanford ner server. The speed will be much faster.

sner

pip install sner

ner

cd your_stanford_ner_dir
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
Hide result

from sner import Ner

test_string = "Alice went to the Museum of Natural History."
tagger = Ner(host='localhost',port=9199)
print(tagger.get_entities(test_string))
Hide result

[('Alice', 'PERSON'),
 ('went', 'O'),
 ('to', 'O'),
 ('the', 'O'),
 ('Museum', 'ORGANIZATION'),
 ('of', 'ORGANIZATION'),
 ('Natural', 'ORGANIZATION'),
 ('History', 'ORGANIZATION'),
 ('.', 'O')]
Hide result

https://github.com/caihaoyu/sner

+4

Stanford CoreNLP 3.5.2 : http://nlp.stanford.edu/software/corenlp.shtml

, /User/username/stanford -corenlp-full-2015-04-20

Python :

stanford_distribution_dir = "/User/username/stanford-corenlp-full-2015-04-20"
list_of_sentences_path = "/Users/username/list_of_sentences.txt"
stanford_command = "cd %s ; java -Xmx2g -cp \"*\" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ssplit.eolonly -filelist %s -outputFormat json" % (stanford_distribution_dir, list_of_sentences_path)
os.system(stanford_command)

Python .json :

import json
sample_json = json.loads(file("sample_file.txt.json").read()

sample_json .

for sentence in sample_json["sentences"]:
  tokens = []
  ner_tags = []
  for token in sentence["tokens"]:
    tokens.append(token["word"])
    ner_tags.append(token["ner"])
  print (tokens, ner_tags)

list_of_sentences.txt , :

input_file_1.txt
input_file_2.txt
...
input_file_100.txt

input_file.txt( ) input_file.txt.json Java- .json NER. .json (, ner tag sequence). "" , . "json" .json , json.loads(...), , .

, .

+1

Source: https://habr.com/ru/post/1616146/


All Articles