What is the difference between gensim LabeledSentence and TaggedDocument

Please help me understand the difference between how TaggedDocument and LabeledSentence of gensim . My ultimate goal is text classification using the Doc2Vec model and any classifier. I follow this blog !

 class MyLabeledSentences(object): def __init__(self, dirname, dataDct={}, sentList=[]): self.dirname = dirname self.dataDct = {} self.sentList = [] def ToArray(self): for fname in os.listdir(self.dirname): with open(os.path.join(self.dirname, fname)) as fin: for item_no, sentence in enumerate(fin): self.sentList.append(LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no])) return sentList class MyTaggedDocument(object): def __init__(self, dirname, dataDct={}, sentList=[]): self.dirname = dirname self.dataDct = {} self.sentList = [] def ToArray(self): for fname in os.listdir(self.dirname): with open(os.path.join(self.dirname, fname)) as fin: for item_no, sentence in enumerate(fin): self.sentList.append(TaggedDocument([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no])) return sentList sentences = MyLabeledSentences(some_dir_name) model_l = Doc2Vec(min_count=1, window=10, size=300, sample=1e-4, negative=5, workers=7) sentences_l = sentences.ToArray() model_l.build_vocab(sentences_l ) for epoch in range(15): # random.shuffle(sentences_l ) model.train(sentences_l ) model.alpha -= 0.002 # decrease the learning rate model.min_alpha = model_l.alpha sentences = MyTaggedDocument(some_dir_name) model_t = Doc2Vec(min_count=1, window=10, size=300, sample=1e-4, negative=5, workers=7) sentences_t = sentences.ToArray() model_l.build_vocab(sentences_t) for epoch in range(15): # random.shuffle(sentences_t) model.train(sentences_t) model.alpha -= 0.002 # decrease the learning rate model.min_alpha = model_l.alpha 

My question is model_l.docvecs['some_word'] the same as model_t.docvecs['some_word'] ? Can you provide me with a link to good sources to understand how TaggedDocument or LabeledSentence .

+5
source share
1 answer

LabeledSentence is an older deprecated name for the same simple object type to encapsulate a text example, now called TaggedDocument . Any objects that have the words and tags properties will be listed in each list. ( words always a list of strings; tags can be a combination of integers and strings, but in the general and most effective case, it's just a list with a single integer id, starting at 0.)

model_l and model_t will work for the same purposes, having been trained on the same data with the same parameters, using only different names for the objects. But the vectors that they will return for individual text tokens ( model['some_word'] ) or document tags ( model.docvecs['somefilename_NN'] ) are likely to be different - there is an accident in the initialization of Word2Vec / Doc2Vec and the sample for training and introduced by sequencing-jitter from multithreaded training.

+5
source

Source: https://habr.com/ru/post/1261380/


All Articles