Using a dataset for training and testing at NLTK

I try to use Naive Bayes algorithm for sentimental analysis and went through several articles. As mentioned in almost every article, I need to prepare a Naive Bayes algorithm with some pre-calculated feelings.

Now I have a piece of code using the movie_review module provided with NLTK. Code:

import nltk import random from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words(): all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:3000] def find_features(document): words = set(document) features = {} for w in word_features: features[w] = (w in words) return features featuresets = [(find_features(rev), category) for (rev, category) in documents] training_set = featuresets[:1900] testing_set = featuresets[1900:] classifier = nltk.NaiveBayesClassifier.train(training_set) print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100) 

So, in the above code, I have training_set and testing_set. I checked the movie_review module and inside the movie viewer module we got a lot of small text files containing reviews.

  • So my question is that we had a movie review module and we imported, trained and tested it using the module, but how can we do this when I use an external training dataset and an external test dataset.
  • Just like NLTK parses the movie_review directory, which has so many text files. Since I will use http://ai.stanford.edu/~amaas/data/sentiment/ this is like a training dataset, so I need to understand how this is done.
+5
source share

Source: https://habr.com/ru/post/1242606/


All Articles