How to prepare a large dataset for classification

I have a training dataset of 1,600,000 tweets. How can I train this type of huge data.

I tried something using nltk.NaiveBayesClassifier . If I work, it will take more than 5 days.

 def extract_features(tweet): tweet_words = set(tweet) features = {} for word in featureList: features['contains(%s)' % word] = (word in tweet_words) return features training_set = nltk.classify.util.apply_features(extract_features, tweets) NBClassifier = nltk.NaiveBayesClassifier.train(training_set) # This takes lots of time 

What should I do?

I need to classify my dataset using SVM and naive bays.

The dataset I want to use is: Link

Example (training kit):

 Label Tweet 0 url aww bummer you shoulda got david carr third day 4 thankyou for your reply are you coming england again anytime soon 

Example (test data set):

 Label Tweet 4 love lebron url 0 lebron beast but still cheering the til the end ^ I have to predict Label 0/4 only 

How can I effectively train this huge data set?

+6
source share
3 answers

Following what has been excellently suggested on feature extraction, you can use the tfidvectorizer in the scikit library to extract important words from tweets. Using the default configuration, combined with a simple logistic regression, it gives me an accuracy of 0.8. Hope this helps. Here is an example of how to use it for you:

  train_df_raw = pd.read_csv('train.csv',header=None, names=['label','tweet']) test_df_raw = pd.read_csv('test.csv',header=None, names=['label','tweet']) train_df_raw = train_df_raw[train_df_raw['tweet'].notnull()] test_df_raw = test_df_raw[test_df_raw['tweet'].notnull()] test_df_raw = test_df_raw[test_df_raw['label']!=2] y_train = [x if x==0 else 1 for x in train_df_raw['label'].tolist()] y_test = [x if x==0 else 1 for x in test_df_raw['label'].tolist()] X_train = train_df_raw['tweet'].tolist() X_test = test_df_raw['tweet'].tolist() print('At vectorizer') vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(X_train) print('At vectorizer for test data') X_test = vectorizer.transform(X_test) print('at Classifier') classifier = LogisticRegression() classifier.fit(X_train, y_train) predictions = classifier.predict(X_test) print 'Accuracy:', accuracy_score(y_test, predictions) confusion_matrix = confusion_matrix(y_test, predictions) print(confusion_matrix) Accuracy: 0.8 [[135 42] [ 30 153]] 
+4
source

Before speeding up the training, I personally made sure what you really need. Although you cannot directly answer your question, I will try to provide a different angle, which may or may not be missing (it's hard to say from your initial message).

Take for example. as a baseline. 1.6Mio and 500 test samples with 3 functions give an accuracy of 0.35.

Using the same setting, you can go through less than 50 thousand training samples without losing accuracy, in fact, the accuracy will increase slightly - probably because you are processing with a lot of examples (you can check if its code works with a smaller size sampling). I am sure that using a neural network at this stage will give terrible accuracy with this setting (SVM can be configured to overcome retraining, although this is not my point).

In your initial post, you wrote that you have 55k functions (which you deleted for some reason?). This number should match your training set size. Since you did not provide your list of functions, it is actually impossible to give you a suitable working model or test my assumptions.

However, I strongly recommend that you reduce your training data as a first step and see: a) how well you are doing and b) at what point retraining can occur. I would also adjust the size of the test to a higher size. 500-1.6Mio - a kind of strange split of sets. Try 80/20% for training / testing. As a third step, check the size of the feature list. Is he representative of what you need? If there are no unnecessary / duplicate features in this list, you should consider cropping.

As a final thought, if you go back to longer training formats (for example, because you decide that you really need to do a lot more data than now), consider whether slow learning is really a problem (besides testing your model ) Many modern classifiers are trained for several days / weeks using GPUs. In this case, the training time does not matter, because they only learn once and, possibly, are updated only in small batches of data when they "go online".

+4
source

I have an option here. It took 3 minutes on my machine (I really have to get a new one: P).

 macbook 2006 2 GHz Intel Core 2 Duo 2 GB DDR2 SDRAM 

Accuracy Achieved: 0.3555421686747

I am sure that if you set up a vector machine, you can get the best results.

First, I changed the csv file format so that it could be easier to import. I just replaced the first space with a comma, which can be used as a delimiter during import.

 cat testing.csv | sed 's/\ /,/' > test.csv cat training.csv | sed 's/\ /,/' > train.csv 

In python, I used pandas to read csv files and list comprehension to extract functions. This is much faster than for loops. Subsequently, I used sklearn to train a vector support machine.

 import pandas from sklearn import svm from sklearn.metrics import accuracy_score featureList = ['obama','usa','bieber'] train_df = pandas.read_csv('train.csv',sep=',',dtype={'label':int, 'tweet':str}) test_df = pandas.read_csv('test.csv',sep=',',dtype={'label':int, 'tweet':str}) train_features = [[w in str(tweet) for w in featureList] for tweet in train_df.values[:,1]] test_features = [[w in str(tweet) for w in featureList] for tweet in test_df.values[:,1]] train_labels = train_df.values[:,0] test_labels = test_df.values[:,0] clf = svm.SVC(max_iter=1000) clf.fit(train_features, train_labels) prediction = clf.predict(test_features) print 'accuracy: ',accuracy_score(test_labels.tolist(), prediction.tolist()) 
+3
source

Source: https://habr.com/ru/post/981027/


All Articles