The lack of tagged data is a problem that harms many machine learning applications. To find out if youβre looking for someone who viewed your tweets, blog articles, and news, tagged a source, and published this database? Or is this acceptable for a program that has made the classification? In the first case, the keywords look like a good classification scheme, but in reality they are not: different people will choose different keywords for the same content. This will fundamentally damage your machine learning process.
My point is that you should use uncontrolled training (without labels) and not controlled training (labels are provided) - you should not look for marked data because you will not find it. Even if you come across some of the data that was tagged by the program, this program probably used uncontrolled teaching methods.
I recommend that you use some of the functions defined in the scikit-learn cluster module. They implement uncontrolled teaching methods.
UC Irvine has a large repository of machine learning datasets. You can test some of your natural language processing processes on some of your data sets. One popular dataset is the Enron email dataset. He and 4 more are compiled here .
UCI datasets are great, but they are not in scikit-learn format. You will have to convert them. I usually use the iris dataset since it is small and you can play with scikit-learn easily this way. As you can see in the example line
est.fit(X)
only an array of data X and without labels Y is required.
X = iris.data
assigns X a 150_instances to the num_features numpy array. You need data from UCI in this form. Let's take a look at NYTimes news articles.
From the readme.txt file in the note under the UCI link
For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words). After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. ... NYTimes news articles: orig source: ldc.upenn.edu D=300000 W=102660 N=100,000,000 (approx)
That is, your X will have the form 300000_instances at 102660_features. Note the attribute format:
Attribute Information: The format of the docword.*.txt file is 3 header lines, followed by NNZ triples: --- D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count ---
This data is in the docword.nytimes.txt data file. Some code to read it and run the clustering algorithm:
import numpy as np from sklearn.cluster import KMeans with open('docword.nytimes.txt','r') as f: # read the header information n_instances = int(f.readline()) n_attributes = int(f.readline()) n_nnz = int(f.readline()) # create scikit-learn X numpy array X = np.zeros((n_instances, n_attributes)) for line in f: doc_id, word_id, count = line.split() X[doc_id, word_id] = count # run sklearn clustering on nytimes data n_clusters = 8 est = KMeans(n_clusters) est.fit(X)
Unfortunately, this requires a lot of memory. Actually more memory than my machine, so I can not check this code. However, I believe your application domain is comparable to this. You will need to look at some methods of reducing the dimension, or look only at smaller subsets of words at a time.
Hope this helps. Feel free to let me know.