Hierarchical classification + study materials for online articles and social networks

Question

Hierarchical classification + study materials for online articles and social networks

I want to classify large numbers (from 100 K to 1 M +) of small articles on the Internet (tweets, blog articles, news, etc.) by topics. To achieve this, I was looking for labeled training data that I could use to create classification models. To make this post most useful, here are some of the possible sources that I have found:

a) www.freebase.com/internet/website/category?instances=

b) wikipedia-miner.cms.waikato.ac.nz (a set of tools for accessing Wikipedia data)

c) ru.wikipedia.org/wiki/Wikipedia:Database_download

d) wiki.dbpedia.org/About (SKOS category text keywords)

e) search the Internet for a large set of articles, followed by clustering and manual curation

Question 1: Are there additional online resources that could provide labeled training documents? Sets of keywords on a given topic, especially weighted sets, will also be useful.

Ideally, I would like to create a classifier that will return hierarchical categories and where sub-topic details can be added later, as more percent / data becomes available.

Question 2: Are there thematic modeling / classification structures that are hierarchically structured (and possibly also extensible)? Sample code will be especially welcome.

thank you very much

UPDATE:

The Reuters Corpus Volume 1 (search on RCV1-v2) is about 800 thousand Reuters articles from the end of 1990, classified by people by category, industry and region.

The Academic Consortium (LDC) distributes various corps, including one compiled by the NY Times with ~ 1.5M with the inscription: http://catalog.ldc.upenn.edu/LDC2008T19

+6

scikit-learn nltk topic-modeling hierarchical-clustering training-data

Ziggy eunicien Nov 05 '13 at 21:40

source share

1 answer

apoh · Answer 1 · 2013-11-06T23:31:34+0000

The lack of tagged data is a problem that harms many machine learning applications. To find out if you’re looking for someone who viewed your tweets, blog articles, and news, tagged a source, and published this database? Or is this acceptable for a program that has made the classification? In the first case, the keywords look like a good classification scheme, but in reality they are not: different people will choose different keywords for the same content. This will fundamentally damage your machine learning process.

My point is that you should use uncontrolled training (without labels) and not controlled training (labels are provided) - you should not look for marked data because you will not find it. Even if you come across some of the data that was tagged by the program, this program probably used uncontrolled teaching methods.

I recommend that you use some of the functions defined in the scikit-learn cluster module. They implement uncontrolled teaching methods.

UC Irvine has a large repository of machine learning datasets. You can test some of your natural language processing processes on some of your data sets. One popular dataset is the Enron email dataset. He and 4 more are compiled here .

UCI datasets are great, but they are not in scikit-learn format. You will have to convert them. I usually use the iris dataset since it is small and you can play with scikit-learn easily this way. As you can see in the example line

est.fit(X)

only an array of data X and without labels Y is required.

 X = iris.data

assigns X a 150_instances to the num_features numpy array. You need data from UCI in this form. Let's take a look at NYTimes news articles.

From the readme.txt file in the note under the UCI link

 For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words). After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. ... NYTimes news articles: orig source: ldc.upenn.edu D=300000 W=102660 N=100,000,000 (approx)

That is, your X will have the form 300000_instances at 102660_features. Note the attribute format:

 Attribute Information: The format of the docword.*.txt file is 3 header lines, followed by NNZ triples: --- D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count ---

This data is in the docword.nytimes.txt data file. Some code to read it and run the clustering algorithm:

 import numpy as np from sklearn.cluster import KMeans with open('docword.nytimes.txt','r') as f: # read the header information n_instances = int(f.readline()) n_attributes = int(f.readline()) n_nnz = int(f.readline()) # create scikit-learn X numpy array X = np.zeros((n_instances, n_attributes)) for line in f: doc_id, word_id, count = line.split() X[doc_id, word_id] = count # run sklearn clustering on nytimes data n_clusters = 8 est = KMeans(n_clusters) est.fit(X)

Unfortunately, this requires a lot of memory. Actually more memory than my machine, so I can not check this code. However, I believe your application domain is comparable to this. You will need to look at some methods of reducing the dimension, or look only at smaller subsets of words at a time.

Hope this helps. Feel free to let me know.

Hierarchical classification + study materials for online articles and social networks

UPDATE:

More articles: