Upload a custom dataset (which looks like a set of 20 newsgroups) to Scikit to classify text documents

I am trying to run this scikit example code for my custom Ted Talks dataset. Each directory is a topic under which there are text files that contain a description for each Ted Talk.

This is what my tree-like data structure looks like. As you can see, each directory is a topic, and below are text files that contain a description.

Topics/
|-- Activism
|   |-- 1149.txt
|   |-- 1444.txt
|   |-- 157.txt
|   |-- 1616.txt
|   |-- 1706.txt
|   |-- 1718.txt
|-- Adventure
|   |-- 1036.txt
|   |-- 1777.txt
|   |-- 2930.txt
|   |-- 2968.txt
|   |-- 3027.txt
|   |-- 3290.txt
|-- Advertising
|   |-- 3673.txt
|   |-- 3685.txt
|   |-- 6567.txt
|   `-- 6925.txt
|-- Africa
|   |-- 1045.txt
|   |-- 1072.txt
|   |-- 1103.txt
|   |-- 1112.txt
|-- Aging
|   |-- 1848.txt
|   |-- 2495.txt
|   |-- 2782.txt
|-- Agriculture
|   |-- 3469.txt
|   |-- 4140.txt
|   |-- 4733.txt
|   |-- 4939.txt

I made my data set in such a way that it resembled the 20news group, whose tree structure is as follows:

20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119

|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|   |-- 37915
|   |-- 37916
|   |-- 37917
|   |-- 37918
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002
|   |-- 10003
|   |-- 10004
|   |-- 10005 

To the source code (98-124). Here's how to download training and testing data directly from scikit.

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')

categories = data_train.target_names    # for case categories == None
def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))
print("%d categories" % len(categories))
print()

# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

Scikit, .. . , ( 84):

dataset = load_files('./TED_dataset/Topics/')

, . , :

data_train.data,  data_test.data 

, , . , .

, . , data_train.target_names .

:

, :

dataset = load_files('./TED_dataset/Topics/')
train, test = train_test_split(dataset, train_size = 0.8)

.

+4
2

, - :

In [1]: from sklearn.datasets import load_files

In [2]: from sklearn.cross_validation import train_test_split

In [3]: bunch = load_files('./Topics')

In [4]: X_train, X_test, y_train, y_test = train_test_split(bunch.data, bunch.target, test_size=.4)

# Then proceed to train your model and validate.

, bunch.target - , , bunch.target_names.

In [14]: X_test[:2]
Out[14]:
['Psychologist Philip Zimbardo asks, "Why are boys struggling?" He shares some stats (lower graduation rates, greater worries about intimacy and relationships) and suggests a few reasons -- and challenges the TED community to think about solutions.Philip Zimbardo was the leader of the notorious 1971 Stanford Prison Experiment -- and an expert witness at Abu Ghraib. His book The Lucifer Effect explores the nature of evil; now, in his new work, he studies the nature of heroism.',
 'Human growth has strained the Earth\ resources, but as Johan Rockstrom reminds us, our advances also give us the science to recognize this and change behavior. His research has found nine "planetary boundaries" that can guide us in protecting our planet\ many overlapping ecosystems.If Earth is a self-regulating system, it\ clear that human activity is capable of disrupting it. Johan Rockstrom has led a team of scientists to define the nine Earth systems that need to be kept within bounds for Earth to keep itself in balance.']

In [15]: y_test[:2]
Out[15]: array([ 84, 113])

In [16]: [bunch.target_names[idx] for idx in y_test[:2]]
Out[16]: ['Education', 'Global issues']
+6

, , sklearn, ( fetch_20newsgroup()). , , , , , numpy . , . , , (, , ;-)).

, . 90% 10% . , 10- , 10 , 9 10- , 8 10- 9- .. ..

0

Source: https://habr.com/ru/post/1615062/


All Articles