I am trying to run this scikit example code for my custom Ted Talks dataset. Each directory is a topic under which there are text files that contain a description for each Ted Talk.
This is what my tree-like data structure looks like. As you can see, each directory is a topic, and below are text files that contain a description.
Topics/
|
| |
| |
| |
| |
| |
| |
|
| |
| |
| |
| |
| |
| |
|
| |
| |
| |
| `
|
| |
| |
| |
| |
|
| |
| |
| |
|
| |
| |
| |
| |
I made my data set in such a way that it resembled the 20news group, whose tree structure is as follows:
20news-18828/
|
| |
| |
| |
|
| |
| |
| |
| |
| |
| |
| |
|
| |
| |
| |
| |
| |
| |
To the source code (98-124). Here's how to download training and testing data directly from scikit.
print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
data_test = fetch_20newsgroups(subset='test', categories=categories,
shuffle=True, random_state=42,
remove=remove)
print('data loaded')
categories = data_train.target_names
def size_mb(docs):
return sum(len(s.encode('utf-8')) for s in docs) / 1e6
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)
print("%d documents - %0.3fMB (training set)" % (
len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
len(data_test.data), data_test_size_mb))
print("%d categories" % len(categories))
print()
y_train, y_test = data_train.target, data_test.target
Scikit, .. .
, ( 84):
dataset = load_files('./TED_dataset/Topics/')
, . , :
data_train.data, data_test.data
, , . , .
, . , data_train.target_names .
:
, :
dataset = load_files('./TED_dataset/Topics/')
train, test = train_test_split(dataset, train_size = 0.8)
.