How to do a sample on scikit to learn?

Question

How to do a sample on scikit to learn?

We have a gridded data set in which information about a sore eye makes up 70 percent of the information, while an unhealthy eye makes up the remaining 30 percent. We want to get a data set in which patients, as well as non-sick samples, should be equal in number. Is there any function with which we can do the same?

+6

python python-2.7 scikit-learn dataset sampling

Gaurav patil Mar 23 '15 at 5:53

source share

2 answers

RickardSjogren · Answer 1 · 2015-03-23T07:08:02+0000

I would decide to do this using Pandas DataFrame and numpy.random.choice . Thus, it is easy to sample to create data sets of the same size. Example:

 import pandas as pd import numpy as np data = pd.DataFrame(np.random.randn(7, 4)) data['Healthy'] = [1, 1, 0, 0, 1, 1, 1]

These data contain two healthy and five healthy samples. To randomly select two samples from a healthy population, you do:

 healthy_indices = data[data.Healthy == 1].index random_indices = np.random.choice(healthy_indices, 2, replace=False) healthy_sample = data.loc[random_indices]

To automatically select a subsample of the same size as the unhealthy group, you can do:

 sample_size = sum(data.Healthy == 0) # Equivalent to len(data[data.Healthy == 0]) random_indices = np.random.choice(healthy_indices, sample_size, replace=False)

Fomalhaut · Answer 2 · 2015-03-23T06:36:38+0000

Alternatively, you can use the stochastic method. Suppose you have a data set of data , which is a large number of tuples (X, Y) , where Y is the diseased eye information (0 or 1). You can prepare a wrapper for your dataset, which skips all sore eyes and skips sore eyes with a probability of 0.3 / 0.7 (you only need 30% of sore eyes from the dataset).

 from random import random def wrapper(data): prob = 0.3 / 0.7 for X, Y in data: if Y == 0: yield X, Y else: if random() < prob: yield X, Y # now you can use the wrapper to extract needed information for X, Y in wrapper(your_dataset): print X, Y

Be careful if you need to use this shell many times as a generator and want to have the same results, before using the random() function, you need to set a fixed random seed. More on this: https://docs.python.org/2/library/random.html

How to do a sample on scikit to learn?

More articles: