I would decide to do this using Pandas DataFrame and numpy.random.choice . Thus, it is easy to sample to create data sets of the same size. Example:
import pandas as pd import numpy as np data = pd.DataFrame(np.random.randn(7, 4)) data['Healthy'] = [1, 1, 0, 0, 1, 1, 1]
These data contain two healthy and five healthy samples. To randomly select two samples from a healthy population, you do:
healthy_indices = data[data.Healthy == 1].index random_indices = np.random.choice(healthy_indices, 2, replace=False) healthy_sample = data.loc[random_indices]
To automatically select a subsample of the same size as the unhealthy group, you can do:
sample_size = sum(data.Healthy == 0) # Equivalent to len(data[data.Healthy == 0]) random_indices = np.random.choice(healthy_indices, sample_size, replace=False)
source share