Sci-Kit studies Kmeans and PCA dimensional reduction
I have a data set, 2M rows of 7 columns, with various measurements of energy consumption at home with a date for each measurement.
- the date
- Global_active_power
- Global_reactive_power
- Voltage,
- Global_intensity
- Sub_metering_1,
- Sub_metering_2,
- Sub_metering_3
I put my dataset in the pandas framework, selecting all the columns, but the date column, and then split the cross-validation.
import pandas as pd from sklearn.cross_validation import train_test_split data = pd.read_csv('household_power_consumption.txt', delimiter=';') power_consumption = data.iloc[0:, 2:9].dropna() pc_toarray = power_consumption.values hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01) power_consumption.head()
I use the K-media classification, followed by a reduction in PCA dimension for display.
from sklearn.cluster import KMeans import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA hpc = PCA(n_components=2).fit_transform(hpc_fit) k_means = KMeans() k_means.fit(hpc) x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1 y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5 xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02)) Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.figure(1) plt.clf() plt.imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, aspect='auto', origin='lower') plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4) centroids = k_means.cluster_centers_ inert = k_means.inertia_ plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=169, linewidths=3, color='w', zorder=8) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks(()) plt.yticks(()) plt.show()
Now I would like to know which lines fell under this class, then what dates fell under this class.
- Is there a way to associate the points on the chart with the index in my dataset after the PCA?
- Some method I don't know about?
- Or is my approach fundamentally flawed?
- Any recommendations?
I am new to this field and trying to read a lot of code, this is a compilation of several examples that I have seen documented.
My goal is to classify the data and then get dates that belong to the class.
thanks