I have a data set, 2M rows of 7 columns, with various measurements of energy consumption at home with a date for each measurement.

  • the date
  • Global_active_power
  • Global_reactive_power
  • Voltage,
  • Global_intensity
  • Sub_metering_1,
  • Sub_metering_2,
  • Sub_metering_3

I put my dataset in the pandas framework, selecting all the columns, but the date column, and then split the cross-validation.

import pandas as pd from sklearn.cross_validation import train_test_split data = pd.read_csv('household_power_consumption.txt', delimiter=';') power_consumption = data.iloc[0:, 2:9].dropna() pc_toarray = power_consumption.values hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01) power_consumption.head() 

power table

I use the K-media classification, followed by a reduction in PCA dimension for display.

 from sklearn.cluster import KMeans import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA hpc = PCA(n_components=2).fit_transform(hpc_fit) k_means = KMeans() x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1 y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5 xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02)) Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.figure(1) plt.clf() plt.imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), yy.max()),, aspect='auto', origin='lower') plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4) centroids = k_means.cluster_centers_ inert = k_means.inertia_ plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=169, linewidths=3, color='w', zorder=8) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks(()) plt.yticks(()) 

PCA output

Now I would like to know which lines fell under this class, then what dates fell under this class.

  • Is there a way to associate the points on the chart with the index in my dataset after the PCA?
  • Some method I don't know about?
  • Or is my approach fundamentally flawed?
  • Any recommendations?

I am new to this field and trying to read a lot of code, this is a compilation of several examples that I have seen documented.

My goal is to classify the data and then get dates that belong to the class.


1 answer

KMeans (). predict (X) .. docs here

Predict the closest cluster, each sample of which belongs to X.

In the vector quantization literature, cluster_centers_ is called a codebook, and each value returned by the prediction is an index of the closest code in the codebook.

 Parameters: (New data to predict) X : {array-like, sparse matrix}, shape = [n_samples, n_features] Returns: (Index of the cluster each sample belongs to) labels : array, shape [n_samples,] 

The problem I am with the code you submitted is to use


which returns two arrays of random strings in your dataset, effectively destroying your dataset order, making it difficult to match the labels returned from the KMeans classification to consecutive dates in your dataset.

Here is an example:

 import pandas as pd import numpy as np from sklearn.cluster import KMeans #read data into pandas dataframe df = pd.read_csv('household_power_consumption.txt', delimiter=';') 

Raw dataset head

 #convert merge date and time colums and convert to datetime objects df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time']) df.set_index(pd.DatetimeIndex(df['Datetime'],inplace=True)) df.drop(['Date','Time'], axis=1, inplace=True) #put last column first cols = df.columns.tolist() cols = cols[-1:] + cols[:-1] df = df[cols] df = df.dropna() 

preprocessed dates

 #convert dataframe to data array and removes date column not to be processed, sliced = df.iloc[0:, 1:8].dropna() hpc = sliced.values k_means = KMeans() # array of indexes corresponding to classes around centroids, in the order of your dataset classified_data = k_means.labels_ #copy dataframe (may be memory intensive but just for illustration) df_processed = df.copy() df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index) 


  • Now you can see that your result matches your dataset on the right side.
  • Now that it is classified, you need to get the point.
  • This is just a good example of how it can be used from start to finish.
  • Display your result, view PCA or create other graphs depending on the class.


