First of all, this is a regression problem, not a classification problem, because the values in the Daily_KWH_System
column Daily_KWH_System
not form a set of labels. Instead, they seem to be real numbers (at least based on the above example).
If you want to approach it as a classification problem independently, then according to the sklearn documentation :
When classified in scikit-learn, y is a vector of integers or a string.
In your case, y
is a floating-point vector, and therefore you get an error. So instead of a string
y = df['Daily_KWH_System']
write a line
y = np.asarray(df['Daily_KWH_System'], dtype="|S6")
and this will solve the problem. (Here you can learn more about this approach: Python RandomForest - Unknown label Error )
However, since the regression is more appropriate in this case, replace the strings instead
from sklearn.neural_network import MLPClassifier mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
from
from sklearn.neural_network import MLPRegressor mlp = MLPRegressor(hidden_layer_sizes=(30,30,30))
The code will work without throwing an error (but, of course, there is not enough data to check whether our model works well).
With that said, I don’t think that this is the right approach to select features for this problem.
In this problem, we are dealing with a sequence of real numbers that form a time series. One of the reasonable functions that we could choose is the number of seconds (or minutes \ hours \ days, etc.) that have passed since the start. Since these specific data contain only days, months, and years (other values are always 0), we could choose as a function the number of days that have passed since the very beginning. Then your data frame will look like this:
Daily_KWH_System days_passed 0 4136.900384 0 1 3061.657187 1 2 4099.614033 2 3 3922.490275 3 4 3957.128982 4
You can take the values in the days_passed
column as functions and the values in the Daily_KWH_System
as goals. You can also add some indicator functions. For example, if you think that the end of the year can affect the goal, you can add an indicator function that indicates whether the month is December or not.
If the data is really daily (at least in this example you have one data point per day) and you want to solve this problem with neural networks, then another sensible approach would be to treat it as a time series and try a suitable one recurrent neural network. Here are some great blog posts that describe this approach:
http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
http://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/