Convert a structured array to a numpy array for use with Scikit-Learn

Question

Convert a structured array to a numpy array for use with Scikit-Learn

I find it difficult to convert a structured array loaded from CSV using np.genfromtxt to np.array to match the data in the Scikit-Learn evaluation system. The problem is that at some point, a merge from the structured array into a regular array will occur, which will lead to ValueError: can't cast from structure to non-structure . Over time, I used .view to do the conversion, but this led to several warnings about abandoning NumPy. The code is as follows:

 import numpy as np from sklearn.ensemble import GradientBoostingClassifier data = np.genfromtxt(path, dtype=float, delimiter=',', names=True) target = "occupancy" features = [ "temperature", "relative_humidity", "light", "C02", "humidity" ] # Doesn't work directly X = data[features] y = data[target].astype(int) clf = GradientBoostingClassifier(random_state=42) clf.fit(X, y)

An exception is: ValueError: Can't cast from structure to non-structure, except if the structure only has a single field.

My second attempt was to use the view as follows:

 # View is raising deprecation warnings X = data[features] X = X.view((float, len(X.dtype.names))) y = data[target].astype(int)

Which works and does exactly what I want to do (I do not need a copy of the data), but leads to warnings about the failure:

 FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array. This code may break in numpy 1.15 because this will return a view instead of a copy -- see release notes for details.

We are currently using tolist() to convert a structured array to a list, and then to np.array . This works, however it seems terribly inefficient:

 # Current method (efficient?) X = np.array(data[features].tolist()) y = data[target].astype(int)

There must be a better way, I will be grateful for any advice.

NOTE The data for this example is taken from the UCI ML Employment Repository , and the data is displayed as follows:

 array([(nan, 23.18, 27.272 , 426. , 721.25, 0.00479299, 1.), (nan, 23.15, 27.2675, 429.5 , 714. , 0.00478344, 1.), (nan, 23.15, 27.245 , 426. , 713.5 , 0.00477946, 1.), ..., (nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.), (nan, 20.89, 28.0225, 418.75, 1632. , 0.00427949, 1.), (nan, 21. , 28.1 , 409. , 1864. , 0.00432073, 1.)], dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'), ('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')])

+5

python arrays numpy

bbengfort Mar 03 '18 at 14:55

source share

2 answers

You can avoid the need for copying if you can first read the data into a simple NumPy array (omitting the names parameter):

 data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1)

Then (we are lucky) X consists of all but the first and last columns (i.e. omitting the datetime and occupancy ). Therefore, we can express X and y as slices:

 X = data[:, 1:-1] y = data[:, -1].astype(int)

Then we can easily pass these functions to scikit-learn:

 clf = GradientBoostingClassifier(random_state=42) clf.fit(X, y)

and, if you want, we can consider a simple NumPy array as a structured array afterwards:

 features = ["temperature", "relative_humidity", "light", "C02", "humidity"] X = X.ravel().view([(field, X.dtype.type) for field in features])

Unfortunately, this workaround relies on X , expressed as a slice - we cannot avoid copying if, for example, occupancy appeared between other function codes. It also means that you must define X using X = data[:, 1:-1] instead of the more human-friendly X = data[features] .

 import numpy as np from sklearn.ensemble import GradientBoostingClassifier data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1) X = data[:, 1:-1] y = data[:, -1].astype(int) clf = GradientBoostingClassifier(random_state=42) clf.fit(X, y) features = ["temperature", "relative_humidity", "light", "C02", "humidity"] X = X.ravel().view([(field, X.dtype.type) for field in features])

If you must start with a structured array, then hpaulj answer shows how the view/reshape/slice structured array will get a simple array without copying:

 import numpy as np nan = np.nan data = np.array([(nan, 23.18, 27.272 , 426. , 721.25, 0.00479299, 1.), (nan, 23.15, 27.2675, 429.5 , 714. , 0.00478344, 1.), (nan, 23.15, 27.245 , 426. , 713.5 , 0.00477946, 1.), (nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.), (nan, 20.89, 28.0225, 418.75, 1632. , 0.00427949, 1.), (nan, 21. , 28.1 , 409. , 1864. , 0.00432073, 1.)], dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'), ('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')]) target = 'occupancy' nrows = len(data) X = data.view('<f8').reshape(nrows, -1)[:, 1:-1] y = data[target].astype(int)

This exploits the fact that each field is 8 bytes long. Therefore, it is easy to convert a structured array to a simple dtype <f8 array. Reshaping makes it a two-dimensional array with the same number of rows. Slicing removes the columns / t 2> occupancy column / field from the array.

0

unutbu Mar 03 '18 at 15:43

source share

Mike müller · Accepted Answer · 2018-03-03T15:20:39+0000

Add .copy() to data[features] :

 X = data[features].copy() X = X.view((float, len(X.dtype.names)))

and the FutureWarning message disappeared.

This should be more efficient than converting to a list.

Convert a structured array to a numpy array for use with Scikit-Learn

More articles: