Scikit grid searches for KNN regression ValueError: array contains NaN or infinity

Question

Scikit grid searches for KNN regression ValueError: array contains NaN or infinity

I am trying to implement a grid search to select the best parameters for KNN regression using Scikit learn. Especially what I'm trying to do:

parameters = [{'weights': ['uniform', 'distance'], 'n_neighbors': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}] clf = GridSearchCV(neighbors.KNeighborsRegressor(), parameters) clf.fit(features, rewards)

Unfortunately, I get a ValueError: Array contains NaN or infinity.

 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y, **params) 705 " The params argument will be removed in 0.15.", 706 DeprecationWarning) --> 707 return self._fit(X, y, ParameterGrid(self.param_grid)) 708 709 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable) 491 X, y, base_estimator, parameters, train, test, 492 self.scorer_, self.verbose, **self.fit_params) --> 493 for parameters in parameter_iterable 494 for train, test in cv) 495 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable) 515 try: 516 for function, args, kwargs in iterable: --> 517 self.dispatch(function, args, kwargs) 518 519 self.retrieve() /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch(self, func, args, kwargs) 310 """ 311 if self._pool is None: --> 312 job = ImmediateApply(func, args, kwargs) 313 index = len(self._jobs) 314 if not _verbosity_filter(index, self.verbose): /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, func, args, kwargs) 134 # Don't delay the application, to avoid keeping the input 135 # arguments in memory --> 136 self.results = func(*args, **kwargs) 137 138 def get(self): /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit_grid_point(X, y, base_estimator, parameters, train, test, scorer, verbose, loss_func, **fit_params) 309 this_score = scorer(clf, X_test, y_test) 310 else: --> 311 this_score = clf.score(X_test, y_test) 312 else: 313 clf.fit(X_train, **fit_params) /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y) 320 321 from .metrics import r2_score --> 322 return r2_score(y, self.predict(X)) 323 324 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in r2_score(y_true, y_pred) 2181 2182 """ -> 2183 y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred) 2184 2185 if len(y_true) == 1: /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in _check_reg_targets(y_true, y_pred) 59 Estimated target values. 60 """ ---> 61 y_true, y_pred = check_arrays(y_true, y_pred) 62 63 if y_true.ndim == 1: /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options) 231 else: 232 array = np.asarray(array, dtype=dtype) --> 233 _assert_all_finite(array) 234 235 if copy and array is array_orig: /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X) 25 if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) 26 and not np.isfinite(X).all()): ---> 27 raise ValueError("Array contains NaN or infinity.") 28 29 ValueError: Array contains NaN or infinity.

Based on this post, I already tried to use the following line with fit instead of the one above:

 clf.fit(np.asarray(features).astype(float), np.asarray(rewards).astype(float))

Then, based on this post, I even tried this:

 scaler = preprocessing.StandardScaler().fit(np.asarray(features).astype(float)) transformed_features = scaler.transform(np.asarray(features).astype(float)) clf.fit(transformed_features, rewards)

But, unfortunately, without success. So I would like to ask if anyone has any ideas where there might be a problem, and how I can get my code to work.

Thank you in advance.

EDIT:

I found out that I do not get this error in the case when I have only the following parameters:

 parameters = [{'weights': ['uniform'], 'n_neighbors': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}]

So the problem seems to be that when weight = distance. Does anyone have an idea why?

There is another issue related to this, which I ask about here .

EDIT 2:

If I run my code with logging set during debugging, I get the following warning:

 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/neighbors/regression.py:160: RuntimeWarning: invalid value encountered in divide y_pred[:, j] = num / denom

Thus, there is a clear problem of dividing by zero. So my question is why scikit divides by 0 on line 160 in .py regression?

+2

python numpy scikit-learn

ziky90 Aug 11 '14 at 17:38

source share

2 answers

I ran into the same issue with KNN regression on scikit-learn. I used weight = "distance", and this led to infinite values when calculating forecasts (but not when fitting the KNN model, i.e. Teaching the correct KD tree or the tree of balls). I switched to weight = "uniform" and the program returned to completion correctly, indicating that the problem with the supplied mass was a problem. If you want to use a radius based on distance, put a custom weight function that will not explode indefinitely at zero distance, as stated in Eienberg's answer.

0

Abhinav maurya Mar 16 '16 at 18:18

source share

eickenberg · Accepted Answer · 2014-08-11T20:50:27+0000

In addition to what you have tried, you can also see if

 import numpy as np features = np.nan_to_num(features) rewards = np.nan_to_num(rewards)

This sets all non-numeric values in your arrays to 0 and should at least execute your algorithm, unless an error occurs somewhere inside the algorithm. Make sure there are not many non-numeric entries in your data, since setting them all to 0 can cause strange biases in your estimates.

If this is not the case, and you use weights='distance' , then check to see if any train patterns are identical. This will lead to division by zero at the opposite distance.

If the return distances are the cause of division by zero, you can get around this using your own distance function, for example.

 def better_inv_dist(dist): c = 1. return 1. / (c + dist)

and then use 'weights': better_inv_dist . You may need to adapt the constant c to the correct scale. In any case, it will avoid dividing by zero until c > 0 .

Scikit grid searches for KNN regression ValueError: array contains NaN or infinity

More articles: