I am trying to implement a grid search to select the best parameters for KNN regression using Scikit learn. Especially what I'm trying to do:
parameters = [{'weights': ['uniform', 'distance'], 'n_neighbors': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}] clf = GridSearchCV(neighbors.KNeighborsRegressor(), parameters) clf.fit(features, rewards)
Unfortunately, I get a ValueError: Array contains NaN or infinity.
/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y, **params) 705 " The params argument will be removed in 0.15.", 706 DeprecationWarning) --> 707 return self._fit(X, y, ParameterGrid(self.param_grid)) 708 709 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable) 491 X, y, base_estimator, parameters, train, test, 492 self.scorer_, self.verbose, **self.fit_params) --> 493 for parameters in parameter_iterable 494 for train, test in cv) 495 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable) 515 try: 516 for function, args, kwargs in iterable: --> 517 self.dispatch(function, args, kwargs) 518 519 self.retrieve() /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch(self, func, args, kwargs) 310 """ 311 if self._pool is None: --> 312 job = ImmediateApply(func, args, kwargs) 313 index = len(self._jobs) 314 if not _verbosity_filter(index, self.verbose): /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, func, args, kwargs) 134 # Don't delay the application, to avoid keeping the input 135 # arguments in memory --> 136 self.results = func(*args, **kwargs) 137 138 def get(self): /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit_grid_point(X, y, base_estimator, parameters, train, test, scorer, verbose, loss_func, **fit_params) 309 this_score = scorer(clf, X_test, y_test) 310 else: --> 311 this_score = clf.score(X_test, y_test) 312 else: 313 clf.fit(X_train, **fit_params) /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y) 320 321 from .metrics import r2_score --> 322 return r2_score(y, self.predict(X)) 323 324 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in r2_score(y_true, y_pred) 2181 2182 """ -> 2183 y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred) 2184 2185 if len(y_true) == 1: /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in _check_reg_targets(y_true, y_pred) 59 Estimated target values. 60 """ ---> 61 y_true, y_pred = check_arrays(y_true, y_pred) 62 63 if y_true.ndim == 1: /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options) 231 else: 232 array = np.asarray(array, dtype=dtype) --> 233 _assert_all_finite(array) 234 235 if copy and array is array_orig: /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X) 25 if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) 26 and not np.isfinite(X).all()): ---> 27 raise ValueError("Array contains NaN or infinity.") 28 29 ValueError: Array contains NaN or infinity.
Based on this post, I already tried to use the following line with fit instead of the one above:
clf.fit(np.asarray(features).astype(float), np.asarray(rewards).astype(float))
Then, based on this post, I even tried this:
scaler = preprocessing.StandardScaler().fit(np.asarray(features).astype(float)) transformed_features = scaler.transform(np.asarray(features).astype(float)) clf.fit(transformed_features, rewards)
But, unfortunately, without success. So I would like to ask if anyone has any ideas where there might be a problem, and how I can get my code to work.
Thank you in advance.
EDIT:
I found out that I do not get this error in the case when I have only the following parameters:
parameters = [{'weights': ['uniform'], 'n_neighbors': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}]
So the problem seems to be that when weight = distance. Does anyone have an idea why?
There is another issue related to this, which I ask about here .
EDIT 2:
If I run my code with logging set during debugging, I get the following warning:
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/neighbors/regression.py:160: RuntimeWarning: invalid value encountered in divide y_pred[:, j] = num / denom
Thus, there is a clear problem of dividing by zero. So my question is why scikit divides by 0 on line 160 in .py regression?