Exclude missing values ​​from model performance calculation

I have a dataset and I want to create a model, preferably with the caret package. My data is actually a time series, but the question does not apply to time series, I just work with CreateTimeSlices for the data section.

My data has a certain number of missing NA values, and I imputed it separately from caret code. I also wrote down their places:

 # a logical vector same size as the data, which obs were imputed NA imputed=c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE) imputed[imputed] <- NA; print(imputed) #### [1] FALSE FALSE FALSE NA FALSE FALSE 

I know that there is an option in the Caret train function to either exclude NA or attribute different methods to them. This is not what I want. I need to build a model on an already imputed data set, but I want to exclude imputed points from the calculation of error indicators (RMSE, MAE, ...).

I do not know how to do this in the carriage. In my first script, I tried to do all the cross check manually, and then I had a configured error:

 actual = c(5, 4, 3, 6, 7, 5) predicted = c(4, 4, 3.5, 7, 6.8, 4) Metrics::rmse(actual, predicted) # with all the points #### [1] 0.7404953 sqrt(mean( (!imputed)*(actual-predicted)^2 , na.rm=T)) # excluding the imputed #### [1] 0.676757 

How can I handle this in caret ? Or is there another way to avoid manually coding everything?

+5
source share
1 answer

I don't know if you are looking for this, but here is a simple solution by creating a function.

 i=which(imputed==F) ## As you have index for NA values metric_na=function(fun, actual, predicted, index){ fun(actual[index], predicted[index]) } metric_na(Metrics::rmse, actual, predicted, index = i) 0.676757 metric_na(Metrics::mae, actual, predicted, index = i) 0.54 

Also, you can simply use the index directly when calculating the desired indicators.

 Metrics::rmse(actual[i], predicted[i]) 
+4
source

Source: https://habr.com/ru/post/1257786/


All Articles