I am trying to evaluate a model based on its performance in a historical competition.
I have a dataset consisting of the following columns:
feature1 | ... | featureX | oddsPlayerA | oddsPlayerB | winner
The model will do a regression, where the conclusion is that playerA wins the match
As far as I understand, I can use my own scoring function to return the “money” that the model would make if it bets every time the condition is true and uses this value to measure the suitability of the model. The condition is something like:
if prediction_player_A_win_odds < oddsPlayerA money += bet_playerA(oddsPlayerA, winner) if inverse_odd(prediction_player_A_win_odds) < oddsPlayerB money += bet_playerB(oddsPlayerB, winner)
In the custom counting function, I need to get the usual arguments, such as "ground_truth, predions" (where ground_truth is the winner [], and the predictions are prediction_player_A_win_odds []) , but also the fields "oddsPlayerA" and "oddsPlayerB" from the dataset (and here is the problem !).
If the custom count function was called with the data in the same order as the original dataset, it would be trivial to get the extra data needed from the dataset. But in reality, when using cross-validation methods, the data that it receives is mixed (compared to the original).
I tried the most obvious approach, which was to pass the variable y using [oddsA, oddsB, winner] (sizes [n, 3]), but scikit did not allow this.
So, how can I get data from a dataset into a user-defined counting function that is neither X nor y, but is still “connected” in the same order?