The use of function selection algorithms in one hot coding may be skipped due to the relationship between the encoded functions. For example, if you encode a function of n values into n functions, and you select n-1 from m, the latter function is not required.
Since the number of your functions is rather small (~ 10), choosing a function does not help you so much, since you can probably reduce only a few of them without losing too much information.
You wrote that one hot coding turns 10 functions into 500, which means that each function has about 50 values. In this case, you may be interested in sampling algorithms by manipulating the values themselves. If there is an implied order for the values, you can use algorithms for continuity features . Another option is to simply omit rare values or values without a strong correlation with the concept.
If you use the select function, most algorithms will work on categorical data, but you should beware of corner cases. For example, the reciprocal information suggested by @Igor Raush is a great measure. However, functions with many values tend to have higher entropy than a function with lower values. This, in turn, can lead to more mutual information and bias in particular of many values. The way to deal with this problem is to normalize by dividing mutual information by the entropy of attributes.
Another set of feature selection algorithms that can help you are wrappers . They actually delegate training to the classification algorithm, and therefore they are indifferent to the presentation if the classification algorithm can handle it.
source share