How can I code functions with more than one value per column? Need MultiDictVectorizer?

I am embedding some features in sklearn and I am having a problem. DictVectorizer works well if your data can be encoded in one voice button for each item. What if your items can have two or more values ​​of the same column? For example, DictVectorizer works great with an element like this:

{'a': 'b', 'b': 'c'}

But what about something similar, with more than one value per column?

{'a': ['b','c'], 'b': 'd'}

The one-hot-encoding strategy can be applied, you just need two columns ... a = b and a = c. As far as I can tell, such a vectorizer does not exist! What needs to be done in this situation? Do I need to create my own MultiDictVectorizer?

I wrote about this on my blog here before posting.

+4
source share
2 answers

In this situation, there are at least two quick possible solutions:

  • Create a new value that represents the possibility of having two aggregated values

    {'a': 'bc', 'b': 'd'} or give it another name, i.e. 'bc'-->'e'

  • Repeat the pattern each time, taking one of the values

    {'a': 'b', 'b': 'd'} and {'a': 'c', 'b': 'd'}

But, of course, it depends on the context of your problem (case 2: is it right to “duplicate” the sample with various manifestations? Case 1: is the conceptually different new value of the function acceptable?). And I don’t even know if this multi-valued function of the situation corresponds N/A, for example.

github, , , , .

+1

DictVectorizer , . , sklearn. , DictVectorizer MultiDictVectorizer .

Pull Github

Github

0

Source: https://habr.com/ru/post/1628912/


All Articles