I am embedding some features in sklearn and I am having a problem. DictVectorizer works well if your data can be encoded in one voice button for each item. What if your items can have two or more values of the same column? For example, DictVectorizer works great with an element like this:
{'a': 'b', 'b': 'c'}
But what about something similar, with more than one value per column?
{'a': ['b','c'], 'b': 'd'}
The one-hot-encoding strategy can be applied, you just need two columns ... a = b and a = c. As far as I can tell, such a vectorizer does not exist! What needs to be done in this situation? Do I need to create my own MultiDictVectorizer?
I wrote about this on my blog here before posting.
source
share