In Pandas, you can use something like this:
import pandas as pd import numpy as np data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} data = pd.DataFrame(data) def hash_col(df, col, N): cols = [col + "_" + str(i) for i in range(N)] def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols) df[cols] = df[col].apply(xform) return df.drop(col,axis=1) print hash_col(data, 'state',4)
The output will be
pop year state_0 state_1 state_2 state_3 0 1.5 2000 0 1 0 0 1 1.7 2001 0 1 0 0 2 3.6 2002 0 1 0 0 3 2.4 2001 0 0 0 1 4 2.9 2002 0 0 0 1
Also at the row level you could
import numpy as np, os import sys, pandas as pd
def hash_col(df, col, N): df = df.replace('',np.nan) cols = [col + "_" + str(i) for i in range(N)] tmp = [0 for i in range(N)] tmp[hash(df.ix[col]) % N] = 1 res = df.append(pd.Series(tmp,index=cols)) return res.drop(col) a = pd.Series(['new york',30,''],index=['city','age','test']) b = pd.Series(['boston',30,''],index=['city','age','test']) print hash_col(a,'city',10) print hash_col(b,'city',10)
This will work for one series, the column name will be considered the pandas index. It also replaces empty lines with nan and swims everything.
age 30 test NaN city_0 0 city_1 0 city_2 0 city_3 0 city_4 0 city_5 0 city_6 0 city_7 1 city_8 0 city_9 0 dtype: object age 30 test NaN city_0 0 city_1 0 city_2 0 city_3 0 city_4 0 city_5 1 city_6 0 city_7 0 city_8 0 city_9 0 dtype: object
If, however, there is a dictionary, and you just want one-hot-encode, you can use
import numpy as np import pandas as pd, os import scipy.sparse as sps def hash_col(df, col, vocab): cols = [col + "=" + str(v) for v in vocab] def xform(x): tmp = [0 for i in range(len(vocab))]; tmp[vocab.index(x)] = 1; return pd.Series(tmp,index=cols) df[cols] = df[col].apply(xform) return df.drop(col,axis=1) data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} df = pd.DataFrame(data) df2 = hash_col(df, 'state', ['Ohio','Nevada']) print sps.csr_matrix(df2)
which will give
pop year state=Ohio state=Nevada 0 1.5 2000 1 0 1 1.7 2001 1 0 2 3.6 2002 1 0 3 2.4 2001 0 1 4 2.9 2002 0 1
I also added the separation of the final piece of data. In incremental configuration, where we might not have met all the values ββin advance (but somehow we somehow got a list of all possible values), you can use the approach described above. Incremental ML methods would need the same number of functions for each increment, so a single coding should result in the same number of lines in each batch.