Create open border indicators from pandas get_dummies on sampled numeric

In the pandas numeric age column, sample as ageD with qcut, we create open borders from the qcut borders:

import pandas as pd from itertools import chain d = {'age': {0: 5, 1: 23, 2: 43, 3: 70, 4: 30}} df = pd.DataFrame.from_dict(d) df['ageD'] = pd.qcut(df.iloc[:, 0], 2) df.ageD.cat.categories # Index([u'[5, 30]', u'(30, 70]'], dtype='object') 

From the index ([u '[5, 30]', u '(30, 70]'], dtype = 'object'), we do bopens:

 >>> bopens = get_open_bounds(df) >>> bopens # ['(-inf, 5]', '(-inf, 30]', '(-inf, 70]', '(5, +inf)', '(30, +inf)', '(70, +inf)'] 

Then we transform the categorical variable into dummy variables with get_dummies indicators:

 df = pd.get_dummies(df) print df # age ageD_[5, 30] ageD_(30, 70] # 0 5 1 0 # 1 23 1 0 # 2 43 0 1 # 3 70 0 1 # 4 30 1 0 

I want to enrich the data frame with columns of open borders, df.shape will be quite large, ~ (10e6, 32). What is the best way to make 6 bopen cols for each row?

The df target will look like this:

 >>> df age age_[5, 30] age_(30, 70] (-inf, 5] (-inf, 30] (-inf, 70] (5, +inf) (30, +inf) (70, +inf) 0 5 1 0 1 1 1 0 0 0 1 23 1 0 0 1 1 1 0 0 2 43 0 1 0 0 1 1 1 0 3 70 0 1 0 0 1 1 1 0 4 30 1 0 0 1 1 1 0 0 

PS: get_open_bounds used to create bopens:

 def get_open_bounds(df): bounds = [(int(x[1:]), int(y[:-1])) for x, y in [c.split(', ') for c in df.ageD.cat.categories]] bounds = list(chain(*bounds)) bounds # [5, 30, 30, 70] # to get uniques, keeping the order bounds = [b for idx, b in enumerate(bounds) if b not in bounds[:idx]] # make the open bounds bopens = ["(-inf, {}]".format(b) for b in bounds] + \ ["({}, +inf)".format(b) for b in bounds] return bopens 
+5
source share
1 answer

IIUC, you can do this with a little broadcast:

 df['ageD'], bins = pd.qcut(df.iloc[:, 0], 2, retbins=True) left = (df["age"].values <= bins[:,None]).T.astype(int) dl = pd.DataFrame(left, columns=["(-inf, {}]".format(b) for b in bins]) dr = pd.DataFrame(1-left, columns=["({}, +inf)".format(b) for b in bins]) dout = pd.concat([pd.get_dummies(df), dl, dr], axis=1) 

gives me

 >>> dout age ageD_[5, 30] ageD_(30, 70] (-inf, 5] (-inf, 30] (-inf, 70] (5, +inf) (30, +inf) (70, +inf) 0 5 1 0 1 1 1 0 0 0 1 23 1 0 0 1 1 1 0 0 2 43 0 1 0 0 1 1 1 0 3 70 0 1 0 0 1 1 1 0 4 30 1 0 0 1 1 1 0 0 

Note # 1: by adding retbins = True , I can get my own bins and avoid some inconvenient line parsing.

Note # 2: by making an implicit β€œright = 1 - left”, I assume that no age is NaN, and therefore one of> = or <should be true; if it is not, you can do right = (df["age"].values > bins[:,None].T.astype(int) instead.)

Note # 3: indeed, I also need to pass df.index frame constructors, while your example had a canonical index that might be incorrect in your real data.

+2
source

Source: https://habr.com/ru/post/1239484/


All Articles