Sort, but has a uniform distribution for the second variable, if the values are the same

Question

Sort, but has a uniform distribution for the second variable, if the values are the same

I have 2 columns: Col1, Col2 in the pandas frame. Col1 has numbers from 1 to 100, and Col2 has numbers 0 and 1 s.

I want to sort this framework so that the rows are sorted by Col1. In my case, I have several million rows, and so definitely the Col1 values will be repeated many times.

I can do data = data.sort_values('Col1')to sort the values based on Col1. For example, this may give:

Col1 Col2 ... OR ... Col1 Col2 ... OR ... Col1 Col2
100  0               100  1               100  0
100  0               100  1               100  0
100  1               100  1               100  0
100  0               100  1               100  0
100  1               100  0               100  1
100  1               100  0               100  1
100  1               100  0               100  1
100  0               100  0               100  1
99   1               99   1               99   1
...                  ...                  ...

There are many possible distributions for Col2 when Col1 = 100 based on the sorting algorithm used (quicksort, mergesort, etc.).

In sections where my Col1 is the same value, I want the distribution of my Col2 to be uniform, for example:

, python/numpy/ pandas/[ ]? , ?

+4

python sorting numpy pandas

AbdealiJK 21 . '17 11:12

4

B. M. · Answer 1 · 2017-02-21T14:12:01+0000

0 1 :

df=pd.DataFrame({'col1':randint(0,100,1000),'col2':randint(0,2,1000),}) 
df.sort_values(['col1','col2'],inplace=True)
cnt= df.groupby(['col1','col2']).col1.count()
df['rk']=np.hstack([list(range(n)) for n in cnt])
df.sort_values(['col1','rk'],inplace=True)

:

df:

df.sort_values(['col1','col2'],inplace=True)

:

cnt= df.groupby(['col1','col2']).col1.count()

:

df['rk']=np.hstack([list(range(n)) for n in cnt])

:

df.sort_values(['col1','rk'],inplace=True)

df=pd.DataFrame({'col1':randint(0,100,1000),'col2':randint(0,2,1000),}):

     col1  col2  rk
161     0     0   0
1       0     1   0
253     0     0   1
118     0     1   1
471     0     0   2
391     0     1   2
582     0     0   3
444     0     1   3
579     0     1   4
735     0     1   5
887     0     1   6
111     1     0   0
57      1     1   0
......

AndreyF · Answer 2 · 2017-02-21T12:01:47+0000

, , - , :

offset_dict = defaultdict(lambda: defaultdict(lambda: 2))

def get_offset(row):
    step = offset_dict[row["Col1"]][row["Col2"]]
    offset_dict[row["Col1"]][row["Col2"]] += 1
    return row["Col1"] + 1.0/step

df["offset"] = df.apply(get_offset, axis=1)
df = df.sort_values("offset")

:

    Col1  Col2
0    100     1
1    100     1
2    100     1
3     99     1
4    100     0
5    100     0
6     99     1
7     99     0
8     99     0
9    100     0
10    99     0
11   100     1
12   100     1
13   100     0
14   100     0

:

    Col1  Col2      offset
10    99     0   99.250000
6     99     1   99.333333
8     99     0   99.333333
3     99     1   99.500000
7     99     0   99.500000
12   100     1  100.166667
14   100     0  100.166667
11   100     1  100.200000
13   100     0  100.200000
2    100     1  100.250000
9    100     0  100.250000
1    100     1  100.333333
5    100     0  100.333333
0    100     1  100.500000
4    100     0  100.500000

jeremycg · Answer 3 · 2017-02-21T21:12:42+0000

cumcount, , count:

import pandas as pd
import numpy as np
#data from B. M.

df=pd.DataFrame({'col1':np.random.randint(0,100,1000),'col2':np.random.randint(0,2,1000)}) 

#make a new column, with the cumulative count for each of col1:col2
df['values'] = df.groupby(['col1','col2']).cumcount()

#sort by the col1, and values:
df.sort_values(['col1', 'values'])

    col1    col2    values
61  0   1   0
213 0   0   0
173 0   1   1
473 0   0   1
266 0   1   2

, !

, , :

#make a new column, with the cumulative count for each of col1:col2
df['values'] = df.groupby(['col1','col2']).cumcount()

#sort by the col1, and values:
df.sort_values(['col1', 'values'])
#merge in a count of each value
df = df.merge(df.groupby(['col1', 'col2']).size().reset_index())
#make a key of index/count
df['sortkey'] = df['values']/df[0]
#sort
df.sort_values(['col1', 'sortkey'])

    col1    col2    values  sortkey 0
393 0   0   0   0.000000    3
812 0   1   0   0.000000    4
813 0   1   1   0.250000    4
394 0   0   1   0.333333    3
814 0   1   2   0.500000    4

exp1orer · Answer 4 · 2017-02-21T21:28:12+0000

It depends on what you mean by “even distribution”. Will you apply a specific test that must pass a certain threshold? If you just want it to be “fairly homogeneous” or “unpredictable,” you can simply randomize within each Col1 value.

# setup
import pandas as pd
import numpy as np
df=pd.DataFrame({'col1':randint(0,100,1000),'col2':randint(0,2,1000),})

# add a column with random numbers
df['random_col'] = np.random.random(len(df))

# two-level sort 
df.sort_values(['col1','random_col'])

Sort, but has a uniform distribution for the second variable, if the values ​​are the same

More articles:

Sort, but has a uniform distribution for the second variable, if the values are the same