Shuffle / swap DataFrame in pandas

What is a simple and efficient way to shuffle a data file in pandas, row or column? That is, how to write a shuffle(df, n, axis=0) function shuffle(df, n, axis=0) that takes a data frame, the number of shuffles is n , and the axis ( axis=0 are rows, axis=1 are columns) and returns a copy of the data frame that was shuffled n times

Edit : The key should do this without destroying the row / column labels in the data frame. If you just shuffle df.index , which loses all this information. I want the resulting df to be the same as the original, except that the row order or column order is different.

Edit2 : my question was unclear. When I say line shuffle, I mean random shuffle of each line. Therefore, if you have two columns a and b , I want each row to be shuffled by itself, so that you do not have the same associations between a and b , as you would if you simply reorder each row as a whole. Something like:

 for 1...n: for each col in df: shuffle column return new_df 

But, I hope, is more effective than the naive cycle. This does not work for me:

 def shuffle(df, n, axis=0): shuffled_df = df.copy() for k in range(n): shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis) return shuffled_df df = pandas.DataFrame({'A':range(10), 'B':range(10)}) shuffle(df, 5) 
+49
python numpy pandas
Apr 02 '13 at 18:50
source share
11 answers
 In [16]: def shuffle(df, n=1, axis=0): ...: df = df.copy() ...: for _ in range(n): ...: df.apply(np.random.shuffle, axis=axis) ...: return df ...: In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)}) In [18]: shuffle(df) In [19]: df Out[19]: AB 0 8 5 1 1 7 2 7 3 3 6 2 4 3 4 5 0 1 6 9 0 7 4 6 8 2 8 9 5 9 
+22
Apr 2 '13 at 7:10
source share

Use the numpy random.permuation function:

 In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)}) In [2]: df Out[2]: AB 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 In [3]: df.reindex(np.random.permutation(df.index)) Out[3]: AB 0 0 0 5 5 5 6 6 6 3 3 3 8 8 8 7 7 7 9 9 9 1 1 1 2 2 2 4 4 4 
+168
Apr 02 '13 at 19:09
source share

Sampling is randomized, so just select the entire data frame.

 df.sample(frac=1) 
+65
Mar 03 '16 at 22:51
source share

You can use sklearn.utils.shuffle() (sklearn 0.16.1 or higher is required to support Pandas data frames):

 # Generate data import pandas as pd df = pd.DataFrame({'A':range(5), 'B':range(5)}) print('df: {0}'.format(df)) # Shuffle Pandas data frame import sklearn.utils df = sklearn.utils.shuffle(df) print('\n\ndf: {0}'.format(df)) 

outputs:

 df: AB 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 df: AB 1 1 1 0 0 0 3 3 3 4 4 4 2 2 2 

Then you can use df.reset_index() to reset the index column if necessary:

 df = df.reset_index(drop=True) print('\n\ndf: {0}'.format(df) 

outputs:

 df: AB 0 1 1 1 0 0 2 4 4 3 2 2 4 3 3 
+11
Aug 11 '16 at 17:40
source share

In the docs, use sample() :

 In [79]: s = pd.Series([0,1,2,3,4,5]) # When no arguments are passed, returns 1 row. In [80]: s.sample() Out[80]: 0 0 dtype: int64 # One may specify either a number of rows: In [81]: s.sample(n=3) Out[81]: 5 5 2 2 4 4 dtype: int64 # Or a fraction of the rows: In [82]: s.sample(frac=0.5) Out[82]: 5 5 4 4 1 1 dtype: int64 
+5
Feb 24 '16 at 19:07
source share

I resorted to adapting @root's answer a bit and directly using the original values. Of course, this means that you lose the ability to do fancy indexing, but it works great for just shuffling data.

 In [1]: import numpy In [2]: import pandas In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)}) In [4]: %timeit df.apply(numpy.random.shuffle, axis=0) 1000 loops, best of 3: 406 ยตs per loop In [5]: %%timeit ...: for view in numpy.rollaxis(df.values, 1): ...: numpy.random.shuffle(view) ...: 10000 loops, best of 3: 22.8 ยตs per loop In [6]: %timeit df.apply(numpy.random.shuffle, axis=1) 1000 loops, best of 3: 746 ยตs per loop In [7]: %%timeit for view in numpy.rollaxis(df.values, 0): numpy.random.shuffle(view) ...: 10000 loops, best of 3: 23.4 ยตs per loop 

Note that numpy.rollaxis returns the specified axis to the first size, and then numpy.rollaxis over the arrays with the remaining dimensions, i.e. if we want to shuffle the first size (columns), we need to flip the second dimension to the front, so we apply the shuffle to the first size views.

 In [8]: numpy.rollaxis(df, 0).shape Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows) In [9]: numpy.rollaxis(df, 1).shape Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns) 

Then your final function uses the trick to bring the result into line with the expectation of applying the function to the axis:

 def shuffle(df, n=1, axis=0): df = df.copy() axis = int(not axis) # pandas.DataFrame is always 2D for _ in range(n): for view in numpy.rollaxis(df.values, axis): numpy.random.shuffle(view) return df 
+2
Feb 01 '14 at 20:08
source share

This might be more useful if you want your index shuffled.

 def shuffle(df): index = list(df.index) random.shuffle(index) df = df.ix[index] df.reset_index() return df 

He selects a new df using the new index, then reset them.

+2
Aug 14 '14 at 23:48
source share

A simple solution in pandas is to use the sample method independently of each column. Use apply to iterate over each column:

 df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]}) df ab 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 df.apply(lambda x: x.sample(frac=1).values) ab 0 4 2 1 1 6 2 6 5 3 5 3 4 2 4 5 3 1 

You must use .value so that you return a numpy array, not a series, otherwise the returned series will align with the original DataFrame without changing the thing:

 df.apply(lambda x: x.sample(frac=1)) ab 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 
+1
Nov 04 '17 at 15:40
source share

Here is the work I found if you only want to shuffle a subset of the DataFrame:

 shuffle_to_index = 20 df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]]) 
0
Jun 23 '16 at 19:28
source share

I know that the question relates to pandas df, but in the case where the shuffling occurs on a row (the column order is changed, the row order is unchanged), then the column names no longer matter, and it would be interesting to use np.array , then np.apply_along_axis() will be what you are looking for.

If this is acceptable, it would be useful, note that it is easy to switch the axis along which data is shuffled.

If the panda data frame is named df , perhaps you can:

  • get the values โ€‹โ€‹of the data frame using values = df.values ,
  • create np.array from values
  • apply the method below to shuffle np.array by row or column
  • recreate the new (shuffled) pandas df from the shuffled np.array

Original array

 a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]]) print(a) [[10 11 12] [20 21 22] [30 31 32] [40 41 42]] 

Keep row order, shuffle columns in each row

 print(np.apply_along_axis(np.random.permutation, 1, a)) [[11 12 10] [22 21 20] [31 30 32] [40 41 42]] 

Keep column order, shuffle rows in each column

 print(np.apply_along_axis(np.random.permutation, 0, a)) [[40 41 32] [20 31 42] [10 11 12] [30 21 22]] 

The original array does not change

 print(a) [[10 11 12] [20 21 22] [30 31 32] [40 41 42]] 
0
Jun 21 '17 at 21:18
source share

If you want to shuffle only one column (not an index) of a data frame with many columns:

df ['column_name'] = numpy.random.permutation (df.column_name)

0
Aug 31 '17 at 0:39
source share



All Articles