How to remove spaces / NA from dataframe and push values ​​up

I have a huge dataframe in which there are values ​​and spaces / NA. I want to remove spaces from a data frame and move the following values ​​in a column. Consider an example data frame below.

import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,4)) df.iloc[1,2] = np.NaN df.iloc[0,1] = np.NaN df.iloc[2,1] = np.NaN df.iloc[2,0] = np.NaN df 0 1 2 3 0 1.857476 NaN -0.462941 -0.600606 1 0.000267 -0.540645 NaN 0.492480 2 NaN NaN -0.803889 0.527973 3 0.566922 0.036393 -1.584926 2.278294 4 -0.243182 -0.221294 1.403478 1.574097 

I want my conclusion below

  0 1 2 3 0 1.857476 -0.540645 -0.462941 -0.600606 1 0.000267 0.036393 -0.803889 0.492480 2 0.566922 -0.221294 -1.584926 0.527973 3 -0.243182 1.403478 2.278294 4 1.574097 

I want NaN to be removed and the next value is up. df.shift did not help. I tried with several cycles, and if the application and achieved the desired result, but is there a better way to do this.

+5
source share
3 answers

You can use apply with dropna :

 np.random.seed(100) df = pd.DataFrame(np.random.randn(5,4)) df.iloc[1,2] = np.NaN df.iloc[0,1] = np.NaN df.iloc[2,1] = np.NaN df.iloc[2,0] = np.NaN print (df) 0 1 2 3 0 -1.749765 NaN 1.153036 -0.252436 1 0.981321 0.514219 NaN -1.070043 2 NaN NaN -0.458027 0.435163 3 -0.583595 0.816847 0.672721 -0.104411 4 -0.531280 1.029733 -0.438136 -1.118318 df1 = df.apply(lambda x: pd.Series(x.dropna().values)) print (df1) 0 1 2 3 0 -1.749765 0.514219 1.153036 -0.252436 1 0.981321 0.816847 -0.458027 -1.070043 2 -0.583595 1.029733 0.672721 0.435163 3 -0.531280 NaN -0.438136 -0.104411 4 NaN NaN NaN -1.118318 

And then, if you need to replace it with empty space, which creates mixed values ​​- strings with numeric ones - some functions may be violated:

 df1 = df.apply(lambda x: pd.Series(x.dropna().values)).fillna('') print (df1) 0 1 2 3 0 -1.74977 0.514219 1.15304 -0.252436 1 0.981321 0.816847 -0.458027 -1.070043 2 -0.583595 1.02973 0.672721 0.435163 3 -0.53128 -0.438136 -0.104411 4 -1.118318 
+8
source

A numpy approach
The idea is to sort the columns on np.isnan so that np.nan delivered last. I use kind='mergesort' to keep order within np.nan . Finally, I slice the array and reassign it. I follow this with fillna

 v = df.values i = np.arange(v.shape[1]) a = np.isnan(v).argsort(0, kind='mergesort') v[:] = v[a, i] print(df.fillna('')) 0 1 2 3 0 1.85748 -0.540645 -0.462941 -0.600606 1 0.000267 0.036393 -0.803889 0.492480 2 0.566922 -0.221294 -1.58493 0.527973 3 -0.243182 1.40348 2.278294 4 1.574097 

If you did not want to change the data frame in place

 v = df.values i = np.arange(v.shape[1]) a = np.isnan(v).argsort(0, kind='mergesort') pd.DataFrame(v[a, i], df.index, df.columns).fillna('') 

The purpose of this is to use numpy speed

naive test of time

enter image description here

+4
source

Adding to the solution using piRSquared: This shifts all values ​​to the left , not up.
If not all values ​​are numbers, use pd.isnull

 v = df.values a = [[n]*v.shape[1] for n in range(v.shape[0])] b = pd.isnull(v).argsort(axis=1, kind = 'mergesort') # a is a matrix used to reference the row index, # b is a matrix used to reference the column index # taking an entry from a and the respective entry from b (Same index), # we have a position that references an entry in v v[a, b] 

A little explanation:

a is a list of the length of v.shape[0] , and it looks something like this:

 [[0, 0, 0, 0], [1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3], [4, 4, 4, 4], ... 

what happens here is v - m x n , and I did both a and b m x n , and so we do each entry i,j in a and b to get an element in the row with the element value at i,j in a and the column with the value of the element at i,j , in b . Therefore, if we have a and b , both look like the above matrix, then v[a,b] returns a matrix in which the first row contains n copies of v[0][0] , the second row contains n copies of v[1][1] and so on.

In the solution piRSquared, its i is a non-matrix list. Thus, the list is used for v.shape[0] times, aka once for each line. Similarly, we could do:

 a = [[n] for n in range(v.shape[0])] # which looks like # [[0],[1],[2],[3]...] # since we are trying to indicate the row indices of the matrix v as opposed to # [0, 1, 2, 3, ...] which refers to column indices 

Let me know if something is unclear, Thanks :)

+1
source

Source: https://habr.com/ru/post/1266130/


All Articles