How to deal with the modification of large pandas data

I have a pretty big pandas dataframe (1.5gig.csv on disk). I can load it into memory and request it. I want to create a new column that combines the value of two other columns, and I tried this:

def combined(row):
    row['combined'] = row['col1'].join(str(row['col2']))
return row

df = df.apply(combined, axis=1)

This leads to my python process killing, presumably due to memory issues.

A more iterative solution to the problem is as follows:

df['combined'] = ''
col_pos = list(df.columns).index('combined')
crs_pos = list(df.columns).index('col1')
sub_pos = list(df.columns).index('col2')

for row_pos in range(0, len(df) - 1):
    df.iloc[row_pos, col_pos] = df.iloc[row_pos, sub_pos].join(str(df.iloc[row_pos, crs_pos]))

This, of course, seems very unpleasant. And very slowly.

- apply_chunk(), , , . , dask , dask dataframes, , , . , , , pandas ?

+4
2

+ itertools:

df = pd.DataFrame({
    'a': ['ab'] * 200,
    'b': ['ffff'] * 200
})


import itertools

[a.join(b) for (a, b) in itertools.izip(df.a, df.b)]

"unpandas", pandas, , .str, , "".

, :

df['c'] = [a.join(b) for (a, b) in itertools.izip(df.a, df.b)]

, , :

[a.join(b) for (a, b) in itertools.izip(df.a[10: 20], df.b[10: 20])]

. , itertools , , .

+4

pandas dask.dataframe .assign.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 2, 3, 4], 'y': ['a', 'b', 'a', 'b']})

In [3]: df
Out[3]: 
   x  y
0  1  a
1  2  b
2  3  a
3  4  b

In [4]: df.assign(z=df.x * df.y)
Out[4]: 
   x  y     z
0  1  a     a
1  2  b    bb
2  3  a   aaa
3  4  b  bbbb

, ( ), Python ( ), . , apply iloc , , Pandas , .

+1

Source: https://habr.com/ru/post/1599171/


All Articles