How to set all values ​​of an existing Pandas DataFrame to zero?

I currently have an existing Pandas DataFrame with a date index and columns with a specific name.

As for data cells, they are filled with different float values.

I would like to copy my DataFrame, but replace all of these values ​​with zero.

The goal is to reuse the DataFrame structure (dimensions, index, column names), but clear all current values, replacing them with zeros.

I am currently achieving this:

df[df > 0] = 0 

However, this will not replace any negative value in the DataFrame.

Is there a more general approach to populating an entire existing DataFrame with one common value?

Thank you in advance for your help.

+19
source share
3 answers

The fastest way that dtypes also saves is this:

 for col in df.columns: df[col].values[:] = 0 

This directly writes to the numy base array of each column. I doubt that any other method will be faster than this, since it does not allocate additional memory and does not go through the processing of dtype pandas. You can also use np.issubdtype only to zero out numeric columns. This is probably what you need if you have a mixed dtype DataFrame, but of course this is not necessary if your DataFrame is already fully numeric.

 for col in df.columns: if np.issubdtype(df[col].dtype, np.number): df[col].values[:] = 0 

For small DataFrames, subtype checking is somewhat costly. However, the cost of zeroing a non-numeric column is significant, so if you are not sure if your DataFrame is fully numeric, you should enable issubdtype checking.


Terms of comparison

Customization

 import pandas as pd import numpy as np def make_df(n, only_numeric): series = [ pd.Series(range(n), name="int", dtype=int), pd.Series(range(n), name="float", dtype=float), ] if only_numeric: series.extend( [ pd.Series(range(n, 2 * n), name="int2", dtype=int), pd.Series(range(n, 2 * n), name="float2", dtype=float), ] ) else: series.extend( [ pd.date_range(start="1970-1-1", freq="T", periods=n, name="dt") .to_series() .reset_index(drop=True), pd.Series( [chr((i % 26) + 65) for i in range(n)], name="string", dtype="object", ), ] ) return pd.concat(series, axis=1) 

 >>> make_df(5, True) int float int2 float2 0 0 0.0 5 5.0 1 1 1.0 6 6.0 2 2 2.0 7 7.0 3 3 3.0 8 8.0 4 4 4.0 9 9.0 >>> make_df(5, False) int float dt string 0 0 0.0 1970-01-01 00:00:00 A 1 1 1.0 1970-01-01 00:01:00 B 2 2 2.0 1970-01-01 00:02:00 C 3 3 3.0 1970-01-01 00:03:00 D 4 4 4.0 1970-01-01 00:04:00 E 

Little DataFrame

 n = 10_000 # Numeric df, no issubdtype check %%timeit df = make_df(n, False) for col in df.columns: df[col].values[:] = 0 36.1 µs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # Numeric df, yes issubdtype check %%timeit df = make_df(n, False) for col in df.columns: if np.issubdtype(df[col].dtype, np.number): df[col].values[:] = 0 53 µs ± 645 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # Non-numeric df, no issubdtype check %%timeit df = make_df(n, True) for col in df.columns: df[col].values[:] = 0 113 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # Non-numeric df, yes issubdtype check %%timeit df = make_df(n, True) for col in df.columns: if np.issubdtype(df[col].dtype, np.number): df[col].values[:] = 0 39.4 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 

Large data frame

 n = 10_000_000 # Numeric df, no issubdtype check %%timeit df = make_df(n, False) for col in df.columns: df[col].values[:] = 0 38.7 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # Numeric df, yes issubdtype check %%timeit df = make_df(n, False) for col in df.columns: if np.issubdtype(df[col].dtype, np.number): df[col].values[:] = 0 39.1 ms ± 556 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # Non-numeric df, no issubdtype check %%timeit df = make_df(n, True) for col in df.columns: df[col].values[:] = 0 99.5 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # Non-numeric df, yes issubdtype check %%timeit df = make_df(n, True) for col in df.columns: if np.issubdtype(df[col].dtype, np.number): df[col].values[:] = 0 17.8 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

I previously suggested the answer below, but now I consider it harmful - it is much slower than the ones given above, and it is more difficult to reason with. Its only advantage is to write nicer.

The cleanest way is to use the bare colon to reference the entire dataframe.

 df[:] = 0 

Unfortunately, the dtype situation dtype bit blurry, because each column in the resulting data frame will have the same dtype . If each df column was originally a float , the new dtypes will still be a float . But if one column was int or object , it seems that the new dtypes will be all int .

+33
source

You can use the replace function:

 df2 = df.replace(df, 0) 
+9
source

Since you're trying to create a copy, it might be better to just create a new data frame with values ​​like 0, and the columns and index from the original data frame:

 pd.DataFrame(0, columns=df.columns, index=df.index) 
+4
source

Source: https://habr.com/ru/post/1265116/


All Articles