Binarize a float64 Pandas Dataframe in Python

Question

Binarize a float64 Pandas Dataframe in Python

I have a Panda DF with different columns (each of which indicates the frequency of a word in the corpus). Each line corresponds to the document, and each of them has type float64.

eg:

word1 word2 word3
0.0   0.3   1.0
0.1   0.0   0.5
etc

I want to binarize this, and instead of frequency it ends with a logical (0s and 1s DF), which indicates the existence of the word

therefore the above example will be converted to:

word1 word2 word3
0      1     1
1      0     1
etc

I looked at get_dummies (), but the result was not expected.

+4

python pandas dataframe

Snake_a Sep 27 '16 at 23:08

source share

4 answers

Alberto Garcia-Raboso · Answer 1 · 2016-09-27T23:36:02+0000

Going to boolean will be Truefor something that is not null - and Falsefor any null entry. If you then pass an integer, you get ones and zeros.

import io
import pandas as pd

data = io.StringIO('''\
word1 word2 word3
0.0   0.3   1.0
0.1   0.0   0.5
''')
df = pd.read_csv(data, delim_whitespace=True)

res = df.astype(bool).astype(int)
print(res)

:

   word1  word2  word3
0      0      1      1
1      1      0      1

piRSquared · Answer 2 · 2016-09-28T00:09:23+0000

, @Alberto Garcia-Raboso, , .

np.where

pd.DataFrame(np.where(df, 1, 0), df.index, df.columns)

Timing

sascha · Answer 3 · 2016-09-27T23:19:33+0000

:

import numpy as np
import pandas as pd

""" create some test-data """
random_data = np.random.random([3, 3])
random_data[0,0] = 0.0
random_data[1,2] = 0.0

df = pd.DataFrame(random_data,
     columns=['A', 'B', 'C'], index=['first', 'second', 'third'])

print(df)

""" binarize """
threshold = lambda x: x > 0
df_ = df.apply(threshold).astype(int)

print(df_)

:

A         B         C
first   0.000000  0.610263  0.301024
second  0.728070  0.229802  0.000000
third   0.243811  0.335131  0.863908
A  B  C
first   0  1  1
second  1  1  0
third   1  1  1

:

get_dummies () parses each unique value for each column and introduces new columns (for each unique value) to mark whether this value is included
= if column A has 20 unique values, 20 new columns are added, where exactly one column is true, the rest is false

Snake_a · Answer 4 · 2016-10-04T19:55:36+0000

Found an alternative way to use Pandas indexing.

It can be done simply.

df[df>0] = 1

just like that!

Binarize a float64 Pandas Dataframe in Python

Timing

More articles: