Python: converting numeric data to pandas dataframe to float in the presence of strings

I have a pandas framework with a "cap" column. This column mainly consists of floats, but has several rows in it, for example, at index 2.

df = cap 0 5.2 1 na 2 2.2 3 7.6 4 7.5 5 3.0 ... 

I import my data from the csv file as follows:

 df = DataFrame(pd.read_csv(myfile.file)) 

Unfortunately, when I do this, the "cap" column is imported entirely as rows. I would like the floats to be identified as float and strings as strings. Trying to convert this with:

 df['cap'] = df['cap'].astype(float) 

causes an error:

 could not convert string to float: na 

Is there a way to make all numbers in floats, but save "na" as a string?

+9
python pandas dataframe
Nov 08 '13 at
source share
4 answers

Here is a possible workaround

first you define a function that converts numbers to float only when necessary

  def to_number(s): try: s1 = float(s) return s1 except ValueError: return s 

and then you apply it line by line.




Example:

given

  df 0 0 a 1 2 

where both a and 2 are strings, we do the conversion through

 converted = df.apply(lambda f : to_number(f[0]) , axis = 1) converted 0 a 1 2 

Direct type checking:

 type(converted.iloc[0]) str type(converted.iloc[1]) float 
+1
Nov 08 '13 at 16:51
source share

Calculations with columns of type float64 (rather than an object) are much more efficient, so this is usually preferable ... it will also allow you to do other calculations. Because of this, it is recommended that you use NaN for missing data (instead of your own placeholder or None).

Is this really the answer?

 In [11]: df.sum() # all strings Out[11]: cap 5.2na2.27.67.53.0 dtype: object In [12]: df.apply(lambda f: to_number(f[0]), axis=1).sum() # floats and 'na' strings TypeError: unsupported operand type(s) for +: 'float' and 'str' 

You must use convert_numeric to force swimming:

 In [21]: df.convert_objects(convert_numeric=True) Out[21]: cap 0 5.2 1 NaN 2 2.2 3 7.6 4 7.5 5 3.0 

Or read it directly as csv, adding "na" to the list of values ​​that NaN should consider:

 In [22]: pd.read_csv(myfile.file, na_values=['na']) Out[22]: cap 0 5.2 1 NaN 2 2.2 3 7.6 4 7.5 5 3.0 

In any case, the sum (and many other pandas functions) will now work:

 In [23]: df.sum() Out[23]: cap 25.5 dtype: float64 

As Jeff advises :

repeat 3 times fast: object == bad, float == good

+18
Nov 08 '13 at 18:40
source share

I tried the alternative above:

 for num, item in enumerate(data['col']): try: float(item) except: data['col'][num] = nan 
0
May 4 '14 at 10:03
source share

First of all, how you import the CSV is redundant, instead of doing:

 df = DataFrame(pd.read_csv(myfile.file)) 

You can do directly:

 df = pd.read_csv(myfile.file) 

Then, to convert to float and put everything that is not a number, like NaN:

 df = pd.to_numeric(df, errors='coerce') 
0
Jan 30 '18 at 4:48
source share



All Articles