Pandas data processing

I have a CSV file with the lines:

ID,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#, 

I can read it with

 #!/usr/bin/env python import pandas as pd import sys filename = sys.argv[1] df = pd.read_csv(filename) 

Given a specific column, I would like to split the rows by identifier, and then output the mean and standard deviation for each identifier.

My first problem is how to remove all non-numeric parts from numbers, such as "100M" and "0N #", which should be 100 and 0 respectively.

I also tried looping through the appropriate headers and using

 df[header].replace(regex=True,inplace=True,to_replace=r'\D',value=r'') 

as suggested in Pandas DataFrame: remove unwanted parts from the rows in a column .

However, this changes 98.4 to 984.

+5
source share
2 answers

use str.extract :

 In [356]: import io import pandas as pd t="""ID,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#""" df = pd.read_csv(io.StringIO(t), header=None) df Out[356]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 \ 0 ID 98.4 100M 55M 65M 75M 100M 75M 65M 100M 98M 100M 100M 92M 14 15 0 0# 0N# In [357]: for col in df.columns[2:]: df[col] = df[col].str.extract(r'(\d+)').astype(int) df Out[357]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 ID 98.4 100 55 65 75 100 75 65 100 98 100 100 92 0 0 

If you have floating point numbers, you can use the following regular expression:

 In [379]: t="""ID,98.4,100.50M,55.234M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#""" df = pd.read_csv(io.StringIO(t), header=None) df Out[379]: 0 1 2 3 4 5 6 7 8 9 10 11 \ 0 ID 98.4 100.50M 55.234M 65M 75M 100M 75M 65M 100M 98M 100M 12 13 14 15 0 100M 92M 0# 0N# In [380]: for col in df.columns[2:]: df[col] = df[col].str.extract(r'(\d+\.?\d+)').astype(np.float) df Out[380]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 ID 98.4 100.5 55.234 65 75 100 75 65 100 98 100 100 92 NaN NaN 

therefore (\d+\.?\d+) looking for groups containing \d+ 1 or more digits with \.? optional decimal point and \d+ 1 or more additional digits after the decimal point

EDIT

OK edited my regex pattern:

 In [408]: t="""Name,97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#""" df = pd.read_csv(io.StringIO(t), header=None) df Out[408]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \ 0 Name 97.7 0A 0A 65M 0A 100M 5M 75M 100M 90M 90M 99M 90M 0# 15 0 0N# In [409]: for col in df.columns[2:]: df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float) df Out[409]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 Name 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 
+3
source

My first problem is how to remove all non-numeric parts from numbers, such as "100M" and "0N #", which should be 100 and 0 respectively.

 import re df = pd.read_csv(yourfile, header=None) df.columns = ['ID'] + list(df.columns)[1:] df = df.stack().apply(lambda v: re.sub('[^0-9]','', v) if isinstance(v, str) else v).astype(float).unstack() df.groupby('ID').agg(['std', 'mean']) 

Here .stack() converts the dataframe to a series, .apply() calls a lambda for each value, re.sub() removes any non-numeric characters, .astype() converts to a numeric value, and unstack() converts the series back to dataframe. This works equally well for both integers and floating point numbers.

Given a specific column, I would like to split the rows by identifier, and then output the mean and standard deviation for each identifier.

  # for all columns df.groupby('ID').agg(['std', 'mean']) # for specific column df.groupby('ID')['<colname>'].agg(['std', 'mean']) 

output dataframe

The data used in the example is used here:

 from StringIO import StringIO s=""" 1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#, 1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#, 2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#, 2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#, """ yourfile = StringIO(s) 
+2
source

Source: https://habr.com/ru/post/1235864/


All Articles