Pandas DataFrame - how to group and label rows

Question

Pandas DataFrame - how to group and label rows

I have a large dataset that I want to extract from two columns, which I managed to do using the code below:

import pandas as pd
import numpy as np
import os


pickupfile = 'pickuplist.xls'

path = os.chdir('some path')
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls']

df = pd.DataFrame()
pl = pd.ExcelFile(pickupfile)
pickuplist = pd.read_excel(pl)

df = [pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']] for f in files_xls]

plistcollect = pd.concat(df, keys=files_xls)\
                 .reset_index(level=1, drop=True)\
                 .rename_axis('Tag')\
                 .reset_index()

Each file in the pk list folder contains 10 columns, and the code above pulls two columns from the file into the plistcollect dataframe. The downside for me is that the file pulling iteration adds data to the bottom of the previous data. The data is as follows:

Number    Exp. m/z    Intensity
1         1013.33     1000
2         1257.52     2000

etc. and with the addition of:

Number    Exp. m/z    Intensity
1         1013.33     1000
2         1257.52     2000
3         1013.35     3000
4         1257.61     4000

1 ~ 2 , 3 ~ 4 - .. (.. 1 400 , 2 501 ..), . , , , plistcollect plistcollect DataFrame , ?

, plistcollect, :

ppm = 150

matches = pd.DataFrame(index=pickuplist['mass'], columns=plistcollect.set_index(list(plistcollect.columns)).index, dtype=bool)

for index, findex, exp_mass, intensity in plistcollect.itertuples():
    matches[findex, exp_mass] = abs(matches.index - exp_mass) / matches.index < ppm / 1e6


results = {i: list(s.index[s]) for i, s in matches.iterrows()}
results2 = {key for key, value in matches.any().iteritems() if value}
results3 = matches.any().reset_index()[matches.any().values]

Exp. m/z, ppm (150 ppm), , plistcollect. binning np.digitize:

bins = np.arange(900, 3000, 1)

groups = results3.groupby(np.digitize(results3['Exp. m/z'], bins))


stdev = groups['Intensity'].std()
average = groups['Intensity'].mean()
CV = stdev/average*100



resulttable = pd.concat([groups['Exp. m/z'].mean(),average,CV], axis=1)


resulttable.columns.values[1] = 'Average'
resulttable.columns.values[2] = 'CV'


resulttable.to_excel('test.xls', index=False)

, , ( , ):

Exp. m/z    Average     CV
1013.32693  582361.5354 13.49241757
1257.435414 494927.0904 12.45206038

, , , . , , plistcollect . , . , . , 1013.33 1000/(1000 + 2000), 1013.35 : 3000/(3000 + 4000).

, , , ,

EDIT:

, , "findex" . dataframe results3, , . DataFrame , , Tag. , / ?

filetags = groups['Tag']
resulttable = pd.concat([filetags, groups['Exp. m/z'].mean(), average, CV], axis=1)

: , NDFrame.

Edit2: pickuplist.xls "", Exp. m/z, . m/z ( ppm 150, Exp/m/z 150 ppm (abs (mass-mass_from_file)/mass * 1000000 = 150). pickuplist.xls

, , . Stack Overflow. plistcollect Exp. m/z, 150 ppm "".

+4

python pandas

Bong Kyo Seo 16 . '17 8:20

1

jezrael · Accepted Answer · 2017-06-16T08:39:18+0000

, keys concat:

dfs = []
for f in files_xls:
    dfs = pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']]
    dfs.append(data)

, :

dfs = [pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']] for f in files_xls]

plistcollect = pd.concat(dfs, keys=files_xls) \
                 .reset_index(level=1, drop=True) \
                 .rename_axis('Tag') \
                 .reset_index()
print (plistcollect)
         Tag  Exp.m/z  Intensity
0  test1.xls  1013.33       1000
1  test1.xls  1257.52       2000
2  test2.xls  1013.35       3000
3  test2.xls  1257.61       4000

EDIT:

, . Tag , groupby np.digitize Tag:

print (plist)
         Tag  Exp. m/z  Intensity
0  test1.xls      1000       2000
1  test1.xls      1000       1500
2  test1.xls      2000       3000
3  test2.xls      3000       4000
4  test2.xls      4000       5000
5  test2.xls      4000       5500

pickup = pd.DataFrame({'mass':[1000,1200,1300, 4000]})
print (pickup)
   mass
0  1000
1  1200
2  1300
3  4000

matches = pd.DataFrame(index=pickup['mass'], 
                       columns = plist.set_index(list(plist.columns)).index, 
                       dtype=bool)

ppm = 150
for index, tags, exp_mass, intensity in plist.itertuples():
    matches[(tags, exp_mass)] = abs(matches.index - exp_mass) / matches.index < ppm / 1e6

print (matches)
Tag       test1.xls               test2.xls              
Exp. m/z       1000          2000      3000   4000       
Intensity      2000   1500   3000      4000   5000   5500
mass                                                     
1000           True   True  False     False  False  False
1200          False  False  False     False  False  False
1300          False  False  False     False  False  False
4000          False  False  False     False   True   True

results3 = matches.any().reset_index(name='a')[matches.any().values]
print (results3)
         Tag  Exp. m/z  Intensity     a
0  test1.xls      1000       2000  True
1  test1.xls      1000       1500  True
4  test2.xls      4000       5000  True
5  test2.xls      4000       5500  True

bins = np.arange(900, 3000, 1)
groups = results3.groupby([np.digitize(results3['Exp. m/z'], bins), 'Tag'])

resulttable = groups.agg({'Intensity':['mean','std'], 'Exp. m/z': 'mean'})
resulttable.columns = resulttable.columns.map('_'.join)
resulttable['CV'] = resulttable['Intensity_std'] / resulttable['Intensity_mean'] * 100
d = {'Intensity_mean':'Average','Exp. m/z_mean':'Exp. m/z'}
resulttable = resulttable.reset_index().rename(columns=d) \
                          .drop(['Intensity_std', 'level_0'],axis=1)
print (resulttable)
         Tag  Average  Exp. m/z         CV
0  test1.xls     1750      1000  20.203051
1  test2.xls     5250      4000   6.734350

Pandas DataFrame - how to group and label rows

More articles: