I have a large dataset that I want to extract from two columns, which I managed to do using the code below:
import pandas as pd
import numpy as np
import os
pickupfile = 'pickuplist.xls'
path = os.chdir('some path')
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls']
df = pd.DataFrame()
pl = pd.ExcelFile(pickupfile)
pickuplist = pd.read_excel(pl)
df = [pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']] for f in files_xls]
plistcollect = pd.concat(df, keys=files_xls)\
.reset_index(level=1, drop=True)\
.rename_axis('Tag')\
.reset_index()
Each file in the pk list folder contains 10 columns, and the code above pulls two columns from the file into the plistcollect dataframe. The downside for me is that the file pulling iteration adds data to the bottom of the previous data. The data is as follows:
Number Exp. m/z Intensity
1 1013.33 1000
2 1257.52 2000
etc. and with the addition of:
Number Exp. m/z Intensity
1 1013.33 1000
2 1257.52 2000
3 1013.35 3000
4 1257.61 4000
1 ~ 2 , 3 ~ 4 - .. (.. 1 400 , 2 501 ..), . , , , plistcollect plistcollect DataFrame , ?
, plistcollect, :
ppm = 150
matches = pd.DataFrame(index=pickuplist['mass'], columns=plistcollect.set_index(list(plistcollect.columns)).index, dtype=bool)
for index, findex, exp_mass, intensity in plistcollect.itertuples():
matches[findex, exp_mass] = abs(matches.index - exp_mass) / matches.index < ppm / 1e6
results = {i: list(s.index[s]) for i, s in matches.iterrows()}
results2 = {key for key, value in matches.any().iteritems() if value}
results3 = matches.any().reset_index()[matches.any().values]
Exp. m/z, ppm (150 ppm), , plistcollect. binning np.digitize:
bins = np.arange(900, 3000, 1)
groups = results3.groupby(np.digitize(results3['Exp. m/z'], bins))
stdev = groups['Intensity'].std()
average = groups['Intensity'].mean()
CV = stdev/average*100
resulttable = pd.concat([groups['Exp. m/z'].mean(),average,CV], axis=1)
resulttable.columns.values[1] = 'Average'
resulttable.columns.values[2] = 'CV'
resulttable.to_excel('test.xls', index=False)
, , ( , ):
Exp. m/z Average CV
1013.32693 582361.5354 13.49241757
1257.435414 494927.0904 12.45206038
, , , . , , plistcollect . , . , . , 1013.33 1000/(1000 + 2000), 1013.35 : 3000/(3000 + 4000).
, , , ,
EDIT:
, , "findex" . dataframe results3, , . DataFrame , , Tag. , / ?
filetags = groups['Tag']
resulttable = pd.concat([filetags, groups['Exp. m/z'].mean(), average, CV], axis=1)
: , NDFrame.
Edit2:
pickuplist.xls "", Exp. m/z, . m/z ( ppm 150, Exp/m/z 150 ppm (abs (mass-mass_from_file)/mass * 1000000 = 150). pickuplist.xls
mass
1013.34
1079.3757
1095.3706
1136.3972
1241.4285
1257.4234
, , . Stack Overflow. plistcollect Exp. m/z, 150 ppm "".