Combine two Pandas data frames, repeat the selection in one time column, interpolate

This is my first stackoverflow question. Take it easy on me!

I have two data sets obtained simultaneously by different data acquisition systems with different sampling rates. One of them is very regular, and the other is not. I would like to create a single dataframe containing both data sets, using timestamps with a time interval (in seconds) as a reference for both. Wrongly sampled data should be interpolated at time intervals at intervals.

Here are some toy details that demonstrate what I'm trying to do:

import pandas as pd import numpy as np # evenly spaced times t1 = np.array([0,0.5,1.0,1.5,2.0]) y1 = t1 # unevenly spaced times t2 = np.array([0,0.34,1.01,1.4,1.6,1.7,2.01]) y2 = 3*t2 df1 = pd.DataFrame(data={'y1':y1,'t':t1}) df2 = pd.DataFrame(data={'y2':y2,'t':t2}) 

df1 and df2 are as follows:

 df1: t y1 0 0.0 0.0 1 0.5 0.5 2 1.0 1.0 3 1.5 1.5 4 2.0 2.0 df2: t y2 0 0.00 0.00 1 0.34 1.02 2 1.01 3.03 3 1.40 4.20 4 1.60 4.80 5 1.70 5.10 6 2.01 6.03 

I am trying to combine df1 and df2 by interpolating y2 on df1.t. Desired Result:

 df_combined: t y1 y2 0 0.0 0.0 0.0 1 0.5 0.5 1.5 2 1.0 1.0 3.0 3 1.5 1.5 4.5 4 2.0 2.0 6.0 

I read the documentation for pandas.resample, and also looked at previous stack questions, but could not find a solution to my specific problem. Any ideas? Sounds like it should be easy.

UPDATE: I realized one possible solution: first, interpolate the second series, and then add to the first data frame:

 from scipy.interpolate import interp1d f2 = interp1d(t2,y2,bounds_error=False) df1['y2'] = f2(df1.t) 

which gives:

 df1: t y1 y2 0 0.0 0.0 0.0 1 0.5 0.5 1.5 2 1.0 1.0 3.0 3 1.5 1.5 4.5 4 2.0 2.0 6.0 

This works, but I'm still open to other solutions if there is a better way.

+6
source share
2 answers

If you build one DataFrame from Series using time values ​​as an index, for example:

 >>> t1 = np.array([0, 0.5, 1.0, 1.5, 2.0]) >>> y1 = pd.Series(t1, index=t1) >>> t2 = np.array([0, 0.34, 1.01, 1.4, 1.6, 1.7, 2.01]) >>> y2 = pd.Series(3*t2, index=t2) >>> df = pd.DataFrame({'y1': y1, 'y2': y2}) >>> df y1 y2 0.00 0.0 0.00 0.34 NaN 1.02 0.50 0.5 NaN 1.00 1.0 NaN 1.01 NaN 3.03 1.40 NaN 4.20 1.50 1.5 NaN 1.60 NaN 4.80 1.70 NaN 5.10 2.00 2.0 NaN 2.01 NaN 6.03 

You can simply interpolate it and select only the part where y1 defined:

 >>> df.interpolate('index').reindex(y1) y1 y2 0.0 0.0 0.0 0.5 0.5 1.5 1.0 1.0 3.0 1.5 1.5 4.5 2.0 2.0 6.0 
+1
source

It’s not entirely clear to me how you get rid of some values ​​in y2, but it seems that if there is more than one for a given moment in time, you only need the first. It also seems that your time values ​​should be in the index. I also added column labels. It looks like this:

 import pandas as pd # evenly spaced times t1 = [0,0.5,1.0,1.5,2.0] y1 = t1 # unevenly spaced times t2 = [0,0.34,1.01,1.4,1.6,1.7,2.01] # round t2 values to the nearest half new_t2 = [round(num * 2)/2 for num in t2] # set y2 values y2 = [3*z for z in new_t2] # eliminate entries that have the same index value for x in range(1, len(new_t2), -1): if new_t2[x] == new_t2[x-1]: new_t2.delete(x) y2.delete(x) ser1 = pd.Series(y1, index=t1) ser2 = pd.Series(y2, index=new_t2) df = pd.concat((ser1, ser2), axis=1) df.columns = ('Y1', 'Y2') print df 

Fingerprints:

  Y1 Y2 0.0 0.0 0.0 0.5 0.5 1.5 1.0 1.0 3.0 1.5 1.5 4.5 1.5 1.5 4.5 1.5 1.5 4.5 2.0 2.0 6.0 
0
source

Source: https://habr.com/ru/post/977101/


All Articles