Pandas DataFrame - The desired index has duplicate values.

This is my first attempt at Pandas. I think I have a reasonable precedent, but I stumble. I want to load a tab delimited file into a Pandas Dataframe, and then group it by character and draw it using x.axis indexed by a TimeStamp column. Here is a subset of the data:

Symbol,Price,M1,M2,Volume,TimeStamp TBET,2.19,3,8.05,1124179,9:59:14 AM FUEL,3.949,9,1.15,109674,9:59:11 AM SUNH,4.37,6,0.09,24394,9:59:09 AM FUEL,3.9099,8,1.11,105265,9:59:09 AM TBET,2.18,2,8.03,1121629,9:59:05 AM ORBC,3.4,2,0.22,10509,9:59:02 AM FUEL,3.8599,7,1.07,102116,9:58:47 AM FUEL,3.8544,6,1.05,100116,9:58:40 AM GBR,3.83,4,0.46,64251,9:58:24 AM GBR,3.8,3,0.45,63211,9:58:20 AM XRA,3.6167,3,0.12,42310,9:58:08 AM GBR,3.75,2,0.34,47521,9:57:52 AM MPET,1.42,3,0.26,44600,9:57:52 AM 

Note two things about the TimeStamp column;

  • it has duplicate meanings and
  • intervals are irregular.

I thought I could do something like this ...

 from pandas import * import pylab as plt df = read_csv('data.txt',index_col=5) df.sort(ascending=False) df.plot() plt.show() 

But the read_csv method throws an exception "I tried columns 1-X as an index, but found duplicates." Is there an option that allows me to specify an index column with duplicate values?

I would also be interested in combining my irregular time intervals with a resolution of up to one second, I would still like to build several events in a certain second, but maybe I could introduce a unique index and then align my prices with it?

+6
source share
1 answer

I created several releases just now to consider some features / amenities that I think would be nice to have: GH-856 , GH-857 , GH-858

We are currently working on updating the capabilities of time series and now we can perform alignment in the second resolution (although not with duplicates, so this will require writing some functions). I also want to maintain duplicate timestamps better. However, this is really panel (3D) data, so one of the ways you can change is to:

 In [29]: df.pivot('Symbol', 'TimeStamp').stack() Out[29]: M1 M2 Price Volume Symbol TimeStamp FUEL 9:58:40 AM 6 1.05 3.8544 100116 9:58:47 AM 7 1.07 3.8599 102116 9:59:09 AM 8 1.11 3.9099 105265 9:59:11 AM 9 1.15 3.9490 109674 GBR 9:57:52 AM 2 0.34 3.7500 47521 9:58:20 AM 3 0.45 3.8000 63211 9:58:24 AM 4 0.46 3.8300 64251 MPET 9:57:52 AM 3 0.26 1.4200 44600 ORBC 9:59:02 AM 2 0.22 3.4000 10509 SUNH 9:59:09 AM 6 0.09 4.3700 24394 TBET 9:59:05 AM 2 8.03 2.1800 1121629 9:59:14 AM 3 8.05 2.1900 1124179 XRA 9:58:08 AM 3 0.12 3.6167 42310 

Note that this created MultiIndex. Another way I could get this:

 In [32]: df.set_index(['Symbol', 'TimeStamp']) Out[32]: Price M1 M2 Volume Symbol TimeStamp TBET 9:59:14 AM 2.1900 3 8.05 1124179 FUEL 9:59:11 AM 3.9490 9 1.15 109674 SUNH 9:59:09 AM 4.3700 6 0.09 24394 FUEL 9:59:09 AM 3.9099 8 1.11 105265 TBET 9:59:05 AM 2.1800 2 8.03 1121629 ORBC 9:59:02 AM 3.4000 2 0.22 10509 FUEL 9:58:47 AM 3.8599 7 1.07 102116 9:58:40 AM 3.8544 6 1.05 100116 GBR 9:58:24 AM 3.8300 4 0.46 64251 9:58:20 AM 3.8000 3 0.45 63211 XRA 9:58:08 AM 3.6167 3 0.12 42310 GBR 9:57:52 AM 3.7500 2 0.34 47521 MPET 9:57:52 AM 1.4200 3 0.26 44600 In [33]: df.set_index(['Symbol', 'TimeStamp']).sortlevel(0) Out[33]: Price M1 M2 Volume Symbol TimeStamp FUEL 9:58:40 AM 3.8544 6 1.05 100116 9:58:47 AM 3.8599 7 1.07 102116 9:59:09 AM 3.9099 8 1.11 105265 9:59:11 AM 3.9490 9 1.15 109674 GBR 9:57:52 AM 3.7500 2 0.34 47521 9:58:20 AM 3.8000 3 0.45 63211 9:58:24 AM 3.8300 4 0.46 64251 MPET 9:57:52 AM 1.4200 3 0.26 44600 ORBC 9:59:02 AM 3.4000 2 0.22 10509 SUNH 9:59:09 AM 4.3700 6 0.09 24394 TBET 9:59:05 AM 2.1800 2 8.03 1121629 9:59:14 AM 2.1900 3 8.05 1124179 XRA 9:58:08 AM 3.6167 3 0.12 42310 

You can get this data in true panel format as follows:

 In [35]: df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel() Out[35]: <class 'pandas.core.panel.Panel'> Dimensions: 4 (items) x 11 (major) x 7 (minor) Items: Price to Volume Major axis: 9:57:52 AM to 9:59:14 AM Minor axis: FUEL to XRA In [36]: panel = df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel() In [37]: panel['Price'] Out[37]: Symbol FUEL GBR MPET ORBC SUNH TBET XRA TimeStamp 9:57:52 AM NaN 3.75 1.42 NaN NaN NaN NaN 9:58:08 AM NaN NaN NaN NaN NaN NaN 3.6167 9:58:20 AM NaN 3.80 NaN NaN NaN NaN NaN 9:58:24 AM NaN 3.83 NaN NaN NaN NaN NaN 9:58:40 AM 3.8544 NaN NaN NaN NaN NaN NaN 9:58:47 AM 3.8599 NaN NaN NaN NaN NaN NaN 9:59:02 AM NaN NaN NaN 3.4 NaN NaN NaN 9:59:05 AM NaN NaN NaN NaN NaN 2.18 NaN 9:59:09 AM 3.9099 NaN NaN NaN 4.37 NaN NaN 9:59:11 AM 3.9490 NaN NaN NaN NaN NaN NaN 9:59:14 AM NaN NaN NaN NaN NaN 2.19 NaN 

You can generate some graphs from this data.

note that timestamps are still strings. I think they can be converted to Python datetime.time objects, and it can be a little easier to work with. I don't have many plans to provide much support for raw times against timestamps (date + time), but if that's enough, people, I think, can be sure :)

If you have several observations per second for one character, some of the above methods will not work. But I want to improve this support in future releases of pandas, so knowing your use cases will be useful to me - consider joining the mailing list (pystatsmodels)

+4
source

Source: https://habr.com/ru/post/909944/


All Articles