Filter the data to get only the first numbers of rows of the month

Question

Filter the data to get only the first numbers of rows of the month

I have a daily data dataset. I need to get only the data of the first day of each month in the data set (data from 1972 to 2013). So, for example, I need the value Index 20, Date 2013-12-02, equal to 0.1555. The problem is that the first day for each month is different, so I cannot use such a step as relativedelta (months = 1), how would I like to extract these values from my data set?

Is there a similar command I found in another post for R? R - XTS: Get the first dates and values for each month from a daily time series with missing rows

17 2013-12-05 0.1621 18 2013-12-04 0.1698 19 2013-12-03 0.1516 20 2013-12-02 0.1555 21 2013-11-29 0.1480 22 2013-11-27 0.1487 23 2013-11-26 0.1648

+5

python-2.7 pandas

tadalendas Sep 11 '14 at 21:13

source share

4 answers

One way is to add a column for the year, month, and day:

 df['year'] = df.SomeDatetimeColumn.map(lambda x: x.year) df['month'] = df.SomeDatetimeColumn.map(lambda x: x.month) df['day'] = df.SomeDatetimeColumn.map(lambda x: x.day)

Then group by year and month, sorting by day, and take only the first record (which will be the minimum input day).

 df.groupby( ['year', 'month'] ).apply(lambda x: x.sort('day', ascending=True)).head(1)

Using lambda expressions makes this less than ideal for large datasets. You may not want to increase the size of the data by storing separately saved values for the year, month, and day. However, for these kinds of special date alignment problems, sooner or later, separating these values is very useful.

Another approach is to directly group the datetime column function:

 dfrm.groupby( by=dfrm.dt.map(lambda x: (x.year, x.month)) ).apply(lambda x: x.sort('dt', ascending=True).head(1))

Typically, these problems arise from a dysfunctional database or data storage scheme that exists at the same level as the Python / pandas level.

For example, in this situation, it should be fundamental to rely on the existence of a calendar database table or calendar dataset that contains (or simplifies the query) the earliest active date per month relative to a given dataset (for example, first trading day, day of the first week, first working day, first vacation or something else).

If a companion database table exists for this data, it is easy to combine it with the data set that you have already loaded (for example, joining a date column that you already have), and then it's just a matter of applying a logical filter to the calendar data columns.

This becomes especially important when you need to use timestamps: for example, building a company for a 1-month market capitalization with stock returns in the current month for the company to calculate the total profit received for this 1-month period.

This can be done by delaying the columns in pandas using shift or trying to make a complex self-join, which is likely to be subject to the error itself, and creates the problem of perpetuating a specific date convention in every location downstream that uses the data from this code.

It is much better to simply require (or do it yourself) that the data should properly normalize the date functions in its raw format (database, flat files, whatever) and stop what you are doing, fix this problem first, and only then come back to do some analysis with date data.

+1

ely Sep 11 '14 at 21:36

source share

 import pandas as pd dates = pd.date_range('2014-02-05', '2014-03-15', freq='D') df = pd.DataFrame({'vals': range(len(dates))}, index=dates) g = df.groupby(lambda x: x.strftime('%Y-%m'), axis=0) g.apply(lambda x: x.index.min()) #Or depending on whether you want the index or the vals g.apply(lambda x: x.ix[x.index.min()])

0

Isaac laughlin Sep 11 '14 at 21:44

source share

The above did not work for me, because I needed more than one row per month, where the number of rows every month could change. This is what I did:

 dates_month = pd.bdate_range(df['date'].min(), df['date'].max(), freq='1M') df_mth = df[df['date'].isin(dates_month)]

0

citynorman May 24, '17 at 16:44

source share

Andy hayden · Accepted Answer · 2014-09-11T21:37:33+0000

I gathered for a month and then received the zero (nth) row of each group.

First set the index (I think it is necessary):

 In [11]: df1 = df.set_index('date') In [12]: df1 Out[12]: n val date 2013-12-05 17 0.1621 2013-12-04 18 0.1698 2013-12-03 19 0.1516 2013-12-02 20 0.1555 2013-11-29 21 0.1480 2013-11-27 22 0.1487 2013-11-26 23 0.1648

The next sort, so the first element is the first date of this month (Note: this does not seem necessary for nth, but I think it’s actually a mistake!):

 In [13]: df1.sort_index(inplace=True) In [14]: df1.groupby(pd.TimeGrouper('M')).nth(0) Out[14]: n val date 2013-11-26 23 0.1648 2013-12-02 20 0.1555

another option is to reselect and first record:

 In [15]: df1.resample('M', 'first') Out[15]: n val date 2013-11-30 23 0.1648 2013-12-31 20 0.1555

Thinking about this, you can make it a lot easier by extracting the month and then grouping it:

 In [21]: pd.DatetimeIndex(df.date).to_period('M') Out[21]: <class 'pandas.tseries.period.PeriodIndex'> [2013-12, ..., 2013-11] Length: 7, Freq: M In [22]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(0) Out[22]: n date val 0 17 2013-12-05 0.1621 4 21 2013-11-29 0.1480

This time the df.date (correctly) is relevant, if you know it in descending order of date, you can use nth(-1) :

 In [23]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(-1) Out[23]: n date val 3 20 2013-12-02 0.1555 6 23 2013-11-26 0.1648

If this is not guaranteed, first sort by the date column: df.sort('date') .

Filter the data to get only the first numbers of rows of the month

More articles: