One way is to add a column for the year, month, and day:
df['year'] = df.SomeDatetimeColumn.map(lambda x: x.year) df['month'] = df.SomeDatetimeColumn.map(lambda x: x.month) df['day'] = df.SomeDatetimeColumn.map(lambda x: x.day)
Then group by year and month, sorting by day, and take only the first record (which will be the minimum input day).
df.groupby( ['year', 'month'] ).apply(lambda x: x.sort('day', ascending=True)).head(1)
Using lambda expressions makes this less than ideal for large datasets. You may not want to increase the size of the data by storing separately saved values โโfor the year, month, and day. However, for these kinds of special date alignment problems, sooner or later, separating these values โโis very useful.
Another approach is to directly group the datetime column function:
dfrm.groupby( by=dfrm.dt.map(lambda x: (x.year, x.month)) ).apply(lambda x: x.sort('dt', ascending=True).head(1))
Typically, these problems arise from a dysfunctional database or data storage scheme that exists at the same level as the Python / pandas level.
For example, in this situation, it should be fundamental to rely on the existence of a calendar database table or calendar dataset that contains (or simplifies the query) the earliest active date per month relative to a given dataset (for example, first trading day, day of the first week, first working day, first vacation or something else).
If a companion database table exists for this data, it is easy to combine it with the data set that you have already loaded (for example, joining a date column that you already have), and then it's just a matter of applying a logical filter to the calendar data columns.
This becomes especially important when you need to use timestamps: for example, building a company for a 1-month market capitalization with stock returns in the current month for the company to calculate the total profit received for this 1-month period.
This can be done by delaying the columns in pandas using shift or trying to make a complex self-join, which is likely to be subject to the error itself, and creates the problem of perpetuating a specific date convention in every location downstream that uses the data from this code.
It is much better to simply require (or do it yourself) that the data should properly normalize the date functions in its raw format (database, flat files, whatever) and stop what you are doing, fix this problem first, and only then come back to do some analysis with date data.