I need to calculate the average value of std from a time series (monthly frequency), but I also need to exclude from the calculation "incomplete" years (from less than 12 months)
Abnormal / meager "working" version:
import numpy as np import scipy.stats as sts url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices' npdata = np.genfromtxt(url, skip_header=1) unique_enso_year = [int(value) for value in set(npdata[:, 0])] nin34 = np.zeros(len(unique_enso_year)) for ind, year in enumerate(unique_enso_year): indexes = np.flatnonzero(npdata[:, 0]==year) if len(indexes) == 12: nin34[ind] = np.mean(npdata[indexes, 9]) else: nin34[ind] = np.nan nin34x = (nin34 - sts.nanmean(nin34)) / sts.nanstd(nin34) array([[ 1.02250000e+00, 5.15000000e-01, -6.73333333e-01, -7.02500000e-01, 1.16666667e-01, 1.32916667e+00, -1.10333333e+00, -8.11666667e-01, 1.51666667e-01, 6.42500000e-01, 6.49166667e-01, 3.71666667e-01, 4.05000000e-01, -1.98333333e-01, -4.79166667e-01, 1.24666667e+00, -1.44166667e-01, -1.18166667e+00, -8.89166667e-01, -2.51666667e-01, 7.36666667e-01, 3.02500000e-01, 3.83333333e-01, 1.19166667e-01, 1.70833333e-01, -5.25000000e-01, -7.35000000e-01, 3.75000000e-01, -4.50833333e-01, -8.30000000e-01, -1.41666667e-02, nan]])
Pandas attempt:
import pandas as pd from datetime import datetime def parse(yr, mon): date = datetime(year=int(yr), day=2, month=int(mon)) return date url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices' data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse) grouped = data.groupby(lambda x: x.year) zscore = lambda x: (x - x.mean()) / x.std() transformed = grouped.transform(zscore) print transformed['ANOM.3'] YR_MON 1982-01-02 -0.986922 1982-02-02 -1.179216 1982-03-02 -1.179216 1982-04-02 -0.885119 1982-05-02 -0.376105 1982-06-02 0.087664 1982-07-02 -0.161188 1982-08-02 0.098975 1982-09-02 0.415695 1982-10-02 1.049134 1982-11-02 1.286674 1982-12-02 1.829622 1983-01-02 1.715072 1983-02-02 1.428598 1983-03-02 0.976272 ... 2012-03-02 -0.999284 2012-04-02 -0.663736 2012-05-02 -0.063283 2012-06-02 0.572491 2012-07-02 0.961020 2012-08-02 1.314227 2012-09-02 0.925699 2012-10-02 0.537170 2012-11-02 0.660793 2012-12-02 -0.169245 2013-01-02 -1.001483 2013-02-02 -0.924445 2013-03-02 0.462223 2013-04-02 1.386668 2013-05-02 0.077037 Name: ANOM.3, Length: 377, dtype: float64
This is not what I want .. because the bill is also 2013 (which has only 5 months)
To extract what I want, I need to do something like:
(grouped.mean()['ANOM.3'][:-1] - sts.nanmean(grouped.mean()['ANOM.3'][:-1])) / sts.nanstd(grouped.mean()['ANOM.3'][:-1])
but this suggests that I already k now that the last year was incomplete, and then I lost np.NAN, where I should have the value of 2013
so I was now trying to make a request in pandas, for example:
grouped2 = data.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None).reset_index(drop=True)
This gives me the "correct values" .. but it generated a new dataframe "no index with a timestamp" .. I'm sure there is a simple and beautiful way to do this .. thanks for any help!