Statsmodels Poisson glm is different from R

Question

Statsmodels Poisson glm is different from R

I'm trying to fit some models (spatial interaction models) according to some code that is provided in R. I managed to get some of the code to work using statsmodels in the python structure, but some of them do not match at all. I believe that the code I have for R and Python should give the same results. Does anyone see any differences? Or are there some fundamental differences that all of this can drop? The R code is the source code that corresponds to the numbers indicated in the textbook (Found here: http://www.bartlett.ucl.ac.uk/casa/pdf/paper181 ).

R sample Code:

library(mosaic) Data = fetchData('http://dl.dropbox.com/u/8649795/AT_Austria.csv') Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data) cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete") rsquared = cor * cor rsquared

Output R:

 > Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data) Warning messages: 1: glm.fit: fitted rates numerically 0 occurred 2: glm.fit: fitted rates numerically 0 occurred > cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete") > rsquared = cor * cor > rsquared [1] 0.9753279

Python Code:

 import numpy as np import pandas as pd import statsmodels.formula.api as smf import statsmodels.api as sm from scipy.stats.stats import pearsonr Data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv')) Model = smf.glm('Data~Origin+Destination+Dij', data=Data, offset=np.log(Data['Offset']), family=sm.families.Poisson(link=sm.families.links.log)).fit() cor = pearsonr(doubleConstrained.fittedvalues, Data["Data"])[0] print "R-squared for doubly-constrained model is: " + str(cor*cor)

Python output:

 R-squared for doubly-constrained model is: 0.104758481123

+3

python r glm statsmodels poisson

user3311076 Feb 14 '14 at 16:52

source share

1 answer

jseabold · Answer 1 · 2014-02-14T20:25:42+0000

It seems that GLM has convergence issues here in statsmodels. Perhaps in R too, but R gives only these warnings.

 Warning messages: 1: glm.fit: fitted rates numerically 0 occurred 2: glm.fit: fitted rates numerically 0 occurred

This may mean something like perfect separation in the context of Logit / Probit. I would think of this for the Poisson model.

R does a better, if not subtle, job, telling you that something might be wrong in your fitting. If you look at the installed probability in statsmodels, for example, this is -1.12e27. That should be the key right there that something is not working.

Using the Poisson model directly (I always prefer the maximum GLM probability when possible), I can reproduce the results of R (but I get a warning about convergence). Confidently, again, the default loser newton-raphson fails, so I use bfgs.

 import numpy as np import pandas as pd import statsmodels.formula.api as smf import statsmodels.api as sm from scipy.stats.stats import pearsonr data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv')) mod = smf.poisson('Data~Origin+Destination+Dij', data=data, offset=np.log(data['Offset'])).fit(method='bfgs') print mod.mle_retvals['converged']

Statsmodels Poisson glm is different from R

More articles: