Stata Probit replication with persistent errors in R

I am trying to replicate Stata output to R. I am using a dataset questions . I'm having trouble replicating probit functions using standard standard errors.

The Stata code is as follows:

probit affair male age yrsmarr kids relig educ ratemarr, r

I started with:

  probit1 <- glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, family = binomial (link = "probit"), data = mydata) 

Then I tried various settings with the sandwich package, for example:

 myProbit <- function(probit1, vcov = sandwich(..., adjust = TRUE)) { print(coeftest(probit1, vcov = sandwich(probit1, adjust = TRUE))) } 

Or (with all types of HC0 to HC5 ):

 myProbit <- function(probit1, vcov = sandwich) { print(coeftest(probit1, vcovHC(probit1, type = "HC0")) } 

Or this, as suggested here (do I need to enter something else for the object ?):

 sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1) coeftest(probit1, vcov = sandwich1) 

None of these attempts led to the same standard errors or z values ​​from stata output.

Hoping for some constructive ideas!

Thanks in advance!

+6
source share
2 answers

For people who are considering jumping on this car, here is some code that demonstrates the problem (data here ):

 clear set more off capture ssc install bcuse capture ssc install rsource bcuse affairs saveold affairs, version(12) replace rsource, terminator(XXX) library("foreign") library("lmtest") library("sandwich") mydata<-read.dta("affairs.dta") probit1<-glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, family = binomial (link = "probit"), data = mydata) sandwich1 <- function(object,...) sandwich(object) * nobs(object)/(nobs(object) - 1) coeftest(probit1,vcov = sandwich1) XXX probit affair male age yrsmarr kids relig educ ratemarr, robust cformat(%9.6f) nolog 

R gives:

 z test of coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.764157 0.546692 1.3978 0.1621780 male 0.188816 0.133260 1.4169 0.1565119 age -0.024400 0.011423 -2.1361 0.0326725 * yrsmarr 0.054608 0.019025 2.8703 0.0041014 ** kids 0.208072 0.168222 1.2369 0.2161261 relig -0.186085 0.053968 -3.4480 0.0005647 *** educ 0.015506 0.026389 0.5876 0.5568012 ratemarr -0.272711 0.053668 -5.0814 3.746e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Stata gives:

 Probit regression Number of obs = 601 Wald chi2(7) = 54.93 Prob > chi2 = 0.0000 Log pseudolikelihood = -305.2525 Pseudo R2 = 0.0961 ------------------------------------------------------------------------------ | Robust affair | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- male | 0.188817 0.131927 1.43 0.152 -0.069755 0.447390 age | -0.024400 0.011124 -2.19 0.028 -0.046202 -0.002597 yrsmarr | 0.054608 0.018963 2.88 0.004 0.017441 0.091775 kids | 0.208075 0.166243 1.25 0.211 -0.117754 0.533905 relig | -0.186085 0.053240 -3.50 0.000 -0.290435 -0.081736 educ | 0.015505 0.026355 0.59 0.556 -0.036150 0.067161 ratemarr | -0.272710 0.053392 -5.11 0.000 -0.377356 -0.168064 _cons | 0.764160 0.534335 1.43 0.153 -0.283117 1.811437 ------------------------------------------------------------------------------ 

Addendum:

The difference in the covariance estimation of the coefficients is due to different fitting algorithms. In R, the glm command uses the iterative least square method, and the Stata probit uses the ML method, based on the Newton-Raphson algorithm. You can map what R does with glm in Stata with the irls option:

 glm affair male age yrsmarr kids relig educ ratemarr, irls family(binomial) link(probit) robust 

This gives:

 Generalized linear models No. of obs = 601 Optimization : MQL Fisher scoring Residual df = 593 (IRLS EIM) Scale parameter = 1 Deviance = 610.5049916 (1/df) Deviance = 1.029519 Pearson = 619.0405832 (1/df) Pearson = 1.043913 Variance function: V(u) = u*(1-u) [Bernoulli] Link function : g(u) = invnorm(u) [Probit] BIC = -3183.862 ------------------------------------------------------------------------------ | Semirobust affair | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- male | 0.188817 0.133260 1.42 0.157 -0.072367 0.450002 age | -0.024400 0.011422 -2.14 0.033 -0.046787 -0.002012 yrsmarr | 0.054608 0.019025 2.87 0.004 0.017319 0.091897 kids | 0.208075 0.168222 1.24 0.216 -0.121634 0.537785 relig | -0.186085 0.053968 -3.45 0.001 -0.291862 -0.080309 educ | 0.015505 0.026389 0.59 0.557 -0.036216 0.067226 ratemarr | -0.272710 0.053668 -5.08 0.000 -0.377898 -0.167522 _cons | 0.764160 0.546693 1.40 0.162 -0.307338 1.835657 ------------------------------------------------------------------------------ 

They will be close, although not identical. I'm not sure how to get R to use something like NR without a lot of work.

+3
source

I use the matrix approach as described in detail here (p. 57) to match the results of R with Stata. However, I couldn’t find the exact result. I think a small difference may be due to the difference in scores. Ratings in R correspond to Stata only up to 4 decimal places.

Stata STRONG>

 clear all bcuse affairs probit affair male age yrsmarr kids relig educ ratemarr mat var_nr=e(V) predict double u, score matrix accum s = male age yrsmarr kids relig educ ratemarr [iweight=u^2*601/600] //n=601,n-1=600 matrix rv = var_nr*s*var_nr mat diagrv=vecdiag(rv) matmap diagrv rse,m(sqrt(@)) //install matmap mat list rse //standard errors 

This gives you the same standard errors as:

 qui probit affair male age yrsmarr kids relig educ ratemarr,r rse[1,8] affair: affair: affair: affair: affair: affair: affair: affair: male age yrsmarr kids relig educ ratemarr _cons r1 .13192707 .01112372 .01896336 .16624258 .05324046 .02635524 .05339163 .53433495 

R:

 library(AER) # Affairs data data(Affairs) mydata<-Affairs mydata$affairs<-with(mydata,ifelse(affairs>0,1,affairs)) # convert to 1 and 0 probit1<-glm(affairs ~ gender+ age + yearsmarried + children + religiousness+education + rating,family = binomial(link = "probit"),data = mydata) u<-subset(estfun(probit1),select="(Intercept)") #scores: perfectly matches to 4 decimals with Stata: difference may be due to this step w0<-u%*%t(u)*(601/600) #(n/n-1) iweight<-matrix(0,nrow=601,ncol=601) #perfectly matches to 4 decimals with Stata diag(iweight)<-diag(w0) x<-model.matrix(probit1) s<-t(x)%*%iweight%*%x #doesn't match with Stata : rv<-vcov(probit1)%*%s%*%vcov(probit1) rse<-sqrt(diag(rv)) # standard errors rse (Intercept) gendermale age yearsmarried childrenyes religiousness education rating 0.54669177 0.13325951 0.01142258 0.01902537 0.16822161 0.05396841 0.02638902 0.05366828 

It corresponds:

  sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1) coeftest(probit1, vcov = sandwich1) 

Conclusion: the difference in results between R and Stata is explained by the difference in scores (corresponding to only 4 decimal places).

+2
source

Source: https://habr.com/ru/post/987197/


All Articles