Linear regression fails in Python with large values in dependent variables

Question

Linear regression fails in Python with large values in dependent variables

I am trying to rewrite the prediction model (in Stata) using Python (with pandas.stats.api.ols) and ran into a problem with linear regression: the coefficients and intercept calculated using pandas do not match the data from Stata.

The study shows that the main reason may be the values of the dependent values, very large. I have this suspicion based on the results below:

1) I created a simple DataFrame in Python and ran a linear regression with it:

from pandas.stats.api import ols
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50,21], "B": [20, 30, 10, 40, 50,98], "C": [32, 234, 23, 23, 31,21], "D":[12,28,12,98,51,87], "E": [1,8,12,9,12,91]})
ols(y=df['A'], x=df[['B','C', 'D', 'E']])

and summary for LR:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>

Number of Observations:         6
Number of Degrees of Freedom:   5

R-squared:         0.4627
Adj R-squared:    -1.6865

Rmse:             23.9493

F-stat (4, 1):     0.2153, p-value:     0.9026

Degrees of Freedom: model 4, resid 1

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B     0.3212     1.1176       0.29     0.8218    -1.8693     2.5117
             C    -0.0488     0.1361      -0.36     0.7806    -0.3155     0.2178
             D     0.1512     0.4893       0.31     0.8092    -0.8077     1.1101
             E    -0.4508     0.8268      -0.55     0.6822    -2.0713     1.1697
     intercept    20.9222    23.6280       0.89     0.5386   -25.3887    67.2331
---------------------------------End of Summary---------------------------------

I saved this DataFrame in the Stata.dta file and ran LR in Stata as:

 use "/tmp/lr.dta", clear
 reg A B C D E

The result is the same:

      Source |       SS       df       MS              Number of obs =       6
-------------+------------------------------           F(  4,     1) =    0.22
       Model |  493.929019     4  123.482255           Prob > F      =  0.9026
    Residual |  573.570981     1  573.570981           R-squared     =  0.4627
-------------+------------------------------           Adj R-squared = -1.6865
       Total |      1067.5     5       213.5           Root MSE      =  23.949

------------------------------------------------------------------------------
           A |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           B |   .3211939   1.117591     0.29   0.822    -13.87914    14.52153
           C |  -.0488429   .1360552    -0.36   0.781    -1.777589    1.679903
           D |   .1512067   .4892539     0.31   0.809    -6.065353    6.367766
           E |  -.4508122   .8267897    -0.55   0.682    -10.95617    10.05455
       _cons |    20.9222   23.62799     0.89   0.539    -279.2998    321.1442
------------------------------------------------------------------------------

I tried this in R and got the same result.

2) However, if I increased the values of the dependent variables in Python:

df = pd.DataFrame({"A": [10.0,20.0,30.0,40.0,50.0,21.0]})
df['B'] = pow(df['A'], 30)
df['C'] = pow(df['A'], 5)
df['D'] = pow(df['A'], 15)
df['E'] = pow(df['A'], 25)

, float64 : df.dtypes 64 B float64 C float64 D float64 E float64 dtype: object

:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>

Number of Observations:         6
Number of Degrees of Freedom:   2

R-squared:        -0.7223
Adj R-squared:    -1.1528

Rmse:             21.4390

F-stat (4, 4):    -1.6775, p-value:     1.0000

Degrees of Freedom: model 1, resid 4

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B    -0.0000     0.0000      -0.00     0.9973    -0.0000     0.0000
             C     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000
             D     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000
             E     0.0000     0.0000       0.00     0.9975    -0.0000     0.0000
     intercept     0.0000    21.7485       0.00     1.0000   -42.6271    42.6271
---------------------------------End of Summary---------------------------------

Stata :

      Source |       SS       df       MS              Number of obs =       6
-------------+------------------------------           F(  4,     1) =  237.35
       Model |  1066.37679     4  266.594196           Prob > F      =  0.0486
    Residual |   1.1232144     1   1.1232144           R-squared     =  0.9989
-------------+------------------------------           Adj R-squared =  0.9947
       Total |      1067.5     5       213.5           Root MSE      =  1.0598

------------------------------------------------------------------------------
           A |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           B |  -1.45e-45   2.32e-46    -6.24   0.101    -4.40e-45    1.50e-45
           C |   2.94e-06   3.67e-07     8.01   0.079    -1.72e-06    7.61e-06
           D |  -3.86e-21   6.11e-22    -6.31   0.100    -1.16e-20    3.90e-21
           E |   4.92e-37   7.88e-38     6.24   0.101    -5.09e-37    1.49e-36
       _cons |   9.881129    1.07512     9.19   0.069    -3.779564    23.54182
------------------------------------------------------------------------------

R Stata: lm ( = A ~ B + C + D + E, data = stata)

Residuals:
         1          2          3          4          5          6 
-1.757e-01  8.211e-01  1.287e-03 -1.269e-06  1.289e-09 -6.467e-01 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  9.881e+00  1.075e+00   9.191    0.069 .
B           -1.449e-45  2.322e-46  -6.238    0.101  
C            2.945e-06  3.674e-07   8.015    0.079 .
D           -3.855e-21  6.106e-22  -6.313    0.100  
E            4.919e-37  7.879e-38   6.243    0.101  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.06 on 1 degrees of freedom
Multiple R-squared:  0.9989,  Adjusted R-squared:  0.9947 
F-statistic: 237.3 on 4 and 1 DF,  p-value: 0.04864

, , pandas . - ?

+4

python pandas r stata linear-regression

Jackie Wong 18 . '15 5:50

1

Jianxun Li · Accepted Answer · 2015-07-18T08:38:30+0000

, python ( python, , ++). np.finfo(float).eps 2.2204460492503131e-16, , eps*max_value_of_your_data, 0 , + - * /. , 1e117 + 1e100 == 1e117 True, 1e100/1e117 = 1e-17 < eps. .

# your data
# =======================
print(df)

    A           B          C           D           E
0  10  1.0000e+30     100000  1.0000e+15  1.0000e+25
1  20  1.0737e+39    3200000  3.2768e+19  3.3554e+32
2  30  2.0589e+44   24300000  1.4349e+22  8.4729e+36
3  40  1.1529e+48  102400000  1.0737e+24  1.1259e+40
4  50  9.3132e+50  312500000  3.0518e+25  2.9802e+42
5  21  4.6407e+39    4084101  6.8122e+19  1.1363e+33

,

# ===================================================
import numpy as np

np.finfo(float).eps # 2.2204460492503131e-16

df[df < df.max().max()*np.finfo(float).eps] = 0
df

   A           B  C  D           E
0  0  0.0000e+00  0  0  0.0000e+00
1  0  1.0737e+39  0  0  0.0000e+00
2  0  2.0589e+44  0  0  8.4729e+36
3  0  1.1529e+48  0  0  1.1259e+40
4  0  9.3132e+50  0  0  2.9802e+42
5  0  4.6407e+39  0  0  0.0000e+00

, y(A) , statsmodels 0 . , .

Linear regression fails in Python with large values ​​in dependent variables

More articles:

Linear regression fails in Python with large values in dependent variables