Setting a lognormal distribution in Python using CURVE_FIT

I have a hypothetical function y from x and try to find / fit the logarithmic distribution curve that would best shape the data. I use the curve_fit function and can match the normal distribution, but the curve is not optimized.

The data points y and x are shown below, where y = f (x).

y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05] 

the y axis is the probability of the event occurring in the pins along the x axis:

 x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0] 

I was able to better approach my data using the excel and lognormal approach. When I try to use lognormal in python, the fit does not work and I am doing something wrong.

Below is the code that I use for normal distribution, which seems to be the only thing that I can fit in python (hard to believe):

 #fitting distributino on top of savitzky-golay %matplotlib inline import matplotlib import matplotlib.pyplot as plt import pandas as pd import scipy import scipy.stats import numpy as np from scipy.stats import gamma, lognorm, halflogistic, foldcauchy from scipy.optimize import curve_fit matplotlib.rcParams['figure.figsize'] = (16.0, 12.0) matplotlib.style.use('ggplot') # results from savgol x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0] y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05] ## y_axis values must be normalised sum_ys = sum(y_axis) # normalize to 1 y_axis = [_/sum_ys for _ in y_axis] # def gamma_f(x, a, loc, scale): # return gamma.pdf(x, a, loc, scale) def norm_f(x, loc, scale): # print 'loc: ', loc, 'scale: ', scale, "\n" return norm.pdf(x, loc, scale) fitting = norm_f # param_bounds = ([-np.inf,0,-np.inf],[np.inf,2,np.inf]) result = curve_fit(fitting, x_axis, y_axis) result_mod = result # mod scale # results_adj = [result_mod[0][0]*.75, result_mod[0][1]*.85] plt.plot(x_axis, y_axis, 'ro') plt.bar(x_axis, y_axis, 1, alpha=0.75) plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-') plt.axis([0,35,0,.1]) # convert back into probability y_norm_fit = [fitting(_, *result[0]) for _ in x_axis] y_fit = [_*sum_ys for _ in y_norm_fit] print list(y_fit) plt.show() 

I am trying to get answers to two questions:

  • Is this the best option I get from a normal distribution curve? How can I adapt to this?

Normal distribution result: enter image description here

  1. How can I match the lognormal distribution of this data, or is there a better distribution that I can use?

I played with a lognormal distribution curve, tuning mu and sigma, it seems to be better suited. I do not understand what I'm doing wrong to get similar results in python.

+5
source share
3 answers

In fact, a Gamma distribution might be appropriate, as suggested by @Glen_b. I use the second definition with \ alpha and \ beta.

NB: the trick I use for quick fitting is to calculate the mean and variance, and for a typical two-parameter distribution, it's enough to restore the parameters and get a quick idea if they fit or not.

enter image description here

code

 import math from scipy.misc import comb import matplotlib.pyplot as plt y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05] x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0] ## y_axis values must be normalised sum_ys = sum(y_axis) # normalize to 1 y_axis = [_/sum_ys for _ in y_axis] m = 0.0 for k in range(0, len(x_axis)): m += y_axis[k] * x_axis[k] v = 0.0 for k in range(0, len(x_axis)): t = (x_axis[k] - m) v += y_axis[k] * t * t print(m, v) b = m/v a = m * b print(a, b) z = [] for k in range(0, len(x_axis)): q = b**a * x_axis[k]**(a-1.0) * math.exp( - b*x_axis[k] ) / math.gamma(a) z.append(q) plt.plot(x_axis, y_axis, 'ro') plt.plot(x_axis, z, 'b*') plt.axis([0, 35, 0, .1]) plt.show() 
+2
source

Note that if the lognormal curve is correct and you take the logs of both variables, you should have a quadratic relation; even if this is not a suitable scale for the final model (due to dispersion effects - if your dispersion is almost constant on the original scale, it will be overweight of small values), it should at least give a good starting point for non-linear correspondence.

Indeed, besides the first two points, it looks pretty good:

log-log chart showing almost quadratic relationships

- a quadratic fit to hard points will describe this data well enough and should give suitable starting values ​​if you want to perform a nonlinear fit.

(If an error in x is not possible at all, a lack of matching at the lowest x can have as many problems with an error in x as an error in y)

By the way, this plot seems to hint that the gamma curve may fit a little better than the lognormal (in particular, if you do not want to reduce the influence of these first two points relative to points 4-6). A good initial fit for this can be achieved by regressing log (y) on x and log (x):

matching gamma curve on a log scale

The scaled gamma density is g = cx ^ (a-1) exp (-bx) ... using logs, you get log (g) = log (c) + (a-1) log (x) - bx = b0 + b1 log (x) + b2 x ... therefore, substituting log (x) and x in the linear regression procedure will correspond to this. The same warnings about the effects of dispersion apply (so this could be best as a starting point for determining non-linear least squares if your relative error in y is not nearly constant).

+2
source

A discrete distribution might look better - your x all integers. You have a dispersion distribution about 3 times above average, asymmetric - so most likely something like Negative Binomial might work pretty well. Here is a quick fit

enter image description here

r slightly higher than 6, so you can move on to a distribution with a real distribution of r - Polya.

code

 from scipy.misc import comb import matplotlib.pyplot as plt y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05] x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0] ## y_axis values must be normalised sum_ys = sum(y_axis) # normalize to 1 y_axis = [_/sum_ys for _ in y_axis] s = 1.0 # shift by 1 to have them all at 0 m = 0.0 for k in range(0, len(x_axis)): m += y_axis[k] * (x_axis[k] - s) v = 0.0 for k in range(0, len(x_axis)): t = (x_axis[k] - s - m) v += y_axis[k] * t * t print(m, v) p = 1.0 - m/v r = int(m*(1.0 - p) / p) print(p, r) z = [] for k in range(0, len(x_axis)): q = comb(k + r - 1, k) * (1.0 - p)**r * p**k z.append(q) plt.plot(x_axis, y_axis, 'ro') plt.plot(x_axis, z, 'b*') plt.axis([0, 35, 0, .1]) plt.show() 
+1
source

Source: https://habr.com/ru/post/1266391/


All Articles