Gini coefficient calculation in Python / numpy

I am calculating a Gini coefficient (similar to: Calculating a Python-Gini coefficient using Numpy ), but I get an odd result. for a uniform distribution taken from np.random.rand() , the Gini coefficient is 0.3, but I expected it to be close to 0 (perfect equality). what is going wrong here?

 def G(v): bins = np.linspace(0., 100., 11) total = float(np.sum(v)) yvals = [] for b in bins: bin_vals = v[v <= np.percentile(v, b)] bin_fraction = (np.sum(bin_vals) / total) * 100.0 yvals.append(bin_fraction) # perfect equality area pe_area = np.trapz(bins, x=bins) # lorenz area lorenz_area = np.trapz(yvals, x=bins) gini_val = (pe_area - lorenz_area) / float(pe_area) return bins, yvals, gini_val v = np.random.rand(500) bins, result, gini_val = G(v) plt.figure() plt.subplot(2, 1, 1) plt.plot(bins, result, label="observed") plt.plot(bins, bins, '--', label="perfect eq.") plt.xlabel("fraction of population") plt.ylabel("fraction of wealth") plt.title("GINI: %.4f" %(gini_val)) plt.legend() plt.subplot(2, 1, 2) plt.hist(v, bins=20) 

for a given set of numbers, the above code calculates a fraction of the total distribution values ​​that are in each percentile box.

result:

enter image description here

uniform distributions should be close to "perfect equality", so the bending of the lorenz curve is turned off.

+5
source share
2 answers

This is to be expected. Random sampling from a uniform distribution does not lead to homogeneous values ​​(that is, values ​​that are all relatively close to each other). Using a small calculus, it can be shown that the expected value (in a statistical sense) of the Gini coefficient for a sample from the uniform distribution on [0, 1] is 1/3, so obtaining values ​​of about 1/3 for this sample is reasonable.

You will get a lower Gini coefficient with a sample like v = 10 + np.random.rand(500) . All these values ​​are close to 10.5; The relative variation is lower than that of the sample v = np.random.rand(500) . In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n) is 1 / (6 * base + 3).

Here is a simple implementation of the Gini coefficient. He uses the fact that the Gini coefficient is half the relative average absolute difference .

 def gini(x): # (Warning: This is a concise implementation, but it is O(n**2) # in time and memory, where n = len(x). *Don't* pass in huge # samples!) # Mean absolute difference mad = np.abs(np.subtract.outer(x, x)).mean() # Relative mean absolute difference rmad = mad/np.mean(x) # Gini coefficient g = 0.5 * rmad return g 

Here is the Gini coefficient for several samples of the form v = base + np.random.rand(500) :

 In [80]: v = np.random.rand(500) In [81]: gini(v) Out[81]: 0.32760618249832563 In [82]: v = 1 + np.random.rand(500) In [83]: gini(v) Out[83]: 0.11121487509454202 In [84]: v = 10 + np.random.rand(500) In [85]: gini(v) Out[85]: 0.01567937753659053 In [86]: v = 100 + np.random.rand(500) In [87]: gini(v) Out[87]: 0.0016594595244509495 
+9
source

The Gini coefficient is the area under the Lorentz curve, usually calculated to analyze the distribution of incomes of the population. https://github.com/oliviaguest/gini provides a simple implementation of the same using Python.

0
source

Source: https://habr.com/ru/post/1275591/


All Articles