how quantities, e.g., income are distributed over a population.
Could a single number express equality/inequality of a distribution?
Gini axiomatised the requirements with his G index:
\(G \approx 0\): all individuals have exactly the same share of wealth/income
\(G \approx 1\): one individual has it all, everyone else has exacly zero
\(G < 0.3\) rather egalitarian (Slovakia = 0.22)
\(G > 0.4\) rather elitist wrt. income (S. Africa = 0.62)
Consider the pairwise absolute differences between individuals:
\[ G_0 = \Sigma_i \Sigma_j | x_i - x_j | \]
normalise them for scale wrt. the overall average \(\overline{x}\)
\[ G = \frac{\Sigma_i \Sigma_j | x_i - x_j |}{2n^2\overline{x}} \]
Sort individuals by increasing income (X-axis) and plot cumulative income
area under the diagonal interpretation: \(\frac{A}{A+b}\)
is a measure of dispersion, not necessarily of egalitarianism
measures a present dispersion rather than a trend.
often implied measures are easier to observe, e.g., home computer ownership wrt. wealth.
In classification, Gini impurity is a measure of quality for a subset of the data which is to be given a classification/label.
Algorithm: take a set of elements and choose their label by randomly selecting one element and its category.
What is the probability that this simple method leads to misclassification?
It depends on the dispersion in the set.
Let \(P(i)\) be the normalised frequency distribution of n elements over k categories.
What is the prob. of misclassification, when the label is chosen randomly?
\[G = \Sigma_{i=1}^k P(i)\cdot (1 - P(i))\]
\[G = 1 - \Sigma_{i=1}^k P(i)^2\]
\(G \approx 0\): all items are into one category (whatever that is): good classification likely
\(G \approx 0.5\): items equally scattered over categories: bad classification likely
Dataset: playing golf today?
Day | Outlook | Temp. | Hum. | Wind | Play? |
---|---|---|---|---|---|
1 | Sunny | Hot | High | Weak | No |
2 | … |
Consider three sets, on the basis of the Outlook dimension:
where G(Outlook)
is the weighted sum of the impurities of a labelling based on splitting along the values of Outlook
(and the random-labelling algorithm)
Q: can we do better? E.g., Majority voting?