Ch. 3 of Provost-Fawcett’s Data Science for Business
introduces CART by example
illustrates the use of Entropy
it shows the training and testing of a predictive model.
A predictive model is a formula for estimating the unknown value of interest, often called the target attribute.
Regression: numerical target
Classification: class membership, e.g., Class-probability estimation
A descriptive model is a formula for estimating the underlying phenomena and causal connections between values.
Descriptive modeling often is used to work towards a causal understanding of the data-generating process (e.g., why do users watch Sci-Fi sagas?)
divide the dataset into segments (sets of rows) by the value of their output variable.
If the segmentation is done using values of variables that will be known when the target is not then these segments can be used to predict the value of the target.
Example: the golf days example seen when studying Gini.
Day | Outlook | Temp. | Hum. | Wind | Play? |
---|---|---|---|---|---|
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | ? |
What are the variables that contain important information about the target variable?
Can they be selected automatically?
Attributes
Target
Measure:
purity of segments: all datapoints have the same target variable value.
non-binary attributes in binary classification
non-discrete attributes
attributes seldom split a dataset perfectly.
attributes seldom split segments perfectly (see final slides)
Let’s focus on 3.
if an unlabelled datapoint is with a pure segment, we safely assigh the same target variable valus of its segment.
Day | Outlook | Temp. | Hum. | Wind | Play? |
---|---|---|---|---|---|
1 | Sunny | Hot | High | Strong | No |
2 | Sunny | Hot | High | Strong | No |
3 | Sunny | Hot | High | Strong | No |
6 | Sunny | Hot | High | Strong | ? |
Predict ‘No.’
Day | Outlook | Temp. | Hum. | Wind | Play? |
---|---|---|---|---|---|
1 | Sunny | Hot | High | Strong | No |
2 | Sunny | Hot | High | Strong | No |
3 | Sunny | Hot | High | Strong | No |
4 | Sunny | Hot | High | Strong | Yes |
6 | Sunny | Hot | High | Strong | ? |
assign the target value with the same probability as the frequency in the segment:
\(Pr[\textrm{Play}_6=\textrm{`No'}]=\frac{3}{4}\), \(Pr[\textrm{Play}_6=\textrm{`Yes'}]=\frac{1}{4}.\)
(alternative: majority voting?)
We would improve prediction by decreasing segment impurity.
Entropy (or Gini impurity) drives segmentation.
IG measures how much much an attribute improves entropy over the whole segmentation it creates.
How much purer are the children wrt. their parent segment?
IG(parent, children) = H(parent) - \([prop(c_a)\cdot H(c_a) + prop(c_b)\cdot H(c_b) + \dots ]\)
where \(prop(c_x)\) is the proportion of el. assigned to \(c_i:\) \(\frac{|c_i|}{n}\)
[\(IG(P, C)\) is connected with Kullback-Leibler divergence]
H(parent) = 0.99
H(left) = 0.39
H(right) = 0.79
\(IG(p, C) = 0.99 -[13/30\cdot 0.39 + 17/30 \cdot 0.79] = 0.37\)
Notice the discretization of the numerical var. we split on
H(parent) = 0.99
H(left) = 0.54
H(center) = 0.97
H(right) = 0.98
IG(p, C) = 0.13
Discretization may reduce numerical dimensions to discrete ones
In regression, variance works as the analogous of information entropy.
A graphical method deployed to visualize Information gain:
The shaded area represents Entropy.
the white area ‘reclaimed’ from the shade is the Information gain at each attempt.
edible: 4208 (51.8%), poisonous: the rest.
H = \(-[0.518\log 0.518 + 0.482\log 0.482]\) = \(-[.518\cdot -.949 + .482\cdot -1.053]\) = .999
node \(\rightarrow\) leaves represents a segmentation that increases purity, i.e., decreases information Entropy or Gini impurity.
Iterate until the set of all root \(\rightarrow\) leaf trajectories gives a complete classification.
Measure: total entropy of the set of leaf segments.
Decision tree: a set of if-then rules over attribute (or discretized) values
Each observation falls into 1! leaf, and each leaf is uniquely defined by a set of rules.
Even if one subgroup happens to be pure, the other may not!
If the second person were not there, then body-color=gray
would create a pure segment (write-off=no).
However, body-color=white
, still is not pure.
body-color=gray
only splits off one single data point into the pure subset.
Is this better than a split that does not produce any pure subset, but reduces impurity more broadly?