Decision trees

DSTA

Decision trees

A simple-yet-effective classification algorithm: CART
it introduces us to predictive modeling

Study Plan

Ch. 3 of Provost-Fawcett’s Data Science for Business

introduces CART by example
illustrates the use of Entropy
it shows the training and testing of a predictive model.

Models

A simplified representation of reality

A predictive model is a formula for estimating the unknown value of interest, often called the target attribute.

Regression: numerical target

Classification: class membership, e.g., Class-probability estimation

A descriptive model is a formula for estimating the underlying phenomena and causal connections between values.

Descriptive modeling often is used to work towards a causal understanding of the data-generating process (e.g., why do users watch Sci-Fi sagas?)

Supervised segmentation

divide the dataset into segments (sets of rows) by the value of their output variable.

If the segmentation is done using values of variables that will be known when the target is not then these segments can be used to predict the value of the target.

Example: the golf days example seen when studying Gini.

Day	Outlook	Temp.	Hum.	Wind	Play?
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	?

Questions

What are the variables that contain important information about the target variable?

Can they be selected automatically?

Selecting informative attributes

Attributes
- head-shape: {square, circular}
- body-shape: {rectangular, oval}
- body-color: {gray, white}
Target
- write-off: {yes, no}

Measure:

purity of segments: all datapoints have the same target variable value.

Complications

non-binary attributes in binary classification
non-discrete attributes
attributes seldom split a dataset perfectly.
attributes seldom split segments perfectly (see final slides)

Let’s focus on 3.

The importance of purity

if an unlabelled datapoint is with a pure segment, we safely assigh the same target variable valus of its segment.

Day	Outlook	Temp.	Hum.	Wind	Play?
1	Sunny	Hot	High	Strong	No
2	Sunny	Hot	High	Strong	No
3	Sunny	Hot	High	Strong	No
6	Sunny	Hot	High	Strong	?

Predict ‘No.’

Impurity

Day	Outlook	Temp.	Hum.	Wind	Play?
1	Sunny	Hot	High	Strong	No
2	Sunny	Hot	High	Strong	No
3	Sunny	Hot	High	Strong	No
4	Sunny	Hot	High	Strong	Yes
6	Sunny	Hot	High	Strong	?

assign the target value with the same probability as the frequency in the segment:

\(Pr[\textrm{Play}_6=\textrm{`No'}]=\frac{3}{4}\), \(Pr[\textrm{Play}_6=\textrm{`Yes'}]=\frac{1}{4}.\)

(alternative: majority voting?)

Entropy and impurity

We would improve prediction by decreasing segment impurity.

Entropy (or Gini impurity) drives segmentation.

Information gain

Definition

IG measures how much much an attribute improves entropy over the whole segmentation it creates.

How much purer are the children wrt. their parent segment?

IG(parent, children) = H(parent) - \([prop(c_a)\cdot H(c_a) + prop(c_b)\cdot H(c_b) + \dots ]\)

where \(prop(c_x)\) is the proportion of el. assigned to \(c_i:\) \(\frac{|c_i|}{n}\)

[\(IG(P, C)\) is connected with Kullback-Leibler divergence]

H(parent) = 0.99

H(left) = 0.39

H(right) = 0.79

\(IG(p, C) = 0.99 -[13/30\cdot 0.39 + 17/30 \cdot 0.79] = 0.37\)

Notice the discretization of the numerical var. we split on

H(parent) = 0.99

H(left) = 0.54

H(center) = 0.97

H(right) = 0.98

IG(p, C) = 0.13

Numeric targets

Discretization may reduce numerical dimensions to discrete ones

In regression, variance works as the analogous of information entropy.

Attribute selection with IG

By example

A graphical method deployed to visualize Information gain:

The shaded area represents Entropy.

the white area ‘reclaimed’ from the shade is the Information gain at each attempt.

edible: 4208 (51.8%), poisonous: the rest.

H = \(-[0.518\log 0.518 + 0.482\log 0.482]\) = \(-[.518\cdot -.949 + .482\cdot -1.053]\) = .999

Decision trees (DTs)

DT: iterated supervised segmentation

node \(\rightarrow\) leaves represents a segmentation that increases purity, i.e., decreases information Entropy or Gini impurity.

Iterate until the set of all root \(\rightarrow\) leaf trajectories gives a complete classification.

Measure: total entropy of the set of leaf segments.

Decision tree: a set of if-then rules over attribute (or discretized) values

Each observation falls into 1! leaf, and each leaf is uniquely defined by a set of rules.

CART, visually

Attributes seldom split groups perfectly

Even if one subgroup happens to be pure, the other may not!

If the second person were not there, then body-color=gray would create a pure segment (write-off=no).

However, body-color=white, still is not pure.

body-color=gray only splits off one single data point into the pure subset.

Is this better than a split that does not produce any pure subset, but reduces impurity more broadly?