Decision trees

DSTA

Decision trees

  • A simple-yet-effective classification algorithm: CART
  • it introduces us to predictive modeling

Study Plan

Ch. 3 of Provost-Fawcett’s Data Science for Business

  • introduces CART by example

  • illustrates the use of Entropy

  • it shows the training and testing of a predictive model.

Models

A simplified representation of reality

A predictive model is a formula for estimating the unknown value of interest, often called the target attribute.

Regression: numerical target

Classification: class membership, e.g., Class-probability estimation

A descriptive model is a formula for estimating the underlying phenomena and causal connections between values.

Descriptive modeling often is used to work towards a causal understanding of the data-generating process (e.g., why do users watch Sci-Fi sagas?)

Supervised segmentation

divide the dataset into segments (sets of rows) by the value of their output variable.

If the segmentation is done using values of variables that will be known when the target is not then these segments can be used to predict the value of the target.

Example: the golf days example seen when studying Gini.

Day Outlook Temp. Hum. Wind Play?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong ?

Questions

What are the variables that contain important information about the target variable?

Can they be selected automatically?

Selecting informative attributes

  • Attributes

    • head-shape: {square, circular}
    • body-shape: {rectangular, oval}
    • body-color: {gray, white}
  • Target

    • write-off: {yes, no}

Measure:

purity of segments: all datapoints have the same target variable value.

Complications

  1. non-binary attributes in binary classification

  2. non-discrete attributes

  3. attributes seldom split a dataset perfectly.

  4. attributes seldom split segments perfectly (see final slides)

Let’s focus on 3.

The importance of purity

if an unlabelled datapoint is with a pure segment, we safely assigh the same target variable valus of its segment.

Day Outlook Temp. Hum. Wind Play?
1 Sunny Hot High Strong No
2 Sunny Hot High Strong No
3 Sunny Hot High Strong No
6 Sunny Hot High Strong ?

Predict ‘No.’

Impurity

Day Outlook Temp. Hum. Wind Play?
1 Sunny Hot High Strong No
2 Sunny Hot High Strong No
3 Sunny Hot High Strong No
4 Sunny Hot High Strong Yes
6 Sunny Hot High Strong ?

assign the target value with the same probability as the frequency in the segment:

\(Pr[\textrm{Play}_6=\textrm{`No'}]=\frac{3}{4}\), \(Pr[\textrm{Play}_6=\textrm{`Yes'}]=\frac{1}{4}.\)

(alternative: majority voting?)

Entropy and impurity

We would improve prediction by decreasing segment impurity.

Entropy (or Gini impurity) drives segmentation.

Information gain

Definition

IG measures how much much an attribute improves entropy over the whole segmentation it creates.

How much purer are the children wrt. their parent segment?

IG(parent, children) = H(parent) - \([prop(c_a)\cdot H(c_a) + prop(c_b)\cdot H(c_b) + \dots ]\)

where \(prop(c_x)\) is the proportion of el. assigned to \(c_i:\) \(\frac{|c_i|}{n}\)

[\(IG(P, C)\) is connected with Kullback-Leibler divergence]

H(parent) = 0.99

H(left) = 0.39

H(right) = 0.79

\(IG(p, C) = 0.99 -[13/30\cdot 0.39 + 17/30 \cdot 0.79] = 0.37\)

Notice the discretization of the numerical var. we split on

H(parent) = 0.99

H(left) = 0.54

H(center) = 0.97

H(right) = 0.98

IG(p, C) = 0.13

Numeric targets

Discretization may reduce numerical dimensions to discrete ones

In regression, variance works as the analogous of information entropy.

Attribute selection with IG

By example

A graphical method deployed to visualize Information gain:

The shaded area represents Entropy.

the white area ‘reclaimed’ from the shade is the Information gain at each attempt.

edible: 4208 (51.8%), poisonous: the rest.

H = \(-[0.518\log 0.518 + 0.482\log 0.482]\) = \(-[.518\cdot -.949 + .482\cdot -1.053]\) = .999

Decision trees (DTs)

DT: iterated supervised segmentation

node \(\rightarrow\) leaves represents a segmentation that increases purity, i.e., decreases information Entropy or Gini impurity.

Iterate until the set of all root \(\rightarrow\) leaf trajectories gives a complete classification.

Measure: total entropy of the set of leaf segments.

Decision tree: a set of if-then rules over attribute (or discretized) values

Each observation falls into 1! leaf, and each leaf is uniquely defined by a set of rules.

CART, visually

Attributes seldom split groups perfectly

Even if one subgroup happens to be pure, the other may not!

If the second person were not there, then body-color=gray would create a pure segment (write-off=no).

However, body-color=white, still is not pure.

body-color=gray only splits off one single data point into the pure subset.

Is this better than a split that does not produce any pure subset, but reduces impurity more broadly?