Decision Trees¶
Piecewise constant model that adaptively learns to divide predictor space into different regions, and then fits a model in each region, without human intervention for deciding the cuts
Can be applied for regression and classification
Advantages¶
- Interpretable, for small trees
- Little data processing required
Disadvantages¶
- Small trees are not powerful
- Large trees tend to overfit, and are hard to regularize
- Do not work well for modelling linear relationships
- Cannot extrapolate well
- Decision regions tend to be highly-fragmented
- Resulting function (also Decision boundary for classification) is very non-smooth and blocky
Terms¶
Term | Meaning |
---|---|
Internal Nodes | Labelled by attributes names, to be tested on an attribute Contain the splitting attributes |
Leaf/Terminal Nodes | Labelled by class labels |
Edges | determined by the nuo of outcomes on n attribute test conditions |
Size | Number of leaves |
Goal¶
Find cuts to obtain \(R_m\) and improve in-sample fit, while minimizing out-of-sample loss
Since it is computationally-infeasible to consider every possible partition of the predictor space, we adopt a forward stepwise procedure
Decision Tree Approaches¶
Approach | Steps | Disadvantage |
---|---|---|
Greedy | Divide the problem into different steps, take decisions at each step Build tree in Top-Down manner Split train set into purer subsets (the best split) @ every node | Backtracking is not possible |
Recursive |
Regression Tree¶
where
- \(M =\) no of leaves = no of regions
- \(R_m =\) region \(m\)
- \(n_m=\) no of obs in \(R_m\)
For every observation that falls into the region Rm, we make the same prediction, which is simply the mean of the response values for the training observations in \(R_m\)
CART Algorithm¶
Greedy recursive partitioning algorithm that divides predictor space through successive binary splits, until a stopping criterion is reached.
This is greedy because at each step of the tree-building process, the optimal split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step
Steps¶
-
At each step, generate binary split to divide predictor space into two regions that achieves the biggest reduction in \(E_\text{in}\)
-
The 2 regions are as homogenous in response \(y\) as possible
- \(R_\text{left} = \{ x \in R: x_j < s \}\)
- \(R_\text{right} = \{ x \in R: x_j \ge s \}\)
-
For regression this means choosing \(j, s\) such that
\(\arg \min\limits_{j, s} \left \{ \sum \limits_{x_j \in R_\text{left}} L(y_i, \hat y_\text{left}) + \sum \limits_{x_j \in R_\text{right}} L(y_i, \hat y_\text{right}) \right \}\)
-
Prune the tree by choosing a subtree \(T \subset T_0\) that minimizes
\(\sum \limits_{m=1}^{\vert T \vert} \sum \limits_{x_j \in R_m} L(y_i, \hat y_m) + \alpha \vert T \vert, \quad \forall \alpha \ge 0\)
- \(\vert T \vert =\) size of tree
- \(\alpha\) controls tradeoff b/w subtree’s complexity and fit to training data
- \(\alpha=0 \implies T=T_0\)
- \(\vert T \vert \propto \dfrac{1}{\alpha}\)
- Optimal \(\alpha\) obtained through cross-validation
Classification Tree¶
\(\hat c_m =\) most frequent class in \(R_m\)
Node Impurity¶
Misclassification Rate | \(1 - {\hat p}_m\) |
Gini-Index | \(\sum \limits_c^C \hat p_c^m (1 - \hat p_c^m)\) |
Cross-Entropy | \(- \sum \limits_c^C \hat p_c^m \cdot \ln \hat p_c^m\) |
Steps¶
-
Pick an independent variable
-
Find Entropy of all classes of that independent variable
- Calculate gain of each independent variable for current set/subset of data
-
Pick the independent variable with the highest gain
-
Recursively repeat for each independent variable
Hunt’s Algorithm¶
Let
- \(t\) be a node
- \(D_t\) be train dataset @ node \(t\)
-
\(y\) be set of class labels \(\{ C_1, C_2, \dots, C_n \}\)
-
Make node \(t\) into a leaf node. Label \(t\) with class label \(y_t\)
- Split using appropriate attribute
- Apply splitting criteria for each attribute and obtain impurity of split using that attribute
- Pick the attribute that gives the lowest impurity
- Label the node \(t\) with this attribute
- Determine the outcomes
- Create nodes for each outcome
- Draw the edges
- Split the dataset
- Repeat the steps for each subtree now
Additional Cases¶
Empty Subset¶
Let’s say when splitting your data, you end up having an empty subset of the data
- Make node \(t\) into a leaf node
- \(t\) is labelled as the majority class of the parent dataset
All records have identical attribute values¶
- Make \(t\) into a leaf node
- \(t\) is labelled as the majority class represented in the current subset
Number of records fall below minimum threshold value¶
- Make \(t\) into a leaf node
- \(t\) is labelled as the majority class represented in the current subset
Splitting Continuous Attributes¶
Binary Split¶
Choose a splitting value that results in a purer partition
Multi-Way Split¶
- Apply discretization
- Each bin will be a different node
flowchart LR
subgraph Multi-Way Split
direction TB
s2((Salary))
a2(( ))
b2(( ))
c2(( ))
d2(( ))
e2(( ))
s2 -->|< 10k| a2
s2 -->|10k - 30k| b2
s2 -->|30k - 50k| c2
s2 -->|50k - 80k| d2
s2 -->| > 80k| e2
end
subgraph Binary
direction TB
s1((Salary))
end
Challenges¶
- How to split training records?
- Find best splitting attribute (Test on Attribute)
- Measure for goodness of split
- Stopping conditions Stop at situations that result in fully-grown tree (the tree has learnt all the important characteristics, which may not be optimal; in that case we may require some other stopping conditions)
- When all records at a node have the same attribute value
- When all records at a node have the same class label value
- When a node receives an empty subset
Attribute Test Condition and Outcomes¶
Nominal¶
Binary split will have \(2^{k-1} - 1\) combinations of the binary split, where \(k=\) no of values. For eg:
flowchart LR
subgraph 1
direction TB
aa[Attribute] --> v1a[v1] & notv1a["Not v1 <br> (v2 or v3 or v4)"]
end
subgraph 2
direction TB
ab[Attribute] --> v1b[v1 or v2] & notv1b["Not (v1 or v2) <br> (v3 or v4)"]
end
Missed this class¶
Terms¶
Variables
- \(I(\text{Parent}) =\) Impurity at parent
- \(N =\) no of records before split (parent)
- \(k =\) no of splits
- \(V_j =\) denote a split
- \(N(V_j) =\) no of records at split \(V_j\)
- \(I(V_k) =\) impurity at child(split) \(V_j\)
Steps to choose splitting attribute¶
- Compute degree of impurity of parent node (before splitting)
- Compute degree of impurity of child nodes (after splitting)
- Compute weighted average impurity of split
- Compute gain \((\Delta)\) Larger the gain, better the test condition
- Choose attribute with largest gain
- Repeat for all attributes
Note¶
Information gain means the impurity measure is entropy