Game theoretic Shapley values have recently become a popular way to explain the predictions of tree-based machine learning models. This rise in popularity was made possible by new efficient algorithms. Here we examine how these algorithms work, and how they solve what is in general an NP-hard problem in linear time for trees.
One of the most popular machine learning model types are tree-based models. A 2017 survey of data scientists and researchers found that tree models were both the second and third most popular class of method (1). Although small tree models can be interpretable
This article discusses how to explain tree-based models with Shapley values - a unique game-theoretic solution for spreading credit between features. First, we discuss how Shapley values can be applied to machine learning models (Section 1). Then, since exactly computing Shapley values for an arbitrary model is NP-hard
What is the goal of this article?
Given that there is a long history of model explanations going awry when users do not understand what an explanation means (e.g., p-values for linear models
In this section, we introduce the notion of feature attributions. Then, we introduce Shapley values and describe the ways in which they have been used to explain machine learning models, using an example with a linear model to motivate a specific extension of the Shapley values. With this formulation, we show how obtaining these Shapley values reduces to an average of Baseline Shapley values that can be thought of as explanations with respect to a single foreground sample (sample being explained that we will denote as \(x^f\)) and a single background sample (sample the foreground sample is compared to that we will denote as \(x^b\)).
The goal of feature attribution methods is to explain a model by assigning an importance value for each feature the model uses. In order to discuss the goal of feature attribution, it is useful to think about a linear model: $$ f(x)=\beta_0 + \beta_1 x_1 + \cdots + \beta_d x_d $$ Linear models are considered inherently interpretable because they summarize the importance for each feature with a scalar value where the importance (\(\phi\)) of a given feature (\(i\)) for a linear model (\(f\)) is: $$ \phi_i(f)=\beta_i $$ The attribution \(\phi_i(f)\) is known as a global feature attribution that explains the model as a whole. This is possible because features in a linear model are constrained to have a linear relationship with the model prediction, so a sufficient statistic to describe the relationship between any feature and the output of the model is the slope of that feature.
However, in many cases, it may be preferable to give a more individualized explanation that is not for the model as a whole, but rather for a specific sample's prediction (\(f(x^f)\)). For linear models, we could do so by returning an attribution of the following form: $$ \phi_i(f,x^f)=\beta_i x^f_i $$ The attribution \(\phi_i(f,x^f)\) is known as a local feature attribution that explains the model for a specific individual. Here, the attributions are easy to obtain because the relationship between \(x^f\) and \(f(x^f)\) is constrained to both be linear and consist only of marginal effects.
However, many easily interpretable models require making unrealistic assumptions (e.g., linear relationships in linear models, strictly marginal or pairwise effects in generalized additive models). Instead, if we use non-linear models that allow for complex interaction effects between variables, we can much better capture relationships in our data. For these models, summarizing non-linear effects with a single number (a global feature attribution \(\phi_i(f)\)) may not make sense, because we no longer have a slope. Instead, we can aim for local feature attributions (\(\phi_i(f,x^f)\)) that will explain the model's prediction for a specific individual. In order to do so in a principled manner, we can turn to the Shapley values.
Shapley values are a method to spread credit among players in a "coalitional game". We can define the players to be a set \(N=\{1,\cdots,d\}\). Then, the coalitional game is a function that maps subsets of the players to a scalar value: $$ v(S):\text{Power Set}(N)\to\mathbb{R}^1 $$
To make these concepts more concrete, we can imagine a company that makes a profit \(v(S)\) that is determined by what combination of individuals they employ \(S\). Furthermore, let's assume we know \(v(S)\) for all possible teams of employees. Then, the Shapley values assign credit to an individual \(i\) by taking a weighted average of how much the profit increases when \(i\) works with a group \(S\) versus when he does not work with \(S\). Repeating this for all possible teams \(S\) gives us Shapley values: $$ \overbrace{\phi_i(v)}^{\mathclap{\text{Shapley value of }i}} = \underbrace{\sum_{S\subseteq N\setminus\{i\}}}_{\mathclap{\text{All subsets w/o }i}} \overbrace{\frac{|S|!(|N|-|S|-1)!}{|N|!}}^{\mathclap{\text{Weight }W(|S|,|N|)}} ( \underbrace{v(S\cup\{i\})-v(S)}_{\text{Profit individual }i\text{ adds}}) $$
Coalitional game | ||
Subset \(S\) | Profit \(v(S)\) | |
\(\{\}\) | ||
\(\{Ava\}\) | ||
\(\{Ben\}\) | ||
\(\{Cat\}\) | ||
\(\{Ava,Ben\}\) | ||
\(\{Ava,Cat\}\) | ||
\(\{Ben,Cat\}\) | ||
\(\{Ava,Ben,Cat\}\) |
Shapley values | |||||
\(\phi_{Ava}(v)\) | 0 | \(\phi_{Ben}(v)\) | 0 | \(\phi_{Cat}(v)\) | 0 |
No credit.
Shapley values are an excellent way to give credit to individuals in a coalitional game. In fact, they are known to be a unique solution to spreading credit as defined by the following three desirable properties
Shapley values are an optimal solution for allocating credit among players (\({1,\cdots,d}\)) in a coalitional game (\(v(S)\)). However, our goal is actually to allocate credit among features (\(x:=\{x_1,\cdots,x_d\}\in\mathbb{R}^d\)) in a machine learning model (\(f(x)\)) which is generally not a coalitional game: $$ v(S)\neq f(x) $$ In order to use Shapley values to explain a model, our goal moving forward is to define a new coalitional game related to the model's prediction: $$ v(S)=g(f,x) $$
As a first step, we can explore how coalitional games differ to machine learning models by looking at a model that is a coalitional game: a model with binary features \(x_i\in(0,1)\)
Presence of a feature: Our goal is to obtain local feature attributions (\(\phi_i(f,x^f)\)), a vector of importance values for each feature of a model prediction for a specific sample \(x^f\). This means that if feature \(i\) is present, we can simply set that feature to its value in the foreground sample \(x_i^f\). The next step is to address the absence of a feature, which can be done in several ways.
One straightforward way to address the absence of a feature is to define a baseline \(x^b\) (i.e., the background sample). In this case, if feature \(i\) is absent, we can simply set that feature to be \(x_i^b\). Then the new coalitional game is: $$ v(S)=f(h^S)\text{, where } h^S_i = x^f_i\text{ if }i\in S\text{, }x^b_i\text{ otherwise} $$
This specific adaptation of Shapley values to machine learning models is called Baseline Shapley, and were shown to be a unique solution to attribution methods with a single baseline based on cost-sharing literature
a.) Model and sample being explained. | ||||||
Linear Model: | \(\beta_1\) | 2 | \(\beta_2\) | -1 | \(\beta_3\) | 10 |
Explicand: | \(x^f_1\) | 70 | \(x^f_2\) | 135 | \(x^f_3\) | 0 |
b.) Zero baseline. | ||||||
Baseline: | \(x^b_1\) | 0 | \(x^b_2\) | 0 | \(x^b_3\) | 0 |
Attribution: | \(\phi_1\) | 140 | \(\phi_2\) | -135 | \(\phi_3\) | 0 |
c.) Average baseline. | ||||||
Baseline: | \(x^b_1\) | 70 | \(x^b_2\) | 135 | \(x^b_3\) | 0.5 |
Attribution: | \(\phi_1\) | 0 | \(\phi_2\) | 0 | \(\phi_3\) | -5 |
One downside of Baseline Shapley is that the choice of baseline is very influential to the resulting feature attributions. In 3, we can see that the choice of baseline heavily influences the resulting attribution value.
3b is the zero baseline. In this case, it appears that height and weight are quite important, however being male is not important at all. This is a bit counter-intuitive, because relative to being female, being male reduces the prediction for this individual. Put another way, we can consider a new model where \(x_3\) is 0 for female, rather than 1 for female. Then, an equivalent model would be \(y=2x_1-x_2-10x_3+10\). In this case, using the same zero baseline gives an attribution value of -10 instead of 0 for being male for the exact same model. Since the meaning of zero is often arbitrary for different variables, selecting it as your baseline can result in misleading attributions.
Alternatively, we could use a mean baseline as in 3c. Although the mean of a binary variable does not have a natural interpretation, the mean baseline is actually a reasonable choice of baseline for linear models when we use the interventional conditional expectation to define our coalitional game (more details in Section 1.5). For this baseline, we see that for an individual with average height and weight, the importance of their height and weight should be zero relative to an average individual. Furthermore, in comparison to the zero baseline, the attribution for gender with a mean baseline will be consistent across different representations of gender.
For linear models comparing to an average individual baseline is equivalent to comparing to the average population
Another natural approach to incorporate a background distribution to the coalitional game is with a conditional expectation. Instead of simply replacing "missing" features with a fixed value, we condition on the set of features that are "present" as if we know them and use those to guess at the "missing" features. If we define \(D\) to be the background (underlying) distribution our samples are drawn from, the value of the game is: $$ v(S)=\mathbb{E}_D[f(x)|x_{S}] $$
One caveat is that getting this conditional expectation for actual data is very difficult. Furthermore, even if you do manage to do so, the resulting explanations can end up having undesirable characteristics (more on this later). Because our goal is just to explain the model itself, an arguably more natural approach is to use causal inference's interventional conditional expectation:
$$
v(S)=\mathbb{E}_D[f(x)|\text{do}(x_{S})]
$$
The do notation is causal inference's do-operator
Linear | |
Model | |
\(\beta_1\) | |
\(\beta_2\) | |
\(\beta_3\) | |
\(\beta_4\) |
Covariance \(C\) | ||||
\(x_1\) | \(x_2\) | \(x_3\) | \(x_4\) | |
\(x_1\) | ||||
\(x_2\) | ||||
\(x_3\) | ||||
\(x_4\) |
Foreground | |
Sample | |
\(x_1\) | |
\(x_2\) | |
\(x_3\) | |
\(x_4\) |
\(\phi_1\) | \(\phi_2\) | \(\phi_3\) | \(\phi_4\) | |
Shapley Values (CE) | 1 | 2 | 3 | 4 |
Shapley Values (ICE) | 1 | 2 | 3 | 4 |
Independent variables.
In general, computing the conditional expectation Shapley value is difficult; however, for a multivariate normal distribution the conditional expectation is easy to calculate. In addition, for a linear function the conditional expectation of the function equals the function applied to the conditional expectation (\(E[f(x)|x_S]=f(E[x|x_S])\)
In 4 we highlight tradeoffs between the conditional expectation and the interventional conditional expectation for a linear model. Comparing 4A to 4B, we can see that the correlation between variables will cause the CE Shapley values to be split between correlated variables. Although this may be desirable at times
Although both CE Shapley values and ICE Shapley values are meaningful, we will focus on ICE Shapley values moving forward for two reasons: 1.) ICE Shapley values are tractable to compute as opposed to CE Shapley values that require modelling many conditional expectations and 2.) ICE Shapley values more explicitly describe the model's behavior whereas CE Shapley values conflate the model's behavior with the correlation in the data.
To compute \(\phi_i(f,x^f)\) we evaluate the interventional conditional expectation. However, this depends on a background distribution \(D\) that the foreground sample will be compared to.
One natural definition of the background distribution is a uniform distribution over a population sample (e.g., equal probability for every sample in a data set). With this background distribution, we can re-write the Shapley value as an average of Baseline Shapley values (which are analogous to masking missing features)
In summary, we show that the problem of obtaining \(\phi_i(f,x^f)\)
Now our goal is to tackle the simpler problem of obtaining Baseline Shapley values \(\phi_i(f,x^f,x^b)\) that are attributions for a single foreground sample and background sample. The examples in this section are all for a specific foreground sample, background sample, and tree specified in 5.
Tree Parameters
Node Variable:
Node Threshold:
5
Variable | Foreground Sample \(x^f\) | Background Sample \(x^b\) | ||
\(x_1\) | ||||
\(x_2\) | ||||
\(x_3\) |
Based on the proof in Section 1.6, the brute force approach would be to compute the following: $$ \phi_i(f,x^f,x^b)=\sum_{S\subseteq N\setminus\{i\}} \underbrace{W(|S|,|N|)}_{W}(\underbrace{f(h^{S\cup \{i\}})}_{\text{\textcolor{green}{Pos} term}} {-} \underbrace{f(h^S)}_{\text{\textcolor{red}{Neg} term}}) $$ The algorithm is then fairly straightforward:
Tree Parameters
Attribution Values | |||||
\(\phi_1\) | \(\phi_2\) | \(\phi_3\) |
Foreground & background sample | |||
\(x_1\) | \(x_2\) | \(x_3\) | |
---|---|---|---|
\(x^f\) | 0 | 0 | 10 |
\(x^b\) | 10 | 10 | 0 |
\(h^S\) | |||
\(h^{S\cup i}\) |
Algorithm State | ||||
\(W\) | \(S\) | \(f(h^S)\) | \(S\cup{i}\) | \(f(h^{S\cup{i}})\) |
1/3 | ||||
1/6 | ||||
1/6 | ||||
1/3 |
If we assume the computational cost of computing the weight \(W\) is constant
Then, in order to compute \(\phi_i(f,x^f,x^b)\) for all features \(i\), we have to re-run the entire algorithm \(d\) times, giving us an overall complexity of $$ O(d\times (\text{tree depth})\times 2^{d}) $$
An exponential computational complexity is bad; however, if we constrain \(f(x)\) to be a tree-based model (e.g., XGBoost, decision trees, random forests, etc.), then we can come up with a polynomial time algorithm to compute \(\phi_i(f,x^f,x^b)\) exactly. Why is this the case? Well, looking at 6, we can see that even for explaining a single feature, the brute force algorithm may consider a particular path multiple times. However, to compute the Baseline Shapley value for a single feature, it turns out that we only need to consider each path once. This insight leads us to the naive algorithm in Section 2.2.
Before we get into the algorithm, we first describe a theorem that is the basis for this naive implementation.
Theorem 1: To calculate \(\phi_i(f,x^f,x^b)\), we can calculate attributions for each path from the root to each leaf and then sum them: $$ \phi_i(f,x^f,x^b)=\sum_P \phi_i^P(f,x^f,x^b) $$
To obtain \(\phi_i^P(f,x^f,x^b)\) for a given path \(P\), we define \(N_P\) to be the unique features in the path and \(S_P\) to be the unique features in the path that came from \(x^f\). Finally, define \(v\) to be the value of the path's leaf. Then, the attribution of the path is: $$ \phi_i^P(f,x^f,x^b)= \begin{cases} 0 & \text{if}\ i\notin N_P \\ \underbrace{W(|S_P|-1,|N_P|)\times v}_{\text{\textcolor{green}{Pos} term}} & \text{if}\ i\in S_P \\ \underbrace{-W(|S_P|,|N_P|)\times v}_{\text{\textcolor{red}{Neg} term}} & \text{otherwise} \end{cases} $$
Given Theorem 1, we just need an algorithm to obtain \(N_P\) and \(S_P\) for each path, which can be done by traversing the tree. We will start by explaining the algorithm via an example:
In the naive algorithm, we maintain lists \(N_P\) and \(S_P\) as we traverse the tree. At each internal node (Cases 2-4) we update the lists and then pass them to the node's children. At the leaf nodes (Case 1), we calculate the attribution for each path. In 3, we see four possible cases:
Tree Parameters
Foreground & background sample | |||
\(x_1\) | \(x_2\) | \(x_3\) | |
\(x^f\) | 0 | 0 | 10 |
\(x^b\) | 10 | 10 | 0 |
\(h\) |
Algorithm State | |||
\(S_P\) | \(N_P\) |
Attribution Values | |||||
\(\phi_1\) | \(\phi_2\) | \(\phi_3\) |
Next we examine the computational complexity of computing \(\phi_i(f,x^f,x^b)\) for all features \(i\) using the naive algorithm.
For each internal nodes, the complexity is based on Cases 2-4. In the worst case, we need to check whether the current node's feature has been encountered previously by iterating through \(S_P\) and \(N_P\). Since these lists are of length \((\text{tree depth})\) in the worst case, each node incurs a linear \(O(\text{tree depth})\) cost.
For the leaf nodes, we compute the contributions for each feature (of which there are \(d\)). Then, computing the contributions for each feature requires checking whether the feature is in \(S_P\). This means that each leaf node incurs a cost of \(O((\text{tree depth})\times d)\) cost.
Putting this together, we get an overall cost of: $$ O((\text{\# internal nodes})\times (\text{tree depth})) + O((\text{\# leaf nodes})\times (\text{tree depth}) \times d) $$
In the next section we present two observations that allows us to compute \(\phi_i(f,x^f,x^b)\) for all features \(i\) in just \(O(\text{\# nodes})\) time.
To improve the computational complexity of the straightforward naive algorithm, we can make two observations.
The first observation is that we can get rid of the multiplicative \((\text{tree depth})\) factor by focusing on what the lists \(S_P\) and \(N_P\) are used for. The first use is to check if a feature has been previously encountered. By replacing the lists with boolean arrays, we can check if a given feature has been encountered by a constant time access into the arrays. This means the internal nodes now incur a constant cost. The second use of \(S_P\) and \(N_P\) is to calculate \(|S|\) and \(|N|\) at the leaves. By maintaining integers that keep track of these values, the leaves no longer have to iterate through lists. This gets rid of the multiplicative \((\text{tree depth})\) factor for leaf nodes. These optimizations lead to a new computational complexity of: $$ O((\text{\# internal nodes}) + O((\text{\# leaf nodes}) \times d) $$
The second observation is that we can compute the attributions for all features simultaneously as we traverse the tree by passing \(\textcolor{green}{\text{Pos}}\) and \(\textcolor{red}{\text{Neg}}\) attributions to parent nodes
\(\phi_1(f,x^f,x^b)\) | \(\textcolor{green}{\text{Pos}_{R\to L1}}+\textcolor{green}{\text{Pos}_{R\to L2}}+\textcolor{red}{\text{Neg}_{R\to L3}}+\textcolor{red}{\text{Neg}_{R\to L4}}\) |
\(\phi_2(f,x^f,x^b)\) | \(\textcolor{red}{\text{Neg}_{R\to L1}}+\textcolor{green}{\text{Pos}_{R\to L2}}+\textcolor{green}{\text{Pos}_{R\to L3}}+\textcolor{red}{\text{Neg}_{R\to L4}}\) |
In 10, we can first observe that for each leaf, according to Theorem 1, there are only two possible values needed to compute the Baseline Shapley values (\(\textcolor{green}{\text{Pos}}\) and \(\textcolor{red}{\text{Neg}}\)). Based on the attributions for \(x_1\) we see that these \(\textcolor{green}{\text{Pos}}\) and \(\textcolor{red}{\text{Neg}}\) terms can be grouped by the left and right subtrees below \(x_1\). To generalize this example, we make the following observation:
Observation: We only add to the attribution for Case 4 nodes. For a specific Case 4 node \(n\), one child is associated with \(x^f\) and one child is associated with \(x^b\). If \(\phi_i(f,x^f,x^b)\) is initialized to zeros for all features, then, for each Case 4 node: $$ \phi_{n_{feature}}(f,x^f,x^b)\mathrel{{+}{=}}\sum_{L\in \text{leaves}(x^f\text{ child})}\textcolor{green}{\text{Pos}_{R \to L}} + \sum_{L\in \text{leaves}(x^b\text{ child})}\textcolor{red}{\text{Neg}_{R \to L}} $$ In words, we only consider \(\textcolor{green}{\text{Pos}}\) terms from all paths under the \(x^f\) child and \(\textcolor{red}{\text{Neg}}\) terms from all paths under the \(x^b\) child.
Therefore it is sufficient to add the \(\textcolor{green}{\text{Pos}}\) and \(\textcolor{red}{\text{Neg}}\) terms at a given node and pass them up to the parent to calculate the attributions for a parent node's feature. This aggregation of the \(\textcolor{green}{\text{Pos}}\) and \(\textcolor{red}{\text{Neg}}\) terms is the dynamic programming observation that allows each node to only need a constant number of operations.
Then, the dynamic programming algorithm computes the attributions for all features simultaneously (which gets rid of the multiplicative \(d\) factor for the leaf nodes):
Tree Parameters
Foreground & background sample | |||
\(x_1\) | \(x_2\) | \(x_3\) | |
---|---|---|---|
\(x^f\) | 0 | 0 | 10 |
\(x^b\) | 10 | 10 | 0 |
\(h\) |
Algorithm State | |||
\(S_C\) | \(N_C\) |
Attribution Values | |||||
\(\phi_1\) | \(\phi_2\) | \(\phi_3\) |
In 10 each node now only requires a constant amount of work. The computational complexity to compute \(\phi_i(f,x^f,x^b)\) for all features \(i\) with this algorithm is just: $$ O(\text{\# nodes}) $$
Background distribution: Our original goal was to compute \(\phi_i(f,x^f)\). In order to do so, we compute \(\phi_i(f,x^f,x^b)\) for many references \(x^b\), resulting in a run time of: $$ O(|D|\times (\text{\# nodes})) $$ where \(|D|\) is the number of samples in the background distribution. In practice, using a fixed number of about 100 to 1000 references works well.
Further optimization: Explaining the tree for a specific foreground and background sample requires knowing where all hybrid samples go in the tree. Therefore we achieve the best possible complexity of \(\Omega(\text{\# nodes})\) for Baseline Shapley values. However, it may be possible to compute the attributions for many background samples simultaneously in sub-linear time in order to reduce the \(O(|D|\times (\text{\# nodes}))\) cost.
Ensembles of trees: Many tree based methods are ensembles (e.g., random forests, gradient boosting trees). In order to compute that attributions for these types of models, we can simply leverage the additivity of Shapley values and explain each tree and sum the attributions for each tree. This means that to explain a gradient boosting tree model, the computational complexity is: $$ O((\text{\# trees}) \times |D|\times (\text{\# nodes})) $$
Empirical evaluation: The goal of this article is understanding how these algorithms work, and not empirically evaluating them. For evaluation we refer readers to
It should be noted that there are a number of alternative methods that aim to estimate interventional conditional expectation Shapley values. A few of these methods include: Sampling Explainer, Kernel Explainer, and Path Dependent Tree Explainer. If you are explaining tree-based models, it may not be clear which one you should use. In this article we briefly overview a few of these the methods and compare them to Interventional Tree Explainer (ITE).
First, there are two model agnostic explanation methods in the SHAP package. The first is Sampling Explainer which is an implementation of Interactions-based Method for Explanation (IME)
Second, in the SHAP package there is a method named Path Dependent Tree Explainer (PDTE) that is meant to obtain Shapley values for tree models specifically. PDTE approximates the interventional conditional expectation based on how many training samples went down paths in the tree, whereas ITE computes it exactly. The computational complexity of PDTE is \(O((\text{\# leaf nodes})\times (\text{tree depth})^2)\). In practice, PDTE can be faster than ITE, although it may depend on the number of references or the tree depth. The tradeoff is that the attribution values estimated by PDTE are biased away from interventional expectations by conditioning on parent nodes during the computation of the expected values.
Two pre-existing methods to explain trees include Gain (Gini Importance)
In this article we do not empirically evaluate these methods, because evaluating interpretability methods can difficult due to the subjective nature of an explanation. Instead, we identify a way of assigning credit that comes equipped with a set of desirable properties (i.e., Shapley values) and show how to compute them exactly for trees with a tractable algorithm. This means that the credit allocation we obtain comes built in with many of the desirable properties that ablation tests aim to measure (e.g., consistency). That being said,
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1762114. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.