Building the first real-time, accurate explainability engine

OuterProduct solves a fundamental trade-off between the performance and transparency of predictive models.

OuterProduct Labs

Quick Summary

  • OuterProduct builds the first real-time, accurate explainability engine for predictive models, substantially outperforming industry standards (Shapley values and Integrated Gradients) across hundreds of datasets (generated in the style of OpenXAI). 

  • The explainability engine solves a fundamental trade-off that has persisted between the performance of predictive models on critical tasks, and the ability to understand, explain and utilize the context for how they make decisions. 

Introduction

Existing AI explainability algorithms are widely used, it is difficult to make these explanations actionable, stable, and easy to validate.  Many AI practitioners would rather focus on improving model capabilities and performance. 

At the same time, the explanations that do exist are rarely communicated in a way that business process owners can understand and use confidently.  It is not enough to output “feature importances” or attribution scores.  Rather, we need to be able to contextualize explainability into actionable narratives.

The goal of this blog post is to explain the existing challenges in building such an explainability engine and to show how we at OuterProduct Labs have built the first accurate and real-time primitives for an explainability engine.  

Explaining predictive models

We first need to define what it means to “explain” a predictive model. Ultimately, an explanation needs to be an actionable narrative. 

For example, a credit underwriting process needs to determine why an application was not approved  and the application properties that would have resulted in a different approval outcome. Given a predictive model for credit scoring, building this actionable narrative requires (1) identifying underlying drivers for why an application was not approved and (2) based on these drivers, verify the different scenarios that would reverse this decision.

In this post, we focus specifically on the first step of explainability, which is defined below: 

Given a predictive model and an input, we want to identify the subset of input attributes that determine the prediction. 

This measure of explainability is one of the key measures considered in the explainable AI (XAI) literature and in particular, falls under the purview of “ground-truth faithfulness” (more details on this here).

Evaluating explainability 

Unlike evaluating predictive models themselves, evaluating explainability is not obvious because in real world data, there is no “ground truth” for whether an explanation is correct or not. 

To overcome this issue with real world data, we instead turn to synthetic data where we know the ground truth. Namely, we can generate input and output pairs where the output is a function of some subset of input fields. Next, we force that the output is determined by a subset of input attributes that are specific to each input. For example, credit risk drivers can vary by applicant: for a young borrower with limited history, income stability may be the sole driver of risk, while for an older borrower with a long credit record, debt-to-income may be the sole driver of risk.  This approach is consistent with that taken by XAI researchers (as in OpenXAI).

In particular, we here generate three types of synthetic datasets of increasing difficulty:


  1. Homogeneous data: In this setting, we have input-output pairs where all outputs depend on the same subset of fields. For example, given 100 fields, the label could just be the product of values in the first two fields. (Mathematically, we would say f(x) = x₁ x₂ for x containing 100 fields.) Such data are generally quite rare and unrealistic but provide a simple check for any explainability method. An accurate explainability method should be able to identify the fixed, relevant fields.  


  2. Slightly heterogeneous data: In this setting, we have input-output pairs where half of the pairs depend on one subset of attributes and the other half depend on another subset of attributes. For example, given 100 attributes, the output could be the value of the second attribute if the first attribute is positive and the value of the third attribute if the first attribute is negative. (Mathematically, we would say f(x) = x₂ if x₁ > 0 and x₃ if x₁ < 0.) An accurate explainability method should be able to identify that the fields x₁ and x₂ are important when x₁ > 0 and that attributes x₁ and x₃ are important when x₁ < 0.


  3. Highly heterogeneous data: In this setting, we have input-output pairs where the input data form clusters and the output depends on a subset of attributes that differs for each cluster. This data more accurately mimics real-world settings. For example, in Customer Success Management, the reasons for churn may vary greatly based on the region or sector of service (perhaps East coast software customers churn for different reasons than West coast retail customers).

In each of these synthetic settings, we get to control the number of input attributes (d) and the number of available samples (n). Upon varying n and d, we can evaluate explainability methods quantitatively along two axes:


  1. Accuracy: We can measure how often they identify the ground truth relevant attributes for prediction per sample. In particular, for any sample, suppose there are k true attributes.  We can check whether the top k attributes with highest importance match those k true attributes.  This measure is a stronger notion of feature agreement (where the authors measure the percentage of attributes that overlap between the ground truth and attributes importance scores). 


  2. Speed: We can evaluate the speed (in wall-clock time) for each method to output relevant fields for a batch of samples.

In our experiments, we will evaluate the performance of different methods across values of n from 1024 (=2¹⁰) to 32,768 (=2¹⁵) and values of d from 32 (=2⁵) to 1024 (=2¹⁰). 

Existing explainability methods are highly inaccurate

Today, there are a number of existing explainability methods for black-box predictive models. The most prominent approach is known as Shapley values – a method originating from game theory. 

Shapley values were intended to be a method for distributing gains among a number of collaborators based on how much or little each person contributed to the overall outcome.

A practical disadvantage of the exact Shapley value computation for general models is that it is slow (computation grows exponentially quickly with the number of fields). To overcome this, many practitioners have resorted to fast approximations or exact calculations for specific predictive models. In particular, for tree-based models like XGBoost or variants (like CatBoost), the state-of-the-art is to use TreeSHAP (a fast implementation of Shapley values for tree-based models), which is what we will consider in our evaluation benchmark below.

The major limitation of both approximate and exact SHAP is that it, provably, can never achieve 100% accuracy in simple settings like the slightly heterogeneous data setting above. This means that even with an infinite amount of data, SHAP would still not return the true fields that drive predictions.  

To this end, another prominent method we will compare with is known as Integrated Gradients (IG) – a method originally intended to explain how deep-learning based image classification models make predictions. IG works by identifying a set of fields along which a model’s predictions change most along a path from an “average” input (typically the all zeros input) to the current input.  While faster than exact SHAP, we will show below that IG is still slow in wall-clock time relative to TreeSHAP and moreover, it can even have worse performance than TreeSHAP on our benchmarks. 

Below, we present the results of running both TreeSHAP and IG on data sampled from our three evaluation benchmarks.

On homogeneous data, both methods do not consistently achieve 100% accuracy even in simple settings with 1,024 samples and 32 features. Yet, it is clear that with more samples, TreeSHAP gets up to 99% accuracy.

Homogeneous data

TreeSHAP
Number of fields
210
71%
84%
90%
96%
96%
97%
29
69%
84%
91%
93%
97%
98%
28
74%
90%
92%
96%
96%
99%
27
83%
88%
93%
97%
97%
98%
26
79%
87%
93%
96%
97%
98%
25
86%
90%
93%
95%
98%
99%
210
211
212
213
214
215
Number of samples
TreeSHAP
Number of fields
210
71%
84%
90%
96%
96%
97%
29
69%
84%
91%
93%
97%
98%
28
74%
90%
92%
96%
96%
99%
27
83%
88%
93%
97%
97%
98%
26
79%
87%
93%
96%
97%
98%
25
86%
90%
93%
95%
98%
99%
210
211
212
213
214
215
Number of samples
Integrated Gradients
Number of fields
210
0%
68%
91%
95%
96%
97%
29
0%
71%
92%
94%
95%
96%
28
49%
88%
92%
94%
95%
95%
27
55%
90%
92%
93%
94%
96%
26
73%
87%
92%
95%
94%
95%
25
83%
87%
89%
93%
96%
95%
210
211
212
213
214
215
Number of samples

Next, we turn to the slightly homogeneous data setting. First, we find that both methods are far from 100% accurate. Moreover, even as the number of samples grows large, TreeSHAP is stuck at around 55% accuracy. This is in fact because exact SHAP itself is, provably, not capable of getting 100% accuracy on this task. 

Slightly heterogeneous data

TreeSHAP
Number of fields
210
54%
55%
55%
54%
55%
54%
29
53%
54%
55%
55%
54%
55%
28
53%
54%
54%
54%
54%
54%
27
53%
53%
54%
54%
54%
54%
26
54%
55%
54%
54%
55%
55%
25
53%
55%
55%
54%
54%
55%
210
211
212
213
214
215
Number of samples
Integrated Gradients
Number of fields
210
2%
32%
47%
52%
63%
78%
29
8%
32%
48%
72%
51%
73%
28
17%
50%
74%
56%
64%
82%
27
29%
54%
52%
76%
72%
91%
26
33%
75%
80%
62%
86%
74%
25
66%
74%
55%
82%
90%
86%
210
211
212
213
214
215
Number of samples

Performance of both methods continues to get worse in the highly heterogeneous data setting. Here, TreeSHAP never gets more than 20% accuracy and for high dimensional data with 1,024 fields and 32,768 samples, IG achieves only 8% accuracy. 

Highly heterogeneous data

TreeSHAP
Number of fields
210
2%
2%
13%
16%
18%
20%
29
4%
5%
12%
15%
18%
18%
28
0%
4%
13%
17%
18%
18%
27
1%
8%
13%
16%
19%
17%
26
2%
6%
15%
18%
17%
17%
25
2%
10%
15%
16%
19%
18%
210
211
212
213
214
215
Number of samples
Integrated Gradients
Number of fields
210
0%
0%
1%
2%
3%
8%
29
0%
0%
1%
2%
5%
9%
28
0%
1%
2%
2%
6%
19%
27
0%
1%
1%
3%
16%
27%
26
0%
1%
3%
10%
20%
31%
25
1%
2%
6%
15%
25%
40%
210
211
212
213
214
215
Number of samples

Real world data are clearly highly heterogeneous, and existing explainability algorithms are not equipped to tackle this heterogeneity.  OuterProduct unlocks explainability in heterogeneous settings. 

OuterProduct’s direct, real-time and accurate explainability solution

OuterProduct has built the first real-time, accurate explainability engine, achieving >90% in all of the above evaluation benchmarks.  Our solution is rooted in our fundamental understanding of feature learning, built on our foundational research published in Science).

On homogeneous data, our solution achieves 100% performance at all scales. On slightly heterogeneous data, our solution achieves 99% performance for most settings, far exceeding the best performance of both TreeSHAP and IG. On highly heterogeneous data, we see that our solution is well into the 90% accuracy range across all settings when the number of samples is around 32K.

OuterProduct

Homogeneous data
Number of fields
210
100%
100%
100%
100%
100%
100%
29
100%
100%
100%
100%
100%
100%
28
100%
100%
100%
100%
100%
100%
27
100%
100%
100%
100%
100%
100%
26
100%
100%
100%
100%
100%
100%
25
100%
100%
100%
100%
100%
100%
210
211
212
213
214
215
Number of samples
Homogeneous data
Number of fields
210
100%
100%
100%
100%
100%
100%
29
100%
100%
100%
100%
100%
100%
28
100%
100%
100%
100%
100%
100%
27
100%
100%
100%
100%
100%
100%
26
100%
100%
100%
100%
100%
100%
25
100%
100%
100%
100%
100%
100%
210
211
212
213
214
215
Number of samples
Slightly heterogeneous
Number of fields
210
95%
97%
97%
99%
99%
99%
29
97%
98%
98%
99%
99%
99%
28
93%
97%
97%
98%
99%
98%
27
94%
96%
99%
99%
99%
99%
26
93%
96%
98%
99%
99%
99%
25
91%
95%
98%
99%
99%
99%
210
211
212
213
214
215
Number of samples
Slightly heterogeneous
Number of fields
210
95%
97%
97%
99%
99%
99%
29
97%
98%
98%
99%
99%
99%
28
93%
97%
97%
98%
99%
98%
27
94%
96%
99%
99%
99%
99%
26
93%
96%
98%
99%
99%
99%
25
91%
95%
98%
99%
99%
99%
210
211
212
213
214
215
Number of samples
Highly heterogeneous
Number of fields
210
0%
0%
17%
29%
62%
89%
29
22%
21%
24%
35%
62%
92%
28
0%
18%
27%
40%
62%
90%
27
2%
0%
21%
35%
62%
92%
26
0%
17%
23%
32%
65%
93%
25
2%
10%
34%
50%
65%
92%
210
211
212
213
214
215
Number of samples
Highly heterogeneous
Number of fields
210
0%
0%
17%
29%
62%
89%
29
22%
21%
24%
35%
62%
92%
28
0%
18%
27%
40%
62%
90%
27
2%
0%
21%
35%
62%
92%
26
0%
17%
23%
32%
65%
93%
25
2%
10%
34%
50%
65%
92%
210
211
212
213
214
215
Number of samples

The difference between our solution and other methods is apparent when comparing their performance on the largest data settings shown below. 

Recovery of true prediction drivers (%)

32,768 samples and 1,024 features

Moreover, our solution took 1.4 seconds for 10k explanations using a single Ampere series (A100 GPU) (matching 1.4 seconds for TreeSHAP for 10k explanations, and much faster than IG, which was 19 seconds for 10k explanations). 

From accurate explainability methods to narratives

Explainability is foundational to accelerating the impact of applied AI. From governance to powering new agentic systems, deterministic reasoning for AI decision making is a fundamental bottleneck. 

OuterProduct’s real-time engine for explainability, presents a material breakthrough for AI systems that can not only use data as context but intelligence from the world’s predictive models themselves as context.