Mar 2, 2026
Building the first real-time, accurate explainability layer

OuterProduct Labs
Quick Summary
OuterProduct builds the first real-time, accurate explainability layer that allows agentic AI to talk to predictive models.
OuterProduct Labs’ new explainability layer dramatically outperforms industry standards (Shapley values and Integrated Gradients) across hundreds of datasets (generated in the style of
OpenXAI).An explainability layer unlocks context for agentic recommendations through scenario-exploration grounded in data.
From lookups to predictions to recommendations
Explainability of predictive AI is often treated as a compliance requirement rather than a product capability. Although existing explainability algorithms are widely used, it is difficult to make these explanations actionable, stable, and easy to validate. Many AI practitioners would rather focus on improving model capabilities and performance. Explanations often remain an afterthought rather than a foundational piece of the workflow.
At the same time, the explanations that do exist are rarely communicated in a way that business process owners can understand and use confidently. It is not enough to output “feature importances” or attribution scores. Rather, we need to be able to contextualize explainability into actionable narratives and agentic recommendations. Understanding what drives predictive models and the persistence of their predictive signal is now critical to delivering sufficient context to guide these agentic recommendation systems.
The next major unlock in applied AI will come from building an explainability layer that quickly and accurately explains the patterns in data used by predictive models. This explainability layer will be the contextual bridge for generative AI to “talk to predictive models.”
The goal of this blog post is to explain the existing challenges in building such an explainability layer and to show how we at OuterProduct Labs have built the first accurate and real-time algorithm for an explainability layer.
Explaining predictive models
In order to build our explainability layer, we first need to define what it means to “explain” a predictive model. Ultimately, an explanation needs to be an actionable narrative.
For example, a customer success manager needs to understand why a customer is likely to churn and what interventions they can proactively take to mitigate churn risk. Given a predictive model for churn risk, building this actionable narrative requires (1) identifying underlying drivers for why a customer is likely to churn and (2) assessing the impact of different scenarios based on these drivers.
In this post, we focus specifically on the first step of explainability, which is defined below:
Given a predictive model and an input sample, we want to identify the subset of input fields that determine the prediction for the sample.
This measure of explainability is one of the key measures considered in the explainable AI literature and in particular, falls under the purview of “ground-truth faithfulness” (more details on this here). In our example above, among all input fields, a customer may have been likely to churn entirely based on (1) a lack of product usage and (2) their number of customer service calls (rather than say their income bracket or their location). In order to build an actionable narrative for this customer, we first need a method that can accurately identify these relevant fields.
Evaluating explainability
With this measure in mind, suppose we have an explainability algorithm that outputs importance scores for different fields and want to evaluate its effectiveness. How do we do so?
Unlike evaluating predictive models themselves, evaluating explainability algorithms is not so obvious because in real world data, there is no “ground truth” for whether an explanation is correct or not.
To overcome this issue with real world data, we instead turn to synthetic data where we know the ground truth. Namely, we can generate input and label pairs where the label is a function of some subset of input fields. Next, we force that the label is determined by a subset of input fields, which can vary based on the input sample itself. For example, credit risk drivers can vary by applicant: for a young borrower with limited history, income stability may be the sole driver of risk, while for an older borrower with a long credit record, debt-to-income may be the sole driver of risk. This approach is consistent with that taken by explainable AI researchers (as in OpenXAI).
In particular, we here generate three types of synthetic datasets of increasing difficulty:
Homogeneous data: In this setting, we have input-label pairs where all labels depend on the same subset of fields. For example, given 100 fields, the label could just be the product of values in the first two fields (mathematically, we would say f(x) = x₁ x₂ for x containing 100 fields.) Such data are generally quite rare and unrealistic but provide a simple check for any explainability algorithm. An accurate explainability algorithm should be able to identify the fixed, relevant fields.
Slightly heterogeneous data: In this setting, we have input-label pairs where half of the labels depend on one subset of fields and the other half depend on another subset of fields. For example, given 100 fields, the label could be the value of the second field if the first field is positive and the value of the third field if the first field is negative (mathematically, we would say f(x) = x₂ if x₁ > 0 and x₃ if x₁ < 0.) An accurate explainability algorithm should be able to identify that the fields x₁ and x₂ are important when x₁ > 0 and that the fields x₁ and x₃ are important when x₁ < 0.
Highly heterogeneous data: In this setting, we have input-label pairs where the input data form clusters and the label depends on a subset of fields that differs for each cluster. This data more accurately mimics real-world settings. For example, in Customer Success Management, the reasons for churn may vary greatly based on the region or sector of service (perhaps East coast software customers churn for different reasons than West coast retail customers).
In each of these synthetic settings, we get to control the number of input fields (d) and the number of available samples (n). Upon varying n and d, we can evaluate explainability algorithms quantitatively along two axes:
Accuracy: We can measure how often they identify the ground truth relevant fields for prediction per sample. In particular, for any sample, suppose there are k true fields. We can check whether the top k fields with highest importance match those k true fields. This measure is a stronger notion of
feature agreement(where the authors measure the percentage of fields that overlap between the ground truth and field importance scores).Speed: We can evaluate the speed (in wall-clock time) for each algorithm to output relevant fields for a batch of samples.
In our experiments, we will evaluate the performance of different algorithms across values of n from 1024 (=2¹⁰) to 32,768 (=2¹⁵) and values of d from 32 (=2⁵) to 1024 (=2¹⁰).
Existing explainability algorithms are highly inaccurate
Given the above evaluation benchmarks, what explainability algorithms will we evaluate?
Today, there are a number of existing explainability algorithms for black-box predictive models. The most prominent approach is known as Shapley values – a method originating from game theory (specifically Lloyd Shapley in 1951).
Shapley values were intended to be a method for distributing gains among a number of collaborators based on how much or little each person contributed to the overall outcome. The key idea behind applying Shapley values to explainability is to view the prediction from a model as the “gain” and then view the input fields as “collaborators.” SHAP then provides attribution scores for each field based on how much it “contributed” to the prediction. As a result the prediction is guaranteed to be a sum of the SHAP attribution scores.
A practical disadvantage of the exact Shapley value computation for general models is that it is slow (computation grows exponentially quickly with the number of fields). To overcome this, many practitioners have resorted to fast approximations or exact calculations for specific predictive models. In particular, for tree-based models like XGBoost or variants (like CatBoost), the state-of-the-art is to use TreeSHAP (a fast implementation of Shapley values for tree-based models), which is what we will consider in our evaluation benchmark below.
The major limitation of both approximate and exact SHAP is that it, provably, can never achieve 100% accuracy in simple settings like the slightly heterogeneous data setting above. This means that even with an infinite amount of data, SHAP would still not return the true fields that drive predictions.
To this end, another prominent method we will compare with is known as Integrated Gradients (IG) – a method originally intended to explain how deep-learning based image classification models make predictions. IG works by identifying a set of fields along which a model’s predictions change most along a path from an “average” input (typically the all zeros input) to the current input. In our benchmark experiments, IG was on average slower than TreeSHAP in wall-clock time and, in some cases, had a lower performance.
Below, we present the results of running both TreeSHAP and IG on data sampled from our three evaluation benchmarks.
On homogeneous data, we remarkably find that both methods do not consistently achieve 100% accuracy even in simple settings with 1,024 samples and 32 features. Yet, it is clear that with more samples, TreeSHAP gets up to 99% accuracy.
Homogeneous data
Next, we turn to the slightly homogeneous data setting. First, we find that both methods are far from 100% accurate. Moreover, even as the number of samples grows large, TreeSHAP is stuck at around 55% accuracy. This is in fact because exact SHAP itself is, provably, not capable of getting 100% accuracy on this task.
Slightly heterogeneous data
Performance of both methods continues to get worse in the highly heterogeneous data setting. Here, TreeSHAP never gets more than 20% accuracy and for high dimensional data with 1,024 fields and 32,768 samples, IG achieves only 8% accuracy.
Highly heterogeneous data
Real world data are clearly highly heterogeneous, and existing explainability algorithms are not equipped to tackle this heterogeneity.
OuterProduct Labs’ new algorithm, however, unlocks explainability in heterogeneous settings.
OuterProduct’s direct, real-time and accurate explainability solution
At OuterProduct Labs, we have built the first real-time, accurate explainability solution. Our solution is rooted in our fundamental understanding of feature learning (in particular, built on the foundational research from our founders’ labs and published in Science).
Below, we show that our solution achieves high performance (>90% accuracy) in all of the above evaluation benchmarks.
On homogeneous data, our solution achieves 100% performance at all scales. On slightly heterogeneous data, our solution achieves 99% performance for most settings, far exceeding the best performance of both TreeSHAP and IG. On highly heterogeneous data, we see that our solution is well into the 90% accuracy range across all settings when the number of samples is around 32K.
OuterProduct
The difference between our solution and other methods is apparent when comparing their performance on the largest data settings shown below.
Moreover, our solution took 1.4 seconds for 10K explanations using a single Ampere series (A100 GPU) (matching 1.4 seconds for TreeSHAP for 10K explanations, and much faster than IG, which was 19 seconds for 10K explanations).
From accurate explainability algorithms to narratives
Explainability layers are the key to unlocking the next generation of agentic recommendation systems that can not only use data as context but intelligence from the world’s predictive models themselves as context. The fundamental bottleneck in building explainability layers has been the lack of real-time, accurate algorithms for explainability.
At OuterProduct, we developed an algorithm that overcomes this bottleneck, as verified quantitatively over an explainability evaluation benchmark. As such, we have an algorithmic core for powering an explainability layer.
Yet, explainability does not stop at any single algorithm. Ultimately, explanations need to be actionable narratives and recommendations. In upcoming posts, we will discuss how we implemented scenario-exploration methods that build on our explainability algorithm to provide the essential context for agentic recommendations.