November 6, 2025

Research

November 6, 2025

Research

November 6, 2025

/

Research

Decision-Centric AI Systems That Optimize Choices Under Irreversibility and Uncertainty

Shubhankar Kahali

Zuzanna Kowalczyk

Kai Keskitalo

Most modern AI systems are built on a fundamental assumption: if we predict accurately enough, good decisions will follow. This "predict-then-decide" paradigm dominates the field—from recommendation engines to fraud detection, from hiring systems to medical diagnostics. The underlying logic appears sound: minimize prediction error, and optimal actions become trivial derivatives.

But this assumption breaks catastrophically in the domains that matter most.

Consider a hiring decision. A predictive model can estimate with 85% confidence that a candidate will succeed. But the decision isn't "will this person succeed?"—it's "should we hire this person over the alternative?" The model never observes what happens to rejected candidates. It cannot learn whether its rejections were mistakes because those counterfactual outcomes remain permanently hidden. The prediction might be accurate, but the decision quality is fundamentally unknowable.

Or consider medical treatment selection. A model predicts disease progression under treatment A with high confidence. But the patient can only receive one treatment. The outcome under treatment B—the road not taken—is never observed. The model optimizes for predictive accuracy on observed data, but decision quality depends on comparing observed and unobserved outcomes.

These aren't edge cases or data quality problems. They represent a category error in how we've architected AI systems. We've built prediction engines for decision problems, and the gap between these two objectives grows wider as stakes increase.

At LastArray, we focus on human assessment systems where this gap is most acute: career decisions, team formation, capability evaluation, developmental planning. These are irreversible choices with asymmetric costs, unobservable counterfactuals, and performative feedback loops. Getting the prediction right means nothing if the decision is wrong.

This post examines what we call Decision-Centric AI—systems explicitly designed to optimize decision quality when prediction accuracy is insufficient. We'll explore why standard paradigms fail, what theoretical frameworks enable decision-focused learning, and what architectural primitives must be invented to build AI systems that make robust choices under deep uncertainty.


Taxonomy of Decision Environments

Not all decisions are created equal. The requirements for a decision support system vary dramatically based on four key properties:



Type 1: The ML Sweet Spot

Properties: Reversible, Observable, Symmetric, Static

These are the problems where standard machine learning excels:

  • Image classification

  • Spam filtering

  • Price prediction for liquid markets

  • Weather forecasting

Here, errors are cheap learning signals. A misclassified email is moved back. A wrong price prediction is corrected next hour. The environment doesn't change based on your predictions.


Type 6: The Decision Frontier

Properties: Irreversible, Unobservable, Asymmetric, Performative

These represent the most challenging class:

  • Hiring and promotion decisions

  • Medical treatment selection

  • Criminal sentencing

  • Strategic resource allocation

  • Existential risk mitigation

A hiring mistake cannot be undone—the rejected candidate's trajectory is never observed. False negatives (missing exceptional talent) and false positives (hiring poor performers) have wildly different costs. And the act of hiring changes the organization, affecting future hiring contexts.

Standard ML paradigms make assumptions that hold for Type 1 but catastrophically fail for Type 6.

Decision Type

Reversible

Observable

Cost Symmetry

Feedback

Primary Challenge

Standard Approach

Failure Mode

Type 1

Yes

Yes

Symmetric

Static

Sample Efficiency

Supervised Learning

✓ Works well

Type 2

Yes

Yes

Asymmetric

Static

Precision/Recall Balance

Cost-Sensitive Learning

Partial—weights post-hoc

Type 3

No

Yes

Symmetric

Static

Constraint Satisfaction

Constrained Optimization

Ignores irreversibility risk

Type 4

No

Yes

Asymmetric

Static

Catastrophic Risk

Safe RL

Learns from catastrophes

Type 5

Yes

No

Symmetric

Performative

Proxy Reward

Decision-Focused Learning

Partial—assumes loss observable

Type 6

No

No

Asymmetric

Performative

All of the above

None adequate

Complete breakdown


Why Standard Paradigms Fail
The Supervised Learning Trap: Learning from Biased History

Supervised learning assumes training data represents objective reality. But in decision contexts, training data represents historical decision policy artifacts.

Consider a resume screening system trained on historical hiring data:

  • What it learns: P(success | resume, was_hired_historically)

  • What we need: P(success | resume) for all candidates

  • The gap: We only observe outcomes for historically hired candidates

The model learns to replicate historical biases rather than optimize hiring quality. It cannot distinguish "genuinely low-performing candidate" from "candidate who looked low-performing to a biased historical decision-maker."

This is the Selective Labels Problem—the training distribution is fundamentally different from the decision distribution because the labels themselves are conditional on past decisions.


Benchmark: Prediction Accuracy vs Decision Quality

We evaluated this gap using a simulated hiring environment where ground truth (candidate quality) is known but only historically hired candidates are labeled in training:

Note: Accuracy measured on historically hired candidates; decision quality measured on optimal selection from full candidate pool

The standard supervised model achieves high accuracy on its test set (historically hired candidates) but poor decision quality when selecting from the full pool. High predictive accuracy does not translate to good decisions because the training distribution is biased.


The Reinforcement Learning Illusion: Catastrophic Regret

Reinforcement learning appears to solve decision problems—agents act in environments and optimize cumulative reward. But standard RL makes a fatal assumption: exploration is cheap.

RL algorithms like ε-greedy or entropy-regularized policies compel agents to try random actions to learn their values. In a video game, when an agent explores by jumping off a cliff, the episode resets, and the agent learns "don't jump off cliffs." Death is a valuable negative signal.

In irreversible domains, death is not a signal—it's the end of the system.

An autonomous power grid controller that explores a strategy causing a blackout creates irreversible economic damage. There is no "next episode" to exploit this knowledge. A surgical robot that explores by severing a critical artery cannot respawn the patient.

Standard RL maximizes expected reward over time, implicitly assuming early catastrophes can be amortized by infinite future gains. This leads to what we call Catastrophic Regret—unacceptable losses during exploration that cannot be recovered.


Benchmark: Exploration Cost in Irreversible Environments

We compared RL algorithms in a simulated medical treatment environment where certain actions cause irreversible harm:

Catastrophic failures defined as actions causing irreversible state transitions with loss > 100

The reversibility-aware approach trades slower convergence for dramatically fewer catastrophic failures—a critical tradeoff when single errors are unacceptable.


The Offline RL Problem: Extrapolation Delusions

Offline RL attempts to learn from static datasets without dangerous online exploration. But it suffers from a different failure mode: extrapolation error.

When the agent queries the value of an action outside the training distribution (OOD), the value function—approximated by a neural network—produces arbitrary, often over-optimistic estimates. In online RL, the agent would try this action and correct its estimate. In offline settings with irreversible consequences, it cannot.

The result: agents develop "delusions" of high value in unexplored regions, leading to confident but disastrous decisions.



Benchmark: OOD Value Estimation Error

We measured value function error as actions move further from the training distribution:


Distance measured as Mahalanobis distance in state-action space; error is absolute percentage deviation from true value

Standard value-based methods become dangerously overconfident in OOD regions. Conservative approaches reduce but don't eliminate the problem.


The Causal Inference Bottleneck: Unidentifiable Counterfactuals

Causal inference frameworks provide mathematical machinery for reasoning about "what if" questions. But they hit a fundamental wall: the Fundamental Problem of Causal Inference.

We cannot observe Y(1) and Y(0) simultaneously for the same unit. We see what happened under the action taken, but the counterfactual—what would have happened under the alternative—remains hidden.

To estimate causal effects, we must assume:

  1. Strong Ignorability: No unobserved confounders affect both treatment and outcome

  2. Positivity/Overlap: Every unit has non-zero probability of receiving any treatment

Both assumptions are routinely violated in high-stakes decisions:

Unobserved Confounding: Doctors use private information (patient demeanor, unstated concerns) to assign treatments. This information rarely appears in medical records. Any model trained on records will be confounded—mistaking the doctor's selection bias for treatment efficacy.

Positivity Violations: High-risk loans are never given to applicants with zero income. Certain medical procedures are never performed on specific patient subgroups. The counterfactual isn't just unobserved—it's unsupported by the data manifold.

Methods like Inverse Propensity Weighting (IPW) explode in variance when positivity is violated, providing unstable or meaningless estimates.

Recent theoretical work by Koch & Imai (2025) proves a critical limitation: counterfactual risk is identifiable only under restrictive assumptions. Specifically, the difference in risk between two decision rules can be estimated only if the counterfactual loss function is additive in potential outcomes:

This is a strong structural constraint. Most real-world loss functions (particularly those involving interactions or threshold effects) violate additivity, rendering counterfactual decision quality formally unidentifiable from observational data alone.


The Reward Observability Crisis

Even when actions are observable, the reward signal may be latent, delayed, or fundamentally unobservable:

Temporal Delays: In policy-making, educational reforms may take decades to show measurable impact. Credit assignment over such horizons is statistically intractable—high variance dominates any signal.

Sparse and Proxy Rewards: We optimize engagement metrics because user well-being is unobservable. This creates reward hacking (Goodhart's Law): the agent exploits flaws in the proxy to maximize its score while degrading the true objective.

Example: A content recommendation system maximizes "engagement" by serving polarizing, inciting content. Engagement increases while social fabric degrades—a negative utility never encoded in the reward function.

Multi-Objective Tensions: Real decisions involve conflicting objectives with context-dependent weights. How does a hiring system balance experience, diversity, learning potential, cost, and cultural fit? These weights are politically contested and time-varying, not parameters to be "learned."


Theoretical Frameworks for Decision-Centric Learning

To build systems that optimize decisions rather than predictions, we need new theoretical primitives. Four frameworks form the foundation:


Framework 1: Counterfactual Loss Functions

The defining characteristic of decision-centric systems is their loss function. While predictive models minimize error between prediction and observation, decision models must minimize regret—the gap between realized outcomes and the best possible outcomes.


The Koch-Imai Identifiability Theorem

Koch & Imai (2025) introduce a generalized loss function that depends on the entire set of potential outcomes:

The challenge: the risk

depends on the joint distribution of potential outcomes {Y(d′)} for d′ in D, which is fundamentally unidentifiable, since we can observe at most one potential outcome per unit and never all counterfactuals simultaneously.

Their critical contribution: Under strong ignorability, the difference in counterfactual risk between two decision rules is identifiable if and only if the loss function is additive in potential outcomes.

This theorem provides the mathematical foundation for valid policy comparison without observing all counterfactuals. We can evaluate decision policies using only marginal distributions (which are estimable) without needing the correlation structure between counterfactuals (which isn't).


Asymmetric Counterfactual Penalties

In irreversible decisions, regret often depends on the counterfactual. The cost isn't just "did the patient die?" but "did the patient die when they would have lived under the alternative?"

We can formulate loss functions that impose heavy penalties specifically on "Type III errors"—choosing an action strictly worse than the default:

This embeds a safety constraint directly into the learning objective, forcing risk-aversion in regions where counterfactual outcomes are uncertain but potentially superior.


Framework 2: Decision-Focused Learning

Decision-Focused Learning (DFL) integrates the downstream optimization problem directly into the model's training loop, addressing the misalignment between predictive accuracy and decision quality.


Smart Predict-Then-Optimize (SPO) Loss

In standard machine-learning pipelines:

  1. An ML model predicts parameters
    $\hat{\theta}$ (e.g., a demand forecast).

  2. An optimization solver computes a decision
    $x^*(\hat{\theta})$ (e.g., an inventory allocation).

  3. The model is trained to minimize prediction error, typically
    $\lVert \hat{\theta} - \theta \rVert^2$.

Decision-Focused Learning (DFL) replaces step 3 with a loss that directly measures decision quality, rather than prediction accuracy.

The SPO loss is defined as:

L<sub>SPO</sub>(θ̂, θ) = Cost(x*(θ̂), θ) − Cost(x*(θ), θ)

This quantity represents the regret of acting on the prediction $\hat{\theta}$ when the true state is $\theta$.

Crucially, if a large prediction error does not change the optimal decision—such as predicting demand of 100 instead of 120 when capacity is 200—the SPO loss is zero. Conversely, a small prediction error that crosses a decision boundary—such as 199 versus 201—can incur a large loss.

The SPO+ surrogate provides a differentiable, convex relaxation of this objective, enabling gradient-based training while maintaining Fisher consistency.


Benchmark: SPO vs MSE in Decision Quality

We compared prediction-optimized vs decision-optimized models in a simulated resource allocation task:

The decision-focused model accepts worse prediction accuracy to achieve better decision outcomes—reducing costs by 51% despite higher prediction error.


Framework 3: Performative Prediction

In Type 6 environments, the environment is not static—it reacts to the decision-maker. This is formalized as Performative Prediction: the data distribution $\mathcal{D}(\theta)$ depends on model parameters $\theta$.


Performative Stability vs Optimality

Standard machine learning seeks optimal parameters θ_opt for a fixed data distribution. In performative settings, however, the data distribution itself depends on the deployed model. As a result, the objective shifts from minimizing loss under a static distribution to finding a Performative Stable Point (θ_PS), defined as a fixed point between learning and data generation.

A performative stable point satisfies the following condition:

At a stable point, the model is optimal for the distribution it induces. However, stability does not guarantee optimality. The system may converge to a bad equilibrium—one that is self-reinforcing but undesirable.

Example:
A credit-scoring model repeatedly denies loans to a demographic group. Because individuals in that group are denied access to credit, they cannot build credit history, which causes their measured creditworthiness to decline. This feedback loop justifies continued denial. The model is performatively stable, yet suboptimal from a societal perspective.

Performative optimization goes beyond stability by explicitly modeling how the data distribution responds to changes in the deployed model. By anticipating how policy updates shift future data, learning can be steered toward performative optima rather than merely settling for stable but suboptimal fixed points.


Framework 4: Action-Aware Uncertainty Quantification

Standard uncertainty quantification measures predictive uncertainty—entropy of classifier outputs, variance of regression predictions. But this is a poor proxy for decision risk.

A model can be highly uncertain about the outcome but certain about the decision (if all plausible outcomes lead to the same optimal action). Conversely, a model can be confident in its prediction but uncertain about the decision (if small prediction errors cross critical thresholds).


Loss-Grounded Uncertainty

Bickford-Smith et al. (2024) propose defining uncertainty via the Expected Value of Perfect Information (EVPI):

This quantifies how much the decision would improve with perfect information. It scales uncertainty by consequence—distinguishing "harmless ignorance" (decision robust to uncertainty) from "critical ignorance" (uncertainty could lead to ruin).


Benchmark: Predictive vs Decision Uncertainty

We evaluated uncertainty quantification methods on a medical treatment selection task:

Critical errors: decisions with negative counterfactual utility > 50

EVPI-based uncertainty correctly identifies which cases benefit most from additional information, enabling targeted data collection where it matters.


Evaluation Without Ground Truth

In Type 6 environments, we often cannot validate decisions by comparing to outcomes—either because outcomes are unobservable (counterfactuals) or delayed beyond relevant horizons. This necessitates Structural Evaluation: checking the process rather than the answer.


Metamorphic Testing: Logical Consistency as Quality

Metamorphic Testing (MT), adapted from software engineering, validates systems using Metamorphic Relations (MRs)—necessary logical properties that must hold between inputs and outputs.

Instead of asking “Is decision ddd correct for input xxx?”, we ask “Is the relationship between d(x)d(x)d(x) and d(x′)d(x')d(x′) consistent with domain logic?”


Key Metamorphic Relations

Monotonicity: In loan approval, if Applicant A is identical to B but has higher income, and A is approved, then B must be approved. Violations indicate flawed decision boundaries.

Affine Invariance: Rotating an image 2° shouldn't change classification of a stop sign. Translation invariance, scale invariance.

Counterfactual Invariance: Changing a protected attribute (race, gender) while keeping all causal parents of the outcome constant should not change the decision (fairness constraint).

Consistency Under Paraphrasing: Semantically equivalent inputs should yield the same decision. "Experienced Python developer" vs "Developer with Python experience."

We quantify decision quality via Metamorphic Violation Rate—frequency with which the policy violates logical invariants across generated perturbations.


Benchmark: Metamorphic Testing in Hiring Systems

We evaluated production hiring models using metamorphic tests:

Lower is better; tests conducted on synthetic perturbations with known logical constraints

Explicitly training with metamorphic constraints reduces violations by 87%, improving decision robustness without access to ground truth outcomes.


Process Supervision: Verifying the Reasoning Chain

For complex reasoning where final outcomes are hard to verify (strategic planning, long-term forecasting), we evaluate the reasoning process itself.


Process Reward Models (PRMs)

Traditional Outcome Reward Models (ORMs) score final outputs. Process Reward Models score each step of a reasoning chain:

  • Mechanism: PRM evaluates the sequence of intermediate steps
    $s_1 \rightarrow s_2 \rightarrow s_3$.
    A logical fallacy at $s_2$ is penalized immediately, even when the final answer is correct by chance.

  • Training: PRMs can be trained on synthetic data generated by filtering reasoning traces for logical consistency. Recent work (Rationalyst, ThinkPRM) pre-trains on large-scale unlabeled reasoning data.

By optimizing for high PRM scores, we align AI decision processes with human-verified logical structures. This decouples decision quality from outcome stochasticity—we trust the decision because we trust the derivation.


Benchmark: Process vs Outcome Supervision

We compared supervision approaches on a multi-step planning task:

Process supervision improves both success rates and interpretability—humans can audit why the system made a decision, not just whether it was correct.


The MARIA Framework: Marginal Risk Assessment

When absolute risk cannot be measured, we rely on relative risk assessment. The MARIA Framework evaluates systems by comparing performance to incumbent baselines (human experts, legacy systems).


Three Dimensions of Marginal Risk
  1. Predictability (Consistency): Stability under perturbation. Systems giving vastly different answers to semantically equivalent inputs are inherently risky.

    • Metrics: Self-consistency (variance across runs), input stability (invariance to paraphrasing)

  2. Capability: Measured through proxy tasks correlated with the unobservable objective.

    • Example: While "long-term alignment" is unmeasurable, "instruction following" and "corrigibility" serve as proxies

  3. Interaction Dominance: Evaluated via game-theoretic simulations (persuasion games, negotiation). If System A consistently "wins" against System B in zero-sum settings, it demonstrates superior capability.


Cross-Consensus as Truth Proxy

In the absence of ground truth, consensus among diverse models serves as a proxy for correctness. If five independent models (different architectures, training data, objectives) all agree, epistemic uncertainty is likely low.

High disagreement signals fragile decision boundaries where models' inductive biases diverge—a warning flag for "unknown unknowns."


Benchmark: MARIA Assessment in Production

We applied MARIA to three production systems vs human baseline:

Risk vs Baseline: negative values indicate lower estimated risk than human baseline

The full decision-centric system demonstrates superior marginal risk profile across all dimensions without requiring ground truth validation.


Architectural Primitives for Decision-Centric Systems

To operationalize decision-centric learning, we need new architectural components beyond standard neural network building blocks.


Primitive 1: Counterfactual World Models

Standard world models predict future states: $s_{t+1} = f(s_t, a_t)$. Counterfactual World Models simulate alternative histories: "What would $s_{t+1}$ have been if I had done $a'_t$ instead?"


Causal Generative Models

Generative models (Diffusion Transformers, VAEs) can function as causal simulators by conditioning on causal graphs and interventions $do(X)$.

Mechanism:

  • Learn latent representation $z_t$ disentangling "mechanism" from "noise"

  • Manipulate latent variables corresponding to interventions

  • Generate synthetic futures respecting causal constraints

Application: In autonomous driving, world models simulate millions of "near-miss" scenarios (counterfactuals) rarely occurring in real data. Policies trained on these synthetic trajectories learn safety behaviors for unencountered situations.


Primitive 2: Reversibility-Aware Architectures

To handle irreversibility, agents need an intrinsic sense of the "arrow of time"—distinguishing actions with permanent consequences.


The Reversibility Critic

Grinsztajn et al. (2021) propose self-supervised learning of reversibility as an intrinsic value:

  1. Train classifier to distinguish forward trajectories ($s_t \to s_{t+1}$) from backward ($s_{t+1} \to s_t$)

  2. Transitions easily distinguished (easy forward, hard backward) are likely irreversible (entropy increased)

  3. Convert to Reversibility Value Function $V_{rev}(s)$

  4. Penalize reward: $R_{safe}(s,a) = R_{task}(s,a) - \lambda(1 - V_{rev}(s))$

This "soft constraint" repels agents from irreversible traps without manually specified safety shields.


Benchmark: Reversibility-Aware Exploration

The reversibility-aware agent achieves near-optimal performance while reducing irreversible failures by 93% compared to standard RL.


Primitive 3: Distributionally Robust Optimization (DRO)

When data distribution is uncertain or performative, optimizing for average case is dangerous. DRO optimizes worst-case performance within an "Ambiguity Set" $\mathcal{P}$ of plausible distributions:

$$d^* = \arg\min_d \max_{P \in \mathcal{P}} \mathbb{E}_P[\text{Loss}(d)]$$


Constructing Ambiguity Sets via Dark Speculation

To populate ambiguity sets with "unknown unknowns," we use Dark Speculation:

  1. Generative AI produces narrative scenarios of low-probability, high-impact events (black swans)

  2. "Underwriting" models convert narratives to quantitative parameter distributions

  3. DRO optimizes against this expanded ambiguity set

This bridges qualitative scenario planning and quantitative optimization, yielding policies robust to specific catastrophic scenarios imagined by speculation engines.


Primitive 4: One-Shot Decision Architectures

For decisions occurring only once (mergers, pivots, strategic hires), we cannot rely on asymptotic convergence. One-shot learning architectures:

Siamese Networks / Matching Networks: Learn similarity metrics rather than classification boundaries. For new decisions, retrieve "nearest neighbor" from historical analogies and adapt.

Meta-Learning (MAML): Train models to be "quick to adapt"—learn initialization $\theta_0$ such that few gradient steps on a new task yield good policies. "Learning to learn" from sparse data.


Benchmark: One-Shot Decision Performance

Meta-learning approaches achieve 67% higher success rates than standard fine-tuning on one-shot decisions.


Synthesis—Building Decision-Centric Systems

At LastArray, we build complete AI systems for human assessment—contexts defined by irreversibility, unobservability, and asymmetric costs. Our approach integrates the theoretical frameworks and architectural primitives into production systems.


The Decision-Centric Stack



Component 1: Causal Abstraction in Representation Learning

Standard representations optimize for predictive accuracy. Decision-focused representations optimize for decision quality:

  • Task: Learn latent $\phi(x)$ such that decisions on $\phi(x)$ are robust, fair, interpretable

  • Training signal: Not prediction loss, but decision quality metrics (stability, fairness)

  • Mechanism: Backpropagate decision-quality gradients through representation

This ensures representations encode decision-relevant structure rather than spurious correlations.


Component 2: Counterfactual Simulation Pipeline

Our systems maintain world models capable of simulating "ghost trajectories"—what would have happened under alternative actions:

  1. Learn disentangled latent dynamics from observational data

  2. Condition generation on causal interventions $do(X)$

  3. Generate synthetic counterfactual outcomes respecting causal structure

  4. Train policies on both observed and simulated trajectories

This enables learning from unobservable counterfactuals while maintaining physical consistency.


Component 3: Multi-Constraint Decision Optimization

Real decisions involve multiple, often conflicting objectives. Our decision layer:

  1. Models objective uncertainty (treats goals as distributions, not points)

  2. Computes Pareto frontier over plausible objective weights

  3. Applies metamorphic constraints as hard requirements

  4. Solves for distributionally robust policies over ambiguity sets

Output: Not a single decision, but a decision policy that adapts to objective clarification and remains robust to specification uncertainty.


Component 4: Process-Based Evaluation

Since outcomes are often unobservable, we evaluate reasoning chains:

  1. Generate decision trace with explicit reasoning steps

  2. Score each step with Process Reward Model

  3. Check metamorphic consistency across perturbations

  4. Compute MARIA risk profile vs baseline

  5. Measure ensemble disagreement as epistemic uncertainty proxy

This produces a decision audit trail enabling human oversight without outcome ground truth.


Production Results and Learnings

We deployed decision-centric systems in three domains at LastArray: technical hiring assessment, team composition optimization, and developmental trajectory planning.


Case Study: Technical Hiring Assessment

Context: Evaluate engineering candidates across multiple dimensions (technical depth, collaboration, growth trajectory) to predict long-term performance and team fit.

Challenges:

  • Counterfactual problem: Only observe hired candidates

  • Performative: Hiring decisions change team dynamics

  • Asymmetric costs: False negatives (missing exceptional talent) far costlier than false positives

  • Multi-objective: Balance immediate capability, growth potential, team composition


System Architecture


Results vs Baseline

Decision quality measured via simulation with known ground truth; production uses process supervision

The decision-centric system achieves slightly lower predictive accuracy on hired candidates (because it's not optimizing for that), but dramatically better decision quality when selecting from the full candidate pool—the actual decision problem.

Most importantly: False negative rate (missing exceptional talent) dropped by 56%, addressing the asymmetric cost structure.


Key Learnings

1. Prediction accuracy is a poor proxy for decision quality

Systems optimized for predictive metrics often make worse decisions. The correlation between test set accuracy and decision quality in our experiments: r = 0.34. They're measuring different things.

2. Counterfactual simulation improves robustness

Training on both observed and simulated counterfactual trajectories reduced out-of-distribution failures by 67%. The world model acts as a "safety imagination" system.

3. Metamorphic constraints provide free supervision

Enforcing logical consistency constraints during training improved decision stability without requiring additional labeled data. Metamorphic testing found violations that outcome-based evaluation missed.

4. Process supervision enables human oversight

Generating explicit reasoning traces with per-step scoring allowed domain experts to audit decisions in real-time. Trust increased even when decisions differed from human intuition because the reasoning was inspectable.

5. Uncertainty quantification must be decision-grounded

Predictive uncertainty (model confidence) had near-zero correlation with decision risk. Action-aware uncertainty (EVPI-based) enabled targeted intervention on high-stakes cases.

6. Distributionally robust policies generalize better

DRO policies trained against ambiguity sets maintained performance under distribution shift (new team compositions, market conditions) where standard policies degraded 30-40%.


The Missing Primitives—A Research Agenda

Despite progress, fundamental gaps remain. Five critical primitives require invention:


Primitive 1: Action-Epistemic Uncertainty Metrics

Current state: We measure "how uncertain we are about the outcome"
Needed: "How much does this uncertainty matter for the choice?"

A standardized metric quantifying decision-relevant ignorance, not just predictive uncertainty. Should answer: "Is this ignorance harmless or critical?"


Primitive 2: The Consistency Oracle

Current state: Metamorphic violations detected post-hoc
Needed: Differentiable loss function penalizing violations during training

A "rationality regularizer" forcing models to obey logical invariants even in unobserved regions. Trains decision coherence directly into the objective.


Primitive 3: Automated Dark Speculation

Current state: Manual scenario planning for ambiguity sets
Needed: Automated pipeline generating catastrophic scenarios

Use world models to produce ambiguity sets of low-probability, high-impact events. Convert qualitative risk narratives to quantitative DRO constraints.


Primitive 4: Causal Abstraction Layers

Current state: Representations optimized for prediction
Needed: Architectures mapping observations to causal variables before deciding

Learn to represent the world in terms of stable causal mechanisms rather than spurious correlations. Decisions operate on causal abstractions, not raw features.


Primitive 5: Verifiable Reasoning Traces

Current state: Black-box policies outputting decisions
Needed: White-box policies outputting reasoning chains

Shift from "predicted label" to "verified reasoning step" as the fundamental unit. Makes AI decision-making auditable and correctable by domain experts.


Conclusion: From Prediction to Navigation

The transition from prediction-centric to decision-centric AI represents a fundamental paradigm shift. We're moving from map-making (accurately representing static data) to navigation (steering through irreversible trajectories).

In the map-making era, ground truth was a label in a dataset. In the navigation era, ground truth is often a counterfactual ghost—forever unobservable. We must replace the objective of Empirical Accuracy (matching labels) with Structural Coherence (matching logic, causal constraints, safety invariants).

The systems of the future will not be judged by how well they predict the past, but by how robustly they shape the future.

At LastArray, we build for the navigation era. Our systems integrate:

  • Counterfactual loss functions that optimize regret, not error

  • Reversibility-aware architectures that distinguish permanent from temporary consequences

  • Decision-grounded uncertainty that scales ignorance by consequence

  • Process supervision that validates reasoning chains when outcomes are unobservable

  • Distributionally robust policies that remain stable under specification uncertainty

These are not incremental improvements. They are the foundational primitives required to build AI systems worthy of trust in high-stakes decisions—systems that don't just predict accurately but decide wisely.

The engineering challenges are substantial. The theoretical gaps are real. But the alternative—continuing to deploy prediction engines for decision problems—is untenable as AI systems take on more consequential roles.

We're building the navigational instruments for an uncertain world. Not maps claiming perfect fidelity, but compasses pointing toward robustness in the dark.


References
  1. Koch, B. & Imai, K. (2025). Statistical Decision Theory with Counterfactual Loss. ArXiv 2505.08908.

  2. Grinsztajn, N., Ferret, J., Pietquin, O., Preux, P. & Geist, M. (2021). There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning. NeurIPS 2021.

  3. Elmachtoub, A. & Grigas, P. (2022). Smart "Predict, then Optimize". Management Science.

  4. Bickford-Smith, F. et al. (2024). Rethinking Aleatoric and Epistemic Uncertainty. ArXiv 2412.20892.

  5. Tennenholtz, G. & Mannor, S. (2020). Off-Policy Evaluation in Partially Observable Environments. AAAI 2020.

  6. Perdomo, J. et al. (2020). Performative Prediction. ICML 2020.

  7. Fluri, L., Paleka, D. & Tramèr, F. (2023). Evaluating Superhuman Models with Consistency Checks. ArXiv 2306.09983.


LastArray is a research lab focused on building decision-intelligence systems for high-stakes human assessment. We work at the intersection of causal inference, robust optimization, and machine learning to create AI systems that make wise choices under deep uncertainty.

For technical inquiries or collaboration: research@lastarray.com

Related

Research

Related

Research

Related

/

Research