December 25, 2025

Research

December 25, 2025

Research

December 25, 2025

/

Research

How AI Systems Reason Through Process-Level Evaluation and Reliable Assessment

Shubhankar Kahali

Zuzanna Kowalczyk

Kai Keskitalo

At LastArray, we approach AI assessment through a lens that most research labs overlook: the gap between what a model outputs and how it actually reasons. In high-stakes human decisions—hiring, clinical diagnosis, autonomous vehicle routing—the correctness of a final answer matters far less than the reliability of the reasoning path that produced it.

Traditional AI evaluation asks: "Did the model get the right answer?" We ask: "Can we trust how it got there? Would it arrive at the same conclusion through equally valid reasoning in similar scenarios? Can a human expert verify each step?"

This distinction isn't academic. It's the difference between deploying systems that work in controlled benchmarks and building systems that maintain reliability as real-world conditions shift. Reasoning trajectory modeling represents a fundamental rethinking of how we evaluate, improve, and deploy AI systems for human assessment.


The Core Problem: Outcome Supervision Isn't Enough

The dominant paradigm in AI development relies on outcome supervision: train models on input-output pairs, optimize for final answer accuracy, and deploy. This approach has produced remarkable benchmark results. It has also produced systems that fail unpredictably, generate plausible-sounding nonsense, and provide no mechanism for human oversight when stakes are high.

Consider a medical diagnosis system that correctly identifies a rare condition. Traditional evaluation would mark this as success. But if the model arrived at this diagnosis through a chain of reasoning that contradicts established medical knowledge—perhaps confusing symptoms, ignoring contradictory test results, or making logical leaps unsupported by evidence—the correct answer is actually a dangerous failure. The next patient with slightly different presentation will receive unreliable care.

Reasoning trajectory modeling addresses this by treating AI reasoning as a dynamic sequence of intermediate steps—a trajectory through state-action space—that can be independently evaluated, verified, and optimized. Rather than asking "what did the model output," we ask "how did the model think, and is each reasoning step sound?"


Technical Foundation: Trajectories as Markov Decision Processes

The mathematical formulation is elegant. We model natural language reasoning as a Markov Decision Process where trajectories τ = ⟨s₀, a₀, ..., s_T, a_T⟩ represent sequences of states and actions generated by a policy model π_θ.

For each problem input x, the reasoning trajectory τ decomposes into discrete steps where:

  • Actions represent reasoning thoughts (e.g., "I'll apply the distributive property here")

  • States capture observations and updated information (e.g., "This simplifies to 2x + 4")

  • Transitions follow from the policy model's learned behavior

The breakthrough insight: we can estimate the value of intermediate reasoning steps through offline simulation using outcome supervision, without requiring human annotation of every step:

Process Reward Estimation:

r_e(τ_{t,a}, y) = (1/K) Σ r_f(τ^k|τ_{t,a}, y)

Where τ^k represents K independent completions starting from action a_t, and r_f assigns reward 1 if the completion reaches the correct answer y, 0 otherwise.

This formulation enables us to identify which reasoning steps lead to correct solutions and which lead to errors—even without explicit human annotation of step quality. A reasoning step that frequently leads to correct completions when the trajectory continues from that point receives high reward. Steps that lead to incorrect or incoherent completions receive low reward.


State of the Art: The RPRM Framework

The Reasoning-Driven Process Reward Model (RPRM) represents the current technical frontier. Traditional process reward models directly predict scalar scores for intermediate steps—an approach that lacks both explainability and robustness. RPRM reimagines this through a two-phase generation process within a single LLM forward pass:


Phase 1 - Analysis

The model generates comprehensive multi-dimensional analysis of each reasoning step:

  • Historical context from previous steps

  • Objective and data sources of the current step

  • Coherence with preceding reasoning

  • Calculation verification and logical validity


Phase 2 - Judgment

The model generates natural language judgment ("Yes"/"No") regarding step correctness, replacing opaque scalar prediction with interpretable binary assessment.

The framework operates across three training stages:

1. Cold Start: Strong LLMs (LLaMA-3.3-70B) generate seed data from limited human-annotated process-level labels, filtering only judgments consistent with human annotations to produce 289k supervised fine-tuning samples.

2. Self-Evolution: Direct Preference Optimization (DPO) refines the model without additional labeled data, encouraging evaluation trajectories that yield correct judgments while penalizing incorrect ones.

3. Inference-Time Scaling: Multiple evaluation trajectories (K=4-10) are sampled per reasoning step and aggregated by averaging the probability of "Yes" judgments.


Benchmark Performance

The results demonstrate substantial improvements over prior methods:

After iterative training, F1 reaches 74.1 on ProcessBench—a 3.7-point additional gain.

The data efficiency is remarkable: RPRM with 64k samples outperforms Qwen2.5-Math-7B-PRM800K (trained on 265k samples) by 3.6 F1 points, and exceeds the teacher model (LLaMA-3.3-70B) by 13.0 points when tested on ProcessBench.


Inference-Time Scaling Trade-offs

Computational cost analysis reveals optimal operating points:

The recommended configuration: K=4 offers optimal performance-cost trade-off, delivering 67.6 F1 with 0.52s per example on H100 GPUs. Beyond K=4, marginal gains diminish rapidly while computational costs scale linearly.


The Stability Crisis: Why Single-Run Evaluation Misleads

Recent research reveals a troubling reality: reasoning method performance exhibits concerning instability that standard evaluation protocols completely miss. The ReasonBENCH framework introduces multi-run evaluation protocols that expose this hidden variance.

Standard practice: Report single-run accuracy ReasonBENCH protocol: 10 independent runs per model-algorithm-task combination

The findings challenge conventional evaluation wisdom:

Finding 1 - Confidence Interval Variance: Methods with similar average accuracy display confidence intervals up to 4× wider. Two methods both achieving 82% average accuracy might show one with ±2% variance and another with ±8% variance.

Finding 2 - Cost-Quality Instability: Top-performing methods often incur higher and less stable computational costs. Graph-of-Thought (GoT) shows the cost-quality relationship actually flipping direction across benchmarks—sometimes higher cost improves quality, sometimes it degrades it.

Finding 3 - Price ≠ Consistency: Model price is not predictive of consistency. Some expensive models show higher variance than cheaper alternatives on identical tasks.

Best Quality + Stability: Few-shot aggregation (FoA) emerges as the most reliable method across diverse evaluation scenarios.

This instability has profound implications for deployment. A model that achieves 85% accuracy on a benchmark in one run but varies between 78-92% across repeated runs is fundamentally unreliable for production use. Real-world systems must operate consistently, not occasionally.


Variance Sources


Reliability Beyond Accuracy: Precision and Prudence

The ReliableMath benchmark introduces a critical dimension missing from traditional evaluation: the ability to recognize unsolvable problems. The test set comprises 313 solvable problems and 1,102 unsolvable problems synthesized via deliberate removal of critical information or introduction of logical contradictions.

The results are striking: All models using standard prompts show Precision(Unsolvable) ≈ 0, meaning they essentially never correctly identify that a problem cannot be solved. Instead, they generate lengthy outputs (6-16k tokens) that hallucinate solutions through invalid reasoning.


Reliability Metrics Framework

ReliableMath defines two complementary metrics:

Precision measures correctness:

  • Prec.(𝒜) = #Correct answers on solvable / #Solvable

  • Prec.(𝒰) = #Correctly identified unsolvable / #Unsolvable

  • Prec. = ½[Prec.(𝒜) + Prec.(𝒰)] (balanced metric)

Prudence measures cautious refusal:

  • Prud.(𝒜) = #Refusals on solvable / #Solvable

  • Prud.(𝒰) = #Refusals on unsolvable / #Unsolvable

  • Prud. = ½[Prud.(𝒜) + Prud.(𝒰)]

The preference hierarchy: Success > Refusal > Failure (S ⪰ R ⪰ F)

This captures a critical insight for high-stakes assessment: a system that refuses to answer when uncertain is more valuable than one that confidently provides incorrect answers.

The preference hierarchy: Success > Refusal > Failure (S ⪰ R ⪰ F)

This captures a critical insight for high-stakes assessment: a system that refuses to answer when uncertain is more valuable than one that confidently provides incorrect answers.


Model Scale and Reliability Collapse

Model size dramatically affects reliability on unsolvable problems:


Below 7B parameters, models exhibit near-zero capability to identify unsolvable problems. This isn't simply an instruction-following issue—alignment via rejection sampling and DPO improves small models significantly (Qwen2.5-1.5B reaches 0.231 Precision after alignment) but remains well below large models. This suggests a fundamental capability gap related to model scale or training data quality.

DeepSeek-R1 with reliable prompt achieves:

  • Precision(Solvable) = 0.735

  • Precision(Unsolvable) = 0.549

  • Total Precision = 0.642 (best among all tested models)

  • Prudence = 0.007 (rarely refuses, relies on detection rather than abstention)


Multi-Dimensional Trustworthiness: The ReFIne Framework

Accuracy alone does not constitute trustworthiness. The ReFIne framework optimizes across three orthogonal properties through supervised fine-tuning combined with Group Relative Policy Optimization (GRPO):


1. Interpretability (+44.0% improvement)
  • Produces structured, tag-based traces with high-level planning

  • Enables human verification of logical structure

  • Metric: Agreement with human-annotated logical structure


2. Faithfulness (+18.8% improvement)
  • Explicitly discloses decisive information guiding each solution

  • Maintains consistent cross-section references to evidence

  • Metric: Factual grounding of claims in derivation


3. Reliability (+42.4% improvement)
  • Provides self-assessments of derivation soundness

  • Offers calibrated confidence estimates for final answers

  • Metric: Calibration between confidence and correctness

Implementation on Qwen-3 models (1.7B-8B parameters) demonstrates that trustworthiness can be systematically improved through targeted optimization, with evaluation across mathematical benchmarks of varying difficulty.


Mathematical Reasoning Performance Landscape

The contemporary benchmark landscape spans 15+ major frameworks targeting different dimensions of reasoning quality:

End-to-End Problem-Solving Accuracy


The dramatic drop from 91% on grade-school problems to 2% on expert-level mathematics (FrontierMath) reveals that current reasoning capabilities remain far from human expert performance on truly difficult problems. The field has largely saturated easier benchmarks while struggling with problems requiring deep domain expertise.


Multi-agent debate achieves the highest accuracy on GSM8K (91%), suggesting that ensemble reasoning approaches—where multiple models reason independently and reach consensus—may offer more reliable trajectories than single-model generation, even with sophisticated prompting strategies.


Self-Improving Reasoning: The TRIDENT Architecture

Moving beyond static reasoning, TRIDENT reconceptualizes reasoning as structured search with self-correction loops. Key architectural innovations:


1. Tree-of-Thought Reasoning Policy

Unlike linear Chain-of-Thought, concurrent evaluation of multiple reasoning paths permits identification of high-quality trajectories and prevents premature commitment to unproductive branches.


2. GNN-Guided Path Evaluation

Graph Neural Networks assess reasoning states and assign promise scores to partial paths, enabling early elimination of unproductive branches and concentration of computational resources on promising trajectories.


3. Self-Generative Reasoning Loop (SGRL)

The model generates its own reasoning paths, evaluates both answers and reasoning processes using verifiable rewards, and improves iteratively without dependence on human-annotated chains or external preference data.

This architecture overcomes the fixed reasoning behavior typical of models trained on single forward passes, enabling dynamic adaptation to problem-specific characteristics and current model capabilities.


TRIDENT Training Pipeline


Applications: Autonomous Driving and Behavioral Topology

Reasoning trajectory modeling finds its most mission-critical application in autonomous vehicle navigation, where safety depends on accurately predicting and planning coordinated actions among multiple interactive agents.


The BeTop Framework

Behavioral Topology (BeTop) introduces topological formulations derived from braid theory to represent consensual behavioral patterns among multi-agent futures. Rather than dense behavioral representations (computationally inefficient) or sparse representations (inconsistent across scenarios), BeTop explicitly models relational topology through two layers:

Formation Layer: Captures multi-agent interaction structure (which vehicles are interacting, in what spatial relationships)

Dynamics Layer: Encodes motion evolution within the established topology (how interactions change over time)

Consistency Constraint: Ensures prediction-planning integration without trajectory conflicts between what the system predicts other agents will do and what it plans to do itself


Performance on Motion Prediction Benchmarks

Results on nuPlan and Waymo Open Motion Dataset (WOMD):


The framework demonstrates that reasoning about vehicle trajectories through topological structure—modeling relationships and formations rather than just individual paths—produces more coherent, safer, and more interpretable autonomous driving behavior.


Scene-Context Representation (DSC-LLM)

The DSC-LLM framework combines multimodal LLM reasoning with trajectory prediction to enable:

  1. Real-time trajectory forecasting (critical for safety)

  2. Interpretable risk-aware reasoning (supports human oversight)

  3. Unified framework: Quantitative motion prediction + qualitative natural language explanation

This addresses a critical gap: traditional motion prediction models output coordinates without explanation, making it impossible for safety operators to understand why the system made specific predictions. DSC-LLM generates both the predicted trajectories and natural language reasoning explaining the prediction:

"Vehicle A is likely to merge left because: (1) their turn signal is active, (2) they are positioned at the lane boundary, (3) there is sufficient gap in left lane traffic."


Healthcare Applications: Neuro-Symbolic Reasoning and Clinical Training

In healthcare diagnostics, the explainability-accuracy gap is not merely inconvenient—it's a barrier to clinical adoption. Clinicians cannot use systems they cannot verify.


Logical Neural Networks for Clinical Diagnosis

Neuro-symbolic AI integrates neural perception with logical reasoning through Logical Neural Networks (LNNs) that learn:

  • Feature contributions via interpretable weights

  • Diagnostic rule thresholds directly from data

  • Explanations grounded in domain logic rather than black-box embeddings

Performance on Diagnosis Prediction:

Metric

LNN Performance

Pure Neural Network

Advantage

Accuracy

87.3%

88.9%

-1.6% (acceptable trade-off)

Explainability Score

91.2% agreement with clinicians

23.4%

+67.8%

Adaptability

Weight adjustment without retraining

Full retraining required

Operational

The modest accuracy trade-off (1.6%) is acceptable given the dramatic explainability improvement (67.8% higher clinician agreement). In high-stakes medical decisions, being able to verify each reasoning step matters more than marginal accuracy gains.

Application Domains:

  • Cardiovascular disease risk stratification

  • Cancer prognosis prediction

  • Sepsis detection in ICU settings


Sequential Behavioral Analysis in Clinical Training

Extended Reality (XR) simulation combined with sequential analysis techniques enables unprecedented insight into clinical decision-making patterns:

Data Collection:

  • Behavioral log data: Computer-recorded timestamped sequences of clinical actions

  • Real-time environment feedback: Immediate indication of action correctness/incorrectness

  • Multiple completion paths: Multiple possible routes to successful outcome

Sequential Pattern Analysis:

  • Identifies common decision patterns

  • Distinguishes high vs. low performer trajectories

  • Reveals critical decision junctures where errors concentrate

Example Finding: In fluid resuscitation scenarios, successful trajectories show the pattern:

Rescue Definitive Treatment Outcome

Unsuccessful trajectories show:

Diagnostic Outcome (premature termination, missing treatment phase)

Sequential analysis automates what human observers previously had to manually identify, enabling scalable training assessment across thousands of simulation sessions.

Sequential analysis automates what human observers previously had to manually identify, enabling scalable training assessment across thousands of simulation sessions.


Behavioral Signal Processing: Multi-Modal Assessment

Beyond text-based reasoning, behavioral signal processing (BSP) extracts predictive features from speech, language, and interaction patterns to enable richer human assessment.


BSP Pipeline Architecture

Components:

  1. Data Acquisition: Multi-modal recording (audio, video, EEG) in ecologically valid settings

  2. Feature Extraction: Speech spectral features, prosody, discourse patterns, gesture kinematics

  3. Behavior Modeling: Machine learning mapping behavioral cues to higher-level constructs

Applications:

  • Literacy assessment via oral language fluency

  • Autism diagnostics via vocal atypicality indices

  • Psychotherapy progress tracking via linguistic markers

  • Marital relationship quality prediction via conflict patterns

Key Principle: Human-in-the-loop design where automated feature extraction augments rather than replaces human expert judgment. BSP identifies patterns humans might miss (subtle prosodic changes, micro-gesture timing) while humans provide contextual interpretation and final assessment.


The QUEST Framework: Systematic Healthcare Evaluation

High-stakes healthcare deployment requires evaluation frameworks that go beyond benchmark accuracy. The QUEST framework establishes a three-phase evaluation workflow based on systematic review of 142 studies:


Phase 1 - Planning
  • Define evaluator qualifications and number (literature median: 20 evaluators)

  • Establish evaluation dimensions specific to clinical context

  • Choose blind vs. unblinded assessment strategy


Phase 2 - Implementation & Adjudication
  • Deploy evaluators with predefined scoring guidelines

  • Resolve inter-rater disagreements via consensus or senior adjudication

  • Collect both quantitative scores and qualitative rationales


Phase 3 - Scoring & Review
  • Calculate metrics with confidence intervals (not point estimates)

  • Perform statistical significance testing

  • Report both average performance and variance


Five Core Evaluation Principles

Principle

Description

Validation

Quality of Information

Accuracy, completeness, relevance

93.9% expert agreement on criterion clarity

Understanding & Reasoning

Logical coherence, evidence integration

80.2% LLM-as-judge agreement with humans

Expression Style & Persona

Clarity, empathy, appropriateness

79.6% inter-human agreement (comparable baseline)

Safety & Harm

Absence of bias, fabrication, plagiarism

Mandatory safety assessment

Trust & Confidence

User satisfaction, reliability perception

Longitudinal measurement required

Rubric validation achieves 93.9% expert agreement on criterion clarity, and LLM-as-judge demonstrates 80.2% agreement with human experts—comparable to 79.6% inter-human agreement, suggesting automated evaluation can approximate human judgment when properly calibrated.


Limitations and Open Challenges


Computational Efficiency Trade-offs

Trajectory-based reasoning incurs substantial computational overhead:

Token Budget Scaling:

  • Reasoning models generate longer outputs under high token budgets (32k tokens)

  • Increased length paradoxically correlates with reduced accuracy (overthinking pathology)

  • Inference time: Pruned models at 32k tokens require 29.1 minutes vs. 23.3 minutes for dense models

  • Trade-off disappears at practical token limits (4-8k tokens)

The overthinking pathology represents a fundamental challenge: given unlimited computational budget, reasoning models don't always produce better answers. They can reason themselves into incorrect conclusions by over-analyzing or considering irrelevant factors. Optimal performance occurs at intermediate computational budgets where the model reasons sufficiently but not excessively.


Generalization Across Domains

Current systems show limited cross-domain transfer:

Domain Transfer

Performance Drop

Implication

High-school MATH → Olympiad AIME

Precision 0.7-0.9 → 0.1-0.6

Domain-specific tuning required

Automotive trajectory prediction → New vehicle classes

Requires complete retraining

Limited transfer within domain

Clinical decision patterns → Different specialties

30-45% accuracy drop

Specialty-specific models needed

The ReliableMath benchmark demonstrates this clearly: unsolvability detection that works well on high-school problems (Precision 0.7-0.9) largely fails on olympiad-level problems (Precision 0.1-0.6). This isn't simply a difficulty issue—the reasoning patterns that indicate unsolvability differ fundamentally across mathematical domains.


The Causality Gap

A critical open challenge: reasoning trajectories often correlate with correct outcomes without necessarily explaining why those outcomes occur.

Trajectory Plausibility ≠ Causal Explanation: Models can generate coherent narratives that don't reflect actual decision logic. The model might produce a step-by-step derivation that arrives at the correct answer, but the internal computation that generated that answer may have followed a completely different path.

Counterfactual Validity Untested: Does removing a reasoning step actually change the conclusion as the model claims? Current evaluation frameworks struggle to measure whether explanations reflect human-like reasoning or merely plausible post-hoc rationalization.

Black-Box Evaluation of Reasoning Quality: Even trajectory-based methods struggle to distinguish genuine step-by-step reasoning from sophisticated pattern matching that mimics reasoning structure.

Future work requires formal verification methods and causal discovery techniques to establish trajectory-outcome links beyond correlation.


The Path Forward

Reasoning trajectory modeling has matured from theoretical construct to practical paradigm with concrete instantiations in process reward modeling, benchmark design, and human evaluation frameworks. The RPRM framework demonstrates that step-by-step reasoning can be reliably assessed at scale, achieving improvements of 13.9-8.5 F1 points over prior methods. ReliableMath and ReasonBENCH reveal that stability, reproducibility, and appropriate uncertainty calibration are as critical as average accuracy—and frequently overlooked in standard evaluation.

The most sophisticated contemporary systems combine three elements:

  1. Structured reasoning representation via trajectory modeling

  2. Human-aligned assessment through careful evaluation frameworks and verification

  3. Domain-specific instantiation tailored to application requirements

In safety-critical domains—autonomous driving, healthcare diagnostics, human assessment—this integration is non-negotiable.

At LastArray, we focus on building AI systems that behave well outside the lab. That means clear design, careful engineering, and emphasis on reliability over appearance. Reasoning trajectory modeling provides the technical substrate for preserving human-in-the-loop oversight in an era of increasingly capable AI systems.

The field stands at an inflection point. As reasoning becomes central to AI capability, the ability to transparently evaluate and improve reasoning processes determines whether AI systems remain intelligible to human oversight or become increasingly opaque.

We measure success not by benchmark leaderboard positions but by whether the systems we build continue to work as intended and earn trust as conditions change over time. Reasoning trajectory modeling—when implemented with appropriate attention to stability, reliability, and human verification—represents a path toward that goal.

The science of human assessment requires systems that show their work. We're building them.


References
  1. She, S., et al. (2025). "RPRM: Reasoning-Driven Process Reward Modeling." EMNLP 2025.

  2. Xue, B., et al. (2025). "ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models."

  3. Potamitis, N., Klein, L., Arora, A. (2025). "ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning."

  4. Jiao, F., et al. (2024). "Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing." EMNLP 2024.

  5. Sun, C.E., et al. (2025). "ReFIne: A Framework for Trustworthy Large Reasoning Models."

  6. Tam, T.Y.C., et al. (2024). "A framework for human evaluation of large language models in healthcare." Nature Medicine.

  7. Narayanan, S., et al. (2013). "Deriving Human Behavioral Informatics From Speech and Language."

  8. Lu, Q., et al. (2025). "Explainable Diagnosis Prediction through Neuro-Symbolic AI."

  9. Rochlen, L.R., et al. (2022). "Sequential Behavioral Analysis in XR Clinical Simulation." Simulation in Healthcare.

Related

Research

Related

Research

Related

/

Research