Source: arXiv
Published: 17 June 2026
Runtime: 0:00

A conversation between

Hyeongwon Jang , Gyouk Chu , Changhun Kim , Joonhyung Park , Hangyul Yoon , Eunho Yang

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

§03

Synthesis

## The Problem: LLMs Oversimplify Medical Risk

Large language models show promise for clinical early warning systems, but they have a critical flaw—they collapse nuanced patient risk into overconfident binary predictions. When a patient is admitted to the ICU, clinicians need two things: a calibrated risk score (not just "yes" or "no") and reasoning they can verify. Current LLM approaches fail at both. They either confidently predict one outcome or the other, losing the ability to express graduated risk. This "risk polarization" also makes it hard to compare patients fairly—two patients with different risk levels may receive identical predictions.

## The Method: Dialectical Reasoning

TRIAGE trains an LLM to generate competing arguments for different clinical outcomes using irregularly sampled medical time series (medical measurements taken at varying intervals, as they naturally occur in hospitals). The key insight is dialectical reasoning: instead of predicting a single outcome, the LLM explicitly reasons *for* and *against* each competing outcome. For example, rather than deciding "sepsis: yes or no," the model generates rationales explaining why the patient *might* develop sepsis and why they might not.

This approach works with real clinical data, where observations are scattered unevenly in time—a blood test at 3 AM, vital signs at 8 AM, lab work at noon. The authors don't provide implementation details in the abstract, but the framework clearly prompts the LLM to elicit outcome-specific reasoning before generating a final risk score.

The dialectical formulation mitigates polarization because the model must attend to evidence for *all* outcomes, not just the highest-confidence prediction. This naturally produces continuous risk scores rather than binary labels, and the explicit reasoning grounds those scores in clinical logic.

## Why It Works: Results and Clinical Value

Evaluated on three medical time-series benchmarks, TRIAGE improves area under the precision-recall curve (AUPRC) by an average of 3.3% over competitive baselines. More strikingly, it reduces calibration error by 81%—meaning the predicted risk much more closely matches actual patient outcomes. A 3.3% AUPRC gain may sound modest, but in clinical systems, even small improvements in risk stratification can shift patient management decisions at scale.

Perhaps more important for clinical adoption: an LLM-as-a-judge evaluation (where another LLM assesses explanation quality) shows that TRIAGE's rationales outperform post-hoc explanations from baseline methods by 20% in clinical reasoning quality. Clinicians can read the model's reasoning and understand *why* it assigned a particular risk score, rather than reverse-engineering a black box.

The framework addresses a genuine tension in AI-for-healthcare: models must be both accurate *and* intelligible. By forcing the LLM to reason dialectically—to articulate competing evidence—TRIAGE sidesteps the usual tradeoff. The source code is publicly available, suggesting the authors intend the work to be reproducible and adoptable.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator