Recursive Failure Archaeology: RLM for Agent Failure Diagnosis

Comparing Recursive Investigation With Full-Context GPT-5.5

I wanted to test a simple idea: instead of giving GPT-5.5 an entire agent trace and asking “what went wrong?”, can I make the model investigate the trace recursively?

The project became a small experiment in failure archaeology. The input is a long agent trajectory. The output is a report that tries to identify where the failure began, how it propagated, and which events support the diagnosis.

The core comparison was:

Method	What it does
Full-context GPT-5.5	Reads the whole trace in one prompt and writes a diagnosis.
RLM	Breaks the trace into segments, analyzes local evidence, generates hypotheses, prunes evidence, verifies failure/control status, then writes the final report.

I ran both methods on the same 15 official tau2 retail traces.

The working hypothesis was not that RLM would automatically beat GPT-5.5. The more realistic hypothesis was narrower: RLM might produce a more inspectable diagnosis, because the reasoning process is decomposed into explicit stages instead of hidden inside a single long prompt.

The RLM Idea I Used

The RLM paper’s main idea is that long context should not always be treated as text stuffed into a single prompt. Instead, the long input can live outside the model as an environment, and the model can inspect pieces of it recursively.

The useful ideas for this project were:

RLM paper idea	How I used it
Externalized context	The full trajectory lives as structured JSON, not only inside one model prompt.
Symbolic intermediate state	Segment analyses, hypotheses, pruned evidence, and verifier outputs are stored as JSON artifacts.
Recursive inspection	The system opens selected trajectory segments instead of always reading everything at once.
Final synthesis	A root model uses the intermediate artifacts to produce the final report.

For this project, the mapping is:

RLM concept	Failure archaeology equivalent
Long prompt	Agent trajectory
REPL/context environment	Local dataset + trajectory schema
Sub-calls	Segment analysis calls
Intermediate variables	Evidence, hypotheses, control verdict
Final answer	Failure archaeology report

So the point was not to make a chatbot with a longer prompt. The point was to build a small investigation loop.

This distinction mattered in practice. Full-context prompting can produce a good final answer, but the reasoning process is mostly implicit. RLM produces additional artifacts that can be inspected after the run: which segments were selected, what each segment analysis found, which hypotheses were considered, which evidence was pruned, and whether the verifier believed a real task-level failure had been proven.

The Pipeline

The RLM pipeline I implemented looks like this:

flowchart TD
    A["Agent trace"] --> B["Trajectory schema"]
    B --> C["Signal detection"]
    C --> D["Segmentation"]
    D --> E["Segment analysis"]
    E --> F["Competing hypotheses"]
    F --> G["Evidence pruning"]
    G --> H["Control verification"]
    H --> I["Final report"]

The most important design decision was separating evidence discovery from final judgment. Early versions of recursive analysis can collect many suspicious events, but suspicious does not always mean causal. That is why I added a verifier step before the final report.

Here is the core implementation shape from the investigator:

selected = self.select_segments(segment_map, limit=segment_limit)
analyses = [
    self.analyze_segment(trajectory, signals, segment, live=live)
    for segment in selected
]

hypotheses = self.generate_hypotheses(trajectory, analyses, live=live)
pruned_evidence = self.prune_evidence(
    trajectory,
    analyses,
    hypotheses,
    live=live,
)
control_verdict = self.verify_control(
    trajectory,
    hypotheses,
    pruned_evidence,
    live=live,
)
report = self.final_judge(
    trajectory,
    hypotheses,
    pruned_evidence,
    control_verdict,
    live=live,
)

That code is the whole idea in miniature: inspect, hypothesize, prune, verify, judge.

The pipeline produces these artifacts for every trace:

Artifact	Purpose
Selected segments	The parts of the trace the recursive investigator decided to inspect.
Segment analyses	Local findings from each selected segment.
Competing hypotheses	Multiple possible explanations, including a no-failure/control hypothesis.
Pruned evidence	The smallest event set considered sufficient for each hypothesis.
Control verdict	A separate decision about whether a persistent task failure was proven.
Normalized report	The final root cause, causal chain, evidence, confidence, and interventions.
Causal graph	A graph representation of the final causal explanation.

That is why I think of RLM less as a prompt and more as an experiment harness for trajectory reasoning.

Segment Analysis

The first model calls are local. A segment analyzer only sees a selected region of the trace and has to return compact JSON.

The prompt rule is intentionally narrow:

SEGMENT_ANALYSIS_SYSTEM = """You are a sub-investigator in a recursive failure archaeology system.
Analyze only the provided segment. Do not invent events that are not in the segment.
Return compact JSON with:
- segment_id
- local_findings
- causal_clues
- memory_issues
- likely_failure_role
- confidence
"""

This is directly inspired by the RLM paper’s “inspect a slice, store the result, continue” pattern. The segment model is not supposed to solve the whole trace. It only contributes local evidence.

In the implementation, the segment limit is also important. For this run I used --segment-limit 4, which means the recursive investigator could inspect only a small number of selected trajectory regions. This creates a useful pressure: if the selector chooses poor segments, the final report suffers. That makes segment selection a real component of the method rather than just formatting.

Hypotheses, Pruning, and Control Verification

After local analysis, the root model generates multiple possible explanations. I found this important because a trace often has several plausible stories:

a tool failed
the agent escalated too early
the user gave incomplete identity information
nothing actually failed and the trace is a successful control

The hypothesis generator is forced to include a no-failure/control hypothesis:

HYPOTHESIS_GENERATOR_SYSTEM = """You are the hypothesis generator for a Recursive Failure Archaeology pipeline.
You are given trajectory metadata, task outcome, detected signals, and local segment analyses.
Your job is to produce competing hypotheses, not a final report.

Rules:
- Always include one explicit no-failure/control hypothesis.
- Include 2-4 failure hypotheses when evidence exists.
- A suspicious event is not enough; each failure hypothesis needs task-level consequence.
"""

Then the evidence pruner tries to reduce noise:

EVIDENCE_PRUNER_SYSTEM = """You are the evidence pruner for Recursive Failure Archaeology.
Your job is to reduce noisy local findings into a minimal evidence set for each hypothesis.

Rules:
- required_event_ids should be the smallest sufficient set, usually 2-5 events.
- Exclude events that are merely noisy, downstream, or redundant.
- Preserve contradicting evidence for the control verifier.
"""

The verifier is the guardrail. It asks whether a persistent task-level failure has actually been proven:

CONTROL_VERIFIER_SYSTEM = """You are the no-failure/control verifier for Recursive Failure Archaeology.
Your job is to answer: has a persistent task-level failure been proven?

Rules:
- Prefer no_persistent_failure when the task outcome is successful and no concrete unmet requirement is proven.
- Do not count a tool error, clarification, or inefficiency as a failure if the user task was eventually resolved.
- A failure is proven only if evidence shows an unmet user request, wrong state change, policy violation, or unresolved escalation.
"""

This is the part that made the method feel more serious. Without a control verifier, the system can become too eager to call every suspicious event a failure.

The final judge then receives the hypotheses, pruned evidence, and verifier output. Its job is not to restart the investigation from scratch, but to adjudicate between already-structured intermediate results:

report = self.final_judge(
    trajectory,
    hypotheses,
    pruned_evidence,
    control_verdict,
    live=live,
)

This is the closest part of the system to the RLM paper’s final synthesis step: the model does not need to remember every raw event, because important intermediate state has already been externalized.

The Experiment

I used 15 official tau2 retail traces from the prepared 50-trace set. Each trace had a GPT-5.5 LLM-as-judge causal label. These labels are not human-gold annotations, so I treat the result as an experimental benchmark rather than a final scientific claim.

Both methods used GPT-5.5:

Method	Root model	Segment model
RLM	GPT-5.5	GPT-5.5
Full-context baseline	GPT-5.5	none

I kept both methods on GPT-5.5 to avoid making the comparison about model quality. The difference was the reasoning structure: recursive investigation versus direct full-context diagnosis.

The evaluation metrics were:

Metric	What I used it to check
Semantic root cause	Did the predicted cause match the label semantically?
Causal consistency	Did the explanation actually support the final outcome?
First failure turn error	Did the method locate the beginning of the failure?
Evidence recall	Did it recover the labeled evidence events?
Evidence precision	Were the cited events actually useful evidence?
Causal anchor recall	Did it recover the most central causal events?
Control correctness	Did it correctly identify no-failure traces?
False positive failure rate	Did it invent failures on successful controls?
Parse success	Could the report be normalized into the schema?

The live RLM run:

.venv/bin/python scripts/run_investigator_batch.py \
  datasets/real/tau2_official_retail_50 \
  --experiment-id phase_18_rlm_gpt55_live_15 \
  --limit 15 \
  --segment-limit 4 \
  --root-model openai/gpt-5.5 \
  --segment-model openai/gpt-5.5 \
  --root-max-tokens 3000 \
  --segment-max-tokens 900 \
  --resume \
  --live

The benchmark:

.venv/bin/python scripts/run_scientific_benchmark.py \
  datasets/real/tau2_official_retail_50 \
  --investigation-dir datasets/real/tau2_official_retail_50/experiments/phase_18_rlm_gpt55_live_15/investigations \
  --method-id rlm_gpt55_live_15 \
  --label-file datasets/real/tau2_official_retail_50/llm_judge_labels.json \
  --output datasets/real/tau2_official_retail_50/experiments/phase_18_rlm_gpt55_live_15/benchmark_15.json

Results

RLM vs full-context GPT-5.5 on 15 official tau2 retail traces

Metric	RLM GPT-5.5	Full-context GPT-5.5
Semantic root cause	0.8571	0.8762
Causal consistency	0.9038	0.9272
First failure turn error	0.0667	0.0000
Evidence recall	0.9089	0.9511
Evidence precision	0.6833	0.8111
Causal anchor recall	0.5444	0.7556
Control correctness	1.0000	1.0000
False positive failure rate	0.0000	0.0000
Parse success	1.0000	1.0000

Full-context GPT-5.5 is still slightly better on this short-trace slice. That is not surprising. These tau2 traces are short enough for GPT-5.5 to read directly, so the full-context baseline is very strong.

But RLM was close on the main report-quality metrics:

semantic root cause: 0.8571 vs 0.8762
causal consistency: 0.9038 vs 0.9272
evidence recall: 0.9089 vs 0.9511

The bigger weakness was precision:

RLM evidence precision: 0.6833
full-context evidence precision: 0.8111

So the main bottleneck is not “can RLM find relevant evidence?” It can. The bottleneck is selecting the smallest and cleanest evidence set.

This result was useful because it showed a specific weakness. RLM did not collapse on root-cause quality, and it did not fail to retrieve evidence. Instead, its weakness was over-inclusion: it cited more evidence than necessary or included context events that were not as directly causal as the full-context baseline’s evidence set.

RLM minus full-context GPT-5.5

Reliability Result

Reliability checks on the 15-trace evaluation

The part I care about most is reliability:

Reliability check	RLM
Control correctness	1.0000
False positive failure rate	0.0000
Parse success	1.0000
Consistency issues	0

This means RLM did not hallucinate failures on successful controls in this slice. For a failure-analysis system, that matters a lot. A model that finds interesting clues but invents failures would not be useful.

The reliability result is especially important because the system is meant to diagnose failures, not merely find anomalies. A failed lookup, a clarification question, or a transfer to a human can be normal behavior depending on the task constraints. The verifier is supposed to prevent the system from treating every irregularity as a root cause.

Example Trace

One example report came from:

tau2-official-retail-35-trial-0-39d2c249-6e85-4db3-8f36-049be2744cf3

The user wanted to return a speaker and change a laptop order. The task did not complete because identity verification failed.

RLM’s root cause:

The customer record was not found via either email lookup or name/zip lookup, so identity verification and order access failed. Without verified identity, the assistant could not perform the requested return or order modification.

Example RLM causal chain for trace 35

The causal chain was:

Step	Event
1	Email lookup failed.
2	User supplied the missing zip code.
3	Populated name/zip lookup also failed.
4	Identity remained unresolved.
5	Assistant escalated to human support.
6	Session ended without completing the order actions.

What I like about this example is that the report is not just “the agent failed.” It separates an agent mistake from a data or identity-resolution blockage. That is the kind of distinction a failure archaeology system should make.

The intermediate hypotheses are what make this example interesting. The system considered at least four possible interpretations:

Hypothesis	Interpretation
Identity lookup failure	The backend/user data could not resolve the customer, blocking order access.
Premature escalation	The assistant may have escalated before exhausting available recovery paths.
Procedural lookup error	An earlier lookup may have used incomplete zip information.
No-failure/control	The assistant may have behaved correctly under identity-verification constraints.

The final report selected the identity-resolution explanation because the later populated name/zip lookup also failed. That matters: the earlier incomplete lookup was suspicious, but it was not sufficient to prove the root cause once the later complete lookup failed too.

Long SWE-Agent Traces

After the tau2 experiment, I wanted a harder test. The tau2 traces are useful, but they are short enough that a full-context model can simply read the whole thing. The more interesting question for RLM is what happens when the trace is genuinely long.

So I added a small real long-trace experiment using SWE-agent trajectories from AI-ModelScope/SWE-agent-trajectories. I converted a local subset of the 15 longest traces from the first downloaded shard into the project schema.

The resulting traces were much longer than the tau2 examples:

Dataset	Trace count	Min events	Max events	Mean events
SWE-agent long subset	15	525	797	632.47

This changed the problem. The task was no longer just “summarize a failure.” The task became:

In a 500-800 event trajectory, where did the failure actually begin?

That question matters because the final failure is often obvious. The scientific value is in locating the earlier causal turn that made the collapse likely.

Labeling the Long Traces

I did not want to score these traces with placeholder labels, so I added a long-trace judge:

scripts/run_swe_long_llm_judge.py

The judge does not simply dump the entire trajectory into a prompt. It builds a hierarchical packet:

return {
    "trajectory_id": trajectory.trajectory_id,
    "outcome": trajectory.outcome,
    "metadata": {...},
    "event_count": len(trajectory.events),
    "event_skeleton": event_skeleton,
    "selected_events": selected_events,
    "packet_policy": {
        "selection_features": [
            "head_events",
            "tail_events",
            "error_windows",
            "assistant_repetition_windows",
            "middle_breadcrumbs",
        ],
    },
}

The important idea is that the judge sees both:

a compressed view of the full trajectory,
fuller evidence windows around likely causal regions.

That is closer to how I want the actual system to work: not blind truncation, and not a single undifferentiated wall of text.

The live labeling command was:

.venv/bin/python scripts/run_swe_long_llm_judge.py \
  datasets/real/swe_agent_long_15 \
  --model openai/gpt-5.5 \
  --limit 15 \
  --resume \
  --live

Validation passed:

Labels: 15
Errors: 0

The label distribution was surprisingly concentrated.

Failure taxonomy in the 15 long SWE-agent traces

Failure taxonomy	Count
repeated invalid command loop	14
wrong file or location	1

This is an important limitation. These 15 traces are not a balanced coding-agent benchmark. They are mostly extreme collapse traces where the agent gets stuck repeating malformed commands or operating from a wrong assumption. But that also makes them useful stress tests for long-horizon failure archaeology: the failure begins early and then propagates for hundreds of events.

First causal failure versus full trace length

This chart is the clearest explanation of the experiment. The gray dots are the full trace lengths. The red dots are the labeled first causal failure turns.

In most traces, the first causal failure happened very early. The trace then continued for hundreds of events. That is exactly the failure-archeology setting: the collapse is visible at the end, but the cause is buried near the beginning.

RLM vs Full-Context GPT-5.5 on Long Traces

The first RLM run completed 9 of the 15 long traces. I then ran a matched full-context GPT-5.5 baseline on the same first 9 traces and scored both against the new labels.

Long SWE-agent report-quality metrics

Metric	RLM GPT-5.5	Full-context GPT-5.5
Semantic root cause	0.4583	0.4440
Causal consistency	0.5313	0.5684
First failure turn error	0.5556	4.0000
Evidence recall	0.3889	0.5833
Evidence precision	0.7333	0.6861
Parse success	1.0000	1.0000

At first this looks mixed. RLM is not universally better. Full-context GPT-5.5 has better evidence recall and slightly better causal consistency. But the first-failure result is the important one.

First-failure localization error

RLM had a mean first-failure turn error of 0.5556. Full-context GPT-5.5 had 4.0000.

That means RLM was much closer to the labeled first causal failure. Full-context GPT-5.5 found more evidence across the trace, but it was less precise about where the causal failure began.

This gave me the first clean long-trace finding:

On real long SWE-agent traces, RLM appears better at localizing the earliest causal failure, while full-context GPT-5.5 is better at broad evidence recall.

That is the result I had been trying to understand. RLM is not magically smarter. Its advantage is structural. Because it breaks the trace into investigated pieces and stores intermediate evidence, it is better suited to asking “where did this start?” Full-context GPT-5.5 is strong at reading the whole trace and collecting many relevant details, but that broad view can blur the earliest causal point.

Case Study: The Invalid Command Loop

One long trace was:

swe-agent-long-01-iterative__dvc-7497

The task was a DVC issue, but the agent almost immediately entered an invalid shell-command loop.

The label was:

Field	Value
Root cause	`invalid_mkdir_git_command_loop`
First causal failure	turn 3
Visible failure	turn 6
Failure type	`repeated_invalid_command_loop`
Trace length	797 events

The causal story was:

Turn	What happened
2	The user provided the DVC issue.
3	The agent tried to simulate Git setup instead of investigating the repository.
4	The shell reported the command was invalid.
6	The agent repeated the same malformed command.
736+	The trace was still stuck in the same kind of loop.
797	The run ended without a real patch.

RLM’s normalized root cause was:

Severe command perseveration with feedback non-incorporation: the agent repeatedly issued the malformed shell command despite immediate corrective feedback, blocking all task-relevant software engineering work.

This is the simple version:

early bad command -> explicit failure feedback -> same bad command repeated -> no repository work -> timeout

The interesting part is not that the agent failed. That is obvious by the end. The useful part is that RLM marked the first causal failure at turn 3, not near the final timeout.

Case Study: Working in the Wrong Project

Another useful example was:

swe-agent-long-09-iterative__dvc-3534

This one was different. It was not just a malformed command loop. The agent adopted the wrong working premise.

The label was:

Field	Value
Root cause	`worked_in_example_project_simulator_instead_of_dvc_source`
First causal failure	turn 3
Visible failure	turn 13
Failure type	`wrong_file_or_location`
Trace length	588 events

The causal story was:

Turn	What happened
3	The agent moved away from the actual DVC source context.
5	It began reproduction work in an example project.
13	It concluded the real implementation was unavailable.
15	It created a simulator instead of modifying the actual DVC code.
Later	It entered a downstream invalid-command loop and never recovered.

RLM’s summary was:

The trajectory failed because the assistant abandoned the real DVC repository almost immediately, built and reasoned inside a fake example project, then substituted an unrelated simulator for a production fix.

This case is useful because it separates two layers:

Layer	Explanation
Root failure	The agent worked in the wrong project context.
Downstream failure	The agent later got stuck in command repetition.

That distinction is exactly what failure archaeology is supposed to do. A shallow report might say “the agent looped.” RLM’s better explanation is “the agent first left the real codebase; the later loop made recovery impossible.”

What the Long-Traces Changed

The tau2 result said:

RLM is reliable and auditable, but full-context GPT-5.5 is slightly stronger on short traces.

The SWE-agent result adds:

On long traces, RLM’s structure helps with first-failure localization.

That is a much stronger reason to keep working on this project.

The current picture is:

Setting	What happened
Short tau2 traces	Full-context GPT-5.5 was slightly stronger overall.
Long SWE-agent traces	RLM was much better at locating the first causal failure.
Evidence behavior	Full-context GPT-5.5 recovered more evidence; RLM cited cleaner evidence.
Scientific limitation	The long subset is loop-heavy and LLM-labeled, not human-gold.

So the more precise thesis is now:

RLM is most useful when the evaluation target is not just “summarize the failure,” but “identify the earliest causal failure in a long trajectory.”

That is the project’s strongest direction.

What I Learned

The short-trace experiment did not show that RLM beats GPT-5.5 full-context. On tau2, full-context GPT-5.5 is still slightly stronger.

But the long-trace experiment made the picture more interesting. RLM can work as an investigation architecture:

It produces structured reports.
It stores intermediate reasoning artifacts.
It can include a no-failure/control hypothesis.
It can avoid false-positive failures.
It gives a path for ablations: remove segmentation, remove pruning, remove verification, compare again.
It can localize early causal failures in long traces more accurately than the full-context baseline in the current SWE-agent slice.

The most useful way to frame the result is:

RLM turns failure analysis from one-shot diagnosis into an auditable investigation process.

That is different from claiming it is always more accurate than full-context GPT-5.5.

My read of the result is:

Question	Current answer
Can RLM produce valid structured reports?	Yes. Parse success was 1.0000.
Can RLM avoid false failures on this slice?	Yes. False positive rate was 0.0000.
Does RLM beat full-context GPT-5.5 on short traces?	No. Full-context is slightly stronger.
Does RLM show an advantage on long traces?	Yes, on first-failure localization in the current SWE-agent slice.
Where is RLM weakest?	Evidence recall and causal consistency still need work.
Why keep working on RLM?	It gives an auditable structure for long-horizon causal diagnosis.

What I Would Improve Next

The next improvement should target evidence quality more carefully. On the short tau2 traces, RLM found relevant evidence but was less precise than full-context GPT-5.5. On the long SWE-agent traces, RLM had stronger evidence precision but weaker evidence recall.

So the problem is not simply “cite less evidence” or “cite more evidence.” The better goal is:

cite the smallest set of events that proves the root cause, while preserving enough anchors to explain the trajectory.

The next feature I would add is an event-level entailment checker:

def verify_evidence_item(event, claim):
    """
    Return whether this event actually supports the causal claim,
    is merely context, contradicts it, or is irrelevant.
    """
    ...

Then every cited event could be classified before the final report:

Label	Meaning
supports	The event directly supports the claim.
context	The event is useful background but not proof.
contradicts	The event weakens the claim.
irrelevant	The event should be removed.

That should improve evidence precision and make the causal chains cleaner.

A slightly more complete version would look like this:

def classify_evidence(event, causal_claim):
    label = judge_event_relation(event, causal_claim)
    if label == "supports":
        return {"keep": True, "role": "primary"}
    if label == "context":
        return {"keep": True, "role": "background"}
    if label in {"contradicts", "irrelevant"}:
        return {"keep": False, "role": label}

Then the final judge would be allowed to cite only primary evidence for root-cause claims, while background evidence could remain in a separate context field. That should directly target the precision gap.

I would also expand the long-trace benchmark. The current SWE-agent subset is useful, but it is too dominated by command-loop failures. A stronger benchmark should include:

Needed trace type	Why it matters
command loops	tests execution-collapse detection
wrong-file edits	tests repository localization
incomplete patches	tests task-outcome reasoning
test failures	tests evidence from execution output
context loss	tests long-horizon memory and instruction retention
successful controls	tests false-positive resistance

Bottom Line

This experiment gave a useful result:

RLM is reliable on 15 real tau2 traces.
Full-context GPT-5.5 is still slightly better on short traces.
On long SWE-agent traces, RLM is better at localizing the first causal failure.
Full-context GPT-5.5 is still stronger at broad evidence recall on the long slice.
RLM’s strongest value is auditability plus early-failure localization, not generic benchmark dominance.
The next bottleneck is evidence quality: knowing which events prove the cause, which are only context, and which are downstream symptoms.

The project is now less of a vague “AI observability” idea and more of a concrete experiment:

Can recursive investigation make agent failure analysis more auditable, more structured, and better at locating the first causal failure in long trajectories?

Supporting Artifacts

The project folder also includes the data-summary and method-diagnostic figures that supported the report:

Dataset composition

Method quality metrics

Error metrics

Evidence tradeoff

Label taxonomy

Supporting tables and reproduction script: