Teaching an LLM to Explore: Reinforcement Learning for Document Navigation

Mohammed Alshehri — March 2026

How we trained a Qwen3-8B model to efficiently search through documents using GRPO, inspired by Karpathy’s autoresearch — and what Goodhart’s Law taught us along the way.


Table of Contents

  1. Introduction — The Autoresearch Connection
  2. The Problem: Partially Observable Multi-Hop QA
  3. Environment Design
  4. The Dataset: 2WikiMultiHopQA
  5. Training Infrastructure: Tinker + GRPO
  6. Run 1: Shaped Reward — The Goodhart Trap
  7. Run 2: Pure F1 — Cleaning the Signal
  8. Run 3: Status Text Fix — The Breakthrough
  9. The Paradox: Lower Training Reward = Better Model
  10. Results and Analysis
  11. Lessons Learned
  12. What's Next

1. Introduction — The Autoresearch Connection

In March 2025, Andrej Karpathy announced autoresearch — a project where an AI agent autonomously improves a language model by searching over code edits, running experiments within a 5-minute budget, and optimizing validation perplexity. The core insight was simple but profound: give an agent a search space, a budget, and a scalar reward, then let reinforcement learning do the rest.

Karpathy’s search space was code modifications. Ours is something different: document exploration.

We built Tinker-Explorer, an RL agent that learns to navigate a set of document chunks to answer multi-hop questions. Like autoresearch, it operates under a budget (limited steps), must decide what information to gather (which documents to read), and receives a scalar reward (answer correctness). Unlike autoresearch, the search space isn’t code — it’s evidence.

The mapping between the two projects:

Karpathy's AutoresearchTinker-Explorer
AgentCode-editing LLMDocument-exploring LLM
Search spaceCode modificationsDocument chunk selections
Budget5 minutes wall-clock10 action steps max
Rewardval_bpb improvementToken F1 on answer
OptimizationRL over code editsGRPO over exploration trajectories
Key challengeWhich edits improve the model?Which documents contain the answer?

Both projects share the same fundamental question: Can an LLM learn to make better decisions about what to explore, through trial and error?

This post documents our journey across three training runs — including a Goodhart’s Law failure, a reward debugging mystery, and the realization that looking at your model’s actual outputs matters more than tuning hyperparameters.


2. The Problem: Partially Observable Multi-Hop QA

Standard QA gives the model everything upfront. Retrieval-augmented QA retrieves documents first, then answers. But neither captures the active information gathering that humans do naturally.

Consider this question:

“Which film has the director born first, Once A Gentleman or The Girl In White?”

A human researcher would: look at the available sources, open the article about “Once A Gentleman” to find the director’s birth year, open the article about “The Girl In White” to do the same, then compare the two dates and answer.

This process requires sequential decision-making under uncertainty — the agent doesn’t know which documents are useful until it reads them. This is fundamentally a reinforcement learning problem.

Why Not Just Retrieve Everything?

You could argue: “Just open all the documents.” But in real-world settings, API calls cost money (think: calling a paid knowledge base), context windows have limits, irrelevant information introduces noise that degrades LLM performance, and some documents are red herrings that actively mislead.

The goal isn’t just to answer correctly — it’s to answer correctly while reading as few documents as possible. This is the efficiency-accuracy tradeoff that makes the problem interesting.


3. Environment Design

Tinker-Explorer implements a standard RL environment loop:

State Space

At each step, the agent observes:

The question (e.g., “Who is Marie Zéphyrine Of France’s paternal grandmother?”)
A list of chunk previews — one-line titles of all available documents, but NOT their contents
Previously opened chunks — full text of any documents the agent has chosen to read
Remaining step budget — how many actions it has left

Action Space

Three possible actions at each step:

{"action": "OPEN", "target": 3, "reasoning": "Chunk 3 mentions Marie Zéphyrine..."}
{"action": "SUMMARIZE", "target": 3, "text": "Born 1750, daughter of Louis XV..."}
{"action": "ANSWER", "text": "Marie Leszczyńska"}

OPEN(i) — Read chunk i’s full text. This is the exploration action.
SUMMARIZE(i) — Write a summary of chunk i. This helps manage context length.
ANSWER(text) — Submit a final answer. Ends the episode.

Reward Function

The agent receives reward only when it answers. The reward is the token-level F1 score between its predicted answer and the gold answer — a standard metric from the SQuAD literature that gives partial credit for overlapping words.

For example:

Gold: “The Mask Of Fu Manchu” → Predicted: “The Mask of Fu Manchu” → F1 = 1.0 ✅
Gold: “Małgorzata Braunek” → Predicted: “Chunk 4 has been opened” → F1 = 0.0 ❌

Episode Flow

1. Agent receives question + chunk previews
2. Agent decides: OPEN, SUMMARIZE, or ANSWER?
3. If OPEN/SUMMARIZE → environment reveals text, step counter increments
4. If ANSWER → episode ends, reward = token_f1(predicted, gold)
5. If step budget exhausted → episode ends with reward = 0

A typical successful episode takes 2-4 steps: open the relevant chunks, then answer.


4. The Dataset: 2WikiMultiHopQA

We use 2WikiMultiHopQA — a multi-hop question answering dataset built from Wikipedia. Each question requires reasoning across exactly two Wikipedia articles.

Why this dataset?

Natural chunk structure — Each Wikipedia article is a chunk. The agent must decide which articles to read.
Ground truth supporting facts — The dataset provides supporting_facts — which paragraphs contain the answer evidence. This lets us measure whether the agent opens the right documents.
Multi-hop reasoning — Questions require combining information from two sources. The agent can’t answer from a single document.
Varied question types — Comparison (“which film came first”), bridge (“who is the mother of the director of…”), and compositional questions.

Dataset Split

Training: 5,000 tasks (randomly sampled from the full 167K)
Validation: 200 held-out tasks for evaluation
Chunk pool: Each task has ~10 candidate chunks (2 relevant, ~8 distractors)


5. Training Infrastructure: Tinker + GRPO

Tinker Platform

All training runs on Tinker (Temporal Intelligence via Neural Knowledge Extraction and Reasoning) — a cloud platform that provides:

Hosted model weights — Qwen3-8B with LoRA adapters, no local GPU needed
Sampling API — generate completions from the current policy
Training API — submit gradient updates (importance-sampled policy gradient)
Checkpointing — save/restore model states

This architecture is powerful: the model lives in the cloud, and our local code just orchestrates rollouts and computes gradients. We ran all experiments on a MacBook with no GPU.

GRPO (Group Relative Policy Optimization)

We use GRPO — a variant of policy optimization from DeepSeek-R1 — instead of PPO. The key idea:

For each task, sample G = 16 trajectories from the current policy. Compute rewards for all 16. Normalize rewards within the group: advantage_i = (r_i - mean(r)) / std(r). Update the policy to increase probability of above-average trajectories.

Why GRPO over PPO?

No value function needed — PPO requires a separate critic network. GRPO computes baselines from the group.
Natural for language — each trajectory is a sequence of tokens. GRPO treats the whole sequence as one “action.”
Simpler implementation — just a weighted policy gradient update, no GAE, no clipping ratio (handled by importance sampling).

Hyperparameters

ParameterValueRationale
GROUP_SIZE16Balance between variance reduction and compute
BATCH_SIZE8Tasks per batch (= 128 total rollouts per update)
Learning Rate5e-6Conservative — RL fine-tuning is sensitive
LoRA Rank32Enough capacity without overparameterizing
Grad Clip Norm1.0Prevents IS loss explosions
Max Steps10Episode timeout

SFT Warm-Start

Before RL training, we performed supervised fine-tuning (SFT) on 450 demonstration episodes generated by a heuristic policy. This gives the model a reasonable starting point — it knows the action format and basic exploration strategy.

The heuristic policy simply: computes TF-IDF overlap between the question and each chunk title, opens the highest-overlap chunk, and answers with the chunk title (which is often the Wikipedia article title — and therefore the exact entity name).

This heuristic achieves F1 = 0.246 — a surprisingly strong baseline.


6. Run 1: Shaped Reward — The Goodhart Trap

Hypothesis

“If we add a bonus for opening relevant chunks, the model will learn to explore more effectively.”

Reward Design

reward = token_f1(predicted, gold) + 0.2 × (number of relevant chunks opened)

The idea was sound: reward the agent not just for the final answer, but for the intermediate exploration steps. Every relevant chunk opened earns a +0.2 bonus (capped at 0.4).

Training

100 batches over ~12 hours. Training reward climbed steadily — the 5-batch average increased from 0.08 to 0.20 over the course of training.

Run 1 Training Reward

Run 1 Training Reward</figcaption> </figure>

Results

ModelF1EMAnswer Rate
Heuristic0.2460.220100%
SFT0.1520.06587%
RL Run 10.1230.01096%

Wait — F1 went DOWN? The RL agent performed worse than the SFT baseline it started from. Training reward increased, but actual answer quality decreased.

The Diagnosis: Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.”

The relevance bonus created a shortcut: the agent learned to open the right chunks (earning the +0.2 bonus) while paying less attention to actually answering correctly. The training reward was dominated by the exploration bonus, not answer quality — so the gradient signal pushed the model to optimize opens, not answers.

This is a textbook case of reward hacking. The agent found a way to earn high reward without doing the thing we actually wanted (accurate answers).

Goodhart's Law Visualization

Goodhart’s Law Visualization — the inverse relationship between training reward and model quality</figcaption> </figure>

Lesson

Shaped rewards are dangerous. The extra signal might seem helpful, but if it’s easier to optimize than the true objective, the agent will chase the bonus and ignore what matters.


7. Run 2: Pure F1 — Cleaning the Signal

The Fix

Strip everything back to the minimum:

reward = token_f1(predicted, gold)    # nothing else

Plus one safety gate: the agent must open at least 1 chunk before earning any reward. This prevents the degenerate strategy of answering immediately without reading anything.

Training

100 batches. Immediately obvious: training reward was lower than Run 1. The agent earned less reward per batch because there was no easy bonus to collect.

All Reward Curves

All training reward curves across three runs — lower training reward correlated with better model quality</figcaption> </figure>

The red line (Run 1) sits higher than the green line (Run 2) throughout training. To someone only watching the training curves, Run 1 looks better. But…

Results

ModelF1EMAnswer Rate
Heuristic0.2460.220100%
SFT0.1520.06587%
RL Run 1 (shaped)0.1230.01096%
RL Run 2 (pure F1)0.1540.06589.5%

F1 jumped from 0.123 to 0.154 — a 25% improvement — by removing reward signal. The pure F1 agent matched the SFT baseline, which means the RL training at least didn’t hurt, even if it didn’t help much.

Error Analysis

We looked at what the model actually output on the 200 val tasks:

Prediction Breakdown

Prediction breakdown showing 50% of outputs were status text instead of answers</figcaption> </figure>

50% of outputs were status text — things like: “Chunk(s) 4 and 7 (‘Polish-Russian War’ and ‘Xawery Żuławski’) have been opened.” or “Chunk 0 (‘Blind Shaft’) has highest overlap with the question.”

The model was outputting descriptions of its own actions instead of answers. It opened the right chunks, formed the right reasoning — then put the reasoning in the answer field instead of the actual answer.

This error mode is invisible in aggregate metrics. You can only find it by reading examples.


8. Run 3: Status Text Fix — The Breakthrough

The Insight

The 50% status text rate meant that half our training signal was wasted. These episodes generated zero reward (because “Chunk 4 has been opened” has zero token overlap with “Małgorzata Braunek”) — but the model didn’t learn from them because the gradient signal was the same as for genuinely wrong answers.

We needed to make the punishment explicit.

Three Changes

1. Reward penalty — If the answer starts with known status text patterns, force reward to 0:

_STATUS_PREFIXES = ("chunk", "the chunk", "chunks", "i have", "i've", "based on")

if any(answer_text.lower().startswith(p) for p in _STATUS_PREFIXES):
    return 0.0  # you gave me reasoning, not an answer

2. Prompt update — Added an explicit “NEVER do this” section:

NEVER put any of these in the "text" field:
- "Chunk X has been opened" — this is NOT an answer
- "Chunk X has highest overlap" — this is reasoning, not an answer
- "Based on the text..." — just give the answer directly

3. Same pure F1 base — No relevance bonus. Min-read gate still active.

Training

~102 batches across two sessions (the first session crashed at batch 52 due to a Tinker API network timeout; we resumed from the batch 50 checkpoint).

Training reward was the lowest of all three runs — the status text penalty made the signal even sparser (71% nonzero batches, vs 78% in Run 2 and 96% in Run 1).

Run 3 Reward Curve

Run 3 Training Reward — lowest training reward, but best model</figcaption> </figure>

Results

ModelF1EMAnswer RateAvg Opens
Heuristic0.2460.220100%1.2
SFT0.1420.06087.0%1.6
RL Run 1 (shaped)0.1230.01096.0%1.4
RL Run 2 (pure F1)0.1540.06589.5%1.5
RL Run 3 (status fix)0.1720.08599.5%1.2

🎉 Best results across every metric:

F1 = 0.172 — 40% better than Run 1, 21% better than SFT
EM = 0.085 — 8.5× better than Run 1
Answer Rate = 99.5% — the model almost always gives an answer now
Efficiency = 1.2 opens — same as the heuristic, better than SFT

Val Results

Validation results across all models — Run 3 dominates</figcaption> </figure>

The Reasoning Evolution

The most compelling evidence comes from looking at the same questions across all three runs:

Reasoning Evolution

How the same questions were answered across three runs — the model opened the right documents every time, but only Run 3 put the right answer in the answer field</figcaption> </figure>

Example 2 tells the whole story:

Run 1: “Chunk(s) 4 and 7 have been opened.” → F1 = 0.00
Run 2: “Chunk(s) 4 and 7 have been opened.” → F1 = 0.00
Run 3: “Małgorzata Braunek” → F1 = 1.00

The model opened the right documents in all three runs. The reasoning was correct in all three runs. The only difference was what it put in the answer field.


9. The Paradox: Lower Training Reward = Better Model

This is the central insight of the project. Across three runs, we observed an inverse relationship between training reward and actual model quality:

The Paradox

The paradox: lower training reward = better model quality</figcaption> </figure>

RunTraining RewardVal F1
Run 1 (shaped)0.108 (highest)0.123 (worst)
Run 2 (pure F1)0.0670.154
Run 3 (status fix)0.056 (lowest)0.172 (best)

The run with the lowest training reward produced the best model.

Why? Because each successive run used a stricter, more honest reward function:

Run 1’s reward was easy to earn (bonus for opens) but misleading.
Run 2’s reward was harder (F1 only) but still didn’t penalize status text.
Run 3’s reward was the hardest (F1 + status text penalty) but most aligned with what we actually want.

Each time we made the reward harder, the training curves looked worse — but the model learned better behaviors. This is the opposite of what intuition suggests. Most practitioners assume higher training reward = better learning. Our experience shows that reward quality matters more than reward quantity.

This echoes a broader principle in RL: easy rewards lead to shallow learning. The difficulty of the reward function is a feature, not a bug.


10. Results and Analysis

Final Comparison Table

ModelToken F1Exact MatchAnswer RateAvg OpensTraining
Heuristic0.2460.220100%1.2N/A
SFT (warm-start)0.1420.06087.0%1.6450 steps
RL Run 1 (shaped)0.1230.01096.0%1.4100 batches
RL Run 2 (pure F1)0.1540.06589.5%1.5100 batches
RL Run 3 (status fix)0.1720.08599.5%1.2~102 batches

The Improvement Trajectory

F1 Progression

F1 progression across runs — starting from SFT baseline, RL initially worsened then systematically improved through reward engineering</figcaption> </figure>

Starting from the SFT baseline (0.142), RL initially made things worse (Run 1: 0.123), then systematically improved through reward engineering (Run 2: 0.154, Run 3: 0.172). The gap to the heuristic (0.246) narrowed from 50% to 30%.

Answer Quality Distribution

Answer Quality Evolution

Answer quality evolution — status text dropped dramatically from Run 1 to Run 3</figcaption> </figure>

The most dramatic change is in the “Status text” category — dropping from ~120 instances in Run 1 to significantly fewer in Run 3, while “Correct” and “Partial” answers increased.

Where the Gap to the Heuristic Remains

The heuristic still leads by 30%. Why?

Exact entity names — The heuristic uses Wikipedia article titles as answers, which happen to be the exact entity names that 2WikiMultiHopQA expects. The RL model generates free-text that may paraphrase or include minor variations.
Formatting — “The Mask Of Fu Manchu” vs “The Mask of Fu Manchu” — capitalization differences reduce F1.
Inherent difficulty — Some questions require reasoning chains the 8B model struggles with, regardless of which documents it reads.

Cumulative Reward

Cumulative Reward

Cumulative average reward — RL runs plateau below heuristic and SFT baselines</figcaption> </figure>

The cumulative average reward shows both RL runs plateauing well below the heuristic and SFT baselines — confirming that the training signal, while sufficient for improvement, remains sparse.


11. Lessons Learned

1. Look at Your Model’s Actual Outputs

This is the single most important lesson. We spent hours tuning hyperparameters and reward functions — but the breakthrough came from reading 20 example predictions and noticing that half were status text. No amount of quantitative analysis would have revealed this.

Practical advice: Before any training run, sample 20 predictions and read them manually. Two minutes of qualitative analysis beats two hours of hyperparameter sweeps.

2. Goodhart’s Law Is Real and Painful

Shaped rewards feel scientific and principled. Adding a bonus for opening relevant chunks seems like it should help. In practice, it creates a shortcut the agent exploits while ignoring the actual objective.

Practical advice: Start with the simplest possible reward. Add complexity only when you’ve confirmed the base signal works.

3. Sparse but Honest > Dense but Misleading

Run 3 had 71% nonzero batches (vs 96% for Run 1). Many practitioners would look at that and add reward shaping to “help” the agent learn. Our experience shows the opposite: the sparse signal produced the best model because every bit of reward was earned through genuinely correct behavior.

4. SFT Warm-Start Matters More Than You Think

Without SFT, the model wouldn’t know the action format (JSON with “action”, “target”, “text” fields). The 450 demonstration episodes gave it the basic vocabulary for exploration. RL then refined the decisions within that format.

5. Infrastructure Resilience is a First-Class Concern

Across three runs (~300 total batches), we experienced: 2 network timeouts requiring session restart, 1 tensor alignment bug requiring code fix, and multiple loss spikes from importance sampling ratios.

Long-running RL experiments need: checkpointing every N batches, automatic resume logic, and gradient clipping. These aren’t optional — they’re survival features.

6. The Heuristic Baseline Was Shockingly Strong

A simple TF-IDF overlap heuristic achieved F1 = 0.246. After 3 RL runs totaling ~300 batches (~36 hours of training), the learned agent reached 0.172. The heuristic required zero training.

This humbling result is common in RL: carefully engineered baselines are hard to beat. The heuristic’s advantage came from a design choice (using chunk titles as answers) that happened to align perfectly with the evaluation metric. The RL agent had to discover this alignment from scratch.


12. What’s Next

The 30% gap to the heuristic is surmountable. Based on our error analysis, these are the highest-ROI next steps:

Near-Term (Runs 4-5)

Exact Match reward — Replace token F1 with binary EM. The heuristic wins because its answers are exact matches. Training for EM directly would close the gap, though the sparser signal needs a larger GROUP_SIZE (32 or 64).
Post-processing cleanup — Strip common prefixes (“The answer is…”, “Based on the text…”) at eval time. This is a free 2-3% F1 improvement.
Few-shot prompting — Add 3-4 worked examples to the system prompt showing the complete question → open → answer flow.

Medium-Term

Larger GROUP_SIZE (32-64) — More rollouts per task reduce variance, which is critical with sparser rewards.
Curriculum learning — Start with easier questions (single-hop) and gradually introduce multi-hop. This provides denser early reward without sacrificing eventual difficulty.
SEARCH(q) sub-action — Let the agent compose sub-queries. Instead of just opening chunks by index, it could search for “director of Polish-Russian War” and get filtered results.

Long-Term

Larger models — Qwen3-8B is capable but limited. 14B or 32B models may have stronger reasoning that translates to better exploration.
Multi-hop reasoning chains — Visualize and reward intermediate reasoning steps, not just the final answer.


Appendix: Technical Details

Code Structure

tinker-explorer/
├── env/
│   ├── explorer_env.py      # RL environment
│   ├── action_schema.py     # JSON action parsing
│   └── reward.py            # Reward function (all 3 variants)
├── train/
│   └── rl_train.py          # GRPO training loop
├── eval/
│   ├── eval_rl.py           # Val set evaluation
│   └── eval_trajectories.py # Trajectory visualization
├── policies/
│   ├── prompts.py           # System prompt
│   └── heuristics.py        # Heuristic baseline
├── runs/                    # Run reports (this post's source data)
├── plots/                   # All figures in this post
└── logs/                    # Raw metrics and rollouts

Compute

All experiments ran on a MacBook Pro with no local GPU. Model inference and gradient computation happened on Tinker’s cloud infrastructure. Total wall-clock time across 3 runs: ~40 hours. Estimated cloud compute: ~120 GPU-hours (Qwen3-8B with LoRA on A100-equivalent).

Reproducibility

All metrics, checkpoints, and evaluation results are saved in the repository:

logs/rl-run-v{1,2,3}/metrics.jsonl — per-batch training metrics
logs/eval-rl-v1.json — full per-task evaluation results
runs/run_{1,2,3}_*.md — detailed run reports with interpretation


This project was built as a research experiment exploring RL for language agent training. The code, data, and this post are available for educational purposes. Inspired by Karpathy’s autoresearch, built on Tinker.

</div>