---
title: When the LLM Remembers Too Much
url: https://varstatt.com/jurij/p/when-llm-remembers-too-much
author: Jurij Tokarski
date: 2026-04-28 14:14
description: A vendor rate becomes a budget constraint and reasoning leaks across step boundaries — two failures from implicit state crossing where it shouldn't, and the architecture that fixed both.
section: Blog (https://varstatt.com/jurij/archive)
tags: ai (https://varstatt.com/jurij/c/ai), software-design (https://varstatt.com/jurij/c/software-design)
---

Multi-step AI pipelines break in production the same way every time. The output looks wrong — stale reasoning, invented constraints, downstream logic that doesn't add up. It looks like a bug in the later step. It's not. The model retained something from a prior step that it shouldn't have.

## The Budget That Wasn't

A multi-step pipeline walks users through sequential analysis stages. [Market sizing](/discovery/market-sizing) feeds [competitive analysis](/discovery/competitive-analysis), which feeds the tech stack recommendation, which feeds the [cost estimate](/discovery/build-cost-plan). Ten steps, each reading prior outputs.

One step produced a vendor rate of **$997/week** as a reference point for agency development costs. Three steps later, the tech stack recommendation cited "$997/week" as the user's budget. The model had reinterpreted a comparison rate as a constraint.

The prior step outputs were injected as raw text:

```xml
<prior_context>
{{costCalculatorOutput}}
</prior_context>
```

The model had no way to distinguish "this is a vendor rate for comparison" from "this is what the user can spend." The fix was scope guards — labeling what context represents, and what the model should not do with it:

```xml
<prior_context step="cost-calculator">
This output contains vendor rate comparisons used for cost estimation.
These are NOT the user's budget. Do not use these figures as budget
constraints in the tech stack recommendation.
{{costCalculatorOutput}}
</prior_context>
```

## The Reasoning That Leaked

An eight-step pipeline ran sequential analysis stages — opportunity analysis, SWOT, bid strategy, and so on. Each step consumed the prior step's output via `previous_response_id`.

The first symptom was token creep. Responses were getting longer and slower. When I traced what the model was reasoning about in step three, it was hauling in thinking from step one — competitor analysis referencing initial opportunity framing that was no longer relevant. The chain didn't carry forward selected outputs. It carried everything: the model's reasoning, intermediate tool calls, abandoned lines of thought.

The fix was a clean break at every step transition:

```typescript
function isSyntheticThreadId(threadId: string): boolean {
  return threadId.startsWith("synthetic_");
}

if (source === RunSource.NEXT_STEP) {
  threadId = buildSyntheticThreadId(sessionId);
  message = buildStepStartContext(artifacts, recentMessages)
    + "\n\n---\n\n" + userInput;
}

await llm.streamOrchestrator({
  threadId,
  previousResponseId: isSyntheticThreadId(threadId)
    ? undefined
    : threadId,
  message,
});
```

At step boundaries, the thread ID resets to a synthetic value. The driver checks `isSyntheticThreadId()` before setting `previous_response_id` — if synthetic, the call is fully stateless. Context comes from a structured text block built from the current state, not from the chain's accumulated history.

## The Fix That Worked Everywhere

Every failure had the same shape: implicit state — whether it was unscoped context or a chained conversation history — created a contract the code assumed would hold. The model remembered too much, and the wrong things.

The fix was the same each time. **Stateless at step boundaries, chained within.**

```typescript
function buildStepStartContext(artifacts: Artifact[], messages: ChatMessage[]): string {
  const lines: string[] = [];
  lines.push("[Previous artifacts — current versions including user edits]");
  for (const artifact of artifacts) {
    lines.push(`<artifact step="${artifact.step}" type="${artifact.type}">`);
    lines.push(artifact.content);
    lines.push(`</artifact>`);
  }
  lines.push("[Recent conversation — last 30 messages]");
  for (const msg of messages.slice(-30)) {
    lines.push(`${msg.role}: ${msg.message}`);
  }
  return lines.join("\n");
}
```

At every step transition: reset the thread ID, fetch the current state of all artifacts from the database, inject that state as a structured text block with semantic labels. Within a step — retries, follow-ups, agent callbacks — chain normally for conversational continuity.

Each step becomes independently resumable. Contaminated reasoning is gone. A comparison rate stays a comparison rate because the label says so.

The cost surprised me. Stateless reconstruction loses the cached token discount from chaining. Across a full eight-step workflow, it was about 27% cheaper anyway. Early steps have little to cache. Later steps were paying to cache stale content the user had already edited.

Chain for conversational depth within a step. Break the chain and inject labeled context at every step boundary.
