When the LLM Remembers Too Much

Multi-step AI pipelines break in production the same way every time. The output looks wrong — stale reasoning, invented constraints, downstream logic that doesn't add up. It looks like a bug in the later step. It's not. The model retained something from a prior step that it shouldn't have.

The Budget That Wasn't

A multi-step pipeline walks users through sequential analysis stages. Market sizing feeds competitive analysis, which feeds the tech stack recommendation, which feeds the cost estimate. Ten steps, each reading prior outputs.

One step produced a vendor rate of $997/week as a reference point for agency development costs. Three steps later, the tech stack recommendation cited "$997/week" as the user's budget. The model had reinterpreted a comparison rate as a constraint.

The prior step outputs were injected as raw text:

<prior_context>
{{costCalculatorOutput}}
</prior_context>

The model had no way to distinguish "this is a vendor rate for comparison" from "this is what the user can spend." The fix was scope guards — labeling what context represents, and what the model should not do with it:

<prior_context step="cost-calculator">
This output contains vendor rate comparisons used for cost estimation.
These are NOT the user's budget. Do not use these figures as budget
constraints in the tech stack recommendation.
{{costCalculatorOutput}}
</prior_context>

The Reasoning That Leaked

An eight-step pipeline ran sequential analysis stages — opportunity analysis, SWOT, bid strategy, and so on. Each step consumed the prior step's output via previous_response_id.

The first symptom was token creep. Responses were getting longer and slower. When I traced what the model was reasoning about in step three, it was hauling in thinking from step one — competitor analysis referencing initial opportunity framing that was no longer relevant. The chain didn't carry forward selected outputs. It carried everything: the model's reasoning, intermediate tool calls, abandoned lines of thought.

The fix was a clean break at every step transition:

function isSyntheticThreadId(threadId: string): boolean {
  return threadId.startsWith("synthetic_");
}

if (source === RunSource.NEXT_STEP) {
  threadId = buildSyntheticThreadId(sessionId);
  message = buildStepStartContext(artifacts, recentMessages)
    + "\n\n---\n\n" + userInput;
}

await llm.streamOrchestrator({
  threadId,
  previousResponseId: isSyntheticThreadId(threadId)
    ? undefined
    : threadId,
  message,
});

At step boundaries, the thread ID resets to a synthetic value. The driver checks isSyntheticThreadId() before setting previous_response_id — if synthetic, the call is fully stateless. Context comes from a structured text block built from the current state, not from the chain's accumulated history.

The Fix That Worked Everywhere

Every failure had the same shape: implicit state — whether it was unscoped context or a chained conversation history — created a contract the code assumed would hold. The model remembered too much, and the wrong things.

The fix was the same each time. Stateless at step boundaries, chained within.

function buildStepStartContext(artifacts: Artifact[], messages: ChatMessage[]): string {
  const lines: string[] = [];
  lines.push("[Previous artifacts — current versions including user edits]");
  for (const artifact of artifacts) {
    lines.push(`<artifact step="${artifact.step}" type="${artifact.type}">`);
    lines.push(artifact.content);
    lines.push(`</artifact>`);
  }
  lines.push("[Recent conversation — last 30 messages]");
  for (const msg of messages.slice(-30)) {
    lines.push(`${msg.role}: ${msg.message}`);
  }
  return lines.join("\n");
}

At every step transition: reset the thread ID, fetch the current state of all artifacts from the database, inject that state as a structured text block with semantic labels. Within a step — retries, follow-ups, agent callbacks — chain normally for conversational continuity.

Each step becomes independently resumable. Contaminated reasoning is gone. A comparison rate stays a comparison rate because the label says so.

The cost surprised me. Stateless reconstruction loses the cached token discount from chaining. Across a full eight-step workflow, it was about 27% cheaper anyway. Early steps have little to cache. Later steps were paying to cache stale content the user had already edited.

Chain for conversational depth within a step. Break the chain and inject labeled context at every step boundary.

Jurij Tokarski

When the LLM Remembers Too Much

The Budget That Wasn't

The Reasoning That Leaked

The Fix That Worked Everywhere

About Jurij Tokarski