Published April 11, 2026

The AI Purpose Gap: How We Got Our Content Pipeline From 24% to 83% Instruction Compliance

Manav Garkel

Three prompt engineering iterations took Sembra's content pipeline from 24% to 83% LLM instruction compliance. Here's what actually worked.

Building9 min read

The AI Purpose Gap: How We Got Our Content Pipeline From 24% to 83% Instruction Compliance

Most AI content tools genuinely understand what a blog post says, but not why it exists. We spent three iterations of prompt engineering closing that gap — and the fix tripled our LLM's instruction compliance from 24% to 83% with zero quality loss. Here's the technical journey.

When a B2B company publishes a blog post, they have a reason: drive traffic back to the blog, promote a product mentioned in the content, leverage a timely event, or build the author's credibility. Existing AI repurposing tools throw all of that away. They extract themes and quotes, but they ignore the marketing intent behind the content — and the generated posts feel rather disconnected from anything the creator actually cares about.

At Sembra we call this the purpose gap: the distance between "AI that understands content" and "AI that understands marketing intent." Closing it is the difference between a content summarizer and a marketing tool.

This post is the three-iteration story of how we closed it — the prompt engineering dead ends, the hidden generation failures, and the single JSON schema trick that moved the needle further than everything else combined. If you're building with LLMs and struggling to get them to actually follow instructions, the lessons here might save you weeks.

What Is the Purpose Gap in AI Content Tools?

The purpose gap is the distance between an AI that can extract themes from a blog post and an AI that knows why the blog post was written in the first place.

Every content repurposing tool on the market does roughly the same thing: take a blog post, extract the key points, reformat them for social media. The result reads well but accomplishes nothing for the creator's marketing goals. The post is coherent; it is also genuinely purposeless.

Closing this gap is precisely what Sembra's Contextual Grounding feature was designed to do. The idea sounds simple — detect four types of marketing context in source content (Content Marketing URLs, Product mentions, Event references, Author credentials), then weave them into the generated posts. The implementation was anything but.

V1: Detection Worked, Everything Else Broke

Our first version detected marketing context accurately but failed at every downstream step — URLs were null, relevance scores never varied, and the LLM fabricated links when we asked it to include them.

The V1 architecture had three stages: detection (extract grounding context alongside themes and quotes), theme mapping (connect grounding to relevant themes), and two-layer generation (decide whether each platform gets grounded, then how). Testing across 7 real content pieces surfaced 7 distinct issues — every one of the 14 grounding details came back with url: null, every relevance score landed above 0.7 (so the medium and low paths never triggered), and every Instagram post ended with the exact same "Save this" CTA.

The worst failure was trust-breaking: when we told the LLM to "include links" but no URL was available in context, it genuinely invented them. Plausible-looking URLs that went nowhere. That's the kind of bug that ships once and loses a customer forever.

V2: Seven Targeted Fixes, One Hidden Problem

V2 eliminated hallucinations, fixed the detection gaps, and restored variety in relevance scoring — but a strict audit revealed that roughly 75% of grounding details were being completely ignored by the generation LLM anyway.

We made seven changes: a user-provided source_url field so the LLM never guesses, a raised relevance threshold (0.7 → 0.85), simplified CTA instructions to remove duplication with platform prompts, an anti-hallucination guard with a subtle-fallback when no URL exists, better Product Marketing detection in blog contexts, Author Identity detection that referenced the AUTHOR field directly, and automatic source_url injection into Content Marketing details.

On paper, V2 was a triumph. Hallucinated URLs dropped from "multiple observed" to zero. Variety returned. Blog-context detection finally worked. And then we ran a rigorous check: did the generated posts actually mention the grounding details they'd been given?

Grounding Type	V2 Hit Rate
Content Marketing	24%
Product Marketing	21%
Event Occasion	24%
Author Identity	45%

Buffer — our ideal B2B blog customer — scored 0%. Not a single generated post mentioned "Buffer," despite Product Marketing grounding being attached to 2 of 5 post sets. The pipeline was correctly detecting the context, correctly mapping it to themes, and then the generation LLM was quietly throwing it away.

Why the Generation LLM Was Ignoring Instructions

Four compounding factors — weak imperative language, no structural slot in the prompt, positional disadvantage in the middle of the prompt, and no forced reasoning in the output schema — combined to make grounding instructions easy for the LLM to skip.

The root cause analysis turned up six factors, but the underlying pattern was strikingly simple: our grounding instructions were in the worst possible position. Research on LLM attention is crucial here — Liu et al. (2023) famously documented the "Lost in the Middle" effect, where LLMs attend strongly to the beginning and end of prompts but lose focus in the middle. Our grounding lived exactly in the middle, buried as the 4th item in a {content_elements} variable alongside key phrases, quotes, and data points, all weighted equally.

On top of that, the platform prompts' STRUCTURE sections (Hook → Context → Content → CTA) never mentioned grounding. The output JSON schema forced hook awareness via a hook_type_used field, but nothing forced grounding awareness. The LLM had no example of what a grounded post even looks like. Every nudge in the prompt was pointing away from the one thing we needed it to do.

V3: Four Prompt Engineering Changes That Actually Worked

V3 applied four techniques — variable separation, sandwich positioning, structural anchoring, and a JSON schema reasoning field — that together tripled grounding compliance from ~30% to 83% with zero quality degradation.

Change 1 — Separation. We pulled grounding out of the shared {content_elements} blob and gave it its own {grounding_instructions} template variable. Its own visual block. No competition with other content elements for the LLM's attention.

Change 2 — Sandwich pattern. We added a "MUST reference" rule to each platform's system prompt (primacy) and kept the detailed instructions in the user prompt (recency). Exploit both ends of the U-shaped attention curve; don't fight it.

Change 3 — Structural anchoring. We added parenthetical hints directly to the platform STRUCTURE sections: [Opening context -- expand on the hook's promise; integrate contextual grounding here if instructed]. The LLM needed to know where to place the thing, not just that it should exist.

Change 4 — The JSON schema trick. This one was insanely useful. We added grounding_reference as the first field in each platform's response schema:

{
  "grounding_reference": "<Quote the exact phrase from your post that references the grounding context. Write 'none' if no grounding was provided>",
  "hook_type_used": "...",
  "content": "..."
}

Our code ignores this field entirely. It exists purely to force the LLM to articulate what grounding it's about to use before it writes the post. Research on reasoning-before-answering in structured output supports this ordering, and the PARSE framework treats JSON schemas as first-class instructions with reported compliance gains up to 64.7%. We saw something genuinely similar.

The Results: 24% → 83% With No Quality Loss

V3 improved grounding compliance by 53 percentage points overall, with Event Occasion and Author Identity hitting 100% — and it did so without a single regression in quality or cost.

Grounding Type	V2	V3	Change
Content Marketing	24%	69%	+45pp
Product Marketing	21%	71%	+50pp
Event Occasion	24%	100%	+76pp
Author Identity	45%	100%	+55pp
Overall	~30%	83%	+53pp

Per platform: Twitter went from 53% to 94%. LinkedIn from 47% to 80%. Instagram from 29% to 76%.

The critical question was whether we'd broken anything to get there. Validation failures: same (zero). Hallucinated URLs: same (zero). LinkedIn engagement questions: same (100%). Twitter hook violations actually dropped slightly. Total cost: 9% cheaper, because shorter focused prompts beat longer unfocused ones.

Before and After: What Customers Actually See

Here's the Buffer blog post rendered as a Twitter post in V2 and V3:

V2 (0% grounding):

Stop thinking Instagram has one algorithm. It doesn't.

The platform uses multiple distinct AI systems... [great thought leadership, zero mention of Buffer or the blog URL]

#InstagramAlgorithm #ContentStrategy

V3 (grounding integrated):

Stop trying to beat "the Instagram algorithm." There isn't one. There are four.

Instagram's head Adam Mosseri confirmed this... [same quality content]

Read the full guide: https://buffer.com/resources/instagram-algorithm/

#InstagramAlgorithm #ContentStrategy

Same voice, same insight, same quality — but one of these posts drives traffic back to Buffer's blog and one doesn't. That's the purpose gap, closed.

What I Discovered Building This

Three insights stand out from the three-iteration journey, and none of them were obvious before we started.

Detection is the easy part — compliance is where it breaks. Our extraction LLM identified grounding context correctly from day one. The gap between "AI knows the information" and "AI acts on the information" is notably wider than most people think, and it's exactly where most AI content tools quietly fail. If you're only measuring detection, you might genuinely be shipping a product that's 75% broken without noticing.

Prompt positioning matters more than prompt wording. We spent V2 rewriting instruction text and got nowhere. V3 barely changed the words — it moved them. Separating grounding into its own variable, duplicating the rule across system and user prompts, and adding one forced-reasoning field in the JSON schema did more than any amount of imperative language ever could. When an LLM is ignoring your instructions, the first question is rarely "what did I say"; it's "where did I say it."

Simplicity beats complexity in prompt engineering. Our first V3 attempt had a 4×3 matrix of 12 type-specific instruction strings. A code simplicity review correctly flagged it as over-engineered — the detail text already carries the type specificity. We collapsed to three style-based templates with a conditional URL overlay, and got better results. The LLM doesn't need to be told "mention the product by name" when it can see [Product Marketing] Atomic Habits Workbook right there in context. Every line you add to a prompt is a line the LLM has to choose to attend to.

Closing the Gap

Closing the purpose gap isn't about making your AI smarter — it's about making your instructions impossible to ignore. Three iterations, four prompt engineering techniques, and one JSON schema trick took us from 24% compliance to 83% without touching the model, the cost structure, or the quality bar. If you're building with LLMs, the lesson is worth sitting with: when the model ignores you, the fix is usually structural, not linguistic.

This is the kind of work that makes Sembra different from every other content tool — we're not wrapping an API, we're engineering the pipeline that turns one long-form piece into platform-native social posts that genuinely serve your marketing goals. If you want the strategy layer, read the content amplification guide; if you want the format distinction, start with content amplification vs repurposing. Or start a 7-day free trial and run a piece through it yourself.

Frequently Asked Questions

What is the AI purpose gap in content tools?

The AI purpose gap is the distance between an AI that understands what content says and one that understands why the content was written. Most repurposing tools extract themes and quotes but ignore the marketing intent — the product being promoted, the URL worth visiting, the author whose credibility matters.

Why do LLMs ignore instructions buried in the middle of prompts?

LLMs exhibit U-shaped attention, meaning they attend most strongly to the beginning and end of prompts and lose focus in the middle — a phenomenon documented as 'Lost in the Middle' by Liu et al. (2023). Instructions placed mid-prompt are often acknowledged during extraction but ignored during generation.

How do you improve LLM instruction following in JSON mode?

The highest-leverage technique is adding a reasoning field as the first property in your response schema that forces the model to articulate its plan before generating the answer. Research on structured output ordering and the PARSE framework both report significant compliance improvements from this single change.

What is contextual grounding in AI content generation?

Contextual grounding is the process of detecting marketing intent in source content — product mentions, event references, author credentials, URLs worth linking — and weaving it into generated outputs. Done well, it's the difference between generic thought leadership and content that serves the creator's actual marketing goals.

How much does prompt engineering affect LLM compliance rates?

In our case, prompt engineering alone moved grounding compliance from 24% to 83% — a 53 percentage point improvement — with no model change, no fine-tuning, and a 9% cost reduction. The techniques were variable separation, sandwich positioning, structural anchoring, and JSON schema reasoning fields.

Is content amplification different from content repurposing?

Yes. Content repurposing reformats one piece into another format — a blog becomes a summary tweet. Content amplification extracts multiple unique angles from one source and generates 15-25 distinct posts, each serving a different marketing purpose. Contextual grounding is what makes that purpose-awareness possible.

Does forcing LLM reasoning in structured output hurt quality?

No — V3 maintained identical quality metrics compared to V2 (same validation pass rate, same formatting compliance, same hashtag discipline) while tripling instruction compliance. Forcing the LLM to reason before answering tends to improve quality, not degrade it, because the plan anchors the generation that follows.