Most AI content tools genuinely understand what a blog post says, but not why it exists. We spent three iterations of prompt engineering closing that gap — and the fix tripled our LLM's instruction compliance from 24% to 83% with zero quality loss. Here's the technical journey.
When a B2B company publishes a blog post, they have a reason: drive traffic back to the blog, promote a product mentioned in the content, leverage a timely event, or build the author's credibility. Existing AI repurposing tools throw all of that away. They extract themes and quotes, but they ignore the marketing intent behind the content — and the generated posts feel rather disconnected from anything the creator actually cares about.
At Sembra we call this the purpose gap: the distance between "AI that understands content" and "AI that understands marketing intent." Closing it is the difference between a content summarizer and a marketing tool.
This post is the three-iteration story of how we closed it — the prompt engineering dead ends, the hidden generation failures, and the single JSON schema trick that moved the needle further than everything else combined. If you're building with LLMs and struggling to get them to actually follow instructions, the lessons here might save you weeks.
What Is the Purpose Gap in AI Content Tools?
The purpose gap is the distance between an AI that can extract themes from a blog post and an AI that knows why the blog post was written in the first place.
Every content repurposing tool on the market does roughly the same thing: take a blog post, extract the key points, reformat them for social media. The result reads well but accomplishes nothing for the creator's marketing goals. The post is coherent; it is also genuinely purposeless.
Closing this gap is precisely what Sembra's Contextual Grounding feature was designed to do. The idea sounds simple — detect four types of marketing context in source content (Content Marketing URLs, Product mentions, Event references, Author credentials), then weave them into the generated posts. The implementation was anything but.
V1: Detection Worked, Everything Else Broke
Our first version detected marketing context accurately but failed at every downstream step — URLs were null, relevance scores never varied, and the LLM fabricated links when we asked it to include them.
The V1 architecture had three stages: detection (extract grounding context alongside themes and quotes), theme mapping (connect grounding to relevant themes), and two-layer generation (decide whether each platform gets grounded, then how). Testing across 7 real content pieces surfaced 7 distinct issues — every one of the 14 grounding details came back with url: null, every relevance score landed above 0.7 (so the medium and low paths never triggered), and every Instagram post ended with the exact same "Save this" CTA.
The worst failure was trust-breaking: when we told the LLM to "include links" but no URL was available in context, it genuinely invented them. Plausible-looking URLs that went nowhere. That's the kind of bug that ships once and loses a customer forever.
V2: Seven Targeted Fixes, One Hidden Problem
V2 eliminated hallucinations, fixed the detection gaps, and restored variety in relevance scoring — but a strict audit revealed that roughly 75% of grounding details were being completely ignored by the generation LLM anyway.
We made seven changes: a user-provided source_url field so the LLM never guesses, a raised relevance threshold (0.7 → 0.85), simplified CTA instructions to remove duplication with platform prompts, an anti-hallucination guard with a subtle-fallback when no URL exists, better Product Marketing detection in blog contexts, Author Identity detection that referenced the AUTHOR field directly, and automatic source_url injection into Content Marketing details.
On paper, V2 was a triumph. Hallucinated URLs dropped from "multiple observed" to zero. Variety returned. Blog-context detection finally worked. And then we ran a rigorous check: did the generated posts actually mention the grounding details they'd been given?
| Grounding Type | V2 Hit Rate |
|---|---|
| Content Marketing | 24% |
| Product Marketing | 21% |
| Event Occasion | 24% |
| Author Identity | 45% |
Buffer — our ideal B2B blog customer — scored 0%. Not a single generated post mentioned "Buffer," despite Product Marketing grounding being attached to 2 of 5 post sets. The pipeline was correctly detecting the context, correctly mapping it to themes, and then the generation LLM was quietly throwing it away.
Why the Generation LLM Was Ignoring Instructions
Four compounding factors — weak imperative language, no structural slot in the prompt, positional disadvantage in the middle of the prompt, and no forced reasoning in the output schema — combined to make grounding instructions easy for the LLM to skip.
The root cause analysis turned up six factors, but the underlying pattern was strikingly simple: our grounding instructions were in the worst possible position. Research on LLM attention is crucial here — Liu et al. (2023) famously documented the "Lost in the Middle" effect, where LLMs attend strongly to the beginning and end of prompts but lose focus in the middle. Our grounding lived exactly in the middle, buried as the 4th item in a {content_elements} variable alongside key phrases, quotes, and data points, all weighted equally.
On top of that, the platform prompts' STRUCTURE sections (Hook → Context → Content → CTA) never mentioned grounding. The output JSON schema forced hook awareness via a hook_type_used field, but nothing forced grounding awareness. The LLM had no example of what a grounded post even looks like. Every nudge in the prompt was pointing away from the one thing we needed it to do.
V3: Four Prompt Engineering Changes That Actually Worked
V3 applied four techniques — variable separation, sandwich positioning, structural anchoring, and a JSON schema reasoning field — that together tripled grounding compliance from ~30% to 83% with zero quality degradation.
Change 1 — Separation. We pulled grounding out of the shared {content_elements} blob and gave it its own {grounding_instructions} template variable. Its own visual block. No competition with other content elements for the LLM's attention.
Change 2 — Sandwich pattern. We added a "MUST reference" rule to each platform's system prompt (primacy) and kept the detailed instructions in the user prompt (recency). Exploit both ends of the U-shaped attention curve; don't fight it.
Change 3 — Structural anchoring. We added parenthetical hints directly to the platform STRUCTURE sections: [Opening context -- expand on the hook's promise; integrate contextual grounding here if instructed]. The LLM needed to know where to place the thing, not just that it should exist.
Change 4 — The JSON schema trick. This one was insanely useful. We added grounding_reference as the first field in each platform's response schema:
{
"grounding_reference": "<Quote the exact phrase from your post that references the grounding context. Write 'none' if no grounding was provided>",
"hook_type_used": "...",
"content": "..."
}Our code ignores this field entirely. It exists purely to force the LLM to articulate what grounding it's about to use before it writes the post. Research on reasoning-before-answering in structured output supports this ordering, and the PARSE framework treats JSON schemas as first-class instructions with reported compliance gains up to 64.7%. We saw something genuinely similar.
The Results: 24% → 83% With No Quality Loss
V3 improved grounding compliance by 53 percentage points overall, with Event Occasion and Author Identity hitting 100% — and it did so without a single regression in quality or cost.
| Grounding Type | V2 | V3 | Change |
|---|---|---|---|
| Content Marketing | 24% | 69% | +45pp |
| Product Marketing | 21% | 71% | +50pp |
| Event Occasion | 24% | 100% | +76pp |
| Author Identity | 45% | 100% | +55pp |
| Overall | ~30% | 83% | +53pp |
Per platform: Twitter went from 53% to 94%. LinkedIn from 47% to 80%. Instagram from 29% to 76%.
The critical question was whether we'd broken anything to get there. Validation failures: same (zero). Hallucinated URLs: same (zero). LinkedIn engagement questions: same (100%). Twitter hook violations actually dropped slightly. Total cost: 9% cheaper, because shorter focused prompts beat longer unfocused ones.
Before and After: What Customers Actually See
Here's the Buffer blog post rendered as a Twitter post in V2 and V3:
V2 (0% grounding):
Stop thinking Instagram has one algorithm. It doesn't.
The platform uses multiple distinct AI systems... [great thought leadership, zero mention of Buffer or the blog URL]
#InstagramAlgorithm #ContentStrategy
V3 (grounding integrated):
Stop trying to beat "the Instagram algorithm." There isn't one. There are four.
Instagram's head Adam Mosseri confirmed this... [same quality content]
Read the full guide: https://buffer.com/resources/instagram-algorithm/
#InstagramAlgorithm #ContentStrategy
Same voice, same insight, same quality — but one of these posts drives traffic back to Buffer's blog and one doesn't. That's the purpose gap, closed.
What I Discovered Building This
Three insights stand out from the three-iteration journey, and none of them were obvious before we started.
Detection is the easy part — compliance is where it breaks. Our extraction LLM identified grounding context correctly from day one. The gap between "AI knows the information" and "AI acts on the information" is notably wider than most people think, and it's exactly where most AI content tools quietly fail. If you're only measuring detection, you might genuinely be shipping a product that's 75% broken without noticing.
Prompt positioning matters more than prompt wording. We spent V2 rewriting instruction text and got nowhere. V3 barely changed the words — it moved them. Separating grounding into its own variable, duplicating the rule across system and user prompts, and adding one forced-reasoning field in the JSON schema did more than any amount of imperative language ever could. When an LLM is ignoring your instructions, the first question is rarely "what did I say"; it's "where did I say it."
Simplicity beats complexity in prompt engineering. Our first V3 attempt had a 4×3 matrix of 12 type-specific instruction strings. A code simplicity review correctly flagged it as over-engineered — the detail text already carries the type specificity. We collapsed to three style-based templates with a conditional URL overlay, and got better results. The LLM doesn't need to be told "mention the product by name" when it can see [Product Marketing] Atomic Habits Workbook right there in context. Every line you add to a prompt is a line the LLM has to choose to attend to.
Closing the Gap
Closing the purpose gap isn't about making your AI smarter — it's about making your instructions impossible to ignore. Three iterations, four prompt engineering techniques, and one JSON schema trick took us from 24% compliance to 83% without touching the model, the cost structure, or the quality bar. If you're building with LLMs, the lesson is worth sitting with: when the model ignores you, the fix is usually structural, not linguistic.
This is the kind of work that makes Sembra different from every other content tool — we're not wrapping an API, we're engineering the pipeline that turns one long-form piece into 15-25 platform-native social posts that genuinely serve your marketing goals. If that's the kind of content tool you've been waiting for, read the full story of why I'm building Sembra or join the waitlist to be among the first to use it.