How Do You Teach AI Your Writing Voice? Building Brand Voice Extraction for Sembra

Madeline Fredriksz

How we built AI brand voice extraction that reads your writing and captures how you actually sound — not how you think you sound.

If you create long-form content, you have probably tried using AI to turn it into social posts — and noticed the output sounds nothing like you. Brand voice extraction fixes this by reading your actual writing and building a structured profile of how you sound.

And right now, it matters more than ever. LinkedIn's 2026 algorithm actively downranks generic AI content. Human-written content gets 5.44x more traffic. One creator on Reddit put it bluntly: "One post I wrote by hand earned me more than the 143 I created using AI."

We built a brand voice extraction pipeline at Sembra to solve this problem. Here is how we built it, what broke, and what we learned.

Why "Pick Your Tone" Fails

Before we could build the right solution, we had to understand why the obvious one does not work. The content marketing community has started calling the problem "Voice Drift" — AI repurposing tools strip away the creator's personality, leaving behind neutral, generic output. Most tools try to fix this by asking you to self-describe your voice. Pick from a dropdown: formal or casual, playful or serious, authoritative or approachable. This fails for a reason that might sound counterintuitive: people are genuinely terrible at describing how they write.

I asked one of our test writers if he hedges. He said no. Then I ran his writing through the extraction pipeline and counted 15 instances of "I think," "maybe," and "tends to". He hedges constantly — he just does not experience it that way. Self-perception and linguistic reality diverge consistently, and a generation model built on self-perception produces content that sounds like who the writer thinks they are, not who they actually are on the page.

The specificity problem is maybe worse. "Professional but friendly" describes half the internet. It gives a generation model nothing concrete to work with. What a model actually needs is: this writer hedges at medium frequency using "rather," "might," and "possibly"; uses em dashes and semicolons but never exclamation points; writes sentences averaging 23 words with high variance; and addresses the reader directly but rarely uses inclusive "we." That is a usable voice profile. A dropdown is not.

Reading the Writer Instead

So rather than asking writers to describe themselves, we built a pipeline that reads their actual content and extracts a structured voice profile — a machine-readable representation of how someone genuinely writes.

We needed a foundation more rigorous than intuition, so we grounded the schema in Ken Hyland's metadiscourse framework — an empirically validated model from academic linguistics. The framework describes two dimensions of writing: stance (how the writer positions themselves relative to their own claims — hedging, boosting, attitude, self-reference) and engagement (how they connect with readers — direct address, questions, directives, shared knowledge). Hyland's framework includes curated lists of marker words — the specific words that signal hedging, boosting, and attitude. But those lists were built from academic writing — which uses completely different language than blogs and newsletters. Words like "really" and "think" barely appear in academic prose but show up constantly in informal writing. We adapted the lists using research on how informal writing actually works.

The pipeline splits work between traditional NLP and an LLM. Everything countable — word frequencies, sentence lengths, pronoun ratios, punctuation marks — is extracted deterministically. The LLM only handles what requires judgment: classifying tone, identifying structural patterns, assessing rhetorical function. And crucially, the LLM receives the deterministic results as context, so its qualitative assessments are grounded in what the numbers actually show.

The pipeline ends with a personality reveal — not a settings page. Each writer gets a name and a short portrait in second person. One of our test writers came back as "The Curious Bridge-Builder": you give direct instructions — "stop," "reconsider" — while softening them with shared knowledge appeals like "you've been waiting to feel ready." Your attitude runs on "genuinely" and "interesting," and you address the reader as a peer, not a student. She read it and recognized herself immediately. That is the validation that mattered most: not that the numbers were correct, but that the writer saw her own voice reflected back at her.

When Words Lie: The Context Problem

With the pipeline scaffolded, we tested five writers with genuinely different voices — and immediately hit a wall. The initial extraction used flat word lists: if a word appeared on our "hedge" list, it got counted as a hedge. The problem is that common English words change meaning depending on how they are used.

"I think this matters" — that is a hedge; the writer is expressing uncertainty. "People think AI is magic" — that is not a hedge; the writer is stating what others believe. Same word, completely different function. The same problem showed up with boosters: "Research shows that X" sounds assertive, but the writer is citing an external source, not personally claiming anything. Compare that with "I'll show you that this works" — now the writer is personally asserting. A writer who cites research often is not the same as a writer who personally asserts often, even though both use the word "shows."

The fix: instead of just matching words, we look at who is saying it. Linguistics research confirmed what our false positives were showing — these words only count as voice markers when the writer themselves is the subject. "I think" is a hedge; "people think" is not. "I'll show" is a booster; "research shows" is not. That single rule eliminated false positives across the board, and every writer's profile became more accurate.

The Propose-Prove Pattern

Attitude markers — the evaluative language that gives writing its emotional texture — posed a genuinely different kind of problem. Our initial approach used a curated list of evaluative words from the linguistics literature: "surprisingly," "remarkably," "importantly." This worked for some writers but completely missed others.

One test writer's entire emotional register lives in words like "startling," "astonishing," and "genuine breakthroughs" — none of which were on our list. We could have expanded the list, but that is whack-a-mole; English has hundreds of evaluative words, and different writers use completely different ones. No finite list would provide coverage.

We hit a similar wall with shared knowledge — phrases where the writer assumes common ground with the reader ("Everyone knows that evals are important"). The LLM could identify these naturally, but it also hallucinated plausible-sounding phrases that were not actually in the text. Tighter prompting helped but could not fully solve the hallucination problem.

The solution — and the most important pattern we developed — is what we call the Propose-Prove Pattern. The LLM reads the text and proposes candidate markers or phrases. Then a deterministic function proves each candidate against the source text: does this exact phrase actually appear in the writing? Any candidate the LLM hallucinated gets zero matches and is automatically dropped.

Linguistics research has long documented that curated word lists break down for open-vocabulary features like evaluative language — there are simply too many ways writers express attitude for any fixed list to capture. That is precisely why we needed a different approach for these fields: the LLM has essentially unlimited vocabulary, and the deterministic proof guarantees precision. It can identify "spookily good" for one writer and "brutal reality" for another without either appearing in any predefined list. And if the LLM invents a marker that is not in the text, the validation catches it automatically. We applied the Propose-Prove Pattern to both attitude markers and shared knowledge, and it solved both problems identically.

What Surprised Me

The most surprising finding was that contradictions in a voice profile are often the most distinctive feature. One writer's schema shows both medium-to-high hedging and high directives — she hedges her claims while telling readers exactly what to do. Both signals are true. The tension between them is her voice. A schema that tried to resolve this into a single "confidence score" would lose the most interesting signal.

The Full Pipeline

The extraction runs in 12-20 seconds per document. Traditional NLP handles the countable features in under 2.5 seconds; the LLM qualitative analysis takes 4-8 seconds. For multi-document analysis (the production path uses up to 3 writing samples), the system pools text, runs detection once, and uses a final LLM pass to reconcile qualitative fields across documents.

Input: 1-3 writing samples
Traditional NLP
sentence rhythm, pronouns, directives, questions, punctuation signature
Marker Detection
hedges + boosters (with context filtering)
LLM Qualitative Analysis
attitude, tone, shared knowledge, rhythm pattern (grounded in NLP results)
Propose-Prove Validation
LLM proposals verified against source text; hallucinations dropped
Structured Voice Profile (JSON) → injected into generation prompts

The architectural decision to handle everything countable with traditional NLP rather than sending it all to the LLM is not just about accuracy — it also keeps the pipeline fast and cost-efficient at scale.

Why This Changes Your Content Strategy

All of the work described above — the context filtering, the Propose-Prove Pattern, the linguistics foundation — exists to solve one problem: when Sembra turns your blog post into 15-25 social posts, every one of them needs to sound like you wrote it. Self-description dropdowns were never going to get us there. Extraction was the only viable path.

We have since fed these voice profiles into the generation pipeline — and the results are genuinely surprising. The generated posts do not just avoid sounding like AI; they pick up the writer's specific patterns. The hedging words, the punctuation habits, the sentence rhythm. It works. Brand voice is not a setting you configure — it is a signal you extract.

Join the Sembra waitlist — voice-matched content amplification is coming.

Frequently Asked Questions

How does AI learn a writer's brand voice?
Sembra's brand voice extraction analyzes a writer's actual content — not self-descriptions — to detect hedging patterns, sentence rhythm, punctuation habits, and evaluative language. An LLM classifies qualitative traits like tone while deterministic NLP handles everything countable. The result is a structured voice profile that generation models use to match the writer's style.
What is brand voice extraction?
Brand voice extraction reads a writer's existing content and produces a structured profile of how they write. Sembra's approach captures specific linguistic signals — hedging frequency, sentence length variance, punctuation preferences, attitude markers — rather than relying on vague self-descriptions like 'professional but friendly' that give AI models nothing concrete to work with.
Does brand voice matter for social media reach in 2026?
Yes — LinkedIn's 2026 algorithm actively downranks generic AI content, and human-generated content receives 5.44x more traffic than AI-generated content. Brand voice preservation is no longer a quality preference; it is a distribution requirement. Content that sounds like AI gets suppressed by the platforms themselves.
What is the Propose-Prove Pattern?
The Propose-Prove Pattern is a hybrid AI architecture developed by Sembra where the LLM reads text and proposes candidate findings, then a deterministic function proves each candidate against the source text via exact matching. Any hallucinated candidate gets dropped automatically, giving broad AI coverage with zero false positives.
Can AI accurately detect writing style from a few samples?
Sembra's pipeline can reliably detect measurable features like sentence rhythm, punctuation signature, pronoun patterns, and word-level markers from 1-3 writing samples. Qualitative traits like tone require LLM judgment but are grounded in the measured data. The profile describes what was observed, not assumed preferences.
What linguistics framework is used for brand voice analysis?
Sembra's extraction is grounded in Ken Hyland's metadiscourse framework from academic linguistics. It categorizes how writers express stance (hedges, boosters, attitude markers, self-mention) and engagement (reader address, directives, questions, shared knowledge). This gives the voice profile empirical grounding rather than subjective intuition.
What is Voice Drift in AI content?
Voice Drift is the loss of a creator's personality when AI tools repurpose their content — the output reads as competent but generic, stripped of the hedging patterns, punctuation habits, and evaluative language that make the writer sound like themselves. Sembra's brand voice extraction was built specifically to prevent Voice Drift by profiling the writer's actual linguistic patterns.

Enjoyed this post? Get content strategies delivered to your inbox.