Research shows all LLMs converge toward the same polished, voiceless register when they write. We measure which models resist that pull when constrained by a real style guide.
The problem, proven: van Nuenen (2026) showed that LLM rewriting produces consistent stylistic normalization: function words, contractions, and first-person pronouns decrease while vocabulary diversity and word length increase. Even "voice-preserving" prompts reduce this effect but don't eliminate it. We measure which models, under bookmoth's style guide constraint, deviate least from the source voice across 13 linguistic markers.
Weighted: Voice Fidelity 35%, Prose Quality 25%, Instruction Following 20%, Continuity 10%, Text Manipulation 10%
| # | Model | Overall | |||||
|---|---|---|---|---|---|---|---|
| Benchmark results loading... First run in progress. Check back soon. | |||||||
35% of overall score
Measured via 13 linguistic markers (function words, type-token ratio, mean word length, contraction rate, pronoun density, punctuation, sentence variance, and more). We compute feature-space distance between the style guide's example prose and the model output. Lower distance = higher fidelity. No subjective judging required.
25% of overall score
Avoidance of AI-typical anti-patterns: filter words, passive voice, purple prose, cliche metaphors, saidisms, telling over showing. Tested both with and without style guide constraint.
20% of overall score
Can the model hit specific targets? Word count accuracy, dialogue ratio, POV consistency, tense adherence, scene brief compliance.
10% of overall score
Does the model respect established facts? Character details from a story bible, setting specifics, plot continuity from prior chapters.
10% of overall score
Mechanical rewriting accuracy: tense conversion, POV shifts, character renames, contraction expansion. Deterministic, objectively scored.
bonus insight
We run the same prompts with and without style guide constraint and measure how much the normalization signature is suppressed. Research confirms all models pull toward convergence. The question is: how much does a proper voice constraint fight back?
Each scenario is run 5 times per model to measure both quality and stability. A model that scores 90% one run and 40% the next isn't useful in practice.
Based on the 13 linguistic markers from van Nuenen (2026): function word ratio, type-token ratio, mean word length, punctuation density, contraction rate, first/third-person pronoun density, emotion word frequency, mean sentence length, sentence length variance, paragraph density, dialogue ratio, and causal connector frequency. We extract these markers from both the style guide's example prose and the model's output, then compute feature-space distance. This is fully deterministic: no LLM-as-judge, no subjectivity, reproducible by anyone.
We additionally check whether outputs follow the normalization signature: function words decreasing, contractions decreasing, first-person pronouns decreasing, vocabulary diversity increasing, word length increasing, punctuation elaborating. Models that resist more of these directions score higher.
Instruction following and text manipulation use automated deterministic scoring (word counts, regex, string matching). Continuity and POV adherence use LLM-as-judge with structured rubrics. Prose quality combines automated detection (AI-isms, passive voice, dialogue tags) with LLM evaluation for show-vs-tell.
Three synthetic profiles: literary fiction (short sentences, heavy interiority, domestic imagery), genre thriller (punchy, external, driving pace), and lyric prose (long flowing sentences, sensory density, contemplative). Each includes example cadence used as the reference for marker extraction.
We test models available via OpenRouter and the Anthropic API, covering frontier (Opus, GPT-5.4, Gemini 3), strong mid-tier (GLM-5.1, DeepSeek V4, Qwen 3.5), and local-capable options (Qwen 27B, Mistral 24B). Pricing data from OpenRouter at time of testing.
Re-run on major model releases and quarterly. All tests use identical fixtures and scoring logic across runs for comparability.
Every model normalizes your voice. The question is how much.
bookmoth's pipeline is designed to fight the pull.