bookmoth bench — creative writing model benchmark

#	Model	Tier	Overall	Voice	Prose	Instruct	$/1M out
Benchmark results loading... First run in progress. Check back soon.

Test Categories

Voice Fidelity

35% of overall score

Measured via 13 linguistic markers (function words, type-token ratio, mean word length, contraction rate, pronoun density, punctuation, sentence variance, and more). We compute feature-space distance between the style guide's example prose and the model output. Lower distance = higher fidelity. No subjective judging required.

Prose Quality

25% of overall score

Avoidance of AI-typical anti-patterns: filter words, passive voice, purple prose, cliche metaphors, saidisms, telling over showing. Tested both with and without style guide constraint.

Instruction Following

20% of overall score

Can the model hit specific targets? Word count accuracy, dialogue ratio, POV consistency, tense adherence, scene brief compliance.

Continuity

10% of overall score

Does the model respect established facts? Character details from a story bible, setting specifics, plot continuity from prior chapters.

Text Manipulation

10% of overall score

Mechanical rewriting accuracy: tense conversion, POV shifts, character renames, contraction expansion. Deterministic, objectively scored.

The Normalization Delta

bonus insight

We run the same prompts with and without style guide constraint and measure how much the normalization signature is suppressed. Research confirms all models pull toward convergence. The question is: how much does a proper voice constraint fight back?

Methodology

Each scenario is run 5 times per model to measure both quality and stability. A model that scores 90% one run and 40% the next isn't useful in practice.

Voice fidelity scoring

Based on the 13 linguistic markers from van Nuenen (2026): function word ratio, type-token ratio, mean word length, punctuation density, contraction rate, first/third-person pronoun density, emotion word frequency, mean sentence length, sentence length variance, paragraph density, dialogue ratio, and causal connector frequency. We extract these markers from both the style guide's example prose and the model's output, then compute feature-space distance. This is fully deterministic: no LLM-as-judge, no subjectivity, reproducible by anyone.

Normalization detection

We additionally check whether outputs follow the normalization signature: function words decreasing, contractions decreasing, first-person pronouns decreasing, vocabulary diversity increasing, word length increasing, punctuation elaborating. Models that resist more of these directions score higher.

Other categories

Instruction following and text manipulation use automated deterministic scoring (word counts, regex, string matching). Continuity and POV adherence use LLM-as-judge with structured rubrics. Prose quality combines automated detection (AI-isms, passive voice, dialogue tags) with LLM evaluation for show-vs-tell.

Voice profiles

Three synthetic profiles: literary fiction (short sentences, heavy interiority, domestic imagery), genre thriller (punchy, external, driving pace), and lyric prose (long flowing sentences, sensory density, contemplative). Each includes example cadence used as the reference for marker extraction.

Models

We test models available via OpenRouter and the Anthropic API, covering frontier (Opus, GPT-5.4, Gemini 3), strong mid-tier (GLM-5.1, DeepSeek V4, Qwen 3.5), and local-capable options (Qwen 27B, Mistral 24B). Pricing data from OpenRouter at time of testing.

Updates

Re-run on major model releases and quarterly. All tests use identical fixtures and scoring logic across runs for comparability.

Which model resists
the normalization pull?

Overall Ranking

Test Categories

Voice Fidelity

Prose Quality

Instruction Following

Continuity

Text Manipulation

The Normalization Delta

Methodology

Voice fidelity scoring

Normalization detection

Other categories

Voice profiles

Models

Updates

Which model resiststhe normalization pull?

Overall Ranking

Test Categories

Voice Fidelity

Prose Quality

Instruction Following

Continuity

Text Manipulation

The Normalization Delta

Methodology

Voice fidelity scoring

Normalization detection

Other categories

Voice profiles

Models

Updates

Which model resists
the normalization pull?