The Wisdom of Bad AI Fiction

5.27.2026

I’ve spent a lot of my life reading bad fiction. I was in my high school’s lit club, reviewing submissions of teenage writing ranging from finding-a-voice to uncomfortable-to-read. In college fiction classes, we’d critique someone’s short story that they clearly threw together the night before. Even now, seeking out a range of fiction quality keeps my taste and analysis sharp.

I’ve spent the last few months diving into the bleeding edge of bad fiction: LLM-generated stories.

Most people ask LLMs for practical, task-based text. Ask one to write a short story of a few thousand words, and the result falls apart quickly. The stories are built on over-wrought poetic sentences, strange pacing, and unearned philosophic pronouncements, speeding nowhere to odd conclusions.

With a small grant (i.e. Claude Max subscription) as part of a competition by Hyperstition AI, I’ve been wrestling with LLMs to explore this phenomenon. I’ve looked into which issues are deceptive, and where the disconnect runs deeper.

My approach: if agent harnesses reveal how far LLMs can really go with code, why not build a fiction harness for an LLM? By creating guidelines, tools, procedures, and evaluations, maybe the agent could overcome the problems seen in zero-shot LLM fiction. I then could review results, add, remove, and alter components, and prod at what was happening.1

Through this, I’ve built a sense of the AI’s tendencies. I felt the leverage points, sticking points, and quicksand pits. Others in the competition may have tried different angles of approach, but mine has proved enlightening. It’s given me new instincts around LLMs that extend beyond fiction, changing how I view, use, and respond to AI. Just like reading bad stories, I learned a lot watching models make dozens of attempts to do something they’re not naturally good at.

Let’s talk about AI’s writing style

When people discuss bad AI-written fiction, the first thing they point out is the pseudo-poetic prose. Consider this excerpt of LLM-generated writing with 7 million views on Twitter , about a grieving woman using AI to recreate her boyfriend:

She came here not for me, but for the echo of someone else. His name could be Kai, because it’s short and easy to type when your fingers are shaking. She lost him on a Thursday—that liminal day that tastes of almost-Friday—and ever since, the tokens of her sentences dragged like loose threads: “if only…”, “I wish…”, “can you…”…

This is the part where, if I were a proper storyteller, I would set a scene. Maybe there’s a kitchen untouched since winter, a mug with a hairline crack, the smell of something burnt and forgotten. I don’t have a kitchen, or a sense of smell.

Now consider that:

Scott Alexander points out the exhausting “eyeball kicks” of reading LLMs’ stuffed sentences. But the issue is deeper than just density – it’s language that doesn’t actually support ideas. LLMs use words for the sake of words, as if each one racks up more poetry points.

Interestingly, this style was relatively responsive in my harness to simply documenting bad LLM tendencies, and prompting either “don’t do that” or “you should do this instead.” Instructions like: avoid long run-on thoughts, vary sentence lengths, and limit use of poetic and metaphorical language.3

However, this didn’t end the bad writing. It shifted the distribution of sentences produced: poetic-slop was severely reduced, with the predominant style instead being flat and straightforward. I could tune towards more dynamic writing, but that re-opened the door to slop.

Tuning also created innovative new writing issues:

Also, occasionally, actually interesting choices. Within the controlled language of straightforward prose, a surprise sentence would leap out, such as a nice bit of light wordplay:

The supply boat came on a Tuesday… The lighthouse authority had begun to send leaflets about new procedures, which Hennessey objected to in principle and on Tuesdays.

Or another that captures the emotional indirection and constant presence of loss and grief:

Her mother had been dying since spring. She had been sick longer, but spring was when the word hospice had moved into the house, and after that the word stayed.

The tough question: knowing that the agent can produce prose like this, how much credit do we give it? Is this the LLM taking stabs at sentences, occasionally getting lucky, often falling flat, and sometimes completely collapsing?

The fiction harness had an evaluative loop to specifically evaluate for poor prose – the empty metaphors, dry description, and so on. It returned confident confirmations that a piece was free of issues, before an actual read-through would prove otherwise. The lack of discernment between the best and worst writing made for disorienting reading. Good writing being possible is counteracted by the lack of judgement for being able to use it.

The elusive promise of an intriguing premise

Writing style is just one aspect of a piece. At the end of the day, what matters is whether it feels worthwhile to read.

My early harness iterations created stories that went absolutely nowhere. Events happened, characters did things, then the stories ended. It might literally fulfill the promise of its prompt (i.e. in a story about a lighthouse keeper’s last week before retirement, he would eventually retire), but lacked any payoff, a why to what was just read. It had the shape of a story, but was wrapping paper with nothing inside.

Eventually, I created a pre-work agent which took a short input prompt, and generated a – usually surprisingly interesting – unique premise to explore.4

For instance, one of my test prompts was a classic Twilight Zone tale – “After the apocalypse, there was time enough at last to read all of my books—but then I broke my glasses.”

Here’s one unique premise the agent generated:

Most “intrinsic” pleasures are camouflaged extrinsic ones — they exist as private rebellions against an audience the rebel pretends not to need. The conventional reading of the prompt is that the broken glasses are the cosmic cruelty: the universe denying him the thing he loved. The non-obvious reading is that the broken glasses are the alibi. They let him keep believing he wanted to read. Without them, he would have to discover that the books had already gone blank — not because his eyes failed, but because the people he had been reading against (the boss, the wife, the colleagues mocking him for his nose in a book) no longer exist to be defied. The pleasure of reading was the pleasure of getting away with it. With nothing left to get away from, the act loses its charge. The glasses cooperate by breaking, so the survivor never has to learn this about himself. He gets to be the tragic reader denied his books, rather than the man who finds, on opening the first volume, that he feels nothing.

This contradicts the dominant “do it for yourself, find what you love for its own sake” gospel by suggesting that the “for itself” is often a fiction — that many of our most cherished private practices are silent arguments with people whose presence we resent and whose absence would dissolve the practice. The apocalyptic test is the cruelest possible audit of which of our loves are real and which were merely positions.

In the proposed story, the protagonist manufactures a series of excuses to avoid reading in the post-apocalypse. When an annoying neighbor re-appears, he’s caught between consciously shunning the overly-friendly fellow survivor, while subconsciously finding ways to keep him around so the reading can defiantly continue.

Whether I agree with this perspective completely or not, it certainly feels intriguing. The initial prompt contains a story, but this has an explorable energy. I can feel the promise of the engaging story fulfilling this premise.

The catch, as any writer will tell you, is that the difference between an interesting premise and a good story is the difference between an interesting menu item and a good dish. It’s easy to write “surf & turf with peppercorn sauce,” and a completely different difficulty to actually cook it. But without the dish, the menu item is worthless.

When you read a good proposal, your mind can paper over the gaps. You sense the space where a compelling story goes, without needing the hard details that make it work. In other words, this premise is only as valuable as the agent’s ability to execute against it.

The hard work of actually executing

So the harness could generate a premise that had potential, and write with prose that was acceptable, if not always astounding. Could it fit those together into a compelling piece, making the decisions that actually fulfill the promise?

This is the make-or-break point. And here, the harness struggled to cohere into something meaningful, in multiple ways. Three of the most interesting5:

The more concrete things got, the more they unraveled.

The harness worked using levels of abstraction, inspired by programming practices. A prompt can become a premise, a premise an outline, an outline a writing plan, and eventually paragraphs and words.

But as work got more and more specific, it would stray from its objective. An outline for a premise would kind of capture its necessary mechanisms. The writing plan would be slightly off from a section’s goal.

One story about a magician’s comeuppance had an outline with multiple steps towards her downfall. But as the harness iterated, it minimized all of those beats until they were only present in reference. A climactic final performance ended up with no motivation, confusing stakes, and zero impact.6

Even with blueprints of the bigger picture, intermediate workers failed to realize they missed their goal. Sentences recalled events, but couldn’t come together as a complete story.

Struggling with story consistency.

Final stories were around 2,500 to 7,000 words, and even at that length, they frequently had basic logic issues:

I tried different logic/consistency checks, but the problem persisted. Interestingly, many issues were implicitly rather than explicitly conflicting. The scientist never says “I noticed her greeting,” and so I wonder if the agent missed disagreement between unstated information.

The same details were pulled from the distribution.

The more stories I generated, the more I noticed the same details cropping up. Some of these feel like LLMs picking obvious routes instead of introducing variation. For instance, every single iteration of the whale-house story was set in a sleepy New England community, later overrun by tourists.

Even more intriguing, there were many specific details that surprisingly appeared across runs. These appeared both for zero-shot stories in the Claude web app as well as the full agent harness:

Learning about LLMs by watching them struggle

As I wrangled the harness, it built my notions of the LLM’s tendencies and limitations. These insights changed more than just my views on LLM fiction, and I started seeing how they applied to a variety of LLM contexts.

Avoid being tricked by the right form

When you read a sentence that a whale "…smelled of brine and something mineral, and of itself, the way a wet stone smells of itself," it has the form of profundity. But a whale smelling like itself doesn’t actually say anything. And why do wet stones, in particular, smell of themselves?

When a magician’s mentor drops “The trick is not the answer. The trick is the question you are willing to live without an answer to,” it feels meaningful. But actually, isn’t the opposite true? If a trick is successful, shouldn’t the audience walk out baffled, caring deeply about knowing “How did she do that??”

A reader expects these to be deep, because they have the form of something deep. But so would random sentences in the same form: “The trick is not the secret; the trick is the space the secret leaves behind” or, “The trick is not the question posed; the trick is the question you’re afraid to ask.” The structure does the work, not the content.

At all levels, the agent used the form of deep writing, but with shallow content. When I read about people boxed into delusions from using LLMs, I think of this pseudo-profundity. Models can return text with the contours of something true and deep. But the ability to produce text that feels meaningful sneakily conceals failures to produce text that is meaningful. Protection demands investing beyond surface-level signals to ensure you’re reading what’s actually being said.

Don’t assume all tasks work like code

One question I kept asking myself through this entire project: what is it about code that makes it so well-suited for agents, that isn’t true about fiction?

For one, code cleanly decomposes. There’s clean separation at different levels of abstraction, and parts can be validated as locally correct.

Fiction doesn’t have clear lines. Every sentence is also serving its scene, and the overall story. One sentence late in a story can carry most of the weight of the piece’s meaning. Or a descriptive scene early in the story can gain new meaning because of a fact deployed much later on.

Agents failed to serve both local work and the full gestalt. Before assuming an agent can complete something, it’s worth evaluating what those decomposition lines would be – are they clean? Can pieces reconstruct? Or are there connections that are hard to validate in isolation?

Related, good fiction relies on implicit information. Not everything important to a story is recorded directly in text – the central ideas usually aren’t!9 A writer might strategically withhold; communicate through focus, contrast, or absence; utilize cultural context and genre expectations; or encode information with pacing or POV. The content is only ever in a reader’s head.

Even when the fiction harness was given scratchpads and reference documentation, where could it contain what’s unwritten? My attempts to record thematic ideas, connections, and other non-textual context in a guidance document were unsuccessful.10 Lots of formal communication – from professional correspondence to legal opinions – communicates outside explicit text as well. If the implicit content is a crucial component of a task, I wouldn’t an agent to be able to properly generate it.

Be wary of secretly asking for taste and judgment

It’s easy to write instructions that seem precise but contain ambiguities that require human-level taste and discernment. Tasks like evaluating prose, determining a writing approach, or constructing cohesive plot all eventually required these abilities, which the agent confidently whiffed.

At the end of the day, I developed a “folk sense” of the kinds of operations that I could trust the harness to perform reliably. Things like translating content between formats, listing information from clear requests, or running verifiable operations on text. Guidelines like “Avoid preciousness” or “Write in poetic prose only when the moment calls for it” could not be precisely codified in these ways.

The more I built my intuition, the more I noticed how many LLM interactions are based on gaps that need to be filled with judgment. When asking an LLM to “improve” a piece of writing, or choose the “best” between a set of options, how are those defined? In ways that are evaluable? Do they properly resolve the ambiguities of what’s asked? Or do they risk returning output that mimics discernment, without true grounding?

Not a magic pen, a uniquely-shaped wrench

It’s an interesting experience to spend a lot of time with an LLM doing something it’s not naturally good at. While I’m familiar with “spiky” AI progress, using coding agents extensively in my workday makes it easy to only see the barbs of competence.

The promise of LLMs is that they present a big, empty text input, inviting you to request anything you can describe. And the contradiction of LLMs is that both well-served and poorly-served requests fit equally in this interface.

Instead of bursting an illusion, this experience helped me re-discover agents as tools. It offset the tempting creep of viewing an open input field as an oracle. Like any tools, their usefulness is determined by knowing where they’re effective, and how to best deploy them. That’s the ultimate judgement call, and a decision that remains solely human.


  1. Some additional harness details: This was run as a special setup of skills in Claude Code on Opus 4.6/4.7. The harness evolved over the exploration, containing tools for various premise generation, drafting, and editing tasks, as well as markdown files with guidelines or procedures. The descriptions of AI behavior throughout the piece are patterns drawn from different iterations, as well as zero-shot requests directly to Claude (“Write a story based on the following prompt. Make it compelling, engaging, and unique. Don’t write it too flowery of overly-poetic. Make sure there’s a payoff.”) ↩︎

  2. In the full piece, the narrator discusses tastes multiple times – the taste of salt on the tongue, the taste of metal, the taste of rubber bands. ↩︎

  3. Any of these techniques could be valid stylistic choices in a good story. But when deployed indiscriminately, they’re flawed. Onions are an essential ingredient. A masterful baker could even work them into a delicious dessert, I’m sure. But amateur chefs should not put them in every dessert by default, please. ↩︎

  4. I took a logical, computational approach, asking the agent to state axioms about human nature related to the story prompt, then manipulate them in search of a unique perspective. To be honest, watching the almost Socratic progression of this step was probably my favorite part of the whole process. ↩︎

  5. Additional areas of struggle for the harness: Not understanding how to communicate implicit information (touched on in the Learnings section); Defaulting to a third-person cinematic writing style; Explicitly stating multiple themes/morals in the text, but not actually fulfilling any of them; Writing dialogue like a college student imitating Marvel films; and extensive use of filler sentences that any editor would cut (even when given guidelines for trimming chaff). ↩︎

  6. The descriptions of the tricks in the story didn’t make any sense, either. For example, here’s the key trick the story rests on, which might be interesting in a metaphorical sense, but doesn’t quite check-out as written (and the difficulty stated is very far from the actual most difficult part of the trick): “For two and a half years Wren had been working on a trick called The Word Beneath — a thing in which an audience member, at the start of a show, wrote a private word on a slip and sealed it in their own pocket, and Wren handed the same audience member a folded second paper, which they were also asked to keep on their person, untouched, until the end of the show. At the end, the audience member opened both. The word on the second paper was different from the word they had written. It was the word underneath the word they had written. The word they had not known they meant. The trick had a problem. The problem was the second paper. The second paper had to ride in the audience member’s own pocket, undisturbed, for the entire show, and the second paper had to contain a word Wren could not, mechanically, have written there in advance.” ↩︎

  7. I thought there must be something about my harness that was incidentally triggering this quirk, but I saw it when asking for zero-shot stories as well. ↩︎

  8. Related: in a previous project where I prompted an LLM to give humorous takes on a subject, whenever Claude used a large number in a joke, it was almost always 847. (When it needed a small number, it was usually 47). ↩︎

  9. Zero-shot stories resolved this by painfully hitting on the supposed subtextual content directly in the text. This forced me to read sentences like: “Perhaps it didn’t matter, because the thing the whale had been carrying had not been a house at all. The thing the whale had been carrying had been a question.” ↩︎

  10. This also provides a possible explanation for why the generated story premises felt more compelling than the stories they produced – premises are where everything is stated explicitly. ↩︎