We analysed 370 AI-generated essays.
Here's what gave them away.

We built a corpus of 370 essays written by ChatGPT, Claude and Gemini across 12 subject areas — then ran every one through our detection engine. The patterns weren't subtle. They were universal.

What we tested — and why

Most AI detection tools publish accuracy claims without showing their working. We wanted to do it differently.

Over six weeks, we generated 370 academic essays using three AI models — ChatGPT-4o, Claude, and Gemini 2.5 Flash — across 12 subject groups: psychology, sociology, business, nursing, law, education, criminology, history, English literature, computer science, media studies, and social work. Each essay was written in response to real UK university-style assessment briefs covering different levels from first year through to master's.

Every essay was then analysed by SafeGrade's Deep Scan engine, which measures over a dozen textual features including vocabulary distribution, sentence rhythm, structural patterns, hedging language, and phrase-level originality. We also included 5 confirmed human-written essays as controls.

The result: a dataset large enough to draw real conclusions about how AI writes — and how it differs from human students.

The numbers — at a glance

370
AI essays tested across three models
98.7%
Correctly identified as AI-generated
12
Subject areas covered in the corpus

Of the 370 AI-generated essays, SafeGrade's Deep Scan correctly flagged 365 as AI-written — a 98.7% detection rate. The 5 that slipped through shared one trait: they were unusually short (under 400 words), which reduced the amount of detectable signal. At standard essay length (1,000+ words), the detection rate was effectively 100%.

For AI-specific indicators — the signals that distinguish machine prose from human writing — the accuracy was even higher: 99.7%.

7 patterns that appeared in almost every AI essay

1. The "moreover–furthermore–additionally" ladder

AI models love formal transition words. Not occasionally — obsessively. In our corpus, 94% of AI essays used at least three of these connectives: "moreover," "furthermore," "additionally," "consequently," "nevertheless." Human students overwhelmingly don't write like this. They use "but," "also," and "so" — or nothing at all. When your essay reads like a diplomatic cable, that's a signal.

2. Perfectly even paragraph length

This one surprised even us. Across all three models, the standard deviation of paragraph length was dramatically lower than in human writing. A typical AI essay has paragraphs of 120–150 words each, like clockwork. Human essays are messy — a 200-word paragraph followed by a 60-word one followed by a 180-word one. That inconsistency is actually what makes them look real.

3. Hedging without committing

AI essays are full of phrases like "it could be argued that," "this may suggest," and "one might consider." There's nothing wrong with academic hedging — good scholarship uses it. But AI hedges everything. It never commits to a position. In our data, AI essays averaged 8.3 hedging phrases per 1,000 words. Human essays averaged 3.1.

4. The "delve into" family

Certain phrases appear in AI writing at rates that are statistically impossible in human text. "Delve into," "it is worth noting that," "in the realm of," "this underscores the importance of," and "a nuanced understanding" were all present in over 60% of our AI essays. We've built a library of 40+ such phrases — if your essay contains three or more, that's a red flag that any decent detector will catch.

Want to check your own essay for these patterns?
SafeGrade's Deep Scan analyses the same signals from this study. Free to try — no sign-up needed for your first scan.
Scan my essay →

5. Missing personal voice

The most difficult signal to quantify, but the most obvious to a human reader. AI essays present arguments but never inhabit them. There's no "I found this surprising because…" or "Having read Smith's critique, I'm unconvinced because…" — AI writes about positions without ever taking one. Lecturers notice this instantly, even if they can't articulate why something feels off.

6. Reference clustering

When AI generates references, it tends to cluster them in predictable ways. Multiple citations from the same year range, overly broad source descriptions, and a strange habit of citing theorists without engaging with their actual arguments. In 78% of our corpus, the citations section contained at least one reference that didn't correspond to a real publication — sometimes subtle (wrong year, wrong journal), sometimes entirely fabricated.

7. Uniform sentence rhythm

This is what detection researchers call low "burstiness." Human writing has natural variation — short declarative sentences mixed with long, complex ones. AI maintains a remarkably consistent sentence length throughout. When we measured sentence length variance across the corpus, AI essays scored 40–60% lower than the human controls on every measure of rhythmic variation.

How each AI model differs

ChatGPT-4o was the easiest to detect. It had the heaviest reliance on transition words, the most uniform paragraph lengths, and the most frequent use of flagged phrases. It also tended to produce longer essays with more structural repetition — introduction, three body paragraphs, conclusion, in almost every case regardless of the question format.

Claude was more varied in its sentence structure and used fewer clichéd transitions, but gave itself away through its tendency to present balanced "on the other hand" arguments without resolving them. It was also the most likely to produce plausible-sounding but ultimately hollow analysis — sentences that feel intelligent on first reading but say nothing specific on closer inspection.

Gemini 2.5 Flash sat between the two. Its vocabulary was slightly more varied than ChatGPT's, but it had a distinctive habit of front-loading complexity — opening paragraphs were dense and impressive, while later sections became increasingly formulaic and repetitive. Detection accuracy was essentially identical across all three models.

What human essays do differently

The 5 human control essays in our corpus shared none of the patterns above. But what they did share was equally revealing:

Inconsistency. Human essays have rough edges. A brilliant paragraph followed by a weaker one. A reference mentioned in the introduction but not fully developed until page three. These inconsistencies aren't flaws — they're fingerprints. They show a mind working through an argument in real time rather than generating a complete structure instantaneously.

Specificity. Human students refer to specific lectures, specific readings from their module, specific case studies they found compelling. AI references are generic — it talks about "Smith (2019)" without ever showing that it actually read Smith. Lecturers know what's on their reading list, and they know when a citation feels like wallpaper rather than engagement.

Imperfection in flow. Human essays don't transition smoothly between every section. Sometimes there's an abrupt shift. Sometimes a sentence starts with "But" or "And." Sometimes the conclusion introduces a new thought rather than neatly summarising everything. This is normal writing. AI doesn't do it.

What this means for you

If you've used AI to help with your essay — whether for research, outlining, or drafting sections — the patterns above are what your university's detector is looking for. And so is your lecturer, even without any technology. These signals are visible to the human eye.

That doesn't mean you're helpless. You can revise AI-assisted sections to introduce your own voice, vary your sentence structure, engage properly with your sources, and remove the telltale phrases. SafeGrade's Essay Coach can help you do exactly this — it analyses your specific text and shows you which passages carry AI signals and how to rewrite them in your own voice.

The point isn't to hide AI use. It's to make sure your essay sounds like you — because that's what gets the marks.

See what your essay looks like
through a detector's eyes.
SafeGrade analyses the same patterns from this study — vocabulary, rhythm, structure, phrase patterns, and more. Your first Deep Scan is free.
Scan my essay free →