When You Hire, Don't Be The LLM Police

Why LLM bans in hiring can’t be validated

Apr 01, 2025

Some employers ban the use of large language models (LLMs) in job applications, like cover letters and essays. The appeal is understandable — ensuring applicants are sharing their actual skills and cutting down the volume of applicants to review manually.

But these bans create a fundamental integrity problem. When you implement rules that you can't meaningfully enforce, you're penalizing honest people and rewarding rule-breaking–and many of your applicants know this, so you're also undermining your hiring process.

But what about tools for detecting 'AI'? Many claim impressive accuracy rates. Some advertise 99% accuracy on both synthetic and human-generated text. And they're not being dishonest, exactly. Those metrics are true in certain contexts. But there's no reason to believe you'll achieve anything close to that with your actual applicant pool, and more importantly, you'll never be able to verify your real accuracy rate to determine if it's good enough.

This is because, when you use LLM detection for application materials, you face three simultaneous problems:

No Real Verification Possible: No ability to validate correctness for your specific use case
Outdated Human Writing Samples: No reliable source of contemporary human writing at all
Mismatched Usage Patterns: No certainty that synthetic text examples in training match how applicants actually use these tools

These aren't bad choices companies are making when building detectors; they're inherent limitations.

Because of this, I don't recommend using them. But if you're determined, maybe because you're overwhelmed with applications and need to get through them somehow, here's a different approach:

Eliminate Unenforceable LLM Bans: Skip the ban altogether
Test Against Your Data: Validate performance on your own data to determine if your tools filter applications in ways that align with your goals
Practice Internal Honesty: Take those results and be transparent internally about what these tools do

Here's why.

The Three Validation Problems

LLM detection, like any classification problem, works by learning patterns from training data and then applying those patterns to new examples.

These features that differentiate synthetic from human-written text during training are things like word choice, sentence structure, or repetition patterns.

These distinctions are real for the data they were trained on, but applying them to your applicant pool introduces several problems because the patterns in your data may not match the patterns in the training data—and you'll never know for sure.

No Real Verification Possible

When you reject an application based on an LLM detector, no one ever comes back to say, "you got it wrong. I did use Claude." You have no ground truth for your specific context, meaning you can't know if the tool is working as intended for your use case.

And there are problems on both sides of the equation—the human and the LLM-generated side—that may cause the patterns in training data to differ from patterns in your actual data*.

Outdated Human Writing Samples

First, the problem with human data. Detection tools must compare synthetic text against datasets of human writing that predate widespread LLM usage. This isn't an oversight—it’s the only way to ensure training data that no LLM was involved in writing. We must assume all recent text is potentially "contaminated" by LLMs, leaving no way to update models with current human-written text at any significant scale.

Data sets for human-generated writing have a cutoff date. (Paper)

This creates a growing problem: as definitely-human text ages, it becomes more difficult to tell whether a new piece of writing is different from those previous samples because it was generated by an LLM vs. the writer was influenced by all of the LLM-generated text they've been reading. There could also be some totally different reason for this resemblance: the LLM-generated data evolves in various ways, including updates based on new patterns in human-generated training data. But the human-generated training data doesn't update at all.

Mismatched Usage Patterns

On the synthetic side, there are also issues: creating training examples involves guesswork about how people use these LLMs. You can generate endless samples with various models and prompting strategies, but there's limited visibility into how people use them.

Even if you had access to user data from companies like OpenAI or Anthropic, you still wouldn't know how your applicants modify outputs before submission–much less, what any part of this process looks like for your specific applicant pool.

How We Test Robustness vs. How People Actually Use LLMs

We can see this issue when we look at how researchers evaluate the robustness of LLM detection tools to attempts to evade them, or "adversarial attacks."

For instance, in a 2024 study published in the International Journal of Speech Technology, researchers systematically evaluated six different detectors by testing them against various attacks. They looked at algorithmic techniques like character substitution (replacing letters with similar-looking characters from other alphabets), paraphrasing the LLM output, or adjusting generation parameters like temperature settings to make the output less predictable.

The researchers were basically asking: "How robust are these detectors if someone deliberately tries to fool them?" They found varying levels of success. Some tools were more vulnerable to certain evasion tactics than others.

This research is valuable for understanding vulnerabilities in detection systems in certain settings. But these scenarios may not match how people use LLMs for your situation.

This is for two reasons having to do with the two ways people might be using LLMs: using services that generate application materials at scale, or doing the generation themselves and then (likely) tweaking them.

For the automated services that generate cover letters for lots of people, they might indeed use character substitutions or other at-scale tricks to evade detection. But in those cases, there are already tools trying to stay ahead of common detection methods, with varying degrees of success. It’s a cat-and-mouse game, and the product you’re using might stay ahead to some extent, but how much?

This product is in the business of BOTH AI detection AND bypassing AI detection

And real individual applicants typically don't use these kinds of technical approaches at all. Instead, they might:

Request specific voices or styles ("write this in my voice" or "make this sound more professional")
Edit portions of the output to personalize it
Expand on their own drafts with LLM assistance, rather than generating content from scratch
Integrate specific suggestions into existing writing they've already created

Unless you know what those usage patterns look like—including what happens between the initial LLM output and the final submitted document—how can you test for them to know what your accuracy rates are1?

But maybe you’re not trying to catch that type of usage—the minor, "we used it a little" kind.

Great. But if you're not trying to catch it, why are you banning it?

"But We Need Some Filtering Method"

One answer is companies need some way to handle application volume–and no screening method is perfect, and for most screening methods, we don't have numbers on how well they're performing.

Fair enough. But if that's your position, consider this approach instead:

Eliminate Unenforceable LLM Bans

Use classification tools if needed, but don't pair them with bans. This avoids creating a system where conscientious applicants face disadvantages while those willing to lie gain an edge—and where, unless you're Anthropic, you risk looking like you don't understand how these tools work or how people use them.

Test Against Your Data

Periodically test these tools against content from existing employees who weren't screened this way, or with applications you've also reviewed manually.

For instance, use your current successful employees' cover letters and resumes as a baseline. Would this tool have rejected them?

If you do this, you're no longer necessarily looking to detect LLM usage: you're determining if the tool filters applications in ways that align with your hiring criteria. For instance, maybe these tools are great at filtering out generic slop, regardless of where it came from.

Practice Internal Honesty

Don't tell your internal stakeholders "this is 99% effective at finding AI-generated content," no matter what the vendor reports. Share your internal validation metrics and be transparent about the problems you’re trying to solve.

Solve a Better Problem

As a small test, I had an LLM substantially rewrite an earlier draft of this post. When I submitted it to what literature suggests is among the best available detectors, it returned "no AI content found." That was direct LLM output with zero human editing.

Maybe that was an anomaly. But I don't know, and if you're using one of these detectors, you don't either.

Some domains are forced to play this detection cat-and-mouse game. Fraud prevention and cybersecurity teams have to detect malicious activity even without perfect ground truth data. They can’t opt out.

But when you're evaluating job applicants? You have other options. Yes, they require more work to figure out what you’re looking for and how to identify it, including how the skills needed for your roles may have changed now that these technologies are available. But this approach lets you avoid building a system that depends on detection methods with limited visibility—methods that make less and less sense in a world where using LLMs is increasingly just part of how people write.

Don’t be the LLM police. Solve a better problem.

There is very little literature on human rewrites of LLM-generated content. Lee's (2023) conference paper testing AI detection tools with human modifications to a single AI-generated essay found decreased detection accuracy across most tested tools—but also, it was on ONE ESSAY, so I don’t think this constitutes strong evidence either way.

The Present of Coding

Discussion about this post