How much should you actually trust a single detection score? Asking for a workflow decision

Building out a content quality review workflow and trying to figure out how much weight to give detection scores as one of several signals.

Current thinking: detection scores alone are not a reliable basis for any high-stakes decision. The false positive rate on human-written content is too variable across tools, and a single score from a single tool tells you almost nothing about the actual provenance of a piece.

What I’m less sure about is how to use them usefully. A few questions I’m working through:

Is there a meaningful signal when multiple tools independently flag the same piece at high scores? Or do they share enough underlying methodology that they’d all be wrong in the same direction?

Is there a score threshold below which you can reasonably treat a piece as human-written for workflow purposes, even if it’s not a certainty?

How do you handle the asymmetry between false positives and false negatives in a professional context where both errors have real costs?

We’ve already pressure-tested one tool against a sample of known human-written content from our team and got a 15-20% false positive rate on certain writers whose style happens to be more structured and formal. That’s not usable for anything consequential.

Genuinely curious how others are building detection into workflows where the decisions matter — not academic curiosity, actual process design.

To be clear: I don’t use detection scores as evidence of anything. I use them as a flag that prompts a closer editorial read.

The 15-20% false positive rate you’re describing is consistent with what I’ve seen. Writers with formal, structured styles — academic background, legal training, certain editorial traditions — get flagged regularly. The tool isn’t wrong exactly, it’s just that “statistically consistent with AI patterns” is not the same as “written by AI.”

For workflow purposes, I’d treat any score under 30 as noise and anything above 70 as a reason to read more carefully, not as a finding. The range between 30 and 70 is where you do the actual editorial work.

The multiple-tools question is interesting. My understanding is that most tools share similar underlying approaches — they’re all trained on variations of the same signal types. So correlation between tools doesn’t necessarily mean independent confirmation. It might just mean they share the same blind spots.

What would actually be more useful than running the same piece through multiple tools is running the same tool on multiple pieces from the same writer over time. A consistent pattern across a body of work is more meaningful than a single high score on a single piece.

From my experience, the threshold question doesn’t have a clean answer. It depends heavily on the tool and the content type.

Tools trained primarily on student essays perform differently on professional marketing copy. Tools trained on web content may flag academic-style writing. There is no universal threshold that works across content types, which is itself a problem if you’re trying to build a consistent workflow.

In many cases, the most useful thing detection scores can do is surface pieces for human review, not replace human review. If the cost of a false positive is high in your context, the workflow has to account for that explicitly.

the asymmetry question is the real one and nobody talks about it enough.

false positive: you flag human-written work as AI. cost = damaged relationship with the writer, slowed workflow, potential morale issue.
false negative: AI-generated content passes. cost = whatever downstream risk you’re trying to prevent.

which cost is higher in your context? that should determine where you set your threshold and how much weight you give any given score. there’s no universal answer.

for us the false positive cost is higher so we treat scores as advisory only and require a second human read before any action.

honestly the 15-20% false positive rate on formal writers is higher than i’d expect but not surprising. structured writing with clear topic sentences, transitions, and summaries looks statistically like AI output because AI was trained on structured writing.

the signal you actually want is probably more qualitative. does this piece contain specific observations or details that a model wouldn’t produce without access to primary sources? is there genuine opinion rather than balanced “on the other hand” hedging? does it reference things that happened recently enough that a model couldn’t know them?

those are human editorial checks, not algorithmic ones. the score just tells you where to look.