How much should you actually trust a single detection score? Asking for a workflow decision

Building out a content quality review workflow and trying to figure out how much weight to give detection scores as one of several signals.

Current thinking: detection scores alone are not a reliable basis for any high-stakes decision. The false positive rate on human-written content is too variable across tools, and a single score from a single tool tells you almost nothing about the actual provenance of a piece.

What I’m less sure about is how to use them usefully. A few questions I’m working through:

Is there a meaningful signal when multiple tools independently flag the same piece at high scores? Or do they share enough underlying methodology that they’d all be wrong in the same direction?

Is there a score threshold below which you can reasonably treat a piece as human-written for workflow purposes, even if it’s not a certainty?

How do you handle the asymmetry between false positives and false negatives in a professional context where both errors have real costs?

We’ve already pressure-tested one tool against a sample of known human-written content from our team and got a 15-20% false positive rate on certain writers whose style happens to be more structured and formal. That’s not usable for anything consequential.

Genuinely curious how others are building detection into workflows where the decisions matter — not academic curiosity, actual process design.

To be clear: I don’t use detection scores as evidence of anything. I use them as a flag that prompts a closer editorial read.

The 15-20% false positive rate you’re describing is consistent with what I’ve seen. Writers with formal, structured styles — academic background, legal training, certain editorial traditions — get flagged regularly. The tool isn’t wrong exactly, it’s just that “statistically consistent with AI patterns” is not the same as “written by AI.”

For workflow purposes, I’d treat any score under 30 as noise and anything above 70 as a reason to read more carefully, not as a finding. The range between 30 and 70 is where you do the actual editorial work.