looking for empirical data rather than anecdotes
i’ve seen a lot of informal testing shared in forums but nothing that systematically compares humanizer tool performance across multiple detectors with a consistent methodology. does anything like that exist publicly? academic paper, blog comparison, anything with actual numbers rather than “worked great for me”?