Has anyone published a comparison of humanizer tool accuracy across different detector types?

looking for empirical data rather than anecdotes

i’ve seen a lot of informal testing shared in forums but nothing that systematically compares humanizer tool performance across multiple detectors with a consistent methodology. does anything like that exist publicly? academic paper, blog comparison, anything with actual numbers rather than “worked great for me”?