We Tested 5 AI Detectors with 20 Different Texts: Full Results

The Test Setup

We prepared 20 texts: 10 entirely AI-generated (5 from ChatGPT-4, 5 from Claude 3 Opus) and 10 entirely human-written across different styles. We submitted each to five detection tools: Turnitin (via institutional access), GPTZero, Copyleaks, ZeroGPT, and Winston AI. No text was modified before submission.

Overall Accuracy Results

Turnitin led with 88% overall accuracy (95% on English, lower on other languages). ZeroGPT performed worst at 71%. GPTZero came in at 82%. Winston AI surprised us with 84% accuracy despite being less well-known. Copyleaks landed at 80%.

False Positive Analysis

The most concerning finding: GPTZero incorrectly flagged 23% of human-written technical articles as AI-generated. This means nearly 1 in 4 genuine human writers could face false accusations if their professor uses GPTZero and follows its output blindly. This is a significant problem with the current state of AI detection.

After Humanization with Temiz Metin

We then processed all AI-generated texts through Temiz Metin and retested. Results: 4 of 5 detectors classified the humanized texts as human-written. Only Turnitin flagged one text — with a low-risk 31% score, below most institutional action thresholds.

Conclusion

No detector is perfectly accurate. The most reliable approach combines multiple tools for detection, and quality humanization for prevention. Temiz Metin's results across all five detectors were consistently strong.

We Tested 5 AI Detectors with 20 Different Texts: Full Results

The Test Setup

Overall Accuracy Results

False Positive Analysis

After Humanization with Temiz Metin

Conclusion

İlgili Yazılar