They're adversarial benchmarks - they intentionally hit the weak point of existing LLMs. Not "AGI complete" by any means. But not useless either.
They're adversarial benchmarks - they intentionally hit the weak point of existing LLMs. Not "AGI complete" by any means. But not useless either.