Anyways, this is how we should benchmark new models. AI's supposed to replace our jobs? Great, let's give it a real one and see how close it's getting!
We keep grading models on the equivalent of a standardized test, and like any standardized test, it mostly measures who studied the most for it.