presenting my state-of-the-art eval framework for AGI not a single model has passed this rigorous test