@boyuan_chen: For coding agents, consistency beats peak IQ. I would rather ship with the model that is 8/10 everyw...

For coding agents, consistency beats peak IQ. I would rather ship with the model that is 8/10 everywhere than the one that is 10/10 on benchmarks and 3/10 after the harness changes.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex)

GLM 5.2 is one of the greatest gap reductions ever, but I think it is the greatest show of benchmark solidity from an open model claiming SoTA ever. Normally, you have some variety of the bad old Qwen pattern: headline benchmarks are SoTA+, new OOD ones are ≈8 months behind, and real experience is spiky, competitive in places, but usually ≈1 year behind, and sometimes utterly falling apart. Knock on it and hear the hollow sound. Yes, even DeepSeek.
Not so here. There's no progressive decay. It's "Opus 4.5-4.7ish" throughout, in anything of value that you throw at it. It is the first truly, completely solid Chinese model. A phase change, I hope.

— https://nitter.net/teortaxesTex/status/2068252024320192620#m