Apparently it is possible to go much faster than my 13.5 t/s, and this is a very good news. It means 4-bit GLM 5.2 will be usable on Mac Studios M3 Ultras. The problem is that it is literally hardware you can not buy anymore.
Ivan Fioravanti ᯅ (@ivanfioravanti)
MLX GLM 5.2 Distributed on two M3 Ultra 512GB 🔥
One M3 Ultra: 18.8 tokens/sec
Two M3 Ultra: 23.4 tokens/sec
Context:
- PR by @pcuenq is still open and probably there is room for improvement: github.com/ml-explore/mlx-lm…
- basic generation test to measure decoding performance here, I will do a full context benchmarking once PR is more mature
- nvfp4 quantization used
- Video alternates standard speed and x20, with one Mac first and distributed later.
Enjoy! 🙌🏻
Video
— https://nitter.net/ivanfioravanti/status/2068722319913066603#m