I'm using the Unsloth 4-bit weights to integrate interence into DwarfStar. If the experiment works well I'll specialize the quants with an optimizer for the best setup. My target is however the 512GB M3 ultra and in distributed inference 4 128GB machines with 3-bit variants if quality holds. Inference quality must be the local inference mantra.
Unsloth AI (@UnslothAI)
GLM-5.2 can now be run locally!🔥
The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% size).
Run on a 256GB Mac or RAM/VRAM setups.
GLM-5.2 is the strongest open model to date.
Guide: unsloth.ai/docs/models/glm-5…
GGUF: huggingface.co/unsloth/GLM-5…
— https://nitter.net/UnslothAI/status/2067588262156501497#m