Twitter/X

@ivanfioravanti: I see Nvidia sending DGX Spark to many on X so that they can test and publish results. It seems I'l...

I see Nvidia sending DGX Spark to many on X so that they can test and publish results.

It seems I'll have to buy my own to test and share my own 😎
But that memory bandwidth is really stopping me from buying one 😖

Anyone out there with a DGX Spark testing some text to image or some video models willing to share results? This could be something to push me buying it.

Otherwise I think I'll save (a lot of) money for a GB300.

Ahmad (@TheAhmadOsman)

Local AI hardware = capacity × bandwidth × software stack

  • Capacity tells you what fits
  • Bandwidth tells you how hard the box can breathe
  • The software stack tells you how much of the spec sheet you can actually cash out.

Hardware by Memory Bandwidth
- Mac Studio M3 Ultra: up to 512GB @ 819 GB/s
- RTX PRO 6000 Blackwell: 96GB @ 1792 GB/s
- RTX 5090: 32GB @ 1792 GB/s
- RTX 4090: 24GB @ 1008 GB/s
- RX 7900 XTX: 24GB @ 960 GB/s
- Radeon PRO W7900: 48GB @ 864 GB/s
- AMD Radeon AI PRO R9700: 32GB @ 640 GB/s
- Intel Arc Pro B65: 32GB @ ~608 GB/s
- Tenstorrent Wormhole n300: 24GB @ 576 GB/s
- Tenstorrent Blackhole p150: 32GB @ 512 GB/s + 800G
- MacBook Pro M5 Max: 460-614 GB/s
- MacBook Pro M5 Pro: 307 GB/s
- DGX Spark: 128GB @ 273 GB/s (coherent + CUDA)
- Mac mini M4 Pro: 273 GB/s
- Ryzen AI Max / Strix Halo: ~256 GB/s (~96GB usable GPU)
- MacBook Air M5: 153 GB/s
- Snapdragon X2 Elite: 152-228 GB/s
- Intel Lunar Lake: 136 GB/s
- Snapdragon X Elite: 135 GB/s
- Mac mini M4: 120 GB/s
- Arc Pro B60: 24GB @ ~456 GB/s

Verdict

  • GPUs are still the bandwidth kings

  • Apple wins: stupid amounts of memory, don’t want to shard across GPUs

  • Apple loses: when raw tokens/sec & concurrency matter more

  • DGX Spark: coherent memory + NVIDIA stack

  • Strix Halo / Ryzen AI Max: first real x86 unified-memory contender

  • Tenstorrent: fully OSS stack, excited to see this mature

Fitting ≠ serving

Even if it fits, you still pay for
- bandwidth during decode
- KV cache growth
- dequantization
- batching + concurrency
- scheduler quality
- framework overhead

The only mental model that matters:

  1. What must fit?
  2. What bandwidth tier do I need?
  3. What software stack can actually deliver it?

In short:
- NVIDIA → fastest raw speed
- Apple Studio M3 Ultra → biggest one-box memory
- Strix Halo → first real x86 unified
- DGX Spark → coherent NVIDIA dev appliance
- AMD / Intel Arc → rising alternatives
- Tenstorrent → fully opensource stack

Do ask: “which bottleneck am I buying?”

Not: “which hardware is best?”

— https://nitter.net/TheAhmadOsman/status/2062312164455862286#m