In addition to the M5 iPad Pro, which I reviewed earlier today, I also received an M5 MacBook Pro review unit from Apple last week. I really wanted to write a companion piece to my iPad Pro story about MLX and the M5’s Neural Accelerators; sadly, I couldn’t get the latest MLX branch to work on the MacBook Pro either.
However, Max Weinbach at Creative Strategies did, and shared some impressive results with the M5 and its GPU’s Neural Accelerators:
These dedicated neural accelerators in each core lead to that 4x speedup of compute! In compute heavy parts of LLMs, like the pre-fill stage (the processing that happens during the time to first token) this should lead to massive speed-ups in performance! The decode, generating each token, should be accelerated by the memory bandwidth improvements of the SoC.
Now, I would have loved to show this off! Unfortunately, full support for the Neural Accelerators isn’t in MLX yet. There is preliminary support, though! There will be an update later this year with full support, but that doesn’t mean we can’t test now! Unfortunately, I don’t have an M4 Mac on me (traveling at the moment) but what I was able to do was compare M5 performance before and after tensor core optimization! We’re seeing between a 3x and 4x speedup in prefill performance!
Looking at Max’s benchmarks with Qwen3 8B and a ~20,000-token prompt, there is indeed a 3.65x speedup in tokens/sec in the prefill stage – jumping from 158.2 tok/s to a remarkable 578.7 tok/s. This is why I’m very excited about the future of MLX for local inference on M5, and why I’m also looking forward to M5 Pro/M5 Max chipsets in future Mac models.