gekro
GitHub LinkedIn

The Mac Mini M4: The Un-official Local LLM King

8 min read
Intermediate March 1, 2026

TL;DR

Why unified memory architecture is the only way to run 70B parameter models without a data-center budget.
Stop chasing 24GB NVIDIA cards. For local LLM inference, Video RAM (VRAM) is the only metric that matters, and Apple's Unified Memory Architecture (UMA) is the most cost-effective way to get 64GB+ of it. This post explains why a Mac Mini M4 Pro is the silent, high-density heart of my lab's reasoning layer.

If you’re building a lab in a DFW suburb, you quickly learn that power draw and heat are your biggest enemies. I spent years fighting with a dual-3090 PC build that sounded like a jet engine and turned my office into a sauna every time an agent ran a background task. Then I switched to a Mac Mini M4 Pro with 64GB of unified memory. It’s silent, it draws less power than a desk lamp, and it runs Llama 3-70B models at usable speeds. For the first time, the hardware has become invisible, which is the ultimate goal of any engineering lab.

The Architecture

The “Magic” of Apple Silicon isn’t the CPU speed—it’s the Unified Memory Architecture (UMA). In a traditional PC, your CPU RAM and GPU RAM (VRAM) are separate. If you want to run a 40GB model, you need a $1,600 GPU. On a Mac, the system RAM is the VRAM.

FeatureDesktop PC (RTX 4090)Mac Mini (M4 Pro 64GB)
VRAM CapacityHard cap at 24GBUp to 64GB (Flexible)
Memory Bandwidth1,000 GB/s (GDDR6X)273 GB/s (Unified)
Power Consumption450W - 600W20W - 50W
AcousticsHigh Fan NoiseNear-Silent
Ideal Model Size8B - 34B8B - 70B (Quantized)

While the PC beats the Mac on pure speed (tokens per second), the Mac wins on Size-to-Cost. You literally cannot run a 70B model on a single consumer NVIDIA card without massive pruning. The Mac Mini just eats it.

The Build

Configuring a Mac for a production lab requires moving away from the “Desktop App” mentality and toward a headless service mentality.

1. The Inference Stack

I use Ollama because it has the best implementation of the Apple Metal API. It offloads the tensor math directly to the GPU cores in the M4 chip.

# Verify Metal acceleration is active in the logs
grep "Metal" ~/.ollama/logs/server.log

# You should see: "Metal device is available" and "offloading layers to GPU"

2. The Python Bridge

My agents talk to the Mac Mini over the local network using the GekroLLMClient I detailed in my API Sovereignty post.

import ollama

def run_heavy_inference(prompt):
    # This runs on the Mac Mini, called by a Pi or my Workstation
    response = ollama.chat(
        model='llama3:70b-instruct-q4_K_M',
        messages=[{'role': 'user', 'content': prompt}]
    )
    return response['message']['content']

3. Model Quantization Selection

For a 64GB Mac, Q4_K_M is the “Goldilocks” quantization for Llama 3-70B. It fits comfortably in RAM (leaving 20GB for the system and other agents) while retaining 99% of the base model’s intelligence.

The Tradeoffs

The Mac Mini isn’t perfect. The biggest “Tax” is the Lack of CUDA. If you’re doing model training or fine-tuning, the Mac is a paperweight compared to an NVIDIA build. Most cutting-edge research code is written for CUDA first, and “Metal support” is often an afterthought that arrives months later.

I also hit a major issue with Heat Soak. During a 4-hour batch processing run of 1,000 Tesla telemetry logs, the Mac Mini’s internal fan finally kicked in, and the inference speed dropped from 8 TPS to 5 TPS. Even Apple’s efficiency has limits when pushed to 100% utilization for hours. I ended up 3D-printing a custom stand with a 120mm fan to keep the bottom of the chassis cool during long inference runs.

Where This Goes

I’m currently looking at Clustering Mac Minis using high-speed Thunderbolt bridges. If I can pool the memory of two M4 Pros, I can run a 405B parameter model locally. That’s the dream: a private, local “Super-Intelligence” in a form factor that fits in a desk drawer.

Continue Reading

Related from the lab

The Lab Newsletter

Get notified when I build something worth sharing.

No filler. No frequency commitment. Just the experiments, tools, and breakdowns I'd want to read myself.

Add PUBLIC_BEEHIIV_PUBLICATION_ID to .env.local to enable the real Beehiiv embed.

Discussion

Comments

Powered by GitHub Discussions — requires a GitHub account.