Question 1

What device do I need to run AI locally?

Accepted Answer

6 GB of VRAM is the floor — enough for 7B models at Q4 quantization. 12 GB covers most 13B models; 24 GB opens up 30B; 70B needs 48 GB+ or a unified-memory setup like Mac Studio. NVIDIA RTX (discrete VRAM), Apple Silicon (unified memory), and AMD Radeon all work.

Question 2

What's the best device for local AI — Apple Silicon or NVIDIA RTX?

Accepted Answer

Depends on what you want. Apple wins on capacity: Mac Studio unified memory goes up to 192 GB. NVIDIA wins on bandwidth: an RTX 4090 is ~1 TB/s vs the M2 Pro's 200 GB/s. Bigger models fit on Apple; NVIDIA runs them faster.

Question 3

Does VRAM or memory bandwidth matter more for local inference?

Accepted Answer

Both, but they answer different questions. VRAM decides if a model loads. Bandwidth decides how fast it runs. An RTX 4090 has 24 GB VRAM and ~1 TB/s bandwidth — a 7B model fits in 5 GB and runs at 80+ tokens/sec. An Apple M2 Pro has 16 GB unified memory at 200 GB/s — same model fits but runs around 20 tok/sec.

Question 4

Why does the same 70B model fit on one device but not another with the same VRAM?

Accepted Answer

Quantization. At Q4_K_M a 70B model needs ~42 GB; at Q8_0 it needs ~75 GB. Two 48 GB cards will both fit Q4, but context length and activation memory can push Q8 past the edge on one of them.

Compare devices for local AI

Common questions about choosing a device for local AI

What device do I need to run AI locally?

What's the best device for local AI — Apple Silicon or NVIDIA RTX?

Does VRAM or memory bandwidth matter more for local inference?

Why does the same 70B model fit on one device but not another with the same VRAM?

Pick your two devices

Common questions about choosing a device for local AI

What device do I need to run AI locally?

What's the best device for local AI — Apple Silicon or NVIDIA RTX?

Does VRAM or memory bandwidth matter more for local inference?

Why does the same 70B model fit on one device but not another with the same VRAM?