Llama 3.1 8B
Meta
CAN I RUN AI LOCALLY? · A FIELD GUIDE
Looking at your machine…
Cairn reads your GPU, VRAM, and bandwidth from the browser, then ranks 50+ open-weight LLMs against your hardware. Offline, in about 300 ms.
— looking around your machine —
6 GB of VRAM runs a 7B model at Q4 quantization. 12 GB covers most 13B. 24 GB opens up 30B and MoE 70B. Cairn checks 50+ open-weight LLMs against your hardware so you know what fits before you pull a 40 GB checkpoint.
Want the full picture? Check the tier list, or put two GPUs side-by-side.
Meta
Meta
Meta
Alibaba
Alibaba
Alibaba
Alibaba
Alibaba
DeepSeek
DeepSeek
DeepSeek
Mistral AI
Microsoft
Meta
LLaVA Team
Mistral AI
Alibaba
Meta
Meta
Meta
Meta
Meta
Meta
OpenAI
OpenAI
Mistral AI
Mistral AI
Mistral AI
Mistral AI
Alibaba
Alibaba
Alibaba
Alibaba
Alibaba
Alibaba
Alibaba
Alibaba
DeepSeek
DeepSeek
DeepSeek
DeepSeek
DeepSeek
DeepSeek
Moonshot AI
Microsoft
At 6 GB of VRAM, stick to 7B-parameter models at Q4 quantization. 12 GB covers most 13B models. 24 GB opens up 30B dense models and MoE 70B Q4. You'll want 48 GB+ for 70B at Q8.
Speed comes down to your GPU's memory bandwidth. An RTX 4090 runs a 7B model at 80+ tokens per second — about as fast as an API response, minus the network trip.
All three work. Cairn reads your GPU via WebGPU / WebGL — the inference itself runs in llama.cpp, Ollama, LM Studio, or whatever local runtime you prefer. Model support is the same across operating systems.
Q4_K_M uses ~0.65 GB per billion parameters with ~1% quality loss vs full precision. Q8_0 doubles the VRAM but keeps ~99.9% quality. Q4 is the default for most consumer GPUs.