The meteoric rise of open-weights large language models means you no longer need a cloud account—or an internet connection—to experiment with Meta’s powerful Llama family. This guide shows you exactly how to download, install, and run Llama 2, Llama 3, and the cutting-edge Llama 4 models on Windows, macOS, or Linux. By the end, you’ll be chatting with your own private AI, benchmarking performance, and knowing which tool—Ollama or LM Studio—fits your workflow.
Why go local?
- Privacy first: All prompts and data stay on-device.
- Zero latency: No API calls, no rate limits—just instant responses.
- Cost control: Forget usage bills; run as much as your hardware allows.
What Is Ollama and Why Use It for Llama?
Ollama is a lightweight, Docker-style runtime that treats Llama models like containers:
- Simplicity: Single-line commands such as
ollama run llama3:8b
spin up a chat. - Cross-platform: Native installers for Windows, macOS, and Linux—even a Docker image.
- Model management: Pull, list, copy, or remove models with intuitive CLI verbs.
- Quantization built-in: Automatically fetches efficient GGUF quants (Q4_K_M, IQ4_XS, etc.) so 8 GB VRAM GPUs can load models once reserved for datacenters.
When Ollama Shines
Developers love its API-first design: OpenAI-compatible endpoints mean you can swap cloud calls for http://localhost:11434/api
and keep coding.
Step 1 – Obtain the Ollama Software
Grab the official installer for your OS from ollama.com/download and save it to your desktop.
Step 2 – Install Ollama on Your System
- Windows: Double-click
OllamaSetup.exe
, accept prompts, and let the service start in the system tray. - macOS: Drag the Ollama app to Applications; it auto-launches a background daemon.
- Linux: Run
curl -fsSL https://ollama.com/install.sh | sh
The script installs binaries to
/usr/local/bin
and registers a systemd service.
Step 3 – Verify Ollama Installation
Open Terminal (macOS/Linux) or PowerShell (Windows) and type:
ollama
A help screen listing commands (pull
, run
, list
, rm
, …) confirms everything is ready.
Quick Reference – Llama Model Ollama Commands & VRAM
Below is a ballpark guide. Always check the Ollama Library for the latest tags and sizes.
Model | Size (GB) | VRAM Required | Ollama Command |
---|---|---|---|
LLaMA 4 Scout | 55–60 (4-bit) | ≈ 64–80 GB | ollama run llama4:scout |
LLaMA 4 Maverick | 200 (4-bit) | ≥ 200 GB | ollama run llama4:maverick |
LLaMA 3.3 70B | 43 | 43 GB | ollama run llama3.3 |
LLaMA 3.2 1B | 2.0 | 1.8 – 3.1 GB | ollama run llama3.2:1b |
LLaMA 3.2 3B | 2.0 | 3.4 – 6.5 GB | ollama run llama3.2:3b |
LLaMA 3.1 8B | 4.9 | 8 – 16 GB | ollama run llama3.1:8b |
LLaMA 3.1 70B | 43 | 42 – 84 GB | ollama run llama3.1:70b |
LLaMA 3.1 405B | 243 | ≥ 243 GB | ollama run llama3.1:405b |
LLaMA 3 8B | 4.7 | 8 – 16 GB | ollama run llama3:8b |
LLaMA 3 70B | 40 | 26 – 44 GB | ollama run llama3:70b |
Step 4 – Download and Run Your Chosen Llama Model(s)
Use ollama run
(or ollama pull
) with the tag that matches the model and quantization you need. Examples:
ollama run llama3:8b
Stable internet required; some downloads exceed 15 GB.
If you just downloaded a model, it auto-launches; otherwise run:
ollama run llama3:8b
The terminal shows loading progress, then:
Send a message:
Step 6 – Test Your Llama Installation
Try a prompt:
Explain the difference between Llama 3 and Llama 4 in one paragraph.
A coherent reply confirms success. To exit, type /bye
.
Troubleshooting Common Issues
“Command not found” for ollama
Add the install directory to your PATH or restart the terminal.
RuntimeError: CUDA out of memory
- Switch to a smaller model or a tighter quant (e.g., Q4_K_M → Q3_K_L).
- Lower context length with
PARAMETER num_ctx 4096
. - In LM Studio, reduce the GPU offload slider.
Ollama using CPU, not GPU
- Update NVIDIA or AMD drivers.
- On laptops, force the high-performance GPU in OS settings.
- Docker users: launch with
--gpus all
.
Model download failed
- Manually download the GGUF file from Hugging Face and “sideload” with a Modelfile:
FROM ./Meta-Llama-3.1-8B.Q4_K_M.gguf
then
ollama create llama3-local -f ./Modelfile
.
Other Local Deployment Options (Advanced)
- LM Studio GUI: Perfect for beginners—search, download, and chat without touching the CLI.
- llama.cpp: Compile from source for maximum control and speed tuning.
- Open WebUI + Ollama: Adds a sleek web-app front-end with built-in RAG and multi-user support.
- vLLM or Text Generation Inference: High-throughput serving for production workloads.
Conclusion & Next Steps
You now have a repeatable workflow to download, install, and run Meta’s Llama models completely offline:
- Install Ollama (or LM Studio).
- Pull the model tag that matches your VRAM.
- Chat, code, or build an API—no cloud required.
Experiment with different quantizations, benchmark token-per-second rates, and explore advanced features like Modelfile-based agents or LM Studio’s RAG mode. Ready to push further? Try creating a custom preset or integrating Ollama into LangChain for document search. Happy hacking!
FREQUENTLY ASKED QUESTIONS (FAQ)
QUESTION: Do I need a dedicated GPU to run Llama locally?
ANSWER: No. CPU-only inference works, but expect slow responses—often below 2 tokens/s on large models. A modern GPU with at least 8 GB VRAM dramatically improves speed and energy efficiency.
QUESTION: How much VRAM is required for Llama 3 70B?
ANSWER: With a Q4_K_M quant you’ll need roughly 20 GB of VRAM (or a GPU + system RAM combo that accommodates the 22 GB model file). FP16 versions demand over 40 GB.
QUESTION: Can I run Llama on Windows 11 without WSL?
ANSWER: Absolutely. The native Ollama installer bundles everything; no Windows Subsystem for Linux is required. LM Studio also offers a standalone Windows build.
QUESTION: How do I expose my local Llama as an API?
ANSWER: Start the Ollama service (ollama serve
) or enable “Developer → OpenAI API Server” in LM Studio. Point your client’s base_url
to http://localhost:11434
(Ollama) or http://localhost:1234/v1
(LM Studio) and use any OpenAI-compatible SDK.
QUESTION: Is commercial use allowed under the Llama license?
ANSWER: Yes for most individuals and startups. If your product exceeded 700 million monthly active users before the model’s release, you must negotiate a separate commercial license with Meta.