In September 2024 Meta released Llama 3.2, a milestone that split the Llama family in two clear directions. On one side are the 1 B and 3 B text-only models engineered for on-device privacy and speed. On the other are the 11 B Vision and 90 B Vision models that fuse images with state-of-the-art language understanding. Despite the different goals, every variant shares the same 128 K-token context window, an optimized transformer backbone with Grouped-Query Attention (GQA), and instruction tuning reinforced by RLHF. This guide walks you through every detail—from architecture and benchmarks to licensing and real-world deployment—using plain language first and deep dives where you need them.
Download and Run Llama 3.2 Models
Need the exact steps for getting Llama 3.2 1B, 3B, 11B Vision, 90B Vision or any other Meta model onto your machine? Click here for our dedicated download guide; it covers official checkpoints, the best quantization choices, and one-click launches with Ollama, LM Studio, and more.
Why Llama 3.2 Is More Than an Update
Llama 3.2 is not a linear bump from 3.1. Think of 3.1 as the sturdy trunk; 3.2 grows two distinct branches:
-
High-performance multimodality – the Vision models (11 B and 90 B) process an image plus text in one prompt to generate text-only answers.
-
Edge-first efficiency – the 1 B and 3 B models shrink a large-language brain down to phone-level hardware without losing basic reasoning.
Architecture
A Shared Transformer Core
Every member of the 3.2 family is an auto-regressive transformer tuned with GQA, which balances multi-head richness with reduced memory traffic. Instruction versions layer supervised fine-tuning and Direct Preference Optimization on top, so you get helpful, safe answers out of the box.
How Vision Works Under the Hood
Vision models bolt a trainable adapter onto a frozen Llama 3.1 text backbone. An image encoder transforms pixels into features, and cross-attention layers feed those features to the language model at several depths. Freezing the text weights prevents “catastrophic forgetting,” so you keep elite language quality while adding sight.
Pruning + Distillation for 1 B and 3 B
Meta trimmed unimportant weights from the 8 B parent, then trained the tiny “students” to mimic the larger model’s logits. Result: the 3 B scores ~78 % on ARC-Challenge—remarkable for something you can run on a laptop GPU.
Performance Snapshot
-
1 B – 49 % on MMLU, 30 % on MATH; perfect for quick summaries and offline commands.
-
3 B – 78.6 % (ARC) and 77.7 % (GSM8K); sweet spot for cost-efficient RAG.
-
11 B Vision – ~111 tokens per second (TPS) on Oracle Cloud; excels at single-page visual Q&A.
-
90 B Vision – 86 % on MMLU, 90 % on DocVQA; state-of-the-art open model for complex documents.
Hardware Requirements & Quantization
SUB-SECTION HEADING (H3): Quick-Reference Tables
PC / NVIDIA GPUs
Model | Q4_K_M Size | Min VRAM | Rec. VRAM | Rec. System RAM |
---|---|---|---|---|
1B Instruct | ~0.8 GB | 2 GB | 4 GB | 16 GB |
3B Instruct | ~2.0 GB | 4 GB | 6 GB | 16 GB |
11B Vision | ~7.8 GB | 10 GB | 12 GB | 32 GB |
90B Vision | ~55 GB | 1× A100 | 2× RTX 4090 | 128 GB |
Apple Silicon (Unified Memory)
Model | Q4_K_M Size | Recommended Unified Memory |
---|---|---|
1B Instruct | ~0.8 GB | 8 GB |
3B Instruct | ~2.0 GB | 8 GB |
11B Vision | ~7.8 GB | 16 GB |
Choosing a Format
-
GGUF – best all-rounder; lets CPU and GPU share the load.
-
GPTQ – compact and simple on GPU if you have a small calibration set.
-
AWQ – fastest 4-bit inference on CUDA and ROCm.
Real-World Use Cases
-
Edge assistants – 3 B on a tablet can translate, summarize notes, or answer FAQs without internet.
-
Document automation – 90 B Vision parses invoices, charts, and long reports better than many closed APIs.
-
Multimodal RAG – retrieve PDF pages as images with ColPali, feed them plus the question to 11 B Vision for a rich answer.
-
Research & security – a March 2025 paper fine-tuned 1 B for C/C++ vulnerability detection, hitting 66 % F1 with modest compute.
Fine-Tuning and RAG Best Practices
LoRA in Four Lines
Freeze 99 % of the weights, train tiny rank-decomposition matrices, and save VRAM. Frameworks like Unsloth or NVIDIA NeMo ship ready-made scripts for both text and vision.
Embeddings & Reranking
Use llama-3.2-nv-embedqa-1b-v2 for multilingual retrieval; Matryoshka embeddings let you store vectors at 2048 D today and trim to 384 D when bandwidth matters.
Licensing, AUP, and Compliance
-
Community License – free for research and most businesses; companies with >700 M MAU need a Meta-approved commercial deal.
-
Attribution – show “Built with Llama” and prefix any fine-tuned model name with “Llama.”
-
EU restriction – Vision checkpoints are not legally distributed in the European Union.
-
Acceptable Use Policy – forbids disallowed content (terrorism, child harm, unauthorized medical advice, etc.).
Known Limitations & Mitigations
-
Single-image cap – Vision models accept only one image per prompt; for sequences, loop prompts or choose another architecture.
-
Inference speed – adapter design makes Vision slower than native multimodal peers. Consider 11 B Vision if latency is critical.
-
Prompt injection – third-party tests forced 8 B to output malicious code. Always add external safety filters.
Conclusion:
Llama 3.2 is a toolbox, not a one-size-fits-all model. If you need blazing document understanding with fully open weights, 90 B Vision is hard to beat. For privacy-first mobile apps, 3 B delivers more brains per gigabyte than anything in its class. By blending modular architecture with a vibrant open-source ecosystem, Meta has pushed the baseline for what free-to-use AI can do—and given developers the freedom to run it anywhere, from data center to pocket.
FREQUENTLY ASKED QUESTIONS (FAQ)
QUESTION: What sets Llama 3.2 apart from Llama 3.1 and 3.3?
ANSWER: Llama 3.2 adds two Vision models for image-plus-text reasoning and two pruned-and-distilled text models for edge devices. Llama 3.1 focused on scaling context to 128 K, while Llama 3.3 offers a single, highly optimized 70 B text-only model that rivals 405 B performance but lacks multimodality.
QUESTION: Can I run the 90 B Vision model on consumer hardware?
ANSWER: Only if you have at least one 24 GB GPU and accept heavy offloading. For production-grade speed you’ll need dual RTX 4090s or an A100 80 GB server.
QUESTION: Is the 1 B model good enough for a chatbot?
ANSWER: Yes for lightweight tasks like summarizing, basic Q&A, or offline voice assistants. For deeper reasoning or multi-step analysis, upgrade to the 3 B or larger.
QUESTION: How many images can a Vision prompt include?
ANSWER: Currently just one. You can chain prompts for comparisons or pick a different multimodal model that supports multiple images.
QUESTION: Is it legal to ship a commercial app with Llama 3.2 inside?
ANSWER: Yes, provided you respect the Community License: attribute “Built with Llama,” follow the Acceptable Use Policy, and obtain a commercial license if your product already served more than 700 million monthly active users on 25 Sep 2024.