<< back << LLM models comparison << back <<
| Model Name | Parameter Sizes | Release/Peak Year | Architecture Type | Best For / Strengths | Hardware Requirements (Quantized) | Hardware Requirements (Full Precision) | Context Window | Multilingual Support | License | Special Features |
|---|---|---|---|---|---|---|---|---|---|---|
| Phi-3.5 Mini | 3.8B | 2024 | Transformer decoder-only | Long-context reasoning, RAG applications, reading PDFs, technical documentation, code generation and debugging, multilingual tasks [citation:1] | 6-10GB RAM (4-bit) [citation:1] | 16GB RAM (16-bit) [citation:1] | Very long (book-length), depends on variant [citation:1] | Yes [citation:1] | MIT [citation:1] | Handles very long inputs, strong for document-heavy workflows [citation:1] |
| Llama 3.2 | 1B, 3B | 2024 | Transformer | General chat and Q&A, document summarization, text classification, customer support automation [citation:1] | 1B: 2-4GB RAM (4-bit), 3B: 6GB RAM (4-bit) [citation:1] | 1B: 4-6GB RAM (16-bit), 3B: 12GB RAM (16-bit) [citation:1] | 128K tokens [citation:1] | 8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai) [citation:1] | Llama Community License (commercial use under 700M MAU) [citation:7] | 1B runs on phones and mobile devices, 3B is balanced all-rounder [citation:1] |
| Ministral 3 / Ministral | 3B, 8B | 2025 | Mixture-of-Experts (MoE) | Complex reasoning tasks, multi-turn conversations, code generation, tasks requiring nuanced understanding, on-device deployment [citation:1][citation:7] | 8B: 10GB RAM (4-bit) [citation:1] | 8B: 20GB RAM (16-bit) [citation:1] | 128K tokens (largest models) [citation:7] | Yes, strong multilingual [citation:7] | Apache 2.0 (newer releases), Mistral Research License (older) [citation:1] | Native function calling works without special prompting, runs on phones under 500ms [citation:7] |
| Qwen 2.5 / Qwen 3 | 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B | 2024-2025 | Transformer | Code generation and completion, mathematical reasoning, technical documentation, multilingual tasks (especially Chinese/English) [citation:1] | 7B: 8GB RAM (4-bit) [citation:1] | 7B: 16GB RAM (16-bit) [citation:1] | 128K tokens (most variants) [citation:5] | 92+ coding languages [citation:5] | Apache 2.0 or Qianwen LICENSE [citation:1] | Dominates coding and math benchmarks in its size class [citation:1] |
| Gemma 3 / Gemma 3n | 270M, 1B, 4B, 12B, 27B | 2025-2026 | Transformer with 5-to-1 interleaved attention [citation:7] | Complex instruction-following, tasks requiring careful safety handling, general knowledge Q&A, content moderation, multimodal (4B and up) [citation:1][citation:7] | 9B: 12GB RAM (4-bit) [citation:1], 4B multimodal fits mobile | 9B: 24GB RAM (16-bit) [citation:1] | 128K tokens [citation:7] | 140+ languages [citation:7] | Gemma Terms of Use [citation:1] | ShieldGemma 2 filters harmful image content, 270M uses 0.75% battery for 25 conversations on Pixel 9 Pro [citation:7] |
| Llama 4 Scout | 109B total, 17B active (MoE) | 2025 | Mixture-of-Experts (16 experts) [citation:4] | General purpose AI, massive context applications [citation:4] | Fits on single H100 (int4) [citation:4] | 16-20GB VRAM range [citation:4] | 10 million tokens [citation:4] | 200 languages training, fine-tuning for 12 (Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, Vietnamese) [citation:4] | Llama 4 Community License (commercial use under 700M MAU) [citation:4] | Multimodal (text+image in, text out) [citation:4] |
| Llama 4 Maverick | 400B total, 17B active (MoE) | 2025 | Mixture-of-Experts (128 experts) [citation:4] | High-performance general AI, production deployments [citation:4] | Requires multiple GPUs [citation:4] | Multiple H100 GPUs | 1 million tokens [citation:4] | 200 languages training, fine-tuning for 12 [citation:4] | Llama 4 Community License [citation:4] | Used internally for WhatsApp, Messenger, Instagram, beats GPT-4o and Gemini 2.0 Flash on benchmarks [citation:4] |
| Falcon 3 | 1B, 3B, 7B, 40B | 2025 | Transformer / Mamba (Falcon3-Mamba variant) [citation:7] | Resource-constrained deployment, runs on laptops [citation:7] | 3B runs on MacBook Air [citation:7] | 7B: 16GB RAM | 8K-32K tokens | English, French, Spanish, Portuguese [citation:7] | TII Falcon License (free for research/commercial) [citation:7] | Falcon3-Mamba uses State Space Models for faster long-sequence inference, trained on 14 trillion tokens [citation:7] |
| EXAONE Deep | 2.4B, 7.8B, 32B | 2025 | Transformer | Reasoning-enhanced tasks, math, coding benchmarks [citation:5] | 2.4B outperforms similar size models [citation:5] | 7.8B surpasses OpenAI o1-mini [citation:5] | 4K-8K tokens | Primarily English/Korean | Apache 2.0 [citation:5] | AWQ and GGUF quantized weights available, runs with llama.cpp and Ollama [citation:5] |
| Devstral | Based on Mistral-Small-3.1 | 2026 | Transformer | Software engineering tasks, navigating complex codebases, editing multiple files, resolving real-world issues [citation:5] | Mac with 32GB RAM or RTX 4090 [citation:5] | High-end consumer GPU [citation:5] | 128K tokens [citation:5] | English primarily | Apache 2.0 [citation:5] | 46.8% on SWE-Bench Verified (best open source), agentic LLM designed for software engineering [citation:5] |
| GLM-4.7-Flash | 19B | 2026 | New 2026 architecture [citation:9] | General purpose, vision-language tasks [citation:9] | 19GB VRAM [citation:9] | ~20GB VRAM | 64K tokens default, up to 65536 [citation:9] | Multilingual | GLM License | ~46.3 tok/s on AMD Instinct MI50 [citation:9] |
| Qwen3-VL | 8B, 32B, 480B cloud | 2026 | Vision-Language | Multimodal tasks, vision-language understanding [citation:8][citation:9] | 8B: 6.1GB VRAM [citation:9] | 32B: 20GB VRAM [citation:9] | 64K+ tokens [citation:8] | Multilingual | Qianwen LICENSE | 8B: 61.5 tok/s, 32B: 17.3 tok/s on AMD MI50 [citation:9] |
| SmolLM2 | 1.7B | 2024 | Transformer | Rapid prototyping, learning and experimentation, simple NLP tasks, educational projects [citation:1] | 4GB RAM (4-bit) [citation:1] | 6GB RAM (16-bit) [citation:1] | 2K-8K tokens | English primarily | Apache 2.0 [citation:1] | Extremely fast iteration, runs on any modern laptop, perfect for testing pipelines [citation:1] |
Hardware Requirements Reference (Ollama System Requirements 2026)
| Component | Minimum | Recommended |
|---|---|---|
| Memory RAM | 8GB [citation:2] | 16GB [citation:2] |
| GPU VRAM (7B-13B models) | 8GB [citation:2] | 8GB+ [citation:2] |
| GPU VRAM (30B models) | 16GB [citation:2] | 24GB [citation:2] |
| GPU VRAM (65B-70B models) | 32GB [citation:2] | 48GB+ or multi-GPU [citation:2] |
| CPU | x86-64 with AVX2 support [citation:2] | 11th-gen Intel / AMD Zen 4 with AVX512, DDR5 [citation:2] |
| OS Support | macOS 11+, Linux, Windows [citation:2] | Ubuntu 24.04 LTS [citation:2] |
Summary Notes
Small models for laptops and edge: Phi-3.5 Mini (3.8B) for long context, Llama 3.2 (1B/3B) for general use, Ministral (3B/8B) for on-device performance, Qwen 2.5 (7B) for coding, Gemma 3 (4B/9B) for quality/safety, Falcon 3 (3B) for resource-constrained, SmolLM2 (1.7B) for prototyping [citation:1].
Large models for production: Llama 4 Scout (109B MoE) for massive context, Llama 4 Maverick (400B MoE) for high performance, Gemma 3 27B for beating much larger models, EXAONE Deep 32B for reasoning tasks [citation:4][citation:5][citation:7].
Specialized models: Devstral for software engineering (46.8% on SWE-Bench Verified), Qwen3-VL for vision-language tasks, GLM-4.7-Flash for new 2026 architecture performance [citation:5][citation:8][citation:9].
Coding tools integration: Ollama launch command supports Claude Code, OpenCode, Codex with recommended models glm-4.7-flash, qwen3-coder, gpt-oss:20b for local use, and cloud variants with 480B parameters [citation:8].
Licensing note: Llama models have commercial restrictions over 700M MAU, DeepSeek uses MIT, Ministral newer releases use Apache 2.0, Gemma has usage terms. Always verify current license on model page [citation:1][citation:4][citation:7].
Performance benchmarks 2026: Ollama 0.3.15 with Gemma 2B shows average tokens/s ranging from 25 (low-tier) to 619 (high-tier) [citation:3]. On AMD Instinct MI50, Qwen3-VL 8B achieves 61.5 tok/s, GLM-4.7-Flash 19B achieves 46.3 tok/s [citation:9].
Quantization tip: 4-bit quantized models require approximately 40-60% less RAM than full precision, making larger models accessible on consumer hardware [citation:1][citation:2].
Ai context:
)
|
|