(NOTES) NOTES (2026)

<< back <<    LLM models comparison    << back <<

Model Name Parameter Sizes Release/Peak Year Architecture Type Best For / Strengths Hardware Requirements (Quantized) Hardware Requirements (Full Precision) Context Window Multilingual Support License Special Features
Phi-3.5 Mini 3.8B 2024 Transformer decoder-only Long-context reasoning, RAG applications, reading PDFs, technical documentation, code generation and debugging, multilingual tasks [citation:1] 6-10GB RAM (4-bit) [citation:1] 16GB RAM (16-bit) [citation:1] Very long (book-length), depends on variant [citation:1] Yes [citation:1] MIT [citation:1] Handles very long inputs, strong for document-heavy workflows [citation:1]
Llama 3.2 1B, 3B 2024 Transformer General chat and Q&A, document summarization, text classification, customer support automation [citation:1] 1B: 2-4GB RAM (4-bit), 3B: 6GB RAM (4-bit) [citation:1] 1B: 4-6GB RAM (16-bit), 3B: 12GB RAM (16-bit) [citation:1] 128K tokens [citation:1] 8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai) [citation:1] Llama Community License (commercial use under 700M MAU) [citation:7] 1B runs on phones and mobile devices, 3B is balanced all-rounder [citation:1]
Ministral 3 / Ministral 3B, 8B 2025 Mixture-of-Experts (MoE) Complex reasoning tasks, multi-turn conversations, code generation, tasks requiring nuanced understanding, on-device deployment [citation:1][citation:7] 8B: 10GB RAM (4-bit) [citation:1] 8B: 20GB RAM (16-bit) [citation:1] 128K tokens (largest models) [citation:7] Yes, strong multilingual [citation:7] Apache 2.0 (newer releases), Mistral Research License (older) [citation:1] Native function calling works without special prompting, runs on phones under 500ms [citation:7]
Qwen 2.5 / Qwen 3 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B 2024-2025 Transformer Code generation and completion, mathematical reasoning, technical documentation, multilingual tasks (especially Chinese/English) [citation:1] 7B: 8GB RAM (4-bit) [citation:1] 7B: 16GB RAM (16-bit) [citation:1] 128K tokens (most variants) [citation:5] 92+ coding languages [citation:5] Apache 2.0 or Qianwen LICENSE [citation:1] Dominates coding and math benchmarks in its size class [citation:1]
Gemma 3 / Gemma 3n 270M, 1B, 4B, 12B, 27B 2025-2026 Transformer with 5-to-1 interleaved attention [citation:7] Complex instruction-following, tasks requiring careful safety handling, general knowledge Q&A, content moderation, multimodal (4B and up) [citation:1][citation:7] 9B: 12GB RAM (4-bit) [citation:1], 4B multimodal fits mobile 9B: 24GB RAM (16-bit) [citation:1] 128K tokens [citation:7] 140+ languages [citation:7] Gemma Terms of Use [citation:1] ShieldGemma 2 filters harmful image content, 270M uses 0.75% battery for 25 conversations on Pixel 9 Pro [citation:7]
Llama 4 Scout 109B total, 17B active (MoE) 2025 Mixture-of-Experts (16 experts) [citation:4] General purpose AI, massive context applications [citation:4] Fits on single H100 (int4) [citation:4] 16-20GB VRAM range [citation:4] 10 million tokens [citation:4] 200 languages training, fine-tuning for 12 (Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, Vietnamese) [citation:4] Llama 4 Community License (commercial use under 700M MAU) [citation:4] Multimodal (text+image in, text out) [citation:4]
Llama 4 Maverick 400B total, 17B active (MoE) 2025 Mixture-of-Experts (128 experts) [citation:4] High-performance general AI, production deployments [citation:4] Requires multiple GPUs [citation:4] Multiple H100 GPUs 1 million tokens [citation:4] 200 languages training, fine-tuning for 12 [citation:4] Llama 4 Community License [citation:4] Used internally for WhatsApp, Messenger, Instagram, beats GPT-4o and Gemini 2.0 Flash on benchmarks [citation:4]
Falcon 3 1B, 3B, 7B, 40B 2025 Transformer / Mamba (Falcon3-Mamba variant) [citation:7] Resource-constrained deployment, runs on laptops [citation:7] 3B runs on MacBook Air [citation:7] 7B: 16GB RAM 8K-32K tokens English, French, Spanish, Portuguese [citation:7] TII Falcon License (free for research/commercial) [citation:7] Falcon3-Mamba uses State Space Models for faster long-sequence inference, trained on 14 trillion tokens [citation:7]
EXAONE Deep 2.4B, 7.8B, 32B 2025 Transformer Reasoning-enhanced tasks, math, coding benchmarks [citation:5] 2.4B outperforms similar size models [citation:5] 7.8B surpasses OpenAI o1-mini [citation:5] 4K-8K tokens Primarily English/Korean Apache 2.0 [citation:5] AWQ and GGUF quantized weights available, runs with llama.cpp and Ollama [citation:5]
Devstral Based on Mistral-Small-3.1 2026 Transformer Software engineering tasks, navigating complex codebases, editing multiple files, resolving real-world issues [citation:5] Mac with 32GB RAM or RTX 4090 [citation:5] High-end consumer GPU [citation:5] 128K tokens [citation:5] English primarily Apache 2.0 [citation:5] 46.8% on SWE-Bench Verified (best open source), agentic LLM designed for software engineering [citation:5]
GLM-4.7-Flash 19B 2026 New 2026 architecture [citation:9] General purpose, vision-language tasks [citation:9] 19GB VRAM [citation:9] ~20GB VRAM 64K tokens default, up to 65536 [citation:9] Multilingual GLM License ~46.3 tok/s on AMD Instinct MI50 [citation:9]
Qwen3-VL 8B, 32B, 480B cloud 2026 Vision-Language Multimodal tasks, vision-language understanding [citation:8][citation:9] 8B: 6.1GB VRAM [citation:9] 32B: 20GB VRAM [citation:9] 64K+ tokens [citation:8] Multilingual Qianwen LICENSE 8B: 61.5 tok/s, 32B: 17.3 tok/s on AMD MI50 [citation:9]
SmolLM2 1.7B 2024 Transformer Rapid prototyping, learning and experimentation, simple NLP tasks, educational projects [citation:1] 4GB RAM (4-bit) [citation:1] 6GB RAM (16-bit) [citation:1] 2K-8K tokens English primarily Apache 2.0 [citation:1] Extremely fast iteration, runs on any modern laptop, perfect for testing pipelines [citation:1]

Hardware Requirements Reference (Ollama System Requirements 2026)

Component Minimum Recommended
Memory RAM 8GB [citation:2] 16GB [citation:2]
GPU VRAM (7B-13B models) 8GB [citation:2] 8GB+ [citation:2]
GPU VRAM (30B models) 16GB [citation:2] 24GB [citation:2]
GPU VRAM (65B-70B models) 32GB [citation:2] 48GB+ or multi-GPU [citation:2]
CPU x86-64 with AVX2 support [citation:2] 11th-gen Intel / AMD Zen 4 with AVX512, DDR5 [citation:2]
OS Support macOS 11+, Linux, Windows [citation:2] Ubuntu 24.04 LTS [citation:2]

Summary Notes

Small models for laptops and edge: Phi-3.5 Mini (3.8B) for long context, Llama 3.2 (1B/3B) for general use, Ministral (3B/8B) for on-device performance, Qwen 2.5 (7B) for coding, Gemma 3 (4B/9B) for quality/safety, Falcon 3 (3B) for resource-constrained, SmolLM2 (1.7B) for prototyping [citation:1].

Large models for production: Llama 4 Scout (109B MoE) for massive context, Llama 4 Maverick (400B MoE) for high performance, Gemma 3 27B for beating much larger models, EXAONE Deep 32B for reasoning tasks [citation:4][citation:5][citation:7].

Specialized models: Devstral for software engineering (46.8% on SWE-Bench Verified), Qwen3-VL for vision-language tasks, GLM-4.7-Flash for new 2026 architecture performance [citation:5][citation:8][citation:9].

Coding tools integration: Ollama launch command supports Claude Code, OpenCode, Codex with recommended models glm-4.7-flash, qwen3-coder, gpt-oss:20b for local use, and cloud variants with 480B parameters [citation:8].

Licensing note: Llama models have commercial restrictions over 700M MAU, DeepSeek uses MIT, Ministral newer releases use Apache 2.0, Gemma has usage terms. Always verify current license on model page [citation:1][citation:4][citation:7].

Performance benchmarks 2026: Ollama 0.3.15 with Gemma 2B shows average tokens/s ranging from 25 (low-tier) to 619 (high-tier) [citation:3]. On AMD Instinct MI50, Qwen3-VL 8B achieves 61.5 tok/s, GLM-4.7-Flash 19B achieves 46.3 tok/s [citation:9].

Quantization tip: 4-bit quantized models require approximately 40-60% less RAM than full precision, making larger models accessible on consumer hardware [citation:1][citation:2].




Ai context:



Comments ( )
Link to this page: http://www.vb-net.com/AI-LLM-Install/LLM-model-comparision.htm
< THANKS ME>