Viacheslav Eremin | LLM models comparison

(NOTES) NOTES (2026)

<< back << LLM models comparison << back <<

Model Name	Parameter Sizes	Release/Peak Year	Architecture Type	Best For / Strengths	Hardware Requirements (Quantized)	Hardware Requirements (Full Precision)	Context Window	Multilingual Support	License	Special Features
Phi-3.5 Mini	3.8B	2024	Transformer decoder-only	Long-context reasoning, RAG applications, reading PDFs, technical documentation, code generation and debugging, multilingual tasks [citation:1]	6-10GB RAM (4-bit) [citation:1]	16GB RAM (16-bit) [citation:1]	Very long (book-length), depends on variant [citation:1]	Yes [citation:1]	MIT [citation:1]	Handles very long inputs, strong for document-heavy workflows [citation:1]
Llama 3.2	1B, 3B	2024	Transformer	General chat and Q&A, document summarization, text classification, customer support automation [citation:1]	1B: 2-4GB RAM (4-bit), 3B: 6GB RAM (4-bit) [citation:1]	1B: 4-6GB RAM (16-bit), 3B: 12GB RAM (16-bit) [citation:1]	128K tokens [citation:1]	8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai) [citation:1]	Llama Community License (commercial use under 700M MAU) [citation:7]	1B runs on phones and mobile devices, 3B is balanced all-rounder [citation:1]
Ministral 3 / Ministral	3B, 8B	2025	Mixture-of-Experts (MoE)	Complex reasoning tasks, multi-turn conversations, code generation, tasks requiring nuanced understanding, on-device deployment [citation:1][citation:7]	8B: 10GB RAM (4-bit) [citation:1]	8B: 20GB RAM (16-bit) [citation:1]	128K tokens (largest models) [citation:7]	Yes, strong multilingual [citation:7]	Apache 2.0 (newer releases), Mistral Research License (older) [citation:1]	Native function calling works without special prompting, runs on phones under 500ms [citation:7]
Qwen 2.5 / Qwen 3	0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B	2024-2025	Transformer	Code generation and completion, mathematical reasoning, technical documentation, multilingual tasks (especially Chinese/English) [citation:1]	7B: 8GB RAM (4-bit) [citation:1]	7B: 16GB RAM (16-bit) [citation:1]	128K tokens (most variants) [citation:5]	92+ coding languages [citation:5]	Apache 2.0 or Qianwen LICENSE [citation:1]	Dominates coding and math benchmarks in its size class [citation:1]
Gemma 3 / Gemma 3n	270M, 1B, 4B, 12B, 27B	2025-2026	Transformer with 5-to-1 interleaved attention [citation:7]	Complex instruction-following, tasks requiring careful safety handling, general knowledge Q&A, content moderation, multimodal (4B and up) [citation:1][citation:7]	9B: 12GB RAM (4-bit) [citation:1], 4B multimodal fits mobile	9B: 24GB RAM (16-bit) [citation:1]	128K tokens [citation:7]	140+ languages [citation:7]	Gemma Terms of Use [citation:1]	ShieldGemma 2 filters harmful image content, 270M uses 0.75% battery for 25 conversations on Pixel 9 Pro [citation:7]
Llama 4 Scout	109B total, 17B active (MoE)	2025	Mixture-of-Experts (16 experts) [citation:4]	General purpose AI, massive context applications [citation:4]	Fits on single H100 (int4) [citation:4]	16-20GB VRAM range [citation:4]	10 million tokens [citation:4]	200 languages training, fine-tuning for 12 (Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, Vietnamese) [citation:4]	Llama 4 Community License (commercial use under 700M MAU) [citation:4]	Multimodal (text+image in, text out) [citation:4]
Llama 4 Maverick	400B total, 17B active (MoE)	2025	Mixture-of-Experts (128 experts) [citation:4]	High-performance general AI, production deployments [citation:4]	Requires multiple GPUs [citation:4]	Multiple H100 GPUs	1 million tokens [citation:4]	200 languages training, fine-tuning for 12 [citation:4]	Llama 4 Community License [citation:4]	Used internally for WhatsApp, Messenger, Instagram, beats GPT-4o and Gemini 2.0 Flash on benchmarks [citation:4]
Falcon 3	1B, 3B, 7B, 40B	2025	Transformer / Mamba (Falcon3-Mamba variant) [citation:7]	Resource-constrained deployment, runs on laptops [citation:7]	3B runs on MacBook Air [citation:7]	7B: 16GB RAM	8K-32K tokens	English, French, Spanish, Portuguese [citation:7]	TII Falcon License (free for research/commercial) [citation:7]	Falcon3-Mamba uses State Space Models for faster long-sequence inference, trained on 14 trillion tokens [citation:7]
EXAONE Deep	2.4B, 7.8B, 32B	2025	Transformer	Reasoning-enhanced tasks, math, coding benchmarks [citation:5]	2.4B outperforms similar size models [citation:5]	7.8B surpasses OpenAI o1-mini [citation:5]	4K-8K tokens	Primarily English/Korean	Apache 2.0 [citation:5]	AWQ and GGUF quantized weights available, runs with llama.cpp and Ollama [citation:5]
Devstral	Based on Mistral-Small-3.1	2026	Transformer	Software engineering tasks, navigating complex codebases, editing multiple files, resolving real-world issues [citation:5]	Mac with 32GB RAM or RTX 4090 [citation:5]	High-end consumer GPU [citation:5]	128K tokens [citation:5]	English primarily	Apache 2.0 [citation:5]	46.8% on SWE-Bench Verified (best open source), agentic LLM designed for software engineering [citation:5]
GLM-4.7-Flash	19B	2026	New 2026 architecture [citation:9]	General purpose, vision-language tasks [citation:9]	19GB VRAM [citation:9]	~20GB VRAM	64K tokens default, up to 65536 [citation:9]	Multilingual	GLM License	~46.3 tok/s on AMD Instinct MI50 [citation:9]
Qwen3-VL	8B, 32B, 480B cloud	2026	Vision-Language	Multimodal tasks, vision-language understanding [citation:8][citation:9]	8B: 6.1GB VRAM [citation:9]	32B: 20GB VRAM [citation:9]	64K+ tokens [citation:8]	Multilingual	Qianwen LICENSE	8B: 61.5 tok/s, 32B: 17.3 tok/s on AMD MI50 [citation:9]
SmolLM2	1.7B	2024	Transformer	Rapid prototyping, learning and experimentation, simple NLP tasks, educational projects [citation:1]	4GB RAM (4-bit) [citation:1]	6GB RAM (16-bit) [citation:1]	2K-8K tokens	English primarily	Apache 2.0 [citation:1]	Extremely fast iteration, runs on any modern laptop, perfect for testing pipelines [citation:1]

Hardware Requirements Reference (Ollama System Requirements 2026)

Component	Minimum	Recommended
Memory RAM	8GB [citation:2]	16GB [citation:2]
GPU VRAM (7B-13B models)	8GB [citation:2]	8GB+ [citation:2]
GPU VRAM (30B models)	16GB [citation:2]	24GB [citation:2]
GPU VRAM (65B-70B models)	32GB [citation:2]	48GB+ or multi-GPU [citation:2]
CPU	x86-64 with AVX2 support [citation:2]	11th-gen Intel / AMD Zen 4 with AVX512, DDR5 [citation:2]
OS Support	macOS 11+, Linux, Windows [citation:2]	Ubuntu 24.04 LTS [citation:2]

Summary Notes

Small models for laptops and edge: Phi-3.5 Mini (3.8B) for long context, Llama 3.2 (1B/3B) for general use, Ministral (3B/8B) for on-device performance, Qwen 2.5 (7B) for coding, Gemma 3 (4B/9B) for quality/safety, Falcon 3 (3B) for resource-constrained, SmolLM2 (1.7B) for prototyping [citation:1].

Large models for production: Llama 4 Scout (109B MoE) for massive context, Llama 4 Maverick (400B MoE) for high performance, Gemma 3 27B for beating much larger models, EXAONE Deep 32B for reasoning tasks [citation:4][citation:5][citation:7].

Specialized models: Devstral for software engineering (46.8% on SWE-Bench Verified), Qwen3-VL for vision-language tasks, GLM-4.7-Flash for new 2026 architecture performance [citation:5][citation:8][citation:9].

Coding tools integration: Ollama launch command supports Claude Code, OpenCode, Codex with recommended models glm-4.7-flash, qwen3-coder, gpt-oss:20b for local use, and cloud variants with 480B parameters [citation:8].

Licensing note: Llama models have commercial restrictions over 700M MAU, DeepSeek uses MIT, Ministral newer releases use Apache 2.0, Gemma has usage terms. Always verify current license on model page [citation:1][citation:4][citation:7].

Performance benchmarks 2026: Ollama 0.3.15 with Gemma 2B shows average tokens/s ranging from 25 (low-tier) to 619 (high-tier) [citation:3]. On AMD Instinct MI50, Qwen3-VL 8B achieves 61.5 tok/s, GLM-4.7-Flash 19B achieves 46.3 tok/s [citation:9].

Quantization tip: 4-bit quantized models require approximately 40-60% less RAM than full precision, making larger models accessible on consumer hardware [citation:1][citation:2].

Ai context:

Comments (

)

Link to this page: http://www.vb-net.com/AI-LLM-Install/LLM-model-comparision.htm

< THANKS ME>