Qwen (Alibaba)
What makes it distinctive
Qwen is one model family with two superpowers: a single network can switch between slow step-by-step "thinking" and fast direct answers, and the same family scales from a 0.6B phone-sized model all the way to a 235B Mixture-of-Experts (many small sub-networks, few active at once).
Built by Alibaba Cloud, Qwen takes the standard decoder-only recipe (predict-the-next-token) and adds two flourishes. First, hybrid thinking: you can ask the model to reason out loud in <think> tokens for hard problems, or skip straight to the answer for easy ones — with a thinking budget (a cap on reasoning steps) you control. Second, fine-grained MoE: many tiny experts, a handful lit per token, giving big-model quality at small-model running cost.
Every open model shares the same set of building bricks — attention, normalization, position encoding, the feed-forward block. See How Open-Source LLMs Are Built for the shared recipe, then come back to see which bricks Qwen swaps out.
The family so far
Qwen2 (2024), then Qwen2.5 (late 2024) with a 1M-context variant, then Qwen3 (Apr–May 2025) and its "2507" updates. Qwen3.5 (Feb 2026) jumped to a 397B-A17B MoE with 262K native context, plus 122B-A10B / 35B-A3B / 27B siblings; the newest Qwen3.7 previews (May 2026) are no longer open weights.
Dense: 0.6 / 1.7 / 4 / 8 / 14 / 32B. Sparse MoE: 30B-A3B (30B total, 3B active) and 235B-A22B (235B total, 22B active, 94 layers).
Qwen3 trains at 32,768 tokens, stretches to 131,072 (128K) with YaRN. "2507" variants reach 262,144 (256K) natively; Qwen2.5-1M reaches 1,000,000.
Qwen3 (and most of Qwen2.5) ships under Apache 2.0 — a permissive license you can use commercially. Vocabulary is ~151K tokens of byte-level BPE.
Signature: one model, two speeds of thought
Most model families ship a "chat" model and a separate "reasoning" model. Qwen3 folds both into one set of weights. In non-thinking mode the prompt goes straight to an answer — fast and cheap. In thinking mode the model first writes a visible chain of intermediate <think> steps, then the answer. A thinking budget lets you cap how long it reasons: more steps for a tricky proof, fewer for a quick fact.
Drag the slider and flip the toggle below to watch the reasoning path grow and shrink.
Non-thinking = prompt straight to answer. Thinking = a chain of <think> steps first; the budget caps how many.
Under the hood: fine-grained MoE with no shared expert
The big Qwen3 models are sparse. Instead of one fat feed-forward block per layer, there are 128 small experts; a tiny router scores them all for each token and lights up only the top 8. So Qwen3-235B-A22B holds 235B parameters but runs just 22B per token — the quality of a giant, the bill of a midsize model.
Unlike DeepSeek's fine-grained MoE, Qwen3 uses no always-on shared expert; every expert is chosen by the router, and training adds a global-batch load-balancing loss so no expert sits idle. Step through the routing below.
vs the shared recipe
Qwen keeps almost every standard brick of a modern decoder-only transformer, then tweaks a few for stability and reach.
- Decoder-only next-token prediction — the usual backbone.
- RMSNorm pre-norm — normalize before each block for stable training.
- SwiGLU feed-forward — the modern gated activation.
- RoPE rotary positions, with GQA (grouped query attention) at every size to shrink the KV cache.
- Byte-level BPE tokenizer with a large ~151K multilingual vocabulary.
- Adds QK-Norm — normalizes the query and key inside attention for steadier large-scale training.
- Removes the QKV bias that Qwen2/2.5 carried in attention.
- Fine-grained MoE (128 experts, top-8) on big models, but no shared expert (unlike DeepSeek) — relies on a load-balancing loss instead.
- Long context is bolted on: base-frequency scaling (RoPE base 10,000 → 1,000,000) plus YaRN at inference to stretch 32K toward 128K.
- Dense models have no MoE — the experts trick only appears in the 30B and 235B variants.
Gotchas / good to know
- Native vs stretched context. Qwen3 is trained at 32K; 128K comes from YaRN scaling at inference, and very long inputs can lose some quality. Only the "2507" (256K) and Qwen2.5-1M variants are built for truly long documents.
- Thinking mode costs tokens. Every
<think>step is generated text you pay for in latency and money. Set a thinking budget; don't leave it on for simple chat. - Active ≠ total parameters. 235B-A22B needs memory for all 235B weights even though only 22B run per token — MoE saves compute, not VRAM.
- No shared expert. Routing quality leans entirely on the load-balancing loss; a poorly balanced router can leave experts under-trained, so don't assume MoE always beats a same-size dense model.
- Generation-specific details. QK-Norm and the dropped QKV bias are Qwen3 changes — Qwen2/2.5 behave differently, so port configs carefully.
Related
How Open-Source LLMs Are Built
The shared recipe: attention, normalization, positions, MoE — the bricks every model reuses.
DeepSeek
The other fine-grained MoE family — and the one Qwen contrasts with: DeepSeek keeps an always-on shared expert.
Llama
A dense-only counterpart with the same RoPE + GQA + SwiGLU bricks — handy for an apples-to-apples compare.
What Is an LLM?
Start here if next-token prediction, tokens, and context windows are new.