Qwen (Alibaba): One Family, Dense and MoE, with Hybrid Thinking

What makes it distinctive

Qwen is one model family with two superpowers: a single network can switch between slow step-by-step "thinking" and fast direct answers, and the same family scales from a 0.6B phone-sized model all the way to a 235B Mixture-of-Experts (many small sub-networks, few active at once).

Built by Alibaba Cloud, Qwen takes the standard decoder-only recipe (predict-the-next-token) and adds two flourishes. First, hybrid thinking: you can ask the model to reason out loud in <think> tokens for hard problems, or skip straight to the answer for easy ones — with a thinking budget (a cap on reasoning steps) you control. Second, fine-grained MoE: many tiny experts, a handful lit per token, giving big-model quality at small-model running cost.

New to how these models are assembled?

Every open model shares the same set of building bricks — attention, normalization, position encoding, the feed-forward block. See How Open-Source LLMs Are Built for the shared recipe, then come back to see which bricks Qwen swaps out.

The family so far

Timeline Qwen2 → Qwen3.5

Qwen2 (2024), then Qwen2.5 (late 2024) with a 1M-context variant, then Qwen3 (Apr–May 2025) and its "2507" updates. Qwen3.5 (Feb 2026) jumped to a 397B-A17B MoE with 262K native context, plus 122B-A10B / 35B-A3B / 27B siblings; the newest Qwen3.7 previews (May 2026) are no longer open weights.

Sizes (Qwen3) 0.6B → 235B

Dense: 0.6 / 1.7 / 4 / 8 / 14 / 32B. Sparse MoE: 30B-A3B (30B total, 3B active) and 235B-A22B (235B total, 22B active, 94 layers).

Context 32K → 1M

Qwen3 trains at 32,768 tokens, stretches to 131,072 (128K) with YaRN. "2507" variants reach 262,144 (256K) natively; Qwen2.5-1M reaches 1,000,000.

License Apache 2.0

Qwen3 (and most of Qwen2.5) ships under Apache 2.0 — a permissive license you can use commercially. Vocabulary is ~151K tokens of byte-level BPE.

Signature: one model, two speeds of thought

Most model families ship a "chat" model and a separate "reasoning" model. Qwen3 folds both into one set of weights. (The original Qwen3 release, that is — the July 2025 "2507" refresh dropped hybrid mode and ships separate Instruct and Thinking models.) In non-thinking mode the prompt goes straight to an answer — fast and cheap. In thinking mode the model first writes a visible chain of intermediate <think> steps, then the answer. A thinking budget lets you cap how long it reasons: more steps for a tricky proof, fewer for a quick fact.

Drag the slider and flip the toggle below to watch the reasoning path grow and shrink.

Thinking budget

Non-thinking = prompt straight to answer. Thinking = a chain of <think> steps first; the budget caps how many.

Under the hood: fine-grained MoE with no shared expert

The big Qwen3 models are sparse. Instead of one fat feed-forward block per layer, there are 128 small experts; a tiny router scores them all for each token and lights up only the top 8. So Qwen3-235B-A22B holds 235B parameters but runs just 22B per token — the quality of a giant, the bill of a midsize model.

Unlike DeepSeek's fine-grained MoE, Qwen3 uses no always-on shared expert; every expert is chosen by the router, and training adds a global-batch load-balancing loss so no expert sits idle. Step through the routing below.

vs the shared recipe

Qwen keeps almost every standard brick of a modern decoder-only transformer, then tweaks a few for stability and reach.

Keeps the standard bricks

Decoder-only next-token prediction — the usual backbone.
RMSNorm pre-norm — normalize before each block for stable training.
SwiGLU feed-forward — the modern gated activation.
RoPE rotary positions, with GQA (grouped query attention) at every size to shrink the KV cache.
Byte-level BPE tokenizer with a large ~151K multilingual vocabulary.

What Qwen3 changes / trades

Adds QK-Norm — normalizes the query and key inside attention for steadier large-scale training.
Removes the QKV bias that Qwen2/2.5 carried in attention.
Fine-grained MoE (128 experts, top-8) on big models, but no shared expert (unlike DeepSeek) — relies on a load-balancing loss instead.
Long context is bolted on: a high RoPE base (10,000 → 1,000,000) for the trained long-context window, with YaRN as an optional inference-time technique to stretch beyond the natively trained length.
Dense models have no MoE — the experts trick only appears in the 30B and 235B variants.

Gotchas / good to know

Read before you reach for it

Native vs stretched context. Qwen3 is trained at 32K; 128K comes from YaRN scaling at inference, and very long inputs can lose some quality. Only the "2507" (256K) and Qwen2.5-1M variants are built for truly long documents.
Thinking mode costs tokens. Every <think> step is generated text you pay for in latency and money. Set a thinking budget; don't leave it on for simple chat.
Active ≠ total parameters. 235B-A22B needs memory for all 235B weights even though only 22B run per token — MoE saves compute, not VRAM.
No shared expert. Routing quality leans entirely on the load-balancing loss; a poorly balanced router can leave experts under-trained, so don't assume MoE always beats a same-size dense model.
Generation-specific details. QK-Norm and the dropped QKV bias are Qwen3 changes — Qwen2/2.5 behave differently, so port configs carefully.