Qwen (Alibaba)

Alibaba Cloud Dense + MoE Hybrid thinking GQA + QK-Norm Fine-grained MoE Apache 2.0 Multilingual

What makes it distinctive

Qwen is one model family with two superpowers: a single network can switch between slow step-by-step "thinking" and fast direct answers, and the same family scales from a 0.6B phone-sized model all the way to a 235B Mixture-of-Experts (many small sub-networks, few active at once).

Built by Alibaba Cloud, Qwen takes the standard decoder-only recipe (predict-the-next-token) and adds two flourishes. First, hybrid thinking: you can ask the model to reason out loud in <think> tokens for hard problems, or skip straight to the answer for easy ones — with a thinking budget (a cap on reasoning steps) you control. Second, fine-grained MoE: many tiny experts, a handful lit per token, giving big-model quality at small-model running cost.

New to how these models are assembled?

Every open model shares the same set of building bricks — attention, normalization, position encoding, the feed-forward block. See How Open-Source LLMs Are Built for the shared recipe, then come back to see which bricks Qwen swaps out.

The family so far

Timeline Qwen2 → Qwen3.5

Qwen2 (2024), then Qwen2.5 (late 2024) with a 1M-context variant, then Qwen3 (Apr–May 2025) and its "2507" updates. Qwen3.5 (Feb 2026) jumped to a 397B-A17B MoE with 262K native context, plus 122B-A10B / 35B-A3B / 27B siblings; the newest Qwen3.7 previews (May 2026) are no longer open weights.

Sizes (Qwen3) 0.6B → 235B

Dense: 0.6 / 1.7 / 4 / 8 / 14 / 32B. Sparse MoE: 30B-A3B (30B total, 3B active) and 235B-A22B (235B total, 22B active, 94 layers).

Context 32K → 1M

Qwen3 trains at 32,768 tokens, stretches to 131,072 (128K) with YaRN. "2507" variants reach 262,144 (256K) natively; Qwen2.5-1M reaches 1,000,000.

License Apache 2.0

Qwen3 (and most of Qwen2.5) ships under Apache 2.0 — a permissive license you can use commercially. Vocabulary is ~151K tokens of byte-level BPE.

Signature: one model, two speeds of thought

Most model families ship a "chat" model and a separate "reasoning" model. Qwen3 folds both into one set of weights. In non-thinking mode the prompt goes straight to an answer — fast and cheap. In thinking mode the model first writes a visible chain of intermediate <think> steps, then the answer. A thinking budget lets you cap how long it reasons: more steps for a tricky proof, fewer for a quick fact.

Drag the slider and flip the toggle below to watch the reasoning path grow and shrink.

Non-thinking = prompt straight to answer. Thinking = a chain of <think> steps first; the budget caps how many.

Under the hood: fine-grained MoE with no shared expert

The big Qwen3 models are sparse. Instead of one fat feed-forward block per layer, there are 128 small experts; a tiny router scores them all for each token and lights up only the top 8. So Qwen3-235B-A22B holds 235B parameters but runs just 22B per token — the quality of a giant, the bill of a midsize model.

Unlike DeepSeek's fine-grained MoE, Qwen3 uses no always-on shared expert; every expert is chosen by the router, and training adds a global-batch load-balancing loss so no expert sits idle. Step through the routing below.

vs the shared recipe

Qwen keeps almost every standard brick of a modern decoder-only transformer, then tweaks a few for stability and reach.

Keeps the standard bricks
  • Decoder-only next-token prediction — the usual backbone.
  • RMSNorm pre-norm — normalize before each block for stable training.
  • SwiGLU feed-forward — the modern gated activation.
  • RoPE rotary positions, with GQA (grouped query attention) at every size to shrink the KV cache.
  • Byte-level BPE tokenizer with a large ~151K multilingual vocabulary.
What Qwen3 changes / trades
  • Adds QK-Norm — normalizes the query and key inside attention for steadier large-scale training.
  • Removes the QKV bias that Qwen2/2.5 carried in attention.
  • Fine-grained MoE (128 experts, top-8) on big models, but no shared expert (unlike DeepSeek) — relies on a load-balancing loss instead.
  • Long context is bolted on: base-frequency scaling (RoPE base 10,000 → 1,000,000) plus YaRN at inference to stretch 32K toward 128K.
  • Dense models have no MoE — the experts trick only appears in the 30B and 235B variants.

Gotchas / good to know

Read before you reach for it
  • Native vs stretched context. Qwen3 is trained at 32K; 128K comes from YaRN scaling at inference, and very long inputs can lose some quality. Only the "2507" (256K) and Qwen2.5-1M variants are built for truly long documents.
  • Thinking mode costs tokens. Every <think> step is generated text you pay for in latency and money. Set a thinking budget; don't leave it on for simple chat.
  • Active ≠ total parameters. 235B-A22B needs memory for all 235B weights even though only 22B run per token — MoE saves compute, not VRAM.
  • No shared expert. Routing quality leans entirely on the load-balancing loss; a poorly balanced router can leave experts under-trained, so don't assume MoE always beats a same-size dense model.
  • Generation-specific details. QK-Norm and the dropped QKV bias are Qwen3 changes — Qwen2/2.5 behave differently, so port configs carefully.

Related