Qwen3.5 vs GLM-5 vs Kimi K2.5: Why Agent Orchestration Wins

In the week surrounding Chinese New Year 2026, China's AI scene put on a collective release show: Qwen3.5, MiniMax M2.5, GLM-5, and Kimi K2.5 all dropped within days of each other. I've been putting each of them through their paces, and the more I use them, the more I notice something interesting—their technical approaches differ, but they share a common direction: prioritizing tool-calling and agent capabilities over raw intelligence.

This post is my research notes on these four models, ending with a broader observation I've been sitting with.

Qwen3.5: Native Multimodal Agent on a Hybrid Architecture

Alibaba's Qwen3.5 is the most architecturally adventurous of the bunch. The flagship Qwen3.5-397B-A17B uses a hybrid architecture fusing linear attention (via Gated Delta Networks) with sparse mixture-of-experts (Sparse MoE)—397B total parameters, but only 17B activated per forward pass. This achieves top-tier capability while dramatically reducing inference costs.

The language coverage expansion is also notable: from 119 languages in the previous generation to 201 languages and dialects, which has real implications for global deployment.

On tool calling, Qwen3.5 performs impressively. It scores 72.9 on BFCL-V4 (tool use benchmark) and 46.1 on MCP-Mark (MCP protocol tool calling), both placing it in the top tier among open-source models. The hosted Qwen3.5-Plus comes with built-in official tools, adaptive tool selection, and a 1 million token context window.

The small-scale versions are equally interesting: Qwen3.5-9B outperforms OpenAI's gpt-oss-120B on multiple benchmarks while running on a standard laptop—a significant development for locally deployed agent frameworks.

Capability	Performance
Architecture	Linear attention + Sparse MoE hybrid
Active parameters	17B (of 397B total)
Tool calling	BFCL-V4: 72.9, MCP-Mark: 46.1
Context window	1M tokens (Plus version)
Language support	201 languages and dialects
Open source	Yes (Qwen3.5-397B-A17B)

My Take: The Value of the OpenClaw Layer

After spending time with these models, I keep coming back to the same observation.

The current AI tool usage pattern looks roughly like this: you open Claude or ChatGPT, describe your need, wait for an answer, then go execute it yourself. The model is the "consultant," you are the "executor."

But tools like OpenClaw represent a different paradigm: the model itself is the executor. It doesn't just give you suggestions—it calls tools, runs commands, manages files, sends messages. You describe the goal; it completes the task.

There's a key cognitive shift here: the execution-layer model doesn't need to be the smartest—it just needs to call tools accurately and orchestrate task flows correctly.

These are my thoughts from recent experimentation—not necessarily right, happy to discuss.

Qwen3.5: Native Multimodal Agent on a Hybrid Architecture

My Take: The Value of the OpenClaw Layer

Enjoyed this? Stay in the loop.