Four models, four technical approaches, one shared direction: prioritizing tool-calling and agent capabilities over raw intelligence. Behind this trend lies the rising value of the orchestration layer.
In the week surrounding Chinese New Year 2026, China's AI scene put on a collective release show: Qwen3.5, MiniMax M2.5, GLM-5, and Kimi K2.5 all dropped within days of each other. I've been putting each of them through their paces, and the more I use them, the more I notice something interesting—their technical approaches differ, but they share a common direction: prioritizing tool-calling and agent capabilities over raw intelligence.
This post is my research notes on these four models, ending with a broader observation I've been sitting with.
Alibaba's Qwen3.5 is the most architecturally adventurous of the bunch. The flagship Qwen3.5-397B-A17B uses a hybrid architecture fusing linear attention (via Gated Delta Networks) with sparse mixture-of-experts (Sparse MoE)—397B total parameters, but only 17B activated per forward pass. This achieves top-tier capability while dramatically reducing inference costs.
The language coverage expansion is also notable: from 119 languages in the previous generation to 201 languages and dialects, which has real implications for global deployment.
On tool calling, Qwen3.5 performs impressively. It scores 72.9 on BFCL-V4 (tool use benchmark) and 46.1 on MCP-Mark (MCP protocol tool calling), both placing it in the top tier among open-source models. The hosted Qwen3.5-Plus comes with built-in official tools, adaptive tool selection, and a 1 million token context window.
The small-scale versions are equally interesting: Qwen3.5-9B outperforms OpenAI's gpt-oss-120B on multiple benchmarks while running on a standard laptop—a significant development for locally deployed agent frameworks.
| Capability | Performance |
|---|---|
| Architecture | Linear attention + Sparse MoE hybrid |
| Active parameters | 17B (of 397B total) |
| Tool calling | BFCL-V4: 72.9, MCP-Mark: 46.1 |
| Context window | 1M tokens (Plus version) |
| Language support | 201 languages and dialects |
| Open source | Yes (Qwen3.5-397B-A17B) |
After spending time with these models, I keep coming back to the same observation.
The current AI tool usage pattern looks roughly like this: you open Claude or ChatGPT, describe your need, wait for an answer, then go execute it yourself. The model is the "consultant," you are the "executor."
But tools like OpenClaw represent a different paradigm: the model itself is the executor. It doesn't just give you suggestions—it calls tools, runs commands, manages files, sends messages. You describe the goal; it completes the task.
There's a key cognitive shift here: the execution-layer model doesn't need to be the smartest—it just needs to call tools accurately and orchestrate task flows correctly.
These are my thoughts from recent experimentation—not necessarily right, happy to discuss.