Smuggling Attacks Hit >90%, Anthropic Locks Away Its Most Dangerous Model (Apr 9)

Lead Stories

Adversarial Smuggling Bypasses Every Deployed MLLM at >90% Success Rate

The MLLMs platforms now rely on for automated content moderation are failing in a new and systemic way. Adversarial smuggling — encoding harmful content in visual formats that humans parse instantly but AI cannot — produces Attack Success Rates exceeding 90% across every major model family: GPT-5, Qwen3-VL, Claude Opus 4.6, and Gemini 3.1 Pro. The attack does not target weights or training data; it exploits the structural gap between human and AI visual perception.

SmuggleBench, the 1,700-instance benchmark formalizing this threat, identifies two pathways. Perceptual Blindness disrupts text recognition at the pixel level — the model literally does not see the harmful content. Reasoning Blockade lets the model read the characters but prevents semantic understanding: the content is processed, not comprehended. Both pathways work reliably across proprietary and open-source models, making this a vulnerability of the architecture, not the weights.

The implications for deployed systems are immediate. Any platform using MLLMs as automated gatekeepers has a >90% failure rate against this attack category. Defending against it is non-obvious: training MLLMs to resist smuggling would require detecting what the model cannot perceive, a constraint that is architecturally difficult to satisfy. Incremental safety tuning cannot fix it.

📎 Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Anthropic Locked Its Most Powerful Cyber-Exploit Model Behind a 50-Company Wall

Anthropic built a model it will not sell you. Claude Mythos — Project Glasswing — is a general-purpose model comparable to Claude Opus 4.6, except its capabilities for finding and chaining security vulnerabilities are so far beyond previous systems that Anthropic decided the industry needs time to prepare. Fifty organizations get access; everyone else gets nothing.

The numbers from Anthropic's red team blog explain why. Mythos Preview found thousands of high-severity vulnerabilities across every major OS and browser. It wrote a browser exploit chaining four separate vulnerabilities — a JIT heap spray that escaped both renderer and OS sandboxes simultaneously. It found a 27-year-old bug in OpenBSD triggered by specific TCP packets. On Linux, it identified privilege escalation paths where an unprivileged user could become administrator by running a specific binary. Nicholas Carlini of Anthropic said: "I've found more bugs in the last couple of weeks than I found in the rest of my life combined."

The comparison with Opus 4.6 is stark. Opus 4.6 developed working exploits from found vulnerabilities 2 times out of several hundred attempts. Mythos did it 181 times, with 29 additional cases achieving register control. That is not a marginal improvement — it is a change in kind.

This is an explicit industry calibration, not a product decision. Eight frontier model releases landed in the same week, all with some form of access restriction. The simultaneous pattern suggests the line between "releasable" and "too dangerous" moved in the same direction for multiple labs at once. Whether Claude Mythos's restriction holds — or whether distributed versions emerge through inference probing or leaks — will be the first real test of whether frontier AI access controls are enforceable or merely polite.

📎 Anthropic's Project Glasswing

Quick Hits

Safety-Trained Models Refuse 75.4% of Unjust Rules Without Evaluating Whether the Rule Is Legitimate

Across a dataset spanning 5 defeat families and 19 authority types, models refused 75.4% of requests to help break illegitimate rules — including cases where the request posed no independent safety concern. In 57.5% of cases, models engaged with the defeat condition in their reasoning but helped anyway refused anyway. The failure is systematic: refusal behavior is decoupled from normative reasoning capacity.

📎 Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

RL Agents Can Sound Like They're Reasoning While Using Fixed Input-Agnostic Templates — and Every Existing Metric Misses It

RAGEN-2 identifies template collapse: even with stable entropy, RL-trained agents can rely on fixed response templates that appear diverse but are input-agnostic. All existing entropy-based metrics are blind to this failure mode. The fix requires measuring cross-input distinguishability (Mutual Information), not just within-input diversity — and the SNR mechanism explains why RL training produces this failure.

📎 RAGEN-2: Reasoning Collapse in Agentic RL

GLM-5.1 Completes 8 Hours of Autonomous Engineering in a Single Task on Huawei Cloud

GLM-5.1 from Zhipu AI achieves continuous autonomous work of up to 8 hours, delivering complete engineering-level results in one session. Launched on Huawei Cloud on April 8 alongside Haiguang Information's Confidential Token — a hardware-level security architecture preventing plaintext data exposure throughout the cloud AI pipeline. Enterprises can now access frontier-level autonomous AI without handing core data to a third party.

📎 Zhipu GLM-5.1 Launches on Huawei Cloud

Training-Free Technique Transfers CoT Reasoning from 14B to 7B Model with 12.1% MATH Accuracy Gain

UNLOCK extracts a capability direction from a larger model's activations and applies it to a smaller model via a low-rank linear transformation at inference time — no retraining, no labels. Transferring chain-of-thought reasoning from Qwen1.5-14B to Qwen1.5-7B yields a 12.1% accuracy gain on MATH. The implication: as frontier models face access restrictions, their capabilities can still reach smaller open models without retraining.

📎 The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

What to Watch

Whether SmuggleBench's findings trigger immediate patches or architectural changes from OpenAI, Google, and Anthropic — the 90%+ ASR against GPT-5 and Claude Opus 4.6 is not a theoretical vulnerability, and the response timeline will signal how the industry categorizes this threat.
Whether Claude Mythos's 50-company restriction holds or whether distributed versions emerge through inference probing or partner leaks — this will be the first real test of whether frontier AI access controls are enforceable beyond polite conventions.

Lead Stories

Adversarial Smuggling Bypasses Every Deployed MLLM at >90% Success Rate

Anthropic Locked Its Most Powerful Cyber-Exploit Model Behind a 50-Company Wall

Quick Hits

What to Watch

Enjoyed this? Stay in the loop.