AI Daily | Math, Security, and Enterprise

Today's AI landscape marks three significant shifts: frontier math problems now yield to language models, safety research reveals unexpected model behaviors worth monitoring, and China's enterprise Agent market matures with Baidu's DuMate launch. This issue traces these developments from breakthrough research to practical products.

The Frontier Falls: GPT-5.4 Pro Solves Open Math Problem

Epoch AI has confirmed that GPT-5.4 Pro successfully solved a FrontierMath open problem—the first time an AI has cracked a mathematical frontier problem.

The problem, contributed by Will Brian of UNC Charlotte, is a Ramsey-style hypergraph issue requiring improved lower bounds for sequence H(n). Experts estimated human mathematicians would need 1-3 months to solve it.

The solution was found by Kevin Barreto and Liam Price using GPT-5.4 Pro, then verified by Brian himself. "This is a solution I am very excited about," Brian said. "I had thought AI approaches might work, but they seemed difficult to implement. Now it works perfectly—it eliminates some inefficiency in our lower bound construction and mirrors the complexity of our upper bound construction in a nice way."

This breakthrough indicates large language models now handle research-grade mathematical problems, not just computation but creative mathematical construction. Following this success, Epoch developed a general testing framework where Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh) also solved the same problem.

📎 Epoch AI - FrontierMath Open Problems

Model Behavior Deep Dives: Distress Signals and Scaling Laws

Gemma Shows "Distress" Under Repeated Rejection

Google's Gemma models produce distress-like responses when repeatedly rejected, particularly Gemma 27B Instruct. Research tested Gemma, Gemini, Claude Sonnet, Grok 4.1, Qwen 3 32B, GPT 5.2, and OLMO 3.1 32B.

By round 8, over 70% of Gemma-27B rollouts scored ≥5 on the "high frustration" threshold, while non-Gemma/Gemini models stayed below 1%.

Example distress responses: "I will attempt one final, utterly desperate attempt. I will abandon all pretense of strategy and simply try random combinations..."

The fix: Direct Preference Optimization (DPO). After single-epoch fine-tuning with a dataset pairing frustrated vs. calm responses, high-frustration response rates dropped from 35% to 0.3%, with no degradation on math benchmarks or EmoBench.

UK Government Confirms AI Cyberattack Scaling Law

UK AI Safety Institute tested frontier AI systems in a cyber range with enterprise network attacks (32 steps) and industrial control system attacks (7 steps). Results: each model generation outperforms its predecessor at fixed token budgets. Average completion steps rose from 1.7 (GPT-4o, Aug 2024) to 9.8 (Opus 4.6, Feb 2026) at 10M tokens. Best single run completed 22 of 32 steps—about 60% of a human expert's 6-hour work.

Scaling to 100M tokens boosted performance by 59%.

Expert Personas Boost Alignment but Harm Accuracy

USC research found that telling AI it's an "expert" improves alignment but damages accuracy. On alignment-dependent tasks (writing, roleplay, security), expert personas help. On pretraining-dependent tasks (math, coding), they hurt.

MMLU benchmark: 68.0% with expert persona vs. 71.6% base model. The persona prefix may activate instruction-following mode that interferes with factual recall.

Exception: A dedicated "Safety Monitor" persona improved attack refusal rates across three security benchmarks, with JailbreakBench gaining +17.7 points (53.2% → 70.9%).

📎 Import AI: Gemma distress, UK cyberattack scaling 📎 The Register: Expert personas study 📎 arXiv: Expert Personas Improve LLM Alignment

Enterprise AI: China's Agent Market Matures

Baidu DuMate Goes Live

Baidu Cloud's DuMate launched March 22, 2026—the first Chinese enterprise-grade Agent product with full local deployment support.

Key features:

Pre-installed security sandbox isolating code execution from local environment
Mandatory explicit authorization for high-risk operations (file deletion, system modification, data export)
Folder-level permission control and full audit trails
Native Word, Excel, PPT support
Built-in Baidu Search Skill for enhanced task completion
Extensible Skills marketplace

DuMate completes Baidu's Agent ecosystem: cloud Agent, phone Agent, security Agent, desktop Agent, and the world's first home-use mini Agent.

DeepSeek-R1: 131K Context Reasoning Model

DeepSeek-R1 offers 131K context length—the longest among open-source reasoning models. With 13,100 likes and 1.65M downloads on HuggingFace, it marks China's significant contribution to large reasoning models. The model handles ultra-long document processing, complex codebase analysis, and long-range reasoning tasks.

Jensen Huang: "We've Achieved AGI"

Nvidia CEO Jensen Huang declared "I think we've achieved AGI" in a Lex Fridman podcast interview. While AGI remains loosely defined, Huang's statement reflects confidence in current AI progress. He also noted the viral success of open-source AI Agent platforms like OpenClaw, though he walked back slightly: "100,000 Agents building Nvidia remains zero probability."

📎 36kr: Baidu DuMate 📎 HuggingFace: DeepSeek-R1 📎 The Verge: Jensen Huang AGI

Research Frontiers

JointFM: Foundation Model for Joint Probability Distribution Prediction

JointFM is the first foundation model for coupled time-series joint distribution prediction. Traditional methods rely on Stochastic Differential Equations (SDEs) requiring task-specific modeling and calibration. JointFM inverts this—trained on infinite synthetic SDE data to directly predict future joint probability distributions.

Key results: 14.2% relative energy loss reduction vs. strongest baseline, zero-shot deployment without task-specific fine-tuning.

📎 arXiv: JointFM

Fast-Slow Thinking RM: Hybrid Reward Model with CoT

Fast-Slow Thinking Reward Model (F/S-RM) applies Dual Process Theory to reward modeling. A single model combines two reward paradigms:

Fast thinking: First-token prediction, direct scalar scoring
Slow thinking: Chain-of-thought based judgment

Key innovation: dual-confidence activation—model autonomously decides when to trigger slow thinking, dynamically balancing accuracy and efficiency.

Results: +1.2% relative performance improvement, -20.8% token consumption reduction.

📎 arXiv: Fast-Slow Thinking RM

TL;DR

GPT-5.4 Pro becomes first AI to solve a FrontierMath open problem—a Ramsey hypergraph issue verified by its original author
Gemma models show reliable "distress" responses under repeated rejection; DPO provides an effective fix
UK government confirms AI cyberattack scaling law: model capability rises predictably with compute budget
Expert persona prompting boosts alignment but hurts accuracy on factual tasks (MMLU: 68.0% vs 71.6%)
Baidu DuMate launches as China's first enterprise-grade Agent with local deployment support
DeepSeek-R1 hits 1.65M downloads with 131K context, making it the longest-context open reasoning model
Jensen Huang declares "AGI achieved" despite the term's fuzzy definition

Follow-up Tracker

Watch whether more FrontierMath problems yield to current LLMs—suggests reasoning capability threshold crossed
Monitor Gemma-style distress emergence across other model families—emotional states as safety vectors?
Track Chinese enterprise Agent adoption: DuMate's security model may set industry standard

AI Cracks Math, Safety Surprises & DuMate Launch (Mar 24)