Agentic Retrieval Tops ViDoRe, 1M Context Goes GA (Mar 14)

Context windows are becoming a commodity. Anthropic shipped 1M tokens for Opus 4.6 and Sonnet 4.6 as a generally available feature — and Simon Willison's analysis shows the pricing story is more nuanced than the raw number suggests. Meanwhile, NVIDIA proved that agentic retrieval — where an LLM actively decides how to search — beats pure embedding similarity by a wide margin on ViDoRe v3. DeepMind introduced Aletheia, an agent designed to move from math competition performance to real research contributions. And Google's Groundsource project extracted 2.6 million historical flash flood events from unstructured global news.

Thread 1 | Retrieval, multimodal, and memory systems

NVIDIA NeMo Retriever's Generalizable Agentic Retrieval Pipeline

Traditional retrieval pipelines compute embeddings once and rank by cosine similarity. NVIDIA's agentic approach gives the LLM itself the ability to decide how to search — choosing between different retrieval strategies, reformulating queries, and combining results. On ViDoRe v3, pairing Opus 4.5 with the nemotron-colembed-vl-8b-v2 embedding model achieved a score of 69.22, taking the #1 spot. On BRIGHT, llama-embed-nemotron-reasoning-3b outperformed the previous model by 19 points. The tradeoff: each query takes ~136 seconds and consumes roughly 760K input tokens plus 6.3K output tokens.

If your RAG pipeline relies on a single embedding similarity pass, this result should make you reconsider. Agentic retrieval dramatically improves quality on complex queries that require reasoning about what to look for, not just matching surface-level similarity. The cost is real — ~136 seconds and 760K tokens per query isn't cheap — but for high-value tasks (legal research, medical literature review, compliance scanning), accuracy matters more than latency.

Watch whether NVIDIA or others release optimized, faster variants of this pipeline. 136 seconds per query works for batch processing but not for interactive applications. If inference optimization brings that below 30 seconds without significant quality loss, agentic retrieval becomes viable for real-time products.

Links: Hugging Face Blog

Thread 2 | AI product experience

1M context is now generally available for Opus 4.6 and Sonnet 4.6

Anthropic made 1M-token context windows generally available across Opus 4.6 and Sonnet 4.6, but Simon Willison's analysis reveals a pricing structure that penalizes large contexts differently than you might expect. OpenAI and Gemini both charge elevated rates above certain thresholds (200K for Gemini 3.1 Pro, 272K for GPT-5.4), meaning "unlimited context" never actually means unlimited cost. Anthropic's move to GA removes the access barrier but doesn't change the economics: stuffing a million tokens into every call will burn budget fast.

Before migrating your workflows to 1M context, run a cost analysis on your actual usage patterns. Most tasks don't need that much context — they need better retrieval. Reserve long-context windows for cases where the full document matters (legal contract review, full codebase analysis) and stick with shorter contexts + RAG for everything else.

Check Anthropic's per-token pricing tiers for Opus 4.6 and Sonnet 4.6 when context exceeds 200K tokens. If the price curve steepens significantly, build context-length-aware routing into your application: short tasks go to standard context, long tasks go to 1M with explicit user confirmation.

Links: Simon Willison

Thread 3 | Frontier research and capability shifts

Google DeepMind's Aletheia: from math competitions to autonomous research

DeepMind's Aletheia agent sits at the intersection of mathematical reasoning and scientific discovery. While models achieved gold-medal performance at the 2025 IMO, competition math and professional research operate under different constraints — research problems are ill-defined, solutions require literature synthesis, and validation is open-ended rather than binary. Aletheia addresses this by iteratively generating, verifying, and revising solutions in a loop, bridging the structured world of competition problems and the messy reality of actual research.

For research teams, Aletheia represents a potential shift in how AI assistants are used in scientific work. Rather than treating the model as a knowledgeable oracle, Aletheia treats it as a hypothesis generator that proposes, tests, and refines. If your team works in math-heavy domains (physics, quantitative finance, operations research), the iterative verification pattern is directly applicable even without using Aletheia itself.

Watch for Aletheia's results on open research benchmarks — not just math competition scores. If it produces verifiable, novel contributions in a real research domain (not a synthetic benchmark), that crosses a meaningful threshold from competition solver to research tool.

Links: MarkTechPost

Brief | Agents and developer tooling

Craig Mod built his own accounting system with Claude

Writer Craig Mod shared his experience building a custom accounting tool with Claude after years of frustration with off-the-shelf software. The tool handles multiple currencies and pulls daily historical conversion rates — functionality that existing products couldn't match for his specific workflow. His approach: sit down with an AI coding agent and build exactly what you need, not what a product manager guessed you needed.

Small-scale, personal tools are the sweet spot for current AI coding agents. If you've been tolerating a clunky workflow because no commercial product fits your exact requirements, try building it with Claude Code or Codex in a weekend. The bar for "good enough" personal software is lower than you think, and the agent handles the tedious parts.

Watch whether a pattern emerges of domain experts (writers, accountants, designers) building their own niche tools with AI agents and open-sourcing them. That would signal a bottom-up software market where supply comes from practitioners, not product teams.

Links: Simon Willison

Brief | Model safety and controllability

Physical AI and manufacturing's next advantage

MIT Technology Review explores how manufacturing is shifting from traditional automation — which delivered gains through repetition — to "physical AI" systems that handle complexity, adapt to variation, and compensate for labor constraints. The argument: decades of pursuit of pure efficiency have hit diminishing returns; the new frontier is adaptability in the face of rising complexity and workforce shortages.

If you're in manufacturing, logistics, or robotics, the question isn't whether to adopt AI but where to start. Physical AI requires significant integration work — sensor data pipelines, real-time inference, safety constraints — so begin with a constrained, high-value process (quality inspection, predictive maintenance) and expand from there.

Track whether major manufacturing companies publish ROI data from physical AI deployments within 2026. If numbers show 2x+ improvement in specific metrics (defect rates, throughput, downtime), expect rapid industry adoption.

Links: MIT Technology Review

Brief | Frontier research and capability shifts

Google's Groundsource: turning global news into structured historical data

Google Research released Groundsource, a methodology using Gemini to extract structured data from unstructured global news. Its first output: an open-source dataset of 2.6 million historical urban flash flood events across 150+ countries, addressing a critical gap in disaster preparedness data. The approach transforms decades of news articles — previously unsearchable at scale — into a queryable, structured resource.

For teams working with unstructured text archives (legal documents, medical records, historical data), Groundsource's methodology is directly applicable. The key insight: don't ask the model to answer questions about the text — ask it to extract structured records that you can then query with standard tools. This separation of extraction from analysis dramatically improves reliability and auditability.

Watch for Groundsource's extraction pipeline to be generalized beyond flood events. If Google releases the extraction templates and evaluation framework for other domains (conflict tracking, disease surveillance, economic indicators), it becomes a practical tool for data teams working with messy archives.