Latent

Migrating from Gemini 2.5 Flash to 3.5 Flash · what actually changed, and what to touch in your code

2026-05-21T00:00:00+00:00

If you have a production workload on gemini-2.5-flash, the answer to “should I move?” is almost certainly yes — but the migration is not a simple model-ID swap. The Gemini 3.x family changed the thinking config, dropped the classic sampling knobs, and tightened the function-calling contract. Here is what I’d hand a teammate on day one of the cutover.

What actually got better

Hold the model name fixed at Flash and look at the deltas that matter for a production agent:

Coding / agentic workloads jumped substantially. On Terminal-Bench 2.1, 3.5 Flash hits 76.2% — and notably beats Gemini 3.1 Pro on the agentic suite (Terminal-Bench, MCP Atlas, Finance Agent v2, GDPval-AA). That last one is the headline I keep coming back to: a Flash-tier model out-scoring the previous Pro-tier on agent tasks is not a normal generational jump.
Low-reasoning coding is up 10–20% over the previous Flash generation. This is the band most production traffic actually sits in.
Thinking is now first-class. thinking_level is a string enum (minimal → high), and medium is the new default. The model is tuned for it, and chain-of-thought scaffolding in your prompts is now actively counterproductive — simpler prompts at medium beat elaborate CoT at low.
Thought preservation is on by default. Multi-turn agentic loops get more coherent across tool calls. Token usage goes up a bit; quality goes up more.

What got worse

One thing, and it matters: price.

3.5 Flash: $1.50 / 1M input, $9.00 / 1M output, $0.15 / 1M cached input.
That’s roughly 3× the price of the Gemini 3 Flash Preview and, per Artificial Analysis’s full-benchmark suite, about 5.5× the run cost of the previous Flash (the increase compounds because thinking-on-by-default eats more output tokens).
It does sit ~40% below Gemini 3.1 Pro ($2.00 / $12.00), so the intelligence-per-dollar picture is still favourable — but if your 2.5 Flash bill was tight, model your new cost before you cut over.

The fix on cost is almost always the same: aggressive prompt caching ($0.15/M cached is the cheapest token Google sells), and thinking_level: 'minimal' or 'low' on routes that don’t need reasoning.

The four code changes you’ll actually make

1. Model ID

- model = "gemini-2.5-flash"
+ model = "gemini-3.5-flash"

GA, no preview suffix. Available in the Gemini API, AI Studio, Antigravity, the Gemini app, and AI Mode in Search.

2. Drop `temperature`, `top_p`, `top_k`

These are no longer recommended on any Gemini 3.x model. The reasoning is tuned for default sampling; passing custom values is a net-negative on most evals I’ve seen.

- generation_config = {
-   "temperature": 0.2,
-   "top_p": 0.9,
-   "top_k": 40,
- }
+ # leave sampling at defaults on 3.x

If you were using temperature=0 for determinism, you’re going to need a new strategy — usually thinking_level: 'minimal' plus a stricter system prompt and schema-constrained output.

3. `thinking_budget` → `thinking_level`

The integer budget is gone. Replace it with the string enum.

- thinking_config = ThinkingConfig(thinking_budget=7500)
+ thinking_config = ThinkingConfig(thinking_level="medium")

Mapping I’ve been using as a starting point (adjust per route):

Old (`thinking_budget`)	New (`thinking_level`)
`0` / disabled	`minimal`
`~1k–3k`	`low`
`~5k–10k`	`medium` (new default)
`>10k`	`high`

Important gotcha if you’re coming from gemini-3-flash-preview rather than 2.5: the preview defaulted to high. 3.5 GA defaults to medium. If your eval scores quietly dropped after the model-ID swap, this is almost certainly why — set thinking_level: 'high' explicitly to restore the previous behaviour.

4. Tighten your `FunctionResponse` parts

The function-calling contract is stricter now. Three requirements you must satisfy or the call will be rejected / hallucinated around:

id must match the original FunctionCall.id — you can no longer get away with omitting it.
name must match the call’s name.
Exactly one response per function call — no merging, no extras.

  response_part = Part.from_function_response(
+     id=call.id,
      name=call.name,
      response={"result": tool_output},
  )

Two related cleanups while you’re in this file:

Multimodal tool results: put media inside the function response parts, not as sibling parts.
Inline instructions (the “and now do X with this” trailing nudge): append them to the response text with two newlines, rather than sending them as a separate Part.

Prompt cleanups you’ll thank yourself for

3.5 Flash punishes the prompting habits that 2.5 Flash rewarded.

Strip explicit chain-of-thought scaffolding. “Think step by step, first list assumptions, then…” — delete it. Set thinking_level: 'medium' (or 'high') and let the model do the reasoning natively. I’ve seen 5–10% accuracy gains from removing CoT prompts on 3.5.
Shorten system prompts. The model follows tighter instructions more reliably than 2.5 did; verbosity in the system prompt now correlates negatively with instruction-following on a couple of my internal evals.
Use schema-constrained output where you used to coerce JSON in the prompt. Cheaper, more reliable.

Caveats and not-yets

Computer Use is not supported on 3.5 Flash yet. If you have a workload using the computer-use surface, stay on Gemini 3 Flash Preview for that specific route. Mixed-model deployments are fine.
PDF token usage can go up at media_resolution_high. Video usage typically goes down. Re-baseline both before you trust your cost projections.
Thought preservation increases output tokens. Worth it for agent loops; you can opt out per route if you’re cost-sensitive on single-turn classification work.

A migration checklist I’d hand a team

Swap the model ID in one non-production route. Run your eval set.
Remove temperature / top_p / top_k. Re-run.
Convert thinking_budget → thinking_level using the table above. Re-run.
Audit every FunctionResponse site: add id, verify name, ensure 1:1 with the call.
Move multimodal tool outputs inside the function response part.
Delete explicit CoT scaffolding from your prompts; trim system prompts.
Re-baseline cost on a representative day of traffic — the price jump is real, prompt caching is your main lever.
Canary 5% of production traffic for 48 hours. Watch p95 latency, tool-call error rate, and cost-per-request, in that order.
Cut over fully. Leave 3 Flash Preview wired up for any computer-use routes.

The TL;DR: the model is meaningfully better at exactly the workloads most teams are running (agents, tool-use, coding) and the migration is mostly a config-and-prompt cleanup rather than an architectural change. Budget a day for a small team, two days if your function-calling layer is non-trivial.

— Priyanshu

Sources:

Google I/O 2026 · the day Google stopped shipping models and started shipping agents

2026-05-21T00:00:00+00:00

I watched the Shoreline keynote yesterday with the same notepad I use for client calls. By the end of it the page had one underlined sentence at the top: the model is no longer the product. Below that, the things that actually shipped on 20 May 2026.

The two models

Gemini 3.5 is the new frontier family. The framing Google used — “frontier intelligence with action” — is the giveaway. 3.5 is being positioned less as a benchmark-chasing release and more as the substrate the rest of the agentic stack runs on. Gemini 3.5 Flash was the SKU shown end-to-end on stage; the bigger sibling was alluded to but not benchmarked head-to-head against Opus 4.7 or GPT-5.5, which tells you something about where Google wants the conversation to go.

Gemini Omni is the more interesting release. “Any input to any output, starting with video” is the pitch. The demos that landed were the editing ones — point at a frame, describe the change in natural language, get a coherent edit back across the rest of the clip. This is the first time I’ve seen a multimodal model where video editing felt like a first-class output modality rather than a party trick stitched onto an image model.

The agentic surface

This is the part of the keynote I’ll remember. Five separate product launches, all of them agents, all shipping into surfaces that already have hundreds of millions of users:

Gemini Spark — a general-purpose agent inside the Gemini app that can reason across your connected apps and take action under your direction. Beta, Ultra subscribers and trusted testers first, wider rollout to follow.
Information agents in Search — Search becomes a thing that goes and does the research loop, not just a thing that ranks ten blue links.
Daily Brief — proactive, 24/7 surfacing inside the Gemini app. The “agent that pre-empts you” pattern that everyone has been trying; Google now has the personal-context graph to actually make it useful.
Universal Cart — a shopping cart that holds items across merchants and lets an agent transact on your behalf. The commerce-side implications of this are bigger than the demo suggested.
Google Antigravity — the agent-first developer platform got a substantial bump. “Moving beyond AI tools that help write, to agents that help act” is the line. This is Google’s answer to Cursor + Claude Code + the rest of the agentic-IDE cohort.

What this actually means

Three things stood out to me, and I think they’re going to shape the rest of the year:

1. Google is now competing on distribution, not model quality. The model announcements were almost a formality. The keynote spent its minutes on what the models do inside Gmail, Docs, Keep, Drive, Search, and the Gemini app. When your moat is a billion-user surface area, the model is a feature; the agent on top of it is the product. Anthropic and OpenAI do not have this lever, and yesterday made that asymmetry very visible.

2. The “agentic in everything” pivot is now industry-wide. Microsoft has Copilot agents, Anthropic has Claude with computer-use and tool ecosystems, OpenAI has Operator-class products, and now Google has shipped its own coherent agent layer end-to-end across consumer and developer surfaces in a single keynote. The interesting question for buyers in Q3 is no longer “which model is best” — it’s “whose agent layer integrates with our system of record.” That is a very different procurement conversation.

3. Antigravity is the one to watch for developer tooling. The agentic-IDE space has been fragmenting fast. Google entering with first-party access to Gemini 3.5 + Omni + the Workspace graph is a different kind of entrant than the startups in this category. I expect a real fight over the next two quarters.

What I’m watching next

Whether Gemini Spark actually generalises across connected apps the way the demo suggested, or whether it ends up being a Google-properties agent with thin connectors. The reliability story on cross-app agents is still unsolved across the industry.
Universal Cart’s merchant adoption curve. If Google can get the big retailers in by holiday season, this becomes the default purchase surface inside the Gemini app and the discovery economics change.
Antigravity vs. Claude Code vs. Cursor on real agentic coding workloads. The benchmarks I’d want here don’t exist publicly yet; I’ll try to put one together for an enterprise client this quarter and write it up.
Whether Omni lands a real video-editing market or remains a demo capability. The editing UX in the keynote was good. The proof is whether anyone ships paid video workflows on it within 90 days.

The headline yesterday wasn’t “Google shipped Gemini 3.5.” It was “Google stopped shipping a model and started shipping an operating layer.” That is a different company than the one that showed up at I/O 2024.

— Priyanshu

Sources:

This week in AI · the lines I drew on the whiteboard

2026-05-07T00:00:00+00:00

A short week-in-review. Six items I keep returning to, with a one-line take on each.

Anthropic’s compute deal with SpaceX is the structural story of the week. 220K+ GPUs at Colossus 1, doubled Claude Code rate limits, and gigawatt-scale orbital data-centre talks. The capex race is now openly the lever pulling everything else. (I wrote about this on Tuesday — see the previous post.)

CAISI’s pre-launch agreements with Google, Microsoft, xAI. Pre-deployment government evaluation moved from voluntary commitment to formal-and-repeated. This is going to start showing up as a procurement-checklist item by Q4. Anthropic has had a separate arrangement; OpenAI’s status is less clear. Watch for the next labs to sign on.

DeepSeek V4-Pro is the #2 open-weights model on Artificial Analysis’s intelligence index. Behind only Kimi K2.6. Open-weights gap on reasoning is single-digit-percent territory now. MIT licence on V4 makes it deployable in places Llama Community Licence isn’t. (Wrote about this last week.)

MCP roadmap update is worth reading. The 2026 priorities are unglamorous but exactly right: audit trails, SSO-integrated auth, gateway patterns, async/long-running tools, configuration portability. The protocol won the integration layer. Now the boring work of enterprise-readiness gets done. The community link is below.

Quiet shift in the agent stack: the move from “frameworks” to “primitives”. The trend through 2024 was full-stack agent frameworks — opinionated, vertically integrated. The trend in 2026 is people composing MCP servers + a stateful orchestrator (LangGraph or similar) + their own evaluation harness, instead of buying a framework. We made that move last quarter and the operability gain was real.

GPT-5.5 (“Spud”) shipping under the GPT-5 brand instead of GPT-6. OpenAI held the line on what counts as a major version. Industry-restraint signal. Worth watching whether competitors follow suit; the model-versioning hype cycle has been a problem for buyers.

What I’m thinking about for next week

A longer note on agentic evaluation harnesses — what we’ve built, what we’re still missing.
A field report on moving an enterprise client from Llama 3.3 to a V4-Flash deployment — what changed, what didn’t.
Possibly a shorter post on Project Glasswing’s ripple effects — security teams I’m talking to are starting to plan around the assumption that Mythos-class capabilities will be more broadly available within 12 months.

Take care of yourselves. See you next week.

— Priyanshu

Sources:

Claude Opus 4.7 and the SpaceX deal — orbital compute, doubled Code limits, same prices

2026-05-05T00:00:00+00:00

Two announcements from Anthropic on May 4 that ought to be read together: a new flagship model, and a compute deal that materially changes their capacity ceiling.

Claude Opus 4.7

The model itself is positioned as an incremental upgrade over Opus 4.6:

Notable gains on advanced software engineering, particularly on the hardest tasks. Anthropic’s framing — and matching independent reports — is that Opus 4.7 is the first Claude where you can confidently hand off your hardest coding work without babysitting.
Substantially better vision — higher resolution image inputs, improved fine-detail extraction.
Same pricing as 4.6 — $5/M input tokens, $25/M output tokens.

Pricing-flat-with-quality-up is the move I’ve been expecting from all the frontier labs as competition tightens. Worth noting that the headline numbers don’t move; the unit economics do.

The SpaceX partnership

The structurally interesting announcement. Anthropic signed a deal with SpaceX for the entire compute capacity at Colossus 1:

300+ megawatts of new capacity.
220,000+ NVIDIA GPUs coming online within the month.
The two companies are exploring orbital data centres — Anthropic mentioned interest in multiple gigawatts of orbital AI compute capacity.

Yes, orbital. The pitch for space-based compute is solar (constant illumination), thermal (radiative cooling against deep space), and political (jurisdiction, supply-chain redundancy). Whether it’s actually economical at scale is a different question, but the fact that it’s being seriously discussed at the GW scale by an organisation that ships product is worth filing away.

The downstream effects are immediate:

Claude Code’s 5-hour rate limits doubled for Pro, Max, Team, and Enterprise plans, effective immediately.
API rate limits raised across the board for Opus.
This is the second compute capacity announcement from Anthropic in two months — they’ve been visibly compute-constrained, and this addresses it.

What I take from this

The compute war is here. The frontier labs are now openly trading capital for capacity in ways that show up as customer-facing rate limits. Operating at the frontier is a capex question first, R&D question second.

The duopoly dynamic is real. Anthropic at $30B annualised, OpenAI at $24B, both racing to lock in compute. The mid-2025 view that the field would have 4–6 frontier labs is now harder to defend. Most of the rest are de-facto open-source partners or cloud-vendor offerings.

Vendor risk is back on the agenda. When two companies are responsible for most of the inference for the GenAI economy, single-vendor dependency starts costing real procurement points. Build for vendor-neutrality (MCP, abstracted model calls) accordingly.

We doubled our Claude Code allocation today and used it. The new Opus is genuinely better. If the orbital data-centre piece comes together, this gets even more interesting.

Sources:

CAISI signs pre-launch evaluation agreements with Google, Microsoft, xAI

2026-05-02T00:00:00+00:00

The story most enterprise practitioners aren’t watching closely enough this week: the U.S. Center for AI Standards and Innovation (CAISI) announced agreements with Google DeepMind, Microsoft, and xAI that allow the government to evaluate frontier AI models before they’re publicly released.

This is the first concrete instance I’ve seen of pre-launch evaluation moving from voluntary commitments to a formal, repeated process. It deserves more attention than it’s getting.

What’s actually in the agreement

The reported scope:

Pre-launch evaluation access — CAISI gets to test new frontier models from each company before public release.
Capability and safety testing — focused on dual-use risks (biosecurity, cybersecurity, autonomy).
Findings sharing — the labs receive the evaluation results; not all findings are necessarily public.

CAISI also recently published its evaluation of DeepSeek V4-Pro following V4’s April release — separate work, but the same body, and a useful signal that they’re scaling up evaluations across both U.S. and overseas frontier labs.

What it doesn’t say

Worth being clear about the limits:

It’s not a launch veto. Evaluations inform; they don’t block. (At least not under the current framework.)
OpenAI and Anthropic aren’t in the announcement — though Anthropic has had a different evaluation arrangement going back further. The question is when (not whether) similar formal agreements expand.
It’s not the EU AI Act. Different framework, different teeth, different scope. Don’t conflate them.

What it means for enterprise practitioners

Compliance buyers will want to see the evaluation report. If you’re advising a regulated client on model selection, the CAISI report is becoming the artefact you’ll be asked about. Not the model card. Not the system card. The independent evaluation.

The pre-launch window is shifting. Labs that have been comfortable with “ship fast, eval after” are facing a structural pressure to delay launches for evaluation. This shows up downstream as longer pre-release windows, more cautious staged rollouts, and (sometimes) features held back at launch.

Cross-border models become a separate question. A model evaluated by CAISI is in a different operational risk category — for U.S. enterprise procurement — than a model that wasn’t. This isn’t legally required (yet), but it’s becoming a procurement-checklist item at large enterprises.

My read

This is the kind of slow, mostly-procedural development that won’t make AI Twitter trend, but will substantially shape what enterprise AI procurement looks like in 2027. We’ll start seeing CAISI-evaluation status as a vendor-comparison axis. Plan accordingly.

Sources:

DeepSeek V4-Pro and V4-Flash — open weights catch the closed frontier

2026-04-29T00:00:00+00:00

DeepSeek dropped two models on April 24, and they matter more than the standard “another open release” framing suggests.

The headline:

DeepSeek V4-Pro — 1.6 trillion parameters total, 49B activated per token (MoE), 1M token context.
DeepSeek V4-Flash — 284B parameters total, 13B activated per token (MoE), 1M token context.
Both shipped same-day as API endpoints AND as open weights under the MIT licence on Hugging Face.

MIT. Not a Llama-style “community licence” with monthly-active-user carve-outs — a genuinely permissive licence that allows commercial use, modification, and redistribution.

Where V4-Pro sits

On the Artificial Analysis Intelligence Index for open weights, V4-Pro is now #2, behind only Kimi K2.6. On the GDPval-AA agentic-real-world-tasks benchmark, V4-Pro Max scored 1554, beating Kimi K2.6 (1484), GLM-5.1 (1535), GLM-5 (1402), and MiniMax-M2.7 (1514).

Coding-specific: V4-Pro now outperforms most closed flagships on Codeforces, LiveCodeBench, and Terminal-Bench. The “open-weights gap” on coding has effectively closed.

What this means for enterprise builds

Three concrete shifts I’m thinking about for our roadmap:

One. For workloads where the cost-per-token math has been borderline (high-volume RAG, agentic pipelines that burn tokens on tool-calling intermediate steps), V4-Flash on self-hosted infrastructure is now competitive on quality with where Claude Sonnet was twelve months ago. The unit economics shift.

Two. For reasoning-heavy workloads — Text2SQL on complex schemas, code agents, multi-step plans — V4-Pro is the first open-weights model I’d seriously consider against the closed flagships. The activation count (49B per token) means you can serve it on a node a lot of teams already have provisioned.

Three. MIT licence changes the conversation in regulated industries. The Llama Community Licence is a non-starter at some financial-services and healthcare clients I’ve worked with — their legal teams won’t sign off. MIT clears that hurdle.

The caveats

Inference cost for V4-Pro at full quality is still substantial — that 1.6T total parameter count needs serving infrastructure.
The model has a recognisable Chinese-language tilt in its training mix, which shows up on culturally sensitive evaluations. Test against your domain.
Open weights are not the same as open data — we still don’t have the training corpus.

The bigger pattern

We’re now at the point where every 2–4 weeks, a frontier-class open release lands. Llama 4 in early April. Mistral Large 3 in March. Gemma 4 in April under Apache 2.0. DeepSeek V4 now. The closed-source labs still have the absolute leading edge, but the gap measured in “months until the open model is good enough for this workload” has dropped to single digits for most enterprise tasks.

If you haven’t priced an open-weights deployment into your 2026 architecture decisions, do it now.

Sources:

GPT-5.5 ships as ‘Spud’ — what the rebrand from GPT-6 tells us

2026-04-26T00:00:00+00:00

OpenAI’s most-anticipated model of the year shipped this week, eight days late and with a different name on the box.

The recap, briefly:

The model — internally codenamed Spud — was confirmed for an April 14, 2026 global launch.
That date came and went. No public weights, no API rollout, no developer keys.
On April 23, OpenAI shipped it — branded GPT-5.5, not GPT-6.

There are two threads to pull on here. The benchmarks, and the naming.

The benchmarks

Pre-launch leaks pointed to GPT-6-class performance: a high-70s SWE-bench Pro score (the agentic software-engineering benchmark that’s become the headline metric for “frontier” claims), substantial improvements on long-context reasoning, etc.

The actual numbers in the system card came in lower than the leaks. SWE-bench Pro at 58.6%, well short of the high-70s rumour. Strong-but-incremental gains on most other axes.

That score is genuinely good — comparable to or above the latest Claude Opus releases on several axes — but it’s not the leap that “GPT-6” had been priced into.

The naming

The decision to ship as GPT-5.5 rather than GPT-6 is the part of this release I find most worth dwelling on.

OpenAI has been criticised — fairly — for inflating model versioning in the past (the GPT-4-Turbo / GPT-4o / o1 lineage was a mess for end users to track). Choosing to not call this GPT-6 when the benchmark didn’t land is a small but real act of restraint. It’s the right call. It also tells us something:

OpenAI has internal numerical bars for major versions, and they apparently held the line.
The GPT-6 brand is now being saved for whatever ships next that does clear that bar.
The model is good. It’s not the leap the rumour mill was paying for.

What I’m telling clients

GPT-5.5 is a serious frontier release — particularly on coding and reasoning workloads. It’s also incremental. If you’ve already standardised on Claude Opus 4.6 / 4.7 or Gemini 2.5 Pro, there’s no urgent reason to switch. If you’re on an older GPT-4-class model, the upgrade is worth running an eval on.

The bigger story is the honesty of the renaming. We’re past the era where every release has to be the biggest one yet, and that’s a healthier place for the field to be.

Sources:

Google ships new AI agents to challenge OpenAI and Anthropic — a sober read

2026-04-22T00:00:00+00:00

Bloomberg reported today that Google has released a new line of AI agents aimed squarely at OpenAI’s and Anthropic’s enterprise lead. I’ve been waiting for this move — we’ve all been waiting for it — and the strategic story is more interesting than any single demo.

The state of play

Going into Q2 2026:

Anthropic has the agentic-workloads lead. Claude’s tool-use reliability and MCP made them the default for production agent stacks.
OpenAI has the consumer-product lead and the largest install base, but enterprise agent share is lagging.
Google has the platform lead — Gemini 2.5 Pro’s 1M-token context, native multimodal, deep Workspace integration — but the agent product story has been thin.

This release is Google saying: we have the model, we have the data, we have the cloud — now we have the agents.

What I think the bet is

Google’s competitive moat in enterprise AI is integration with what enterprises already use: Gmail, Drive, Calendar, Docs, Sheets, BigQuery, Workspace. Anthropic has MCP (and the connector ecosystem on top). OpenAI has Operator and a growing tool catalogue. Google has the data the agent should be acting on, already in their cloud.

If the agents are even competent, the integration story alone makes them a serious enterprise play. You don’t need to build connectors for the data when the data lives in the same cloud as the agent.

Two things I’m watching

MCP support. Will Google’s new agents speak MCP, or push their own integration story? My bet — they’ll do both, badge MCP support as table stakes, and try to differentiate on Workspace/Cloud-native depth. The vendor-neutral protocol is too widely adopted to ignore at this point.

Enterprise SLAs. Anthropic’s enterprise lead is partly built on operational reliability — SLAs, predictable rate limits, a quieter incident history. Google’s enterprise track record on AI products is mixed (remember Bard?). The model is the easy part. Operating a model business at enterprise scale is the hard part.

The shape of the rest of the year

We now have three credible agent platforms competing for enterprise share, plus a healthy long tail (Microsoft via Azure + OpenAI, AWS via Bedrock, Salesforce, etc). For customers, this is excellent — pricing pressure is real, multi-vendor patterns become viable. For practitioners, it means the integration layer (MCP, agent frameworks, eval harnesses) becomes more important, not less.

Pick the platform that’s strong where your data lives. Build the integrations on the standard.

Sources:

MCP at 97 million monthly downloads — what’s shipped, what’s still missing

2026-04-17T00:00:00+00:00

Model Context Protocol shipped in November 2024 with roughly 100,000 SDK downloads in its first month. By March 2026, that monthly number was 97 million — a 970× increase in 18 months.

We use MCP heavily at Elastiq. We’ve also tripped over most of the rough edges. A short field report on what’s actually shipped, and what’s still in the roadmap column.

What’s shipped

A real connector ecosystem. The public MCP server registry grew from ~1,200 in Q1 2025 to 9,400+ by April 2026, and that’s just the public ones — the count of internal/private servers in enterprises is much higher. Drive, GCS, S3, Azure Blob, Slack, GitHub, Postgres, Sentry, Snowflake, Salesforce — every major data source has at least one server, often three.

Cross-vendor adoption. Anthropic open-sourced MCP. The interesting part is that OpenAI, Google, Microsoft, and AWS have all adopted it as a first-class integration surface. The protocol is genuinely vendor-neutral now, in a way it wasn’t 12 months ago.

Enterprise deployment patterns. 78% of enterprise teams with AI agents in production are now running them on MCP. The “build a connector for every model provider” anti-pattern has died.

What’s still missing

Audit trails and SSO-integrated auth. This is the biggest gap. Out-of-the-box MCP doesn’t give you the audit story enterprise security teams want. Most production deployments have a custom logging layer wrapped around the protocol. The 2026 roadmap calls these out as priorities — enterprise readiness is the headline theme — but they’re not in the standard yet.

Gateway patterns. As MCP servers proliferate, you need to centralise auth, rate-limiting, and policy. The community is converging on a gateway pattern (an MCP-aware reverse proxy), but there’s no blessed implementation. We rolled our own.

Async / long-running tools. MCP doesn’t have a great answer for tool calls that take 30+ seconds, or that want to stream partial results back. The roadmap mentions a Tasks primitive for async agent calls; it’s not there yet.

Configuration portability. Moving an MCP setup from one host application to another (Claude Desktop → Cursor → your own product, etc.) is still more friction than it should be.

ACL propagation. I keep banging on about this. Connectors know who the user is. The LLM doesn’t have a first-class way to honour that identity downstream when it composes responses or chains calls. Solved in our shop with a custom layer. Should be in the protocol.

My take

The protocol has won the integration layer. That’s settled. The next 12 months are about closing the enterprise-readiness gap — auth, audit, gateways, async — without breaking the simplicity that made it spread in the first place. Hard problem. Solvable.

If you’re building enterprise AI in 2026 and you’re not on MCP, the burden of proof is now on you to explain why.

Sources:

Llama 4 Scout’s 10 million-token context window — what changes, and what doesn’t

2026-04-12T00:00:00+00:00

Meta shipped two Llama 4 models this month — the headline numbers:

Llama 4 Scout — 17B active / 109B total MoE, 16 experts, 10 million-token context.
Llama 4 Maverick — 17B active / 400B total MoE, 128 experts, 1M-token context, native multimodal.

Both are open-weight under the Llama Community Licence. Scout’s 10M context — supported by a technique called Interleaved RoPE that lets it generalise from a 256K training window — is the largest of any openly available model at launch.

Two questions everyone asks

“Is the 10M context real, or marketing?” Real, in the sense that Scout can be fed and reason over very long inputs. Real-but-caveated, in the sense that performance degrades on the long tail of the window, attention drift is real, and your inference cost scales with input tokens regardless.

“Does this kill RAG?” No. Read on.

Why long-context doesn’t replace retrieval

Three things stack against the “stuff everything into the prompt” architecture, even when it’s technically possible:

Cost. Generating against 8M input tokens costs roughly 8M-tokens-worth of inference. RAG keeps your effective input small (top-K retrieved chunks), and that ratio determines your unit economics. For workloads that run thousands of times an hour, the math doesn’t even close.

Latency. Time-to-first-token scales with context length. A user-facing query at 1M tokens of context is tens of seconds in. RAG keeps the LLM call fast.

Recency and update propagation. A long context is baked at request time. RAG is evaluated at retrieval time. When your underlying corpus updates, RAG sees the change immediately. Long-context approaches need to re-stuff.

Access control. This is the one nobody mentions enough. RAG can apply ACLs at retrieval — the user only ever sees passages they’re authorised for. Long-context naively dumps everything in. Solving ACL inside a 10M-token prompt is a problem you don’t want.

Where 10M context does change things

Single-document workflows on huge documents. Legal corpora, code bases, full books, multi-day audio transcripts. If the entire input is one logical thing the user wants to reason over, the long-context model is the right tool.

Reduced retrieval engineering for prototypes. For internal tools and experiments where the corpus is “this one PDF” or “this one repo”, you can skip the retrieval stack entirely and ship faster.

Long-trace agentic workflows. Agents that maintain extensive history (tool calls, intermediate reasoning) benefit from windows that won’t truncate them mid-task.

The thoughtful pattern in 2026 is RAG for retrieval, long-context for reasoning over the retrieval. The two compose. They don’t replace each other.

Sources:

Latent

Migrating from Gemini 2.5 Flash to 3.5 Flash · what actually changed, and what to touch in your code

What actually got better

What got worse

The four code changes you’ll actually make

1. Model ID

2. Drop temperature, top_p, top_k

3. thinking_budget → thinking_level

4. Tighten your FunctionResponse parts

Prompt cleanups you’ll thank yourself for

Caveats and not-yets

A migration checklist I’d hand a team

Google I/O 2026 · the day Google stopped shipping models and started shipping agents

The two models

The agentic surface

What this actually means

What I’m watching next

This week in AI · the lines I drew on the whiteboard

What I’m thinking about for next week

Claude Opus 4.7 and the SpaceX deal — orbital compute, doubled Code limits, same prices

Claude Opus 4.7

The SpaceX partnership

What I take from this

CAISI signs pre-launch evaluation agreements with Google, Microsoft, xAI

What’s actually in the agreement

What it doesn’t say

What it means for enterprise practitioners

My read

DeepSeek V4-Pro and V4-Flash — open weights catch the closed frontier

Where V4-Pro sits

What this means for enterprise builds

The caveats

The bigger pattern

GPT-5.5 ships as ‘Spud’ — what the rebrand from GPT-6 tells us

The benchmarks

The naming

What I’m telling clients

Google ships new AI agents to challenge OpenAI and Anthropic — a sober read

The state of play

What I think the bet is

Two things I’m watching

The shape of the rest of the year

MCP at 97 million monthly downloads — what’s shipped, what’s still missing

What’s shipped

What’s still missing

My take

Llama 4 Scout’s 10 million-token context window — what changes, and what doesn’t

Two questions everyone asks

Why long-context doesn’t replace retrieval

Where 10M context does change things

2. Drop `temperature`, `top_p`, `top_k`

3. `thinking_budget` → `thinking_level`

4. Tighten your `FunctionResponse` parts