<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://sinha96.github.io/newsletter/feed.xml" rel="self" type="application/atom+xml" /><link href="https://sinha96.github.io/" rel="alternate" type="text/html" /><updated>2026-05-21T02:33:03+00:00</updated><id>https://sinha96.github.io/newsletter/feed.xml</id><title type="html">Latent</title><subtitle>A continuous newsletter on enterprise AI, RAG architecture, agentic systems, and the changing GenAI landscape — by Priyanshu Shekhar Sinha.</subtitle><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><entry><title type="html">Migrating from Gemini 2.5 Flash to 3.5 Flash · what actually changed, and what to touch in your code</title><link href="https://sinha96.github.io/newsletter/2026/05/gemini-3-5-flash-migration-from-2-5/" rel="alternate" type="text/html" title="Migrating from Gemini 2.5 Flash to 3.5 Flash · what actually changed, and what to touch in your code" /><published>2026-05-21T00:00:00+00:00</published><updated>2026-05-21T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/05/gemini-3-5-flash-migration-from-2-5</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/05/gemini-3-5-flash-migration-from-2-5/"><![CDATA[<p>If you have a production workload on <code class="language-plaintext highlighter-rouge">gemini-2.5-flash</code>, the answer to “should I move?” is almost certainly yes — but the migration is <em>not</em> a simple model-ID swap. The Gemini 3.x family changed the thinking config, dropped the classic sampling knobs, and tightened the function-calling contract. Here is what I’d hand a teammate on day one of the cutover.</p>

<h2 id="what-actually-got-better">What actually got better</h2>

<p>Hold the model name fixed at Flash and look at the deltas that matter for a production agent:</p>

<ul>
  <li><strong>Coding / agentic workloads jumped substantially.</strong> On Terminal-Bench 2.1, 3.5 Flash hits <strong>76.2%</strong> — and notably <em>beats Gemini 3.1 Pro</em> on the agentic suite (Terminal-Bench, MCP Atlas, Finance Agent v2, GDPval-AA). That last one is the headline I keep coming back to: a Flash-tier model out-scoring the previous Pro-tier on agent tasks is not a normal generational jump.</li>
  <li><strong>Low-reasoning coding</strong> is up <strong>10–20%</strong> over the previous Flash generation. This is the band most production traffic actually sits in.</li>
  <li><strong>Thinking is now first-class.</strong> <code class="language-plaintext highlighter-rouge">thinking_level</code> is a string enum (<code class="language-plaintext highlighter-rouge">minimal</code> → <code class="language-plaintext highlighter-rouge">high</code>), and <code class="language-plaintext highlighter-rouge">medium</code> is the new default. The model is tuned for it, and chain-of-thought scaffolding in your prompts is now actively counterproductive — simpler prompts at <code class="language-plaintext highlighter-rouge">medium</code> beat elaborate CoT at <code class="language-plaintext highlighter-rouge">low</code>.</li>
  <li><strong>Thought preservation is on by default.</strong> Multi-turn agentic loops get more coherent across tool calls. Token usage goes up a bit; quality goes up more.</li>
</ul>

<h2 id="what-got-worse">What got worse</h2>

<p>One thing, and it matters: <strong>price.</strong></p>

<ul>
  <li>3.5 Flash: <strong>$1.50 / 1M input</strong>, <strong>$9.00 / 1M output</strong>, <strong>$0.15 / 1M cached input</strong>.</li>
  <li>That’s roughly <strong>3× the price of the Gemini 3 Flash Preview</strong> and, per Artificial Analysis’s full-benchmark suite, about <strong>5.5× the run cost of the previous Flash</strong> (the increase compounds because thinking-on-by-default eats more output tokens).</li>
  <li>It does sit ~40% below Gemini 3.1 Pro ($2.00 / $12.00), so the <em>intelligence-per-dollar</em> picture is still favourable — but if your 2.5 Flash bill was tight, model your new cost before you cut over.</li>
</ul>

<p>The fix on cost is almost always the same: aggressive prompt caching ($0.15/M cached is the cheapest token Google sells), and <code class="language-plaintext highlighter-rouge">thinking_level: 'minimal'</code> or <code class="language-plaintext highlighter-rouge">'low'</code> on routes that don’t need reasoning.</p>

<h2 id="the-four-code-changes-youll-actually-make">The four code changes you’ll actually make</h2>

<h3 id="1-model-id">1. Model ID</h3>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="hl"><code><span class="gd">- model = "gemini-2.5-flash"
</span><span class="gi">+ model = "gemini-3.5-flash"
</span></code></pre></div></div>

<p>GA, no preview suffix. Available in the Gemini API, AI Studio, Antigravity, the Gemini app, and AI Mode in Search.</p>

<h3 id="2-drop-temperature-top_p-top_k">2. Drop <code class="language-plaintext highlighter-rouge">temperature</code>, <code class="language-plaintext highlighter-rouge">top_p</code>, <code class="language-plaintext highlighter-rouge">top_k</code></h3>

<p>These are no longer recommended on any Gemini 3.x model. The reasoning is tuned for default sampling; passing custom values is a net-negative on most evals I’ve seen.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="hl"><code><span class="gd">- generation_config = {
-   "temperature": 0.2,
-   "top_p": 0.9,
-   "top_k": 40,
- }
</span><span class="gi">+ # leave sampling at defaults on 3.x
</span></code></pre></div></div>

<p>If you were using <code class="language-plaintext highlighter-rouge">temperature=0</code> for determinism, you’re going to need a new strategy — usually <code class="language-plaintext highlighter-rouge">thinking_level: 'minimal'</code> plus a stricter system prompt and schema-constrained output.</p>

<h3 id="3-thinking_budget--thinking_level">3. <code class="language-plaintext highlighter-rouge">thinking_budget</code> → <code class="language-plaintext highlighter-rouge">thinking_level</code></h3>

<p>The integer budget is gone. Replace it with the string enum.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="hl"><code><span class="gd">- thinking_config = ThinkingConfig(thinking_budget=7500)
</span><span class="gi">+ thinking_config = ThinkingConfig(thinking_level="medium")
</span></code></pre></div></div>

<p>Mapping I’ve been using as a starting point (adjust per route):</p>

<table>
  <thead>
    <tr>
      <th>Old (<code class="language-plaintext highlighter-rouge">thinking_budget</code>)</th>
      <th>New (<code class="language-plaintext highlighter-rouge">thinking_level</code>)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">0</code> / disabled</td>
      <td><code class="language-plaintext highlighter-rouge">minimal</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">~1k–3k</code></td>
      <td><code class="language-plaintext highlighter-rouge">low</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">~5k–10k</code></td>
      <td><code class="language-plaintext highlighter-rouge">medium</code> <em>(new default)</em></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">&gt;10k</code></td>
      <td><code class="language-plaintext highlighter-rouge">high</code></td>
    </tr>
  </tbody>
</table>

<p>Important gotcha if you’re coming from <code class="language-plaintext highlighter-rouge">gemini-3-flash-preview</code> rather than 2.5: the preview defaulted to <code class="language-plaintext highlighter-rouge">high</code>. 3.5 GA defaults to <code class="language-plaintext highlighter-rouge">medium</code>. If your eval scores quietly dropped after the model-ID swap, this is almost certainly why — set <code class="language-plaintext highlighter-rouge">thinking_level: 'high'</code> explicitly to restore the previous behaviour.</p>

<h3 id="4-tighten-your-functionresponse-parts">4. Tighten your <code class="language-plaintext highlighter-rouge">FunctionResponse</code> parts</h3>

<p>The function-calling contract is stricter now. Three requirements you must satisfy or the call will be rejected / hallucinated around:</p>

<ol>
  <li><strong><code class="language-plaintext highlighter-rouge">id</code> must match the original <code class="language-plaintext highlighter-rouge">FunctionCall.id</code></strong> — you can no longer get away with omitting it.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">name</code> must match the call’s <code class="language-plaintext highlighter-rouge">name</code>.</strong></li>
  <li><strong>Exactly one response per function call</strong> — no merging, no extras.</li>
</ol>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="hl"><code>  response_part = Part.from_function_response(
<span class="gi">+     id=call.id,
</span>      name=call.name,
      response={"result": tool_output},
  )
</code></pre></div></div>

<p>Two related cleanups while you’re in this file:</p>

<ul>
  <li><strong>Multimodal tool results:</strong> put media <em>inside</em> the function response parts, not as sibling parts.</li>
  <li><strong>Inline instructions</strong> (the “and now do X with this” trailing nudge): append them to the response text with two newlines, rather than sending them as a separate <code class="language-plaintext highlighter-rouge">Part</code>.</li>
</ul>

<h2 id="prompt-cleanups-youll-thank-yourself-for">Prompt cleanups you’ll thank yourself for</h2>

<p>3.5 Flash punishes the prompting habits that 2.5 Flash rewarded.</p>

<ul>
  <li><strong>Strip explicit chain-of-thought scaffolding.</strong> “Think step by step, first list assumptions, then…” — delete it. Set <code class="language-plaintext highlighter-rouge">thinking_level: 'medium'</code> (or <code class="language-plaintext highlighter-rouge">'high'</code>) and let the model do the reasoning natively. I’ve seen 5–10% accuracy gains from <em>removing</em> CoT prompts on 3.5.</li>
  <li><strong>Shorten system prompts.</strong> The model follows tighter instructions more reliably than 2.5 did; verbosity in the system prompt now correlates negatively with instruction-following on a couple of my internal evals.</li>
  <li><strong>Use schema-constrained output where you used to coerce JSON in the prompt.</strong> Cheaper, more reliable.</li>
</ul>

<h2 id="caveats-and-not-yets">Caveats and not-yets</h2>

<ul>
  <li><strong>Computer Use is not supported on 3.5 Flash yet.</strong> If you have a workload using the computer-use surface, stay on Gemini 3 Flash Preview for that specific route. Mixed-model deployments are fine.</li>
  <li><strong>PDF token usage can go up</strong> at <code class="language-plaintext highlighter-rouge">media_resolution_high</code>. Video usage typically goes <em>down</em>. Re-baseline both before you trust your cost projections.</li>
  <li><strong>Thought preservation increases output tokens.</strong> Worth it for agent loops; you can opt out per route if you’re cost-sensitive on single-turn classification work.</li>
</ul>

<h2 id="a-migration-checklist-id-hand-a-team">A migration checklist I’d hand a team</h2>

<ol>
  <li>Swap the model ID in one non-production route. Run your eval set.</li>
  <li>Remove <code class="language-plaintext highlighter-rouge">temperature</code> / <code class="language-plaintext highlighter-rouge">top_p</code> / <code class="language-plaintext highlighter-rouge">top_k</code>. Re-run.</li>
  <li>Convert <code class="language-plaintext highlighter-rouge">thinking_budget</code> → <code class="language-plaintext highlighter-rouge">thinking_level</code> using the table above. Re-run.</li>
  <li>Audit every <code class="language-plaintext highlighter-rouge">FunctionResponse</code> site: add <code class="language-plaintext highlighter-rouge">id</code>, verify <code class="language-plaintext highlighter-rouge">name</code>, ensure 1:1 with the call.</li>
  <li>Move multimodal tool outputs <em>inside</em> the function response part.</li>
  <li>Delete explicit CoT scaffolding from your prompts; trim system prompts.</li>
  <li>Re-baseline cost on a representative day of traffic — the price jump is real, prompt caching is your main lever.</li>
  <li>Canary 5% of production traffic for 48 hours. Watch p95 latency, tool-call error rate, and cost-per-request, in that order.</li>
  <li>Cut over fully. Leave 3 Flash Preview wired up for any computer-use routes.</li>
</ol>

<p>The TL;DR: the model is meaningfully better at exactly the workloads most teams are running (agents, tool-use, coding) and the migration is mostly a config-and-prompt cleanup rather than an architectural change. Budget a day for a small team, two days if your function-calling layer is non-trivial.</p>

<p>— Priyanshu</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/">Gemini 3.5: frontier intelligence with action (Google Blog)</a></li>
  <li><a href="https://ai.google.dev/gemini-api/docs/whats-new-gemini-3.5">What’s new in Gemini 3.5 — generateContent API (Google AI for Developers)</a></li>
  <li><a href="https://ai.google.dev/gemini-api/docs/interactions/whats-new-gemini-3.5">What’s new in Gemini 3.5 — Interactions API (Google AI for Developers)</a></li>
  <li><a href="https://ai.google.dev/gemini-api/docs/pricing">Gemini Developer API pricing (Google AI for Developers)</a></li>
  <li><a href="https://artificialanalysis.ai/models/gemini-3-5-flash">Gemini 3.5 Flash — Intelligence, Performance &amp; Price Analysis (Artificial Analysis)</a></li>
  <li><a href="https://deepmind.google/models/gemini/flash/">Gemini 3.5 Flash — DeepMind model page</a></li>
  <li><a href="https://www.buildfastwithai.com/blogs/gemini-3-5-flash-review-benchmarks-price-api">Gemini 3.5 Flash Review: Benchmarks, Price &amp; API (Build Fast With AI)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="gemini" /><category term="migration" /><category term="api" /><category term="flash" /><category term="google" /><summary type="html"><![CDATA[A practical migration note for teams running on 2.5 Flash. What's measurably better in 3.5, what's worse (price), the four config changes you'll have to make, and a checklist for cutting over without breaking your function-calling code.]]></summary></entry><entry><title type="html">Google I/O 2026 · the day Google stopped shipping models and started shipping agents</title><link href="https://sinha96.github.io/newsletter/2026/05/google-io-2026-the-agentic-pivot/" rel="alternate" type="text/html" title="Google I/O 2026 · the day Google stopped shipping models and started shipping agents" /><published>2026-05-21T00:00:00+00:00</published><updated>2026-05-21T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/05/google-io-2026-the-agentic-pivot</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/05/google-io-2026-the-agentic-pivot/"><![CDATA[<p>I watched the Shoreline keynote yesterday with the same notepad I use for client calls. By the end of it the page had one underlined sentence at the top: <em>the model is no longer the product.</em> Below that, the things that actually shipped on 20 May 2026.</p>

<h2 id="the-two-models">The two models</h2>

<p><strong>Gemini 3.5</strong> is the new frontier family. The framing Google used — “frontier intelligence with action” — is the giveaway. 3.5 is being positioned less as a benchmark-chasing release and more as the substrate the rest of the agentic stack runs on. <strong>Gemini 3.5 Flash</strong> was the SKU shown end-to-end on stage; the bigger sibling was alluded to but not benchmarked head-to-head against Opus 4.7 or GPT-5.5, which tells you something about where Google wants the conversation to go.</p>

<p><strong>Gemini Omni</strong> is the more interesting release. “Any input to any output, starting with video” is the pitch. The demos that landed were the editing ones — point at a frame, describe the change in natural language, get a coherent edit back across the rest of the clip. This is the first time I’ve seen a multimodal model where video editing felt like a first-class output modality rather than a party trick stitched onto an image model.</p>

<h2 id="the-agentic-surface">The agentic surface</h2>

<p>This is the part of the keynote I’ll remember. Five separate product launches, all of them agents, all shipping into surfaces that already have hundreds of millions of users:</p>

<ul>
  <li><strong>Gemini Spark</strong> — a general-purpose agent inside the Gemini app that can reason across your connected apps and take action under your direction. Beta, Ultra subscribers and trusted testers first, wider rollout to follow.</li>
  <li><strong>Information agents in Search</strong> — Search becomes a thing that goes and does the research loop, not just a thing that ranks ten blue links.</li>
  <li><strong>Daily Brief</strong> — proactive, 24/7 surfacing inside the Gemini app. The “agent that pre-empts you” pattern that everyone has been trying; Google now has the personal-context graph to actually make it useful.</li>
  <li><strong>Universal Cart</strong> — a shopping cart that holds items across merchants and lets an agent transact on your behalf. The commerce-side implications of this are bigger than the demo suggested.</li>
  <li><strong>Google Antigravity</strong> — the agent-first developer platform got a substantial bump. “Moving beyond AI tools that help write, to agents that help act” is the line. This is Google’s answer to Cursor + Claude Code + the rest of the agentic-IDE cohort.</li>
</ul>

<h2 id="what-this-actually-means">What this actually means</h2>

<p>Three things stood out to me, and I think they’re going to shape the rest of the year:</p>

<p><strong>1. Google is now competing on distribution, not model quality.</strong> The model announcements were almost a formality. The keynote spent its minutes on what the models do <em>inside</em> Gmail, Docs, Keep, Drive, Search, and the Gemini app. When your moat is a billion-user surface area, the model is a feature; the agent on top of it is the product. Anthropic and OpenAI do not have this lever, and yesterday made that asymmetry very visible.</p>

<p><strong>2. The “agentic in everything” pivot is now industry-wide.</strong> Microsoft has Copilot agents, Anthropic has Claude with computer-use and tool ecosystems, OpenAI has Operator-class products, and now Google has shipped its own coherent agent layer end-to-end across consumer and developer surfaces in a single keynote. The interesting question for buyers in Q3 is no longer “which model is best” — it’s “whose agent layer integrates with our system of record.” That is a very different procurement conversation.</p>

<p><strong>3. Antigravity is the one to watch for developer tooling.</strong> The agentic-IDE space has been fragmenting fast. Google entering with first-party access to Gemini 3.5 + Omni + the Workspace graph is a different kind of entrant than the startups in this category. I expect a real fight over the next two quarters.</p>

<h2 id="what-im-watching-next">What I’m watching next</h2>

<ul>
  <li>Whether <strong>Gemini Spark</strong> actually generalises across connected apps the way the demo suggested, or whether it ends up being a Google-properties agent with thin connectors. The reliability story on cross-app agents is still unsolved across the industry.</li>
  <li><strong>Universal Cart</strong>’s merchant adoption curve. If Google can get the big retailers in by holiday season, this becomes the default purchase surface inside the Gemini app and the discovery economics change.</li>
  <li><strong>Antigravity vs. Claude Code vs. Cursor</strong> on real agentic coding workloads. The benchmarks I’d want here don’t exist publicly yet; I’ll try to put one together for an enterprise client this quarter and write it up.</li>
  <li>Whether <strong>Omni</strong> lands a real video-editing market or remains a demo capability. The editing UX in the keynote was good. The proof is whether anyone ships paid video workflows on it within 90 days.</li>
</ul>

<p>The headline yesterday wasn’t “Google shipped Gemini 3.5.” It was “Google stopped shipping a model and started shipping an operating layer.” That is a different company than the one that showed up at I/O 2024.</p>

<p>— Priyanshu</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://blog.google/innovation-and-ai/sundar-pichai-io-2026/">Google I/O 2026: Sundar Pichai’s opening keynote (Google Blog)</a></li>
  <li><a href="https://blog.google/innovation-and-ai/technology/developers-tools/google-io-2026-collection/">Google I/O 2026: News and announcements (Google Blog)</a></li>
  <li><a href="https://developers.googleblog.com/all-the-news-from-the-google-io-2026-developer-keynote/">All the news from the Google I/O 2026 Developer keynote (Google Developers Blog)</a></li>
  <li><a href="https://cloud.google.com/blog/products/ai-machine-learning/innovations-from-google-io-26-on-google-cloud">Innovations from Google I/O 26 on Google Cloud (Google Cloud Blog)</a></li>
  <li><a href="https://www.cnbc.com/2026/05/19/google-ai-ultra-gemini-spark-omni.html">Google debuts new AI models, personal AI agents (CNBC)</a></li>
  <li><a href="https://www.techradar.com/news/live/google-io-2026-live">Google I/O 2026 live updates (TechRadar)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="google" /><category term="gemini" /><category term="agents" /><category term="io-2026" /><summary type="html"><![CDATA[Notes from yesterday's I/O keynote — Gemini 3.5, Gemini Omni, Antigravity, Spark, Universal Cart. The story isn't the model numbers. It's that the entire product surface pivoted to agentic in one keynote.]]></summary></entry><entry><title type="html">This week in AI · the lines I drew on the whiteboard</title><link href="https://sinha96.github.io/newsletter/2026/05/this-week-in-ai-may-7/" rel="alternate" type="text/html" title="This week in AI · the lines I drew on the whiteboard" /><published>2026-05-07T00:00:00+00:00</published><updated>2026-05-07T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/05/this-week-in-ai-may-7</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/05/this-week-in-ai-may-7/"><![CDATA[<p>A short week-in-review. Six items I keep returning to, with a one-line take on each.</p>

<p><strong>Anthropic’s compute deal with SpaceX is the structural story of the week.</strong> 220K+ GPUs at Colossus 1, doubled Claude Code rate limits, and gigawatt-scale orbital data-centre talks. The capex race is now openly the lever pulling everything else. <em>(I wrote about this on Tuesday — see the previous post.)</em></p>

<p><strong>CAISI’s pre-launch agreements with Google, Microsoft, xAI.</strong> Pre-deployment government evaluation moved from voluntary commitment to formal-and-repeated. This is going to start showing up as a procurement-checklist item by Q4. Anthropic has had a separate arrangement; OpenAI’s status is less clear. Watch for the next labs to sign on.</p>

<p><strong>DeepSeek V4-Pro is the #2 open-weights model on Artificial Analysis’s intelligence index.</strong> Behind only Kimi K2.6. Open-weights gap on reasoning is single-digit-percent territory now. MIT licence on V4 makes it deployable in places Llama Community Licence isn’t. <em>(Wrote about this last week.)</em></p>

<p><strong>MCP roadmap update is worth reading.</strong> The 2026 priorities are unglamorous but exactly right: audit trails, SSO-integrated auth, gateway patterns, async/long-running tools, configuration portability. The protocol won the integration layer. Now the boring work of enterprise-readiness gets done. The community link is below.</p>

<p><strong>Quiet shift in the agent stack:</strong> the move from “frameworks” to “primitives”. The trend through 2024 was full-stack agent frameworks — opinionated, vertically integrated. The trend in 2026 is people composing <strong>MCP servers + a stateful orchestrator (LangGraph or similar) + their own evaluation harness</strong>, instead of buying a framework. We made that move last quarter and the operability gain was real.</p>

<p><strong>GPT-5.5 (“Spud”) shipping under the GPT-5 brand instead of GPT-6.</strong> OpenAI held the line on what counts as a major version. Industry-restraint signal. Worth watching whether competitors follow suit; the model-versioning hype cycle has been a problem for buyers.</p>

<h2 id="what-im-thinking-about-for-next-week">What I’m thinking about for next week</h2>

<ul>
  <li>A longer note on <strong>agentic evaluation harnesses</strong> — what we’ve built, what we’re still missing.</li>
  <li>A field report on <strong>moving an enterprise client from Llama 3.3 to a V4-Flash deployment</strong> — what changed, what didn’t.</li>
  <li>Possibly a shorter post on <strong>Project Glasswing’s ripple effects</strong> — security teams I’m talking to are starting to plan around the assumption that Mythos-class capabilities will be more broadly available within 12 months.</li>
</ul>

<p>Take care of yourselves. See you next week.</p>

<p>— Priyanshu</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://www.anthropic.com/news/higher-limits-spacex">Higher usage limits for Claude and a compute deal with SpaceX (Anthropic)</a></li>
  <li><a href="https://www.cnn.com/2026/05/05/tech/microsoft-google-xai-government-test-ai-models">Microsoft, Google and xAI will let the government test their AI models (CNN Business)</a></li>
  <li><a href="https://artificialanalysis.ai/articles/deepseek-is-back-among-the-leading-open-weights-models-with-v4-pro-and-v4-flash">DeepSeek is back among the leading open weights (Artificial Analysis)</a></li>
  <li><a href="https://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/">The 2026 MCP Roadmap (Model Context Protocol Blog)</a></li>
  <li><a href="https://openai.com/index/gpt-5-5-system-card/">GPT-5.5 System Card (OpenAI)</a></li>
  <li><a href="https://www.anthropic.com/glasswing">Project Glasswing (Anthropic)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="weekly" /><category term="links" /><category term="digest" /><summary type="html"><![CDATA[Six things that earned a tab and a sticky note this week — Anthropic's compute deal, CAISI's pre-launch evaluations, the open-weights frontier closing in, and one quiet shift in the agent stack.]]></summary></entry><entry><title type="html">Claude Opus 4.7 and the SpaceX deal — orbital compute, doubled Code limits, same prices</title><link href="https://sinha96.github.io/newsletter/2026/05/claude-opus-4-7-spacex-compute/" rel="alternate" type="text/html" title="Claude Opus 4.7 and the SpaceX deal — orbital compute, doubled Code limits, same prices" /><published>2026-05-05T00:00:00+00:00</published><updated>2026-05-05T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/05/claude-opus-4-7-spacex-compute</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/05/claude-opus-4-7-spacex-compute/"><![CDATA[<p>Two announcements from Anthropic on May 4 that ought to be read together: a <strong>new flagship model</strong>, and a <strong>compute deal that materially changes their capacity ceiling</strong>.</p>

<h2 id="claude-opus-47">Claude Opus 4.7</h2>

<p>The model itself is positioned as an incremental upgrade over Opus 4.6:</p>

<ul>
  <li><strong>Notable gains on advanced software engineering</strong>, particularly on the hardest tasks. Anthropic’s framing — and matching independent reports — is that Opus 4.7 is the first Claude where you can confidently hand off your hardest coding work without babysitting.</li>
  <li><strong>Substantially better vision</strong> — higher resolution image inputs, improved fine-detail extraction.</li>
  <li><strong>Same pricing as 4.6</strong> — $5/M input tokens, $25/M output tokens.</li>
</ul>

<p>Pricing-flat-with-quality-up is the move I’ve been expecting from all the frontier labs as competition tightens. Worth noting that the headline numbers don’t move; the unit economics do.</p>

<h2 id="the-spacex-partnership">The SpaceX partnership</h2>

<p>The structurally interesting announcement. Anthropic signed a deal with SpaceX for the <strong>entire compute capacity at Colossus 1</strong>:</p>

<ul>
  <li><strong>300+ megawatts of new capacity</strong>.</li>
  <li><strong>220,000+ NVIDIA GPUs</strong> coming online within the month.</li>
  <li>The two companies are exploring <strong>orbital data centres</strong> — Anthropic mentioned interest in <em>multiple gigawatts of orbital AI compute capacity</em>.</li>
</ul>

<p>Yes, orbital. The pitch for space-based compute is solar (constant illumination), thermal (radiative cooling against deep space), and political (jurisdiction, supply-chain redundancy). Whether it’s actually economical at scale is a different question, but the fact that it’s being seriously discussed at the GW scale by an organisation that ships product is worth filing away.</p>

<p>The downstream effects are immediate:</p>

<ul>
  <li><strong>Claude Code’s 5-hour rate limits doubled</strong> for Pro, Max, Team, and Enterprise plans, effective immediately.</li>
  <li><strong>API rate limits raised across the board</strong> for Opus.</li>
  <li>This is the second compute capacity announcement from Anthropic in two months — they’ve been visibly compute-constrained, and this addresses it.</li>
</ul>

<h2 id="what-i-take-from-this">What I take from this</h2>

<p><strong>The compute war is here.</strong> The frontier labs are now openly trading capital for capacity in ways that show up as customer-facing rate limits. Operating at the frontier is a capex question first, R&amp;D question second.</p>

<p><strong>The duopoly dynamic is real.</strong> Anthropic at $30B annualised, OpenAI at $24B, both racing to lock in compute. The mid-2025 view that the field would have 4–6 frontier labs is now harder to defend. Most of the rest are de-facto open-source partners or cloud-vendor offerings.</p>

<p><strong>Vendor risk is back on the agenda.</strong> When two companies are responsible for most of the inference for the GenAI economy, single-vendor dependency starts costing real procurement points. Build for vendor-neutrality (MCP, abstracted model calls) accordingly.</p>

<p>We doubled our Claude Code allocation today and used it. The new Opus is genuinely better. If the orbital data-centre piece comes together, this gets even more interesting.</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://www.anthropic.com/news/higher-limits-spacex">Higher usage limits for Claude and a compute deal with SpaceX (Anthropic)</a></li>
  <li><a href="https://www.anthropic.com/news/claude-opus-4-7">Introducing Claude Opus 4.7 (Anthropic)</a></li>
  <li><a href="https://www.axios.com/2026/05/06/anthropic-spacex-elon-musk-compute">Anthropic will get compute capacity from Elon Musk’s SpaceX (Axios)</a></li>
  <li><a href="https://www.pcworld.com/article/3132997/anthropic-doubles-claude-code-limits-thanks-to-a-deal-with-spacex.html">Anthropic doubles Claude Code limits, thanks to a deal with SpaceX (PCWorld)</a></li>
  <li><a href="https://www.coindesk.com/tech/2026/05/06/anthropic-signs-elon-musk-s-spacex-for-colossus-1-compute-ahead-of-june-ipo">Anthropic signs Elon Musk’s SpaceX for Colossus 1 compute (CoinDesk)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="anthropic" /><category term="claude-opus" /><category term="spacex" /><category term="compute" /><summary type="html"><![CDATA[Anthropic shipped Opus 4.7 yesterday and announced a SpaceX compute partnership: 220K+ NVIDIA GPUs at Colossus 1, doubled Claude Code rate limits, and exploratory work on orbital data centres.]]></summary></entry><entry><title type="html">CAISI signs pre-launch evaluation agreements with Google, Microsoft, xAI</title><link href="https://sinha96.github.io/newsletter/2026/05/caisi-pre-launch-government-evals/" rel="alternate" type="text/html" title="CAISI signs pre-launch evaluation agreements with Google, Microsoft, xAI" /><published>2026-05-02T00:00:00+00:00</published><updated>2026-05-02T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/05/caisi-pre-launch-government-evals</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/05/caisi-pre-launch-government-evals/"><![CDATA[<p>The story most enterprise practitioners aren’t watching closely enough this week: the <strong>U.S. Center for AI Standards and Innovation (CAISI)</strong> announced agreements with <strong>Google DeepMind, Microsoft, and xAI</strong> that allow the government to evaluate frontier AI models <em>before</em> they’re publicly released.</p>

<p>This is the first concrete instance I’ve seen of pre-launch evaluation moving from voluntary commitments to a formal, repeated process. It deserves more attention than it’s getting.</p>

<h2 id="whats-actually-in-the-agreement">What’s actually in the agreement</h2>

<p>The reported scope:</p>

<ul>
  <li><strong>Pre-launch evaluation access</strong> — CAISI gets to test new frontier models from each company before public release.</li>
  <li><strong>Capability and safety testing</strong> — focused on dual-use risks (biosecurity, cybersecurity, autonomy).</li>
  <li><strong>Findings sharing</strong> — the labs receive the evaluation results; not all findings are necessarily public.</li>
</ul>

<p>CAISI also recently published its evaluation of <strong>DeepSeek V4-Pro</strong> following V4’s April release — separate work, but the same body, and a useful signal that they’re scaling up evaluations across both U.S. and overseas frontier labs.</p>

<h2 id="what-it-doesnt-say">What it doesn’t say</h2>

<p>Worth being clear about the limits:</p>

<ul>
  <li><strong>It’s not a launch veto.</strong> Evaluations inform; they don’t block. (At least not under the current framework.)</li>
  <li><strong>OpenAI and Anthropic aren’t in the announcement</strong> — though Anthropic has had a different evaluation arrangement going back further. The question is when (not whether) similar formal agreements expand.</li>
  <li><strong>It’s not the EU AI Act.</strong> Different framework, different teeth, different scope. Don’t conflate them.</li>
</ul>

<h2 id="what-it-means-for-enterprise-practitioners">What it means for enterprise practitioners</h2>

<p><strong>Compliance buyers will want to see the evaluation report.</strong> If you’re advising a regulated client on model selection, the CAISI report is becoming the artefact you’ll be asked about. Not the model card. Not the system card. The independent evaluation.</p>

<p><strong>The pre-launch window is shifting.</strong> Labs that have been comfortable with “ship fast, eval after” are facing a structural pressure to delay launches for evaluation. This shows up downstream as longer pre-release windows, more cautious staged rollouts, and (sometimes) features held back at launch.</p>

<p><strong>Cross-border models become a separate question.</strong> A model evaluated by CAISI is in a different operational risk category — for U.S. enterprise procurement — than a model that wasn’t. This isn’t <em>legally</em> required (yet), but it’s becoming a procurement-checklist item at large enterprises.</p>

<h2 id="my-read">My read</h2>

<p>This is the kind of slow, mostly-procedural development that won’t make AI Twitter trend, but will substantially shape what enterprise AI procurement looks like in 2027. We’ll start seeing CAISI-evaluation status as a vendor-comparison axis. Plan accordingly.</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://www.cnbc.com/2026/05/05/ai-oversight-trump-google-microsoft-xai.html">Trump admin moves further into AI oversight (CNBC)</a></li>
  <li><a href="https://www.cnn.com/2026/05/05/tech/microsoft-google-xai-government-test-ai-models">Microsoft, Google and xAI will let the government test their AI models before launch (CNN Business)</a></li>
  <li><a href="https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro">CAISI Evaluation of DeepSeek V4 Pro (NIST)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="policy" /><category term="regulation" /><category term="caisi" /><category term="governance" /><summary type="html"><![CDATA[The U.S. Center for AI Standards and Innovation will now evaluate frontier models from Google DeepMind, Microsoft, and xAI before they're publicly available. The first concrete operationalisation of pre-deployment government oversight.]]></summary></entry><entry><title type="html">DeepSeek V4-Pro and V4-Flash — open weights catch the closed frontier</title><link href="https://sinha96.github.io/newsletter/2026/04/deepseek-v4-pro-flash-open-weights/" rel="alternate" type="text/html" title="DeepSeek V4-Pro and V4-Flash — open weights catch the closed frontier" /><published>2026-04-29T00:00:00+00:00</published><updated>2026-04-29T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/04/deepseek-v4-pro-flash-open-weights</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/04/deepseek-v4-pro-flash-open-weights/"><![CDATA[<p>DeepSeek dropped two models on April 24, and they matter more than the standard “another open release” framing suggests.</p>

<p>The headline:</p>

<ul>
  <li><strong>DeepSeek V4-Pro</strong> — 1.6 trillion parameters total, 49B activated per token (MoE), 1M token context.</li>
  <li><strong>DeepSeek V4-Flash</strong> — 284B parameters total, 13B activated per token (MoE), 1M token context.</li>
  <li>Both shipped <strong>same-day</strong> as API endpoints AND as open weights under the <strong>MIT licence</strong> on Hugging Face.</li>
</ul>

<p>MIT. Not a Llama-style “community licence” with monthly-active-user carve-outs — a genuinely permissive licence that allows commercial use, modification, and redistribution.</p>

<h2 id="where-v4-pro-sits">Where V4-Pro sits</h2>

<p>On the <strong>Artificial Analysis Intelligence Index</strong> for open weights, V4-Pro is now <strong>#2</strong>, behind only Kimi K2.6. On the <strong>GDPval-AA</strong> agentic-real-world-tasks benchmark, V4-Pro Max scored <strong>1554</strong>, beating Kimi K2.6 (1484), GLM-5.1 (1535), GLM-5 (1402), and MiniMax-M2.7 (1514).</p>

<p>Coding-specific: V4-Pro now outperforms most closed flagships on Codeforces, LiveCodeBench, and Terminal-Bench. The “open-weights gap” on coding has effectively closed.</p>

<h2 id="what-this-means-for-enterprise-builds">What this means for enterprise builds</h2>

<p>Three concrete shifts I’m thinking about for our roadmap:</p>

<p><strong>One.</strong> For workloads where the cost-per-token math has been borderline (high-volume RAG, agentic pipelines that burn tokens on tool-calling intermediate steps), V4-Flash on self-hosted infrastructure is now competitive on quality with where Claude Sonnet was twelve months ago. The unit economics shift.</p>

<p><strong>Two.</strong> For <em>reasoning-heavy</em> workloads — Text2SQL on complex schemas, code agents, multi-step plans — V4-Pro is the first open-weights model I’d seriously consider against the closed flagships. The activation count (49B per token) means you can serve it on a node a lot of teams already have provisioned.</p>

<p><strong>Three.</strong> <strong>MIT licence</strong> changes the conversation in regulated industries. The Llama Community Licence is a non-starter at some financial-services and healthcare clients I’ve worked with — their legal teams won’t sign off. MIT clears that hurdle.</p>

<h2 id="the-caveats">The caveats</h2>

<ul>
  <li>Inference cost for V4-Pro at full quality is still substantial — that 1.6T total parameter count needs serving infrastructure.</li>
  <li>The model has a recognisable Chinese-language tilt in its training mix, which shows up on culturally sensitive evaluations. Test against your domain.</li>
  <li>Open weights are not the same as <em>open data</em> — we still don’t have the training corpus.</li>
</ul>

<h2 id="the-bigger-pattern">The bigger pattern</h2>

<p>We’re now at the point where every 2–4 weeks, a frontier-class open release lands. Llama 4 in early April. Mistral Large 3 in March. Gemma 4 in April under Apache 2.0. DeepSeek V4 now. The closed-source labs still have the absolute leading edge, but the gap measured in “months until the open model is good enough for this workload” has dropped to single digits for most enterprise tasks.</p>

<p>If you haven’t priced an open-weights deployment into your 2026 architecture decisions, do it now.</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://api-docs.deepseek.com/news/news260424">DeepSeek V4 Preview Release · DeepSeek API Docs</a></li>
  <li><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro">DeepSeek V4-Pro on Hugging Face</a></li>
  <li><a href="https://artificialanalysis.ai/articles/deepseek-is-back-among-the-leading-open-weights-models-with-v4-pro-and-v4-flash">DeepSeek is back among the leading open weights models (Artificial Analysis)</a></li>
  <li><a href="https://codersera.com/blog/deepseek-v4-pro-review-benchmarks-pricing-2026/">DeepSeek V4 Pro Review 2026 (Codersera)</a></li>
  <li><a href="https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro">CAISI Evaluation of DeepSeek V4 Pro (NIST)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="deepseek" /><category term="open-weights" /><category term="reasoning" /><category term="mit-license" /><summary type="html"><![CDATA[DeepSeek shipped V4-Pro (1.6T MoE) and V4-Flash (284B MoE) on April 24 under MIT licence. V4-Pro is now the #2 open-weights model on the Artificial Analysis Intelligence Index, behind only Kimi K2.6.]]></summary></entry><entry><title type="html">GPT-5.5 ships as ‘Spud’ — what the rebrand from GPT-6 tells us</title><link href="https://sinha96.github.io/newsletter/2026/04/gpt-5-5-spud-rebrand/" rel="alternate" type="text/html" title="GPT-5.5 ships as ‘Spud’ — what the rebrand from GPT-6 tells us" /><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/04/gpt-5-5-spud-rebrand</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/04/gpt-5-5-spud-rebrand/"><![CDATA[<p>OpenAI’s most-anticipated model of the year shipped this week, eight days late and with a different name on the box.</p>

<p>The recap, briefly:</p>

<ul>
  <li>The model — internally codenamed <strong>Spud</strong> — was confirmed for an <strong>April 14, 2026</strong> global launch.</li>
  <li>That date came and went. No public weights, no API rollout, no developer keys.</li>
  <li>On <strong>April 23</strong>, OpenAI shipped it — branded <strong>GPT-5.5</strong>, not GPT-6.</li>
</ul>

<p>There are two threads to pull on here. The benchmarks, and the naming.</p>

<h2 id="the-benchmarks">The benchmarks</h2>

<p>Pre-launch leaks pointed to GPT-6-class performance: a high-70s SWE-bench Pro score (the agentic software-engineering benchmark that’s become the headline metric for “frontier” claims), substantial improvements on long-context reasoning, etc.</p>

<p>The actual numbers in the system card came in lower than the leaks. SWE-bench Pro at <strong>58.6%</strong>, well short of the high-70s rumour. Strong-but-incremental gains on most other axes.</p>

<p>That score is genuinely good — comparable to or above the latest Claude Opus releases on several axes — but it’s not the leap that “GPT-6” had been priced into.</p>

<h2 id="the-naming">The naming</h2>

<p>The decision to ship as <strong>GPT-5.5</strong> rather than <strong>GPT-6</strong> is the part of this release I find most worth dwelling on.</p>

<p>OpenAI has been criticised — fairly — for inflating model versioning in the past (the GPT-4-Turbo / GPT-4o / o1 lineage was a mess for end users to track). Choosing to <em>not</em> call this GPT-6 when the benchmark didn’t land is a small but real act of restraint. It’s the right call. It also tells us something:</p>

<ul>
  <li><strong>OpenAI has internal numerical bars for major versions</strong>, and they apparently held the line.</li>
  <li><strong>The GPT-6 brand is now being saved</strong> for whatever ships next that does clear that bar.</li>
  <li>The model is good. It’s not the leap the rumour mill was paying for.</li>
</ul>

<h2 id="what-im-telling-clients">What I’m telling clients</h2>

<p>GPT-5.5 is a serious frontier release — particularly on coding and reasoning workloads. It’s also incremental. If you’ve already standardised on Claude Opus 4.6 / 4.7 or Gemini 2.5 Pro, there’s no urgent reason to switch. If you’re on an older GPT-4-class model, the upgrade is worth running an eval on.</p>

<p>The bigger story is the <em>honesty</em> of the renaming. We’re past the era where every release has to be the biggest one yet, and that’s a healthier place for the field to be.</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://openai.com/index/gpt-5-5-system-card/">GPT-5.5 System Card (OpenAI)</a></li>
  <li><a href="https://openai.com/index/introducing-gpt-5-5/">Introducing GPT-5.5 (OpenAI)</a></li>
  <li><a href="https://felloai.com/all-we-know-about-chatgpt-6/">ChatGPT 6 Release Date: Spud Shipped as GPT-5.5 (Fello AI)</a></li>
  <li><a href="https://www.mejba.me/blog/gpt-6-spud-openai-analysis">GPT-6 (Spud): What’s Real, What’s Hype (Mejba Ahmed)</a></li>
  <li><a href="https://findskill.ai/blog/gpt-6-release-date/">GPT-6 Release Date: 7 Days Past April 14 (FindSkill.ai)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="openai" /><category term="gpt-5-5" /><category term="model-naming" /><category term="benchmarks" /><summary type="html"><![CDATA[The model OpenAI confirmed for an April 14 launch — internal codename 'Spud' — finally shipped on April 23. Branded GPT-5.5, not GPT-6. The renaming is the most honest thing about the release.]]></summary></entry><entry><title type="html">Google ships new AI agents to challenge OpenAI and Anthropic — a sober read</title><link href="https://sinha96.github.io/newsletter/2026/04/google-new-ai-agents-bloomberg/" rel="alternate" type="text/html" title="Google ships new AI agents to challenge OpenAI and Anthropic — a sober read" /><published>2026-04-22T00:00:00+00:00</published><updated>2026-04-22T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/04/google-new-ai-agents-bloomberg</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/04/google-new-ai-agents-bloomberg/"><![CDATA[<p>Bloomberg reported today that <strong>Google has released a new line of AI agents</strong> aimed squarely at OpenAI’s and Anthropic’s enterprise lead. I’ve been waiting for this move — we’ve all been waiting for it — and the strategic story is more interesting than any single demo.</p>

<h2 id="the-state-of-play">The state of play</h2>

<p>Going into Q2 2026:</p>

<ul>
  <li><strong>Anthropic</strong> has the agentic-workloads lead. Claude’s tool-use reliability and MCP made them the default for production agent stacks.</li>
  <li><strong>OpenAI</strong> has the consumer-product lead and the largest install base, but enterprise agent share is lagging.</li>
  <li><strong>Google</strong> has the platform lead — Gemini 2.5 Pro’s 1M-token context, native multimodal, deep Workspace integration — but the agent product story has been thin.</li>
</ul>

<p>This release is Google saying: <em>we have the model, we have the data, we have the cloud — now we have the agents</em>.</p>

<h2 id="what-i-think-the-bet-is">What I think the bet is</h2>

<p>Google’s competitive moat in enterprise AI is <strong>integration with what enterprises already use</strong>: Gmail, Drive, Calendar, Docs, Sheets, BigQuery, Workspace. Anthropic has MCP (and the connector ecosystem on top). OpenAI has Operator and a growing tool catalogue. Google has <em>the data the agent should be acting on, already in their cloud</em>.</p>

<p>If the agents are even competent, the integration story alone makes them a serious enterprise play. You don’t need to build connectors for the data when the data lives in the same cloud as the agent.</p>

<h2 id="two-things-im-watching">Two things I’m watching</h2>

<p><strong>MCP support.</strong> Will Google’s new agents speak MCP, or push their own integration story? My bet — they’ll do both, badge MCP support as table stakes, and try to differentiate on Workspace/Cloud-native depth. The vendor-neutral protocol is too widely adopted to ignore at this point.</p>

<p><strong>Enterprise SLAs.</strong> Anthropic’s enterprise lead is partly built on operational reliability — SLAs, predictable rate limits, a quieter incident history. Google’s enterprise track record on AI products is mixed (remember Bard?). The model is the easy part. <em>Operating</em> a model business at enterprise scale is the hard part.</p>

<h2 id="the-shape-of-the-rest-of-the-year">The shape of the rest of the year</h2>

<p>We now have three credible agent platforms competing for enterprise share, plus a healthy long tail (Microsoft via Azure + OpenAI, AWS via Bedrock, Salesforce, etc). For customers, this is excellent — pricing pressure is real, multi-vendor patterns become viable. For practitioners, it means the <em>integration layer</em> (MCP, agent frameworks, eval harnesses) becomes more important, not less.</p>

<p>Pick the platform that’s strong where your data lives. Build the integrations on the standard.</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://www.bloomberg.com/news/articles/2026-04-22/google-releases-new-ai-agents-to-challenge-openai-and-anthropic">Google Releases New AI Agents to Challenge OpenAI and Anthropic (Bloomberg)</a></li>
  <li><a href="https://gurusup.com/blog/ai-comparisons">AI Models in 2026: Which One Should You Actually Use? (Gurusup)</a></li>
  <li><a href="https://llm-stats.com/ai-news">LLM News Today (May 2026)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="google" /><category term="agents" /><category term="gemini" /><category term="market" /><summary type="html"><![CDATA[Bloomberg reported today on Google's new agent line. The strategic story is more interesting than the demo. Here's what I think the bet is.]]></summary></entry><entry><title type="html">MCP at 97 million monthly downloads — what’s shipped, what’s still missing</title><link href="https://sinha96.github.io/newsletter/2026/04/mcp-97m-downloads-what-shipped/" rel="alternate" type="text/html" title="MCP at 97 million monthly downloads — what’s shipped, what’s still missing" /><published>2026-04-17T00:00:00+00:00</published><updated>2026-04-17T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/04/mcp-97m-downloads-what-shipped</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/04/mcp-97m-downloads-what-shipped/"><![CDATA[<p>Model Context Protocol shipped in November 2024 with roughly 100,000 SDK downloads in its first month. By March 2026, that monthly number was <strong>97 million</strong> — a <strong>970× increase in 18 months</strong>.</p>

<p>We use MCP heavily at Elastiq. We’ve also tripped over most of the rough edges. A short field report on what’s actually shipped, and what’s still in the roadmap column.</p>

<h2 id="whats-shipped">What’s shipped</h2>

<p><strong>A real connector ecosystem.</strong> The public MCP server registry grew from ~1,200 in Q1 2025 to <strong>9,400+ by April 2026</strong>, and that’s just the public ones — the count of internal/private servers in enterprises is much higher. Drive, GCS, S3, Azure Blob, Slack, GitHub, Postgres, Sentry, Snowflake, Salesforce — every major data source has at least one server, often three.</p>

<p><strong>Cross-vendor adoption.</strong> Anthropic open-sourced MCP. The interesting part is that <strong>OpenAI, Google, Microsoft, and AWS have all adopted it</strong> as a first-class integration surface. The protocol is genuinely vendor-neutral now, in a way it wasn’t 12 months ago.</p>

<p><strong>Enterprise deployment patterns.</strong> <strong>78% of enterprise teams</strong> with AI agents in production are now running them on MCP. The “build a connector for every model provider” anti-pattern has died.</p>

<h2 id="whats-still-missing">What’s still missing</h2>

<p><strong>Audit trails and SSO-integrated auth.</strong> This is the biggest gap. Out-of-the-box MCP doesn’t give you the audit story enterprise security teams want. Most production deployments have a custom logging layer wrapped around the protocol. The 2026 roadmap calls these out as priorities — <em>enterprise readiness</em> is the headline theme — but they’re not in the standard yet.</p>

<p><strong>Gateway patterns.</strong> As MCP servers proliferate, you need to centralise auth, rate-limiting, and policy. The community is converging on a <em>gateway</em> pattern (an MCP-aware reverse proxy), but there’s no blessed implementation. We rolled our own.</p>

<p><strong>Async / long-running tools.</strong> MCP doesn’t have a great answer for tool calls that take 30+ seconds, or that want to stream partial results back. The roadmap mentions a <strong>Tasks primitive</strong> for async agent calls; it’s not there yet.</p>

<p><strong>Configuration portability.</strong> Moving an MCP setup from one host application to another (Claude Desktop → Cursor → your own product, etc.) is still more friction than it should be.</p>

<p><strong>ACL propagation.</strong> I keep banging on about this. Connectors know who the user is. The LLM doesn’t have a first-class way to honour that identity downstream when it composes responses or chains calls. Solved in our shop with a custom layer. Should be in the protocol.</p>

<h2 id="my-take">My take</h2>

<p>The protocol has <em>won</em> the integration layer. That’s settled. The next 12 months are about closing the enterprise-readiness gap — auth, audit, gateways, async — without breaking the simplicity that made it spread in the first place. Hard problem. Solvable.</p>

<p>If you’re building enterprise AI in 2026 and you’re not on MCP, the burden of proof is now on you to explain why.</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://www.digitalapplied.com/blog/mcp-adoption-statistics-2026-model-context-protocol">MCP Adoption Statistics 2026 (Digital Applied)</a></li>
  <li><a href="https://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/">The 2026 MCP Roadmap (Model Context Protocol Blog)</a></li>
  <li><a href="https://thenewstack.io/model-context-protocol-roadmap-2026/">MCP’s biggest growing pains for production use (The New Stack)</a></li>
  <li><a href="https://dev.to/composiodev/what-is-an-mcp-gateway-and-why-do-enterprise-ai-teams-need-one-in-2026-1lie">What Is an MCP Gateway (DEV / Composio)</a></li>
  <li><a href="https://www.cdata.com/blog/2026-year-enterprise-ready-mcp-adoption">2026: The Year for Enterprise-Ready MCP Adoption (CData)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="mcp" /><category term="integration" /><category term="agents" /><category term="enterprise" /><summary type="html"><![CDATA[Model Context Protocol's adoption numbers are extraordinary — 970× growth in 18 months, 9,400+ public servers, 78% of enterprise teams running agents in production. A status check from a working integrator.]]></summary></entry><entry><title type="html">Llama 4 Scout’s 10 million-token context window — what changes, and what doesn’t</title><link href="https://sinha96.github.io/newsletter/2026/04/llama-4-scout-10m-context/" rel="alternate" type="text/html" title="Llama 4 Scout’s 10 million-token context window — what changes, and what doesn’t" /><published>2026-04-12T00:00:00+00:00</published><updated>2026-04-12T00:00:00+00:00</updated><id>https://sinha96.github.io/newsletter/2026/04/llama-4-scout-10m-context</id><content type="html" xml:base="https://sinha96.github.io/newsletter/2026/04/llama-4-scout-10m-context/"><![CDATA[<p>Meta shipped two <strong>Llama 4</strong> models this month — the headline numbers:</p>

<ul>
  <li><strong>Llama 4 Scout</strong> — 17B active / 109B total MoE, 16 experts, <strong>10 million-token context</strong>.</li>
  <li><strong>Llama 4 Maverick</strong> — 17B active / 400B total MoE, 128 experts, 1M-token context, native multimodal.</li>
</ul>

<p>Both are open-weight under the Llama Community Licence. Scout’s 10M context — supported by a technique called Interleaved RoPE that lets it generalise from a 256K training window — is the largest of any openly available model at launch.</p>

<h2 id="two-questions-everyone-asks">Two questions everyone asks</h2>

<p><strong>“Is the 10M context real, or marketing?”</strong> Real, in the sense that Scout can be fed and reason over very long inputs. Real-but-caveated, in the sense that performance degrades on the long tail of the window, attention drift is real, and your inference cost scales with input tokens regardless.</p>

<p><strong>“Does this kill RAG?”</strong> No. Read on.</p>

<h2 id="why-long-context-doesnt-replace-retrieval">Why long-context doesn’t replace retrieval</h2>

<p>Three things stack against the “stuff everything into the prompt” architecture, even when it’s technically possible:</p>

<p><strong>Cost.</strong> Generating against 8M input tokens costs roughly 8M-tokens-worth of inference. RAG keeps your effective input small (top-K retrieved chunks), and that ratio determines your unit economics. For workloads that run thousands of times an hour, the math doesn’t even close.</p>

<p><strong>Latency.</strong> Time-to-first-token scales with context length. A user-facing query at 1M tokens of context is tens of seconds in. RAG keeps the LLM call fast.</p>

<p><strong>Recency and update propagation.</strong> A long context is <em>baked at request time</em>. RAG is <em>evaluated at retrieval time</em>. When your underlying corpus updates, RAG sees the change immediately. Long-context approaches need to re-stuff.</p>

<p><strong>Access control.</strong> This is the one nobody mentions enough. RAG can apply ACLs at retrieval — the user only ever sees passages they’re authorised for. Long-context naively dumps everything in. Solving ACL inside a 10M-token prompt is a problem you don’t want.</p>

<h2 id="where-10m-context-does-change-things">Where 10M context <em>does</em> change things</h2>

<p><strong>Single-document workflows on huge documents.</strong> Legal corpora, code bases, full books, multi-day audio transcripts. If the entire input is one logical thing the user wants to reason over, the long-context model is the right tool.</p>

<p><strong>Reduced retrieval engineering for prototypes.</strong> For internal tools and experiments where the corpus is “this one PDF” or “this one repo”, you can skip the retrieval stack entirely and ship faster.</p>

<p><strong>Long-trace agentic workflows.</strong> Agents that maintain extensive history (tool calls, intermediate reasoning) benefit from windows that won’t truncate them mid-task.</p>

<p>The thoughtful pattern in 2026 is <strong>RAG for retrieval, long-context for reasoning over the retrieval</strong>. The two compose. They don’t replace each other.</p>

<hr />

<p><strong>Sources:</strong></p>
<ul>
  <li><a href="https://www.llama.com/models/llama-4/">Meta Llama 4 · llama.com</a></li>
  <li><a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/">The Llama 4 herd (Meta AI Blog)</a></li>
  <li><a href="https://huggingface.co/blog/llama4-release">Welcome Llama 4 Maverick &amp; Scout on Hugging Face</a></li>
  <li><a href="https://explore.n1n.ai/blog/meta-llama-4-scout-maverick-production-guide-2026-04-27">Llama 4 production guide (n1n.ai)</a></li>
</ul>]]></content><author><name>Priyanshu Shekhar Sinha</name><email>priyanshu1996@hotmail.com</email></author><category term="llama" /><category term="context-window" /><category term="rag" /><category term="meta" /><summary type="html"><![CDATA[Meta shipped Llama 4 Scout (10M context, MoE) and Maverick (1M context, 128 experts) earlier this month. The context-length number is real. The 'do we still need RAG' question still has the same answer.]]></summary></entry></feed>