Tech Lead · GenAI @ Elastiq AI 12.97°N / 77.59°E · Bangalore
GCP · Professional MLE --:--:-- IST
Folio · 2026 Bangalore, IN Technical Lead — GenAI @ Elastiq AI

Priyanshu Shekhar Sinha

Technical Lead — GenAI · AI/ML Engineer · GCP Professional MLE

gen.intro  //  streaming

Seven plus years building production-grade AI systems. Currently architecting an enterprise RAG platform at Elastiq AI — multi-source ingestion, hybrid search, cross-encoder re-ranking, and document-level access control.

Scroll
01 / About

Engineer of tangible outcomes.

Translating ambiguous business problems into deployed, measurable, data-driven systems.

I'm an AI/ML Engineer with seven plus years shipping production-grade data systems. I've grown from classical data science into the architecture of large-scale enterprise AI — currently leading the design and development of an enterprise RAG platform at Elastiq AI.

My work sits at the intersection of search infrastructure, LLM orchestration, and enterprise security — multi-source connectors (Drive, GCS, S3, Azure Blob, Slack), hybrid retrieval (BM25 + kNN on OpenSearch), cross-encoder re-ranking, and document-level access control synchronized across cloud providers.

I believe most of the work is finding the right shape of the problem. The rest is craft, iteration, and the discipline to keep what works.

Currently building
Enterprise RAG · ACL-aware retrieval
Lately exploring
Agentic AI · MCP connectors
Always for hire on
GenAI · LLM fine-tuning
0+
Years in practice
0
Projects shipped
0+
Verticals
0+
Mentored
02 / Stack

Tools, in daily use.

A pragmatic kit refined over a decade of building, breaking, and shipping.
Top Skills
GenAI Solution ArchitectAgentic AIEnterprise RAGGoogle Cloud Platform
Languages
PythonSQLCypher
LLM & GenAI
LLaMA 3.1 70BQLoRALoRASFTRLHFClassic RAGGraphRAGCypher-RAGText2SQLMCPLangChainTransformersOllama
Search & Retrieval
OpenSearchBM25kNN / VectorHybrid SearchCross-encoder Re-rankingChromaDBEmbeddingsACL Sync
Multi-modal
TextPDFAudioVideoSpeaker DiarizationOCR
ML / DL
PyTorchTensorFlowscikit-learnMLFlowPandasPolarsNumPyPySparkNLPDeep LearningMLOps
Cloud
GCPAWS S3AWS EKSAWS ECSAWS SageMakerAWS SQSAWS SESEC2GCSAzure Blob
Data
PostgreSQLCitus DBSnowflakeDatabricksGraph DBChromaDB
Tooling
DockerAirflowStreamlitChainLitFlaskPlotlyAltairPower BIPostmanGitGitHubGitLab
Soft Skills
CommunicationProject ManagementTeam LeadershipMentorship
02·B / Off-screen

When I'm away from the keyboard.

Long Drives

Badminton

Hiking

Cuisine

Documentaries

Music

Photography

Mentorship

03 / Echo

What collaborators say.

Selected notes from former managers, peers, and mentees.

Priyanshu got a very deep understanding of our pipeline and quickly recognized nuances in the Python-pandas framework. He learned and implemented ML models for our text-processing pipeline which improved accuracy and reduced false positives, and made significant contributions to our Learning Framework.

Venkatesh Mohan
Venkatesh Mohan
Founder & Director — Sumyag

Priyanshu has immense knowledge — and not just that, he focuses on the core concepts. Conceptually it's very difficult to beat him. Whether or not he has worked on a certain algorithm, he picks it up in upcoming projects. His grasping power is amazing. He is a great asset to the team.

Sakshi Sehgal
Sakshi Sehgal
Senior Data Scientist — Paxcomm

Priyanshu is highly focused and relentless. His incredibly deep knowledge of neural networks, deep learning, and ML could be an asset to any organization. He gels with the team well, always provides constructive feedback, and I would love to work with him again.

Garima Thakur
Garima Thakur
ML Engineer — Softuvo Solutions

Priyanshu is one of the most dependable, thorough, and dedicated people I've worked with. His contributions in document enrichment and image-processing initiatives improved customer experiences significantly. Priyanshu is an amazing value-add to any organisation.

Abhishek Kumar
Engineering Manager — Google
04 / Recognition

Selected honors.

Certifications, awards, and contributions worth flagging.
Elastiq AI March · 2026

Star Employee of the Month

Excellence grows through continuous learning and self-mastery. — presented by the Elastiq leadership team.

Recruitment Smart 2021

Star Performer of the Month

Recognised for impact across reporting automation — reducing analytics report time from 7 days to 15 minutes (99% improvement).

GitHub Open Source

Arctic Code Vault Contributor

Open-source code preserved in the GitHub Archive Program — a 1,000-year repository archived in the Arctic.

GitHub Pull Shark

Pull Shark

Awarded for sustained contributions across multiple open-source repositories.

DeepLearning.AI Specializations · 2018, 2021

Deep Learning & NLP Specializations

Two multi-course Coursera specializations covering DNNs, CNNs, sequence models, attention, and the foundations of modern NLP.

05 / Résumé

Trajectory, in brief.

A linear walk through the past seven years, plus the studies that led there.

Education & Certifications

★ Cert

Google Cloud — Professional ML Engineer

Google Cloud
Designing, building, and productionising ML solutions on Google Cloud — feature engineering, model development, serving, MLOps, and responsible AI.
2021

NLP Specialization

DeepLearning.AI · Coursera
Classification & Vector Spaces · Probabilistic Models · CNNs · Sequence Models · Attention Models.
2019

PG Program — Data Science Engineering

Great Lakes Institute of Management
Python · Machine Learning · Statistics · Data Analysis · NLP · Visualization · Deep Learning · Feature Engineering.
2018

Deep Learning Specialization

DeepLearning.AI · Coursera
Improving DNNs · Structuring ML Projects · CNNs · Neural Networks · Sequence Models.
2018

Python Data Structures

University of Michigan · Coursera
Data Structures · Object-Oriented Programming.
Cert

Spark & Python for Big Data with PySpark

Coursera
Cert

Programming for Everybody

University of Michigan · Coursera
2014 — 2018

B.E. — Electrical & Electronics Engineering

Sir M. Visvesvaraya Institute of Technology
Electrical Machine Design · Engineering Mathematics · Analog Circuits · Microcontrollers · Power Electronics · Control Systems · Network Analysis.

Experience

2025.01 → Now

Technical Lead — GenAI

Elastiq AI · India
Lead the design and development of an enterprise RAG platform — multi-source ingestion, hybrid search, cross-encoder re-ranking, and document-level access control synchronized across cloud providers.
  • Architecting scalable distributed-database solutions via Citus DB research and implementation.
  • Designing distributed database architectures for enterprise-scale applications.
  • Mentoring cross-functional team members on robust data pipelines for large-scale ML projects.
  • Guiding the team through complex LLM fine-tuning and enterprise deployment strategies.
  • Collaborating with senior leadership on strategic product development.
  • Onboarded the first enterprise client for our Text2SQL solution, establishing initial market presence.
PythonLLMsRAGOpenSearchCitus DBGCPAWSDocker
2024.09 → 2025.01

Senior Software Engineer

Elastiq AI · Bangalore
Foundational AI-engineering work on Discover — the company's flagship product — covering fine-tuning, retrieval, and agent design.
  • Fine-tuned LLaMA 3.1 70B via advanced QLoRA / LoRA — 85% accuracy on the target task.
  • Engineered custom datasets and context-aware chunking for unstructured document RAG.
  • Built AI agents for Text2SQL and unstructured file processing inside Discover.
  • Built the foundational Text2SQL framework from scratch.
  • Integrated multiple Graph DB backends offering Classic RAG, Cypher-based RAG, and GraphRAG.
  • Engineered data-processing systems capable of large-scale enterprise workloads.
PythonLLaMA 3.1QLoRALangChainGraphRAGText2SQLOpenSearch
2023.05 → 2024.08

Senior Data Scientist

Acuity Knowledge Partners · Bangalore
Senior Data Scientist at a research, analytics, and technology partner to the financial services sector — asset managers, IBs, PE, VC, hedge funds, and consulting firms.
  • Developed a versatile analysis dashboard for an investment firm enabling experiments and KPI monitoring — 87% reduction in client analysis time.
  • Built a web dashboard integrated with an ESG data engine for a top-tier investment firm shaping their investment strategy decisions.
  • Shipped an end-to-end RAG system for parsing financial reports.
PythonStreamlitSQLDatabricksMLFlowLlama 3ChromaDBOllamaLangChainPostmanDocker
2021.12 → 2023.04

Associate — Data Scientist

TheMathCompany · Bangalore
Contributed to an enterprise AI and analytics company trusted for data-driven decision-making by some of the largest organisations across industries.
  • Led a 5-person team designing and shipping a flexible A/B Testing Framework for a leading pharmaceutical client to measure promotion success in revenue generation.
  • Co-led a 7-person team on an analyse-and-flag pipeline for sales anomaly detection with seasonality — +$125K revenue for the client in Q1.
  • Built an LSTM-based model forecasting customer return-to-purchase windows for supply/demand balancing — R²: 0.9012.
PythonDeep LearningDockerAirflowCodx
2021.05 → 2021.12

Data Analyst

Recruitment Smart Technologies · Ahmedabad
Built a scalable solution for periodic opportunity, operational, and advanced-analytics reporting on product performance — across multiple clients.
  • New-client integration: 10 days → 3 hours (97% improvement).
  • Operational reports: 5 days → 30 mins; opportunity reports: 1 day → 30 mins (95% improvement).
  • Advanced analytics report: 7 days → 15 mins (99% improvement).
  • Automated Power BI dashboard refreshes. Star Performer of the Month.
PythonAWS S3EC2SESPower BIDocker
2019.08 → 2021.04

Data Scientist

Sumyag Data Sciences · Bangalore
Designed and built a multi-phase, large-scale data pipeline for enriching extracted data points from documents — covering NLP, data wrangling, and text mining.
  • Generated 500+ features per data point to create a holistic document view via NLP, data wrangling, and text mining.
  • Co-developed a large matrix-multiplication model on the NumPy stack ensembling 10+ unique sources for entity classification.
  • Built a sequential automated pipeline producing clean datasets for custom Bayesian learning models — 10+ distinct interpretations to generate word embeddings.
  • Shipped an end-to-end pipeline extracting insights from claims documents for automobile and healthcare enterprise clients.
  • Developed a geometric model for tabular-entity extraction using asymmetric graphs and inter-point distances.
PythonNLPDeep LearningBayesianNumPy
2020.12 → 2023.05

Data Science Mentor — freelance

Independent · Topmate
  • Mentored college students and working professionals transitioning into data science.
  • Guided post-grad students on thesis projects and professionals on project management.
  • Conducted dozens of mock interviews for analytics and ML roles.
2016 — 2019

Earlier roles

OpenGenus · GXWeb · sabziadda · SMVIT
Pre-data-science chapter.
  • Content Writer (Technical) — OpenGenus Foundation, Jan – Apr 2019.
  • Web Developer Intern — GXWeb, Sep – Dec 2017.
  • Web Developer Intern — sabziadda.com (e-commerce startup), Jan – May 2017.
  • Student Placement Coordinator — SMVIT, Apr 2016 – Sep 2017.
06 / Services

How I can help.

Six engagements I take on most often. Pick one — or bring me a problem and we'll shape the right one.

Enterprise RAG

Architect end-to-end RAG platforms — multi-source ingestion (Drive, GCS, S3, Azure Blob, Slack), hybrid search (BM25 + kNN), cross-encoder re-ranking, and document-level ACL sync.

Agentic AI

Design and ship AI agents for Text2SQL, document processing, and custom enterprise workflows — including evaluating emerging protocols like MCP for connectors.

LLM Fine-tuning

Domain-specific tuning with QLoRA / LoRA / SFT / RLHF on open-source models (e.g. LLaMA 3.1 70B), plus custom datasets and context-aware chunking.

ML Engineering & Microservices

Build ML / DL microservices — host on GCP / AWS, instrumented with MLOps, observability, and the unglamorous infrastructure that makes the magic dependable.

Analytics & Visualisation

Develop analytic platforms and intuitive dashboards — Streamlit, Power BI, Tableau — for every flavour of analysis a problem requires.

Mentorship

I guide freshers and working professionals transitioning into the data & AI world — projects, interviews, career paths.

07 / Selected Work

Things I've shipped.

A small cross-section of industrial work, side projects, and writing.
01 / 26 all
Drive GCS S3 Azure Slack RAG hybrid ACL BM25 + kNN · cross-encoder · ACL sync
Industrial · Elastiq

Enterprise RAG Platform

Architected the multi-source ingestion layer (Drive · GCS · S3 · Azure Blob · Slack), hybrid retrieval (BM25 + kNN on OpenSearch), cross-encoder re-ranking, and document-level ACL synchronization across cloud providers.

> question "top 5 customers by revenue this quarter" ↓ agent SELECT name, SUM(rev) FROM orders WHERE q='Q3' GROUP BY 1 ORDER BY 2 LIMIT 5; customer revenue Acme Corp $ 4.2M Globex $ 3.1M
Industrial · Elastiq

Discover · Text2SQL

Built the foundational Text2SQL framework from scratch for Elastiq's flagship "Discover" product. AI agents for natural-language → SQL plus unstructured file processing — and the first enterprise client onboarded.

LLaMA 3.1 · 70B QLoRA QLoRA QLoRA QLoRA 85% · target accuracy
Industrial · Elastiq

LLaMA 3.1 · 70B Fine-tuning

Fine-tuned the open-source 70B base via advanced QLoRA & LoRA, hitting 85% target accuracy. Engineered custom datasets and context-aware chunking algorithms for unstructured document RAG.

customer order product region vendor sku channel graph · cypher MATCH (c)-[:BOUGHT]→(p) ...
Industrial · Elastiq

GraphRAG & Cypher-RAG

Integrated multiple Graph DB backends offering clients a choice of Classic RAG, Cypher-based RAG, and GraphRAG — extending Discover with structural retrieval for relational and ontological data.

text PDF audio video embed pipeline diarize multi-modal · ingest text · pdf · audio · video
Industrial · Elastiq

Multi-modal Pipeline

Designed pipelines spanning text, PDF, audio, and video — including speaker diarization for spoken content, with shared embedding and retrieval surfaces.

analysis · KPI dashboard 87% · time saved per analysis
Industrial · Acuity

Investment Analysis Dashboard

Versatile analysis dashboard for an investment firm — running experiments and monitoring KPIs across portfolios. 87% reduction in client analysis time.

extracted $ 12.4M Q3 · 23.4% ESG · 87/100 Δ +4.2bp RAG · 10-K · ESG
Industrial · Acuity

Financial Reports RAG

End-to-end RAG system for parsing financial reports — plus a web dashboard wired into an ESG data engine that shaped a top-tier investment firm's strategy.

A · control B · variant A / B · pharma promo p < 0.01 · significant lift
Industrial · MathCo

A/B Testing Framework

Led a 5-engineer team designing and shipping a flexible A/B testing framework for a leading pharmaceutical client to measure promotion success against revenue lift.

sales · weekly · seasonal anom +$125K · Q1 client lift
Industrial · MathCo

Sales Anomaly & Flag Pipeline

Co-led a 7-engineer team on an analyse-and-flag pipeline detecting sales anomalies under seasonality — driving customer outreach that lifted client revenue by +$125K in Q1.

LSTM · sequence R² · 0.9012
Industrial · MathCo

Customer Return Forecast · LSTM

Built an LSTM-based model forecasting how many days a customer would take to return for the next purchase — used by the client to balance supply and demand. R² · 0.9012.

10 days 5 days 7 days 3 hrs 30 min 15 min onboard ops advanced 97% 95% 99% time saved · across reporting suite
Industrial · RST

Reporting Automation Suite

Built a scalable, multi-client reporting solution. Cut new-client onboarding from 10 days → 3 hrs, ops/opportunity reports to 30 mins, and advanced analytics from 7 days → 15 mins.

REVENUE / Q1 ▲ 23.4%
Industrial

Auto Analytics

A configurable, real-time pipeline that generates multi-sheet Excel reports with embedded graphs across heterogeneous client data — Python · Flask · Power BI · AWS S3.

KYC · DOC.012
Industrial

Sygnif.ai

Document classification (Passport, Aadhaar, Voter ID, DL) on InceptionNet, masked QR for privacy, OCR + structural rebuild — Python · PyTorch · Deep Learning.

LSTM · 4 layers
Industrial

Symplif.ai

Structured extraction from insurance documents — limits, periods, deductibles, listed forms, jurisdictional in/exclusions — LSTM · NLP · PyTorch.

TRAINEVALDEPLOY retrain · n7d
Industrial

Insurance Learning Pipeline

Automated periodic retraining with shallow-prod evaluation against the existing model before promotion — Python · AWS S3 · SQS · EC2.

CNN · 96% accuracy · < 2s
Personal

Facial Recognition Attendance

CNN-based attendance system that recognises students in 1–2 seconds, logs to PostgreSQL, and emails guardians on absence — CNN · AWS SES.

> question > context ↳ answer.span
Personal

Questionary

A Flask app that extracts answers from 512-token contexts using a fine-tuned ELECTRA — Python · NLP · Transformers.

kWh · 30d forecast
Personal

Power Consumption Forecast

LSTM-based time-series model forecasting household power consumption from prior behaviour — Python · LSTM · Deep Learning.

ResNet-50 · Haar cascade
Personal

Facial Expression Classification

ResNet-50 image classifier with Haar-cascade face detection and bounding-box overlays — CV · CNN · Deep Learning.

CKD · KNN precision 100% · recall 97.2%
Personal

Chronic Kidney Disease Classifier

KNN classifier on a clinical dataset — 97.91% accuracy, 100% precision, 97.23% recall — sklearn · EDA · feature engineering.

Pose · 17 keypoints
Personal

Posenet Live

Real-time pose estimation in the browser — TensorFlow.js, webcam-driven, 17-keypoint skeleton overlay.

India 2014 · GE
Personal

2014 Elections — Tableau

Tableau dashboard visualising the 2014 Indian general election — vote share, swing, and party performance.

conv · pool · conv
Writing

Convolutional Neural Networks

A primer on CNNs — convolutions, pooling, and the intuition behind feature maps.

DNN · OpenGenus
Writing

Deep Learning Notes

Selected articles on deep learning published on OpenGenus IQ.

inblog · personal feed
Writing

Inblogs

A long-running personal feed — short notes on data science and engineering.

{ ssLabs } // programming notebook
Writing

Programming Notebook

A WordPress notebook of programming explorations and snippets.

08 / Learn AI

Modern AI, visualized.

Five live chapters. Every panel is real math, linked. Click anything.

01 · Inside a transformer.

Every modern LLM — GPT-4, Claude, Llama — is a stack of attention layers. Pick a sentence, switch heads, click any token, and watch the math light up across all five panels in sync.

Tokens — click one to focus its attention as the query
A

Attention matrix

Row = query · Col = key · Brighter = "this token looks at that token more"

i

What you're looking at

This updates as you interact.

The transformer takes your sentence, converts each token to a vector, then in every layer each token "attends to" every other token. The matrix on the left is that attention — for a real model, this happens 96 times in parallel (multi-head), across 80+ layers.

Switch Head above — each head learns a different pattern (position, syntax, coreference). Click a token to spotlight its row.

D

Next-token probabilities

Sampled at temperature 0.80 — try the slider.

B

Token embeddings

Each row is one token's vector. Real models use 768–12,288 dims; we show 16 for legibility.

C₁

Q · queries

"what am I looking for?"

C₂

K · keys

"what do I offer?"

C₃

V · values

"what I contribute"

1
Tokenize — split text into sub-word IDs.
2
Embed — each ID becomes a learned vector (panel B).
3
Project — multiply by W_Q, W_K, W_V (panels C₁/C₂/C₃).
4
Score — softmax(QKᵀ / √d) → attention (panel A).
5
Mix & predict — weighted sum of V, MLP, then logits → probabilities (panel D).

02 · The semantic map.

Embedding models turn anything — words, sentences, images, code — into points in a high-dimensional space where distance = meaning. Below is a 2-D projection of 32 concepts. Click pairs, do vector arithmetic, see clusters emerge.

A

Semantic space — 2-D projection

Hover any point · click two for cosine similarity · try the analogy presets below.

B

Cosine similarity

Click two points (or pick below) — distance measured by angle, not raw position.

A

B

cosine sim
0.00
euclidean
0.00

Pick two concepts to compare.

C

Vector arithmetic

The famous king − man + woman ≈ queen — embeddings encode relations as directions.

Pick an analogy — watch the path on the map.
D

How embedding models are trained

Contrastive learning · pull similar pairs together, push unrelated apart.

Drag the slider — or hit play — watch the encoder learn:

step 0 / 1000

Anchor: "a small striped feline"

Positive (pull closer): "tabby kitten"

Negative (push away): "industrial logistics"

Loss: 2.41 — InfoNCE / triplet · same idea behind CLIP, SBERT, OpenAI ada, Cohere, etc.

E

Picking an embedding model

Dimensions ≠ quality. Context length, domain, and inference cost matter more for most apps.

ModelDimCtxMTEBBest for
OpenAI text-embedding-3-large30728k64.6General · multilingual
Cohere embed-v3102451264.5Search · compressed
Voyage voyage-3-large102432k62.5Long docs · code
BGE-M3 (open)10248k59.6Self-host · dense+sparse
nomic-embed-text-v27682k57.3On-device · CPU
jina-embeddings-v310248k59.5Multilingual · long

03 · Fine-tuning, the cheap way.

A 70B model has billions of weights. Fine-tuning all of them needs a $50k GPU rack. LoRA trains tiny rank-r adapters instead — same effect, <1% of the parameters, single-GPU friendly. Below is the actual math, animated.

A

LoRA · weight decomposition

Freeze W₀ · learn small B·A where r ≪ d. Only B·A is trained.

B

Training loss

Cross-entropy ↓ as the adapter learns your domain.

Train: Val: Step: 0
C

Before vs. after — same prompt

Domain: financial-report summarization. Move the training slider.

PROMPT
Summarize Acme's Q3 retail performance in our internal format.
BASE MODEL
A company reported earnings. Sales were up. Some products performed better than others.
FINE-TUNED
A company reported earnings. Sales were up. Some products performed better than others.
D

Pick your weapon

Cost & quality trade-offs for a 70B base.

MethodTrainsVRAMQuality
Full FT100% (70B)~1.4 TB★★★★★
LoRA (r=16)~0.3%~80 GB★★★★
QLoRA (4-bit + LoRA)~0.3%~24 GB★★★★
Prompt-tune<0.01%~16 GB★★
Prompt eng. only0%0 GB★ – ★★★

Rule of thumb: start with prompting → RAG → LoRA → full FT only if you've genuinely outgrown the rest.

04 · Retrieval-Augmented Generation.

LLMs hallucinate when asked about private data, fresh info, or anything outside training. RAG fixes it: embed your docs, look up the most relevant ones at query-time, stuff them into the prompt, and force the model to cite. Pick a query — watch the whole pipeline run.

A

The pipeline

5 stages. Each lights up as the query flows through.

1 · Embed query

2 · Vector search

cosine sim · top-k in index

3 · Re-rank

cross-encoder · refines order

4 · Inject

top-k → prompt context

5 · Generate

LLM · grounded + cited

B

Retrieval space — where your query lands

Each dot is one document chunk. Closer to the query (★) = more relevant.

C

Retrieved chunks

Top-k after re-ranking. Score = cross-encoder relevance, not just cosine.

D

Grounded answer

LLM sees only retrieved context — must cite by chunk ID.

Pick a query above to start.

Production RAG adds: hybrid search (BM25 + dense), ACL filters per user, chunk-aware splitting, recency boost, and answer-grounding evals.

05 · Mixture of Experts.

GPT-4, Mixtral, DeepSeek, Grok — most frontier models are sparse. Instead of activating all 400B parameters per token, a tiny router picks just 2 of 8 (or 16) "experts" to use. Same quality, ~4× cheaper to run. Click tokens — watch the router decide.

A

Router → experts

Each token is scored against every expert · top-k win.

B

Expert specialization

Emergent — experts drift toward themes the router prefers.

C

Compute saved

Per-token FLOPs · live.

D

Dense vs. sparse — same intelligence, different cost

Why every frontier lab moved to MoE in 2024.

ModelTotal paramsActive / tokenStyle
Llama 3 70B70B70BDense
Mixtral 8×7B47B~13BMoE · 2 of 8
Mixtral 8×22B141B~39BMoE · 2 of 8
DeepSeek V3671B37BMoE · fine-grained
GPT-4 (reported)~1.8T~280BMoE · 2 of 16

The trick: serve a 671B-param model with the inference cost of a 37B dense model — that's why MoE won.

06 · KV cache at scale.

Every generated token re-uses the keys & values of every prior token. Naive caching: one GPU serves two users. PagedAttention (vLLM) treats KV memory like virtual memory — 16-token blocks, allocated on demand, shared across requests with the same prefix. The result: 5–10× throughput on the same hardware.

A

GPU memory · 80 GB A100

Each cell = one 16-token page. Color = request owner. Grey = free. Striped = wasted (fragmentation / over-allocation).

B

Memory budget

Live · per-strategy.

C

Throughput

Tokens/sec the GPU can serve.

D

Survival tactics at scale

What real serving stacks do once they hit the wall.

  • Paged KV (vLLM, TGI) — 16-token blocks, no over-allocation, <4% fragmentation vs. 60-80% naive.
  • Prefix caching — share system-prompt KV across all users on a node. A 4k system prompt × 1000 users = saved once.
  • Continuous batching — new requests slot into a running batch on token boundaries instead of waiting for the slowest to finish.
  • KV offload & quantization — INT8 KV halves memory; offload cold prefixes to CPU/NVMe; recompute on miss.
  • Speculative re-use — assistant tool calls often share prefixes (system + tools schema) — cache aggressively, expire on schema change.
E

Quick math

Per-token KV bytes for a 70B model.

Formula: 2 (K,V) × n_layers × n_heads × d_head × bytes. For Llama-3 70B (80 layers · 8 kv-heads · 128 dim · FP16): ~320 KB/token. A 32k context = 10 MB just in KV. 100 users at 32k = 1 TB.

07 · Speculative decoding vs quantization.

Two unrelated ways to make a slow model fast. Spec decode: a tiny "draft" model proposes K tokens, the big model verifies them in parallel (one forward pass for K tokens). Quantization: shrink weights from 16-bit to 8/4-bit — model gets smaller and arithmetic gets faster, with controlled quality loss. They compose.

A

Speculative decoding · token trace

Draft proposes K tokens · target verifies in one pass · accepted tokens are kept, rejected ones cause a rollback.

B

Weight distribution · quantization

Same layer · same weights · fewer distinct values. INT4 has only 16 levels; the model snaps to the nearest one.

C

Throughput vs. quality

Both techniques stacked. Live.

D

When to use which

Decision shortcuts.

  • Spec decode — best for predictable outputs (code, structured JSON, common chat). Wasted draft tokens kill the gain on creative text.
  • INT8 — almost free quality drop (≤1% on most evals). Default for production serving.
  • INT4 / GPTQ / AWQ — 2× memory savings, ~2-4% eval drop. Worth it when you need to fit a 70B on a single GPU.
  • Compose — INT4 + spec decode → 4-7× faster end-to-end on the same hardware, with quality within ~3% of FP16.

08 · Did the RAG actually work?

RAG looks fine until you measure it. Four canonical metrics catch the failures: context recall (did we retrieve the right docs?), context precision (was the order right?), faithfulness (did the answer stay grounded?), and answer relevance (does the answer match the question?). Compare three configs side-by-side.

A

Question & ground truth

Test question, the docs that should have been retrieved, and the gold answer.

Q
Should-retrieve (ground truth IDs)
Gold answer
B

What the system did

Retrieved chunks + generated answer. Sentences are tagged by which chunk they're grounded in (or hallucinated).

Retrieved baseline
Answer
C

Metrics · this run

Higher = better · 0.0 to 1.0 · RAGAS-style.

D

Config comparison

Same question · three retrieval setups.

ConfigRecallPrecisionFaith.Rel.

Build a small golden set (50-200 Q/A pairs), run it on every retriever change, watch each metric independently. A higher faithfulness with lower relevance often means you over-grounded into the wrong chunks.

09 · Where the money goes.

A polished LLM product can burn $10k/day before anyone notices. The leaks aren't dramatic — they're boring: a 4k system prompt re-sent 200k times a day, conversation history that never gets trimmed, RAG context with 30 chunks where 4 would do, retry loops on schema errors. Audit your token bill like cloud spend.

A

Where each token goes · per request

Stacked bar — system prompt, tool schema, conversation history, RAG context, user message, response. Hover for $/day attribution.

B

Top leaks · ranked

Auto-flagged patterns sorted by $ wasted.

C

The leak checklist

Print this. Pin it.

  • Cache system prompts — Anthropic / OpenAI prompt caching cuts repeated tokens 90%. Free 10× ROI.
  • Trim history — keep last K turns, summarize the rest. Cap to a hard token budget per conversation.
  • Right-size RAG — 4-8 well-ranked chunks beat 30 mediocre ones. Re-ranker is cheaper than long context.
  • Stop streaming on schema fail — early-stop when structured output is malformed; don't burn 4k tokens on bad JSON.
  • Per-endpoint $ caps — alert when an endpoint's $/req drifts >30%. Catches prompt-bloat regressions.
  • Log every call's token breakdown — input by source (system / history / RAG / user) + output. Without this you can't find leaks.
D

Real-world hidden leaks

War stories.

The retry loop. Structured-output failure triggers an automatic retry with the bad response appended for "context." 4 retries × ~2k tokens each = 8k wasted per failure. At 0.5% failure rate × 1M req/day = $1,200/day on a $0.003/1k model.
The runaway agent. An agent calls a search tool, gets 200KB of HTML back, stuffs it into context, calls again. Each turn doubles. One bug = 6 figures.
The dev who left logging on. Production accidentally calling verbose=True appends the full reasoning trace to every response. 3× output tokens. $8k/mo until someone checked.
The shared system prompt. One service builds the prompt fresh each call, defeating the upstream cache. 70% cache hit → 0%. $22k surprise.

10 · Agents — and the loops they fall into.

Agents are LLMs that call tools and decide what to do next. Without guardrails they will: loop on the same tool, ignore failure modes, exhaust budgets, and confidently take destructive actions. Step through a trace below — watch the loop detector, budget meter, and validators kick in.

A

Execution trace

Each row = one tool call. Loops, schema fails, and budget breaches highlight automatically.

B

Budgets

Hard limits. Halt on breach.

C

Guardrail status

Which checks fired so far.

D

Loop detection — the actual patterns

What "stuck" looks like in production.

  • Same-call repeat — identical tool(args) twice in a row. Trivially stuck. Halt immediately.
  • 2-cycle — alternating A → B → A → B with no progress (e.g., re-search after re-search). Detect via hash of last N actions.
  • Argument drift — same tool, slightly tweaked args. Often a retry loop in disguise. Halt after N similar calls (Levenshtein on args < 5%).
  • State stagnation — observation entropy collapses. The agent is no longer learning from results.
  • Token / time / cost budget — never trust the model to self-regulate. Hard caps at the framework level.
E

Guardrail stack — defense in depth

No single check is enough.

  • Input filter — PII / prompt-injection / unsafe-content classifier before the model.
  • Schema validation — JSON schema / regex / type check on every tool argument. Reject & resample.
  • Allowlist tools — minimum viable set per task. No shell.exec unless the task genuinely needs it.
  • Output filter — content moderation + structured-output validator on every response.
  • Human-in-loop — irreversible actions (delete, send, pay) require an explicit confirm step.
  • Audit log — every tool call, args, result, and decision. Replay any failure.
09 / Contact

Let's build something.

For projects, mentorship, or a chat about an interesting problem.

Bring a problem.
I'll help you find the shape of it.

priyanshu1996@hotmail.com