// ml.engineer · multi-agent · production systems

Nikhil
Mourya

ML Engineer · Multi-Agent Systems · Production Systems

I optimize models for production. Production still finds a way to optimize me.

About

I'm Nikhil Mourya. I build ML systems where the math decides when to stop, not the model's confidence. Retrieval scores over self-reported certainty. Pruned weights over bloated checkpoints. Shipped code over benchmark screenshots.

BIT Mesra, studying what doesn't fit in a lecture hall. Right now: questioning whether your retrieval pipeline actually needs that vector DB, or just needs a better question.

I'm always maintaining unbroken eye contact with that single code block like we got beef.

Model Compression

Pruning and LoRA are quiet admissions that full fine-tuning is often billing for capacity you never needed. I like models that shrink without forgetting what they're for.

efficiency

NLP & Text Systems

Summarization, classification, multi-agent pipelines where fluent isn't the same as faithful. Production NLP is mostly telling confident hallucinations they can't sit with us.

production

Multi-Agent Systems & MLOps

LangGraph orchestration, eval loops that catch what the LLM won't admit, FastAPI + Docker + AWS, the glue between a Jupyter notebook and someone else's pager.

infra

Competitive Programming

Codeforces Specialist, 1400+. Graphs have a way of humbling you on a schedule; the upside is you stop trusting clever one-liners without proof.

cf · 1400+
ongoing

Specialist Codeforces 1400+

Still grinding rated rounds, the graphs are optional; the ego damage isn't. Proof you can think with a clock breathing down your neck.

peak: 1400+ · still climbing
2024

Finalist SIH - Hospital Mgmt System

99.5% uptime with real models in the loop, the 0.5% was character development. Backend stayed polite even when the night shift wasn't.

99.5% uptime · production ML
2023

Finalist IIIT Delhi - ResNet50

94% accuracy after grid search stopped me from brute-forcing the hyperparameter void. Sometimes the boring search is the clever move.

94% acc · −30% train time

Skills

PyTorch PyTorch
TensorFlow TensorFlow
HuggingFace HuggingFace
Scikit-learn Scikit-learn
NumPy NumPy
Pandas Pandas
Python Python
FastAPI FastAPI
Git Git
MySQL MySQL
LangChain LangChain
LangGraph LangGraph
Redis Redis
🧠 ChromaDB
🦙 Ollama
Groq
PyTorch PyTorch
TensorFlow TensorFlow
HuggingFace HuggingFace
Scikit-learn Scikit-learn
NumPy NumPy
Pandas Pandas
Python Python
FastAPI FastAPI
Git Git
MySQL MySQL
LangChain LangChain
LangGraph LangGraph
Redis Redis
🧠 ChromaDB
🦙 Ollama
Groq
C++ C++
AWS AWS
Linux Linux
Jupyter Jupyter
GitHub GitHub
📊 Matplotlib
OpenTelemetry OpenTelemetry
🔍 LangSmith
Pydantic Pydantic
📐 RAGAS
🌳 tree-sitter
Docker Docker
RAGs
C++ C++
AWS AWS
Linux Linux
Jupyter Jupyter
GitHub GitHub
📊 Matplotlib
OpenTelemetry OpenTelemetry
🔍 LangSmith
Pydantic Pydantic
📐 RAGAS
🌳 tree-sitter
Docker Docker
RAGs

Projects

// flagship.projects

HiveMind

GitHub

"Can a research system know its own answer isn't good enough, and fix it?"

Production-grade autonomous research system with a 5-agent sequential pipeline (Planner → Researcher → Critic → Writer → Evaluator). Orchestrator loops autonomously until a deterministic confidence threshold — calculated as mean retrieval score, not LLM self-report — is met.

Infra: asyncio.Lock single-threaded LLM execution · Docker Compose (FastAPI + Redis + ChromaDB + OpenTelemetry) · SSE streaming
5-agent pipeline autonomous eval loop deterministic confidence scoring
Python FastAPI ChromaDB Redis OpenTelemetry Groq Ollama Docker Pydantic asyncio

"Can you search a codebase by intent, not keywords, fully offline?"

Fully offline VS Code extension that indexes codebases using AST-based chunking via tree-sitter and 768-dim embeddings via Ollama. Natural language query → ranked semantic results → one-click jump to exact line. No internet, no API keys.

9K+ chunks indexed sub-10ms ANN query 100% offline
Python TypeScript FastAPI tree-sitter Ollama VS Code API Docker

Pruned U-Net

GitHub

"How much of a segmentation model is actually load-bearing?"

Structured magnitude pruning pipeline targeting channels, not weights. Key finding: IoU held flat until ~96% reduction then degraded sharply, meaning there's a wide safe compression window most implementations never explore. IIT Kharagpur research collaboration.

97.3% parameter reduction 92% FLOPs reduction IoU > 0.95 on MoNuSeg
PyTorch U-Net Model Pruning Computer Vision MoNuSeg
// other.work
01

AttentionIsALLICode

Not "used a framework." Actually from scratch.

Full architecture Multi-head attention Custom training loop
PyTorch Transformers NLP From Scratch
02

Vectorless RAGs

Vector databases are the default. I wanted to know if the default was actually necessary.

Zero embeddings Zero vector DB Full retrieval
Python Ollama LLaMA 3 RAG Tree Traversal Streamlit
03

Second Brain Debugger

A senior engineer code-reviewed my brain. Six stages. Real AI. No affirmations.

6-stage pipeline SSE streaming Multimodal input
Next.js 14 TypeScript Mistral-7B Whisper Stable Diffusion Zod
04

DermaVision

7 skin lesion classes. 57:1 class imbalance. Focal Loss said no problem.

HAM10000 dataset Grad-CAM XAI ONNX export
EfficientNet-B3 PyTorch FastAPI Next.js 14 Docker Albumentations
05

PEGASUS + LoRA · Efficient Summarization

Fine-tuned a 767M parameter model using only 1.57M trainable parameters via LoRA. Full fine-tuning produced incoherent outputs on unseen domains. LoRA didn't.

99.8% param reduction 27× faster training 767M → 1.57M
PyTorch HuggingFace PEFT LoRA NLP XSum

// ventures

LIVE PRODUCT
dev-path.site ↗
Founder

DevPath

Structured learning for developers who are tired of tutorial hell.

Growingcurated roadmaps
Dailyfocused tasks
Freeto get started
Visit DevPath ↗

// experience.log

LAYER_01: HireBuddy_Software_Engineer_ML (2025) [STATUS: DEPLOYED]
[ROLE] Software Engineer - Machine Learning
[INIT] Resume-JD matching via RAG + Transformers. Zero regex. Zero vibes.
[THROUGHPUT] NLP_pipeline.ingest(resumes=10k/day, mode="shortlist")
[METRIC] shortlist_acc: +35%  ·  chain_latency: −28%
[DEPLOY] FastAPI + Docker on AWS  ·  REST API, Friday-proof since day one.
[EVAL] eval_stack: RAGAS  ·  LangSmith, because vibes aren't a metric.
[SIGNAL] Engagement: +40%, the model got better at reading people than the recruiters did.