I study how language models fail and build tools to catch it.

AI Engineer, 7 years. I design agents for production and research LLM trust at the architectural level.

Now

Open-source profiler that looks inside LLMs to find failure modes benchmarks miss. Next: training sparse autoencoders on Qwen3.5's hybrid DeltaNet layers.

Writing all →

Claude Code Treats Its System Prompt Like Infrastructure I read Claude Code's source looking for prompt engineering. I found cost engineering. Cache boundaries, sticky latches, circuit breakers, and the 77% number.

Apr 1

Watch a Language Model Think I used sparse autoencoders to look inside GPT-2. I found interpretable concepts, feature loops, and a temperature feature that fires correctly while the model gets the answer wrong.

Apr 1

I Built a Terminal Dashboard to See Where My Tokens Go crux is a TUI that reads Claude Code session logs and shows you real-time context growth, cache efficiency, cost breakdowns, and session health grades.

Mar 16

Hybrid Attention at 8M Params (and What Broke at 150M) I trained Qwen3.5's hybrid DeltaNet+attention architecture from scratch on a MacBook. Pure attention won at 8M. Scaling to 150M hit a math bug that looked like a hyperparameter problem.

Mar 10

Why I'm Building Trust Bench Evaluation tools score LLM outputs. They don't tell you why models fail. Trust Bench connects profiling, diagnosis, and repair into a single open-source tool.

Feb 20

Tools

tokenizer-arena tokenizer-arena is a Rust CLI that shows how different LLM tokenizers encode the same text, revealing surprising differences in efficiency cargo install tokenizer-arena crux crux is a TUI that reads Claude Code session logs and shows you real-time context growth, cache efficiency, cost breakdowns, and session health grades brew install amaljithkuttamath/tap/crux

microGPT Playground Train a transformer in your browser →