I study how language models fail and build tools to catch it.
AI Engineer, 7 years. I design agents for production and research LLM trust at the architectural level.
Now
Trust Bench in progress
Open-source profiler that looks inside LLMs to find failure modes benchmarks miss. Next: training sparse autoencoders on Qwen3.5's hybrid DeltaNet layers.
Writing all →
Claude Code Treats Its System Prompt Like Infrastructure I read Claude Code's source looking for prompt engineering. I found cost engineering. Cache boundaries, sticky latches, circuit breakers, and the 77% number.
Watch a Language Model Think I used sparse autoencoders to look inside GPT-2. I found interpretable concepts, feature loops, and a temperature feature that fires correctly while the model gets the answer wrong.
I Built a Terminal Dashboard to See Where My Tokens Go crux is a TUI that reads Claude Code session logs and shows you real-time context growth, cache efficiency, cost breakdowns, and session health grades.
Hybrid Attention at 8M Params (and What Broke at 150M) I trained Qwen3.5's hybrid DeltaNet+attention architecture from scratch on a MacBook. Pure attention won at 8M. Scaling to 150M hit a math bug that looked like a hyperparameter problem.
Why I'm Building Trust Bench Evaluation tools score LLM outputs. They don't tell you why models fail. Trust Bench connects profiling, diagnosis, and repair into a single open-source tool.
Tools
tokenizer-arena tokenizer-arena is a Rust CLI that shows how different LLM tokenizers encode the same text, revealing surprising differences in efficiency
cargo install tokenizer-arena crux crux is a TUI that reads Claude Code session logs and shows you real-time context growth, cache efficiency, cost breakdowns, and session health grades brew install amaljithkuttamath/tap/crux