Blog
When an LLM Says 90%, Should You Believe It?
Measuring LLM calibration by asking 102 factual questions and checking if stated confidence matches actual accuracy. The answer: models are overconfident.
Watching Superposition Emerge in a Toy Model
Reproducing the key finding from Anthropic's Toy Models of Superposition paper. A 30-line model shows how neural networks pack more features than they have dimensions.
Four Attention Variants, One Training Loop
I implemented MHA, GQA, MQA, and Sliding Window attention from scratch and trained small transformers to compare them.
I Built a Terminal Dashboard to See Where My Tokens Go
crux is a TUI that reads Claude Code session logs and shows you real-time context growth, cache efficiency, cost breakdowns, and session health grades.
What's Actually Inside a GGUF File?
I parsed a Llama 3.2 model file byte by byte. Here's what the format reveals about quantization, architecture, and how inference engines load models.
Comparing LLM Tokenizers Side by Side
tokenizer-arena is a Rust CLI that shows how different LLM tokenizers encode the same text, revealing surprising differences in efficiency.
Hybrid Attention at 8M Params (and What Broke at 150M)
I trained Qwen3.5's hybrid DeltaNet+attention architecture from scratch on a MacBook. Pure attention won at 8M. Scaling to 150M hit a math bug that looked like a hyperparameter problem.
I Built a Tool to Fix Claude Code Skills
I audited my own Claude Code skills and found problems in every one. So I built a plugin to do the audit for anyone.
Why I'm Building Trust Bench
Evaluation tools score LLM outputs. They don't tell you why models fail. Trust Bench connects profiling, diagnosis, and repair into a single open-source tool.
How Large Language Models Actually Work
A visual, from-scratch deep dive into the algorithm behind GPT, Qwen, Llama, and every other LLM.