Blog

When an LLM Says 90%, Should You Believe It?

Measuring LLM calibration by asking 102 factual questions and checking if stated confidence matches actual accuracy. The answer: models are overconfident.

calibrationllm-safetyevaluationpython

Watching Superposition Emerge in a Toy Model

Reproducing the key finding from Anthropic's Toy Models of Superposition paper. A 30-line model shows how neural networks pack more features than they have dimensions.

superpositioninterpretabilityanthropicpytorch

Four Attention Variants, One Training Loop

I implemented MHA, GQA, MQA, and Sliding Window attention from scratch and trained small transformers to compare them.

attentiontransformerspytorchfrom-scratch

I Built a Terminal Dashboard to See Where My Tokens Go

crux is a TUI that reads Claude Code session logs and shows you real-time context growth, cache efficiency, cost breakdowns, and session health grades.

rusttuiclaude-codetoolingopen-source

What's Actually Inside a GGUF File?

I parsed a Llama 3.2 model file byte by byte. Here's what the format reveals about quantization, architecture, and how inference engines load models.

ggufllamaquantizationrustmodel-internals

Comparing LLM Tokenizers Side by Side

tokenizer-arena is a Rust CLI that shows how different LLM tokenizers encode the same text, revealing surprising differences in efficiency.

rusttokenizersllmnlpopen-source

Hybrid Attention at 8M Params (and What Broke at 150M)

I trained Qwen3.5's hybrid DeltaNet+attention architecture from scratch on a MacBook. Pure attention won at 8M. Scaling to 150M hit a math bug that looked like a hyperparameter problem.

mlxattentiondeltanettrainingapple-silicon

I Built a Tool to Fix Claude Code Skills

I audited my own Claude Code skills and found problems in every one. So I built a plugin to do the audit for anyone.

claude-codeskillstooling

Why I'm Building Trust Bench

Evaluation tools score LLM outputs. They don't tell you why models fail. Trust Bench connects profiling, diagnosis, and repair into a single open-source tool.

trust-benchllm-safetyevaluationinterpretability

How Large Language Models Actually Work

A visual, from-scratch deep dive into the algorithm behind GPT, Qwen, Llama, and every other LLM.

transformersfrom-scratchdeep-dive