Writing

Deep Dive Apr 1, 2026

Claude Code Treats Its System Prompt Like Infrastructure

I read Claude Code's source looking for prompt engineering. I found cost engineering. Cache boundaries, sticky latches, circuit breakers, and the 77% number.

agentsclaude-codeprompt-engineeringcost-engineering

Experiment Apr 1, 2026

Watch a Language Model Think

I used sparse autoencoders to look inside GPT-2. I found interpretable concepts, feature loops, and a temperature feature that fires correctly while the model gets the answer wrong.

interpretabilitySAEinteractive

EXP A Single Neuron for 'And' in Six Languages

Mar 12, 2026

Deep Dive Feb 20, 2026

Why I'm Building Trust Bench

Evaluation tools score LLM outputs. They don't tell you why models fail. Trust Bench connects profiling, diagnosis, and repair into a single open-source tool.

trust-benchllm-safetyevaluationinterpretability

TOOL Three Behavioral Evals on Claude's Medical Safety

Mar 30, 2026

TOOL The Medical AI Evaluation Problem

Mar 30, 2026

EXP When an LLM Says 90%, Should You Believe It?

Mar 14, 2026

DEEP DIVE Six Experiments That Built My Intuition for Trust Bench

Mar 8, 2026

Tool Mar 16, 2026

I Built a Terminal Dashboard to See Where My Tokens Go

crux is a TUI that reads Claude Code session logs and shows you real-time context growth, cache efficiency, cost breakdowns, and session health grades.

rusttuiclaude-codetoolingopen-source

brew install amaljithkuttamath/tap/crux cargo install crux-cli

TOOL I Built a Tool to Fix Claude Code Skills

Mar 22, 2026

TOOL Comparing LLM Tokenizers Side by Side

Mar 20, 2026

TOOL What's Actually Inside a GGUF File?

Mar 18, 2026

Experiment Mar 10, 2026

Hybrid Attention at 8M Params (and What Broke at 150M)

I trained Qwen3.5's hybrid DeltaNet+attention architecture from scratch on a MacBook. Pure attention won at 8M. Scaling to 150M hit a math bug that looked like a hyperparameter problem.

mlxattentiondeltanettrainingapple-silicon