Writing
Why I'm Building Trust Bench
Evaluation tools score LLM outputs. They don't tell you why models fail. Trust Bench connects profiling, diagnosis, and repair into a single open-source tool.
TOOL Three Behavioral Evals on Claude's Medical Safety
TOOL The Medical AI Evaluation Problem
EXP When an LLM Says 90%, Should You Believe It?
DEEP DIVE Six Experiments That Built My Intuition for Trust Bench
I Built a Terminal Dashboard to See Where My Tokens Go
crux is a TUI that reads Claude Code session logs and shows you real-time context growth, cache efficiency, cost breakdowns, and session health grades.
TOOL I Built a Tool to Fix Claude Code Skills
TOOL Comparing LLM Tokenizers Side by Side
TOOL What's Actually Inside a GGUF File?