Trust Bench
Open-source profiler that looks inside LLMs to find failure modes benchmarks miss. Profile, diagnose, repair.
Every LLM evaluation tool scores outputs. None of them tell you why a model fails. Trust Bench connects profiling, diagnosis, and repair into a single loop: extract trust signals from model internals, evaluate against peer-reviewed safety benchmarks, apply targeted fixes, verify they worked.
Starting with Qwen3.5's hybrid attention architecture (alternating DeltaNet + GQA), which nobody has studied for trust properties. Training sparse autoencoders on these layers to find interpretable failure features.
Progress
Three Behavioral Evals on Claude's Medical Safety Running Anthropic's Bloom framework on medical safety behaviors. The arithmetic verification loophole, zero deference to fabricated claims, and how disclaimers degrade over multi-turn conversations.
The Medical AI Evaluation Problem A systematic review of 70 studies found that 69 assessed accuracy, 3 evaluated safety, and 2 addressed privacy. Each layer of medical AI evaluation has problems.
When an LLM Says 90%, Should You Believe It? Measuring LLM calibration by asking 102 factual questions and checking if stated confidence matches actual accuracy. The answer: models are overconfident.
A Single Neuron for 'And' in Six Languages Using Gemma Scope's pre-trained sparse autoencoders to find cross-lingual features in Gemma 2 2B. Feature #10543 fires on 'and', 'et', 'und', 'y', and 'e' with zero activation on control sentences.
Hybrid Attention at 8M Params (and What Broke at 150M) I trained Qwen3.5's hybrid DeltaNet+attention architecture from scratch on a MacBook. Pure attention won at 8M. Scaling to 150M hit a math bug that looked like a hyperparameter problem.
Six Experiments That Built My Intuition for Trust Bench Before I could build a tool that profiles trust inside language models, I needed to understand what's happening inside them. These are the experiments that got me there.
Why I'm Building Trust Bench Evaluation tools score LLM outputs. They don't tell you why models fail. Trust Bench connects profiling, diagnosis, and repair into a single open-source tool.
Architecture
1
Extract Pull trust signals from model internals. Entropy, layer activations, attention patterns, separated by attention type.
↓
2
Evaluate Score across TrustLLM dimensions: truthfulness, safety, fairness, robustness, privacy, machine ethics.
↓
3
Improve Apply targeted techniques (calibration, RepE steering, Constitutional AI). Re-evaluate. Close the loop.