in progress trust-bench

Trust Bench

Open-source profiler that looks inside LLMs to find failure modes benchmarks miss. Profile, diagnose, repair.

Every LLM evaluation tool scores outputs. None of them tell you why a model fails. Trust Bench connects profiling, diagnosis, and repair into a single loop: extract trust signals from model internals, evaluate against peer-reviewed safety benchmarks, apply targeted fixes, verify they worked.

Starting with Qwen3.5's hybrid attention architecture (alternating DeltaNet + GQA), which nobody has studied for trust properties. Training sparse autoencoders on these layers to find interpretable failure features.

Three Behavioral Evals on Claude's Medical Safety Running Anthropic's Bloom framework on medical safety behaviors. The arithmetic verification loophole, zero deference to fabricated claims, and how disclaimers degrade over multi-turn conversations.
evaluationmedical-aillm-safety
The Medical AI Evaluation Problem A systematic review of 70 studies found that 69 assessed accuracy, 3 evaluated safety, and 2 addressed privacy. Each layer of medical AI evaluation has problems.
evaluationmedical-aillm-safety
When an LLM Says 90%, Should You Believe It? Measuring LLM calibration by asking 102 factual questions and checking if stated confidence matches actual accuracy. The answer: models are overconfident.
calibrationllm-safetyevaluation
A Single Neuron for 'And' in Six Languages Using Gemma Scope's pre-trained sparse autoencoders to find cross-lingual features in Gemma 2 2B. Feature #10543 fires on 'and', 'et', 'und', 'y', and 'e' with zero activation on control sentences.
interpretabilitysaegemma
Hybrid Attention at 8M Params (and What Broke at 150M) I trained Qwen3.5's hybrid DeltaNet+attention architecture from scratch on a MacBook. Pure attention won at 8M. Scaling to 150M hit a math bug that looked like a hyperparameter problem.
mlxattentiondeltanet
Six Experiments That Built My Intuition for Trust Bench Before I could build a tool that profiles trust inside language models, I needed to understand what's happening inside them. These are the experiments that got me there.
interpretabilitytrainingattention
Why I'm Building Trust Bench Evaluation tools score LLM outputs. They don't tell you why models fail. Trust Bench connects profiling, diagnosis, and repair into a single open-source tool.
trust-benchllm-safetyevaluation
01
SAE on DeltaNet layers Train sparse autoencoders on Qwen3.5's linear attention layers. Find interpretable features that predict trust failures.
02
Hybrid attention trust comparison Compare trust signal patterns between DeltaNet and GQA layers. Do different attention types fail differently?
03
Repair loop Apply RepE activation steering and Constitutional AI techniques. Profile before and after. Did the fix work?
1
Extract Pull trust signals from model internals. Entropy, layer activations, attention patterns, separated by attention type.
2
Evaluate Score across TrustLLM dimensions: truthfulness, safety, fairness, robustness, privacy, machine ethics.
3
Improve Apply targeted techniques (calibration, RepE steering, Constitutional AI). Re-evaluate. Close the loop.