Trust Bench

in progress

Open-source profiler for LLM trustworthiness. Extracts internal signals, evaluates across safety dimensions, diagnoses failure modes, and measures trust boundaries specific to deployment context.

AI safetymechanistic interpretabilityPyTorchevaluation
Predicting Risk of Lung Cancer From Medical History Quality Management in Health Care, 2025 paper →pubmed →
Cross-lingual feature found in 6 languages

sae-explorer

Found a single SAE feature in Gemma 2 2B that detects conjunctions across 6 languages with zero false positives.

interpretabilitysaegemma
Problem Do multilingual models store concepts once or per-language? SAEs can decompose activations into interpretable features, but finding genuinely monosemantic cross-lingual features requires systematic probing.
Approach Load Gemma Scope layer 12 SAE (16K features). Run parallel sentences in 6 languages through Gemma 2 2B. Filter for narrow, selective features.
Stack SAELens, TransformerLens, PyTorch, Gemma 2 2B
Outcome Feature #10543: conjunction-only, 6 languages, 0 false positives. Also found cross-lingual "cat" feature (#4497) and sentence-initial determiner feature (#1178).
DeltaNet from scratch on Apple Silicon

hybrid-attention-150m

Trained Qwen3.5 hybrid architecture at 8M and 150M params. Found and fixed a numerical bug in the triangular solve.

mlxdeltanettraining
Problem Qwen3.5 uses a novel 3:1 linear/full attention pattern. Understanding it requires training it, not just reading about it.
Approach From-scratch MLX implementation. Trained at 8M (works) and 150M (NaN). Debugged the repeated-squaring triangular solve to find missing terms in the Neumann series.
Stack MLX, Python, Apple Silicon M-series
Outcome 150M model trained to BPB 2.036. Found that MLX lacks tri_inv VJP and associative_scan, limiting DeltaNet training.
ECE 0.107, overconfident by 3%

calibration-probe

Measures whether LLMs know when they are wrong. 102 factual questions, forced confidence, calibration curves.

calibrationllm-safetyevaluation
Problem A model that says "90% confident" but is only right 75% of the time is dangerous in production.
Approach 102 factual questions across 5 categories. Model states confidence 0-100. Binned by confidence, actual accuracy measured per bin.
Stack Python, matplotlib, Anthropic API, OpenAI API
Outcome Models are slightly overconfident (89% confidence, 86% accuracy). Geography worst calibrated. ECE = 0.107.
6 experiments, 5 repos

building-intuition

Superposition, activation projections, loss landscapes, scaling laws, and attention variants. Each experiment answered a question I needed for Trust Bench.

interpretabilitytrainingattentionscaling-laws
Problem Before building a trust profiler, I needed hands-on understanding of model internals: superposition, layer representations, landscape geometry, scaling behavior, and attention mechanics.
Approach Six targeted experiments, each answering one question for Trust Bench. Superposition phase transitions, UMAP activation projections, filter-normalized loss surfaces, Chinchilla power law fits, and four attention variants from scratch.
Stack PyTorch, MLX, matplotlib, UMAP, scipy
Outcome Confirmed: trust signals need SAE decomposition (superposition), middle-to-late layers matter most (activations), flat minima tolerate interventions (landscapes), small-scale results are directionally useful (scaling laws), GQA layers need normalization (attention).
10K+ daily queries

Healthcare RAG Pipeline

Production RAG system for clinicians. Hallucination-aware retrieval with domain-specific re-ranking.

RAGvertex aihallucination detection
Problem Clinicians need grounded answers from medical records and research. Standard retrieval surfaces irrelevant context, leading to hallucinated outputs.
Approach Retrieval-augmented generation on Vertex AI with hallucination-aware retrieval. Domain-specific re-ranking for clinical relevance.
Stack Vertex AI, Gemini Pro, LangChain, LlamaIndex
Outcome Serves 10,000+ daily clinical queries. Developed methodology for measuring retrieval fidelity and hallucination rates.
25% lower inference cost

LLM Fine-tuning System

Fine-tuning Qwen2.5-3B with GRPO + LoRA. Studied how reward-guided optimization changes model behavior.

fine-tuningGRPOvLLM
Problem Off-the-shelf models underperform on domain tasks. Understanding how fine-tuning alters behavior requires both training infra and evaluation methodology.
Approach 4-bit quantization with LoRA adapters. GRPO for reward-guided optimization. vLLM for high-throughput inference.
Stack PyTorch, Hugging Face, LoRA, vLLM
Outcome Deployed fine-tuned 3B model to production, reducing inference costs by 25%.
5 entity types, cross-format search

Medical Knowledge Graph

Knowledge graph connecting diseases, drugs, treatments, and clinical trials. Powers retrieval grounding.

neo4jknowledge graphsembeddings
Problem Medical knowledge is scattered across databases, papers, and records. Flat storage loses entity relationships.
Approach Graph schema mapping diseases, treatments, drugs, clinical trials. Multi-modal embeddings for cross-format querying.
Stack Neo4j, Elasticsearch, sentence-transformers
Outcome Powers the retrieval grounding layer. Enables entity-aware queries across text, structured data, and metadata.
Early warning signals

Drug Safety Sentiment Analysis

Transformer classifiers detecting sentiment shifts in medical expert opinions for pharmacovigilance.

NLPsafety signalshealthcare
Problem Pharmaceutical safety teams need to detect early warning signs before adverse events escalate.
Approach Fine-tuned transformer classifiers on domain-specific medical language. Tracks sentiment per drug per condition over time.
Stack PyTorch, Hugging Face, spaCy
Outcome Surfaces drugs with increasing negative sentiment. Integrated into pharmacovigilance workflows.
90%+ accuracy, 50% less downtime

Predictive Maintenance

Digital twin platform for industrial machinery. Anomaly detection and time-to-failure prediction from IoT streams.

anomaly detectiontime seriesdigital twins
Problem Unplanned machine downtime is expensive. Raw sensor data exists but is not used for proactive failure prediction.
Approach Signal processing pipeline ingesting IoT sensor streams. Feature extraction and time-to-failure models trained on historical failure data.
Stack Python, scikit-learn, Azure Data Lake, Power BI
Outcome 90%+ anomaly detection accuracy. Reduced machine downtime by 50% across deployed sites.
Real-time token analytics

crux

Terminal dashboard for Claude Code usage. Tracks context growth, cache efficiency, cost breakdowns, and session health.

brew install amaljithkuttamath/tap/cruxcargo install crux-cli
rusttuiclaude-code
Problem Claude Code users have no visibility into token usage patterns. Sessions vary wildly in cost with no way to diagnose why.
Approach Rust TUI that reads Claude Code JSONL session logs in real-time. Computes rolling statistics, cache hit ratios, and health grades.
Stack Rust, ratatui, tokio, serde, MCP server
Outcome Open-source with cross-platform releases. Reveals usage patterns invisible from the Claude Code interface.
Compare 4 encodings

tokenizer-arena

CLI that runs the same text through multiple LLM tokenizers and shows differences in token count and boundaries.

cargo install tokenizer-arena
rusttokenizerscli
Problem Tokenizer choice affects training efficiency and inference cost, but comparing tokenizers requires custom scripts.
Approach Wraps tiktoken-rs to compare cl100k_base, o200k_base, p50k_base, and r50k_base side by side.
Stack Rust, tiktoken-rs, comfy-table, clap
Outcome Newer tokenizers are ~25% more efficient on code. Works fully offline. JSON output for scripting.
Parse models byte by byte

gguf-inspect

CLI that reads GGUF model files and prints architecture, quantization, tensor shapes, and memory estimates.

rustggufmodel-internals
Problem Understanding what is inside a quantized model file requires reading llama.cpp source or scattered docs.
Approach From-scratch GGUF binary parser in Rust. Reads header, metadata, tensor info. Computes parameter counts and memory estimates.
Stack Rust, byteorder, comfy-table, clap
Outcome Tested on Llama 3.2 3B. Reveals mixed-precision quantization, GQA head ratios, and 6.7x compression from FP32.
Claude Code plugin

skill-doctor

Audits your Claude Code skills and diagnoses issues. Builds upgrades in staging with rollback.

claude-codeskillstooling
Problem Claude Code skills accumulate bad patterns: redundant context, missing guards, no tool restrictions.
Approach Reads all skills and sends each to the claude-code-guide agent for evaluation against current docs.
Stack Claude Code plugin, Markdown skills
Outcome Found and fixed issues in all 5 of my own skills. Saved ~2,500 tokens per session from context duplication.

microGPT Playground

try it live →

Train a transformer in your browser. Watch attention patterns, embeddings, and loss evolve in real time.

interactive d3.js transformers