Trust Bench

in progress

Framework for profiling LLM trustworthiness. Extracts internal signals, evaluates across safety dimensions (truthfulness, fairness, robustness), diagnoses failure modes, and measures trust boundaries specific to deployment context.

Predicting Risk of Lung Cancer From Medical History

Quality Management in Health Care, 2025

Peer-reviewed research achieving AUC 0.82 in lung cancer risk prediction using Electronic Health Records. Developed methodology for clinical risk modeling with structured EHR data.

microGPT Playground

try it live →

Train a transformer in your browser. Watch attention patterns, embeddings, and loss evolve in real time. Same architecture as frontier models, small enough to inspect every component.

interactive d3.js transformers
Phase transition at 0.7 sparsity

superposition-viz

Reproduces key findings from Anthropic's Toy Models of Superposition paper. Visualizes how neural networks pack more features than they have dimensions.

superpositioninterpretabilityanthropicpytorch
Problem Superposition is a fundamental obstacle for mechanistic interpretability. If 10 features share 5 dimensions, no single neuron maps to a single feature.
Approach Toy model with linear encoder, ReLU bottleneck, and tied-weight decoder. Trains at multiple sparsity levels. Generates phase diagrams, feature geometry plots, dimensionality curves, and interference matrices.
Stack PyTorch, matplotlib, numpy
Outcome Clear phase transition at 0.7 sparsity. Model jumps from representing 4 to 5+ features in a 5-dim bottleneck. Matches the Anthropic paper's Figure 2.
ECE 0.107, overconfident by 3%

calibration-probe

Measures whether LLMs know when they are wrong. Asks 102 factual questions, forces stated confidence, and plots calibration curves.

calibrationllm-safetyevaluationpython
Problem A model that says "90% confident" but is only right 75% of the time is dangerous. Calibration measures this gap, which matters for safety-critical deployments.
Approach 102 factual questions across 5 categories. Model states confidence 0-100. Responses binned by confidence, actual accuracy measured per bin. Supports direct, chain-of-thought, and step-by-step prompting strategies.
Stack Python, matplotlib, Anthropic API, OpenAI API
Outcome Models are slightly overconfident (89% confidence, 86% accuracy). Geography worst calibrated. Math best calibrated. ECE = 0.107. Dry-run mode works without API keys.
4 variants, from scratch

attention-bench

Benchmarks MHA, GQA, MQA, and Sliding Window attention by training small transformers on the same data and comparing perplexity, throughput, and memory.

attentionpytorchtransformersfrom-scratch
Problem Attention variants like GQA and MQA are described in papers with theoretical tradeoffs, but direct empirical comparison on identical setups is rare.
Approach Four attention mechanisms implemented from scratch in PyTorch. Each plugs into the same decoder-only transformer LM. Trained on TinyStories with identical hyperparameters.
Stack PyTorch, tiktoken, matplotlib, HuggingFace datasets
Outcome MQA uses 25% fewer KV parameters than MHA. GQA sits between them. SWA trades global context for linear memory scaling. All with 25 passing tests.
Real-time token analytics

crux

Terminal dashboard for Claude Code usage. Tracks context growth, cache efficiency, cost breakdowns, and session health grades from local session logs.

rusttuiclaude-codeopen-source
Problem Claude Code users have no visibility into token usage patterns. Sessions vary wildly in cost and efficiency with no way to diagnose why.
Approach Rust TUI that reads Claude Code JSONL session logs in real-time. Computes rolling statistics, cache hit ratios, context growth rates, and assigns health grades per session.
Stack Rust, ratatui, tokio, serde, MCP server
Outcome Open-source, published to GitHub with CI/CD and cross-platform releases. Reveals usage patterns invisible from the Claude Code interface.
Compare 4 encodings

tokenizer-arena

CLI tool that runs the same text through multiple LLM tokenizers and shows differences in token count, compression ratio, and token boundaries.

rusttokenizerscliopen-source
Problem Tokenizer choice affects training efficiency, inference cost, and multilingual performance, but comparing tokenizers requires writing custom scripts each time.
Approach Wraps tiktoken-rs to compare cl100k_base (GPT-4/Claude), o200k_base (GPT-4o), p50k_base, and r50k_base side by side. Color-coded token boundary visualization.
Stack Rust, tiktoken-rs, comfy-table, clap
Outcome Shows newer tokenizers are ~25% more efficient on code. Works fully offline. JSON output for scripting.
Parse models byte by byte

gguf-inspect

CLI that reads GGUF model files and prints architecture details, quantization info, tensor shapes, and memory estimates. Hand-rolled binary parser, no external GGUF crates.

rustggufmodel-internalsopen-source
Problem Understanding what is inside a quantized model file requires reading llama.cpp source or scattered documentation. No simple tool shows the full picture.
Approach From-scratch GGUF binary parser in Rust. Reads the header, metadata KV pairs, and tensor info. Computes parameter counts, memory estimates, and identifies quantization schemes per layer.
Stack Rust, byteorder, comfy-table, clap
Outcome Tested on Llama 3.2 3B. Reveals mixed-precision quantization (Q6_K embeddings, Q4_K attention), GQA head ratios from tensor shapes, and 6.7x compression from FP32.
Claude Code plugin

skill-doctor

Audits your Claude Code skills using the built-in claude-code-guide agent. Diagnoses issues, builds upgrades in a staging directory, and migrates with rollback.

claude-codeskillstooling
Problem Claude Code skills accumulate bad patterns: redundant context, missing tool restrictions, no invocation guards. Existing validators only check YAML syntax.
Approach Reads all skills, agents, and CLAUDE.md. Sends each skill to the claude-code-guide agent for evaluation against current Claude Code docs. Consult mode maps findings to actual workflow pain points.
Stack Claude Code plugin, Markdown skills
Outcome Found and fixed issues in all 5 of my own skills. Saved ~2,500 tokens per session from context duplication alone.
10K+ daily queries

Healthcare RAG Pipeline

Production RAG system for clinicians querying medical records and PubMed. Built hallucination-aware retrieval with domain-specific re-ranking.

RAGvertex aihallucination detection
Problem Clinicians need accurate, grounded answers from medical records and research. Standard retrieval surfaces irrelevant context, leading to hallucinated outputs in clinical settings.
Approach Retrieval-augmented generation on Vertex AI with hallucination-aware retrieval. Domain-specific re-ranking ensures clinical relevance. Measurement of retrieval fidelity across query types.
Stack Vertex AI, Gemini Pro, LangChain, LlamaIndex, Python
Outcome Serves 10,000+ daily clinical queries. Developed methodology for measuring retrieval fidelity and hallucination rates in domain-specific RAG.
5 entity types, cross-format search

Medical Knowledge Graph

Knowledge graph connecting diseases, drugs, treatments, and clinical trials. Powers retrieval grounding for the RAG system.

neo4jknowledge graphsembeddings
Problem Medical knowledge is scattered across databases, papers, and records. Flat storage loses entity relationships critical for accurate retrieval.
Approach Graph schema mapping diseases, treatments, drugs, clinical trials, and expert opinions. Multi-modal embeddings for cross-format querying. Provides structured grounding for RAG retrieval.
Stack Neo4j, Elasticsearch, Python, sentence-transformers
Outcome Powers the retrieval grounding layer. Enables entity-aware queries across text, structured data, and metadata.
25% lower inference cost

LLM Fine-tuning System

Fine-tuning Qwen2.5-3B with GRPO + LoRA. Studied how reward-guided optimization changes model behavior at the output distribution level.

LLMfine-tuningGRPOvLLM
Problem Off-the-shelf models underperform on domain tasks. Understanding how fine-tuning alters model behavior requires both training infrastructure and evaluation methodology.
Approach 4-bit quantization with LoRA adapters for memory-efficient training. GRPO for reward-guided optimization. vLLM for high-throughput inference with auto-scaling.
Stack PyTorch, Hugging Face, LoRA, vLLM, Python
Outcome Deployed fine-tuned 3B parameter model to production, reducing inference costs by 25%. Developed evaluation pipeline for measuring behavioral shift pre/post fine-tuning.
early warning signals

Drug Safety Sentiment Analysis

Transformer classifiers detecting sentiment shifts in medical expert opinions to flag safety signals for pharmaceutical products.

NLPsafety signalshealthcare
Problem Pharmaceutical safety teams need to detect early warning signs in expert opinions before adverse events escalate. Manual review doesn't scale.
Approach Fine-tuned transformer classifiers on domain-specific medical language. Tracks sentiment shifts per drug per condition over time. Measures classifier reliability against pharmacovigilance ground truth.
Stack PyTorch, Hugging Face Transformers, spaCy, Python
Outcome Surfaces drugs with increasing negative sentiment for safety review. Integrated into existing pharmacovigilance workflows.
5,000+ SKUs

NL-to-Insights

Ask a question in plain English about your inventory, get back a chart.

RAGGPTplotly
Problem Warehouse managers need inventory analytics but can't write SQL. Analysts become bottlenecks for routine data questions.
Approach Natural language to SQL translation via GPT-3.5 Turbo. Queries execute against inventory and sales databases. Results render as interactive Plotly charts.
Stack GPT-3.5 Turbo, Plotly, PostgreSQL, Python, FastAPI
Outcome Used for demand forecasting across 5,000+ SKUs. Eliminated analyst dependency for standard inventory queries.
2TB+ processed, 50+ pipelines

EDI Pipelines

Data pipelines moving business data between ERP systems, warehouses, and reporting tools.

azuredatabricksETL
Problem Business partners exchange data via EDI, a standardized-in-theory format that varies wildly in practice.
Approach Azure Data Factory orchestrates ingestion and routing. Databricks handles transformation, schema normalization, and quality checks.
Stack Azure Data Factory, Databricks, Azure Data Lake, Python, SQL
Outcome Processed over 2TB of transaction data. Connected ERP systems, warehouses, and reporting tools into a single pipeline.
90%+ accuracy, 50% less downtime

Predictive Maintenance

Digital twin platform for industrial machinery. Anomaly detection and time-to-failure prediction from IoT sensor streams.

anomaly detectiontime seriesdigital twins
Problem Unplanned machine downtime is expensive. Raw sensor data exists but isn't used for proactive failure prediction.
Approach Signal processing pipeline ingesting IoT sensor streams. Feature extraction, anomaly detection, and time-to-failure models trained on historical failure data.
Stack Python, scikit-learn, Azure Data Lake, Power BI
Outcome 90%+ anomaly detection accuracy. 95% time-to-failure prediction. Reduced machine downtime by 50% across deployed sites.