2025 · Solo build

Groundtruth — eval harness for retrieval pipelines

A reproducible eval harness for RAG systems: golden questions, citation-level scoring, and regression detection between model and index changes.

Stack: Python / DuckDB / Anthropic API / OpenAI API
Code: Repository ↗

Built after one too many silent regressions caused by an index change three weeks earlier. Groundtruth separates the eval set from the model, scores at the citation level rather than the answer level, and gives you a one-line diff between any two runs.