Skip to content
Hande Kafkas

← Projects

2025 · Solo build

Groundtruth — eval harness for retrieval pipelines

A reproducible eval harness for RAG systems: golden questions, citation-level scoring, and regression detection between model and index changes.

Stack
Python / DuckDB / Anthropic API / OpenAI API
Code
Repository ↗

Built after one too many silent regressions caused by an index change three weeks earlier. Groundtruth separates the eval set from the model, scores at the citation level rather than the answer level, and gives you a one-line diff between any two runs.