Feature request: Add evaluation metric for comparing different approaches
garrytan/gbrain
**What problem does this solve?**
There are several more methods to improve gbrain such as reranking, and for comparing what embeddings are suitable for gbrain. Currently we have no way to measure search quality — hybrid search (vector + keyword + RRF), 4-layer dedup, and multi-query expansion all have tunable parameters (RRF K=60, cosine dedup threshold=0.85, type ratio cap=0.6, etc.) but changes are evaluated by vibes only. We also can't compare embedding models (e.g. `text-embedding-3-large` vs `gemini-embedding-preview-2` vs `qwen3-embedding:8b`) or validate whether adding reranking actually improves results.
**What does the solution look like?**
An evaluation harness with:
- **Ground truth format/interface**: a simple schema (JSON/YAML) for users to define query → expected relevant slugs pairs against their own brain data. Not a bundled dataset — each brain is different.
- **Retrieval metrics**: nDCG@k, MRR, Precision@k, Recall@k computed against user-provided ground truth
- **A/B comparison mode**: run the same query set against two configurations (different embeddings, with/without reranking, different RRF K values) and output a diff table
- **CLI command**: `gbrain eval` that runs the benchmark suite and prints a summary
- New operation in `operations.ts` so it's available via both CLI and MCP
**Follow-up issues (out of scope here):**
- DSPy integration — use eval metrics as optimization objectives for automatic prompt tuning
- Reranking — cross-encoder reranking ...
View on GitHub ↗
Related Content
-
AI Insight Evaluation harness for RAG retrieval quality
SaaS Metrics