Feature request: Add evaluation metric for comparing different approaches

garrytan/gbrain

Status: Open

Opened: Apr 11, 2026

**What problem does this solve?** There are several more methods to improve gbrain such as reranking, and for comparing what embeddings are suitable for gbrain. Currently we have no way to measure search quality — hybrid search (vector + keyword + RRF), 4-layer dedup, and multi-query expansion all have tunable parameters (RRF K=60, cosine dedup threshold=0.85, type ratio cap=0.6, etc.) but changes are evaluated by vibes only. We also can't compare embedding models (e.g. `text-embedding-3-large` vs `gemini-embedding-preview-2` vs `qwen3-embedding:8b`) or validate whether adding reranking actually improves results. **What does the solution look like?** An evaluation harness with: - **Ground truth format/interface**: a simple schema (JSON/YAML) for users to define query → expected relevant slugs pairs against their own brain data. Not a bundled dataset — each brain is different. - **Retrieval metrics**: nDCG@k, MRR, Precision@k, Recall@k computed against user-provided ground truth - **A/B comparison mode**: run the same query set against two configurations (different embeddings, with/without reranking, different RRF K values) and output a diff table - **CLI command**: `gbrain eval` that runs the benchmark suite and prints a summary - New operation in `operations.ts` so it's available via both CLI and MCP **Follow-up issues (out of scope here):** - DSPy integration — use eval metrics as optimization objectives for automatic prompt tuning - Reranking — cross-encoder reranking ...

TypeScript

View on GitHub ↗