This guide to creating #LLMs-as-a-judge systems by Hamel Husain is the best I’ve ever seen.
Ada uses one of these in production to evaluate whether transcripts resolved or not (and the reasons for why they don’t). We used domain experts to ensure it works, but we didn’t use scenario, feature and persona dimensions. This is a very clever way to try to avoid blind spots in test coverage.