Posts about evals

Only one LLM is good at chess

Exploring how different LLMs perform at chess, with most failing except turbo-instruct. Discusses tuning and training influences.


Creating an LLM-as-a-Judge

Just found an incredible guide on building LLMs as a judge by Hamel Husain! Super insightful, especially since we’re using a similar system at Ada to evaluate transcript resolutions. Excited about how smartly it's avoiding blind spots in test coverage!


Working Probabilistically

Exploring the importance of thinking probabilistically when working with LLMs, this post highlights insights on effective eval methodologies, the quirks of model behavior, and practical tips for building robust evaluation processes that go beyond traditional testing.