Working Probabilistically

I was chatting with a co-worker today about how powerful evals are for code that relies on #LLMs. How we both felt evals were still not taken as seriously as they should be. He said “we have to get engineers to think probabilistically, rather than functionally” which by I think he meant ‘deterministically’.

So, I largely agree, and was guilty in the past of the common pitfall of building on #LLMs and convincing myself that all possible states must be as good as the handful I observed locally. It’s just not the case with LLMs. I still get strange surprises from flagship models like GPT-4o:

Well, this is a new one. GPT-4o thinks the response was "poetic in nature" ... Hm ... pic.twitter.com/TBCCgl1NrQ
— David Hariri (@davehariri) October 15, 2024

So, thinking probabilistically while building on #LLMs is, I think, the following:

Knowing that traditional tests of functions that rely on an LLM generation are not very useful (and slow .. and expensive .. if in CI)
Functions that rely on LLMs can exhibit kooky behaviour. Constrain what is a valid generation as much as possible with structured output. Pydantic + OpenAI’s new parse method is great for that:

class SemanticIntentEvaluation(BaseModel):
        passed: bool
        reason: Optional[str] = None

response = openai_client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        response_format=SemanticIntentEvaluation,
        messages=[
            {
                "role": "system",
                "content": f"""
                Evaluate the following customer service response:

                <response>{response_content}</response>

                Does the response meet the following expectations:
                - {expectations_str}
                """
            }
        ],
        temperature=0,
        top_p=1,
    )

    evaluation_result = response.choices[0].message.parsed

More often than not, the goal state with #evals is not 100%. Unlike with a test suite, not every eval must score perfect for the system to be in a production-ready state. In fact, 100% performance might be a sign that an eval set is not comprehensive or ambitious enough.
Using human preferences as ground truth is still the gold standard (as far as I know) and an enduring way to evaluate system performance. If you can’t afford that, using slow, smart LLMs like o1-preview to create evals synthetically is also a good option.
Comparing strings is a common need for evals. You can embed both strings and use semantic distance and a threshold or you can use another LLM to compare the generation to your eval’s ground truth and give it a score. I also sometimes want to ensure key strings are quoted from source material in generations. So, I supply a list of strings which must be present in the generation and evaluate that deterministically.
Building tooling to quickly create evals when bugs are found is a good investment in your teams likelihood to extend the eval set.