Working Probabilistically
I was chatting with a co-worker today about how powerful evals are for code that relies on #LLMs. How we both felt evals were still not taken as seriously as they should be. He said “we have to get engineers to think probabilistically, rather than functionally” which by I think he meant ‘deterministically’.
So, I largely agree, and was guilty in the past of the common pitfall of building on #LLMs and convincing myself that all possible states must be as good as the handful I observed locally. It’s just not the case with LLMs. I still get strange surprises from flagship models like GPT-4o:
Well, this is a new one. GPT-4o thinks the response was "poetic in nature" ... Hm ... pic.twitter.com/TBCCgl1NrQ
— David Hariri (@davehariri) October 15, 2024
So, thinking probabilistically while building on #LLMs is, I think, the following:
- Knowing that traditional tests of functions that rely on an LLM generation are not very useful (and slow .. and expensive .. if in CI)
- Functions that rely on LLMs can exhibit kooky behaviour. Constrain what is a valid generation as much as possible with structured output.
Pydantic
+ OpenAI’s newparse
method is great for that:
class SemanticIntentEvaluation(BaseModel):
passed: bool
reason: Optional[str] = None
response = openai_client.beta.chat.completions.parse(
model="gpt-4o-mini",
response_format=SemanticIntentEvaluation,
messages=[
{
"role": "system",
"content": f"""
Evaluate the following customer service response:
<response>{response_content}</response>
Does the response meet the following expectations:
- {expectations_str}
"""
}
],
temperature=0,
top_p=1,
)
evaluation_result = response.choices[0].message.parsed
- More often than not, the goal state with #evals is not 100%. Unlike with a test suite, not every eval must score perfect for the system to be in a production-ready state. In fact, 100% performance might be a sign that an eval set is not comprehensive or ambitious enough.
- Using human preferences as ground truth is still the gold standard (as far as I know) and an enduring way to evaluate system performance. If you can’t afford that, using slow, smart LLMs like o1-preview to create evals synthetically is also a good option.
- Comparing strings is a common need for evals. You can embed both strings and use semantic distance and a threshold or you can use another LLM to compare the generation to your eval’s ground truth and give it a score. I also sometimes want to ensure key strings are quoted from source material in generations. So, I supply a list of strings which must be present in the generation and evaluate that deterministically.
- Building tooling to quickly create evals when bugs are found is a good investment in your teams likelihood to extend the eval set.