Using LLMs in Production

Will Larson just wrote about his mental models for using LLMs in production. I agree with much of it, particularly the re-framing of what LLMs can really do today for product developers.

On the Unsupervised Case...

Because you cannot rely on LLMs to provide correct responses, and you cannot generate a confidence score for any given response, you have to either accept potential inaccuracies (which makes sense in many cases, humans are wrong sometimes too) or keep a Human-in-the-Loop (HITL) to validate the response.

I only wish the post touched more on the unsupervised (no human in the loop) scenario. For many workflows, an LLM and human in the loop means the workflow is only marginally improved. To make systems that are autonomous it's not just about accepting potential inaccuracies, it's also about accepting responsibility for driving them down. This is the super hard part about unsupervised LLM application. You have to first educate customers on the trade-offs and risks they are taking and then you have to build systems that drive those risks to 0 and optimize those trade offs for value so that customers become increasingly confident in the system.

Using Schemas in Prompts

A tactic that wasn't mentioned in the post is using JSONSchema within LLM prompts. This is a great way to ensure generations are more accurate and meet your systems expectations.

You don't need to use JSONSchema if you don't want to. We have had good results from simply showing a few examples of desired output in the prompt and letting the LLM infer the schema from that.

Here's a toy example of how you can use JSONSchema with an LLM prompt:

import openai
from pydantic import BaseModel
from typing import Literal

# Example docs
docs = [
    {
        'id': 1,
        'content': 'This is a well formed sentence that has no errors in grammar or spelling.'
    },
    {
        'id': 2,
        'content': 'This is an exampel sentence errors in grammer and speling.'
    }
]

# Define your schema with the LLM. We're using pydantic, but there are many options for this.
class DocumentReview(BaseModel):
    # A document review
    document_id: int
    review: Literal['good', 'bad']

# Make a prompt that uses the schema
prompt = f"""Given the following <document>, please review the document and provide your review using the provided JSONSchema:

<document id="{doc['id']}">
{doc['content']}
</document>

JSONSchema:
{schema}

Your review:
"""

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {
            "role": "system",
            "content": prompt.format(
                doc=docs[0],
                schema=DocumentReview.model_json_schema()
            )
        }
    ]
)

# Assuming the LLM returns a JSON string that fits our schema
try:
    review = DocumentReview.model_validate_json(response.choices[0].text.strip())
except ValidationError as e:
    print(f"Error validating schema: {e}")
    return

Handling ValidationError

You can handle the ValidationError by re-prompting the LLM with the error and re-running the prompt.
You can also handle the ValidationError by dropping down to a HITL using a queue of documents to review.

Using schemas to validate generations allows you to ensure the data generated by an LLM at least matches your data types. In addition, if the LLM is referencing passed material (such as in a RAG architecture) you can ensure the document IDs referenced in generations at least match the source documents given. To improve this further, you can perform some semantic/string distance checks to the source documents' content and the outputted generation.

For more on using JSONSchema with LLM prompts, see this post from ThoughtBot.