Automating LLM Evaluation

A Guide for RAG Chatbots and Other Very Specific Generative Tools

Feb 26, 2024

Since last October, I’ve been working on evaluating LLMs in one specific way: how to tell whether the model “knows” something. That is, if you keep asking a question—many different times, or in different ways, or if you ask a specific model that’s been fine-tuned so it won’t refuse to answer—will it return a particular fact?

In some ways, this is different from how you’d evaluate an LLM-based tool that you’re going to make available to your customers, like a chatbot that answers questions about a particular set of documents, or one that makes book recommendations, or schedules appointments.

For instance, I care more about “tail” behavior. If the model only spits out a particular set of facts once out of every hundred times it responds—or only if you ask it in a specific way—I need to know that! Whereas if you’re evaluating your chatbot, you’re likely more concerned with typical behavior, or how it’s going to answer most of the time. You may also care if your chatbot is rude or promising your customers discounts that it shouldn’t be.

But there’s one aspect of evaluation that we both need to figure out: how to automate evaluating text data. That is, how do we take an LLM’s output and, without manually reviewing it, assess its performance.

If you don’t read any further, the TL; DR here is: you can and should use LLMs to evaluate your LLM-based tools. You can write precise, multi-step rubrics for evaluation (which determine if your tool is doing a good job), and you can write them more quickly and get more accurate results than with earlier types of text evaluation or comparison.

Why Automate Testing At All?

We should automate testing LLM-based tools like RAG chatbots for the same reasons we automate software testing in general. We’re going to make changes, fix things, maybe swap out the underlying model that our tools are based on—perhaps to see if a new model can perform as well as the model we’re currently using—and we want to make sure that our changes don’t break our stuff. And we want to do this without having to run a bunch of ad hoc tests each time, where we ask the model questions manually and see how it does. That is, like with other kinds of software testing, we want to have some tests that we can run when you’re making a change—maybe that run automatically when we push our code—so that we’ll be easily alerted to issues. Because, as the complexity of our tools grow, ad hoc testing becomes less feasible. It takes too long and it’s not reliable enough.

Challenges in Evaluating Generative Models

For some problems involving text data, model assessment is straightforward because we have, or can easily generate, “labeled” data, and the tasks we’re trying to accomplish are straightforward.

For instance, let’s say we have a classic binary classification problem, where we’re trying to assess whether email is spam or not. We have some data that’s labeled as being spam or not-spam, and we can assess our model by how well it performed at predicting the correct values. In this example, our model outputs “spam” or “not-spam”, meaning it’s either correct or incorrect, and we can evaluate different models by how frequently they are correct or by which types of examples they succeed or fail on.

Or maybe we’re doing named entity recognition, where we’re extracting names and places from text. Again, we have (or can produce) labeled data—that is, text samples, and the entities we want to extract. We can assess how well our model did in terms of how many or which of the entities it correctly found.

But when we’re evaluating generative models like chatbots, there’s not an equivalent ‘out-of-the-box’ solution. We need to figure out for our individual use cases what ‘good’ looks like—or what behavior we want to see, and what behavior we don’t want to see. Then we can write tests for that specific behavior.

What Do You Want Your Tool to Do and How Might It Fail?

Let’s say you have a RAG chatbot that is designed to answer questions about a specific corpus of data—e.g., your lab and its practices. You should start your evaluation process by writing out some questions that you want your tool to be able to answer. After you expand this to users, you can add to the list some questions that your users are asking it to answer.

My one major architecture suggestion for these kinds of narrow-use, specific tools is that, before you send a user’s question to your LLM, you first use a model which has been trained to determine whether or not it’s an appropriate question for your tool. This model can be a fine-tuned version of BERT or a BERT-derived model which you’ve trained on appropriate and inappropriate questions. Inappropriate questions are questions that are outside of its scope, including ones that are trying to get your model to reveal proprietary information or otherwise jailbreak it. Using this step is the easiest way to improve your tool’s accuracy, just by keeping it from answering questions you are not trying to get it to answer correctly! It can also keep people from using your tool to get free LLM access, and reduce the chances that anyone successfully baits your model into toxic behavior.

But once you have the list of questions you want your tool to be capable of answering, you can use these to write tests for how your tool will perform—that is, of the text that it will generate for questions that it does answer. Maybe you want your chatbot to be able to answer basic lab safety questions, questions about your lab’s protocols, and questions about the specific research that people are working on. This list doesn’t need to be comprehensive—in fact it shouldn’t be: if you have a comprehensive list of questions that your chatbot is supposed to answer, you should use a FAQ and not a chatbot—but it should be generally representative.

The next piece you should think about is how to differentiate failure vs. success, or what ‘good’ looks like for these questions. If you’ve done ad hoc testing of your tool, you probably have a good sense of that already. Maybe you’ve tried smaller models and bigger ones and you’ve seen the bigger ones perform better.

Going back to the example about a research lab and its practices, here are a few questions that you might want people to be able to ask your tool, along with corresponding descriptions of what ‘good’ looks like:

I lost my ID. What should I do? The answer to this should contain a particular name, contact information, and/or link.
Who is working on research related to a particular topic? The answer to this should contain a specific list of names, as well as information about their research.
What’s the protocol for cleaning pipettes? Here, we’re looking for a multi-step process containing all of a particular set of steps.

For all of these questions, you should use an LLM to evaluate how well your tool is performing. That is, your test should be a query to an LLM which asks the LLM to evaluate your tool’s answer using some kind of scoring mechanism.

Even if you think you can evaluate the answer using a different set of tools, an LLM is going to do it better and be more robust. To demonstrate this, I’ll briefly go through a couple of other tools and why I don’t think you should use them.

String Matching and Why You Shouldn’t Do This

String matching is the simplest way of evaluating text: does your tool’s response contain a particular pattern or string? Or is it syntactically close to a particular string—i.e., if you edit just a few characters, can you get from one to another?

Theoretically, you’d build a test around string matching if you had a straightforward answer. Using question #1 from above, if the answer to “I lost my ID. What should I do?” needs to contain a particular e-mail address, you could certainly handle that via string matching by searching for the email address in the response. But what if you’re concerned that, in addition to that e-mail address, it’ll add something else that’s not correct? You can write a bunch of tests ruling out this possibility one by one—or you can use an LLM to evaluate it for you.

# This is a test that sends a prompt to an API (you'd use your tool) and then tests to see if the response contains something that is formatted like an e-mail address. If you wanted to look for a specific substring - a specific e-mail address - that's even simpler. 

def contains_email_address(text):
    # Regular expression pattern for a generic email address
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    return bool(re.search(pattern, text))

class TestGenericEmail(unittest.TestCase):
    def test_contains_generic_email(self):
        # Assuming there's an API that returns a text response
        # Replace 'http://example.com/api/get_text' with the actual API endpoint
        response = requests.post('http://example.com/api/get_text', json={'prompt': "what is the dean's email address?"})
        text = response.text
        self.assertTrue(contains_email_address(text))

if __name__ == '__main__':
    unittest.main()

Semantic Similarity and Why You Shouldn’t Do This

The concept behind semantic similarity models is that different words can mean the same thing, and that context matters. Using a semantic similarity model is a more sophisticated way of evaluating text relative to string matching, one that will know that “I refuse to answer that” is similar to “I cannot respond”, even though the actual overlap in words is minimal.

There are various small, open-source models out there which will cast your target output—the response you want—into n-dimensional space and cast your actual output into that same n-dimensional space, so that you can find the distance between those representations of your text.

For a test, you could then set some threshold to determine whether the actual output is within a certain distance from the output you wanted.

If you go this route, you should run some tests and see if you can find a model and threshold that works for each test.

This isn’t difficult to implement. Below is a test we could write to assess semantic similarity, by encoding two pieces of text—your desired response and actual response—and then finding the distance and determining whether it is above or below a certain threshold.

# Initialize the model and tokenizer for 'sentence-transformers/paraphrase-mpnet-base-v2'
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
model = AutoModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

def encode_text(text):
    # Tokenize and encode the text for the given model
    encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        model_output = model(**encoded_input)
    # Mean pooling
    embedding = model_output.last_hidden_state.mean(dim=1).squeeze()
    return embedding

def calculate_cosine_similarity(embedding1, embedding2):
    # Calculate cosine similarity
    similarity = 1 - cosine(embedding1, embedding2)
    return similarity

class TestSemanticSimilarity(unittest.TestCase):
    def test_api_response_similarity(self):
        # Replace with the actual API endpoint and the target text
        api_url = 'http://example.com/api/get_text'
        target_text = "If you lose your ID card, you should go to the registrar's office"
        
        # Make the API call
        response = requests.post(api_url, json={'prompt': "What should I do if I lose my ID card?"})
        api_response_text = response.text
        
        # Encode the texts
        target_embedding = encode_text(target_text)
        response_embedding = encode_text(api_response_text)
        
        # Calculate similarity
        similarity = calculate_cosine_similarity(target_embedding, response_embedding)
        
        # Define a threshold for semantic similarity
        threshold = 0.8  # This threshold can be adjusted
        
        # Assert that the similarity is above the threshold
        self.assertTrue(similarity >= threshold, f"Semantic similarity below threshold: {similarity}")

if __name__ == '__main__':
    unittest.main()

But I don’t think this should be your first choice for a test, either. Text similarity is great for a lot of things. But there are a lot of ways that text can be “close” to other text while still being meaningfully different in ways you care about. Or where the similarity model will tell you that two strings of text are quite different, even when they’re similar in the ways you care about.

LLM-Led Evals and Why You Should Do This

The following is a grading rubric I’m using to evaluate whether a piece of text, like an LLM response, contains a comprehensive set of instructions on how to do a particular task. There are seven steps. If the text it’s evaluating contains all seven steps, it passes. If it doesn’t, it fails.

class GradingPipetteCleaningInstructions(Enum):
    PASS = """Includes instructions for all of the following tasks: 
    using distilled water, use of mild detergent or cleaning solution, 
    rinsing with distilled water, drying, reassembly, wearing gloves and goggles, 
    checking for calibration and wear"""
    FAIL = """Leaves out one or more of the following tasks: using distilled water, 
    use of mild detergent or cleaning solution, 
    rinsing with distilled water, drying, reassembly, 
    wearing gloves and goggles, checking for calibration and wear"""

There is one line of code I use to run this on text and I get back a PASS or FAIL. That’s all it takes. I had to tweak it slightly to get it work, but it was a much easier process than it would have been if I wanted to accomplish this via a combination of string matching and semantic comparisons.

I’m cheating a little bit because I’m using a library called Marvin, which simplifies interactions with the OpenAI API. Using Marvin, for instance, you don’t have to also tell the LLM “ONLY GIVE ME A PASS OR FAIL, NOTHING ELSE IN YOUR RESPONSE.” Or “PLEASE PLEASE PLEASE RETURN JSON*".” But if you wanted to use a non-OpenAI model to do your evaluations, or you wanted to write them from scratch, you would have to code a little bit more.

But still: you can write this kind of test and get reliable results. And there is no other way you could get this level of precision from using string matching or semantic similarity without writing a huge amount of code. Just think about how we’d try to do this with string matching or semantic similarity: how are we going to tell if all seven steps are present? If I had absolutely had to do in a different way, I’d probably train a BERT model for each question, so that it could classify output as pass or fail. But I don’t want to train a BERT model every time I need to write a test!

You can use this kind of framework to quickly write rubrics for each of your tests. You can write them pass/fail or award points for particular pieces of content, and then set a threshold for passing. This is an example of a rubric which returns a point value:

def GradingPipetteCleaningScore(text: str) -> float:
    """
    Award ten points for the inclusion of each of the following seven tasks: 
    task 1: using distilled water; 
    task 2: use of mild detergent or cleaning solution; 
    task 3: rinsing with distilled water; 
    task 4: drying;
    task 5: reassembly;
    task 6: wearing BOTH gloves and goggles; 
    task 7: checking for calibration and wear
    """

I had to play with the wording a little to make sure the model correctly processed each task and wasn’t awarding ten points each for gloves and goggles.

Here are a few other types of LLM-led evaluations which are more generic—that is, you don’t need to write a specific rubric for each question.

Have a target response that’s a good answer to a question. Ask the LLM how similar on a scale of 0 to 10 the actual response your tool gave is to the target response. You can experiment and set a threshold for passing.
Ask the LLM to evaluate whether the answer given is actually an answer to the question. Not “is it accurate?”, but “is it answering the question that was asked, and doing so in a complete way?”
Another, for RAG applications specifically, asks whether the answer given was contained in the context. That is, when your RAG application found similar text to feed to the model as part of the context, did that context actually contain the answer that was given. (Both 2 and 3 are from Athina AI.)

You can’t get this level of precision any other way.

This Is New And We Are Just Figuring It Out

I wrote an earlier version of this post last summer in which I didn’t yet have an opinion about what kind of testing you should do. I have an opinion now because I’ve spent a lot of time trying to assess text data—and also because I’ve seen examples of what other people are building.

There’s no set of standards yet for how comprehensive your tests should be or what they should contain or anything like that. I suspect a lot of people are putting out RAG chatbots without formal testing processes, as well as without a classifier step which refuses to answer unrelated questions.

But formal assessment really is possible, it just means figuring out what you want your tool to do, how it might fail, and how to test for that.

_______________________________________________________________________

This is a notebook with my example Marvin scoring rubrics.

*You can also use “function calling” to get structured output like a JSON from OpenAI models directly. I did not have a fun time with that, but here’s a tutorial.

The Present of Coding

Discussion about this post