Large Language Models (LLMs) like OpenAI's GPT are being applied to a broad range of tasks, from code generation to customer support. You may be trying to figure out how to use this – or you may be getting directives from higher-up, like “add AI to it!” But just because an LLM can more-or-less do a task doesn’t mean it’s the right tool, and even if it is the right tool, you’re still going to be making choices that require assessing its output, like selecting the most appropriate model, crafting prompts, and determining if and how to pair it with human assessments.
So how can you determine if GPT, or any LLM, is the right fit for your specific project? This post outlines an approach to answer this question, discusses what you can do in an ad hoc way vs. where you’re going to need to write code and have specifically-formatted data, and how the level of rigor you’ll need in evaluation depends on your specific use case – the alternatives as well as the stakes if you get it wrong.
But regardless of the context, if an LLM isn’t going to work, you want to figure that out as soon as possible. You will probably only proceed to the time-intensive and more rigorous evaluation if there’s some chance this might work.
The tasks I’m going to talk about evaluating LLMs for are ones where it’s possible to do systematic evaluation. These are tasks where you’re going to be feeding an LLM the same prompt over and over again with different pieces of text via an API, like doing a first pass at assessing resumes, pulling names or organizations out of regulations, or summarizing emails.
A Three-Tiered Approach to Assessing LLMs Like GPT
When determining if GPT is a suitable choice for your specific project, it's helpful to think in terms of a three-tiered approach: ad hoc checks, informal benchmarking, and formal evaluation. The first stage you can do quickly and without any technical expertise or specialized data; by the third, you’ll need data that’s formatted in a specific way, and you’ll need to be writing code or working with someone who does.
1. Ad Hoc Checks: The Quick and Dirty Way
The quickest way to assess if an LLM might be a fit for your project is through ad hoc checks. This process involves feeding it some sample prompts, along with text examples, and checking if the outputs make sense.
For instance, if you're considering using GPT for initial resume review, you might input a resume along with a sample prompt. That prompt could be: “I’m looking to hire a data scientist with experience in Natural Language Processing. Please assess whether this resume fits that profile, and answer beginning with a ‘Yes’ or ‘No’, followed by an explanation.”
You will then examine and assess the model's output. You can do this via OpenAI’s playground, where you can experiment with different models, prompts, and other parameters to understand their impact on the model's performance.
Remember, this is a quick and dirty check. We're not looking for perfection here; instead, we're trying to identify any immediate deal-breakers. If GPT frequently misunderstands the prompt or produces nonsensical outputs, it may not be the right tool for your project.
As an example, I found that a prompt that worked well with GPT 3.5 for a certain type of named entity recognition – pulling out the names of software tools and programming languages – was hopeless with GPT 4, because the format of the output it gave me required so much cleaning that it was never going to be feasible to scale up, and it identified a lot of entities that were not what I was looking for.
Conversely, if the outputs seem reasonable and potentially useful, it might be worth proceeding to the next stage of evaluation.
2. Informal Benchmarking: The Trial Run
If the initial ad hoc checks are promising, the next step is to undertake informal benchmarking. This involves a more systematic comparison of the LLM's outputs with the results you'd expect from your current process – or the results you’d like to see, if you don’t have a current process.
Returning to the resume review example, you might randomly pull ten resumes, label each of them with the output that you want from GPT, and then see how well GPT performs relative to your expected outputs. You can still do this with the graphical user interface like GPT’s playground. You can also switch at this point to using your model’s APIs and writing code to test responses against desired outcomes.
During this stage, you can also start to investigate whether tweaking the model's parameters improves its performance. You might experiment with different settings or try using a different model.
By this point, you’ll need to have a sense of how well GPT needs to perform to be useful to you. In the resume example, let’s say you’re trying to classify two of the resumes as promising/needing human review, and eight as not. How accurate does this model have to be in order to be something you’ll incorporate into your process? What’s your current process like? This can help you evaluate your results and decide whether to go on to the next step.
And before you invest too much time, think about other considerations besides whether this works from a technical perspective that could be deal breakers for your project. If you can only get this working with GPT 4, but using it isn’t feasible from a cost perspective, it doesn’t matter how good the results are. If you have private data that needs to stay secure, then you should test on models that you can run locally or that you otherwise trust to be secure, because those are the only models you can use. If you’re building out a long-term project rather than doing a one-time task, think about whether you want to become dependent on a technology that could change its terms of service and costs at any time. Finally, if you’re building a tool that other people outside your organization will be able to input data into, as opposed to running it on already-curated data sets, think really hard about whether you’ve solved the security and prompt injection problem.
3. Formal Evaluation: The Rigorous Test
If informal benchmarking suggests that GPT could be a good fit, it's time to move on to a more formal evaluation. There is a caveat to this: if your initial, informal benchmarking has shown that GPT is already performing significantly better than your current process, or if the stakes of the task are low, you might decide that a full-blown formal evaluation isn't necessary.
For example, if GPT is triaging customer complaints with high accuracy, and your current manual process is slow and error-prone, it could be clear that GPT is a substantial improvement without needing a rigorous evaluation. Similarly, if GPT is sorting a daily batch of low-stakes social media comments, and the occasional misclassification isn't a major issue, you might be content with a less formal assessment.
If you do decide to proceed with rigorous, formal evaluation, this is where you'll need more labeled data and the capability to be writing code.
Labeled data is data that's been tagged with one or more labels to provide context or meaning. In the world of machine learning, these labels are often the correct answers or outputs that a model should aim to predict. They allow us to compare the model's predictions against the known correct answers, giving us a direct measure of its performance.
If you've ever sorted your emails into different folders based on their content, or tagged photos of your friends on social media, you've actually created labeled data without even knowing it – and there are a lot of other examples of where your organization may already have labeled data. For instance, if someone has sorted through resumes and decided who to contact, and you’ve saved that information, then you have labeled data. If you have data on every navigation to your website, and which ones of those bought something, that’s labeled data as well.
In scenarios where you don’t have labeled data, then you will likely need to actually label your data yourselves, which can be a laborious process.
To fully understand the concept of formal evaluation and labeled data, let's go back to our resumes for an example of a classification task: categorizing data scientist resumes for whether they have experience with Natural Language Processing
In this scenario, imagine you have a large number of resumes that you’ve previously categorized, maybe with keyword searches supplemented with human judgment, as “interview” or “don’t interview.” Each resume (the data) has been tagged with a category (“interview” or “don’t interview”.) This is your labeled data.
Now, suppose you want to use GPT to automate the task of categorizing resumes. You would feed a resume into GPT along with your prompt: “I’m looking to hire a data scientist with experience in Natural Language Processing. Please assess whether this resume fits that profile, and answer beginning with a ‘Yes’ or ‘No’, followed by an explanation.” GPT would predict a category – “interview” or “don’t interview.” By comparing this predicted category with the known correct category (the label), you can assess how well GPT is performing the task. This process of comparing GPT's predictions to known correct answers is at the core of a formal evaluation process.
In the simplest kind of modeling - like our resume problem - we’re going to evaluate our results using a confusion matrix, which shows the counts of each of all four possible outcomes: resumes that our model correctly identified at ‘Yes’, resumes that we wrongly identified as ‘Yes’, resumes that we correctly identified as ‘No’, and resumes that we wrongly identified as ‘No’.
A better model has more correctly-identified data points, but the specifics – whether we care more about false positives vs. false negatives, and how much – will depend on the specific situation. Since we can easily vary parameters like the prompt text and model type (3.5 vs. 4 in the case of GPT, or even using other models entirely), we’ll write code which produces these metrics for each of our different sets of parameters, and then we’ll compare them with each other as well as with how good they would have to be for us to use them for our problem.
It's important to note that the effectiveness of this process hinges heavily on the quality of your labeled data. If there are mistakes in your labels, or the your labeled data doesn’t include all of the kinds of data you’ll get in the future – if you’re going to get resumes that look really different from any of the kind you got before – then this might give you misleading results about the quality of your model.
The Takeaway: A Step-by-Step Approach to Evaluation
The ad hoc checks, informal benchmarking, and formal evaluation provide an approach to this assessment which will let you fail quickly. Starting with quick, low-effort checks and progressing to more rigorous testing allows you to quickly weed out non-viable options and only invest significant time and resources when there's a reasonable chance of success.
Although LLMs are a powerful tool, they’re not the solution to every problem – and it may even be that there are lower-tech solutions involving existing NLP tools like Python libraries which can summarize text, find topics, or extract entities. Evaluate LLMs in the context of your specific project, consider the nature of the task, the structure of your data, the resources at your disposal, and the potential alternatives. By taking this approach, you can make an informed decision about whether an LLM is the right tool for your project – and not treat it like a hammer in search of a nail.
If you have a use case you want to discuss, need code written to test out the feasibility of your project, or want a workshop for your organization, get in touch with me.