Using LLMs to Generate Labeled Data for Chatbot Assessment

It Kind of Works

Aug 11, 2023

Like many data scientists, I’m working on prototyping narrow-use tools built on top of LLMs like Llama 2 and GPT-4. Unlike the base LLMs, these tools are designed for specific tasks rather than answering any type of question.

It’s important to assess these tools to see if they work, but the broad benchmarks used to assess base LLMs won’t do. You need to assess the tool you build for the specific tasks it’s going to get used for.

A common LLM use case, called retrieval augmented generation (RAG), involves answering questions from unstructured data, such as Word documents or PDFs.

I wrote about this type of use case previously, but the summary is:

We want to create a chatbot which lets users ask questions just about a specific corpus of information, like the documentation for a software package or a set of regulations or all of the documents living on their organization’s intranet.
Maybe these are private or recent documents that aren’t in the training corpus of any LLMs; maybe they are, but because the LLM knows so many other things as well, when we ask questions specifically about these documents, it gets mixed up and gives us answers about other documents as well.
We don’t want to go through the trouble of fine-tuning a model, or actually modifying the LLM itself – maybe for lack of expertise or compute resource, maybe because our data is not in a format that would let us do this
Instead, we segment our unstructured data and design a tool to identify the most relevant text chunks related to our query, which is then sent to the LLM along with the question. The LLM now has our question and, hopefully, the text it needs to answer that question.

Since many organizations possess text data or unstructured documents, retrieval augmented generation is a popular and accessible use case to demo. To generate this, there are tools you can use without writing any code, but which won’t give you much control over how your documents are processed or how the chatbot will work. Alternatively, you can write more code and have more correspondingly more control over things like how your documents are processed and what LLM you use. The end result is a chatbot designed exclusively to answer questions related to your specific data.

Great, But How Do You Determine If It’s Working?

I've discussed methods for evaluating specific LLM tools, such as retrieval augmented generation chatbots, by comparing the answers they generate to our desired responses (labeled data). They are as follows:

For exact matches, we’re comparing the response we got from the LLM tool to see if it exactly matches the correct response, or label.
For keywords/regex, we’re looking for the presence of a particular word or pattern of characters in the response.
If we use similarity algorithms, we’re comparing, either semantically or syntactically, the two results.
If we ask an LLM, the LLM compares the correct result to the result our tool gave us.

But all of these rely on having the “right answers”, or labeled data with which to compare your tool’s responses. How do you get those labels to begin with?

Ideally, you already have labeled data. For instance, if you’re trying to supplement an existing customer service process, you already have a lot of data about what customers ask, what answers your current process gets them, and whether the customers are satisfied by those answers. You can use that data to evaluate how your tool performs.

But What If You Don’t Have Labeled Data?

While evaluating a prototype of the retrieval augmented generation chatbot, I created ad hoc questions based on the unstructured text data. For instance, when we used recent newspaper articles as our unstructured data source, I read the news articles and wrote questions that were answered within the article, and then I gave those same questions to the chatbot to see how it performed.

That process worked, but it doesn’t scale. I can’t evaluate performance on a large corpus of data that way unless I want to make generating labeled data my full-time job.

Because of this, I wanted a tool to write me labeled data for assessment – essentially, to generate pairs of questions and answers about my unstructured text data. Then, I can see how my chatbot performs and potentially compare it to other possible tools. For instance, I can compare how well the retrieval augmented generation chatbot does relative to just the base LLM (without the additional document context), or how it does if I use a different base model, like Llama 2 vs. GPT-4. Or I could compare the use of different algorithms to determine which chunks of text the tool will pull to send as context to the LLM to help it answer the question, or how many chunks of text I send as context. There are lots of parameters that can potentially be varied for your chatbot, but to determine the best solution, you need labeled data.

First Prompts and Results

I started by using the ChatGPT GUI and pasting in chunks of text from the California Vehicle Code, preceded by a prompt. The prompt I used was this: "I'm going to give you some text, and I'd like for you to return a list of question/answer pairs about this text.”

But when I did this, I got questions that lacked context. They only made sense for someone who was looking at the same chunk of text as the LLM was. You could know a lot about the California Vehicle Code, but if I ask you any of these questions without adding the context that this is about a California Vehicle Code, you’re not going to be able to answer them.

For instance:

What happens if a vehicle has been issued five or more notices of parking violations?
What are the conditions for a vehicle to be released to the legal owner?
What happens if a vehicle is found illegally parked and there are no license plates or other evidence of registration displayed?
What is the condition for a removal to be considered reasonable?

In some cases, if we preface these questions with the topic, for instance, “This question is about the California Vehicle Code: What happens if a vehicle has been issued five or more notices of parking violations?”, that could provide the necessary context.

In others, even if we provided this context, it still wouldn't be enough. “This question is about the California Vehicle Code: What is the condition for a removal to be considered reasonable?” isn’t a good question because there could be multiple conditions for removal that could be considered reasonable, but which are covered in different sections. But the answer GPT-4 generated for this question was only based on one specific section.

Example Question and Answer

I’ve previously found that GPT-4 is quite familiar with the California Vehicle Code. So I asked it that question:

“This question is about the California Vehicle Code: What is the condition for a removal to be considered reasonable?”

The answer that I was looking for was “If a vehicle has been issued five or more notices of parking violations to which the owner or person in control of the vehicle has not responded within 21 calendar days of notice of citation issuance or citation issuance or 14 calendar days of the mailing of a notice of delinquent parking violation, the vehicle may be impounded.”

But when I asked GPT-4, it determined (correctly) that there were actually a broad sense of answers to this question, as opposed to just the one I was looking for.

This is an excerpt from its response:

If you're referring to when it's reasonable to remove (or tow) a vehicle in California, generally, a vehicle can be removed:

When it's parked in a prohibited location or manner that is specified by signage or the law.
When it poses a hazard or obstruction to the regular flow of traffic.
If it's involved in criminal activities.
If it's left unattended for a specified period in a public place.
If registration has been expired for a specified duration.
And other specific reasons detailed in the CVC.

This illustrates the challenges arising from GPT-4's context-dependent method of generating question-answer pairs. Even if an LLM has the information necessary to answer a question, the way we’re posing the question is too broad to elicit the correct information – even if we append the overall topic.

Trying To Get GPT-4 To Add Context Via Prompting

I played with prompt style and added this to my question-generating prompts: “these questions should be answerable by someone who knows this material but isn't looking at this specific text, so you need to make the questions specific and so there isn't additional context you need to understand what it's asking.”

It didn’t work very well. I was still getting questions like: “What are the conditions for a vehicle to be released to the legal owner?” and “What is the role of law enforcement and other agencies identified in this chapter?”

Additional efforts to clarify the context issue were also unsuccessful. I specifically asked it not to use language like “this section” or “this chapter”, and that failed.

I could append the chapter section/title to each question, but the point is to come up with a set of tools that’s generally usable with most sets of unstructured data, so that’s not helpful. Many sets of unstructured data won’t have sections or chapters, but rather some other organizing structure.

I’m not ready to conclude there isn’t a solution to this – but I have not found it yet.

Two Paths Forward, Depending on Your Use Case

What the implications are for these imperfect-but-not-useless sets of labeled data depends on your use case.

If you want to compare two narrow-use LLMs which have access to the same retrieval-augmented data sets, the context issue may not be important. For instance, if you’re varying the base model (like, using Llama 2 vs. GPT-4), or varying the size and quantity of chunks of text you send as context, then you’re not stacking the deck against one of your tools vs. the other via the lack of context. They might correctly answer some of the weird, context-free questions, they might not, but it’s not an unfair comparison. If they both fail to get the context-free questions, then the assessment result differences will look smaller than if all the questions were “good” questions, but you’ll still get the correct answer: the tool which is actually better will also perform better on your assessment.

But if you are comparing a base model without retrieval augmented generation with one with retrieval augmented generation, you’ll probably want to manually filter out questions that we wouldn’t expect to be able to be answered without additional context.

For instance, even with the topic appended, “This question is about the California Vehicle Code: What is the condition for a removal to be considered reasonable?” isn’t a question that a person who knows the California vehicle code is going to get “correct”, and it’s not one that an LLM is, either. This is because there are multiple circumstances under which the California Vehicle Code says removal is reasonable — not just the specific one that the LLM wrote when looking at one section of the code.

So if that’s your use case, you should filter out those kinds of questions – or tag them as being context-specific, or find some method of prompt engineering that I’ve not yet found to keep them from being generated in the first place.

What Prompting Is Good For: Answer Formatting

While my efforts at prompting to take care of the context issue were not particularly successful, I was able to use prompting to get GPT-4 to structure my question-answer pairs in a way that will make it easier to evaluate whether the answers we get when we use these for assessment are correct. My goal was to have three types of question formats:

Yes/no
Short answer
Long answer

For the yes/no answers, my prompt included “please have the answers be a yes or no, and a mixture of both”. For the short answer, I included “please have the answers be NO MORE THAN FOUR WORDS”. I didn’t include any additional context for the long answer questions, because I found that GPT-4 defaulted to long answers.

The table below shows how well this works in terms of limiting answer length. Of the 116 yes/no questions I generated, all of them correctly had only a yes or no answer. For the short answers, where I requested no more than 4 words, more than half were 4 or under, but there was a long right tail driving up the mean – the max was 70 words. The long answers, where I didn’t specify answer length, had limited overlap with the other two – half were longer than 48 words.

The prompting for answer format basically worked.

If This Works, Can We Fine-Tune an LLM With This?

My initial interest in going from unstructured data to question-answer pairs wasn’t for assessment, it was for training. That’s the major use of labeled data for LLMs: training (or fine-tuning) a model.

But even if we can get this working well enough to work for assessment, the bar for training is higher. In order to do assessment–particularly assessment that’s just a step up from “here are some ad hoc prompts I made”–we’re not necessarily looking for 100% coverage of a piece of text in order to be able to determine which tool (model, prompt, etc.) is better. If we want to fine-tune a model to be able to answer questions about a whole piece of text, we would need to have question-and-answer pairs that had coverage of that whole piece of text.

Conclusion

Part of the problem with narrow-use tools on LLMs is that you can quickly build a demo and show that it’s kind-of-working, but that doesn’t mean it’s ready for users. One big piece of getting it ready and making choices like which base LLM to use, or what size chunks to split your unstructured data into, is being able to assess performance. Ideally to assess performance, you would already have labeled data, and the only challenge would be how to systematically compare the tool’s answers to your labeled data. But if you don’t have labeled data, you’ll have to generate it yourself.

Because of the context issue, using LLMs to do this is promising but imperfect. For specific use cases, there might be ad hoc solutions – for instance, I could automate appending the section and chapter number to each question about the California Vehicle Code to each assessment question. But if the goal is to create a tool that works with other kinds of unstructured text documents - most of which won’t have chapter or section labels - then we need a solution that can work for a broader set of use cases.

But even with the context issue not fixed, this is still potentially worth pursuing. For next steps, I’m going to use these questions with GPT-4 and with a retrieval augmented generation tool giving GPT-4 access to the California Vehicle Code, see what answers they give, and then show how to evaluate those answers using different methods for text comparison.

The Present of Coding

Discussion about this post