Using GPT To Search By Concept Rather Than Keyword

Finding Data Science-Related Jobs on USAJobs

Apr 27, 2023

If you've applied for a federal government job, you've probably used the USAJobs website. While federal hiring processes have a reputation for being slow and opaque, the USAJobs site itself is useful. It has comprehensive search capabilities, and you can generate an alert that will send you emails when jobs that meet your criteria are posted.

However, while you can filter job listings results on specific fields, or perform keyword searches to match phrases in the text, both of these depend on job descriptions being labeled and described consistently. And in areas like data analysis and statistical modeling, which are highly relevant to my job field, there are a lot of different types of language that can be used to describe similar tasks. This can make it difficult to find all of the relevant jobs.

I wanted to know whether ChatGPT could help me search by concept instead of by keyword.

The bottom line is that it can. Both the overall results that it gave me – categorizing job listings – as well as the summary text it wrote explaining the categorization – show that it’s doing something useful and fundamentally different from keyword searches.

Defining the Problem

To start with, I took a look at the existing search functionality on USAJobs to see what already worked well, and where there was additional functionality that I wanted.

Here is a brief summary.

Filters

If you’re looking for something that corresponds well to a field in the USAJobs listings, the filters are easy to use. For example, each job listing is owned by a specific agency, so filtering for jobs with a particular agency (e.g. Bonneville Power Administration) is straightforward.

And this can work for job categories as well: the job “economist” corresponds to occupational series 0110, and filtering jobs by this series produces results that more or less match job listings that most people would consider to be “economists.”

However, some jobs are not well-matched by any occupational series. In these cases, things become more complicated. Take my job: “data scientist.” Although there is an occupational category (1560) for data science jobs it’s a relatively new one – and the concept of a data scientist is also not well-defined, with significant overlap in tasks with other jobs. This is especially true for government employees.

As a result, there are a lot of jobs with tasks, like statistical analysis, mathematical modeling, or using R or Python, that a data scientist (or someone adjacent to data science) might be interested in, but which are not in series 1560. For instance, when I was a data scientist for the Army, I was classified as an 0343 – Management and Program Analyst.

So filters alone aren’t always sufficient to find the job listings you might want.

Keyword Search

But there are other ways to refine search results, such as keyword search. What happens when you search USAJobs for ‘data science’ (without parentheses)?

I got 1,849 jobs.

The top listings were very relevant – they’re jobs with data science or data scientist in the title, and/or which are in that or 1560 “data scientist” occupational category.

But by the second page of results, I’m getting jobs that I wouldn’t really classify even very broadly as data science jobs, like this contract specialist job:

When I searched for “data scientist”, that narrowed it down considerably – but then I only got 23 jobs, and I knew there were many more with the kinds of tasks that I was trying to find. There were 96 for “data scientist” in parentheses — but even of that small number, some of them were not remotely data science jobs, like this Security Specialist role:

This is potentially a situation where being able to do semantic searching – searching for concepts rather than just specific words or phrases – could be helpful.

Using GPT as a Quick-and-Dirty Classification Model

GPT excels in an area where computers have traditionally struggled: understanding and responding to natural language. My hypothesis was that, with some guidance, GPT could apply a fuzzy, human definition of “data scientist-type tasks” and determine if a given job contained these better than simple keyword search or the human-labeled occupation series field.

To test this, I did a few things:

Performed some searches using the USAJobs API to find potential matches
Wrote a prompt (using some trial and error) to tell GPT what a “data scientist-type” job looked like, and asked GPT to respond in a particular format
Fed each potential job listing to the GPT API to have it apply my prompt, and saved the result

Essentially, this idea boils down to treating an API call to GPT with my custom prompt as a “data scientist classification model.” (This would be much quicker than training my own classification model by hand!)

Before I show you my results, let’s look at each of these steps in a little more detail.

Narrowing Down Job Listings

Because I didn’t want to put all of the ~28,000 current job listings through the GPT API, I first narrowed them down to a set of job listings I considered potentially of interest to data scientists. I did this in two ways:

By occupational series. I used a broad set of occupational codes that might have data science jobs in them. Some of these are obvious, like "1515: Operations Research Analyst” and “0110: Economist”; others are engineering or scientist roles which may have significant data science components.
By mention of ‘data’. I narrowed down further by jobs which contained the word “data” at least twice in a particular set of fields, "QualificationSummary", "JobSummary", "MajorDuties", "Education", and "Evaluations", which were also the set of fields I fed to GPT when I was telling it about each job.

This got me down to 2,067 job listings.

Prompting GPT For Data Science Jobs

The next step was building my prompt. After some trial and error, I came up with this:

I'm going to give you text from a federal job listing that I'd like you to categorize as whether it would be a good job for someone who is interested in data science tasks like writing code to clean and analyze data, do modeling, automate processes related to data, using tools like R, SQL, and Python. But I need you to keep in mind that these listings are different from private sector listings in that they frequently use different vocabulary to talk about those tasks, and they don’t list specific tools or languages, so I need you to think broadly and make a guess. I’d like you to include leadership jobs involving data, if you think they still involve some analysis and coding, and also to include jobs which have data science components and also require significant background in a specific topic like economics or physical science. I also do want to include data engineering jobs, but exclude pure software engineering jobs and software development and IT jobs. Please structure your response starting with Yes or No and then explaining your reasoning.

After some preliminary investigation, the results initially looked good both in terms of finding jobs which include data science-type duties and excluding solely leadership-type jobs which were getting included with the naive ‘data science’ search in the GUI. For instance, it had this to say about an “IC AI Coordinator” job which contained the phrase “data science” in the job listing:

“No, this job listing does not seem to involve tasks related to data science. While it does mention Artificial Intelligence (AI), the focus of the job is on policy, strategy, and leadership, rather than data analysis or coding. The desired requirements also do not mention any specific skills or experience related to data science.”

It also was able to find data science-adjacent jobs which didn’t contain the phrase “data science” and weren’t in the 96 “data science jobs.” For instance, it found this Senior Geographer job, which is as central an example of a data scientist job as you’ll find on USAJobs:

“Yes, this job would be a good fit for someone interested in data science tasks. The job listing mentions using Python scripting and geoprocessing tools to support geospatial analysis and data integrity within a geographic information system application, as well as applying advanced analytical techniques to support military planning and modeling and simulation using sophisticated geographic information system geo-processing tools and/or related scripting. These tasks involve data cleaning, analysis, and automation using tools like Python and GIS software. Additionally, the job requires a significant background in geography or related fields, which suggests that the data analysis tasks will be focused on geographic data. The job also involves leadership responsibilities, but the job description indicates that the incumbent will still be involved in data analysis and coding tasks.”

It also classified the previously-mentioned contract specialist job the way I would also classify it — as a “No.”

I did find that some IT and software jobs were getting labeled as data science. Including “exclude pure software engineering jobs and software development and IT jobs” did not entirely solve this problem.

For instance, one justification for why an IT specialist job qualified as data science included the line “The use of languages such as COBOL is also relevant to data science tasks" – which, no, it is generally not.

But the results I was getting were still looking better than I was getting through the USAJobs GUI or that I could have gotten from a more complex keyword search via the USAJobs API.

I had my prompt – now I just had to apply it systematically to my potential data science jobs.

Running the Data Through My “Classification Model”

I’ve put the full code in git, but below is a high-level summary of the steps I took with the 2,067 job listings that made it through my initial screen:

Make an API call to GPT for each of these. In each API call, I include:
1. my data categorization prompt
2. text from the listing.

Each API call returns an answer which starts with “Yes” or “No” and gives reasoning as to the classification.

Use that text (“does it start with Yes”?) to determine which are positives.
Store the classification result along with important data from USAJobs, such as job title, location, salary range, department/agency, closing date, and whether it’s remote.

I’ve posted the positive and negative results in a prototype app. you can search for “Yes” in the occupation column if you want to find the positives, or you can just download the whole file.

Results

So, did my approach work?

I feel confident that it’s working to some degree and doing something different and useful. But it’s hard to quantify how well it worked, because there’s a lot of fuzziness involved in saying what is or isn’t a data science-adjacent job. I didn’t go in with labeled data in part because I’m not sure I can confidently label every job as data science-related or not, although I’m thinking about how to do that.

Of the 2,067 jobs I classified using this process, 68 were too long to classify using the GPT-3.5 model or hit some issue with the API. Of the 1,999 that did get classified, 428 were “positives”: they were labeled by GPT as having data science content.

About 90% of these positives did not contain the phrase “data sci” (as in, “data science” or “data scientist”). Of the 55 jobs which did contain that phrase, 43 were classified as having data science content – the model classified the remaining 12 as not having sufficient data science content.

Here are some examples, and they show the value of doing semantic searches:

One example of these five was a position that called for “experience working with laboratory, bioinformatic, epidemiologic or data science professionals” - that’s exactly the kind of position that we’d want to exclude, and which the model correctly labeled.
Another was a Supervisory Personnel Research Psychologist - also a definite “no”
There was an Assistant Professor of Cyber and a Computer Engineer
Finally, there was a Data Steward, which I might have classified as being a data science-adjacent job, but I think GPT’s is actually reasonably persuasive about why it’s not, and I think it could go either way:

No, this job listing does not seem to be a good fit for someone interested in data science tasks. While the education requirement includes degrees in Data Science and related fields, the mandatory qualification criteria and job description focus on leadership and customer service skills, as well as experience with GIS tools. The desirable qualification criteria do mention experience with quality assurance/quality control review procedures for geospatial data, but this is not the primary focus of the job. Additionally, the physical requirement of distinguishing colors and visual acuity may not be relevant to data science tasks.

How well is the classification working with the jobs that didn’t contain “data sci” and that it classified as positives?

We can check that by looking at some examples of the 385 job listings that the model classified as data science-related but which didn’t contain that phrase.

Here are three different, randomly-selected examples and the justification that GPT gave for the classification:

IT Specialist (APPSW/SYSADMIN): Yes, this job would be a good fit for someone interested in data science tasks. The job listing mentions building and managing database systems, performing computer programming processes, and analyzing hardware and software capabilities to determine the most efficient methods to meet business requirements. These tasks require skills in data cleaning, analysis, and modeling, as well as knowledge of programming languages like SQL and Visual Basic. Additionally, the job involves serving as a senior analyst and database administration expert, which suggests a leadership role that still involves analysis and coding. The job also requires specialized experience in information technology, which is relevant to data science tasks. Overall, this job seems to have a strong data science component and would be a good fit for someone with experience in this area.

Mathematical Statistician: Yes, this job would be a good fit for someone interested in data science tasks. The job listing mentions developing mathematical/statistical techniques, approaches, and methodologies to analyze energy and commodity markets, reviewing data operations to assure compliance with specifications, agreements, and laws in the use of energy supply and demand data collection programs, applying advanced statistical techniques to experimental design, data analysis, sampling, forecasting, quality control, and operations research, and developing and/or presenting professional papers or reports on statistical studies or surveys. These tasks involve data analysis, modeling, and statistical techniques, which are common in data science. Additionally, the job requires a degree that includes 24 semester hours of mathematics and statistics, which is a strong foundation for data science.

Science Software Developer (IT Specialist): Yes, this job would be a good fit for someone interested in data science tasks. The job listing mentions specialized experience in programming languages like Python and C++, knowledge of code versioning tools and repositories, and experience with continuous integration and testing environments. Additionally, the job involves designing, developing, and implementing astronomical software, which requires skills in data analysis and modeling. The job also requires knowledge of databases and data access, web and grid enabled technologies, and user interfaces, which are all relevant to data science tasks. Finally, the job requires a degree in computer science, engineering, information science, mathematics, operations research, statistics, or technology management, which are all fields related to data science.

None of these are modal data science jobs. Job 3 is a software developer job, which I was trying to screen out. But jobs 1 and 2 are exactly the kinds of jobs I had in mind and wanted the prompt to find.

While all three of these involve tasks and have requirements that you might think of not purely data science tasks and requirements, this is also the case for some of the jobs that explicitly mention “data science” on USAJobs, and has more to do with the kinds of jobs that the federal government hires for.

For instance, one job that mentions “data science” requires “a doctoral degree in data science, computational toxicology, biology, physical sciences, chemistry, or a related field, and substantial experience in a related field”. Another requires “a bachelor's degree in computer science or a related field, and at least one year of specialized experience in providing technical support services in all phases of the Software Engineering and Software Development Life Cycles related to cloud-based software engineering”.

Therefore, it’s also not the case that this model is picking up on jobs that have a lot of non-data-science requirements, in contrast to if you searched for “data science.”

In general, these results look like what I wanted. My prompt at the moment still appears to be picking up a few false positives–mostly software developer jobs–but otherwise looks like an improvement over the basic keyword search, even if I can’t currently quantify how much of an improvement.

Conclusion & Next Steps

Adding simple classification models to search results looks like a good use case for GPT. I’m getting these results using GPT 3.5, which is not even the newest version.

This is also a great use case for the GPT APIs - with a fairly small amount of code to pull the USAJobs listings to get current listings, and then another small amount of code to pass it to GPT, we can classify a listing. This approach can potentially provide more accurate and relevant search results for job seekers, especially for those looking for positions in less structured fields like data science.

To be fully confident in this conclusion, I would need to blindly classify some job listings to send to GPT. This would give us more insight into how much GPT (and my prompt) may be over-classifying or under-classifying certain job types, and in turn give me some labels that I can use to systematically test different prompts.

It’s also worth noting that this entire experiment took just a few days to conceive, get up and running, and complete. This suggests that using GPT as “good enough” classification model can be much faster than training your own model in many practical contexts. (For example, when the stakes are low, when a human is available to write a classification description, and when accuracy is less important than time-to-market.)

Another obvious next step is building this out into a simple product that regularly re-searches and classifies jobs. By doing this, we can build something that’s providing useful functionality that USAJobs isn’t yet able to provide – and with minimal cost or work.

If this is a general topic that you’re working on – including, making government workforce data more legible – or you have another job category that you’d like to see labeled/categorized, or you want to label some data, please send me an e-mail.

The Present of Coding

Discussion about this post