I recently analyzed 35,000 public comments on "Schedule F"—a proposed regulation to strip civil service protections from many government workers—using large language models. The project illustrates where AI genuinely helps and where we're still stuck with fundamental problems of user design and data collection, as well as security and procurement constraints.
What AI Made Easier
When agencies propose certain types of new regulations, they're required to collect and respond to public comments. The idea is that agencies should address all substantive arguments being made. But when there are tens of thousands of comments or more, manual analysis becomes difficult and expensive.
Before large language models, automating this analysis was severely limited. You could search for keywords or use smaller models to cluster similar comments and identify topics. But these methods struggle with varied wording or complicated arguments.
Large language models do much more. You can more reliably categorize comments as agree/disagree and identify very granular themes, even when the wording varies widely. Multimodal models like Gemini also improve processing attachments. They can extract text from scanned PDFs, images, and other documents that were previously difficult to handle automatically.
For Schedule F, I processed 35,000 comments through OpenAI's cheapest model for under $20. Processing attachments through Gemini cost under a dollar for files that failed to process locally using other tools. It didn't work on all of them, but it was a big improvement from previous models. This was fast, cheap, and good—and it could be better with more time and evaluation, while still being much faster and cheaper than human analysis.
This example represents processing unstructured text data, or a workflow where we want to ask the same questions of every document. In these comments, I was looking for a specific set of themes, a representative quote, and agree/disagree categorization. For other types of documents, we'd be asking different questions.
LLMs represent a genuine advance in data processing capabilities. Organizations have tons of documents that have become much easier to analyze. I saw a tiny fraction of the possible use cases when I was a machine learning engineer at DHS, but they're endless.
The Bigger Win: Fixing Data Collection Upfront
But what's better than processing unstructured data? Not having to process unstructured text data because you were able to collect structured data upfront.
For comment data, the government could improve data collection in two ways.
First, instead of accepting any type of file attachment, uploading files could be limited to text documents, Word documents, or non-scanned PDFs. Some of the attachments people send are truly not readable in any way, like dark, scanned PDFs. It's not good user interface design to accept attachments we can’t use.
Second, in addition to allowing raw data (in attachments or not), agencies could add a field where they ask users to summarize their comment or say the most important thing. Where it makes sense, they could add an approve/reject selection button.
However, there are still many cases where data is inherently unstructured, and so you’re stuck parsing it if you want to understand it. For instance, if you're analyzing emails to determine how to respond to Freedom of Information Act (FOIA) requests, the unstructured nature of that email data is unavoidable.
Better Interfaces for Existing Data
Better interfaces are another non-AI way to make data more accessible. For instance, the easiest way to get public comment data is to fill out a form and get a massive spreadsheet (CSV) file that does not contain the text of the attachments. It's great that this exists. But even without any AI, it would be straightforward to build frontends to make the data easier to search, including the attachments.
This applies to much government data. Either it lacks programming interfaces (APIs) and the front ends are difficult or limited, like FedScope data (which is currently getting a redesign). Or like USAJobs data, where there's no graphical interface for historical data. Additionally, if you want any of the rich text data, like job duties or requirements, you can't get it from the API at all — you're going to be scraping inconsistent web data.
People struggle to access information that's technically public and available, and that’s a problem. There's a lot of room for the government and non-governmental groups to build out tooling to make data more accessible.
This is also an issue within government, where a lot of data also gets passed via spreadsheet. It could be more accessible (and version-controlled!) if it lived in a central repository with some basic table functionality.
Harder to Build From the Inside
When I was a DHS employee, I wanted to build tooling to analyze public comments. But if I'd gotten the go-ahead for it, I would have found it more challenging than on the outside. This was true even though I did sometimes have access to LLM APIs, so I could query LLMs like the one I used for my text processing—which many technical employees within government don’t have, including ones with clear use cases for it.
But outside of DHS, I can also use whatever coding assistants I want to help with coding—like Claude Code, which I used for this project. Coding assistants are enormously useful to me because they integrate into my system directly. But even though the comments data was public, there are no coding assistants I could have used at DHS: I would have been copying and pasting code between my computer and the available LLM, which is a much worse way of using LLMs in coding.

I get that there's a balance here between security and access. But the balance now is one of too much risk aversion: in government, you never get in trouble for saying no to new technology. Because of this, there is not enough urgency around making sure that civil servants have access to the best tooling, even if it's inexpensive enough that we're using it for our side projects.
Evaluation and Trust
In contrast to a lot of narratives, I think AI actually provides an opportunity for more transparency—but that doesn't happen automatically.
For instance, I didn't get every comment categorized correctly.
But you can see what prompts I used to categorize it, because they're on GitHub. You can see the evaluation frameworks I built and discarded, because they're in the git history. If I'd categorized this manually in a spreadsheet, I still wouldn't have gotten it perfect. And you wouldn't get any insight into the process.
As the government uses these tools, I think it's important that at a minimum they live on internal code repositories. This is as opposed to proprietary/drag-and-drop interfaces where it's difficult for even the government clients—the people the models are being built for—to see exactly what happened. And when possible, code should be open-sourced for the public to see it.
And regardless of whether it’s appropriate to open-source the code, you should at least be able to share some info about your evaluation framework. The 2024 Report on Select Use Cases of Face Recognition and Face Capture Technology from DHS was a good example of this.
AI isn't magic—it's a useful tool in some circumstances and the wrong answer in others. Our job as technologists is to work with users to determine what they need and to provide transparency so stakeholders can make informed decisions. But the government should give their employees and contractors the technical infrastructure to do that well.
A version of this appeared at Reform for Results.
You mention internal code repositories and, when possible, open source. Is there a good reason that the default shouldn't be open source and where compelling reasons exist internal?
I'd expect something like "it deals with PII" or "it deals with classified data/systems" would be compelling reasons, but anything related to public data should default to open.
As an example, why aren't the data sets and code used to generate the BLS jobs report sitting in a public repository somewhere?