This project develops an evaluation methodology around the implementation of Large Language Model (LLM)-based tools to support human experts working in federal/state/local government programs. The complexity of rules in these programs means that mistakes are quite common, often creating more work for staff and participants. LLM-based tools have the potential to reduce that complexity by reflecting relevant rules and information from internal knowledge bases back to staff, along with citations. They can also be used to automate and simplify steps that require significant time and introduce potential for human error. For instance, rather than having staff manually key in data from scanned documents, these tools can assist by categorizing and populating data, which are then reviewed by staff. This project will conduct a study that reflects the breadth of households across the country and evaluate the trade-offs of implementing these systems: from the cost of development, to quantifying errors in LLM responses, to evaluating the potential burdens on human experts in correcting errors. This experiment compares performance (measured via metrics such as answer accuracy and time-to-answer) for responses in three conditions: generated by 1) LLMs only, 2) humans only, and 3) humans working together with LLMs. <br/><br/>The project goal is to evaluate, with a replicable approach, whether LLMs ought to be applied to specific use cases within public services, and provide structured summaries of findings for decision-makers when weighing LLM adoption. Natural language processing methods are used to generate nationally representative synthetic prompts based on program rules and demographics across states (e.g., eligibility for different programs given different situations). The project will collect responses in the form of a next step or decision from three experimental conditions: a hypothetical LLM-only condition, a human-only condition, and a human supported by LLM condition. The LLM tool itself is fine-tuned on similar question-and-answer data regarding government services, and was created with partner organizations using retrieval augmented generation methods. The project relies upon gold-standard responses from quality control auditors to evaluate correctness. The answers from each condition will be analyzed for subgroup disparities based on prompt characteristics (e.g., types of employment, age group, housing circumstances), allowing for granular reporting of potential tradeoffs of LLM adoption.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.