More than 350 languages and many additional variants and dialects are spoken in the United States and yet, voice technology recognizes only a handful. This research will create crucial training datasets, predominantly optimized for speech recognition (speech-to-text), for three underrepresented, American sociolinguistic contexts — a sociolect, a code-switching language context, and an Indigenous language. The methodology for co-creating these datasets with communities prioritizes building the agency, skills, and knowledge required for people to use and apply their dataset to serve their own social and economic context. Inclusive speech-to-text technology that recognizes more American language dialects means that more Americans can access critical information across citizen services, finance, education, health, and justice.<br/><br/>The project iterates a community-mobilizing, inherently capacity-building, applied methodology for creating crucial machine-learning datasets, predominantly optimized for speech recognition (speech-to-text). The data creation process (text and audio) for these datasets will be run, hosted, and released through an open-source platform and infrastructure to ensure public accessibility. Communities will co-create the datasets from design phase to quality assurance, with space to shape the governance framework, diversity criteria, and domain representation. This program will: (1) bridge critical gaps for innovative technological research on under-represented languages and variants; (2) evolve understanding of culturally-conscious, consent-centric modes of community participation in the building of artificial intelligence (AI); and (3) accelerate first-language language technology tooling in key economic domains such as health, education, justice, and agriculture, thereby accelerating pathways to societal and economic benefits. The project will also advance skills development in machine learning by actively involving individuals who speak these underrepresented language variants in the data collection process. The project methodology is applied pedagogy, through teaching communities about AI training datasets by involving them in their design and build. This skill-building approach can lead to improved community representation within STEM professions, as well as immediately mitigating dataset biases and potential harms.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.