Collaborative Research: Syntactically-annotated corpora for endangered languages in areal contact

Information

  • NSF Award
  • 2319246
Owner
  • Award Id
    2319246
  • Award Effective Date
    8/15/2023 - 9 months ago
  • Award Expiration Date
    7/31/2026 - 2 years from now
  • Award Amount
    $ 299,977.00
  • Award Instrument
    Standard Grant

Collaborative Research: Syntactically-annotated corpora for endangered languages in areal contact

There is a revolution underway in the ability for computers to understand and use human language. This revolution, though, depends on techniques that require large amounts of data. This project builds specially annotated datasets of a series of endangered languages for which only small amounts of data exist, validating new computational techniques that work on datasets such as these. This develops language technology that works on "small languages", while also generating new, computationally driven analytical insights about the grammars of the languages of interest. A core aspect of the project is training, providing opportunities for US students to learn these computational tools, to build high quality datasets, and to build technological platforms on top of the datasets produced by the project. The overall result is an increase in the United States' language infrastructure, including human resources, since the project prepares students to enter the private sector, government, or academia with advanced STEM training in computational linguistics.<br/> <br/>Universal Dependencies (UD) is a standardized framework for building syntactically annotated corpora of any language in the world. The UD framework has garnered widespread support due to its ease-of-use for quantitative linguistic analyses and cross-linguistic comparisons, and due to its utility for training natural language processing (NLP) pipelines to annotate new input texts. It is especially useful for dealing with smaller languages where rich annotation can make up for lack of large datasets, which many modern NLP techniques presuppose. This project develops five new UD corpora of endangered languages of sufficient size to do deep scientific analysis as well as to build technology platforms. The final results of this project are: (1) Five freely-available and fully-annotated treebanks of at least 30,000 tokens, (2) free and open-source natural language processing (NLP) pipeline models to automatically perform word segmentation, stemming, morphological analysis, and syntactic parsing for each of the target languages, also made freely available in public-facing code repositories, (3) novel methods for identifying areal linguistic clusters in large multilingual language models, and for leveraging this information to bootstrap NLP systems for related languages, and (4) a thematic volume describing comparative quantitative and computational syntactic investigations based on the corpora collected during the project. This award is made as part of a funding partnership between the National Science Foundation and the National Endowment for the Humanities for the NSF Dynamic Language Infrastructure – NEH Documenting Endangered Languages Program.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Jorge Valdes Kroffjvaldesk@nsf.gov7032927920
  • Min Amd Letter Date
    8/3/2023 - 9 months ago
  • Max Amd Letter Date
    8/3/2023 - 9 months ago
  • ARRA Amount

Institutions

  • Name
    Indiana University
  • City
    BLOOMINGTON
  • State
    IN
  • Country
    United States
  • Address
    107 S INDIANA AVE
  • Postal Code
    474057000
  • Phone Number
    3172783473

Investigators

  • First Name
    Francis
  • Last Name
    Tyers
  • Email Address
    ftyers@iu.edu
  • Start Date
    8/3/2023 12:00:00 AM

Program Element

  • Text
    IIS Special Projects
  • Code
    7484

Program Reference

  • Text
    IIS SPECIAL PROJECTS
  • Code
    7484
  • Text
    DEL
  • Code
    7719