Building a parsed historical corpus to investigate word-order change and variation

Information

  • NSF Award
  • 2314522
Owner
  • Award Id
    2314522
  • Award Effective Date
    9/1/2023 - 8 months ago
  • Award Expiration Date
    8/31/2026 - 2 years from now
  • Award Amount
    $ 457,997.00
  • Award Instrument
    Standard Grant

Building a parsed historical corpus to investigate word-order change and variation

Living languages change over time in a number of areas, including not only vocabulary and pronunciation, but also sentence structure. Historical linguistics is concerned with documenting these changes and seeking explanations for them. Changes in sentence structure often occur over an extended period of time, including a period in which there is variation between various grammatical patterns for expressing a basic notion. The only evidence for these changes before the introduction of sound recording consists of written documents. However, gathering sufficient evidence from written documents for a rigorous scientific investigation of variation and change in grammatical patterns in the history of a given language requires the examination of a large, parsed corpus — a collection of texts that is divided into sentences, clauses, and phrases. This project builds a parsed electronic corpus of a single language, covering multiple centuries, geographical areas, and text genres. This allows for the investigation of grammatical change and variation in the history of the language as well as comparison with similar developments in related languages. The corpus is publicly available for any researcher to use, and outreach to universities and high schools promotes public awareness of the use of science and technology to explore questions about the structure of language. The development of the corpus also contributes to the training of the next generation of researchers in linguistics including a postdoctoral researcher, graduate students, and undergraduate students. <br/><br/>This project builds a 1.4-million-word syntactically parsed electronic corpus including 165 texts spanning the years 1050-1950 and ten dialectal regions, and a range of text genres. This requires substantial extension of existing annotation schemes based on previous syntactically parsed corpora to accommodate a broader range of syntactic phenomena, while also keeping the annotation scheme as comparable as possible with those used in the handful of syntactically parsed historical corpora of other languages. This project involves manual annotation of texts, correcting errors that arise in automatic part-of-speech parsing, disambiguation of many sentences, and cross-checking for accuracy of syntactic annotations. The resulting annotated corpus fills a gap among the set of parsed corpora the world's languages and is available free of charge to researchers around the world, together with documentation on the use of the corpus. The empirical data generated by this project informs research on the mechanisms and spread of typological change over time across closely related dialects. The corpus can be used to investigate phenomena not only in the domain of syntax, but also in the interfaces between syntax and other components of grammar. Given the broad range of texts in the corpus, these phenomena can be examined synchronically, diachronically, sociolinguistically, and in comparison with other languages.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Jorge Valdes Kroffjvaldesk@nsf.gov7032927920
  • Min Amd Letter Date
    7/24/2023 - 10 months ago
  • Max Amd Letter Date
    7/24/2023 - 10 months ago
  • ARRA Amount

Institutions

  • Name
    Indiana University
  • City
    BLOOMINGTON
  • State
    IN
  • Country
    United States
  • Address
    107 S INDIANA AVE
  • Postal Code
    474057000
  • Phone Number
    3172783473

Investigators

  • First Name
    Christopher
  • Last Name
    Sapp
  • Email Address
    csapp@iu.edu
  • Start Date
    7/24/2023 12:00:00 AM
  • First Name
    Rex
  • Last Name
    Sprouse
  • Email Address
    rsprouse@indiana.edu
  • Start Date
    7/24/2023 12:00:00 AM

Program Element

  • Text
    Linguistics
  • Code
    1311

Program Reference

  • Text
    LINGUISTICS
  • Code
    1311
  • Text
    REU SUPP-Res Exp for Ugrd Supp
  • Code
    9251
  • Text
    SCIENCE, MATH, ENG & TECH EDUCATION