I-Corps: Translation Potential of Synthetic Data Generation with Nullspace Sampling for Tabular and Timeseries Data

Information

  • NSF Award
  • 2422393
Owner
  • Award Id
    2422393
  • Award Effective Date
    5/15/2024 - 6 months ago
  • Award Expiration Date
    4/30/2025 - 5 months from now
  • Award Amount
    $ 50,000.00
  • Award Instrument
    Standard Grant

I-Corps: Translation Potential of Synthetic Data Generation with Nullspace Sampling for Tabular and Timeseries Data

The broader impact of this I-Corps project is based on the development of software to generate synthetic data for use in the healthcare, consulting, and insurance industries. Synthetic data is artificially generated data that is statistically similar to real-world datasets used by businesses. Synthetic data can be used for analytics and machine learning when access to real data is limited and may have uses in augmenting minority representation in real-world datasets thereby aiding in more equitable outcomes. Overall, the broad applicability of synthetic datasets has the potential to drive innovation in healthcare and other industries by allowing businesses to share synthetic versions of proprietary data with strategic partners, such as data analytics companies, and remain in full compliance with data privacy laws. This ability can lead to an increase in data-driven decision-making in the private sector and effective policy formulation in the public sector. For instance, applications of synthetic medical data may help healthcare researchers and administrators to better model patient activity, including representative data of understudied populations, and ultimately improve human health.<br/> <br/>This I-Corps project utilizes experiential learning coupled with a first-hand investigation of the industry ecosystem to assess the translation potential of the technology. The solution is based on the prior development of a non-deep learning technique to generate synthetic datasets using features of real data. Synthetic data is artificially generated data that is statistically similar to real datasets and can be used for analytics and machine learning when access to real data is limited. This innovative solution allows significantly faster generation of tabular and timeseries synthetic data without the need for training or optimization processes, while internally using linear algebra-based techniques. Although this solution was initially created to generate synthetic timeseries data, it can be modified to generate synthetic tabular data. This solution is completely non-parametric and does not involve the additional steps associated with training and optimization, making it 300x faster than state-of-the-art deep learning generation methods for tabular data. Thus, this approach can generate richly structured datasets using significantly less computing time relative to deep-learning methods.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Molly Waskomwasko@nsf.gov7032924749
  • Min Amd Letter Date
    5/6/2024 - 6 months ago
  • Max Amd Letter Date
    5/6/2024 - 6 months ago
  • ARRA Amount

Institutions

  • Name
    Vanderbilt University
  • City
    NASHVILLE
  • State
    TN
  • Country
    United States
  • Address
    110 21ST AVE S
  • Postal Code
    372032416
  • Phone Number
    6153222631

Investigators

  • First Name
    Mikail
  • Last Name
    Rubinov
  • Email Address
    mika.rubinov@vanderbilt.edu
  • Start Date
    5/6/2024 12:00:00 AM

Program Element

  • Text
    I-Corps
  • Code
    802300

Program Reference

  • Text
    Software Services and Applications
  • Code
    8032