Software to Handle Missing Values in Large Data

Information

  • Research Project
  • 6690119
  • ApplicationId
    6690119
  • Core Project Number
    R43RR017862
  • Full Project Number
    1R43RR017862-01A1
  • Serial Number
    17862
  • FOA Number
  • Sub Project Id
  • Project Start Date
    7/1/2003 - 21 years ago
  • Project End Date
    9/30/2004 - 20 years ago
  • Program Officer Name
    SWAIN, AMY L
  • Budget Start Date
    7/1/2003 - 21 years ago
  • Budget End Date
    9/30/2004 - 20 years ago
  • Fiscal Year
    2003
  • Support Year
    1
  • Suffix
    A1
  • Award Notice Date
    6/9/2003 - 21 years ago
Organizations

Software to Handle Missing Values in Large Data

DESCRIPTION (provided by applicant): This SBIR aims to produce commercial software for handling missing data in large data sets, where the goal is data mining and knowledge discovery. There may be a large number of subjects, variables, or both. Examples include microarray data, surveys, genomic data, and high throughput screening data. Handling missing data is one important step of careful data preparation, which is key to the success of an entire project. Missing values often arise in medical data. This is an obstacle because many data mining tools either require complete data or are not robust to missing data. Principled methods of handling missing data are computationally intensive. Therefore computational feasibility is a challenge to handling missing values in large data sets. Phase I work will explore strategies such as sampling, constraining parameters, and monotone data algorithms for model based techniques. Factor analysis and multivariate linear mixed effects models will be used to reduce the number of parameters. A variable-by-variable approach using a popular data mining technique, recursive partitioning, will also be used to impute missing values. For each of the methods, we will write prototype software and test performance on missing data patterns simulated on real data. Several ad hoc techniques will serve as a baseline for comparison. Experience writing prototypes and using them in simulations will lead to preliminary software design that will serve as the foundation of Phase II work. This proposed software will enable medical researchers to gain more from their data mining efforts: maximally extracting information and achieving unbiased predictions, despite missing data.

IC Name
NATIONAL CENTER FOR RESEARCH RESOURCES
  • Activity
    R43
  • Administering IC
    RR
  • Application Type
    1
  • Direct Cost Amount
  • Indirect Cost Amount
  • Total Cost
    99847
  • Sub Project Total Cost
  • ARRA Funded
  • CFDA Code
    371
  • Ed Inst. Type
  • Funding ICs
    NCRR:99847\
  • Funding Mechanism
  • Study Section
    ZRG1
  • Study Section Name
    Special Emphasis Panel
  • Organization Name
    INSIGHTFUL CORPORATION
  • Organization Department
  • Organization DUNS
    150683779
  • Organization City
    SEATTLE
  • Organization State
    WA
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    98109
  • Organization District
    UNITED STATES