RAW SEQUENCING DATA PROCESSING AND BASE CALLING

Information

  • Research Project
  • 7252524
  • ApplicationId
    7252524
  • Core Project Number
    R01HG002929
  • Full Project Number
    5R01HG002929-03
  • Serial Number
    2929
  • FOA Number
    PA-97-44
  • Sub Project Id
  • Project Start Date
    6/1/2005 - 19 years ago
  • Project End Date
    5/31/2008 - 16 years ago
  • Program Officer Name
    FELSENFELD, ADAM
  • Budget Start Date
    6/1/2007 - 17 years ago
  • Budget End Date
    5/31/2008 - 16 years ago
  • Fiscal Year
    2007
  • Support Year
    3
  • Suffix
  • Award Notice Date
    6/5/2007 - 17 years ago
Organizations

RAW SEQUENCING DATA PROCESSING AND BASE CALLING

DESCRIPTION (provided by applicant): The long term objective of this application is to develop a software application for processing raw data obtained using DNA capillary electrophoresis sequencing machines (data processing) and identify the DNA bases achieving an overall higher accuracy over the existing techniques (base calling). The specific aims are to: collect a large number of data files (approximately 50,000 files will be used), create a database including the correct basecalls associated with each of the datafiles, develop a methodology for comparing the results of two basecallers (and incorporate the confidence values associated with each call into the assessment method), develop novel algorithms for processing the raw data, incorporate into basecalling a model for the peak amplitudes, improve the current base spacing model and finally, test the basecaller with the above proposed database. The proposed methodology is based on a novel signal processing approach applied to the raw data. A highly adaptive filter will be used for the raw data. The filter will adapt to the various levels of noise in the raw data and to the variation of the peaks width. The order in which traditional steps for DNA sequencing raw data processing are performed will be changed to allow for a better color separation between the channels. Features from the data itself will be identified and used to predict the base calls. For instance, a peak amplitudes model will be created to allow for a better prediction of the base calls. This peak amplitudes model will also be used to indicate whether or not an individual base follows the model, thus indicating a probability for an insertion/deletion error. An automatic algorithm will be developed to detect and remove stutter peaks from the raw data. Combined with an improved cross-talk removal procedure this will allow for a better sensitivity in identifying heterozygotes in the processed sequences. The calculated confidence values will follow the current standard as introduced by phred and will be calibrated such that for data with reduced levels of noise to match the actual accuracy rate over the testing database. The software and the testing database will be free of charge for academic and publicly funded sequencing projects.

IC Name
NATIONAL HUMAN GENOME RESEARCH INSTITUTE
  • Activity
    R01
  • Administering IC
    HG
  • Application Type
    5
  • Direct Cost Amount
  • Indirect Cost Amount
  • Total Cost
    161534
  • Sub Project Total Cost
  • ARRA Funded
  • CFDA Code
    172
  • Ed Inst. Type
    GRADUATE SCHOOLS
  • Funding ICs
    NHGRI:161534\
  • Funding Mechanism
  • Study Section
    ZRG1
  • Study Section Name
    Special Emphasis Panel
  • Organization Name
    UNIVERSITY OF ST. THOMAS
  • Organization Department
    NONE
  • Organization DUNS
    606870090
  • Organization City
    ST PAUL
  • Organization State
    MN
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    55105
  • Organization District
    UNITED STATES