Measuring functional similarity between transcriptional enhancers using deep learning

Information

  • Research Project
  • 10302539
  • ApplicationId
    10302539
  • Core Project Number
    R21HG011507
  • Full Project Number
    1R21HG011507-01A1
  • Serial Number
    011507
  • FOA Number
    PA-20-195
  • Sub Project Id
  • Project Start Date
    9/1/2021 - 3 years ago
  • Project End Date
    8/31/2023 - a year ago
  • Program Officer Name
    GILCHRIST, DANIEL A
  • Budget Start Date
    9/1/2021 - 3 years ago
  • Budget End Date
    8/31/2023 - a year ago
  • Fiscal Year
    2021
  • Support Year
    01
  • Suffix
    A1
  • Award Notice Date
    8/25/2021 - 3 years ago
Organizations

Measuring functional similarity between transcriptional enhancers using deep learning

PROJECT SUMMARY Understanding transcriptional regulation remains as a major task in the molecular biology ?eld. Enhancers are genetic elements that regulate when and where genes are expressed and their expression levels. These elements are hard to discover because their locations and orientations are not constrained with respect to their target genes. Several diseases and susceptibility to certain diseases are linked to mutations and variants in enhancers. Multiple experimental and computational methods have been developed for locating enhancers. Computational methods are more suitable to handle the large number of genomes being sequenced now because they are faster, cheaper, and less labor intensive than experimental methods. Despite many available computational tools, we lack a sophisticated tool that can measure similarity in the enhancer activity of a pair of sequences. We propose here utilizing Deep Arti?cial Neural Networks (DANNs) to develop such a tool. The long-term objective of this project is to decipher the code governing gene regulation with the following speci?c aims: (i) design a computational tool for measuring enhancer-enhancer similarity, (ii) validate up to 96 putative enhancers experimentally, (iii) understand enhancer grammar, and (iv) annotate enhancers in more than 50 insect genomes. To achieve these aims, a novel application of DANNs is proposed. Current tools utilize DANNs to answer a yes-no question: does a sequence have similar activity to the tissue-speci?c enhancers comprising a particular training set of known enhancers? These approaches require training a separate network on each tissue, leading to inconsistent performances on different tissues. Instead, here we use a DANN to answer a related but different question: does this sequence have similar enhancer activity to a single known tissue-speci?c enhancer? This deep network should perform consistently on different cell types because it is trained on pairs of sequences ? not individual sequences as is the case in the available tools ? representing all tissues for which there are known enhancers. The DANN is trained to recognize sequence pairs with similar enhancer activities and those with dissimilar activities including (i) two enhancers active in two different tissues, (ii) one enhancer and a random genomic sequence, and (iii) two random genomic sequences. The tool outputs a score between 0 and 1, indicating how similar the enhancer activities of the two sequences are. Using a much simpler machine learning algorithm than DANNs, we demonstrate that pairs with similar enhancer activities can be separated from pairs of random genomic sequences or pairs of one enhancer and a random genomic sequence with a very high accuracy. The new tool has many important potential applications including consistent annotation of enhancers across cell types and related species. Our tool can annotate enhancers active in a cell type that has a small number of known enhancers, and it can annotate enhancers in related genomes when there is a set of known enhancers demarcated in one of them. Discovering new transcription factor binding sites is another potential application. Studying enhancer ?design principles? and the effects of variants can be facilitated using the proposed tool. Such applications will advance our ?eld.

IC Name
NATIONAL HUMAN GENOME RESEARCH INSTITUTE
  • Activity
    R21
  • Administering IC
    HG
  • Application Type
    1
  • Direct Cost Amount
    293951
  • Indirect Cost Amount
    73940
  • Total Cost
    367891
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    172
  • Ed Inst. Type
    BIOMED ENGR/COL ENGR/ENGR STA
  • Funding ICs
    NHGRI:367891\
  • Funding Mechanism
    Non-SBIR/STTR RPGs
  • Study Section
    BDMA
  • Study Section Name
    Biodata Management and Analysis Study Section
  • Organization Name
    TEXAS A&M UNIVERSITY-KINGSVILLE
  • Organization Department
    ENGINEERING (ALL TYPES)
  • Organization DUNS
    868154089
  • Organization City
    KINGSVILLE
  • Organization State
    TX
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    783638202
  • Organization District
    UNITED STATES