III-CXT: Learning from graph-structured data: new algorithms for <br/>modeling physical interactions in cellular networks<br/><br/>The complex behavior of the cell derives from an intricate network of <br/>molecular interactions of thousands of genes and their products. <br/>Understanding how this network operates and predicting its behavior <br/>are primary goals of biology and have broad implications for life <br/>science, medicine and biotechnology.<br/><br/>The genomic information revolution of the last ten years has enabled <br/>new systems-level and data-driven approaches for studying cellular <br/>networks. In particular, using machine learning to model gene <br/>regulatory networks---the switching on and off of genes by regulatory <br/>proteins that bind to non-coding DNA---has emerged as a central <br/>problem in systems biology. Now, an explosion of new high-throughput <br/>technologies for measuring physical interactions between proteins and <br/>between protein and DNA provides a new data integration challenge for <br/>computational modeling of gene regulation. These new data can all be <br/>viewed as graph-structured data, or physical interaction networks.<br/><br/>The central computational goal of this project is to develop new <br/>machine learning learning algorithms for exploiting graph-structured <br/>data, including: (1) boosting with efficient graph mining; (2) graph <br/>kernels based on subgraph histogramming; and (3) information-based <br/>graph partitioning. These new algorithms will be used to integrate <br/>physical interaction network data into models of gene regulation in <br/>order to better represent underlying biological mechanisms. The <br/>focus will be two fundamental modeling problems: inferring signal <br/>transduction pathways and modeling cis regulatory modules at the <br/>level of DNA sequence and interacting regulatory proteins. The <br/>algorithms will be applied both to publicly available data and to <br/>primary gene expression data provided by one of the investigators to <br/>study the hypoxia in yeast and the response to environmental toxins <br/>in mammalian neural cells.<br/><br/>This project will learn systems-level models that lead to new insight <br/>into the underlying mechanisms of gene regulation and open the way to <br/>broader biological discoveries. All data, results and source code <br/>will be publicly available via the Web (http://www.cs.columbia.edu/ <br/>compbio/cellular-networks) and disseminated through courses and <br/>bioinformatics software packages. The project will also create <br/>undergraduate research opportunities for joint dry and wet lab <br/>projects and outreach activities to introduce New York City public <br/>high school students to new interdisciplinary areas of science.