This award to the J. Craig Venter Institute (JCVI) is to develop novel methods for analyzing metagenomic sequence data. Metagenomics pertains to the study of the genomic content of microbial communities using cultivation independent techniques. This paradigm has revolutionized the field of microbiology, especially given our inability to cultivate a majority of microbes that exist in many environments. Next-generation sequencing technologies are used routinely for generating large volumes of nucleotide sequence read data from metagenomic samples. The identification of full-length protein sequences from these data allows for a comprehensive and accurate analysis of the metabolic potential of the constituent microbes. This is often implemented by assembling the nucleotide sequence reads, and then using the set of generated contigs as substrate for protein identification. However, metagenomic assemblies are typically very fragmented, producing short contigs and also leaving a large fraction of reads unassembled, thereby limiting the utility of this approach. This project will develop a sequence assembly framework for reconstructing full-length protein sequences directly from short peptide fragments identified on nucleotide reads. This approach is motivated by two observations ? (a) the high coding density observed in prokaryotic genomes - which implies that most of the nucleotide reads will contain at least part of a protein; and (b) the redundancy in the genetic code - which alleviates the effect of nucleotide-level polymorphisms that greatly confound nucleotide assembly, and allows for the reconstruction of protein sequences even when the underlying constituent nucleotide sequences are not identical. The peptide assembler output will be used to develop a framework for analyzing and comparing metagenomic samples based on their protein and pathway abundances. Open-source documented software packages for assembly and analysis will be created by this project, and made available for use by the research community.<br/><br/>This project will create infrastructure and tools for training in the broad area of metagenomic informatics to make data analysis concepts accessible to the wider community doing research in the biological sciences, and for mathematics and science education. Additionally, postdoctoral researchers and interns will be trained in metagenomics and computational biology. An educational module will also be created to introduce high school teachers, via workshops, to bioinformatic methods in genomics and metagenomics. The teachers participating in the workshops will subsequently teach the curriculum in their classrooms. This approach will promote the excitement of discovery and research in young scientists.