Protein sequencing plays an important role in identifying protein functions, analyzing protein-protein interactions, and characterizing post-translational modifications. Despite the recent progress in protein sequencing and assembly, many of the currently available assembled proteins come in a draft form. There are still many gaps in the assembled protein sequences even if one combines top-down and bottom-up sequencing methods. In other words, at the end of the sequencing step for a specific protein, it is more likely to see contigs separated with gaps (which is called a scaffold). Hence, an important but also natural combinatorial problem is to fill the missing amino acids into a scaffold to obtain a complete protein sequence. With the new framework produced by this project, de novo protein sequencing will greatly advance the research and clinical practice of identifying the function and structure of proteins. The project will provide researchers with powerful computational tools for obtaining the sequence information of antibodies, which is extremely valuable for the construction of antibody databases. This interdisciplinary research also provides various training projects to students at all levels, particularly to underrepresented African American students, and helps them to pursue high quality research from an open-minded and cross-disciplinary perspective. New advances achieved will be integrated into undergraduate/graduate curricula. The results will be disseminated through journal publications, conferences, open-source software release, tutorials, and seminar talks.<br/><br/>In this project, the investigators will study the mass spectrometry-based de novo protein scaffold filling problem by two related phases. Firstly, the investigators will analyze the top-down and bottom-up tandem mass spectrometry to construct the protein scaffold without a proper reference. The methods include general global optimization, dynamic programming, and graph algorithms, which can also handle small protein mutations (where the mass of some amino acid changes). Secondly, the investigators will use deep learning methods, such as combined convolutional neural network and long short-term memory, convolutional denoising autoencoder, and transformer models to finish the last step of protein sequencing obtained by top-down and bottom-up tandem mass spectrometry analysis at first step. The project will result in a new framework of combined combinatorial and deep learning methods for protein scaffold filling, and a corresponding open-source software.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.