Optimizing the Utility of Electronic Medical Records Data in Data-driven Health Research ABSTRACT Medical centers continue to archive patient follow-up data in Electronic Medical Records (EMR), which have tremendous value in discovering new knowledge and insights. The large volume of EMR data can play an important role in improving the accuracy and generalizability of predictive models in healthcare, especially when misdiagnosis is known to be the third leading cause of death in the United States. Despite these merits, EMR data are invariably corrupted by factors like missing values, outliers, and unrealistic measurements, which prevent researchers from fully utilizing such abundant data in many important studies. Many studies simply discard a large number of samples to get rid of missingness and eventually bias their data-driven analytical models. Existing techniques for missing data imputation use simplified linear models and are mostly suitable for imputing cross-sectional data missingness that ignore longitudinal missingness in patient follow-up data. This proposal aims to investigate novel artificial intelligence (AI) based models to improve the quality and utility of EMR data in preparation for data-driven retrospective studies. Toward this preparation, the goal of the project is 1) to investigate more accurate and robust data imputation models compared to existing ones and 2) adapt state-of-the-art deep learning techniques in preparing optimal representation of large EMR data. The proposed research will 1) maximize the quality and utility of EMR data to support a multitude of retrospective studies, 2) enable visualization of complex patient data, 3) identify more important and predictive clinical parameters, 4) yield a compact and optimal representation of large EMR datasets. We hypothesize that optimally processed EMR data with state-of-the-art AI models can most accurately model patient risk when compared to existing statistical and clinical risk models. This project will combine the complementary expertise of the collaborators, Dr. Manar Samad, PhD (Computer Science), Dr. Owen Johnson, DPH (Biostatistics and Public Health), and Dr. Edilberto Raynes, MD, PhD (Medicine) along with the participating undergraduate students at Tennessee State University (TSU). The proposal entails several research and development components that will allow undergraduate students to gain valuable research and analytical skills in data science, programming, and health informatics. The project activities will expose health science students to AI-based computing solutions to broaden their scope of future health research and career. This project will help TSU prepare a strong workforce of minority students who will gain competitive skill sets in data science and health informatics that are currently high in demand almost everywhere. Overall, the project will develop a data-capable workforce to strengthen an interdisciplinary research capacity and collaboration between the Departments of Computer and Health science at TSU.