Modern scientific practice is rooted on statistical testing of hypotheses on data. To limit the risk of false discoveries, the tests must offer strict statistical guarantees. The task is very challenging due to the sheer amount of rich data available today, and to the ever-increasing number of complex hypotheses that scientists want to test on the same data. In order for science to advance, and therefore advance society and human well-being, it is of the foremost importance that scientists are given tools that overcome these challenges. This project will design novel computational methods for statistical hypothesis testing that tackle all the above challenges by combining modern statistical results with recent approaches from the area of knowledge discovery and data mining, a field of computer science dealing with the efficient analysis of data. As part of the educational activities, this project will develop materials for college-level courses to ensure that the next generation of scientists and computer scientists posses the intellectual and practical knowledge to ensure a statistically-sound analysis of data and testing of hypotheses by using and extending the methods developed in the project. A diverse cohort of undergraduate students will be involved in the research and educational components of the project.<br/><br/>The team of researchers in this project will design and mathematically analyze algorithms to make statistical hypothesis testing iterative and scalable along multiple dimensions. Many existing statistical procedures are already computationally expensive when testing a single hypothesis on moderate-size datasets, and become even more inefficient as the amount of data or the number of hypotheses grows. Along the dimension of data complexity, available tests often lack scalability because limited to simple types of data (e.g., binary tables), while fewer methods are available for rich data such as attributed graphs or panel time-series. The lack of scalable methods may be due in part to the requirement that hypothesis tests satisfy stringent statistical guarantees (e.g., the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR)) to ensure that the successive inference is sound. Additionally, the iterative aspect of the practice of data analysis has been ignored for statistical tests, but considering it is crucial in order to ensure that these guarantees are satisfied. This project will develop algorithms for the scalable and iterative statistical testing of multiple complex hypotheses on massive rich datasets, while imposing only weak assumptions on the data generation process, and controlling the FWER and the FDR. These results will be achieved by bringing together two areas of computer science research that had, until now, only very limited points of contact: statistical learning theory and data mining. The novel methods developed in this project will use concepts from the former, such as (local) Rademacher averages, covering numbers, and pseudodimension, to exploit the structure of the class of hypotheses being tested and achieve better sample complexity bounds, which translate to higher statistical power and improved control of the FWER/FDR, even in an iterative data analysis setting. These concepts will be adapted to statistical hypothesis testing and strengthen to fully exploit their practical usefulness, especially on rich datasets and in the presence of dependencies between the data points. The project team will use techniques from the knowledge discovery task of pattern mining to efficiently explore the space of hypotheses to filter out those that are definitively not significant. To reach this goal, the project team will develop novel bounds for the p-value functions of different tests and adapt these techniques to rich datasets such as attributed graphs.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.