Many researchers argue that commercial services that yield a high volume of data have little scientific utility because their users are not representative of the general population. This project develops methods for generalizing inferences drawn from non-representative big data sources. The researchers are implementing a multi-phase survey to compare survey results with online social networking data. They will collect a nationwide probability sample and a separate sample of online users, and use novel statistical procedures to combine them in a manner will enable statistically valid estimators to be produced from the universe of online users. The results of this project will be broadly applicable by enabling more accurate statistical models of to be created for applications in medicine, economics, and many other fields. <br/> <br/>Statistical weighting will be used to match the online media sample with the probability survey sample across a set of auxiliary variables observed for both samples. The researchers will then match the data in the universe of online users to the weighted sample across a second set of variables that is measured for all users. Doing so will enable extracting results from media users that are more representative of the general population. To produce real-time forecasts, Bayesian models for mixed frequency time series will be used to combine the weighted online-based analyses with traditional polls.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.