Monitoring of environmental conditions includes measuring the levels of various components of the surroundings, allowing detection of potentially harmful air pollution, radiation, greenhouse gases, or other contaminants in the environment. Monitoring of environmental conditions typically includes gathering environmental data. Environmental data includes detection and measurement of pollutants or contaminants such as nitrogen dioxide (NO2), carbon monoxide (CO), nitrogen oxide (NO), ozone (O3), sulfur dioxide (SO2), carbon dioxide (CO2), methane (CH4), volatile organic compounds (VOC), air toxics, temperature, sound radiation, and particulate matter. In order to assess the effects of such pollutants, it is desirable to associate environmental data sensing these pollutants at particular times with geographic locations (homes, businesses, towns, etc.). Such an association would allow individuals and communities to evaluate the quality of their surroundings.
Environmental data collected over time period and geographic region is generally sparsely populated along a spatial dimension and/or temporal dimension. Plausible values for environmental data at locations or times for which an observation is not collected may be predicted to populate an environmental dataset. Thus, a mechanism for improving collection and processing of environmental data is desired.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
As used herein, road type includes a classification of a particular road segment. The classification may be determined by a third-party map service. For example, open street maps (OSM) comprises labels attributed to different road segments. Examples of road types include highway, major roads, residential roads, etc. Additional information for each road may indicate a number of lanes, a road surface, a maximum speed, a minimum speed, an indication that the corresponding road segment is within a school zone, an indication that the corresponding road segment is within a construction zone, etc.
Generally, environmental data cannot be sampled everywhere in a predefined geographic region (e.g., a contract region over which environmental data, such as pollutant concentrations are to be measured) at all times continuously. The system can use the environmental data to generate maps or other representations or time series, or to make predictions of environmental characteristics (e.g., pollutant concentrations) at a particular location and a particular point in time. Effective use of the environmental data to represent or predict an environmental characteristic generally necessitates the system to ensure that the quality of environmental data used to generate the representations or prediction is accurate. As such, the system generally takes into consideration invalid or inaccurate sampling.
According to various embodiments, the system augments an environmental dataset with best guesses. The environmental dataset may be a dataset of collected samples such as sampling obtained during a session in which a vehicle travels according to a predefined drive plan. Augmenting the environmental dataset with best guesses includes imputing values for environmental data at locations and/or times that are unsampled. The system may use the sampled environmental data and one or more models (e.g., statistical models) of the effects of different features (e.g., spatial features such as topography or road type, temporal features, and/or spatiotemporal features). In some embodiments, the system determines the one or more models based on an environmental dataset (e.g., data collected over a predefined period of time, such as a contract length for pollutant monitoring) to determine the contributions made to the collected environmental data by one or more features to predict the collected environmental data. The system obtains the environmental dataset, extracts information from the environmental data comprised in the environmental dataset (e.g., spatial or temporal structures), and imputes observations to a particular location for a particular time.
Various embodiments provide a system, method, and device for augmenting environmental data. The method includes (i) obtaining an environmental dataset having a first data resolution, and (ii) determining an augmented environmental dataset based at least in part on the environmental dataset, a set of spatial features, a set of temporal features, and a set of spatiotemporal features. The model has a second data resolution that is finer than the first data resolution. In some embodiments, the model has less null values for a cell corresponding to a predefined location at a predefined time than the environmental dataset. In some embodiments, the model has no missing values.
Various embodiments provide a system, method, and device for augmenting environmental data. The method includes (i) obtaining a sparse environmental dataset, and (ii) determining an augmented environmental dataset based at least in part on the environmental dataset, a set of spatial features, a set of temporal features, and a set of spatiotemporal features. The resulting augmented environmental dataset has no missing values.
Various embodiments provide a system, method, and device for augmenting environmental data. The method includes (i) obtaining an environmental dataset (e.g., a sparse environmental dataset), and (ii) determining an augmented environmental dataset based at least in part on the environmental dataset, a set of spatial features, a set of temporal features, and a set of spatiotemporal features. The augmented environmental dataset being less sparse than the environmental dataset. In some embodiments, the augmented environmental dataset has less missing or null values than a the environmental dataset. In some embodiments, a system uses a Data Interpolating Empirical Orthogonal Functions (DINEOF)/Kalman Filter (KF) (DKF) to determine imputed values for a sparsely populated environmental dataset. The system uses the DKF to leverage the spatiotemporal correlations between measurements (e.g., sampled observations) to perform imputation of values to populate the environmental dataset (e.g., in connection with determining an augmented environmental dataset). The system uses a DKF on an environmental dataset having a particular temporal resolution (e.g., a daily temporal resolution) to impute values to empty cells in a matrix representation of the environmental dataset. In some embodiments, the system imputes the values without using any extraneous or information outside of the collected environmental data. For example, the system imputes the values based on the particular pollutant being analyzed. As an illustrative example, if the system is monitoring/analyzing ozone, then the system generates an augmented dataset that comprises the collected ozone dataset combined with the associated imputed values that are generated based on the ozone dataset without other information.
In some embodiments, the system pre-processes the environmental dataset (e.g., pass data from a set of driving sessions) to obtain a decycled residuals data by removing road-type-specific diurnal cycles (e.g., contributions from morning and evening rush hours). Diurnal cycles may also arise based on reactions with pollutants that are dependent on radiation (e.g., ozone, nitrogen dioxide, etc.), photochemistry effects on the pollutants, solar radiation that may drive or contribute to production cycles, or other loss mechanisms (e.g., humidity, time of day, etc.). In some embodiments, the system uses DKF to obtain a dataset of daily environmental characteristic measurements (e.g., daily pollutant concentrations), for all road segments within a predefined geographic region (e.g., a contracted region). To obtain the dataset of daily environmental characteristic measurements the system uses the decycled residuals data. As an example, the system removes the diurnal cycle from the raw pass data, so that the resulting hyperlocal daily dataset would not be full of spurious bumps and troughs (e.g., nearby roads might appear to have very different levels of pollution, simply because such roads were sampled at different times of day). To remove the cycle, the system subtracts a road-type-specific estimate of the diurnal cycle from each observation. The system can determine the noise contribution to the environmental data based on the identified hyperlocal variability.
Various embodiments build on the use of DKF by using a residual model for air quality (REMAQ) to perform imputation. The REMAQ includes the use of regression analysis (e.g., a generalized additive model) in conjunction with a DKF to obtain imputed environmental data.
In some embodiments, the system implements a residual model for air quality (REMAQ) in conjunction with a DKF to impute values to an environmental dataset in connection with imputing values to an augmented environmental dataset or otherwise making a prediction. The system uses the REMAQ to identify the contributions made by a set of features to the environmental dataset. In some embodiments, the system uses information outside the collected environmental dataset to determine the feature-specific components in the environmental dataset. For example, the system uses one or more of topography (e.g., latitude, longitude, altitude), weather, road type, and other features (e.g., correlations associated with other pollutants at the particular location and time) to make predictions of environmental data at particular locations and particular times. In some embodiments, the use of the REMAQ enables better and higher resolution predictions for imputed values.
Various features outside the collected environmental data may be used in connection with determining an augmented environmental dataset or otherwise predicting environmental data for a particular location and time. Examples of such features include other pollutants (e.g., data from other pollutants that may have a causal relationship with the pollutant being analyzed), weather, topography, road characteristics, etc. Weather includes one or more of temperature, precipitation level, visibility, wind speed (e.g., decomposing a wind vector into a zonal wind speed component and/or a meridional wind speed component), downwards radiation, height of the planetary boundary layer, etc. Topography may include one or more of latitude, longitude, and altitude. Examples of road characteristics include road type, road width, road surface, number of lanes, traffic information (e.g., historical traffic and/or current traffic), road construction zones, etc.
The system uses regression analysis to incorporate extraneous environmental predictors (e.g., features other than the collected measurements). An example of a type of environmental monitoring includes the monitoring/modeling of pollution at a particular point and time. In some embodiments, the system deems pollution (e.g., pollutant concentrations) to be a function of a plurality of different variables (e.g., features), such as road type, altitude or the road segment, weather (e.g., weather occurring when a sample is collected, or weather occurring when a prediction is being generated), etc. The system uses regression to estimate the coefficients for the variables. As an example, the system uses a generalized additive model in connection with determining the contributions to the predictions of environmental data made by the various features (e.g., a set of component of the environmental data respectively associated with each of the features). The relationship between the various features and a corresponding pollutant can be non-linear. For example, environmental data is not necessarily a linear combination of the extraneous environmental predictors. As an illustrative example, an increase in altitude, from 0 to 100 m, does not entail the same change in pollutant concentration as a change in altitude from 100 to 200 m.
In some embodiments, the system pre-processes the environmental data (e.g., sampled environmental data within the environmental dataset) before using the environmental data to generate models (e.g., determine a contribution by a particular feature to the prediction of environmental data) or make predictions (e.g., determine imputed values, etc.). Pre-processing the data enables the system to use high-quality data to build the models and make predictions. Pre-processing the environmental data includes updating the environmental data to adjust/remove non-sensical data, etc. For example, the pre-processing includes adjusting the environmental data to resolve negative concentrations. In some embodiments, when selecting which features to use in predicting environmental data (e.g., determining the imputed values and/or augmented environmental dataset), the system naturally prefers those features that have plausible causal nature, but sometimes the system may be required to use proxies (e.g., number of lanes, maximum speed, etc.) as a surrogate for unavailable features that the system would much rather have (e.g. emissions from ICE vehicles).
In response to determining the models, the system may validate the models before deployment. In various embodiments, the system validates the models based at least in part on one or more of (a) assessing an internal goodness-of-fit (e.g., qq plots, histograms, correlograms, etc.); (b) determine whether the Kalman filter (KF) initial conditions are important, such as by examining initial time points; (c) perform out-of-sample validation, such as assessing a root mean square error (RMSE), (d) validate against synthetic data; and (e) validate against third party data (e.g., regulatory stations, purpleAir, etc.).
Hyper-local environmental data, for example related to air quality and greenhouse gas data, can be collected using vehicles with air pollutant sensors installed. Embodiments of techniques usable in gathering hyper-local data are described in U.S. patent application Ser. No. 16/682,871, filed on Nov. 13, 2019, entitled HYPER-LOCAL MAPPING OF ENVIRONMENTAL CONDITIONS and assigned to the assignee of the present application, U.S. patent application Ser. No. 16/409,624, filed on May 10, 2019, entitled INTEGRATION AND ACTIVE FLOW CONTROL FOR ENVIRONMENTAL SENSORS and assigned to the assignee of the present application; U.S. patent application Ser. No. 16/773,873, filed on Jan. 27, 2020, entitled SENSOR DATA AND PLATFORMS FOR VEHICLE ENVIRONMENTAL QUALITY MANAGEMENT, assigned to the assignee of the present application and which claims priority to U.S. Patent Application Ser. No. 62/798,395 entitled SENSOR DATA AND PLATFORMS FOR VEHICLE ENVIRONMENTAL QUALITY MANAGEMENT and assigned to the assignee of the present application, which are all incorporated herein in their entirety for all purposes.
Mobile sensor platforms 102A, 102B and 102C may be mounted in a vehicle, such as an automobile or a drone. In some embodiments, mobile sensor platforms 102A, 102B and 102C are desired to stay in proximity to the ground to be better able to sense conditions analogous to what a human would experience. Mobile sensor platform 102A includes a bus 106, sensors 110, 120 and 130. Although three sensors are shown, another number may be present on mobile sensor platform 102A. In addition, a different configuration of components may be used with sensors 110, 120 and 130. Each sensor 110, 120 and 130 is used to sense environmental quality and may be of primary interest to a user of system 100. For example, sensors 110, 120 and 130 may be gas sensors, volatile organic compound (VOC) sensors, particulate matter sensors, radiation sensors, noise sensors, light sensors, temperature sensors, noise sensors or other analogous sensors that capture variations in the environment. For example, sensors 110, 120 and 130 may be used to sense one or more of NO2, CO, NO, O3, SO2, CO2, VOCs, CH4, particulate matter, noise, light, temperature, radiation, and other compounds. In some embodiments, sensor 110, 120 and/or 130 may be a multi-modality sensor. A multi-modality gas sensor senses multiple gases or compounds. For example, if sensor 110 is a multi-modality NO2/O3 sensor, sensor 110 might sense both NO2 and O3 together. Sensor 110 may comprise a plurality of sensors, such as sensors 112, 114, and 116. Sensor 120 may comprise a plurality of sensors, such as sensors 122, 124, and 126. Sensor 130 may comprise a plurality of sensors, such as sensors 132 and 134.
Although not shown in
Sensors 110, 120 and 130 provide sensor data over bus 106, or via another mechanism. In some embodiments, data from sensors 110, 120 and 130 incorporates time. This time may be provided by a master clock (not shown) and may take the form of a timestamp. Master clock may reside on sensor platform 102A, may be part of processing unit 140, or may be provided from server 150. As a result, sensors 110, 120 and 130 may provide timestamped sensor data to server 150. In other embodiments, the time associated with the sensor data may be provided in another manner. Because sensors 110, 120 and 130 generally capture data at a particular frequency, sensor data is discussed as being associated with a particular time interval (e.g., the period associated with the frequency), though the sensor data may be timestamped with a particular value. For example, sensors 110, 120 and/or 130 may capture sensor data every second, every two seconds, every ten seconds, or every thirty seconds. The time interval may be one second, two seconds, ten seconds, or thirty seconds. The time interval may be the same for all sensors 110, 120 and 130 or may differ for different sensors 110, 120 and 130. In some embodiments, the time interval for a sensor data point is centered on the timestamp. For example, if the time interval is one second and a timestamp is t1, then the time interval may be from t1−0.5 seconds to t1+0.5 seconds. However, other mechanisms for defining the time interval may be used.
Sensor platform 102A also includes a position unit 145 that provides position data. In some embodiments, position unit 145 is a global positioning satellite (GPS) unit. Consequently, system 100 is described in the context of a position unit 145. The position data may be time-stamped in a manner analogous to sensor data. Because position data is to be associated with sensor data, the position data may also be considered associated with time intervals, as described above. However, in some embodiments, position data (e.g., GPS data) may be captured more or less frequently than sensor data. For example, position unit 145 may capture position data every second, while sensor 130 may capture data every thirty seconds. Thus, multiple data points for the position data may be associated with a single thirty second time interval. The position data may be processed as described below.
Optional processing unit 140 may perform some processing and functions for data from sensor platform 104, may simply pass data from sensor platform 104 to server 150 or may be omitted.
Mobile sensors platforms 102B and 102C are analogous to mobile sensor platform 102A. In some embodiments, mobile sensor platforms 102B and 102C have the same components as mobile sensor platform 102A. However, in other embodiments, the components may differ. However, mobile sensor platforms 102A, 102B and 102C function in an analogous manner.
Server 150 includes sensor data database 156, calibration tables 154 (e.g., stored in database 152), processor(s) 158, memory 159. Processor(s) 158 may include multiple cores. Processor(s) 158 may include one or more central processing units (CPUs), one or more graphical processing units (GPUs) and/or one or more other processing units. Memory 159 can include a first primary storage, typically a random-access memory (RAM), and a second primary storage area, typically a non-volatile storage such as solid-state drive (SSD) or hard disk drive (HDD). Memory 159 stores programming instructions and data for processes operating on processor(s) 158. Primary storage typically includes basic operating instructions, program code, data and objects used by processor(s) 158 to perform their functions. Primary storage devices (e.g., memory 159) may include any suitable computer-readable storage media, described below, depending on whether, for example, data access needs to be bi-directional or uni-directional.
Sensor data database 156 includes data received from mobile sensor platforms 102A, 102B and/or 102C. After capture by mobile sensor platform 102A, 102B and/or 102C, sensor data stored in sensor data database 156 may be operated on by various analytics, as described below. Position data database 152 stores position data received from mobile sensor platforms 102A, 102B and/or 102C. In some embodiments, sensor data database 156 stores position data as well as sensor data. In such embodiments, position data database 152 may be omitted. Server 150 may include other databases and/or store and utilize other data. For example, server 150 may include calibration data (not shown) used in calibrating sensors 110, 120 and 130.
System 100 may be used to capture, analyze, and provide information regarding hyper-local environmental data. Mobile sensor platforms 102A, 102B and 102C may be used to traverse routes and provide sensor and position data to server 150. Server 150 may process the sensor data and position data. Server 150 may also assign the sensor data to map features corresponding to the locations of mobile sensor platforms 102A, 102B and 102C within the same time interval as the sensor data was captured. As discussed above, these map features may be hyper-local (e.g., one hundred meter or less road segments or thirty meter or less road segments). Thus, mobile sensor platforms 102A, 102B and 102C may provide sensor data that can capture variations on this hyper-local distance scale. Server 150 may provide the environmental data, a score, confidence score and/or other assessment of the environmental data to a user. Thus, using system 100 hyper-local environmental data may be obtained using a relatively sparse network of mobile sensor platforms 102A, 102B and 102C, associated with hyper-local map features and processed for improved understanding of users.
Mobile sensor platforms traverse routes in a geographic region, at 202. While traversing the routes, the mobile sensor platforms collect not only sensor data, but also position data. For example, a mobile sensor platform may sense one or more of NO2, CO, NO, O3, SO2, CO2, CH4, VOCs, particulate matter, other compounds, radiation, noise, light, and other environmental data at various times during traversal of the route. Other environmental characteristics, including but not limited to temperature, pressure, and/or humidity may also be sensed at 202. In addition, the time corresponding to the environmental data is also captured. The time may be in the form of a timestamp for the sensor data (sensor timestamp), which may correspond to a particular time interval. Different sensors on the mobile sensor platform may capture the environmental data at different times and/or at different frequencies. Also, at 202 the mobile sensor platforms capture position data, for example via a GPS unit. The position data may include location (as indicated by a GPS unit), velocity and/or other information related to the geographic location of the mobile sensor platform. In some embodiments, position data from other sources, such as acceleration, may be captured from by the vehicle or another source. The position data may include a timestamp (position timestamp) or other indicator of the time at which the position data is captured.
The mobile sensor platforms provide the position and sensor data to a server, at 204. In some embodiments, mobile sensor platforms provide this data substantially in real time, as the mobile sensor platforms traverse their routes at 202. Thus, the position and sensor data may be transmitted wirelessly to the server. In some embodiments, some or all of the position and/or sensor data is stored at the mobile sensor platform and provided to the server at a later time. For example, the data may be transferred to the server when the mobile sensor platform returns to its base. In some embodiments, the mobile sensor platform may process the sensor data and/or position data prior to sending the sensor and/or position data to the server. In other embodiments, the mobile sensor platform provides little or no processing. The sensor data and position data may be sent at the same time or may be sent separately.
At 206, the route traversal and data collecting of 202 and data sending of 204 are repeated. Thus, the mobile sensor platforms may traverse the same or different routes at 206. In either case, multiple passes of the same geographic locations, and thus multiple passes of the same corresponding map features, are made at 206. In some embodiments, the repetition at 206 may be periodic (e.g., approximately every week, month, or other time period). In some embodiments, the repetition at 206 may be performed based on other timing. In some cases, the same mobile sensor platform is sent on the same route and/or collects data for the same map features. In some embodiments, different mobile sensor platforms collect data may be used for the same routes and/or map features. Also at 206, steps 202 and 204 may be performed multiple times. Thus, at 206, data for a particular region may be aggregated over time.
For example,
At 206, mobile sensor platform 102A and/or other mobile sensor platform(s) 102B and 102C repeat the route traversal, data collection and sending of the position and sensor data. In some cases, mobile sensor platform(s) 102A, 102B and/or 102C follow route 330 again. In some cases, mobile sensor platform(s) 102A, 102B and/or 102C traverses a different route. For example,
Thus, using method 200, sensor and position data may be captured for regions of a map. The sensor data and position data may be provided to server 150 or other component for processing, aggregation, and analysis. Sensor data and position data are sensed sufficiently frequently using method 200 that variations environmental quality on the hyper-local scales may be reflected in the sensor data. Method 200 may be performed using a relatively small number of mobile sensor platforms. Consequently, efficiency of data gathering may be improved while maintaining sufficient sensitivity in both sensor and position data.
For example,
At 206, mobile sensor platform 102A and/or other mobile sensor platform(s) 102B and 102C repeat the route traversal, data collection and sending of the position and sensor data. In some cases, mobile sensor platform(s) 102A, 102B and/or 102C follow route 330 again. In some cases, mobile sensor platform(s) 102A, 102B and/or 102C traverses a different route. For example,
Thus, using method 200, sensor and position data may be captured for regions of a map. The sensor data and position data may be provided to server 150 or other component for processing, aggregation, and analysis. Sensor data and position data are sensed sufficiently frequently using method 200 that variations environmental quality on the hyper-local scales may be reflected in the sensor data. Method 200 may be performed using a relatively small number of mobile sensor platforms. Consequently, efficiency of data gathering may be improved while maintaining sufficient sensitivity in both sensor and position data.
The system uses REMAQ based on collected environmental data (e.g., raw pass data) as input. Generally, each measurement includes an average of 1 Hz measurements taken by a mobile sensor as the mobile sensor travelled a road segment (e.g., a stretch of road roughly 100 meters long). For each session, the system may generate (e.g., probabilistically) a drive plan to provide good spatial diversity and temporal diversity in the environmental dataset (e.g., a set of collected samples over a set of sessions). The number of 1 Hz measurements that go into a pass average depends not only on segment length but also the sensor's speed over ground. For some pollutants, the average (or minimum) distance between the sensor and other vehicles might also affect the final pass average. In some embodiments, the system accounts for these factors. In other embodiments, the system does not take such factors into consideration and takes the pass “measurement” at face value.
The sampled environmental data may comprise noise (e.g., a noise component or contribution to the prediction of environmental data). For example, the mobile sensor deployed to collect the sampled environmental data may introduce noise or measurement errors. These noise/measurement errors generally do not cancel out at the pass level. As shown in
In this expression, the parameter κ>1 controls how early the function begins to approach the 1:1 line. The system determines modality-specific κs based on pass data collected over a predetermined time period (e.g., pass data collected over two years). In some embodiments, the parameter κ is selected to preserve the mode of the original distribution (e.g., the distribution over the raw data samples).
As illustrated in
In some embodiments, the system deploys a model(s) to determine contributions (e.g., by a particular feature) to the prediction of environmental data (e.g., spatially-based features, temporally-based features, or spatiotemporally-based features). The model(s) is used in connection with determining predictions for environmental data.
In some embodiments, the system determines the model(s) based on fitting a regression model, such as a generalized additive model, to an environmental dataset. The system uses a generalized additive model to remove the effects of different features (e.g., topography, road type, etc.) and obtains corresponding residual data. The residuals data are obtained because model predictions will generally not match the observed data perfectly. The system can feed the environmental dataset to a generalized additive regression model to determine predictions for all locations in the environmental dataset. For example, the system represents the environmental dataset in a matrix along the dimensions of location and time. The system passes environmental dataset through the additive regression model to impute values to all empty cells in the matrix. The system obtains an augmented environmental dataset from the collected environmental dataset and the imputed values for the empty cells the collected environmental dataset.
In response to obtaining the augmented dataset, the system can remove the predictive contribution of each feature (or a subset of features) from the original environmental dataset (e.g., the dataset of pollutant concentrations), thereby generating a dataset of residuals that are not explained by that feature(s). As an illustrative example, in response to obtaining the augmented dataset, the system determines the effect of topography on the environmental data and removes the topography component to obtain a set of residual data that is not explained by topography, such as data/variability of data that may be explained by other features, such as weather or temporally-based effects.
The system iteratively determines the effect of a particular feature on the environmental data, removes that feature's predicted to contribution to the prediction of environmental data (or input residual data) to obtain residual data that may be used in a subsequent iteration. In some embodiments, at each iteration, the input data (e.g., the environmental dataset on the first pass, or the corresponding residual data for subsequent passes) is fed into a regression analysis to determine the effect of a next feature on the environmental data (e.g., the component for the feature corresponding to the iteration being performed).
In some embodiments, in connection with implementing the REMAQ with an environmental dataset, the system attempts to fit a model to log-transformed, softplus-corrected pass measurements. In other words, the response variable is Yt(s)=log(softplus(Yt0 (s), κ)), where Yt0 (s) denotes the original pass measurement made at time t and road segment s. The system assumes that the particular random variable follows the Gaussian distribution, with a fairly complex expression for the mean (mt(s)) and a simple unknown parameter (σ2) for the variance, as shown in Equation 2.
Y
t(s)˜N[mt(s),σ2] (2)
The mean is provided by Equation 3 below. The constituent parts of Equation 3 are defined by functions such as by the examples described in Equations 4-8.
m
t(s)=g1(xspace(s))+g2(xtweather)+g3(xthour)+g4(t)+g5(t,s) (3)
In Equation (3) above:
In the expression for the mean, such as the mean defined by Equation 3, the system utilizes the following underlying assumptions:
In some embodiments, different models can be determined in which the above-noted assumptions are not made.
The nonlinear functions that connect environmental log-characteristics (e.g., pollutant log-concentrations) to environmental features are defined by Equations 4 and Equation 5 below.
g
1(xspace(s))=tensor(latitude(s),longitude(s))+spline(altitude(s))+factor(road type(s)) (4)
g
2(xweather(t))=spline(DLWRFt)+spline(HLBLt)+spline(PRATEt)+spline(TCDCt)+spline(TMPt)+spline(UGRDt)+spline(VGRDt) (5)
In Equation 5 above, DLWRF refers to downward longwave radiation flux, HLBL refers to the height of the planetary boundary layer, PRATE refers to the precipitation rate, TCLDC refers to the total cloud cover, TMP refers to the near-surface air temperature, UGRD refers to the near-surface zonal wind speed, and VGRD refers to the near-surface meridional wind speed.
The system accounts for temporal features, such as temporally varying diurnal cycles. As an example, the system implements Equation 6 to account for temporally varying diurnal cycles. Although Equation 6 uses values at an hourly resolution, various other time resolutions may be implemented.
g
2+Spline(1hour(t)==12AM)+spline(1hour(t)==1AM)+spline(1hour(t)==2AM)+ . . . 30 spline(1hour(t)==11PM) (6)
In some embodiments, the system accounts for the principal component of variability, such as based on Equation 7.
g
4(t)=spline(t) (7)
In some embodiments, the system accounts for hyperlocal variability via Equation 8. Hyperlocal variability may correspond to consistent differences in pollutant concentrations between neighboring road segments during a predefined period of time (e.g., a predefined contract). As an example, this equation can be part of the KF.
g
5(t,s)=Σk=1Kϕ(s)[k]λday(t)[k] (8)
In Equation 8, K∈N denotes the number of hyperlocal components, φ(s)[k]∈R is the value of the k-th spatial component at location s, and λday(t)[k] is the k-th coefficient associated with day(t).
The sample size of environmental data in the environmental dataset can be extremely large. The sample size can be sufficiently large that solving for the coefficients for the set of features may be extremely computationally expensive and thus infeasible. In some embodiments, the system solves the problem by solving for different types of features. For example, the system iteratively fits the environmental data to a feature or type of feature.
In some embodiments, the system fits the environmental data by implementing five modules: (1) a spatial module, (2) a weather module, (3) a diurnal cycle module, (4) a principal component module, and (5) a hyperlocal module. The inputs of one module are the residuals of the previous module, hence the name REMAQ. For example, the system begins with the pre-processed environmental data from the environmental dataset and determines the component for a first feature (e.g., a spatial feature), and iteratively determines the component for the remaining features based on residual data from the solution of an immediately preceding feature. Although examples described herein solve the fitting of the environmental dataset to a set of features in the order of spatial effects, temporal effects, spatiotemporal effects, and noise effects, various other orders may be implemented.
The spatial module is used for fitting the data to spatially-based effects to determine the spatial component of (e.g., contribution to the predictions of) the environmental data. In some embodiments, the spatial module employs temporally averaged means as the response variable. An example of the temporally average means is provided in Equation 9.
The system fits the generalized additive model (GAM) according to Equation 10.
Y(s)=g1(xspace(s))+ϵ(s) (10)
In Equation 10, ϵ(s) denotes spatially uncorrelated Gaussian error. Various embodiments fit this GAM to the environmental data based on Python's library pygam or other similar tools. GAMs provide an easy way to describe the behavior of response variables via smooth, non-linear functions of features.
The weather module is used for fitting the data to weather-based effects to determine the weather component of (e.g., contribution to the prediction of) the environmental data. Based on the predictions from the spatial module (e.g., g1(xspace(s))), the weather module employs spatially-averaged residuals as the response variable. The residuals may be computed from a function such as the function defined in Equation 11.
Y
t=Σ(Yt(s)−(xspace(s)) (11)
The system fits the GAM according to Equation 12.
Y
t
=g
2(xtweather)+ϵt (12)
In Equation 12, Et denotes temporally uncorrelated Gaussian error.
The diurnal cycle module is used for fitting the data to diurnal cycle-based effects to determine the diurnal cycle component of (e.g., contribution to the prediction of) the environmental data. Based on the predictions from the weather module (e.g., g2(xtweather)) the diurnal cycle module employs the resulting residuals computed from a function such as Equation 13 as the response variable.
Y
t
†
=Y
t−(xtweather) (13)
The system fits the GAM according to Equation 14.
Y
t
†
=g
3(xthour)+ζt (14)
In equation 14, ζt denotes temporally uncorrelated Gaussian error.
The principal component module is used for fitting the data to principal component-based effects to determine the contribution of the principal component to the prediction of environmental data. Based on the predictions from the diurnal cycle module (e.g., g3(xthour)), the principal component module employs the resulting residuals computed from a function such as Equation 15.
Y
t
‡
=Y
t
†−(xthour) (15)
The system fits the GAM according to Equation 16.
Y
t
†
=g
4(t)−vt (16)
In Equation 16, vt denotes temporally uncorrelated Gaussian error.
In some embodiments, the system implements a DKF in connection with determining the hyperlocal-based effects. The system uses a hyperlocal module that fitting the environmental data to hyperlocal-based effects to determine the contribution of the hyperlocal features to the prediction of environmental data.
Data Interpolating Empirical Orthogonal Functions (DINEOF) is a machine learning method that attempts to reconstruct a spatiotemporal field of observations based on a sparse sample. Data interpolation, or imputation (e.g., predicting the value of the missing observations), is accomplished by leveraging the spatial and temporal correlations present in the sparse sample. The spatial and temporal correlations enable the system to identify K components of variability. As an example, each component comprises:
According to DINOF, if an observation zt(s) is missing in a dataset, the system imputes the missing value/observation based on a function, such as the function described in Equation 17.
z
t(s)=Σk=1Kϕ(s)[k]λt[k] (17)
Based on the predictions from the previous modules (e.g., the spatial module, the weather module, the diurnal cycle module, the principal component module, etc.), the system first compute the corresponding residuals. The system can compute the residuals according to a function such as the function described in Equation 18.
Y
t
*(s)=Yt(s)−((xspace(s)))+(xtweather)+(xthour)+(t)) (18)
If a particular location s was sampled nd(s) times on the same day d, then system can compute a daily mean residual according to Equation 19.
If no observations were made, then the system deems Zd(s) as missing determines to impute a value for the particular location on day d (or such other time point). The system can use an estimation technique (e.g., algorithm) described below to impute the value.
In some embodiments, the system implements an estimation algorithm to determine imputed values. To impute missing values, the system obtains the values of K, {ϕ(s1)[k], . . . , φ(sN)[k]} and {λ1[k], . . . , λM[k]}, for all k=1, . . . , K. The system can obtain the values of K via an iterative optimizer. An example of an iterative optimizer is provided below:
Once this optimizer has converged, the system has obtained the final low-rank SVD decomposition {circumflex over (Z)}=Û{circumflex over (D)}{circumflex over (V)}. The system uses the SVD decomposition to compute a few quantities of interest:
ϕ(s)[k]=U(s,k)·D(k,k) (22)
In some embodiments, the DINEOF enables the system to impute missing observations (e.g., empty cells of a matrix representation of the environmental dataset) when the environmental dataset is sparse. However, DINEOF's performance can degrade as sparsity of the dataset increases. The degradation can get to the point that the algorithm is incapable of determining predictions for days where no samples were collected at any location.
According to various embodiments, the system implements a Kalman Filter (KF) to mitigate this problem of DINEOF. The system implements the KF in accordance with the following:
Z
d(s)˜N[Σk=1Kϕ(s)[k]λd[k],σ2] (23)
λd[k]˜N[λd−1[k],τ2] (24)
λ1[k]˜N[0,τ2] (25)
According to various embodiments, the system uses the Kalman Filter to estimate the Gaussian posterior mean and variance of the temporal coefficients λd[k], for all d=1, . . . , M and k=1, . . . , K, given the available observations. The system estimates the Gaussian posterior mean and variance of the temporal coefficients based on using the forward filtering, backward smoothing equations described in the following: C. K. Carter and R. Kohn, “On Gibbs Sampling for State Space Models”, Biometrika. Vol. 81, No. 3 (August, 1994), pp. 541-553 (hereinafter Carter and Kohn); and Fruhwirth-Schnatter, “Data Augmentation and Dynamic Linear Models”, Journal of Time Series Analysis. Vol. 15, Issue 2 (March 1994), pp. 183-202 (hereinafter Fruhwirth-Schnatter. Both Carter and Kohn and Fruhwirth-Schnatter are hereby incorporated by reference in their entireties for all purposes. From these distributions, the system can sample the missing observations P times, Zd(s)[1], . . . , Zd(s)[P], thereby obtaining a probabilistic reconstruction of the response variable, for all locations and times (e.g., for all road segments and days).
In some embodiments, the system generates an augmented environmental dataset, such as a dataset that comprises collected/sampled values or imputed values for all locations and time points during a predefined period of time (e.g., during a length of a contract). For example, the system combines the output of the various modules (e.g., the spatial module, the weather module, the diurnal cycle module, the principal component module, the hyperlocal module, etc.) in REMAQ, to generate the final reconstruction of the environmental dataset (e.g., pollutant concentrations at all road segments and time points (hours) of a contract).
exp((s)+Zd(s)[1]), . . . ,exp((s)+Zd(s)[P]) (26)
where
(s)=(xspace(s))+(xtweather)+(xthour)+(t) (27)
The system can assess the extent of the uncertainty of the estimate (e.g., imputed value) based on the spread of the sample computed using Equation 26. One advantage of using a sample-based approach is that the system can also construct spatial and temporal aggregations (e.g., census-tract baselines, monthly baselines, etc.) that take spatiotemporal correlations into account, because such correlations are embedded in the DKF algorithm, used to generate the sample.
A useful way to think about this augmented dataset is as if it were a data cube, D, with dimensions P×N×M. The general element of this data cube, D(p, s, t) is equal to a (softplus-corrected) measurement if one was made at time t and location s, or equal to a simulated value if no measurements were made.
The foregoing equations are merely illustrative examples. In various embodiments, the system may implement various equations in connection with processing the environmental data and/or making predictions pertaining to the environmental data.
Mobile sampling artifacts, in the shape of polygonal regions that coincide with census tracts, are visible in the raw mean of the pass values (e.g.,
As shown in
In the example shown in
A null model is a model with no predictors. For example, the null model simply captures the average of the data. Therefore, any competing model should present smaller residuals, as a sign of improved fit. As illustrated in
In some embodiments, the weather module is used to assess (e.g., determine) the extent to which weather variables influence the temporal dynamics of spatially averaged residuals from the spatial module. In other words, the system uses the weather module to assess the skill of a weather-based GAM to predict all the variability in the pass data that could not be captured by smooth geographical predictions.
In the example shown, the temporal dynamics of spatially averaged residuals from the PM2.5 spatial module (e.g., the dotted representation) is illustrated with predictions from the weather module (e.g., the solid line representation).
In the example shown, environmental dataset 905 can be represented/organized into a matrix having dimensions of time and space (e.g., road segments). The white cells correspond to cells for which no observation was collected. The grey and black cells correspond to cells in which observations were sampled. As illustrated in the missingness mask representation 910 that identifies cells for which observations are collected, the pollutant associated with environmental dataset 905 was not observed everywhere and at only a few times. The system can deconstruct the environmental dataset 905 into (i) effects observed over space (e.g., effects having spatial variability) as represented by the spatial effect component 915, (ii) effects observed over time (e.g., effects having temporal variability) as represented by temporal effect component 920, (iii) effects observed over space-time (e.g., effects having spatiotemporal variability) as represented by spatiotemporal effect component 925, and (iv) effects of noise as represented by noise component 930.
The system obtains an environmental dataset 1005 that is based at least in part on collected sample data (e.g., pass data from a mobile sensor platform traveling and sampling along a drive plan during a session). Although not shown in
In some embodiments, the system determines the spatial feature component of the environmental dataset 1005. The spatial feature components may comprise topography effects (e.g., the effects of latitude, longitude, or elevation), road characteristics (e.g., road type classification, such as highway, rural, road surface, number of lanes, etc.). Various other effects that cause spatial variability may be identified. The system determines the spatial feature component based on passing the environmental dataset 1005 through a generalized additive model, such as the spatial module described herein. In response to determining the spatial feature component, the system removes the spatial feature component from the environmental dataset 1005 to obtain a spatially detrended residual data 1010.
In some embodiments, the system determines the weather component of the environmental dataset 1005. The system determines the effect of weather on the environmental dataset by passing the spatially detrended residual data 1010 through a generalized additive model such as the weather module described herein. In response to determining the weather component, the system removes the weather component from the spatially detrended residual data 1010 to obtain weather detrended residual data 1015.
In some embodiments, the system determines the dynamic diurnal cycle component of the environmental dataset 1005. The system determines the dynamic diurnal cycle component by passing the weather detrended residual data 1015 through a generalized additive model such as the diurnal cycle module described herein. In response to determining the dynamic diurnal cycle component, the system removes the dynamic diurnal cycle component from the weather detrended residual data 1015 to obtain decycled residual data 1020.
In some embodiments, the system determines the nonlinear trend component of the environmental dataset 1005. The system determines the nonlinear trend component by passing the decycled residual data 1020 through a generalized additive model such as the principal component module described herein. In response to determining the non-linear trend component, the system removes the non-linear trend component from the decycled residual data 1020 to obtain temporally detrended residual data 1025.
In some embodiments, the system determines the noise component of the environmental dataset 1005. For example, the system determines the noise component after the effects of spatial variability, weather variability, temporal variability, and spatiotemporal variability. The system can determine the noise component by passing the temporally detrended residual data 1025 through a DKF to remove hyperlocal variability. In response to determining the hyperlocal variability, the system removes the hyperlocal variability temporally detrended residual data 1025 to obtain the noise component 1030.
As illustrated, the system obtains the environmental dataset 1105. The system can pre-process the environmental dataset 1105 to remove static, road-type-specific diurnal cycles to obtain decycled residual data 1110. In response to obtaining the decycled residual data 1110, the system determines the hyperlocal variability component. In response to determining the hyperlocal variability component, the system removes the hyperlocal variability component from the decycled residual data 1110 to obtain a noise component 1115. In some embodiments, the system determines the noise component by passing the decycled residual data 1110 through a DKF to remove the hyperlocal variability component.
Stacked bar chart 1605 illustrates the components of the set of features for environmental data pertaining to PM2.5 concentrations collected over 2019. Conversely, stacked bar chart 1610 illustrates the same components for environmental data collected over 2020. As illustrated, the variance in the PM2.5 concentrations for each feature is relatively constant over the 2019 and 2020 datasets.
Stacked bar chart 1615 illustrates the components of the set of features for environmental data pertaining to ozone concentrations collected over 2019. Conversely, stacked bar chart 1620 illustrates the same components for environmental data collected over 2020. As illustrated, the variance in the ozone concentrations for each feature is relatively constant over the 2019 and 2020 datasets.
Stacked bar chart 1625 illustrates the components of the set of features for environmental data pertaining to nitrogen dioxide concentrations collected over 2019. Conversely, stacked bar chart 1630 illustrates the same components for environmental data collected over 2020. As illustrated, the variance in the nitrogen dioxide concentrations for each feature is relatively constant over the 2019 and 2020 datasets.
Stacked bar chart 1635 illustrates the components of the set of features for environmental data pertaining to carbon monoxide concentrations collected over 2019. Conversely, stacked bar chart 1640 illustrates the same components for environmental data collected over 2020. As illustrated, the variance in the carbon monoxide concentrations for each feature is relatively constant over the 2019 and 2020 datasets.
As indicated by the results depicted in
According to various embodiments, various orders of steps 2215-2230 may be implemented.
At 2305, the system obtains an indication to that an augmented environmental dataset is to be generated. At 2310, the system generates a matrix for the environmental dataset. At 2315, the system determines a set of empty cells in the matrix. At 2320, the system selects and empty cell. At 2325, the system applies a model to impute a value to the selected empty cell. At 2330, the system determines whether a value is to be imputed for another empty cell. In response to determining that the value is to be imputed for another empty cell, process 2300 returns to 2320 and process 2300 iterates over 2320-2330 until no further values are to be imputed. Conversely, in response to determining that no further values are to be imputed for an empty cell(s), process 2300 proceeds to 2335. At 2335, the system provides the augmented environmental dataset. At 2340, a determination is made as to whether process 2300 is complete. In some embodiments, process 2300 is determined to be complete in response to a determination no further environmental datasets are to be analyzed, no further augmented environmental datasets are to be generated, no further environmental datasets are to be complete, an administrator indicates that process 2300 is to be paused or stopped, etc. In response to a determination that process 2300 is complete, process 2300 ends. In response to a determination that process 2300 is not complete, process 2300 returns to 2305.
Various examples of embodiments described herein are described in connection with flow diagrams. Although the examples may include certain steps performed in a particular order, according to various embodiments, various steps may be performed in various orders and/or various steps may be combined into a single step or in parallel.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application claims priority to U.S. Provisional Patent Application No. 63/407,536 entitled METHOD FOR AUGMENTING ENVIRONMENTAL DATA RESOLUTION filed Sep. 16, 2022 which is incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63407536 | Sep 2022 | US |