The present disclosure relates to systems and methods for training machine learning algorithms.
To increase occupant awareness and convenience, vehicles may be equipped with driver assistance systems and/or automated driving systems. Driver assistance systems may use inputs from multiple vehicle sensors to determine information about an environment surrounding the vehicle and determine suggested or optimal behaviors for the vehicle and/or the occupant. In order to determine information about complex environments in dynamic conditions with large amounts of input sensor data, driver assistance systems may utilize machine learning algorithms which take inputs from the vehicle sensors and determine suggested or optimal behaviors. However, machine learning algorithms must be trained using a large amount of input data. Gathering input data may require additional time and/or resources. Additionally, in order to optimally train machine learning algorithms, specific types or categories of training data having specific characteristics may be required.
Thus, while current machine learning algorithms for vehicles achieve their intended purpose, there is a need for a new and improved system and method for training a machine learning algorithm for a vehicle.
According to several aspects, a method for training a machine learning algorithm is provided. The method includes performing at least one exploratory training session of the machine learning algorithm using an input data set. The input data set includes a first plurality of input data samples. The method further includes dividing the input data set into a plurality of input data categories. The method further includes determining a regression curve equation for each of the plurality of input data categories based at least in part on the at least one exploratory training session. The method further includes collecting a second plurality of input data samples based at least in part on the regression curve equation for each of the plurality of input data categories. The method further includes training the machine learning algorithm using the second plurality of input data samples and the input data set.
In another aspect of the present disclosure, dividing the input data set into the plurality of input data categories further may include identifying at least one data set parameter by which to categorize the input data set. Dividing the input data set into the plurality of input data categories further may include dividing the input data set into the plurality of input data categories based on the at least one data set parameter.
In another aspect of the present disclosure, determining the regression curve equation for each of the plurality of input data categories further may include generating an exploratory learning curve plot of a model loss of the machine learning algorithm versus a quantity of input data samples trained for each of the plurality of input data categories in the at least one exploratory training session. Determining the regression curve equation for each of the plurality of input data categories further may include determining the regression curve equation for each of the plurality of input data categories based at least in part on the exploratory learning curve plot for each of the plurality of input data categories.
In another aspect of the present disclosure, determining the regression curve equation for each of the plurality of input data categories further may include determining the regression curve equation for each of the plurality of input data categories based at least in part on the exploratory learning curve plot for each of the plurality of input data categories. The regression curve equation for one of the plurality of input data categories is a power law equation having a form:
where ε is the model loss of the machine learning algorithm for the one of the plurality of input data categories, α is a first constant factor for the one of the plurality of input data categories, m is a quantity of input data samples trained for the one of the plurality of input data categories, βg is a steepness of a regression curve described by the regression curve equation for the one of the plurality of input data categories, and γ is a lower bound model loss of the machine learning algorithm for the one of the plurality of input data categories.
In another aspect of the present disclosure, collecting the second plurality of input data samples further may include identifying a first subset of the plurality of input data categories based at least in part on a quantity of input data samples in each of the plurality of input data categories. Collecting the second plurality of input data samples further may include determining a plurality of data collection quotas based at least in part on the regression curve equation for each of the plurality of input data categories. Each of the plurality of data collection quotas corresponds to one of the first subset of the plurality of input data categories. Collecting the second plurality of input data samples further may include collecting a second plurality of input data samples based at least in part on the plurality of data collection quotas.
In another aspect of the present disclosure, identifying the first subset of the plurality of input data categories further may include comparing a quantity of input data samples in each of the plurality of input data categories to a previous data collection quota. The previous data collection quota is one of the plurality of data collection quotas determined during a previous execution of the method. Identifying the first subset of the plurality of input data categories further may include determining each of the plurality of input data categories having a quantity of input data samples less than the previous data collection quota to be one of the first subset of the plurality of input data categories.
In another aspect of the present disclosure, determining one of the plurality of data collection quotas corresponding to one of the first subset of the plurality of input data categories further may include identifying a second subset of the plurality of input data categories. The second subset of the plurality of input data categories includes each of the plurality of input data categories not in the first subset of the plurality of input data categories. Determining one of the plurality of data collection quotas corresponding to one of the first subset of the plurality of input data categories further may include determining an average steepness
In another aspect of the present disclosure, determining the one of the plurality of data collection quotas using the predetermined equation further may include determining the one of the plurality of data collection quotas using the predetermined equation. The predetermined equation includes:
where mi+1 is the one of the plurality of data collection quotas, mi is a quantity of input data samples in the one of the first subset of the plurality of input data categories, α′ is a first constant factor of the one of the first subset of the plurality of input data categories, γ′ is a lower bound model loss of the one of the first subset of the plurality of input data categories, and
In another aspect of the present disclosure, collecting the second plurality of input data samples further may include determining a quantity of additional input data samples to collect for each of the first subset of the plurality of input data categories using an equation:
where s is the quantity of additional input data samples to collect for the one of the first subset of the plurality of input data categories, mi+1 is the one of the plurality of data collection quotas for the one of the first subset of the plurality of input data categories, and mi is the quantity of input data samples in the one of the first subset of the plurality of input data categories. Collecting the second plurality of input data samples further may include transmitting a data sample collection task to a vehicle. The data sample collection task includes at least the quantity of additional input data samples to collect for each of the first subset of the plurality of input data categories.
In another aspect of the present disclosure, training the machine learning algorithm using the second plurality of input data samples and the input data set further may include receiving the second plurality of input data samples from the vehicle. Training the machine learning algorithm using the second plurality of input data samples and the input data set further may include generating an updated input data set. The updated input data set includes the second plurality of input data samples and the input data set. Training the machine learning algorithm using the second plurality of input data samples and the input data set further may include performing the method using the updated input data set.
According to several aspects, a system for training a machine learning algorithm for a vehicle is provided. The system includes a server system including a server storage device, a server communication system, and a server controller in electrical communication with the server storage device and the server communication system. The server controller is programmed to perform at least one exploratory training session of the machine learning algorithm using an input data set. The input data set includes a first plurality of input data samples. The input data set is stored on the server storage device. The server controller is further programmed to divide the input data set into a plurality of input data categories. The server controller is further programmed to generate an exploratory learning curve plot of a model loss of the machine learning algorithm versus a quantity of input data samples trained for each of the plurality of input data categories in the at least one exploratory training session. The server controller is further programmed to determine a regression curve equation for each of the plurality of input data categories based at least in part on the exploratory learning curve plot for each of the plurality of input data categories. The regression curve equation for one of the plurality of input data categories is a power law equation having a form:
where ε is the model loss of the machine learning algorithm for the one of the plurality of input data categories, α is a first constant factor for the one of the plurality of input data categories, m is a quantity of input data samples trained for the one of the plurality of input data categories, βg is a steepness of a regression curve described by the regression curve equation for the one of the plurality of input data categories, and γ is a lower bound model loss of the machine learning algorithm for the one of the plurality of input data categories. The server controller is further programmed to collect a second plurality of input data samples from the vehicle using the server communication system. The second plurality of input data samples is based at least in part on the regression curve equation for each of the plurality of input data categories. The server controller is further programmed to train the machine learning algorithm using the second plurality of input data samples and the input data set.
In another aspect of the present disclosure, to collect the second plurality of input data samples from the vehicle using the server communication system, the server controller is further programmed to identify a first subset of the plurality of input data categories based at least in part on a quantity of input data samples in each of the plurality of input data categories. To collect the second plurality of input data samples from the vehicle using the server communication system, the server controller is further programmed to determine a plurality of data collection quotas based at least in part on the regression curve equation for each of the plurality of input data categories, where each of the plurality of data collection quotas corresponds to one of the first subset of the plurality of input data categories. To collect the second plurality of input data samples from the vehicle using the server communication system, the server controller is further programmed to determine a quantity of additional input data samples to collect for each of the first subset of the plurality of input data categories using an equation:
wherein s is the quantity of additional input data samples to collect for the one of the first subset of the plurality of input data categories, mi+1 is the one of the plurality of data collection quotas for the one of the first subset of the plurality of input data categories, and mi is a quantity of input data samples in the one of the first subset of the plurality of input data categories. To collect the second plurality of input data samples from the vehicle using the server communication system, the server controller is further programmed to transmit a data sample collection task to the vehicle using the server communication system. The data sample collection task includes at least the quantity of additional input data samples to collect for each of the first subset of the plurality of input data categories.
In another aspect of the present disclosure, to determine one of the plurality of data collection quotas corresponding to one of the first subset of the plurality of input data categories, the server controller is further programmed to identify a second subset of the plurality of input data categories. The second subset of the plurality of input data categories includes each of the plurality of input data categories not in the first subset of the plurality of input data categories. To determine one of the plurality of data collection quotas corresponding to one of the first subset of the plurality of input data categories, the server controller is further programmed to determine an average steepness
In another aspect of the present disclosure, to determine the one of the plurality of data collection quotas using the predetermined equation, the server controller is further programmed to determine the one of the plurality of data collection quotas using the predetermined equation. The predetermined equation includes:
where mi+1 is the one of the plurality of data collection quotas, mi is a quantity of input data samples in the one of the first subset of the plurality of input data categories, α′ is the constant factor of the one of the first subset of the plurality of input data categories, γ′ is the lower bound model loss of the one of the first subset of the plurality of input data categories, and
In another aspect of the present disclosure, to transmit the data sample collection task to the vehicle using the server communication system, the server controller is further programmed to transmit the data sample collection task to the vehicle using the server communication system. The data sample collection task includes a validation algorithm describing one of the plurality of input data categories and at least one of a task priority, a projected decrease in model loss, and the quantity of additional input data samples to collect.
In another aspect of the present disclosure, the system further includes a vehicle system including at least one vehicle sensor, a vehicle communication system, and a vehicle controller in electrical communication with the at least one vehicle sensor and the vehicle communication system. The vehicle controller is programmed to receive the data sample collection task from the server system using the vehicle communication system. The vehicle controller is further programmed to determine a priority of the data sample collection task based at least in part on at least one of the task priority, the projected decrease in model loss, and the quantity of additional input data samples to collect. The vehicle controller is further programmed to perform the data sample collection task using the at least one vehicle sensor.
In another aspect of the present disclosure, to perform the data sample collection task, the vehicle controller is further programmed to record a plurality of unvalidated input data samples using the at least one vehicle sensor. To perform the data sample collection task, the vehicle controller is further programmed to determine a second plurality of input data samples based at least in part on the validation algorithm. The second plurality of input data samples is a subset of the plurality of unvalidated input data samples. To perform the data sample collection task, the vehicle controller is further programmed to transmit the second plurality of input data samples to the server communication system using the vehicle communication system.
According to several aspects, a method for training a machine learning algorithm for a vehicle is provided. The method includes performing at least one exploratory training session of the machine learning algorithm using an input data set. The input data set includes a first plurality of input data samples. The input data set is stored on a server storage device. The method also includes dividing the input data set into a plurality of input data categories. The method also includes generating an exploratory learning curve plot of a model loss of the machine learning algorithm versus a quantity of input data samples trained for each of the plurality of input data categories in the at least one exploratory training session. The method also includes determining a regression curve equation for each of the plurality of input data categories based at least in part on the exploratory learning curve plot for each of the plurality of input data categories. The regression curve equation for one of the plurality of input data categories is a power law equation having a form:
where ε is the model loss of the machine learning algorithm for the one of the plurality of input data categories, α is a first constant factor for the one of the plurality of input data categories, m is a quantity of input data samples trained for the one of the plurality of input data categories, βg is a steepness of a regression curve described by the regression curve equation for the one of the plurality of input data categories, and γ is a lower bound model loss of the machine learning algorithm for the one of the plurality of input data categories. The method also includes transmitting a data sample collection task to a vehicle communication system of the vehicle using a server communication system. The method also includes receiving the data sample collection task using a vehicle communication system. The method also includes collecting a second plurality of input data samples using at least one vehicle sensor. The second plurality of input data samples is based at least in part on the regression curve equation for each of the plurality of input data categories. The method also includes transmitting the second plurality of input data samples from the vehicle communication system to the server communication system. The method also includes training the machine learning algorithm using the second plurality of input data samples received from the vehicle communication system and the input data set.
In another aspect of the present disclosure, collecting the second plurality of input data samples from the vehicle using the server communication system further may include identifying a first subset of the plurality of input data categories based at least in part on a quantity of input data samples in each of the plurality of input data categories. Collecting the second plurality of input data samples from the vehicle using the server communication system further may include determining a plurality of data collection quotas based at least in part on the regression curve equation for each of the plurality of input data categories. Each of the plurality of data collection quotas corresponds to one of the first subset of the plurality of input data categories. Collecting the second plurality of input data samples from the vehicle using the server communication system further may include determining a quantity of additional input data samples to collect for each of the first subset of the plurality of input data categories using an equation:
where s is the quantity of additional input data samples to collect for the one of the first subset of the plurality of input data categories, mi+1 is the one of the plurality of data collection quotas for the one of the first subset of the plurality of input data categories, and mi is a quantity of input data samples in the one of the first subset of the plurality of input data categories. Transmitting the data sample collection task to the vehicle communication system using the server communication system. The data sample collection task includes at least the quantity of additional input data samples to collect for each of the first subset of the plurality of input data categories.
In another aspect of the present disclosure, determining one of the plurality of data collection quotas corresponding to one of the first subset of the plurality of input data categories further may include identifying a second subset of the plurality of input data categories. The second subset of the plurality of input data categories includes each of the plurality of input data categories not in the first subset of the plurality of input data categories. Determining one of the plurality of data collection quotas corresponding to one of the first subset of the plurality of input data categories further may include determining an average steepness
wherein mi+1 is the one of the plurality of data collection quotas, mi is a quantity of input data samples in the one of the first subset of the plurality of input data categories, α′ is the constant factor of the one of the first subset of the plurality of input data categories, γ′ is the lower bound model loss of the one of the first subset of the plurality of input data categories, and
Further areas of applicability will become apparent from the description provided herein. It should be understood that the description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The drawings described herein are for illustration purposes only and are not intended to limit the scope of the present disclosure in any way.
The following description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses.
Machine learning algorithms find various applications in the contexts of vehicle control, automated driving, driver assistance systems, and the like. Training machine learning algorithms may require large amounts of data. Gathering and/or storing large amounts of data may be resource and/or time intensive. Therefore, the present disclosure provides a new and improved system and method for gathering data for training a machine learning algorithm for a vehicle which allows for optimized data collection based on performance of the machine learning algorithm, reducing training time and/or resources.
Referring to
The vehicle controller 22 is used to implement a method 100 for training a machine learning algorithm, as will be described below. The vehicle controller 22 includes at least one processor 40 and a non-transitory computer readable storage device or media 42. The processor 40 may be a custom made or commercially available processor, a central processing unit (CPU), a graphics processing unit (GPU), an auxiliary processor among several processors associated with the vehicle controller 22, a semiconductor-based microprocessor (in the form of a microchip or chip set), a macroprocessor, a combination thereof, or generally a device for executing instructions. The computer readable storage device or media 42 may include volatile and nonvolatile storage in read-only memory (ROM), random-access memory (RAM), and keep-alive memory (KAM), for example. KAM is a persistent or non-volatile memory that may be used to store various operating variables while the processor 40 is powered down. The computer-readable storage device or media 42 may be implemented using a number of memory devices such as PROMs (programmable read-only memory), EPROMs (electrically PROM), EEPROMs (electrically erasable PROM), flash memory, or another electric, magnetic, optical, or combination memory devices capable of storing data, some of which represent executable instructions, used by the vehicle controller 22 to control various systems of the vehicle 12. The vehicle controller 22 may also consist of multiple controllers which are in electrical communication with each other. The vehicle controller 22 may be inter-connected with additional systems and/or controllers of the vehicle 12, allowing the vehicle controller 22 to access data such as, for example, speed, acceleration, braking, and steering angle of the vehicle 12.
The vehicle controller 22 is in electrical communication with the at least one vehicle sensor 24 and the vehicle communication system 26. In an exemplary embodiment, the electrical communication is established using, for example, a CAN network, a FLEXRAY network, a local area network (e.g., WiFi, ethernet, and the like), a serial peripheral interface (SPI) network, or the like. It should be understood that various additional wired and wireless techniques and communication protocols for communicating with the vehicle controller 22 are within the scope of the present disclosure.
The at least one vehicle sensor 24 is used to determine performance data about the vehicle 12. In an exemplary embodiment, the at least one vehicle sensor 24 includes at least one of a motor speed sensor, a motor torque sensor, an electric drive motor voltage and/or current sensor, an accelerator pedal position sensor, a coolant temperature sensor, a cooling fan speed sensor, and a transmission oil temperature sensor. In another exemplary embodiment, the plurality of vehicle sensors further includes sensors to determine information about an environment surrounding the vehicle 12, for example, an ambient air temperature sensor, a barometric pressure sensor, and/or a photo and/or video camera which is positioned to view the environment in front of the vehicle 12. In another exemplary embodiment, at least one of the at least one vehicle sensor 24 is capable of measuring distances in the environment surrounding the vehicle 12. In a non-limiting example wherein the at least one vehicle sensor 24 includes a camera, the at least one vehicle sensor 24 measures distances using an image processing algorithm configured to process images from the camera and determine distances between objects. In another non-limiting example, the at least one vehicle sensor 24 includes a stereoscopic camera having distance measurement capabilities. In one example, at least one of the at least one vehicle sensor 24 is affixed inside of the vehicle 12, for example, in a headliner of the vehicle 12, having a view through a windscreen of the vehicle 12. In another example, at least one of the at least one vehicle sensor 24 is affixed outside of the vehicle 12, for example, on a roof of the vehicle 12, having a view of the environment surrounding the vehicle 12. It should be understood that various additional types of vehicle sensors, such as, for example, LiDAR sensors, ultrasonic ranging sensors, radar sensors, and/or time-of-flight sensors are within the scope of the present disclosure.
The vehicle communication system 26 is used by the controller 14 to communicate with other systems external to the vehicle 12. For example, the vehicle communication system 26 includes capabilities for communication with vehicles (“V2V” communication), infrastructure (“V2I” communication), remote systems at a remote call center (e.g., ON-STAR by GENERAL MOTORS) and/or personal devices. In general, the term vehicle-to-everything communication (“V2X” communication) refers to communication between the vehicle 12 and any remote system (e.g., vehicles, infrastructure, and/or remote systems). In certain embodiments, the vehicle communication system 26 is a wireless communication system configured to communicate via a wireless local area network (WLAN) using IEEE 802.11 standards or by using cellular data communication (e.g., using GSMA standards, such as, for example, SGP.02, SGP.22, SGP.32, and the like). Accordingly, the vehicle communication system 26 may further include an embedded universal integrated circuit card (eUICC) configured to store at least one cellular connectivity configuration profile, for example, an embedded subscriber identity module (eSIM) profile. The vehicle communication system 26 is further configured to communicate via a personal area network (e.g., BLUETOOTH) and/or near-field communication (NFC). However, additional or alternate communication methods, such as a dedicated short-range communications (DSRC) channel and/or mobile telecommunications protocols based on the 3rd Generation Partnership Project (3GPP) standards, are also considered within the scope of the present disclosure. DSRC channels refer to one-way or two-way short-range to medium-range wireless communication channels specifically designed for automotive use and a corresponding set of protocols and standards. The 3GPP refers to a partnership between several standards organizations which develop protocols and standards for mobile telecommunications. 3GPP standards are structured as “releases”. Thus, communication methods based on 3GPP release 14, 15, 16 and/or future 3GPP releases are considered within the scope of the present disclosure. Accordingly, the vehicle communication system 26 may include one or more antennas and/or communication transceivers for receiving and/or transmitting signals, such as cooperative sensing messages (CSMs). The vehicle communication system 26 is configured to wirelessly communicate information between the vehicle 12 and another vehicle. Further, the vehicle communication system 26 is configured to wirelessly communicate information between the vehicle 12 and infrastructure or other vehicles. It should be understood that the vehicle communication system 26 may be integrated with the controller 14 (e.g., on a same circuit board with the controller 14 or otherwise a part of the controller 14) without departing from the scope of the present disclosure.
With continued reference to
Referring to
At block 106, the server controller 32 identifies at least one data set parameter by which to categorize the input data set. In an exemplary embodiment, the at least one data set parameter is a parameter which may vary between individual input data samples. In a non-limiting example, wherein the input data set includes images of traffic signals, the at least one data set parameter includes one or more of an environmental light level, a traffic signal phase (i.e., traffic signal color), a weather condition (i.e., clear, fog, rain, and/or the like), a distance between the vehicle and the traffic signal, and/or the like. In an exemplary embodiment, the at least one data set parameter by which to categorize the input data set is predetermined and saved in the server storage device 34 for retrieval by the server controller 32. In another exemplary embodiment, the server controller 32 determines the at least one data set parameter by random selection. After block 106, the method 100 proceeds to block 112.
At block 112, the server controller 32 divides the input data set into a plurality of input data categories based on the at least one data set parameter identified at block 106. In an exemplary embodiment, each of the plurality of input data categories includes a subset of the first plurality of input data samples of the input data set. In a non-limiting example, one of the plurality of input data categories is “low environment light”. Therefore, the relevant data set parameter is “environmental light”, and the input data category “low environment light” includes a subset of the first plurality of input data samples of the input data set having an environment light below a predetermined environment light threshold. Accordingly, at block 112, each of the first plurality of input data samples of the input data set is labeled as being a member of one of the plurality of input data categories, and this information is saved in the server storage device 34. After block 112, the method 100 proceeds to blocks 108 and 110.
At block 108, the server controller 32 generates an exploratory learning curve plot for each of the plurality of input data categories determined at block 112 based on the at least one exploratory training session performed at block 104. In the scope of the present disclosure, a learning curve plot is a plot of a model loss of the machine learning algorithm versus a quantity of input data samples trained in the at least one exploratory training session. In the scope of the present disclosure, the model loss of the machine learning algorithm represents a difference between a predicted output of the machine learning algorithm and an actual output for a given input data sample. At block 108, the server controller 32 generates a plurality of exploratory learning curve plots, one for each of the plurality of input data categories. Therefore, the plurality of exploratory learning curve plots describe a learning curve for each of the plurality of input data categories after the at least one exploratory training session. After block 108, the method 100 proceeds to block 114.
At block 114, the server controller 32 determines a regression curve equation for each of the plurality of input data categories. In the scope of the present disclosure, the regression curve equation is a mathematical expression which represents a relationship between a dependent variable (i.e., the model loss) and an independent variable (i.e., the quantity of input data samples trained). In an exemplary embodiment, the server controller 32 first determines a mathematical model for the regression curve equation. In a non-limiting example, mathematical model is predetermined to be a power law equation having a form:
wherein ε is the model loss of the machine learning algorithm for one of the plurality of input data categories, α is a first constant factor for the one of the plurality of input data categories, m is a quantity of input data samples trained for the one of the plurality of input data categories, βg is a steepness of a regression curve described by the regression curve equation for the one of the plurality of input data categories, and γ is a lower bound model loss of the machine learning algorithm (i.e., a second constant factor) for the one of the plurality of input data categories.
In an exemplary embodiment, to determine the regression curve equation for each of the plurality of input data categories, the server controller 32 uses a curve fitting method on each of the plurality of exploratory learning curve plots determined at block 108 to fit each of the plurality of exploratory learning curve plots to the regression curve equation (1) described above. In an exemplary embodiment, the curve fitting method is configured to find a tuple (α, βg, γ) for each of the plurality of input data categories which minimizes curve fitting error between the regression curve described by the regression curve equation (1) and the exploratory learning curve plot for each of the plurality of input data categories. In a non-limiting example, the curve fitting method includes at least one of: the least squares method, the maximum likelihood method, and the Bayesian inference method. It should be understood that additional methods for determining a mathematical expression which represents a relationship between the model loss and the quantity of input data samples trained based on the plurality of exploratory learning curve plots are within the scope of the present disclosure. After block 114, the method 100 proceeds to blocks 116 and 118 as will be discussed in greater detail below.
At block 110, the controller 14 identifies a first subset of the plurality of input data categories. In an exemplary embodiment, the first subset of the plurality of input data categories includes input data categories for which enough input data samples have not yet been collected. Upon an initial execution of the method 100, to determine the first subset of the plurality of input data categories, the server controller 32 compares a quantity of input data samples in each of the plurality of input data categories to a predetermined sample quantity threshold (e.g., one hundred samples). If the quantity of input data samples in a given one of the plurality of input data categories is less than the sample quantity threshold, the given one of the plurality of input data categories is determined to be included in the first subset of the plurality of input data categories. As will be discussed in greater detail below, the method 100 may be repeatedly executed, with some or all data from previous executions of the method 100 being stored persistently in the server storage device 34. Upon subsequent executions of the method 100, the server controller 32 compares the quantity of input data samples in each of the plurality of input data categories to a previous data collection quota. The previous data collection quota is a data collection quota determined during a previous execution of the method 100 and stored in the server storage device 34. The data collection quota will be defined and discussed in further detail below. If the quantity of input data samples in a given one of the plurality of input data categories is less than the previous data collection quota, the given one of the plurality of input data categories is determined to be included in the first subset of the plurality of input data categories. After block 110, the method 100 proceeds to block 120.
At block 120, the server controller 32 identifies a second subset of the plurality of input data categories. In an exemplary embodiment, the second subset of the plurality of input data categories includes input data categories for which enough input data samples have been collected. In an exemplary embodiment, the second subset of the plurality of input data categories includes each of the plurality of input data categories not in the first subset of the plurality of input data categories as determined at block 110. After block 120, the method 100 proceeds to blocks 116 and 118.
At block 116, the server controller 32 determines an average steepness
At block 118, the server controller 32 determines a constant factor α′ and a lower bound model loss γ′ of each of the first subset of the plurality of input data categories based on the plurality of regression curve equations corresponding to each of the first subset of the plurality of input data categories determined at block 114. After block 118, the method 100 proceeds to block 122.
At block 122, the server controller 32 determines a plurality of data collection quotas, wherein each of the plurality of data collection quotas corresponds to one of the first subset of the plurality of input data categories. In the scope of the present disclosure, a data collection quota for a given input data category is a total number of input data samples for the given input data category which will result in a reduction of the model loss for the given input data category by a predetermined loss reduction amount (e.g., fifty percent) upon further training. In an exemplary embodiment, each of the plurality of data collection quotas is determined using a predetermined equation:
wherein mi+1 is one of the plurality of data collection quotas, mi is the quantity of input data samples in the one of the first subset of the plurality of input data categories, α′ is the constant factor of the one of the first subset of the plurality of input data categories as determined at block 118, γ′ is the lower bound model loss of the one of the first subset of the plurality of input data categories as determined at block 118, and
wherein s is the quantity of additional input data samples to collect for one of the first subset of the plurality of input data categories, mi+1 is the one of the plurality of data collection quotas for the one of the first subset of the plurality of input data categories, and mi is the quantity of input data samples in the one of the first subset of the plurality of input data categories. Accordingly, after completion of block 122, the method 100 has computed a quantity of additional input data samples to collect for each of the first subset of the plurality of input data categories in order to reduce the model loss of each of the first subset of the plurality of input data categories by half. It should be understood that by collecting the additional input data samples (as will be discussed in greater detail below) and repeatedly executing the method 100, the model loss will be iteratively decreased. After block 122, the method 100 proceeds to block 124.
At block 124, the server controller 32 uses the server communication system 36 to transmit a data sample collection task for each of the first subset of the plurality of input data categories to the vehicle 12. In the scope of the present disclosure, the data sample collection task is a message containing information for the vehicle controller 22 to enable collection of the additional input data samples. In an exemplary embodiment, the data sample collection task includes a validation algorithm describing one of the plurality of input data categories, a task priority (i.e., an indication of a priority level of the data sample collection task relative to other data sample collection tasks), a projected decrease in model loss (e.g., determined by extrapolation of the regression curve equation), the quantity of additional input data samples to collect, a task identifier (i.e., a numeric code which uniquely identifies the data sample collection task), and a sensor identifier (i.e., a numeric code which uniquely identifies which of the at least one vehicle sensors 24 should be used for data collection).
In the scope of the present disclosure, the validation algorithm is a mathematical, logical, programmatic, or similar algorithm which is configured to determine whether a particular collected data sample falls into a desired one of the plurality of input data categories. In a non-limiting example, the validation algorithm is a simple logical statement such as, for example, “if light_level is greater than 200, then true, otherwise, false”. In the above example, the validation algorithm would result in selection of only input data having a light level greater than two hundred. In other examples, the validation algorithm may be a more complex construction, such as, for example, a series of nested logical statements, a set of computer-executable instructions (i.e., a computer program or script), or a mathematical function. The validation algorithm will be discussed in further detail below. It should be understood that the data sample collection task may include additional information and/or omit portions of the above information without departing from the scope of the present disclosure. After block 124, the method 100 proceeds to block 126.
At block 126, the vehicle 12 receives the data sample collection tasks transmitted at block 124 and collects a second plurality of input data samples, as will be discussed in greater detail in reference to
In an exemplary embodiment, the server controller 32 repeatedly exits the standby state 128 and restarts the method 100 at block 102. In a non-limiting example, the server controller 32 exits the standby state 128 and restarts the method 100 on a timer, for example, every three hundred milliseconds. By repeatedly performing the method 100, the model loss for each of the first subset of the plurality of input data categories is iteratively decreased.
Referring to
At block 304, the vehicle controller 22 determines a priority of each of the plurality of data sample collection tasks received at block 302. In an exemplary embodiment, the priority is determined based on the task priority of each of the plurality of data sample collection tasks. In examples where multiple data sample collection tasks have an equal task priority, the projected decrease in model loss and/or the quantity of additional input data samples to collect are used to determine the priority of each of the plurality of data sample collection tasks. After block 304, the exemplary embodiment of block 126 proceeds to block 306.
At block 306, the vehicle controller 22 uses the at least one vehicle sensor 24 to record a plurality of unvalidated input data samples. In the scope of the present disclosure, unvalidated input data samples are input data samples recorded by the at least one vehicle sensor 24, but which may not necessarily correspond to any of the first subset of the plurality of input data categories. After block 306, the exemplary embodiment of block 126 proceeds to block 308.
At block 308, the vehicle controller 22 uses the validation algorithm, as discussed above, to determine the second plurality of input data samples. In the scope of the present disclosure, the second plurality of input data samples is a subset of the plurality of unvalidated input data samples for which the validation algorithm of at least one of the plurality of data sample collection tasks returned a true value. In other words, the validation algorithms of each of the plurality of data sample collection tasks are executed with the plurality of unvalidated input data samples as inputs. The subset of the plurality of unvalidated input data samples for which at least one of the validation algorithms of each of the plurality of data sample collection tasks returns “true” is considered to be the second plurality of input data samples. After block 308, the exemplary embodiment of block 126 proceeds to block 310.
At block 310, the vehicle controller 22 uses the vehicle communication system 26 to transmit the second plurality of input data samples determined at block 308 to the server communication system 36. In an exemplary embodiment, after receiving the second plurality of input data samples, the server system 30 adds the second plurality of input data samples to the input data set discussed above, generating an updated input data set. The method 100 is then repeated with the updated input data set. After block 310, the exemplary embodiment of block 126 is concluded, and the method 100 continues as discussed above.
The system 10 and method 100 of the present disclosure offer several advantages. The system 10 allows for communication between the vehicle system 20 and the server system 30 to optimize data collection for training of the machine learning algorithm. Using the method 100, characteristics of the training data gathered by the vehicle system 20 are adjusted based on the performance of the machine learning algorithm, resulting in reduced training time and reduced resource use.
The description of the present disclosure is merely exemplary in nature and variations that do not depart from the gist of the present disclosure are intended to be within the scope of the present disclosure. Such variations are not to be regarded as a departure from the spirit and scope of the present disclosure.