The following relates generally to a machine learning augmented reality (AR) system and method.
Machine learning (ML) algorithms arc generally operable to perform a given task using patterns and inferences. Machine learning algorithms are generally based on a mathematical model that relies on “training data” to make predictions or decisions without being explicitly programmed to perform the requested task.
In one embodiment, an augmented reality system (AR system) and method is disclosed that may include a controller operable to process one or more convolutional neural networks (CNN) and a visualization device operable to acquire one or more two-dimensional RGB images. The controller may also generate an anchor vector in a semantic space in response to an anchor image being provided to a first convolutional neural network (CNN). The anchor image may be one of the two-dimensional RGB images acquired by the visualization device. The controller may also generate a positive vector and negative vector in the semantic space in response to a negative image and positive image being provided to a second CNN. The negative image may be a first three-dimensional computer-aided design (CAD) image and the positive image may be a second three-dimensional CAD image. Both the first CAD image and second CAD image may be provided to the AR system from a database.
It is contemplated that the first CNN and the second CNN may include one or more second convolutional layers, one or more second max pooling layers, a second flattening layer, a second dropout layer, and a second fully connected layer. It is also contemplated that the controller may apply a cross-metric deep metric learning algorithm that is operable to extract image features in the semantic space using the anchor image, positive image, and negative image. It is further contemplated that the controller may be operable to extract one or more image features from different modalities using the anchor vector, positive vector, and negative vector
The cross-domain deep metric learning algorithm may be implemented as a triplet loss algorithm that is operable to decrease a first distance between the anchor vector and the positive vector in the semantic space and increase a second distance between the anchor vector and the negative vector in the semantic space. Also, the convolutional layers included within the first CNN and second CNN may be implemented using one or more activation functions that may include a rectified linear unit.
It is contemplated that the first and CNN and second CNN may employ skip-connection architecture. The second CNN may also be designed as a Siamese network. The controller may perform step recognition by analyzing the image features extracted in the semantic space. Lastly, the controller may also be operable to determine if an invalid repair sequence has occurred based on an analysis of the image features in the semantic space.
It is contemplated that the controller may further be operable to determine a pose of an image object within the one or more RGB images. The controller may also apply a post-processing image algorithm to the one or more RGB images.
As required, detailed embodiments are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary and may be embodied in various and alternative forms. The figures arc not necessarily to scale; some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present embodiments.
It is contemplated that a vision-based AR system may be desirable that is operable to recognize the different status or steps of a work procedure (e.g., the correct steps needed to repair a vehicle). The vision-based AR system may be operable to recognize the different steps using 3-dimensional (3D) models of the entire procedure (e.g., 3-D models of the vehicle and the subset of vehicle parts). It is contemplated that the AR system may employ a machine learning algorithm that receives input data from different domains and encode the input data as high dimensional feature vectors. The machine learning algorithm may also be operable to transform the encoded features to one semantic space for distance measurement.
The client system 110 may further include a controller 114 and memory 116. The controller 114 may be one or more computing devices such as a quad core processor for processing commands, such as a computer processor, microprocessor, or any other device, series of devices or other mechanisms capable of performing the operations discussed herein. The memory 116 may be operable to store instructions and commands. The instructions may be in the form of software, firmware, computer code, or some combination thereof. The memory 116 may be in any form of one or more data storage devices, such as volatile memory, non-volatile memory, electronic memory, magnetic memory, optical memory, or any other form of data storage device. The memory 116 may be an internal to client system 110 (e.g., DDR memory) or memory may comprise removable memory components (e.g., micro-SD card memory).
RGB images 113 may be captured by visualization device 112 for further processing by controller 114. For instance, the controller 114 may be operable to determine the position and orientation (i.e., pose) of the object captured by the RGB images 113 using visual odometry algorithms, simultaneous localization and mapping (SLAM) algorithms, and/or model-based tracking algorithms.
The client system 110 may be connected and in communication with a server system 120. For instance, client system 110 may include a client transceiver 118 operable to transmit and receive data using a wired network (e.g., LAN) or using a wireless network communication (e.g., WiFi, cellular Bluetooth, or Zigbee). Server system 120 may include a server transceiver 122 also operable to transmit and receive data using a wired network (e.g., LAN) or using a wireless network communication (e.g., WiFi, cellular, Bluetooth, or Zigbee). It is contemplated that the client transceiver 118 may be operable to transmit data, e.g., RGB images 113 and pose data, to the server transceiver 122. However, it is further contemplated that client system 110 and server system 120 may be included in one unit therefore not requiring the client transceiver 118 and server transceiver 122.
It is contemplated that the server system 120 may also include memory 126, like memory 116, for storing the RGB image 113 and pose data received from client system 110. The server system 120 may also include a controller 124, like controller 114, operable to perform post-processing algorithms on the RGB image 113. The server system 120 may also be connected and in communication with a database 128 for storing 3-dimensional computer aided design (3D-CAD) models (i.e., 3-D images) of the target work procedure. Server system 120 may communicate with database 128 using server transceiver 122. It is also contemplated that database 128 may be included within server system 120 therefore not requiring the server transceiver 124. Controller 124 may be operable to apply a computer graphic rendering algorithm that generates one or more normal map images of the 3D-CAD model that may be compared with the RGB images 113 based on the view defined by the pose data received from client system 110. Controller 124 may also be operable to apply post-processing image algorithms to the RGB image 113 received from client system 110.
Controller 124 may also be operable to apply one or more deep neural network (DNN) algorithms. It is contemplated that the controller 124 may apply a DNN status prediction algorithm using the normal map images and the RGB images 113 to predict the status of a work procedure (e.g., current step). The server system 120 may transmit instructions back to the client system 110 that may be visually displayed by visualization device 112.
It is contemplated that the DNN may be a convolutional neural net (CNN) that includes convolutional and pooling layers for handling image recognition applications. The CNN may be a two-branch CNN that is operable to obtain a bilinear feature vector for fine-grained classification. The CNN may also be operable to learn the proper features to compare the fine-grained categories of a given image. It is also contemplated that the machine learning algorithm may be operable to capture small variations in the important region of interests (ROO and to avoid noisy background images, illumination, and viewpoint changes. It is further contemplated that the controller 124 may be operable to employ deep metric learning (e.g., cross-domain deep metric learning) to further improve performance of the CNN by not constraining the recognition process to only the procedures provided during a training phase.
For instance, controller 124 may be operable to employ a CNN for determining the current step the user is performing and provide instructions for the next step the user will need to perform (i.e., “step recognition”). Controller 124 may also be operable to employ the CNN for detecting if a user's action deviates from. a prestored sequence of steps (i.e., “invalid step detection”). Upon detecting that the user has performed an invalid step, the controller 124 may transmit instructions to client system 110 to inform the user how to rectify any incorrect actions already taken. It is contemplated that the controller 124 may be operable to perform “step recognition” and “invalid step detection” using the 3D-CAD models stored in database 128 that correspond to the target work procedure being assessed.
It is contemplated that the CNN may be designed as a cross-domain deep metric learning algorithm that is trained to determine similarity measurements between the 2-D RGB images 113 acquired by the visualization device 112 and the 3D-CAD data stored in database 128. By training CNN, the controller 124 may be operable to compare the similarity or distance between the original data (i.e., the original 2-D images and 3-D CAD models) and the data transformed to the semantic space using a variation of the triplet loss algorithm as represented by Equation (1) below:
s(t)=ΣiN[|fRGB(xiα)−fNormal(xip)|22−|fRGB(xiα)−fNormal(xin)|22+α]+ (1)
Where Ltriplet is the calculated triplet loss function, fRGB is the branch of the CNN feature encoding of the RGB images 113, and fNormal is the branch of the triplet loss function that extracts features from normal map images that are stored as 3D CAD models in database 128. It is contemplated that the triplet loss function may operate on a triplet of feature vectors which include an anchor image, a positive image, and a negative image.
It is also contemplated that the controller 124 may be operable to perform a two-branch CNN that extracts high dimensional features from the RGB images 113 acquired by the client system 110 and the normal map images that are stored as 3D-CAD models in database 128. It is contemplated that each branch of the CNN may include one or more convolution layers having kernels of various sizes (e.g., 3×3 or 5×5). The number and size of the kernels may be adjusted depending on a given application.
It is also contemplated that the CNN may include: one or more convolution layers employing a rectified linear unit (ReLU) or tanh as the activation functions; one or more normalization layers that are operable to improve the performance and stability of the CNN; and a max pooling layer that may reduce the dimensionality. The CNN may include one or more stacked modules based upon the complexity of the target data distribution. The CNN may also employ a skip-connection architecture where the output of one layer of the CNN may not be directly inputted to the next sequential layer of the CNN. The skip-connection architecture may instead allow the output of one layer to be connected to the input of a non-sequential layer.
Table 1 below illustrates an exemplary structure that may be used by the CNN for one branch that may comprise a design having 8,468,784 parameters.
As illustrated, the CNN may include one or more convolutional layers, one or more max pooling layers, a flattening layer, a dropout layer and a dense layer (i.e., fully connected or linear layer). The size of the data input into each layer of the CNN (i.e.., “Size-in”) and the corresponding size of the data output by a given layer of CNN (i.e., “Size-out”). It is contemplated that the kernel size for each convolutional layer can vary, but it is also contemplated that the kernel sizes may be the same depending on a given application. The CNN may also employ a ReLU activation function for each convolutional layer. It is contemplated, however, that the activation function may vary based on a given application and the CNN may employ other known activation functions (e.g., a tanh activation function).
CNN 400 may include an RGB network 410 branch that may receive an anchor image 420 as input data (e.g., RGB image 113). The RGB network 410 may apply the function fRGB that represents the branch of the CNN for feature encoding one of the RGB images 113. The CNN 400 may also include a branch having a normalization network 430 that may receive as input data a positive image 440 and negative image 450 (e.g., one of the 3D-CAD models stored within database 128).
The output data of the RGB network 410 may be provided to an RGB encoded feature 460. Also, the output data of the normalization network 430 may be provided to a positive encoded feature 470 and negative encoded feature 480. The triplet loss algorithm 490 (discussed with reference to Equation 1 above) may then be employed to decrease the vector distance between the anchor image 420 and the positive image 440 and increase the vector distance between the anchor image 420 and negative image 450.
It is contemplated that the normalization network 430 may be implemented as a “Siamese Network” that is able to operate in tandem to compute different vector distances for the positive image 440 and negative image 450 provided. It is also contemplated that RGB network 410 and normalization network 430 may include different layers (i.e. convolutional layers, max pooling layers, flattening layers, dropout layers, realization layers, and dense layers) that employs different training data depending on a given application.
RGB network 510 (i.e., fRGB) may be provided an RGB image 520 as an input. Again, the RGB image 520 may be an the RGB image 113 received from client system 110. CNN 500 may also include a normalization network 530 (fnormalization) that receives one or more normalized images 540 extracted during the training phase. RGB network 510 may provide output vector data to an RGB encoded feature 550 and normalization network 530 may provide output vector data to one or more normal encoded features 560, 570.
CNN 500 may then use the output vector data provided by RGB encoded feature 550 and the one or more normal encoded features 560, 570 to compute a set of distance vectors 580 from the feature vectors of the normalized images 540 and the feature vectors encoded within RGB image 520. CNN 500 may then select the distance vector that has the smallest distance in the semantic space.
Again, it is also contemplated that CNN may be operable to detect an “invalid” status which indicates that a given step or sequence being performed by a user may not match any step in the procedure. The CNN may determine that the invalid status may have been the result of an incorrect sequence of operations performed by the user. It is contemplated by analyzing the RGB image 113 acquired by client system 110, the CNN may be able to analyze the critical parts of a procedure to recognize an invalid status.
It is contemplated that the presence or absence of parts within RGB image 600 may be processed using the two-branch CNN network that is trained using the triplet loss algorithm discussed with respect to
It is also contemplated that CNN may use one of the 3D-bounding boxes (e.g., first bounding box 610) as an anchor image. It is also contemplated that the anchor image and a negative and positive image provided by database 128 is used to train CNN 400 to employ the invalid step recognition process. During the training phase, the anchor image and positive image may contain the same part while the negative image may contain a part that is not the same as the anchor image. It is contemplated that the triplet loss algorithm may again be employed using the anchor image, positive image and negative image to train CNN 400 for determining when a part is detected or absent within a given region of interest.
Once trained, CNN 500 may be employed during a testing phase or run-time operation to provide invalid or incorrect step detection. The testing or run-time operation may crop a captured image (e.g., RGB image 113 received by client system 110) into different parts. CNN may then be operable to compare the parts in the image with. the corresponding areas in two or more two normal map images. It is contemplated that the one normal snap image may include the part and the other normal image may not include the part. If the feature extracted from the image is closer to the normal map image with the part, the CNN may determine that the part is DETECTED. If the feature extracted from the image is closer to the normal map image without the part, the CNN may determine that the part is not detected (i.e., NONE).
Table 2 below illustrates the outputs that may be generated by CNN 500 during the invalid step recognition process.
As illustrated, the CNN 500 may employ the invalid step recognition process on more than one image (e.g., image 1, image 2, image 3, image 4). For instance, Table 2 illustrates that for “image 1” the CNN 500 may generate the following outputs for “part 1” to “part 5”: [DETECTED, DETECTED, NONE, NONE, NONE]. Based on these outputs, the invalid step recognition process may be operable to map the outputs to standard repair steps. As shown, for image 1, image 2, and image 3 the invalid step recognition process determined that no invalid step had occurred. Stated differently, for image 1 to image 3 the CNN 500 had determined that the user had been following the repair steps in the correct sequence.
For image “4,” however, the invalid step recognition process had determined that an “Invalid Status” had occurred. The invalid status output may have been generated because the user had mistakenly performed the repair sequence incorrectly. For instance, the fourth step in the repair sequence may have required that “part 1” be “DETECTED.” Since the CNN 500 did not detect “part 1” as being present, an invalid sequence was detected. Alternatively, the invalid status may also be generated if a user has incorrectly tried to re-add a part out of the correct repair sequence. For instance, “part 1” may have been detected as not being present for image 5 because it is currently behind or blocked by an additional part (e.g., part 5). Since, “part 1” is not visible, the invalid step recognition process may still output an “invalid status” because “part 5” was not required to be reassembled until later in the repair sequence.
The processes, methods, or algorithms disclosed herein can be deliverable to/implemented by a processing device, controller, or computer, which can include any existing programmable electronic control unit or dedicated electronic control unit. Similarly, the processes, methods, or algorithms can be stored as data, logic, and instructions executable by a controller or computer in many forms including, but not limited to, information permanently stored on non-writable storage media such as ROM devices and information alterably stored on writeable storage media such as floppy disks, magnetic tapes, CDs, RAM devices, and other magnetic and optical media. The processes, methods, or algorithms can also be implemented in a software executable object. Alternatively, the processes, methods, or algorithms can be embodied in whole or in part using suitable hardware components, such as Application Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs), state machines, controllers or other hardware components or devices, or a combination of hardware, software and firmware components.
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Additionally, the features of various implementing embodiments may he combined to form further embodiments of the invention.