The present application, in accordance with one or more embodiments, relates generally to classification systems and methods and, more particularly, for example, to systems and methods for training and/or implementing multi-object classification systems and methods.
Object detection is often implemented as a computer vision technique for locating instances of objects in images or videos. Object detection algorithms typically leverage machine learning or deep learning to produce meaningful results. When humans look at images or video, they can recognize and locate objects of interest within a matter of moments. A goal of object detection is to replicate this intelligence using a computer. In some systems, objects are detected in an image by an object detection process and a bounding box is defined surrounding each detected object with an identification of an object class. For example, an image of a neighborhood may include a dog, a bicycle and a truck, that are each detected and classified.
Object detection is used in a variety of real-time systems, such as advanced driver assistance systems that enable cars to detect driving lanes or perform pedestrian detection to improve road safety. Object detection is also useful in applications such as video surveillance, image retrieval, and other systems. Object detection problems are often solved using deep learning, machine learning, and other artificial intelligence systems. Popular deep learning-based approaches use convolutional neural networks (CNNs), such as regions with convolutional neural networks (R-CNN), You Look Only Once (YOLO), and other approaches that automatically learn to detect objects within images.
In one approach for object detection through deep learning, a custom object detector is created and trained. To train a custom object detector from scratch, a network architecture is designed to learn the features for the objects of interest, using a large set of labeled data to train the CNN. The results of a custom object detector are acceptable for many applications. However, these systems may require a lot of time and effort to set up the layers and weights in the CNN. In a second approach, a pretrained object detector is used. Many object detection workflows using deep learning leverage transfer learning, an approach that enables the system to start with a pretrained network and then fine-tune it for a particular application. This method can provide faster results because the object detectors have already been trained on thousands, or even millions, of images, but has other drawbacks in terms of complexity and accuracy.
In view of the forgoing, there is a continued need in the art for improved object detection and classification systems and methods.
The present disclosure is directed to systems and methods for object detection and classification. In various embodiments, improved systems and methods are described that can be used for a variety of classification problems including object detection and speech recognition tasks. In some embodiments, improved training methods incorporate a “rhino” loss function to force the model to activate one time for each object. These approaches reduce the complexity of full system solutions, including eliminating the need in many embodiments for conventional post-processing that is typically applied after the classification step. For example, in some object detection systems, a post-processing step called Non-Maximum suppression is used to reject redundant detections per object. This post-processing not only increases the computational complexity, it also decreases the performance. The single-detection systems and methods disclosed herein provide advantages over such systems.
Various embodiments disclosed herein can be used without conventional post-processing, greatly reducing the amount of computational complexity in run-time and increasing effectiveness to accurately estimate small objects. In addition, the training system can converge faster than other, state of art methods. In a speech recognition task, for example, a system of the present disclosure is configured to apply a heavy decoding algorithm in order to decode the speech letters from the input data. In practice, the decoding can be less than optimal due to a trade-off between the amount of processing and the performance using the search algorithm. The techniques disclosed herein can greatly simplify the decoding part of speech recognition and it can improve the performance while reducing the computational complexity.
The scope of the present disclosure is defined by the claims, which are incorporated into this section by reference. A more complete understanding of the present disclosure will be afforded to those skilled in the art, as well as a realization of additional advantages thereof, by a consideration of the following detailed description of one or more embodiments. Reference will be made to the appended sheets of drawings that will first be described briefly.
Aspects of the disclosure and their advantages can be better understood with reference to the following drawings and the detailed description that follows. It should be appreciated that like reference numerals are used to identify like elements illustrated in one or more of the figures, where showings therein are for purposes of illustrating embodiments of the present disclosure and not for purposes of limiting the same. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure.
The present disclosure is directed to improved systems and methods for object detection and/or classification. The techniques disclosed herein can be applied generally to classification problems, including voice detection and authentication in audio, object detection and classification in an image, and/or other classification problems. For example, a two-dimensional classification problem may include an object detection process directed to identifying and locating objects of certain classes in an image. Object localization can be done in various ways, including creating a bounding box around the object. A one-dimensional classification problem, for example, may include phoneme recognition. In phoneme recognition, unlike object detection in an image, the system receives a sequence of data. The detection of classes in a sequence is often important when detecting speech. In the present disclosure, improved techniques are described that can be applied to various classification systems, including an object detection problem (as an example of a 2-D classification problem) and a phoneme recognition problem (as an example of a 1-D classification problem with a sequential data).
Whether a classification system includes a custom object detector or uses a pretrained one, the system designer decides what type of object detection network to use (e.g., a two-stage network or a single-stage network). The initial stage of two-stage networks, such as R-CNN and its variants, identifies region proposals, or subsets of the image that might contain an object. The second stage classifies the objects within the region proposals. Two-stage networks can achieve accurate object detection results; however, they are typically slower than single-stage networks.
In single-stage networks, such as YOLO v2, the CNN produces network predictions for regions across an image using anchor boxes, and the predictions are decoded to generate the final bounding boxes for the objects. Single-stage networks can be much faster than two-stage networks, but they may not reach the same level of accuracy, especially for scenes containing small objects. However, single-stage networks are simpler, faster and memory and computationally efficient object detectors and more practical to be used in many end-user products.
Many conventional object detector techniques require the use of a post processing stage, such as non-max suppression, in order to disregard redundant detections for each object. For example, an object detector may detect a single object (e.g., a car) three different times and place three different bounding boxes around the object. After using the non-max suppression, the highest confident estimation will be retrieved while the others will be rejected allowing each object to be identified using a single bounding box. This post processing stage can impose additional computational complexity especially when the number of objects per image is high. Embodiments of the deep learning-based techniques disclosed herein include a single-stage object detector that does not require the post processing stage such as non-max suppression, which can improve the performance of estimation for multi-class object detection.
Referring to the figures, embodiments of the present disclosure will now be described. The present disclosure introduces a novel network that can recognize the multi-class objects and localize them with one bounding box per object. The proposed technique is a pretrained object detector that leverages the transfer learning in order to build a single-stage object detector.
In order to understand what is in an image, the input image is fed through a convolutional network to build a rich feature representation of the original image. This part of the architecture may be referred to herein as the “backbone” network, which is pre-trained as an image classifier to learn how to extract features from an image. In this approach, it is recognized that image classification may be easier and cheaper to label than a full image as it only requires a single label as opposed to defining bounding box annotations for each image. The training can be conducted on a large labeled dataset (e.g., ImageNet) in order to learn good feature representations.
An example of a backbone network is illustrated in
Referring to
Referring to
If the input image contains multiple objects, then multiple activations can be identified on the grid denoting that an object is in each of the activated regions. For example, as illustrated in the example of
In various embodiments disclosed herein, the network learns to find the responsible grid cell to be used for detection of the object. In other words, the network will choose all the grid cells that are inside the ground truth bounding box of the object, such as the grid cells marked with an “X” in
In some embodiments, the last layer generates an N*N output probability for each class (here it is assumed N=7 in a 7×7 grid). If we assume the number of classes is C, then there will be N*N*C output probabilities (y(1), . . . , y(C)). For each of N*N grid cells, it also generates four coordinates cx1, cy1, cx2, cy2 corresponding to the estimated four outputs which are related to the position of the x-axis and y-axis of the two corners on the top left and bottom right of the rectangular bounding box as it is shown in
The likelihood that a grid cell contains the object of class i is defined as y(i) and it is assumed that the number of classes is C. If all the y(i) for all grid cells are close to zero, then there is a determination of no object detected in the image.
Four bounding box descriptors are used to describe the x-y coordinate of the upper left corner of the bounding box (cx1, cy1) and bottom right corner of the bounding box (cx2, cy2). These values will be mapped to get the corresponding values (mx1, my1, mx2, my2) considering the reference point of (x=0,y=0) on the upper left corner of the image and (x=1,y=1) on the bottom right corner of the image.
Thus, the network is configured to learn a convolution filter for each of the above attributes such that it produces 4+C output channels to describe a single bounding box at each grid cell location. This means that the network will learn a set of weights to look across all feature maps (in above example it is assumed to be 512) to evaluate the grid cells.
It is possible that we increase the size of model by introducing new parameters to learn for each class to estimate the bounding box. In other words, there will be 5*C output for each grid cell instead of 4+C as it is shown in the figure below. This will enlarge the model size at the output layer, and it may improve the performance of the model for the objects that have different aspect ratio or shape. In this embodiment we assume we have 4+C output for each grid cell unless it says otherwise.
Now we will describe a proposed rhino loss function that enforces the network to detect each object using only one grid cell activation. Without loss of generality, it is assumed the number of class is one (C=1) and the object of interest is “car”. So, we have “car” confidence score yn(1) and bounding box coordinate mxn1, myn1, mxn2, myn2 for n-th grid cell. In each image, each object is shown with a rectangular bounding box around it as its ground truth. All the grid cells inside the bounding box will be considered as target grid cells that will be used to detect the object. For example, a car object in image 900 of
The rhino loss function for ith sample of the data (Lrhino(i)) and the total detection loss for a batch data of size D (Lrhinototal) is given below.
γ≥0 is a hyperparameter that needs to be tuned for the training
Embodiments of rhino loss with overlapping bounding boxes using reassignment will now be described with reference to
In one embodiment, the mask of each overlapped object is modified at each update of the training. The modified mask is called n,s(j)(i). To do this, the following rhino soft target score (rhinon,s(j)(i)) is computed for each object (∀s).
The rhino soft target score will be computed for each grid of all the objects. Then the object of any class that has the maximum metric value will have its mask to be one. For example, in the image of
In one embodiment, the system replaces the mask in (1) with the modified mask computed using the method below to address the problem of overlapping area of the bounding boxes:
for each i compute rhinon,s(j)(i) for all j, s and n
for each n find s* and j* that has maximum value of rhinon,s(j)(i)
among all s and j. Set n,s*(j*)(i)=1 and n,s(j)(i)=0 for other j, s
If we increase the number of parameters by having one set of coordinates for each class, then there is no need to modify the mask for overlapping area of objects belonging to different classes. In this case the number of outputs will be changed from (4+C)*N2 to 5*C*N2. This can increase the number of parameters of the object detector model and it may also improve the performance when the classes do not have similar shapes or aspect ratio (e.g. person and car).
In another embodiment, an alternative approach to address the problem of overlapping bounding box when the two bounding boxes belong to different classes is provided. Note that if the overlapped bounding box belong to the same class, then the method that is proposed above with respect to
Equation (4) can be revised as follows to address the problem of overlapped bounding boxes of different class:
As shown one additional term (hn,s(j)(i)) is added to the multiplications in (4) in order to address the problem of overlapping bounding boxes when the objects belong to different classes. As it is mentioned above, if there is mix of intra class and inter class objects, the grid cells of intra class objects may be reassigned using the method previously discussed and the inter class objects will have their rhino loss function modified as it is given in (12)-(13).
The bounding box loss function is designed to estimate the boundary box around the estimated object. The total bounding box loss is defined as follows:
Where IoUn,s(j) and Rn,s(j) are the Intersection over Union (IoU) loss and the penalty term defined in [7] for predicted box B and target box Bgt for each grid cell n of sth object of the jth class of image i. Both IoUn,s(j) and Rn,s(j) will be computed using (mx1(i), my,n1(i), mx,n2(i), my,n2(i)) outputs for each grid cell n. Please note that the losses are defined in [7] using the height and width of bounding box and center point of the bounding boxes. So (mx,n1(i), my,n1(i), mx,n2(i), my,n2(i)) which are the x-y coordinate of the upper left and bottom right of the bounding box will be translated to the height/width with center point and then the losses will be computed. In one embodiment, the loss may be computed as described in Zhaohui Zheng1, Ping Wang1, Wei Liu2, Jinze Li3, Rongguang Ye1, and Dongwei Ren “Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression” AAAI 2020, which is incorporated by reference herein.
The total loss function Ltotal is sum of bounding box loss and the rhino loss:
L
total
=L
rhino
total
+β*L
loc
total (17)
β>0 is a hyperparameter that needs to be tuned to balance the loss values of the two losses namely rhino loss and localization loss.
A phoneme recognition task involves recognizing the phoneme of speech (C classes) in a sequence of audio data. This is often the initial step for a speech recognition system. The backbone for phoneme recognition can be a recurrent neural network or CNN. Each output is the confidence score for the probability of detecting the jth class and it is obtained after applying the sigmoid function. Like object detection, a marking window is defined for each phoneme to be classified in a sequence. Note that the marking window is a 1-D array unlike the bounding box of object detector which is a 2-D array. So the rhino loss function for ith sample of the data (Lrhino(i)) and the total detection loss for a batch data of size D (Lrhinototal) can be obtained as (10).
Embodiments of applying rhino loss with an overlapping marking window in a sequence of data with reassignment will now be described. As previously explained, if two marking windows have overlap area, the system will reassign the overlap area to either of the two classes using a rhino score which such as defined in (11). For example, in
Embodiments of rhino loss with an overlapping marking window in a sequence of data without reassignment will now be described with reference to
Similar to overlapClassn,s(j)(i) discussed herein, overlapClasspreceding,s(j)(i) and overlapClasssuceeding,s(j)(i) are defined to include the index of overlapped classes that comes before and after the sth phoneme. For example if the sequence ABC has overlaps, overlapClasspreceding,s(j)(i) for class B would be the index of class A and overlapClasssuceeding,s(j)(i) for class B would be the index of class C. Also, we define npreceding,s(j)(i)nsuceeding,s(j)(i) to be the end time frame of preceding class (here class A) in overlap area and start time frame of succeeding class (here class C) in overlap area. This is shown in the example of
The modified rhino loss can be written as follows:
Note that it is assumed that npreceding,s(j)(i)≥n nd nsuceeding,s(j)≤n for each time frame n. If either of these two conditions is not met, then there is no need to compute the multiplications in (18) or (19).
The techniques described herein provide a general solution for any classification problem and so it can be applied to many problems including object detection, keyword spotting, acoustic event detection, speech recognition. The disclosure can provide an opportunity to solve many practical problems in which high accuracy with low computation complexity is an important requirement.
Referring to
The neural network 1300 is trained using a supervised learning process that compares input data to a ground truth (e.g., expected network output). For a speaker verification system, for example, a training dataset 1302 may include sample speech input (e.g., an audio sample) labeled with a corresponding speaker ID. The input data 1302 may comprise other labeled data types, such as a plurality of images labeled with object classification data, audio data labeled for phoneme recognition, etc. In some embodiments, the input data 1302 is provided to a feature extraction process 1304 to generate a batch of features for input to the neural network 1300. The input batch is compared against the output of the neural network 1300, and differences between the generated output data and the ground truth output data are determined using a rhino loss function 1340 as discloses herein and fed back into the neural network 1300 to make corrections to the various trainable weights and biases. The loss may be fed back into the neural network 1300 using a back-propagation technique (e.g., using a stochastic gradient descent algorithm or similar algorithm). In some examples, training data combinations may be presented to the neural network 1300 multiple times until the overall rhino loss function converges to an acceptable level.
In some examples, each of input layer 1310, hidden layers 1320, and/or output layer 1330 include one or more neurons, with each neuron applying a combination (e.g., a weighted sum using a trainable weighting matrix W) of its inputs x, adding an optional trainable bias b, and applying an activation function f to generate an output a as shown in the equation a=f(Wx+b). In some examples, the activation function f may be a linear activation function, an activation function with upper and/or lower limits, a log-sigmoid function, a hyperbolic tangent function, a rectified linear unit function, and/or the like. In some examples, each of the neurons may have a same or a different activation function.
After training, the neural network 1300 may be implemented in a run time environment of a remote device to receive input data and generate associated classifications. It should be understood that the architecture of neural network 1300 is representative only and that other architectures are possible, including a neural network with only one hidden layer, a neural network with different numbers of neuron, a neural network without an input layer and/or output layer, a neural network with recurrent layers, and/or the like.
In other embodiments, the training dataset 1302 may include captured sensor data associated with one or more types of sensors, such as speech utterances, visible light images, fingerprint data, and/or other types of biometric information. The training dataset may include images of a user's face for a face identification system, fingerprint images for a finger print identification system, retina images for a retina identification system, and/or datasets for training another type of biometric identification system.
The system 1400 includes an authentication device 1420 including processing components 1430, audio input processing components 1440, user input/output components 1446, communications components 1448, and a memory 1450. In some embodiments, other sensors and components 1445 may be included to facilitate additional biometric authentication modalities, such as fingerprint recognition, facial recognition, iris recognition, etc. Various components of authentication device 1420 may interface and communicate through a bus or other electronic communications interface.
The authentication device 1420, for example, may be implemented on a general-purpose computing device, as a system on a chip, integrated circuit, or other processing system and may be configured to operate as part of an electronic system 1410. In some embodiments, the electronic system 1410 may be, or may be coupled to, a mobile phone, a tablet, a laptop computer, a desktop computer, an automobile, a personal digital assistant (PDA), a television, a voice interactive device (e.g., a smart speaker, conference speaker system, etc.), a network or system access point, and/or other system of device configured to receive user voice input for authentication and/or identification.
The processing components 1430 may include one or more of a processor, a controller, a logic device, a microprocessor, a single-core processor, a multi-core processor, a microcontroller, a programmable logic device (PLD) (e.g., field programmable gate array (FPGA)), a digital signal processing (DSP) device, an application specific integrated circuit, or other device(s) that may be configured by hardwiring, executing software instructions, or a combination of both, to perform various operations discussed herein for audio source enhancement. In the illustrated embodiment, the processing components 1430 include a central processing unit (CPU) 1432, a neural processing unit (NPU) 1434 configured to implement logic for executing machine learning algorithms, and/or a graphics processing unit (GPU) 1436. The processing components 1430 are configured to execute instructions stored in the memory 1450 and/or other memory components. The processing components 1430 may perform operations of the authentication device 1420 and/or electronic system 1410, including one or more of the processes and/or computations disclosed herein.
The memory 1450 may be implemented as one or more memory devices or components configured to store data, including audio data, user data, trained neural networks, authentication data, and program instructions. The memory 1450 may include one or more types of memory devices including volatile and non-volatile memory devices, such as random-access memory (RAM), read-only memory (ROM), electrically-erasable programmable read-only memory (EEPROM), flash memory, hard disk drive, and/or other types of memory.
Audio input processing components 1440 include circuits and digital logic components for receiving an audio input signal, such as speech from one or more users 1444 that is sensed by an audio sensor, such as one or more microphones 1442. In various embodiments, the audio input processing components 1440 are configured to process a multi-channel input audio stream received from a plurality of microphones, such as a microphone array, and generate an enhanced target audio signal comprising speech from the user 1444.
Communications components 1448 are configured to facilitate communication between the authentication device 1420 and the electronic system 1410 and/or one or more networks and external devices. For example, the communications components 1448 may enable Wi-Fi (e.g., IEEE 802.11) or Bluetooth connections between the electronic system 1410 and one or more local devices or enable connections to a wireless router to provide network access to an external computing system via a network 1480. In various embodiments, the communications components 1448 may include wired and/or other wireless communications components for facilitating direct or indirect communications between the authentication device 1420 and/or other devices and components.
The authentication device 1420 may further include other sensor and components 1445, depending on a particular implementation. The other sensor components 1445 may include other biometric input sensors (e.g., fingerprint sensors, retina scanners, video or image capture for face recognition, etc.), and the user input/output components 1446 may include I/O components such as a touchscreen, a touchpad display, a keypad, one or more buttons, dials, or knobs, loudspeaker and/or other components operable to enable a user to interact with the electronic system 1410.
The memory 1450 includes program logic and data configured to facilitate speaker verification in accordance with one or more embodiments disclosed herein, and/or perform other functions of the authentication device 1420 and/or electronic system 1410. The memory 1450 includes program logic for instructing processing components 1430 to perform voice processing 1452, including speech recognition 1454, on an audio input signal received through the audio input processing components 1440. In various embodiments, the voice processing 1452 logic is configured to identify an audio sample comprising one or more spoken utterances for speaker verification processing.
The memory 1450 may further includes program logic for implementing user verification controls 1462, which may include security protocols for verifying a user 1444 (e.g., to validate the user's identity for a secure transaction, to identify access rights to data or programs of the electronic system 1410, etc.). In some embodiments, the user verification controls 1462 includes program logic for an enrollment and/or registration procedure to identify a user and/or obtain user voice print information, which may include a unique user identifier and one or more embedding vectors. The memory 1450 may further include program logic for instructing the processing components 1430 to perform a voice authentication process 1464 as described herein, which may include neural networks trained for speaker verification using generalized negative log-likelihood loss processes, feature extraction components for extracting features from an input audio sample, processes for identifying embedding vectors and generating centroid or other vectors and confidence scores for use in speaker identification.
The memory 1450 may further include other biometric authentication processes 1466, which may include facial recognition, fingerprint identification, retina scanning, and/or other biometric processing for a particular implementation. The other biometric authentication processes 1466 may include feature extraction processes, on or more neural networks, statistical analysis modules, and/or other processes. In some embodiments, the user verification controls 1462 may process confidence scores or other information from the voice authentication process 1464 and/or one or more other biometric authentication processes 1466 to generate the speaker identification determination. In some embodiments, the other biometric authentication processes 1466 include a neural network trained through a process using a batch of biometric input data and a rhino loss function as described herein.
The memory 1450 includes program logic for instructing processing components 1430 to perform image processing 1456, including object detection 1456, on images received through one or more components (e.g., other sensors/components 1445 such as image capture components, communications components 1448, etc.).
In various embodiments, the authentication device 1420 may operate in communication with one or more servers across a network 1480. For example, a neural network server 1490 includes processing components and program logic configured to train neural networks (e.g., neural network training module 1492), for use in speaker verification as described herein. In some embodiments, a database 1494 stores training data 1496, including training datasets and validation datasets for used in training one or more neural network models. Trained neural networks 1498 may also be stored in the database 1494 for downloading to one or more runtime environments, for use in the voice authentication processes 1464. The trained neural networks 1498 may also be provided to the one or more verification servers 1482, which provide cloud or other networked speaker identification services. For example, the verification server 1482 may receive biometric data from an authentication device 1420, such as voice data or other biometric data, and upload data to the verification server 1482 for further processing. The uploaded data may include a received audio sample, extracted features, embedding vectors, and/or other data. The verification server 1482, through a biometric authentication process 1484 that includes one or more neural networks (e.g., trained neural network 1488 stored in a database 1486) trained in accordance with the present disclosure, and system and/or user data 1489 to compare the sample against known authentication factors and/or user identifiers to determine whether the user 1444 has been verified. In various embodiments, the verification server 1482 may be implemented to provide authentication for a financial service or transaction, access to a cloud or other online system, cloud or network authentication services for used with an electronic system 1410, etc.
Where applicable, various embodiments provided by the present disclosure may be implemented using hardware, software, or combinations of hardware and software. Also, where applicable, the various hardware components and/or software components set forth herein may be combined into composite components comprising software, hardware, and/or both without departing from the scope of the present disclosure. Where applicable, the various hardware components and/or software components set forth herein may be separated into sub-components comprising software, hardware, or both without departing from the scope of the present disclosure. In addition, where applicable, it is contemplated that software components may be implemented as hardware components and vice versa.
Software, in accordance with the present disclosure, such as program code and/or data, may be stored on one or more computer readable mediums. It is also contemplated that software identified herein may be implemented using one or more general purpose or specific purpose computers and/or computer systems, networked and/or otherwise. Where applicable, the ordering of various steps described herein may be changed, combined into composite steps, and/or separated into sub-steps to provide features described herein.
The foregoing disclosure is not intended to limit the present disclosure to the precise forms or particular fields of use disclosed. As such, it is contemplated that various alternate embodiments and/or modifications to the present disclosure, whether explicitly described or implied herein, are possible in light of the disclosure. Having thus described embodiments of the present disclosure, persons of ordinary skill in the art will recognize that changes may be made in form and detail without departing from the scope of the present disclosure. Thus, the present disclosure is limited only by the claims.