Vehicles, such as automobiles, motorcycles and the like—are being provided with image or video capturing devices to capture surrounding environments. These devices are being provided so as to allow for an enhanced driving experience. With surrounding environments being captured by sensors, through processing, the surrounding environment can be identified, or objects in the surrounding environment may also be identified.
For example, a vehicle implementing an image capturing device configured to capture a surrounding environment may detect road signs indicating danger or information, highlight local attractions and other objects for education and entertainment, and provide a whole host of other services.
This technology becomes even more important as autonomous vehicles are introduced. An autonomous vehicle employs many sensors to determine an optimal driving route and technique. One such sensor is the capturing of real-time images of the surrounding area, and processing driving decisions based on said captured image.
As shown in
Processing power of devices situated in vehicles have improved and become more powerful. Conversely, operations needed to be performed in the vehicular context have also become more intensive. One such organization technique for allowing processing is a convolutional neural network (CNN) as shown in
Referring specifically to the CNN, the first layer includes a set of nodes that receives the captured data, such as image data, and provides outputs to be input by the next layer. Each subsequent layer consists of a set of nodes which receive inputs from the previous layer, except the last layer, which outputs the identification of the object. A node acts as an artificial neuron and as an example calculates a weighted sum of its inputs, and then applies an activation function to the sum to produce a single output.
Thus, because the process of searching every data item becomes potentially processor intensive, vehicle implementers are attempting to incorporate processors with greater capabilities and processor power. Therefore, the price of components needing to be implemented in a vehicle-based computing system is increased.
The following description relates to employing vehicular sensor information for retrieval of data. Exemplary embodiments may also be directed to any of the system, the method, or an application disclosed herein, and the subsequent implementation in existing vehicular systems, microprocessors, and autonomous vehicle driving systems.
Additional features of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention.
Disclosed herein are systems, methods, and devices for optimally performing object identification employing a neural network (NN), for example a convolutional neural network (CNN). The aspects disclosed herein employ audio data captured by one or more microphones in to at least identify an object, or augment image capturing to perform the same. The audio data and the image data are each propagated to the NN, to perform object identification.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed. Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
The detailed description refers to the following drawings, in which like numerals refer to like items, and in which:
The invention is described more fully hereinafter with references to the accompanying drawings, in which exemplary embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure is thorough, and will fully convey the scope of the invention to those skilled in the art. It will be understood that for the purposes of this disclosure, “at least one of each” will be interpreted to mean any combination the enumerated elements following the respective language, including combination of multiples of the enumerated elements. For example, “at least one of X, Y, and Z” will be construed to mean X only, Y only, Z only, or any combination of two or more items X, Y, and Z (e.g. XYZ, XZ, YZ, X). Throughout the drawings and the detailed description, unless otherwise described, the same drawing reference numerals are understood to refer to the same elements, features, and structures. The relative size and depiction of these elements may be exaggerated for clarity, illustration, and convenience.
As explained above, vehicle implementers are implementing processors with increased capabilities, thereby attempting to perform the search for captured data via a complete database in an optimal manner. However, these techniques are limited in that they require increased processor resources, costs, and power to accomplish the increased processing.
The disclosure incorporates vehicle-based context information, obtainable via passive sensors, to augment object recognition in a vehicle-based context. Thus, by utilizing extra information available to a vehicle, the ability to process and retrieve information through a CNN (such as those described in the Background) is greatly enhanced.
In general, the aspects associated with the disclosure allow an increase in the object recognition capabilities of vehicles. This can result in higher performance using the same amount of processing capacity or more, or the reduction in required processing capacity to achieve the same performance level, or a combination of these. Higher performance could mean for example more total objects that can be identified, faster object identification, more classes of objects that can be identified, more accurate object identification and more accurate bounding of object areas.
Specifically, the disclosure relies on the employment of passive-based audio equipment. By passive, it is meant that the microphone is configured such that as the vehicle traverses through a driving condition, the microphone continually receives audio content from the vehicle's external environment.
One such technology employable is a beamforming microphone. Beamforming or spatial filtering is a signal processing technique used in sensor arrays for directional signal transmission or reception. This is achieved by combining elements in a microphone array in such a way that signals at particular angles experience constructive interference while others experience destructive interference. Beamforming can be used at both the transmitting and receiving ends in order to achieve spatial selectivity.
Disclosed herein are devices, systems, and methods for employing audio information that may be combined with visual information for an identification of an object in or around the environment of the vehicle. By employing the aspects disclosed herein, the need to incorporate more powerful processing power is obviated. As such, the ability to identify images, or objects in the images, is accomplished in a quicker fashion, with the gains being achieved of a cheaper, less resource intensive, and low power implementation of a vehicle-based processor. Further, and as mentioned above, all the advantages associated with higher performance may be achieved.
The vehicle microprocessor 300 is electronically coupled to a variety of sensors. According to sample configurations shown in
Microphones 360 and 370 may be beamforming microphones. Processing of the microphone outputs determines positional information. This processing could be specific processing associated with the microphone array for distance estimation (this processing being some type of conventional processing or a CNN), or this could be implemented within the CNN that also performs object identification. Furthermore, the output(s) of the object identification CNN could be fed back into the distance estimation processing to enhance performance of the position estimation (i.e. accuracy, speed of calculation). While several of the aspects disclosed herein are described with a CNN, a neural network (NN) may also be used.
Additionally, and in another embodiment, the beamforming microphones may be equipped with automatic movement devices. These automatic movement devices allow the microphones 360 and 370 (and other microphones not shown) to be oriented in a manner optimal to record sound. The control of the beamforming microphones may also be accomplished with a CNN, wherein the control would be subsequently trained through iterative operations. In another example, two microphones employing a processor to convert said recorded signals to a beam forming signal may also be employed (independent of a NN or CNN).
In another example, the microphones 360 and 370 may be non-beamforming, and essentially be inputted into a processor, such as a CNN 310, and converted into a beam-formed signal. This embodiment is further described in
As shown in
This operation is highlighted in
The vehicle microprocessor 300 is configured to receive the data (351, 361, and 371), and propagate said data to the CNN 310. Employing the aspects disclosed herein, and as highlighted in either method 400 (
In operations 410 and 420 (in no particular order), image data of an object and sound data (captured by a beamforming microphone) of the object is obtained. This may be done employing the described peripherals shown in
In operation 430, this captured data (image/video data and audio data) is propagated/communicated through an electronic coupling to a CNN for object identification. The CNN 310 is shown as a separate element in
In operation 440, employing both the visual data and the audio data, a CNN is employed to perform object identification. Once the object is identified, it may be propagated to the vehicle microprocessor 300 or another party that may employ the identification data for one or more objects identified, such identification data exemplified by object class or type, and object position.
Alternatively, the CNN 310 may learn from the operation in operation 440, and update the neural connections in the CNN 310 based on the correlated audio with the object. In this case, the CNN 310 is made to be more efficient in subsequent operations.
A key distinction is that operation 420 in method 500 is omitted. Alternatively, in operation 520, a non-beamforming set of at least two microphones are employed to record audio data (operation 520). These non-beamforming microphones are configured to record data, and input data into a CNN (operation 530). This operation leads to the non-beamforming audio signals being converted to a beamforming signal. The beamforming signal may be employed in operation 430. In another example, the multiple microphones being employed may input data into the NN or CNN, and beam forming may not be employed.
Thus, employing the aspects associated with the disclosure allow an increase in the object recognition capabilities of vehicles. This can result in higher performance using the same amount of processing capacity or more, or the reduction in required processing capacity to achieve the same performance level, or a combination of these. Higher performance could mean for example more total objects that can be identified, faster object identification, more classes of objects that can be identified, more accurate object identification and more accurate bounding of object areas.
Certain of the devices shown include a computing system. The computing system includes a processor (CPU) or a graphics processor (GPU) and a system bus that couples various system components including a system memory such as read only memory (ROM) and random access memory (RAM), to the processor. Other system memory may be available for use as well. The computing system may include more than one processor or a group or cluster of computing systems networked together to provide greater processing capability. The system bus may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. A basic input/output (BIOS) stored in the ROM or the like, may provide basic routines that help to transfer information between elements within the computing system, such as during start-up.
To enable human (and in some instances, machine) user interaction, the computing system may include an input device, such as a microphone for speech and audio, a touch sensitive screen for gesture or graphical input, keyboard, mouse, motion input, and so forth. An output device can include one or more of a number of output mechanisms. In some instances, multimodal systems enable a user to provide multiple types of input to communicate with the computing system. A communications interface generally enables the computing device system to communicate with one or more other computing devices using various communication and network protocols.
Embodiments disclosed herein can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the herein disclosed structures and their equivalents. Some embodiments can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible computer storage medium for execution by one or more processors. A computer storage medium can be, or can be included in, a computer-readable storage device, a computer-readable storage substrate, or a random or serial access memory. The computer storage medium can also be, or can be included in, one or more separate tangible components or media such as multiple CDs, disks, or other storage devices. The computer storage medium does not include a transitory signal.
As used herein, the term processor (or microprocessor) encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing. The processor can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The processor also can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them.
A computer program (also known as a program, module, engine, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and the program can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
To provide for interaction with an individual, the herein disclosed embodiments can be implemented using an interactive display, such as a graphical user interface (GUI). Such GUI's may include interactive features such as pop-up or pull-down menus or lists, selection tabs, scannable features, and other features that can receive human inputs.
The computing system disclosed herein can include clients and servers. A client and server are generally remote from each other and typically interact through a communications network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device). Data generated at the client device (e.g., a result of the user interaction) can be received from the client device at the server.
As a person skilled in the art will readily appreciate, the above description is meant as an illustration of implementation of the principles this invention. This description is not intended to limit the scope or application of this invention in that the invention is susceptible to modification, variation and change, without departing from spirit of this invention, as defined in the following claims.