This disclosure relates generally to neural networks, and more particularly to neural networks used on limited resource devices.
Neural networks have found many applications today and more applications are being developed every day. However, current deep neural network models are computationally expensive and memory intensive. For example, the commonly used image classification network ResNet50 takes over 95 MB of RAM for storage and performs over 3.8 billion floating point multiplications. This has created problems when neural networks are to be employed in embedded systems. The large RAM utilization and processor cycle consumption can easily hinder other functions executing on the embedded system, limiting the deployment or forcing the neural network to operate very infrequently, such as at very low frame rates in face finding applications. When used in a videoconferencing application, the frame rates can be so low that tracking individuals for view framing becomes challenged, hindering proper camera tracking of a speaker.
In the described examples, resources of an embedded system, such as RAM utilization and available processor cycles or bandwidth are monitored. Neural network models of varying size and computational load for given neural networks are utilized in conjunction with this resource monitoring. The neural network model used for a particular neural network is dynamically varied based on the resource monitoring. In one example, neural network models of varying precision are stored and the best model for the available RAM and processor cycles is loaded. In one example, neural network model weight values are quantized before being loaded for use, the level of quantization being based on the available RAM and processor cycles. This dynamic adaption of the neural network models allows other processes in the embedded system to operate normally and yet allows the neural network to operate at the maximum capability allowed for a given period.
For illustration, there are shown in the drawings certain examples described in the present disclosure. In the drawings, like numerals indicate like elements throughout. The full scope of the inventions disclosed herein are not limited to the precise arrangements, dimensions, and instruments shown. In the drawings:
In the drawings and the description of the drawings herein, certain terminology is used for convenience only and is not to be taken as limiting the examples of the present disclosure. In the drawings and the description below, like numerals indicate like elements throughout.
Throughout this disclosure, terms are used in a manner consistent with their use by those of skill in the art, for example:
Computer vision is an interdisciplinary scientific field that deals with how computers can be made to gain high-level understanding from digital images or videos. Computer vision seeks to automate tasks imitative of the human visual system. Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world to produce numerical or symbolic information. Computer vision is concerned with artificial systems that extract information from images. Computer vision includes algorithms which receive a video frame as input and produce data detailing the visual characteristics that a system has been trained to detect.
A convolutional neural network is a class of deep neural network which can be applied analyzing visual imagery. A deep neural network is an artificial neural network with multiple layers between the input and output layers.
Artificial neural networks are computing systems inspired by the biological neural networks that constitute animal brains. Artificial neural networks exist as code being executed on one or more processors. An artificial neural network is based on a collection of connected units or nodes called artificial neurons, which mimic the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a ‘signal’ to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it. The signal at a connection is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections are called edges. Neurons and edges have weights, the value of which is adjusted as ‘learning’ proceeds and/or as new data is received by a state system. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold.
The device 100 includes loudspeaker(s) 122, camera(s) 116 and microphone(s) 114 interfaced via interfaces to a bus 115, the microphones 114 through an analog to digital (A/D) converter 112 and the loudspeaker 122 through a digital to analog (D/A) converter 113. The device 100 also includes a processing unit 102, a network interface 108, a flash memory 104, RAM 105, and an input/output general interface 110, all coupled by bus 115. An HDMI interface 118 is connected to the bus 115 and to an external display 120. Bus 115 is illustrative and any interconnect between the elements can used, such as Peripheral Component Interconnect Express (PCIe) links and switches, Universal Serial Bus (USB) links and hubs, and combinations thereof. The cameras 116 and microphones 114 can be contained in a housing containing the other components or can be external and removable, connected by wired or wireless connections.
The processing unit 102 can include digital signal processors (DSPs), central processing units (CPUs), graphics processing units (GPUs), dedicated hardware elements, such as neural network accelerators and hardware codecs, and the like in any desired combination.
The flash memory 104 stores modules of varying functionality in the form of software and firmware, generically programs, for controlling the device 100. Illustrated modules include a video codec 150, camera control 152, face and body finding 154, other video processing 156, audio codec 158, audio processing 160, neural network models 162, resource monitor 164, network operations 166, user interface 168 and operating system and various other modules 170. The RAM 105 is used for storing any of the modules in the flash memory 104 when the module is executing, storing video images of video streams and audio samples of audio streams and can be used for scratchpad operation of the processing unit 102. Relevant to this description is that the neural network models 162 are loaded into the RAM 105 when the respective neural network is being used, such as for face and body finding, background detection and other operations that vary based on the actual device.
The network interface 108 enables communications between the device 100 and other devices and can be wired, wireless or a combination. In one example, the network interface is connected or coupled to the Internet 130 to communicate with remote endpoints 140 in a videoconference. In one or more examples, the general interface 110 provides data transmission with local devices such as a keyboard, mouse, printer, projector, display, external loudspeakers, additional cameras, and microphone pods, etc.
In one example, the cameras 116 and the microphones 114 capture video and audio, respectively, in the videoconference environment and produce video and audio streams or signals transmitted through the bus 115 to the processing unit 102. In at least one example of this disclosure, the processing unit 102 processes the video and audio using algorithms in the modules stored in the flash memory 104. Processed audio and video streams can be sent to and received from remote devices coupled to network interface 108 and devices coupled to general interface 110. This is just one example of the configuration of a device 100.
In a second configuration, the components are disaggregated or separated. In this second configuration, the camera and a set of microphones used for speaker location are in separate camera component with its own processing unit and flash memory storing software and firmware. In such a configuration, the camera control module 152, the face and body finding module 154, and the neural network models 162 are present in the camera component, the camera component then performing the neural network processing used in face and body finding, for example. The camera component provides properly framed video to a codec component. The codec component also has its own processing unit and flash memory storing software and firmware. In this second configuration, the remaining modules in the flash memory 104 of
Other configurations, with differing components and arrangement of components, are well known for both videoconferencing endpoints and for devices used in other manners.
A graphics acceleration module 224 is connected to the high-speed interconnect 208. A display subsystem 226 is connected to the high-speed interconnect 208 to allow operation with and connection to various video monitors. A system services block 232, which includes items such as DMA controllers, memory management units, general-purpose I/O's, mailboxes and the like, is provided for normal SoC 200 operation. A serial connectivity module 234 is connected to the high-speed interconnect 208 and includes modules as normal in an SoC. A vehicle connectivity module 236 provides interconnects for external communication interfaces, such as PCIe block 238, USB block 240 and an Ethernet switch 242. A capture/MIPI module 244 includes a four-lane CSI-2 compliant transmit block 246 and a four-lane CSI-2 receive module and hub.
An MCU island 260 is provided as a secondary subsystem and handles operation of the integrated SoC 200 when the other components are powered down to save energy. An MCU ARM processor 262, such as one or more ARM R5F cores, operates as a master and is coupled to the high-speed interconnect 208 through an isolation interface 261. An MCU general purpose I/O (GPIO) block 264 operates as a slave. MCU RAM 266 is provided to act as local memory for the MCU ARM processor 262. A CAN bus block 268, an additional external communication interface, is connected to allow operation with a conventional CAN bus environment in a vehicle. An Ethernet MAC (media access control) block 270 is provided for further connectivity. External memory, generally non-volatile memory (NVM) such as flash memory 104, is connected to the MCU ARM processor 262 via an external memory interface 269 to store instructions loaded into the various other memories for execution by the various appropriate processors. The MCU ARM processor 262 operates as a safety processor, monitoring operations of the SoC 200 to ensure proper operation of the SoC 200.
It is understood that this is one example of an SoC provided for explanation and many other SoC examples are possible, with varying numbers of processors, DSPs, accelerators and the like.
In the example where the device 100 is a videoconferencing device, all of the illustrated modules in the flash memory 104 are executing concurrently during a videoconference. Camera 116 is providing a video stream which is being analyzed by the face and body finding module 154 using the neural network models 162. The video codec 150 and other video processing module 156 are operating on the resulting stream, with camera control module 152 focusing the camera on the speakers as determined by the face and body finding module 154. The audio processing module 160 is operating on speech of the participants of the videoconference provided by the microphones 114, with the resulting speech being provided through the audio codec 158. The network operations module 166 is operating to provide the outputs of the video codec 150 and the audio codec 158 to the far end and to provide the far end audio and video data to the video codec 150 and the audio codec 158 for decoding and presentation on the display 120 and reproduction on the loudspeakers 122. User interface module 168 is operating to allow user control of the various devices and the layout of the display 120. The operating system and various other modules 170 are operating as necessary to allow the device 100 to operate. The resource monitor module 164 is operating to monitor the use and loading of all of the various components for resource scheduling.
The concurrent operation of this many modules often puts a strain on the processing capabilities of the processing unit 102, even one as complex and capable as the SoC 200. Not only are many of the modules operating concurrently, some of the modules are also replicated and the multiple instances are running concurrently. For example, if the device 100 is acting as a videoconferencing bridge, multiple instances of the video codec 150 and the audio codec 158 will be executing for each of the remote endpoints and the network operations module 166 will be interfacing with each of those remote endpoints. Additional modules not shown, such as the modules to combine the various audio streams and the video streams would also be executing on the processing unit 102. This provides an even greater burden on the processing unit 102. Alternatively, if the videoconference is a peer-to-peer videoconference, multiple instances of the video codec 150, audio codec 158 and network operations module 166 will be executing for each of the endpoints in the videoconference. The situation can be further exacerbated if the protocol used in the videoconference is scalable video coding (SVC), which actually produces multiple video streams at different resolutions, which creates the need for further instances of the video codec 150 in operation.
For example, if the device 100 is in a single point videoconference with a single remote endpoint, only single instances of the various modules would be executing. However, when a second endpoint remote endpoint is added to the videoconference, additional instances of the video codec 150, audio codec 158 and other modules as needed would be spawned and begin executing. While performance may be acceptable for the processing unit 102 for this three party peer-to-peer videoconference, when a fourth remote endpoint is added, the processing unit 102 may now have exceeded capabilities under certain circumstances, particularly if the videoconference is being conducted using SVC.
Referring now to
As discussed above, neural network models are used for face and body finding, background finding and the like. In step 312, the particular neural network model to be used for each neural network which is operating is selected or determined. This selection or determination is based on the loads and utilizations as determined in the steps 302-310. If the DSP load, the RAM utilization, and so on are high, a simpler, less complex neural network model is used to minimize resource drain on the other necessary modules of the device 100. If, instead, the DSP load and memory utilization, for example, are low, a higher quality neural network model can be utilized to provide enhanced results for face and body finding and the like. Alternatively, if the DSP load is high and the GPU load is low, a neural network model that primarily utilizes the GPU instead of the DSP can be utilized, with a quality based on the GPU load. The selection of the neural network model can change quality or specific processing unit, or both, depending on resource availability, loading and utilization. Step 312 selects the appropriate neural network model based on the various loading and utilization conditions. In step 314, it is determined if there are any changes from the currently executing neural network models. If not, operation returns to step 302 to again determine the resource loading. Though shown as a loop for continuous operation, a delay can be included so that the resource determination is only performed periodically. The periods can vary from values such as five to ten seconds to thirty seconds. Specific values vary based on components and processing tasks and are determined for a particular instance by tuning the value for the specific environment. If changes are necessary as determined in step 314, in step 316 neural network models are swapped to the newly determined neural network models. In this manner, the highest quality neural network model appropriate for the device 100 operating circumstances is provided, so that the device 100 and the processing unit 102 are not overloaded and thus impairing operation of the device 100.
It is understood that the specific elements whose loading or utilization is being determined can vary as needed for the particular environment. In some examples, GPU loading is minimal in all instances, so the GPU load determination of step 308 can be omitted. In many cases, the neural networks are programs operating on the DSPs, so step 310 can be omitted as it is incorporated in step 306. In some examples, the load determinations can be finer grained. For example, the DSP loading of step 306 can be done per DSP or per DSP task group, such as neural network processing. Similarly, CPU loading as determined in step 302 can be finer grained, per processor or per task type.
To maintain satisfactory loading levels, various versions of the neural network models are present to allow this proper resource tuning.
The flash memory 104 stores each of the specific neural network models at each level of quantization or precision. The total space occupied by the neural network models is then relatively large, but the flash memory 104 is relatively large, compared to the RAM 105, so this replication of varying precision neural network models in the flash memory 104 does not pose the problem of the large neural network models being used in the RAM 105.
where N is the number of connections
Computation speed is increased using quantization. For the ResNet50 example, the weight values must be stored in external DRAM because of the size. Quantizing reduces the number of actual weight values, allowing a portion of the weight values to be cached in the relevant processor. For example, if 8-bit quantization is used, the 256 32-bit weight values will all be stored in the relevant L1D cache. In one example, the retrieval time from the L1D cache is just one cycle, as opposed to many cycles from external DRAM. This single cycle retrieval time versus the many cycles for external DRAM provides a computation speed increase. Varying the number of bits in the quantization varies the number of weight values retained in the L1D and L2D caches, which in turn varies the computation speed increase.
The weight quantizer compressor 456 cooperates with the resource monitoring module 164 to set the number of clusters or quantization bits to provide a neural network model of the desired size and computation speed to match the desired RAM utilization and computation overhead.
It is understood that changing the precision or quantization of the neural network will change the accuracy of the analysis performed by the neural network, but this change in precision is preferable to starving other functions of RAM or processor cycles or reducing the frequency of the neural network operations.
In various examples the neural network models of both
In other examples, the storage of models of differing precision as in
The illustrated precision variances and weight value quantization are two examples of variable compression that can be used to size the neural network model adaptively to available RAM and processor cycles. Other methods of neural network model compression can be utilized as well. For example, low-rank tensor factorization can be used, in which the order of the factorization is adjustable, with higher orders used when the available RAM and processor cycles are high and lower orders used as the available RAM and processor cycles are reduced.
In some examples, each neural network operating in the device is dynamically sized, while in other examples only specific neural networks are dynamically sized and other neural networks have a fixed size.
It is understood that, while the detailed examples used herein are for a videoconferencing unit, the adaptive sizing of neural network models based on RAM utilization and available processor cycles is generally applicable to any embedded system utilizing neural networks, such as vehicles for advanced driver assistance systems (ADAS) applications, robots for vision and movement processing, augmented reality, security and surveillance, cameras and the like.
By periodically monitoring the available RAM and various processor cycles available, differing size and processing requirement neural network models can be utilized adaptively to maximize the quality of the neural network output while also ensuring that other functions using the embedded processor are not starved of RAM or processing cycles.
The various examples described are provided byway of illustration and should not be construed to limit the scope of the disclosure. Various modifications and changes can be made to the principles and examples described herein without departing from the scope of the disclosure and without departing from the claims which follow.