The present disclosure relates to systems and methods configured to create a set of one or more presentation models based on a system design and associated resource constraints, and execute one presentation model using an inference engine.
Contemporary multiprocessor system architecture design methodologies rely on a manual, user-involved approach. For example, a typical AI model compiler or mapper software development kit (SDK) reports inferences-per second (IPS) for each model when they are compiled offline (for example, in a single-model use case), based on user-defined architectures. The SDK does not consider, and is not aware of, an associated application stack or host functions involved for an end-to-end use-case. Other performance aspects such as input/output (IO) transfers, pre-processing, post-processing input and output transfer times (to and from a digital video device and a host), and to pipeline proxy inference pipeline may not be accounted for. Actual implementations of user-defined multiprocessing architectures may provide lower-than-predicted performance in an actual end-to-end system deployment. Contemporary multiprocessor system design techniques place the burden on the end user to come up with an appropriate pipeline execution model to achieve a desired hardware throughput.
Aspects of the invention are directed to systems and methods to execute a presentation model by an inference engine.
One aspect includes identifying resource constraints for multiple computing devices. Using identified resource constraints, a presentation model having a plurality of modifiable parameters based at least in part based on resource constraints may be created. One or more inference engines may be used to execute a particular neural network model, where an inference engine supports neural network processing.
Another aspect includes identifying resource constraints for multiple computing devices. Using identified resource constraints, multiple presentation models may be created. The presentation models may be created at least in part based on identified processing metrics, with the multiple presentation models including multiple processing pipelines configurable for execution on multiple computing devices. An inference engine may be used to provide an execution model for the multiple processing pipelines based at least in part on the multiple presentation models. In one aspect, the execution model has improved processing metrics as compared to at least one of the multiple presentation models.
Aspects of the invention include apparatuses that implement the above methods.
Non-limiting and non-exhaustive embodiments of the present disclosure are described with reference to the following figures, wherein like reference numerals refer to like parts throughout the various figures unless otherwise specified.
In the following description, reference is made to the accompanying drawings that form a part thereof, and in which is shown by way of illustration specific exemplary embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the concepts disclosed herein, and it is to be understood that modifications to the various disclosed embodiments may be made, and other embodiments may be utilized, without departing from the scope of the present disclosure. The following detailed description is, therefore, not to be taken in a limiting sense.
Reference throughout this specification to “one embodiment,” “an embodiment,” “one example,” or “an example” means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least one embodiment of the present disclosure. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” “one example,” or “an example” in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, the particular features, structures, databases, or characteristics may be combined in any suitable combinations and/or sub-combinations in one or more embodiments or examples. In addition, it should be appreciated that the figures provided herewith are for explanation purposes to persons ordinarily skilled in the art and that the drawings are not necessarily drawn to scale.
Embodiments in accordance with the present disclosure may be embodied as an apparatus, method, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware-comprised embodiment, an entirely software-comprised embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module,” or “system.” Furthermore, embodiments of the present disclosure may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
Any combination of one or more computer-usable or computer-readable media may be utilized. For example, a computer-readable medium may include one or more of a portable computer diskette, a hard disk, a random-access memory (RAM) device, a read-only memory (ROM) device, an erasable programmable read-only memory (EPROM or Flash memory) device, a portable compact disc read-only memory (CDROM), an optical storage device, a magnetic storage device, and any other storage medium now known or hereafter discovered. Computer program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages. Such code may be compiled from source code to computer-readable assembly language or machine code suitable for the device or computer on which the code can be executed.
Embodiments may also be implemented in cloud computing environments. In this description and the following claims, “cloud computing” may be defined as a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned via virtualization and released with minimal management effort or service provider interaction and then scaled accordingly. A cloud model can be composed of various characteristics (e.g., on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service), service models (e.g., Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“IaaS”)), and deployment models (e.g., private cloud, community cloud, public cloud, and hybrid cloud).
The flow diagrams and block diagrams in the attached figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flow diagrams or block diagrams may represent a module, segment, or portion of code, which includes one or more executable instructions for implementing the specified logical function(s). It is also noted that each block of the block diagrams and/or flow diagrams, and combinations of blocks in the block diagrams and/or flow diagrams, may be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flow diagram and/or block diagram block or blocks.
Aspects of the invention described herein address the shortcomings associated with contemporary multiprocessor system implementations. The systems and methods described herein facilitate the pipelining of the stream processing functions associated with, for example, streaming media applications. Some aspects provide a mechanism to order the stream processing functions, and may enable designing higher-order applications based on applications with relatively low complexity.
Some aspects may be configured to autonomously identify one or more performance constraints (e.g., bandwidth constraints, latency constraints, system processing constraints, etc.) associated with a multiprocessor architecture and one or more associated processing pipelines. Some embodiments may provide analytical information on performance gains or losses associated with moving one or more workloads across a spectrum of computing devices (i.e., across different computing devices in the multiprocessor system). One aspect may support a domain-specific language DSL to specify a pipeline functionality. Another aspect may provide a user interface (UI) for building a pipeline using one or more widgets and a gallery.
Pipeline processing architecture 120 further includes preprocessing 106, inferencing 108, and postprocessing 112. Inferencing 108 may further include machine learning 108. In one aspect, data sources 104 are sources of digital data. Examples of digital data files include text files, audio files, video files, etc. Another aspect may include data sources 104 configured as video streaming sources (e.g., video cameras, webcams, security cameras, etc.). Other examples of data sources include online media files (e.g., streaming video from a website such as YouTube), radar data, streaming digital music, a digitized audio signal from a microphone, etc. Digital data generated by data sources 104 may be input to pipeline processing architecture 120 as input 102. Other sources to input 102 may include one or more cameras, files or directories, still images, video sources, one or more decoders, one or more real time streaming protocol (RTSP) or network sources, etc.
In one aspect, preprocessing 106 performs preprocessing functions on data received from input 102. Examples of preprocessing functions on video or image data include resizing, cropping, scaling, element-wise division, color-space conversion, mean subtraction, normalization, mirror imaging, transposing, tiling, input quantization, etc.
In one aspect, inferencing 108 is implemented using one or more processing systems that include any combination of artificial intelligence, neural network, and machine learning algorithms. Inferencing 108 may implement one or more of the following operations:
Postprocessing 112 may perform postprocessing operations on processed output data from inferencing 108. Post processing 112 may implement functions such as one or more detection layers, arithmetic functions such as sigmoid, softmax, and exponential, data sorting functions, converting one or more data vectors into corresponding text words or audio signals, dequantization, and so on. Results from postprocessing 112 may be received by output 114. Output 114 may perform one or more final operations on postprocessed data received from postprocessing 112. Such operations may include:
Outputs from output 114 may be transmitted to sink 116. Sink 116 may be a user interface that is configured to present or display results from output 114. Examples of sink 116 may include any combination of visual display monitors (e.g., computer or television screens), audio devices (e.g., buzzers or loudspeakers), lamps or LED lights, and so on. Results from output 114 may be displayed as a text file, an image file, a video stream, an audio or sound stream, and so on.
In one aspect, pipeline processing architecture may include an accuracy metric stage (not depicted), after postprocessing 112 and before output 114. Accuracy metrics associated with the accuracy metric stage may include numerical accuracy (i.e., range, precision and error). The accuracy metric stage may be configured to perform on-the-fly updating of any accuracy metrics associated with pipeline processing architecture 120. Examples of metrics include a latency, an execution time, a memory consumed, an input/output data transfer time, and an inference time.
In one aspect, pipeline processing architecture 120 may be configured to perform processing operations using pipelining, or pipeline processing. Examples of pipelines include an audio stream pipeline, a video stream pipeline, and so on. In stream processing applications (e.g., audio or video stream processing), a pipeline, also known as a data pipeline, is a set of a data processing elements connected in series, where the output of one element is the input of the next element. The elements of a pipeline are often executed in parallel or in a time-sliced fashion. Pipeline models may be used to increase throughput. An ideal throughput of a pipelined processing system is same as the input rate. To bring the stream processing application output to the levels of the input, the same execution sequence of stream processing functions, (e.g., lines) are run multiple times in parallel. The number of lines to run in parallel depends on the latency/processing time of the stream processing functions.
In one aspect, a processing architecture for pipeline processing architecture 120 is generated by pipeline processing architecture generator 118. Pipeline processing architecture generator 118 may generate one or more candidate multiprocessing system architectures (also referred to as “multiprocessing system configurations”) that can be used to implement pipeline processing architecture 120. Based on processing parameters (e.g., desired latency, available computational and memory resources, etc.) pipeline processing architecture generator 118 may generate one or more multiprocessing system configurations that may be used to implement pipeline processing architecture 120. Of these multiprocessing system configurations, a user may select one multiprocessing system configuration to implement pipeline processing architecture 120.
Pipeline processing architecture generator 118 may be configured to receive a design specification for a multiprocessing system. The design specification may include functionality (e.g., face detection or object tracking), a number of pipeline stages, a listing of the pipeline stages and their order, and other such information. The design specification may also include available system resources (e.g., number of CPUs, number of GPUs, system memory (RAM), etc.). The design specification may also include a requirement that a particular aspect of system design or system performance be given priority. For example, a first user may want to use as less computing resource utilization as possible in a first design, a second user may want to reduce latency in a second design, and so on.
To ensure that certain minimum systemic and performance objectives are maintained, the design specification may also include one or more constraints. For example, while trying to increase system throughput, a processing resource utilization constraint may impose an associated limit based on available computing resources (e.g., a limited number of processing cores). Or, while trying to reduce system resource utilization, the throughput of the system should not drop below a certain value. Pipeline processing architecture generator 118 may receive the system design requirements along with the constraints, and generate one or more multiprocessing system designs (i.e., multiprocessing system configurations or multiprocessing system architectures) based on the performance requirements and constraints. A user may then select one of these multiprocessing system designs to implement as pipeline processing architecture 120.
In one aspect, processing module 212 includes components that perform the generation of the multiprocessing system designs. In one aspect, these design specifications are referred to as “presentation models.” Memory manager 214 may process data associated with the design specification to ensure that a multiprocessing system architecture is within any memory constraints specified by the multiprocessing system design specification.
In one aspect, statistics 216 is configured to gather statistics either statically from an input design specification, or at run time. Statistics 216 may gather statistics by querying various components in the system to obtain information such as memory consumed for a particular operation, time consumed for a particular configuration, etc. Taskflow 220 may be configured to provide a capability to traverse and execute one or more Task Graphs. In one aspect, task graph 224 allows for representation of a set of execution elements in a graph format i.e., in a specific order, say a DAG (directed acyclic graph). Pipeline 222 allows for representation of a set of pipestages are eventually traversed and executed in a sequential manner. A pipestage is a logical grouping of ‘related’ tasks that together represent a high level function that is performed, for example ‘pre-processing’ stage.
Function library 218 may be configured to interface with plugin module 228. Plugin module 228 may include modules such as Gstreamer 230, nnstreamer 232, and other Python modules. The modules included in plugin module 228 may be external modules that the system can avail of in order to implement one or more functions (e.g., a resize function in a pre-processing stage). External frameworks such as GStreamer can provide alternate representations that pipeline processing architecture generator 118 can leverage to implement such functions. An abstract representation of these functions may be provided as one or more inputs to pipeline processing architecture generator 118 in order to use these functions appropriately.
In one aspect, template repository 202 includes one or more system architecture templates that can be used by a user to generate a multiprocessing system design (architecture). Use cases and demos 204 may include multiprocessing design architectures for specific use cases, as well as demonstration multiprocessing system design architectures that may be used to demonstrate the capabilities of pipeline processing architecture generator 118. Accuracy flows 206 is a set of templates that can include additional pipe stages, as opposed to functional variants of pipelines such as Demos. These additional pipe stage perform functions such as configuration, observation and reporting of numerical accuracy statistics. In one aspect, numerical accuracy data is used to determine if a particular use case is able to accurately perform the function and if not, how much the deviation is for a predetermined function like Mean Intersection over Union (MIoU) from the predetermined or acceptable value for any deviations.
In one aspect, example pipelines 208 includes sample pipeline architectures that can be used by a user to as learning tools or as multiprocessing system architectures on which a multiprocessing system architecture can be based. In one aspect, model zoo pipelines 210 are fully-functional, simplistic representations of a pipeline. For example, in one such implementation, a model zoo pipeline may perform all steps sequentially (e.g., one-by-one) while dumping parameters, input and output and data at each and every stage or step of the pipeline, pipe stage or task. Such a representation allows for a user to easily understand, visualize and debug the pipeline or task. Model zoo pipelines 210 may also be used to demonstrate the capability of a particular machine learning model without the complexities of a performant implementation of the same.
In general, a stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed by a multiprocessing system architecture. Given a sequence of data (e.g., a stream), a series of operations is applied to each element in the stream. Kernel functions are usually pipelined, and local on-chip computing functionality and memory reuse is attempted, in order to reduce or minimize the loss in bandwidth. Uniform streaming, where same set of kernel functions are applied to all elements in the stream, is typical. Pipeline processing architecture generator 118 may be viewed as a framework that:
Pipeline models may be used mainly to increase the throughput of a multiprocessing system architecture; the ideal throughput is same as the input rate. To bring the stream processing application output to the levels of the input, the same execution sequence of stream processing functions is run. In other words, a “line” of tasks is run multiple times in parallel. The number of lines to run in parallel depends on the latency/processing time of the stream processing functions.
In one aspect, the following concepts may be associated with pipeline processing architecture generator 118:
In one aspect, pipeline processing architecture generator 118 facilitates building an inference pipeline from scratch, and/or facilitates building a prototypes of use cases. In one aspect, pipeline processing architecture generator 118 may be equipped with metrics collection system (e.g., statistics 216), and may collect the memory and CPU at task, stage and line levels. (e.g., using memory manager 214 and other modules). For an inference stage, pipeline processing architecture generator 118 may collect data transfer time and inference time. Pipeline processing architecture generator 118 may provide tools to analyze and identify the performance degradation points (e.g., bottlenecks) in the pipeline execution. Pipeline processing architecture generator 118 may also provide a mechanism to offload the execution of a step on to any generic or specialized hardware present on a host (e.g., a host computing system).
While building an app (e.g., an application software or a multiprocessing system design architecture), maintain two aspects may be taken into consideration: a composition of a pipeline and an execution of the pipeline.
Pipeline composition may be considered a building mode. After analyzing the application, a logical grouping of associated functionalities may be performed. Each such group is called a stage. Inter-step dependencies determine the possibility of parallel execution of the steps in the stage. The stages in a pipeline are always executed in sequential order. For example, input 102, preprocessing 106, inferencing 108, post-processing 112 (also referred to as “postprocessing 112”), and output 114 are stages. Operations such as resize, crop, shift, pad and quantize are steps. Based on their order in the data address generator (DAG), shift and pad steps can be executed in parallel. Stages 102 through 114 may be executed sequentially within a given pipeline instance.
In one aspect, pipeline processing architecture generator 118 may be deployed as a software application running on a host computing system. A selected multiprocessing system architecture may be deployed as pipeline processing architecture 120. Pipeline processing architecture 120 may be run on a target computing system that may include any combination of multi-core CPUs, multi-core GPUs, system memory, multi-core custom hardware accelerators, neural processing units (NPUs), deep learning accelerators (DLAs), FPGA-based custom accelerators, and one or more wired or wireless network connections. In one aspect, the target computing system may be implemented as a customized processing integrated circuit. The target computing system may include inferencing 108, which may further include machine learning or artificial intelligence components such as neural engines, convolutional neural networks, and neural network models.
In one aspect, stages 308 through 312 may correspond to input stage 102. Input video 308 may be generated by a video source such as a digital camera. Image signal processor 310 may relate to image processing functions implemented on the digital camera. The digital camera may also have a video codec stage 312. Stages 314, 316, 318, and 320 may be respectively similar to preprocessing 106, inferencing 108, postprocessing 112, and output 114.
Mapping options 306 provide different options that each stage can be mapped to (e.g., associated with, generated by, or executed on). Input video 308 may be generated by sensor 322. Sensor 322 may be an image sensor (e.g., a CCD sensor or a CMOS sensor or some other type of image sensor) in a digital camera or digital imaging device. Image signal processor 310 may be mapped to any combination of an image signal processor (ISP) 324, to a network 336, or to a peripheral 346. When an input source (e.g., input video 308 or sensor 322) is via network 336, a network interface card in the system is configured to receive or retrieve the corresponding input source data. For example, a camera in an industrial installation, say at a door, will stream the video data via a protocol such as RTSP (real-time streaming protocol), via network 336. An application may subscribe to the RTSP stream in order to execute 336. Similarly, a peripheral could be a USB camera which is connected to the USB port of the system. The application can invoke a USB camera driver and issue corresponding requests for video frames in order to execute peripheral stage 346. Video codec 312 may be mapped to any combination of a hardware accelerator 326, one or more GPUs 338, one or more CPUs 348, or to software 356. Any combination of these blocks can provide video codec functionalities.
In one aspect, mapping options 306 map inferencing 316 to a customized integrated circuit Ara-1 330. Ara-1 330 may be implemented as a computing system with one or more multi-core CPUs, one or more GPU arrays, system memory, and a network connection. Ara-1 may be a target computing system configured to run a multiprocessing system architecture generated by pipeline processing architecture generator 118. Post-processing 318 may be mapped by mapping options 306 to any combination of a hardware accelerator 332, one or more GPUs 342, one or more CPUs 352, one or more DSPs 360, or software 366. Mapping options 306 can map output 320 to display 334 (e.g., a video display), a network 344, a storage 354 (e.g., a hard disk drive or a cloud storage server), or to software 362.
In one aspect, network 336 and network 344 may be implemented via software. Boundaries associated with allocated stages in workload migration 302 may have fluid boundaries based on selected components, component feature set, cost, power, scalability etc.
If a stage/task ‘B’ precedes a stage/task “A”, this implies stage/task “A” executes before stage/task “B.” Then the types and format of outputs of stage/task “A” should match the types and formats of inputs of stage/task “B.”
When building a pipeline, pipeline processing architecture generator 118 verifies the inputs and outputs of each stage and auto-stitches the stages, so that data can flow through the pipeline seamlessly.
Task flow 500 can be representative of a set of tasks associated with a pipeline stage (e.g., preprocessing stage 106). Tasks that may be associated with preprocessing stage 106 may include tasks such as resize (task A, executed on a GPU), crop (task B, executed on a CPU), shift (task C, executed on a DSP), pad (task D), and quantize (task E). As depicted, task flow 500 is depicted as a graph.
In one aspect, pipeline line 1 input 712 starts at approximately the same time as inferencing 706, of line 0. Pipeline line 2 input 722 may start at approximately the same time as inferencing 716, of line 1. In one aspect, each of input 702, 712 and 722 may need 10 ms to complete; each of preprocessing 704, preprocessing 714, and preprocessing 724 may need 330 ms to complete; each of inferencing 706, 716, and 726, may need 200 ms to complete; each of postprocessing 708, 718, and 728 may need approximately 15 ms to complete; and each of output 710, 720, and 730 may need approximately 10 ms to complete.
Each pipeline stage (i.e., line 0, line 1, and line 2) may need approximately 565 ms to complete (end-to-end latency). A corresponding throughput for each inferencing stage may be 1/330 ms, or approximately 3 inferences per second. A total of 3 units of memory may be consumed, for a total of 6 active threads (1 for input/inference/postprocessing/output, and 2 for preprocessing), for each pipeline, for a total of three pipelines. In one aspect, a latency between a start time of preprocessing 704 and an end time of preprocessing 724 may be approximately 1 second. All three pipelines may be run in parallel.
For throughput enhancement, a 4-way parallel preprocessing stage 806 may be implemented. 4-way parallel preprocessing stage 806 may split up processing tasks between four parallel processors, resulting in a fourfold reduction in throughput time. With this enhancement, a throughput time associated with 4-way parallel processing stage 806 takes 82.5 ms to complete (down from 330 ms). However, the latency associated with the 4-way parallel preprocessing stage 806 may still be 330 ms. Inferencing 808 still takes 200 ms to complete, and is now the stage with the highest processing latency in this implementation.
For latency and throughput enhancement, a DSP-accelerated preprocessing stage 810 may be implemented. DSP-accelerated preprocessing stage 810 may use DSP acceleration to bring down completion time of preprocessing 802 from 330 ms, to 50 ms when implemented as DSP-accelerated preprocessing stage 810. Inferencing 812 still takes 200 ms to complete, and is now the stage with the highest processing latency in this implementation.
In one aspect, the combination of 4-way parallel preprocessing stage 806 and inferencing stage 808 provides the following performance metrics:
In one aspect, the combination of DSP-accelerated preprocessing stage 810 and inferencing stage 812 provides the following performance metrics:
Both system architectures associated with 4-way parallel preprocessing 806 and inferencing 808, and DSP-accelerated preprocessing 810 and inferencing 812 may be generated by pipeline processing architecture generator 118, and presented as implementation options to a user. The user can select with system design option best suits their needs, and can use the selected system design to build a target system for deployment.
A sequence for pipeline processing architecture generator 118 to generate one or more multiprocessor computing system architectures may include:
In the task flow 900, tasks A and B are executed sequentially, (i.e., serially), while tasks C and D are executed concurrently (i.e., in parallel).
As depicted, DSL user interface 1100 allows a user to specify a use case (e.g., person tracking). An associated graph may be used to specify that detection is executed before tracking. Pipeline stages (input, preprocessing, inference, postprocessing, and remote output) may be specified by the user. DSL user interface 1100 may also allow the user to specify a 1024×768 camera as an input source.
Using DSL user interface 1100, a user may be able to specify a task flow graph for one or stages, as required. As depicted, a user specifies a graph with the tasks in task flow 900, for the preprocessing stage. The user can specify a sequence of execution (A>B>C>E, B>D>E, consistent with task flow 900), a type of each function (e.g., task A is a resize function), and a processing platform for execution (e.g., task A is executed on a GPU). The remote (output) stage may be specified as a network that communicates using an RSTP protocol, at address 10.10.20.21:20000.
Pipeline processing architecture generator 118 may receive user inputs from DSL user interface 1100, and generate one or more multiprocessor system architectures based on the user inputs.
Outputs from pipeline 2 1204 and pipeline 3 1206 may be transmitted over network 1212 (e.g., a local area network, wide area network, the Internet, etc.), to an AI appliance. In one aspect, the AI appliance costs approximately $300. The AI appliance may receive inputs from multiple (e.g., 20) cameras and implement a shared computing load. The AI appliance may implement pipelines pipeline 4 1208, and pipeline 5 1210. Pipeline 4 1208 may perform association operations on the different visual data streams received from the multiple cameras. For example, if multiple cameras image a space (e.g., a section of a store), then the different cameras may capture images of a person in the store from different angles. Pipeline 4 1208 may perform an association operation, to link together the identity of the person in the different video streams. Outputs from pipeline 4 1208 may be received by pipeline 5 1210. Pipeline 5 1210 may collectively track the identified person in all video streams.
In one aspect, a pipeline architecture to implement inference use case 1200 may be generated by pipeline processing architecture generator 118. Other possible system embodiments may include system architectures for processing data from an AI camera only, processing data from a camera without AI features (e.g., a “dumb” camera) using an AI appliance, and an AI camera and an AI appliance (as depicted in inference use case 1200).
In one aspect, pipeline 1 1312 executes a neural network application (NNApp) by gathering memory needs and latency of each node in an associated NNApp graph. The memory needs and latency may be determined by pipeline processing architecture generator 118 via one or more trial runs. Pipeline processing architecture generator 118 may automatically choose a system design configuration that most closely matches the design specification for the actual execution. For example, a system design configuration may be chosen to substantially maximize the throughput of the NNApp under user-defined system level resource constraints. Pipeline processing architecture generator 118 may include a constraint solver or may generate a system design configuration algorithmically/heuristically.
A constraint solver, when presented with a set of constraints expressed either as a mathematical formula or as a table of values for all the parameters in the expression, is able to apply several different techniques such as machine learning techniques (e.g., gradient boosting, XGboosting), manually coded targeted algorithms, or solvers (e.g., Integer linear programming based solver or SAT solver) or heuristics (practical methods to solve problems of this type that are unintuitive and may not necessarily give most optimal result). These techniques enable the constraint solver to determine one or more sets of parameter values that produces optimal results. Conversely, given a compute graph and associated constraints, pipeline processing architecture generator 118 may report suboptimal usage of resources.
As an example, pipeline processing architecture generator 118 may be presented with the following user budget:
The system design may include the following objectives:
Based on these inputs, pipeline processing architecture generator 118 may generate the following result:
The algorithm uses the system constraints and trial run statistics to determine the recommended resources needed to maximize the objectives:
Based on #optimal_lines, the firmware provides feedback to the user about the possible system performance degradation parameters (e.g., bottlenecks) from each section (stage), and suggestions on how to improve the throughput by adjusting the resource budget.
The algorithm to determine recommended resources is an improvement over the current state-of-the-art, as current solutions are based on manual back-of-the envelope calculations and some raw estimates before deploying. If the app graph changes, a user needs to go through the manual process all over again, which could get complicated, laborious, and time-consuming.
Other aspects of system designs generated by pipeline processing architecture generator 118 may include systems for source rate control. A methodology to automatically detect the maximum throughput an app (e.g., a pipeline processing architecture generated by pipeline processing architecture generator 118) can support for the given system constraints. The app can then dynamically control the input source rate (e.g., a camera) in case the maximum throughput achievable is unable to match the input rate. In doing so, unnecessary computation of dropped frames is reduced or/and there is less jitter in frames at the sink.
A similar methodology can be applied for multiple independent streams—each input stream can be dynamically rate controlled based on the pipeline statistics and user-provided system constraints. For example, in case the pipeline supports a maximum frame rate of 10 FPS and the input camera is at 40 FPS, the algorithm can drop source frames at 4:1 ratio to prevent, for example, any dropped frames.
In one aspect, execution flow diagram 1400 may represent an execution flow implemented on a processing architecture generated by pipeline processing architecture generator 118. Pipeline processing architecture generator 118 may use this processing architecture to simulate a performance of the processing architecture. Results from the simulation may be presented to a user for evaluation. The user may use these results to determine whether the processing architecture is suitable for the user's needs.
The simulation methodology for simulating a use case graph associated with a processing architecture may provide approximate latency numbers for each task using performance benchmarks of the candidate system-on-chip (SOC). The simulation may determine whether the system configuration is ideal or not in terms of throughput. Latency/throughput/compute requirements of the system can be simulated, and can be modified based on hardware constraints. The simulation process can be used as a methodology to help choose/compare SoC candidates for various end-to-end inference pipelines.
Execution flow diagram 1401 is an example of a methodology to determine system resource allocation across parallel independent streams, such that the system resource allocation provides required system performance within requisite system constraints. In contrast, current solutions are based on equal distribution for each independent stream or back-of-the envelope calculations.
As an example, suppose a user's budget includes the following constraints:
The example above is representative of how pipeline processing architecture generator 118 can generate a multiprocessing system design that reduces system resource allocation.
The memory allocation table design presents a methodology to statically allocate memory for each graph node in each pipeline so that app does not need to search or allocate memory during the actual execution and do round robin across lines. This aspect may be managed by memory manager 214.
In one aspect, pipeline processing architecture generator 118 may also perform memory collocation while generating a pipeline processing architecture design. Memory collocation allows reusing the data across tasks and avoid memory copy operations (memcpys) using colocation logic.
Requirement: In a multi-branch graph, a vertex node with more than one input edges would require the memory to be contiguous. The source nodes may produce the outputs in a scattered manner, where multiple memcpy operations would be needed to make the outputs contiguous. To address this problem, a colocation algorithm implemented by pipeline processing architecture generator 118 can parse through the graph and identify the graph patterns in the trail run and reduce the number memcpys required.
Example: In case of multi-input model inferences, typical inference engines need input tensors to be in a particular order and contiguous. These tensors may be independently generated by their previous tasks in non-contiguous memory location. In order to serve the inference, these tensors need to be arranged using memcpys in the required order.
Colocation: The colocation algorithm implemented by pipeline processing architecture generator 118 will understand this requirement in the trial run and automatically align the tensor output memories in the required order (and contiguous) in their respective tasks without the need for extra memcpys. This saves a memory bandwidth associated with the SOC.
Pipeline processing architecture generator 118 may also perform a load balancing operation. Pipeline processing architecture generator 118 may implement a methodology to load-balance the inference load across the Inference engines available in the system and automatically detects #parallel loads to pipeline the IO transfers. This methodology automatically detects an optimal number of IO slots to be placed inside the memory of the inference engines. Representative performance figures are presented below:
In one aspect, pipeline processing architecture generator 118 functions based on:
In one aspect, model 1602 receives one or more inputs that are input to stage 1604. The inputs are processed by task 1610. The outputs of task 1610 are input to and processed by task 1612. Outputs generated by task 1612 are outputs generated by stage 1604, and received by task 1614 of stage 1606. Outputs generated by task 1614 are received by task 1616 and 1618, and processed in parallel. Task 1616 and task 1618 generate separate outputs that may be respectively processed by tasks 1620 and 1622, respectively. Outputs of tasks 1620 and 1622 may be processed by task 1624, which generates one or more outputs for model 1602. Model 1602 may also be associated with one or more library functions 1626. In one aspect, library functions 1626 are the actual implementations of the functions such as ‘Resize’ described herein.
In one aspect, pipeline processing architecture generator 118 uses presentation model 1700 to generate execution model 1701 as presented in
Essentially, pipeline processing architecture generator 118 compiles the 5-stage pipeline depicted in
Pipeline processing architecture generator 118 may implement higher order functionality/complex use cases by joining one or more pipelines, based on user-defined models. In one aspect, pipeline processing architecture generator 118 facilitates reusing the models and build complex use cases and higher order functionality using the existing models. Pipelines may be joined in two ways: serial (depicted in
As depicted in schematic diagram 18B, presentation model 1812 includes input source 1 1814, input stage 1816, inferencing 1818, output 1820, and output sink 1822. Each of input stage 1816, inferencing 1818, and output 1820 may have a set of tasks, with each set being generically represented with tasks A through D.
This sequence of calls joins the models M1 1800 and M2 1812 in a serial fashion, where M1 1812 precedes M2 1800 (M1>>M2).
In the above serial join process, the following commands may be used:
Using these commands, pipeline processing architecture generator 118 joins models M1 1800 and M2 1812 serially, using the user-defined adapter functionality.
On a Join( ) call, the resultant pipeline will also have only three stages—the same as each of pipelines M1 1800 and M2 1812. The join call merges all step DAGs present in the stages between input stage of M1 1800 and output stage of M2 1812. While merging the DAGs, pipeline processing architecture generator 118 ensures that the order of dependency is maintained.
In one aspect, while performing a serial merge, dummy steps may be introduced. These dummy steps allow the identification the starting and ending of a DAG of each stage. If a DAG associated with a stage has multiple leaf nodes, a dummy node that precedes all sources may be used. If a DAG associated with a stage has multiple parent nodes, a dummy step succeeding all the parent nodes may be added. This enables marking a beginning and an end of each stage.
A user can instruct pipeline processing architecture generator 118 to join two pipelines. This join operation receives outputs from all models (e.g., M1 1800 and M2 1812) as inputs and emits output(s) in the required form. Pipeline processing architecture generator 118 may implement a sink functionality (e.g., sink 2002) to join the models/pipelines in parallel. An example sequence of function calls to achieve a parallel join operation may be:
In the background, compiling the pipeline to a three-stage pipeline will ease the process of joining the pipelines.
In one aspect, pipeline processing architecture generator 118 processes a join( ) call and merges the pipelines in a parallel manner. The resulting pipeline may have three stages. As the pipelines are merged in parallel, the resulting DAG in each stage in the parallel execution model executes the DAGs from the original models in parallel. In one aspect, pipeline processing architecture generator 118 joins the DAGs of input and processing stages using a dummy node; however, a sink (e.g., sink 2002 or 2114) may be used to joining the DAGs of the respective output stages.
Domain-specific language (DSL): Model building/specifying language, which can comprehensively specify the model functionality and pipeline. Pipeline processing architecture generator 118 may include a DSL compiler that generates the code for the given model specification in a DSL format compatible with pipeline processing architecture generator 118.
User interface UI): Pipeline processing architecture generator 118 framework provides a UI-based widget to compose pipelines. This framework provides drag-and-drop facilities of existing Steps/Stages and Models from the gallery. Behind the scenes, the framework validates the combination and generates the model specification using the pipeline processing architecture generator DSL.
Memory Management: A memory manager (e.g., memory manager 214) associated with pipeline processing architecture generator 118 understands the memory requirement patterns of the pipeline and allocates the memory best suitable for the pipeline execution. Memory manager 214 learns the memory allocation pattern during an essay phase of execution of the pipeline, and creates a map best suitable for a pipeline execution with possibly minimum memory use. Memory manager 214 then replicates the same memory allocation across all the execution lines of the pipeline. This mechanism helps memory manager 214 to avoid the allocation/deallocation of memory during the pipeline execution.
Apart from other memory management tasks memory manager 214 performs other tasks such as:
Apart from these memory manager tasks, memory manager 214 can also it also hints the model to decide on the max scale factor.
Statistics: In one aspect, pipeline processing architecture generator 118 may include statistics 216 that provides deeper insights into pipeline execution, regarding resource consumption and performance. Pipeline processing architecture generator 118 may receive the execution time and memory requirements at every level of pipeline execution (e.g., at a step level, a stage level and a line level). These statistics help pipeline processing architecture generator 118 determine a maximum number of lines in a pipeline execution.
Analysis & Debugging Tools: In one aspect, pipeline processing architecture generator 118 provides logs to debug any concurrent issues that may arise during pipeline execution. In addition, pipeline processing architecture generator 118 may also provide a pictorial view of a pipeline execution. This tool provides can help a user in understanding and conceptualizing any issues arise during pipeline execution.
An algorithm implemented by pipeline processing architecture generator 118 may include a methodology to automatically detect a maximum throughput a system can achieve for a given set of system constraints. Pipeline processing architecture generator 118 may dynamically control the input source rate (e.g., a camera), in case an inference engine is unable to match the input rate. This reduces unnecessary computation of dropped frames, or to have better (continuity) in frames at the sink.
An algorithm implemented by pipeline processing architecture generator 118 may include a methodology to automatically detect optimal resources needed to obtain maximum efficiency for an application under user given system level resource constraints. Pipeline processing architecture generator 118 may provide options to a user to reduce the resources in if necessary.
Communication manager 2304 can be configured to manage communication protocols and associated communication with external peripheral devices as well as communication with other components in pipeline processing architecture generator 118. For example, communication manager 2304 may be responsible for generating and maintaining a communication interface between pipeline processing architecture generator 118 and pipeline processing architecture 120.
Memory 2306 is configured to store data associated with pipeline processing architecture generator 118. Memory 2306 may include both long-term memory and short-term memory. Memory 2306 may be comprised of any combination of hard disk drives, flash memory, random access memory, read-only memory, solid state drives, and other memory components.
System resource allocator 2308 may function to allocate one or more system resources (e.g., CPUs, GPUs, memory, etc.) for a pipeline processing architecture design. System resource allocator may perform resource allocation functions based on one or more system-specific constraints (e.g., processor utilization, memory allocation, etc.).
Statistics collection module 2310 may be similar to statistics 216, and may collect the memory and CPU at task, stage and line levels associated with a pipeline. Data collected by statistics collection module 2310 comprises trial run statistics that are used in conjunction with system design constraints to determine the recommended resources needed to maximize the objectives. These statistics may also help pipeline processing architecture generator 118 determine a maximum number of lines in a pipeline execution.
Memory allocation module 2312 may be configured to perform memory allocation operations as described herein. Memory allocation module 2312 may generate a memory allocation table (e.g., memory allocation table 1500), allowing pipeline processing architecture generator 118 to statically allocate memory for each graph node in each pipeline.
Pipeline merging module 2314 may execute serial or parallel pipeline merging tasks. For example, pipeline M1 1800 and pipeline M2 1812 may be merged serially or in parallel by pipeline merging module 2314.
Processor 2316 is configured to perform functions associated with pipeline processing architecture generator 118. These functions may include generalized processing functions, arithmetic functions, and so on. Processor 2316 is configured to process information associated with the systems and methods described herein.
User interface 2318 allows a user to interact with aspects of the invention described herein. User interface 2318 may include any combination of user interface devices such as a keyboard, a mouse, a trackball, one or more visual display monitors, touch screens, incandescent lamps, LED lamps, audio speakers, buzzers, microphones, push buttons, toggle switches, and so on.
Source rate control module 2320 may be configured to perform operations related to source rate control, as described herein. Memory management module 2322 may be configured to perform memory management operations, such as to ensure that a multiprocessing system architecture is within any memory constraints specified by the multiprocessing system design specification.
Memory co-location module 2324 may be configured to perform memory collocation operations, as described herein. Data bus 2326 communicatively couples the different components of computing system 2302, and allows data and communication messages to be exchanged between these different components.
Method 2400 may include creating multiple presentation models (2404). For example, pipeline processing architecture generator 118 may create multiple presentation models that meet design criteria while satisfying system design constraints.
Method 2400 may include using an inference engine to provide an execution model for the multiple processing pipelines (2406). For example, the multiple processing pipelines designed by pipeline processing architecture generator 118 may be converted into an execution model and executed on inferencing 108.
Method 2500 may include creating multiple presentation models (2504). For example, pipeline processing architecture generator 118 may create multiple presentation models that meet design criteria while satisfying system design constraints.
Method 2500 may include using an inference engine supporting neural network processing to execute a neural network model (2506). For example, the multiple processing pipelines designed by pipeline processing architecture generator 118 may be converted into an execution model and executed on inferencing 108. Inferencing 108 may support neural network processing, and the pipelines may be executed as neural network models.
Although the present disclosure is described in terms of certain example embodiments, other embodiments will be apparent to those of ordinary skill in the art, given the benefit of this disclosure, including embodiments that do not provide all of the benefits and features set forth herein, which are also within the scope of this disclosure. It is to be understood that other embodiments may be utilized, without departing from the scope of the present disclosure.