This application claims priority to Chinese Patent Application No. 202111134551.3, filed Sep. 27, 2021, the entire contents of which is hereby incorporated by reference as if fully set forth herein.
Systems that process live video in a pipelined manner are latency-sensitive systems. The end-to-end processing time of a pipelined data element (e.g., a video frame), that is the systems' latency, should be kept low in order to maintain high quality of user experience. Low latency is a challenging requirement given the amount of data that such systems have to process. This is especially so for systems that process the live feeds of multiple cameras and given the high demand for content at a high resolution and at a high dynamic range. Typically, such systems process a video feed sequentially across a pipeline, with an overall processing time that is limited by the video frames' resolution, dynamic range, and rate. For example, Augmented Reality (AR) systems that support AR-based applications, in addition to handling in real-time multiple video streams, have to employ computer vision and image processing algorithms of high complexity. For such AR systems, low latency is imperative; otherwise, the user's immersive experience will be compromised.
A more detailed understanding can be achieved through the following description given by way of example in conjunction with the accompanying drawings wherein:
The present disclosure describes systems and methods for reducing the latency of a real-time system that processes live data in a pipelined manner. For purposes of illustration only, features disclosed in the present disclosure are described with respect to an AR system and a camera enabled device. However, features disclosed herein are not so limited. The methods and systems described below are applicable to other latency-sensitive systems, such as systems related to Human Computer Interface (HCI) or Autonomous Vehicles (AV) that run on any computing device, such as a laptop, a tablet, or any other wearable devices.
An AR system and a camera enabled device are described to demonstrate the benefit of the low latency systems and methods disclosed herein. Camera enabled devices—such as Head Mounted Devices (HMDs) or handheld mobile devices—interface with AR systems to provide users with immersive experience when interacting with device applications in gaming, aviation, and medicine, for example. Such immersive applications, typically, capture video data that cover a scene currently being viewed by the user and insert (or inlay), via an AR system, content (e.g., an enhancement such as a virtual object) into the video viewed by the user or onto an image plane of a see-through display. To facilitate the immersive experience, the content has to be inserted in a perspective that matches the perspective of the camera or the perspective in which the user views the scene via the see-through display. As the camera moves with the user's head movements (when attached to an HMD) or with the user's hand movements (when attached to a mobile device), the time between the capturing of a video and the insertion of an enhancement must be very short. Otherwise, the immersive experience will be compromised as the enhancement will be inserted at a displaced location and perspective. In other words, the latency of the AR system should be low.
AR systems employ complex algorithms that process video data and metadata. These algorithms, for example, include computing a camera-model of the camera (or the camera's pose), a three-dimensional (3D) reconstruction of the scene captured by the camera (i.e., a real-world representation of the scene), detection and tracking of the user's gaze and objects located at the scene, and mapping of a virtual (animated) object that is placed at the real-world representation of the scene onto a projection plane (either the camera's image plane or an image plane consistent with the user's gaze). For example, an AR system, can process a video of a table in a room, as captured by a user's HMD's camera. Through the operations described above the AR system can insert into the video a virtual object placed in perspective on the top of the table. As the user (and the attached camera) moves, the AR system can continuously compute the camera-model, track the table, and update the perspective in which the virtual object is inserted into the video. As long as the AR system's operation is with sufficiently low latency, the update of the virtual object insertion will be frequent enough to allow an immersive experience as the user (and therefore the camera) moves.
The present disclosure describes a method of communication between system components, using a hybrid communication protocol. The hybrid communication protocol comprises the operations of processing a slice of a video by a first system component; sending a first message to a second system component, indicating that the processed slice is stored in a memory, wherein the sending of the first message comprises writing the first message, stored in an out-buffer of the first system component, into a mailbox of the second system component; and receiving a hardware interrupt issued by the second system component, indicating that the mailbox is released. In an alternative, the out-buffer is managed by a hardware message controller that controls the writing, by the first system component, of messages to the out-buffer via a direct memory access. The method further comprises receiving a second message from the second system component, indicating that a further processing of the processed slice is completed and that the further processed slice is stored in the memory, wherein the receiving of the second message is a one-way transaction that comprises reading the message from an in-buffer of the first system component and wherein the reading completes the transaction. In an alternative, the in-buffer is managed by the hardware message controller that controls the reading, by the first system component, of messages from the in-buffer via a direct memory access.
The present disclosure further discloses a first system component that comprises at least one processor and memory storing instructions. The instructions, when executed by the at least one processor, cause the first system component to: process a slice of a video; send a first message to a second system component [CVIP], indicating that the processed slice is stored in a memory, wherein the sending of the first message comprises writing the first message, stored in an out-buffer of the first system component, into a mailbox of the second system component; and receive a hardware interrupt issued by the second system component, indicating that the mailbox is released. In an alternative, the out-buffer is managed by a hardware message controller that controls the writing, by the first system component, of messages to the out-buffer via a direct memory access. The instructions further cause the first system component ton receive a second message from the second system component, indicating that a further processing of the processed slice is completed and that the further processed slice is stored in the memory, wherein the receiving of the second message is a one-way transaction that comprises reading the message from an in-buffer of the first system component and wherein the reading completes the transaction. In an alternative, the in-buffer is managed by the hardware message controller that controls the reading, by the first system component, of messages from the in-buffer via a direct memory access.
Furthermore, the present disclosure further discloses a non-transitory computer-readable medium comprising instructions executable by at least one processor to perform a method. The method comprises: processing a slice of a video by a first system component; sending a first message to a second system component, indicating that the processed slice is stored in a memory, wherein the sending of the first message comprises writing the first message, stored in an out-buffer of the first system component, into a mailbox of the second system component; and receiving a hardware interrupt issued by the second system component, indicating that the mailbox is released. The method further comprises receiving a second message from the second system component, indicating that a further processing of the processed slice is completed and that the further processed slice is stored in the memory, wherein the receiving of the second message is a one-way transaction that comprises reading the message from an in-buffer of the first system component and wherein the reading completes the transaction.
In various alternatives, the processor 102 includes a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core can be a CPU or a GPU. In various alternatives, the memory 104 is located on the same die as the processor 102 or is located separately from the processor 102. The memory 104 includes a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 106 includes a fixed or removable storage, for example, a hard disk drive, a solid-state drive, an optical disk, or a flash drive. The input devices 108 include, without limitation, a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 110 include, without limitation, a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 112 communicates with the processor 102 and the input devices 108, and permits the processor 102 to receive input from the input devices 108. The output driver 114 communicates with the processor 102 and the output devices 110, and permits the processor 102 to send output to the output devices 110. It is noted that the input driver 112 and the output driver 114 are optional components, and that the device 100 will operate in the same manner if the input driver 112 and the output driver 114 are not present. The output driver 116 includes an accelerated processing unit (“APU”) 116 which is coupled to a display device 118. The APU accepts compute commands and graphics rendering commands from processor 102, processes those compute and graphics rendering commands, and provides pixel output to display device 118 for display. The APU 116 can include one or more parallel processing units to perform computations in accordance with a single-instruction-multiple-data (“SIMD”) paradigm. Thus, although various functionality is described herein as being performed by or in conjunction with the APU 116, in various alternatives, the functionality described as being performed by the APU 116 is additionally or alternatively performed by other computing devices having similar capabilities that are not driven by a host processor (e.g., processor 102) and provides graphical output to a display device 118. For example, it is contemplated that any processing system that performs processing tasks in accordance with a SIMD paradigm can perform the functionality described herein. Alternatively, it is contemplated that computing systems that do not perform processing tasks in accordance with a SIMD paradigm can also perform the functionality described herein.
The communication interface 255 (e.g., PCIe) can facilitate video data and control data communication with a camera enabled device. The video CODEC 250 (e.g., AMD VCN) can encode processed video that is streamed back to the camera enabled device, to storage, or out to other destinations. The video CODEC 250 can also decode video received from the camera enabled device or other sources. Alternatively, the video received from the camera enabled device can be decoded by the CVIP 240, as described below with reference to
In an alternative, the CVIP 240 is a SoC, designed to employ computer vision algorithms to support AR-based applications that run on a camera enabled device. The CVIP 240 can contain a co-processor (e.g., ARM A55 core cluster) and Digital Signal Processors (DSPs), that can interconnect via an internal bus. For example, one DSP can be used to perform machine learning (e.g., Convolutional Neural Network (CNN)) algorithms and another DSP can be used to perform other algorithms of pattern recognition and tracking. The CVIP 240 can run the secondary operating system concurrently with the primary operating system that runs on the CPU 260. For example, a multi-core Linux based OS can be used. The secondary operating system can contain various CVIP software drivers that manage, e.g., memory, DSP control, DMA engine control, clock and power control, IPC messages, secure PCIe handling, system cache control, general fusion controller hubs (FCH) support, watchdog timer, and CoreSight debugging/tracing handling.
The ISP 245 processes decoded video data and corresponding metadata captured by the camera enabled device. In an alternative, the captured video data and corresponding metadata are received by the CVIP 240, via the communication interface 255 (e.g., PCIe), decoded by the CVIP, and then provided to the ISP 245 for processing. Based on the processing of the decoded video and/or corresponding metadata, the ISP can determine camera controls, such as lens exposure, white balancing, and focus, and send these controls to the CVIP to be relayed to the camera enabled device. For example, the ISP can determine parameters of white balancing based on comparison of the video frames' color-histograms or determine parameters of lens exposure or focus based on blur detection applied to the video frames. ISP software can provide an abstraction of a camera sub-system into the primary operating system, thereby bridging the command/data path between applications running on the primary operating system and the ISP. ISP Software in its user space can be responsible to implement the OS required framework APIs to collect configurations of streams and pipelines and to arbitrate the capture requests to an ISP driver in its kernel space. An ISP's kernel driver can manage the power and clocks of the ISP through, e.g., standard graphics service interface. The ISP's kernel driver can also transfer the capture requests and collect the results from the ISP's firmware through ring buffer communication. Off-chip camera devices can also be controlled by the ISP's kernel driver for streaming in desired resolution and frame rate, as inputs to the ISP pipeline.
The camera enabled device 360, can represent a mobile device or an HMD. Both can be applied to provide an immersive experience based on augmented reality presentation to a user of the device. For example, when the camera enabled device 360 is an HMD, such a device can include one or more (vision and/or infrared) cameras attached to the device. Some cameras 375 can be facing toward the scene viewed by the user, while some cameras 375 can be facing toward the user's eyes. The former are instrumental for scene recognition and tracking, the latter are instrumental for tracking the user's gaze. Further, an HMD can include a head mounted display 385, such as a see-through display. Employing the processes provided by the AR system 310, the user will be able to see through the display 385 a virtual object placed at the scene at a perspective that matches the user's viewing perspective. In addition, to the video data captured by the cameras 375, the camera enabled device can capture sensory data, using one or more sensors 380. For example, sensors 380 can be attached to the device or to the user's body and provide real time localization measurements or inertial measurements. Localization and inertial measurements can be used by various pattern recognition algorithms, employable by the AR system 310, for example, to analyze the user behavior or preferences, and to, accordingly, present the user with targeted virtual objects or data. The device's encoder 370 can encode the video data captured by the one or more cameras 375 and embed the data captured by the one or more sensors 380 as metadata. Thus, the encoded video and the corresponding metadata can be sent to the AR system for processing, the result of which will be received at the user's display 385. As, explained before, to maintain an immersive experience, the end-to-end processing time—the time elapsed from video data capturing, through processing at the AR system 310, and delivering the virtual augmentation at the user's display—should be sufficiently short.
In an alternative, the AR system 310 receives the encoded video and corresponding metadata from the device 360 at the CVIP 240 via communication interface 255. The encoded video and corresponding metadata are first decoded, typically frame by frame, by the decoder 320. Once, the decoder 320 completes the decoding of a current frame and corresponding metadata, it can write the decoded data to the memory 230 via the data fabric 220 bus. Then, the decoder 320 can send a message to the image processor 325, via the SMN 210, informing the image processor that the decoded data of the current frame are ready. The decoder 320 can also inform the tracker 330 that the decoded data of the current frame are ready. For example, the decoder 320 can make available to the tracker 330 decoded data of a gray scale (monochrome) frame image and can make available to the image processor 325 decoded data of a color (RGB) frame image.
Upon receiving a message, via the SMN 210, that decoded data of the current frame are ready, the image processor 325 can read the decoded data from the memory 230, via the data fabric 220, and process it. For example, the image processor 325 can process coded data containing color images to evaluate the need to adjust the cameras' lens exposure, focus, or white balancing, and, accordingly, can determine camera controls. Upon completion of the processing of the decoded data, the image processor 325 can write the processed decoded data into the memory 230, via the data fabric 220, and can then send a message to the tracker 330 to inform the tracker that the processed decoded data of the current frame are ready. The image processor 325 can write the camera controls determined by the image processor directly into a mailbox of CVIP 240. The tracker 330 can then proceed to read the data processed by the image processor 325 and the CVIP 240 can immediately send the camera controls to the camera enabled device 360 via the communication interface 255.
The tracker 330 can apply algorithms that detect, recognize, and track objects at the scene based on the data captured by the cameras 375 and by the sensors 380. For example, the tracker 330 can compute a camera-model for a camera 375. Typically, a camera-model defines the camera's location and orientation (i.e., camera's pose) and other parameters of the camera, such as zoom and focus. A camera-model of a camera can be used to project the 3D scene to the camera's projection plane. A camera-model can also be used to correlate a pixel in the video image to its corresponding 3D location at the scene. Hence, the camera-model has to be continuously updated as the camera's location, orientation, or other parameters change. The tracker can also use machine learning algorithms to recognize objects at the scene, and then, e.g., based on the camera-model to compute their pose relative to the scene. For example, the tracker, can recognize an object (a table) at the scene and compute its surface location relative to a 3D representation of the scene. The tracker can then generate a 3D representation of a virtual object (a cup of tea) to be inserted at a location and an orientation relative to the table (e.g., on top of the table). At its output, the tracker can provide to the renderer 335 the virtual object representation, the location and orientation in which the virtual object is to be inserted, as well as the projection plane (the projection plane of the camera or the projection plane that is aligned with the see-through display) onto which the virtual object is to be projected—namely enhancement data. For example, the tracker can send a message to the renderer 335, to inform the renderer that enhancement data with respect to the current frame are ready in the memory to be retrieved. Upon receiving such a message, the renderer can use the enhancement data to render the virtual object onto the projection plane at the given location and orientation.
As mentioned above, since the camera moves with the movements of the user it is attached to, the tracker has to update its calculation of the camera-model (or the camera's projection plane) continuously. Moreover, if the virtual object is to be mapped onto an image plane (or a projection plane) that is consistent with the see-through display, that projection plane should be updated continuously as the user moves or changes her gaze. These updates should be at a high rate to allow for immersive content enhancement (accurate virtual object insertion). For example, by the time the renderer 335 completed the rendering of the virtual object and made it available in memory to the projector 340, it can be that small changes in the camera's pose or the see-through display's pose call for an update of the virtual object projection, rendered by the renderer 335. Thus, upon the receiving of a message from the renderer 335 that the rendered image of the virtual object is ready to be fetched from the memory 230 and a message from the tracker 330 that an update of the projection plane is also available, the projector 340 can read the rendered image and the updated projection plane, and can re-project the rendered image based on the updated projection plane. The image of the re-projected virtual object is then saved into memory 230, and a message is issued to the display driver 345 informing it that the image of the re-projected virtual object is ready in memory to be fetched and to be delivered via the communication interface 255 to the display 385 of the device. Presented on a see-through display 385, for example, the user of the device will be able to see the virtual object at the scene the user is viewing as if the object is indeed present at the scene (as if a cup of tea is indeed present on the top of the table).
In an alternative, the enhancement described above, as viewed by a user of the see-through display 385, can be viewed by another user that does not view the scene via the see-through display (i.e. third-eye feature), by means of fusing 350 the image of the current frame, provided by the image processor 325, and the image of the re-projected virtual object, provided by the projector 340. Thus, the fuser 350, prompted by a message from the image processor 325 that a processed current frame is ready in the memory 230 and a message from the projector 340 that the corresponding image of a re-projected virtual object is also ready, the fuser 350 will fetch the processed current frame and the corresponding image of the re-projected virtual object and fuse these data into one output frame that will then be saved in memory 230 to be fetched and encoded by the encoder 355, upon receiving of a message from the fuser 350. The encoded fused frame is then can be available to be stored or viewed by another user.
The AR system described above, in reference to
Processing data at a slice level, as described above, reduces the system 200, 300 latency, however, more time has to be spent on communication among the system components. For example, CVIP software and ISP software and firmware are running on different processors, hence using conventional software to software communication at the slice level can mitigate the gain in latency reduction achieved by moving from a frame-based processing to a slice-based processing, as explained above. For example, communication between CVIP and ISP is bi-directional, and so can be initiated each time the CVIP informs ISP, or each time ISP informs CVIP, about the completion of a slice processing. The higher the number of slices, the higher the number of communications between CVIP and ISP. To reduce the time the system spends on communication among its components, a hybrid communication protocol is disclosed herein.
Hence, a hybrid communication protocol is much faster than the standard software-based communication protocol, and, therefore, most suitable for transactions at the firmware and hardware layer. For example, a typical transaction between two system components, using the software-based communication protocol (described above with reference to
In an alternative, a software-based communication protocol can be used at the software layer to send a less frequent messages, such as stream level communication, for example, camera property (e.g., image size or frame rate). However, communications at the hardware layer and the firmware layer that are required to be issued multiple time per a frame and per a slice, advantageously, can use the hybrid communication protocol described herein. For example, at the hardware layer, messages can be issued that indicate the readiness of input data to the ISP or output data from the ISP, including the slice buffer offset. In another example, at the firmware layer, messages can be issued with frame level information, such as frame attributes, frame buffer allocation, or frame id.
It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element can be used alone without the other features and elements or in various combinations with or without other features and elements.
The various functional units illustrated in the figures and/or described herein (including, but not limited to, the processor 102, the input driver 112, the input devices 108, the output driver 114, the output devices 110, the accelerated processing device 116) may be implemented as a general purpose computer, a processor, or a processor core, or as a program, software, or firmware, stored in a non-transitory computer readable medium or in another medium, executable by a general purpose computer, a processor, or a processor core. The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements features of the disclosure.
The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
Number | Date | Country | Kind |
---|---|---|---|
202111134551.3 | Sep 2021 | CN | national |