The present invention relates to memory access and in particular to an improved direct memory access (DMA) technique. Also disclosed is a specific use of the DMA of the present invention as applied to video data processing.
In typical computer-based applications, the data that passes through computer input/output (I/O) devices must often be performed at high speeds, in large blocks, or large blocks at high speeds. Three conventional data transfer mechanisms for computer I/O include polling, interrupts (also known as programmed I/O), and direct memory access (DMA). Polling is a technique in which the central processing unit (CPU, data processor, etc.) is dedicated to acquiring the incoming data. The processor issues an I/O instruction and polls the progress of the I/O in a loop.
Interrupt driven (programmed) I/O involves the processor issuing the I/O instruction without having to perform polling for completion of the I/O operation. An interrupt is asserted when the operation completes, causing the processor to handle branch to an appropriate interrupt handler to process the completed I/O.
With DMA, a dedicated device referred to as a DMA controller reads incoming data from a device and stores that data in a system memory buffer for later retrieval by the processor. Conversely, the DMA controller writes data stored in the system memory buffer to a device. A typical DMA transfer (e.g., a read operation) sequence involves the following:
Video processing systems have greatly increased the throughput requirements of a processor. Parallel processor architectures are increasingly used to serve the demands of real-time video by processing video streams in parallel fashion. A typical video operation is the streaming of video from memory to an output device, for example a video display unit. Here, large amounts of data must be transferred out of memory to the screen. In addition, this data transfer must be of sufficient bandwidth to ensure no visual artifacts. Meanwhile, since there is limited memory, video is being loaded into memory. This involves switching between loading video data into memory and setting up for the next DMA transfer, placing a heavy burden on the video processing unit(s). The problem is amplified if some kind of processing of the video is desired prior to outputting it to a display.
It is therefore desirable to be able to move data on and off RAM with even less burden on the processors than is possible with conventional DMA techniques. Video data processing systems would benefit by such improvements, and certainly data processing systems in general can realize substantial gains by such improvements.
A DMA transfer method according to the present invention includes a data processing block initiating a first DMA transfer operation to obtain first data. Based at least on address information contained in the first data, a second DMA transfer operation is performed absent further action by the data processing block. The second DMA transfer obtains an additional data block having additional address information. Additional DMA transfer operations are performed in this manner absent intervention from the data processing block to obtain still further blocks of data.
Thus, DMA transfers in accordance with the present invention require only one initial setup for the DMA transfer. For example, a processor need on setup a starting address and a optionally a data length of the first block of data to be DMA-transferred. Subsequent blocks of data can then be DMA-transferred without further intervention from the processor.
Aspects, advantages and novel features of the present invention will become apparent from the following description of the invention presented in conjunction with the accompanying drawings, wherein:
Storing a single large block of data 102 in the memory 116 typically requires a contiguous area of memory large enough to hold the block of data. An advantage of the present invention arises from the fact that smaller blocks of memory are needed to store the linked list elements since it is easier to allocate smaller blocks of memory than it is to allocate one very large block of contiguous memory. As will be discussed in connection with a more specific embodiment, in order to reduce latency in a video application, it is desirable to be able to store one line of video data and to be able to send out a line of video data at a time.
In response to receiving an indication to begin the DMA transfer, the output control block 202 reads out (fetches) an element from the linked list 122, beginning with the element indicated by the start address, e.g., element 122a. The address in the memory 116 of the next element in the linked list 122 is determined from the next address field in the currently fetched linked list element. The next element is then transferred from memory and processed accordingly. This is repeated for each element in the linked list, so that the linked list elements 122b-122f are subsequently read out.
The linked list 122 allows the entire data block 102 (
The linked list 122 contains information that can be used by the output control block 202 to perform DMA transfer of the entire data block 102. The last element 122f of the linked list 122 points back to the beginning of the list. Consequently, traversal through the linked list 122 can simply be repeated when the last element 122f of the linked list is reached.
In accordance with another aspect of the present invention, the last element in a linked list can point to another linked list.
In accordance with still another aspect of the present invention, an application executing on the processor 112 can simultaneously update previously read-out portions of the linked list while subsequent parts of the linked list are being output by the output control block 202. Referring to
In accordance with yet another aspect of the present invention, new linked list elements written by the processor 112 during DMA transfer processing by the output control block 202 can be written to different partitions of the memory 116. Since each linked list element has a next address field, the next element in the linked list can be located anywhere in memory. This would be useful where some form of “garbage collection” or memory defragmentation processing is performed. Defragmentation is process whereby a memory manager coalesces allocated portions of memory to create large contiguous blocks of free memory for allocation. For example, a linked list can be initially written to a first portion of memory, and a DMA transfer can be initiated. A final element of the linked list in the first memory portion can be made to point to a linked list element stored in a second portion of memory which continues the list in the second portion of memory. When the final element of the linked list in the first portion of memory is read out, DMA transfer can then proceed in the second portion of memory. At this point, the processor 112 can perform some maintenance operations on the first memory portion; e.g., defragmentation, or the like. Note that all the while, the DMA transfer continues without additional instruction from the processor beyond initiation of the DMA operation.
In general, the processor 112 can be any data processing block. Typical examples include microprocessors (e.g., central processing unit CPU) or an application-specific IC (ASIC) that is designed to perform data processing functions. The processor 112 can be digital signal processor (DSP), and so on.
In a particular embodiment of the present invention the processor 112 is a data processing component in a video processing system; e.g., a video encoder. In fact, the processor 112 might comprise a plurality of video processors in a multiprocessor architecture. Accordingly, the data block 102 comprises video data that is processed by the video processing system. The output control block 202 shown in
The data block 102 can be any unit of video data suitable for the particular video application. For example, each data block can be the video data for an entire video frame; or video field, in the case of interlaced video. Each linked list element can contain the video data for a line in the video frame or field. For example, a video frame might comprise 720 video lines in the case of progressively scanned video (720P). The number of lines varies depending upon the format of the video data such as SD, HD, 10801 etc. It might be convenient to organized the video on a frame by frame basis, where there is a linked list structure for each frame of video. Each linked list structure would comprise a number of linked list elements that constitute a video frame, where each element holds the data for a line of video in the frame. More generally, the video data may be structured such that each linked list hold only a portion of the video frame or field. Video data can be separated out into a luma data stream and a chroma data stream, in the case of component video. A linked list structure can be provided for each data stream.
Many memory systems impose a constraint on the length of the data transfer. In the particular embodiment of the present invention, the length of the transfer is modulo 128 bytes. Therefore, according to this particular aspect of the invention, each element 302 of the linked list is size-constrained to satisfy the condition that the length is a value modulo-128 (i.e., a value that is an integer multiple of 128, a value divisible by 128 with no remainder). The filler field 316 is used to ensure that this condition is met. The number of bytes of fill data (m) in the filler field 316 is selected to satisfy the condition that the sum (12+n+m) is an integer multiple of 256, where “12” is the size of the three four-byte fields. Given that the data length (n) can be zero, the filler field has a maximum value of “252”, and a minimum value of “0” when the sum (12+n) equals a value modulo-128.
In this particular embodiment of the present invention, the vertical sync byte 332 is encoded with control information. The vertical sync byte 332 is used to indicate the end of a frame of video (hence “vertical sync”). In a particular implementation, a value of 0×01 is used to indicate the end of a video frame. The vertical sync byte 332 can also encode additional information. For example, a value (e.g., 0×03) can be inserted to cause the output control block 202 to immediately cease DMA transfer operations. This is useful for diagnostic purposes.
DMA transfer processing is performed by the output control block 202 when it is triggered (step 404). The DMA transfer operation can be initiated by the processor 112 in any of a number of well known techniques, including asserting an interrupt, asserting a predefined signal line, writing to an area in the output control block 202, and so on. The output control block contains the address of the starting element in the linked list.
In a step 406, a DMA transfer operation is performed to read out the addressed linked list element. In a video application, the data for a line of video is typically on the order of 1K (1024) bytes. Therefore, in the case that each element in the linked list represents a video line in the video frame or video field, the amount of data that is transferred by the DMA operation is about 1M (220) bytes. Depending on the memory architecture and the data bus width, reading out an element may require two or more DMA transfer operations. Thus, a first DMA transfer reads out a first portion of the linked list element. Then, a computation can be made based on the data length field 322 to determine if a further DMA transfer operation(s) is needed.
In a step 408, the video data portion of the linked list element is obtained and processed in some manner. This typically involves outputting the video data to a video output channel of the output control block 202, such as the data output channel 216. In accordance with conventional DMA processing, an interrupt or some similar signaling mechanism would be used to interrupt the processor 112 at this time so that the next DMA transfer can be set up by the processor.
However, in accordance with the present invention, a determination is made in a step 409 whether or not to continue traversing the linked list for the next element. Referring to
In step 410, the next address field in the currently fetched linked list element is accessed to obtain the address in the memory 116 of the next element in the list. Processing then proceeds to step 406 to obtain the next element. It is noted here that, in accordance with the present invention, DMA processing continues without additional setup by the processor 112. Thus, DMA transfer is continuously performed by repeating steps 406 through 410, absent intervention by the processor 112.
If the last element in the linked list points back to the starting element (i.e., forms a circular linked list), then the linked list will be repeatedly traversed. An application executing on the processor 112 can update each element in the list with new video data after it is read out, thereby effectively outputting another frame or field of video.
The linked list need not be circularly linked. Instead, a process can continuously add elements to the end of the linked list, while another process performs some form of garbage collection processing on elements which have been read out. In these scenarios, it is noted that the processor 112 need not manage any aspect of the DMA transfer operations after the initial steps of establishing the setup data (step 402) to read out the starting element in the linked list and initiating DMA transfer processing (step 404).
The discussion will now turn to a description of a specific embodiment of the present invention in a video processing application. A commonly used video format represents video as a luma data and as chroma data. In this embodiment, a video frame comprises a luma data stream that is stored in the linked list arrangement discussed above. Similarly, a chroma data stream is stored in a separate linked list arrangement. Each element in the respective linked lists constitutes the data for a line of video in the frame. The chroma data actually comprises chroma-R data and chroma-B data. However, a 4:2:2 sampling technique is used to reduce video data storage requirements by undersampling the chroma information. Consequently, the chroma-R and chroma-B data can be combined and stored in the same amount of space as used to store the luma data.
A signal 522 (DMA-data-ready) from the memory (e.g., DMA) controller 114 feeds into the DMA interface block 500 to indicate that the DMA controller 114 has data to be read out. A DMA address bus 524 feeds into the memory controller 114. A 64-bit data bus 526 from the memory controller 114 feeds into latches 504, 506, and to a buffer (not shown) for storing data read out from the memory 116.
A data store 518 (e.g., register bank) stores starting addresses and other information to initiate a DMA transfer of starting elements from the linked in lists in the memory 116. The information contained in the data store 518 is programmatically accessed. For example, software executing on a processor 112 can write to the data 212 to the data store 518 or read from the data store 518. The information 212 includes a luma start address which identifies a beginning element (622a,
The data store 518 also includes information relating to the data size, whether the data is 8-bit data or 10-bit data; the video data can be stored in 8-bit format or 10-bit format. A luma-only datum indicates whether the data to be accessed from the memory 116 contains only a luma data stream. As will be explained below, a video-start datum (Start-video-out) triggers processing to output the stored video data. Thus, the software will set up the address information, and when video output is desired, the video-start datum is written.
The DMA address bus (address lines) 524 is driven by a mux 502. The mux 502 is coupled to receive the luma start address and the chroma start address information contained in the data store 518. The mux 502 also receives a luma-next address and a chroma-next address from a data latch 506 (typically provided by flip-flops). A selector input 502a on the mux 502 selects which of the data into the mux will be driven on the DMA address bus 524.
The 64-bit data bus 526 feeds into the data latch 504. In operation, the data bus 526 initially carries a data length value (322,
The 64-bit data bus 526 also feeds into the data latch 506. In operation, the data bus 526 carries a 32-bit address (luma-next) for the next linked list element (e.g., 622b) in the linked list 622 for the luma data stream, and a 32-bit address (chroma-next) for the next linked list element (e.g., 624b) in the linked list 624 for the chroma data stream. Referring again to
The adder circuit 512 receives the data length value and filler length value from the data latch 504. A constant value of “12” is also provided to the adder circuit 512. Referring to
The computed sum produced by the adder circuit 512 feeds into a comparator 514. The comparator 514 compares the computed sum with a value from a 32-bit counter 516. The counter 516 counts the number of bytes read from the memory controller 114. In the specifically disclosed embodiment of the present invention, the memory controller 114 outputs eight bytes at a time to the DMA interface block 500. Consequently, the counter 516 is incremented by a constant value of “8”.
The output of the comparator produces a signal when the computed sum and the counter value match. The signal serves to reset the counter. The output of the comparator also serves as a signal that indicates the end of the linked list element has been reached.
A state machine 508 provides control signals and sequencing control to perform the series of operations comprising the DMA transfer operations of the present invention. The state machine is in an idle state until a start-video-out datum is written. In response to receiving the start-video-out datum, the state machine operates the mux 502 to latch the luma-start-address onto the DMA address bus 524.
A block of eight bytes of data is read from the memory, and when that block of data is ready, the DMA-data-ready is asserted; this block is the first eight bytes of the starting element in the linked list for the luma data. The state machine 508 responds by latching in data from the DMA channel 526 into the data latch 504. The data length field 322 and the filler length field 334 are produced and fed into the summer 512, where the sum is computed and compared against the list-counter 516. Data which comprise the data field portion 314 from the channel 526 is then stored to a buffer (not shown). The list-counter 516 is incremented by “8”.
Subsequent 8-byte blocks of the linked list element are read in and stored to the buffer. With each 8-byte block, the list-counter 516 is incremented by “8”. When the last eight bytes of the linked list element are read in, the comparator 514 will assert end-of-list. This will trigger latch 506 to latch in the luma-next address. At this point, one line of luma data has been read out of memory.
The end-of-list signal will cause the state machine 508 to output (via mux 502) the chroma-start address to the DMA address bus 524, to begin reading out the starting element in the linked list for the chroma data. The starting element of the linked list for the chroma data is read out in the same manner as discussed for the starting element of the luma data.
When read out of the linked list element for the chroma data has completed, the chroma-next-address will have been latched into the latch 506. At this point, a line of luma data and a line of chroma will have been read out and buffered. The data can then be processed, for example, simply outputting it on a video out channel.
In the meanwhile, the state machine 508 drives the luma-next-address latched in the mux 502 onto the DMA-address bus 524, to begin DMA transfer of the next element in the luma linked list. When the next element in the luma linked list is read into the buffer (not shown), the state machine 508 drives the chroma-next-address latched in the mux 502 onto the DMA-address bus 524 to read in the next element in the chroma linked list.
Thus, in accordance with the present invention, a single DMA set up operation to read in a first block of data is sufficient to initiate a continuous series of DMA operations to read in additional blocks of data. Significantly, the additional (subsequent) blocks of data are not identified in the initial DMA set up operation. Instead, the additional blocks of data are identified in a previously obtained block of data.
The present invention is related to the following commonly owned, applications: METHOD AND APPARATUS FOR CLOCK SYNCHRONIZATION BETWEEN A PROCESSOR AND EXTERNAL DEVICES, filed concurrently herewith (attorney docket no. 021111-001600US); and VECTOR PROCESSOR WITH SPECIAL PURPOSE REGISTERS AND HIGH SPEED MEMORY ACCESS, filed concurrently herewith (attorney docket no. 021111-001300US) all of which are incorporated herein by reference for all purposes.