1. Field of the Invention
The present invention relates generally to the field of timing synchronization and, more particularly, to multiple synchronization mechanism to create effective synchronization between different clock sources.
2. Introduction
Encoder-to-decoder clock synchronization is an issue that arises in many different types of multimedia transmission systems. It is a particularly difficult issue in transmission over asynchronous packet-switched networks such as Ethernet/Internet. The encoder and decoder in the system agree on a nominal sample clock frequency, such as 16 KHz audio or 29.97 frames per second video. The encoder has a crystal clock source of a certain nominal frequency fce which runs at least one PLL/DLL, and the encoder creates its 16 KHz sample clock from this clock source. The decoder also has its own crystal clock source of nominal frequency fcd, thru PLL/DLL, and then creates the decoder's 16 KHz sample clock. Encoder's audio ADC (analog-to-digital converter) uses its 16 KHz sample clock, encodes, transmits over network to decoder, which decodes and outputs to its audio DAC (digital-to-analog converter) which uses decoder's 16 KHz sample clock. The problem is that the crystal frequencies fce and fcd are only nominal frequencies. Nevertheless, crystals have some tolerance in their frequencies (for example ±40 parts per million), plus further changes due to aging and temperature. Thus, while the actual crystals frequencies are both very close to 27 MHz, they are not likely to be exactly equal to one another. If the spec for each were ±50 ppm total, in the worst case the encoder could be 27,001,350 Hz and the decoder 26,998,650 Hz.
The asynchronous communication network does not provide a clock source which can be used to directly synchronize the two ends. Moreover, to make matters worse, the packet-switched network typically results in data transmission latency. While the network must be able to maintain the average data transmission rate, there is quite a bit of “jitter” in packet transmission times. This makes it somewhat more difficult for the decoder to determine the encoder's actual sample rate frequency.
Method and apparatus are disclosed for synchronizing multimedia in asynchronous networks. In this invention, clock domains are first reduced to separate hardware clock correction circuits at the separate endpoints of an asynchronous network. At each end of the network node, the controllable input device such as a video device is synchronized to the non-controllable output device such as a set top box to prevent unknown or poor-quality alterations by the output device. Output device timestamp packets are regularly sent to the input device, which then adjusts its clock accordingly. The exchange of packets between input devices over the asynchronous network is then subjected to a software-based scheme to effectively synchronize these devices.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth herein.
The invention concerns the use of two separate synchronization mechanisms on each endpoint of the asynchronous network. The first, Hardware-based Correction, actually changes the sample clock rate. The second, Software-based Correction, adjust the number of samples and/or timestamps before outputting them to a reproduction device.
The asynchronous network nodes such as network node 115 may be a MPEG player, satellite radio receiver, AM/FM radio receiver, satellite television, portable music player, portable computer, wireless radio, wireless telephone, portable digital video recorder, a Media system, handheld device, cellular telephone, mobile telephone, mobile device, personal digital assistant PDA), or combinations of the above, for example.
The Media system performs full-duplex audio and video communication between different network nodes or endpoints. Each media production device or media system has at a minimum an audio and video capture device and an audio and video output device such as a Set Top Box. The capture and output devices have separate clocks for encoding and decoding purposes. Thus when exchanging packets between Media system endpoints there are at least four independent clock sources. A video clock (VN) at the video device and a STB clock (SN) at the Set Top Box, where “N” is the network node or endpoint. In actuality there may be several other clocks used at each Media system that have no influence on the exchange of packets between endpoints. Further, the four clock sources (V1, S1, V2, S2) are related to media capture and output. For instance, most of the video devices will run off a fixed 27 MHz clock source (VN). But a separate 27 MHz VCXO, controlled by the video device, will be used to derive the sample clock (SN) that in turn synchronizes the audio samples to the device.
The plurality of capture devices such as capture device 105 comprise a microphone for producing audio signals, camera for producing video signal, and a processing platform such as the Davinci® video platform with a DM6446 evaluation module (EVM). The DM6446 features robust operating systems support, rich user interfaces, high processing performance, and long battery life through the maximum flexibility of a fully integrated mixed processor solution. The peripheral set includes: configurable video ports; a Ethernet MAC (EMAC) with a Management Data Input/Output (MDIO) module; an inter-integrated circuit I2C) Bus interface; audio serial port (ASP); general-purpose timers; watchdog timer; general-purpose input/output (GPIO) with programmable interrupt/event generation modes, multiplexed with other peripherals; UARTs with hardware handshaking support; pulse width modulator PWM) peripherals; and external memory interfaces: an asynchronous external memory interface (EMIFA) for slower memories/peripherals, and a higher speed synchronous memory interface for DDR2.
The DM6446 device includes a Video Processing Subsystem (VPSS) with two configurable video/imaging peripherals: Video Processing Front-End (VPFE) input used for video capture, Video Processing Back-End (VPBE) output with imaging co-processor (VICP) used for display. The Video Processing Front-End (VPFE) is comprised of a CCD Controller (CCDC), a Preview Engine (Previewer), Histogram Module, Auto-Exposure/White Balance/Focus Module (H3A), and Resizer. The CCDC is capable of interfacing to common video decoders, CMOS sensors, and Charge Coupled Devices (CCDs). The Previewer is a real-time image processing engine that takes raw imager data from a CMOS sensor or CCD and converts from an RGB Bayer Pattern to YUV4. The Histogram and H3A modules provide statistical information on the raw color data for use by the DM6446 . The Resizer accepts image data for separate horizontal and vertical resizing from ¼× to 4× in increments of 256/N, where N is between 64 and 1024.
The capture devices produce a data stream 140 consisting of audio packets and video packets, which respectively contain the audio and video data. Data stream 140 can communicated to another network node or exchanged between the capture device 105 and the set top box 110 in the form of local traffic or intra-node communication. Data stream 140 in most cases is audio and video data that can be reproduced by a set top box such as STB 110 into an audio signal to be produced by a speaker system and a video signal to be produced by a TV monitor or other video generating devices. The capture devices such as capture device 105 can also format the captured data into data packets 135 to transmit to another video device, another capture device, or another media production device through an asynchronous network node according to an asynchronous network media access protocol. In inter-node communication, data packet 135 originates in either network node 115 or network node 130. Data packets 135 received at second capture device 120 are processed so as to be reproduced by STB 125. STB 125 and STB 110 are substantially identical and operate in a similar fashion.
The network environment 100 illustrated in
Processor 230 may include at least one conventional processor or microprocessor that interprets and executes a set of instructions. Memory 220 may be a random access memory (RAM) or another type of dynamic storage device that stores information and instructions for execution by processor 230. Memory 220 may also include a read-only memory (ROM which may include a conventional ROM device or another type of static storage device that stores static information and instructions for processor 230.
Communication interface 240 may include any mechanism that facilitates communication via network 145. For example, communication interface 240 may include a modem. Alternatively, communication interface 240 may include other mechanisms like a transceiver in communicating with other devices or systems via wireless connections. User interface 250 may include one or more conventional input mechanisms that permit a user to input information, communicate with the capture device, and present information to the user, such as an electronic display, microphone, touchpad, keypad, keyboard, mouse, pen, stylus, voice recognition device, buttons, one or more speakers.
Microphone 210 is used for picking up the audio signals of a user of the capture device. A second microphone could be used to capture stereo sound signals Camera 260 is a single camera or a camera array comprising one or more still or video electronic cameras, e.g., CCD or CMOS cameras, either color or monochrome or having an equivalent combination of components that capture an area. Motion and operation of each camera 260 may be controlled by control signals, e.g., under computer and/or software control. Moreover, operational parameters for camera 260 including pan/tilt mirror, lens system, focus motor, pan motor, and tilt motor control are controlled by control signals from a controller such as processor 230.
The capture device 105 may perform with processor 230 input, output, communication, programmed, and user-recognition functions by executing sequences of instructions contained in a computer-readable medium, such as, for example, memory 220. Such sequences of instructions may be read into memory 220 from another computer-readable medium, such as a storage device, or from a separate device via communication interface 240.
In hardware synchronization the set top box is treated as a decoder and correction is only performed at the capture device. The STP is prevented from performing any synchronization because the capture device needs to be aware of the operations performed on the packets. For example, when the capture device performs echo cancellation the post-corrected packet would not be available to the echo canceller running on the capture device. Without the post-corrected packets there would be complications with full duplex systems and could result in poor-quality echo cancellation. Thus, to improve operations it is advisable to try to avoid any corrections from being performed by the STB. If the STB performs estimation, ideally it should find that no correction is necessary. It is, however, possible that the STB will perform a correction infrequently, in which case the echo canceller may perform less than ideally for a brief period of time. In hardware synchronization VCXO 425 is then used to match the capture device to the STB. Since VCXO 425 is synchronized to the set top box any clock mismatch between the different capture devices has to be corrected by using schemes that does not employ VCXO 425.
In action 630, the best arrival time 620 is used for fine-tuning to adjust the nominal frequency of the capture device clock, or local clock, or adjustable clock in the capture device such as VCXO 425. Action 630, actually changes the sample clock rate of the decoder by using external VCXO 425 as the crystal clock source. With the proper control circuits, a VCXO can be adjusted by a small amount around its nominal frequency. The VCXO would be used as the source from which the decoder's sample clock is derived. In the alternative, action 640 adjusts the VCXO's clock rate through a scaling factor or multiply/divide ratio. The VCXO's multiply/divide ratio is a clock recovery or timing extraction circuit capable of locking onto data bits having a bit repetition rate related to the frequency of oscillator VXCO by the ratio or fraction N/M, where each of N and M are integers. It will of course be understood that the frequency and divisor values given herein are for purposes of illustrating a specific example of the invention, and not by way of limiting the invention.
As noted above the hardware synchronization has been reserved for local traffic or packets transmitted to the set top box by the video device. When audio quality is a key evaluation factor the Software-based Correction for audio should do more than trivial sample drop or repeat. In sample drop the number of samples is decreased to accommodate faster traffic arriving at the video device. In sample repeat the number of samples has to be increased to accommodate slower traffic arriving at the video device.
The data sampling could be controlled by using three playback rate settings (slower/normal/faster) and use bilinear or bicubic interpolation to implement “slower” and “faster.” For example, “slower” might interpolate to create 5% more samples, and “faster” might interpolate to create 5% fewer samples. The actual percentage adjustment likely impacts the complexity of the interpolation filter, so 5% may not turn out to be a good choice. On the other hand, larger percentage adjustments may result in more noticeable changes in audio pitch and more oscillation between “slower” and “faster.” The Software-based Correction for video must be carefully coordinated to the correction for audio. The actual Correction method for video will probably need to be frame skip/repeat. Video timestamp adjustment at the Set Top Box could be used to adjust presentation times based on the MPEG-2 transport stream or H.264 SEI picture timing timestamps.
Embodiments within the scope of the present invention may also include computer-readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code means in the form of computer-executable instructions or data structures. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or combination thereof to a computer, the computer properly views the connection as a computer-readable medium. Thus, any such connection is properly termed a computer-readable medium. Combinations of the above should also be included within the scope of the computer-readable media.
Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Computer-executable instructions also include program modules that are executed by computers in stand-alone or network environments. Generally, program modules include routines, programs, objects, components, and data structures, et cetera, that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of the program code means for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.
In particular, one of skill in the art will readily appreciate that the names of the methods and apparatus are not intended to limit embodiments. Furthermore, additional methods and apparatus can be added to the components, functions can be rearranged among the components, and new components to correspond to future enhancements and physical devices used in embodiments can be introduced without departing from the scope of embodiments. One of skill in the art will readily recognize that embodiments are applicable to future communication devices, different file systems, and new data types. Accordingly, the appended claims and their legal equivalents should only define the invention, rather than any specific examples given.