The present disclosure relates to the field of multimedia content delivery networks and, more particularly, synchronizing audio and video components of multimedia content delivered over a multimedia content delivery network.
When multimedia content is delivered over a distribution network to a plurality of end users, whether via satellite, cable, twisted copper, fiber, or another medium, audio components and video components may be segregated to improve network efficiencies. However, when segregated audio and video packets are transported across the network, random and systematic sources of error or delay may affect video and audio packets differently and can, therefore, negatively impact the synchronization. Because the most common or recognizable manifestation of the problem may be a detectable difference in timing between the visual perception of the movement of a speaker's lips and the audio perception of the corresponding sound, this problem is commonly referred to as lip synchronization error or, more simply, lip sync error.
A disclosed method of managing lip synchronization error in a multimedia content delivery network (MCDN) includes identifying a video packet and an audio packet associated with the video packet and determining a synchronization offset between the video packet and the audio packet at a first monitoring point in the network. The video packet and the audio packet are then detected at a second monitoring point in the network and a second synchronization offset between the video packet and the audio packet is determined. When a delta between the first synchronization offset and the second synchronization offset exceeds a threshold, lip synchronization error information is automatically reported to a service provider and corrective action may be taken if potential sources of the lip synchronization error are within the domain of the service provider.
Identifying the video packet may include identifying a timestamp associated with the video packet. The video packet may be encoded according to a motion pictures expert group (MPEG)-compliant video encoding and the timestamp may include a presentation timestamp (PTS). A PTS is a metadata field in an MPEG program stream that is used to achieve synchronization of elementary streams at an end point of the network.
The audio packet may be encoded according to an MPEG-compliant audio encoding and the audio packet may be identified based on pulse code modulation data in the packet. In some implementations, the audio packet and the video packet occur contemporaneously or substantially contemporaneously in a multimedia content program. In these embodiments, the synchronization offset that is monitored may be relatively small or minor at an upstream monitoring point in the network unless there is lip sync error in the content as received from a content provider.
Determining a synchronization offset may include associating a network timestamp with the video packet and a network timestamp with the audio packet and determining a difference between the video packet network timestamp and the audio packet network timestamp. Associating a network timestamp with the video packet may include assigning a network time protocol (NTP) timestamp to the video packet while or when the video packet is being processed at a first monitoring point. Similarly, associating a network timestamp with the audio packet may include assigning an NTP timestamp to the audio packet while or when the video packet is being processed at a first monitoring point.
In another aspect, a disclosed audio/video synchronization server, suitable for use in automatically detecting lip sync error in an MCDN, includes a general purpose, embedded, or other form of processor having access to computer readable storage media. Instructions, embedded or otherwise stored in the storage medium and executable by the processor, include instructions to identify a video packet and an audio packet, determine a synchronization offset between the video packet and the audio packet at first and second monitoring points in the network, and, when a delta between the first and second synchronization offsets exceeds a predetermined threshold, logging synchronization data indicative of the audio and video packets and the synchronization offset. Some embodiments may include instructions to compensate for the synchronization offset delta by adding packets to either a video stream carrying the video packet or an audio stream carrying the audio packet.
In some implementations, the first monitoring point includes an encoder of the MCDN and the second monitoring point comprises a central office switch. In addition, a synchronization offset between the second monitoring point and a third monitoring point may be determined. The third monitoring point may include customer premises equipment at a client site of the network.
Embodiments of the lip sync error detection and correction methods described herein may feature MPEG implementations, i.e., implementations that operate on MPEG-compliant multimedia content. Accordingly, aspects of MPEG are described herein. The output of a single MPEG audio or video encoder is called an elementary stream. An elementary stream is an endless, near real-time signal. An elementary stream may be broken into data blocks of manageable size, referred to as a packetized elementary stream (PES). A video PES and one or more audio PESs can be combined to form a program stream. Packets in a PES may include header information to demarcate the start of each packet and timestamps to resolve time base disruptions caused by the packetizing itself.
For transmission and digital broadcasting, several programs and their associated PESs can be multiplexed into a single transport stream. A transport stream differs from a program stream in that the PES packets are further subdivided into short fixed-size packets, referred to herein as transport packets. MPEG transport packets are fixed-size data packets, each containing 188 bytes. Each transport stream packet includes a program identifier code (PID). Transport stream packets within the same elementary stream will all have the same PID, so that the decoder (or a demultiplexer) can select the elementary stream(s) it wants and reject the remainder. Packet continuity counts ensure that every packet that is needed to decode a stream is received. An effective synchronization system is needed so that decoders can correctly identify the beginning of each packet and deserialize the bit stream into words.
An MPEG transport stream may carry packets for multiple programs encoded with different clocks. To enable this functionality, a transport stream includes a program clock reference (PCR) mechanism that is used to regenerate clocks at a decoder. In MPEG-2, timing references such as the PTS are relative to the PCR. A PTS has a resolution of 90 kHz, which is suitable for the presentation synchronization task. The PCR has a resolution of 27 MHz which is suitable for synchronization of a decoder's overall clock with that of the usually remote encoder.
Despite the use of timestamps and clock references, lip sync error can occur when multimedia content is delivered by multicasting a multi-program transport stream over a wide area, best-efforts network to a plurality of users, who may receive the content via access networks that employ different media including, as examples, twisted pair wire, co-axial cable, and/or fiber optic cables.
Lip sync error may result from the accumulation of video delays at several locations in the delivery network when no provision for compensating audio delay is made. Lip sync error may include different types of lip sync error such as valid content lip sync error, provider-introduced lip sync error, lip sync error induced by NTP induced PCR offset and jitter, and even MPEG-2 time-stamp missing packet errors.
Disclosed herein are methods and systems enabled to automatically detect and, when feasible, remediate lip sync error. Automated lip sync error detection/correction alleviates the need for costly and time consuming intervention by a network engineer or technician. Disclosed lip sync error detection/correction system and methods include systems and methods for measuring and monitoring parameters indicative of lip sync error at multiple points in a network as well as ongoing attention to evolving solutions and standards. Lip sync error auto detection and correction may encompass content delivered via different transmission media. For example, lip sync error may occur when television content is transported via a first medium, e.g., a geosynchronous satellite radio link, having significantly different delay times than content delivered via a second medium, e.g., landline. The lip sync error methods and system disclosed herein may delay the earlier of the two signals electronically to compensate for different propagation times.
Automated lip sync error detection may implement an MPEG analyzer to monitor the video PTS timing and measuring frame delay with respect to a reference. When a stream experiencing lip sync error is compared to the reference, the audio lead or lag may be quantified and flagged if greater than a predetermined threshold (e.g., a quarter of a second, which is sufficient to be easily perceived).
In some cases, automated lip sync error isolation disclosed herein may identify the content provider or the service provider as the source of a lip sync error problem. A content provider may be identified as a source of a lip sync error when a headend receiver detects lip sync error, in the form of audio leading or lagging video, in content received from a content provider. A service provider may be identified as a lip sync error source when, for example, a headend encoder or receiver incorrectly inserts a video PTS that is out of alignment with an audio PTS or out of sync with the timing sync on a set top box (STB). A headend receiver may also discard serial digital interface (SDI) packets pre-encoder, causing the encoder to insert or stuff null packets into the multicast stream in order to maintain a constant bit rate. If sufficient null packets are stuffed into the post-encoder multicast stream, time misalignment sufficient to produce lip sync error may occur. Lip sync error can even result when network timing errors are caused by network elements that block NTP packets. Automated correction of lip sync error attributable to the service provider may include adding frames to either the video or audio components either within the receiver or at the encoder.
In the following description, details are set forth by way of example to facilitate discussion of the disclosed subject matter. It should be apparent to a person of ordinary skill in the field, however, that the disclosed embodiments are exemplary and not exhaustive of all possible embodiments. Throughout this disclosure, a hyphenated form of a reference numeral refers to a specific instance of an element and the un-hyphenated form of the reference numeral refers to the element generically or collectively. Thus, for example, widget 12-1 refers to an instance of a widget class, which may be referred to collectively as widgets 12 and any one of which may be referred to generically as a widget 12.
In the depicted embodiment of MCDN 100, multimedia content is distributed from an upstream location such as super headend office (SHO) 150 and video headend office (VHO) 140, across a backbone network 130 to central offices (COs) 120 (only one of which is depicted in
In the case of DSL, content is sent from CO 120 to one or more DSL access multiplexer(s) (DSLAM(s)) 110, only one of which is depicted, and then to residential gateway (RG) 104-2 and STB 103-2 at client 102-2 via DSL access network 109, which may be implemented with twisted copper pair transmission medium.
In the case of an optical access network 108, content may be sent from an optical line terminal (OLT) 124 to an optical network termination (ONT) 106, which may be located at the exterior of a premises of a subscriber associated with client 102-1. ONT 106 converts optical signals to electrical signals and provides the electrical signals to RG 104-1 and STB 103-1, which may be functionally equivalent or similar to RG 104-2 and STB 103-2 in client 102-2.
Depending upon the implementation, CO 120 may include one or more switches and/or routing devices including, for example, a multiservice edge router 126 that couples CO 120 to backbone network 130. In the depicted embodiment, edge router 126 connects to one or more service switches 122 that provides an interface between CO 120 and one or more DSLAMs 110. In the embodiment depicted in
One or more of the switches and routers of CO 120 may include hardware and or software to implement or facilitate auto detecting and/or correction of lip sync error of multimedia content. In these embodiments, service switch 122 and/or edge router 126 may include a general purpose or embedded processor and computer readable storage for storing processor executable instructions to perform all or some of the lip sync error detection and correction methods and procedures. Similarly, lip sync error detection and correction modules may be included in upstream resources including SHO 150 or VHO 140 and in downstream resources including RG 104 and/or STB 103.
Referring now to upstream portions of MCDN 100, SHO 150 receives content from national content sources collectively represented by referenced numeral 155. In some embodiments, SHO 150 provides “national” feed content including nationally distributed television channels including, as examples, TBS, USA, CNN, CSPAN, and the like. VHO 140 may encompass providers of regional or local content delivered from regional or local sources collectively represented as regional sources 145.
In some embodiments, national feed content provided via SHO 150 may be received by and/or delivered from SHO 150 via different media than regional/local content delivered to and distributed by VHO 140. National feed content may, for example, be delivered to SHO 150 via a satellite transmission while content may be delivered to VHO 140 via terrestrial broadcast, coaxial cable, twisted copper, optical fiber and so forth. Moreover, although
Referring now to
The embodiment of MCDN 100 depicted in
In some embodiments, monitoring servers 212 may include features and functionality similar to or analogous to features found in commercially implemented element management systems such as the ROSA video service manager from Cisco, Inc. Monitoring servers 212 may be configured to detect and identify individual network packets. Monitoring servers 212 may be further configured to interact with a source of timing that is external to MCDN 100. In some embodiments, for example, monitoring servers 212 may implement or be configured to communicate with an NTP client 214. NTP client 214, as suggested by its name, is a network element configured to communicate NTP messages with one or more NTP servers. NTP is a protocol for synchronizing clocks in a computer network. See, e.g., Network Time Protocol (Version 3), Internet Engineering Task Force RFC 1305 (1992). In Unix environments, NTP client 214 may be implemented as a daemon process that runs continuously in user space. In a Windows® environment, NTP client 214 may be implemented within Windows® time service. NTP employs 64-bit timestamps that have a theoretical resolution of approximately 200 picoseconds although the accuracy of actual NTP implementations may be closer to approximately 10 milliseconds over the Internet and 200 microseconds over a local area network.
In some embodiments, MCDN 100 is configured to detect lip sync error and take corrective action, whenever possible, to compensate for or otherwise correct any lip sync error detected. As depicted in
In some embodiments, encoders 210 implement and/or support one or more levels of MPEG video encoding. Some embodiments, for example, may support MPEG-2 encoding, MPEG-4 encoding, and additional encodings. Encoders 210 may further include audio encoders that may support MPEG-1 levels 1, 2, and 3 as well as MPEG-2 Audio, and MPEG-4 Audio.
In the embodiment depicted in
For optical access networks, content may be routed from CO switch 126-1 through service switch 122-1 to OLT 124-1 for conversion from an electrical signal to an optical signal. Content is then delivered from service switch 122-2 to client 102-2 over DSL access network 109 from OLT 124 to client 102-1 via optical access network 108 and ONT 106 as described above with respect to
National content is depicted in
Referring now to
MPEG describes stream packets and transport packets. Stream packets are relatively large, variable-sized packets that represent a meaningful grain of the content. The packets in a video stream, for example, may represent a frame of the content, i.e., one entire image or screen. As content is transported over MCDN 100, however, MPEG-compliant devices generate transport streams that include a series of relatively small, fixed-size packets referred to as transport packets. Each MPEG transport packet contains 188 bytes, which includes a header and a payload. The packets illustrated in
Associated with each packet depicted in
In some implementations, the automated detection and correction of lip sync error is implemented as an application or service that is distributed across various elements in MCDN 100. In the depicted implementations, for example, the network points identified for monitoring may each include a processor and software or access to software containing instructions to perform a lip sync error detection and correction application. Also, as described above with respect to
When executed by the applicable one or more processor(s), the automated lip sync error detection and correction described herein may be implemented as a method 400 represented by the flow diagram of
After identifying the lip sync error monitoring points in block 402, the embodiment of method 400 depicted in
Identifying audio and video packets in block 404 may be facilitated by leveraging or extending functionality that is implemented in a video service management tool such as the ROSA video service manager from Cisco, Inc. In some embodiments, the identification of audio and video packets may be based, in part, on the encoding scheme employed by encoders 210. Embodiments that employ, for example, MPEG-2 video encoding, may identify a video packet based, at least in part, on temporal or timestamp information contained in the video packet itself. In the case of MPEG-2 video, for example, a video packet may include PTS information that is highly indicative, if not absolutely indicative of the packet itself. In some embodiments, PTS information in a video packet may be combined with other information to further identify the packet of interest. The PTS information in an MPEG-compliant video encoded packet is a 33-bit value representing a sample of a counter that increments at 9 kHz. Because the PTS value increases monotonically as the content progresses, the PTS is highly indicative of the corresponding packet. Moreover, as suggested above, PTS data may be combined with additional packet data to further refine the identification of specific packets.
In block 404 of the embodiment of method 400 depicted in
The depicted embodiment of method 400 further includes determining (block 406), at the first monitoring point, a synchronization reference between the identified video packet and the identified audio packet. The synchronization reference may represent the difference between network-based timestamps associated with the identified video and audio packets. In some embodiments, for example, the lip sync error monitoring server 212-1 at first monitoring point 201 implements or is configured to invoke an NTP client 214 to obtain network based timestamps for selected packets. In some embodiments, lip sync error detection method 400 may include obtaining NTP timestamps for the identified video and audio packets at first monitoring point 201. Any audio/video synchronization difference detected at first monitoring point 201 may represent synchronization offset that is undetectable, inherent in the content as received from the content provider, or both. The synchronization offset that is determined at first monitoring point 201 is taken as the baseline synchronization offset.
Block 408 of the embodiment of method 400 depicted in
Method 400 as shown further includes determining (block 422) a synchronization offset between the recognized audio and video packets. In the absence of network processing errors including, as examples, dropped packets, cyclic redundancy check (CRC) errors, and other network-based errors, one would not expect to see any substantial change in the synchronization offset that was present at first monitoring point 201. If, however, an appreciable shift in synchronization offset is detected, the shift may be categorized as lip sync error. The determination of lip sync error at second monitoring point 202 would tend to indicate that processing in the service provider's backbone network is causing or otherwise generating lip sync error into content.
When an appreciable change or delta in the synchronization offset between the identified packets is detected at second monitoring point 202, the synchronization offset shift may be stored to or otherwise recorded (block 424) to computer readable storage for subsequent analysis. In some embodiments, detection of appreciable changes in synchronization offset between first monitoring point 201 and second monitoring point 202 may trigger initiation of a corrective action procedure (block 426). Corrective action might be performed, for example, by a monitoring server 212-2 at monitoring point 202 and may include, for example, injecting empty or null packets into the component of content that is leading or lagging as appropriate. Corrective action may also include initiating a trouble ticket or otherwise notifying a service provider and/or a content provider of the lip sync error, and notifying a subscriber if and when lip sync error is detected and if and when a trouble ticket is initiated.
Referring now to
In the embodiment depicted in
Computing apparatus 500, as depicted in
Apparatus 500 as shown in
Storage media 510 encompasses persistent and volatile media, fixed and removable media, magnetic, semiconductor, and optical media. As depicted in
Number | Name | Date | Kind |
---|---|---|---|
3945718 | Werner | Mar 1976 | A |
4313135 | Cooper | Jan 1982 | A |
4323920 | Collender | Apr 1982 | A |
5387943 | Silver | Feb 1995 | A |
5526354 | Barraclough et al. | Jun 1996 | A |
5550594 | Cooper et al. | Aug 1996 | A |
5621772 | Maturi et al. | Apr 1997 | A |
5642171 | Baumgartner et al. | Jun 1997 | A |
5794018 | Vrvilo et al. | Aug 1998 | A |
5915091 | Ludwig et al. | Jun 1999 | A |
5940352 | Moriguchi | Aug 1999 | A |
5982830 | Maturi et al. | Nov 1999 | A |
6018376 | Nakatani | Jan 2000 | A |
6122668 | Teng et al. | Sep 2000 | A |
6181383 | Fox et al. | Jan 2001 | B1 |
6191821 | Kupnicki | Feb 2001 | B1 |
6269122 | Prasad et al. | Jul 2001 | B1 |
6285405 | Binford, Jr. et al. | Sep 2001 | B1 |
6330286 | Lyons et al. | Dec 2001 | B1 |
6452974 | Menon et al. | Sep 2002 | B1 |
6583821 | Durand | Jun 2003 | B1 |
6862044 | Kariatsumari | Mar 2005 | B2 |
6912010 | Baker et al. | Jun 2005 | B2 |
6956871 | Wang et al. | Oct 2005 | B2 |
7007235 | Hussein et al. | Feb 2006 | B1 |
7020894 | Godwin et al. | Mar 2006 | B1 |
7164076 | McHale et al. | Jan 2007 | B2 |
7194676 | Fayan et al. | Mar 2007 | B2 |
7400653 | Davies et al. | Jul 2008 | B2 |
7436456 | Morel et al. | Oct 2008 | B2 |
7486658 | Kumar | Feb 2009 | B2 |
7551839 | Yamada et al. | Jun 2009 | B2 |
7656947 | Seo et al. | Feb 2010 | B2 |
7692724 | Arora et al. | Apr 2010 | B2 |
20030038807 | Demos et al. | Feb 2003 | A1 |
20030122964 | Hara | Jul 2003 | A1 |
20030164845 | Fayan et al. | Sep 2003 | A1 |
20030193616 | Baker et al. | Oct 2003 | A1 |
20030198256 | Wang et al. | Oct 2003 | A1 |
20030234892 | Hu et al. | Dec 2003 | A1 |
20040227855 | Morel et al. | Nov 2004 | A1 |
20040264577 | Jung | Dec 2004 | A1 |
20050042591 | Bloom et al. | Feb 2005 | A1 |
20050053089 | Abou-Chakra et al. | Mar 2005 | A1 |
20050252362 | McHale et al. | Nov 2005 | A1 |
20050281255 | Davies et al. | Dec 2005 | A1 |
20050282580 | Tuori et al. | Dec 2005 | A1 |
20060002681 | Spilo et al. | Jan 2006 | A1 |
20060007356 | Junkersfeld et al. | Jan 2006 | A1 |
20060012709 | Yamada et al. | Jan 2006 | A1 |
20060018387 | Jung et al. | Jan 2006 | A1 |
20060078305 | Arora et al. | Apr 2006 | A1 |
20060236359 | Lee | Oct 2006 | A1 |
20070025325 | Kumar | Feb 2007 | A1 |
20070081562 | Ma | Apr 2007 | A1 |
20070081563 | Seo et al. | Apr 2007 | A1 |
20070085575 | Cooper | Apr 2007 | A1 |
20070153089 | Cooper et al. | Jul 2007 | A1 |
20070153125 | Cooper et al. | Jul 2007 | A1 |
20070220561 | Girardeau, Jr. et al. | Sep 2007 | A1 |
20070223874 | Hentschel | Sep 2007 | A1 |
20070237494 | Chen | Oct 2007 | A1 |
20070245222 | Wang et al. | Oct 2007 | A1 |
20070276670 | Pearlstein | Nov 2007 | A1 |
20080005350 | Logvinov | Jan 2008 | A1 |
20080040759 | She et al. | Feb 2008 | A1 |
20080111887 | Cooper et al. | May 2008 | A1 |
20080187282 | Brady et al. | Aug 2008 | A1 |
20080260350 | Cooper | Oct 2008 | A1 |
20080263612 | Cooper | Oct 2008 | A1 |
20080291863 | Agren | Nov 2008 | A1 |
20080298399 | Gou et al. | Dec 2008 | A1 |
20090003379 | Shao | Jan 2009 | A1 |
20090073316 | Ejima | Mar 2009 | A1 |
20090110370 | Shibata | Apr 2009 | A1 |
20090168658 | Russell et al. | Jul 2009 | A1 |
20090175180 | Yang et al. | Jul 2009 | A1 |
20090178075 | Russell et al. | Jul 2009 | A1 |
20090228941 | Russell et al. | Sep 2009 | A1 |
20100005501 | Stokking et al. | Jan 2010 | A1 |
20100064316 | Dai et al. | Mar 2010 | A1 |
20110217025 | Begen et al. | Sep 2011 | A1 |
20110261257 | Terry et al. | Oct 2011 | A1 |
Number | Date | Country | |
---|---|---|---|
20120120314 A1 | May 2012 | US |