The disclosed subject matter relates to video communication systems, and particularly point-to-point or multi-point video communication systems in which one or more participants may have access to more than one camera and/or more than one display.
Video communication systems such as the ones used for videoconferencing often involve a single camera and a single display for each of the participants. This is typically the case when the system is hosted on a personal computer. High-end systems, intended for use in dedicated conferencing rooms, may feature multiple monitors. The 2nd monitor is often dedicated to application sharing material (“content”). When no such content is used, one monitor may feature the loudest speaker whereas the other monitor shows some or all of the remaining participants.
Recently, there has been significant interest in so-called “telepresence” systems. These are systems that are intended to convey the sense of “being in the same room” as the remote participant(s). In order to accomplish this goal, these systems utilize multiple cameras as well as multiple displays. The displays and cameras are positioned at carefully calculated positions in order to be able to give a sense of eye-contact. Typical systems involve three displays—left, center, and right—although configurations with only two or more than three displays are also available.
The displays are situated in carefully selected positions in the conferencing room. Looking at each of the displays from any physical position on the conferencing room table is supposed to give the illusion that the remote participant is physically located in the room. This is accomplished by matching the exact size of the person as displayed to the expected physical size that the subject would have if he or she were actually present in the perceived position within the room. High-end systems go as far as matching the furniture, room colors, and lighting, to further enhance the life-like experience.
In order to be effective, telepresence systems must offer very high resolution and operate with very low latency. For example, these systems can operate at high definition (HD) 1080p/30 resolutions, i.e., 1080 horizontal lines progressive at 30 frames per second. To eliminate latency and packet loss, they also use dedicated multi-megabit networks and typically operate in point-to-point or switched configurations (i.e., they avoid transcoding).
Traditional video conferencing systems assume that each endpoint is equipped with a single camera, although they can be equipped with several displays. For example, the commercially available VidyoRoom HD-220 system, produced by Vidyo, Inc., is equipped with one camera and two monitors. The dual monitor configuration can be used in several different ways. For example, the active speaker can be displayed in the primary monitor, with the other participants shown in the second monitor in a matrix of smaller windows. The matrix layout is referred to as “continuous presence”, since participants are continuously present on the screen rather than being switched in and out depending on who is the active speaker. An alternative way is to use the second monitor to display content (e.g., a slide presentation from a computer) and the primary monitor to show the participants. The primary monitor then is treated as with a single-monitor system.
Telepresence systems that feature multiple cameras are designed so that each camera is assigned to its own codec. A system with three cameras and three screens would then use three separate codecs to perform encoding and decoding at each endpoint. These codecs would make connections to three counterpart codecs on the remote site, using proprietary signaling or proprietary signaling extensions to existing protocols.
The three codecs are typically identified as “left,” “right,” and “center.” In this document such positional references are made from the perspective of a user of the system; left, in this context, is the left-hand side of a user that is sitting in front of the camera(s) and is using the system. Audio is typically stereo, and can be handled through the center codec. In addition to the three video screens, telepresence systems typically include a fourth screen to display computer-related content such as presentations. This is referred to as the “content” or “data” stream.
Telepresence systems pose unique challenges compared with traditional videoconferencing systems. A key challenge is the fact that such systems must be able to handle multiple video streams. A typical videoconferencing system only handles a single video stream, and optionally an additional “data” stream for content. Even when multiple participants are present, the Multipoint Control Unit (MCU) is responsible for compositing the multiple participants in a single frame and transmitting the encoded frame to the receiving endpoint. Existing systems today address the need for multiple stream support in two different ways. One way is to establish as many connections as there are video cameras. This means that, for a three camera systems, three separate connections have to be established. Note that mechanisms have to be provided to properly treat these separate streams as a unit, i.e., as coming from the same location.
A second way is to use proprietary extensions to existing signaling protocols, or use new protocols, such as the Telepresence Interoperability Protocol (TIP). TIP was originally designed by Cisco Systems, Inc., and is currently managed by the International Multimedia Telecommunications Consortium (IMTC); the specification can be obtained from IMTC at the address 2400 Camino Ramon, Suite 375, San Ramon, Calif. 94583, USA or from the web site http://www.imtc.org/tip. TIP was designed to allow multiple audio and video streams to be transported over a single RTP (Real-Time Protocol, RFC 3550) connection. TIP enables the multiplexing of up to four video or audio streams in the same RTP session, using proprietary RTCP (Real-Time Control Protocol, defined in RFC 3550 as part of RTP) messages.
A significant difficulty in designing and implementing telepresence systems, is multipoint operation and integration with non-telepresence systems. Traditional multipoint systems utilize MCUs, either in switching or transcoding configurations. The transcoding configuration introduces significant delay due to cascaded decoding and encoding, in addition to quality loss, and is thus problematic for the high-quality experience expected of a telepresence system. Switching, on the other hand, can become awkward, particularly when used between systems with a different number of screens.
Integration with non-telepresence systems, such as single-screen room systems, computer desktop systems, or mobile systems (on tablets or phones), is similarly problematic. In fact, existing telepresence systems that are based on legacy videoconferencing equipment do not support the presence of a low-end device without transcoding through an MCU.
Scalable video coding (‘SVC’), an extension of the well-known video coding standard H.264 that is already used in most digital video applications, is a video coding technique that has proven to be very effective in interactive video communication. The bitstream syntax and decoding process are formally specified in ITU-T Recommendation H.264, and particularly Annex G. ITU-T Rec. H.264, incorporated herein by reference in its entirety, can be obtained from the International telecommunications Union, Place de Nations, 1120 Geneva, Switzerland, or from the web site www.itu.int. The packetization of SVC for transport over RTP is defined in RFC 6190, “RTP payload format for Scalable Video Coding,” incorporated herein by reference in its entirety, which is available from the Internet Engineering Task Force (IETF) at the web site http://www.ietf.org.
Scalable video and audio coding has been beneficially used in video and audio communication using the so-called Scalable Video Coding Server (SVCS) architecture. The SVCS is a type of video and audio communication server and is described in commonly assigned U.S. Pat. No. 7,593,032, “System and Method for a Conference Server Architecture for Low Delay and Distributed Conferencing Applications”, as well as commonly assigned International Patent Application No. PCT/US06/62569, “System and Method for Videoconferencing using Scalable Video Coding and Compositing Scalable Video Servers,” both incorporated herein by reference in their entirety. It provides an architecture that allows for very high quality video communication with high robustness and low delay.
Commonly assigned International Patent Application Nos. PCT/US06/061815, “Systems and methods for error resilience and random access in video communication systems,” PCT/US07/63335, “System and method for providing error resilience, random access, and rate control in scalable video communications,” and PCT/US08/50640, “Improved systems and methods for error resilience in video communication systems,” all incorporated herein by reference in their entirety, further describe mechanisms through which a number of features such as error resilience and rate control are provided through the use of the SVCS architecture.
In one form, the SVCS operation includes receiving scalable video from a transmitting endpoint and selectively forwarding layers of that video to the receiving participant(s). In a multipoint configuration, and contrary to an MCU, the SVCS performs no decoding/composition/re-encoding. Instead, all appropriate layers from all video streams are sent to each receiving endpoint by the SVCS, and each receiving endpoint is itself responsible for performing the composition for final display. Note that this means that, in the SVCS system architecture, all endpoints have to have multiple stream support, because the video from each transmitting endpoint is transmitted as a separate stream to the receiving endpoint(s). Of course, the different streams can be transmitted over the same RTP session (i.e., multiplexed), but the endpoint must be configured to receive multiple video streams, decode, and compose them for display. This is a very important advantage for SVC/SVCS-based systems in terms of supporting telepresence-type operation. In fact, the architecture lends itself to a much more general treatment, where telepresence is simply a special case of a multiple camera/multiple monitor architecture.
Consideration now is being given to developing architectures and systems for video communication using devices that feature multiple cameras and multiple monitors, that take advantage of the capabilities made available by scalable video coding and SVCSs.
Systems and methods for performing videoconferencing using endpoints with multiple monitors and multiple cameras are disclosed herein.
In some embodiments, multimonitor/multicamera endpoints are comprised of nodes, where each node is comprised of a control unit and one or more node units, each connected to at least one monitor, camera, speaker, or microphone. Video is encoded using scalable coding, and endpoints are connected to each other over a network using an SVCS. In one embodiment, media from node units does not flow through the control unit. In another embodiment, media from node units flows through the control unit.
In one embodiment, the control unit assigns particular monitor layouts to each of the nodes, and selectively forwards layers from- and to- each endpoint. The control unit can dynamically change the layout of each monitor depending on system events.
In one embodiment of the disclosed subject matter, media streams are tagged with attributes such as loudness of audio, so that a control unit can apply prioritized stream selection in its assignment algorithm. Additional attributes can include linking and geolocation information. Stream allocation can take into account performance limits such as maximum pixel rate or maximum bit rate for a particular node.
Throughout the figures the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the disclosed subject matter will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments.
The Primary codec is responsible for audio handling. The system here is shown as having multiple microphones, which are mixed into a single signal that is encoded by the primary codec. There is also a fourth screen to display content. The entire system is managed by a special device labeled as the Controller. In order to establish a connection with a remote site, this system performs three separate H.323 calls, one for each codec. This is because existing ITU-T standards do not allow the establishment of multi-camera calls. This architecture is typical of existing telepresence products that use standards-based signaling for session establishment and control. Use of the TIP protocol would allow system operation with a single connection, and would make possible up to 4 video streams and 4 audio streams to be carried over two RTP sessions (one for audio and one for video).
In embodiments of the disclosed subject matter, an endpoint is equipped with multiple monitors. The monitors can be positioned in a row, as with a telepresence system, but more generally they can have any number of different placement configurations.
In some embodiments of the disclosed subject matter the endpoint can be equipped with multiple cameras, which could be more or fewer than the number of monitors. The cameras can be located on the monitors (attached to them at the top), built inside the monitors (e.g., at the bezel, such as with the built-in camera of commercially available Apple Cinema displays), or they could be positioned in completely different locations.
In the following, the number of monitors associated with an endpoint will be denoted by M and the number of cameras associated with an endpoint will be denoted by C. Often a telepresence system is designed so that a set number of users is intended to be shown in each monitor, typically one or two. This number is herein indicated U. A system configuration can then be described by M/C/U; a 3/3/2 system then involves 3 monitors, 3 cameras, with 2 users intended to be shown in each of the monitors (and captured by each of the cameras). The system shown in
In all embodiments of the disclosed subject matter it is assumed that scalable video (and optionally audio) coding is used, following the H.264 SVC specification (previously cited). For audio it is assumed that the MPEG AAC-LD audio coding is used.
By selecting a subset of the layers, one can obtain a version of the original signal at different spatial and temporal resolutions. For example, taking only the B components (i.e., all B0, B1, and B2), one obtains the signal in full temporal resolution but low spatial resolution. Similarly, by taking the B0/S0 and B1/S1 components one obtains the signal at full resolution but at half the original frame rate.
The particular picture coding structure is just an example, and other structures can also be used as described in commonly assigned International Patent Application No. PCT/US06/028365, “System and method for scalable and low-delay videoconferencing using scalable video coding,” incorporated herein by reference in its entirety, or others, as is known to people skilled in the art.
In embodiments of the disclosed subject matter, one or more SVCS servers can be used. The basic operation of the SVCS is described below with reference to
As explained in detail in U.S. Pat. No. 7,593,032 (previously cited), the SVCS architecture has significant advantages compared to traditional switching and transcoding Multipoint Control Units (MCUs) for multipoint video and audio communication: there is very little delay (10-20 msec vs. 200 msec for an MCU), there is no transcoding loss, and rate matching and personalized layout operations become packet forwarding decisions (i.e., decisions on which packet should be forwarded to each of the receiving endpoints).
The architecture for multimonitor/multicamera support in an endpoint in one embodiment of the disclosed subject matter is shown in
It is possible that each Node 650 can be equipped with more than one Monitor 610, or support more than one Camera 630. These cases are treated in essentially the same way as the single camera/single monitor Node case, as it will be obvious to people skilled in the art. In the following, it is assumed that each Node 650 is equipped with a single monitor and camera, for simplicity of the presentation. Similarly, although
In one embodiment of the disclosed subject matter, content can be generated and encoded by either a Node Unit 650 or a Control Unit 670. Content can be encoded using the same SVC algorithm that is used for regular video, albeit tuned in a different way (higher spatial resolution but lower frame rate). This allows content decoding by any video-capable Node Unit. The content generation could be performed either internally, by capturing the contents of a window on the host computer or its entire desktop, or by obtaining an external computer graphics signal (not shown in
With continued reference to
A Node Unit 655 is a device that is capable of performing decoding of video if a monitor is present, encoding of video if a camera is present, encoding of audio if one or more microphones are present, and decoding of audio if one or more speakers are present. In one embodiment of the disclosed subject matter the Node Units 655 can be a general purpose personal computer with appropriate hardware interfaces for the peripheral devices, and running appropriate software to perform the Node Unit 655 functions described herein. The commercially available VidyoRoom HD-50 system from Vidyo, Inc., is a hardware device that features a general purpose personal computer equipped with appropriate interfaces to perform video encoding and decoding, as well as Speex Wideband audio, and can thus be used for this purpose.
It is also possible to create custom devices to perform the Node Unit 655 functions. For example, for Node Units without a camera, it is possible to produce very inexpensive devices. In fact, Node Unit capability could even be added to television sets that are equipped with hardware SVC decoders. Similarly, webcam manufacturers today are already designing devices that feature built-in SVC encoders; it is therefore straightforward to add network connectivity and the associated software to make them Node Units 655 offering camera (and microphone) functionality.
Similarly, it is possible for the Control Unit 670 to integrate at least one Node Unit 655, since it can have a monitor, camera, and speaker/microphone. One can equip a large room system such as the VidyoRoom HD-220, commercially available by Vidyo, Inc., to be a Control Unit 670 with an integrated Node Unit serving one camera. In fact, the system can have two such units, one acting as a Control Unit and the other as a simple Node Unit. This way, if the active Control Unit ceases to work, the other Node Unit can start operating as a Control Unit/Node Unit combination, thus providing fault tolerance.
The Control Unit 670 operates very similarly to an SVCS. For video and audio streams arriving to the Endpoint 600, the Control Unit 670 decides to which Node Unit 655 to send them (including which layers to include). Similarly, the Control Unit 670 activates each video and audio encoder in the various Nodes 650 so equipped, receives their coded video or audio streams, and transmits them to the connected SVCS (or other endpoint). The communication of the real-time streams between the two devices can be performed using standard RTP.
In one embodiment of the disclosed subject matter, the Node Units 655, the Control Panel 680 and the Control Unit 670 can automatically self-discover each other using the UPnP protocol (Universal Plug-and-Play, UPnP Forum, and also International Standard ISO/IEC 29341) and work as a cluster. UPnP allows devices to advertise their presence and the services that they offer to control devices on the network. Control devices in turn can send suitable control messages to the control URL for the service (provided in the service description), and express them in XML using SOAP (Simple Object Access Protocol, Word Wide Web Consortium/W3C). Other protocols, including custom ones, e.g., using Remote Procedure Calls (RPC), can also be used as is obvious to persons skilled in the art. The control of the communication between Node Units 655 and the Control Unit 670, including error resilience functions, is performed by a special Node Unit Protocol that is implemented using UPnP, and is discussed after the overall system architecture is presented. Note that this architecture allows Node Units 655 to be dynamically removed from, or added to, the system, even when communication is on-going (in real-time). This can be useful in certain applications and provides very high fault-tolerance.
The Control Unit 670 is also the connection point of the Endpoint 500 to SVCSs or directly to other endpoints (neither shown in this diagram). Similarly to the Node-to-Control Unit connection, in one embodiment of the disclosed subject matter the connection between the Control Unit 670 and any SVCS or other endpoint is performed over an IP-based network.
Finally, in one embodiment of the disclosed subject matter the Endpoint 600 is also connected to a Portal (indicated here to be outside this diagram). The Portal is a server function responsible for user management and authentication, as well as other system management functions, as discussed later on. The connection between the Endpoint 600 and the Portal is preferably over the IP network to which the Endpoint 600 is attached. Note that the Portal can also be integrated with an SVCS (or even with an Endpoint, in some product configurations); its operation remains the same.
With continued reference to
The particular selection of endpoints and gateways is only used for purposes of illustration; any number of multimonitor/multicamera endpoints can be used, as well as any number of legacy endpoints or gateways, as is obvious to persons skilled in the art. At a minimum, it is assumed that at least one multimonitor/multicamera endpoint is present and at least one SVCS or other endpoint.
In one embodiment of the disclosed subject matter, all communication between endpoints, the SVCS 710, and Portals 740 is performed over a common IP network. The multimonitor/multicamera endpoints 720 and 722 are treated as any other endpoint by the SVCS and Portal, except that each one of them can produce more than one video stream. Functionally, for an SVCS (and a Portal), there is no difference if multiple video streams originate from the same endpoint or from different endpoints.
In one embodiment of the disclosed subject matter, the communication between Endpoints and the Portals 740 is performed using an Endpoint Management and Control Protocol (EMCP). These connections are shown with dashed lines in
In one embodiment of the disclosed subject matter a protocol is used between SVCSs and Endpoints to indicate to an upstream device (i.e., a receiving SVCS to a transmitting endpoint, a receiving endpoint to a transmitting SVCS, or a receiving SVCS to a transmitting SVCS) which layers of each of the available sources to include in its transmission. In one embodiment of the disclosed subject matter the Conference Management and Control Protocol (CMCP) is used, described in commonly assigned U.S. Provisional Patent Application 61/384,634, “System and method for the control and management of multipoint conferences,” which is incorporated by reference herein in its entirety.
The fact that scalable video coding offers multiple resolutions on the same bitstream, as well as the fact that composition occurs on the receiving endpoint provides complete flexibility in implementing different layouts.
With continued reference to
In the media control unit embodiment (also referred to as “media aggregation” embodiment), all media flow from- and to- the Node Units is always performed through the Control Unit of each corresponding Endpoint. For Endpoint 1720, this means that all media flow from- and to- the three Node Units 720n is performed through the Control Unit 720c. In this case the Control Unit acts more like an SVCS, or a cascaded SVCS, in that all media flows through it and it can make and implement decisions on which data to forward in either direction. The main advantage of the media control unit embodiment is that a single RTP session can be used for all media of the same type, thus simplifying firewall traversal and session setup processes. Also, any decisions that the Control Unit has to make with respect to sending a particular video stream to a particular Node Unit are implemented by the Control Unit itself, and do not have to be communicated to the SVCS, as with signaling only aggregation. Node Units are relatively simpler as well, since they only have to implement very basic signaling functionality. Finally, it can potentially offer a simpler and/or more safe implementation if media encryption is used. In the following it is assumed that the media aggregation is used.
In summary, with continued reference to
In order to fully control placement of videos on the multiple monitors, the Node Unit uses a tree-structured conceptual model of the monitor area comprised of a display (the entire monitor display area), windows, and tiles. The structure is shown in
Example message definitions for the NUP are provided in
The “NAKpacketRequestEvent” event implements the negative acknowledgment in an identical way to the LR protection protocol, but this time carried over the NUP protocol. The R picture sequence indices can be carried over the standard RTP stream, using the optional Y bit of the PACSI header (with the associated optional TLOPICIDX and IDRPICID fields, and the S and E flags) in RFC 6190 (previously cited).
When a Control Unit 720c receives a “NAKpacketRequestEvent” event, it either retransmits the missing packet, if still available in its cache, or passes the negative acknowledgment upstream to the SVCS 710 (or whatever device happens to be connected upstream).
The main difference between single-camera systems and the multimonitor/multicamera systems is the fact that a multicamera source can be treated as a unit, and that a multimonitor system offers considerable flexibility in terms of how incoming video streams are to be displayed in the various monitors.
The spatial positioning of the video streams provided from a multicamera endpoint can have significance. For example, the relative positioning of each stream may need to be indicated (e.g., left, center, and right). This will enable a receiving endpoint to respect the spatial orientation and facilitate proper display. At a lower level, it can be desirable to only establish that streams are “linked”—in other words, they should be treated as a unit: if one is displayed, the other(s) should as well, and vice versa. For a camera, the system can even indicate its relative position from a known system location, as well as its vertical and horizontal angles (tilt/pan) and zoom factors. This information allows the receiver to know what the camera is looking at in the remote room.
It is also possible that the multiple video streams have no spatial orientation preference; for example, consider two cameras that take very close-up shots of two users. In this scenario, the relative spatial positioning may not matter.
The streams can have other attributes of significance. For example one of them can be marked as the loudest, or active, speaker. If the multiple cameras capture multiple people, it is not possible to infer (at least without some significant processing) which video corresponds to the active speaker. This information should therefore be provided by the endpoint itself. Proper tagging in this case can require that each camera has an associated microphone. This can also enable the system to provide spatial localization for audio, in that the active speaker's audio can be played back on the monitor where the video is shown. Such audio localization is also useful when a user is entering or leaving the conferencing session; the “chime” (or possibly a more complicated, text-to-speech driven announcement) that is played by the system should be played on the monitor (or near the monitor, if the monitor does not have speakers) where the particular user will be, or was, shown. This can ensure that the user's attention will be drawn to the right monitor.
Other attributes of interest can be geolocation, which, for example, can provide information about the physical location where the video streams originate (e.g., “New York Office”, or “Atlanta Airport”). The IP address where the video streams originally can be used to the same effect. Video resolution can also be an attribute of significance.
In order to better facilitate the organization of the layout on the remote size, and to attempt to preserve the physical characteristics of the room and the position of the participants, one embodiment of the disclosed subject matter also includes in each of the video streams it transmits, or in its signaling information, metadata that provides a set of attributes or tags.
Similarly, a receiving system can indicate to the sender the number, position, and size of its monitors. It is possible that during call setup the systems exchange these parameters so that they identify the best possible operating mode between them.
When the configuration of two communicating systems is different, it is possible for either the transmitter or the receiver to perform adaptation. For example, if a system is designed to transmit video that is supposed to be shown in three display monitors that are on the same plane, whereas a particular receiving room has its display monitors in an arc configuration, the sender or receiver could perform perspective correction operation on the video signals.
Assuming that multiple monitors are available to a single Node Unit, it is possible to decode and render video so that it spans more than one monitor.
Note that, with SVC, there is no need to perform downsampling of the three video signals by a factor of ⅔ in order to fit them to the two monitors. If spatial scalability with a ratio of 1.5 is used, low resolution versions can be directly obtained from the SVC bitstream. Similarly, for a 4/4/-system, the room can be fitted into two monitors using 2:1 spatial scalability without any processing. Other ratios can be more difficult to accommodate, since they are not directly supported by SVC's Scalable Baseline profile.
c) shows a layout in which two 3/3/2 telepresence system rooms are displayed in a single set of 3 monitors. In this instance the video streams have been cropped in their upper and lower parts (one quarter of the picture height each), so that two sets can fit vertically on the monitors. Of course cropping can produce visual problems if the placement of the subjects is not appropriate within each frame.
It is evident from the preceding discussion that the traditional telepresence system layout has considerable limitations in terms of flexibility in combining video from different sources. In general, telepresence systems work best in symmetrical and point-to-point configurations.
Commonly assigned International Patent Application No. PCT/US09/046,758, “System and method for improved layout management in scalable video and audio communication systems,” incorporated by reference herein in its entirety, describes several techniques for performing layout management on a single monitor (or, more generally, a single rectangular display area), focusing particularly on the unique layout capabilities offered by SVC and the SVCS architecture.
With reference to
In
With continued reference to
These single monitor layouts can now be combined in multimonitor systems in various ways.
c) shows a video wall in which one monitor is allocated for the active speaker, one of content, and two are allocated for continuous presence for 18 participants. This particular configuration can also be used to connect 6 telepresence rooms that use a 3/3/-configuration. The lower monitors display the 6 rooms by placing the three videos of each room in a window row of each monitor. The active speaker monitor displays the active speaker, which can be in any one of the 18 video streams provided. The Control Unit of the Endpoint will select which incoming video stream is the one corresponding to the active speaker, and forward (a duplicate) of the video to the associated Node Unit. In this example the active speaker can also be shown in the bottom views.
d) shows a layout with two separate content monitors and a continuous presence monitor, together with a single-view monitor. This allows content from two different applications and/or participants to be simultaneously shown on the monitors. Clearly, with the disclosed subject matter, any number of monitors can be allocated to display content. Although in the examples of
The determination of the active speaker can be done by appropriate tagging of the associated video and audio streams with a standardized audio intensity measure. This way the Control Unit can easily make the decision without any audio processing.
Note that the system can transition from layouts dynamically, to accommodate changes in the system. For example, it could transition from layout
Contrary to traditional telepresence systems, in a general multimonitor/multicamera system videos can be assigned to windows and monitors at will. Use of SVC for the video representation, coupled with the selective forwarding capability of a Node's Control Unit and of the SVCS, any desirable layout (including those of traditional telepresence systems) can be easily implemented, including the implementation of layout transitions.
Through the use of tagging, which is used above to implement active speaker selection, other interesting layout management strategies can be implemented. For example, a monitor can be marked to always show a particular participant (e.g., the CEO of a company). Using geolocation tagging (or the IP address of the video sources), participants from the same geographical location can be forced to always be shown in windows that are physically next to each other, to further enhance the perception of their physical proximity. By linking streams to each other, they can be shown in particular configurations in their respective windows (for example, one to the left of the other).
Finally, operations such as duplication (“duping”) or mirroring can also be easily implemented. In the first case a video stream is shown in more than one window, and in the second case there is duplication but with a reflection of the picture along the vertical axis (as with a regular mirror). This can be used for creative effect in large monitor walls. In one embodiment of the disclosed subject matter, the Control Panel of the Endpoint can allow the user to select the desired layout configuration and switch between different ones in real-time and on the fly. It is also possible that identical functionality is provided through some other interface to the Control Unit, which manages the layout, as is obvious to persons skilled in the art. Additional functionalities that the system can provide include swapping streams between windows, “pinning” streams to particular windows so that any automatic layout determination algorithm does not modify their window allocation, or moving streams from one window to another. The system can also offer as an option to include a text overlay in each video window of the name of the associated endpoint.
An additional function that can be provided by the Control Unit is an “identify” operation; when triggered, the Control Unit instructs each Node equipped with one or more monitors to display an integer number on the monitor. This way the user can easily identify which monitor number is assigned to which physical monitor in the system. Alternatively, the Control Panel, or an alternative interface to the Control Unit, could provide the ability to show a particular uniform color on a monitor selected on its user interface. This would also allow the user to identify the particular monitor.
All these layout management strategies and operations are simple packet forwarding decisions that are made at the Control Unit of the multimonitor/multicamera endpoint, and require no signal processing. In one embodiment of the disclosed subject matter, the Control Unit establishes a desired layout with the various Node Units. Video-to-window allocation within the available monitors is based on prioritization attributes (e.g., active speaker) and respecting stream placement constraints (e.g., stream linking, telepresence grouping, geolocation).
Assuming that monitor #3 is allocated to be used for the active speaker, then the Control Unit will perform stream swapping as necessary so that the active speaker is always shown on monitor #3. For example, with reference to
As shown, the benefits associated with SVCS can carry over to the environment of the multimonitor/multicamera endpoint. This is because the scalable nature of the coded representation of the video signal enables the elimination of signal processing for most of the useful system operations.
In commercial systems, an important consideration is how the system is licensed for use. The typical model used in videoconferencing is the concept of a “port.” A port in a legacy system is associated with a physical port on an MCU, and implies use of DSP resources. In an SVCS architecture, the concept of a port can be replaced by that of a “line,” i.e., a form of soft-licensing associated with connections to the SVCS. Line licensing is performed at the Portal, so that a set of line licenses can be used across a set of SVCSs. This allows, for example, to implement “follow the sun” strategies for license management, and thus use the same licenses in the US, Europe, and Asia, but at different times of the day. In a multimonitor/multicamera endpoint setting where a large number of monitors or camera can be involved, it is advantageous to be able to specify licensing levels that depend on any of the following: number of streams per monitor; number of streams per node; number of streams per monitor; number of cameras; resolution limits; number of monitors; or total bandwidth. License management is performed at the Portal, when connections are set up.
The methods for scalable video communication using multiple cameras and multiple monitors described above can be implemented as computer software using computer-readable instructions and physically stored in computer-readable medium. The computer software can be encoded using any suitable computer languages. The software instructions can be executed on various types of computers. For example,
The components shown in
Computer system 1500 includes a display 1532, one or more input devices 1533 (e.g., keypad, keyboard, mouse, stylus, etc.), one or more output devices 1534 (e.g., speaker), one or more storage devices 1535, various types of storage medium 1536.
The system bus 1540 link a wide variety of subsystems. As understood by those skilled in the art, a “bus” refers to a plurality of digital signal lines serving a common function. The system bus 1540 can be any of several types of bus structures including a memory bus, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example and not limitation, such architectures include the Industry Standard Architecture (ISA) bus, Enhanced ISA (EISA) bus, the Micro Channel Architecture (MCA) bus, the Video Electronics Standards Association local (VLB) bus, the Peripheral Component Interconnect (PCI) bus, the PCI-Express bus (PCI-X), and the Accelerated Graphics Port (AGP) bus.
Processor(s) 1501 (also referred to as central processing units, or CPUs) optionally contain a cache memory unit 1502 for temporary local storage of instructions, data, or computer addresses. Processor(s) 1501 are coupled to storage devices including memory 1503. Memory 1503 includes random access memory (RAM) 1504 and read-only memory (ROM) 1505. As is well known in the art, ROM 1505 acts to transfer data and instructions uni-directionally to the processor(s) 1501, and RAM 1504 is used typically to transfer data and instructions in a bi-directional manner. Both of these types of memories can include any suitable of the computer-readable media described below.
A fixed storage 1508 is also coupled bi-directionally to the processor(s) 1501, optionally via a storage control unit 1507. It provides additional data storage capacity and can also include any of the computer-readable media described below. Storage 1508 can be used to store operating system 1509, EXECs 1510, application programs 1512, data 1511 and the like and is typically a secondary storage medium (such as a hard disk) that is slower than primary storage. It should be appreciated that the information retained within storage 1508, can, in appropriate cases, be incorporated in standard fashion as virtual memory in memory 1503.
Processor(s) 1501 is also coupled to a variety of interfaces such as graphics control 1521, video interface 1522, input interface 1523, output interface 1524, storage interface 1525, and these interfaces in turn are coupled to the appropriate devices. In general, an input/output device can be any of: video displays, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, biometrics readers, or other computers. Processor(s) 1501 can be coupled to another computer or telecommunications network 1530 using network interface 1520. With such a network interface 1520, it is contemplated that the CPU 1501 might receive information from the network 1530, or might output information to the network in the course of performing the above-described method. Furthermore, method embodiments of the present disclosure can execute solely upon CPU 1501 or can execute over a network 1530 such as the Internet in conjunction with a remote CPU 1501 that shares a portion of the processing.
According to various embodiments, when in a network environment, i.e., when computer system 1500 is connected to network 1530, computer system 1500 can communicate with other devices that are also connected to network 1530. Communications can be sent to and from computer system 1500 via network interface 1520. For example, incoming communications, such as a request or a response from another device, in the form of one or more packets, can be received from network 1530 at network interface 1520 and stored in selected sections in memory 1503 for processing. Outgoing communications, such as a request or a response to another device, again in the form of one or more packets, can also be stored in selected sections in memory 1503 and sent out to network 1530 at network interface 1520. Processor(s) 1501 can access these communication packets stored in memory 1503 for processing.
In addition, embodiments of the present disclosure further relate to computer storage products with a computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code can be those specially designed and constructed for the purposes of the present disclosure, or they can be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media such as optical disks; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter.
As an example and not by way of limitation, the computer system having architecture 1500 can provide functionality as a result of processor(s) 1501 executing software embodied in one or more tangible, computer-readable media, such as memory 1503. The software implementing various embodiments of the present disclosure can be stored in memory 1503 and executed by processor(s) 1501. A computer-readable medium can include one or more memory devices, according to particular needs. Memory 1503 can read the software from one or more other computer-readable media, such as mass storage device(s) 1535 or from one or more other sources via communication interface. The software can cause processor(s) 1501 to execute particular processes or particular parts of particular processes described herein, including defining data structures stored in memory 1503 and modifying such data structures according to the processes defined by the software. In addition or as an alternative, the computer system can provide functionality as a result of logic hardwired or otherwise embodied in a circuit, which can operate in place of or together with software to execute particular processes or particular parts of particular processes described herein. Reference to software can encompass logic, and vice versa, where appropriate. Reference to a computer-readable media can encompass a circuit (such as an integrated circuit (IC)) storing software for execution, a circuit embodying logic for execution, or both, where appropriate. The present disclosure encompasses any suitable combination of hardware and software.
While this disclosure has described several exemplary embodiments, there are alterations, permutations, and various substitute equivalents, which fall within the scope of the disclosed subject matter. It will thus be appreciated that those skilled in the art will be able to devise numerous systems and methods which, although not explicitly shown or described herein, embody the principles of the invention and are thus within the spirit and scope of the invention.
This application claims priority to U.S. Provisional Application Ser. No. 61/347,994, filed May 25, 2010, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
61347994 | May 2010 | US |