The present disclosure relates generally to communication networks, and more particularly, to providing summary information for a media session.
The use of live media sessions has become increasingly popular as a way to reduce travel expense and enhance collaboration between people from distributed geographic locations. Live broadcasts or conferences may be used, for example, for meetings (e.g., all-hands, town-hall), remote training lectures, classes, or other purposes. A common occurrence with a live media session is that a participant has to join in late after the session has started. A participant may also miss a portion of the live media session. This may result in disturbance of others if the participant inquires as to what has been missed. If the participant does not ask for an update, he may lose context and have trouble following the rest of the session. A participant may also have to step out of an ongoing session, in which case he does not know what is being missed.
Corresponding reference characters indicate corresponding parts throughout the several views of the drawings.
In one embodiment, a method generally comprises receiving media from a live media session, processing the media to generate summary information for the live media session, and transmitting the summary information for a specified segment of the live media session to a user during the live media session.
In another embodiment, an apparatus generally comprises a processor for processing media received from a live media session to generate summary information for the live media session and transmitting summary information for a specified segment of the live media session to a user during the live media session. The apparatus further comprises memory for storing the processed media.
The following description is presented to enable one of ordinary skill in the art to make and use the embodiments. Descriptions of specific embodiments and applications are provided only as examples, and various modifications will be readily apparent to those skilled in the art. The general principles described herein may be applied to other applications without departing from the scope of the embodiments. Thus, the embodiments are not to be limited to those shown, but are to be accorded the widest scope consistent with the principles and features described herein. For purpose of clarity, details relating to technical material that is known in the technical fields related to the embodiments have not been described in detail.
The embodiments described herein provide summary information for a live media session during the media session. For example, a user may automatically receive a summary of the media session from a start of the session until the point at which the user joined the session, or may request summary information for a specific segment of the media session, as described below. The user can therefore catch up on a missed portion of a media session as soon as he joins the live session and does not need to wait until the media session is over to receive a summary. The user may also set an alert for an event that may occur later in the media session, so that a notification can be sent to the user upon occurrence of the event. This allows the user to leave the live media session and return to the session if a notification is received.
As described in detail below, the summary information may be a transcription, keywords, identification of speakers and associated speech time, audio or video tags, notification of an event occurrence, or any other information about the media session that can be used by a participant of the media session.
The term ‘media’ as used herein refers to video, audio, data, or any combination thereof (e.g., multimedia). The media may be encrypted, compressed, or encoded according to any format. The media content may be transmitted as streaming media or media files, for example.
The term ‘media session’ as used herein refers to a meeting, class, conference (e.g., video conference, teleconference), broadcast, telecast, or any other communication session between a plurality of users transmitted using any audio or video means, including signals, data, or messages transmitted through voice or video devices. The media session may combine media from multiple sources or may be from a single source. The media session is ‘live’ from the start of the session (e.g., transmission of audio or video stream begins, start of broadcast/telecast, one or more participants logging on or dialing in to a conference, etc.) until the session ends (e.g., broadcast/telecast ends, all participants log off or hang up, etc.). A participant of the media session may be an active participant (e.g., receive and transmit media) or a nonactive participant (e.g., only receive media or temporarily located remote from the media session).
The embodiments operate in the context of a data communications network including multiple network devices (nodes). Some of the devices in the network may be appliances, switches, routers, gateways, servers, call managers, service points, media sources, media receivers, media processing units, media experience engines, multimedia transformation units, multipoint conferencing units, or other network devices.
Referring now to the drawings, and first to
The endpoints 10 are configured to originate or terminate communications over the network 14. The endpoints 10 may be any device or combination of devices configured for receiving, transmitting, or receiving and transmitting media flows. For example, the endpoint 10 may be a personal computer, media center device (e.g., TelePresence device), mobile device (e.g., phone, personal digital assistant), or any other device capable of engaging in audio, video, or data exchanges within the network 14. The endpoints 10 may include, for example, one or more processor, memory, network interface, microphone, camera, speaker, display, keyboard, whiteboard, and video conferencing interface. There may be one or more participants (users) located or associated with each endpoint 10.
The endpoint 10 may include a user interface (e.g., graphical user interface, mouse, buttons, keypad) with which the user can interact with to request summary information from the media processor 16. For example, upon joining a live media session, the user may be presented with a screen displaying options to request summary information. The user may specify, for example, the type of summary information (e.g., transcript, speakers, keywords, notification, etc.) and may also specify the segment of the media session for which the summary is requested (e.g., from beginning to time at which participant joined the media session, segment at which a specific speaker was presenting, segment for a specified time period before and after a keyword, etc.). The endpoint 10 may also include a display screen for presenting the summary information. For example, the summary information may be displayed within a window (note) or side screen (side bar) along with a video display of the live media session. The summary information may also be displayed on a user device (e.g., personal computer, mobile device) associated with the participant and independent from the endpoint 10 used in the media session.
The media source 12 is a network device operable to broadcast live audio, video, or audio and video. The media source 12 may originate a live telecast or may receive media from one or more of the endpoints and broadcast the media to one or more of the endpoints. For example, the media source 12 may be a conferencing system including, a multipoint conferencing unit (multipoint control unit) (MCU) configured to manage a multi-party conference by connecting multiple endpoints 10 into the same conference. The MCU collects audio and video signals transmitted by conference participants through their endpoints 10 and distributes the signals to other participants of the conference.
The media processor 16 is a network device (e.g., appliance) operable to process and share media across the network 14 from any source to any endpoint. As described in detail below, the media processor 16 processes the live media to provide summary information to a participant. The media processor 16 may also be configured to perform other processing on the media, including, for example, media transformation, pulse video analytics, integrating video into the media session, conversion from one codec format to another codec format, etc.
The media processor 16 is located between the media source 12 and the endpoints 10 and may be implemented, for example, at the media source, at one or more of the endpoints, or any other network device interposed in the communication path between the media source and endpoints. Also, one or more processing components of the media processor 16 may be located remote from the other components. For example, a speech-to-text converter may be located at the media source 12 and a search engine configured to receive and search the text may be located at one or more endpoints 10 or other network device.
It is to be understood that the network shown in
An example of a network device 20 (e.g., media processor) that may be used to implement embodiments described herein is shown in
Logic may be encoded in one or more tangible computer readable media for execution by the processor 22. For example, the processor 22 may execute codes stored in a computer-readable medium such as memory 24. The computer-readable medium may be, for example, electronic (e.g., RAM (random access memory), ROM (read-only memory), EPROM (erasable programmable read-only memory)), magnetic, optical (e.g., CD, DVD), electromagnetic, semiconductor technology, or any other suitable medium.
The network interfaces 26 may comprise one or more wireless or wired interfaces (linecards, ports) for receiving signals or data or transmitting signals or data to other devices. The interfaces 26 may include, for example, an Ethernet interface for connection to a computer or network.
The media processing components 28 may include, for example, speech-to-text converter, search engine, speaker identifier (e.g., voice or face recognition application), tagging engine, or any other media processing components that may be used to generate summary information from the live media. Examples of media processing components 28 are described further below.
The network device 20 may further include any suitable combination of hardware, software, algorithms, processors, devices, components, or elements operable to facilitate the capabilities described herein. It is to be understood that the network device 20 shown in
As described below with respect to the flowchart of
If the participant joins the media session after the start of the media session (tx−t0>0) (step 44), summary information may be automatically sent (or sent on demand) to the user during the live media session for the missed segment of the session (t0 to tx) (step 46). If the difference between the joining time (tx) of the media session and the start time (t0) is equal to (or less than) zero, there is no missed segment to send and the process moves on to step 48. At any time during the media session, the user may request on demand summary information for a specified segment of the media session (steps 48 and 49). Even if the user does not leave the media session, he may miss a part of the session, want to check if he heard something correctly, or want to identify a speaker in the session, for example. The user may request a specific segment (e.g., from time x to time y, segment when speaker z was talking, time period before or after a keyword was spoken or a video frame was shown, etc.).
It is to be understood that the processes illustrated in
The summary information may include any synopsis attributes (e.g., transcript (full or partial), keywords, video tags, speakers, speakers and associated time, list of ‘view-worthy’ sections of session, notification for event occurrence, etc.) that may be used by the participant to gain insight into the portion of the session that he has missed or needs to review. The following provides examples of processing that may be performed on the live media to provide summary information. It is to be understood that these are only examples and that other processing or types of summary information may be used without departing from the scope of the embodiments.
Speech-to-text transcription may be performed to extract the content of the media session. A full transcript may be provided or transcript summarization may be used. A transcript summary may be presented, for example, with highlighted keywords that can be selected to request a full transcript of a selected section of the transcript summary. The transcript is preferably time stamped. The speech-to-text converter may be any combination of hardware, software, or encoded logic, that operates to receive speech signals and generate text that corresponds to the received speech. In one example, speech-to-text operations may include waveform acquisition, phoneme matching, and text generation. The waveform may be broken down into individual phonemes (e.g., eliminating laughter, coughing, background noises, etc.). Phoneme matching can be used to assign a symbolic representation to the phoneme waveform (e.g., using some type of phonetic alphabet). The text generation can map phonemes to their intended textual representation. If more than one mapping is possible, contextual analysis may be used to select the most likely version.
Speaker recognition may also be used to provide summary information such as which speakers spoke during a specified segment of the media session. Speaker recognition may be provided using characteristics extracted from the speaker's voices. For example, the users may enroll in a speaker recognition program in which the speaker's voice is recorded and a number of features are extracted to form a voice print, template, or model. During the media session, the speech is compared against the previously created voice prints to identify the speaker. The speaker may also be identified using facial recognition software that identifies a person from a digital image or video frame. For example, selected facial features from the image may be compared with a facial database.
Media tagging can be used to transform media (video, audio, data) into a text tagged file for use in presenting summary information. A search module can interact with the media tagging module to search information. The tags identified during a specified segment of the media session can be used to provide the user a general idea of what topics were discussed or mentioned in the media session. The tags may be processed, for example, using pulse video tagging techniques.
Although the method and apparatus have been described in accordance with the embodiments shown, one of ordinary skill in the art will readily recognize that there could be variations made without departing from the scope of the embodiments. Accordingly, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.