The application relates generally to multimodal game video summarization in computer simulations and other applications.
A video summary of a computer simulation video or other video would generate a concise video for quickly viewing highlights for, e.g., a spectating platform or online gaming platform to enhance the spectating experience. As understood herein, generating an effective summary video automatically is difficult, and generating a summary manually is time-consuming.
An apparatus includes at least one processor programmed with instructions to receive audio-video (AV) data and provide a video summary of the AV data that is shorter than the AV data at least in part by inputting to a machine learning (ML) engine first modality data and second modality data. The instructions are executable to receive the video summary of the AV data from the ML engine responsive to the inputting of the first and second modality data.
In example embodiments, the first modality data includes audio from the AV data and the second modality data includes computer simulation video from the AV data. In other implementations the second modality data can include computer simulation chat text related to the AV data.
In non-limiting examples the instructions are executable to execute the ML engine to extract at least a first parameter from the second modality data and provide the first parameter to an event relevance detector (ERD). In these examples the instructions can be executable to execute the ML engine to extract at least a second parameter from the first modality data and provide the second parameter to the ERD. The instructions can be further executable to execute the ERD to output the video summary at least in part based on the first and second parameters.
In another aspect, a method includes identifying an audio-video (AV) entity such as a computer game audio-video stream. The method includes using audio from the AV entity to identify plural first candidate segments of the AV entity for establishing a summary of the entity, and likewise using video from the AV entity to identify plural second candidate segments of the AV entity for establishing a summary of the entity. The method further includes identifying at least one parameter associated with chat related to the AV entity and selecting at least some of the plural first and second candidate segments based at least in part on the parameter. The method uses the at least some of the plural first and second candidate segments for generating a video summary of the AV entity that is shorter than the AV entity.
In example implementations of the method, the method may include presenting the video summary on a display. In non-limiting embodiments, using video from the AV entity for identifying plural second candidate segments of the AV entity includes identifying scene changes in the AV entity. In addition, or alternatively, using video from the AV entity for identifying plural second candidate segments of the AV entity can include identifying text in the video of the AV entity.
In some embodiments using audio from the AV entity for identifying plural first candidate segments of the AV entity can include identifying acoustic events in the audio. In addition, or alternatively, using audio from the AV entity for identifying plural first candidate segments of the AV entity can include identifying pitch and/or amplitude of at least one voice in the audio. In addition, or alternatively, using audio from the AV entity for identifying plural first candidate segments of the AV entity can include identifying emotion in the audio. In addition, or alternatively, using audio from the AV entity for identifying plural first candidate segments of the AV entity can include identifying words in speech in the audio.
In example implementations, identifying the parameter associated with chat related to the AV entity can include identifying sentiment of chat. In addition, or alternatively, identifying the parameter associated with chat related to the AV entity may include identifying emotion of the chat. In addition, or alternatively, identifying the parameter associated with chat related to the AV entity can include identifying topic of the chat. In addition, or alternatively, identifying the parameter associated with chat related to the AV entity can include identifying at least one grammatical category of at least one word in the chat. In addition, or alternatively, identifying the parameter associated with chat related to the AV entity can include identifying a summary of the chat.
In another aspect, an assembly includes at least one display apparatus configured to present an audio-video (AV) computer game. At least one processor is associated with the display apparatus and is configured with instructions to execute a machine learning (ML) engine to generate a video summary of the computer game that is shorter than the computer game. The ML engine includes an acoustic event ML model trained to identify events in audio of the computer game, a speech pitch and power ML model trained to identify pitch and power in speech of the audio, and a speech emotion ML model trained to identify emotion in the audio. The ML engine also includes a scene change detector ML model trained to identify scene changes in video of the computer game. Further, the ML engine includes a text sentiment detector model trained to identify sentiment in text associated with chat related to the computer game, a text emotion detector model trained to identify emotion in text associated with the chat, and a text topic detector model trained to identify at least one topic of text associated with the chat. An event relevance detector (ERD) module is configured to receive input from the acoustic event ML model, speech pitch and power ML model, speech emotion ML model, and scene change detector ML model to identify plural candidate segments of the computer game and to select a subset of the plural candidate segments to establish the video summary based at least in part on input from one or more of the text sentiment detector model, text emotion detector model, and text topic detector model.
The details of the present application, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which:
This disclosure relates generally to computer ecosystems including aspects of consumer electronics (CE) device networks such as but not limited to computer game networks. A system herein may include server and client components which may be connected over a network such that data may be exchanged between the client and server components. The client components may include one or more computing devices including game consoles such as Sony PlayStation® or a game console made by Microsoft or Nintendo or other manufacturer, virtual reality (VR) headsets, augmented reality (AR) headsets, portable televisions (e.g. smart TVs, Internet-enabled TVs), portable computers such as laptops and tablet computers, and other mobile devices including smart phones and additional examples discussed below. These client devices may operate with a variety of operating environments. For example, some of the client computers may employ, as examples, Linux operating systems, operating systems from Microsoft, or a Unix operating system, or operating systems produced by Apple, Inc., or Google. These operating environments may be used to execute one or more browsing programs, such as a browser made by Microsoft or Google or Mozilla or other browser program that can access websites hosted by the Internet servers discussed below. Also, an operating environment according to present principles may be used to execute one or more computer game programs.
Servers and/or gateways may include one or more processors executing instructions that configure the servers to receive and transmit data over a network such as the Internet. Or, a client and server can be connected over a local intranet or a virtual private network. A server or controller may be instantiated by a game console such as a Sony PlayStation®, a personal computer, etc.
Information may be exchanged over a network between the clients and servers. To this end and for security, servers and/or clients can include firewalls, load balancers, temporary storages, and proxies, and other network infrastructure for reliability and security. One or more servers may form an apparatus that implement methods of providing a secure community such as an online social website to network members.
A processor may be a single- or multi-chip processor that can execute logic by means of various lines such as address lines, data lines, and control lines and registers and shift registers.
Components included in one embodiment can be used in other embodiments in any appropriate combination. For example, any of the various components described herein and/or depicted in the Figures may be combined, interchanged, or excluded from other embodiments.
“A system having at least one of A, B, and C” (likewise “a system having at least one of A, B, or C” and “a system having at least one of A, B, C”) includes systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.
Now specifically referring to
Accordingly, to undertake such principles the AVD 12 can be established by some or all of the components shown in
In addition to the foregoing, the AVD 12 may also include one or more input ports 26 such as a high definition multimedia interface (HDMI) port or a USB port to physically connect to another CE device and/or a headphone port to connect headphones to the AVD 12 for presentation of audio from the AVD 12 to a user through the headphones. For example, the input port 26 may be connected via wire or wirelessly to a cable or satellite source 26a of audio video content. Thus, the source 26a may be a separate or integrated set top box, or a satellite receiver. Or, the source 26a may be a game console or disk player containing content. The source 26a when implemented as a game console may include some or all of the components described below in relation to the CE device 44.
The AVD 12 may further include one or more computer memories 28 such as disk-based or solid state storage that are not transitory signals, in some cases embodied in the chassis of the AVD as standalone devices or as a personal video recording device (PVR) or video disk player either internal or external to the chassis of the AVD for playing back AV programs or as removable memory media. Also in some embodiments, the AVD 12 can include a position or location receiver such as but not limited to a cellphone receiver, GPS receiver and/or altimeter 30 that is configured to receive geographic position information from a satellite or cellphone base station and provide the information to the processor 24 and/or determine an altitude at which the AVD 12 is disposed in conjunction with the processor 24. The component 30 may also be implemented by an inertial measurement unit (IMU) that typically includes a combination of accelerometers, gyroscopes, and magnetometers to determine the location and orientation of the AVD 12 in three dimensions.
Continuing the description of the AVD 12, in some embodiments the AVD 12 may include one or more cameras 32 that may be a thermal imaging camera, a digital camera such as a webcam, and/or a camera integrated into the AVD 12 and controllable by the processor 24 to gather pictures/images and/or video in accordance with present principles. Also included on the AVD 12 may be a Bluetooth transceiver 34 and other Near Field Communication (NFC) element 36 for communication with other devices using Bluetooth and/or NFC technology, respectively. An example NFC element can be a radio frequency identification (RFID) element.
Further still, the AVD 12 may include one or more auxiliary sensors 37 (e.g., a motion sensor such as an accelerometer, gyroscope, cyclometer, or a magnetic sensor, an infrared (IR) sensor, an optical sensor, a speed and/or cadence sensor, a gesture sensor (e.g. for sensing gesture command), providing input to the processor 24. The AVD 12 may include an over-the-air TV broadcast port 38 for receiving OTA TV broadcasts providing input to the processor 24. In addition to the foregoing, it is noted that the AVD 12 may also include an infrared (IR) transmitter and/or IR receiver and/or IR transceiver 42 such as an IR data association (IRDA) device. A battery (not shown) may be provided for powering the AVD 12, as may be a kinetic energy harvester that may turn kinetic energy into power to charge the battery and/or power the AVD 12.
Still referring to
Now in reference to the afore-mentioned at least one server 50, it includes at least one server processor 52, at least one tangible computer readable storage medium 54 such as disk-based or solid state storage, and at least one network interface 56 that, under control of the server processor 52, allows for communication with the other devices of
Accordingly, in some embodiments the server 50 may be an Internet server or an entire server “farm”, and may include and perform “cloud” functions such that the devices of the system 10 may access a “cloud” environment via the server 50 in example embodiments for, e.g., network gaming applications. Or, the server 50 may be implemented by one or more game consoles or other computers in the same room as the other devices shown in
It is to be understood that audio is first stripped from the video of the AV entity, and the audio and video are aligned in time (e.g., using timestamps) and processed by respective ML models in segments that may be, e.g., five seconds or other period in length. The segments are contiguous to each other and together make up the AV entity. Each ML model outputs a probability of an interesting segment, and a segment whose probability from either audio or video processing satisfies a threshold is a candidate for inclusion in the video summary 204, which includes audio and video of selected segments plus, if desired, X seconds of AV content on both sides of selected segments. As discussed further below, while both audio and video are used to identify candidate segments for the video summary, to avoid over-inclusion (and, hence, a too-long video summary), text from chat associated with the AV entity may be used to reinforce identified segments. This essentially limits the total length of segments that are included in the video summary to be no more than a predefined percentage of the full AV entity by eliminating candidate segments whose associated text from chat indicates less interest than other candidate segments.
A ML model may be trained as shown in
Commencing at block 300, the training set of data is input to the ML engine, such as by inputting the training set to various ML models that are to process respective types of data in an AV entity. As discussed further below, at block 302 the ML engine combines feature vectors of two or more data type modes to output at 304 the video summary of an AV entity, the efficacy of the predictions of which may be annotated and fed back to the ML engine to refine its processing.
The acoustic event detector 402 is trained to identify events in segments of audio of the AV entity that indicate interesting content and thus indicate that a particular segment is a candidate for inclusion in the video summary. The acoustic event detector 402 is described further below and may include one or more layers of convolutional neural networks (CNN) to identify acoustic events as being interesting based on a training set of events that are predefined as being “interesting”.
Similarly, the pitch and power detector 404 is a ML model trained to identify pitch and power in speech of the audio that indicates interesting content. Examples are higher voice pitches indicating more interest than lower pitches, or wider variations in pitch indicating more interest than narrower variations, and louder voices indicating more interest than quieter speech. The variations in pitch varies substantially where there is excitement and when an interesting event takes place, and this can be detected in their voice/speech. Thus, the regions of the sound with high power and sudden variations in speech can be classified as one of the candidate regions for highlight generation.
The speech emotion ML model 406 is trained to identify emotion in the audio to identify interesting emotions. One or both of categorical emotion detection and dimensional emotion detection may be used. Categorical emotion detections may detect plural (e.g., ten) different categories of emotions such as but not limited to happiness, sadness, anger, anticipation, fear, loneliness, jealousy, and disgust. Dimensional emotion detection has two variables, namely, arousal and valence.
The ERD 400 also may receive input from a text sentiment analyzer or detector model 410 that is trained to identify parameters such as but not limited to sentiment and emotion in text associated with the chat 412 related to the AV entity. Sentiment is different from emotion, in that sentiment is generally positive or negative while emotion is more specific as discussed further below. Positive sentiment for example may be correlated to an interesting segment and negative sentiment may be correlated to a less interesting segment.
The ERD 400 receives probabilities from the ML models described herein to identify plural candidate segments of the AV entity on the basis of an audio-based or video-based probability for a segment satisfying a threshold. The ERD 400 selects a subset of the plural candidate segments based on chat text-based probabilities to establish the video summary.
In addition, each voice track may be input to an automatic speech recognition (ASR) model 420, which converts the speech of each track to words and sends probabilities of the words indicating terms of interest as defined by the training set for the model to the ERD 400. The automatic speech recognition model 420 can also identify segments as uninteresting based on lengthy periods of no speech.
As shown in
Turn now to the chat text portion of the ML engine. Chat may be used to reinforce summary predictions based on video and audio. As shown in
A named entity recognition (NER) and aspect detection (NERAD) model 430 may be used to output probabilities of interesting grammatical types detected in the input text based on a training set correlating words to interesting and uninteresting grammatical types. For example, the NERAD model 430 may output a probability that a term is a proper noun, which may be predefined to be of more interest than an adjective. The NERAD model 430 may also output probabilities that a brief summary of the text in the segment indicates an interesting or uninteresting segment.
Note that the chat text may include “stickers” or emoticons that may in some cases require purchase by a user to employ, meaning that attachment of such a sticker to chat may indicate greater interest in the corresponding segment to reinforce learning derived from other modalities.
Note further that in addition to receiving the text from the chat 412, the chat text-based models can also receive terms from the automatic speech recognition model 420 to process along with the terms in the chat text.
Also, as indicated at 1206 the fundamental frequency variation (pitch variation) of the signal 1200 is identified. These variations are indicated at 1208. The model is trained to identify, from the shapes of the variations, interesting segments. ASR and NER as discussed above in relation to
Note that in embodiments in which the ERD 400 is implemented by a ML model, the ERD model may be trained using a set of audio, video, and chat text probabilities and corresponding video summaries derived therefrom as generated by human annotators.
As indicated at 1400 and 1402, respectively, metadata may be received from both the game event data 434 in
Block 1406 indicates that portions of video that are the subject of current time-aligned metadata may be visibly highlighted by, e.g., increasing the brightness of the portions, presenting a line around the portions, etc. For example, if the metadata includes a proper noun (name of a character), that character may be highlighted in the video summary during the time the metadata pertains to. In other words, any or all of the metadata may be visually indicated by highlighting the associated portions of the video summary.
The metadata also may be used at block 1408 to generate text that can be overlaid onto the video summary. Any or all of the metadata accordingly may be textually presented on a portion of the video summary. This metadata can include who has expressed likes for certain portions of the AV entity summarized in the video summary, themes present in the video summary as derived from, e.g., the Aspect Detection block, emoticons representing emotions indicated in the metadata, and so on.
It will be appreciated that whilst present principals have been described with reference to some example embodiments, these are not intended to be limiting, and that various alternative arrangements may be used to implement the subject matter claimed herein.
Number | Date | Country | |
---|---|---|---|
63074333 | Sep 2020 | US |