The present application relates generally to real-time AI screening and auto-moderation of audio comments in a livestream.
Livestreamers on services like Twitch, Instagram Live, YouTube Live, and TikTok Live broadcast to hundreds, thousands, even millions of fans in real time. The streams usually involve a single person in front of an Internet-connected camera or smartphone, talking to the audience directly via the camera.
One way these streamers keep their streams interesting is to have a live, real-time chat running, where viewers can type in text or emoji reactions to what the streamer is doing or saying.
As understood herein, the streamer may read these text comments on stream as a way of engaging with the audience. However, popular streams can have dozens or hundreds of chat comments per minute, making the stream of comments basically unreadable because they fly by so fast.
In similar media such as talk radio, listeners are able to call into the show to ask the host a question or to comment on what the host has said. This adds a different dimension to the show because callers can hear themselves (or people like them) on the air, instead of just hearing the host reading text comments in his/her own voice.
Such a “call-in” mechanic isn't practical to do with a livestream, however, because there is generally no way for the streamer to screen the call. Whereas a talk radio show has staff to answer calls and make sure the call is legitimate and has an interesting question, the streamer is typically working alone and must be providing an entertaining livestream during their entire broadcast-they can't be simultaneously screening audio calls while they stream.
Even with a separate person screening calls, a talk radio host will sometimes be trolled by a disingenuous caller who makes it past the screener. The caller will curse live on the air, berate the host, or otherwise behave in a disruptive manner. The risk of this kind of bad behavior is especially high for livestreamers, due to the types of audiences they attract. If they were to allow any type of user's call to go out on their stream unscreened, they would be at high risk of people saying offensive things. Therefore, livestreamers generally don't allow their viewers' voices to go out live on their streams.
It is in this context that present principles arise.
Accordingly, a system includes at least one computer medium that is not a transitory signal and that in turn includes instructions executable by at least one processor assembly to receive from at least a first viewer of a computer network livestream at least one audio comment. The instructions are executable to convert the audio comment to text and to use at least one machine learning (ML) model to process the text to identify whether the text contains first content. Responsive to the text not containing first content, the instructions are executable to present the text on at least one display of a person generating the livestream, and responsive to the person selecting the text, send the audio comment with the livestream.
The first content may include one or more of profanity, hate speech, personally-identifiable information, or a topic different from a topic being discussed in the livestream (off-topic).
In some examples the instructions can be executable to allow the person generating the livestream to define the first content to be identified by the ML model.
In example implementations the instructions may be executable to present on the display along with the text at least one selector selectable to cause the audio comment to be inserted into the livestream.
If desired, the instructions can be executable to indicate that first text represents first content for a first segment of the livestream and to indicate that first text does not represent first content for a second segment of the livestream.
In another aspect, a method includes analyzing audio associated with a livestream, and automatically blocking the audio from being included in the livestream responsive to the audio containing a first characteristic.
The first characteristic can include one or more of profanity, hate speech, personally-identifiable information, off-topic content, or non-verbal audio feature. The audio can be spoken by a livestreamer transmitting the livestream or by a viewer of the livestream.
In another aspect, an apparatus includes at least one processor assembly configured to identify at least one word spoken by a person associated with a livestream, and identify whether the word is of a class not desired to be presented in the livestream. The processor is configured to, responsive to the word being of a class not desired to be presented in the livestream, block audio of the word from being sent in the livestream.
The details of the present application, both as to its structure and operation, can be best understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which:
This disclosure relates generally to computer ecosystems including aspects of consumer electronics (CE) device networks such as but not limited to computer game networks. A system herein may include server and client components which may be connected over a network such that data may be exchanged between the client and server components. The client components may include one or more computing devices including game consoles such as Sony PlayStation® or a game console made by Microsoft or Nintendo or other manufacturer, virtual reality (VR) headsets, augmented reality (AR) headsets, portable televisions (e.g., smart TVs, Internet-enabled TVs), portable computers such as laptops and tablet computers, and other mobile devices including smart phones and additional examples discussed below. These client devices may operate with a variety of operating environments. For example, some of the client computers may employ, as examples, Linux operating systems, operating systems from Microsoft, or a Unix operating system, or operating systems produced by Apple, Inc., or Google. These operating environments may be used to execute one or more browsing programs, such as a browser made by Microsoft or Google or Mozilla or other browser program that can access websites hosted by the Internet servers discussed below. Also, an operating environment according to present principles may be used to execute one or more computer game programs.
Servers and/or gateways may include one or more processors executing instructions that configure the servers to receive and transmit data over a network such as the Internet. Or a client and server can be connected over a local intranet or a virtual private network. A server or controller may be instantiated by a game console such as a Sony PlayStation®, a personal computer, etc.
Information may be exchanged over a network between the clients and servers. To this end and for security, servers and/or clients can include firewalls, load balancers, temporary storages, and proxies, and other network infrastructure for reliability and security. One or more servers may form an apparatus that implement methods of providing a secure community such as an online social website to network members.
A processor may be a single- or multi-chip processor that can execute logic by means of various lines such as address lines, data lines, and control lines and registers and shift registers. A processor assembly may include one or more processors acting independently or in concert with each other to execute an algorithm.
Components included in one embodiment can be used in other embodiments in any appropriate combination. For example, any of the various components described herein and/or depicted in the Figures may be combined, interchanged, or excluded from other embodiments.
“A system having at least one of A, B, and C” (likewise “a system having at least one of A, B, or C” and “a system having at least one of A, B, C”) includes systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.
Now specifically referring to
Accordingly, to undertake such principles the AVD 12 can be established by some, or all of the components shown in
In addition to the foregoing, the AVD 12 may also include one or more input and/or output ports 26 such as a high-definition multimedia interface (HDMI) port or a USB port to physically connect to another CE device and/or a headphone port to connect headphones to the AVD 12 for presentation of audio from the AVD 12 to a user through the headphones. For example, the input port 26 may be connected via wire or wirelessly to a cable or satellite source 26a of audio video content. Thus, the source 26a may be a separate or integrated set top box, or a satellite receiver. Or the source 26a may be a game console or disk player containing content. The source 26a, when implemented as a game console, may include some or all of the components described below in relation to the CE device 48.
The AVD 12 may further include one or more computer memories 28 such as disk-based or solid-state storage that are not transitory signals, in some cases embodied in the chassis of the AVD as standalone devices or as a personal video recording device (PVR) or video disk player either internal or external to the chassis of the AVD for playing back AV programs or as removable memory media or the below-described server. Also, in some embodiments, the AVD 12 can include a position or location receiver such as but not limited to a cellphone receiver, GPS receiver and/or altimeter 30 that is configured to receive geographic position information from a satellite or cellphone base station and provide the information to the processor 24 and/or determine an altitude at which the AVD 12 is disposed in conjunction with the processor 24. The component 30 may also be implemented by an inertial measurement unit (IMU) that typically includes a combination of accelerometers, gyroscopes, and magnetometers to determine the location and orientation of the AVD 12 in three dimension or by an event-based sensors.
Continuing the description of the AVD 12, in some embodiments the AVD 12 may include one or more cameras 32 that may be a thermal imaging camera, a digital camera such as a webcam, an event-based sensor, and/or a camera integrated into the AVD 12 and controllable by the processor 24 to gather pictures/images and/or video in accordance with present principles. Also included on the AVD 12 may be a Bluetooth transceiver 34 and other Near Field Communication (NFC) element 36 for communication with other devices using Bluetooth and/or NFC technology, respectively. An example NFC element can be a radio frequency identification (RFID) element.
Further still, the AVD 12 may include one or more auxiliary sensors 38 (e.g., a motion sensor such as an accelerometer, gyroscope, cyclometer, or a magnetic sensor, an infrared (IR) sensor, an optical sensor, a speed and/or cadence sensor, an event-based sensor, a gesture sensor (e.g., for sensing gesture command), providing input to the processor 24. The AVD 12 may include an over-the-air TV broadcast port 40 for receiving OTA TV broadcasts providing input to the processor 24. In addition to the foregoing, it is noted that the AVD 12 may also include an infrared (IR) transmitter and/or IR receiver and/or IR transceiver 42 such as an IR data association (IRDA) device. A battery (not shown) may be provided for powering the AVD 12, as may be a kinetic energy harvester that may turn kinetic energy into power to charge the battery and/or power the AVD 12. A graphics processing unit (GPU) 44 and field programmable gated array 46 also may be included. One or more haptics generators 47 may be provided for generating tactile signals that can be sensed by a person holding or in contact with the device.
Still referring to
Now in reference to the afore-mentioned at least one server 52, it includes at least one server processor 54, at least one tangible computer readable storage medium 56 such as disk-based or solid-state storage, and at least one network interface 58 that, under control of the server processor 54, allows for communication with the other devices of
Accordingly, in some embodiments the server 52 may be an Internet server or an entire server “farm” and may include and perform “cloud” functions such that the devices of the system 10 may access a “cloud” environment via the server 52 in example embodiments for, e.g., network gaming applications. Or the server 52 may be implemented by one or more game consoles or other computers in the same room as the other devices shown in
The components shown in the following figures may include some or all components shown in
Present principles may employ various machine learning models, including deep learning models. Machine learning models consistent with present principles may use various algorithms trained in ways that include supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, feature learning, self-learning, and other forms of learning. Examples of such algorithms, which can be implemented by computer circuitry, include one or more neural networks, such as a convolutional neural network (CNN), a recurrent neural network (RNN), and a type of RNN known as a long short-term memory (LSTM) network. Support vector machines (SVM) and Bayesian networks also may be considered to be examples of machine learning models. However, a preferred network contemplated herein is a generative pre-trained transformer (GPTT) that is trained using unsupervised training techniques described herein.
As understood herein, performing machine learning may therefore involve accessing and then training a model on training data to enable the model to process further data to make inferences. An artificial neural network/artificial intelligence model trained through machine learning may thus include an input layer, an output layer, and multiple hidden layers in between that that are configured and weighted to make inferences about an appropriate output.
Turning to
In the example shown, the livestream computer 302 includes at least one processor 310 with computer storage controlling at least one video display 312, typically including or being associated with at least one audio speaker as well as a video display, and at least one network interface 314 for communicating with the WAN 308. The processor 310 may receive images of, for example, the livestreamer 300 from one or more cameras 316 as well as audio spoken by the livestreamer from one or more microphones 318.
If desired, a viewer computer 304 may include components similar to those shown for the livestream computer 302. For example, a viewer computer 304 may include at least one processor 320 with computer storage controlling at least one video display 322, typically including or being associated with at least one audio speaker as well as a video display, and at least one network interface 324 for communicating with the WAN 308. The processor 320 may receive images of, for example, the viewer 306 from one or more cameras 326 as well as audio spoken by the viewer from one or more microphones 328.
In
Now refer to
However, responsive to the text containing objectionable content, the logic moves from state 904 to block 908. At block 908, either the objectionable portion of the text is filtered out and the remainder is passed to the livestream computer, or the text is entirely blocked from being presented on the livestream computer. In the former instance, if the livestreamer subsequently selects the filtered text, a filtered version of the original audio clip is presented consistent with the filtered-out text.
It is to be understood that the ML model may be trained on a training set of terms and ground truth labels as to categories of the terms, with certain categories being labeled as “objectionable”. Moreover, the model may be trained to recognize topics of phrases using a training set of phrases and ground truth labels as to what topics the phrases are associated with for subsequent screening of off-topic viewer messages.
The livestreamer can watch the feed of text comments during streaming. The livestreamer can feel secure in knowing that the comments in the feed have been filtered to remove anything unwanted in the livestream. Also, the sheer volume of comments to view will be far lower than it would be absent present principles, because only the most substantive comments will make it through the moderation algorithm. The livestreamer can read the text of the comment (silently) to identify comments that may be interesting, funny, controversial, entertaining, etc. for the audience.
Upon identifying a comment worth broadcasting on the stream, the livestreamer can click the post selector 504 on the text version of the comment which would play the audio version of the comment live on the stream. The livestreamer can then react to that comment with his own live commentary.
The end result for the audience would be an effect much like a live caller calling into a talk radio show, but without the risk that the caller might say something offensive or unexpected once live on the air. It allows the audience to hear the voices of themselves or people like them on the stream, which can make for a more engaging and interactive viewing experience.
Proceeding to block 1104, the ML model outputs text representing the viewer's action depicted in the video at block 1100. The text is processed by an ML model at block 1106 to identify whether it describes any objectionable actions. For example, certain gestures may be defined as objectionable, or certain facial expressions. The model may be trained on a training set of text describing actions along with ground truth labels indicating whether the text describes an objectionable action. Text describing non-objectionable actions may be presented on the display of the livestreamer at block 1108.
The text of a livestream may be analyzed by a machine learning algorithm to build a contextual and semantic understanding of the content of livestream. This understanding can be more general or specific depending on the length of time being analyzed. For example, an analysis of an entire livestream might conclude that one stream is about “sports” and another is about “politics.” Looking at smaller increments of time, say five-to-ten-minute segments of the livestream, can have a more specific analysis. The sports stream might be about “basketball” for a few minutes before talking about “baseball.” When basketball is being discussed, for example, comments about baseball and other off-topic subjects may be blocked.
Accordingly, at block 1400 in
In an even smaller increment of time, the basketball portion may be broken into a minute talking about the “Lakers” and another minute talking about the “Celtics”, with text relating to Celtics being screened out when the discussion concerns the Lakers and vice-versa.
Once the system has an understanding of the topic of the livestream, each audio comment can be processed by the same algorithm. For example, assume a first comment is about “voting rights,” a second topic is about “misinformation,” and a third comment is about “the Astros winning the World Series.” The topical and semantic analysis of the livestream can then be compared to the analysis of the audio comment to determine whether the comment is relevant to the topic of the stream. Thus, even though the third comment about the Astros winning the World Series is not necessarily offensive, it can be filtered out of the comments shown to the politics livestreamer, because it is not relevant to the current topic(s) of the livestream.
Because the livestream content has been analyzed and broken into time blocks related to different topics, a temporal dimension can also be added to the moderation. For example, a comment about “voting rights” that comes more than five minutes after the topic of “voting rights” has last been discussed by the streamer could be filtered out as irrelevant, because the streamer is no longer discussing that topic. The analysis of the audio comments can also filter out comments which aren't necessarily offensive, but which are inappropriate in context. For example, the moderation could filter out parasocial comments that express some love or attraction to the streamer. Things like requests to view certain body parts, requests for a date or to meet in person, expressions of love or affection, personal questions about the streamer's life, requests for personally identifiable information, etc. may be filtered out.
Thus, moderation/censorship can also be applied to the livestreamer's outgoing stream. If the system is analyzing the content of the streamer's audio, it can detect when the streamer says something he shouldn't have and can proactively censor the stream audio to ensure the questionable content isn't broadcast to the audience.
As another example, a livestreamer may accidentally reveal information about where he lives, such as the name of a hometown, a high school, a nearby landmark, etc. Without any kind of screening, that information would go out unfiltered to the audience, and it's impossible to take back once it is out there. This system could mute, “bleep out,” or otherwise censor this sensitive information before it is broadcast to the audience. Alternatively, as divulged above the system can warn the streamer that the streamer just said something objectionable, and the streamer can manually choose to censor that phrase as shown in
In an additional feature, for a livestreamer to be confident that an audio comment is suitable for broadcast, in addition to analyzing content of the livestream audio additional analysis may be provided including whether overall volume is appropriate (not too loud or too quiet), voice volume relative to background noise is appropriate so that the livestreamer's voice isn't drowned out by noise, whether the amount of background noise is excessive, whether the audio contains uncomfortable or offensive sounds, such as very high or very low frequencies, spikes in volume, gunshots, offensive nonverbal utterances, and the age of speaker using voice age analysis, to disallow children from adding to the stream. This additional analysis may use, as input, audio features such as spectrum, amplitude, and frequency that are input to a ML model trained to detect such irregularities or objectionable sounds on a training set of spectra, amplitude, and frequency along with ground truth labels as to whether the audio components represent objectionable sounds.
While the particular embodiments are herein shown and described in detail, it is to be understood that the subject matter which is encompassed by the present invention is limited only by the claims.