Machine learning relies on the development of algorithms and statistical models to execute tasks using pattern recognition and inference. In some cases, machine learning may provide opportunities to reduce processing and memory resources that are expended in attempting to execute a particular task, such as developing efficient machine-learned models that reduce expenditure of computer resources. For example, depending on a particular machine learning application, server machines having central processing unit (CPU), graphics processing unit (GPU), tensor processing unit (TPU), or application-specific integrated circuit (ASIC) components may be used to execute algorithms to perform some task or to achieve some operational result. Reducing unnecessary usage of the machine components may prolong having equipment to be replaced or refreshed, resulting in less down time. For example, machine learning algorithms may be tailored in attempts to realize an efficient use of server resources for predictive maintenance for the associated servers. Accordingly, machine learning may be used to improve the useful life of server infrastructure by reducing inefficient use of processor and memory resources.
Machine learning has been used in many different types of applications. As an example, machine learning has been used to develop topic models for describing content of textual documents within a large document collection by discovering latent topics underlying the documents. For example, words comprising a document come from a mixture of topics, wherein a topic can be defined as a probability distribution over the words. Dynamic topic modeling is an established tool for capturing temporal dynamics of the topics of a corpus. However, dynamic topic models can only consider a small set of frequent words because of their computational complexity and insufficient data for less frequent words. Moreover, conventional topic models do not consider word correlation. The assumption of word independence given the topic does not allow information sharing across words, which limits in practice the applicability of topic models on corpus with large vocabulary and short documents.
In general terms, this disclosure is directed to dynamic topic models. In one possible configuration and by non-limiting example, a dynamic word correlated topic model (DWCTM) (also referred to as MIST) identifies underlying topics of a set of documents or user listening sessions that span a period of time, and models, for each topic, a topic popularity, a word embedding, and a correlation with other topics across the period of time to capture the evolution thereof. In some examples, the output of the DWCTM can be provided for post-processing such that the output can be utilized or further applied in a wide array of scenarios. Various aspects are described in this disclosure, which include, but are not limited to, the following aspects.
One aspect is a method for dynamic topic modeling with word correlation, the method comprising: receiving, as input to a DWCTM, a set of documents each comprised of a plurality of words and having associated timestamps, wherein the timestamps of the set of documents span a period of time; identifying, as input to the DWCTM, a quantity of topics for modeling; providing the set of documents as input to the DWCTM for modeling according to the quantity of topics identified; for each topic, modeling via the DWCTM: a document-topic distribution across the period of time to yield a popularity of each of the topics across the period of time; a topic-word distribution across the period of time that captures a correlation among the plurality of words to yield a word embedding; and a series of covariance matrices to yield a correlation of each topic with other topics across the period of time; and providing, as output of the DWCTM: the popularity of each of the topics across the period of time; the word embedding across the period of time; and the correlation of each topic with other topics across the period of time.
Another aspect is a system for dynamic topic modeling with word correlation, the system comprising: a DWCTM; and a server communicatively coupled to the DWCTM, the server comprising at least one processing device and a memory coupled to the at least one processing device and storing instructions, that when executed by the at least one processing device, cause the at least one processing device to: receive a set of documents each comprised of a plurality of words and having associated timestamps, wherein the timestamps for the set of documents span a period of time; identify, as input to the DWCTM, a quantity of topics for modeling; provide the set of documents as input to the DWCTM for modeling according to the quantity of topics identified; for each topic, model via the DWCTM: a document-topic distribution across the period of time to yield a popularity of each of the topics across the period of time; a topic-word distribution at given time points across the period of time that captures a correlation among the plurality of words to yield a word embedding; and a series of covariance matrices to yield a correlation of each topic with other topics across the period of time; and provide, as output of the DWCTM: the popularity of each of the topics across the period of time; the word embedding across the period of time; and the correlation of each topic with other topics across the period of time.
A further aspect is a system for dynamic topic modeling with word correlation related to user consumption of media content items over time, the system comprising: a DWCTM; and a server communicatively coupled to the DWCTM, the server comprising at least one processing device and a memory coupled to the at least one processing device and storing instructions, that when executed by the at least one processing device, cause the at least one processing device to: receive a set of user listening sessions each comprised of a plurality of media content items and having associated timestamps, wherein the plurality of media content items include one or more types of media content metadata and timestamps for the set of user listening sessions that span a period of time; identify, as input to the DWCTM, a quantity of topics for modeling; provide the set of user listening sessions as input to the DWCTM for modeling based on the quantity of topics identified; for each topic, model via the DWCTM: a media content item-topic distribution across the period of time to yield a popularity of each of the topics across the period of time; a topic-media content metadata distribution across the period of time that captures a correlation among the set of user listening sessions comprising a media content metadata embedding; and a series of covariance matrices to yield a correlation of each topic with other topics across the period of time; and provide, as output of the DWCTM: the popularity of each of the topics across the period of time; the media content metadata embedding across the period of time; and the correlation of each topic with other topics across the period of time.
Various embodiments will be described in detail with reference to the drawings, wherein like reference numerals represent like parts and assemblies throughout the several views. Reference to various embodiments does not limit the scope of the claims attached hereto. Additionally, any examples set forth in this specification are not intended to be limiting and merely set forth some of the many possible embodiments for the appended claims.
The modeling system 106 is hosted by a service 102. In some examples, the service 102 can also host another system that utilizes outputs 116 of a dynamic topic model 107 as inputs to derive additional information, referred to hereinafter as application system 110. As one non-limiting example, the service 102 is a media streaming service that includes a media delivery system that utilizes the outputs 116 of a dynamic topic model 107 for generating, monitoring, and/or providing media content items to users for consumption, as discussed in greater detail with references to
The service 102 can also include one or more databases 112 for storing at least a portion of inputs 114 to a dynamic topic model 107, such as DCTM 201 or DWCTM 1501 for example. The service 102 can receive the inputs 114 directly from client devices 122 or indirectly from one or more third party services 124 that collect the inputs 114 from the client devices 122 or otherwise store the inputs 114 in a database, library, or archive, for example. The service 102 can receive the inputs 114 over a network 120.
The inputs 114 include the data collection and a quantity of topics to be modeled. The data collection is a set of data items that span a period of time, where the data items are comprised of a vocabulary that can be clustered or grouped to represent one or more topics. A topic is defined as a probability distribution over the vocabulary. As one example and as described with reference to
The modeling system 106 can also be useful for processing other data types, such as a collection of timestamped user listening sessions comprised of media content items, to understand how user consumption of media evolves over time and apply that understanding to enhance future listening experiences. Therefore, as another example and as described with reference to
The modeling system 106 provides the inputs 114 to the dynamic topic model 107 for modeling. For example, modeling system 106 provides the inputs 114 to DCTM 201 or DWCTM 1501 for modeling. The dynamic topic model 107 is configured to analyze the data items and identify underlying topics by clustering the respective vocabulary into the specified quantity of topics. The outputs 116 of the dynamic topic model 107 include a list of topics, each topic defined by a cluster of the respective vocabulary representing the topic. When modeling system 106 deploys DCTM 201, the outputs 116 include, for each topic, a topic popularity, a topic representation, and a correlation with other topics at given time points across the period of time to illustrate the evolution thereof, thereby extending capabilities of traditional correlated topic models. When modeling system 106 deploys DWCTM 1501, the outputs 116 include, for each topic, a topic popularity, a word embedding, and a correlation with other topics at given time points across the period of time to illustrate the evolution thereof, also extending capabilities of traditional correlated topic models.
For example, in a traditional correlated topic model (CTM), a correlation in co-occurrence of topics is modeled in a non-dynamic manner. In some examples, one or more of the topic popularity and topic representation are also modeled in a non-dynamic manner by the traditional CTM. As described in greater detail with reference to
As described in greater detail with reference to
The outputs 116 of the dynamic topic model 107 can be stored in the databases 112 and/or provided to the application system 110 as inputs for further processing. Processed outputs 118 generated by the application system 110 can be provided to the client devices 122 or the third party services 124. In some examples, the outputs 116 can be provided to the client devices 122 or the third party services 124, alternatively or in addition to the processed outputs 118.
A topic is a probability distribution over vocabulary. Here, the vocabulary are the words 205 of the documents 204. Thus, the topics identified by the DCTM 201 are each a cluster or grouping of a subset of the words 205, as illustrated and described in greater detail with reference to
The quantity of topics 208 received as input informs the DCTM 201 of a number of word clusters to identify. Depending on the quantity of topics 208 provided as input and a number of topics that can be drawn from the set 202 of documents 204, a list of topics 210 output by the DCTM 201 can include all topics inferred or the N most probable topics for the documents 204 where N corresponds to the quantity of topics 208. As the topics are clusters of the words 205, in some examples, the most probable words associated with a topic are provided as the topic within the list of topics 210.
Additionally, for each topic 212 in the list of topics 210 (e.g., for each topic 1 through topic N), the DCTM 201 can provide topic popularity 214, topic representation 216, and topic correlation 218 at given time points across the period of time that the set 202 of documents 204 span to illustrate how the popularity, representations, and correlations among topics have evolved over time for the set 202.
As described in more detail with reference to
The list of topics 210 provided as output of the DCTM 201 includes 7 topics (e.g., topics 1 through 7) corresponding to the quantity of topics 208 provided as input. As illustrated, the topics 402 are represented by clusters 404 of the plurality of words 205. For example, as shown in a blown up representation 406 the list of topics 210, a top thirty most probable words associated with topics 1, 6 and 7 are displayed. The topics 402 are not assumed to be independent from one another. Accordingly, two or more topics can have common words 408, as illustrated by the common words 408 associated with at least two of the topics 1, 6, and 7 that are highlighted in the blown up representation 406. The DCTM 201 is robust enough to discriminate between the topics 402 that share common words 408, and is also able to consider multiple topics with similar interpretation, such as topics 6 and 7, and split facets of a single topic into more than one topic. Additionally, as described in greater detail below in
In some examples, the topics can later be labeled or categorized based on a general subject matter reflected by the words. As one example, the topic 1 can be labeled or categorized as neural networks. As another example, topics 6 and 7 can be labeled or categorized as neuroscience.
The conference paper dataset scenario presented in
A document-topic distribution is a distribution over the topics for a document. An example distribution over the topics for a document at a given time point is as follows: x % of topic 1, y % of topic 2, z % of topic 3, etc., where the sum of the percentages for each topic equals 100%. For example, as shown in the graph 500, for a document from 1987, the document-topic distribution is likely to be 8% topic 7, 6% topic 6, 4% topic 5, and so on. A legend 506 indicates a visual scheme to distinguish a proportion of the document that each of the topics 1 through 7 make up as part of the document-topic distribution. For additional clarity, the corresponding data may also be labeled with each of the topics within the graph 500, as illustrated.
As illustrated by the graph 500, a distribution for each of the topics 1, 6 and 7 has generally decreased as time has progressed from 1987 to 2015. This decrease in distribution corresponds to a decrease in trends or decrease in popularity for these topics within conference papers related to neural information processing systems.
As illustrated in
As illustrated in
The generative process 902 defines a joint probability distribution over observed data and latent variables. The joint probability distribution can be decomposed into a likelihood for the observed data conditioned on the latent variables and a prior distribution from which the latent variables are drawn. In some examples, a probabilistic graphical model 904 is used to illustrate the generative process 902. For example, within the probabilistic graphical model 904, the shaded node represents the observed data, the unshaded nodes represent the latent variables, and the edges represent possible dependencies between the nodes. A goal of the training phase is to learn the latent variables and model parameters. In some examples, at least a portion of the training dataset 908 is held back to be used for testing. For example, 75% of the dataset can be used for training, and 25% can be withheld as a testing dataset.
Here, as illustrated by the probabilistic graphical model 904, the observed data is a word (wdn), which is an nth word of a dth document (e.g., the document d having Nd words) in a set of documents D comprising a corpus W. Corpus W is the training dataset 908. The latent variables, each described in greater detail below, include: distributions of words for topics (β), a topic assignment (zdn) drawn from a mixture of topics (ηd) over the corpus W, the mixture of topics (ηd) being dependent on a mean of the prior distribution of document-topic proportion (e.g., topic probability) (μ) and co-variance matrices (Σ), the co-variance matrices (Σ) drawn from a plurality of Gaussian processes generated independent of one another (f) and a correlation between the Gaussian processes driven by (L).
The set of documents D comprising the corpus Ware associated with one or more evolving indexes. Here, the set of documents D are associated with a single index of time (e.g., indicated by timestamps td associated with one or more documents din the set of documents D). However, in other examples, the index can alternatively or additionally include a geographical location. Taking into account the temporal dynamics underlying the documents din the training dataset 908, the DCTM 201 learns the latent variables and model parameters during the training phase 802 to infer a topic-word distribution 910 for each topic and a document-topic distribution 912. The topic-word distributions 910 include words that are most frequently associated with each of the topics, which can be determined based on β. The document-topic distribution 912 includes a proportion of each document d within the corpus W that is associated with each of the topics, which can be determined based on μ and Σ. Continuous processes are utilized for modeling to enable incorporation of the temporal dynamics into the DCTM 201. For example, as described in greater detail below, to incorporate temporal dynamics for each component of the DCTM 201, Gaussian processes (GP) are used to model β and μ and a generalized Wishart process (GWP) is used to model Σ.
The generative process 902 depicts how the variables and model parameters are learned. For example, the DCTM 201 assumes that a document d having a word count Nd at time to is generated according to the generative process 902, described as follows. A mixture of topics ηd˜(μt
Under the above-described generative process 902, the marginal likelihood for the corpus W of documents becomes:
The individual documents of the set W are assumed to be independent and identically distributed (i.i.d) given the document-topic proportion and word-topic distribution.
In traditional correlated topic models (CTMs), the parameterization of η is relaxed by allowing topics to be correlated with each other (e.g., by allowing a non-diagonal Σt
To model the dynamics, the topic probability (μt
fdi˜(0,κθ),d≤D,i≤ν (2)
be D×ν i. i. d. Gaussian processes with zero mean function and (shared) kernel function κθ, where θ denotes any parameters of the kernel function. For example, in the case of θ=θ12* exp (−∥x−y∥2/(2*θ22)), θ=(θ1, θ2) corresponds to the amplitude and length scale of the kernel (assumed to be independent from one another). In some examples, a squared exponential kernel can be used for Σ to allow more freedom for topic correlations to change rapidly. The amplitude and length scale of kernels can be initialed as 1 and 0.1 respectively, which can then be learned using an approximate empirical Bayes approach.
The positive integer-valued ν≥D is denoted as the degrees of freedom parameter. Let Fndk:=fdk(χn), and let Fn:=(Fndk, d≤D, k≤ν) denote the D×ν matrix of collected function values, for every n≥1. Then, consider
Σn=LFnFnTLT,n≥1, (3)
where L∈D×D satisfies the condition that the symmetric matrix LLT is positive definite. With such construction, Σn is (marginally) Wishart distributed, and Σ is correspondingly called a Wishart process with degrees of freedom ν and scale matrix V=LLT.
Σn˜(V, ν, κθ) denotes that Σn is drawn from a Wishart process. The dynamics of the process of the covariance matrices Σ are inherited by the Gaussian processes, controlled by the kernel function ice. With this formulation, the dependency between D Gaussian processes is static over time, and regulated by the matrix V.
L is a triangular Cholesky factor of the positive definite matrix V, with M=D(D+1)/2 free elements. Each of the free elements can be vectorized into a vector =(1, . . . , m) and assigned a spherical normal distribution p(m)=(0,1), where the diagonal elements of L are positive. To ensure that the diagonal elements of L are positive, a change of variables can be applied to the prior distribution of the diagonal elements by applying a soft-plus transformation =log(1+exp(i), i˜(0,1)).
Stochastic gradient estimation with discrete latent variables is difficult, and often results in significantly higher variance in gradient estimation even with state-of-the-art variance reduction techniques. To simplify stochastic gradient estimation, the discrete latent variables in the DCTM 201 can be marginalized out in closed form. For example, the resulting marginalized distribution p(Wd|zn, βt
Wd˜Πn=1N
As discussed above, the generative process 902 defines a joint probability distribution over observed data and latent variables that can be decomposed into a likelihood for the observed data conditioned on the latent variables and a prior distribution from which the latent variables are drawn.
The inference process 906 parametrizes the approximate posterior of the latent variables using variational inference techniques. For example, variational lower bounds are individually derived for the following variables (e.g., components of the DCTM 201): ηd, β, μ, and Σ, the derivation processes for each discussed in turn below. Once derived, the individually derived lower bounds of the components can be assembled together for a stochastic variational inference (SVI) method for the DCTM 201.
The SVI method for the DCTM 201 enables mini-batch training over the documents in the training dataset 908. This is facilitated by the use of amortized inference to derive the variational lower bound of ηd. After defining a variational posterior q(ηd) for each document, a variational lower bound of the log probability over the documents, denoted as W and also referred to as lower bound W, can be derived as follows,
As the lower bound W is a summation over individual documents, it is straight-forward to derive a stochastic approximation of the summation by sub-sampling the documents,
where is a random sub-sampling of the document indices with the size . The above data sub-sampling enables performance of mini-batch training, where the gradients of the variational parameters are stochastically approximated from a mini-batch. An issue with the above data sub-sampling is that only the variational parameters associated with the mini-batch get updated, which causes synchronization issues when running stochastic gradient descent. To avoid this, it is assumed the variational posteriors q(ηd) for individual documents are generated according to parametric functions,
q(ηd)=(ϕm(Wd),ϕS(Wd)), (7)
where ϕm and ϕS are the parametric functions that generate the mean and variance of q(ηd), respectively. This is referred to as amortized inference. With this parameterization of the variational posteriors, a common set of parameters can be updated no matter which documents are sampled into the mini-batch, thus overcoming the synchronization issue.
The lower bound W cannot be computed analytically. Instead, an unbiased estimate of W is computed using Monte Carlo sampling. As q(ηd) are normal distributions, a low-variance estimate of the gradients of the variational parameters are obtained via the reparameterization strategy.
Both of the word distributions of topics (β) and the mean of the prior distribution of the document-topic proportion (μ), also referred to as topic probability, follow Gaussian processes that take the time stamps of individual documents as inputs, i.e., p(β|t) and p(μ|t). A stochastic variational Gaussian process approach can be used to construct the variational lower bound of β and μ.
For examples, each Gaussian process can be augmented with a set of auxiliary variables with a set of corresponding time stamps, i.e.,
p(β|t)=∫p(β|Uβ,t,β)p(Uβ|β)dUβ (8)
p(μ|t)=∫p(μ|Uμ,t,μ)p(Uμ|μ)dUμ (9)
where Uβ and Uμ are the auxiliary variables for β and μ, respectively, and zβ and zμ are the corresponding time stamps. Both p(β|Uβ, t, zβ) and p(Uβ|zβ) follow a same Gaussian process as the Gaussian process for p(β|t), each having the mean and kernel functions. Similarly, both p(μ|Uμ, t, zβ) and p(Uβ|zβ) follow a same Gaussian process as the Gaussian process for p(μ|t), each having the mean and kernel functions. Despite the augmentation, the prior distributions for β and μ are not changed.
Variational posteriors of β and μ are constructed in the following form: q(β, Uβ)=p(β|Uβ)q(Uβ) and q(μ, Uμ)=p(μ|Uμ)q(Uμ). Both q(Uβ) and q(Uμ) are multivariate normal distributions in which the mean and covariance are variational parameters. For example, q(Uβ)=(Mβ, Sβ) and q(Uμ)=(Mμ, Sμ). When β and μ are used in down-stream distributions, a lower bound can be derived,
log p(·|β)≥q(β)[p(·|β)]−KL(q(Uβ)∥p(Uβ)), (10)
log p(·|μ)≥Q(μ)[p(·|μ)]−KL(q(Uμ)∥p(Uμ)), (11)
where q(β)=∫p(β|Uβ)q(Uβ)dUβ and q(μ)=∫p(μ|Uμ)q(Uμ)dUμ.
As previously discussed, the generalized Wishart process for Σ is derived from a set of Gaussian processes. At each time point, the covariance matrix is defined as Σt=LFtFtT LT. The vector stacking of each entry of the matrix Ft across all the time points, fij=((F1)ij, . . . , (FT)ij), follows a Gaussian process p(fij|t)=(0, κ). A stochastic variational inference method for the Wishart Process can be derived similar to the stochastic variational inference method for Gaussian processes described with reference to β and μ. For example, each p(fij|t) in the Wishart process is augmented with a set of auxiliary variables having a set of corresponding time stamps,
p(fij|t)=∫p(fij|uij,t,ij)p(uij|ij)duij, (12)
where uij is the auxiliary variable and ij is the corresponding time stamp. The variational posterior of fij is defined as q(fij, uij)=p(fij|uij)q(uij), where (uij)=(mij, sij)). The variational posterior of can be defined to be q()=(, ), where S is a diagonal matrix. A change of variable can also be applied to the variational posterior of the diagonal elements, m=log (1+exp (m)), q(m)=(, ).
With such a set of variational posterior for all the entries {fij} and , a variational lower bound can be derived, when Σ is used for down-stream distributions,
where q(F)=Πijq(fij) with q(fij)=∫p(fij|uij)q(uij)duij.
After deriving the variational lower bound for all the components, the lower bounds of the individual components can be assembled together for a stochastic variational inference for DCTM 201. For example, the document-topic proportion for each document d follows a prior distribution p(ηd|μt
where the first term of can be further decomposed by plugging in equation (5),
This formulation allows mini-batch training to be performed by data sub-sampling. For each mini-batch, the training dataset 908 is randomly sub-sampled and the term q(u)q(F)q(L)q(β)[W] is re-weighted according to the ratio between the size of training dataset a 908 and the size of the mini-batch as shown in Equation (6).
To test or validate an output of the DCTM 201, a test document from the portion of the training dataset 908 withheld for testing can be provided as input to the DCTM 201. A perplexity is computed using the exponential of the average negative predictive log-likelihood for each word, where the evidence lower bound (ELBO) for the test document is computed using Equation (14).
In a traditional correlated topic model, a prior distribution for mixtures of topics is derived from a multivariate distribution, in which the mean encodes the popularity of a topics while the covariance matrix encodes the co-occurrence of topics in a non-dynamic manner. As described in the generative process 902 and inference process 906, the DCTM 201 extends the prior distribution for mixtures of topics into a dynamic distribution by providing a set of Gaussian processes as the prior distribution for the mean, and generalized Wishart Process as the prior distribution for the covariance matrices. Accordingly, the evolution of the popularity of topics, representations of topics, and their correlations can be jointly modeled over time. Additionally, the SVI method for the DCTM 201 utilizes amortized inference to enable mini-batch training that is scalable to large numbers of data. For example, the DCTM 201 utilizes a deep neural network to encode the variational posterior of the mixtures of topics for individual documents. For the Gaussian processes and the generalized Wishart Process, the DCTM 201 can be augmented with auxiliary variables to derive a scalable variational lower bound. Because the final low bound is intractable, the discrete latent variable are marginalized. Further, a Monte Carlo sampling approximation with the reparameterization trick can be applied to enable a low-variance estimate for the gradients.
Also shown is a user U who uses the media playback device 1102 to continuously play back a plurality of media content items. In some examples, the media content items may be in a form of a playlist, where the playlist may be created based on recommendations from the media content recommendation engine 1110 informed from an output of the DCTM 201 or DWCTM 1501. In some examples, the DCTM 201 is trained as described with reference to
The media playback device 1102 operates to play media content items to produce media output 1112. In some embodiments, the media content items are provided by the media delivery system 1104 and transmitted to the media playback device 1102 using the network 1106. A media content item is an item of media content, including audio, video, or other types of media content, which are stored in any format suitable for storing media content. Non-limiting examples of media content items include songs, albums, music videos, movies, television episodes, podcasts, other types of audio or video content, and portions or combinations thereof. In this document, the media content items can also be referred to as tracks.
The media-playback engine 1108 operates to facilitate the playing of media content items on the media playback device 1102. The media delivery system 1104 operates to provide the media content items to the media playback device 1102. In some embodiments, the media delivery system 1104 is connectable to a plurality of media playback devices 1102 and provides the media content items to the media playback devices 1102 independently or simultaneously. Additionally, the media delivery system 1104 operates to provide recommendations for playback of media content items (e.g., in a form of a playlist) to the media playback device 1102.
For example, the media content recommendation engine 1110 operates in conjunction with the DCTM 201 or DWCTM 1501 to determine media content items to recommend and provide to the user U for playback, among other recommendations. As described in greater detail with reference to
The DCTM 201 or DWCTM 1501 can process the set of listening sessions of the user U similar to the set of documents, as discussed in detail above, to model an evolution of how the user U is consuming media content items from the media delivery system 1104 over that period of time. For example, as described in greater detail with reference to
The output of the DCTM 201 or DWCTM 1501 can be provided as input to the media content recommendation engine 1110. The output of the DCTM 201 or DWCTM 1501 can then be used to inform recommendations made by the media content recommendation engine 1110 such that recommended media content items more closely correspond to the evolving media content item preferences of the user U, while also taking into account evolution of the respective artists and genres over time to provide diverse recommendations. In some examples, the recommended media content items can be provided in a form of a playlist.
In some embodiments, the media playback device 1102 is a computing device, handheld entertainment device, smartphone, tablet, watch, wearable device, or any other type of device capable of playing media content. In yet other embodiments, the media playback device 1102 is a laptop computer, desktop computer, television, gaming console, set-top box, network appliance, blue-ray or DVD player, media player, stereo, or radio.
In at least some embodiments, the media playback device 1102 includes a location-determining device 1202, a touch screen 1204, a processing device 1206, a memory device 1208, a content output device 1210, and a network access device 1212. Other embodiments may include additional, different, or fewer components. For example, some embodiments may include a recording device such as a microphone or camera that operates to record audio or video content. As another example, some embodiments do not include one or more of the location-determining device 1202 and the touch screen 1204.
The location-determining device 1202 is a device that determines the location of the media playback device 1102. In some embodiments, the location-determining device 1202 uses one or more of the following technologies: Global Positioning System (GPS) technology which may receive GPS signals from satellites S, cellular triangulation technology, network-based location identification technology, Wi-Fi positioning systems technology, and combinations thereof.
The touch screen 1204 operates to receive an input from a selector (e.g., a finger, stylus etc.) controlled by the user U. In some embodiments, the touch screen 1204 operates as both a display device and a user input device. In some embodiments, the touch screen 1204 detects inputs based on one or both of touches and near-touches. In some embodiments, the touch screen 1204 displays a user interface 1214 for interacting with the media playback device 1102. As noted above, some embodiments do not include a touch screen 1204. Some embodiments include a display device and one or more separate user interface devices. Further, some embodiments do not include a display device.
In some embodiments, the processing device 1206 comprises one or more central processing units (CPU). In other embodiments, the processing device 1206 additionally or alternatively includes one or more digital signal processors, field-programmable gate arrays, or other electronic circuits. The memory device 1208 operates to store data and instructions. In some embodiments, the memory device 1208 stores instructions for a media-playback engine 1108. The memory device 1208 typically includes at least some form of computer-readable media.
Computer readable media include any available media that can be accessed by the media playback device 1102. By way of example, computer-readable media include computer readable storage media and computer readable communication media. Computer readable storage media includes volatile and nonvolatile, removable and non-removable media implemented in any device configured to store information such as computer readable instructions, data structures, program modules, or other data. Computer readable storage media includes, but is not limited to, random access memory, read only memory, electrically erasable programmable read only memory, flash memory and other memory technology, compact disc read only memory, blue ray discs, digital versatile discs or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and that can be accessed by the media playback device 1102. In some embodiments, computer readable storage media is non-transitory computer readable storage media.
Computer readable communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” refers to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, computer readable communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency, infrared, and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
The content output device 1210 operates to output media content. In some embodiments, the content output device 1210 generates media output 1112 (
The network access device 1212 operates to communicate with other computing devices over one or more networks, such as the network 1106. Examples of the network access device include wired network interfaces and wireless network interfaces. Wireless network interfaces includes infrared, BLUETOOTH® wireless technology, 802.11a/b/g/n/ac, and cellular or other radio frequency interfaces in at least some possible embodiments.
The media-playback engine 1108 operates to play back one or more of the media content items (e.g., music) to the user U. As described herein, the media-playback engine 1108 is configured to communicate with the media delivery system 1104 to receive one or more media content items (e.g., through the stream media 1252), as well as recommendations (e.g., through communications 1254, 1256 or in the form of playlists received through the stream media 1252).
With still reference to
In some embodiments, the media delivery system 1104 includes a media server 1216 and recommendation server 1218. In this example, the media server 1216 includes a media server application 1220, a processing device 1222, a memory device 1224, and a network access device 1226. The processing device 1222, memory device 1224, and network access device 1226 may be similar to the processing device 1206, memory device 1208, and network access device 1212 respectively, which have each been previously described.
In some embodiments, the media server application 1220 operates to stream music or other audio, video, or other forms of media content. The media server application 1220 includes a media stream service 1228, a media data store 1230, and a media application interface 1232.
The media stream service 1228 operates to buffer media content such as media content items 1234 (including 1234A, 1234B, and 1234Z) for streaming to one or more streams 1236A, 1236B, and 1236Z.
The media application interface 1232 can receive requests or other communication from media playback devices or other systems, to retrieve media content items from the media delivery system 1104. For example, in
In some embodiments, the media data store 1230 stores media content items 1234, media content metadata 1238, and playlists 1240. The media data store 1230 may comprise one or more databases and file systems. Other embodiments are possible as well. As noted above, the media content items 1234 may be audio, video, or any other type of media content, which may be stored in any format for storing media content.
The media content metadata 1238 operates to provide various pieces of information associated with the media content items 1234. In some embodiments, the media content metadata 1238 includes one or more of title, artist name, album name, length, genre, sub-genre, mood, era, etc. In addition, the media content metadata 1238 includes acoustic metadata which may be derived from analysis of the track. Acoustic metadata may include temporal information such as tempo, rhythm, beats, downbeats, tatums, patterns, sections, or other structures. Acoustic metadata may also include spectral information such as melody, pitch, harmony, timbre, chroma, loudness, vocalness, or other possible features.
One or more types of the media content metadata 1238 can be used by the DCTM 201 or DWCTM 1501 to model an evolution of the users' consumption of media content items. For example, artist names can be provided as input the DCTM 201 or DWCTM 1501, and clusters or groupings of artists representing topics can be provided as output to show how the user's taste or preference in artists have changed over time, which can be helpful in predicting new media content items to recommend that align with the user's interest but are also diverse. As another example, genres or sub-genres can be provided as input the DCTM 201 or DWCTM 1501, and clusters or groupings of genres representing topics can be provided as output to show how the user's taste or preference in genres have changed over time.
The playlists 1240 operate to identify one or more of the media content items 1234. In some embodiments, the playlists 1240 identify a group of the media content items 1234 in a particular order. In other embodiments, the playlists 1240 merely identify a group of the media content items 1234 without specifying a particular order. Some, but not necessarily all, of the media content items 1234 included in a particular one of the playlists 1240 are associated with a common characteristic such as a common genre, mood, or era. In some examples, the group of the media content items 1124 identified within the playlist 1240 may be based on recommendations facilitated by output of the DCTM 201 or DWCTM 1501 that are provided by the recommendation server 1218 (e.g., through communications 1256).
In this example, the recommendation server 1218 includes the media content recommendation engine 1110, a recommendation interface 1242, a recommendation data store 1244, a processing device 1246, a memory device 1248, and a network access device 1250. The processing device 1246, memory device 1248, and network access device 1250 may be similar to the processing device 1206, memory device 1208, and network access device 1212 respectively, which have each been previously described.
The media content recommendation engine 1110 operates to determine which of the media content items 1234 to recommend for playback to the user U (e.g., to enhance the listening experience of the user U). In some embodiments, the DCTM 201 or DWCTM 1501 facilitates the media content recommendation determinations. The DCTM 201 can be a component of the media content recommendation engine 1110 or a separate component communicatively coupled to the media content recommendation engine 1110.
The DCTM 201 can process listening sessions of the user U that span a period of time to model an evolution of how the user U is consuming media content items from the media delivery system 1104 over that period of time. For example, as described in greater detail with reference to
The recommendation interface 1242 can receive requests or other communication from other systems. For example, the recommendation interface 1242 receives communications 1258 from the DCTM 201 or DWCTM 1501, the communications 1258 including above-discussed output of the DCTM 201 or DWCTM 1501 to facilitate a determination of media content recommendations. In some examples, the recommendation interface 1242 provides the media server application 1220 with the media content recommendations through communications 1256, such that the media server application 1220 can select media content items 1234 based on the recommendations to provide to the media-playback engine 1108 of the media playback device 1102 for playback (e.g., as stream media 1252). In some examples, the media content items 1124 selected based on the recommendations may be included in a playlist 1240 for provision to the media-playback engine 1108.
In other embodiments, the recommendation interface 1242 may request media content items corresponding to the media content recommendation from the media server application 1220 via the communications 1256. The recommendation interface 1242 can then provide the recommended media content items directly to the media-playback engine 1108 through communication 1260. In some examples, the recommended media content items are presented in a manner (e.g., via the user interface 1214) that notifies the user U that these media content items are recommendations.
In some embodiments, the recommendation data store 1244 stores the output received from the DCTM 201 or DWCTM 1501 and the recommendations determined. The recommendation data store 1244 may comprise one or more databases and file systems. Other embodiments are possible as well.
Referring still to
In various embodiments, the network 1106 includes various types of links. For example, the network 1106 can include wired and/or wireless links, including Bluetooth, ultra-wideband (UWB), 802.11, ZigBee, cellular, and other types of wireless links. Furthermore, in various embodiments, the network 1106 is implemented at various scales. For example, the network 1106 can be implemented as one or more local area networks (LANs), metropolitan area networks, subnets, wide area networks (such as the Internet), or can be implemented at another scale. Further, in some embodiments, the network 1106 includes multiple networks, which may be of the same type or of multiple different types.
Although
Additionally, the DCTM 201 receives as quantity of topics 1312 to be modeled as input. The quantity of topics 1312 may also specify which type of metadata 1308 is to be modeled as the topic. For example, the user listening session 1304 can be modeled based on one or more types of the metadata 1308 associated with the tracks 1306 therein. Here, the artists are to be modeled, and thus the quantity indicates N artist groupings. In other words, the artists are the vocabulary for the user listening session 1304 similar to words of a document, and a cluster or grouping of artists represents a topic (e.g., a distribution over the artists).
Depending on the quantity of topics 1312 (e.g., N artist groupings) provided as input and a number of clusters or groupings of the artists that can be drawn from the set 1302 of user listening sessions 1304, a list of artist groupings 1314 output by the DCTM 201 can include all artist groupings inferred or the N most probable artist groupings for the user listening sessions 1304. As the topics are groupings of artists, in some examples, the most probable artists associated with each grouping are provided within the list of artist groupings 1314. For example a top 30 artists associated with artist grouping 1 (e.g., AG 1 in
Additionally, for each artist grouping 1316 in the list of artist groupings 1314, the DCTM 201 can provide a popularity 1318, representation 1320, and correlation 1322 for the artist grouping 1316 at given time points across the period of time that the set 1302 of user listening sessions 1304 spans to illustrate how the popularity, representations, and correlations among artist groupings have evolved over time for the set 1302.
For example, the popularity 1318 of the artist grouping 1316 can be based on a distribution over the artist groupings for each user listening session 1304 at a given time point (e.g., similar to document-topic word distribution). For example, a user listening session 1304 is comprised of x % of artist grouping 1, y % of artist grouping 2, z % of artist grouping 3, etc., where the sum of the percentages equals 100%. In other words, x % of the tracks 1306 for a user listening session 1304 are associated with an artist from artist grouping 1, y % of the tracks 1306 for the user listening session 1304 are associated with an artist from artist grouping 2, and z % of the tracks 1306 for the user listening session 1304 are associated with an artist from artist grouping 3. The representation 1320 of the artist grouping 1316 can be based on a distribution over the artists for a given artist grouping at a given time point (e.g., similar to topic-word distribution). For example, the distribution can include the most frequently associated artists with the given artist grouping at the given time point. The correlation 1322 for the artist grouping 1316 is a relationship strength between a given artist grouping and one or more other artist groupings at a given time point. The strength of the relationship can be based, at least in part, on a number of common artists shared between the given artist grouping and another artist grouping.
The user U is currently listening to a first media content item 1404 (e.g., a track). For example, the media-playback engine 1108 facilitates the playing of the first media content item 1404 on the media playback device 1102, which operates to play the first media content item 1404 to produce media output 1112. The first media content item 1404 is associated with a first artist. In some examples, a name of the first artist is included as part of metadata for the first media content item 1404.
The media-playback engine 1108 can communicate over the network 1106 with the media delivery system 1104 to indicate that the first media content item 1404 is currently being played back on the media playback device 1102. The media-playback engine 1108 can include associated metadata of the first media content item 1404, such as the first artist, within the communication. Additionally or alternatively, the media content recommendation engine 1110 can retrieve such metadata from the media data store 1230 (e.g., media content metadata 1238).
Using the output of the DCTM 201 or DWCTM 1501 described in detail with respect to
Use of the output of the DCTM 201 or DWCTM 1501 to facilitate media content item recommendations, such as the recommendation 1402, enables trend-sensitive recommendations. For example, as the DCTM 201 learns the latent representations or groupings (e.g., artist groupings, genre groupings, mood groupings, etc.) within user listening sessions, they can be used by the media content recommendation engine 1110 to not only predict and recommend the next media content item to listen to, but also and more importantly identify a set of media content items with diverse properties (for example, belonging to different groupings), which can target not only satisfaction metrics of the user U but also diverse recommendations. This is an improvement over current recommendation systems that often train on all user behavior data gathered collectively over time (rather than at individual time points across time like the DCTM 201), which causes these current systems to often be outdated and insensitive to emerging trends in user preference. Using the DCTM 201 or DWCTM 1501, it is possible to recommend media content items that are relevant to the current user taste and sensitive to emerging trends, as well as possibly predict future ones.
Additionally, the media delivery system 1104 can utilize the DCTM 201 or DWCTM 1501 to discover emerging artists and artist groups to further inform the recommendations. By viewing the artists as the vocabulary, new related artists can be discovered based on a user's recent listening sessions, which is not necessarily what other users are listening to (as “similar artists”), providing further personalization. As every user listening session is a unique collection of topics, the recommendation based on the topics of the current session can offer a new personalized and diversified session creation.
In other examples, the media delivery system 1104 can utilize the DCTM 201 or DWCTM 1501 more generally to understand the evolution of artists groups to keep playlists up to date. For example, using the DCTM 201 to model user listening sessions, the media delivery system 1104 can understand an artist grouping from the point of view of the user (e.g., why artists are grouped together). As one example, the artists can be grouped not only by genre, but also based on geographical reasons (e.g., from a same country or region of a country) or shared themes in their tracks. The DCTM 201 can model the relationship between an artist grouping over time to understand how the grouping evolves and changes. Human or machine editors can use this information to create new playlists of media content items, or enable connections between emerging artists while keeping the playlists up to date, and aligned with users' interests. Additionally, this information can be utilized as part of business decisions to determine whether more editors should be dedicated to creating playlists for certain artist groups or genres related to those artist groups to keep up with user demand.
In further examples, the media delivery system 1104 can utilize the DCTM 201 or DWCTM 1501 to moderate content of media content items. For at least some types of media content items, such as podcasts, it is important to detect harmful content. However, a direct search of the podcasts for banned keywords is likely not an effective detection method, as synonyms or words with a different meaning are likely to be used to avoid detection. The DCTM 201 or DWCTM 1501 can be utilized to understand a relationship between the words and topics that are used in the podcasts, which can aid in detection of potential harmful content. For example, it is likely that the words and topics used to indicate harmful content would be used in a different context than usual, indicating that their meaning could be different and the content of the podcast is potentially harmful.
In yet further examples, the media delivery system 1104 can utilize the DCTM 201 or DWCTM 1501 for knowledge graph integration. Knowledge graphs provide a useful way to link together topics and keywords. However, topics in the knowledge graph are often static, meaning that two entities are linked by a fact (i.e., a family relationship), whereas a manner in which topics and keywords are going to be consumed together can depend on other dynamic factors. For example, based on recent news, topics related to “news” and “sports” can be recommended together if something recently occurred in the sports community, or at other times, topics related to “news” and “politics” if something recently occurred in the political community. The DCTM 201 or DWCTM 1501 can account for these dynamic factors to ensure that the knowledge graph is up to date reflecting the dynamic changes in trends.
The service 102 hosting the DCTM 201 or DWCTM 1501 and/or the application systems 110 that apply the output of the DCTM 201 or DWCTM 1501 to other processes are not limited to media streaming or delivery systems. Other example services 102 and/or application systems 110 can include social networking or professional networking systems. For example, social media posts can be modeled by the DCTM 201 or DWCTM 1501 to determine emerging or trending topics. As one example, in response to a global pandemic, many users that subscribe to the social networking system are having to now work from home. The DCTM 201 or DWCTM 1501 can identify as a topic a distribution of words that are associated with working from home from the social media posts, which be utilized to provide targeted advertising (e.g., noise canceling headphones to deal with those new co-workers) or other similar recommendations within the social networking system. Additionally, the identified topic from social media posts can be provided to other systems, such as a media streaming service, and utilized to inform types of media content recommendations (e.g., stress relieving media content items).
Moreover, in addition to modeling evolution of topic popularity, representation, and correlation over time, the DCTM 201 can also model the evolution thereof based on geography to provide a spatio-temporal perspective. For example, the Gaussian processes utilized for the dynamic modeling, described in detail with respect to
A topic is a probability distribution over vocabulary. Here, the vocabulary are the words 1505 of the documents 1504. Thus, the topics identified by the DWCTM 1501 are each a cluster or grouping of a subset of the words 1505 (see the example of
The quantity of topics 1508 received as input informs the DWCTM 1501 of a number of word clusters to identify. Depending on the quantity of topics 1508 provided as input and a number of topics that can be drawn from the set 1502 of documents 1504, a list of topics 1510 output by the DWCTM 1501 can include all topics inferred during an inference process or the N most probable topics for the documents 1504 where N corresponds to the quantity of topics 1508. As the topics are clusters of words 1505, in some examples, the most probable words associated with a topic are provided as the topic within the list of topics 1510. For each topic 1512 in the list of topics 1510 (e.g., for each topic 1 through topic N), the DWCTM 1501 can provide outputs corresponding to a topic popularity 1514, a word embedding 1516, and a topic correlation 1518 at given time points across the period of time that the set 1502 of documents 1504 span to illustrate how the topic popularity, word embeddings, and topic correlations have evolved over time for the set 1502.
The topic popularity 1514 can be based on document-topic distribution output by the DWCTM 1501. A document-topic distribution is a distribution over the topics for each document 1504. For example, a document is comprised of x % of topic 1, y % of topic 2, z % of topic 3, etc., where the sum of the percentages for each topic equals 1. The topic correlation 1518 is a relationship strength between a topic and one or more other topics at a given point in time, which captures the evolution of topic correlations over time. The topic correlation 1518 can be based on covariance matrices that yield a correlation coefficient measuring the relationship strength.
The word embedding 1516 can be based on a topic-word distribution output by the DWCTM 1501. A topic-word distribution is a distribution over the words for a given topic. For example, the distribution can include the most frequently associated words with a given topic. The DWCTM 1501 allows for new words to be introduced for machine learning at test time as well as reliably inferring low frequency words. The DWCTM 1501 automatically learns a word embedding that provides a proxy to model word correlation. As described below, the word embedding procedure allows word embedding to be initialized with pre-trained embeddings; thereby speeding up the word embedding procedure while also accounting for low frequency words in the training vocabulary. The word embedding can be include generating a word embedding matrix that corresponds to the word embedding, wherein words of similar context are grouped together within the word embedding matrix.
The DWCTM 1501 incorporates word correlation in part by augmenting a generative process of topic representations and removing the assumption of independence between words 1505 to leverage information from less frequent words. Accordingly, DWCTM 1501 can obtain sufficient signals about less frequent words by observing the existence of similar words. The DWCTM 1501 correlates previously independent Gaussian Processes (GPs) into a Multi-Output Gaussian Process (MOGP) that explicitly captures word correlation in the form of a covariance matrix of all the words 1505. The DWCTM 1501 alleviates the high computational complexity and the large amount of data required for reliable estimates by representing a topic representation with word correlation by embedding words 1505 into a latent space and generating a covariance matrix via a covariance function. Applying a Bayesian treatment to the word representations (e.g., vectors) in the latent space, a reliable estimate can be obtained of the word correlation with a small amount of data which may infer a word embedding. When the DWCTM 1501 is applied to music, a track embedding may be provided in the inference, which may dramatically reduce and improve an inference of listening sessions in the case of music, resulting in a more efficient use of processing and memory resources.
The DWCTM 1501 utilizes a meta-encoder 1503 for the variational inference (e.g., variational posterior of topic mixing proportions) to manage document dynamics as they evolve over time which improves an amortized inference formulation. The dynamic encoding ability of the meta-encoder 1503 is different from a static encoder since meta-encoder 1503 is aware of a document timestamp 1506 during inference operations. Conventional topic models are constrained by not considering document dynamics as they evolve over time.
The DWCTM 1501 overcomes the scalability problem of traditional topic modelling by implementing an approximate word normalization through importance sampling. An approximated normalization constant for the word distribution allows the DWCTM 1501 to scale to millions of words 1505. By not sampling all of the vocabulary (which can include millions of words) each time, the DWCTM 1501 considers a fixed subset of words 1505 that do not appear in the document 1502 (plus the words that do appear in the document) to compute a normalization constant.
The generative process 1702 defines a joint probability distribution over observed data and latent variables. The joint probability distribution can be decomposed into a likelihood for the observed data conditioned on the latent variables and a prior distribution from which the latent variables are drawn. In some examples, a probabilistic graphical model 1704 is used to illustrate the generative process 1702. For example, within the probabilistic graphical model 1704, the shaded node represents the observed data, the unshaded nodes represent the latent variables, and the edges represent possible dependencies between the nodes. A goal of the training phase is to learn the latent variables and model parameters. In some examples, at least a portion of the training dataset 1708 is held back to be used for testing (e.g., 75% of the dataset can be used for training, and 25% can be withheld as a testing dataset).
DWCTM 1501 is a probabilistic generative model that assumes that each document d, associated with a specific time point xd, is generated by sampling a set of words according to K topics. Each document has an unnormalized topic mixing proportion ηd sampled from a prior distribution, ηd˜(μχ
The generative process 1702 of a Nd-word document d is summarized as follows. First, draw a mixture of topics ηd˜(μx
1) Draw a topic assignment zn|ηd from a categorical distribution with parameter σ(ηd);
2) Draw a word wn|zn, β from a categorical distribution with parameter σ(βz
The individual documents are assumed to be independent and identically distributed (i.i.d) given the document-topic proportion and topic-word distribution. Under this generative process 1702, the marginal likelihood for a given corpus W that contains D documents becomes:
To model the temporal dynamics of topic mixing proportions ηd, the temporal processes can be considered as the prior distributions for μ and Σ. In particular, a zero-mean Gaussian process can be considered to model the topic probability (ηχ
Σχ
Conventionally, the topic representations β of dynamic topic models are allowed to change over time by defining a GP prior over time independently for each word in each topic, so that there will be KP independent GPs, where P is the number of words in the vocabulary. The conventional process does not allow information sharing among similar words and results in a large number of variational parameters for inference.
According to an aspect, correlation among words can be defined by a correlated temporal process for all words. First, a latent representation can be defined, hi∈Q, for each word in the vocabulary. The latent representations are given an uninformative prior hi˜N(0, I). Then, a MOGP is defined for the topic representations over time for each topic:
p((βk):|H,x)=N((βk):|0,KH⊗Kx), (2)
where ( ): denotes a matrix vectorization, ⊗ denotes the Kronecker product, βk is a T×P matrix representing the unnormalized word probabilities over time for the topic k (T is the number of unique time points in the corpus). The covariance matrix Kx is computed using the kernel function κx over all the time points x and the covariance matrix KH is computed using the kernel function κH over all the word representations H=(h1, . . . , hP). With this formulation, all the words at all the time points are jointly modeled with a single GP, in which the word correlation is encoded in the TP×TP covariance matrix. The prior distributions among different topics are assumed to be independent: p(β|x, H)=Πk=1Kp(βk|x, H). The word correlation is encoded through the latent representations of words, which are static over time and shared across all the topics. Although the number of latent vectors are relatively large, they can be reliably estimated by conditioning the whole corpus. The topic assignment variables {zn}n=1N
where β(χ
The MOGP formulation provides a framework to correlate both the temporal dimension and the words in the vocabulary for the topic representations under a single GP but also introduces a computational challenge because calculating the probability density function (PDF) of Equation (2) is O(P3 T3). To overcome this computational challenge, an efficient variational inference method is provided based on the stochastic variational sparse GP formulation (SVGP), reducing the computational complexity to be linear with respect to P and T.
The word correlation is encoded by the latent representations of individual words H. The variational posterior of H can be parameterized as q(H)=N (mH, SH) to derive a variational lower bound as:
log p(W|x)≥Eq(H)[log p(W|x,H)]−KL(q(H)|p(H)), (4)
where KL(·∥·) denotes the Kullback-Leibler divergence. The KL term in (4) can be computed in closed form because both q(H) and p(H) are normal distributions, but p(W|x, H) is intractable.
To derive a lower bound for the marginalized likelihood p(W|x; H), a variational lower bound is derived for log p(βk x, H) according to the SVGP formulation. Taking advantage of the Kronecker product structure in the covariance matrix, i.e., KH⊗Kx, the inducing variables can be defined to be on a grid in the joint space of the word embedding and the temporal dimension. By letting Uβk be a Mx×MH matrix which follows the distribution,
p(Uβ
The rows of Uβ
After defining the inducing variable Uβ
p(βk|x,H)=p(β|Uβ,x,H,Zx,ZH)p(Uβ|Zx,ZH)dUβ. (5)
The conditional distribution of βk is:
p(βk|Uβ,Zx,ZH,x,H)=N(βk|KfuKuu−1Uβ,Kff−KfuK−1Kuf), (6)
where Kfu=KfuH⊗Kfux and Kff=KffH⊗Kffx. KffH is the covariance matrix computed on H with κH, and Kffx is computed on χ with κx.
With the augmented GP formulation, a variational lower bound can be derived. However, a naive parameterization of the variational posterior q(Uβ
q(Uβ
where M is the mean of the variational posterior, ΣH is a P×P covariance matrix and Σx is a T×T covariance matrix. With this formulation, the covariance matrix can be inverted efficiently by only inverting the two smaller covariance matrices,
(ΣH⊗Σx)−1=(ΣH)−1⊗(Σx)−1. Such a parameterization dramatically reduces the number of variational parameters in the covariance matrix from Mx2MH2 to Mx2+MH2.
With the variational posterior q(Uβ
log p(·|H)≥Eq(βk|H)[log p(·|βk)]−KL(q(Uβ
where q(βk|H)=∫p(βk|Uβ
The multivariate normal distribution with a Kronecker product covariance matrix like p(Uβ
To compute the expectation in equation (8), samples are drawn from q(βk|H). As q(βk|H) is a multivariate normal distribution with a full covariance matrix, drawing a correlated sample of βk is computationally very expensive, O(P3T3). Usually, drawing a fully correlated sample can be avoided if βk in the downstream log PDF, log p(·|βk), can be decomposed into a sum of individual entries, e.g., p(·|βk) is a normal distribution. However, such decomposition is not applicable due to the softmax function applied to βk in equation (1). To efficiently sample from q(βk|H), another sparse GP approximation can be applied, the “fully independent training conditional” or FITC approximation for example, to the conditional distribution of βk. The resulting formulation is:
pFITC(βk|U,Zx,ZH,X,H)=N(βk|KfuKuu−1(UT),diag(Kff−KfuKuu−1Kuf)), (9)
where diag(·) returns a diagonal matrix while keeping the diagonal entries. Since Kfu, Kff and Kuu have a Kronecker structure, the mean and covariance can be rewritten to compute them efficiently. Sampling from (9) is efficient because individual entries of βk can be sampled independently. This reduces the computational complexity of sampling βk from O(P3T3) to O(PTMx2MH2).
Regarding the variational inference for a mixture of topics, a variational posterior q(ηd) for each document can be used to derive a variational lower bound of the log probability over the documents as:
Since the lower bound is a summation over individual documents, the formulation allows for a stochastic approximation by sub-sampling the documents.
Regarding importance sampling, computing the expectation
Eq(β|H)q(η)[log p(Wd|ηd, β(x
First, let ξd=β(x
Then, the derivative can be explicitly written as (see Supplemental analysis further below):
In the sum inside the parenthesis, sampling is performed from all of the vocabulary (that has size P) which is inefficient and may even be unfeasible for a large vocabulary. To efficiently scale the DWCTM 1501 to an arbitrary large set of words in the vocabulary, the normalization constant can be approximated with a fixed number of words, using a self-normalizing importance sampling. By considering the words appearing in the batch of documents under analysis as positive (e.g., as in a positive class in a classification problem), importance sampling can be used to approximate the normalization constant by considering a random sample of M classes (e.g., words from the vocabulary) to approximate the normalization constant.
Consider a sample vector s∈{1, . . . , P}M+N
As a result, the complexity of computing expectation from O(PTMx2MH2) is further reduced to O((M+Nd)TMx2MH2) providing additional processing and memory resource conservation.
Referring again to
q(ηd)=N(σm([WdMβ,x
where σm and σS are the parametric functions generating the mean and variance of q(ηd), respectively, Mβ,x
The lower bound W is intractable, so an unbiased estimate can be computed of W via Monte Carlo sampling. As q(ηd) are normal distributions, a low-variance estimate of the gradients can be obtained via reparameterization. The document-topic proportion for each document d follows a prior distribution p(ηd, μx
As shown in
A quantitative analysis highlights the benefit of incorporating word correlation in topic modelling by comparing DWCTM 1501 with state-of-the-art topic models on public datasets. Full details on the data can be found in Supplemental analysis section further below. In all datasets, there is a timestamp associated with each document. The static topic models (LDA and CTM) are optimized without considering timestamps, while DWCTM 1501 incorporates the timestamps into the inference.
Using a split dataset considering 75% of the samples as training and 25% as test, documents associated with the same timestamps were assigned to the same split. For each dynamic topic model, a Matérn 3/2 kernel was used for β, to allow topics to quickly incorporate new words (e.g., to incorporate neologisms, and particularly for datasets such as NeurIPS conference papers and Elsevier corpus, where the names of novel models become quoted in citations (for example, “LDA” starting to appear in publications together as “topic modeling” after its introduction in 2003)). A squared exponential kernel was used for the parameters μ and f, expecting a smooth temporal evolution of both topic probabilities and their correlation. A full list of experimental settings can be found in the see Supplemental analysis further below.
The average per-word perplexity computed on the held-out test set of the datasets for all the models is shown in
To highlight the benefit of using the meta-encoder 1503, for comparison purposes the DWCTM 1501 was trained using encoders that only consider the document representation as inputs. The comparison is shown in
A qualitative analysis on NeurIPS dataset provides additional insight about the word correlation in DWCTM 1501 by visualizing the inferred word correlation. Four interred popular topics were selected across all years on the NeurIPS dataset and the top-10 frequent words were collected for each topic (see Supplemental analysis further below). Then, the covariance matrix among these frequent words (duplicate words are removed) was computed by applying the learnt kernel function κH to the mean of the variational posterior of the word representations nix. The covariance matrix can be converted into a correlation matrix for better interpretability. Due to the choice of the kernel function (squared exponential) no anti-correlation is captured in the correlation matrix. A simple hierarchical clustering was applied to the correlation matrix. With only the word relation, the words associated with the same topic are roughly grouped together (topics are unknown to the clustering algorithm). For example, network, weight, neural and layer, which identify the topic neural network, have a very similar embedding. The word pairs that are often used together in some research area show interesting strong correlations such as input-output, image-pixel, time-state. This indicates that the word correlation has contributed to the identification of these topics.
As discussed above, an efficient approach to model word correlation in dynamic topic modeling is provided that incorporates word dynamics through the use of MOGPs. The amortized inference is improved via the meta-encoder 1503 which allows DWCTM 1501 to be sensitive to the changes of topic representations. A scalable inference enabled for large vocabularies is provided by deriving an asymptotically unbiased estimator of the gradient to dramatically subsample the number of words in computation. Incorporating word correlation into DWCTM 1501 significantly improves the modeling quality and allows for leveraging information from related words.
Supplemental Analysis:
Importance sampling can use the probability of words in a document conditioned on the parameter ηd and β as:
Its derivative can be derived as:
To approximate this derivative, a random sample of M words can be considered from the vocabulary and used to approximate the normalization constant. Consider a sample vector s∈{1, . . . ,P}M+Nd, which represents a sample of words in the vocabulary and stores the index of the Nd positive (words appearing in document d) and the index of the M sampled words.
Let ξ′d,i:=ξd,i−ln(Qdi/P) if i=0 (i.e., word I does not appear in document d), ξ′d,i: =ξd,i−ln(Qdi) otherwise, with Qdi proposal distribution. The true log its can be shifted by the expected number of occurrences of a word i, ensuring that the sampled softmax is asymptotically unbiased. Q can be a uniform distribution over the subset of words considered, so Qdi=1/(Nd+M). Then:
The FITC approximation for the multi-output Gaussian process results into the follow formulation:
pFITC(β|U,ZX,ZH,X,H)=(β|KfuKuu−1(UT),diag(Kff−KfuKuu−1Kuf)),
where diag(·) returns a diagonal matrix while keeping the diagonal entries, and A: denotes vec(A), the column-wise vectorization of the matrix A. Since Kfu, Kff and Kuu have a Kronecker structure, the mean and covariance can be rewritten to compute them efficiently as follows:
Note that the last line becomes a vectorized outer product between vectors and solved efficiently. A similar analysis can be used for diag(Kff). The full derivation is as follows:
The matrix normal is related to the multivariate normal distribution as:
X˜MNn×p(M,U,V), (7)
if and only if,
vec(X)˜p(vec(M),V⊗U) (8)
where ⊗ denotes the Kronecker product and vec(M) denotes the vectorization of M.
Sampling from the distribution and the KL divergence can be computed efficiently. Uβk can be sampled efficiently following the procedure: (i) sample C˜MNh×x(0,I,I), C∈Rh×x, a collection of independent samples from a standard normal distribution; then (ii) let Uβk=(M+ACB), where ΣH=AAT and ΣX=BTB. The KL divergence between q(Uβk) and p(Uβk) can also be computed efficiently.
Sampling from the matrix normal distribution is a special case of the sampling procedure for the multivariate normal distribution. Let X be an n by p matrix of np independent samples from the standard normal distribution, so that:
X˜MNn×p(0,I,I). (9)
Then let,
Y=M+AXB, so that Y˜MNn×p(M,AAT,BTB), (10)
where A and B can be chosen by Cholesky decomposition or a similar matrix square root operation.
The KL divergence between two matrix-variate normal distributions, e.g., q(Uβk) and p(Uβk), can be analytically computed as:
To implement tr[MT(KX)−1M(KH)−1], use KX=LXLTX, KH=LHLTH, A=LX−1 MLH−T, then tr[MT(KX)−1M(KH)−1]=tr(AT A).
Then,
KL(q∥p)=∫q(x)(log q(x)−log p(x))dx (11)
which in the case of two multivariate Gaussian distributions, say p(x)=N(m1,S1), q(x)=N(m2, S2) is equal to
Now, a Kronecker representation of S1 and S2 can be used as S1=Sh⊗Sx and S2=Kh⊗Kx. Let M=m1-m2. Also, consider a vectorized version of M, and indicated as M. Then the KL divergence becomes: (using |V⊗U|=|V|n|U|p, and mixed product property”
The variational inference for Gaussian and Wishart process is described below. The inference for μ includes first augmenting the Gaussian process with a set of auxiliary variables with a set of corresponding time stamps, i.e.,
p(μ|X)=∫p(μ|Uμ,X,μ)p(Uμ|μ)dUμ (22)
where Uμ is the auxiliary variable for μ and zμ is the corresponding index. Both p(μ|Uμ,X, zμ) and p(Uμ|zμ) follow the same Gaussian processes as the one for p(μ|X), i.e., these Gaussian processes have the same mean and kernel functions. As shown in Equation (22), the above augmentation does not change the prior distributions for μ.
The variational posterior of μ is constructed in a special form to enable efficient inference: q(μ,Uμ)=p(μ|Uμ)q(Uμ). q(Uμ)=N(Mμ,Sμ) is a multivariate normal distribution, in which the mean and covariance are variational parameters. p(μ|Uμ) is a conditional Gaussian process.
When μ is used in the down-stream distributions, a lower bound can be derived,
log p(·|μ)≥Eq(μ)[p(·|μ)]−KL(q(Uμ)∥p(Uμ)), (23)
where q(μ)=∫p(μ|Uμ)q(Uμ)dUμ
A similar stochastic variational inference method can be derived for the Wishart Process by augmenting each GP p(fij|X) in the Wishart process with a set of auxiliary variables and a set of the corresponding inputs,
p(fij|X)=∫p(fij|uij,X,ij)p(uij|ij)duij (24)
where uij is the auxiliary variable, zij is the corresponding inputs and p(fij|uij) is a conditional Gaussian process. The variational posterior of fij can be defined to be q(fij, uij)=p(fij|uij)q(uij), where q(uij)=N(mij,sij). The variational posterior of can be defined to be q()=(, ), where is a diagonal matrix. As the diagonal elements of L need to be positive, a change of variable can be applied to the variational posterior of the diagonal elements, i.e., m=log(1+exp(m)),q(m) Note that zμ and zij are variational parameters instead of random variables and may be omitted from the notation. A variational lower bound can be derived with such a set of variational posterior for all the entries {fij} and , when Σ is used for some down-stream distributions,
where q(F)=Πij∫p(fij|uij)q(μij)duij.
After deriving the variational lower bounds for the individual components of DWCTM 1501, the components can be combined to form the final variational lower bound. The word distributions for individual topics are used in defining the distribution of individual words for each document d, p(Wd|ηd, βX
The first term of L can be further decomposed by plugging in (3),
Note that all variational parameters of q(μ), q(′), q(F), q(β), q(η) are optimized. The following datasets were considered: State of the Union corpus (SotU), department of justice press releases (DoJ), Elsevier corpus (Abstracts), Blog Authorship Corpus (Blogs), NeurIPS conference papers (NeurIPS), A Million News Headlines (News), and Twitter sentiment classification (Twitter).
State of the Union corpus (1790-2018) dataset includes a yearly address of the US president, from 1790 to 2018 (229 years). Our vocabulary includes 1442 words after preprocessing, wherein the data is split into 170 documents as training and 57 documents as test data.
Department of justice press releases (2009-2018) dataset includes 13087 press releases from the Department of Justice from 2009 to 2018 (115 unique timestamps), preprocessed to include 2622 unique words. Documents were split into 9674 for training and 3413 testing.
Elsevier OA CC-BY Corpus dataset includes 40 k open access (OA)CC-BY abstracts taken from articles from across Elsevier's journals, published from 2010 to 2019. A random sample of 6898 were considered for training and same size for testing, including 3000 words in the vocabulary.
Blog Authorship Corpus consists of the posts of 19 k bloggers gathered from blogger.com from June 1999 to August 2004. The corpus incorporates a total of 681 k posts, from which a random sample of 5649 were drawn for training and 5650 for testing. After preprocessing, 3000 words were considered in our vocabulary.
NeurIPS conference papers (1987-2015) dataset includes 5804 conference papers from 1987 to 2015 including an average of 34 papers per year. The dataset was preprocessed leading to 4799 (large dataset). In both cases, 4237 documents were used as training data and 1567 as test data.
A Million News Headlines dataset includes 1.2M news headlines published over a period of 17 Years (from 2003 to 2019). After preprocessing, a random sample of size 8526 was used for training and 2822 for test purposes with a vocabulary size of 3000.
Twitter sentiment classification dataset contains 1.6 M tweets, from April to May 2009. 4525 tweets were randomly sampled for training and same for testing for computational efficiency. The samples were preprocessed using a tweet tokenizer, removing usernames and replacing repeated character sequences (length 3 or more) with sequences of length 3. After preprocessing, 3000 tokens were considered.
For the last two datasets, an extended version was considered with the highest number of tokens available, that is 22459 for the headlines dataset, and 83582 for Twitter, and subsampling 1 M documents in each dataset in the experiments. Such datasets as referred to as extended; however, as both the samples and the dimensionality is different, they are effectively different datasets (hence not comparable) with respect to their smaller counterparts.
Experimentally, each dataset was split considering 75% of the samples as training and 25% as test. Documents associated with the same time stamps were assigned to the same split. For each dynamic topic model a Matérn 3/2 kernel was used for β, to allow topics to quickly incorporate new words. A squared exponential kernel was used for μ and f, expecting a smooth temporal evolution of both topic probabilities and their correlation. Amplitude and length scale of kernels were initialed as 1 and 0.5 respectively, and optimized using the approximate empirical Bayes approach.
Experiments were conducted using Adam optimizer with learning rate 0.001 and up to 10 k epochs until convergence. Experiments included different number of topics, reporting the results using a default choice of 30 for all datasets (20 for SotU) to maintain consistency with previous works. Experiments also included a different number of inducing points for the three components β, μ and f, thus controlling the complexity of the variational posterior. The number of inducing points used for such components is 15, 20 and 15, respectively. DWCTM 1501 has an additional component for the latent embedding of words in β; used MH=200 in Q=10 dimensions. The posterior for H was initialized by transforming the words in the vocabulary using ELMO embeddings pre-trained on the 1 Billion Word Benchmark, taking the first Q principal components using a PCA transformation. For the posterior of η, when using a static encoder a dense neural network was considered with three layers with size 500, 300 and 200, respectively. To account for the increased input dimensionality in the meta-encoder 1503, a dense neural network was used with three layers, with size 1000, 600 and 400, respectively.
The perplexity metric can be computed as:
However, in case of a large vocabulary the log probability cannot be computed exactly, so approximate by sampling M random negative words which do not appear in the document:
where rnd=M/(P Nd) if n is one of the negative words (and rnd=1 otherwise) is the uniform probability of picking word n.
The most correlated word as found by DWCTM 1501 is “wishart” (correlation 0.947), even though “wishart” is a word which is relatively rare in the dataset (see
The various examples and teachings described above are provided by way of illustration only and should not be construed to limit the scope of the present disclosure. Those skilled in the art will readily recognize various modifications and changes that may be made without following the examples and applications illustrated and described herein, and without departing from the true spirit and scope of the present disclosure.
This application is a continuation-in-part of U.S. application Ser. No. 16/932,323, filed Jul. 17, 2020, now U.S. Pat. No. 11,727,221, which is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
8245135 | Cai | Aug 2012 | B2 |
8595225 | Singhal | Nov 2013 | B1 |
10275444 | Bogdan | Apr 2019 | B2 |
11727221 | Ravichandran | Aug 2023 | B2 |
20070094247 | Chowdhury | Apr 2007 | A1 |
20070255708 | Morita | Nov 2007 | A1 |
20080263476 | Vignoli | Oct 2008 | A1 |
20110106743 | Duchon | May 2011 | A1 |
20110218946 | Stern | Sep 2011 | A1 |
20120254755 | Wohlert | Oct 2012 | A1 |
20120296991 | Spivack | Nov 2012 | A1 |
20130311163 | Somekh | Nov 2013 | A1 |
20140019119 | Liu | Jan 2014 | A1 |
20150277852 | Burgis | Oct 2015 | A1 |
20170262447 | Paulsen | Sep 2017 | A1 |
20170372221 | Krishnamurthy | Dec 2017 | A1 |
20180032874 | Sánchez Charles | Feb 2018 | A1 |
20180183747 | Cai | Jun 2018 | A1 |
20180253485 | Pappu | Sep 2018 | A1 |
20180349467 | Malhotra | Dec 2018 | A1 |
20180357227 | Haller, Jr. | Dec 2018 | A1 |
20190294320 | Guttman | Sep 2019 | A1 |
20200250269 | Koseki | Aug 2020 | A1 |
20200364610 | Sweeney | Nov 2020 | A1 |
20220019750 | Ravichandran | Jan 2022 | A1 |
Number | Date | Country |
---|---|---|
3835730 | Oct 2006 | JP |
20060286058 | Oct 2006 | JP |
WO-2009134462 | Nov 2009 | WO |
Entry |
---|
Wilson et al., “Generalized Wishart Processes”, Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, Jul. 2011, pp. 736-744. (Year: 2011). |
Park et al., “Supervised Dynamic Topic Models for Associative Topic Extraction with a Numerical Time Series”, TM '15 (Oct. 15, 2015), 6 pages. |
Jahnichen, Patrick et al., “Scalable Generalized Dynamic Topic Models”, arXiv:1803.07868v1 [stat.ML], Mar. 21, 2018, 12 pages. |
Blei, Bavid et al., “Latent Dirichlet Allocation”, Journal of Machine Learning Research 3 (2003), pp. 993-1022. |
Blei, David et al., “Correlated Topic Models”, located online at: http://papers.neurips.cc/paper/2906-correlated-topic-models.pdf, no date, 8 pages. |
Blei, David et al., “Dynamic Topic Models”, located online at: https://mimno.infosci.cornell.edu/info6150/readings/dynamic_topic_models.pdf, no date, 8 pages. |
Bhadury, Arnab et al., “Scaling up Dynamic Topic Models”, IW3C2 2016, Apr. 11-15, 2016, Montreal, Quebec, Canada, 10 pages. |
Wang, Chong et al., “Continuous time dynamic topic models”, UAI'08: Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, Jul. 2008, pp. 579-586. |
Number | Date | Country | |
---|---|---|---|
20220147716 A1 | May 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16932323 | Jul 2020 | US |
Child | 17526845 | US |