The present disclosure relates to the field of information extraction technologies and, more particularly, relates to providing augmented product specifications based on product information and user reviews.
When people purchase a product from an online store, they are usually provided with product-related information such as product description, product images, and user reviews. Often, product specification is also provided to specify its features in an organized way, especially for high-technology products that consist of several electronic components, and it is highly informative for users to understand the product. An example of a digital camera product specification is shown in
However, it is often hard to understand what the contents of product specification really mean when the consumers are unfamiliar with them. For example, when novice consumers read a digital camera's specification, they would not have any idea what the value “TTL phase detection” of feature “Auto Focus” means because they are not familiar with it. Not only such consumers are strange to what the feature value is, but also they do not know what it really means to them.
In order to choose “right” value of a feature, consumers would like to hear direct experience from consumers who own a product equipped with it, which may answer questions such as “is the feature value preferred by others?” A typical product purchase from online stores is depicted in
Opinion mining and summarization have been widely studied. Most of the studies performed research on product review or Weblog data set since people leave rich opinions on them. In order to know the target of opinions and to mine opinions in a more effective way, aspect-based opinion mining and summarization has been studied as a main stream in the field. To find aspects of a product, many studies applied a topic model, which find latent topics from documents. Most existing works in this line of research mine opinions for a product feature, either pre-defined or latent.
Although product specifications have been available in many e-commerce sites, only a limited number of studies used them for product review analysis. For example, Ontology-Supported Polarity Mining (OSPM), which takes advantage of domain ontology database from IMDb, aims to achieve sentiment classification on reviews. However, the method studied only movie properties (features), not feature values. Other methods employ product review analysis, but the goal is document categorization. Product specifications and reviews are also used to build an aspect hierarchy, but the method did not study feature values. Other studies used product specifications to summarize product features, but they also did not study feature values.
Therefore, most topic model-based opinion mining and summarization techniques do not use pre-defined topics (e.g. product specifications) for product review analysis. Further, those opinion mining techniques that incorporates product specifications still fail to address the problem that novice consumers have little knowledge of the actual value corresponding to a feature in product specifications.
The disclosed method and system are directed to solve one or more problems set forth above and other problems.
One aspect of the present disclosure provides a method for providing augmented product specifications based on user reviews. The method obtains input data of specifications and user reviews on a plurality of products, each product corresponding to a plurality of specifications and a plurality of user reviews, each specification including at least a pair of a feature and a feature-value of the product. The method concatenates the user reviews of the products to form product-documents, each product-document corresponding to the concatenated user reviews of a product. The method further employs a topic model to process the input data and learn topic distributions in the product-documents and word distributions in topics. The topics include specifications of the products. The topic model is a type of statistical model for discovering topics that occur in a collection of product-documents. Each product-document contains one or more topics, and each topic exists in one or more documents. The method can provide augmented specifications to a user based on the topic model. The augmented specifications include one or more of relevant sentences of the feature-value, feature importance information, and product-specific words of the product.
Another aspect of the present disclosure provides a non-transitory computer-readable medium having computer program for, when being executed by a processor, performing a method for providing augmented product specifications based on user reviews. The method obtains input data of specifications and user reviews on a plurality of products, each product corresponding to a plurality of specifications and a plurality of user reviews, each specification including at least a pair of a feature and a feature-value of the product. The method concatenates the user reviews of the products to form product-documents, each product-document corresponding to the concatenated user reviews of a product. The method further employs a topic model to process the input data and learn topic distributions in the product-documents and word distributions in topics. The topics include specifications of the products. The topic model is a type of statistical model for discovering topics that occur in a collection of product-documents. Each product-document contains one or more topics, and each topic exists in one or more documents. The method can provide augmented specifications to a user based on the topic model. The augmented specifications include one or more of relevant sentences of the feature-value, feature importance information, and product-specific words of the product.
Other aspects of the present disclosure can be understood by those skilled in the art in light of the description, the claims, and the drawings of the present disclosure.
The following drawings are merely examples for illustrative purposes according to various disclosed embodiments and are not intended to limit the scope of the present disclosure.
Reference will now be made in detail to exemplary embodiments of the invention, which are illustrated in the accompanying drawings. Hereinafter, embodiments consistent with the disclosure will be described with reference to drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. It is apparent that the described embodiments are some but not all of the embodiments of the present invention. Based on the disclosed embodiment, persons of ordinary skill in the art may derive other embodiments consistent with the present disclosure, all of which are within the scope of the present invention.
When a consumer purchases a product from an online store, the consumer is usually provided with product-related information such as product specifications, product images, and user reviews. However, the consumer may not know the meaning of certain product feature and its feature value. The problem becomes more important because the feature values diverge as more high-technology components are developed and a product is equipped with more features.
Indeed, recently manufactured digital cameras or computers often contain more than fifty features with many of them difficult to understand for novice customers. For example, a digital camera “Canon EOS 70D” has 79 features in CNET's product specifications page. It contains many advanced pairs of a feature and its value, such as (“Battery Type”, “Canon LP-E6”) and (“Light Sensitivity”, “ISO 25600”). A novice user may not know if those features are preferred by other users and if they are enough or too much for their use, which may be answered from experience of other users. Each of the following sentences is a manually retrieved sentences from other products for the values “Canon LP-E6” and “ISO 25600”, respectively: “The 60D uses the LP-E6 battery like the 7D, which is a nice feature as this battery can often last through a full day of shooting. The only negative issues are . . . , and the highest advertised ISO setting (25600 eq) is too noisy to use.”
Through reading such useful review sentences, consumers may learn about the feature values, which may help them choose a proper product for them more efficiently and effectively. For example, a user may know useful information about the battery “Canon LP-E6” from the sentence; it gives a fact that the battery lasts through a full day of shooting and also an opinion that it is a nice feature, which means it is preferred by the author. If the user was considering a camera with a good battery life but did not know about “Canon LP-E6”, the retrieved sentence would be very helpful to the user for choosing a product. Although the retrieved sentence may have some inconsistent opinions, readers can still learn from them that there are different opinions on the feature value.
The present disclosure provides a system for automatically generating augmented product specifications based on user reviews.
Users 302 may connect to the network 303 (i.e. the Internet) and access the product information website 304 through one of the web client terminals 301. The web client terminals 301 may be any device that connects to the network 303 and allows interaction between the users 302 and the product information website 304, such as desktops, laptops, tablets, smart phones, etc. The product information website 304 may provide information about a variety of products such as model, specifications, price, user reviews, etc. The product information website 304 may also provide product purchase capabilities. The users 302 may browse the product information and buy a product on the product information website 303. The users 302 may leave user reviews for a purchased product. The website 304 may be connected to any appropriate network 303 for accessing, such as the Internet. A computing module 3041 may be configured to generate augmented specifications according to the product information and user reviews. In certain embodiments, the computing module 3041 may be integrated in the product information website 304. In other embodiments, the computing module 3041 may be an independent module that can communicate with the product information website 304.
The computing module 3041 and/or client terminal 301 may be implemented on any appropriate computing platform.
As shown in
Processor 402 may include any appropriate processor or processors. Further, processor 402 can include multiple cores for multi-thread or parallel processing. Storage medium 404 may include memory modules, such as ROM, RAM, flash memory modules, and mass storages, such as CD-ROM and hard disk, etc. Storage medium 404 may store computer programs for implementing various processes when the computer programs are executed by processor 402.
Further, peripherals 412 may include various sensors and other I/O devices, such as keyboard and mouse, and communication module 408 may include certain network interface devices for establishing connections through communication networks. Database 410 may include one or more databases for storing certain data and for performing certain operations on the stored data, such as database searching.
The product database 501 may store data from the product information website 303. Specifically, the product database 501 may store data on product specifications 5011, user reviews 5012 and augmented product specifications 5013.
In operation, when a consumer uses a web client terminal 301 to browse the product information website 303, the consumer may leave a review for a product on the website 303. The collected user reviews 5012 and product specifications 5011 stored in the product information database 501 may be used to generate augmented product specifications 5013. The generated augmented product specifications 5013 may also be stored in the product information database 501. In the present disclosure, one product specification refers to a pair of a feature and a value of the feature, such as (feature “Light sensitivity” with a value “ISO 25600” of a digital camera product). Each product may have a plurality of specifications, i.e., feature-value pairs. The augmented product specifications 5013 may include additional relevant information about the product to help consumers understand the product specifications.
The product specifications 5011 and user reviews 5012 from the product information database 501 may be processed in a preprocessing module 502. The preprocessing module 502 may be configured to process reviews using natural language analysis tools and concatenating all reviews of one product to form a single product document for each document.
The product documents obtained from the preprocessing module 502 may be processed in the topic model generation module 503. In the present disclosure, the topic model refers to a type of statistical model used in natural language processing for discovering topics that occur in a collection of documents. The topic model can examine a set of documents and discover, based on statistics of words in each document, what topics exist in the documents and what word exists in the topics. In other words, the topic model may learn topic distribution in each document and word distribution in each topic.
The topic model generation module 503 may further include a prior knowledge generation unit 5031 and a modified LDA model generation unit 5032. In certain embodiments, a modified Latent Dirichlet allocation (LDA) model may be employed as the topic model. In the LDA model, each document is concatenated review texts of a product, and each topic is a feature-value pair from the product specification.
An LDA model assumes the topic distribution has a Dirichlet prior. That is, the topic distribution follows a Dirichlet distribution. In operation, the LDA model generates each word in a document by drawing two variables according to Dirichlet distribution: different topic distribution in each document and different word distribution in each topic. Since the word distribution in a document is known, as LDA model iterates to generate each word in a document, the LDA model modifies the two variables so that they can be estimated to fit the data as much as possible. The data refers to the given documents (concatenated review texts) and topics (feature-value pairs from the product specifications). Therefore, the LDA model may learn the data and generate the estimated topic distribution in each document and word distribution in each topic.
The modified LDA model generation module 5032 may incorporate prior knowledge to the traditional LDA model. The prior knowledge generation module 5031 may obtain prior knowledge including previously known specific topic distribution in a document and specific word distribution in a topic.
The augmented specification generation module 504 may be configured to generate one or more of relevant sentences 5041, feature importance information 5042 and product-specific words 5043. In certain embodiments, the relevant sentences 5041 of a query may be relevant review sentences generated by an ad-hoc language model retrieval system. The query may be a feature-value pair of a product. The ad-hoc language model retrieval system may retrieve relevant sentences according to the query based on the generated topic model (i.e. topic distribution in document and word distribution in topic). The feature importance information 5042 may be ranked product features based on the generated topic model. The product-specific words 5043 may indicate special words for the product based on the generated topic model.
The specifications may also be preprocessed, and the preprocessed specifications are used to obtain prior knowledge (“Prior”) for the topic model and to query from a retrieval system. The topic model learns text data with prior knowledge and generates new document representation, which is also given to the retrieval system. The retrieval system retrieves sentences relevant to a query from the new document representation.
At the same time, the topic model may be used to produce feature importance and product specific words based on the text corpus and prior knowledge. The results from the topic model (e.g., the feature importance and the product specific words) and the output of the retrieval system are added to specifications to generate the augmented specifications.
An exemplary presentation of augmented product specifications is shown in
Specifically, in addition to present a column for product feature and a column for corresponding feature value, another column of feature importance may be added to corresponding features. The feature importance column may provide ranking for each feature according to its importance. Further, the product-specific words may also be shown in the presentation. The font and color of the product-specific words may indicate which one is more specific or special to a certain product. For example, among a plurality of words that are specific to the product, the most specific word may have a biggest font and a darkest color.
In addition, if a user put a mouse pointer over a feature value, relevant sentences from reviews are shown to the user. Novice customers may learn about a particular feature value from the retrieved review sentences so that when looking at product specifications, the customers can choose a product with features they want. In one embodiment, the relevant sentences may be presented in a float box. The feature value word may be highlighted in the relevant sentences.
As shown in
The process 900 may address the following challenges. First, the vocabulary used in specifications and reviews for a feature or a feature value may be different. For example, a feature value “10,000,000 pixels” of the feature “Effective Sensor Resolution” is not likely to be expressed as it is by users; instead, they may prefer to use “10 MP” or “ten megapixels”. If just using the given feature value, many of the relevant review sentences may be missed. Another challenge is that a feature or a feature value word is often used in multiple places of specifications, so the same sentences may be retrieved often for different queries. For example, a feature word “resolution” may be used for features “Sensor Resolution”, “Effective Sensor Resolution”, “Max Video Resolution”, and “Display-Resolution”, and these features are actually all different. The retrieval system may need to somehow distinguish the features well. In addition, many of the features and feature-values in specifications may not appear enough in reviews if authors think they are not worth mentioning, and this may result in many false positives.
The review texts are preprocessed (S902) by performing sentence segmentation, word tokenization, and lemmatization using natural language analysis tools such as Stanford CoreNLP. Word tokens are lowered and punctuations are removed. Then, stop words are removed by natural language analysis tools. In certain embodiments, word tokens that appear in less than five reviews are also removed. All reviews of a product are concatenated to form a single product document for topic modeling.
Specifications data is also preprocessed (S903). In one embodiment, feature values that appear in less than five products are removed. Then, each feature and feature value text is split into word tokens by blank, and the word tokens are lowered. The word tokens for a feature and a feature value are processed to generate prior knowledge. The word tokens for a specification are given to a retrieval system as a query to retrieve relevant sentences.
Both preprocessed user reviews data and specifications data may be used in a topic model to identify feature-value pair distribution in concatenated product reviews (S905). A prior knowledge according to user reviews and product specifications (S904) may also be generated and used in the topic model to improve the modeling result.
Specifically, a topic model is a probabilistic model that can find latent themes and their distributions in a document from a text collection, where a theme is a cluster of words whose occurrence in documents overlap frequently. In a topic model, a topic represents correlated words, thus even if a document do not contains a certain word, likelihood of a word w in a document d, p(w|d), can be high enough if d has enough words similar to w. For example, even if “LCD” is not present in a review sentence t, the sentence can have high p(“LCD”|t) if the sentence contains related words such as “screen” and “fragile”. Therefore, topic models, specifically, topic models based on Latent Dirichlet Allocation (LDA), are employed in the embodiments to bridge vocabulary gap between specifications and reviews.
The graphical representation of LDA is shown in
The variables θd and φz are estimated to fit the data as much as possible by approximation methods because exact estimation over all possible topic structure is infeasible. In approximation algorithms, the documents are used as clues where to search among possible topic structures, and two most popular methods are variational methods and collapsed Gibbs sampling. In certain embodiments, collapsed Gibbs sampling method may be employed for its simplicity and comparable performance to variational methods.
In classic LDA, distribution of topics for a document and distribution of words for a topic is not known beforehand. However, clues on those distributions can be applied to improve LDA model. Intuitively, if it is known that a document is about digital camera, the document is likely to have topics regarding “camera” and have words related to “camera”. Such prior knowledge can be adapted in order to estimate the model better. This model can be denoted as semi-supervised LDA.
The graphical representation of semi-supervised LDA is depicted in
Gibbs sampling can be used to learn the model. The topic zd,i is repeatedly sampled based on all the other topic assignments Z\d,i as well as priors. The topic assignment probability is thus defined as:
While the classic LDA adds the same α to different topics and documents and adds the same β to different words and topics, semi-supervised LDA adds α′ and β′ that are specific to the topic and document and the word and topic, respectively, in order to incorporate prior knowledge.
Returning to
There are M documents, where each document is a concatenated reviews for a product p, and for each document, there are Np words. s is a specification (a feature-value pair), which is used as a topic, and there are |S| possible specifications. With generated prior knowledge, now the topic (feature-value pair) distribution in p, θp, is drawn from Dirichlet (K·α·α′p,s), where K is the number of topics. The word distribution for feature-value pair s, φs, is drawn from Dirichlet(V·β·β′s,w), where V is vocabulary size. This generative process is repeated for all words in all product documents.
Specifically, β′s,w, the prior knowledge for φs, is automatically generated from the data by measuring normalized point-wise mutual information (NPMI) between feature words and review words. Then, negatively correlated words are removed, and each word has normalized probability p(w|f), where f is a feature. In addition, Duan LDA may generate prior θ based on specifications; if a feature-value pair s is not present in a product p, zero is assigned to α′p,s, and otherwise, a probability is assigned to α′p,s, which is uniform among all present feature-value pairs.
Further, the major difference between DuanLDA and semi-supervised LDA is that DuanLDA uses background language model, which is maximum likelihood estimation of words in the entire data set.
Specifically, when an author writes a review word wp,i for p, the author chooses a background language model or feature topics according to switch xp,i which is determined by a parameter λ. If the background language model is chosen, wp,i is drawn from the background language model φB; otherwise, a specification sp,i is chosen according to θp, which is drawn from Dirichlet distribution with α and α′p, and wp,i is drawn from φs.
The document model for DuanLDA is defined by:
The probability of xp,i choosing background language model is determined by λ and the background language model, which is
The Gibbs sampling equation to learn sp,i when xp,i being non-background is defined as:
which basically assigns background topic if wp,i is common enough in the background language model and assigns specification topic sp,i if wp,i is closer to one of the |S| specifications.
Product reviews may have topics that are not in specifications; for example, value, design, or ease of use is not listed in specifications, but they may be mentioned in reviews. The DuanLDA+ model thus adds |E| review topics to the specification topics, resulting in all topics {s1, . . . , s|S|, s|S|+1, . . . , s|S|+|E|}. Specification distribution θp, which is drawn from Dirichlet distribution with α and α′p, where α′p,s is uniform across all specification and review topics. If the drawn specification sp,i belongs to specifications, it works the same as DuanLDA does. However, if sp,i belongs to review topics (E), the word wp,i is drawn from φrs, which is drawn from Dirichlet distribution with βr.
Each specification topic s has its estimated topic size Ns. If the topic size Ns is relatively too small or too big compared to size of prior Vβ, the topic s will rely too much or too little on the prior β′s. If a topic relies too much on prior, then the topic will just follow the word distribution of prior, and if a topic relies too little on prior, it is likely to bear other themes that are unrelated to prior, where the latter case is more problematic. Therefore, the priors may be regularized according to topic sizes. The DuanLDA+ model introduces prior size controllers {η1, . . . , η|S|} each of which repeatedly decays by decay factor ζ if the topic size is too little. More specifically, the Gibbs sampling for word probability in a topic sp,i=z is now defined as:
where prior size Vβ is now controlled by ηs, which is decayed by the equation:
where superscript (n) and (n+1) means the variable at nth and n+1th Gibbs sampling iteration, respectively, and prior proportion of specification s, pp(s) is defined as pp(s)=ηsVβ/(Ns+ηsVβ). The DuanLDA+ model avoids decaying the prior size controller when the topic size is too small in order to prevent the situation that a small topic with a small prior re-grows so that the topic does not rely enough on priors. Therefore, prior size ηs, Vβ, is decreased if the proportion of prior is too much and if there is enough topic size for s.
Further, the background language model in DuanLDA may not be necessary for the data because when preprocessing text data in reviews, stopwords are removed to prevent topics from being occupied by popular words. Thus, the DuanLDA+ model removes background language model, but instead, adds product-specific topics. Some products may have reviews that are very different from those of other products, and the topics may be specific to those products. For application purpose, the DuanLDA+ model adds a product-specific topic ωp for each product p in order to capture the product-specific words. When a review author writes a word wp,i for a product p, the author first chooses between product-specific topic and specification topics according to λp, which is drawn from Beta distribution with γ. If the product-specific topic is chosen, wp,i is drawn from ωp, which is drawn from Dirichlet distribution with δ. The remaining generative process that is not described here is the same as that in DuanLDA.
The resulting document model is thus defined by:
and the Gibbs sampling formula for learning when x=0 is given as following:
where γ is a small constant. To learn a specification topic sεS when x=1, the formula with all priors Ω is:
where K is the number of all topics (|S|+|E|). Similarly, learning a review topic sεE when x=1 is done with the following Gibbs sampling formula:
Prior distribution of β based on mutual information is quite even since there are many words that are “somewhat” related words to a feature. Since probabilities of those “somewhat” related words add up to lower the probability of high ranked words, p(w|f), it is hard for the topic f to be chosen even for the words that are very related to f and high ranked in β′f. Thus, the DuanLDA+ model assumes the prior follows Zipf's law distribution and adjusts p(w|f) according to it. Specifically, from the priors obtained for DuanLDA, p(w|f), new prior p′(w|f) for each word w is defined as:
where v(f) is a vocabulary in f, V is a vocabulary in all reviews, rankf(w), is w's rank in p(w|f) excluding words in v(f), and Zipf's law distribution function Zipf(i) is defined as
where s is a parameter characterizing the distribution. Basically, p′(w|f) keeps the rankings in p(w|f) but discards probabilities of words that are not feature words. The feature words take the sum of first n probabilities of Zipf's law distribution, where n is the intersection of feature word vocabulary and all reviews vocabulary, and the sum is redistributed to the feature words following their proportions in p(w|f). Non-feature words keep only their ranks excluding feature-words, and new prior probabilities follow Zipf's law distribution with corresponding ranks plus v(f). By doing this, the DuanLDA+ model can discriminate important words from unimportant words more explicitly so that the topics are not occupied by unimportant prior words.
In the SpecLDA model, for each feature f of |F| features, there are |Uf| possible values. To separate a feature from feature values, a feature variable f is separated from the value variable uf, which is a possible value for f. Also, the feature value topics w is introduced to separate them from feature topics φ.
In this model, when an author writes a review word wp,i of a product p, the author first chooses if the word is about product features or product-specific topic using switch xp,i according to λp, which is drawn from beta distribution with symmetric vector γ. If a product-specific topic is chosen, a word is drawn following ωp, which is drawn from Dirichlet distribution with symmetric vector δ. If product features are chosen by xp,, the author chooses a feature fp,i from possible feature set {f1, . . . , f|F|, f|F|+1, . . . , f|F|+|E|}, where {f1, . . . , f|F|} is a feature set from specifications and {f|F|+1, . . . , f|F|+|E|} are features that are not in specifications but are found in reviews, according to θp, which is drawn from Dirichlet distribution with a and asymmetric prior α′p. If fp,i belongs to review features, wp,i is drawn from multinomial distribution φrf, which is drawn from Dirichlet distribution with symmetric vector βr. If the chosen feature fp,i belongs to specifications features, the author again chooses to write a feature word or a feature value word about fp,i using switch yp,i according to πf, which is drawn from beta distribution with symmetric vector γy. If the author chooses to write a feature word for fp,i, the wp,i is chosen according to φf, which is drawn from Dirichlet distribution with a constant β and asymmetric prior β′. Otherwise, if the author further chooses value uf of the feature fp,i according to ξp,f, which is drawn from Dirichlet distribution with a constant τ and asymmetric prior τp,f. With the chosen feature value uf, the author chooses a word according to ωf,u, which is drawn from Dirichlet distribution with a constant ρ and asymmetric prior ρ′f,u. This process is repeated for all review words of all products.
The generative process can be described in the following algorithm:
The document model of SpecLDA is thus:
and the Gibbs sampling formula for learning when product-specific topic is used (x=0) is the same as in formula (8).
When learning a review topic or a specification feature topic f, the formula is defined as:
p(xp,i=1,fp,i=z,yp,i=0|wp,i,X\p,i,F\p,i,E\p,i,Y\p,i,Ω)
∂p(xp,i=1|X\p,i,Ω)p(fp,s=z|F\p,i,E\p,i,Ω)
p(yp,i=0|z,Y\p,i,F\p,i,E\p,i,Ω)p(wp,i|z,F\p,i,E\p,i,Y\p,i,Ω) (15)
where p(xp,i=1|X\p,i, Ω) is defined in formula (9), and the remaining terms are defined as:
where K is the number of all reviews and specifications topics.
The SpecLDA also learns when a feature is chosen (fp,i=z) and feature value is chosen (up,i=j) to describe the feature, which is defined as:
p(xp,i=1,fp,i=z,y=1,up,i=j
|wp,i,X\p,i,F\p,i,Y\p,i,U\p,i,Ω)
∂p(xp,i=1|X\p,i,Ω)
p(fp,i=z|F\p,i,E\p,i,Ω)
p(yp,i=1|z,Y\p,i,F\p,i,E\p,i,Ω)
p(up,j=1|z,Y\p,i,F\p,i,U\p,i,Ω)
p(wp,i|z,j,U\p,i,Ω) (19)
where first and the second terms are defined before, and the remaining terms are defined as:
where |Uf| is the number of all possible feature values for the feature f. Regularization is applied for both features word and feature value words as in DuanLDA+.
Returning to
Specifically, in order to retrieve relevant documents from a document collection, a query likelihood retrieval model may be employed:
p(d|q)∂p(q|d)·p(d)
∂p(q|d) (23)
where d is a document and q is a query, which is a list of words. p(d|q) is the probability that d satisfies information needs of a user given q. p(q|d) measures the proximity of d to q, and p(d) is a general preference on d, which is query-independent. Thus, the formula assigns high scores to documents if they match query well and are preferred by users. In the disclosed embodiments, the assumption is that document preferences are not given. So q is uniform and the term p(d) is dropped.
In general, p(q|d) is defined as:
where w is a word in q, V is a vocabulary set of the document collection, and c(w,q) is a count of w in q. p(w|d) is a unigram language model that is estimated by maximum likelihood estimation, and it means a word w's likelihood in a document d. Thus, p(q|d), likelihood of a query q in a document d, becomes higher if more words in q appear more in d.
To avoid overfitting and prevent p(q|d) from being zero when one of a query word is not in d, smoothed p(w|d) is used in general. Specifically, Jelinek-Mercer smoothing method is used, p(w|d) is defined as:
p(w|d)=(1−λ)pml(w|d)+λp(w|B) (25)
where pml(w|d) is a document language model estimated with maximum likelihood, and p(w|B) is a collection language model. To smooth pml(w|d), a reference language model p(w|B) is used, where an entire corpus is used as B so that a general word likelihood can augment pml(w|d). The resulting p(w|d) is thus weighted average of pml(w|d) and p(w|B). In formula (24), underflow may happen by multiplying small values several times. To avoid it, a logarithm is taken, and standard language model retrieval approach is followed. The resulting score of d for q is defined as
where formula (27) and (28) are equivalent if αq=ΣwεVc(w,q)log λp(w|B) is added to formula (28). But αq is omitted since it does not depend on d, which means that it does not affect ranking of documents. By rewriting the formula (27) to (28), it now can penalize a score for a common query word, which is a desired property in ad-hoc information retrieval.
The present embodiments may apply the query likelihood retrieval method with proper adjustment. In the present problem setting, a query q is words in a query specification sq=(fq,uq), and d is a sentence t in all review sentences T. V is thus vocabulary in T, and B is a unigram language model of T. Since the text unit is now a sentence, which usually contains much less words than a document so that the statistical evidence between a query and a sentence is much weaker, the problem is harder than document retrieval. Fortunately, the method takes advantage of specifications to filter out some of unrelated sentences; if a sentence tp is from a product p's review sentences Tp and query specification sq is not in p's specifications Sp, tp can be ignored. Thus, the relevance score of sentence t for q is defined as
where o/w means “otherwise” and pml(w|tp) can be estimated in the same way as p(w|d) in equation (25). This formula (29) will be used as a baseline method and referred as QL.
However, for the baseline method, it may not perform well if a vocabulary used in a query is different from a vocabulary used in documents to describe the query. For example, for a query feature-value pair (“Display Type”, “3 in LCD Display”), QL will assign zero score to the sentence “Screen is big but fragile for active lifestyle” since query words are not in the sentence, resulting in pml(w|tp) being always zero, though the sentence is actually relevant to the feature. Therefore, in order to bridge the vocabulary gap between specifications and reviews, pml (w|tp) is replaced with p′(w|tp) using topic model.
In addition, the present embodiments incorporate ad-hoc retrieval with the modified LDA model. Relying solely on LDA document model is not a good idea since it loses original query information, so interpolation with original language model has been suggested. Thus, the method uses a weighted interpolation model with the modified LDA document model and maximum likelihood estimated language model.
Further, the goal of the retrieval system is to retrieve relevant sentences, not documents. If extending topic models with sentence unit, it may require too many variables are required since the number of sentences is usually way greater than the number of documents. Thus, the present embodiments do not use sentence unit in LDA, but converts estimations from document-level to sentence-level. Language model p′(w|tp) for a sentence t in a document d is thus defined as:
p′(w|tp)=λ′pml(w|tp)+(1−λ′)plda (30)
Therefore, incorporating topic model and maximum likelihood estimated language model, the relevance score of sentence t for q used in the present embodiments is defined as:
where p′(w|tp) is computed from one of the modified LDA models described previously and maximum likelihood estimated language model. Here, p′(w|tp) is obtained from equation (30) in order to give scores to sentences.
Customers may want to know what is special for a certain product compared to others. Product specific topics can be obtained in DuanLDA+ and SpecLDA (S907). For each product p, ωp contains a product-specific topic. The high ranked words in ωp mean that they are likely to be closer to the product-specifics than any other topics. Thus, those high-ranked words may suggest which words are specially used for a certain product.
The importance of a feature may be also useful for a novice customer who is not familiar with features of a product. Feature importance information may be generated according to the modified LDA models (S908). In DuanLDA and DuanLDA+, features are overlapped in multiple feature-value pairs, and the feature importance of a feature f can be calculated from the Gibbs sample after learning, which is defined as
and since features are separated from values in SpecLDA, feature importance is defined as
The higher p(f) means that the feature f is mentioned more in the reviews.
The relevant sentences, feature importance information and product-specific words may then be provided as augmented specifications (S909) of the products obtained in the first step (S901). The augmented specifications may help users to better understand the feature value of the product.
The present disclosure provides a system and method for generating augmented product specifications based on user reviews. The augmented product specifications may enhance product purchase experience. The system employs new approaches based on modified LDA topic models to learn topic distribution in document and word distribution in topic. The generated topic model may be used to retrieve relevant review sentences corresponding to a product feature-value pair. The generated topic model may also be used to rank feature importance and provide product-specific words. Comparing to existing technologies, the present disclosure may enhance product purchase experience by providing additional informative explanations of product specifications.
A product specification is often available for a product on E-commerce websites. However, novice customers often do not have enough knowledge to understand all features of a product especially advanced features. In order to provide useful knowledge to the customers, the present disclosure provides a system that automatically analyzes product specifications with product reviews, which are abundant on the web nowadays. Specifically, the discloses embodiments provides novel LDA models that can provide useful knowledge such as feature importance and product-specific themes as well as retrieving review texts relevant to a specification to inform customers what other customers have said about the specification in reviews of the same product and also different products.
It is understood that the disclosed system and method for generating augmented specifications is not limited to product purchasing scenario. The disclosed system and method can also be used for any text collections with specification (key-value) type prior knowledge.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the claims.