The present disclosure involves processing of information included in a database.
The development of the web has boosted the production of content by users. Users are encouraged to express their opinion on various products or businesses by writing reviews about them, whether on e-commerce website such as Amazon or online reviewer communities like Yelp or IMDB. It is difficult to obtain any official statistics, but Yelp has for instance revealed recently that it contained more than 15 million reviews, with 41 million monthly visitors.
The text reviews are a very rich source of information which can provide businesses a useful feedback, but also other consumers various information about the product from a variety of different point of views. This allows a view of the product without the inherent bias of advertisement and can underline uncommon characteristics or details which might have been left out of a simple description.
This diversity of content is unfortunately submerged into the redundancy of the multitude of reviews. Browsing all this text then becomes a tedious task for a user, who will observe a lot of redundancy and might miss important information.
A solution to capture the diversity in the text is to automatically explore and mine the data. Certain research has focused on the star rating accompanying these reviews to provide the user with personalized recommendation, either based on the features of the product or the tastes of the people similar to the user, thereby removing the need to read reviews.
Star ratings-based analysis does not, however, provide the user with the description of the product they might have wanted, nor the businesses with the aforementioned feedback. This problem is addressed by review summarization which aims at selecting the most important information out of this mass of reviews and provides an exhaustive overview of the product.
Both of these tasks rely on detection of the product features. Manual tagging is obviously very tedious, does not scale well, and does not transfer to other domains. It is subjective and can be partial. A trained learning algorithm will show the same drawbacks. Furthermore, any automatic processing on these data is very difficult considering the nature of the user-written content as described further below. This is especially true of totally unsupervised methods. Strict natural language processing methods fail to account for the loose grammar, the colloquial language or frequent misspelling of such user-produced texts.
A simple straightforward unsupervised approximation is to consider the most frequent nouns as features. Yelp for example uses this method to highlight a few particularities of a restaurant. This kind of method is however insufficient to account for the fact that people use several words to talk about the same subject. For instance, they might use “atmosphere” or “ambiance” to describe the general feeling of a restaurant. Synonym detection is not enough: “bill” and “price” deal with the same concept but are not strictly synonyms, and will therefore not be grouped together. Moreover, the concepts are not all on the same semantic level. “food” is for example a generalization for “chicken”, “shrimp” or “soup” in a restaurant review.
Certain existing predefined taxonomies such as Wordnet might be used to address one or more of the described problems. But, such predefined taxonomies might lack some domain-specific words, such as dish names in the above-discussed restaurant-review based example. Also, the semantic relations of interest are domain-specific: it is very unlikely to find “murgh” in any taxonomy, let alone as a synonym of chicken. Furthermore, words can have totally different meanings in various contexts: “app” is the short for appetizer in a restaurant review but will stand for application in a review of a phone. There is no existing exhaustive taxonomy answering all these problems, and manually building one is quite tedious, if at all possible.
The ever growing quantity of user-produced content on the web has led to research on analysis of unstructured or semi-structured textual data. This is especially true for reviews about products or businesses due to the clear potential monetary value of such information. The desired end result could be review summarization, sentiment analysis or recommendation. Regardless of the end result, topic detection and organization are main challenges to address.
Existing review analysis techniques usually proceed in two steps. First, they detect the various features of the product mentioned by the user, and then they estimate their sentiment towards it. Various techniques have been used for review summarization, but most of them only consist in picking up a few significant sentences. That does not produce a usable profile definition. Some achieve useful results in word/features clustering but rely on a very heavy supervision, such as predefined classes. Others may extract features and evaluate the sentiment towards each of them, but they lack any kind of overlaying structure between these features. Moreover, such approaches are less efficient with low-frequency or abstract terms, which often constitutes the particularities of a profile and hence are not to be neglected.
An aspect of the present disclosure involves a method for automatically analyzing a database of textual information associated with user reviews, the method comprising the steps of selecting words in the database exhibiting a characteristic; processing the selected words to produce a graph representing a relationship between the selected words; and applying spectral analysis comprising cover tree based divisive hierarchical clustering to the graph for creating clusters of the selected words arranged in a tree comprising multiple levels wherein each level comprises thematically coherent ones of the clusters.
Another aspect of the disclosure involves apparatus comprising a pre-processor for selecting words included in a database of textual information associated with user reviews and having a characteristic; a word graph generator for processing the selected words to produce a graph representing a relationship between the selected words; and a word graph analyzer for performing a spectral analysis on the word graph to determine a structure of the graph wherein
These, and other aspects, features and advantages of the present disclosure will be described or become apparent from the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.
In the drawings, wherein like reference numerals denote similar elements throughout the views:
It should be understood that the drawings are for purposes of illustrating the concepts of the disclosure and is not necessarily the only possible configuration for illustrating the disclosure.
It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. Herein, the phrase “coupled” is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its spirit and scope.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (“DSP”) hardware, read only memory (“ROM”) for storing software, random access memory (“RAM”), and nonvolatile storage.
Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
In
It is noteworthy that this dataset is particularly dense: the average number of reviews written by a user is 162.4 (standard deviation of 271.6), with a maximum of 3800 reviews for some users. 35% users write more than 100 reviews and 80% more than 10. The review sizes vary a lot but are also fairly high, with an average of 810.0 characters (and a standard deviation of 656.6).
An example of the data set obtained by data collector 130 is shown in
Therefore, an aspect of the present disclosure relates to data processing involving a flexible bag of words representation. The data set produced by data collector 130 is next analyzed by profile generator 140 in
Next, profile generator 140 of
In the graph generated by word graph generator 230, the links are weighted to account for the number of co-occurrences between the words, but in order not to favor frequent words which would link everything together, a score based on mutual information is used as follows:
where |Si| is the number of sentences containing the word i, and |Si∩Sj| is the number of sentences in which i occurs with j.
Various approaches to weighting of the edges exist. However, point-wise mutual information typically provides good results.
In order to find some structure over this graph, the output of word graph generator 230 is processed by word graph analyzer 240 which implements spectral clustering that is a deterministic, fast and efficient clustering without any supervision. Such clustering relies on the spectral analysis of the graph to find the smoothest functions and cluster them to highlight the strongly connected parts of the graph.
Word graph analyzer 240 first projects the graph into a high dimensional Euclidean space. A goal is to preserve the proximity of two nodes in the weighted graph. Therefore, the processing looks for axes of this space as functions f that minimize:
Dividing the degree ensures that the nodes are considered equally, that is to say that the most common words (highest degree) are not favored. In order to do so, if W is the weighted adjacency matrix of the aforementioned graph, and D the diagonal degree matrix such that
and the normalized Laplacian is defined as:
L=I−D
−1/2
WD
−1/2 (4)
whose eigenvectors correspond to the smooth functions on the graph minimizing the Equation (2). The eigenvectors are orthogonal and each captures thereby different information about the graph.
Solutions to this problem include the functions indicative of unconnected or barely connected components (containing one or a few words), which overweight these outlying words. Therefore, it is necessary to eliminate the smallest eigenvectors corresponding to these smoothest functions, in order to keep only the relevant ones. This can be achieved by a threshold on the eigenvalue, as the eigenvalue a corresponds to:
Furthermore, only a √{square root over (N)} eigenvectors are kept, corresponding to the most meaningful functions, where N is the number of different words. This choice is enough to capture the variability in the data while getting rid of the noise. The results are however invariant with respect to small changes to this quantity. Finally, the axes of the obtained √{square root over (N)}-dimensional space are normalized.
The results show that when projecting the words in the space whose axes are the selected eigenvectors, proximity in the resulting Euclidean space do correspond to thematic proximity, as expected. The overall structure seems like a ball from which bulges about certain topics arise in several dimensions. A three-dimensional projection can be seen on
An approach for spectral clustering comprises applying in this space a k-means clustering algorithm. Using a k-means clustering has however the major drawback to require a manual and arbitrary pick of a single k, which might not be the most meaningful, and will most likely vary for different users or businesses. Furthermore, varying this k can change the whole structure of the clustering, making it impossible to control granularity in a non-chaotic way, as illustrated by
Instead, in accordance with the present disclosure, the exemplary embodiment shown in
Cover trees have many advantages. First, they allow for variable discretization of data. In particular, if j is the deepest level of the tree with no more than k nodes, then the nodes at depth j cover the set {xt}nt=1 within an error of 8d({xt}nt=1, S*) where S* is the optimal coverage of size k. Herein these nodes are referred to as representative states. Note that the above bound holds for all k≦n and therefore, the granularity of discretization does not have to be chosen in advance. This is not the case for k-means and online k-center clustering.
Second, cover trees can be built incrementally, one node at a time. In particular, when a new example Xn+1 arrives, it is added as a child of the deepest node xi such that d(xn+1, xi)≦½j, where j is the level of xi. This simple update takes O(log n) time and maintains all four invariants of the cover tree.
Finally, note that a cover tree on n data points can be built in O(n log n) time. Thus, when k>log n, the tree can be built faster than performing k-means or online k-center clustering.
A cover tree is constructed in the space of words by feeding it the words ordered by decreasing frequency. In accordance with aspects of the method and apparatus described herein, the most frequent words tend to be high in the tree. Frequent words will always be parents of infrequent words. Every level refines precision and reduces the radius of the balls, dividing the previous clusters.
An exemplary cover tree constructed in accordance with the present disclosure is shown in
The rich structure built automatically from the text for a given user or restaurant provides a detailed profile at the output of hierarchical structure generator 250 in
As described, providing such a recommendation involves a comparison of profiles. Profiles as described herein comprise trees that are organized sets of word clusters of different sizes. To compare two trees, the clusters of words which compose them are compared. Therefore, an elementary comparison operation between two of the clusters is defined.
An exemplary embodiment of the comparison included in recommendation engine 150 of
since the vectors over the bag of words are normalized.
The score (6) is used to compute a similarity score between two profiles. The profiles are considered level by level, the first level being the root (hence the bag of words of the whole corpus). However, two trees might not have the same number of clusters at the same level. In such a situation, it is possible to approximate the optimal matching between the two clusters set by the following algorithm:
For each cluster in the tree 1, find the best match (higher score) at the same level in the tree 2, and then do the same with the clusters of the tree 2.
This gives a set C of chosen cluster pairs, from which the similarity score can be obtained using the elementary operation s defined in (6) as follows:
where |c| is the size of the cluster c (that is to say the number of non-zero components of the bag of words vector).
The scores obtained at all the different levels are then merged in a linear combination to yield a final compatibility score. The weights of this combination may be learned on a training set.
The trees of topics constructed as described above capture very interesting properties of the text and can be regarded as profiles for a business or a user. The most important words are at the top of the tree, and the words which are semantically close are close in the tree. Furthermore, the tree structure enables the covering of all the aspects of a given text set, and offers a nice control over granularity. Examples of such trees are displayed in
In accordance with the present disclosure, the described apparatus and method may be used to build one tree per restaurant and use the tree as a browsable representation of the restaurant's reviews.
Indeed, if the nodes of the tree are displayed as sentences containing the maximal number of words from their subtree, this expandable tree can be viewed as a way to browse the corpus of text. The user can go deeper in the tree in the aspects they are interested in, while having an overview of the rest, and could access to the full review from which the sentences are extracted.
The apparatus and method described herein are not limited to the exemplary system described herein and, in particular, are not limited to the restaurant embodiment described herein. It can be used as input to any text-based recommendation or summarization engine. The detailed user profiles would be a basis for matchmaking or targeted advertisement. Adjusting the various scores and comparison process and the performances of the similarity metric would enable the described system to stand as a recommendation system by itself.
Other aspects comprise adding some additional information like a sentiment score for every concept and accounting for the particularities of a profile that distinguishes it from the average.
Another aspect comprises providing a cold start processor 160 in
In addition, the described system could be expanded to build a taxonomy over the whole dataset to fashion an entire “restaurant” taxonomy which could be used as a baseline for profile definition. Indeed, it would provide every word in the cluster “seafood” and the system could know for a given user their interest and sentiment towards “seafood”, as well as finer grain or lower grain categories. Such a score on every level would provide a baseline for sentiment analysis.
The operation of the apparatus shown in
Another aspect of the present disclosure involves a method as depicted in flowchart form in
An exemplary method of operation of recommendation engine 150 is shown in
Although embodiments which incorporate the teachings of the present disclosure have been shown and described in detail herein, those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings. Having described embodiments of a method and apparatus for processing textual information (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope of the disclosure as outlined by the appended claims.
This application claims the benefit of the filing date of the following U.S. Provisional Application, which is hereby incorporated by reference in its entirety for all purposes: Ser. No. 61/541,458, filed on Sep. 30, 2011.”
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US12/57857 | 9/28/2012 | WO | 00 | 3/20/2014 |
Number | Date | Country | |
---|---|---|---|
61541458 | Sep 2011 | US |