The invention relates generally to computer systems, and more particularly to an improved system and method for disambiguating text labeling content objects.
The collaborative efforts of users participating in social media services such as Wikipedia, Flickr, and Delicious have led to an explosion in user-generated content. The content can occur in various forms, such as text, photos, video, audio, or multimedia content. A popular way of organizing the content is through tagging. Tags are often contributed by users when they submit an image or video and then form a key part of a search approach. The tags provide useful descriptors of the content and are an important part of today's multimedia databases. A simple tag like “Tokyo” may provide more information than can possibly be gleaned from content-based algorithms. Therefore making it as easy as possible for users to enter tags is important.
There have been numerous efforts to suggest tags to users. See, for example, M. Ames and M. Naaman, Why We Tag: Motivations for Annotation in Mobile and Online Media, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 971-980, 2007; G. Mishne, AutoTag: A Collaborative Approach to Automated Tag Assignment for Weblog Posts, Proceedings of the 15th International Conference on World Wide Web, pages 953-954, 2006; B. Sigurbjorsnsson and R. van Zwol, Flickr Tag Recommendation Based on Collective Knowledge, In Proceedings of the 17th International World Wide Web Conference (WWW2008), Beijing, China, April 2008; and Z. Xu, Y. Fu, J. Mao, and D. Su, Towards the Semantic Web: Collaborative Tag Suggestions, Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland, May, 2006. A common method is to suggest the most likely co-occuring tags. For instance, Ames and Naaman propose a system called ZoneTag to make it easier for mobile-phone users to tag the photos they upload based on location and previous tags. Both Mishne and Xu propose systems that make suggestions by aggregating tags from similar textual content. And Sigurbjornsson proposes a system based on a probabilistic model of tag usage across all users. Each of these systems is looking for the most likely tags to describe content. However, in many cases, the most likely tag is also the most obvious and least informative. As a result, most tag-suggestion systems suggest words that add little information to a user's contribution.
Instead, disambiguating tags should be recommended when the current tags are not sufficiently clear to describe an object. There are two scenarios when tags are not sufficiently clear to describe an object. The first scenario is if the current tag set has more than one meaning. Resolving this type of ambiguity is non-trivial, as there exist many different ways a tag set can appear ambiguous. Examples of ambiguity are word-sense ambiguity (e.g. “jaguar” can be a car or an animal), geographic ambiguity (e.g. “Cambridge” as in MA or UK), temporal ambiguity (e.g. “Superbowl” from 2006 or 2005), language ambiguity (e.g. “mist” means dung in German and fog in English), and so forth. The second scenario is if the current tag set is not sufficiently specific. For example, “Asia” could describe an image from many different countries, or the tag set (“jaguar,” “car”) is not ambiguous; however, the tag set is also not particularly specific about the type of car that is represented in an image, given there are many Jaguar models.
What is needed is a way to determine the ambiguity of a set of user-contributed tags and suggests new tags that disambiguate the original tags. Ideally, such a system and method should be able to flexibly handle many cases of ambiguity, including word-sense ambiguity, geographic ambiguity, temporal ambiguity, and language ambiguity, without resorting to additional side information such as time or location analysis.
The present invention provides a system and method for disambiguating text strings labeling content objects. A disambiguation engine may be provided to disambiguate a text string set by calculating a divergence measure of two augmented text string sets. The disambiguation engine may be operably coupled to an ambiguity analyzer to determine the ambiguity of the text string set and may be operably coupled to a text recommendation engine to recommend a disambiguating text string set. The system and method may suggest new text strings when a set of given text strings can appear in at least two different contexts. These different contexts could be defined by geographic locations, word senses, languages, temporal events, and so forth. The different text string contexts may be measured based on a weighted KL divergence of co-occurring text string distributions. When the measure exceeds a threshold, the system and method suggest text strings that allow users to better describe their content.
In an embodiment to disambiguate text strings labeling content objects, one or more text strings forming a text string set may be received from a user. Alternatively, one or more machine-generated text strings may be provided by a content recognition system. Frequencies of co-occurring text strings in a text collection may be obtained, and a disambiguation measure may be determined for a pair of text strings that each co-occur with a text string in the text string set. In an embodiment, the disambiguation measure may be based on a weighted KL divergence of text string distributions that maximizes the value of divergence when a text string set may occur in different contexts. The pair of text strings may be output as recommendations to a user if the disambiguation measure exceeds a threshold. In various embodiments, a disambiguation measure may be determined for a list of the top most common pairs of text strings that co-occur with the text string set, and the pairs of text strings may be output in decreasing order by disambiguation measure for those pairs of text strings with a disambiguation measure that exceeds a threshold.
There are many applications which may use the present invention for disambiguating text strings labeling content objects. For instance, the present invention may be used to disambiguate tags in online content publishing and social media applications. The present invention may suggest tags that allow users to better describe their content for both new and existing content objects. Additionally, the present invention may be used in search applications to find an expanded query that best resolves ambiguity of a user's search request. Advantageously, the system and method of the present invention may be generally applied to any types of annotated content including, but not limited to, text, images, static graphics, video, audio, and rich media. Other advantages will become apparent from the following detailed description when taken in conjunction with the drawings, in which:
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer system 100 may include a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer system 100 and includes both volatile and nonvolatile media. For example, computer-readable media may include volatile and nonvolatile computer storage media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer system 100. Communication media may include computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For instance, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
The system memory 104 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 106 and random access memory (RAM) 110. A basic input/output system 108 (BIOS), containing the basic routines that help to transfer information between elements within computer system 100, such as during start-up, is typically stored in ROM 106. Additionally, RAM 110 may contain operating system 112, application programs 114, other executable code 116 and program data 118. RAM 110 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by CPU 102.
The computer system 100 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, discussed above and illustrated in
The computer system 100 may operate in a networked environment using a network 136 to one or more remote computers, such as a remote computer 146. The remote computer 146 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer system 100. The network 136 depicted in
The present invention is generally directed towards a system and method for disambiguating text labeling content objects. The system and method may suggest text strings when a set of text strings can appear in at least two different contexts. These different contexts could be defined by geographic locations, word senses, languages, temporal events, and so forth. The different text string contexts may be measured based on a weighted KL divergence of co-occurring text string distributions. When the benefits are significant, the system and method suggest text strings that allow users to better describe their content. In an embodiment, a text string may label any type of content object, including for example bookmarks, photos, videos, video fragments, text, audio, other multimedia content, web pages and even user queries.
As will be seen, the present invention may be used to disambiguate tags in online content publishing and social media applications. The present invention may suggest tags that allow users to better describe their content for both new and existing content objects. Additionally, the present invention may be used in search applications to find an expanded query that best resolves ambiguity of search results. As will be understood, the various block diagrams, flow charts and scenarios described herein are only examples, and there are many other scenarios to which the present invention will apply.
Turning to
In various embodiments, a client computer 202 may be operably coupled to one or more server computers 208 by a network 206. The client computer 202 may be a computer such as computer system 100 of
The server 208 may be any type of computer system or computing device such as computer system 100 of
The server 212 may be operably coupled to storage such as storage 216 that may store content objects 218 that may include text features 220. The storage 216 may also store text co-occurrence data such as an index 222 mapping the frequency of a text string to other text strings.
There are many applications which may use the present invention for disambiguating text strings labeling content objects. Online content publishing and social media applications are examples among these many applications. For any of these applications, new tags may be generated as needed or daily for both new and existing content items, and these additional tags may be incorporated into a collection of tags labeling content items. For instance, an online photographic sharing application may allow users to upload and share photographs, and may also allow users to annotate the photographs with tags. Those skilled in the art may recognize that other online applications such as news article feeds, blogs or bulletin boards, and multimedia data applications such as images, songs, or movie clips may similarly have tags generated on top of the content. Such applications may use the present invention for disambiguating tags labeling content objects. Or the present invention may be used in search applications to find an expanded query that best resolves ambiguity of a search request.
In general, a text string set may be considered ambiguous if it can appear in at least two different contexts. These different contexts could be defined by geographic locations, word senses, languages, temporal events, and so forth. The text string contexts may be measured by the distribution over all text string co-occurrences. A good example of an ambiguous tag labeling an image, for instance, is the word “Cambridge,” since there are well-known examples of Cambridge in both Massachusetts and England. Suggesting a tag such as “university” is very likely in both contexts, but does little to resolve the ambiguity. The present invention may measure the level of ambiguity of a text string set T and selects two additional text strings that can be proposed to a user to best disambiguate it. Thus, given the tag “Cambridge,” the present invention may determine that this is an ambiguous tag, and suggest either “MA” or “UK” because these words may do the most to remove the ambiguity. It may be assumed that the tag set {“Cambridge” ,“MA.”} co-occurs with different tags than {“Cambridge” ,“UK”}. These additional tags are defined by locations and events that differ strongly between the two very distant cities. As used herein, co-occurring text strings mean two or more text strings that are features describing the same content object.
A probabilistic framework may be introduced that provides a probability p(t|T) that a tag t co-occurs with the set T. Instead of suggesting the tags that are most likely within this framework, two tags ti,tj are suggested that, once added to T, give rise to maximally different probability distributions p(t|{T∪ti}) and p(t|{T∪tj}). The level of ambiguity of a set T is measured by a weighted Kullback-Leibler (KL) divergence of these two probability distributions.
In the proposed probabilistic framework to model tag co-occurrences and measure ambiguity, consider a content object to be labeled with a set of tags T={tatb, . . . }. The expression I(T) represents the number of content objects that contain the tag set T. For any pair of tags ti,tj, consider the number of content object co-occurrences to be denoted by I(ti∪tj). An estimate of the probability that one tag, ti, appears in another tag's presence, tj, may be calculated by the following expression:
By further summing over all contexts, the probability of a pair of tags that includes tag ti may be calculated by the following expression:
In an embodiment of a probabilistic framework, models may be based on these two probability distributions, which may be calculated from pair-wise co-occurrence data. Although tags may not appear only in pairs, it is impractical to store the probability of a tag in any context for all tag sets, T. To simplify the computation, it may be assumed that conditional co-occurrences are independent, and the probability that any one tag for all tag sets is used to label a content object may be calculated by the following expression:
Using this assumption, the probability of a tag given any context may be written using Bayes' rule as
It is important to note that a tag set may be considered ambiguous if it can appear in at least two different tag contexts. Accordingly, a set of labels T may be considered ambiguous if there exist two labels ti and tj such that adding one or the other gives rise to very different distributions over the remaining labels. Thus, given the tag “Cambridge,” adding the tags “MA” or “UK” may lead to very different locations; and the other tags occurring in this context are likely to change, including tags about stores, people, and so forth. In an embodiment, the deviation between two posterior distributions of the different tag contexts may be measured with the KL-divergence. For additional details on measuring two posterior distributions with the KL-divergence, see S. Kullback and R. Leibler, On Information and Sufficiency, in The Annals of Mathematical Statistics, 22 (1):79-86, March 1951. Consider T to denote the current set of tags, and consider ti,tj to be two additional tags. The KL-divergence between the two corresponding distributions may be determined by calculating the following equation:
This equation integrates the amount of disagreement between the two distributions over all tags t, weighted by the probability p(t|{T∪ti}). It is strictly non-negative but not necessarily symmetric. Given that there may be no meaningful notion of order for the tags ti,tj, the following commonly used symmetric variation of the equation may instead be used:
Given a limited data base, it may be possible to easily find tags with maximal disagreement by selecting two terms that appear in very different contexts and are unrelated to the set T. For example, for the tag set T={“Cambridge”}, the tags added could be t1=“fridge” and t2=“mercedes” and the KL-divergence between the two posterior distributions would presumably be very high. To avoid this, the equation
Accordingly, the measure of ambiguity of a tag set T may be defined in various embodiments as the maximum divergence between two potential posterior distributions: f(T)=maxi,jdiv(ti,tj). If the value of f(T) is above a certain threshold, the labels ti and tj may be recommended because they represent the “direction” of greatest ambiguity, f(T), to the system.
A naïve implementation of f(T)=maxi,jdiv(ti,tj) generally results in a computational complexity of O(n3), where n denotes the number of terms in the database. However, for any given tag set T, almost all tags ti have a very small conditional probability p(ti|T). In order to find two terms with maximum disambiguation value, it is generally sufficient to restrict the search over the top N most common terms, where N is some small number. From experimentation, N=25 was found to be sufficient in an embodiment, under which 97.5% of all computations resulted in exact results. Even finding the top N tags can be safely approximated, as the majority of all tags are never likely in any context.
For a very large scale implementation in an embodiment, f(T)=maxi,jdiv(ti,tj) may be parallelizable, for instance, in a map-reduce framework described in J. Dean and S. Ghemawat, Map: Simplified Data Processing on Large Clusters, Communications of the ACMC, 51(1):107, 2008. The reduce phase in Dean and Ghemawat may calculate the max( ) operator and the mapper may implement the div( ) operator defined in
div(ti, tj)=p(ti|T)p(tj|T)g(
The present invention provides a system and method to suggest text strings when a set of text strings can appear in at least two different contexts. These different contexts could be defined by geographic locations, word senses, languages, temporal events, and so forth. The text string contexts may be measured by the distribution over all text string co-occurrences using a measure of ambiguity based on a weighted KL divergence of text string distributions. Advantageously, a text string is suggested that allow people to better describe their content when the benefits are significant.
As can be seen from the foregoing detailed description, the present invention provides an improved system and method for disambiguating text strings labeling content objects. A disambiguation measure based on a weighted KL divergence of tag distributions may be determined that maximizes the value of divergence when a tag set may occur in different contexts. When the benefits are significant, the system and method suggest text strings that allow users to better describe their content. Advantageously, the system and method of the present invention may be generally applied to any types of annotated content including, but not limited to, text, images, static graphics, video, audio, and rich media. As a result, the system and method provide significant advantages and benefits needed in contemporary computing, and more particularly in online applications supporting user-defined content.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.