This application relates in general to digital information search and sensemaking and, in particular, to a system and method for providing default hierarchical training for social indexing.
The Worldwide Web (“Web”) is an open-ended digital information repository into which new information is continually posted. The information on the Web can, and often does, originate from diverse sources, including authors, editors, collaborators, and outside contributors commenting, for instance, through a Web log, or “Blog.” Such diversity suggests a potentially expansive topical index, which, like the underlying information, continuously grows and changes.
Social indexing systems provide information and search services that organize evergreen information according to the topical categories of indexes built by their users. Topically organizing an open-ended information source, like the Web, into an evergreen social index can facilitate information discovery and retrieval, such as described in commonly-assigned U.S. patent application, entitled “System and Method for Performing Discovery of Digital Information in a Subject Area,” Ser. No. 12/190,552, filed Aug. 12, 2008, pending, the disclosure of which is incorporated by reference.
Social indexes organize evergreen information by topic. A user defines topics for the social index and organizes the topics into a hierarchy. The user then interacts with the system to build robust models to classify the articles under the topics in the social index using, for instance, example-based training, such as described in Id. Through the training, the system builds fine-grained topic models by generating finite-state patterns that appropriately match positive-example articles and do not match negative-example articles.
In addition, the system can build coarse-grained topic models based on population sizes of characteristic words, such as described in commonly-assigned
U.S. Pat. No. 8,010,545, issued Aug. 30, 2011, the disclosure of which is incorporated by reference. The coarse-grained topic models are used to recognize whether an article is roughly on topic. Articles that match the fine-grained topic models, yet have statistical word usage far from the norm of the positive training example articles are recognized as “noise” articles. The coarse-grained topic models can also suggest “near misses,” that is, articles that are similar in word usage to the training examples, but which fail to match any of the preferred fine-grained topic models, such as described in commonly-assigned U.S. Provisional Patent Application, entitled “System and Method for Providing Robust Topic Identification in Social Indexes,” Ser. No. 61/115,024, filed Nov. 14, 2008, pending, the disclosure of which is incorporated by reference.
To large extent, the success of social indexing depends upon the ease of creating new indexes, yet index creation can be the most difficult step for new users, particularly when built through example-based training of index topics. The example-based approach yields well-tuned topic models for the indexes and creates patterns without requiring a user to master the skills of writing potentially-complex queries. Example-based training also provides valuable feedback for tuning topic models. Notwithstanding, example-based training requires significant work and understanding. As a preliminary step, a new user must create and name each topic, and place that topic into a topic tree. Much more work is required for training. The user must identify one or more positive-example articles for each topic and train the index using the positive-example articles. Following training, the system reports the matching articles for each topic and their scores, plus candidate “near misses” for each topic. If one or more of the near misses belong under a topic, the user can add the article to the set of positive training examples. As well, if the system reports one or more off-topic articles as matching, the user can add those articles as negative training examples.
Through this routine, a user engages in an open-ended iterative process of tuning topics. Sometimes, several cycles of adding positive and negative training examples is required until satisfactory results are obtained from the topic models. For new users wanting to see quick results from their efforts, the labor of example-based training can be a disincentive.
Topic models are created without requiring a user to provide any training examples. The topic models are built based on a hierarchical topic tree using both the individual topic labels and their locations within the tree. A random sample of articles are created from given sources of information for the index and candidate topic models, or patterns, are generated. The patterns are ranked according to a set of heuristic rules about labels, word and label uniqueness, and the relationships expressed by topic trees. The resulting topic models are less accurate and precise than those created by example-based training because the constraints used in default training are less specific. On the other hand, the approach requires much less work. While a user always needs to create a topic tree to specify the index topics, no additional work providing examples is needed and the user gets a draft index.
One embodiment provides a system and method for providing default hierarchical training for social indexing Articles of digital information for social indexing are maintained. A hierarchically-structured tree of topics is specified. Each topic includes a label that includes one or more words. Constraints inherent in the literal structure of the topic tree are identified. For each topic in the topic tree, a topic model that includes at least one term derived from the words in at least one of the labels is created. The topic models for the topic tree are evaluated against the constraints. Those of the topic models, which best satisfy the constraints are identified.
Creating default social indexes enables new users can get started more quickly than when provided with example-based training alone, and provides a good basis for later switching to the example-based training when topic boundaries within the social index need fine-tuning. The system creates better answers than found by simply concatenating topic labels by generating and evaluating alternative candidate patterns against heuristical criteria and biases.
Still other embodiments of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein are described embodiments by way of illustrating the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the present invention. Accordingly,.the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.
The following terms are used throughout and, unless indicated otherwise, have the following meanings:
Corpus: A collection or set of articles, documents, Web pages, electronic books, or other digital information available as printed material.
Document: An individual article within a corpus. A document can also include a chapter or section of a book, or other subdivision of a larger work. A document may contain several cited pages on different topics.
Cited Page: A location within a document to which a citation in an index, such as a page number, refers. A cited page can be a single page or a set of pages, for instance, where a subtopic is extended by virtue of a fine-grained topic model for indexing and the set of pages contains all of the pages that match the fine-grained topic model. A cited page can also be smaller than an entire page, such as a paragraph, which can be matched by a fine-grained topic model.
Subject Area: The set of topics and subtopics in a social index, including an evergreen index or its equivalent.
Topic: A single entry within a social index characterizing a topical category. In an evergreen index, a topic has a descriptive label and is accompanied by a fine-grained topic model, such as a pattern, that is used to match documents within a corpus.
Subtopic: A single entry hierarchically listed under a topic within a social index. In an evergreen index, a subtopic is also accompanied by one or more topic models.
Fine-grained topic model: This topic model is based on finite state computing and is used to determine whether an article falls under a particular topic. Each saved fine-grained topic model is a finite-state pattern, similar to a query. This topic model is created by training a finite state machine against positive and negative training examples.
Coarse-grained topic model: This topic model is based on characteristic words and is used in deciding which topics correspond to a query. Each saved coarse-grained topic model is a set of characteristic words, which are important to a topic, and a score indicating the importance of each characteristic word. This topic model is also created from positive training examples, plus a baseline sample of articles on all topics in an index. The baseline sample establishes baseline frequencies for each of the topics and the frequencies of words in the positive training examples are compared with the frequencies in the baseline samples. In addition to use in generating topical sub-indexes, coarse-grained models can be used for advertisement targeting, noisy article detection, near-miss detection, and other purposes.
Community: A group of people sharing main topics of interest in a particular subject area online and whose interactions are intermediated, at least in part, by a computer network. A subject area is broadly defined, such as a hobby, like sailboat racing or organic gardening; a professional interest, like dentistry or internal medicine; or a medical interest, like management of late-onset diabetes.
Augmented Community: A community that has a social index on a subject area. The augmented community participates in reading and voting on documents within the subject area that have been cited by the social index.
Evergreen Index: An evergreen index is a social index that continually remains current with the corpus.
Social Indexing System: An online information exchange infrastructure that facilitates information exchange among augmented communities, provides status indicators, and enables the passing of documents of interest from one augmented community to another. An interconnected set of augmented communities form a social network of communities.
Information Diet: An information diet characterizes the information that a user “consumes,” that is, reads across subjects of interest. For example, in his information consuming activities, a user may spend 25% of his time on election news, 15% on local community news, 10% on entertainment topics, 10% on new information on a health topic related to a relative, 20% on new developments in their specific professional interests, I0% on economic developments, and 10% on developments in ecology and new energy sources. Given a system for social indexing, the user may join or monitor a separate augmented community for each of his major interests in his information diet.
Label: A topic label from a hierarchical index of topics.
Duplicated Topic Label: A topic label that is used on more than one topic within a hierarchical index.
Common Ancestor: Given two topics in a topic tree, a common ancestor is a topic that is an ancestor of both of the topics.
Word: A stemmed word that appears within a topic label.
Duplicated Word: A word that appears in more than one topic label in any of its forms.
Local Topic Word: A word that appears in a topic's label for a given topic.
Parent Word: A word that appears in the label of the topic's parent.
Term: A word, n-gram, or group of words that appear in a pattern that functions as a default topic model. The words in each term can be derived from stemmed versions of the words in the label.
Preferred Pattern: A conjunction or in-gram pattern that uses the same words that appear in a topic label. For example, if the topic label is “Onset Venture,” the preferred pattern is either the n-gram “{onset venture}” or the conjunction “[onset venture],” skipping any stop words. If the topic label is a single word, for instance, “Mayfield,” the preferred pattern is the single word in stemmed form.
Complexity (or Simplicity) Score: A score reflecting the structure of a default candidate pattern.
Valid Pattern: A pattern that satisfies the hard constraints.
Digital Information Environment
A digital information infrastructure includes public data networks, such as the Internet, standalone computer systems, and other open-ended repositories of electronically-stored information.
In general, each user device 13a-c is a Web-enabled device that executes a Web browser or similar application, which supports interfacing to and information exchange and retrieval with the servers 14a-c. Both the user devices 13a-c and servers 14a-c include components conventionally found in general purpose programmable computing devices, such as a central processing unit, memory, input/output ports, network interfaces, and non-volatile storage, although other components are possible. Moreover, other information sources in lieu of or in addition to the servers 14a-c, and other information consumers, in lieu of or in addition to user devices 13a-c, are possible.
A social indexing system 11 supplies articles topically organized under an evergreen index through social indexing, such as described in commonly-assigned U.S. Patent Application, entitled “System and Method for Performing Discovery of Digital Information in a Subject Area,” Ser. No. 12/190,552, filed Aug. 12, 2008, pending, the disclosure of which is incorporated by reference. The social indexing system 11 also determines which topics are currently “hot” and which topics have turned “cold” to meet a user's need for recent information, such as described in commonly-assigned U.S. Patent Application, entitled “System and Method for Managing User Attention by Detecting Hot and Cold Topics in Social Indexes,” Ser. No. 12/360,834, filed Jan. 27, 2009, pending, the disclosure of which is incorporated by reference. Finally, the social indexing system 11 groups and displays articles by relevance bands, which are sorted by time and filtered by time regions, such as described in commonly-assigned U.S. Patent Application, entitled “System and Method for Using Banded Topic Relevance and Time for Article Prioritization,” Ser. No. 12/360,823, filed Jan. 27, 2009, pending, the disclosure of which is incorporated by reference.
From a user's point of view, the environment 10 for digital information retrieval appears as a single information portal, but is actually a set of separate but integrated services.
The components 20 can be loosely grouped into three primary functional modules, information collection 21, social indexing 22, and user services 23. Other functional modules are possible. Additionally, the functional modules can be implemented on the same or separate computational platform. Information collection 21 obtains incoming content 24, such as Web content 15a, news content 15b, and “vetted” content 15c, from the open-ended information sources, including Web servers 14a, news aggregator servers 14b, and news servers with voting 14, which collectively form a distributed corpus of electronically-stored information. The incoming content 24 is collected by a media collector to harvest new digital information from the corpus. The incoming content 24 can typically be stored in a structured repository, or indirectly stored by saving hyperlinks or citations to the incoming content in lieu of maintaining actual copies.
The incoming content 24 may be stored in multiple representations, which differ from the representations in which the information was originally stored.
Different representations could be used to facilitate displaying titles, presenting article summaries, keeping track of topical classifications, and deriving and using fine-grained topic models. Words in the articles could also be stemmed and saved in tokenized form, minus punctuation, capitalization, and so forth. Moreover, fine-grained topic models created by the social indexing system 11 represent fairly abstract versions of the incoming content 24 where many of the words are discarded and mainly word frequencies are kept.
The incoming content 24 is preferably organized under at least one topical index 29 that is maintained in a storage device 25. The topical index 29 may be part of a larger set of topical indexes 26 that covers all of the information. The topical index 29 can be an evergreen index built through social indexing 22, such as described in commonly-assigned U.S. patent application “System and Method for Performing Discovery of Digital Information in a Subject Area,” Ser. No. 12/190,552, filed Aug. 12, 2008, pending, the disclosure of which is incorporated by reference. The evergreen index contains fine-grained topic models, such as finite state patterns, that can be used to test whether new information falls under one or more of the topics in the index. Social indexing 22 applies supervised machine learning to bootstrap training material into the fine-grained topic models for each topic and subtopic in the topical index 29.
Alternatively, social indexing 22 can perform default training to form topic models in a self-guided fashion based on a hierarchical topic tree using both the individual topic labels and their locations within the tree, as further described below beginning with reference to
User services 23 provide a front-end to users 27a-b to access the set of topical indexes 26 and the incoming content 24, to perform search queries on the set of topical indexes 26 or a single topical index 29, and to access search results, top indexes, and focused sub-indexes. In a still further embodiment, each topical index 29 is tied to a community of users, known as an “augmented” community, which has an ongoing interest in a core subject area. The community “vets” information cited by voting 28 within the topic to which the information has been assigned.
Simple Default Training
In most fundamental form, a default social index can be formed by extracting the words from each topic label and creating a conjunction of the words, minus any stop words, as a topic model or pattern. This approach, however, is not without shortcomings.
Referring first to
For the articles from the sources in this index, both the simple default training and example-based training patterns did a credible job. Although the simple default pattern found many of the correct articles, the pattern also matched articles about a “crisis faced by the House of Representatives” and missed some articles about the “mortgage meltdown.” In contrast, the example-based pattern lacked these limitations, and could evolve to recognize more complex topic boundaries, given more positive and negative examples.
In other cases, the simple default training fails drastically. Referring next to
A simple variation on the simple default trainer could increase model specificity by including terms from ancestor topics drawn from the topic hierarchy. For example, the trainer could generate the pattern “[Sun Yue pre season game]” for the first of the three duplicate-label subtopics. This variation suggests that constraints on the default pattern for a topic can arise from relationships to other nodes in the topic tree. However, this variation of including parent words also has problems. As the number of words in a conjunction increases, the number of matching articles necessarily decreases. For example, articles about pre-season games of Sun Yue that include the word “Yue” but not the word “Sun” would be missed by the default pattern.
Yet another variation is to include some, but not all, of the words from parent topics. Referring finally to
To summarize, candidate patterns can be generated from the terms that appear in topic labels. Although simple conjunctions work for some indexes, that approach by itself is subject to failure in cases where:
Default hierarchical training addresses the shortcomings of simple default training to generate a default index, which is often entirely satisfactory for organizing the subject matter.
A social index must be specified and accessed (step 71). The social index can be created by a user as a hierarchically-structured topic tree to specify the index topics, or can originate from some other index source. The topic tree includes topic labels, some of which may be duplicated wholly or in part. Each topic label in the social index is iteratively processed (steps 72-74) and, during each iteration, a default candidate pattern is generated (step 73). Each default candidate pattern can include:
Each default candidate pattern is the iteratively processed (steps 75-77) and, during each iteration, a score for the pattern is computed (step 76), as further described below beginning with reference to
Finally, the patterns are ranked based on their scores and the highest scoring patterns are selected for the default hierarchical index (step 78).
Default Candidate Pattern Scoring
Several factors are considered in evaluating default patterns.
Several factors contribute to scoring. The factors can be quantified as percentages, or other metrics, to consider article matching, structural simplicity, and label bonus as factors in pattern evaluation, with approximately the given percentages of influence in typical cases. The factors include:
The default candidate patterns are also checked against hard constraints (step 87) and soft constraints (step 88), as further described below respectively with reference to
Article Matching Evaluation
The biggest single factor in evaluating a default candidate pattern is the number of articles that the pattern matches, contributing up to 70% to the total score, although other approaches that assign considerations of article matching to a suitable majority role in scoring could be used.
Although patterns that match more articles are generally favored over articles that match fewer articles, default candidate patterns can match too many articles. Patterns that are overly prolific, such as matching more than about 20% of the test articles, are usually too non-discriminating to be useful. As a result, default candidate pattern evaluation favors patterns that match the most articles up to an “ideal maximum” without employing a sudden, discontinuous cut-off:
Structural Complexity Evaluation
Structural complexity scoring is a secondary factor in default candidate pattern evaluation, contributing up to 20% to the total score, although other approaches that assign considerations of structural complexity to a suitable minority role in scoring could be used.
Following consideration of the three factors (steps 101-106), the overall score for the default candidate pattern is adjusted (step 107). The complexity (or simplicity) score is determined in accordance with:
score=(6×numNgrams)−(numGroups+2)×numNonDupWords−numDupWords (1)
where numNgrams is the number of n-grams, numGroups is the number of groups of words, numNonDupWords is the number of non-duplicated words, and numDupWords is the number of duplicated words. However, to limit the overall influence of the score, the following rules are also applied in adjusting the score:
Labels Evaluation
Different default candidate patterns could end with the same score. Two cases stand out:
To make a reasonable guess in these cases, the system awards a bonus to the default candidate pattern with a bonus label Bonuses are awarded (step 113) as follows:
Following structural complexity matching evaluation (steps 111-114), the resulting score is returned (step 115).
Hard Constraint Evaluation
“Hard” constraints represent gatekeepers of valid patterns.
Soft Constraint Evaluation
Soft constraints indicate weaker preferences than hard constraints.
Through the default hierarchical training methodology, a topic model is created for each topic in a given index without requiring a user to provide any training examples.
Variations
Social indexes are created for a user without requiring example-based training. A system and method for providing default hierarchical training produces a draft index from chosen information sources and a hierarchy of topics for the index. Consequently, the user gets results quickly.
In the absence of training examples, there is no gold standard for performance. In one embodiment, some constraints are deemed more important than others and the constraints are divided into “hard” and “soft” constraints. The scoring method penalizes violations of the hard constraints the most. However, other approaches are possible, such as simply ruling out default candidate patterns that violate hard constraints, rather than merely penalizing the patterns through their score. Moreover, like in example-based training, the default hierarchical training methodology counts matching articles and considers the complexity of the pattern. Pattern complexity is considered of secondary importance to the violation of constraints and the scoring is based on counts of article matches.
Perhaps the most important and unique elements to default hierarchical training are considerations of relationships to other nodes in the topic tree. These considerations include:
In a further embodiment, a machine-learning approach to default hierarchical training could be created by collecting thousands of index topics, together with answers that have been certified as being correct. Applying a modeling approach, the system search could search for the best assignment of weights to different features that meet the majority of the training cases.
In a still further embodiment, complete semantic models of the meanings of the words found in the topic labels could be incorporated into the default pattern trainer, which would facilitate finding an optimal default pattern by helping to determine the user's intent in constructing the topical index.
While the invention has been particularly shown and described as referenced to the embodiments thereof, those skilled in the art will understand that the foregoing and other changes in form and detail may be made therein without departing from the spirit and scope.
Number | Name | Date | Kind |
---|---|---|---|
5257939 | Robinson et al. | Nov 1993 | A |
5369763 | Biles | Nov 1994 | A |
5530852 | Meske et al. | Jun 1996 | A |
5659766 | Saund et al. | Aug 1997 | A |
5671342 | Millier et al. | Sep 1997 | A |
5680511 | Baker et al. | Oct 1997 | A |
5724567 | Rose et al. | Mar 1998 | A |
5784608 | Meske et al. | Jul 1998 | A |
5907677 | Glenn et al. | May 1999 | A |
5907836 | Sumita et al. | May 1999 | A |
5953732 | Meske et al. | Sep 1999 | A |
6021403 | Horvitz et al. | Feb 2000 | A |
6052657 | Yamron et al. | Apr 2000 | A |
6064952 | Imanaka et al. | May 2000 | A |
6233570 | Horvitz et al. | May 2001 | B1 |
6233575 | Agrawal et al. | May 2001 | B1 |
6240378 | Imanaka et al. | May 2001 | B1 |
6247002 | Steels | Jun 2001 | B1 |
6269361 | Davis et al. | Jul 2001 | B1 |
6285987 | Roth et al. | Sep 2001 | B1 |
6292830 | Taylor et al. | Sep 2001 | B1 |
6397211 | Cooper | May 2002 | B1 |
6598045 | Light et al. | Jul 2003 | B2 |
6772120 | Moreno et al. | Aug 2004 | B1 |
6804688 | Kobayashi et al. | Oct 2004 | B2 |
6981040 | Konig et al. | Dec 2005 | B1 |
7062485 | Jin et al. | Jun 2006 | B1 |
7092888 | McCarthy et al. | Aug 2006 | B1 |
7200606 | Elkan | Apr 2007 | B2 |
7275061 | Kon et al. | Sep 2007 | B1 |
7281022 | Gruhl et al. | Oct 2007 | B2 |
7293019 | Dumais et al. | Nov 2007 | B2 |
7320000 | Chitrapura et al. | Jan 2008 | B2 |
7426557 | Gruhl et al. | Sep 2008 | B2 |
7467202 | Savchuk | Dec 2008 | B2 |
7496567 | Steichen | Feb 2009 | B1 |
7548917 | Nelson | Jun 2009 | B2 |
7567959 | Patterson | Jul 2009 | B2 |
7600017 | Holtzman et al. | Oct 2009 | B2 |
7685224 | Nye | Mar 2010 | B2 |
7707206 | Encina et al. | Apr 2010 | B2 |
7747593 | Patterson et al. | Jun 2010 | B2 |
7809723 | Liu et al. | Oct 2010 | B2 |
7890485 | Malandain et al. | Feb 2011 | B2 |
7890502 | Liu et al. | Feb 2011 | B2 |
20020161838 | Pickover et al. | Oct 2002 | A1 |
20040059708 | Dean et al. | Mar 2004 | A1 |
20050097436 | Kawatani | May 2005 | A1 |
20050226511 | Short | Oct 2005 | A1 |
20050278293 | Imaichi et al. | Dec 2005 | A1 |
20060167930 | Witwer et al. | Jul 2006 | A1 |
20070050356 | Amadio | Mar 2007 | A1 |
20070112815 | Liu et al. | May 2007 | A1 |
20070156622 | Akkiraju et al. | Jul 2007 | A1 |
20070214097 | Parsons et al. | Sep 2007 | A1 |
20070239530 | Datar et al. | Oct 2007 | A1 |
20070244690 | Peters | Oct 2007 | A1 |
20070260508 | Barry et al. | Nov 2007 | A1 |
20070260564 | Peters et al. | Nov 2007 | A1 |
20070271086 | Peters et al. | Nov 2007 | A1 |
20080040221 | Wiseman et al. | Feb 2008 | A1 |
20080065600 | Batteram et al. | Mar 2008 | A1 |
20080126319 | Bukai et al. | May 2008 | A1 |
20080133482 | Anick et al. | Jun 2008 | A1 |
20080140616 | Encina et al. | Jun 2008 | A1 |
20080201130 | Peters et al. | Aug 2008 | A1 |
20080307326 | Gruhl et al. | Dec 2008 | A1 |
20090099839 | Stefik | Apr 2009 | A1 |
20090099996 | Stefik | Apr 2009 | A1 |
20100042589 | Smyros et al. | Feb 2010 | A1 |
20100057577 | Stefik et al. | Mar 2010 | A1 |
20100058195 | Stefik et al. | Mar 2010 | A1 |
20100070485 | Parsons et al. | Mar 2010 | A1 |
20100083131 | You | Apr 2010 | A1 |
20100114561 | Yasin | May 2010 | A1 |
20100125540 | Stefik et al. | May 2010 | A1 |
20100191741 | Stefik et al. | Jul 2010 | A1 |
20100191742 | Stefik et al. | Jul 2010 | A1 |
20100191773 | Stefik et al. | Jul 2010 | A1 |
20100278428 | Terao et al. | Nov 2010 | A1 |
Number | Date | Country |
---|---|---|
1571579 | Sep 2005 | EP |
2005073881 | Aug 2005 | WO |
2007047903 | Apr 2007 | WO |
Entry |
---|
C. Holahan, “So Many Ads, So Few Clicks,” BusinessWeek, p. 38 (Nov. 12, 2007). |
Rocha L. M., “Adaptive Webs for Heterarchies With Diverse Communities of Users,” Workshop From Intelligent Networks to the Global Brain: Evolutionary Technology, pp. 1-35 (Jul. 3, 2001). |
Arasu et al., “Searching the Web,” ACM, New York, NY, US, pp. 2-43 (Aug. 1, 2001). |
Imai et al., “Improved Topic Discrimination of Broadcast News Using a Model of Multiple Simultaneous Topics,” 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97), Apr. 1997, pp. 727-730, vol. 2. |
Anonymous “TF-IDF,” Wikipedia, the free encyclopedia (Jul. 23, 2007). |
Akiko Aizawa, “An Information-Theoretic Perspective of TF-IDF Measures,” Information Processing and Management, Elsevier Science Ltd. , vol. 39, No. 1, pp. 45-65 (Jan. 1, 2003). |
C. Anderson, “The Long Tail: Why the Future of Business is Selling Less of More,” 2006, Chapter 1, pp. 1-26, Hyperion Press, New York. |
Schutze H., “The Hypertext Concordance: A Better Back-of-the-Book Index,” 1998, Proceedings of Workshop on Computational Technology, pp. 101-104, Montreal, Canada. |
Arampatzis et al., “An Evaluation of Linguistically-Motivated Indexing Schemes,” 2000, Proceedings of the BCSIRSG. |
Biebricher et al., “The Automatic Indexing System AIR/PHYS—1997, From Research to Application,” In Readings in Information Retrieval, Morgan Kaufmann Publishers, San Francisco. |
G. Sacco, “Dynamic Taxonomies and Guided Searches,” Journal of the American Society for Information Science and Technology, vol. 57, Issue 6, Apr. 2006. |
Brin et al., “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Paper presented at the Seventh International Conference on World Wide Web. Apr. 14-18, 1998, Brisbane, Australia. |
Card et al., “Readings in Information Visualization: Using Vision to Think,” 1999, Section 3 Interaction, pp. 231-259, 295-306, Morgan Kaufmann Publishers, San Francisco. |
Chi et al., “EBooks With Indexes that Reorganize Conceptually,” Paper presented at Human Factors in Computing Systems Conference Apr. 24-29, 2004, Vienna, Austria. |
G. W. Furnas, “Generalized Fisheye Views,” Paper presented at the Conference on Human Factors in Computing Systems, 1986, Boston, Massachusetts. |
Kusek et al., “The Future of Music: Manifesto for the Digital Music Revolution,” Boston: Berklee Press, 2005. |
P. Pirolli, “Information Foraging Theory: Adaptive Interaction with Information,” Oxford: Oxford University Press, 2007. |
H. Simon, “Designing Organizations for an Information-Rich World.” In Communications and the Public Interest, ed. Martin Greenberger. 37-72. The Johns Hopkins Press, 1971. |
R. D. Burt, “Structural Holes and Good Ideas,” American Journal of Sociology, vol. 110, No. 2, pp. 349-399, 2003. |
C. Mezei, “The Digg Algorithm-Unofficial FAQ,” SeoPedia, www.secopedia.org/tips-tricks/social-media/the-digg-algorithm-unofficial-faq, Nov. 2, 2006. |
N. Patel, “There's More to Digg Than Meets the Eye,” Pronet Advertising, www.pronetadvertising.com/articles/ theres-more-to-digg-than-meets-the-eye.html, Jan. 15, 2007. |
J. Dowdell et al., “Digg's Kevin Rose on Recent Indiggnation: Fact vs. Fiction,” Marketing Shift, www.marketingshift.com/2006/9/diggs-kevin-rose-recent-indiggnation.cfm, Sep. 7, 2006. |
G. A. Miller, “The Magical No. Seven, Plus or Minus Two: Some Limits on Our Capacuty for Processing Information,” Psychological Review, vol. 63, pp. 81-97, 1956. |
J. Dowdell, “Digg Algorithm for Scoring Stories,” Marketing Shift, www.marketingshift.com/2006/9/diggs-algorithm-elements-confirmed.cfm, Sep. 8, 2006. |
P. Lenssen, “How Google News Indexes”. See http://blogoscoped.com/archive/2006-07-28-n49.html. |
A. Agarval, “How Google News works”. http://labnol.blogspot.com/2005/05/how-google-news-works.html. |
M. Helft, “How a Series of Mistakes Hurt Shares of United”. New York Times. http://www.nytimes.com/2008/09/15/technology/15google.html?—r=1. |
Nakashima et al., “Information Filtering for the Newspaper,” 1997 IEEE Pacific RIM Conference NCE on Victoria, BC, Canada (Aug. 20-22, 1997), vol. 1, pp. 142-145 (Aug. 1997). |
Yuan Xue et al., “An Effective News Recommendation in Social Media Based on Users' Preference,” 2008 International Workshop on Education Technology and Training and 2008 International Workshop on Geoscience and Remote Sensing, IEEE, Piscataway, NJ, USA, pp. 627-631 (Dec. 21, 2008). |
Bracewell et al., “Reading: A Self Sufficient Internet News System with Applications in Information and Knowledge Mining,” Natural Language Processing and Knowledge Engineering, International Conference, IEEE, PI, pp. 190-196 (Aug. 1, 2007). |
K. Lerman, “Social Information Processing in News Aggregation,” IEEE Computer Society, vol. 11, No. 6, pp. 16-28 (Nov. 1, 2007). |
G. Linden, “People Who Read This Article Also Read . . . ” IEEE Spectrum, vol. 45, No. 3, pp. 46-60 (Mar. 1, 2008). |
Yu et al, “PEBL: Positive Example Based Learning for Web Page Classification Using SVM,” Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002). |
Number | Date | Country | |
---|---|---|---|
20100191773 A1 | Jul 2010 | US |