The invention relates to determining, from a set of information, which of it is more or less relevant to one or more users.
Conferences can bring people from all over the world to share their new ideas with one another. For example, the Society for Neuroscience has an annual meeting for neuroscientists to present emerging science, learn from experts, and collaborate with their peers, and explore new tools and technologies. Tens of thousands of individuals from most countries attend this conference over a multi-day period. Similarly sized conferences are held regularly throughout the world.
It is not possible for one person to learn all the information presented at a large conference, and so attendees must try to identify the presentations that are most relevant to their field of interest. Systems and methods are needed to improve the ability of an attendee to identify the most relevant information presented during a conference.
The problem of finding relevance in a large quantity of data is not unique to attendees at conferences. Anyone who has used the internet knows that vast amounts of data are available for people to review. Businesses in a variety of industries have developed “big data” and now struggle with determining its relevance. A key challenge in all of these areas is determining which information may be relevant to a particular user.
The word “butler” comes from Anglo-Norman word buteler, corresponding to the Old French term botellier meaning “the officer in charge of the king's wine bottles” and derived from the French boteille, for “bottle.” Wikipedia gives this description of today's popular image of a butler: “the real-life modern butler attempts to be discreet and unobtrusive, friendly but not familiar, keenly anticipative of the needs of his or her employer, and graceful and precise in execution of duty.” A data butler is needed to help users determine which information is or is not relevant to them, across a wide variety of information fields.
Various features of the invention will be apparent from the following description of preferred embodiments as illustrated in the accompanying drawings, in which like reference numerals generally refer to the same parts throughout the drawings. The drawings are not necessarily to scale, the emphasis instead being placed upon illustrating the principles of the inventions.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular structures, architectures, interfaces, techniques, etc. in order to provide a thorough understanding of the various aspects of a data butler to help a user identify relevant information from a large set of data. However, it will be apparent to those skilled in the art having the benefit of the present disclosure that the various aspects of the invention may be practiced in other examples that depart from these specific details. In certain instances, descriptions of well-known devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In an embodiment, a matching system, also known as a data butler, is provided that produces an automated schedule for visitors of a conference, such as a scientific conference or a trade show. The matching system may provide for large-scale matching capabilities. The matching system may produce an automated schedule for individual visitors of the conference. The matching system may match visitors to information of interest, such as a poster. In an embodiment, the system assigns no more than 50 visitors per poster and schedules about 20 posters per day per visitor. In an embodiment, the matching algorithm does not match a visitor to his or her own poster, or a poster of his or her own lab or organization. In an embodiment, the system uses only the abstracts of the posters being presented to produce the automated schedule. In an embodiment, the data butler reduces the amount of human intervention required to produce the automated schedule.
The description below sets out in greater detail the use of the systems and methods described in the context of an academic conference. As discussed in further detail, it can be used by a conference participant to select posters or presentations that are related specifically to his or her field. However, it should be understood that the systems and methods described are useful in a wide variety of fields and situations where it is useful for a user to receive a display of an ordered listing of documents, where the listing of the documents is ordered by a relevance determined at least in part by a relevance value provided by the user.
The documents may be stored in a storage 350. For example, the storage 350 may contain documents 100 that comprise the text of conference papers to be presented at a conference. As another example, the storage 350 may contain documents 100 that comprise the abstract of conference papers to be presented at a conference.
As shown in
The window 200 may display at least one relevance input object associated with each document summary.
Even though the document summaries 150 are returned to the user on the basis of search text provided by the user, the document summaries 150 displayed may be of varying relevance to the user, based on his or her field of study or other interest. Therefore, the user is provided with the opportunity to identify whether a particular document summary shown in the window 200 is relevant or not relevant, using the relevance input objects 154 and 155.
The user of the computer 300 may indicate whether document summary 151 is relevant by activating relevance input object 154, such as by clicking or pressing it. The user of the computer 300 also may indicate whether document summary 151 is not relevant to him or her by activating relevance input object 155. Activating the relevance input object 154 or 155 causes the computer 305 to receive a relevance value 157 for the document summary 151, which may be a “1” or a “0” or another appropriate value. For instance, if the relevance input object is a plurality of stars, the relevance value 157 may reflect the number of stars selected by the user.
In an embodiment, the user may indicate a relevance value 157 for multiple document summaries displayed on the window 200. For example, the user might indicate that document summary 158a is not relevant but document summaries 158b and 158e are relevant. After making the indication, the user may activate the suggestion button 220 for the computer 305 to receive the relevance value 157. Alternately, the computer 305 may receive the relevance value 157 directly after the user activates an input object.
In response to receiving the relevance value 157, a revised plurality of documents 105 may be determined in response to the relevance value 157, as described in further detail below. A revised plurality of document summaries 150 for the documents 105 may then be displayed in the window 200. In an embodiment, the revised plurality of document summaries 150 may be ordered by relevance in response to the relevance value 157. In an embodiment, the revised plurality of document summaries 150 may differ from the document summaries 150 initially presented to the user, because the revised plurality of document summaries 150 are more relevant to the user than those presented in the initial search results.
For example, the embodiment shown in
In an embodiment, the window 200 is displayed using existing technologies, such as JAVASCRIPT, that allow only a portion of the window 200 to be updated. This functionality can make results appear more quickly for the user. For instance, each document summary may be stored as a DOM object. When the computer 300 receives the revised plurality of document summaries 150 for the documents 105, the computer 300 may compare the revised list with the prior list and update only the DOM objects that require updating. Similar update technologies may be used to display additional information about a document by clicking on a document summary. For instance, clicking the title of a document summary may cause the display 200 to be updated and show the abstract for that document, as shown in
We now turn to describing certain embodiments for and methods of determining which documents may be more relevant to a user in response to a relevance value. In an embodiment, latent semantic analysis may be performed on the documents 100. As an initial matter, certain steps may be performed initially in order to prepare for determining relevance.
In 602, the documents 100 are cleaned. For example, if a document 100d is a text document, such as a text abstract of a conference poster, the document 100d may be cleaned by removing subwords, such as stopping words (for example, ‘a’ or ‘the’) which appear in most or all documents, and punctuation. The document 100d may also be cleaned by removing other text that is not useful for the particular field of study. For example, in the field of biology, certain organisms or diseases are identified by number, and so numbers are an important kind of information to retain to help identify an ordered list of documents for the user. In the field of computer science, certain numbers more often indicate results, and so are less useful to identify an ordered list of documents for the user. Therefore, if the document relates to a field where numbers in the text are relatively less useful (such as computer science), then in 602 the numbers in the document may be removed.
In 603, the documents 100 may be stemmed. For example, if a document 100d is a text document, then the document 100d may be stemmed by retaining the root of each word in the document but discarding the stems. For instance, the words “studying” and “studies” each become “studi”. The root term is known herein as a “token”. The set of tokens in a document 100d is referred to herein as 100dt and the set of tokens in all documents 100 is referred to herein as 100t.
In 604, a bag of words analysis is performed, wherein each document 100d is reviewed to count the number of times a token appears in the document 100d. For instance, if the token “studi” appears 10 times in a document 100d, then the token count of “studi” for that document 100d is equal to 10.
In 605, the token count is weighted to reflect the importance of a token in the documents 100. Some common words, like “a” or “the”, will likely appear in most text documents, for instance, and so step 605 is taken to reflect the importance of the token in the documents 100. In an embodiment, term frequency inverse document frequency (tf-idf for short) may be used in 605. In the example provided above, a count of the token “studi” may be revised to equal its former value (equal to 10) divided by the number of documents in documents 100 in which the token “stud” appeared. It should be apparent to one skilled in the art that other methods may be employed to weight the value of tokens 100t in order to reflect their importance in the documents 100. Such examples may include a logarithmic transformation to the term frequencies and document frequencies, or a normalization of the term frequency so that values are within pre-specified lower and upper bounds.
In 605, a weighted token matrix 120 may be prepared that includes the value of each token for each document in documents 100. A simplified example of a weighted token matrix is shown in
It should be understood that in certain uses, the weighted token matrix 120 will have millions of tokens, or potentially billions of tokens or more for very large datasets of documents. To simplify the final analysis and potentially to produce better results, in 606, a dimensionality reduction may be performed on the weighted token matrix 120. For example, truncated singular valued analysis may be performed on the weighted token matrix 120. It is known by the inventors that certain tokens are used together with an increased frequency. A dimension reduction such as truncated singular value decomposition (or SVD for short) helps to determine which tokens are used together with frequency in the documents 100. Dimensionality reduction algorithms are available in many standard computer software packages, such as Matlab, R, or Python, and so are not described here further. The result of the dimensionality reduction may be a vector 100dv for each document 100d, where the values of the vector describe a fingerprint of the document. In other embodiments, other dimensionality reduction methods may be employed, such as Principal Component Analysis, Non-negative Matrix Factorization, Sparse Matrix Factorization or Isomap. The number of dimensions to return after the dimensionality reduction method may be specified in advance of the reduction or determined during runtime, e.g. through nuclear norm minimization. In an embodiment, the number of dimensions may be chosen to capture a pre-specified level of a certain percentage of the total variance in a selected data set. For example, for a certain data set, 400 dimensions may be selected because they capture a pre-specified level of 95% of the total variance in the data set. The number of dimensions may be optimized for a given objective. For example, the number of dimensions can be optimized for user satisfaction, for statistical reasons (as in non-parametric Bayesian approaches), or for computational reasons.
In 802, a set of documents 105 may be identified that are nearest neighbors to the relevant reference 140. The documents 105 may be identified using nearest neighbor methods known in the art, such as Euclidean or Manhattan distance. In an embodiment, an approximate nearest neighbor search strategy may be employed, where the space of documents is recursively separated in a tree-like structure, where each leaf of the tree defines a “ball” that contains many documents. The number of branches and depth of the tree affects the search speed and the accuracy of the search. Other methods for finding nearest method include Hierarchical K-Means, KD-trees, and data-independent Locally Sensitive Hashing.
In 803, the document summaries 105s for the set of documents 105 may be provided for display to the user for further review and interaction.
Steps 801 and 802 may be repeated each time the relevance value 157 is indicated, such as when the user activates a relevance input object. The relevance reference 140 is modified in response to the relevance value 157. For example, if the user indicates that document d2 (shown in
Additionally, in the step 801, the position of the relevant reference 140 may be revised if a relevance value 157 is provided for a document that indicates the document is not relevant. The position of the relevant reference 140 may be described by the following equation, which can be implemented to be executed on a computer:
where c is a constant greater than 0, vi is the vector for relevant document i, and wj is the vector for a not relevant document j, Nv is the number of relevant documents, and Nw is the number of irrelevant documents.
The systems and methods described above may be implemented on one or more computers in a variety of different configurations. One possible configuration is shown in
The computer 300 may communicate with a server computer 320 through a communication link 310. As is known in the art, a communication link 310 may take many forms, including but not limited to a cellular transmission, a WI-FI transmission, a cable, a network connection, a bus, or a combination of such connections. Like the computer 300, the server computer 320 may take many different forms, including a plurality of computers arranged in a cloud network. Server computer 320 may comprise a storage 350 that stores the documents 100, and may perform the steps depicted in FIG. E and FIG. J. In another embodiment, the storage 350 may be part of computer 300, which avoids the need for the computer 300 to regularly communicate through a communication link to the server computer 320.
In an embodiment, the computer 300 may allow a user to create a profile, which allows the computer 300 and/or the computing device 320 to save the user's relevance selections and other information about the user. The profile may be created directly or indirectly, such as through an existing profile (such as a GOOGLE+ profile, a FACEBOOK profile, or another user profile). The profile could retain information about a user's preferences, either indefinitely or for a limited time (in days, months, or years). Alternately, the profile would erase at least a portion of information about the user after each session use.
In other embodiments, the systems and methods described could identify relevant documents from a user with multiple clusters of preferences. For instance, a user may be interested in the diverse fields of “computation” on one hand, and “butterflies” on the other hand. In systems with a large number of documents that extend across multiple subject areas, such as the set of web pages available through the Internet, the systems and methods described herein could return a first cluster of documents related to the user's interest in computation and a second cluster of documents related to the user's interest in butterflies.
In other embodiments, documents 100 may be weighted with relevance information that comes from other users' use of the systems and methods described herein. For example, if user_i and user_j share the same field, and user_i has indicated certain documents as relevant or not relevant, the systems and methods may weight those documents accordingly for user_j.
In other embodiments, the window 200 may display a trending list of documents. For instance, the window 200 may display documents found relevant by a large portion of users. In other embodiments, additional inputs may be included to allow users to mark whether they like or dislike a document, and the trending list may indicate documents that are liked by a large portion of users.
Number | Date | Country | |
---|---|---|---|
62218998 | Sep 2015 | US |