A system concerns online information search, discovery and retrieval by organizing documents by topic and content.
The workplace is an environment where a primary asset used by workers is knowledge. Further, knowledge workers require access to high quality information on a variety of topics as dictated by a dynamic set of tasks. While the web has substantially increased the number of informational sources available to such workers, finding the right information at the right time remains difficult. The Internet is a source of a wealth of knowledge however the tools available for accessing the content are not well suited for knowledge acquisition. Search engines are a highly dynamic source of information and provide excellent coverage. The search results they produce, however, are optimized and presented based on a set of criteria that are not optimal for knowledge acquisition. Attempting to acquire knowledge using search engines is both time consuming, and may not produce good results. At the other end of the spectrum, online courses or Massive Online Open Courses (MOOC) provide education on a variety of topics (good coverage), but are static and time consuming. Known knowledge acquisition systems typically fail to find high quality information sources for a broad variety of relevant topics.
An online knowledge system locates high quality informational sources related to a particular topic by capturing intelligence of a multitude of user selections and user labelling using machine learning techniques. The system finds high quality information sources for a broad variety of relevant topics and organizes the sources to support learning, exploration, and collaboration. The system assesses suitability of information sources available to knowledge workers, based on evaluation criteria. The system categorizes information sources based on, (a) quality and whether a source provides high quality information, (b) coverage and whether a source provides content on a wide variety of topics and (c) dynamism and whether a source provides up to date information and provides it quickly.
A system for searching the Internet for a document, comprises at least one computer system including, a first data repository, a second data repository and a processor. The first repository of data represents an organization of documents provided in response to frequency of terms found in individual documents. The second repository of data represents topics, with an individual topic being associated with, (a) a set of documents in the first repository and (b) a related topic. A processor is configured to, in response to a received search term, use the first and second repositories to identify search result documents in the organization of documents including documents from a first set of documents associated with the individual topic and a second set of documents associated with the related topic.
A system assesses information quality provided by an online source, organizes sources in a topic based ordering that supports knowledge acquisition and exploration, and enables labelling the organized structure based on its topic areas. A Self Organizing Map as used herein is a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map. As used herein the term “repository” is used interchangeably with the term “database”. As used herein a document comprises an informational source, text, message, compilation of data, image, picture or software code and is used interchangeably with the term “article”. As used herein a Feature Vector comprises data indicating a document spatial position within an array of elements representing documents.
The system provides a “search by example” function whereby a user identifies a document, and the system finds documents and/or topics that are relevant to that document. The system derives a Feature Vector from a document and individual Feature Vectors constitute a point in Feature Space. The system finds documents and/or topics that are relevant to a document in response to a derived Feature Vector of the document. The system extracts terms from a document along with a measure of their relevance to the document. The relevance measure is derived from the context of the document. The system derives the Feature Vector from the extracted terms and their respective relevance values using a hashing function that is a one way mapping.
Application 53 includes inbox 41 for receiving and storing documents and includes topic browser 45 enabling a user to add, delete, edit and navigate document topics as well as to associate a received document with one or more existing or new topics. Reader 43 supports document reading and processing for presentation on a display unit (not shown). Server 12 supports document related search and database update functions. In response to UI commands via Application 53 and browser plugin 51, transaction manager 33 determines user access and authorization (e.g., in response to a password and userid) using authorization unit 29 and user data stored in repository 31. Transaction manager 33 operating together with database and training manager 25 via API 27, directs generation and update of a document database 17 and associated document SOM data array 21 as well as a topic database 19 and associated topic SOM data array 23. Unit 33 with unit 25, stores in a first repository 21 data representing an organization of documents (e.g. a 2D (two dimensional) SOM map comprising a data array of spatially organized individual elements representing corresponding individual documents) provided in response to frequency of terms found in individual documents. Unit 33 with unit 25, stores in a second repository 23 data representing topics (e.g. a 2D (two dimensional) SOM map comprising a data array of spatially organized individual elements representing corresponding individual topics associated with corresponding documents). Further, an individual topic is associated with, (a) a set of documents in the first repository 21 and (b) a related topic. A processor (units 33 and 25) is configured to, in response to a received search term, use the first and second repositories 21 and 23 to identify search result documents in the organization of documents 17 and including documents from a first set of documents in unit 17 associated with the individual topic and a second set of documents associated with the related topic. Search results and UI windows supporting user interaction and operation of system 10 are presented on display 56
A Feature Vector comprises, for example, a one-dimensional matrix of decimal numbers that describes the lexical content of a text document. A Feature Vector, in an embodiment, associates an individual cell of the vector with a word from the English language. In order to limit the size of the vectors, stemming is used to eliminate grammatical variations of the same word (such as run and running) and commonly occurring words such as connectives (for example if, while, so, but, yet) are excluded. In order to construct a Feature Vector for a given document, the importance or relevance of each word represented in the Feature Vector is determined for the target document. The use of frequency of a word in a document (term frequency) comprising the number of occurrences of the word in the document, to determine a word's relevance to a document is limited for discriminating between documents if the word occurs frequently in the documents being classified. Therefore, a Feature Vector employs inverse document frequency, which gives higher scores to words that occur frequently in a small number of documents in a collection of documents. If N is the number of documents in a collection, inverse document frequency is calculated as
where dft is the document frequency of term t, the number of documents in which term t occurs. Term relevance assigned by units 33 and 25 to each term in a Feature Vector combines both term frequency and inverse document frequency as follows:
weightt=tft×idft
Units 33 and 25 select relevant words of a document to include in a Feature Vector reducing the number of words in the vector and advantageously use a non-cryptographic hash function to map relevant words found in a document to cell indices in a Feature Vector. This also advantageously reduces computational overhead in obtaining, maintaining, and storing large collections of terms especially when multiple languages are supported. The hash function used acquires an input string of characters, and outputs an integer number, the hash number, that uniquely identifies the input string within the precision provided by the range numbers that the function outputs. Hash functions, therefore, do not guarantee that two different input strings will produce different hash numbers. The approach, termed feature hashing, obviates the need to maintain large dictionaries of words and provides a computationally efficient method of constructing Feature Vectors from text documents. The feature vector hashing does not significantly impair classification performance.
Units 33 and 25 in activity 305 query the topic SOM 23 for topics that are near (within a predetermined radius) of a location in the topic SOM map 23 determined by a document's Feature Vector and presents identified topics to a user in an image on display 56. If topics are found near the document in the topic SOM map 23, they are presented to the user as candidate topics that the user can associate the document with. A user also is presented with an option of not choosing the candidate topics and creating a new one. In activity 307, in response to user addition of a document to a selected existing topic, the selected topic is added to the user's topic List in the User Database and is subsequently displayed in the Application's topic area. Units 33 and 25 in activity 309 recalculate the topic's Feature Vector as a mean value of the Feature Vectors of the Documents in the new Document list as follows:
Where N is the number of Documents in the new Document list and FVi is the Feature Vector of the ith Document in the topic's document list.
In response to user command to create a new topic, in activity 311 a new record is added by units 33 and 25 to topic Database 19, a new, unique identifier for the topic is generated and the Add Count is initialized to 1 and the Remove Count is initialized to 0. In activity 313 the added topic's Feature Vector is initialized to the same values of the associated corresponding Document Feature Vector and in activity 315 units 33 and 25 insert the new topic record into topic Database 19. The process of
Feature Vectors specify points in N-space where N is the size of the Feature Vector and a metric that may be used is a Euclidean distance between two points. Euclidean distance d between two Feature Vectors is calculated using, for example,
where fvi is the ith element of Feature Vector fv. Another commonly used proximity metric is a cosine of the angle between two Feature Vectors. This metric is calculated using the dot product vector operation. This metric advantageously preserves dot products between Feature Vectors when feature hashing is used, while Euclidean distance may not be. The dot product metric to calculate the proximity metric between two Feature Vectors is determined using,
In activity 503 following the start at activity 501, Application 53 is updated so that the Food Safety topic contains a document of interest and a list of suggested visually highlighted documents in order of relevance is presented in an image on display 56 beneath a link to the document. A user is able to view each of the recommended documents on display 56 by double clicking on a link to each one in turn to view the corresponding document in a separate area of the displayed image.
In activity 505, a user selects an individual suggested document for addition to his Food Safety topic by selecting an Add “+” button next to the documents. Application 53 in activity 508 communicates with Server 12 to update user database 31 to add the additional documents to the particular user's Food Safety topic and Server 12 increments the “Additions” counter of each document added by the user to his topic. If the documents added are not already in Food Safety topic list of documents, they are added to the topic. In activity 511, units 33 and 25 recalculate a feature vector associated with that topic as the mean of the feature vectors of the documents associated with that topic. In activity 514, in response to addition of documents to the Food Safety topic, topic SOM 23 is updated by initiating training of the topic SOM 23.
The SOM competitive learning selects a single node as a winner and it is guaranteed to converge to a stable state. It results in a network self organizing itself into a low dimensional structure that reflects the topological structure of high dimensional data and results in a two dimensional map (SOM 21 and 23) where each node represents a set of related documents (and topics) and the relative location of the nodes (measured as a combination of the Euclidean distance and the cosine of the angle between the feature vectors of two nodes) reflects the topical relationship of the documents. Nodes that are near each other indicate documents that are topically related while nodes that are far from each other are topically unrelated. System 10 stores a taxonomy of topics and high value information sources associated with the topics and uses SOM 21 and 23 to capture intelligence in a form that is both accessible to the user and that can grow serially to capture the intelligence of a user base.
The SOMs comprises a two level hierarchical SOM with one level including documents selected by a user base. Document SOM 21 organizes documents based on their term frequency. The second level SOM 23 contains nodes that correspond to topics created by users. When a topic is created by a user, it initially does not have a term frequency vector assigned to it. As documents are added to that topic, the term vector for that node is assigned to be the mean of the term vectors of its constituent documents. This creates a correlation between the two levels of SOM. Each node in the topic SOM is anchored to a location in the Document SOM. The neighborhood of that anchor contains the documents most relevant to that topic. This organizes topics created by users into neighborhoods of related topics.
System 10 supports browsing documents and topics using document SOM 21 enabling size of a neighborhood to be dynamically changed to include more or less documents and topic SOM 23 enabling size of a neighborhood to be dynamically changed to include more or less topics. Browsing documents related to a topic involves selecting a topic including documents added by other users in a topic neighborhood of a document. Browser plug-in 51 enables users to select an open document for addition to a topic. System 10 displays topics and their associated documents and provides access through a set of web services by third party applications through a web services interface.
A Self Organizing Map (SOM) is represented as a two dimensional array of nodes. Each node consists of a data structure containing a Feature Vector and a list of the Data Observations (a Data Observation can be either a Document Identifier or topic Identifier depending on what the SOM represents) that are nearest in distance to that node's Feature Vector. The elements comprising this list are referred to as the Best Matching Units (BMUs). SOM 21 or 23 is trained in an iterative manner where each iteration of training brings the SOM closer to a stable state where its topology reflects the topical structure of the input Data Observations. SOM training begins by assigning each node a Feature Vector consisting of random values and initializing the list of BMUs to an empty list. In a training iteration, units 33 and 25 select a random Data Observation (Document or topic) from the database (units 17 and 19) and calculate the distance between the Data Observation's Feature Vector and the Feature Vectors of the SOM cells. Units 33 and 25 select the SOM cell with the smallest distance to the Data Observation as a winning cell
and modifies the Feature Vector of the winning cell by adding to it a vector quantity equal to the difference between the two Feature Vectors multiplied by a scalar value representing the current learning rate so that the cell moves closer to the Data Observation. Units 33 and 25 modify the Feature Vector of other cells in the SOM by adding to their Feature Vector a vector quantity equal to the difference between each cell's Feature Vector and the Data Observation's Feature Vector multiplied by a scalar value representing the neighbor cell influence so that the cell moves closer to the Data Observation. The learning rate scalar value controls the magnitude of changes that are made to SOM during training. At the start of training it is set to a relatively large number but is progressively reduced as training proceeds and the SOM approaches a stable state. Units 33 and 25 calculate the learning rate scalar as,
Where lr is the learning rate used in the current training iteration, lrinitial is the learning rate at the start of training, lrfinal is the learning rate at the end of training, icurrent is the current training iteration and ifinal is total number of training iterations.
The neighbor cell influence is a scalar value that controls how much influence a winning cell has on its neighbors. This value is highest near the cell and falls off exponentially away from the cell. The cell influence scalar is calculated as,
Where ni is the neighbor influence used in the current training iteration, dcell is the distance in Cartesian coordinates between the winning cell and a neighboring cell, niinitial is the maximum value of the neighbor influence scalar applied to immediate neighbors of the winning cell, nifinal is the minimum value of the neighbor influence scalar, icurrent is the current training iteration and ifinal is the total number of training iterations. The distance between two cells in a SOM depends on the topology of the SOM. System 10 uses a two dimensional grid, so the distance between celli located at (rowi, coli) and cell j located at (rowj, colj) is calculated as the Manhattan distance between the cells:
d=(|rowi−rowj|)+(|columni−columnj|)
System 10 enables advantageous querying of documents and topics, for example, to find topics related to a document, documents related to a topic, topics related to a topic, as well as documents related to a document. Individual queries may comprise a radius of relevance (spatial distance) from a point of reference on SOM 21 and SOM 23. This advantageously allows users to control the specificity of the results returned. Units 33 and 25 find relevant topics within a specified radius of relevance from a target document using SOM 21 and SOM 23. The selected radius determines the breadth of relevant topics. A user adds a new document to inbox 41 using browser plug-in 51 and selects a topic and units 33 and 25 prompt a user with a topic radius. Units 33 and 25 thereby use a radius to suggest a set of existing topics and advantageously limit the number of extraneous topics created by users. Units 33 and 25 calculate the spatial distance between a Feature Vector of a document and a Feature Vector of a selected topic node in topic SOM 23. If the distance is less than the specified radius, units 33 and 25 add the topics from the node's Best Matching Unit list to the result.
System 10 finds relevant documents within a specified radius of relevance from a target topic using SOM 21 and SOM 23 to suggest documents relevant to a topic. System 10 derives a query to provide a document recommendation allowing users to quickly build their topic content by adding documents from a list of recommended documents to a user document or search. For each node in document SOM 21, units 33 and 25 calculate a distance between a topic Feature Vector and a selected node Feature Vector. If the distance is less than a specified radius, units 33 and 25 add the Documents from the node cell's Best Matching Unit list to the result.
System 10 finds relevant topics within a specified radius of relevance from a target topic using SOM 21 and SOM 23 and derives a query enabling users to browse a topical neighborhood of the topic SOM 23. For each cell in topic SOM 23, units 33 and 25 calculate the distance between the topic Feature Vector and the selected cell Feature Vector. If the distance is less than the specified radius, units 33 and 25 add the topics from the cell Best Matching Unit list to the result.
System 10 finds relevant documents within a specified radius of relevance from a target document using SOM 21 and SOM 23 and derives a query enabling users to browse a topical neighborhood of document SOM 21. For each cell in document SOM 21, units 33 and 25 calculate the distance between the document Feature Vector and the selected cell Feature Vector. If the distance is less than the specified radius, units 33 and 25 add the documents from the cell Best Matching Unit list to the result.
System 10 advantageously determines user time varying level of expertise (topic IQ) in a topic area and displays user expertise level for a given topic on display 56. Units 33 and 25 calculate a user's topic IQ using a ratio of documents that exist within a fixed radius of a topic′ Feature Vector location on Document SOM 21 and the number of those documents read by the user.
Units 33 and 25 calculate a number of documents related to a topic based on the inherent organization of the SOM 21 and 23 structure. A topic Feature Vector includes topic location or “anchor” in document SOM 21. In order to determine the number of documents related to a given topic, units 33 and 25 finds documents that fall within a predetermined distance of the topic Feature Vector. The predetermined distance used in the topicIQ calculations can be automatically and dynamically varied based on the level expertise that a user has achieved. For a novice user, that distance can be relatively small. Once the user has achieved a high topicIQ score as a novice, units 33 and 25 move the user to Intermediate status and the distance used in the topicIQ calculation is increased. In response to the user achieving a high score at the Intermediate level, the user is moved to Expert status and the distance is increased further.
System 10 advantageously determines a user has read a document using,
Where N is the number of pages in a document, t is the time, in minutes, spent on page I and s is the number of scroll operations performed on page i. This probability determination advantageously encompasses readers that fall outside normal behavior.
Units 33 and 25 determine document and topic relevance and orders documents by their determined relevance. The relevance calculation takes into account the distance between a topic or document Feature Vector as well as the number of times the Document or Topic was added and removed by users. Documents are added and removed by users from their document list for a topic, while topics are added and removed by users from their list of topics. Document or topic relevance is calculated using,
Where distance is the distance between Feature Vectors, removed is the number of times the Document or Topic was removed by a user and added is the number of times the Document or Topic was added by a user.
In an example of operation, a user in a team needs to prepare a research report on a particular topic (topic A). The user and team install a browser plug compatible with a browser and studies social media, mainstream news articles, academic papers and identifies and marks a relevant article for further study via application 53. Server 12 queries document SOM 21 for articles within a specified distance (i.e. in the neighborhood) of the marked article and provides a list of the documents to application 53 within a specified distance of the topic. The user employs a shared dashboard for the team to view articles in inbox 41 and views and adds relevant documents of other team members to the user collection. The user selects a Learning Lab button and selects a first article to read in Zen Mode showing a plain text version of the article, removing distracting elements associated with web browsing. System 10 enables a user to highlight the text associated with an individual person and save the highlighted (or marked) text to a people profile database of the user in repository 31. System 10 also enables a user to highlight a text term (such as “food inflation”) in an article and adds the term and its definition automatically acquired from Wikipedia into a vocabulary builder in repository 31. The vocabulary builder saves terms and enables a user to explore definitions and reference them. The team is able to build a list of key terms and people data related to topic A using a specific dashboard for topic A and a user is able to add a comment using a social annotation feature requesting additional information enabling others to add information such as a link to a related document.
The array of elements representing documents comprises a two dimensional or three dimensional array of elements where distance between two elements representing first and second documents represents degree of relatedness of the first and second documents and the received search term comprises data indicating a document spatial position within an array of elements representing documents. In activity 240, units 33 and 25 use the second repository to identify a topic related to the individual topic and a set of documents associated with the related topic. Units 33 and 25 in activity 242, identify search result documents in the organization of documents as documents from both the set of documents associated with the individual topic and the set of documents associated with the related topic. Units 33 and 25 identify the related topic as having a spatial position within the array closest to the spatial position of the individual topic, the spatial position of the related topic corresponding to a center of a set of documents associated with the related topic. The center of the set of documents comprises at least one of, (a) a center of mass of elements representing individual documents of the set of documents, the elements being of equal weight and (b) a center of mass of elements representing individual documents of the set of documents, the elements being weighted in response to a relevance criteria and the first and second repositories may comprise one or more data repositories or databases.
The second repository includes a topic array comprising elements representing topics and associates an individual topic with a position in the topic array and an element in the topic array maps to a center of a set of documents associated with the individual topic in the array of elements representing documents of the first repository. Units 33 and 25, in response to a received search term, identify a first document using the first repository, identify a related topic comprising a topic related to the topic associated with the identified first document, using the second repository, identify a second document associated with the identified related topic and output data representing the search result documents including the first and second documents. In activity 244 units 33 and 25 determine a user expertise level associated with a topic in response to at least one of, (a) a number of documents read by the user, (b) a number of documents related to a topic and (c) a proportion determined using (a) and (b). The process of
The above-described embodiments can be implemented in hardware, firmware or via the execution of software or computer code that can be stored in a recording medium such as a CD ROM, a Digital Versatile Disc (DVD), a magnetic tape, a RAM, a floppy disk, a hard disk, or a magneto-optical disk or computer code downloaded over a network originally stored on a remote recording medium or a non-transitory machine readable medium and to be stored on a local recording medium, so that the methods described herein can be rendered via such software that is stored on the recording medium using a general purpose computer, or a special processor or in programmable or dedicated hardware, such as an ASIC or FPGA. As would be understood in the art, the computer, the processor, microprocessor controller or the programmable hardware include memory components, e.g., RAM, ROM, Flash, etc. that may store or receive software or computer code that when accessed and executed by the computer, processor or hardware implement the processing methods described herein. In addition, it would be recognized that when a general purpose computer accesses code for implementing the processing shown herein, the execution of the code transforms the general purpose computer into a special purpose computer for executing the processing shown herein. The functions and process steps herein may be performed automatically or wholly or partially in response to user command. An activity (including a step) performed automatically is performed in response to executable instruction or device operation without user direct initiation of the activity. No claim element herein is to be construed under the provisions of 35 U.S.C. 112, sixth paragraph, unless the element is expressly recited using the phrase “means for.” A “processor” as used herein comprises, a computer system circuit and device operating in response to instruction and is not just software.
The architecture of
This is a non-provisional application claiming priority of provisional Application Ser. No. 61/764,655 by H. Fouad et al., filed 14 Feb. 2013.
Number | Date | Country | |
---|---|---|---|
61764655 | Feb 2013 | US |