While the invention is claimed in the concluding portions hereof, preferred embodiments are provided in the accompanying detailed description which may be best understood in conjunction with the accompanying diagrams where like parts in each of the several diagrams are labeled with like numbers, and where:
The processing unit 3 can be any processor that is typically known in the art with the capacity to run the program and is operatively coupled to the memory storage device 4 through a system bus. In some circumstances the data processing system 1 may contain more than one processing unit 3. The memory storage device 4 is operative to store data and can be any storage device that is known in the art, such as a local hard-disk, etc. and can include local memory employed during actual execution of the program code, bulk storage, and cache memories for providing temporary storage. Additionally, the memory storage device 4 can be a database that is external to the data processing system 1 but operatively coupled to the data processing system 1. The input device 5 can be any suitable device suitable for inputting data into the data processing system 1, such as a keyboard, mouse or data port such as a network connection and is operatively coupled to the processing unit 3 and operative to allow the processing unit 3 to receive information from the input device 5. The display device 6 is a CRT, LCD monitor, etc. operatively coupled to the data processing system 1 and operative to display information. The display device 6 could be a stand-alone screen or if the data processing system 1 is a mobile device, the display device 6 could be integrated into a casing containing the processing unit 3 and the memory storage device 4. The program module 8 is stored in the memory storage device 4 and operative to provide instructions to processing unit 3 and the processing unit 3 is responsive to the instructions from the program module 8.
Although other internal components of the data processing system 1 are not illustrated, it will be understood by those of ordinary skill in the art that only the components of the data processing system 1 necessary for an understanding of the present invention are illustrated and that many more components and interconnections between them are well known and can be used.
Furthermore, the invention can take the form of a computer readable medium having recorded thereon statements and instructions for execution by a data processing system 1. For the purposes of this description, a computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and DVD.
The concept knowledge base 10 contains information relating to a field of knowledge. For example, the concept knowledge base 10 could contain information related to the field of science. The concept knowledge base 10 contains a number of concept data objects 12, a number of term data objects 14 and a number of edge data objects 16.
Each concept data object 12 contains a concept field 13 containing a concept that is related to a specific concept falling within the field of knowledge of the concept knowledge base 10. The concept field 13 typically contains a text string identifying the concept. For example, if the concept knowledge base 10 is for computer science, there may be concept data objects 12 with the concept field 13 containing the text string of “computer graphics”, another concept data object 12 with the concept field 13 containing the text string of “distributed computing”, another concept data object 12 with the concept field 13 containing the text string “artificial intelligence”, etc.
Each term data object 14 contains a term field 15 containing a text string. The text string contains a word or phrase that describes a concept of one of the concept data objects 12.
Each concept data object 12 is associated with one or more term data objects 14 and each term data object 14 is associated with one ore more concept data objects 12. The association of a concept data object 12 and a term data object 14 is defined by an edge data object 16 which contains a weight field 18. A term data object 14 that is associated with a concept data object 12 contains a term in the term field 15 that describes the concept contained in the concept field 13 of the concept data object 12. The relevancy of the term in the term field 15 of the term data object 14 to the concept in the concept field 13 of an associated concept data object 12 is represented by a weight in the weight field 18 of the edge data object 16.
While it is possible to manually construct the data structure containing the concept knowledge base 10,
Method 100 comprises the steps of: determining a concept 110; selecting a document describing the concept 120; determining terms in the document to be analyzed 130; determining the frequency of the selected terms 140; checking if there are any remaining documents describing a concept 150; calculating a preliminary weight 160; checking if there are any more concepts 170; and normalizing all of the weights 180.
The method takes a number of documents and/or descriptions in computer readable form that describe a number of different concepts in a knowledge area and uses the documents to automatically generate a data structure of a concept knowledge base 10, as shown in
The method 100 begins with step 110. A concept falling within the concept knowledge base is determined and a concept data object is created with information identifying the concept contained in the concept field.
Each concept will be described by one or more documents or descriptions in computer readable format. Once a concept has been determined at step 110, one or more documents describing the concept are identified and at step 120 one of these documents is selected to be analyzed.
At step 130, the method 100 determines the terms to be analyzed in the document. For each term to be analyzed, method 100 creates a term data object for each selected term with the term field containing the term, if a term data object containing the term does not already exist. An edge data object indicating the association of the term data object and the concept data object is also created and after the method 100 is completed will contain a weight indicating the relation of the term data object with the associated concept data object containing the concept described by the document being analyzed.
The terms that are analyzed can include all of the words used in the document or only specific words in the documents. For example, common words that are basically non-descriptive, such as “the”, “a”, “this”, etc. may be excluded from the selected terms that are selected for analysis at step 130.
At step 140 the frequency of each of the selected terms in the selected document is determined. The occurrence of each selected term in the document is determined. The occurrence of a selected term tj in the document being analyzed can easily be determined, via text matching, and is defined by the function:
f(dik,tj)
Each of the terms appearing in the document are then averaged based on the number of occurrences of all of the terms in the document. For example, the averaging could be done using the following equation:
where dik is the document being analyzed for the set of terms tik={tl,ik, . . . , tm,ik) with m being the number of terms in document dik. This equation simply divides the frequency or tally of a term being analyzed by the total number of terms being analyzed in document dik. By conducting this averaging, the eventual weight determined for each association between a term node and a concept node takes into account the number of occurrences of a term in the document and provides a potentially more relevant indicator of the relation between the term data object to the concept data object because words or terms that appear often relative to the total number of terms will be given more weight. This preliminary averaging is used to try to prevent a single large document describing a concept from providing term weights that overshadow the weights provided by a number of smaller documents.
Next, at step 150, the method 100 checks to see if there are any more documents related to the concept that have not been analyzed. If there are more documents to be analyzed related to the concept, the method 100 returns to step 120, selects the next unanalyzed document and repeats steps 130, 140 and 150. As long as more documents related to the concept exist, step 150, causes the method 100 to analyze all of the documents. When there are no more documents related to the concept to be analyzed, the method 100 continues on to step 160.
At step 160 the method 100 calculates a preliminary weight for each of the terms used in the documents related to a single concept. For each term an interim weight wij* is calculated taking into account the average term frequency of the documents related to the concept.
Wherein there are 1 . . . n documents.
This equation, in its entirety, is as follows:
This calculation is used to prevent concepts with a large numbers of documents from producing term weights that overshadow term weights from concepts with fewer documents describing the concept.
At step 170, the method 100 checks to see if there are any more concepts left to be evaluated. If there are concepts remaining that have not been analyzed, the method 100 returns to step 110 and the next concept is selected to be analyzed. The method 100 then repeats steps 120, 130, 140, 150 and 160 determining a preliminary weight for each of the terms appearing in the documents describing the selected document. The method 100 continues to analyze each concept repeating steps 110, 120, 130, 140, 150, 160 and 170 until all of the concepts have been analyzed, at which point, the method 100 continues on to step 180.
At step 180 the method 100 determines a normalized weight for each of the terms associated with the concepts. The preliminary weight wij* previously determined for each association between a term ti and a concept is divided by the sum of all of the weights determined for the term ti connected to r concepts. This equation is shown as follows:
Wherein the index f(k) is given by f(x), x=1 . . . r, representing the r concepts to which term i is connected to in the concept knowledge base.
The normalization of the weights is used to prevent common terms that are included in many of the documents for many concepts from having higher weight values than other less common terms. These terms are often of little value in describing a concept. By using normalization, the weights of common terms are significantly reduced. Without this normalization step, common terms that are included in many documents for many different concepts would have a very high weight, even though these terms are of little value in describing the concept. With this normalization step, the weights of these common terms are significantly reduced.
Additionally, rather than using the terms exactly as they appear in the documents or descriptions, in a further aspect of the invention, the stems of the roots of the terms are used to construct the knowledge base allowing terms to be matched based on their stems or roots rather than being based on exact text matches.
Additionally, in some circumstances it may not be necessary to analyze every term in a document. In a further aspect, the method 100 will focus on only specific terms in a document that are highlighted in a particular way, i.e. in an abstract. Alternatively, there could be a list of terms that are not analyzed, such as common terms that are not descriptive of a concepts, for example terms such as the, and, etc. may be excluded from being selected.
At the conclusion of the method 100 a concept knowledge base as illustrated in
The search query will comprise one or more search terms. The software system 300 can be implemented on a data processing system, such as the data processing system 1 as shown in
The search query is entered into the system at the current search query module 320. From the search query module 320 the search query is passed to the query space generation module 330, which accesses the concept knowledge database 310, to generate a query space of terms a user may wish to add to his or her search query. Typically, the concept knowledge database 310 contains a concept knowledge base data structure as shown in
From the query space generation module 330 the generated query space is passed to the query visualization module 340 where a visual representation of the query space is generated. The visual representation of the query space is then passed to the user interface module 370.
Additionally, the current search query module 320 also passes the search query to a search engine preview module 350 that has a search engine API 360 conduct a preview of a web search using the search query and passes the results of preview of the web search to the use interface module 370.
The user interface module 370 displays the visual representation of the query space to a user along with the results of a preview search. The user can perform a number of operations using the user interface module 370, such as, submitting a new search query; modify the search query by adding or removing terms; remove a concept; expand or collapse a concept; and sending the search query to the search engine.
The software system 300 begins with an initial search query being input to the current search query module 320 which passes the search query to the query space generation module 330. The query space generation module 330 accesses a concept knowledge database 310 and uses the information in the concept knowledge database 310 to generate a query space from the search query.
Method 400 comprises the steps of: matching terms in the search query to term data objects in the concept knowledge base to obtain a first term set 410; obtaining a concept set of concept data objects associated with the first term set 420; obtaining a second term set of term data objects associated with the concepts objects in the concept set 430; and obtaining an edge set 450.
The method 400 begins with step 410 and the terms in the search query being matched to term data objects in the concept knowledge database 310. The concept knowledge database 310 is accessed and each of the terms making up the search query are matched with any term data objects that have a term in the term field matching the term in the search query. A first term set containing these selected term data objects is obtained. After step 410 is completed, all of the term data objects in the concept knowledge database 310 that have a term in the term field that corresponds to one of the terms in the search query are identified and these term data objects are added to a first term set.
At step 420, the first term set is used to obtain a concept set containing concept data objects from the concept knowledge database 310 associated with one or more term data objects in the first term set. The term data objects making up the first term set are used to obtain a number of concept data objects from the concept knowledge database 310. Concept data objects associated with one or more term data objects in the first term set are selected to form the concept set.
Concept data objects that are not strongly associated with term data objects in the first terms set are excluded from the concept set using a first weight threshold and a term ratio threshold. The first weight threshold is used to exclude concept data objects that are not strongly associated with one of the term data objects in the first term set by comparing the weight assigned to an association between a concept data object and a term data object and excluding the concept data object from the concept set if the weight determined for the association is less than the first weight threshold. By using this first weight threshold, the concept set is limited to only the more relevant concepts. Additionally, a term ratio threshold is used to further exclude concept data objects from the concept set. If a concept data object is associated with one of the term data objects in the first term set with a weight greater than the first weight threshold, the concept data object is evaluated to determine the ratio of all of the term data objects in the first term set to which the concept data object is associated with a weight greater than the first weight threshold. If this ratio is less than the term ratio threshold, the concept data object is excluded from the concept set.
At step 430 a second term set is obtained. Each of the concept data objects in the concept set are evaluated to determine term data objects, in the concept knowledge base 110, associated with each of these concept data objects. Term data objects associated with the concept data objects selected for the concept set are added to the second term set. A second weight threshold is used to exclude term data objects from the second term set if they are associated with concept data objects in the concept sets by a weight that is less than the second weight threshold.
At step 450, an edge set containing edge data objects from the concept knowledge database 310 is obtained. The edge data object defining the association between the term data objects in the first term set and the concept data objects in the concept set along with the edge data objects defining the association between the concept data objects in the concept set and the term data objects in the second term set are placed in the edge set.
At this point, the method 400 ends and there is: a first term set containing term data objects that correspond to terms in the search query; a concept set containing concept data objects associated with term data objects in the first term set, that represent concepts the terms in the search query could be describing; a second term set containing term data objects associated with one or more concept data objects in the concept set, that indicate further terms that may be used to describe the concepts the user may be trying to look for; and an edge set defining the associations between the term data objects and concept data objects in the different sets.
Through experiments, the first weight threshold, term ratio threshold and second weight threshold can be determined. For example, some initial studies found that a first weight threshold of 0.05, a term ratio threshold of 0.51 and a second weight threshold of 0.10 provided satisfactory results.
Referring again to
Referring again to
The concept data objects contained in the concept set are used to create the concept nodes 550. Each concept data object in the concept node is used to create a concept node 550 in the visual representation 500 and the concept in the concept field of the concept data object is inserted as text on the concept node 550.
The term data objects contained in the first term set are used to create the selected term nodes 560. Each term data object in the first term set is used to create a single selected term node 560 in the visual representation 500 and the term in the term field of the concept is inserted as text on the term node 560.
The term data objects contained in the second term set are used to create the unselected term nodes 570 in the visual representation 500. An unselected term node 570 is created on the visual representation 500 for each term data object contained in the second term set with the term in the term field of each term data object used as text on the unselected term node.
The edge data objects in the edge set define the associations between the term data objects in the first and second term set and the concept data objects in concept set. Each edge data object in the edge set is used to draw the connecting lines 580 between associated concept nodes 550 and unselected term nodes 560 and unselected term nodes 570. The distance between a concept node 550 and an associated selected term node 560 or associated unselected term node 570 joined by a connecting line 580 is a function of the weight of the association indicated in the edge concept. For example, if a weight of an association between a first unselected term nodes 570A and a concept node 550A is less than the weight of an association between the concept node 550A and a second unselected term nodes 570B, the first unselected term node 570A is positioned in the visual representation 500 further away from the concept node 550A than the second unselected term node 570B.
The concept nodes 550 are rendered in the visual representation 500 so that the concept nodes 550 can be visually distinguished from the selected term nodes 560 and the unselected term nodes 570. Typically, colors are used to make the concept nodes visually distinctive, i.e. the concept nodes 550 being rendered with a red background.
The selected term nodes 560 and unselected term nodes 570 are also rendered in the visual representation 500 to be visibly distinguishable from each other. Typically, this is also done by rendering the selected nodes 560 and unselected term nodes 570 with different background colors from each other. For example, the selected term nodes 560 might be rendered with a yellow background or some other bright color and the unselected term node 570 can be rendered in some neutral color, such as grey.
The visual representation 500 allows users to properly interpret the underlying features of the query space. Users are able to visually distinguish between concept nodes 550, selected term nodes 560 and unselected term nodes 570; along with the relationship between these nodes. Terms the user used in their original search query are shown in the visual representation as selected term nodes 560, allowing a user to easily distinguish between terms in the visual representation 500 that the user used in his or her search query and new terms that were generated and that the user may wish to add to their search query. Additionally, this allows a user to identify whether the terms they have used in their search query are actually appropriate for their information needs. If the concepts shown in the concept nodes 550 are unrelated to the to the information the user is seeking, the search query may not be a proper search query and the user can try a completely new search query. The visual representation 500 can allow a user to determine if the search query they have used have very general terms (i.e. connect to numerous concept nodes) or very specific terms (i.e. connected to very few concepts).
Referring again to
For example, both Google™ and Yahoo! offer API services that allows the system tp request a search preview.
The results of the search preview are passed from the search engine preview module 350 to the user interface module 370.
A user interface module 370 is provided. If the user is using the data processing system 1 as shown in
The user interface module 370 displays to a user a visual representation created by the query visualization module 340 using the query space generated by the query space generation module 330, along with a search preview obtained by the search engine preview module 350.
The user interface 600 allows a user to: submit a new search query; modify the search query; remove a concept; expand or collapse a concept; and send the query to the search engine.
When a user sees the visual representation 610 and the search engine preview 620, if the results are much different than what the user wanted, the user can conduct a completely new search by entering a new search query in the first text field 630 and selecting a search button 635.
Referring again to
A user can also add terms to the search query by selecting unselected terms on the visual representation 610. To add a term a user selects an unselected term node in the visual representation 610 and the term in the term node is added to the terms of the search query.
Referring to
Additionally, a user can remove a term from the search query by selecting a selected term node in the visual representation 610. Referring to
Upon seeing the visual representation 610 a user may identify concept nodes illustrated in the visual representation that display concepts the user believes are not relevant to the information the user is trying to obtain in the search. To remove one of these concept nodes from the visual representation, a user selects the concept node in the visual representation 610.
Referring again to
A user can choose between an expanded and a compacted visual representation of a concept by selecting the node to be expanded or compacted. The user selects a concept node 550A on the visual representation 610 that the user either wishes to expand (if the concept node is compacted) or compact (if the concept node is currently expanded).
Referring to
Finally, the user interface 370 allows a user to send the search query to a search engine to conduct a regular web search using the search query. A user selects the search button 645 and, referring to
The foregoing is considered as illustrative only of the principles of the invention. Further, since numerous changes and modifications will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation shown and described, and accordingly, all such suitable changes or modifications in structure or operation which may be resorted to are intended to fall within the scope of the claimed invention.