The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
With reference now to the figures and in particular with reference to
With reference now to the figures,
In the depicted example, server 104 and server 106 connect to network 102 along with storage unit 108. In addition, clients 110, 112, and 114 connect to network 102. These clients 110, 112, and 114 may be, for example, personal computers or network computers. In the depicted example, server 104 provides data, such as boot files, operating system images, and applications to clients 110, 112, and 114. Clients 110, 112, and 114 are clients to server 104 in this example. Network data processing system 100 may include additional servers, clients, and other devices not shown.
In the depicted example, network data processing system 100 is the Internet with network 102 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages. Of course, network data processing system 100 also may be implemented as a number of different types of networks, such as for example, an intranet, a local area network (LAN), or a wide area network (WAN).
With reference now to
In the depicted example, data processing system 200 employs a hub architecture including a north bridge and memory controller hub (MCH) 202 and a south bridge and input/output (I/O) controller hub (ICH) 204. Processor 206, main memory 208, and graphics processor 210 are coupled to north bridge and memory controller hub 202. Graphics processor 210 may be coupled to the MCH through an accelerated graphics port (AGP), for example.
In the depicted example, local area network (LAN) adapter 212 is coupled to south bridge and I/O controller hub 204 and audio adapter 216, keyboard and mouse adapter 220, modem 222, read only memory (ROM) 224, universal serial bus (USB) ports and other communications ports 232, and PCI/PCIe devices 234 are coupled to south bridge and I/O controller hub 204 through bus 238, and hard disk drive (HDD) 226 and CD-ROM drive 230 are coupled to south bridge and I/O controller hub 204 through bus 240. PCI/PCIe devices may include, for example, Ethernet adapters, add-in cards, and PC cards for notebook computers. PCI uses a card bus controller, while PCIe does not. ROM 224 may be, for example, a flash binary input/output system (BIOS). Hard disk drive 226 and CD-ROM drive 230 may use, for example, an integrated drive electronics (IDE) or serial advanced technology attachment (SATA) interface. A super I/O (SIO) device 236 may be coupled to south bridge and I/O controller hub 204.
An operating system runs on processor 206 and coordinates and provides control of various components within data processing system 200 in
Instructions for the operating system, the object-oriented programming system, and applications or programs are located on storage devices, such as hard disk drive 226, and may be loaded into main memory 208 for execution by processor 206. The processes of the illustrative embodiments may be performed by processor 206 using computer implemented instructions, which may be located in a memory such as, for example, main memory 208, read only memory 224, or in one or more peripheral devices.
The hardware in
In some illustrative examples, data processing system 200 may be a personal digital assistant (PDA), which is generally configured with flash memory to provide non-volatile memory for storing operating system files and/or user-generated data. A bus system may be comprised of one or more buses, such as a system bus, an I/O bus and a PCI bus. Of course the bus system may be implemented using any type of communications fabric or architecture that provides for a transfer of data between different components or devices attached to the fabric or architecture. A communications unit may include one or more devices used to transmit and receive data, such as a modem or a network adapter. A memory may be, for example, main memory 208 or a cache such as found in north bridge and memory controller hub 202. A processing unit may include one or more processors or CPUs. The depicted examples in
As previously mentioned, there are several known traditional search algorithms in the existing art which return, based on search terms entered by a user, a list of documents which contain one or more references to the search terms in the user's query. One of these traditional search algorithms is the “bag of words” model, which classifies documents based on a raw statistical analysis of the number of search terms in the page. While these traditional search algorithms may return a list of matching documents which contain one or more of the search terms in the query, these traditional algorithms do not necessarily allow for locating a document that is actually relevant to the search, for they do not take into consideration the meaning of the words or the relationships between them. The illustrative embodiments address this issue by providing a relevancy algorithm for determining how relevant a matching document is to the terms in the search query. A list of matching documents (i.e., documents containing one or more of the search terms) may be obtained using any of the traditional search algorithms in the art. Once the list of documents that contain a match to one or more search terms in the query is obtained, the relevancy algorithm described in the illustrative embodiments may be used to determine the relevancy of the matching documents to the search terms.
Prior to receiving a search query, a repository of documents is indexed for search. During the indexing, one or more semantic networks are generated for each document in the repository. Any known method of generating semantic networks may be used to implement the illustrative embodiments. A semantic network is a diagram that represents concepts that are specified in the document, as well as the relationships between the concepts. A concept may be an idea or thought that has meaning. The semantic network comprises nodes which represent the concepts, and edges which represent the semantic relations between the concepts. The generated semantic networks may be stored with the index in the repository.
The relevancy algorithm for scoring each matching document may include a search of all of the semantic networks in the repository to locate those networks which have one or more terms which match the terms in the search query. When a search query is received from a user, the relevancy algorithm first searches the semantic networks for documents containing terms which match the terms in the search query. This search for matching networks may also be performed using traditional algorithms, such as “bag of words” matching and enumeration of referring documents. Regardless of the manner of obtaining a list of document which contain terms matching the search query, the relevancy algorithm is then used to rank those matching documents according to each document's relevancy to the search terms. The relevancy algorithm ranks the matching networks for the documents in the list by first determining which of the semantic networks have a higher edge density around the nodes which correspond to the search terms. The edge density for a node is simply the number of edges (i.e., relationship connections) incident to the relevant node (i.e., concept). The relevancy algorithm scores each matching semantic network based on the total number of edges in the network multiplied by the total number of matching terms in the network. If a document contains multiple matching semantic networks, the scores for each or the matching semantic networks are added together. Semantic networks having a higher edge density score are ranked as being a better match to the search query. Thus, documents that have a significant amount of context around the term(s) of interest are more likely to be relevant to the query.
The relevancy algorithm described in the illustrative embodiments provides an improvement over traditional search algorithms which determine the relevancy of a document only by the quantity of the search terms in the document and/or number of referring documents. The relevancy algorithm technique also overcomes the storage problems typically associated with semantic networks. A disadvantage of using semantic networks is the exorbitant storage requirements for storing an entire semantic network, as opposed to traditional search algorithms such as the “bag of words” model which only require one to store a list of keywords, as well as possibly storing the number of occurrences of each keyword. However, the relevancy algorithm technique in the illustrative embodiments mitigates the semantic network storage requirement by only storing the list of keywords and the number of edges incident to each keyword. For instance, when the documents are indexed as described above, the list of keywords along with the number of incident edges for each keyword are stored, rather than the entirety of the semantic network. Thus, the amount of additional storage required to implement the relevancy algorithm technique is only negligibly greater (if at all) than the storage requirements of traditional search algorithms.
Turning next to
In this particular Web-based search example, browser 308 is an application executing on client 300. Web page 310 is currently displayed in browser 308. When the user enters search criteria into Web page 310, the search criteria is sent in search request 302, which is received by server process 312 in server 304.
Server process 312 processes search request 302 and sends the search terms to search engine 316, which performs a search using repository 318 to identify sources of information related to the search terms. Repository 318 contains an index used to search documents stored within. This index also contains mappings to different Web pages or other types of content that may be searched based on the search terms. These mappings may be static or may change over time. Search engine 316 may be implemented using various well-known search engines. Some search engines which may be used include, for example, AltaVista, Google, and HotBot. Depending on the particular implementation, search engine 316 may be located on a different data processing system than server process 312.
Search engine 316 generates semantic networks for repository 318. A document or Web page may contain one or more semantic networks. The semantic networks may be stored with the index in repository 318. In one example, all of the terms in the semantic networks may be stored within a symbol table to allow the search engine to easily locate the nodes corresponding to the search terms.
The results of the search query are sent to server process 312 for return to client 300 in result 306. Result 306 may be, for example, a particular Web page containing the information related to the search terms or a Web page containing links to Web pages satisfying the search criteria.
As shown, semantic network 400 in
In contrast, with the relevancy algorithm, the semantic networks of the two documents are further analyzed to identify which documents are more relevant to the content of the search query. The search engine may rank the relevancy of the documents based on the number of edges around the concepts (i.e., terms) in the search query. For example, semantic network 400 in
A relation may also be negative, such that the meaning of the relation is inverted. For example, the negative relation illustrated by dotted line 420 indicates that the text of the document specifies that a hippopotamus does not possess hair. “Is a” 422, or “is a”, is commonly used in semantic networks to define hierarchies. For example, if nodes “rodent”, “mouse”, “animal”, and “mammal” are in a semantic network, “is a” may be used to specify the hierarchy between the nodes, such as “a mouse is a rodent is a mammal is a animal”. From the specified hierarchy, it may be understood that all the properties of a mammal apply to a mouse (i.e., possesses hair, gives birth to young live, etc.). In this particular case in
The relevancy algorithm analyzes semantic network 400 to determine how many edges there are around the concepts specified in the search query. With the search query, “Can a hippopotamus swim?”, semantic network 400 is shown to contain an edge density of four edges 424, 426, 428, and 430 around the concept of hippopotamus 418, and an edge density of two edges 432 and 434 around the concept of swimming 436. Once the number of edges for each concept specified in the search query is known, the relevancy algorithm obtains a total relevancy score for the semantic network by adding the number of edges together to obtain a total number of edges, and then multiplying the total number of edges by the number of terms in the network. In this example, the total relevancy score for semantic network 402 is twelve (e.g., 6 total edges*2 terms=12). Thus, the more edges (connections) a term has to other nodes in the network, the more relevant the document is likely to be to the user's search query.
Semantic network 450 in
Although semantic network 450 is more complex than semantic network 402 in
It should be noted that in the examples above, the search query, “Can a hippopotamus swim?”, is actually answered in semantic network 402 of the first matching document. In response to such a question, a deductive reasoning algorithm may be used to provide an actual “yes” or “no” answer. However, the deductive reasoning on a semantic network required by such an algorithm is much more computationally intensive than the relevancy algorithm in the illustrative embodiments. Additionally, the relevancy algorithm may still be useful with more generic search strings. For example, instead of a search comprising a question such as “Can a hippopotamus swim?”, a generic search query may merely comprise the terms, “hippopotamus swim”. In this generic search string situation, the relevancy algorithm would be able to determine the relevancy of a document to the search terms provided, while the deductive reasoning algorithm would have nothing to deduce.
A determination is then made as to whether any of the documents in the list contains multiple semantic networks (step 508). If a document does not contain more than one semantic network (‘no’ output to step 508), the process skips to step 512. If a document contains more than one semantic network (‘yes’ output to step 508), the scores for each of the semantic networks are added together to form the relevancy score for the document (step 510). The semantic networks having a higher edge density are ranked as better matches to the search query (step 512). The list of documents corresponding to the ranked semantic networks are then be provided to the user in such a manner as to indicate the relevancy ranking (step 514), with the process terminating thereafter.
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any tangible apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.