The present invention generally relates to methods and systems for harvesting domain knowledge from the Web. In particular, the present invention is directed to such systems and methods that allow automatic object hierarchy building/generation from the web.
Nowadays, Computer has become a necessary tool of modern life to help people find interested information, especially in the Internet era that a growing huge amount of diversified information has being accumulated on the Web. Although a computer is fast at information processing such like computing, storing, or searching, its incapability in understanding information is the main obstacle for intelligent information processing. To deal with that problem, semantic relevant research for intelligent information processing becomes popular recently. For example, there are relevant technologies described in T. Berners-Lee, J. Hendler, O. Lassila (2001), entitled “The Semantic Web, Scientific American”, May 2001, pp. 28-37, Nigel Shadbolt, Tim Berners-Lee and Wendy Hall, entitled “The Semantic Web Revisited”, IEEE Intelligent Systems 21(3) pp. 96-101, May/June 2006, and E. Hyvonen (editor), entitled “Semantic Web Kick-Off in Finland—Vision, Technologies, Research, and Applications”, HIIT Publications, 2002-001, Helsinki Institute for Information Technology (HIIT), Helsinki, Finland, 304 pp. They concentrate on the formats and technologies to help computer understand information. Based on some mathematic logics, such as Description Logies or Frame Logics, for knowledge representation from traditional discipline of Artificial Intelligent (AI) and the popular web information processing technologies, standard organizations, like World Wide Web Consortium (W3C), are actively specifying the standards like XML, RDF (Resource Description Framework) and OWL (Web Ontology Language), and rule languages (e.g., Web Rule Language, Rule Markup Language), which will serve as foundation to advancing the adoption of semantic technologies. Also, many developers, entrepreneurs, and practitioners have entered the stage of creating and deploying relevant tool sets, products, case studies, and even real working applications to make the vision of semantic based intelligent information utilization come true.
However, to employ the computer's powerful computing capability and the semantic relevant standards for providing different intelligent information utilization services to the Web user, the backend domain knowledge (Currently, ontology is a dominated way for knowledge representation on the Web) plays the key role inside. Thus, domain knowledge building becomes an important problem that must be solved.
Currently, there are mainly two kinds of the domain knowledge: ontology and hierarchy.
Ontology is a document or file that formally defines the relations among terms, and most typical kind of ontology for the Web has a taxonomy and a set of inference rules. Further, the taxonomy defines classes of objects and relations among them. For example, an address may be defined as a type of location, and city codes may be defined to apply only to locations, and so on. Ontology may express a rule like “If a city code is associated with a state code, and an address uses that city code, then that address has the associated state code.” A program could then readily deduce, for instance, that a Cornell University address, being in Ithaca, must be in New York State, which is in the U.S., and therefore should be formatted to U.S. standards.
A hierarchy contains nodes, and edges which connect nodes, sometimes instances attached to nodes. Compared with ontology, hierarchy is a form much simpler. Many elements in ontology, like class, property, definition and relation, can be ignored in hierarchy. But there are some ways to reason those elements from hierarchy. Thus, a hierarchy can be looked on as a kind of pseudo ontology with explicit but informal specification.
There are mainly two kinds of ontology building (OB) methods in prior arts, i.e. ontology building based on some raw material and ontology building based on some existing ontologies. In the raw material-based ontology building method, for example, the ontology can be built from texts, dictionary, a knowledge base, semi-structured data or relation schemas. In the existing ontology-based ontology building method, by comparing texts or context of concepts, several existing ontologics can be integrated into one.
Although ontology is crucial for Semantic Web and relevant services, it is difficult to build a formal ontology automatically anyway, because ontology usually contains many contents that are difficult to be filled even by human, such as class, class definition, relation of classes, property and so on. Obviously, the complex format of ontology has blocked its large-scale construction and then the widespread applications like some real-time Web services. Moreover, the ontology integration is usually performed through human interaction, and thus it is not as easily implemented as the hierarchy integration.
There are also a few prior arts for the hierarchy building (HB). For example, the Japanese Patent JP2001-34635 (hereinafter which is referred to as reference document 1) claims a method building hierarchy from the Web. Concretely, one term (i.e., one node) is extracted from each web page, and a hierarchical relation is building based on links between web pages. Instead of building the relation among all pages, the method does it only on the same type of web pages. For example, a link between two product-pages is kept, but a link between a product page and an advertisement page is ignored. In addition, in N. Liu, C. C. Yang, entitled “A link classification based approach to website topic hierarchy generation” (WWW2007) (hereinafter which is referred to as reference document 2), it is provided a method for extracting the hierarchical relations between web pages within a website based on inter-page link structure analysis. Then, it wraps each web page into a topic object and builds a topic hierarchy. The disclosures of the above-mentioned reference documents 1 and 2 are hereby incorporated entirely by reference for all the purposes.
However, as for the prior arts for HB (such as the technologies described in reference documents 1 and 2), the existing methods only consider the case that an object/topic is represented by a whole page, and the relationships among object/topics are acquired by the inter-page hyperlink analysis. However, only parts of objects/topics (nodes of hierarchy) could be representative by a whole page, while other pans of objects are only covered by some parts of a web page. Additionally, the hyperlink extracted from only the inter-page relationships are not accurate enough, since there exist much noise other than hierarchical relations within the links between pages.
In view of the deficiencies of the HB methods in the prior arts, the present invention is made for automatically extracting hierarchy of the objects (e.g. products) from a website in a more accurate and efficient way.
In this present invention, it is proposed a coordinated method for automatic hierarchy extraction from websites by integrating inter-page analysis (i.e. analysis of hierarchy of web pages) with intra-page analysis (i.e. analysis on relationship among semantic blocks within a web page). The hierarchical relations implied within the semantic blocks inside pages are exploited to amend the inaccurate hierarchy that comes only from the inter-page analysis.
More specifically, the coordinated hierarchy extraction method of the present invention mainly includes three phases: (1) inter-page hierarchy analysis; (2) intra-page hierarchy analysis; and (3) coordinated hierarchy generating.
During the inter-page hierarchy analysis, the hierarchy is generated based on the semantic relation analysis of the whole page set of a website. On the one side, the nested objects are distilled from the websites, and bind each topic together with its representative page. On the other side, the hierarchical relations between web pages are identified with hyperlink-based method or hybrid method, which integrates the analysis of hyperlinks and contents. Thus, the object hierarchy can be extracted by integrating the object-page pairs and the hierarchical relations between web pages.
Then, in the intra-page hierarchy analysis, the hierarchy is generated based on the semantic block analysis inside a web page. The semantic block analysis is conducted on each page, which has bundles of hyperlinks directing to the object representative pages. And it brings nested semantic blocks, which contain these hyperlinks and the hierarchical relations between the semantic blocks. These nested semantic blocks are also wrapped as objects and thus the hierarchy of the new object set can be extracted by integrating the object-page pairs, object-block pairs and the hierarchical relations between semantic blocks.
Finally, a refined object hierarchy is generated by fusing the results of inter-page analysis and intra-page analysis. In an embodiment, the fusing operations can include calibrating the unreasonable hierarchical relations with each other and complementing the missing hierarchical relations with each other. Of course, it is easy to conceive for those skilled in the art that the fusing operation for the results of inter-page analysis and intra-page analysis is not limited to the described example.
In addition, the foregoing description is only used to briefly explain the principle of the present invention, but should not be viewed as limitation of the present invention. For example, in the above-mentioned example, the mapping operations of web pages-objects and semantic blocks-objects are divided as being performed in the phases of inter-page analysis and intra-page analysis respectively. However, in some other embodiments, the hierarchy of web pages and the nested relationship of semantic blocks, which are obtained as results of inter-page analysis and intra-page analysis, can be first fused, and then, the nodes (web pages or semantic blocks) on the coordinated hierarchy can be mapped into objects to achieve the final object hierarchy.
According to one aspect of the present invention, it is provided a method for hierarchy building, comprising: obtaining a set of web pages from a website; conducting an inter-page analysis on the obtained web pages to extract a hierarchy of the web pages; conducting an intra-page analysis on each of the obtained web pages to identify the semantic blocks within the web page and extract a hierarchy of the semantic blocks for all the web pages; and fusing the hierarchy of the semantic blocks with the hierarchy of the web pages to generate a coordinated hierarchy.
According to another aspect of the present invention, it is provided a system for hierarchy building, comprising: a web page obtaining means for obtaining all web pages from a website; an inter-page analysis means for conducting an inter-page analysis on the obtained web pages to extract a hierarchy of the web pages; an intra-page analysis means for conducting an intra-page analysis on each of the obtained web pages to identify the semantic blocks within the web page and extract a hierarchy of the semantic blocks for all the web pages; and a fusing means for fusing the hierarchy of the semantic blocks with the hierarchy of the web pages to generate a coordinated hierarchy.
Since the present invention focuses on hierarchy but not ontology, it makes possible to deal with many real cases of domain knowledge building. Moreover, the present invention can facilitate the reuse of existing informal or semi-formal knowledge in the Web sites and reflect the common understanding of the world/domain as much as possible.
In addition, the adopted coordinated object hierarchy extraction method in the present invention can get higher accuracy of hierarchy than either inter-page analysis based method or intra-page analysis based method. The results of inter-page analysis method and intra-page analysis can be calibrated and complemented by each other.
Also, since the intra-page analysis adopted in the present invention can conduct only on the pages that have bundles of hyperlinks directing to the object representative pages, which could be identified during inter-page analysis, it can get higher efficiency than that intra-page analysis is conducted for every pages of the website.
The foregoing and other features and advantages of the present invention can become more obvious from the following description in combination with the accompanying drawings. Please note that the scope of the present invention is not limited to the examples or specific embodiments described herein.
The foregoing and other features of this invention may be more fully understood from the following description, when read together with the accompanying drawings in which:
The exemplified embodiments of the present invention will be described below with reference to the accompanying drawings. It should be realized that the described embodiments are only used for illustration purpose, and should not be viewed as limiting the scope of the present invention.
The present invention is directed to such systems and methods for knowledge extraction, management, and utilization. In particular, the present invention provides a method and system for highly accurate and efficient object hierarchy extraction by for example considering a set of web pages from a website. Of course, it can be realized by those skilled in the art that the application of the present invention is not limited to the examples provided here, but can also be similarly used for analysis and management of domain knowledge from other knowledge sources.
First,
With reference to the flow chart of
The object hierarchies for different websites stored in the object hierarchy storage 109 can later be used by a variety of hierarchy related applications (not shown). The hierarchy related application can be such as a hierarchy integration application for integrating and aligning the hierarchies extracted from different websites.
As for other components shown in
Moreover,
Compared with the first embodiment shown in
Although the system shown in
Basically, a HNP is associated with a specific website. It means the multi-steps of those hyperlinks with hierarchical relation between web pages which constitute the assumed navigational path to guide users' navigation from the root page of the website to the destination page. The constitutional hyperlinks of HNP, which we call as hierarchical hyperlinks (HL), are different from those reference hyperlinks which convey the peer-to-peer recommendation, and also different from those pure navigational hyperlinks which provide just shortcut from a page to another page. Instead, HLs are utilized for web page organization and embed a kind of hierarchical relation (e.g., whole-part or parent-child) between web pages, and then the semantic of parent pages could be inherited to children pages along sequential HLs, i.e. HNPs. Thus, HNPs can afford meaningful indication on the content of its destination web page.
With reference to
After all the HLs within a website are identified, the hierarchical navigation path generation unit 402 can generate the HNP for each Web document within the website. At the same time, the linguistic contents within HNP, including the URLs, anchor texts and web page titles along it, can be collected by the collection unit 404.
Then, after the navigation paths have been generated by the hierarchical navigation path generation unit 402, the object-relevant web page identification unit 403 can conduct the path-query to retrieve object-relevant web pages or to filter out the object-irrelevant web pages, by querying the HNPs' text nodes with the object type name or its synonyms that have been inputted in advance. For example, if user wants to extract products web pages from a company website, the HNP can be queried with the keywords such as “product”, “service” and so on. If some nodes of a page's HNPs contain such these keywords, the page could be regarded as a possible object-relevant web page, because HNPs contain the exactly meaningful context of the target page. Such object-relevant web pages could be regarded as the representative pages of a series of nested objects. And the name of an object could be summarized from the corresponding web page's title and the anchor texts of the hyperlinks which direct to the corresponding web page.
After the object-relevant web pages have been filtered out by the filtering means 302, these object-relevant web pages can be provided to the inter-page analysis means 102 and the intra-page analysis means 103 for inter-page analysis and intra-page analysis.
The whole structures and principles of the coordinated object hierarchy building systems and methods according to the first, second and third embodiments of the present invention have been described above with reference to the accompanying drawings. It can be seen that the crucial technical aspects of the above-mentioned systems lie in three aspects, i.e. the inter-page hierarchy analysis (the inter-page analysis means 102), the intra-page hierarchy analysis (the intra-page analysis means 103) and the generation of the coordinated object hierarchy (the fusing means 104 and mapping means 105 in the first embodiment, or the fusing means 104, first mapping means 1051 and second mapping means 1052 in the second embodiment). These aspects will be described in more details later.
First, as for the inter-page hierarchy analysis, i.e. the operation of the inter-page analysis means 102, it can be implemented by using various methods well-known by those skilled in the art. For example, in the case of processing the object-relevant web pages, the hierarchical hyperlinks identified by the hierarchical hyperlink identification unit 401 can be used, so that if two object-relevant web pages could be linked by a sequence of hierarchical hyperlinks, then they are regarded as a parent-child pair and the hierarchical relations between them are stored. Of course, as known by those skilled in the art, there are many inter-page analysis methods in the prior art capable of being applied to the present invention. The user can choose proper method according to actual application requirement to extract the hierarchy of web pages.
As for the intra-page hierarchy analysis, as described above, the intra-page analysis means 103 is used to divide each web page into several nested semantic blocks and extract a hierarchy of the semantic blocks. The intra-page hierarchy analysis process can also be implemented by using various methods well-known by those skilled in the art. Here, an example of the intra-page hierarchy analysis will be given with reference to
First, the object portal page selection unit 501 selects object portal pages from the web pages obtained by the web page obtaining means 101. The object portal pages are pages containing bundles of hyperlinks directing to different object-relevant web pages. Then, the web page segmentation unit 502 conducts web page segmentation for these selected object portal pages to generate nested semantic blocks of the pages. In order to further improve the efficiency, the web page segmentation unit 502 can only pick those semantic blocks containing the hyperlinks directing to object-relevant web pages for the following hierarchy extraction. The web page segmentation could be realized by several existing methods, such as DOM pattern repetition based method or vision layout based method. The details of existing methods are not described here. After division of the semantic blocks, the hierarchy extraction unit 503 extracts the hierarchy of the semantic blocks. Then, the title generation unit 504 can generate a title for each semantic block.
As an example, the title generation of semantic block can be realized by a hybrid context based method which identifies a title for each semantic block with analyzing and synthesizing both the intra-page context, which is for the page where the block is located, and the inter-page context, which is for the destination pages of the out-bound links inside the block, of the semantic block. For example,
Finally, return to
After the inter-page hierarchy analysis and the intra-page hierarchy analysis have been done, the fusing means 104 fuses the inter-page analysis result and the intra-page analysis result to generate the coordinated hierarchy.
Finally, the coordinated hierarchy L′ generated by the fusing means 104 is mapped into the corresponding coordinated object hierarchy in the mapping means 105. As shown in
The coordinated object hierarchy building systems and methods according to the first, second and third embodiments have been described above with reference to the accompanying drawings. Compared with the prior arts, the methods and systems of the present invention possess the following advantages:
First, since the present invention focuses on hierarchy but not ontology, it makes possible to deal with many real cases of domain knowledge building. Moreover, the present invention can facilitate the reuse of existing informal or semi-formal knowledge in the Web sites and reflect the common understanding of the world/domain as much as possible.
In addition, the adopted coordinated object hierarchy extraction method in the present invention can get higher accuracy of hierarchy than either inter-page analysis based method or intra-page analysis based method. The results of inter-page analysis method and intra-page analysis can be calibrated and complemented by each other.
Also, since the intra-page analysis adopted in the present invention can conduct only on the pages that have bundles of hyperlinks directing to the object representative pages, which could be identified during inter-page analysis, it can get higher efficiency than that intra-page analysis is conducted for every pages of the website.
The specific embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the particular configuration and processing shown in the accompanying drawings. In the above embodiments, several specific steps are shown and described as examples. However, the method process of the present invention is not limited to these specific steps. Those skilled in the art will appreciate that these steps can be changed, modified and complemented or the order of some steps can be changed without departing from the spirit and substantive features of the invention.
The elements of the invention may be implemented in hardware, software, firmware or a combination thereof and utilized in systems, subsystems, components or sub-components thereof. When implemented in software, the elements of the invention are programs or the code segments used to perform the necessary tasks. The program or code segments can be stored in a machine-readable medium or transmitted by a data signal embodied in a carrier wave over a transmission medium or communication link. The “machine-readable medium” may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuit, semiconductor memory device, ROM, flash memory, erasable ROM (EROM), floppy diskette, CD-ROM, optical disk, hard disk, fiber optic medium, radio frequency (RF) link, etc. The code segments may be downloaded via computer networks such as the Internet, Intranet, etc.
Although the invention has been described above with reference to particular embodiments, the invention is not limited to the above particular embodiments and the specific configurations shown in the drawings. For example, some components shown may be combined with each other as one component, or one component may be divided into several subcomponents, or any other known component may be added. The operation processes are also not limited to those shown in the examples. Those skilled in the art will appreciate that the invention may be implemented in other particular forms without departing from the spirit and substantive features of the invention. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive. The scope of the invention is indicated by the appended claims rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Number | Date | Country | Kind |
---|---|---|---|
200810111482.2 | Jun 2008 | CN | national |