The invention relates to a method of enabling to extend a set of information items, to a method of extending a set of information items and to software for carrying out the methods.
The term “ontology”, as used in a computational environment, typically refers to the specification of term names, term meanings, and interrelations of the terms. Ontologies, also referred to as “domain conceptualizations”, resemble taxonomies but may use richer semantic relationships among terms, as well as strict rules about how to specify terms and relationships. See, e.g., Deborah L. McGuinness. “Ontologies Come of Age”. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2002.
The creation of an ontology is typically a time-consuming task. At Yahoo, for example, a small group of experts categorize Web pages manually. The Open Directory Project (ODP) of DMOZ leverages the collaborative effort of over 35,000 volunteer editors to generate large, simple ontologies, with over 360,000 classes in a taxonomy.
The inventors consider as an example the metadata accompanying electronic content information available on the Internet, and on carriers such as optical disks, memory cards, etc. Metadata is additional information that can be used to search or browse audio/video content. For example, the metadata relating to a song can include the title of the song, the names of the artists, an indication of the genre, etc. Given an ontology of a certain domain (pop-music, movies, etc.), it is often difficult to fill the metadata database with relevant data. To fill the database by manually adding the data is expensive and time-consuming. The inventors therefore propose to automatically fill the database, by using information that is available on web pages of the world-wide web. The idea is to automatically extend a small set of items of a given type by searching on web pages for enumerations, in which multiple items of the given set are listed. With high probability the other words (or word combinations) in such enumerations will also refer to items of the same type. The invention thus exploits the overlap between enumerations or listings that are present in electronic documents of a large collection in order to create or extend a database.
More specifically, an instantiation of the invention relates to a method of enabling to extend a set of information items that have an ontological attribute in common. The method comprises enabling to query a collection of electronic documents about a first enumeration of multiple items of the set. The query is run on, e.g., the world-wide-web with any convenient search engine such as Google, or on any other collection of electronic documents that can be subjected to, e.g., a full-text search. In respective ones of the documents represented in a query result of the query, a respective candidate item is identified in a respective second enumeration comprising the first enumeration. Then it is determined if among the respective candidate items there is a specific item having the attribute in common with the items of the set If the specific item is determined to have the attribute in common, and is not already comprised in the set, the specific item is provided for being added to the set. Determining whether or not this commonality is present comprises, for example, determining a number of times that the candidate item co-occurs with those of the first enumeration and/or with another enumeration of different items of the set. For example, the determining comprises evaluating a number of documents in the query result that contain the same respective candidate item. The method of the invention may go through two or more further iterations. The collection is then further queried about a third enumeration of a plurality of items of the set, wherein the third enumeration is different from the first enumeration. For example, the third enumeration comprises a permutation of the first enumeration, or the third and first enumeration differ from one another by at least one item, e.g., the third enumeration comprises the specific item found in the previous enumeration, etc.
The method of enabling as defined above is carried out by, e.g., the server of a provider of information services on the Internet, e.g., as an extension to existing search engines.
Another instantiation of the invention relates to a method of extending a set of information items that have an ontological attribute in common. The method comprises: querying on a collection of electronic documents about a first enumeration of multiple items of the set; identifying in respective ones of the documents represented in a result of the query a respective candidate item in a respective second enumeration comprising the first enumeration; determining if among the respective candidate items there is a specific item having the attribute in common with the items of the set; and, if the specific item is determined to have the attribute in common and is not already comprised in the set, adding the specific item to the set. This instantiation of the invention is carried by a database provider or database creator using software to automatically create a database or ontology.
The invention thus exploits the overlap between enumerations or listings that are present in electronic documents of a large collection of documents in order to create or extend a database.
The invention is explained in further detail, by way of example and with reference to the accompanying drawing wherein:
Throughout the figures, same reference numerals indicate similar or corresponding features.
The invention relates to extending a collection of items of a given type with additional items of that same type by means of searching the Web for enumerations wherein multiple given items co-occur. The invention is based on the assumption that in an enumeration or list of specific items found on a web page, more items of the same type are present. By counting the number of times that a co-occurring item is present together with the enumeration with the given multiple items, items can be filtered that are unlikely to be of the proper type. In addition, by counting the relative frequency of hits for different enumerations with given items, more unlikely items newly found are filtered out. A next iteration then may use a next enumeration to start the querying with one or more new items found in the previous iteration. By means of presenting a search program with only a few items to start with, a database can be built with many more items found in a number of iterations.
An item consists of, e.g., a single word or name, or is a composite entity consisting of multiple words in a specific order. The search program may search documents in only a particular language owing to the spelling used. A translation of the initial items into another language may turn up additional items not found or originally not accepted using the initial language. Another fine-tuning of the search relates to running the query using an ordering or arrangement of the initial items that are entered in a specific sequence. For example, information items known in advance of the set to be extended are arranged alphabetically or in order of increasing or decreasing magnitude or size of their concepts covered, etc.
In an iteration that is not the first, the analyzing in step 110 may also include correlating the current results with those of previous iterations, e.g, by analyzing the scores accumulated over the iterations carried out so far. In addition, one may also keep track of which specific electronic documents turn up in two or more iterations. These specific documents then may already contain a larger listing of the items sought. For example, if the same document has appeared among the query results for, e.g., more than half of all iterations so far, one may consider scanning this document in a broader scope, e.g., by iteratively testing if the neighbor of an accepted candidate item in the second enumeration contained in this specific document, the neighbor not being present in the first enumeration, also has a high degree of occurrence in the other documents retrieved so far. If so, then this neighbor is likely to be an acceptable candidate as well. The process then can proceed by evaluating the neighbor's neighbor, etc.
Further, before terminating the process in step 118 an optional step (not shown) can be carried out to further purify the set thus extended. For example, if there is a large difference between the number of documents that include a certain item and the number of documents that include any other item, one may consider the certain item an anomaly and delete it from the set. Statistical analysis, user intervention or editor intervention may be needed for this step.
Once a listing is accepted as complete and process 100 is terminated the result is a database with a one-dimensional array of information items, possibly accompanied by meta-information using pointers as mentioned above. The database can be expanded so as to be represented by a two- or more dimensional array. For example, the user in the example under
An interesting use of the method in the invention relates to finding translations of particular words in another language. Consider for example the name of a city in different languages, e.g., the words “Milano”, “Milan”, “Mailand”, “Milaan” all refer to the same city in northern Italy in Italian, French/English, German and Dutch, respectively. The spelling of the name of the capital of the Netherlands, “Amsterdam”, is conserved when translated to most other languages. This means that the items in an enumeration of names of cities as obtained in a method of the invention depend on the language wherein the documents analyzed have been worded. Accordingly, one could start a query with a first enumeration of names that are language independent, the query being restricted to documents in a specific language. For example, the method of the invention applied to the enumeration “Amsterdam, Rotterdam, Utrecht” and restricted to documents in English will probably result in candidate items as “Eindhoven” and “The Hague”. A similar query restricted to documents in the French language will probably have among the results “Eindhoven” and “La Haye”, whereas one limited to Dutch documents will lead to “Eindhoven” and “'s Gravenhage” and “Den Haag”. Analyzing the eventual results of the queries in different languages will lead to the insight that the terms “The Hague”, “La Haye”, “Den Haag” and “'s Gravenhage” all refer to the same Dutch city in the west of the Netherlands, and that “Den Bosch” “s Hertogenbosch” and “Bois-le-Duc” are different names for the same Dutch city in the south, the first two in the Dutch language and the last one in French. Note that analyzing the eventual enumerations may therefore also leads to alternative indications, e.g., “Holland”, “The Netherlands”, and “The Low Countries”, of the same entity in the same language.
Incorporated herein by reference are the following:
U.S. Pat. No. 6,349,307 (attorney docket PHA 23,606) issued to Doreen Cheng for COOPERATIVE TOPICAL SERVERS WITH AUTOMATIC PREFILTERING AND ROUTING. This patent relates to an information organization and retrieval system that efficiently organizes documents for rapid and efficient search and retrieval based upon topical content The information organization and retrieval system is optimized for the organization and retrieval of only those documents that are relevant to a given set of predefined topics. If a document does not have a topic that is included in the given set of topics, the document is excluded from the provided service. In like manner, if a document includes a topic that is specifically banned from the provided service, it is excluded. In this paradigm, the provider purposely limits the scope of the provided search and retrieval services, but in so doing provides a more efficient and effective service that is targeted to an expected user demand. The information organization and retrieval system also supports context-sensitive search and retrieval techniques, including the use of predefined or user-defined views for augmenting the search criteria, as well as the use of user-specific vocabularies. In a preferred embodiment, the select set of topics are organized in multiple overlapping hierarchies, and a distributed software architecture is used to support the topic-based information organization, routing, and retrieval services. Documents may be relevant to one or more topics, and will be associated with each topic via the topical hierarchies that are maintained by the information servers.
U.S. Pat. No. 6,256,633 (attorney docket PHA 23,422) issued to Chanda Dharap for CONTEXT-BASED AND USER-PROFILE DRIVEN INFORMATION RETRIEVAL This patent relates to enabling a user to navigate through an electronic data base in a personalized manner. A context is created based on a profile of the user, the profile being at least partly formed in advance. Candidate data is selected from the data base under control of the context and the user is enabled to interact with the candidates. The profile is based on topical information supplied by the user in advance and a history of previous accesses from the user to the database. This patented invention increases the effectiveness of browsing wide-area information by means of focusing primarily on the user's interest as given by the user's access history in terms of the results of previous queries. Taking these results into account for next queries creates a context that enables interpreting the current query object in view of what currently is likely to be of interest to this specific user. The context for the current query is used to update the user's profile. The profile itself is used as a recommendation for mapping relevant information form the information provider's topic space, also referred to as document base, onto the user's search space. The profile gets updated dynamically in response to the user's interactions with the document base. Accordingly, the dynamic part reflects the path taken within the provider's information space in the course of the user's search. Preferably, the profile has also a static part that reflects the user's long-term interests. The term “static” is used to indicate a time scale substantially slower than that of the dynamic part. The static part is determined by, for example, letting the user provide topical information about his/her fields of attention the first time that the user interacts with the document base. Such entries can be changed manually in due course. Alternatively, or in addition, statistical analysis of a statistically relevant number of results over time enables finding themes that stay substantially constant.
Number | Date | Country | Kind |
---|---|---|---|
03103363.2 | Sep 2003 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB04/51577 | 8/26/2004 | WO | 3/3/2006 |