The present invention relates to information management system, and more particularly to a system, method and computer program for automatically generating multilingual electronic content from unstructured data.
The inclusion of electronic content (e-content) in learning is now Inevitable. The e-content is a new domain full of new challenges. The e-content development is the creation, design, and deployment of content and related assets including text, images, and animation. The management of objective-driven and multilingual content is a requirement to meet the high expectations of today's global enterprise.
The problem is that the traditional manual development of content may consume a huge amount of time. Moreover, the content “localization” (the adaptation of contents to a local environment) requires additional effort.
US patent application 2003/0163784 entitled “Compiling and distributing modular electronic publishing and electronic instruction materials” discloses a system and method to facilitate the development, maintenance and modification of course and publication content because they may be located centrally in a large library of independent electronic learning and electronic content objects that serve as building blocks for electronic courses and publications. Modular CAI (Computer Aided Instruction) systems and methods can be used to monitor student progress both by administering examinations and tracking what content particular students have accessed and/or reviewed The invention includes authors using the Internet-accessed tools and templates to compile instructional and informational content, and the subsequent delivery of web-based instructional or informational content to end users such that the end users can receive and review such content using computing devices running standard web browsing applications.
The above-mentioned patent application assumes the existence of a large library of independent e-learning and e-content objects (structured materials) to build (compile) e-courses and publications. On the contrary, the present invention starts from scratch using unstructured input. The present invention has also the ability to handle multilingual material and to build relations between topics automatically.
US patent application 2004/205547 entitled “Annotation process for message enabled digital content” discloses an electronic message annotating method for providing interaction between instructor and student. The method involves displaying of annotation and its connection to a chosen subject item on visual displays. The method includes processes and techniques to:
The method includes a technique to encode digital content in a fashion to allow for the creation of text messages and the convenient inclusion of annotations to reference both textual, and non-textual media elements. The main object of this method is the representation of the e-content during the content development.
The present invention goes beyond the systems disclosed above by providing a method for automatically generating e-content.
US patent application 2002/0156702 entitled “System and method for producing, publishing, managing and interacting with e-content on multiple platforms” discloses content production tools that incorporate the XML protocol with Object Oriented methodology to enable the production of effective displays. The claimed method and system unifies the production, delivery and display of content for all content platforms under one set of tools. The tools enable the production of platform-independent content without requiring a deep knowledge of programming.
The present invention goes beyond the system disclosed here above by providing a method for automatically generating e-content from unstructured data. However, the tools disclosed here above can be used at the final stage of the present invention.
Automatic Language Identification for Written Texts:
Some techniques for automatically identifying language in written text, use:
U.S. Pat. No. 5,062,143 entitled “Trigram-based method of language identification”, discloses a mechanism for examining a body of text and identifying its language. This mechanism compares successive trigrams into which the body of text is parsed with a library of sets of trigrams. For a respective language-specific key set of trigrams, if the ratio of the number of trigrams in the text, for which a match in the key set has been found, to the total number of trigrams in the text is at least equal to a prescribed value, then the text is identified as being possibly written in the language associated with that respective key set. Each respective trigram key set is associated with a respectively different language and contains those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language. Successive key sets for other languages are processed as above, and the language for which the percentage of matches is greatest, and for which the percentage exceeded the prescribed value as above, is selected as the language in which the body of text is written.
Machine Translation:
“Machine Translation” is the translation from one natural language to another by means of a computerized system. Many different approaches have been adopted by machine translation researchers and there are many systems available in the market for different languages. These systems mainly fall into two categories.
The automatic retrieval of information from natural language text corpus is mainly based on the retrieval of documents matching one or more key words given in a user query. For instance, most conventional search engines on the Internet use a boolean search based on key words given by the user.
Some proposals are based on the creation of an information retrieval system that can find documents in a natural language text corpus that match a natural language query with respect to the semantic meaning of the query.
Some of these proposals relate to systems that have been extended with specific world knowledge within a given domain. Such systems are based on an extensive database of world knowledge within a single area.
Other proposals are based on underlying linguistic levels of semantic representation, In these proposals, instead of using verbatim matching of one or more key words a semantic analysis of the natural language text corpus and the natural language query is performed and the documents matching the semantic content meaning of the query, are returned.
Information Extraction:
“Information extraction” consists in extracting from text documents entities and relations among these entities. Examples of entities are “people”, “organizations”, and “location”. Examples of relations are “person-affiliation” and “organization-location”. The person-affiliation relation means that a particular person is affiliated with a certain organization. For instance, the sentence “John Smith is the chief scientist of the Hardcom Corporation” contains a person-affiliation relation between the person “John Smith” and the organization “Hardcom Corporation”.
“Information retrieval” gets sets of relevant documents (the user analyzes the documents) while “Information extraction” gets facts out of documents (the user analyzes the facts).
There are several approaches currently used for extracting information from natural language (e.g. Part of Speech Tagging and Entity Extraction). Hidden Markov Model (HMM) was perhaps the most popular approach for adaptive information extraction. HMMs exhibits excellent performance for name extraction [1] (Bikel et al., 1999). HMMs are mostly appropriate for modeling local and flat problems. The extraction of relations often involves the modeling of long range dependencies, for which the HMM methodology is not directly applicable.
Several probabilistic frameworks for modeling sequential data have recently been introduced to limit the HMM constraints:
As such, they both enjoy a number of attractive properties (e.g., global likelihood maximum) and are better suited for modeling sequential data, as contrasted with other conditional models.
Online learning algorithms for learning linear models (e.g. Perceptron, Winnow) are becoming increasingly popular for Natural Language Processing (NLP) problems [4] (Roth, 1999). These algorithms exhibit a number of attractive features such as incremental learning and scalability to a very large number of examples. Their recent applications to shallow parsing [5] (Munoz et al., 1999) and information extraction [6] (Roth and Yih, 2001) exhibit state-of-the-art performance.
More recent work focused on unsupervised methods for extracting relations between entities from unstructured text. For example the work presented in the article entitled “Extracting Paterns and Relations from the World Wide Web”, (by Sergy Brin—Computer Science Department Stanford University) published in “The proceedings of the 1998 International Workshop on the Web and Databases” is directed to the extraction of authorship information as found in books description on the World Wide Web. This publication is based on dual iterative pattern-relation extraction wherein a relation and pattern set is iteratively constructed.
The article entitled “Snowball: Extracting Relations from Large Plain-Text collections” (Eugene Agichtein and Luis Gravano—Department of Computer Science Columbia University), published in “Proceedings of the Fifth ACM International Conference on Digital Libraries”, 2000 discloses an idea similar to the previous work. Seed examples are used to generate initial patterns and to iteratively obtain further patterns. Then ad-hoc measures are deployed to estimate the relevancy of the patterns that have been newly obtained.
US patent application US 2004/0167907 entitled “Visualization of integrated structured data and extracted relational facts from free text” (Wakefield et al.) discloses a mechanism to extract simple relations from unstructured free text.
U.S. Pat. No. 6,505,197 entitled “System and method for automatically and iteratively mining related terms in a document through relations and patterns of occurrences” (Sundaresan et al.) discloses an automatic and iterative data mining system for identifying a set of related information on the World Wide Web that defines a relationship. More particularly, the mining system iteratively refines pairs of terms that are related in a specific way and the patterns of their occurrences in web pages. The automatic mining system runs in an iterative fashion for continuously and incrementally refining the relates and their corresponding patterns. In one embodiment, the automatic mining system identifies relations in terms of the patterns of their occurrences in the web pages. The automatic mining system includes a relation identifier that derives new relations, and a pattern identifier that derives new patterns. The newly derived relations and patterns are stored in a database, which begins initially with small seed sets of relations and patterns that are continuously and iteratively broadened by the automatic mining system.
U.S. Pat. No. 6,606,625 entitled “Wrapper induction by hierarchical data analysis” (Muslea et al.) discloses an inductive algorithm generating extraction rules based on user-labeled training examples.
The present invention is directed to the field of electronic content management and more particularly to a method, system and computer program for automatically generating electronic content based on a user designed table of contents and a desired final content form, Language identification and automatic machine translation technologies are also used to broaden the sources of information.
The method for automatically generating and localizing electronic content from unstructured data based on user preferences, comprises the steps of:
More particularly, the method according to the present invention comprises the further steps of:
An advantage of the present invention is that the user can configure an automatic digital content generator to generate electronic contents according to the form and and language of its choice.
The foregoing, together with other objects, features, and advantages of this invention can be better appreciated with reference to the following specification, claims and drawings.
The new and inventive features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objects and advantages thereof, will best be understood by reference to the following detailed description of an illustrative detailed embodiment when read in conjunction with the accompanying drawings, wherein:
The following description is presented to enable one or ordinary skill in the art to make and use the invention and is provided in the context of a patent application and its requirements, Various modifications to the preferred embodiment and the generic principles and features described herein will be readily apparent to those skilled in the art. Thus, the present invention is not intended to be limited to the embodiment shown but is to be accorded the widest scope consistent with the principles and features described herein.
Definitions
In the present invention. the terms: “information”, “data”, and “documents” will be used for the same purpose.
General Principles
The present invention combines automatic text analysis, information searching and information extraction techniques for automatically generating from unstructured information (books, web contents, . . . etc), digital contents for e-learning. The present invention proposes a system and method for automatically developing and localizing (adapting to the local environment) multi-lingual e-content. The present invention proposes the integration of some known technologies and propose some new technologies to contribute to the e-content development of the e-learning market. Many publications world-wide disclose aspects of automatic text analysis, information searching and information extraction techniques. In similar fashion, some references disclose systems and techniques of using the above mentioned technologies. However, none of these references disclose the combination of steps and means claimed in the present invention.
General View of the Invention
How the Information Extractor (201), the Structured Information Generator (202), and the full ADCG system (100) operate will be described using the following example where a user wishes to develop e-contents for a Table of Contents TOC having the following list of topics:
The design of the Table Of Contents (TOG) is done by the user (102). The TOC is used to feed the ADCG system (100).
Information Extractor
For each Topic (Ti) in the Table of Contents (TOC):
It is worth mentioning that the proposed system can accommodate to any type of features. The output of the Relation Extractor (304) represents named entities and relations between said named entities. A features vector is associated with each named entity and relation. This feature vector includes many information regarding the associated entity or relation.
The entities and relations are represented in a directed graph in which the nodes represent the entities and the edges represent the relations between the different entities. The topic (Ti) is also represented by a node in the graph, and all other nodes are candidate sub-topics. The output of the Feature Extractor (305) is, therefore, a Graph-based Hierarchical Topic Representation Ti_G.
The steps 301 to 305 are repeated in order to generate a graph for each topic comprised in the Table Of Contents (TOC).
Structured Information Generator
Each Graph-based Topic Representation Ti_G is passed to the Structured Information Generator (202) which performs the following step:
Then, based on all Graph-based topic Representations Ti-G in output of the Sub-Topic Relevance Checker (401), the Structured Information Generator (202) performs the following step,
As previously shown in
Presentation Composer
The generated structured content is then passed to a Presentation Composer (204) which uses the user selection of the type of materials needed (course, exam, summary, presentation., RD . . . etc.) to compose the final e-content.
Language Identifier and Text Processor
Note that the ADCG system is fed by unstructured information that can be in more than one language. A Language Identifier (106) can be used with a Text Processor (107) (optional as shown in
Particular Embodiment
In a particular embodiment the present invention is executed by a content provider in a server, The server receives the requests and preferences (list of topics, selected environment, specified form) from clients and sends back to said clients the requested content in the specified form.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood that various changes in form and detail may be made therein without departing from the spirit, and scope of the invention.
| Number | Date | Country | Kind |
|---|---|---|---|
| 05112722.3 | Dec 2005 | EP | regional |