The present application claims the priority of European patent application, Serial No. 04106536.8, titled “Method and Apparatus for Generation of Text Documents,” which was filed on Dec. 14, 2004, and which is incorporated herein by reference.
The invention relates to the generation of large volumes of text documents for test proposes.
Natural language processing (NLP) systems such as search engines or text mining software provide essential tools for retrieving information from collections of digitalized text documents. Because of the strong growth of the amount of digital text data to be processed, excellent performance and scalability are essential for these systems. In order to effectively test performance and scalability, huge text document collections with specific properties concerning text and document structure are needed. Although some document collections exist for particular languages such as, for example, the Wall Street Journal or the Gutenberg collections (Gutenberg Project, http://promo.net/pg/), such collections are sometimes not useful enough since they are too specific in terms of specific types of documents like newspaper articles or literature texts. Existing document collections may also cover only a few specific target languages or do not have the appropriate document structure or such document collections are simply too small. On the other hand, document collections containing artificial documents in general do not reflect important properties of natural text document collections, e.g. fulfilling Zipf's law and Heap's law (2). Because many algorithms for natural language processing (NPL) make extensively use of these properties, artificial text document collections are in general not well suited for testing performance and scalability of NLP programs.
It is therefore highly desirable to create artificial text document collections which are large enough and have the essential properties of natural text document collections. These properties may be either specified by the user or learned from a set of training documents.
U.S. Pat. No. 5,418,951 discloses a method to perform the retrieval of text documents by using language modelling in advance of a comparison of a query with the documents contained in a database. This method includes the steps of sorting the documents of a database by language or topic and of creating n-grams for each document and for the query. The comparison between query and documents is performed on the basis of the n-grams.
U.S. Pat. No. 5,467,425 discloses a system and a method for creating a language model which is usable in speech or character recognizers, language translators, spelling checkers or other devices. This system and method comprises a n-gram language modeller which produces n-grams from a set of training data. These n-grams are separated into classes. A count is determined for each n-gram which indicates the number of times the n-gram occurs in the training data where a class is defined by a threshold value. Complement counts indicate those n-grams which are not previously associated with a class and assign these to a class if they are larger than a second threshold value. The system and method uses these factors to determine the probability of a given word occurring on the basis that the previous two words have occurred.
An objective of the invention is to provide a method for modelling and analyzing text documents and for generating large amounts of new documents having the essential properties of natural text document collections.
The invention, as defined in the claims, comprises the steps of collecting a set of text documents as training documents and choosing a language model to be used. New documents are generated by using this model and by using additional words beyond the words contained in the training documents. The new documents comprising the same distribution of their length as the training documents. For securing the quality of the new documents it is determined if the deviation of the word frequency as a function of the word rank from Zipf's law and the deviation of the growth of the vocabulary as a function of the number of terms from the Heap's law are below user defined thresholds. Only those new documents are accepted which fulfil these conditions.
According to one aspect of the invention n-gram probabilities are used as language model. According to another aspect of the invention a probabilistic deterministic context-free grammar (PCFG) is used as language model.
The invention, as defined in the claims, further provides modelling of structured text documents by combining language models for modelling the text with probabilistic deterministic finite automata (PDFA) for modelling the structure of the documents. The language models under consideration are n-gram models and probabilistic context-free grammar (PCFG) models. The combined models are used to generate new documents from the scratch or by using the results of an analysis of a set of training documents. Since the models reflecting various essential features of natural structured document collections, these features are adopted into the generated document collections which is therefore suited to evaluate the performance and scalability of natural language processing (NLP) algorithms relying on these features.
Embodiments of the invention are subsequently described with reference to drawings which show:
Alternatively, in case where no training documents are available, the document generation may take place by using pre-computed model data which contain the terms and the probabilities of a n-gram model as shown in
In the following an example of a text is shown which has been generated according to the invention by using a n-gram model with a set of training documents:
Although this example does not represent a meaningful text, it comprises the essential properties of a natural text which may be part of a text document of the type which allow generating large amounts of documents for test purposes. A more detailed consideration of this text shows that the sentences of this text make some sense but the grammar is in the most sentences incorrect. Another effect is that parenthesis and quotes do not match since n-gram models cannot capture such dependency because it lies outside of their scope. Furthermore, it is visible that a n-gram model may generate rather long sentences. However, these characteristics are not harmful for using such types of texts in a document which is part of a huge collection of text documents to be used for test purposes as explained above.
Subsequently an alternative process of modelling the document text by using a probabilistic deterministic context-free grammar (PCFG) is described by reference to
In step 61 a modification is applied to the selected PCFG. Such modifications comprises various operations applied to the text and structure elements of the PCFG including concatenation, classing, repetition and specialization. In the following step 62 an objective function OF is calculated. The objective function OF may be stated as the probability p(G|O) of a PCFG G for a given set 0 of training elements. Step 63 keeps the modification if the value of the objective function is increased. In step 64 the objective function OF is checked whether it is smaller than a user defined threshold. If necessary, a post-processing may be applied to the inferred grammar. If the objective function OF is above the user defined thresholds, the modified PCFG is used in step 65 to generate new documents. The document generation step 65 may include the addition of new words as described above with reference to
Alternatively, document generation may take place by using a probabilistic deterministic context-free grammar (PCFG) model directly to generate new documents. Referring to
The subsequent description relates to the generation of structured document text.
Furthermore, separating the structure models from the language models is more flexible. I allow using different language models for different fields of the documents, e.g. a simple n-gram model may be used for titles or section names and a more complete grammar based model may be used for paragraphs.
For the purpose of the subsequent description it is assumed that the SGML document type of the training documents is known and well defined. A simple approach for generating documents using this document type would be to use the “Document Type Definition” (DTD) which is well known and which defines a set of rules specifying the top level tag, how tags should be nested, etc. and thus can be easily converted into a context-free grammar. This approach would produce correct documents with respect to both the SGML syntax and the DTD rules. It is however insufficient especially in cases where the DTD covers a broad range of different document structures. A prominent example for a DTD is HTML. However, tag definition in HTML does not have clear semantics. It allows many ways to use the different tags but only a few from these uses make sense. Therefore, generating documents using only the DTD would produce correct but mostly unrealistic documents.
In order to describe the document structure modelling as used according to the invention in more detail, the following HTML example is considered: Within the <body> element the following tags (among others) can be nested:
Herein the convention of the Extended Markup Language XML is used where empty tags (i.e. that contain no children) end with /> instead of >. The fields represent the structure elements of the documents consisting of the text chunks between start and end tag with the tag name defining the field type.
Using the DTD in this case would generate documents with <body> tags whose content would start with <br> or <h1> with equal probabilities although the second case makes much more sense and thus should be more probable.
An improvement of this modelling would be to probabilize the grammar generated by the DTD by giving more weights to rules that actually occur in the training documents. In the considered example, this would mean that a higher probability is assigned as that one calculated by using only the DTD to a <h1> element occurring within the <body> element. However, this approach has still a drawback: Within the <body> element, the previously cited element can occur one or more times, as defined by the DTD, but only certain sequences make sense. For example, a sequence of line breaks is not realistic while a sequence of paragraphs makes sense. This kind of information is missing from the DTD. It is possible at the expense of compactness to construct DTD's that would avoid such shortcomings but the training documents to be processed have pre-existing DTD's, especially with DTD's for HTML.
Since the DTD does not give sufficient information for modelling document structure, an inference framework for the document structure is used which takes into account that in comparison to human language the markup language is fixed and small, and that a context-free grammar describing SGML markup cannot be ambiguous since there can be only one possible parse.
For the document generation including the analysis of training documents the document structure is defined by the use of a probabilistic deterministic finite automata (PDFA). The PDFA will be conditioned by the use of the training documents, i.e. the transition probabilities between the states of the PDFA are determined and thereafter used to generate the structure of the new documents. As a result new structured documents are obtained.
Referring back to
Preferably the language model is trained with a set of documents that have a similar document structure. This corresponds to the reality which is represented, for example, by HTML pages from a number of related web sites. On this basis, the generated documents exhibit the same structure as the training documents.
In cases where no training documents are available the language model may directly generate documents using the DTD as the base grammar, without weighting in any way the possible alternatives. Thus, every valid document has the same possibility to be generated. For some very structured DTD's, such as XML databases, this would be sufficient. The text parts are then generated as described above by using n-gram or PCFG models. This is illustrated in
While the invention is disclosed with reference to the described embodiments, modifications or other implementations of the invention are within the scope of the invention as defined in the claims.
Number | Date | Country | Kind |
---|---|---|---|
04106536.8 | Dec 2004 | DE | national |