The present application is also related to a commonly-owned patent application entitled “Selection of Interrogation Wavelengths in Optical Bio-Detection Systems” by Pierre C. Trepagnier, Matthew B. Campbell and Philip D. Henshaw filed concurrently herewith (Attorney Docket No. 101335-36). This concurrently filed application is also incorporated herein by reference in its entirety.
The present invention relates generally to methods and systems for determining characteristics of a text, such as the language or languages in which it is written, its subject matter, or its author.
Traditionally, many document categorization methods have relied on high-level identifiers such as words, sentences, punctuation, and paragraphs for this task (these method are often known as “stylometric”). Depending on the application, these methods, however, have several drawbacks. For example, they depend on natural-language characteristics, and hence they require a linguist or polyglot for initial setup. Further, these methods can be sensitive to misspellings, variants, synonyms, and inflected forms, and they tend to be language specific.
More recently, many researchers have found that features of a text, such as its subject matter or the language in which it is written, can be deduced from the frequency distributions of n-grams, which are defined as runs of n consecutive characters in a text. Unlike stylometric methods, the methods that rely on n-grams frequency distributions do not require that a text under analysis be “understood.” In fact, n-grams frequency distributions can be generated mechanically without a need to understand the text.
The traditional methods utilizing n-grams frequency distributions have shortcomings of their own. For example, due to the large number of possible characters in a text, the potential n-gram space is very large. For example, using the 7-bit ASCII character set, 1284=268,435,456 distinguishable 4-grams could in principle be created. Even though most of them are never encountered in practice, in a good sized text several thousand separate 4-grams can appear. This can create a very high-dimensional analysis space in which to classify the text, one which cannot be easily visualized and whose analysis can be computationally intensive.
Accordingly, there is a need for enhanced methods and systems for characterizing texts.
The present invention is generally directed to methods and systems for text processing, and particularly to characterizing one or more attributes of a text, such as its language and/or author. In many embodiments, principal component analysis (PCA) can be applied to the n-gram frequency distributions derived from a text under analysis. In general, PCA can produce a set of principal components that are orthonormal eigenvalue/eigenvector pairs, which explain the variance present in a data set. In other words, it projects a new set of axes that best suit the data. In high-dimensional data sets, it is often found that relatively few principal components (PCs) can explain the vast majority of the variance present in a data set. In many embodiments of the present invention for n-gram text classification, it has been found that all important information in n-grams can be found in the first ten or so principal components, in spite of the fact that the raw n-gram frequency distributions can have thousands of variables.
As discussed in more detail below, a further advantage of PCA is that the training aspect of the algorithm (in which the principal component transformation is calculated, and which can be computationally intensive) can be done separately from the analysis of a text under study, which can be accomplished relatively quickly.
In one aspect, the present invention provides a method for characterizing a text, which includes determining frequency distribution for a plurality of n-grams in at least a segment of a text, and applying a principal component transformation to the frequency distribution to obtain a principal component vector in a principal component (PC) space corresponding to the text segment. The principal component vector can be compared with one or more decision rules to determine an attribute of the text segment, such as its authorship, its language and/or its topic.
In a related aspect, the decision rules can be based on assigning different attributes to different regions of the PC space. For example, different regions of the PC space can be associated with different languages, and the language of a text under analysis can be identified by considering in which region the principal component vector associated with the text lies. In some cases, a decision rule can be based on an angle between a reference principal component vector and the principal component vector associated with a text under analysis. For example, a reference principal component vector can be associated with a text authored by a known individual, and that individual can be identified as the author of a text segment under analysis if the angle between a PC vector associated with the text segment and the reference PC vector is less than a predefined value.
In some cases, for each of a plurality of n-gram groupings, frequency distributions for at least two reference texts are determined, where one text exhibits an attribute of interest and the other lacks that attribute. A principal component transformation is performed on each of the frequency distributions so as to generate a plurality of principal component vectors corresponding to the texts for each n-gram grouping, and a metric is defined based on the principal component transformation to rank order the n-gram groupings. By way of example, the metric can be based on a minimum angle between the principal component vectors corresponding to the two reference texts. The n-gram groupings can be rank ordered based on values of the metric corresponding thereto. For example, a higher rank can be assigned to an n-gram grouping associated with a larger minimum angle. Further, one or more n-gram groupings having the highest ranks can be selected for characterizing texts.
In another aspect, a method of comparing two textual documents is disclosed. In such a method, for each of at least two textual documents, the frequency distribution for a plurality of n-grams in at least a segment of the document is determined to generate a frequency histogram of the n-grams. Further, for each document, a principal component transformation is applied to the respective frequency histogram to obtain a principal component vector. At least one attribute (e.g., language or authorship) is compared between the documents based on a comparison of their principal component vectors. For example, the two documents can be characterized as having been written in the same language if an angle between their principal component vectors is less than a predefined value or both vectors lie in a region of the PC space associated with a given language.
In another aspect, the invention provides a system for processing textual data, which includes a module for determining for each of a plurality of n-gram groupings occurrence frequency distribution corresponding to n-gram member of that grouping for at least two reference texts, wherein one text exhibits an attribute of interest and the other lacks that attribute. The system can further include an analysis module receiving the frequency distribution and applying a principal component transformation to that distribution so as to generate a plurality of principal component vectors corresponding to the reference texts for each n-gram grouping. The analysis module can determine, for each n-gram grouping, a minimum angle between the principal components of the texts corresponding to that grouping. Further, the analysis module can rank order the n-gram groupings based on the minimum angles corresponding thereto, e.g., by assigning a higher rank to a grouping that is associated with a larger minimum angle.
Further understanding of the invention can be obtained by reference to the following detailed description, in conjunction with the associated figures, described briefly below.
The present invention generally provides methods and systems that employ transformation of n-grams frequency distributions of a text into principal component (PC) space for characterizing the text, as discussed in more detail below. In some embodiments, a subset of all possible n-grams is selected that is best suited for characterizing a text under analysis. The selection of such a subset of n-grams is analogous to the selection of a plurality of wavelengths for interrogating a sample as discussed in co-pending patent application entitled “Selection of Interrogation Wavelengths in Optical Bio-detection Systems,” which is herein incorporated by reference. Hence, in the following discussion, initially methods for selecting such wavelengths are discussed, and further details can be in the aforementioned patent application.
As discussed in more detail below, in many embodiments, a metric is defined based on the transformation of spectral data into the principal component space that will allow selecting a subset of excitation wavelengths that provide optimal separation of agents and interferents. The metric can provide a measure of the separation between the principal component vectors of agents and those of the interferents. By way of example, in some embodiments, the metric can be based on spectral angles between the principal component vectors of the agents and interferents.
With reference to
In a subsequent step (2), for each of the agents and interferents, a subset of the spectral data corresponding to a grouping of excitation wavelengths is chosen. The number of wavelengths in each grouping can correspond to the number of optical wavelengths whose selection is desired. For instance, consider a case in which there are 20 excitation wavelengths in a full set of XML data, and the best four wavelengths (i.e., the four wavelengths out of 20 that provide optimal results) need to be identified. As the number of combinations of 20 things (here wavelengths) taken four at a time Cnk with n=20 and k=4 is 4845, there are 4845 distinct 4-member groupings of the wavelengths. These combinations can be ordered according to some arbitrary scheme, pick the first one, and move to step (3).
In step (3), a principal component transformation is applied to this subset of the data corresponding to a respective wavelength grouping to transform the data in each subset into the principal component (PC) space. The calculation of the principal component transformation can be performed, e.g., according to the teachings of copending patent application entitled “Agent Detection in the Presence of Background Clutter,” having a Ser. No. 11/541,935 and filed on Oct. 2, 2006, which is herein incorporated by reference in its entirety. The principal component analysis can provide an eigenvector decomposition of the spectral data vector space, with the vectors (the “principal components”) arranged in the order of their eigenvalues. There are generally far fewer meaningful principal components than nominal elements in the data vector (e.g., neighboring fluorescence wavelengths can be typically highly correlated). In many embodiments, only meaningful PC vectors are retained. Many ways to select those PC vectors to be retained are known in the art. For example, a PC vector can be identified as meaningful if multiple measurements of the same sample (replicates) continue to fall close together in the PC space. In many bio-aerosol embodiments, the number of meaningful PC vectors can be on the order of 7-9, depending on the exact nature of the data set.
The principal component transformation of the subset of spectral data corresponding to an agent or an interferent generates a principal component vector for that agent or interferent associated with that subset of data and its respective excitation wavelengths. In this manner, for the wavelength grouping, a set of principal component vectors are generated for the agents {Ai} and a set of principal component vectors are generated for the interferents {Ii}.
In step (4), for the selected wavelength grouping, spectral angles (SAij) (index i refers to agents and j to interferents) between the principal component vectors of the agents and those of the interferents, obtained as discussed above by applying a principal component transformation to the spectral data associated with that wavelength grouping, are calculated. By way of example, the spectral angle between two such principal component vectors a and b (that is, between an agent vector and an interferent vector) can be defined by utilizing the normalized dot product of the two vectors as follows:
wherein
a.b represents the dot product of the two vectors,
|a| and |b| represent, respectively, the length of the two vectors
In many cases the principal component vectors are multi-dimensional and the above dot product of two such vectors (a and b) is calculated in a manner known in the art and in accordance with the following relation:
a.b=a
1
b
1
+a
2
b
2
+ . . . +a
n
b
n Eq. (2)
wherein
(a1, a2, . . . , an) and (b1, b2, . . . , bn) refer to the components of the a and b vectors, respectively.
Further, the norm of such a vector (a) can be defined in accordance with the following relation:
|a|=√{square root over (|a1|2+|a2|2+ . . . +|an|2)} Eq. (3)
Further details regarding the calculation of spectral angles between principal component vectors can be found in the aforementioned patent application entitled “Agent Detection in the Presence of Background Clutter.” This patent application presents a rotation-and-suppress (RAS) method for detecting agents in the presence of background clutter in which such spectral angles act as the metric of separability, with a SA of 90° (orthogonal) corresponding to the easiest separation.
The spectral angles between the agent vectors and the interferent vectors are used herein to define a metric (an objective function) for selecting an optimal grouping of excitation wavelengths. In particular, with continued reference to the flow chart of
In step (6), the SAmin for the data subset is stored, e.g., in a temporary or permanent memory, along with a subset identifier (an identifier that links each subset (distinct wavelength grouping) with a SAmin associated therewith).
The same procedure is repeated for all the other wavelength groupings and their associated data subsets, with the SAmin of each wavelength grouping identified and stored. In many implementations, the calculations of all SAmins can be done via an iterative process (after calculating an SAmin, it is determined whether any additional SAmin(s) need to be calculated, and if so, the calculation(s) is performed—with modern digital computers, an exhaustive search is not prohibitive, although clearly various empirical hill-climbing techniques, genetic algorithms and the like could be used. In particular, such techniques are particularly useful in the methods of text characterization discussed below, where the number of possible n-grams can be in the thousands rendering in many cases exhaustive searches prohibitive.
Once all the SAmins are calculated (e.g., in the case in which there are 20 excitation wavelengths there would be 4845 SAmins), they can be compared as discussed below to identify the “optimal” wavelength grouping.
In step (7), the wavelength groupings (data subsets) are rank ordered in accordance with their respective SAmins with higher ranks assigned to those having greater SAmins. In other words, for any two wavelength groupings the one that is associated with a greater SAmin is assigned a greater rank. A higher rank is indicative of providing a better spectral separation between the agents and interferents.
In step (8), one or more of the wavelength groupings with the highest ranks can be selected for use as excitation wavelengths in optical detection methods, such as those disclosed in the aforementioned patent application entitled “Agent Detection in the Presence of Background Clutter.” For example, in the above example in which four wavelengths from a list of 20 need be selected the “best” set of four wavelengths can be computed, in the sense of those that give the best separation between agents and interferents. In some cases, the SAmin computed for the full ensemble of wavelengths (e.g., 20 in the above example) as well as SAmin computed for a subset of the wavelengths (e.g., 4 in the above example) can be utilized to obtain a direct, quantitative measure of the extent by which the selection of the subset of the wavelengths can effect differentiation of agents and interferents in the PC space.
By way of illustration, the results of applying the wavelength selection embodiment depicted in
An analogous mapping in fluorescent excitation-emission analysis can be implemented by plotting the U coefficients back “geographically” onto the locations in the two-dimensional excitation-emission fluorescence space. For example, a linear vector X in spectral data space can be unwrapped from the two-dimensional excitation-emission space according to some regular scheme, for instance, by starting at the shortest excitation wavelength and taking all emission wavelengths from the shortest to the longest, then moving to the next shortest excitation wavelength, and so forth. This scheme can be simply inverted to map the columns of U back into the excitation-emission space.
The transformation matrix U will have a column for every meaningful PC (e.g. 7 columns for 7 meaningful PCs in an exemplary data set), and hence 7 re-mapped excitation-emission plots of the coefficients of U exist, one for each PC. In the present embodiment, however, rather than employing the coefficients of U, the standard deviation σ of the coefficients (e.g., row-wise, across PC number) are utilized. As discussed above, principal component analysis (PCA) can be employed to reduce the dimensionality of a data set, which can include a large number of interrelated variables, while retaining as much of the variation present in the data set as possible. More specifically, applying a principal component transformation to the data set can generate a new set of variables, the principal components, which are uncorrelated and which are ordered so that the first few retain most of the variation present in all the original variables.
As such, if the underlying spectral data at any single excitation-emission point in X were always constant, then no variation would have to be explained, and the corresponding coefficient of U would be zero for all columns. At the other extreme, if any single excitation-emission point were completely uncorrelated with any other excitation-emission point, then it would itself represent irreducible variation and its weight would appear entirely in one column of U. In the former case, the row-wise standard deviation σ of the coefficients would be zero, while in the latter it would be large. Thus, in this embodiment the row-wise standard deviation vector σ (with as many rows as U, but only 1 column) is utilized as a metric for the amount of variation exhibited by its corresponding spectral data, although other metrics of variation could also be used, e.g. variance or range.
As the data set in question can be a representative sample of agents and/or simulants {Ai} and interferents {Ii}, plotting the vector σ “geographically” back into excitation-emission space will give a measure of how much each area of the excitation-emission spectrum of that space contributes to discrimination between the agents and the interferent.
In a subsequent step (2), a transformation matrix (U) for effecting principal component transformation is calculated for the data set, e.g., in a manner discussed above and the data is transformed into that principal component (PC) space. As noted above, further details regarding principal component transformation can be found in the teachings of the aforementioned pending patent application “Agent Detection in the Presence of Background Clutter.” In step (3) the number of meaningful (non-noise) PC vectors is identified. In general, only meaningful PC vectors are retained. In many bio-aerosol fluorescence cases, the retained PC vectors can be on the order of 7-9, depending on the exact nature of the data set. The number of meaningful PCs is herein denoted by N.
In step (4), the standard deviations of the coefficients of the first N columns of transformation matrix U are calculated, as discussed above. In some implementations, the standard deviations are then normalized (step 5), e.g., by the mean value of U to generate fractional standard deviations. In alternative implementations, the normalization step is omitted.
In step (6), the standard deviations are mapped back onto the excitation-emission space, e.g., in a manner discussed above. The excitation wavelengths can be rank ordered (step 7) based on standard deviations, with the wavelengths associated with larger standard deviations attaining greater ranking. The excitation wavelengths that correspond to the largest values of the standard deviations, that is, the one having the highest ranks, are then selected (step 8).
Turning again to describing exemplary embodiments of the methods and systems of the invention for text processing, a classifier is initially determined for a training corpus of texts. As discussed in more detail below, the determination of the classifier can include transforming distributions of n-grams in the training texts into the principal component (PC) space and identifying regions of the PC space with which the relevant types of texts are associated. The classifier can then be utilized to classify a new text. In many embodiments, the classifier is generated once (e.g., off-line) and then utilized multiple times to classify a plurality of new texts (e.g., at run-time). In the following description, the generation of the classifier and its associated parameters is also referred to as the training step, and the use of the classifier to classify texts is in some cases referred to as the on-line (or run-time) step.
More specifically, with reference to
Assuming there are N texts in the corpus, in step 2, for each text Ti, where i runs from 1 to N, frequency distributions for all n-grams in the text are computed. The term “n-gram” is known in the art, and refers to consecutive sequence of n characters. By way of example, a 2-gram refers to consecutive sequence of 2 characters, such as {ou} or {aw}, and a 3-gram refers to consecutive sequence of 3 characters, such as {gen} or {the}. In some embodiments, punctuation marks, such as comma or semicolon are also considered as characters to be included in the n-grams. In some cases, the frequency distribution of an n-gram can be determined by simply bumping a counter for each n-gram encountered, then dividing by the total number of characters (i.e. 1-grams) in T. Generally, in the corpus {Ti}, many thousands of distinct n-grams will appear.
Preferably, in some cases, in step 3, a subset of the n-grams can be selected according to some criterion for use in the subsequent steps. By way of example, in some cases, a minimum frequency cut-off can be employed to select a subset of the n-grams (the n-grams whose occurrence frequencies are less than the minimum would not be included in the subset). Further details regarding such a frequency cut-off criterion can be found in an article entitled “Quantitative Authorship Attribution: Δn Evaluation of Techniques,” authored by Jack Grieve and published in Literary and Linguistic Computing, v. 22, pp. 251-270 (September 2007), which is herein incorporated by reference in its entirety.
More preferably, in some cases, the method discussed above for selection of an optimal subset of wavelengths can be adapted to select a subset of n-grams. More specifically, n-grams can be treated completely analogously to the interrogation wavelengths discussed above with the subset of n-grams retained being chosen according to a criterion which maximizes separation in the PC space. For example, in cases in which classification of texts based on their language is desired, a subset of n-grams that maximizes separation between principal component vectors corresponding to different languages can be chosen.
In some implementations, the mean and standard deviation of the N n-gram frequency distributions, one for each text Ti, previously found, are computed, and for each of the n-gram frequency distributions, the mean distribution is subtracted from that n-gram frequency distribution (this operation is referred to as “mean-centering” in the PCA literature), and the result is divided by the standard deviation to generate a scaled frequency distribution (step 4). Further, the mean and the standard deviation of the n-gram frequency distributions can be stored (step 5) for subsequent use in processing texts. In other implementations, the n-gram frequency distributions are employed in subsequent steps discussed below without such scaling.
In step 6, a PC transformation is computed from the mean-centered and scaled n-gram frequency distributions, e.g., by utilizing the method of singular value decomposition known in the art. The locations of the various classes under study are then identified in step 7. For example, in case of generating a classifier for identifying texts written in different languages, correspondence of different portions of the PC space with different languages is identified. In general, a decision methodology, e.g., linear discriminant analysis or one based on spectral angles, or the like are identified for application to transformation of texts {T} under analysis in the PC space. For example, the decision methodology can be based on comparing the angle between a PC vector of a text under analysis and a PC vector corresponding to a reference text with a predefined threshold value (a decision parameter). The selected subset of n-grams, together with mean and standard deviation of the n-gram frequencies, the PC transformation matrix, and the decision parameters, determined based on the “off-line” training corpus are all saved (step 5), e.g., in a memory, so that they can be applied to the “on-line” test cases.
With continued reference to
Turning to
In some implementations, in step 2, an n-gram frequency distribution for the text under analysis can be preferably offset and scaled by the factors previously determined in the off-line training step 3, e.g., it can be offset by the mean and is scaled by the standard deviation determined for the training corpus of texts.
In step 3, the n-gram frequency distributions of the text under analysis, which has been preferably offset and scaled, are transformed into principal component space, utilizing the transformation matrix determined based on the corpus of the training texts off-line during the training step (
In step 4, the decision rules previously determined in the off-line training step can be used to classify the text. For example, the location of a principal component vector (
The above process for classifying a text can be performed efficiently as all the relevant parameters (e.g., the transformation matrix, decision rules) other than the n-gram frequencies are determined off-line and saved.
By way of illustration and only to show the efficacy of the methods of the invention for classifying texts,
For a text in which the language and subject were the same, it was found that short samples of text clustered by author. By way of illustration,
The methods of the invention for characterizing texts can be implemented via a variety of different systems. By way of example,
It should be understood that various changes can be made to the above embodiments without departing from the scope of the invention.
The teachings of the following references are herein incorporated by reference:
Those having ordinary skill in the art will appreciate that various modifications can be made to the above embodiments without departing from the scope of the invention.
This application claims priority to a provisional application entitled “Selection of Interrogation Wavelengths in Optical Bio-detection Systems,” having a Ser. No. 60/916,480 and filed on May 7, 2007. This provisional application is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60916480 | May 2007 | US |