The present disclosure relates generally to identifying trends in a data set and, more particularly, to systems and methods for detecting and coordinating changes in lexical items.
Text streams are ubiquitous and contain a wealth of information, but are typically orders of magnitude too large in scale for comprehensive human inspection. Organizations often collect voluminous corpora of data continuously over time. The data may be, for example, email messages, transcriptions of customer comments or of phone conversations, recordings of phone conversations, medical records, news-feeds, or the like. Analysts in an organization may wish to learn about the contents of the data and the changes that occur over time, including when and why, such that they may understand and/or act upon the information contained within the data. Because of the large volume of data, reading each document in the corpora of data individually to determine the changes and summarize the contents can be expensive as well as difficult or impossible.
The present disclosure describes systems and methods for efficiently detecting step changes, trends, cycles, and bursts affecting lexical items within one or more data streams. The data stream can be a text stream that includes, for example, documents and can optionally be labeled with metadata. These changes can be grouped across lexical and/or metavalue vocabularies to summarize the changes that are synchronous in time. A lexical item can include a single word, a set of words, symbols, numbers, dates, places, named-entities, URLs, textual data, multimedia data, other tokens, and the like. A metavalue can include information about incoming text or other incoming data. Metadata can be external metadata or internal metadata. External metadata can include facts about the source of the document. Internal metadata can include labels inferred from the content. Examples of metavalues include, but are not limited to, information about the source, geographic location, current event data, data type, telecommunications subscriber account data, and the like.
In one embodiment of the present disclosure, a method for efficiently detecting and coordinating change events in data streams can include receiving a data stream. The data stream can include various lexical items and one or more metavalues associated therewith. The method can further include monitoring a probability of occurrence of the lexical items in the data stream over time according to a lexical occurrence model to detect a plurality of change events in the data stream. The method can further include applying a significance test and an interestingness test. The significance test can be used to determine if the change events are statistically significant. The interestingness test can be used to determine if the change events are likely to be of interest to a user. The interestingness test can be defined using conditional mutual information between the lexical items and the lexical occurrence model given a time span to determine the amount of information that is derived from the change event. The method can further include grouping the change events across the lexical items and the metavalue to summarize the change events that are synchronous in time. The method can further include presenting, via an output device, a summarization of the grouped change events to the user.
In some embodiments, the change events are step changes, trends, cycles, or bursts in the data stream.
In some embodiments, the lexical occurrence model is a piecewise-constant lexical model, for example, based upon a Poisson or other distribution. In other embodiments, the lexical occurrence model is a piecewise-linear lexical model, for example, based upon a Poisson or other distribution. In still other embodiments, the lexical occurrence model includes a piecewise-linear component and periodic component to detect the change events in the data stream for recent data and long-span data, respectively.
In some embodiments, the interestingness test can be defined by the relationship:
I(W:M|T)=H(W|T)−H(W|M,T)
to determine the amount of information that is derived from the change event.
In some embodiments, the method can further include applying the monitoring step in a stream analysis mode. In a stream analysis mode, the lexical occurrence model includes a slowly-evolving periodic component for modeling regular cyclic changes, together with a piecewise-linear component for modeling irregular acyclic changes that may occur over either long or short timescales.
According to another embodiment of the present disclosure, a computer readable medium can include computer readable instructions that, when executed, perform the steps of the aforementioned method.
According to another embodiment of the present disclosure, a computing system for detecting and coordinating change events in data streams can include a processor, an output device, and a memory in communication with the processor. The memory can be configured to store instructions, executable by the processor to perform the steps of the aforementioned method.
As required, detailed embodiments of the present disclosure are disclosed herein. It must be understood that the disclosed embodiments are merely exemplary examples of the disclosure that may be embodied in various and alternative forms, and combinations thereof. As used herein, the word “exemplary” is used expansively to refer to embodiments that serve as an illustration, specimen, model or pattern. The figures are not necessarily to scale and some features may be exaggerated or minimized to show details of particular components. In other instances, well-known components, systems, materials or methods have not been described in detail in order to avoid obscuring the present disclosure. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present disclosure.
By way of example and not limitation, consider a flow of text in the form of a stream of documents, each labeled with a time stamp and optionally with metadata, for example, the values of zero or more metavariables of the source. Each document can contain a set of words. The analysis described herein is also applicable to more general lexical items, such as, for example, phrases and non-local conjunctions. Given the enormous volumes of text currently being acquired and stored in many domains, it is impractical for human analysts to scan these volumes in order to find and summarize the important changes that are occurring, especially in a timely manner. Accordingly, the present disclosure provides systems and methods for detecting changes in frequency of occurrence of lexical items, either overall or for particular metavalues, localizing these changes in time, and coordinating changes that are synchronous in time across both lexical and metavalue vocabularies into higher-order events.
The present disclosure approaches the term “event” from a statistical view as would be understood by one skilled in the art. The output of a system according to the present disclosure can be a set of ranked groups, each of which can include one or more sets of lexical items and metavalues together with a description of the timing of the event, which can be a step, trend, cycle, burst, or the like. It is contemplated that the system output can be accompanied by original versions of documents that can be presented to an analyst for inspection.
Aspects of the present disclosure can be applied to documents of any length, although accuracy has been found to increase for documents that are relatively short. Documents can be divided into smaller documents, paragraph by paragraph, sentence by sentence, word by word, or character by character, for example. Some exemplary documents include:
Metadata, if available, is valuable in several respects. Changes are often concentrated in sub-streams of the text flow characterized by particular metavalues. Hence, performing change-detection for individual metavalues or groups thereof focuses the search where necessary and avoids dilution. In addition, distinct groups of changes often overlap in time and share words or metavalues. Also, availability of metadata helps the coordination of changes into distinct events and avoids confusion. From an analyst's perspective, having a change-event labeled with a metavalue or group of metavalues helps to contextualize the change-event and aids in understanding the change-event.
The potential disadvantages of using sub-streams are a loss of power after separating the data into sub-streams for analysis, and additional computational burden. To alleviate these disadvantages, the present disclosure can impose a size limit on the metavalue vocabulary, for example, by grouping metavalues to reduce computational burden. Size limitations, if needed, can depend on the data set and the computational resources available. A metavalue vocabulary size on the order of tens can be preferable to one on the order of hundreds.
Conventional statistical tools can test two predetermined time intervals for whether the frequency of a given lexical item changed. In one embodiment of the present disclosure, neither the time intervals nor the number of changes are predetermined. In one embodiment of the present disclosure, the occurrences of the lexical item in a given text stream are modeled by a Poisson process, and changes are expressed in terms of the intensity of this process. The present disclosure can be fit to other models, such as, but not limited to, processes described by generalized Poisson distributions, binomial distributions, or negative binomial distributions.
The present disclosure provides systems and methods for detecting and coordinating changes of lexical items in the following exemplary respects:
Referring now to the drawings wherein like numerals represent like elements throughout the drawings,
Although the exemplary environment described herein employs the hard disk, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, digital versatile disks, cartridges, RAMs, ROMs, a cable or wireless signal containing a bit stream and the like, can also be used in the exemplary operating environment.
To enable user interaction with the computing system 100, an input device 114 represents any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, and the like. An output device 116 can also be one or more of a number of output means, such as a display, monitor, projector, touch screen, multi-touch screen, or other output device capable of presenting results data to an analyst in a visual manner.
In some instances, multimodal systems enable a user to provide multiple types of input to communicate with the computing system 100. A communications interface 118 generally governs and manages the user input and system output. There is no restriction on the present disclosure operating on any particular hardware arrangement and therefore the basic features here may be substituted, removed, added to, or otherwise modified for improved hardware or firmware arrangements as they are developed.
Referring now to
Referring now to
In the stream analysis mode, the CoCITe tool 302 can create permanent segments (permanent models 308) of the lexical occurrence model from temporary models 310 as the span of incoming data moves forward in time. Accordingly, the CoCITe tool 302 can receive data on an on-going basis, analyze the data, output results to a visualization interface 312 (realized via one or more output devices 116), and presented to an end user, such as an analyst, in a graph, plot, table, or other visualization. In the stream analysis mode, new data arrives on an on-going basis, existing models are extended and updated, and an arbitrary time-span can be used for visualization.
The stream analysis mode improves efficiency over the retrospective analysis mode because earlier data is already pre-processed for model training and new data can be added expeditiously. The stream analysis mode also decouples optimization of model components. The periodic component changes slowly and the model is thereby trained using smoothed data from a long time-span. The piecewise-linear component may change quickly and the model is thereby trained using fully-detailed recent data.
Referring now to
The method 400 begins and flow proceeds to block 402 wherein one or more data streams including one or more documents each optionally labeled with metadata are received at the CoCITe tool 202, 302. It should be understood that the use of the term “documents” here is merely exemplary and the data stream can alternatively include raw or unformatted text, or other lexical items. Flow can proceed to block 404 wherein a determination is made as to whether a lexical vocabulary is prescribed. If a lexical vocabulary is not prescribed, flow can proceed to block 406 wherein a lexical vocabulary can be discovered. Flow can then proceed to block 408 wherein the probability of occurrence of lexical items in the incoming data streams over time is monitored. If a lexical vocabulary is prescribed, flow can proceed directly to block 408. At block 410, changes can be coordinated across lexical items and metadata. Flow can then proceed to block 412 wherein results can be output for visualization in the form of a graph, plot, table, or other visualization. The method can end.
Referring now to
The method 500 begins and flow proceeds to block 502 wherein one or more data streams including one or more documents each optionally labeled with metadata can be received at the CoCITe tool 202, 302. At block 504, an acyclic component of the lexical occurrence model can be defined such that documents containing a particular lexical item are assumed to occur at a rate described by an intensity function that is piecewise-linear over time. For example, a Poisson distribution model or other distribution models can be used. Each linear piece of the model is referred to herein as a segment. There is no prescribed number of segments. The acyclic component can be used to model step changes, trends, and bursts in the incoming lexical items.
At block 506, an optional cyclic component of the lexical occurrence model can be defined such that a multi-phase periodic modulation can be superimposed on the intensity function. The cyclic component can be used to model regular cyclic changes in rate and can have multiple periods and phases.
At block 508, the acyclic and cyclic model components are optimized using a dynamic programming algorithm. The optimization results in a likelihood of the data to maximize. The likelihood can be computed as the product of the probability of the actual data values.
Referring briefly to
At block 510, a significance test for change-points is applied. Various exemplary significance tests are described herein below for a piecewise-constant model and a piecewise-linear model.
At block 512, an interestingness test for change-points is applied. The most significant changes are often not the most interesting. When large amounts of data are received, a ranking based on significance can obscure interesting changes affecting rare events. Accordingly, a measure of interest or otherwise termed “interestingness” can be defined using conditional mutual information between lexical item (W) and model (M) given time (T):
I(W:M|T)=H(W|T)−H(W|M,T)
where H( ) is conditional entropy. The measure of interest measures the amount of information that can be learned from the change in the model, allowing for the fact that the models may each depend on time (a trend segment). The definition of the measure of interest is defined to cover all situations and can therefore be used to rank changes consistently. From an analyst's perspective, consistency of the interestingness measure is decisive.
At block 514, the change-points are coordinated. Typically, there is a lot of output from the change-detection procedure. An exemplary method for coordinating changes can identify change-events as graph nodes, create edges between nodes that share words and/or metavalues, run a clustering algorithm, and output a measure of interest ranked list of clusters.
In addition to the above, an optional bigram check can be implemented. Changes often occur for different words at the same time but for different reasons. Metadata do not always exist and may not be sufficient to separate node clusters. A bigram check can be used to only add edge connecting events with distinct words if bigram (document co-occurrence) frequency exceeds threshold. The bigram check is an effective filter against spurious combinations. The bigram check provides an unbiased estimate of true frequency of arbitrary bigram from merged priority-weighted samples of consolidated documents. The bigram check is efficient and reliable and yields no false positives. Most false zeroes have true frequencies are below threshold values.
At block 516, the results are output for visualization. Visualization can be in the form of a graph, plot, table, or other visualization output put on one or more output devices 116. The method 500 can end.
Provided below are two exemplary models, a piecewise-constant lexical occurrence model and a piecewise-linear lexical occurrence model. These models are provided for further explanation of the aforementioned systems and methods and are not intended to limit the scope of the appended claims.
In one embodiment of the present disclosure, a piecewise-constant model is used to detect and coordinate changes in lexical items. In this embodiment, a typical source of lexical items, structured into documents, each labeled with a time stamp and optionally with metadata is considered. An assumption is that each document contains a set of lexical items that are of interest. In some embodiments, a prescribed vocabulary is used. In other embodiments, an open-ended vocabulary is used. An open-ended vocabulary can be acquired, for example, as part of the analysis. In still other embodiments, a vocabulary can be seeded with lexical items. The internal structure of each document can be ignored, thereby treating each document or the collective whole of documents as a set of words. Exceptions can include lexical items of interest that are either n-grams or non-local conjunctions of words, in which case the vocabulary of these can be prescribed in advance.
A system of the present disclosure can be used in either a retrospective mode or a streaming mode. In retrospective mode, a corpus of text files is presented for end-to-end processing. In streaming mode, a summary file (previously generated by the system) is presented together with the most recent data. A new or updated summary file can be generated together with the output of the change-detection algorithms. The summary file can contain enough information about the history for the system to be able to reproduce the results as though it were done retrospectively, but in far less time. Data can be carried forward from summary file to summary file until a time horizon is reached which can depend on recent change-points, so the summary file does not grow without bound.
In either mode, the system creates regular bins of data, for example, daily, weekly, monthly, yearly, etc. The system can ignore the arrival time of each document within each bin. For each bin, the system can obtain frequency data: numbers of documents labeled with particular metavalues, and numbers of documents labeled with particular metavalues and containing particular words. The system can ignore multiple occurrences of words within documents. In many instances, the presence of a word in a document is more important than repetitions thereof because repetitions often add little further information.
Text streams always suffer from missing data. For this reason, the system does not make any assumption that successive bins correspond to regular time increments. If successive bins do correspond to regular time increments, the system can be tolerant of bins that are empty or that contain no data for particular metavalues.
The system analyzes frequencies of lexical items relative to documents. If the number of documents in each bin varies substantially then this can be separately tracked, but of greater interest here is the content of these documents. This makes the analysis more robust to missing data.
By way of example, consider a stream of bins of documents, containing nmt documents labeled with metavalue m in the bin at t, where 1≦m≦M and t is discrete: t=1, . . . , T. Let the (unknown) probability that a document labeled with metavalue m in the bin at t contains word (or lexical item) w be pwmt, and the measured number of documents labeled with metavalue m in the bin at t that contain word w be fwmt. Assume a Poisson model for this quantity, i.e.
f
wmt
˜Poi(nmtpwmt)
where the present disclosure temporarily conflates the random variable with the measured value.
In one embodiment, the Poisson parameter pwmt is piecewise-constant in time. Let there be I time segments where the ith segment starts at si and ends at ei=si+1−1, with sI=1 and eI=T. Assume for now that this time-segmentation is known. We also define e0=0 and sI+1=T+1 for convenience, and si, i=2, . . . , I are referred to below as change-points. Let Ti denote the time range [si, ei,], and define
For word w and metavalue m the overall log-likelihood is provided by equation (1), below.
For the ith segment, let pwmt be equal to the constant rate rwmi for all t ∈ Ti; then the maximum-likelihood estimate of rwmt is
and using this estimate for each i the log-likelihood becomes equation (2), below.
The second term in equation (2) does not depend on the model or segmentation and can be treated as constant during the optimization.
The subscripts w and m are dropped hereinafter for brevity. Suppose that for a word w and a metavalue m, there is a periodic modulation where each bin t is labeled with a phase p from some set P. For example for daily binning P={Monday, . . . ,Sunday}, or for hourly binning P={0, . . . ,23}. More complex forms of cyclic behavior can also be accommodated. There is no requirement for a fixed period on t because of the possibility of missing data or, for example, to accommodate for a monthly variation and the fact that the months have unequal length. In this embodiment, the present disclosure assumes that the time-segmentation is known. Let Tp denote the subset of T with phase p, and Tip denote the subset of Ti with phase p. Also let
In this embodiment, the periodic effect can be represented as
p
t
=q
p
r
i for t ∈ Tip
where qp≧0 is common for all segments. Because only |P|−1 of these values are independent the present disclosure sets the largest equal to one, and if all the remaining qp also equal one then there is no periodic effect. The present disclosure can also map the phases to a smaller set where the values of qp are similar. For daily binning, for example, it has been found that different behavior is seen at weekends compared with weekdays, but the weekend-days are similar to each other, as are the weekdays. P is then binary. This mapping can be discovered automatically using a dynamic programming algorithm that optimizes both the final number of phases and the mapping.
Now the log-likelihood equation (1) becomes (ignoring the constant term) equation (3), below.
To optimize the model we maximize with respect to ri and qp:
which is zero when ri is represented as shown below in equation (4).
which is zero when
is represented as shown below in equation (5).
These may be solved for the |P|−1 independent values of qp, and hence the present disclosure obtains {ri}i=1, . . . , I using equation (4). For a two-phase periodic modulation, equation (5) transforms into a polynomial equation of degree I for the unknown qp, which can be solved exactly for I≦4 or numerically for any I.
In this embodiment, the present disclosure assumes that the time segmentation (equivalently the set of change-points si, i=2, . . . , I) is unknown, although this may not necessarily be the case. A dynamic programming algorithm can be used to efficiently find the optimum segmentation. The periodic modulation parameters qp are assumed known. The reason for this is that these are global parameters and to attempt to optimize these at the same time as the segmentation would violate the Bellman principle of optimality. If {qp}p∈P are unknown then the method below can be iterated: initially the present disclosure assumes all qp=1, finds the optimum segmentation, and then solves equation (5) for qp. The method can repeat. This method generally converges after two or three iterations.
In one embodiment, the dynamic programming algorithm can be represented as follows. Let
And, from equation (4), the present disclosure derives equation (7), below.
An exemplary method the exemplary dynamic programming algorithm is illustrated below.
In step 2(b), if a J−1-segment model exists on [1, s −1] (for some s>1) then the latest segment on [s,τ] can potentially be appended to it giving a J-segment model on [1,τ]. The restriction sig(s) denotes that the potential change-point at s satisfies both the criterion of significance and that of interestingness. It is these criteria that limit the number of segments I discovered: it is not uncommon for no significant changes to be discovered, in which case the procedure terminates with I=1.
This procedure is optimal: recursively, the optimal segmentation into I segments on [1,T] must be given by the maximum over s of the optimal segmentation into I−1 segments on [1,s−1] combined with a single segment on [s,T]. And, no segmentation into less than I segments is expected to give a higher likelihood than the optimum for I.
Various additional quantities are also stored during step 2(b) for recovery during the back-trace for the optimum segmentation, including the model parameters for the Jth segment [s, τ] (which for the piecewise-linear model will be âj, {circumflex over (b)}j, and the measures of significance and interestingness for the change-point at s. These quantities are then available for output at the end of the procedure.
In an exemplary test for significance of a potential change-point at s, let sJ−1=B(J−1,s−1) be the start of the previous segment J−1, and eJ−1=s−1 be segment end. In one embodiment, the estimated rate {circumflex over (r)}J equation (7) can be significantly different from that for the previous segment, which can be given by equation (8), below.
These two proportions can be compared using standard methods, for example, a 2×2 contingency table using Fisher's method for small frequencies and the chi-square test for large frequencies. If some qp≠1 then the denominators can take non-integer values, but the nearest integer can be used.
The most significant changes are often not the most interesting ones. If a word (or more generally a lexical item) is relatively frequent then changes affecting it are likely to be significant. However, changes affecting less frequent items may be of greater interest to an analyst of the data, in which case it is inappropriate to rank the items by significance level. For this reason, the present disclosure can use a separate criterion of interestingness, in addition to significance, both as a test for acceptance of a potential change-point and as a ranking criterion. A measure of interestingness provided herein is based upon information theory.
The null hypothesis is that there is no change in rate at s, that is, rJ=rJ−j. The present disclosure can test this hypothesis to measure both significance and interestingness using the estimated values from equation (7) and equation (8). The principal difference between these two measures can be summarized as follows: if the null hypothesis is false, then as the amount of data increases, the significance test statistic increases in magnitude without bound, and the measure of interest converges to a finite value depending only on rJ−1 and rJ.
The degree of interest of a change in rate (from rJ−1 to rJ) can be measured by the amount of information conveyed by this change. To evaluate this, the present disclosure can compare two possible models on the latest segment [s,τ]: the model derived for that segment (rJ) and the model extrapolated from the previous segment (rJ−1). The present disclosure can define the following three variables:
The conditional mutual information between W and M given T can be defined as shown below in equation (9)
I(W;M|T)=H(W|T)−H(W|M,T) (9)
where H(.|.) is conditional entropy:
H(Y|X)=−ΣxΣyP(x, y) log2 P(y|x)
I(W;M|T) measures the amount of information regarding W brought by knowledge of M that is not already contained in T. A reason for adopting this definition conditional on T is that this definition also covers the case where the segments are not constant but involve trends. For the piecewise-constant model, T conveys no information about W. Let P(M=1)=θ, and LJ=τ−s+1 be the length of the Jth segment. If the variables W, M, T are independent the joint distribution can be given by
From the joint distribution, the conditional entropies can be derived and substituted in equation (9) as shown below:
H(W|T)=H(θrJ+(1−θ)rJ−1)
H(W|M,T)=θH(rJ)+(1−θ)H(rJ−1)
where H(.) is the entropy function:
H(p)=−p log2 p−(1−p) log2 (1−p)
Mutual information can be normalized. An exemplary measure of interest can be defined as provided in equation (10), below.
Equation (10) can be evaluated using the estimated values {circumflex over (r)}J, {circumflex over (r)}J−1 from equations (7) and (8) with θ=½. It can be appreciated that Ir
A candidate change-point at s is accepted (sig(s) in equation (8)) if the significance measure and this measure each reach required thresholds.
If we initially create the following as linear-time arrays for 1≦τ≦T and p ∈ P, as shown below in equations (11) and (12)
and define all equations (11) and (12) as zero for τ=0, then equations (6) and (7) become equation (13) and equation (14) as shown below.
With this formulation, the recursion step is ˜O(T2) in time. The space requirements are quite modest: in addition to the above linear arrays, A(.,.) and B(.,.) are each ˜O(ImaxT), where Imax is the maximum number of segments permitted.
If the Poisson probability with which a lexical item occurs in a document (pwmt) trends gradually up or down over time, the piecewise-constant model can represent this as a flight of steps, which is suboptimal. Trends can be accommodated by assuming more generally that pwmt is piecewise-linear. As above, it is initially assumed that the segmentation is known. Again, the subscripts w and m are dropped for brevity, and allow for a periodic modulation.
For the ith segment, let
p
t
=q
p
r
t for t ∈ Tp
where rt=ai+bi(t−ei−1), with ei−1=si−1 being the end of the previous segment. For a constant segment the coefficient bi is zero. The log-likelihood equation (1) becomes equation (15), below.
Again the final term does not depend on the model or segmentation, and is the same constant term as before. Taking the partial derivative with respect to qp, equation (15) becomes equation (16), below.
Given a segmentation and a model in the form {ai, bi}i=1, . . . ,I, the present disclosure can obtain qp by setting equation (16) to zero. However, maximizing equation (15) directly with respect to {ai, bi}i=1, . . .,I is not as simple because the algorithm would involve additional iteration loops and would be too slow.
1) Weighted Linear Regression: Because the log-likelihood is hard to maximize for ai, bi the present disclosure can use weighted linear regression instead. Consider the regression model
Setting the derivatives with respect to ai and bi of the total weighted squared error, as shown below in equation (17),
to zero and solving yields equation (18) and equation (19), below,
and all summations are over t ∈ Ti.
From the exemplary Poisson model, ft˜P(ntpt) so Var(ft)≈ntpt, hence
Setting vt ∝ nt therefore approximately equalizes the variance as well as giving greater weight to bins containing more data. In fact we use
so that if all nt are equal then all vt=1.
Notation: Let
Then the regression parameters equations (18) and (19) can be shown to be
Also, substituting the regression parameters into equation (17), expanding and using the same definitions leads to the following expression for evaluating the residual sum of squares:
Setting equation (16) to zero and substituting for the weighted-least-squares estimates âi, {circumflex over (b)}i also enables us to re-estimate the periodic modulation parameters qp from these quantities to derive equation (27), below:
for all p ∈ P, where δm=1 if p=m, otherwise zero. The nullspace of this matrix (found using a singular value decomposition) is spanned by the vector of reciprocals of the nonzero periodic parameters and, once found, the nonzero periodic parameters can be scaled so that the largest is equal to one.
2) Likelihood Adjustment: If we assume ai=âi+ε, bi={circumflex over (b)}i+δ substitute into the contribution to the log-likelihood equation (15) from the ith segment, set the derivatives with respect to ε and δ to zero, and expand to first-order in ε and δ, then we get the following pair of equations that are linear in these increments:
The equations immediately above can be solved for ε and δ giving improved estimates of the parameters, and the process can be iterated. Generally, this process converges after one or two iterations. The present embodiment now has estimates of ai and bi that maximize the likelihood; however, the likelihood is maximized at the expense of additional summations over the data. Fortunately, the weighted-least-squares estimates are usually very close to the maximum likelihood estimates, so this step can be omitted if computational efficiency is a priority.
3) Segment Constant vs. Trend: The decision as to whether to treat the latest segment spanning [s,τ] as constant or trend can be based on any combination of the following exemplary criteria:
The present embodiment can assume that the segmentation is not known, although this is not necessarily the case. The optimization proceeds similarly to that described above for the piecewise-constant model. If the periodic modulation parameters qp are not known, as is usually the case, then the procedure is to initially assume all qp=1, find the optimum segmentation and model, re-estimate qp using equation (27), and repeat. Two or three iterations of this process are generally sufficient.
The likelihood contribution L(s,τ) for the Jth segment [s,τ] is obtained using equation (13) for a constant segment. For a trend segment, equation (28) as shown below is used.
The present embodiment defers consideration of how to express this in terms of differences in cumulative values at segment endpoints. The regression parameters and the residual sum of squares can all be evaluated using linear-time arrays for the quantities defined in equations (20) and (22), namely equation (29),
for 1≦τ≦T, with all of these zero for τ=0. Since the Jth segment extends from s to τ inclusive, equation (29) becomes, for example,
T
J
(k)
=T
τ
(k)
−T
s−1
(k)
, R
J
(k)
=R
τp
(k)
−R
s−1,p
(k)
and so forth. All the quantities in equation (20) through equation (24) can be obtained in this way, and also the regression parameters âJ,{circumflex over (b)}J from equation (25), the RSS from equation (26), and the periodic modulation parameters from equation (27).
With the segment model and likelihood available for [s,τ], the optimization can proceed once the restriction sig(s) is defined for segments that may involve trends.
1) Difference Between Regression Lines: Let sJ=s, eJ=τ be the start and end of the Jth segment, sJ−1=B(J−1,s−1), eJ−1=s−1 be the start and end of the previous segment. Also define eJ−2=sJ−1−1. There are two tests can be used for each candidate change-point. A first test can be used to decide whether a significant change exists. A second test can be used to decide what form the significant change takes.
The first test may be used when at least one of the two segments is a trend. The null hypothesis (H0) is that there is no change. That is, the Jth segment is a linear extrapolation of the J−1st. A single regression line can be first fit through both segments as described above and obtain the residual sum of squares RSS0 using equation (26). The alternative hypothesis (H1) is that there is a change-point at s, and RSSI can be obtained as the sum of the residual sums of squares over the two segments, fitted separately. Then, the F-statistic, below,
defines the critical region. The number of degrees of freedom in the denominator is n-m where n=eJ−sJ−1+1 is the total number of data points in the two segments, and m=4 is the total number of estimated parameters in the separate models. Although this test and a similar one in the next section assume normal residuals, the tests have been found to nevertheless work well in this application.
2) Difference between Regression Slopes: If a change-point involving a trend is significant then the next question that needs to be addressed is whether the change involves a discontinuity (as for the piecewise-constant model) or merely a corner, in which case the slope changes but the intercept does not. A corner introduces one less parameter into the overall model, resulting in a simpler description of the data. To test whether a change involves a discontinuity, a modified two-phase linear regression can be used. The modified two-phase linear regression can incorporate the weights vt. The null hypothesis H0 is that the regression lines for segments J−1 and J coincide at eJ−1.
a
J−1
+b
J−1(eJ−1−eJ−2)=aJ
The above constraint can be incorporated into the weighted squared error criterion using a Lagrange multiplier:
Setting the derivatives with respect to the four parameters and λ to zero leads to the following system of equations for the optimum solution:
All these quantities can be obtained from the arrays defined in equation (29). From this solution, equation (26) gives RSS0 which is compared with RSS1 using
If a change-point is determined to be continuous with a corner then the two-phase regression model can be adopted, as determined above for both segments. However, if two consecutive change-points consist of such corners then the middle segment would inherit two distinct models from the separate two-phase regressions, and these would have to be reconciled. So, instead, the present embodiment makes an adjustment to the model for one segment only, depending on the type of the Jth segment, as shown below.
Trend: Set â′J=âJ−1+{circumflex over (b)}J−1(eJ−1−eJ−2)
Constant: Set {circumflex over (b)}′J−1=(âJ−âJ−1)/(eJ−1−eJ−2)
In the first case the intercept of the Jth segment is adjusted to match the end of the J−1st segment, whereas in the second the slope of the J−1st segment, which has to be a trend, is adjusted to match the intercept of the Jth segment. Although slightly suboptimal, this method can handle any number of consecutive connected segments. Within the dynamic programming method, if {circumflex over (b)}′J−1 is set in this way then because this affects the previous (not the current) segment it can be recorded in the main loop as
During the back-trace, if this value is nonzero for the Jth segment then it overrides the usual value recorded for the J−1st.
In addition to passing the significance test, a potential change-point can again satisfy the interestingness requirement based on conditional mutual information (equations (9) and (10)). The present embodiment now involves four model parameters as shown below in equation (30).
The two models for the Jth segment [sJ, eJ] are derived for that segment (aJ, bJ) and extrapolated from the preceding segment (aJ−1, bJ−1). If the variables W, M, T are defined, as defined above, then the joint distribution is now given by:
for t=sJ, . . . , eJ, where LJ=eJ−sJ+1 is the length of this segment. The conditional entropies can then be obtained, as shown below.
Here, again, H(.) is the entropy function. The aforementioned equations are evaluated using the estimated values âJ−1,{circumflex over (b)}J−1,âJ,{circumflex over (b)}J, and with θ=½. It should be noted that the evaluation involves six terms (two for each H(.)), all of which can have the following general form:
for various values of α and β. Because the sum over t could degrade the overall algorithm from quadratic time to cubic time the present embodiment can eliminate this possibility by applying the Euler-Maclaurin formula in the following form:
where fi=f(s+ih), nh=e−s, and fi(k) is the kth derivative. Since in this case s and e are integers, h can be set to 1. The following indefinite integral (for β≠0) can also be used:
and hence obtain:
All the terms on the right-hand side are evaluated at the endpoints of the segment, and in practice the last term is usually negligible. All that remains is to divide the result by In(2). This makes it possible to efficiently compute the conditional mutual information (equation (9)) and measure of interest (equation (30)).
Having the measure of interest consistently defined for both constant and trend segments brings two major advantages:
Thus far, the following steps in the dynamic-programming optimization of the piecewise-linear model are based on linear arrays evaluated at segment ends:
for k=0, . . . , 11, with all Fτ(k)=0 for τ=0. Also define
Then, equation (28) becomes
This calculation leaves G(s,τ). At the moment the algorithm is cubic-time because of this term only. For short segments the cost of evaluating this is small, but for long segments it may be burdensome. Let L≧1 be a parameter which essentially governs the maximum segment length for which the sum in equation (31) can be evaluated directly. The present embodiment can use a Chebyshev polynomial approximation to ln(1+x) for 0≦x≦1 and the Clenshaw algorithm to convert this to a regular polynomial, represented in equation (32):
where K=11, accurate to 1×10−9 throughout the domain [0,1], which is sufficient for present purposes.
Suppose first that {circumflex over (b)}J>0, and define
so that G(s,τ)=G>(s−1,s,τ), and └.┘ denotes the floor function. G>(w,s,τ) can be evaluated recursively as follows:
where xt={circumflex over (b)}J(t−s+1)/(â+{circumflex over (b)}J(s−1−w)). Since t≦v≦u, the definition of equation (33) guarantees that 0<xt≦1. Therefore, the approximation equation (32) can be used together with a standard binomial expansion to obtain equation (35), below.
Although equation (35) involves a sum over 77 terms, there are no function evaluations and empirically it turns out to be faster than the direct evaluation of equation (31) for segment length of 15 (see below).
If {circumflex over (b)}J<0 then the present embodiment proceeds in a similar fashion and only the result will be quoted. Define
Then G (s,τ)=G<(τ+1,s,τ), and recursively
Because the number of recursive function calls in equations (34) equation (36) depends on the values of âJ,{circumflex over (b)}J and not directly on the segment time span (and in practice seldom exceeds 2), this completes a linear-space, quadratic-time formulation. To assess this experimentally the inventors used the Magellan search query corpus. The inventors selected 20 words that occur regularly throughout the corpus (internet, hotel, jobs, free, home, software, music, american, games, email, computer, world, page, school, real, college, state, tv, video, art).
The change-detection method described in previous sections typically generates a lot of output. For each word/metavalue pair there can be a sequence of change-points connecting piecewise-linear segments. Some of these individual changes can be related to similar ones for many other word/metavalue pairs. It can be undesirable to leave it to a human analyst to have to synthesize more meaningful events out of all these elementary changes.
It is often the case that where a subset of all the change-points for all word/metavalue combinations have a common cause the overall event can be visualized in three exemplary dimensions as follows:
It can be helpful to consider a new kind of event that can cover several consecutive segments and therefore change-points. Each of these events can have an onset phase, and can also have peak and offset phases. The onset of an event need not consist of a single change-point. The profiles illustrated in
The overall change profile for a word/metavalue combination can, in general, include several such events in sequence: zero or more bursts followed by an optional step. An algorithm can post-process the change profiles for each word/metavalue combination and form an overall list of these events in the following exemplary form:
φj=wj,mj,sj,ej,Ij, j=1, . . ., N (37)
where
wj is the word,
mj is the metavalue,
sj is the start-time,
ej is the end-time (zero for a step event),
Ij is the interestingness.
Because the onset and offset phases of these events can be extended, the present disclosure can characterize the start-time using the first moment of area of the profile during the onset phase about the point t=0, and similarly for the end-time. The interestingness of the event is based on the quantity defined in section E. If the span of the event φj consists of the segments iI≦i≦i2 then define equation (38):
where Ia
There are various ways in which the present disclosure can measure the dis-synchrony of two events, for example, φi,φj. A measure using only |sj−si|+|ej−ei| may not be sufficient because of the different forms the onset and offset phases can take, as illustrated above. An abrupt step can get grouped with a long trend. The present embodiment adopts the simple expedient of also incorporating the second moments of area of the onset and offset phases of φi and φj. The actual definition of the dis-synchrony measure d(φi,φj) involves further minor considerations which can be omitted here.
It is logical to separate groups of step events (with ej=0) and of burst events (with ej≠0). The principle can be the same in each case. Events of form φj can form groups when words wj and metavalues mj form sets W and M such that the Cartesian product W{circle around (x)}M is substantially covered with events φj that are substantially synchronous in time.
To meet the challenge posed at the end of the previous section, the present disclosure can use a graph clustering method. In testing, the inventors determined that metric clustering algorithms did not work as well as desired because the space occupied by the events φj is a metric space only in the time dimension. It should be understood, however, that the use of metric clustering algorithms is not precluded.
Also, it should be understood that the aforementioned challenge is not a bi-clustering problem, at least in part because it is possible and quite common for words and/or metavalues to be shared between distinct groups of events at different times, and sometimes even for the same times. This is illustrated in
So the imperative is to cluster the events φj placing emphasis on the Cartesian-product structure across the sets W and M. The present embodiment can accomplish this by creating an undirected graph with the events φj as nodes. Edges are created between pairs of nodes (for example, φi and φj) that satisfy one of the following three conditions (δ is a threshold):
For clustering the nodes in the graph, the present disclosure can use a procedure that reveals clusters of densely interconnected nodes by simulating a Markov flow along the graph edges.
1) Filtering Graph Edges: Despite the additional discriminative leverage brought by the metadata, it is still possible that changes can occur for separate words at or about the same time but for different reasons, in which case groups can be generated that are misleading. Data sets without metadata are especially prone to this phenomenon. For this reason, the present embodiment can also perform a bigram check: for a pair of distinct events φi,φj such that wi≠wj an edge connecting these events to the graph is only added if the bigram frequency for the pair wi, wj exceeds a required threshold that may depend on wi and wj.
The bigram frequency can be defined as the total frequency of documents containing both wi and wj over the range of data concerned. There is no requirement that the words be adjacent or occur in a particular order. Imposing this requirement ensures that the two words co-occur in a sufficient number of the source documents, without regard to metadata. This is an effective filter against spurious combinations. It can be expensive to compute the bigram frequency because it may be impractical to accumulate frequencies for all possible such bigrams during the original binning. A separate pass over the raw data can be implemented for this purpose. Requiring a separate pass can be slow and especially undesirable for the streaming mode, in which case it may be desirable to process all raw data only once.
2) Priority Sampling Scheme: The present embodiment can resolve the aforementioned challenge by using a priority sampling scheme through which the present embodiment is able to efficiently obtain an estimate for the frequency of an arbitrary bigram post-hoc without the need for a subsequent pass through the raw data. The general principle of priority sampling can be described as follows: Let there be n items i=1, . . . , n with positive weights νi. For each item, define a priority qi=vi/ri where ri is a uniform random number on [0,1]. The priority sample S of size k<n can include the k items of highest priority. Let γ be the k+1st priority, and let {circumflex over (ν)}i=max{νi,γ} for each sampled item i ∈ S. Now consider an arbitrary subset U⊂{1, . . . ,n} of the original items. It can be shown that
An unbiased estimate of the total weight of the items in the arbitrary subset U is therefore obtained from the priority sample by summing {circumflex over (ν)}i for those items that are also in U. This can be done for many different subsets U after forming the priority sample.
The present embodiment employs this for the bigram check in three stages. First, during the binning of the data the present embodiment forms a list of consolidated documents by filtering out stop words and words that are excluded from the final dictionary, then re-assembling each document with the words in word dictionary order. Metadata can be ignored. This enables the documents to merge as far as possible. The total weight νi of each consolidated document is its total frequency within that bin. From this, the present embodiment can create the priority sample for that bin as described above, and export it along with the word frequency data. In streaming mode, the priority samples are carried forward within the summary file until the data drops off the time horizon.
The second step is to form a merged priority sample for all consolidated documents throughout the data, either from all the separate bins (retrospective mode) or from the summary file together with the latest data (streaming mode). For time and space economy it may be necessary or desirable to discard the tail of the sample for each bin. If this is done, the values of {circumflex over (ν)}i can be re-assigned using the revised value of γ, so that unbiasedness is preserved. The final step is to estimate the frequency of an arbitrary bigram for a range of time by summing the values of {circumflex over (ν)}i for all the consolidated documents in the merged priority sample that contain that bigram, over that range of time. This can be done very quickly. A threshold can then be applied to the estimated frequencies as described above in order to decide which edges to add to the graph.
There are not expected to be “false positives” with this scheme. If an estimated bigram frequency is greater than zero then the true frequency must also be. However, there is expected to be “false zeros” where the estimated bigram frequency is zero for a bigram that does actually occur. The inventors have measured the true frequencies for these false zeros and found that for a sufficiently large merged priority sample ˜105 the true frequencies are typically very small and below the threshold for acceptance.
The graph clustering forms the nodes (events φj) into groups. From this, the present embodiment can immediately generate a structured output of the following form:
Φk=φk
sorted in decreasing order of Ik, where for each group Φk,
{φk
Tk is the time description,
Wk=∪j−1n
Mk=∪j=1n
Ik=Σj=1n
The time description Tk can take various forms depending on the type of onset presence and type of offset. The group measure of interest Ik is the total over that for the component events equation (38). All that needs to be presented to the user are the time Tk, sets of words Wk and metavalues Mk, and perhaps a small sample of the documents or a subset of the priority sample. This is information on a digestible scale which should enable the user to make a judgment about whether this is an important event or not.
The following description provides some results obtained by applying the aforementioned exemplary CoCITe procedure to various corpora.
The time requirements have been found to be roughly proportional to the numbers of words and metavalues and the square of the number of bins. Sparsity also varies from one corpus to another and makes a difference.
The first corpus consists of logs of human/machine automated dialogs. CHI Scan is a tool for reporting, analysis and diagnosis of interactive voice response (IVR) systems. IVR systems can operate using natural language or directed dialog. Natural language allows a caller to speak naturally. Directed dialog requires a caller to follow a menu which, in some cases, only permits touch-toned responses. Designing, monitoring, testing, and improving all IVR systems is predicated on the availability of tools for data analysis. CHI Scan is a web-based interactive tool for this purpose. In addition to providing both high-level and in-depth views of dialogs between callers and automated systems, CHI scan provides views of changes occurring over time. Changes may be either planned (via a new release of the system) or unplanned.
The CoCITe algorithm can be incorporated into the CHI Scan software framework and like software using the streaming mode. Each document is a complete dialog between a caller and the WR system. Changes in relative frequencies of the following are tracked:
Prompts: Messages played to the caller
Responses: Callers' choices in response to prompts
Call outcomes: Transfers (to human agents), hang-ups (caller ends the call), and end-calls (system ends the call)
KPIs: Key performance indicators of progress made within the automation.
These can be important metrics for evaluating and tracking IVR systems over time for providing invaluable insight. No call metadata are used at present for the CoCITe algorithm. However, for tracking the responses the relevant prompt is treated as a metavalue. This has the effect of conditioning each response on a preceding occurrence of the prompt, thereby ensuring that the distribution of responses is normalized. This does not preclude the future use of call metadata as well. Three versions have been implemented, using hourly, daily and weekly binning.
When a customer talks to a human agent, the agent typically makes notes on the reason for the call and the resolution. These notes are a mine of information on why customers are calling, but are usually far too numerous to be read individually. These notes also tend to be rather unstructured, containing many nonstandard abbreviations and spelling errors. However, metadata about the customer are generally available. Detecting and structuring the changes that occur within such streams of notes can provide useful intelligence to the organization.
Most of the clusters represent routine traffic, but cluster 6 (Hurrican Katrina) is unusual. Customers in the Gulf Coast region who were affected by this disaster had special needs. Many change-points therefore emerge, some involving entirely new words (e.g. Katrina), some involving pre-existing words which increased in frequency (e.g. hurricane), and some involving common words being used in new combinations (e.g. home, destroyed). The coordination procedure groups these changes as follows:
Metavalues: Louisiana, Mississippi
Words: hurricane, Katrina, hurrican, house, affected, home, victim, destroyed
The word list shown is a subset. Note the mis-spelling “hurricane,” which occurs often enough to be picked up by the procedure. Tracking this event over time we see it gradually tail off during the month of September, 2005.
Queries made to internet search engines can be treated as documents for this analysis. Such queries tend to evolve over time, both cyclically within the 24-hour period, and over a longer time-scale as changing frequency of search terms reflects evolving interest in diverse topics.
Some rather generic terms (e.g. computer, school, jobs, weather) show no change in rate throughout. Some show an increase in frequency (e.g. hotel, Internet, IM), others a decrease (e.g. chatroom, telnet). Many search terms show bursty behavior, and for grouping these in the absence of metadata the bigram check is helpful for forming coherent groups. Some search terms show an increase in frequency at the same time (e.g. Linux and mall in November 1997) but for different reasons, and the bigram check helps to prevent these from being grouped together. Some groups of burst events generated by the coordination procedure are shown in
The profile of the burst event (using daily data) for the death of Princess Diana is shown in
Turning now to
Thus, change clusters in the full Enron corpus are typically driven by corporate mass mailings (all employees receive a copy) or by targeted advertisements (multiple near-identical messages sent to a particular user). Such effects are valid changes to the language model, but not particularly illuminating as to user activity. To eliminate non-informative “changes” driven by junk mail, we tried various forms of pre-processing. Each user is associated with a number of online identities. We report some results from analysis of messages which have both sender and recipient fields including identities of members of the user group (distinct members, since self-mailings between two accounts are common). Junk email is no longer an issue. Repeated messages still occur; it is difficult to distinguish between identical and near-identical documents (e.g. a copy in the deleted items folder versus a reply with a few new words attached to a copy of the old content).
The present disclosure considers the problem of discovering and coordinating changes occurring within text streams. Typically the volume of text streams being acquired in many domains is far too large for human analysts to process and understand by direct inspection, especially in a timely manner. Therefore, there is a need for tools that can execute change detection and coordination. Changes can be abrupt, gradual, or cyclic. Changes can reverse themselves, and can occur in groups that have a common underlying cause. A tool that is designed to accommodate these behaviors can be of material assistance to analysts in providing them with compact summaries of important patterns of change that would otherwise be hidden in the noise. It is then for the analyst to decide what priority to give to the discovered events.
The above description has described a methodology for efficiently finding step changes, trends, and multi-phase cycles affecting lexical items within streams of text that can be optionally labeled with metadata. Multiple change-points for each lexical item are discovered using a dynamic programming algorithm that ensures optimality. A measure of interestingness has been introduced that weights each change-point by how much information it provides, and complements the more conventional measures of statistical significance. These changes are then grouped across both lexical and metavalue vocabularies in order to summarize the changes that are synchronous in time.
A linear-space, quadratic-time implementation of this methodology is described as a function of the time span of the data and can be applied either retrospectively to a corpus of data or in streaming mode on an ongoing basis. The output of the tool can be a set of ranked events, each including sets of lexical items and metavalues together with a description of the timing of the event. This information, perhaps augmented with a sample of the original documents, can assist a human analyst in understanding an event and its significance.
The law does not require and it is economically prohibitive to illustrate and teach every possible embodiment of the present claims. Hence, the above-described embodiments are merely exemplary illustrations of implementations set forth for a clear understanding of the principles of the disclosure. Variations, modifications, and combinations may be made to the above-described embodiments without departing from the scope of the claims. All such variations, modifications, and combinations are included herein by the scope of this disclosure and the following claims.