The present invention relates generally to information processing and classification. More particularly, the present invention relates to systems, methods and computer readable media for terminating a technology-assisted review (“TAR”) process in order to efficiently classify a plurality of documents in a collection of electronically stored information.
TAR involves the iterative retrieval and review of documents from a collection until a substantial majority (or “all”) of the relevant documents have been reviewed or at least identified. At its most general, TAR separates the documents in a collection into two classes or categories: relevant and non-relevant. Other (sub) classes and (sub) categories may be used depending on the particular application.
Presently, TAR lies at the forefront of information retrieval (“IR”) and machine learning for text categorization. Much like with ad-hoc retrieval (e.g., a Google search), TAR's objective is to find documents to satisfy an information need, given a query. However, the information need in TAR is typically met only when substantially all of the relevant documents have been retrieved. Accordingly, TAR relies on active transductive learning for classification over a finite population, using an initially unlabeled training set consisting of the entire document population. While TAR methods typically construct a sequence of classifiers, their ultimate objective is to produce a finite list containing substantially all relevant documents, not to induce a general classifier. In other words, classifiers generated by the TAR process are a means to the desired end (i.e., an accurately classified document collection).
Some applications of TAR include electronic discovery (“eDiscovery”) in legal matters, systematic review in evidence-based medicine, and the creation of test collections for IR evaluation. See G. V. Cormack and M. R. Grossman, Evaluation of machine-learning protocols for technology-assisted review in electronic discovery (Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 153-162, 2014); C. Lefebvre, E. Manheimer, and J. Glanville, Searching for studies (Cochrane handbook for systematic reviews of interventions. New York: Wiley, pages 95-150, 2008); M. Sanderson and H. Joho, Forming test collections with no system pooling (Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 33-40, 2004). As introduced above, in contrast to ad-hoc search, the information need in TAR is typically satisfied only when virtually all of the relevant documents have been discovered. As a consequence, a substantial number of documents are typically examined for each classification task. The reviewer is typically an expert in the subject matter, not in IR or data mining. In certain circumstances, it may be undesirable to entrust the completeness of the review to the skill of the user, whether expert or not. For example, in eDiscovery, the review is typically conducted in an adversarial context, which may offer the reviewer limited incentive to conduct the best possible search.
In legal matters, an eDiscovery request typically comprises between several and several dozen requests for production (“RFPs”), each specifying a category of information sought. A review effort that fails to find documents relevant to each of the RFPs (assuming such documents exist) would likely be deemed deficient. In other domains, such as news services, topics are grouped into hierarchies, either explicit or implicit. A news-retrieval effort for “sports” that omits articles about “cricket” or “soccer” would likely be deemed inadequate, even if the vast majority of articles—about baseball, football, basketball, and hockey—were found. Similarly, a review effort that overlooked relevant short documents, spreadsheets, or presentations would likely also be seen as unsatisfactory. A “facet” is hereby defined to be any identifiable subpopulation of the relevant documents (i.e., a sub-class), whether that subpopulation is defined by relevance to a particular RFP or subtopic, by file type, or by any other characteristic.
TAR systems and methods including unsupervised learning, supervised learning, and active learning (e.g., Continuous Active Learning or “CAL”) are discussed in Cormack VI. Generally, the property that distinguishes active learning from supervised learning is that with active learning, the learning algorithm is able to choose the documents from which it learns, as opposed to relying on user- or random selection of training documents. In pool-based settings, the learning algorithm has access to a large pool of unlabeled examples, and requests labels for some of them. The size of the pool is limited by the computational effort necessary to process it, while the number of documents for which labels are requested is limited by the human effort required to label them.
Lewis and Gale in “A sequential algorithm for training text classifiers” (Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3-12, 1994) compared three strategies for requesting labels: random sampling, relevance sampling, and uncertainty sampling, concluding that, for a fixed labeling budget, uncertainty sampling generally yields a superior classifier. At the same time, however, uncertainty sampling offers no guarantee of effectiveness, and may converge to a sub-optimal classifier. Subsequent research in pool-based active learning has largely focused on methods inspired by uncertainty sampling, which seek to minimize classification error by requesting labels for the most informative examples. Over and above the problem of determining which document to select for review, it is important to determine a stopping criterion for terminating user review. One such technique described in Cormack VI uses an estimate of recall.
The objective of finding substantially all relevant documents suggests that any review effort should continue until high recall has been achieved, and achieving higher recall would require disproportionate effort. Recall and other measures associated with information classification are discussed in Cormack VI. Measuring recall can be problematic, this can be due to imprecision in the definition and assessment of relevance. See D. C. Blair, STAIRS redux: Thoughts on the STAIRS evaluation, ten years after, (Journal of the American Society for Information Science, 47(1):4-22, January 1996); E. M. Voorhees, Variations in relevance judgments and the measurement of retrieval effectiveness (Information Processing & Management, 36(5):697-716, 2000); M. R. Grossman and G. V. Cormack, Comments on The implications of rule 26(g) on the use of technology-assisted review” (Federal Courts Law Review, 7:285-313, 2014). This difficulty can also be due to the effort, bias, and imprecision associated with sampling. See M. Bagdouri, W. Webber, D. D. Lewis, and D. W. Oard, Towards minimizing the annotation cost of certified text classification (Proceedings of the 22nd ACM International Conference Information and Knowledge Management, pages 989-998, 2013); M. Bagdouri, D. D. Lewis, and D. W. Oard, Sequential testing in classifier evaluation yields biased estimates of effectiveness (Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 933-936, 2013); M. R. Grossman and G. V. Cormack, Comments on The implications of rule 26(g) on the use of technology-assisted review” (Federal Courts Law Review, 7:285-313, 2014). Accordingly, it can be difficult to specify an absolute threshold value that constitutes “high recall,” or to determine reliably that such a threshold has been reached. For example, the objective of “high recall” may depend on the particular data set gauged in relation to the effort required.
Quality is a measure of the extent to which a TAR method achieves “high recall”, while reliability is a measure of how consistently it achieves such an acceptable level of “high recall”. Accordingly, there is a need to define, measure, and achieve high quality and high reliability in TAR using reasonable effort through new and improved stopping criteria.
The invention provides novel systems and methods for determining when to terminate a classification process such that classifiers generated during iterations of the classification process will be able to accurately classify information for an information need to which they are applied (e.g., accurately classify documents in a collection as relevant or non-relevant) and thus, achieve high quality. In addition, these novel systems and methods will also achieve a given level of quality (e.g., recall) within a certain level of assurance and thus, achieve high reliability.
Systems and computerized methods terminate a classification process by executing a classification process which utilizes an iterative search strategy to classify documents in a document collection. The documents in the document collection are stored on a non-transitory storage medium. The systems and methods also determine an upper bound for an expected review effort using an estimate of the number of relevant documents identified as part of the classification process. The systems and methods further select a gain curve slope ratio threshold and compute points on a gain curve using a selected set of documents in the document collection and results from the classification process. The systems and method then detect an inflection point in the gain curve and determine a slope ratio for the detected inflection point. The slope ratio for the inflection point is determined using a slope of the gain curve before the detected inflection point, and a slope of the gain curve after the detected inflection point. The classification process is terminated based upon a determination that the slope ratio for the detected inflection point exceeds the selected slope ratio threshold and that the upper bound for the expected review effort has been exceeded.
In certain embodiments, the upper bound for the review effort is an estimate of an effort that would provide a reliable statistical estimate of the number of relevant documents in the document collection. In certain embodiments, the reliable statistical estimate is achieved by random sampling the document collection. In certain embodiments, the systems and methods refine the upper bound of review effort during one or more iterations of the classification process.
In certain embodiments, the systems and methods also terminate the classification process when a pre-determined portion of the document collection has been reviewed. In certain embodiments, the systems and methods also evaluate the reliability or the effort of the classification process using quadratic loss functions.
The inventive principles are illustrated in the figures of the accompanying drawings which are meant to be exemplary and not limiting, and in which:
One of the most vexing problems that has plagued the use of TAR is determining when to stop the review effort such that a sufficient number of relevant documents in the document collection have been identified. Generally, a good stopping strategy involves determining that as much relevant information as possible has been found, using reasonable effort. Certain stopping criteria for TAR processes are described in Cormack VI.
The present invention provides a reliable method to achieve high recall using any search strategy that repeatedly retrieves documents and receives relevance feedback. A determination can be made as to when to terminate the review effort using the techniques or stopping criteria described in accordance with certain embodiments of the invention. These techniques are applicable to search strategies and classification efforts such as: ranked retrieval, interactive search and judging (“ISJ”), move-to-front pooling, and continuous active learning (“CAL”). In ISJ, a searcher repeatedly formulates queries and examines the top results from a relevance-ranking search engine. CAL, on the other hand, uses machine learning instead of, or in addition to, manually formulated queries to rank the documents for review. Techniques for carrying out these search strategies are described in Cormack VI. See e.g., Cormack VI, ¶¶65-70, 130-136, 184-190.
One objective of the present invention is to provide quality assurance for TAR applications. Such applications include: electronic discovery (“eDiscovery”) in legal matters, systematic review in evidence-based medicine, and the creation of test collections for information retrieval (“IR”) evaluation. For these types of applications, the review effort may be measured as the total number of documents presented to a reviewer. Based on this measure, an ideal (or perfect) search would find all the relevant documents with effort equal to that number. In other words, in an ideal search, each document presented to the reviewer during the TAR process would be a relevant one. Since such an ideal search is most likely impractical or even impossible, an acceptable search strategy would find an acceptable percentage of the relevant documents and limit wasted effort (e.g., presenting non-relevant documents for review).
The systems and methods described and claimed herein are particularly useful for transforming an unclassified collection of information into a collection of classified information by generating and applying one or more classifiers to the unclassified information (e.g., documents). Beyond effecting this particular transformation, the systems and methods described and claimed herein are more efficient than other systems and methods for classifying information, while still maintaining overall classification accuracy and reliability. The systems and methods described herein provide reliable techniques for classification. For example, by utilizing classification processes to independently identify a previously identified target set of documents, the systems and methods are able to meet a designed-for level of results (e.g., recall) with a designed-for level of probability. The systems and methods described herein also reduce the amount of wasted effort in a review process. For example, the systems and methods described herein account for the gain realized from one or more iterations of a classification process and determine if further iterations are likely to produce substantially improved results (e.g., identify relevant documents). In a further example, the systems and methods described herein estimate a review budget (effort) from iterations of a classification process and terminate the classification process when the budget has been exceeded and/or the gain realized from one or more iterations of a classification process is unlikely to produce substantially improved results. Thus, the efficiencies of the systems and methods described and claimed herein are not merely based on the use of computer technology to improve classification speed. Instead, these systems and methods represent a fundamental improvement in at least the field of information classification by virtue of their overall configuration.
In accordance with these goals, three different stopping criteria for TAR processes are described. For nomenclature purposes, these stopping criteria are termed the “target”, “knee,” and “budget” techniques. Each of these stopping criteria is discussed in further detail below.
Target Technique
The target technique is a provably reliable method that uses a number of relevant documents chosen from a document collection as a target set. Next, an independent search method retrieves documents from a document collection until a sufficient profile of documents from the target set are retrieved or identified. Generally, this search is deemed independent because it does not rely on any knowledge of the target set in its search. Instead, the independent search may treat any document in the target set as a typical document would be treated in the search. For example, a document that is located as part of a CAL or other TAR process, and that also happens to be in the target set, may be used to train a classifier.
In this sense, the target technique differs from the use of a “control set” because control set documents are not used in training the classifier. Instead, a control set is a set of documents held out from training the classifier. This control set is used to measure the effectiveness of the classifier, in order to determine when to stop training, and then to measure recall, so as to determine how many documents should comprise the review set. Generally, the control set must be large enough to contain a sufficient number of relevant documents to yield a precise estimate. Because the use of a control set constitutes sequential sampling, however, its net effect is to yield a biased estimate of recall, which cannot be used for quality assurance. In contrast, in certain embodiments, the target method provides an unbiased measurement of recall, which can be used for quality assurance.
In step 1040, a separate, independent search strategy for identifying relevant documents is executed. In step 1040, the independent search strategy may be employed to classify documents from the collection (e.g., as relevant or non-relevant). Generally, any strategy used to identify relevant documents in a document collection may be used. In certain embodiments, the independent search strategy is a TAR process. In certain embodiments, the TAR process is a CAL approach, which retrieves and/or classifies the most likely relevant documents from the collection. Techniques for identifying relevant information (e.g., documents in a document collection) including CAL approaches are described in Cormack VI. See e.g., Cormack VI, ¶¶ 65-70, 130-136, 184-190. Preferably, when the separate search strategy involves a human reviewer, the reviewer should be shielded from knowledge of T. Alternatively, search strategies that don't primarily rely on human reviewers may be used. Such search strategies are described in Cormack IV, which is incorporated by reference herein.
In step 1060, a determination is made as to whether a stopping criteria for the independent search strategy is reached. In certain embodiments, a stopping criteria is reached when a sufficient number of documents m in T are identified as relevant as part of the independent search strategy, such that m≤k. For example, in certain embodiments, a stopping criteria is reached if the independent search strategy presents m documents in T to a reviewer. In certain embodiments, m is equal to the number of documents k in T. Accordingly, as illustrated in decision block 1080, the classification process may be terminated when the independent search strategy has substantially identified the documents in T or return to an earlier step in method 1000 if it has not.
In certain embodiments, a stopping criteria is reached based upon the distribution of the sequence of documents presented to a reviewer during the TAR process. For example, the distribution may be based upon the types of documents presented to a reviewer and assigned classifications during a TAR process. Different document types considered for the distribution may include: documents in the target set T, relevant documents, and non-relevant documents. In certain embodiments, a stopping criteria is met when the distribution of documents presented to the reviewer during the TAR process satisfies a pre-defined target distribution constraint on document types associated with the target set. For example, a pre-defined target distribution constraint may specify that the first X % of the relevant documents (e.g., the first 80%) presented to and/or identified by the reviewer must include at least m k target documents (e.g., 9 of 10 target documents). In certain embodiments, a pre-defined target distribution constraint may specify that the documents presented to and/or identified by the reviewer must meet one or more particular classification sequences. For example, the target distribution constraint may specify that the documents presented to a reviewer be classified in order as T, r, r, r, T or another possible sequence may be r, T, n, r, r, T, where T represents a target set document, r represents a relevant document, and n indicates a non-relevant document.
For the target technique, reliability is obtained at the cost of supplemental review effort, which is inversely proportional to R, the number of relevant documents in the collection. Generally, the number of randomly selected documents that need to be reviewed to find k relevant ones is
for R<<|C|, where |C| is the size of the document collection. The value of
is referred to as prevalence, the inverse of which is used in the above equation. For example, for k=10 and prevalence
the target method generally incurs a review overhead of approximately 1,000 documents. On average, lower prevalence entails more overhead, while higher prevalence entails less.
Unlike other stopping criteria (e.g., control sets), it can be demonstrated that the target technique achieves a statistical guarantee of reliability. For example, an embodiment described above can be demonstrated to achieve 70% recall 95 times in 100, hence achieving 95% reliability. Consider a document collection C and a function rel(d) indicating binary relevance (e.g., relevant, non-relevant). The number of relevant documents in the collection is: R=|{d∈C|rel(d)}|. A search strategy is a ranking on C where rank(d)=1 indicates that d is the first document retrieved, rank(d)=2 the next, and so on to rank(d)=|C|. The retrieved set of the target technique is the shortest prefix P of the ranking that contains T. relrank(d)=|{d′∈C|rel(d)rank(d′)<rank(d)}|. The last retrieved document dlast is necessarily in T:
Recall can then be represented as:
Taking T to be a random variable, the method is reliable if:
where recall_target is the target level of recall.
Assuming large R, consider the problem of determining a cutoff c such that:
For the condition in Equation (2) to hold it must be the case that the [numerically] top-ranked cR documents are absent from T, which occurs with probability:
It follows that:
For all R>10, where k=10 and recall_target=0.7:
Finally, combining (1) and (3), we have:
Accordingly, it is demonstrated that the target technique is provably reliable such that it will achieve a target level of recall with a certain probability.
Knee Technique
The knee method relies on the assumption that certain TAR process (e.g., CAL) in accordance with the probability-ranking principle, ranks more-likely relevant documents before less-likely relevant documents. As can be seen in
Generally, an ideal gain curve would have slope 1.0 until an inflection point at rank R, corresponding to the point at which all relevant documents had been retrieved, and slope 0.0 thereafter. An example of an ideal gain curve is illustrated in
In step 3040, a subset of documents is selected from a set of documents (e.g., a document collection). Generally, when selecting documents from a given set of documents a subset of documents in the given set may be selected. For example, when selecting from the entire document collection, the entire document collection or less than the entire document collection may be selected. In certain embodiments, the documents are selected from the one or batches of documents used as part of a TAR process. For example, if the TAR process uses exponentially increasing batch sizes during iterations, the documents may be selected from those batches of documents. In certain embodiments, the subset of documents is selected from the set of documents presented to a human reviewer as part of the TAR process (e.g., a sub-sample of documents). Use of reviewed sub-samples in a TAR process are discussed in Cormack V. In certain embodiments, documents are selected by randomly sampling documents from a given set of documents. The documents, however, may be selected in any known manner. Techniques for selecting documents are described in Cormack VI. See e.g., Cormack VI, ¶¶ 65-70, 184-190. If it is determined that a stopping criteria has not been reached, the sub-set of documents selected in step 3040 may be augmented by selecting additional documents for subsequent iterations of the steps of the method 3000. In certain embodiments, the selected sub-set of documents is augmented at each iteration until a stopping criteria is reached. In certain other embodiments, the sub-set of documents is re-selected at each iteration. The sub-set of documents may be augmented or re-selected using the techniques used for selecting the sub-set of documents described in step 3040.
In step 3060, the selected documents are scored using one or more classifiers generated by the TAR process. Generation of classifiers through a TAR process and using such classifiers to score documents are described in Cormack IV, Cormack V, and Cormack VI. See e.g., Cormack VI, ¶¶ 90-119. In certain embodiments, the document scores computed as part of the TAR process itself are used. In step 3070, the selected documents are ordered from rank 1 to s. In certain embodiments, the documents are ordered according to the scores generated by a TAR process. For example, the document with rank 1 may be the document with the highest score from step 3060, while the document with rank s may be the document receiving the lowest score. In certain embodiments, the documents are ordered according to the rank in which they were retrieved by the TAR process. For example, the document with rank 1 may be the document where rank(d)=1 and rank s is the last document presented to a human reviewer or otherwise retrieved by a TAR process.
In step 3080, points on a gain curve are computed using the selected documents and results from one or more iterations of a TAR process (e.g., document scores and/or user coding decisions). The number of relevant documents at each rank (e.g., the y-axis in
In step 3100, the gain curve may be smoothed to provide information between computed points on the gain curve. In certain embodiments, the gain curve is smoothed by using linear interpolation between points computed on the gain curve (e.g., points computed in step 3080). In certain embodiments, the gain curve is smoothed by fitting one or more curves (e.g., quadratic equations) to the points computed on the curve. Generally, smoothed gain curves may be used in any of the computations involving gain curves discussed herein.
In step 3120, one or more “knees” or inflection points in the gain curve are detected. In certain embodiments, a knee is detected by first solving for the parameters m and b (for which b should be 0) of an equation (y=mx+b) describing a line l (see item 2040 of
In step 3140, a candidate rank i is determined from a knee in the gain curve. In certain embodiments, the candidate rank i is determined to be the projection of the intersection of the perpendicular drawn from the gain curve and the gain curve onto the horizontal axis of the gain curve (see item 2080 of
In step 3160, the slope ratio α of the gain curve is computed along points on the gain curve. In certain embodiments, the slope ratio of the gain curve is evaluated at a selected point on the gain curve using linear relationships. For example, the slope ratio of the gain curve at a selected point may be computed by solving for the parameters m1 and b1 (for which b should be 0) for an equation (y=m1x+b1) of a line running from the origin to the recall achieved at the selected point on the gain curve. Then, solving for the parameters m2 and b2 for an equation (y=m2x+b2) of a line running from the recall achieved at the selected point on the gain curve to the recall achieved at rank s and computing the ratio of m1 to m2. In certain embodiments, the slope ratio α of the gain curve is computed at a candidate point i (e.g., as determined in step 3140) according to the equation:
In certain embodiments (e.g., SF=1), SF is a smoothing factor, which avoids issues where no relevant documents are beyond the point i (e.g., divide by zero). Correspondingly, this smoothing factor also penalizes situations where the point i is close to s (a late inflection point). In certain embodiments, a smoothing factor SF is not used (i.e., SF=0). In certain embodiments, a second smoothing factor SF is used in the numerator of the equation (e.g., in the numerator of the numerator of the equation for a above). When estimating a proportion (such as a above), smoothing is typically employed when the sample size may be small. Smoothing may be used to avoid certain undesirable situations (e.g., 0/0, y/0 and 0/x). A simple smoothing technique is to add some constant ε to the numerator, and 2*ε to the denominator. In this example, 0/0 becomes ½, y/0 becomes finite, and 0/x becomes non-zero. Another possible technique is two employ two constants (e.g., ε and λ), and then add ε to the numerator, and ε+λ to the denominator.
In step 3180, it is determined whether a slope ratio α of the gain curve exceeds the slope threshold. As illustrated in decision block 3200, in certain embodiments, if a slope ratio α exceeds the slope threshold, a stopping criteria for the TAR process has been reached and the process may be terminated. As also illustrated in decision block 3200, if a stopping criteria has not been reached, control may return to an earlier step in method 3000 (e.g., step 3040).
As discussed above, the method 3000 may be adjusted for document collections featuring a low prevalence of relevant documents (e.g., R≈<100). Generally, during a TAR process, there is no knowledge of the value of R, other than what can be estimated through relevance feedback from retrieved documents. However, even if it were known that R was so small, the sparsity of relevant documents tends to compromise the reliability of the slope-ratio calculation described above. Through observation it was determined that recall and reliability tend to decrease for smaller R while effort tends to increase for larger R, for a given slope ratio threshold.
Accordingly, whether or not there is low prevalence of relevant documents, it may be beneficial to fix a minimum number of documents that must be retrieved before stopping the review, regardless of the existence of a knee. Additionally, the slope ratio threshold may be adjusted based upon relret, a function which returns the number of relevant documents at a given rank. For example, the slope ratio threshold may be adjusted according to the equation: 156−min(relret,150) . In this case, the slope ratio threshold is 150 when no relevant documents have been retrieved, and 6 whenever at least 150 relevant documents have been retrieved. Between these values, linear interpolation may be used to compute the slope ratio threshold.
Budget Technique
The budget technique aims to stop a TAR process when a review budget has been exceeded (e.g., one comparable to the target method) and/or when a gain curve slope-ratio threshold has been satisfied (e.g., as discussed with respect to the knee technique). For small values of R, this method appropriately delays termination of the TAR process which correspondingly ensures reliability.
The budget technique approach is predicated on the hypothesis that the supplemental review effort entailed by the target method may be better spent reviewing more documents retrieved by another TAR process (e.g. CAL). As discussed above, when using random selection, the target method entails the supplemental review of about
for R<<|C| documents in order to find k relevant ones. According to the probability-ranking principle, we would expect a TAR process such as CAL to find more relevant documents than random selection, for any level of effort, up to and beyond
For example, at any point in the review, if R′ relevant documents have been found, it must be that the total number of relevant documents R must be at least R′. Therefore, the expected number of documents that must be reviewed to create a target set with k documents is
which is less than or equal to
In other words,
represents an upper bound on the amount of additional review that would be required for the target method.
In step 4040, a slope ratio threshold for a is selected. In certain embodiments, a slope ratio threshold of 6.0 may be selected (i.e., α=6.0). A slope ratio threshold may be selected in any known manner. For example, a slope ratio threshold may be selected in accordance with the methods described with respect to
In step 4060, a determination is made as to whether a stopping criteria has been reached. In certain embodiments, a stopping criteria is reached when the number of documents reviewed during the TAR process exceeds an upper bound for expected review effort, and the slope ratio of the gain curve at an inflection point (knee) exceeds a slope ratio threshold. In certain embodiments, the upper bound may be one that was determined in accordance with step 4020 described above. In certain embodiments, a stopping criteria is reached when a certain portion of the document collection has been reviewed even when an expected review effort and/or a gain curve slope ratio threshold, has not been exceeded. In certain embodiments, the certain portion of documents in the collection to be reviewed is 0.75 |C|. A ratio of 0.75 |C| is predicated on the probability-ranking principle: where random selection of 75% of the collection would, with high probability, achieve 70% recall. Accordingly, by reviewing the top-ranked documents (e.g., as specified in certain CAL approaches to TAR), a 75% |C| review effort should achieve even higher recall.
Measuring Quality and Loss in Information Classification
Reliability alone does not capture certain important aspects of effectiveness or efficiency in information classification. Additionally, empirical measurements of reliability lack statistical rigor, while parametric estimates depend on unproven assumptions regarding the distribution of recall values. Furthermore, the choices of acceptable recall and acceptable reliability are both somewhat arbitrary.
As an alternative, statistical measures of recall can be employed (e.g., mean μ, standard deviation δ, and/or variance δ2 of recall) to provide more useful information about the quality of a classification effort. For example, quality Q may be measured according to the equation: Q=u−z_score·δ, where z_score represents the number of standard deviations an observation or datum is above/below the mean. For instance, when z_score=1.64 the prior equation represents a quantitative measure of quality, which may be used to determine the threshold level of acceptable recall for which 95% reliability may be obtained. Other values for z_score may be substituted for lower or higher reliability thresholds as desired.
Furthermore, it is possible to replace measurements of reliability and recall with quality estimates based upon loss functions. For example, Q may be measured as: Q=1−
To capture the desirability of consistently high recall, a quadratic loss function may be used. For example, such a loss function may be expressed as: lossr=(1−recall)2. Such a quadratic function subsumes the roles of mean μ and standard deviation δ discussed above. Because the idealized goal is 100% recall, quadratic loss functions tend to penalize larger shortfalls in recall more severely.
Quadratic loss functions can be used to measure the quality of an information classification effort across multiple identified facets/(sub)categories of relevance. For example let a1, a2, . . . an be categories of relevance and rela
Furthermore, a quadratic loss function for recall of a facet/(sub)category ai may be expressed as recall_lossa
Using the loss for each facet/(sub)category, the loss across all facets/(sub)categories maybe expressed as:
In certain embodiments, the weights ωi may be normalized such that
In certain embodiments, the weights ωi are apportioned equally such that
The choice of weights ωi, however, is not critical and certain facets/(sub)categories ai may be afforded more or less influence by similarly adjusting the corresponding weight ωi.
As with recall, review effort may also be modeled using a loss function, which quantifies the concept of “reasonable effort.” Generally, an ideal effort would entail effort=R. However, a “reasonable effort” may be expressed as: effort=aR+b, where a represents proportion of documents reviewed to the number of relevant documents R and b represents fixed overhead. A quadratic loss function for effort may be used instead:
where “effort” is representative of the number of documents reviewed (e.g., by a human reviewer) during the review effort. As with recall, such loss functions may also be used to measure the loss associated with the effort applicable to each facet/(sub)category ai. Furthermore, these individual facet/(sub)category measures may be combined to form a total loss effort measure (e.g.,
In certain embodiments, various loss measures (e.g., lossr and losse) are aggregated to form a combined loss measure. When aggregating the various loss measures, each individual loss function may be weighted. In certain embodiments, the loss measures are weighted equally. In certain embodiments, the loss measures are unequally weighted.
In addition, the systems and platforms described with respect to
One of ordinary skill in the art will appreciate that, aside from providing advantages in e-discovery review, the improved active learning systems, methods and media discussed throughout the disclosure herein may be applicable to a wide variety of fields that require data searching, retrieval, and screening. This is particularly true for applications which require searching for predetermined information or patterns within electronically stored information (regardless of format, language and size), especially as additional documents are added to the collection to be searched. Exemplary areas of potential applicability are law enforcement, security, and surveillance, as well as internet alert or spam filtering, regulatory reporting and fraud detection (whether within internal organizations or for regulatory agencies).
For example, in law enforcement, security, and for surveillance applications, the principles of the invention could be used to uncover new potential threats using already developed classifiers or to apply newly-classified information to discover similar patterns in prior evidence (e.g., crime or counter-terrorism prevention, and detection of suspicious activities). As another example, the principles of the invention could be used for healthcare screening using already developed classifiers or to apply newly-classified information to discover similar patterns in prior evidence (e.g., as predictors for conditions and/or outcomes).
While there have been shown and described various novel features of the invention as applied to particular embodiments thereof, it will be understood that various omissions and substitutions and changes in the form and details of the systems, methods and media described and illustrated, may be made by those skilled in the art without departing from the spirit of the invention. For example, the various method steps described herein may be reordered, combined, or omitted where applicable. Those skilled in the art will recognize, based on the above disclosure and an understanding therefrom of the teachings of the invention, that the particular hardware and devices that are part of the invention, and the general functionality provided by and incorporated therein, may vary in different embodiments of the invention. Accordingly, the particular systems, methods and results shown in the figures are for illustrative purposes to facilitate a full and complete understanding and appreciation of the various aspects and functionality of particular embodiments of the invention as realized in system and method embodiments thereof. Any of the embodiments described herein may be hardware-based, software-based and preferably comprise a mixture of both hardware and software elements. Thus, while the description herein may describe certain embodiments, features or components as being implemented in software or hardware, it should be recognized that any embodiment, feature or component that is described in the present application may be implemented in hardware and/or software. Those skilled in the art will appreciate that the invention can be practiced in other than the described embodiments, which are presented for purposes of illustration and not limitation, and the present invention is limited only by the claims which follow.
The present application claims the benefit of U.S. Provisional Application No. 62/182,028, filed on Jun. 19, 2015, entitled “Systems and Methods for Conducting and Terminating a Technology-Assisted Review, and U.S. Provisional Application 62/182,072, filed on Jun. 19, 2015, entitled “Systems and Methods for Conducting a Highly Autonomous Technology-Assisted Review.” The present application is also related to concurrently filed U.S. patent application Ser. No. 15/186,360 entitled “Systems and Methods for Conducting and Terminating a Technology-Assisted Review” by Cormack and Grossman (herein after “Cormack I”). The present application is also related to concurrently filed U.S. patent application Ser. No. 15/186,366 entitled “Systems and Methods for Conducting and Terminating a Technology-Assisted Review” by Cormack and Grossman (herein after “Cormack II”). The present application is also related to concurrently filed U.S. patent application Ser. No. 15/183,382 entitled “Systems and Methods for Conducting a Highly Autonomous Technology-Assisted Review Classification” by Cormack and Grossman (herein after “Cormack IV”). The present application is also related to concurrently filed U.S. patent application Ser. No. 15/186,387 entitled “Systems and Methods for a Scalable Continuous Active Learning Approach to Information Classification” by Cormack and Grossman (herein after “Cormack V”). The present application is also related to U.S. application Ser. No. 13/840,029 (now, U.S. Pat. No. 8,620,842), filed on Mar. 15, 2013 entitled “Systems and methods for classifying electronic information using advanced active learning techniques” by Cormack and Grossman and published as U.S. Patent Publication No. 2014/0279716 (herein after “Cormack VI”). The contents of all of the above-identified applications and patent publications are hereby incorporated by reference in their entireties.
Number | Name | Date | Kind |
---|---|---|---|
4839853 | Deerwester et al. | Jun 1989 | A |
5675710 | Lewis | Oct 1997 | A |
5675819 | Schuetze | Oct 1997 | A |
6189002 | Roitblat | Feb 2001 | B1 |
6463430 | Brady et al. | Oct 2002 | B1 |
6678679 | Bradford | Jan 2004 | B1 |
6687696 | Hofman et al. | Feb 2004 | B2 |
6738760 | Krachman | May 2004 | B1 |
6751614 | Rao | Jun 2004 | B1 |
6778995 | Gallivan | Aug 2004 | B1 |
6847966 | Sommer et al. | Jan 2005 | B1 |
6888548 | Gallivan | May 2005 | B1 |
6954750 | Bradford | Oct 2005 | B2 |
6978274 | Gallivan et al. | Dec 2005 | B1 |
7113943 | Bradford et al. | Sep 2006 | B2 |
7197497 | Cossock | Mar 2007 | B2 |
7272594 | Lynch et al. | Sep 2007 | B1 |
7313556 | Gallivan et al. | Dec 2007 | B2 |
7328216 | Hofman et al. | Feb 2008 | B2 |
7376635 | Porcari et al. | May 2008 | B1 |
7440622 | Evans | Oct 2008 | B2 |
7461063 | Rios | Dec 2008 | B1 |
7483892 | Sommer et al. | Jan 2009 | B1 |
7502767 | Forman | Mar 2009 | B1 |
7529737 | Aphinyanaphongs et al. | May 2009 | B2 |
7529765 | Brants et al. | May 2009 | B2 |
7558778 | Carus et al. | Jul 2009 | B2 |
7574409 | Patinkin | Aug 2009 | B2 |
7574446 | Collier et al. | Aug 2009 | B2 |
7580910 | Price | Aug 2009 | B2 |
7610313 | Kawai et al. | Oct 2009 | B2 |
7657522 | Puzicha et al. | Feb 2010 | B1 |
7676463 | Thompson et al. | Mar 2010 | B2 |
7747631 | Puzicha et al. | Jun 2010 | B1 |
7809727 | Gallivan et al. | Oct 2010 | B2 |
7844566 | Wnek | Nov 2010 | B2 |
7853472 | Al-Abdulqader et al. | Dec 2010 | B2 |
7899871 | Kumar et al. | Mar 2011 | B1 |
7912698 | Statnikov et al. | Mar 2011 | B2 |
7933859 | Puzicha et al. | Apr 2011 | B1 |
8005858 | Lynch et al. | Aug 2011 | B1 |
8010534 | Roitblat et al. | Aug 2011 | B2 |
8015124 | Milo et al. | Sep 2011 | B2 |
8015188 | Gallivan et al. | Sep 2011 | B2 |
8024333 | Puzicha et al. | Sep 2011 | B1 |
8079752 | Rausch et al. | Dec 2011 | B2 |
8103678 | Puzicha et al. | Jan 2012 | B1 |
8126826 | Pollara et al. | Feb 2012 | B2 |
8165974 | Privault et al. | Apr 2012 | B2 |
8171393 | Rangan et al. | May 2012 | B2 |
8185523 | Lu et al. | May 2012 | B2 |
8189930 | Renders et al. | May 2012 | B2 |
8219383 | Statnikov et al. | Jul 2012 | B2 |
8275772 | Aphinyanaphongs et al. | Sep 2012 | B2 |
8296309 | Brassil et al. | Oct 2012 | B2 |
8326829 | Gupta | Dec 2012 | B2 |
8346685 | Ravid | Jan 2013 | B1 |
8392443 | Allon et al. | Mar 2013 | B1 |
8429199 | Wang et al. | Apr 2013 | B2 |
8527523 | Ravid | Sep 2013 | B1 |
8533194 | Ravid et al. | Sep 2013 | B1 |
8543520 | Diao | Sep 2013 | B2 |
8612446 | Knight | Dec 2013 | B2 |
8620842 | Cormack | Dec 2013 | B1 |
8706742 | Ravid et al. | Apr 2014 | B1 |
8713023 | Cormack et al. | Apr 2014 | B1 |
8751424 | Wojcik | Jun 2014 | B1 |
8838606 | Cormack et al. | Sep 2014 | B1 |
8996350 | Dub et al. | Mar 2015 | B1 |
9122681 | Cormack et al. | Sep 2015 | B2 |
9171072 | Scholtes et al. | Oct 2015 | B2 |
9223858 | Gummaregula et al. | Dec 2015 | B1 |
9235812 | Scholtes | Jan 2016 | B2 |
9269053 | Naslund et al. | Feb 2016 | B2 |
9595005 | Puzicha et al. | Mar 2017 | B1 |
9607272 | Yu et al. | Mar 2017 | B1 |
9886500 | George et al. | Feb 2018 | B2 |
20020007283 | Anelli | Jan 2002 | A1 |
20030120653 | Brady et al. | Jun 2003 | A1 |
20030139901 | Forman | Jul 2003 | A1 |
20030140309 | Saito et al. | Jul 2003 | A1 |
20040064335 | Yang | Apr 2004 | A1 |
20050010555 | Gallivan | Jan 2005 | A1 |
20050027664 | Johnson et al. | Feb 2005 | A1 |
20050134935 | Schmidtler et al. | Jun 2005 | A1 |
20050171948 | Knight | Aug 2005 | A1 |
20050228783 | Shanahan | Oct 2005 | A1 |
20050289199 | Aphinyanaphongs et al. | Dec 2005 | A1 |
20060074908 | Selvaraj | Apr 2006 | A1 |
20060161423 | Scott et al. | Jul 2006 | A1 |
20060212142 | Madani et al. | Sep 2006 | A1 |
20060242098 | Wnek | Oct 2006 | A1 |
20060242190 | Wnek | Oct 2006 | A1 |
20060294101 | Wnek | Dec 2006 | A1 |
20070122347 | Statnikov et al. | May 2007 | A1 |
20070156615 | Davar et al. | Jul 2007 | A1 |
20070156665 | Wnek | Jul 2007 | A1 |
20070179940 | Robinson et al. | Aug 2007 | A1 |
20080052273 | Pickens | Feb 2008 | A1 |
20080059187 | Roitblat et al. | Mar 2008 | A1 |
20080086433 | Schmidtler et al. | Apr 2008 | A1 |
20080104060 | Abhyankar et al. | May 2008 | A1 |
20080141117 | King et al. | Jun 2008 | A1 |
20080154816 | Xiao | Jun 2008 | A1 |
20080288537 | Golovchinsky et al. | Nov 2008 | A1 |
20090006382 | Tunkelang et al. | Jan 2009 | A1 |
20090024585 | Back et al. | Jan 2009 | A1 |
20090077068 | Aphinyanaphongs et al. | Mar 2009 | A1 |
20090077570 | Oral et al. | Mar 2009 | A1 |
20090083200 | Pollara et al. | Mar 2009 | A1 |
20090119140 | Kuo et al. | May 2009 | A1 |
20090119343 | Jiao et al. | May 2009 | A1 |
20090157585 | Fu et al. | Jun 2009 | A1 |
20090164416 | Guha | Jun 2009 | A1 |
20090265609 | Rangan et al. | Oct 2009 | A1 |
20100030763 | Chi | Feb 2010 | A1 |
20100030798 | Kumar et al. | Feb 2010 | A1 |
20100049708 | Kawai et al. | Feb 2010 | A1 |
20100077301 | Bodnick et al. | Mar 2010 | A1 |
20100082627 | Lai et al. | Apr 2010 | A1 |
20100106716 | Matsuda | Apr 2010 | A1 |
20100150453 | Ravid et al. | Jun 2010 | A1 |
20100169244 | Zeljkovic et al. | Jul 2010 | A1 |
20100198864 | Ravid et al. | Aug 2010 | A1 |
20100217731 | Fu et al. | Aug 2010 | A1 |
20100250474 | Richards et al. | Sep 2010 | A1 |
20100253967 | Privault et al. | Oct 2010 | A1 |
20100257141 | Monet et al. | Oct 2010 | A1 |
20100287160 | Pendar | Nov 2010 | A1 |
20100293117 | Xu | Nov 2010 | A1 |
20100306206 | Brassil et al. | Dec 2010 | A1 |
20100312725 | Privault et al. | Dec 2010 | A1 |
20110004609 | Chitiveli | Jan 2011 | A1 |
20110029525 | Knight | Feb 2011 | A1 |
20110029526 | Knight et al. | Feb 2011 | A1 |
20110029527 | Knight et al. | Feb 2011 | A1 |
20110029536 | Knight et al. | Feb 2011 | A1 |
20110047156 | Knight et al. | Feb 2011 | A1 |
20110103682 | Chidlovskii et al. | May 2011 | A1 |
20110119209 | Kirshenbaum et al. | May 2011 | A1 |
20110125751 | Evans | May 2011 | A1 |
20110251989 | Kraaij et al. | Oct 2011 | A1 |
20110295856 | Roitblat et al. | Dec 2011 | A1 |
20110307437 | Aliferis et al. | Dec 2011 | A1 |
20110314026 | Pickens et al. | Dec 2011 | A1 |
20110320453 | Gallivan et al. | Dec 2011 | A1 |
20120047159 | Pickens et al. | Feb 2012 | A1 |
20120095943 | Yankov et al. | Apr 2012 | A1 |
20120102049 | Puzicha et al. | Apr 2012 | A1 |
20120158728 | Kumar et al. | Jun 2012 | A1 |
20120191708 | Barsony et al. | Jul 2012 | A1 |
20120278266 | Naslund et al. | Nov 2012 | A1 |
20120278321 | Traub | Nov 2012 | A1 |
20140108312 | Knight et al. | Apr 2014 | A1 |
20140280173 | Scholtes et al. | Sep 2014 | A1 |
20150012448 | Bleiweiss et al. | Jan 2015 | A1 |
20150310068 | Pickens et al. | Oct 2015 | A1 |
20150324451 | Cormack et al. | Nov 2015 | A1 |
20160019282 | Lewis et al. | Jan 2016 | A1 |
20160371260 | Cormack et al. | Dec 2016 | A1 |
20160371261 | Cormack et al. | Dec 2016 | A1 |
20160371262 | Cormack et al. | Dec 2016 | A1 |
20160371364 | Cormack et al. | Dec 2016 | A1 |
Number | Date | Country |
---|---|---|
103092931 | May 2013 | CN |
WO 2013010262 | Jan 2013 | WO |
Entry |
---|
Forman, An Extensive Empirical Study of Feature Selection Metrics for Text Classification, pp. 207-213 (Year: 2003). |
Yang, Inflection points and singularities on C-curves, pp. 207-213, pp. 1289-1305 (Year: 2003). |
Almquist, “Mining for Evidence in Enterprise Corpora”, Doctoral Dissertation, University of Iowa, 2011, http://ir.uiowa.edu/etd/917. |
Analytics News Jul. 11, 2013, Topiary Discovery LLC blog, Critical Thought in Analytics and eDiscovery [online], [retrieved on Jul. 15, 2013]. Retrieved from the Internet: URL<postmodern-ediscovery.blogspot.com>. |
Bagdouri et al. “Towards Minimizing the Annotation Cost of Certified Text Classification,” CIKM '13, Oct. 27-Nov. 1, 2013. |
Ball, “Train, Don't Cull, Using Keywords”, [online] Aug. 5, 2012, [retrieved on Aug. 30, 2013]. Retrieved from the Internet: URL<ballinyourcourt.wordpress.com/2012/08/05/train-don't-cull-using-keywords/. |
Büttcher et al., “Information Retrieval Implementing and Evaluating Search Engines”, The MIT Press, Cambridge, MA/London, England, Apr. 1, 2010. |
Cormack et al., “Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets”, Apr. 29, 2010. |
Cormack et al., “Machine Learning for Information Retrieval: TREC 2009 Web, Relevance Feedback and Legal Tracks”, Cheriton School of Computer Science, University of Waterloo. |
Cormack et al., “Power and Bias of Subset Pooling Strategies”, Published Jul. 23-27, 2007, SIGIR 2007 Proceedings, pp. 837-838. |
Cormack et al., “Reciprocal Rank Fusion outperforms Condorcet and Individual Rank Learning Methods”, SIGIR 2009 Proceedings, pp. 758-759. |
Cormack et al., “Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review,” Apr. 26, 2015. |
Cormack et al., “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” Jan. 27, 2014. |
Cormack et al., “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” SIGIR 14, Jul. 6-11, 2014. |
Cormack et al., “Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review,” Sep. 13, 2015. |
Cormack et al., “Scalability of Continuous Active Learning for Reliable High-Recall Text Classification,” Feb. 12, 2016. |
Cormack et al., “Engineering Quality and Reliability in Technology-Assisted Review,” Jan. 21, 2016. |
Cormack et al., “Waterloo (Cormack) Participation in the TREC 2015 Total Recall Track,” Jan. 24, 2016. |
Godbole et al., “Document classification through interactive supervision of document and term labels”, PKDD 2004, pp. 12. |
Grossman et al., “Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review”, XVII Rich. J.L. & Tech. 11 (2011), http://jolt.richmond.edu/v17l3/article11.pdf |
Lad et al., “Learning to Rank Relevant & Novel Documents Through User Feedback”, CIMM 2010, pp. 10. |
Lu et al., “Exploiting Multiple Classifier Types with Active Learning”, GECCO, 2009, pp. 1905-1908. |
Pace et al., “Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery”, RAND Institute for Civil Justice, 2012. |
Pickens, “Predictive Ranking: Technology Assisted Review Designed for the Real World”, Catalyst Repository Systems, Feb. 1, 2013. |
Safedi et al., “active learning with multiple classifiers for multimedia indexing”, Multimed. Tools Appl. 2012, 60, pp. 403-417. |
Shafiei et al., “Document Representation and Dimension Reduction for Text Clustering”, Data Engineering Workshop, 2007, pp. 10. |
Seggebruch, “Electronic Discovery Utilizing Predictive Coding”, Recommind, Inc. [online], [retrieved on Jun. 30, 2013]. Retrieved from the Internet: URL<http://www.toxictortlitigationblog.com/Disco.pdf>. |
Wallace et al., “Active Learning for Biomedical Citation Screening,” KDD' 10 , Jul. 25-28, 2010. |
Webber et al., “Sequential Testing in Classifier Evaluation Yields Biased Estimates of Effectiveness,” SIGIR '13, Jul. 28-Aug. 1, 2013. |
Number | Date | Country | |
---|---|---|---|
20160371369 A1 | Dec 2016 | US |
Number | Date | Country | |
---|---|---|---|
62182028 | Jun 2015 | US | |
62182072 | Jun 2015 | US |