Due to the copious amounts of information attributable to the popularity of personal computing and the internet, it has become increasingly difficult for users to effectively sift through and examine such an extensive data or document set. In addition, document search, and particularly document matching, has been the subject of numerous research and commercial tools. Document matching is generally utilized for searching and clustering similar documents, organizing folders, and other content management purposes.
Typically, a document of interest is identified, and similar documents are matched against the target document on a one-to-one basis given their semantic similarity. In cases where the key concepts in a target document are present in combination within multiple documents, the user faces the tedious process of breaking down the concepts in the document of interest, performing partial matches, determining the relevance of the documents, and manually compiling a set of documents, which in combination, match the document of interest.
The features and advantages of the inventions as well as additional features and advantages thereof will be more clearly understood hereinafter as a result of a detailed description of particular embodiments of the invention when taken in conjunction with the following drawings in which:
The following discussion is directed to various embodiments. Although one or more of these embodiments may be discussed in detail, the embodiments disclosed should not be interpreted, or otherwise used, as limiting the scope of the disclosure, including the claims. In addition, one skilled in the art will understand that the following description has broad application, and the discussion of any embodiment is meant only to be an example of that embodiment, and not intended to intimate that the scope of the disclosure, including the claims, is limited to that embodiment. Furthermore, as used herein, the designators “A”, “B” and “N” particularly with respect to the reference numerals in the drawings, indicate that a number of the particular feature so designated can be included with examples of the present disclosure. The designators can represent the same or different numbers of the particular features.
The figures herein follow a numbering convention in which the first digit or digits correspond to the drawing figure number and the remaining digits identify an element or component in the drawing. Similar elements or components between different figures may be identified by the user of similar digits. For example, 143 may reference element “43” in
Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the description of embodiments, discussions utilizing terms such as “detecting,” “determining,” “operating,” “using,” “accessing,” “comparing,” “associating,” “deleting,” “adding,” “updating,” “receiving,” “transmitting,” “inputting,” “outputting,” “creating,” “obtaining,” “executing,” “storing,” “generating,” “annotating,” “extracting,” “causing,” “transforming data,” “modifying data to transform the state of a computer system,” or the like, refer to the actions and processes of a computer system, data storage system, storage system controller, microcontroller, processor, or similar electronic computing device or combination of such electronic computing devices. The computer system or similar electronic computing device manipulates and transforms data represented as physical (electronic) quantities within the computer system's/device's registers and memories into other data similarly represented as physical quantities within the computer system's/device's memories or registers or other such information storage, transmission, or display devices.
Prior solutions for document matching involve comparing a target document with a semantically identical document. Historically, document matching techniques have focused on matching pairs of documents based on their similarities (i.e., identity). For example, automated document matching is the process of determining if two or more documents are semantically similar. Automated document matching relies on computational linguistics and text analysis capabilities, which consider synonyms, thesauri, lexicology, anaphora resolution, as well as statistical methods. In many cases, however, all the key concepts in a target document may not be present on a one-to-one basis in other documents. In such cases, either the document matching process fails, or the similarity threshold has to be reduced. The latter scenario may lead to numerous unwanted false-positive matches. For example, if a target document has key elements ABMNXY, while a first relevant document has elements AB, a second relevant document contains elements MN, and a third relevant document includes elements XY; then, it is apparent that no individual document exactly matches the target document. However, the first, second, and third relevant documents—in combination—match the target document. Many applications, such as searches for sales collateral, patent obviousness, plagiarism detection, and other advanced document search techniques can benefit from matching documents in combinations. Therefore, there is a need to match multiple documents against a target document, where the key concepts of the target document appear, collectively, in a combination of two or more other relevant documents.
Embodiments of the present invention disclose a method and system for combinatorial document matching. More particularly, examples disclosed herein provide a method for identifying a collection of documents, which in combination match a target document. According to one example embodiment, via text or linguistic analysis, key concepts in a target document are identified and analyzed. A similar process analyzes a source document library, and combinations of information associated with the plurality of the documents are used to match information affiliated with the target document. If a match is determined, the set of documents are returned as relevant documents, which in combination, match or substantially correspond to the target document. Hence, document search capabilities can be significantly enhanced by avoiding false negatives resulting from each document possessing only portions of the target document and not a full match onto itself. The advantages afforded by examples or the present invention include better search results for sales collateral, more effective plagiarism and patent obviousness detection, legal precedent identification, and improved eDiscovery for example.
Referring now in more detail to the drawings in which like numerals identify corresponding parts throughout the views,
Similarly to the process of analyzing the related document set 204 described above, the text analyzer 205 is also utilized for analyzing the target document 202, which may be declared and input into the combinatorial document matching system 200 by an operating user for example. That is, concept and phrase extraction of the target document 202 is facilitated using elements 207, 208, and 209 of the text analyzer 205 so as to create vectors, or pointers to a dynamically allocated data array, of key concepts 225 associated with the target document 202. Thereafter, concept parser 230 is configured to analyze and parse the concepts 225 into all possible permutations. For example, concepts ABXY associated with the target document may be parsed into A+BXY, AB+XY, ABX+Y, B+AXY, BX+AY etc. The possible permutations are then used to form the permutated concept data set 235, which may be a set of vectors associated with various concept combinations of the target document 202. In the present example, combinatorial document matching is performed by the concept comparator 240 analyzing and comparing data of the consolidated source document information 220 with data (e.g., permutated concept data set 235) affiliated with the target document 202. More generally, the concept comparator 240 matches concepts of the target data with the concepts of at least a pair of documents associated with the consolidated relevant document source 220. According to one example embodiment, the concept comparator 240 utilizes the document pointers (i.e., vectors associated with information 220 and 235) for compiling a set of relevant documents/concepts 245, which in combination, match or substantially correspond to the concepts disclosed in the target document 202.
In the context of claim obviousness detection—when given a target document having a least one claim and at least two source documents as input—the combinatorial document matching system of the present examples may denote concept information or keywords of the target document as “P”, and keywords of the source document denoted by “S”. In the present example, S may consist of N subsets of keywords for each of its N claim elements, while P consists of M subsets of keywords for each of its M elements. In combinatorial concept vector and comparator, given a set S of keywords and key phrases (i.e., concept information) associated with the source documents, and P of keywords/phrases affiliated with the target document/claim, the concept comparator may estimate the similarity between S and P. In a given repository of documents, the existence of many documents that contain both the source keywords S and the target keywords P may serve to indicate that the sets S and P are likely to be relevant. Still further, external information sources (i.e., internetwork) may be used as the document repository, and, in such a scenario, results of a general-purpose search engine may be used as a proxy to estimate the number of documents common to both target document keywords, P, and the source keywords, S.
Furthermore, the variable “A” may denote any subset of P, while “B” denotes any subset of S. Here, |A| may represent the number of documents that contain A; |B| representing the number of documents containing B; while |A, B| represents the number of documents that contain both A and B. The similarity between A and B may then be computed as min (|A|,|B|)/|A, B|. Given any A, the subset B of S that maximizes the similarity ratio may be taken as A's counterpart in S (i.e., substantially similar). Moreover, given P and S, their similarity is taken as the sum of the similarity ratios of the counterpart subsets (A's and B's) of P and S. With respect to the text analysis, stop-words are eliminated from sets A and B. If a word in A and a word in B have the same stem, then they may be considered to be the same word. High occurring or key phrases in A and B are constructed by the co-occurrence matrix as described above. Moreover, when a search engine is used as a proxy for determining the number of documents common to P and S, the repository becomes the internetwork. In this example, |A| may represent the number of documents that a general-purpose search engine retrieves in response to A, with |B| representing the number of documents that the search engine retrieves in response to B, and |A, B| the number of documents that the search engine retrieves in response to A and B.
Examples of the present invention provide a system and method for combinatorial matching for a plurality of documents. Moreover, the physical manifestation of disclosed method may be observed in the compilations of books, journals, reports, and other document sources that may be required for a business purpose. Furthermore, many advantages and utilities are afforded by examples of the present invention. For example, in an RFP/RFI response in sales, a request for proposal (RFP) or request for information (RFI) may be used as target documents and a combination of sales collaterals can be identified as source documents. The present method may be used to quickly extract the key requirements from the RFP/RFI and search for a combination of assets that collectively meet the stated requirements. Such an implementation of the examples described herein will benefit from specialized taxonomies, legal clauses, pricing models, and other features unique to the sales process.
As described above, patent obviousness detection in which claims of a patent application are used to identify prior art references under 35 U.S.C. Section 103, is aided by the invention described herein and is applicable to initial patent search, patent examination, and patent litigation. Given knowledge of patent claims, claims are parsed to extract inventive elements and their relationships. As patent filings and litigations increase, there is an increasing demand for more effective detection of patent obviousness. Ample patent data is readily available, but detection of patent obviousness is generally a hard problem since it involves finding a combination of relevant patents that combined together subsume the claims of a new patent application. Implementation of the present teachings have yielded positive results when applied to semantic analysis of the first independent claim of patents and thus provides a realistic means for drastically reducing the time and resources for patent prosecution, examination, and the discovery phase in patent litigation.
Advantages further include the extension of conventional eDiscovery capabilities to locating documents that partially address the legal question. Moreover, legal precedent, where the facts of a case are used to identify legal sources (e.g., statutes, case law, etc.) as precedent, may be enhanced and simplified through the combinatorial document matching system of the present examples. Still further, the detection of plagiarism can be improved such that sections of a set of source documents are analyzed to test the originality of a target document.
Furthermore, while the invention has been described with respect to exemplary embodiments, one skilled in the art will recognize that numerous modifications are possible. Thus, although the invention has been described with respect to exemplary embodiments, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.