The amount of information contained in documents is rapidly increasing. There are many industries such as law, education, journalism, politics, economics, or the like that may benefit from rapid and low-cost document analysis. Yet even with recent advances in artificial intelligence and computing, manual analysis still provides the best results for many document analysis tasks that involve subjective judgment and expert knowledge. However, the cost and relatively slow speed of manual, human analysis makes it effectively impossible or impracticable to perform document analysis at the scale, speed, and cost desired in many industries.
“Offshoring” to take advantage of lower costs may allow the hiring of a larger number of people to analyze documents at a lower price per hour of labor. Even so, there is a lower bound on costs and an upper bound on throughput. For example, analyzing a corpus of a million 30-page text documents overnight would be impossible using only human analysis. Automated document analysis using computers is much quicker than human analysis and performs at much lower cost. However, for analytical tasks involving subjective judgment, computers perform much worse than humans. Thus, devices and methods that can analyze documents in a way that emulates human analysis will have broad application across many different industries. Additionally, devices and methods that can analyze documents using unified rules may provide a more consistent analysis. For example, human analysis may include subjective differences when analyzing documents, which may provide for less useful results.
The foregoing challenges are compounded in analytic contexts requiring each document to be assessed based on multiple factors.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
This disclosure describes, in part, techniques for performing automatic document analysis. For instance, documents stored in one or more data repositories may be accessed automatically by one or more computing devices and analyzed based on one or more rule sets. The format, structure, and contents of any document stored in the data repositories may be initially unknown. Thus, in some instances, part of the analysis may include filtering documents from a data repository and pre-processing the documents to identify those that are suitable for further analysis. Examples of document types that may be analyzed include, but are not limited to, issued patents and published patent applications. The analysis may focus on specific portions of the documents such as, for example, abstracts or patent claims. Pre-processing may modify the document portions by standardizing the content and removing content that could negatively affect subsequent analysis through techniques such as stop word removal, stemming, and removal of duplicate words.
In some instances, the automatic document analysis may produce a comprehensive score for a document can be reflective of the economic value of the document (e.g., rights granted by a patent). The comprehensive score may be based on one or more component scores. For instance, and for a patent, a coverage score can reflect the breadth of the patent, which can indicate the likelihood that the patent will cover or be infringed by products and/or services. A patent that covers a greater number products and/or services is likely to have a greater economic value, as the patent can be licensed or enforced relative to more products and/or services to generate royalty streams and/or damage payments. A risk score can reflect the likelihood that the patent will not be invalidated if challenged, such as by reexamination. A patent that is less likely to be invalidated if challenged is likely to have a greater economic value than a patent that is more likely to be invalidated during a challenge. Additionally, a market score can reflect the relative size of the market associated with the products and/or services covered by the patent. A higher market score can indicate a greater economic value for a patent based on greater potential revenue associated with the products and/or services.
In some instances, the documents may be analyzed in order to determine (e.g., calculate) comparative breadth scores associated with breadths of the documents. For instance, in some examples, breadth of document portions may be analyzed based on consideration of word count and commonality of words. Thus, the number of unique words and the frequency with which those words appear in other document portions (e.g., document portions of other documents) are the basis for automatically assigning a breadth score to a given document portion. For instance, for a given document portion of a given document, the word count is compared to the word count of other document portions in the same analysis. Similarly, a commonness score is determined for the given document portion based on the commonality of words in that document portion as compared to the commonality of words in other document portions from the same analysis. An overall breadth score of the given document can then be determined based on the breadth scores of the document portions within the given document. Based on the overall breadth scores of the documents, a comparative breadth score associated with the breadth of each of the documents is determined by comparing the overall breadth score for a respective document to overall breadth scores of the other documents in the same analysis.
In some instances, the documents may be analyzed in order to determine (e.g., calculate) comparative portion count scores associated with the number of document portions that are included in the documents. For instance, a given document may be analyzed to determine a number of document portions that are included in the given document. The comparative portion count score for the given document is then determined by comparing the number of document portions within the given document to the number of document portions that are included in other documents in the same analysis. For instance, if the given document includes a patent, the patent may be analyzed to determine a number of claims within the patent. The number of claims within the patent is then compared to the number of claims within other patents that are being analyzed in order to determine the comparative portion count score for the patent. In some instance, when analyzing patents, analyzing the number of claims may include comparing the number of independent claims and/or number of dependent claims within the patent to the number of independent claims and/or number of dependent claims within the other patents. For instance, independent claims or dependent claims may be given more weight during the analysis to determine the comparative portion count scores.
In some instances, the documents may be analyzed in order to determine (e.g., calculate) comparative differentiation scores associated with differentiations between document portions within the documents. For instance, in some examples, differentiation of document portions may be analyzed based on consideration of word counts and differentiation of words between document portions within a given document. For example, for a given document portion of a given document, a number of the words within the given document portion is determined. Additionally, words in the given document portion are compared to words in at least one other document portion (e.g., the broadest document portion) in the given document to determine a number of words in the given document portion that are unique (e.g., not included in the at least one other document portion). A differentiation score for the given document portion is the determined based on the number of words and the number of unique words. For instance, if the document portion includes ten words, and the number of unique words is five, the differentiation score for the given document portion may be 50%. An overall differentiation score is then determined for the given document based on the differentiation scores of one or more of the document portions of the given document. Based on the overall differentiation scores for the documents, a comparative differentiation score of each of the documents is determined based on comparing the overall differentiation score for a respective document to overall differentiation scores of the other documents in the same analysis.
In some instances, a comparative coverage score is determined (e.g., calculated) for the each of the documents in the analysis based on the respective comparative breadth score, the respective comparative portion count score, and the respective comparative differentiation score for a respective document. For example, for a given document, the comparative coverage score can include an average (and/or mean, mode, lowest score, highest score, etc.) of the comparative breadth score, the comparative portion count score, and the comparative differentiation score of the given document. For another example, for a given document, the comparative coverage score can include a weighted average (and/or weighted mean, weighted mode, weighted lowest score, weighted highest score, etc.) of the comparative breadth score, the comparative portion count score, and/or the comparative differentiation score. For instance, the comparative breadth score may be multiplied by a first weight to determine a weighted breadth score, the comparative portion count score may be multiplied by a second weight to determine a weighted portion score, and the comparative differentiation score may be multiplied by a third weight to determine a weighted differentiation score. The comparative coverage score for the document can then be determined based on an average (and/or mean, mode, lowest score, highest score, etc.) of the weighted breadth score, the weighted portion count score, and the weighted differentiation score.
In some instances, the documents may be analyzed to determine (e.g., calculate) risk scores for the documents. For instance, if documents include patents, each of the patents may be analyzed to determine a respective risk score indicating a likelihood that the respective patent will be invalidated, such as if the validity is challenged. In some instances, a risk score is determined for a patent by performing a semantic search to identify a set of documents (e.g., other documents, such as references, patents, publications, articles, etc.) that are closely related to the concept of the patent, removing documents from the set that do not qualify as prior art to the patent (e.g., such as based on the Manual of Patent Examining Procedure (MPEP)) and/or antedate the patent (e.g., drafted, published, filed, or the like after the priority date of the patent), and then determining the risk score based on the number of remaining documents. Additionally, or alternatively, in some instances, risk scores are adjusted or determined for patents based on other factors. For instance, a risk score for a patent may be adjusted or determined based on a number of references cited during prosecution of the patent, breadth of the claims within the patent, prosecution history associated with the patent, a remaining patent term, litigation history associated with the patent, one or more related patents (e.g., one or more foreign related patents), and/or the like.
In some instances, documents may be analyzed to determine (e.g., calculate) market scores for the documents. For instance, for a given document, subject matter of the given document may be analyzed to determine a classification associated with the given document. In some instances, the classification may include a North American Industry Classification System (NAICS) classification, however, other types of classifications can be used. A value associated with the classification is then determined and used to calculate the market score for the given document. In some instances, the value can correspond to the gross domestic product (GDP) associated with the classification. For example, respective GDPs may be determined for one or more of the NAICS classifications. The GDPs may then be used to determine normalized GDP scores for the NAICS classifications. Using the normalized GDP scores, a market value for the given document can include the normalized GDP score for the NAICS classification identified for the given document.
In some instances, a comprehensive score is determined (e.g., calculated) for the each of the documents in the analysis based on the comparative coverage score, the risk score, and the market score for a respective document. For example, for a given document, the comprehensive score can include an average (and/or mean, mode, lowest score, highest score, etc.) of the comparative coverage score, the risk score, and the market score of the given document. For another example, for a given document, the comprehensive score can include a weighted average (and/or weighted mean, weighted mode, weighted lowest score, weighted highest score, etc.) of the comparative coverage score, the risk score, and the market score of the given document. For instance, one or more of the comparative coverage score, the risk score, and the market score of the given document may be given more weight when calculating the comprehensive score.
In some instances, a user interface is generated and used to provide scores based on the analysis. For instance, the user interface may include a list of each of the documents from the analysis. The user interface may further include one or more of the respective comparative breadth score, the respective comparative portion count score, the respective comparative differentiation score, the respective comparative coverage score, the respective risk score, the respective market score, and the respective comprehensive score for each of the documents. Additionally, for each document that that is related to another document under analysis, the user interface can include an overall group score for the documents. For instance, if a patent is related to two or more other patents under analysis, such as belonging to a common patent family, then the user interface can include a group score corresponding to the patent family.
The comprehensive score for a document (e.g., a patent) can be reflective of the economic value of the document (e.g., rights granted by the patent). For instance, and for a patent, the coverage score can reflect the breadth of the patent, which can indicate the likelihood that the patent will cover or be infringed by products and/or services. A patent that covers a greater number products and/or services is likely to have a greater economic value, as the patent can be licensed or enforced relative to more products and/or services to generate royalty streams and/or damage payments. The risk score can reflect the likelihood that the patent will not be invalidated if challenged, such as by reexamination. A patent that is less likely to be invalidated if challenged is likely to have a greater economic value than a patent that is more likely to be invalidated during a challenge. Additionally, the market score can reflect the relative size of the market associated with the products and/or services covered by the patent. A higher market score can indicate a greater economic value for a patent based on greater potential revenue associated with the products and/or services.
With regard to a group of documents, such as a family of patents, the group score can be reflective of the economic value of the group of documents. For instance, and for a family of patents, the coverage scores of each of the patents within the family can reflect the breadth of the family, which can indicate the likelihood that the family will cover products and/or services that potentially infringe the patents. A family of patents that covers a greater number products and/or services includes a greater economic value, as the family can be enforced on more products and/or services for damages. The risk scores for each of the patents within the family can reflect the likelihood that the patents will not be invalidated if challenged, such as by reexamination. A family that includes patents that are less likely to be invalidated if challenged includes a greater economic value than a family that includes patents that are more likely to be invalidated during a challenge. Additionally, the market scores for each of the patents within the family can reflect a portion of the total market that the products and/or services covered by the family are included within. A higher market score can indicate a greater economic value for a family of patents based on potential revenue that the products and/or services can create. As such, and combining each of the scores for the patents within the family, a higher group score for a family of patents can reflect a greater economic value for the family, and a lower group score for a family of patents can reflect a lesser economic value for the family.
By using the techniques described above, comprehensive scores for a document may be calculated over time using common foundational metrics. For instance, each of the comprehensive scores for a document can be calculated based on respective coverage scores, respective risk scores, and respective market scores, where each coverage score, risk score, and market score is calculated using one or more respective algorithms whose metrics may or may not change from one analysis to the next. As such, as long as the data being utilized to calculate the comprehensive scores remains the same from one analysis to the next, the comprehensive scores for the documents will also remain the same. However, if the data being utilized to calculate the comprehensive scores changes over time (e.g., a change in the GDP), then the comprehensive scores for the document will evolve to reflect the change in the data. For instance, and using the example where the GDP changes over time, the comprehensive scores will evolve to reflect the changing economy.
The foregoing approach to comprehensive scoring offers an additional advantage. The respective algorithms for coverage score, risk score and/or market score, along with the data available to support such algorithms, are expected to evolve and be refined over time to produce results having higher perceived accuracy or utility. This evolution or refinement can, in turn, be incorporated into the comprehensive score to produce a result that is likely to have higher perceived accuracy or utility, while remaining logically consistent with and comparable to earlier scoring. For example, where comprehensive scoring has been used to score patent portfolios associated with prior business transactions (e.g., asset purchases or license agreements), the evolution or refinement of component scoring processes and data can be incorporated into the comprehensive scoring of a portfolio associated with a prospective transaction, increasing the perceived accuracy or utility of the comprehensive score, while still enabling it to be logically compared to earlier scores as part of, e.g., a market comparables analysis.
The format and/or file type of documents received from one of the data repositories 102 may be initially unknown when that document enters the analysis pipeline 100. Thus, at the start, part of the initial analysis may include identifying the file format and/or type of document. Some level of processing may be necessary for all documents and certain types of files such as image files or text files lacking metadata may require more extensive processing before further analysis can begin. In some instances, the data repositories 102 may include both issued patents and published applications for utility, design, and/or plant patents. Patent data from various jurisdictions and in various languages may also be included in the data repositories 102. Examples of data repositories 102 include a patent database provided by Innography®, the U.S. Patent Database maintained by the United States Patent Trademark Office, patent data maintained by Relacura, as well as patent databases maintained by others such as the patent offices of various jurisdictions.
Data filtering 104 can limit the data obtained from the data repositories 102 to a corpus of documents that share specified characteristics. This may be particularly useful when the documents come from multiple different sources and/or the documents are obtained without knowledge of the document format. For example, the data filtering 104 may limit patent documents to only issued patents and exclude published patent applications. Data filtering 104 may filter by patent type and, for example, keep utility patents while excluding design and plant patents. Data filtering 104 may also filter documents by language, by author, by inventor, by assignee, by technical field, by classification, etc. Filters may be specified by user-generated input through a user interface. In one implementation, the user interface for specifying how data is to be filtered may be a command-line interface. Arguments passed on the command line are parsed by appropriate code to determine an input data set and/or filters to apply to incoming data.
Pre-processing 106 can modify the documents or portions of the documents for later processing. Pre-processing 106 may include stripping out punctuation, removing stop words 108, converting acronyms and abbreviations 110 to full words, stemming, and/or removing duplicate words. Stop words 108 are words that are filtered out before additional processing. Punctuation may include any of the following marks: . , ! ? , ; : ′ ″ @ # $ % ∧ & * ( ) [ ] < > . Stop word usually refer to the most common words in a language. Stop words may include short function words such as “the” “is,” “at,” “which,” and “on,” as well as others. However, there is no universal list of stop words. Stop words 108 may be compared to individual documents or portions of the documents and any matching words removed. The stop words 108 may be included directly in the code of a pre-processing algorithm. Additionally, or alternatively, the stop words 108 may be included in a list that is accessed to identify stop words 108. The list may be editable to add or remove stop words 108. Multiple lists of stop words 108 may be available. Particular stop words 108 may be selected based on the type of documents being analyzed. For example, patent specific stop words 108 may include words such as “method” or “comprising” that would not typically be included in a list of general stop words. Similarly, if the data filtering 104 restricts the documents to a specific technical area, the stop words 108 may include words specific to the technical area.
Anomaly detection 112 identifies portions of documents that likely include an anomaly which will result in the portion of the document being excluded from further analysis or being flagged to alert a human user that there may be reasons to manually review the flagged document portion. In one implementation, the analysis may be performed only on independent patent claims. However, the data filtering 104 and the pre-processing 106 may create document portions that include both independent and dependent patent claims. Due to the limits of automatic computer-based document analysis, there are some characteristics which may be detectable, but the automatic analysis system will be unable to properly analyze for breadth. Flagging or otherwise indicating such content allows humans to focus manual review efforts on only those document portions that were not fully amenable to the automatic analytical techniques.
Breadth calculation 114 determines the breadth of one or more portions of a document. In some instances, breadth is a subjective concept that is represented in a form amenable for automatic analysis by considering word count and commonality of words. Word count is simply the number of words in a document portion. Words may be counted based on the raw input following data filtering 104 or after some level of pre-processing 106. For example, word count may be performed after removal of duplicate words so that it is a word count of unique words. Also, word count may be performed before or after removing stop words 108. Similarly, word count may be performed before or after converting acronyms and abbreviations 110 into their full word representations. In the context of patent claims, short claims are generally considered broader than longer claims.
Commonality of words represents the frequency that a given word is found within a corpus of documents or document portions. Generally the relevant corpus is the output of the pre-processing 106. For example, if the starting documents from the data repositories 102 were academic papers on chemistry, and preprocessing limited corpus to the abstracts of those papers, then the commonality of a word would be based on the frequency that word is found throughout all the abstracts. Common words correlate with greater breadth while the presence of intricately found words indicates reduced breadth. In the context of patent claims, claims that include words that are often found in the technical field are generally considered broader than claims with uncommon words.
The breadth calculation 114 combines both word count and word commonality to assign a breadth score to a document portion. Specific techniques for determining word count, word commonality, and breadth score are discussed below. Some documents may have multiple portions that are scored. For example, an abstract and an executive summary of a financial document could be scored. For another example, a single patent document may score independent and dependent claims, and each of one or more independent claim and/or each of one or more dependent claims may be assigned a different breadth score.
Overall breadth calculation 116 determines the overall breadth scores for the documents being analyzed. In some instances, the overall breadth score of a document may be the breadth of its broadest portion, such as the breadth score of the broadest claim (e.g., broadest independent claim) of a patent document. In some instances, the overall breadth score of a document may be the breadth of its narrowest portion, such as the breadth score of the narrowest claim of a patent document. Still, in some instances, the overall breadth score of a document may be based on the breadth score(s) of two or more of the document portions. For example, the overall breadth score for a document may include a median or average of breadth scores of each of the document portions of the document. As a further example, the overall breadth score for a document may be based on the range of breadth scores between the breadth of the broadest portion and the breadth of the narrowest portion. In some instances, the overall breadth score may be represented by more than one score (e.g., the broadest breadth score, the average, median, or mean breadth score, the range of breadth scores) of the document portions or may be a composite (e.g., weighted or unweighted average) of such scores. In some instances, one or more of the document portions may be give a greater weight when determining the overall depth score. For example, independent claims may be given a greater weight than dependent claims when determining the overall breadth score of a patent.
The comparative breadth score calculation 118 can determine comparative breadth scores for the documents as compared to other documents within the analysis. For instance, the overall breadth calculation 116 is performed in the context of the other documents in a corpus. Thus, an overall breadth score for a document is not an absolute score, but a relative score compared to other documents that are part of the same analysis. To determine a comparative breadth score for a document as compared to other documents, the comparative breadth score calculation 118 compares the overall breadth score of the document to the overall breadth scores of other documents that are within the analysis.
For example, where the overall breadth score is based on the score of a single document portion (e.g., broadest or narrowest), the calculation 118 compares that score to the score of the corresponding single document portion of other documents that are within the analysis. Where the overall breadth score is based on the score of multiple document portions (e.g., represented as an average, median, or mean; a weighted or unweighted composite of the broadest, average (or median or mean), and narrowest or range score; or individual component scores such as broadest, average, and range, the calculation 118 compares that score or scores to the score or scores of the corresponding multiple document portions of other documents within the analysis. In some instances, the comparative breadth score for a document corresponds to the percentage of documents that include an overall breadth score that is equal to or less than the overall breadth score of the document. In some instances, the comparative breadth score for a document corresponds to the percentage of documents that include an overall breadth score that is less than the overall breadth score of the document. In some instances, the comparative breadth score for a document corresponds to the percentage of documents that include an overall breadth score that is equal to or greater than the overall breadth score of the document. Still, in some instances, the comparative breadth score for a document corresponds to the percentage of documents that include an overall breadth score that is greater than the overall breadth score of the document.
In some instances, the design for the analysis captures the idea of comparing apples to apples when calculating comprehensive breadth scores. For instance, comparison of the breadth of a biotechnology patent to the breadth of a mechanical patent is less meaningful than comparing the breadth of one software patent to the breadth another software patent. Because the documents are given overall breadth scores with respect to the other documents in the same corpus, those overall breadth scores may be utilized to determine the comprehensive breadth scores for each of the documents.
The user interface 120 may display, or otherwise present to a user, the comparative breadth scores, rankings based on the comparative breadth scores, and an identifier for each of the analyzed documents. The identifier for each of the documents may be a unique identifier such as a patent number, a published patent application number, an international standard book number (ISBN), a title, a universal resource identifier (URI), etc. The user interface (UI) 120 may be generated by processing a text file or other textual output. The UI 120 may be implemented as a command line interface, as a graphical user interface, or as another type of interface. When implement it as a graphical user interface, the UI 120 may be generated by a cloud service that is accessible over a communications network such as the Internet. Cloud services do not require end user knowledge of the physical location or configuration of the system that delivers the services. Common names associated with cloud services include “software as a service” or “SaaS”, “platform computer”, “on-dash demand computing,” and so on. Any number of users may access the UI 120 any time through specialized applications or through browsers (e.g., Internet Explorer®, Firefox®, Safari®, Google Chrome®, etc.) resident on their local computing devices.
Portion count calculation 202 can determine a value (e.g., overall portion count score) corresponding to the number of portions that are within each of the documents. For instance, after performing the filtering and/or the pre-processing of a document, the portion count calculation 202 can determine a value corresponding to the number of document portions that were identified for the document. In some instance, the value corresponds to each of the document portions that were analyzed by the processing pipeline 100 of
In some instances, the portion count calculation 202 can weight one or more of the document portions when determining the value for a document. For instance, if the document includes a patent, more weight can be provided to the independent claims than to the dependent claims when determining the value for the patent. For example, for the patent above that includes three independent claims and seventeen dependent claims, the value for the document may include twenty-nine if the independent claims are given four times more weight than the dependent claims (e.g., (3*4)+17=29). Of course the weight of independent claims may be something other than four times, such as 1.1×, 1.2×, 1.3×, 2×, 3×, 5×, etc. In some instances, weighting independent claims greater than dependent claims for patents can provide a better prediction for the quality of the patents since patents that include more independent claims may include a broader claim scope than other patents or more reflect a different strategy of the claim drafter.
Comparative portion count score calculation 204 can determine comparative portion count scores for the documents based on the values determined for other documents being analyzed. For instance, to determine a comparative portion count score for a given document, the comparative portion count score calculation 204 can compare the value associated with the given document to the values of the other documents being analyzed. In some instances, the comparative portion count score for a document corresponds to the percentage of documents that include a value that is equal to or less than the value of the document. In some instances, the comparative portion count score for a document corresponds to the percentage of documents that include a value that is less than the value of the document. In some instances, the comparative portion count score for a document corresponds to the percentage of documents that include a value that is equal to or greater than the value of the document. Still, in some instances, the comparative portion count score for a document corresponds to the percentage of documents that include a value that is greater than the value of the document.
The UI 206 may display, or otherwise present to a user, the comparative portion count scores, rankings based on the comparative portion count scores, and an identifier for each of the analyzed documents. As discussed above, the identifier for each of the documents may be a unique identifier such as a patent number, a published patent application number, an ISBN, a title, a URI, etc. The UI 206 may be generated by processing a text file or other textual output. The UI 206 may be implemented as a command line interface, as a graphical user interface, or as another type of interface. When implemented as a graphical user interface, the UI 206 may be generated by a cloud service that is accessible over a communications network such as the Internet. Any number of users may access the UI 206 any time through specialized applications or through browsers (e.g., Internet Explorer®, Firefox®, Safari®, Google Chrome®, etc.) resident on their local computing devices.
Differentiation calculation 302 can determine differentiation between document portions within each of the documents being analyzed. Differentiation is a subjective concept that is represented in a form amenable for automatic analysis by considering at least word count and differentiation between words of various document portions within a document. Similar to the breadth analysis discussed above, words may be counted based on the raw input following data filtering 104 or after some level of pre-processing 106. For example, word count may be performed after removal of duplicate words so that it is a word count of unique words. Also, word count may be performed before or after removing stop words 108. Similarly, word count may be performed before or after converting acronyms and abbreviations 110 into their full word representations.
Differentiation of words represents a number of words within a document portion of a document that are not found within one or more other document portions of the document. For example, if a document portion includes the words “audio”, “data”, “representing”, “voice”, and “input”, and at least one other document portion includes the words “audio” and “data”, the word count for the document portion includes five words and the differentiation of words for the document portion includes three. The differentiation calculation 302 combines both word count and differentiation to assign a differentiation score to a document portion. For example, a differentiation score for the example above may include sixty percent (e.g., three unique words/five total words). Specific techniques for determining word count, word differentiation, and differentiation score are discussed below. In some instances, some documents may have multiple portions that are scored. For example, an abstract and an executive summary of a financial document could be scored. For another example, a single patent document may include independent and dependent claims, and each of one or more independent claims and/or each of one or more dependent claims may be assigned a different differentiation score.
For documents that include patents and/or published applications, there may be multiple types of differentiation between claims (e.g., the document portions) within the patents and/or published applications that can be analyzed using the word count/differentiation score technique above. A first type of differentiation between two claims can include a first claim and a second claim that include similar claim components, where each claim uses different wording. A second type of differentiation between two claims can include a first claim and a second claim that include similar components, but claimed in a different order. Still, a third type of differentiation between two claims can include a first claim and a second claim that are claiming different components.
In some instances, the differentiation calculation 302 may determine that the first type and the second type include less differentiation than the third type. For example, and for the first type, the differentiation calculation 302 may determine that there is not a differentiation between two different words that includes a similar meaning. For instance, the differentiation calculation 302 can determine that there is no word differentiation between a first claim that recites “an audio signal representing sound” and a second claim that recites “sound represented by an audio signal.” In some instances, natural language processing techniques may be used to determine whether two words include a similar or a different meaning. For a second example, and for the second type, the differentiation calculation 302 may determine that there is no word differentiation between a first claim and a second claim when components include similar words (e.g., no differentiation) that are merely organized differently. For a third example, and for the third type, the differentiation calculation 302 can determine that there is word differentiation between a first claim and a second claim that recite different components. For instance, the differentiation calculation 302 can determine that there is a word differentiation between a first claim that recites “a camera to capture an image” and a second claim that recites “a scanner to scan an image” (e.g., the word “camera” differs from “scanner” and the word “capture” differs from “scan”).
For example, a first claim in a patent may recite, “capturing a first image of an environment using a camera; analyzing the first image; and capturing a second image of the environment using the camera,” a second claim in the patent may recite, “using a camera to capture a first image of an environment; using a camera to capture a second image of the environment; and analyzing the first image,” and a third claim of the patent may recite, “obtaining a first depth map of an environment using a sensor; analyzing the first depth map; and obtaining a second depth map of the environment using the sensor.” The differentiation calculation 302 may then analyze the patent to determine a first differentiation score between the first claim and the second claim and a second differentiation score between the first claim and the third claim.
For instance, the patent may be may be pre-processed using 104-112 above (e.g., removing stop words, stemming, and removal of duplicate words). Based on the pre-processing, the words remaining for analysis for the first claim may include “capturing”, “first”, “image”, “environment”, “camera”, “second”, “using”, and “analyzing”, the words remaining in the second claim may include “using”, “camera”, “capture”, “first”, “image”, “environment”, “second”, and “analyzing”, and the words remaining in the third claim may include “obtaining”, “first”, “depth”, “map”, “environment”, “using”, “sensor”, “second”, and “analyzing”. The differentiation calculation 302 can then determine that the second claim includes eight words, none of which are unique when compared to the first claim. As such, the differentiation calculation 302 can determine that the second claim includes a first differentiation score of 0% as compared to the first claim. Additionally, the differentiation calculation 302 can determine that the third claim includes nine words, four of which are unique when compared to the first claim. As such, the differentiation calculation 302 can determine that the third claim includes a second differentiation score of 44.4%.
As shown above, the differentiation calculation 302 determines that there is a greater differentiation between the first claim and the third claim than between the first claim and the second claim. This is because the first claim and the second claim fall within the first type of differentiation and the second type of differentiation. For instance, the first claim and the second claim include similar features, but with different wording (e.g., “capturing” in claim 1 and “capture” in claim 2), where the features are recited in each claim using a different order. Additionally, the first claim and the third claim fall within the third type of differentiation. For instance, the first claim and the third claim each include unique features.
Overall differentiation calculation 304 determines overall differentiation scores for the documents being analyzed. In some instances, the overall differentiation score for a document may be determined based on the differentiation scores of each of the document portions included within the document. For example, the overall differentiation score for a document may include the average and/or the median of the differentiation scores of each of the document portions included in the document. For another example, the overall differentiation score for a document may include the highest differentiation score, the lowest differentiation score, or a differentiation score between the highest and lowest differentiation scores for each of the document portions included within the document.
Additionally, or alternatively, in some instances, the overall differentiation score for a document may be based on a portion of the differentiation scores for each of the document portions included within the document. For example, and based on a document including a patent, the overall differentiation score may include an average and/or median of the differentiation scores for the broadest independent claim (e.g., using the breadth scores above) and each of the dependent claims that dependent from the broadest independent claim. For another example, and based on a document including a patent, the overall differentiation score may include an average and/or median of the differentiation scores of each of the independent claims.
Although the above calculations 302 and 304 describe determining differentiation between one or more portions and final differentiation scores based on word analysis within the document itself, in some instances, these calculations 302 and 304 may determine differentiation between one or more portions and final differentiation scores based on the differentiation “footprint” of the one or more portions relative to an entirety of the subject matter of the corpus of documents. For instance, the differentiation calculation 302 can generate a corpus of words based on words within the corpus of documents. In some instances, the differentiation calculation 302 can generate the corpus of words using every word that is included in the corpus of documents. In some instances, the differentiation calculation 302 can generate the corpus of words using every word that is included in the document portions that are being analyzed. For instance, if the corpus of documents includes a corpus of patents, the differentiation calculation 302 can generate the corpus of words to include every word that is included within every claim of the corpus of patents. In some instances, the corpus of words may be generated based on the raw input following data filtering 104 or after some level of pre-processing 106. For example, generating the corpus of words may be performed after removal of duplicate words so that each word in the corpus of words is unique. Also, generating the corpus of words may be performed before or after removing stop words 108. Similarly, generating the corpus of words may be performed before or after converting acronyms and abbreviations 110 into their full word representations.
Using the corpus of words, the differentiation calculation 302 may assign a portion differentiation score to a one or more document portions by comparing words within the one or more document portions. In some instances, the differentiation calculation 302 may determine the number of unique words in the portion determined to have the broadest overall breadth score. For each additional document portion, the differentiation calculation 302 may determine the number of unique words in the portion that are not included in the portion having the broadest overall breadth score. In another example, the differentiation calculation 302 may determine the number of unique words that are included in that particular portion and not included in any other portion. In some instances, the number of unique words associated with each portion is then expressed as a percentage of the unique words within the corpus of words in the relevant documents. For example, if the corpus of words in the relevant documents includes 10,000 unique words, and a given document portion (e.g., independent claim) includes 20 unique words that are within the corpus of 10,000 unique words, then the percentage for the given document portion is 0.002%. If a second document portion (e.g., independent claim) also includes 20 unique words that are both within the corpus of 10,000 unique words and exclusive of the words in the first (or any other previously processed) document portion, then the percentage for the second document portion is also 0.002%.
If the document of interest includes only those two portions, in some instances the overall differentiation calculation 304 can include summing the reciprocal of each percentage for a differentiation calculation of 10,00 (1/0.002+1/0.002), giving more weight to portions with a relatively small percentage of the unique words of the corpus. In other instances, the reciprocal of one minus the percentage could be summed for each portion (i.e., 1/(1−0.002)+1/(1−0.002)=2.004), giving more weight to portions with a relatively large percentage of the unique words of the corpus. In other instances, the reciprocal of the percentage for the broadest portion could be used and the reciprocal of one minus the percentage could be used for all other portions. In still other instances, the summation could be made after further weighting to the contribution of individual portions (e.g., in the context of patent documents, weighting the contribution of independent claims more heavily than the contribution of dependent claims). In this manner, a document with many document portions having unique words that are not common to other portion within the document will have a relatively high overall differentiation score and large “footprint.”
Comparative differentiation score calculation 306 can determine comparative differentiation scores for the documents as compared to other documents within the analysis. For instance, to determine a comparative differentiation score for a document as compared to other documents in the analysis, the comparative differentiation score calculation 306 compares the overall differentiation score of the document to the overall differentiation scores of other documents that are within the analysis. In some instance, the comparative differentiation score for a document corresponds to the percentage of documents that include an overall differentiation score that is equal to or less than the overall differentiation score of the document. In some instance, the comparative differentiation score for a document corresponds to the percentage of documents that include an overall differentiation score that is less than the overall differentiation score of the document. In some instance, the comparative differentiation score for a document corresponds to the percentage of documents that include an overall differentiation score that is equal to or greater than the overall differentiation score of the document. Still, in some instance, the comparative differentiation score for a document corresponds to the percentage of documents that include an overall differentiation score that is greater than the overall differentiation score of the document.
The UI 308 may display, or otherwise present to a user, the comparative differentiation scores for the documents, rankings based on the comparative differentiation scores, and an identifier for each of the analyzed documents. The identifier for each of the documents may be a unique identifier such as a patent number, a published patent application number, an ISBN, a title, a URL, etc. The UI 308 may be generated by processing a text file or other textual output. The UI 308 may be implemented as a command line interface, as a graphical user interface, or as another type of interface. When implemented as a graphical user interface, the UI 308 may be generated by a cloud service that is accessible over a communications network such as the Internet. Any number of users may access the UI 308 any time through specialized applications or through browsers (e.g., Internet Explorer®, Firefox®, Safari®, Google Chrome®, etc.) resident on their local computing devices.
Still, in some instances, the comparative coverage score calculation 402 may calculate the comparative coverage scores using one or more of the comparative breadth scores, the comparative portion count scores, and the comparative differentiation scores. For example, the comparative coverage calculation score 402 can calculate the comparative coverage score for a patent based on the comparative breadth score (e.g., the breadth score for the broadest independent claim) without considering the comparative portion count score or the comparative differentiation score.
There is an entry for one or more documents in the UI 412 and information about those documents. The information may include the ranking 414 for each of the documents, patent number 416 for each of the documents, the comparative breadth scores 406 for each of the documents, the comparative portion count scores 408 for each of the documents, the comparative differentiation scores 410 for each of the documents, and the comparative coverage scores 404 for each of the documents. The UI 412 may also include interactive elements 418 associated with each of the entries. One of the interactive elements 418 may be activated in response to a command generated on an input device to select a one of the documents. Information about the analysis of the selected document may be saved to a separate file, placed in separate portion of memory, or added to a list for later access and/or analysis.
Furthermore, in some instances, the UI 412 can include group scores 420 for the documents under analysis. For instance, a document may be related to one or more other documents that are being analyzed. For example, a patent may be included in a patent family, which can include two or more patents. In some instances, the patent family includes patents that claim priority to one another, such as in the form of continuation applications, divisional applications, foreign applications, or the like. Thus, the group scores 420 can include a score for each of the documents that is included in a group. In some instances, the group score 420 for a document can include the average of each of the comparative coverage scores of the documents within the group. In some instances, the group score 420 for a document can include the median, mode, lowest comparative coverage score, highest comparative coverage score, or the like of the comparative coverage scores of the documents within the group. In some instances, one or more of the documents under analysis may not be included in a group and as such, may not include a group score 420. For instance, the first two patents included in the UI 412 include respective group scores 420, while the last two patents do not include respective group scores 420.
It should be noted that, in some instances, rather than using the results from the comparative breadth score calculation 118, the comparative portion count score calculation 204, and the comparative differentiation score calculation 306, the comparative coverage score calculation 402 can additionally, or alternatively, use one or more of the results from the overall breadth calculation 116, portion count calculation 202, and the overall differentiation calculation 304 to calculate the comparative coverage scores for the documents. For instance, in some examples, the comparative coverage score calculation 402 may not normalize the overall scores for the documents when determining the comparative coverage scores for the documents. Additionally, in some instances, the comparative coverage scores may be based on only one or two of the comparative breadth score calculation 118, the comparative portion count score calculation 204, and the comparative differentiation score calculation 306.
Risk score calculation 502 can determine risks of the documents being analyzed. For instance, if documents include patents, risk can reflect a likelihood that the patents will be invalidated if the patents are challenged, such as by reexamination. To align the calculated risk scores with each of the calculated coverage scores and market scores, where a higher score indicates a better quality for the documents (e.g., the patents), the risk score calculation 502 can alternatively calculate the risk scores as an inverse of the risk of the documents. For instance, and if a document includes a patent, the risk score associated with the patent can reflect a likelihood that the patent will not be invalidated if challenged (e.g., reexamined).
The risk score calculation 502 can utilize many factors when calculating risk scores for patents. For instance, factors for a patent can include a number of sources 504 of possible prior art (e.g., other patents, publications, articles, references, or the like) that are related to the subject matter of the patent, a number of references 506 that were cited during prosecution of the patent, breadth of the claims within the patent 508 (e.g., breadth of the independent claims as well as dependent claims), prosecution history 510 (which may include cited documents 506) of the patent, and/or the like.
For example, in some instances, a semantic search can be performed using a patent to identify a set number of documents 504 (e.g., one, five, ten, one hundred, or any other number) that are related to the patent. In some instances, the semantic search is performed based on one or more claims of the patent. For example, the semantic search can be performed using the broadest independent claim, each of the independent claims, the broadest independent as well as claims that dependent from the broadest independent claim, every claim, or any other combination of the claims. In other instances, the semantic search is performed using one or more additional or alternative portions of the patent, such as the abstract, the specification, the description of the figures, the background, or any combination thereof. The set of documents is then analyzed to remove any documents that do not qualify as prior art to the patent. For instance, the set of documents can be analyzed to remove any documents that includes a priority date (e.g., drafting date, publish date, filing date, or the like) that antedates the priority data of the patent. In some instances, the set of documents can further be analyzed to remove any documents (e.g., references) that were cited during prosecution of the patent. Furthermore, in some instances, the set of documents may further be analyzed to remove any documents that are commonly assigned to the assignee of the patent, which would cause the documents to not qualify as prior art, as set forth in the rules of the MPEP. A risk score can then be calculated for the patent based on the number of documents identified during the semantic search and the number of remaining documents. Specific techniques for calculating risk score using such a process is described in detail below.
In some instances, a risk score for a patent may be calculated or adjusted based on the number of references 506 that were cited during prosecution of the patent. For example, a search can be performed to identify each reference that was cited during prosecution of the patent. The search can include searching one or more databases, such as one or more databases associated with PAIR (Patent Application Information Retrieval), the EPO (European Patent Office), the WIPO (World Intellectual Property Organization), or the like, that include information about references cited during prosecution. A risk score can then be calculated based on the number of references. In some instances, a higher risk score is calculated for a patent that includes a greater number of references cited during prosecution than to a patent that includes a lesser number of references cited during prosecution. This is because more references that qualify as prior art were considered during prosecution of the patent with the higher number of references. Therefore, there may be less references related to the patent, that qualify as prior art, and were not considered during prosecution. Specific techniques for calculating risk score using such a process is described in detail below.
In some instances, a risk score may be calculated or adjusted for a patent based on breadth 508 of the claims in the patent. For instance, the overall breadth score of a patent may be determined using all or part of the method 100 described in
In some instances, a risk score may be calculated or adjusted for a patent based on prosecution history 510 of the patent. For instance, a search can be performed to identify information corresponding to the prosecution history 510 of the patent, such as the filing date of the patent, the issue date of the patent, the number of office actions issued during prosecution, amendments made to the claims during prosecution, whether there was a Notice of Appeal filed during prosecution, or the like. The search can include searching one or more databases, such as one or more databases associated with PAIR, the EPO, the WIPO, or any other organization that stores information associated with patents. A risk score can then be calculated for the patent based on the prosecution history 510. For example, a higher risk score may be calculated for a patent that was in prosecution for a lesser amount of time than to a patent that was in prosecution for a greater amount of time. For another example, a higher risk score may be calculated for a patent that was issued a lesser number of office actions during prosecution than a patent that was issued a greater number of office actions during prosecution. In either example, the higher risk score is based on the assumption that there is a greater probability that an error was made during prosecution (e.g., a higher risk), which can cause the patent to be invalidated, each time the specification and/or claims in the patent are amended during prosecution. Specific techniques for calculating risk score using such a process is described in detail below.
The UI 512 may display, or otherwise present to a user, the risk scores for the documents, rankings based on the risk scores, and an identifier for each of the analyzed documents. The identifier for each of the documents may be a unique identifier, such as a patent number, a published patent application number, an ISBN, a title, a URI, etc. The UI 512 may be generated by processing a text file or other textual output. The UI 512 may be implemented as a command line interface, as a graphical user interface, or as another type of interface. When implemented as a graphical user interface, the UI 512 may be generated by a cloud service that is accessible over a communications network such as the Internet. Any number of users may access the UI 512 any time through specialized applications or through browsers (e.g., Internet Explorer®, Firefox®, Safari®, Google Chrome®, etc.) resident on their local computing devices.
Classification analysis 602 can determine market classifications corresponding to the documents being analyzed. For instance, if a document includes a patent, the patent may be analyzed to identify an initial classification corresponding to the patent. Analyzing the patent can include searching one or more databases, such as one or more databases associated with PAIR, the EPO, the WIPO, or the like, to identify the initial classification corresponding to the patent. In some instances, the initial classification can include a classification assigned to the patent that is based on the Cooperative Patent Classification (CPC). In some instances, the initial classification can include a classification assigned to the patent that is based on the United States Patent Classification (USPC), a classification assigned to the patent from the EPO, or any other type of classification that can be assigned to the patent.
A semantic search can then be performed using the initial classification in order to determine a class (e.g., a market classification) corresponding to the patent. In some instances, the market classification can include a North American Industry Classification System (NAICS) classification. In other instances, the market classification can correspond to a different classification system, such as the Standard Industrial Classification (SIC) system. In either instance, and using the NAICS as an example, a semantic search can be performed using the descriptions for one or more of the NAICS classifications to identify at least one NAICS classification that is related to the initial classification assigned to the patent. For another example, and again using the NAICS, a lookup table may be created that associates each initial classification that can be assigned to a patent to at least one of the NAICS classifications. A search can then be performed using the lookup table to identify a NAICS classification associated with the initial classification assigned to the patent.
In some instances, in addition to, or alternatively from, using the initial classification assigned to the patent, a semantic analysis can be performed on the patent to identify at least one market classification for the patent. For example, a semantic search can be performed using the broadest independent claim, each of the independent claims, the broadest independent as well as claims that dependent from the broadest independent claim, every claim, or any other combination of the claims to identify a NAICS classification that is related to the patent. For another example, a semantic search can be performed using one or more additional or alternative portions of the patent, such as the abstract, the specification, the description of the figures, the background, or any combination thereof, to identify a NAICS classification that is related to the patent.
Market score calculation 604 calculates scores for the documents under analysis using respective values (e.g., metrics) associated with the identified market classifications. For instance, each market classification identified for a document may be associated with a respective value. In some instances, the values are calculated based on the gross domestic product (GDP) of the country in which the documents are being analyzed. For instance, if patents are being analyzed in the United States, each market classification identified for a respective patent may be associated with the corresponding allocation of GDP of the United States. To associate a market classification with a GDP, the GDP for the market classification may be normalized based on the total GDP of the country. For instance, a respective GDP may be identified for each market classification that can be assigned to a document. The respective GDP may then be divided by the total GDP in order to determine a portion of the total GDP for each market classification. The portions can then be multiplied by 100 in order normalize the values using a scale between 0-100. In some instances, the values are then added to the lookup table described above such that market values can be identified for documents using the lookup table.
The UI 606 may display, or otherwise present to a user, the market scores for the documents, rankings based on the market scores, and an identifier for each of the analyzed documents. The identifier for each of the documents may be a unique identifier such as a patent number, a published patent application number, an ISBN, a title, a URI, etc. The UI 606 may be generated by processing a text file or other textual output. The UI 606 may be implemented as a command line interface, as a graphical user interface, or as another type of interface. When implemented as a graphical user interface, the UI 606 may be generated by a cloud service that is accessible over a communications network such as the Internet. Any number of users may access the UI 606 any time through specialized applications or through browsers (e.g., Internet Explorer®, Firefox®, Safari®, Google Chrome®, etc.) resident on their local computing devices.
There is an entry for one or more documents in the UI 702 and information about those documents. The information may include the ranking 714 for each of the documents, patent number 716 for each of the documents, the comparative coverage scores 708 for each of the documents, the risk scores 710 for each of the documents, the market scores 712 for each of the documents, and the comprehensive scores 706 for each of the documents. The UI 702 may also include interactive elements 718 associated with each of the entries. One of the interactive elements 718 may be activated in response to a command generated on an input device to select a one of the documents. Information about the analysis of the selected document may be saved to a separate file, placed in separate portion of memory, or added to a list for later access and/or analysis.
Furthermore, in some instances, the UI 702 can include group scores 720 for the documents under analysis. For instance, a document may be related to one or more other documents that are being analyzed. For example, a patent may be included in a patent family, which can include two or more patents. In some instances, the patent family includes patents that claim priority to one another, such as in the form of continuation applications, divisional applications, foreign applications, or the like. Thus, the group scores 720 can include a score for each of the documents that is included in a group. In some instances, the group score 720 for a document can include the average of each of the comprehensive scores of the documents within the group. In some instances, the group score 720 for a document can include the median, mode, lowest comprehensive score, highest comprehensive score, or the like of the comprehensive scores of the documents within the group. In some instances, one or more of the documents under analysis may not be included in a group and as such, may not include a group score 720. For instance, the first two patents included in the UI 702 include respective group scores 720, while the last two patents do not include respective group scores 720.
It should be noted that, in some instances, rather than using the results from the comparative coverage score calculation 402, the comprehensive score calculation 704 can additionally, or alternatively, use one or more of the results from the overall breadth calculation 116, the comparative breadth score calculation 118, the portion count calculation 202, the comparative portion count score calculation 204, the overall differentiation calculation 304, or the comparative differentiation score calculation 306 to calculate the comprehensive scores for the documents. Additionally, in some instances, the comprehensive score calculation 704 may calculate the comprehensive scores based on only one or two of the comparative coverage score calculation 402, the risk score calculation 502, and the market score calculation 604.
It should further be noted that, in some instances, a comprehensive score can be calculated or adjusted for a document (e.g., a patent) based on other factors. The other factors can include, but are not limited to, a remaining patent term, litigation history associated with the patent, licensing history associated with the patent, a security interest associated with the patent, an ownership associated with the patent, and/or one or more related patents (e.g., one or more foreign related patents). For example, a comprehensive score for a patent may be increased when the patent includes a greater amount of patent term remaining (e.g., 15 years), and decreased when the patent includes a lesser amount of patent term remaining (e.g., 2 years). For another example, a comprehensive score for a patent may be increased when the patent is already being licensed, and decreased if the patent is not already being licensed and/or if it would be difficult to license the patent.
The methods are illustrated as a collection of blocks in a logical flow graph representing a sequence of operations that can be implemented in hardware, software, firmware, or a combination thereof. The order in which the method blocks are described and claimed is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or alternate methods. Additionally, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein. In the context of software, the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations. In the context of hardware, some or all of the blocks may represent application specific integrated circuits (ASICs) or other physical components that perform the recited operations.
Methods 800-1700 are described in the general context of computer-executable instructions. Generally, computer-executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types. The methods can also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. In a distributed computing environment, computer-executable instructions may be located in local and/or remote computer storage media, including memory storage devices.
At 802, a single document may be received from a data repository for analysis. Each document in the data repository may be associated with a unique document identification number. The unique document identification number of a patent document may include an application number, a publication number, a patent number, and/or a combination of information associated with the patent document that may uniquely identify the patent document (such as a combination of a name of an inventor and a filing date, etc.).
This process may repeat until all documents in a targeted data repository are analyzed. The available data repositories may include, but are not limited to, a patent database provided and/or supported by a patent office of a particular country (e.g., a USPTO (United States Patent and Trademark Office) database, a PAIR database, EPO database, WIPO database, SIPO (State Intellectual Property Office of the P.R.C.) database, etc.), and any other databases that are provided by public and/or private institutions over the world.
At 804, it is determined if the document contains machine-readable text. Some types of files available from the data repositories, such as HTML documents, may already contain machine-readable text. Other types of files such as PDF files representing images of paper documents may lack machine-readable text. Draft documents or unpublished documents, for example, may be available only in forms that do not include machine-readable text. The determination of whether a document contains machine-readable text may be made in part by automatic detection of file type using known techniques for file type identification including recognition of filename suffixes. If a file type is not specified by a suffix or other metadata, it may be determined by opening the file and comparing the file structure to a library of known structures associated with known file types. If a document is determined to not include machine-readable text, method 800 may proceed to 806 and optical character recognition (OCR) may be used to recognize text in the document.
At 806, OCR may be applied to the document to convert the document into a format that contains machine-readable text. OCR is the mechanical or electronic conversion of images of typed, handwritten, or printed text into machine-encoded text, whether from a scanned document, a photo of a document, or other source. OCR is a method of digitizing from imaged texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR involves pattern recognition, artificial intelligence, and computer vision.
At 808, document type is identified. Document type means the type of information contained in a document rather than the computer file format in which the document is encoded. Documents may include identifying information such as unique document identification numbers, kind codes, and indications of source. Unique document identification numbers may, for example, include patent numbers that can be used to distinguish between different types of patents based on the structure of the number. For example, when analyzing document identification numbers coming from a database of U.S. patents, a seven digit number may be interpreted as indicating that the document is a utility patent, and eleven digit number optionally with a “I” following the first four digits may indicate a published patent application, a five or six digit number preceded by the letter D indicates a design patent, and identifiers for plant patents begin with the letters PP. Kind codes in patent documents can also indicate if a document is a utility patent, plant patent, patent application publication, statutory invention registration, or design patent. The documents to be analyzed may come from any one of a number of different data repositories. If a given data repository is known to be limited to containing only documents of a certain type, then all documents obtained from that data repository may be assumed to be of the specified type. For example, a document obtained from a data repository that only contains academic papers on biotechnology may be identified as an academic paper on biotechnology by virtue of coming from this specific data repository. Each document at this point in method 800, will contain machine-readable text and be associated with a document type.
At 810, it is determined if the document is of one or more specified document types. This filters documents based on document type. Document type(s) may be specified by user. In the absence of user specification, filtering may be performed based on a default document type. In one implementation, the default document type may be issued U.S. patents. Thus, any document that is identified as a U.S. patent either by a unique document identification number, a kind code, by coming from a particular data repository, or other technique is retained for further analysis. A user may also specify both issued U.S. patents and issued European patents in which case documents of either type would be determined to match the specified document type. However, if a document does not match the specified document type, method 800 returns to 802 and a new document is received from the data repository. This portion of method 800 may proceed automatically and continually until all documents within the one or more data repositories have been analyzed. This processing and filtering allows use of varied data repositories and allows for document analysis to be applied across multiple data repositories because there are mechanisms for converting all documents into machine-readable text and for excluding documents that do not match a specified document type.
For those documents that do match the specified document type at 810, method 800 proceeds to 812.
At 812, it is determined if the claims portion of the document is labeled. A labeled claims portion is identified as a portion of text that contains patent claims separate from other portions of a patent document. For example, a document in CSV format may have all the claims in the same column which is designated as containing claims. Alternatively, an HTML document may have specific tags on each claim indicating that is a claim and whether it is an independent or dependent claim. However, other documents such as an OCR version of a PDF document may simply contain undifferentiated text. For such documents, claims cannot be identified as such without additional analysis. This example discusses determining if a claims portion of a patent document is labeled. However, identifying specific label portions of a document is not limited to this application and may also be applied to determine of other portions of documents are separately identified such as determining which financial documents have executive summaries labeled as executive summaries.
If a document does not have a labeled claims portion, method 800 proceeds to 814.
At 814, the claims portion is detected. The specific technique for detecting the claims portion may vary based on the document format. In one implementation, keyword recognition may be used to distinguish a claims portion. For example, if a page of a document includes the word “claim” or “claims” within the first line and is followed on that same page by a paragraph beginning with a number followed by a period, then that paragraph or entire page may be designated as a claims portion. Other recognition techniques may be alternatively or additionally applied. For example, any paragraph including a line ending with a semicolon may be interpreted as a claim.
At 816, a record is created from the document containing the claims portion and unique document identification number. This record may be stored as an independent file or as a portion of another file. The record may be in a different format than the format of the source document. In many implementations, the record will be stored in a memory that is both logically and physically separate from any of the data repositories. This record can be associated with the source document through the unique document identification number. The claims in the record may be distinguished as individual claims or may be an undifferentiated collection of text that represents some or all of the claims in the patent document. Thus, in the context of patent documents this record may represent the claims section of a patent document. Generation of multiple records from multiple documents can create a corpus of patent claims that are amenable for further analysis.
At 902, the claims section of a document may be parsed into separate words. This divides the text of the claims section into multiple discrete words. Word parsing may be performed by identifying word delimiters and using the word delimiters to separate the text into individual words. A delimiter is a blank space, comma, or other character or symbol that indicates the beginning or end of a character string, word, or data item. In one implementation, the word delimiters are both a <space> and dash “-”. Word parsing may be performed before after individual claims are distinguish from one another.
At 904, acronyms and abbreviations are replaced with alternative standardized representations. This may be performed by comparing each word from the claim section to a synonym library (e.g., a lookup table) containing known acronyms and abbreviations that are paired with alternative representations. In some instances, the alternative representations may be fully written out words. Alternative representation may also be a standardized form that does not use periods. For example, “NASA” may be replaced with National Air and Space Administration. Similarly, “U.S.A.” may be replaced by “USA” or in some implementations “United States of America.” This serves to remove the periods that are found in some abbreviations and to normalize word count so that claims are not perceived as shorter merely because they use more acronyms or abbreviations. Removing periods in acronyms allows for use of the end of sentence period to be an indicator of where a first claim and a second claim begins.
At 906, the claims section maybe to be divided into individual claims. Recall that after document filtering, each record of a document may include a claim section that could potentially contain multiple claims which are not separately differentiated from each other. Although it may be relatively trivial for a human to identify different claims in a document, it can be much more difficult for an automated process to accurately parse strings of text into separate claims. With patent claims, however, this may be done by creating separation between a first claim and a second claim whenever there is a period followed by a numeral. The separation may be implemented by inserting a carriage return, line break, or other marker. This is a reasonable approximation for dividing claims because once the abbreviations with periods have been replaced with full words, the only periods present in a set of claims will be at the end of a claim. Furthermore, each claim will start with a numeral (e.g., 1-20). Therefore, any point following a period and preceding a numeral is likely a division between two claims.
At 908, once the claims have been divided into separate claims, all punctuation may be removed. Punctuation may be removed by matching against a list of punctuation and deleting any character found in the list. Removing punctuation may remove any or all of periods, semicolons, commas, hyphens, brackets, slashes, and the like. Punctuation is generally understood to not affect claim breadth. Thus, by removing punctuation, characters that will not be processed further are taken out of the text which is to be analyzed.
At 910, it is determined if there are specific stop words. Specific stop words may be based on the content of the documents being analyzed. For example, if the documents are patent documents, then the specific stop words may include words that are common in patent claims and unlikely to serve to distinguish one claim from another. A patent-specific list of stop words may include words and/or phrases such as “computer readable media,” “system,” “machine,” “comprising,” and “wherein,” as well as words and/or phrases that indicate statutory classes such as “method,” “article of manufacture”, and “composition of matter.” Technology specific stop words may also be used. For example, if all the patent documents being analyzed are from a same technological class or grouping, then stop words previously identified for that technology may be used. For example, “circuit” may be included in a stop list that is specific for documents describing electrical engineering.
If specific stop words are not available, then method 900 proceeds to 912 and uses default stop words. If, however, specific stop words are available, then method 900 proceeds to 914 and uses the specific stop words. Multiple sets of stop words may be used together. For example, one or more specific stop word lists may be used in conjunction with a default stop word list.
At 916, stop words are removed. If multiple stop word lists are used together, then words are removed if they appear in any of the stop word lists.
At 918, stemming is performed on the remaining words. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Stemming is an additional form of normalization that removes differences between similar words such as “compare” and “comparing.” There are numerous known techniques for stemming including use of a lookup table, suffix stripping, Lemmatisation, stochastic algorithms, n-gram analysis, matching algorithms, etc. In one implementation, the Porter Stemmer algorithm from the publicly available “nltk” package is used to perform stemming.
At 920, duplicate words may be removed. When duplicate word removal occurs after stemming, it is actually the duplicate root forms of the words that are removed. For example, removable of duplicates prior to stemming would leave both “adapter” and “adapted” in the text of a process claim, but following stemming the words may both be converted to the root form “adapt” and one may be removed.
Thus, the various claim sections obtained from patent documents are standardized through pre-processing by replacing acronyms and abbreviations with alternative representations (e.g., writing out in full words), removing punctuation, removing stop words, stemming, and deletion of duplicate words. This pre-processing makes the data from the data repositories more amenable to automatic analysis of claim breadth. It also strips away some of the variation that may be introduced by various patent claim drafting techniques in an effort to approximate the content of a patent claim separate from a particular writing style. Although a human analyst can identify when writing is “wordy,” automatic analysis of breadth may be confounded by different writing styles and potentially score similar claims differently unless pre-processing is performed.
In some instances, the documents are pre-processed to generate one or more processed document portions for each of the documents. The pre-processing may use all or part of the method 800 described in
At 1004, a word count is generated for each of the document portions (e.g., processed or unprocessed document portions). For instance, a word count for each document portion may be generated by counting a number of separate words in the respective document portions. In some instances, this may be performed after pre-processing so that stop words and duplicate words are omitted from the count. A word count performed after removal of duplicate words is referred to as a word count of unique words. In some instances, the word count generated for each document portion (e.g., patent claim includes is an integer (e.g., one, two, three, etc.).
At 1006, a referential word count is identified. In some instances, the referential word count is a number, but not necessarily an integer. The referential word count may be based on a characteristic derived from the word counts of the individual document portions under analysis. For example, the referential word count may be the word count of the document portion having a largest word count out of all the analyzed document portions. For another example, the referential word count maybe the word count of the document portion having a shortest word count out of all the analyzed document portions.
In some instances, other characteristics may also be used to generate the referential word count such as the average or median word count of the analyzed document portions. For example, if the analyzed document portions are patent claims, then the referential word count may be the word count of the longest patent claim, the word count of the shortest patent claim, the average word count of all the analyzed patent claims, the median word count of all the analyzed patent claims, or some other metric. In some instances, the referential word count is the same for all document portions analyzed together in the same corpus. However, in some instances, due to the different characteristics of each corpus of documents analyzed, the referential word count will be different in different analyses.
At 1008, word count ratios are calculated for the document portions. For instance, a word count ratio may be calculated for each document portion by dividing the referential word count by the word count for a respective document portion. Thus, in some instances, each analyzed document portion will be associated with a word count ratio. In some instances, the numerator is the same for each document portion in a given corpus, but the denominator is different depending on the individual word count of that document portion. For example, if the word count for a given document portion is 25 and the referential word count is 72 (e.g., the longest word count of all the analyzed document portions), then the word count ratio for that particular document portion is 72/25 or 2.88.
At 1010, a word frequency is determined for individual words. For instance, a corpus-based word frequency may be determined for each word included in any of the document portions. In some instances, the word frequency is specific to the word and not the document portion in which the word is found. Word frequency may be thought of as a measure of how common a particular word is throughout all of the analyzed document portions. In some instances, word frequency is determined by counting how many times a word appears in all of the analyzed document portions. Thus, word frequency represents the number of instances that a word is found across the entire set of content under analysis prior to removal of duplicate words. For example, if the corpus of documents being analyzed includes 1000 patents, those patents each have on average 20 patent claims, then there will be 20,000 document portions under analysis. The number of times a given word such as “machine” appears throughout all 20,000 document portions is that word's frequency. As such, words that are common in a particular corpus will have higher word frequency values and words that are uncommon in the particular corpus will have lower word frequency values. Thus, at this point, each document portion is associated with a word count and each word (which necessarily includes the words in each document portion) is associated with a word frequency.
At 1012, a commonness score is generated for the document portions. For instance, each document portion may be associated with its own commonness score. The commonness score is based on the frequency that the individual words in a particular document portion are found throughout the entire corpus of document portions under analysis. Thus, the commonness score for a document portion is based on the word frequencies of the words in that document portion. In some instances, the commonness score for a processed document portion is based on the square root of the sum of the squares of the inverse of the word frequency for each one of the separate words in that processed document portion. For instance, the commonness score (cs) for a document portion having words 1 to n, each with an associated word frequency represented by wf1 to wfn, may be calculated by the following equation:
With this calculation, a document portion that has more common words will receive a lower commonness score, and a document portion that has more uncommon words will receive a higher commonness score. In this manner, the commonness score represents an underlying assumption or premise that patent claims with more common words tend to be broader than claims with less common words. This may not always be the case, but is a useful generalization for automatic document analysis.
At 1014, a reference commonness score is identified. In some instances, the reference commonness score is identified as the highest commonness score out of all of the processed document portions undergoing analysis. The commonness scores for each of the document portions maybe calculated, sorted, and then the highest of those is stored as the highest commonness score. This represents the score of the document portion that is the “most common” based on the frequency and number of words included in that document portion. As such, every other document portion will have a commonness score that is lower than the highest commonness score.
At 1016, commonness score ratios are calculated for the processed document portions. For instance, commonness score ratios may be calculated by dividing the reference commonness score (e.g., the highest commonness score) by the commonness score for individual ones of the processed document portions. In some instances, the document portion with the highest commonness score (the “most uncommon” words) has a commonness score ratio of 1 (i.e., it is divided by its own commonness score value). Additionally, a document portion with half the highest commonness score (fewer “uncommon” words and more “common” words) has a commonness score ratio of 2. As the set of words in a document portion become more “common” the commonness score ratio increase. As such, a higher commonness score ratio indicates more “common” or frequent words in a processed document portion. In the context of patent claims, commonness ratio represents an underlying assumption or premise that claims with fewer unique words tend to be broader than claims with more unique words, and thus, the commonness score ratio increases as the words in claim become more common.
At 1018, breadth scores for the document portions are calculated using the word count ratios and the commonness score ratios. For instance, the breadth scores may be calculated by taking a square root of the sum of the square of the word count ratio (wcr) and the square of the commonness score ratio (csr) for the individual ones of the processed document portions. In some instances, the relative weights of the word count ratio and the commonness score may be normalized. One technique for normalization is to set the highest respective values for both word count ratio and commonness score ratio to 100. If, for example, the highest word count ratio is h-wcr, then all of the wcr for the corpus will be multiplied by 100/h-wcr. Similar, in some instances, normalization may be performed for the commonness score ratio using the highest commonness score ratio (h-csr). Of course, normalization values other than 100 may be used, such as 1000, 500, 50, 10, or the like. Both are numbers, but the relative effect on a breadth score may not directly correspond to the respective numerical values. For example, a word count ratio of 10 may have more or less impact on ultimate breadth than a commonness score ratio of 10. However, without normalization both contribute equally to the breadth score. As such, the word count ratio may be weighted by a first normalization value K (e.g. 100/h-wcr) and the commonness score ratio may be weighted by a second normalization value L (e.g., 100/h-csr). When written in an equation:
Breadth Score=√{square root over (K(wcr2)+L(csr2))} (2)
Thus, each document portion may be assigned its own breadth score. The breadth scores may be thought of as measuring the breadth of the document portions because the breadth scores are based on measures of word count and word commonness. This technique for determining a breadth score also moderates each of the underlying assumptions or premises behind the word count ratio and the commonness ratio. For example, if a patent claim is relatively shorter, but uses very uncommon terms, a patent practitioner might still consider the claim to be narrow due to the restrictive language in the claim. By defining a breadth score based on these two underlying assumptions, even shorter claims may be ranked not quite as broad if they use terms that are considered limiting or distinctive within a class in which an ontology is well developed.
At 1020, overall breadth scores for the documents are calculated. For instance, an overall breadth score may be calculated for each document being analyzed using the breadth scores for the document portions from the respective document. In some examples, calculating the overall breadth score for a document can include taking an average of the breadth score(s) for one or more document portions within the document. In some instances, calculating an overall breadth score for a document can include taking the highest, the lowest, the range, the average, median, mean or the like of the breadth score(s) of the one or more document portions and producing a composite score or preserving them individually. Additionally, in some instances, one or more of the breadth scores for one or more of the document portions for a document may be given more weight than one or more other breadth scores for one or more other document portions. For instance, if a document is a patent, breadth score(s) of independent claims(s) (e.g., the broadest independent claim) of the patent may be given more weight when determining the overall breadth score than breadth score(s) of dependent claim(s) within the patent.
In some instances, when documents include patents and/or published applications, one or more rules may be utilized for calculating the overall breadth scores for the patents and/or published applications. For example, if documents include patents, a rule may specify that only breadth scores associated with the broadest independent claim and any dependent claim that depends from the broadest independent claim are utilized to calculate the overall breadth score for the patents using the techniques above (e.g., average, median, etc.). For example, if documents include patents, a rule may specify that only breadth scores associated with independent claims are utilized to calculate the overall breadth score for the patents using the techniques above (e.g., average, median, etc.).
At 1022, comparative breadth scores for the documents are calculated based at least in part on the overall breadth scores. For instance, a comparative breadth score may be calculated for each document being analyzed based on the overall breadth scores of the documents. For example, where the overall breadth score is based on the score of a single document portion (e.g., broadest or narrowest), the calculation 1022 compares that score to the score of the corresponding single document portion of other documents that are within the analysis. Where the overall breadth score is based on the score of multiple document portions (e.g., represented as an average; a weighted or unweighted composite of the broadest, average, and range scores; or as individual component scores such as broadest, average, and range), the calculation 1022 compares that score or scores to the score or scores of the corresponding multiple document portions of other documents within the analysis. In some instances, the comparative breadth score for a document corresponds to the percentage of documents that include an overall breadth score that is equal to or less than the overall breadth score of the document. In some instances, the comparative breadth score for a document corresponds to the percentage of documents that include an overall breadth score that is less than the overall breadth score of the document. In some instances, the comparative breadth score for a document corresponds to the percentage of documents that include an overall breadth score that is equal to or greater than the overall breadth score of the document. Still, in some instances, the comparative breadth score for a document corresponds to the percentage of documents that include an overall breadth score that is greater than the overall breadth score of the document.
Where the overall breadth score is based on the score of multiple document portions and is maintained as individual component scores such as scores associated with the broadest, average, and range of document portions, calculation 1022 may compare each of those scores to the corresponding scores of the multiple document portions of other documents within the analysis. For example, in a context where the documents are patents and the portions are claims, calculation 1022 may compare the breadth score of the broadest claim in a patent to the breadth score of the broadest claims in all patents within the landscape, providing a rank ordering of the patent by broadest claim. Calculation 1022 may further compare the average breadth of the claims in the patent to the average breadth of the claims in each of the patents within the landscape, providing a rank ordering of the patent by average claim breadth. Calculation 1022 may further compare the range of breadth of the claims in the patent to the range of breadth of the claims in each of the patents within the landscape, providing a rank ordering of the patent by range of claim breadth. Then, calculation 1022 may weight the rank order of each component score equally, to determine the final breadth score. Such an approach is based on an assumption that a relatively broad claim is more likely to encompass potentially infringing products, a relatively high average claim breadth reflects that likelihood across a range of independent and dependent claims, and a relatively high range of breadth reflects at least some claims are more likely to encompass limitations that reduce the viability of potential challenges to claim validity.
At 1024, a UI is generated that includes one or more of the comparative breadth scores. For instance, a UI may be generated such that a comparative breadth score for one of the documents is displayed in proximity to the unique document identification number associated with that document. For example, the comparative breadth score for a patent may be displayed next to the patent number. In some instances, the UI may be a textual UI or a command-line interface that displays a line of text including at least the comparative breadth score and the unique document identification number. In some instances, the UI may include information on documents either to highlight a particular document (e.g., one having a highest comparative breadth score out of all the documents in the analyzed corpus), due to limitations of screen real estate such as on mobile devices, to minimize a volume of data transmitted across a network, or for other reasons.
Due to the processing efficiencies obtained by using automatic computer-based analysis, in some instances, the generating of word counts at 1004, the identifying referential word counts at 1006, the calculating of word count ratios at 1008, the determining of word frequencies at 1010, the generating of commonness scores at 1012, the identifying the reference commonness score at 1014, the calculating of commonness score ratios at 1016, the calculating the breadth scores at 1018, the calculating the overall breadth scores at 1020, and the calculating the comparative breadth scores are 1022 are performed at a rate much faster than can be achieved through human analysis. For example, this analysis may proceed at a rate of more than one document per minute, more than one document per 30 seconds, more than one document per 10 seconds, or another rate. This is a rate much faster than can be achieved by manual, human analysis.
In some instances, the documents are pre-processed to generate one or more processed document portions for each of the documents. The pre-processing may use all or part of the method 800 described in
At 1104, portion counts for the documents are generated. For instance, a value corresponding to the number of document portions within each of the documents may be generated. In some instances, the value for a document indicates each of the documents portions that are included in the document. Additionally, or alternatively, in some instances, the value for a document indicates one or more of the document portions that are included in the document. For example, if a document includes a patent, and the document portions include independent claims and dependent claims within the patent, the value may indicate the number of independent claims in the patent. For another example, and again if a document includes a patent, and the document portions include independent claims and dependent claims within the patent, the value may indicate the broadest independent claim as well each of the dependent claims that depend from the broadest independent claim.
At 1106, overall portion count scores are calculated for the documents. For instance, an overall portion count score may be calculated for each document based on the respective portion counts for the respective document. In some instances, the overall portion count score for a document includes the value as calculated at 804. Additionally, or alternatively, in some instances, one or more of the document portions may be given more weight when calculating the overall portion count scores for the documents. For instance, if the documents include patents, more weight may be given to the independent claims than to the dependent claims when calculating the overall portion count scores. For example, if independent claims are given four times as much weight as dependent claims, and a patent includes three independent claims and seventeen dependent claims, the overall portion count score for the patent includes twenty-nine (e.g., (3*4)+17=29). An example equation for calculating the overall portion count scores for patents and/or printed publications may look as follows:
Overall Score=IT(w1)+DT(w2) (3)
As shown, the overall portion count score for a patent may include a number of independent claims (IT) times a first weight (w1) associated with independent claims plus a number of dependent claims (DT) times a second weight (w2) associated with dependent claims.
At 1108, comparative portion count scores are calculated for the documents based at least in part on the overall portion count scores. For instance, a comparative portion count score for a document can be determined by comparing the overall portion count score for the document to the overall portion count scores of the other documents being analyzed. In some instances, the comparative portion count score for a document corresponds to the percentage of documents that include an overall portion count score that is equal to or less than the overall portion count score of the document. In some instances, the comparative portion count score for a document corresponds to the percentage of documents that include an overall portion count score that is less than the overall portion count score of the document. In some instances, the comparative portion count score for a document corresponds to the percentage of documents that include an overall portion count score that is equal to or greater than the overall portion count score of the document. Still, in some instances, the comparative portion score for a document corresponds to the percentage of documents that include an overall portion count score that is greater than the overall portion count score of the document.
At 1110, a UI is generated that includes one or more of the comparative portion count scores. For instance, a UI may be generated such that a comparative portion count score for one of the documents is displayed in proximity to the unique document identification number associated with that document. For example, the comparative portion count score for a patent may be displayed next to the patent number. In some instances, the UI may be a textual UI or a command-line interface that displays a line of text including at least the comparative portion count score and the unique document identification number. In some instances, the UI may include information on documents either to highlight a particular document (e.g., one having a highest comparative portion count score out of all the documents in the analyzed corpus), due to limitations of screen real estate such as on mobile devices, to minimize a volume of data transmitted across a network, or for other reasons.
Due to the processing efficiencies obtained by using automatic computer-based analysis, in some instances, the generating portion counts at 1104, calculating overall portion count scores at 1106, and the calculating of the comparative portion count scores at 1108 are performed at a rate much faster than can be achieved through human analysis. For example, this analysis may proceed at a rate of more than one document per minute, more than one document per 30 seconds, more than one document per 10 seconds, or another rate. This is a rate much faster than can be achieved by manual, human analysis.
In some instances, the documents are pre-processed to generate one or more processed document portions for each of the documents. The pre-processing may use all or part of the method 800 described in
At 1204, word counts are generated for document portions of a document. For instance, a word count for each document portion of a document may be generated by counting a number of separate words in the respective document portions. In some instances, this may be performed after pre-processing so that stop words and duplicate words are omitted from the count. A word count performed after removal of duplicate words is referred to as a word count of unique words. In some instances, the word count generated for each document portion (e.g., patent claim) includes is an integer (e.g., one, two, three, etc.).
At 1206, one or more words are identified in the document portions of the document. For instance, each of the words that are counted in step 1204 may be identified for each document portion of the document. For example, if a document portion recites “audio signal representing sound,” each of “audio”, “signal”, “representing”, and “sound” may be identified for the document portion. In some instances, this may be performed after pre-processing so that stop words and duplicate words are omitted from the identification. An identification performed after removal of duplicate words is referred to as an identification of unique words.
At 1208, differences between one or more words in a document portion and one or more words in at least one other document portion are identified. For instance, the words identified for a document portion may be compared to the words identified for at least one other document portion. In some instances, the comparing includes determining a number of words from the document portion that are included in the at least one other document portion and/or determining the number of words from the document portion that are not included in the at least one other document portion. For example, and using the example above where the document portion recites “audio signal representing sound,” the comparing may include determining that the two words “audio” and “signal” are included in the at least one other document portion, but the two words “representing” and “sound” are not included in the at least one other document portion.
In some instances, when the document includes a patent and/or published application, comparing differences between one or more words in a claim to one or more words in at least one other claim may include comparing differences between one or more words in a dependent claim to one or more words in an independent claim. For example, a dependent claim may be compared to the independent claim from which it depends. For another example, a dependent claim may be compared to both an independent claim and any intervening dependent claim(s) from which the dependent claim depends. Still, for a third example, a dependent claim may be compared to the broadest independent claim within the patent and/or published application. Additionally, or alternatively, in some instances, comparing differences between one or more words in a claim to one or more words in at least one other claim may include comparing differences between one or more words in an independent claim to one or more words in at least one other independent claim. For example, a narrower independent claim (e.g., an independent claim with a breadth score that is less than the breadth score of the broadest independent claim) may be compared to the broadest independent claim in the patent and/or published application.
At 1210, a differential score is calculated for the document portion. For instance, a differential score may be calculated for the document portion using the word count for the document portion and the identified word differences for the document portion. In some instances, the differential score may correspond to a uniqueness in which words in the document portion differ from words in the at least one other portion. For example, and using the example above where the comparing determined that the two words “audio” and “signal” are included in the at least one other document portion, but the two words “representing” and “sound” are not included in the at least one other document portion, the differential score for the document portion may include 2/4 words or 50%. An example equation that may be used to determine the differential score for a document portion may look as follows:
Differential Score=WU/wc (4)
As shown, the differential score for a patent may include a number of uncommon words (WU) included in the document portion divided by the word count (wc) for the document portion.
At 1212, it is determined whether there are any additional document portions in the document that are to be analyzed. If it is determined that there is an additional document portion to analyze (i.e., Yes), the method 1200 repeats back at step 1208 for the additional document portion. In some instances, a respective differentiation score is calculated for each document portion in a document. In some instances, a respective differential score is calculated for each of one or more selected document portions in a document. For example, if a document includes a patent and/or published application, differentiation scores may be calculated for the broadest independent claim and each of the dependent claims that depend from the broadest independent claim. For another example, and again if the document includes a patent and/or published application, a respective differentiation score may be calculated for each of the independent claims.
If it is determined that there is not an additional document portion to analyze (i.e., No) at 1212, the method 1200 proceeds to 1214. At 1214, an overall differential score is calculated for the document. For instance, an overall differential score may be calculated for a document using one or more of the differential scores for one or more of the document portions. In some instances, calculating the overall differentiation score for a document includes calculating an average of the one or more differentiation scores. For example, the overall differentiation score may include the average of the respective differentiation scores of each document portion within the document. In some instances, calculating an overall differentiation score for a document includes taking the highest, the lowest, the median, of the like of the one or more differentiation scores.
In some instances, when a document includes a patent and/or published application, other techniques may be used calculate the overall differentiation score for the patent and/or published application. For example, if a document includes a patent, the overall differentiation score for the patent may include an average of the respective differentiation score(s) of each of the dependent claims that includes a dependency from the broadest independent claim within the patent. For a second example, and again if a document is a patent, the overall differentiation score for the patent may include an average of the respective differentiation score(s) of each independent claim that does not include the broadest independent claim.
For a third example, and again if the document is a patent, the overall differentiation score may include a combined differentiation score for each of the dependent claims that depends from a given independent claim. For instance, the overall differentiation score may be calculated based on a total number of words within dependent claims that depend from a broadest independent claim, and a uniqueness of the words within the dependent claims as compared to the broadest independent claim, using the processes described above.
At 1216, it is determined whether there are any additional documents that that need to be analyzed. If it is determined that there is an additional document to analyze (i.e., Yes), the method 1200 repeats back at step 904 for the additional document. For instance, word counts are generated for the document portions of the additional document at 1204, one or more words are identified for the document portions at 1206, differences between the one or more words in a document portion and one or words in at least one other document portion are identified at 1208, respective differentiation scores are calculated for the document portions at 1210, and an overall differentiation score is calculated for the additional document at 1214.
If it is determined that there is not an additional document to analyze (i.e., No) at 1216, the method 1200 proceeds to 1218. At 1218, comparative differentiation scores are calculated for the documents based at least in part on the overall differentiation scores. For instance, a differentiation score for a document can be determined by comparing the overall differentiation score for the document to the overall differentiation scores of the other documents being analyzed. In some instances, the comparative differentiation score for a document corresponds to the percentage of documents that include an overall differentiation score that is equal to or less than the overall differentiation score of the document. In some instances, the comparative differentiation score for a document corresponds to the percentage of documents that include an overall differentiation score that is less than the overall differentiation score of the document. In some instances, the comparative differentiation score for a document corresponds to the percentage of documents that include an overall differentiation score that is equal to or greater than the overall differentiation score of the document. Still, in some instances, the comparative differentiation score for a document corresponds to the percentage of documents that include an overall differentiation score that is greater than the overall differentiation score of the document.
At 1220, a UI is generated that includes one or more of the comparative differentiation scores. For instance, a UI may be generated such that a comparative differentiation score for one of the documents is displayed in proximity to the unique document identification number associated with that document. For example, the comparative differentiation score for a patent may be displayed next to the patent number. In some instances, the UI may be a textual UI or a command-line interface that displays a line of text including at least the comparative differentiation score and the unique document identification number. In some instances, the UI may include information on documents either to highlight a particular document (e.g., one having a highest comparative differentiation score out of all the documents in the analyzed corpus), due to limitations of screen real estate such as on mobile devices, to minimize a volume of data transmitted across a network, or for other reasons.
Although the above steps 1204-1216 describe determining differentiation between one or more portions and final differentiation scores based on word analysis within the document itself, in some instances, differentiation between one or more portions and final differentiation scores may be determined based on the differentiation “footprint” of the one or more portions relative to an entirety of the subject matter of the corpus of documents. For instance, a corpus of words based on words within the corpus of documents can be generated. Using the corpus of words, a portion differentiation score may be assigned to a one or more document portions by comparing words within the one or more document portions. In some instances, the number of unique words may be determined in the portion determined to have the broadest overall breadth score. For each additional document portion, the number of unique words that are not included in the portion having the broadest overall breadth score may be determined. In another example, the number of unique words that are included in that particular portion and not included in any other portion may be determined. In some instances, the number of unique words associated with each portion is then expressed as a percentage of the unique words within the corpus of words in the relevant documents. For example, if the corpus of words in the relevant documents includes 10,000 unique words, and a given document portion (e.g., independent claim) includes 20 unique words that are within the corpus of 10,000 unique words, then the percentage for the given document portion is 0.002%. If a second document portion (e.g., independent claim) also includes 20 unique words that are both within the corpus of 10,000 unique words and exclusive of the words in the first (or any other previously processed) document portion, then the percentage for the second document portion is also 0.002%.
The overall differentiation calculation can then be determined by summing the reciprocal of each percentage for a differentiation calculation of 1000 (1/0.002+1/0.002), giving more weight to portions with a relatively small percentage of the unique words of the corpus. In other instances, the reciprocal of one minus the percentage could be summed for each portion (i.e., 1/(1−0.002)+1/(1−0.002)=2.004), giving more weight to portions with a relatively large percentage of the unique words of the corpus. In other instances, the reciprocal of the percentage for the broadest portion could be used and the reciprocal of one minus the percentage could be used for all other portions. In still other instances, the summation could be made after further weighting to the contribution of individual portions (e.g., in the context of patent documents, weighting the contribution of independent claims more heavily than the contribution of dependent claims). In this manner, a document with many document portions having unique words that are not common to other portion within the document will have a relatively high overall differentiation score.
After determining the overall differential scores, steps 1218 and 1220 can then be performed. For instance, at 1218, comparative differentiation scores are calculated for the documents based at least in part on the overall differentiation scores. For instance, a differentiation score for a document can be determined by comparing the overall differentiation score for the document to the overall differentiation scores of the other documents being analyzed. At 1220, a UI is generated that includes one or more of the comparative differentiation scores. For instance, a UI may be generated such that a comparative differentiation score for one of the documents is displayed in proximity to the unique document identification number associated with that document.
Due to the processing efficiencies obtained by using automatic computer-based analysis, in some instances, the word count generated at 1204, the identifying the one or more words at 1206, the identifying the differences at 1208, the calculating of the differentiation scores at 1210, the calculating of the overall differentiation score at 1214, and the calculating of the comparative differentiation scores at 1218 are performed at a rate much faster than can be achieved through human analysis. For example, this analysis may proceed at a rate of more than one document per minute, more than one document per 30 seconds, more than one document per 10 seconds, or another rate. This is a rate much faster than can be achieved by manual, human analysis.
In some instances, the documents are pre-processed to generate one or more processed document portions for each of the documents. The pre-processing may use all or part of the method 800 described in
At 1304, comparative breadth scores, comparative portion count scores, and comparative differentiation scores for the documents are generated. For instance, in some examples, the documents may be analyzed using method 1000 in order to generate the comparative breadth scores for the documents, the documents may be analyzed using method 1100 in order to generate the comparative portion count scores for the documents, and the documents may be analyzed using method 1200 in order to generate the comparative differentiation scores for the documents. Additionally, or alternatively, in some examples, the comparative breadth scores, the comparative portion count scores, and the comparative differentiation scores may be received from one or more external sources. For instance, the comparative breadth scores, the comparative portion count scores, and the comparative differentiation scores may be received one or more computing devices.
At 1306, comparative coverage scores are calculated for the documents. For instance, comparative coverage scores may be calculated for each document using the comparative breadth score, the comparative portion count score, and the comparative differentiation score for a respective document. In some instances, calculating the comparative coverage score for a document can include calculating the average of the comparative breadth score, the comparative portion count score, and the comparative differentiation score for the document. In some instances, calculating the comparative coverage score for a document can include taking the highest, the lowest, the median, of the like of the comparative breadth score, the comparative portion count score, and the comparative differentiation score for the document.
Still, in some instances, one or more of the comparative breadth scores, comparative portion count scores, and comparative differentiation scores may be given more weight when calculating the comparative coverage scores for the documents. For instance, the comparative coverage scores for the documents may be calculated using the following formula:
In the above equation, the comparative coverage score for a document includes a first weight (W1) times the comparative breadth score (BF) of the document, plus a second weight (W2) times the comparative portion count score (PF) of the document, plus a third weight (W3) times the comparative differentiation score (DF) of the document, divided by three. In some instances, one or more of the first weight (W1), the second weight (W2), or the third weight (W3) may include a similar value. Additionally, or alternatively, in some instances, each of the first weight (W1), the second weight (W2), or the third weight (W3) may include a unique value.
At 1308, a UI is generated that includes one or more of the comparative coverage scores. For instance, a UI may be generated such that a comparative coverage score for one of the documents is displayed in proximity to the unique document identification number associated with that document. For example, the comparative coverage score for a patent may be displayed next to the patent number. In some instances, the UI may be a textual UI or a command-line interface that displays a line of text including at least the comparative coverage score and the unique document identification number. In some instances, the UI may include information on documents either to highlight a particular document (e.g., one having a highest comparative coverage score out of all the documents in the analyzed corpus), due to limitations of screen real estate such as on mobile devices, to minimize a volume of data transmitted across a network, or for other reasons.
In some instances, the documents are pre-processed to generate one or more processed document portions for each of the documents. The pre-processing may use all or part of the method 800 described in
At 1404, a document is analyzed to identify at least one asset that is related to the document. For instance, a semantic search can be performed using content (e.g., one or more words) within the document to identify a set number of assets (e.g., one, five, ten, one hundred, or any other number) that are related to the document. The set of assets can include other documents, such as references, patents, publications, articles, or the like. The semantic search can be performed using an open system, such as a Web-based search, or within a closed system that stores assets for the analysis. In some instances, based on the document including a patent, the semantic search is performed based on one or more claims of the patent. For example, the semantic search can be performed using content from the broadest independent claim, each of the independent claims, the broadest independent as well as claims that dependent from the broadest independent claim, every claim, or any other combination of the claims. Additionally to, or alternatively from using content from the claims, the semantic search may be performed using content from one or more other portions of the patent, such as the abstract, the specification, the description of the figures, the background, or any combination thereof.
At 1406, assets that do not predate the document are removed from the set of assets. For instance, each asset from the set of assets is analyzed to identify a date corresponding to when the respective asset was drafted, published, filed, and/or the like. If an asset includes a patent or printed publication, the patent or printed publication may be analyzed to identify a priority date of the patent or printed publication. Using the identified dates, any asset from the set of assets that includes a respective date that does not predate a data of the document being analyzed is then identified and removed from the set of assets. For example, if the document being analyzed includes a patent, the set of assets is analyzed to identify each asset that includes a priority date that would cause the respective asset to not qualify as prior art to the patent (e.g., a priority data that does not predate the priority date of the patent). The identified assets are then removed from the set of assets identified for the patent.
At 1408, assets that were cited during prosecution are removed from the set of assets. For instance, if the document being analyzed includes a patent, assets (e.g., cited references) that were cited during prosecution of the patent may be removed from the set of assets. These assets are removed since the patent has already been found to be allowable over such assets (e.g., valid over such assets). Therefore, the assets may not add to the risk of invalidity of the patent.
At 1410, a risk score for the document is calculated based at least in part on a number of remaining assets. For instance, based on a number of assets that remain after step 1406 and optionally step 1408, a risk score is calculated for the document. If the document includes a patent, the risk score can indicate a likelihood that the patent will not be invalidated if the patent is challenged, such as by a reexamination of the patent. In some instances, the risk score is associated with the percentage of assets that remain after removal of assets after step 1406 and optionally step 1408. For instance, the risk score may be calculated using the following equation:
In the above equation, the risk score is calculated by dividing the number of remaining assets (RA), after step 1406 and optionally step 1408, by the total number of assets (NA) that were identified during the analysis. As shown, the more assets that are removed during step 1406 and optionally step 1408, the higher the risk score is for the document.
In some instances, at 1410, the risk score for the document may be calculated using one or more other techniques. For instance, the risk score may be calculated purely on the number of assets that remain after step 1406 and optionally step 1408. For example, the document may be given an initial risk score of 100. Using the initial score, the risk score may be reduced based on the number of assets that remain after step 1406 and optionally step 1408. For instance, the risk score may be reduced by 1 (and/or optionally 0.01, 0.1, 5, 10, 15, or any other number) for each assets that remains after step 1406 and optionally step 1408. Using such a technique, the risk score may be calculated using the following equation:
Risk Score=100−(RF(RA)) (7)
In the above equation, the risk score is calculated based on a risk factor (RF) (e.g., 0.01, 0.1, 1, 5, 10, 15, or any other number) multiplied by the number of remaining assets (RA) after step 1406 and optionally step 1408. In some instances, equation 7 has a lower limit of zero. For instance, the risk score for a patent cannot include a negative number.
At 1412, it is determined whether there are any additional documents that that need to be analyzed. If it is determined that there is an additional document to analyze (i.e., Yes), the method 1400 repeats back at step 1404 for the additional document. For instance, a set of assets is identified for the additional document at 1404, assets that do not predate the additional document are removed from the set of assets at 1406, assets that were cited during prosecution of the additional document are removed at 1408, and a risk score is calculated for the additional document at 1410.
If it is determined that there is not an additional document to analyze (i.e., No) at 1412, the method 1400 proceeds to 1414. At 1414, a UI is generated that includes one or more of the risk scores. For instance, a UI may be generated such that a risk score for one of the documents is displayed in proximity to the unique document identification number associated with that document. For example, the risk score for a patent may be displayed next to the patent number. In some instances, the UI may be a textual UI or a command-line interface that displays a line of text including at least the risk score and the unique document identification number. In some instances, the UI may include information on documents either to highlight a particular document (e.g., one having a highest risk score out of all the documents in the analyzed corpus), due to limitations of screen real estate such as on mobile devices, to minimize a volume of data transmitted across a network, or for other reasons.
In some instances, the patents are pre-processed to generate one or more processed portions for each of the patents. The pre-processing may use all or part of the method 800 described in
At 1504, a patent is analyzed in order to identify information associated with the patent. For instance, at 1506, claim breadth is determined for one or more claims of the patent. In some instances, the claim breadth is determined using all or part of the method 1000 of
Additionally, or alternatively, at 1508, references cited during prosecution of the patent are identified. For instance, a search can be performed to identify each references that was cited during prosecution of the patent. The search can include searching one or more databases, such as one or more databases associated with PAIR, the EPO, the WIPO, or the like, that include information about references cited during prosecution. In some instances, the search is performed using the patent number of the patent. In some instances, the search is performed using a different identification number that is associated with the patent, such as the patent application number or publication number of the patent.
Additionally, or alternatively, at 1510, other information associated with the prosecution history of the patent is retrieved. For instance, a search may be performed to identify information associated with the prosecution history of the patent. The search can include searching one or more database, such as one or more databases associated with PAIR, the EPO, the WIPO, or the like, to retrieve the information. The information can include a date the patent was filed, a date the patent was issued, a number of office actions that were issued during prosecution of the patent, amendments made to the claims during prosecution of the patent, whether a Notice of Appeal was filed during prosecution of the patent, or the like.
In some instances, although not shown in
At 1512, a risk score for the patent is calculated based at least in part on the information. For instance, a risk score may be calculated for the patent based on the overall claim breadth score of the patent. In some instances, the risk score may include a “flip” of the overall claim breadth score. This is because a patent with broad claims may be at a greater risk of being invalidated than a patent with narrow claims and/or a patent with both broad and narrow claims. For example, if the patent includes an overall claim breadth score of 80 out of 100, meaning that overall the claims within the patent include pretty broad claim scope, the risk score for the patent may include 20 out of 100. If the overall claim breadth score for the patent is based on range of the claim breadth scores, then the risk score for the patent may include a “flip” of the lowest breadth score within the range. For example, if the patent includes an overall claim breadth score that ranges between 30-90 out of 100, then the risk score for the patent may be 70 out of 100. Using either of the techniques, the risk score may be calculated using the following formula:
In the above equation, the “flip” corresponds to the part of the equation within the parentheses, where a set breadth score (BS) for a claim in the patent is divided by the maximum breadth score (BM) that a patent claim can receive. The value within the parentheses is then multiplied by the maximum risk score (RM) that a patent can receive in order to calculate the risk score for the patent. In some instances, the set claim breadth score (BS) can include the overall breadth score for the patent. In some instance, the set claim breadth score (BS) can include the lowest breadth score within the range of breadth scores. Still, in some instances, the set claim breadth score (BS) can include a different claim breadth score, such as the breadth score of the narrowest claim in the patent (e.g., the lowest claim breadth score). In some instances, the maximum risk score (RM) may include one hundred, although in other instances the maximum risk score (RM) can include any value.
In addition to, or alternatively from, using claim breadth to calculate a risk score, a risk score may be calculated and/or the risk score above may be adjusted based on the number of references that were cited during prosecution of the patent. For example, a risk score for the patent may be calculated by multiplying the number of references cited by a reference weight factor, such as 0.01, 0.1, 1, 2, 3, 5, 10, or any other number. Using such a technique, the risk score may be calculated using the following equation:
Risk Score=NCRWRF (9)
In the above equation, the risk score is calculated based on a reference weight factor (WRF) (e.g., 0.01, 0.1, 1, 2, 3, 5, 10, or any other number) multiplied by the number of references cited during prosecution (NCR). In some instances, equation 9 has an upper limit. For instance, the risk score for the patent may not exceed 100.
In addition to, or alternatively from, using claim breadth and/or number of cited documents to calculate a risk score, a risk score may be calculated and/or the risk scores above may be adjusted based on the prosecution history of the patent. For example, the patent may be given an initial risk score of 100. The initial risk score may then be reduced based on the prosecution history of the patent. For instance, the initial risk score may be reduced based on the length of time that the patent was being prosecuted (e.g., number of days, months, years, etc.), a number of office actions that were issued during prosecution, the number of times that the claims (e.g., the independent claims, dependent claims, and/or both) were amended during prosecution, the number of times that a Notice of Appeal was filed during prosecution, and/or the like. Using such a technique, the risk score may be calculated using the following equation:
Risk Score=100−(NF1(WF1)+NF2(WF2)+ . . . NFN(WFN)) (10)
In the above equation, the risk score is calculated based on a value of a first factor (NF1) multiplied by a weight associated with the first factor (WF1), plus a value of a second factor (NF2) multiplied by a weight associated with the second factor (WF2), and so on. A value of a factor may include the number of days, months, years, or the like that the patent was in prosecution, the number of office actions that were issued during prosecution, the number of times that the claims (e.g., the independent claims, dependent claims, and/or both) were amended during prosecution, the number of times that a Notice of Appeal was filed during prosecution, and/or the like. The weight for each factor may include the same number (e.g., 0.01, 0.1, 1, 2, 3, 5, 10, or any other number), or the weight for one or more of the factors may be unique to that factor.
For example, suppose that the prosecution history for the patent indicates that the patent was in prosecution for twenty months, the office issued four office actions, and the claims were amended four times. Further suppose that the weight for the length factor is 1 per month, the weight for office action factor is 5 per office action, and the weight for the amendments factor is 5 per amendment. Then, using equation 10 above, the risk score for the patent would be 40 (e.g., 20(1)+4(5)+4(5))
At 1514, it is determined whether there are any additional patents that that need to be analyzed. If it is determined that there is an additional patent to analyze (i.e., Yes), the method 1500 repeats back at step 1504 for the additional patent. For instance, the additional patent is analyzed to identify information associated with the additional patent at 1504, and then a risk score is calculated for the additional patent based at least in part on the information at 1512.
If it is determined that there is not an additional patent to analyze (i.e., No) at 1514, the method 1500 proceeds to 1516. At 1516, a UI is generated that includes one or more of the risk scores. For instance, a UI may be generated such that a risk score for one of the patents is displayed in proximity to the patent number associated with that patent. In some instances, the UI may be a textual UI or a command-line interface that displays a line of text including at least the risk score and the patent number. In some instances, the UI may include information on patents either to highlight a particular patent (e.g., one having a highest risk score out of all the patents in the analyzed corpus), due to limitations of screen real estate such as on mobile devices, to minimize a volume of data transmitted across a network, or for other reasons.
In some instances, the documents are pre-processed to generate one or more processed document portions for each of the documents. The pre-processing may use all or part of the method 800 described in
At 1604, a document is analyzed to identify a market classification corresponding to the document. For instance, if a document includes a patent, the patent may be analyzed to identify an initial classification corresponding to the patent. Analyzing the patent can include searching one or more databases, such as one or more databases associated with the USPTO, PAIR, the EPO, the WIPO, or the like, to identify the initial classification corresponding to the patent. In some instances, the initial classification can include a classification assigned to the patent (e.g., the printed publication corresponding to the patent) that is based on the CPC. In some instances, the classification can include a classification assigned to the patent that is based on the USPC, a classification assigned to the patent from the EPO, or any other type of classification that can be assigned to a patent.
A semantic search can then be performed using the initial classification in order to determine the market classification corresponding to the patent. As discussed above, in some instances, the market classification can include a NAICS classification. In other instances, the market classification can correspond to a different classification system, such as the SIC system. In either instance, and using the NAICS as an example, a semantic search can be performed using the descriptions for one or more of the NAICS classifications to identify at least one NAICS classification that is related to the initial classification. For another example, and again using the NAICS as an example, a lookup table may be created that associates each initial classification that can be assigned to a patent to at least one of the NAICS classification. A search can then be performed using the lookup table to identify a NAICS classification associated with the initial classification assigned to the patent.
In some instances, in addition to, or alternatively from, using an initial classification assigned to the patent, a semantic analysis can be performed on the patent to identify at least one market classification. For example, a semantic search can be performed using the broadest independent claim, each of the independent claims, the broadest independent as well as claims that dependent from the broadest independent claim, every claim, or any other combination of the claims to identify a NAICS classification that is related to the patent. For another example, a semantic search can be performed using one or more additional or alternative portions of the patent, such as the abstract, the specification, the description of the figures, the background, or any combination thereof, to identify a NAICS classification that is related to the patent.
At 1606, a value associated with the market classification is determined. For instance, each market classification identified for a document may be associated with a respective value. As discussed above, in some instances, the values are calculated based on the GDP of the country in which the documents are being analyzed. For instance, if patents are being analyzed in the United States, each market classification may be associated with the GDP of the United States. In some instances, determining the value associated with the GDP for the market classification may include searching one or more databases of the Bureau of Economic Analysis (BEA), which includes data indicating the GDP for various NAICS classifications. In some instances, determining the value associated with the GDP for the market classification may include searching one more other databases that include data indicating GDP for various NAICS classifications (and/or any other types of market classification assigned to the documents).
At 1608, a market score is calculated for the document based at least in part on the value. For instance, values associated with each of the market classifications may be normalized. The normalized value can then be used as the market score for the document. For instance, the market score for the document can be calculated using the following equation:
In the above equation, the market score for the document includes the portion of the total value (VD) of the market that is associated with the document divided by the total value (VT) of the market, and then multiplied by 100 (and/or some other value in other instances). For instance, to determine market score of the document based on GDP, the market score can be calculated by dividing the portion of the total GDP associated with the market classification corresponding to the document by the total GDP, and then multiplying that result by 100. In some instances, the market score is calculated for each document during the analysis. In some instances, rather than calculating the market score for each document during the analysis, the lookup table described above may already include a respective market value associated with each market classification. The market score for the document can then be determined using the lookup table.
At 1610, it is determined whether there are any additional documents that that need to be analyzed. If it is determined that there is an additional document to analyze (i.e., Yes), the method 1600 repeats back at step 1604 for the additional document. For instance, the additional document is analyzed to identify a market classification associated with the document at 1604, a value associated with the market classification is determined at 1606, and then a market score is calculated for the additional document based at least in part on the value at 1608.
If it is determined that there is not an additional document to analyze (i.e., No) at 1610, the method 1600 proceeds to 1612. At 1612, a UI is generated that includes one or more of the market scores. For instance, a UI may be generated such that a market score for one of the documents is displayed in proximity to the unique document identification number associated with that document. For example, the market score for a patent may be displayed next to the patent number. In some instances, the UI may be a textual UI or a command-line interface that displays a line of text including at least the market score and the unique document identification number. In some instances, the UI may include information on documents either to highlight a particular document (e.g., one having a highest market score out of all the documents in the analyzed corpus), due to limitations of screen real estate such as on mobile devices, to minimize a volume of data transmitted across a network, or for other reasons.
In some instances, the documents are pre-processed to generate one or more processed document portions for each of the documents. The pre-processing may use all or part of the method 800 described in
At 1704, comparative coverage scores, risk scores, and market scores for the documents are generated for the documents. For instance, in some examples, the documents may be analyzed using method 1300 in order to generate the comparative coverage scores for the documents, the documents may be analyzed using method 1400 and/or method 1500 in order to generate the risk scores for the documents, and the documents may be analyzed using method 1600 in order to generate the market scores for the documents. Additionally, or alternatively, in some examples, the comparative coverage scores, the risk scores, and the market scores may be received from one or more external sources. For instance, the comparative coverage scores, the risk scores, and the market scores may be received one or more computing devices.
At 1706, comprehensive scores are calculated for the documents. For instance, comprehensive scores may be calculated for each document using the comparative coverage score, the risk score, and the market score for a respective document. In some instances, calculating the comprehensive score for a document can include calculating the average of the comparative coverage score, the risk score, and the market score for the document. In some instances, calculating the comprehensive score for a document can include taking the highest, the lowest, the median, of the like of the comparative coverage score, the risk score, and the market score for the document.
Still, in some instances, one or more of the comparative coverage scores, risk scores, and market scores may be given more weight when calculating the comprehensive scores for the documents. For instance, the comprehensive scores for the documents may be calculated using the following formula:
In the above equation, the comprehensive score for a document includes a first weight (WC) times the comparative coverage score (CS) of the document, plus a second weight (WR) times the risk score (RF) of the document, plus a third weight (WM) times the market score (MF) of the document, divided by three. In some instances, one or more of the first weight (WC), the second weight (WR), or the third weight (WM) may include a similar value. Additionally, or alternatively, in some instances, each of the first weight (WC), the second weight (WR), or the third weight (WM) may include a unique value.
At 1708, a UI is generated that includes one or more of the comprehensive scores. For instance, a UI may be generated such that a comprehensive score for one of the documents is displayed in proximity to the unique document identification number associated with that document. For example, the comprehensive score for a patent may be displayed next to the patent number. In some instances, the UI may be a textual UI or a command-line interface that displays a line of text including at least the comprehensive score and the unique document identification number. In some instances, the UI may include information on documents either to highlight a particular document (e.g., one having a highest comprehensive score out of all the documents in the analyzed corpus), due to limitations of screen real estate such as on mobile devices, to minimize a volume of data transmitted across a network, or for other reasons.
It should further be noted that, in some instances, a comprehensive score can be calculated or adjusted for a document (e.g., a patent) based on other factors. The other factors can include, but are not limited to, a remaining patent term for the patent, litigation history associated with the patent, licensing history associated with the patent, a security interest associated with the patent, an ownership associated with the patent, and/or one or more related patents (e.g., one or more foreign related patents). For example, with regard to the remaining patent term, a set patent term (e.g., ten years) may be used when adjusting the comprehensive score for a patent. For instance, the comprehensive score for the patent may not be adjusted when the remaining pattern term corresponds to (e.g., is similar to) the set patent term, but decreased by a given number (e.g., 1, 2, 5, 10, etc.) for each year that the remaining patent term is less than the set patent term and increased by a given number (e.g., 1, 2, 5, 10, etc.) for each year that the remaining patent is greater than the set patent term.
For a second example, a comprehensive score for a patent may be increased when the patent is already being licensed, and decreased if the patent is not already being licensed and/or if it would be difficult to license the patent. For a third example, a comprehensive score for a patent may be increased when the patent includes one or more related foreign patents that are allowed in one or more foreign jurisdictions, but decreased when the patent does not include one or more related foreign patents.
The computing device(s) 1800 may include one or more processing units 1802 and memories 1804, both of which may be distributed across one or more physical or logical locations. The processing unit(s) 1802 may include any combination of central processing units (CPUs), graphical processing units (GPUs), single core processors, multi-core processors, processor clusters, application-specific integrated circuits (ASICs), programmable circuits such as Field Programmable Gate Arrays (FPGA), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), and the like. One or more of the processing unit(s) 1802 may be implemented in software or firmware in addition to hardware implementations. Software or firmware implementations of the processing unit(s) 1802 may include computer- or machine-executable instructions written in any suitable programming language to perform the various functions described. Software implementations of the processing unit(s) 1802 may be stored in whole or part in the memories 1804.
The memories 1804 are representative of any number of forms of memory including both persistent and non-persistent memory. In some instances, the memories 1804 may include computer-readable media in the form of volatile memory, such as random access memory (RAM) 1806 and/or non-volatile memory, such as read only memory (ROM) 1808 or flash RAM. RAM 1806 includes, but is not limited to, integrated circuits, phase change memory (PRAM), static random-access memory (SRAM), dynamic random-access memory (DRAM), and other types of RAM. ROM 1808 includes erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, and NAND flash. Memories 1804 of the computing device(s) 1800 may also include removable storage, non-removable storage, and/or local storage 1810 to provide long- or short-term storage of computer-readable instructions, data structures, program modules, and other data.
The memories 1804 are an example of computer-readable media. Computer-readable media includes at least two types of media: computer-readable storage media and communications media. Computer-readable storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data, RAM 1806, ROM 1808, flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device. As defined herein, computer-readable storage media does not include transitory media such as modulated data signals and carrier waves.
In contrast, communications media may embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer-readable storage media and communications media are mutually exclusive.
In some instances, the memories 1804 may include a plurality of databases such as the data repository 102. However, as noted above, in other examples the data repository 102 may be separate from the both the memories 1804 and the computing device(s) 1800. The one or more data repositories 102 may contain a collection of patent documents such as issued patents or published patent applications. The collection of patents or patent applications may be defined by, for example, a portfolio of a patent owner, a classification of a taxonomy (e.g., public taxonomy such as a classification system of a patent office or governmental agency, a private taxonomy such as a taxonomy for a private company, a taxonomy set by a standards body or an industry, etc.), results of a search, or any other collection of patent documents.
By way of example and not limitation, the memories 1804 may also include multiple words and/or phrases such as the stop words 108 and the acronyms and abbreviations 110 as shown in
A filtering module 1814 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The filtering module 1814 may modify the data obtained from the data repository 102 to generate a reduced set of data that is the corpus of documents for subsequent analysis. The filtering module 1814 may perform any or all of the method 600 shown in
A pre-processing module 1816 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The pre-processing module 1816 may process document portions such as patent claims prior to determination of breadth, number of portions, and differentiation. This pre-processing may include delimiting individual claims, stemming words to root forms, removing duplicate root forms, and removing stop words 108. The pre-processing module 1816 may perform any or all of method 900 shown in
The pre-processing module 1816 may include stemming logic 1818. The stemming logic 1818 generates root forms of words using a stemming algorithm. A stemming algorithm is a process of linguistic normalization, in which the variant forms of a word are reduced to a common form or a root form. There are many possible stemming algorithms which may be used including use of a lookup table, suffix stripping, Lemmatisation, stochastic algorithms, n-gram analysis, matching algorithms, Porter, Porter2, Paice-Husk, Lovins, and Porter Stemmer. Porter stemmer follows the algorithm presented in Porter, M “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137. The stemming logic 1818 may function in part by passing values to an external stemming operation and receiving results back. One technique for implementing this is by using an API to call an external module or computing system that provides stemming functionality. An application program interface (API) is a set of routines, protocols, and tools for building software applications. An API specifies how software components should interact. APIs that provide stemming include EnClout Stemmer, EnClout Term Analysis, and Text-Processing.
An anomaly detection module 1820 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The anomaly detection module 1820 may detect two types of anomalies: anomalies that lead to removal of a patent claim from further analysis and anomalies that result in flagging a patent claim for manual review. The anomaly detection module 1820 may include claim removal logic that is configured to detect and remove deleted claims from the claims under consideration for analysis of breadth, number of portions, and differentiation. Removing deleted claims may include deleting records corresponding to those claims or indicating that the records correspond to the claims are to be ignored during subsequent analysis. Claim flagging logic may be present in the anomaly detection module 1820 and configured to generate a flag or other indicium that is associated with those flags which have a type of anomaly that warrants further evaluation but not removal.
The anomaly detection module 1820 may reference one or more lists of stop words 108 and/or normative words 1812. The referencing may be done during processing by reading in a list or the list may be integrated into the code that is performing the anomaly detection. In either implementation, part of the section may include a comparison between words in a portion of a document and “anomalous” words. This comparison may be implemented in part by use of one or more lookup tables. The lookup tables may be pre-calculated and stored in static program storage, calculated (or “pre-fetched”) as part of a program's initialization phase (memorization), or even stored in hardware in application-specific platforms. In some programmatic implementations, the lookup tables may include pointer functions (or offsets to labels) to process the matching input. To improve processing speed, one or more field-programmable gate arrays (FPGA) may use reconfigurable, hardware-implemented, lookup tables to provide programmable hardware functionality. For example, and to potentially increase processing speed, a list of default stop words and/or a list of the normative words 1812 could be configured as hardware-implemented lookup tables.
A breadth calculation module 1822 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The breadth calculation module 1822 may be configured to calculate breadth scores for document portions of documents being analyzed, use the breadth scores to calculate overall breadth scores for documents, and then use the overall breadth scores to calculate comparative breadth scores for the documents. If the document portions are patent claims, then the breadth calculation module 1822 may calculate claim breadth scores for one or more of the independent claims and/or one or more of the dependent claims, and then calculate overall breadth scores for patents using the claim breadth scores. In some instances, this calculation may be performed only for the claims or other document portions that are not removed by either the pre-processing module 1816 or the anomaly detection module 1820.
As described above, in some instances, breadth is based on the “footprint” in which one or more document portions cover an entirety of the subject matter of the corpus of documents. Additionally, or alternatively, in some instances, breadth is based on a word count score and a commonest score. Thus, the breadth calculation module 1822 may include one or both of a word count score calculation module 1824 and a commonness score calculation module 1826. The breadth calculation module 1822 may perform any or all of operations 1004-1022 method 1000 shown in
The word count score calculation module 1824 may be configured to determine a word count score for a document portion based on a word count for the document portion and a maximum word count for another document portion that has the highest word count. In some instances, the document portion under analysis and the other document portion with the highest word count are both drawn from the same corpus of documents. Thus, the word count score calculation module 1824 may determine a word count for each document portion under analysis and identify which of those document portions has the most words. In some instances, the word count score calculation module 1824 may contain a set of rules for determining word counts for the document portions.
The commonness score calculation module 1826 may be configured to determine a commonness score for the document portion based on the frequencies in which individual words in the document portion occur throughout all of the document portions in the corpus of documents. The commonness score calculation module 1826 may determine a commonness score for each document portion under analysis and identify which of those document portions is the most “common” due to having the highest commonness score. In some instances, the ratio of a document portion's individual commonness score and the highest commonness score may be used to represent the commonness score for that document portion for the purposes of calculating breadth. In some instances, the commonness score calculation module 1826 may contain a set of rules for determining the commonness scores. The breadth calculation module 1822 may combine results generated by the word count score calculation module 1824 and the commonness score calculation module 1826 to generate a breadth score for each document portion.
A portion count calculation module 1828 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The portion count calculation module 1828 may be configured to calculate comparative portion count scores for documents that are being analyzed. For instance, the portion count calculation module 1828 may determine a respective value corresponding to the number of document portions within each of the documents, and then compare the respective value for each document with the values of the other documents being analyzed to determine respective overall portion scores for the documents. In some instances, when documents include patents and/or patent applications, the portion count calculation module 1828 may give more weight to one or more independent claims or one or more dependent claims when calculating the overall portion count scores. The portion count calculation module 1828 can then use the overall portion count scores of the documents to calculate comparative portion count scores for the documents. The portion count calculation module 1828 may perform any or all of operations 1104-1108 of method 1100 shown in
A differentiation calculation module 1830 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The differentiation calculation module 1830 may be configured to calculate comparative differentiation scores for documents that are being analyzed. For instance, differentiation of document portions may be analyzed based on consideration of word counts and differentiation of words between document portions within a given document. For example, for a given document portion of a given document, the differentiation score module 1830 can determine a number of the words within the given document portion. Additionally, the differentiation calculation module 1830 can compare words in the given document portion to words in at least one other document portion (e.g., the broadest document portion) in the given document to determine a number of words in the given document portion that are unique. The differentiation calculation module 1830 can then calculate a differentiation score for the given document portion based on the number of words and the number of unique words. Additionally, the differentiation calculation module 1830 can calculate an overall differentiation score for the given document based on the differentiation scores of one or more of the document portions of the given document. The differentiation calculation module 1830 can then use the overall differentiation scores for the documents to calculate comparative differentiation scores for the document. The differentiation calculation module 1830 may perform any or all of operations 1204-1218 of method 1200 shown in
A coverage calculation module 1832 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The coverage calculation module 1832 may be configured to calculate comparative coverage scores for documents that are being analyzed. For instance, the coverage calculation module 1832 may calculate a comparative coverage score for each document based on the comparative breadth score, the comparative portion count score, and the comparative differentiation score for the respective document. In some instances, the coverage calculation module 1832 can calculate the comparative coverage score for a document by taking an average (and/or median, mean, mode, lowest score, highest score, etc.) of the comparative breadth score, the comparative portion count score, and the comparative differentiation score. In some instances, the coverage calculation module 1832 may weigh one or more of the comparative breadth score, the comparative portion count score, and the comparative differentiation score when calculating the comparative coverage score for a document. The coverage calculation module 1832 may perform any or all of operations 1304 and 1306 of method 1300 shown in
A risk calculation module 1834 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The risk calculation module 1834 may be configured to calculate risk scores for documents that are being analyzed. For instance, the risk calculation module 1834 may analyze a given document to identify assets that are related to the given document, references cited during prosecution of the given document, claim breadth of claims within the given document, and/or prosecution history associated with the given document. The risk calculation module 1834 can then calculate a risk score for the given document based on one or more of the identified assets, the cited references, the claim breadth, and the prosecution history. For instance, the risk calculation module 1834 may perform any or all of the operations 1404-1412 of
A market calculation module 1836 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The market calculation module 1836 may be configured to calculate market scores for the documents that are being analyzed. For instance, the market calculation module 1836 may analyze a given document to identify an initial classification assigned to the given document. In some instances, when the document is a patent, the initial classification is based on the CPC. The market calculation module 1836 can then identify a market classification for the document using the initial classification. Additionally, the market calculation module 1836 can identify a value associated with the market classification, and calculate a market score for the given document based on the value. For instance, the market calculation module 1836 can calculate a market score for the given document based on the GDP associated with the market classification. The market calculation module 1836 can perform any or all of operations 1604-1610 of
A comprehensive calculation module 1838 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The comprehensive calculation module 1838 may be configured to calculate comprehensive scores for the documents being analyzed. For instance, the comprehensive calculation module 1838 may calculate a comprehensive score for each document based on the comparative coverage score, the risk score, and the market score for the respective document. In some instances, the comprehensive calculation module 1838 can calculate the comprehensive score for a document by taking an average (and/or median, mean, mode, lowest score, highest score, etc.) of the comparative coverage score, the risk score, and the market score. In some instances, the comprehensive calculation module 1838 may weigh one or more of the comparative coverage score, the risk score, and the market score when calculating the comprehensive score for a document. The comprehensive calculation module 1838 may perform any or all of operations 1704 and 1706 of method 1700 shown in
A ranking module 1840 may be present in the memories 1804 and coupled to the one or more processing unit(s) 1802. The ranking module 1840 may be configured to rank the analyzed documents by comparative breadth scores, comparative portion count scores, comparative differentiation scores, comparative coverage scores, risk scores, market scores, and/or comprehensive scores. For example, the ranking module 1840 may rank a number of patents based on the comparative breadth scores, the comparative portion count scores, the comparative differentiation scores, the comparative coverage scores, the risk scores, the market scores, and/or the comprehensive scores.
In an implementation, the ranking module 1840 may additionally bin the results of the ranking into one of a set number of values. One binning implementation is by percentiles. Thus, the top 1% of the analyzed documents in terms of comprehensive scores would be all the given a rank of 100. The binning may divide the ranked documents into any number of different bins such as three different bins (e.g., high, medium, and low), 10 different bins, 100 different bins, or more. Thus, instead of 100,000 documents ranked from 1 to 100,000 in terms of final overall scores, with each ranking being unique, each document may have a rank from 1 to 100 with several documents sharing each numerical level.
Some of the operations described above include summation, subtraction, multiplication, and/or division. The processing unit(s) 1802 may implement these operations by use of floating point computations. Floating point is a formulaic representation that approximates a real number so as to support a trade-off between range and precision. A number is, in general, represented approximately to a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form: significand×baseexponent, where significand is an integer base is an integer greater than or equal to two, and exponent is also an integer. The term floating point refers to the fact that a number's radix point (decimal point, or, more commonly in computers, binary point) can “float”; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated as the exponent component, and thus the floating-point representation is a form of scientific notation.
A floating-point system can be used to represent, with a fixed number of digits, numbers of different orders of magnitude. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers grows with the chosen scale. One example technique for floating point calculation is described in the IEEE 754 Standard. The current version, IEEE 754-2008 published in August 2008. The international standard ISO/IEC/IEEE 60559:2011 (with content identical to IEEE 754-2008) is published as ISO/IEC/IEEE 60559:2011 “Information technology—Microprocessor Systems—Floating-Point arithmetic.”
A floating-point number consists of two fixed-point components, whose range depends exclusively on the number of bits or digits in their representation. The components linearly depend on their range, the floating-point range linearly depends on the significant range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number. On an example computer system, a ‘double precision’ (64-bit) binary floating-point number has a coefficient of 53 bits (one of which is implied), an exponent of 11 bits, and one sign bit. Positive floating-point numbers in this format have an approximate range of 10−308 to 10308, because the range of the exponent is [−1022, 1023] and 308 is approximately log10(21023). The complete range of the format is from about −10308 through +10308 (see IEEE 754).
The number of normalized floating-point numbers in a system (B, P, L, U) where B is the base of the system, P is the precision of the system to P numbers, L is the smallest exponent representable in the system, and U is the largest exponent used in the system) is 2(B−1)(BP−1)(U−L+1)+1.
There is a smallest positive normalized floating-point number, Underflow level=UFL=BL which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent. There is a largest floating-point number, Overflow level=OFL=(1−B−P)(BU+1) which has B−1 as the value for each digit of the significand and the largest possible value for the exponent.
A UI generation module 1842 may be present in the memories 1804 and implemented by the processing unit(s) 1802. The UI generation module 1842 may generate or provide instructions to generate one or more user interfaces such as command-line user interfaces and/or graphic user interfaces. A command-line interface (also known as a command language interpreter (CLI), a command-line user interface, a console user interface, or a character user interface (CUI)), is an interface for interacting with a computer program where the user (or client) issues commands to the program in the form of successive lines of text (command lines). The interface is usually implemented with a command line shell, which is a program that accepts commands as text input and converts commands to appropriate operating system functions.
A GUI is a program interface that takes advantage of a computer's graphics capabilities to make the program easier to use. Well-designed GUIs can free a user from learning complex command languages. In some instances, the UI generation module 1842 may generate a GUI such as the UI 120 shown in
The computing device(s) 1800 may include one or more communication interfaces 1844 for receiving and sending information. The communication interfaces 1844 may communicatively couple the computing device(s) 1800 to a communications network using any conventional networking protocol or technology. The computing device(s) 1800 may also include input-output (I/O) components 1846 for receiving input from human operators (e.g., a keyboard) and providing output (e.g., a monitor) to the human operators.
Although the subject matter has been described in language specific to structural features, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features described. Rather, the specific features are disclosed as illustrative forms of implementing the claims.
The terms “a,” “an,” “the” and similar referents used in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. As used herein in this document “or” means and/or. For example, “A or B” means A without B, B without A, or A and B. The term “based on” is to be construed to cover both exclusive and nonexclusive relationships. For example, “A is based on B” means that A is based at least in part on B and may be based wholly on B.
Certain embodiments are described herein, including the best mode known to the inventors for carrying out the invention. Of course, variations on these described embodiments will become apparent to those of ordinary skill in the art upon reading the foregoing description. Skilled artisans will know how to employ such variations as appropriate, and the embodiments disclosed herein may be practiced otherwise than specifically described. Accordingly, all modifications and equivalents of the subject matter recited in the claims appended hereto are included within the scope of this disclosure. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the invention unless otherwise indicated herein or otherwise clearly contradicted by context.
Furthermore, references have been made to publications, patents, or patent applications (collectively “references”) throughout this specification. Each of the cited references is individually incorporated herein by reference for their particular cited teachings as well as for all that they disclose.