The present invention relates to a document analysis system and the like that analyze document information recorded in a predetermined computer or server.
The background art of the present invention is described for a case where a litigation case or fraud investigation is adopted as an investigation case, for example. Conventionally, for the cases of occurrence of a crime or a legal dispute related to computers, such as an unauthorized access and classified information leakage, equipment required to find the cause of the crime and dispute and required for investigation, and means and technologies for collecting and analyzing data and electronic records and clarifying their legal admissibility and competence of evidence have been proposed.
Particularly, civil litigation in the United States requires eDiscovery (electronic discovery) and the like. All the plaintiffs and defendants of the litigation are responsible for submitting related digital information as evidence. Consequently, digital information stored in computers and servers is required to be submitted as evidence.
According to rapid development and proliferation of IT, most of information has been created by computers in today's business. Thus, even an identical company is inundated with much digital information.
Consequently, in a process of performing preparation work for submitting evidentiary materials to a court, even errors of including classified digital information that is not necessarily related to the litigation tend to occur. Furthermore, submission of classified document information unrelated to the litigation is a problem.
In recent years, techniques pertaining to document information in forensic systems have been proposed in the following Patent Literatures 1 to 3. However, for example, the forensic systems such as those of Patent Literatures 1 to 3 collect enormous amounts of document information on users having used multiple computers and servers.
Work of classifying whether such enormous amounts of digitized document information is appropriate as evidentiary materials for a litigation or not requires a user called a reviewer to visually verify and classify the document information on a piece-by-piece basis, which causes a problem of causing enormous efforts and costs.
A document classification system for solving the above problems is proposed in Patent Literature 4. Patent Literature 4 discloses a document classification system that obtains digital information recorded in multiple computers or servers, analyzes document information included in the obtained digital information, and classifies the information so as to facilitate use for a litigation, including: an extractor that extracts a document group that is a data set including a predetermined number of documents from the document information; a document display unit that displays the extracted document group on a screen; a classification symbol acceptor that accepts a classification symbol assigned to the displayed document group by a user based on relevance to the litigation; a selector that classifies the extracted document group with respect to each classification symbol, based on the classification symbol, and analyzes and selects a keyword commonly appearing in the classified document group; a database that records the selected keyword; a searcher that searches the document information for the keyword recorded in the database; a score calculator that calculates a score representing relevance between the classification symbol and the document using a search result of the searcher and an analysis result of the selector; and an automatic classifier that automatically assigns the classification symbol, based on a result of the score.
Patent Literature 5 discloses a time-series prediction apparatus including: characteristics obtaining means for obtaining the characteristics of time series from previous time-series data; creation means for creating a regression tree, based on the amount of characteristics obtained by the characteristics obtaining means; current time series characteristics obtaining means for obtaining the amount of characteristics from current time-series data using the same algorithm as that of the characteristics obtaining means; and prediction means for obtaining a predictive value in the future using the amount of characteristics obtained by the current time series characteristics obtaining means and the regression tree created by the creation means.
The document classification system disclosed in Patent Literature 4 analyzes previous events at a stage of institution of a lawsuit. Consequently, preventive measures through prediction of possible events in the future cannot be taken; for example, measures of preventing development to a litigation cannot be taken. The time-series prediction apparatus as in Patent Literature 5 does not have an object to facilitate analysis of document information used for a litigation.
The present invention has been made in view of the above problem, and has an object to provide a document analysis system, a document analysis method and a document analysis program that predict possible events in the future by analyzing existing data.
To solve the problem, a document analysis system of the present invention is a document analysis system that obtains information recorded in a predetermined computer or server, and analyzes document information including multiple documents included in the obtained information, including: a score calculator that calculates a score that represents a strength of connection of a document extracted from the document information to a classification symbol representing a degree of relevancy between the document information and a litigation or fraud investigation; a phase identifying section that identifies a phase by which a predetermined action to be a cause of the litigation or fraud investigation is classified along with development of the predetermined action, based on the score calculated by the score calculator; and a change estimation unit that estimates change in the phase identified by the phase identifying section, based on temporal transition of the phase.
The document analysis system may further include a score moving average calculator that calculates a moving average of the scores calculated by the score calculator, wherein the change estimation unit estimates change in the phase by calculating a correlation between the moving average calculated by the score moving average calculator and a predetermined pattern.
The document analysis system may further include a presentation unit that presents the change in the phase estimated by the change estimation unit in a manner allowing a user to grasp the change.
The document analysis system may further include a classification symbol assigner that assigns the classification symbol to each of the documents using a keyword and/or text included in the text information.
To solve the problem, a document analysis method of the present invention is a document analysis method that obtains information recorded in a predetermined computer or server, and analyzes document information including multiple documents included in the obtained information, including: a score calculation step of calculating a score that represents a strength of connection of a document extracted from the document information to a classification symbol representing a degree of relevancy between the document information and a litigation or fraud investigation; a phase identification step of identifying a phase by which a predetermined action to be a cause of the litigation or fraud investigation is classified along with development of the predetermined action, based on the score calculated in the score calculation step; and a change estimation step of estimating change in the phase identified in the phase identification step, based on temporal transition of the phase.
To solve the problem, a document analysis program of the present invention is a document analysis program that obtains information recorded in a predetermined computer or server, and analyzes document information including multiple documents included in the obtained information, causing a computer to achieve: a score calculation function of calculating a score that represents a strength of connection of a document extracted from the document information to a classification symbol representing a degree of relevancy between the document information and a litigation or fraud investigation; a phase identifying function of identifying a phase by which a predetermined action to be a cause of the litigation or fraud investigation is classified along with development of the predetermined action, based on the score calculated by the score calculation function; and a change estimation function of estimating change in the phase identified by the phase identifying function, based on temporal transition of the phase.
The document analysis system, the document analysis method and the document analysis program of the present invention can predict possible events in the future by analyzing existing data. Consequently, the document analysis system and the like can take measures that prevent unfavorable situations, such as development to a litigation, for example.
The document analysis system 1 according to the embodiment of the present invention is a system that obtains a large amount of digital information (big data) recorded in multiple computers and servers, and analyzes document information including multiple documents included in the obtained digital information. Here, for example, a litigation, fraud investigation, financial event, meteorological event, or cases related to diagnosis and treatment is selected as an investigation case.
The data storage 100 stores, in a digital information storing area 101, digital information obtained from multiple computers or servers for use for analyzing a litigation or fraud investigation. The data storage 100 includes an investigation basis database 103, a keyword database 104, a related term database 105, a score calculation database 106, and a report creation database 107. As described in
The investigation basis database 103 holds a category attribute that indicates which category the case falls into among, for example, litigation cases including antitrust, patent, The Foreign Corrupt Practices Act (FCPA), Products Liability (PL), and/or fraud investigation including information leakage and billing fraud, a company name, a person in charge, a custodian, and the configuration of an investigation or classification input screen.
The keyword database 104 holds a specific classification symbol of a document, a keyword having a close relationship with the specific classification symbol, and keyword correspondence information representing the correspondence relationship between the specific classification symbol and the keyword, which are included in the obtained digital information.
The related term database 105 holds a predetermined classification symbol, a related term including a word having a high appearance frequency in a document assigned the predetermined classification symbol, and related term correspondence information representing the correspondence relationship between the predetermined classification symbol and the related term.
The score calculation database 106 holds a weight for a word included in the document in order to calculate a score that represents the strength of connection between the document and the classification symbol.
The report creation database 107 stores the category, the custodian, and the form of a report defined according to the content of classification work.
The database manager 109 manages update of the content of data in an investigation basis database 103, a keyword database 104, a related term database 105, a score calculation database 106, and a report creation database 107. The database manager 109 may be connected to an information storage apparatus 902 via a dedicated connection line or an Internet line 901. In this case, the database manager 109 may update the content of data in the investigation basis database 103, the keyword database 104, the related term database 105, the score calculation database 106, and the report creation database 107, on the basis of the content of data stored in the information storage apparatus 902.
The document extractor 112 extracts multiple documents from the document information.
The word searcher 114 searches the document information for the keyword or related term recorded in the database.
The score calculator 116 calculates a score that represents the strength of connection of the document extracted from the document information to the classification symbol representing the degree of relevancy between the document information and the litigation or fraud investigation. The score calculator 116 may calculates the score in a time series manner. The score calculator 116 may calculate the score of a predetermined action that is a cause of the litigation or fraud investigation, on a phase-by-phase basis for classification, according to advancement of the action. A method of calculating the score is described later in detail.
The phase identifying section 122 identifies the phase by which the predetermined action to be a cause of the litigation or fraud investigation is classified along with the development of the predetermined action, according to the score calculated by the score calculator 116.
Here, the predetermined action may be, for example, an action related to a fraud action, such as antitrust, patent, The Foreign Corrupt Practices Act, product liability, information leakage, or billing fraud (e.g., attendance to a price adjustment meeting with competitors). The phase is an indicator representing each stage of development of the predetermined action. For example, the phase of “relationship building” is a stage that serves as a precondition of the phase of competition, and is a stage of constructing a relationship with customers and competitors. The phase of “preparation” is a stage of exchange of information related to competition with a competitor (that may be a third party). The phase of “competition” is a stage where a price is presented to a customer, feedback is obtained, and communication is achieved with the competitor about the feedback. For example, a predetermined action of “inquiry from a customer” belongs to a phase of “relationship building”. A predetermined action of “obtainment of production situations of the competitor” belongs to a phase of “preparation”.
The phase identifying section 122 identifies “which phase the current state is” on the basis of the score calculated by the score calculator 116. More specifically, the scores corresponding to the respective phases are calculated by the score calculator 116, the phase identifying section 122 identifies the phase (e.g., the phase where the score has the maximum value) according to a result of comparison of the scores.
Alternatively, the ranges of the values of scores may be assigned the respective phases. The phase identifying section 122 may identify the phase corresponding to the score. Alternatively, the phase identifying section 122 may identify the phase (maximum likelihood phase) where a predetermined action subject (an organization made up of one or more individuals) maximizes the likelihood (a value calculated as the score according to each phase) of a model (the observation process, likelihood function) representing a process reaching the predetermined action.
The change estimation unit 120 estimates change in phase identified by the phase identifying section 122 on the basis of temporal transition of the phase. More specifically, for example, when a series of transition where the phase “relationship building” transitions to the phase “preparation” and develops to the phase “competition” is evident (by holding time series information representing temporal order of phases) and the phase identifying section 122 identifies that the current phase is the phase of “preparation”, the change estimation unit 120 estimates that subsequent transition is development to the phase “competition”.
Alternatively, the change estimation unit 120 may estimate change in phase by calculating the correlation between the moving average calculated by the score moving average calculator 140 and a predetermined pattern. Here, the predetermined pattern may be a pattern where the score calculated in a litigation or fraud investigation other than the litigation or fraud investigation concerned changes according to lapse of time.
For example, in the case where analysis related to a previously instituted litigation has been performed in order to submit evidentiary materials in the litigation and the moving average of the score has been calculated, the change estimation unit 120 adopts the moving average as the predetermined pattern, and calculates the correlation between the moving average of score for the document information to be analyzed this time and the predetermined pattern. In other words, the change estimation unit 120 calculates the degree of coincidence (correlation) therebetween while shifting the elapsed time and/or score. When the correlation therebetween becomes high, the change estimation unit 120 estimates that the score at this time will have a similar value in conformity with the predetermined pattern in the future. Consequently, the phase identifying section 122 identifies the phase in the future on the basis of a possible score in the future.
The score moving average calculator 140 calculates the moving average of scores calculated by the score calculator 116.
The score difference moving average calculator 142 calculates the difference moving average of the scores from the short-term moving average and long-term moving average of the scores.
When a keyword stored in the keyword database 104 is searched for by the word searcher 114 and a document including the keyword is extracted by the document extractor 112, the first automatic classifier 201 automatically assigns a specific classification symbol to the extracted document on the basis of the keyword correspondence information.
When the documents including the related terms stored in the related term database are extracted from the document information and the scores are calculated on the basis of the evaluated values of the related terms and the number of related terms included in the extracted document, the second automatic classifier 301 automatically assigns the predetermined classification symbol to the documents having a score exceeding a certain value among the documents including the related terms on the basis of the score and the related term correspondence information.
The presentation unit 130 presents the change in phase estimated by the change estimation unit 120, in a manner allowing the user to grasp the change.
The classification symbol accepting and assigning unit 131 accepts the classification symbol assigned by the user on the basis of the relevance to a litigation, and assigns the classification symbol to the documents that have been assigned no classification symbol and extracted from the document information.
The document analyzer 118 analyzes the document assigned the classification symbol by the classification symbol accepting and assigning unit 131. The document analyzer 118 may analyze not only the documents for which the classification symbols have been accepted from the user and to which the classification symbols have been assigned on the basis of the relevance to the litigation, but also the documents automatically assigned the classification symbols by the first automatic classifier 201 and the second automatic classifier 301 on the basis of the keyword, related term and score, and integrate the documents for which the classification symbols have been accepted from the user and to which the classification symbols have been assigned, with the document automatically assigned the classification symbol to obtain an integrated analysis result. In this case, the third automatic classifier 401 can automatically assign the classification symbol on the basis of the integral analysis result.
Procedures of classification and investigation work are various procedures including: automatic classification through word search; acceptance of classification and investigation by the user; automatic classification and investigation using the score; automatic classification and investigation where a learning process intervenes; and automatic classification and investigation where quality assurance intervenes. With an advancement history that represents the order and combination of the various types of classification and investigation work, the multiple documents assigned the classification symbols are analyzed by the document analyzer 118, and the report creator 701, described below, may report the analyzed result.
The third automatic classifier 401 automatically assigns the classification symbol to the multiple documents extracted from the document information, on the basis of the result by the document analyzer 118 analyzing the documents assigned the classification symbol by the classification symbol accepting and assigning unit 131.
The tendency information generator 124 generates tendency information that represents the degree of similarity to the document assigned the classification symbol of each document on the basis of the types of words, the number of appearances, and the evaluated values of the words, for analysis by the document analyzer 118.
The quality inspector 501 compares the classification symbol accepted by the classification symbol accepting and assigning unit 131 with the classification symbol assigned according to the tendency information in the document analyzer 118, and verifies the appropriateness of the classification symbol accepted by the classification symbol accepting and assigning unit 131.
The learning unit 601 learns the weighting of each of keywords and related terms on the basis of the result of document classification process. The learning unit 601 learns the weighting of each keyword or related term on the basis of the first to fourth processing results (described later) according to the expression (2). The learning unit 601 may reflect the learned result in the keyword database 104, the related term database 105, or the score calculation database 106.
The report creator 701 outputs an optimal investigation report on the basis of the result of the document classification process according to the investigation type of the litigation cases or fraud investigation. As described above, the litigation cases include, for example, antitrust, patent, The Foreign Corrupt Practices Act (FCPA), product liability (PL), etc. The fraud investigation may include, for example, information leakage, billing fraud, etc.
The attorney review accepting unit 133 accepts a review by a chief attorney at law or a chief patent attorney in order to improve the qualities of classification and investigation and report and clarify the responsibilities of the classification and investigation and report.
The language determiner (not shown) determines the type of language of the extracted document.
The translator (not shown) automatically translates the extracted document upon acceptance of designation by the user or automatically. In this case, it is preferred that the delimited unit of the language in the language determiner be set smaller than one sentence so as to support multiple languages in multiple languages in one sentence. Any or both of predictive coding and character coding may be used to determine the language. Furthermore, a process of excluding the headers of HTML (Hyper Text Markup Language) and the like from the targets of translation may be performed.
The score change detector (not shown) detects the time-series change in score calculated by the score calculator 116.
The score change determiner (not shown) determines the degree of relevancy between the investigation case and the extracted document from the time-series change in score detected by the score change detector 120.
The term “classification symbol” is an identifier used to classify a document, and is an identifier that represents the degree of relevancy to a litigation to facilitate use of the document for the litigation. For example, the symbol may be assigned according to the type of evidence when document information is used as evidence in a litigation.
The term “document” is data including at least one word and is, for example, email, presentation materials, spreadsheet materials, discussion materials, a written contract, an organization chart, a business plan, etc.
The term “word” is a unit of a minimum character string having meaning. For example, the text “the document is data that includes at least one word” includes words “document”, “one”, “at least”, “word”, “includes”, “data”, and “is”.
The term “keyword” is a character string aggregate that has a certain meaning in a certain language. For example, keywords may be selected from text “classify a document” to obtain “text” and “classify”. In this embodiment, keywords such as “infringement”, “litigation” and “Patent publicaiton No. XX” are mainly selected. The “keyword” may be a morpheme.
The term “keyword correspondence information” is information that represents the correspondence relationship between a keyword and a specific classification symbol. For example, when the classification symbol “important” representing an important document in a litigation has a close relationship with a keyword “infringer”, the “keyword correspondence information” may be information that manages the classification symbol “important” and the keyword “infringer” in association with each other.
The term “related term” is a term having an evaluated value of at least a certain value among words having a high appearance frequency common to the documents assigned a predetermined classification symbol. Here, the appearance frequency may be a ratio of appearance of the related term to the total number of words appearing in one document.
The term “evaluated value” is a value that represents the amount of information exerted by each word in a certain document. The “evaluated value” may be calculated with reference to the amount of transmitted information. For example, when a predetermined trade name is assigned as a classification symbol, the “related term” may indicate the name of a technical field to which the product belongs, a country where the product is sold, a trade name similar to that of the product. More specifically, the “related terms” in the case of assigning, as a classification symbol, the trade name of an apparatus that performs an image coding process include “coding process”, “Japan” and “encoder”.
The term “related term correspondence information” is information that represents the correspondence relationship between a related term and a classification symbol. For example, when a classification symbol “product A” which is a trade name related to a litigation has a related term “image coding” which is a function of the product A, the “related term correspondence information” may be information where the classification symbol “product A” and the related term “image coding” are associated with each other and managed.
The term “score” is a value of qualitative evaluation of the strength of connection with a specific classification symbol in a certain document. In each embodiment of the present invention, for example, the score is calculated on the basis of words appearing in a document and the evaluated value of each word using the following expression (1).
[Expression 1]
Scr=Σ
i=0
N
i*(mi*wgti2)/Σt=0Ni*wgti2 (1)
Scr: Score of document
mi: Appearance frequency of i-th keyword or related term
wgti2: Weight of i-th keyword or related term
The document analysis system 1 may extract a word that frequently appears in documents having a common classification symbol assigned by the user. The type of the extracted word included in each document, the evaluated value of each word, and tendency information on the number of appearances may be analyzed on a document-by-document basis, and a common classification symbol may be assigned to documents having the same tendency as the analyzed tendency information among documents where no classification symbol is accepted by the classification symbol accepting and assigning unit 131.
Here, the term “tendency information” is information that represents the degree of similarity to the document assigned the classification symbol of each document, and is represented by the degree of relevancy to the predetermined classification symbol based on the type of the word included in each document, the number of appearances, and the evaluated value of the word. For example, when each document is similar to the document assigned the predetermined classification symbol in degree of relevancy with this predetermined classification symbol, the two documents have the same tendency information. Documents including words having the same evaluated value with the same number of appearance even if the types of included words are different from each other may be regarded as documents having the same tendency.
The score calculator 116 calculates the score that represents the strength of connection of the document extracted from the document information to the classification symbol representing the degree of relevancy between the document information and the litigation or fraud investigation (S11, score calculation step). Next, the phase identifying section 122 identifies the phase by which the predetermined action to be a cause of the litigation or fraud investigation is classified along with the development of the predetermined action, on the basis of the score calculated by the score calculator 116 (S12, phase identification step). The change estimation unit 120 then estimates change in phase identified by the phase identifying section 122 on the basis of temporal transition of the phase (S13, change estimation step).
The document analysis method according to the embodiment of the present invention is further described.
Each of the documents of cases 1 and 2 includes email or the like. The documents of cases 1 and 2 may be used as cases for optimizing the predictive coding (specifically among them, for example, sampling, file type classification, etc.). The weights and scores are calculated on the basis of information related to the “responsive” document. In the embodiment of the present invention, the email document of the case 1 is mainly described in English, and the email document of the case 2 are described in both of Japanese and English. The email documents in the cases 1 and 2 may be used as subsets.
In the embodiment of the present invention, a document as of Apr. 1, 2000 to Mar. 31, 2013 is used as the email document of the case 2.
The document of the case 2 is used as an example, and score time-series analysis is described. First, referring to
Next, on the basis of the score, the moving average of scores is obtained, and characteristics and tendency obtained by analyzing the moving average are discussed. Here, the moving average (MA) is as follows.
Here, SMAM is a simple moving average of {ScrM, ScrM-1, . . . , ScrM-(n-1)}. The ScrM is the score of an email document M.
The simple moving average SMA is calculated with respect to each document (email) M, on the basis of the score ScrM and the scores of pieces of email whose transmission dates are in a predetermined days or less before the transmission date of the email M {ScrM-1, . . . , ScrM-(n-1)}. The predetermined days may be appropriately defined. This embodiment defines seven days as a short term, 30 day as a mid-term, and 90 days as a long term.
Use of the simple moving average SMA allows the large fluctuation of the original score values to be smoothed.
Next, the calculation of the difference moving average is described. The difference of moving averages (DMA) is represented as follows.
ΔMAM12=ΔMAM1−ΔMAM2 [Expression 3]
MAM1: moving average 1 (short term: e.g., short-term (seven days))
MAM2: moving average 2 (long term: e.g., mid-term (30 days))
The case where the value of the difference moving average ΔMAM12 is positive means that the value of the score is large in an immediately preceding term (i.e., the short term). It is assumed that a relatively large number of pieces of “HOT” email were transmitted in the short term, and changes to be investigated occurred. Consequently, according to the difference moving average, the characteristics and tendency of the email document that cannot be obtained through simple comparison of scores can be obtained. The change in characteristics and tendency described here is detected as an intersection of difference moving average curves, for example.
The main (rising) edge (EDGE) is a site where the difference of moving average (DMA) changes from negative to positive, that is, the intersection between the difference of moving averages (DMA) and the horizontal axis.
The term IN means a region where the difference of moving averages (DMA) is positive.
As to an email document “HOT” of a custodian 1, presence or absence of a redundant piece of email having the same date and same score value is discussed. Deletion of the redundant piece of email reduces the number of “HOT” email documents from 98 pieces of email to 86 pieces of email. The number of pieces of email whose transmitters cannot be identified owing to the differences of addresses is four pieces of email, which is regarded as substantially, quantitatively absence.
Most of the scores of the pieces of “HOT” email of the custodian 1 have values which are not large. However, on the date when these were transmitted, “EDGE” or IN is detected.
The email documents transmitted on and after November 2012 do not have “EDGE” nor “IN”. Consequently, it is estimated that these pieces of email are related to frequent communication between specific persons in the same domain as that of the custodian 1.
Time-series data is described below. The moving average (MA) and the difference of moving averages (DMA) are excellent indicators for finding the basic characteristics and tendency of the time-series data.
The term “EDGE” of the difference of moving averages (DMA) may be an indicator that can detect the point of change in tendency of the score and indicates the presence of a piece of “HOT” email.
Analysis using the moving average (MA) or difference of moving averages (DMA) of score values has a possibility of detecting specific characteristics (e.g., possible “HOT”) in the time-series data. This enables selective dissemination of information (SDI) about a specific custodian or a specific group of custodians.
An example of procedures of executing time-series data analysis is described below.
The time-series data analysis according to the embodiment of the present invention is performed in the document classification process in relation to the document classification, for example. An example of the document classification process is described below. The document classification process is performed according to a flowchart as shown in
In the first stage, the keyword and the related term are preliminarily updated and registered using a result of a previous classification process (STEP100). At this time, the keyword and the related term are updated and registered together with the keyword correspondence information and the related term correspondence information which are correspondence information on the classification symbol and the keyword or the related term.
On the second stage, a first classification process is executed that extracts a document including the keyword updated and registered in the first stage from the entire document information, refers to the updated keyword correspondence information recorded in the first stage upon finding the document, and assigns the classification symbol corresponding to the keyword (STEP200).
On the third stage, the document including the related term updated and registered in the first stage is extracted from the document information assigned no classification symbol in the second stage, and the score of the document including the related term is calculated. A second classification process is executed that refers to the calculated score and the related term correspondence information updated and registered on the first stage and assigns the classification symbol (STEP300).
On the fourth stage, the classification symbol assigned by the user is accepted with respect to the document information where no classification symbol has been assigned until the third stage, and the classification symbol accepted from the user is assigned to the document information. Next, a third classification process is executed that analyzes the document information assigned the classification symbol accepted from the user, extracts the document assigned no classification symbol on the basis of the analysis result, and assigns the classification symbol to the extracted document. For example, a word frequently appearing in documents with the common classification symbol assigned by the user is extracted, the tendency information which is included in each document and is on the type of the extracted word, the evaluated value of each word, and the number of appearances may be analyzed on a document-by-document basis, and a common classification symbol is assigned to a document having the same tendency as the tendency information (STEP400).
On the fifth stage, the classification symbol to be assigned to the document to which the user has assigned the classification symbol is determined on the basis of the analyzed tendency information, the determined classification symbol is compared with the classification symbol assigned by the user, and the appropriateness of the classification process is verified. (STEP500) A learning process can be performed on the basis of the result of the document classification process as necessary.
Here, the tendency information used in the processes in the fourth and fifth stages is of each document, represents the degree of similarity to the document assigned the classification symbol, and is based on the type of the word included in each document, the number of appearances, and the evaluated value of the word. For example, when each document is similar to the document assigned the predetermined classification symbol in degree of relevancy with this predetermined classification symbol, the two documents have the same tendency information. Documents including words having the same evaluated value with the same number of appearance even if the types of included words are different from each other may be regarded as documents having the same tendency.
Detailed processing flows in each of the first to fifth stages are described as follows.
A detailed processing flow of the keyword database 104 on the first stage is described with reference to
The keyword database 104 creates a table for management for each classification symbol in consideration of a result of classification of documents in previous litigations, and identifies a keyword corresponding to each classification symbol (STEP111). In the embodiment of the present invention, the identification may be made by analyzing the document assigned each classification symbol, using the number of appearances and evaluated value of each keyword in the document. Alternatively, a method of using the amount of transmitted information held by the keyword, or a method of manual selection by the user may be adopted.
In the embodiment of the present invention, for example, when keywords “infringement” and “patent attorney” are identified as keywords of a classification symbol “important”, keyword correspondence information indicating that the “infringement” and “patent attorney” are keywords having close relationship with the classification symbol “important” is created (STEP112). The identified keyword is registered in the keyword database 104. In this case, the identified keyword and the keyword correspondence information are associated with each other, and recorded in the management table of the classification symbol “important” of the keyword database 104 (STEP113).
Next, a detailed processing flow of the related term database 105 is described with reference to
The related term correspondence information indicating correspondence of the registered related terms to the classification symbols is created (STEP122), and recorded in each management table (STEP123). At this time, in the related term correspondence information, the evaluated value of each related term, and a threshold that serves as a score required to determine the classification symbol are recorded together.
Before actual classification work, the keyword and keyword correspondence information, and the related term and related term correspondence information are updated to the latest ones and registered (STEP113, STEP123).
A detailed processing flow of the first automatic classifier 201 on the second stage is described with reference to
The first automatic classifier 201 extracts, from the document information, a document that includes “infringement” and “patent attorney” registered in the keyword database 104 in the first stage (STEP100), and extracts, from the document information, the document that includes keywords “infringement” and “patent attorney” registered in the keyword database 101 (STEP211). With respect to the extracted document, according to the keyword correspondence information, the management table that records the keyword is referred to (STEP212), and the classification symbol “important” is assigned (STEP213).
A detailed processing flow of the second automatic classifier 301 on the third stage is described with reference to
In the embodiment of the present invention, the second automatic classifier 301 performs a process of assigning the classification symbols “product A” and “product B” to the document information having been assigned no classification symbol on the second stage (STEP200).
The second automatic classifier 301 extracts documents including the related terms “coding process”, “product a”, “decode” and “product b”, which have been recorded in the related term database 105 on the first stage, from the document information (STEP311). The scores of the extracted documents are calculated by the score calculator 116 using the expression (1) on the basis of the appearance frequencies and evaluated values of the recorded four related terms (STEP312). The score represents the degree of relevancies between each document and the classification symbols “product A” and “product B”.
When the score exceeds the threshold, the related term correspondence information is referred to (STEP313), and an appropriate classification symbol is assigned (STEP314).
For example, when the appearance frequencies of the related terms “coding process” and “product a” and the evaluated value of the related term “coding process” are high and the score representing the degree of relevancy to the classification symbol “product A” exceeds the threshold in a certain document, the document is assigned the classification symbol “product A”.
At this time, when the appearance frequency of the related term “product b” is also high and the score representing the degree of relevancy to the classification symbol “product B” exceeds the threshold, the document is assigned the classification symbol “product B” besides the classification symbol “product A”. On the contrary, when the appearance frequency of the related term “product b” is low and the score representing the degree of relevancy to the classification symbol “product B” does not exceed the threshold, the document is only assigned the classification symbol “product A”.
In the second automatic classifier 301, the evaluated value of the related term is recalculated according to the following expression (2) using the score calculated in STEP432 on the fourth stage, and the evaluated value is weighted (STEP315).
[Expression 4]
wgt
i,L=√{square root over (wgti−12+γLwgti,L2−∂)}=√{square root over (wgti,L2+Σl=1L(γlwfti,l2−∂))} (2)
For example, when a certain number of documents that have a significantly high appearance frequency of “decode” but have a score is as low as a certain value or less occur, the evaluated value of the related term “decode” is reduced and recorded in the related term correspondence information again.
On the fourth stage, as shown in
A detailed flow of the classification symbol accepting and assigning unit 131 on the fourth stage is described with reference to
The user views a document display screen 11 that is displayed on the document display unit 130 and shown in
Next, a detailed flow of the document analyzer 118 is described with reference to
Furthermore, in consideration of the results analyzed in STEP 422 and STEP 423, the tendency information on the document assigned the classification symbol “important” is analyzed (STEP424).
In
In the embodiment of the present invention, the classification symbol accepting and assigning unit 131 extracts words plotted higher than a straight line R_hot=R_all as the common words with the classification symbol “important”.
The processes in STEP421 to STEP424 are executed also to documents assigned the classification symbols “product A” and “product B”, and the tendency information on the documents is analyzed.
Next, a detailed processing flow of the third automatic classifier 401 is described with reference to
The third automatic classifier 401 reflects the classification result in each database using the scores calculated in STEP432 (STEP434). More specifically, a process may be performed that reduces the evaluated values of the keyword and the related term included in the document with a low score while increasing the evaluated values of the keyword and the related term included in the document with a high score.
Furthermore, an example of a detailed processing flow of the third automatic classifier 401 is described with reference to
The third automatic classifier 401 reflects the classification result in each database using the scores calculated in STEP443 (STEP445). More specifically, a process is performed that reduces the evaluated values of the keyword and the related term included in the document with a low score while increasing the evaluated values of the keyword and the related term included in the document with a high score.
As described above, score calculation is performed by both the second automatic classifier 301 and the third automatic classifier 401. When the number of score calculations is high, data items for score calculation may be collectively stored in the score calculation database 106.
A detailed processing flow of the quality inspector 501 on the fifth stage is described with reference to
The classification symbol accepted by the classification symbol accepting and assigning unit 131 is compared with the classification symbol determined in STEP511 (STEP512), and the appropriateness of the classification symbol accepted in STEP411 is verified (STEP513).
The document analysis system 1 can predict possible events in the future by analyzing existing data. Consequently, the document analysis system 1 can take measures that prevent unfavorable situations, such as development to a litigation, for example.
The control blocks of the document analysis system 1 may be implemented by logic circuits (hardware) formed on an integrated circuit (IC chip) and the like or software through use of CPU (Central Processing Unit). In the latter case, the document analysis system 1 includes a CPU that executes instructions of a program (control program) that are software implementing each function, ROM (Read Only Memory) or a storage device (which is called a “recording medium”) where the program and various data items are recorded in a manner readable by a computer (or CPU), and RAM (Random Access Memory) where the program is deployed. The computer (or CPU) reads the program from the recording medium and executes the program, thereby achieving the object of the present invention. The recording medium may be a “non-transitory, tangible medium”, for example, a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, etc. The program may be supplied to the computer via any transmission medium (communication network, broadcast waves, etc.) that can transmit the program. The present invention can be achieved in a form of a data signal embedded in carrier waves implemented through electronic transmission of the program.
The present invention is not limited to each of the embodiments, and can be variously changed within a range represented by the claims. Embodiments obtained by appropriately combining pieces of technical means disclosed in different embodiments are also included in the technical scope of the present invention. Furthermore, combination of pieces of technical means disclosed in the embodiments can form new technical characteristics.
A document classification and investigation system that obtains digital information recorded in multiple computers or servers, analyzes document information including multiple documents included in the obtained digital information, and investigates a degree of relevancy between an investigation case and the document through assigning the document a classification symbol representing a degree of relevancy to the investigation case so as to facilitate use for the investigation case, includes: a score calculator that extracts a document from the document information, and calculates a score that represents a strength of connection of the extracted document to the classification symbol in a time-series manner; a score change detector that detects time-series change in score from the calculated score; and a score change determiner that investigates and determines the relevancy between the investigation case and the document from the detected time-series change in the score.
In the document classification and investigation system, the score change detector includes: a score moving average calculator that calculates a moving average of scores; and a score difference moving average calculator that calculates a difference moving average of scores from a short-term moving average and long-term moving average of the scores.
In the document classification and investigation system, the score change determiner investigates and determines the degree of relevancy between the investigation case and the extracted document, based on a point where a sign of the difference of different moving averages changes, or a region where the difference of different moving averages is positive.
A document classification and investigation method that obtains digital information recorded in multiple computers or servers, analyzes document information including multiple documents included in the obtained digital information, and investigates a degree of relevancy between an investigation case and the document through assigning the document a classification symbol representing a degree of relevancy to the investigation case so as to facilitate use for the investigation case, causes a computer to: extract a document from the document information, and calculate a score that represents a strength of connection of the extracted document to the classification symbol in a time-series manner; detect time-series change in score from the calculated score; and investigate the relevancy between the investigation case and the extracted document from the detected time-series change in the score.
The document classification and investigation method calculates a short-term moving average and a long-term moving average of scores by calculating a moving average of scores, and detects time-series change in score by calculating a difference moving average of scores from the short-term moving average and long-term moving average of scores.
The document classification and investigation method investigates and determines the degree of relevancy between the investigation case and the extracted document, based on a point where a sign of the difference of different moving averages changes, or a region where the difference of different moving averages is positive.
A document classification and investigation program that obtains digital information recorded in multiple computers or servers, analyzes document information including multiple documents included in the obtained digital information, and investigates a degree of relevancy between an investigation case and the document through assigning the document a classification symbol representing a degree of relevancy to the investigation case so as to facilitate use for the investigation case, causes a computer to achieve: a function of extracting a document from the document information, and calculating a score that represents a strength of connection of the extracted document to the classification symbol in a time-series manner; a function of detecting time-series change in score from the calculated score; and a function of investigating the relevancy between the investigation case and the extracted document from the detected time-series change in the score.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2014/052578 | 2/4/2014 | WO | 00 |