The present invention relates to a device for analyzing a document, particularly a document to be surveyed and document group, and a device for, a program for, and a method for automatically creating information analysis report being characterized by that the document and document group.
The number of technical documents including patent documents and other documents has surely been increased year by year. In recent years, since computerized documentation of distribution has been enabled, automatic searching systems capable of searching only for documents similar to a document to be surveyed among an enormous number of documents have been reduced to practice. However, the number of similar documents obtained from a search result is still large, and a person skilled in the art must read and judge for the similar documents in order to understand the content or nature of the document to be surveyed.
For example, in the “search device for similar document and search method for the same” disclosed in Patent Document 1, index terms contained in a document or a document group for research must be compared to index terms included in a document group for comparison, similarity measures must be calculated based on the kinds of similar index terms or the occurrence frequencies of the terms, and the documents are input in the descending order of similarity measures starting from the most similar document.
[Patent Document 1] JP-A-11-73415
In the automatic similar document searching system like the system disclosed in Patent Document 1, a list of documents similar to a document to be surveyed must be output from a document group for comparison as a result of search, and the evaluator must extract and read about as many as several to several thousand similar documents from the list of the documents similar to the document to be surveyed, find documents similar to the document to be surveyed to evaluate them, and then determine the nature of the document to be surveyed based on them. Therefore, the evaluator must extract and read about as many as several to several thousand documents before the evaluator finds exact expression for the nature of the document to be surveyed.
It is therefore an object of the invention to automatically create an information analysis report that can exactly report about the information of a document to be surveyed without the necessity of human inspection of the contents of the document to be surveyed and an enormous number of documents to be compared.
In order to solve the above disadvantage, a device for automatically creating information analysis report according to the invention creates a report representing characteristics of a document to be surveyed relative to a document to be compared in information analysis of the document to be surveyed, and the device includes input means for receiving input of at least the document to be surveyed, selecting means for selecting a population document group from the information of a document to be compared group stored in a database based on the input document to be surveyed, the population document group being a set of population documents similar to the document to be surveyed, extracting means for extracting characteristic index terms of the document to be surveyed relative to the population documents, creating means for creating an information analysis report representing characteristics of the document to be surveyed based on the population documents and the index terms, and output means for outputting the information analysis report to display means, recording means, or communicating means.
For example, the device further includes calculating means for calculating a similarity relative to the document to be compared, and the selecting means selects population documents based on the result by the calculating means. Further, the calculating means calculates a similarity based on a function value of an occurrence frequency per index term in each document and a document frequency.
The device further includes map creating means for having the population or the index terms distributed in a map state, output data obtaining means for obtaining part of the data of the population or the index terms, fixed comment obtaining means for obtaining a fixed comment corresponding to the content of the map and data, and comment entering means for entering a free comment, and the creating means creates an information analysis report representing characteristics of the document to be surveyed by combining the map, the data and/or the comment.
In a preferred embodiment, the creating means carries out totaling for each of the index terms or prescribed items in the population documents, the totaling including keyword totaling, time-series totaling representing the time-series transition of keywords or prescribed items in the population documents, and/or matrix totaling for a plurality of prescribed items in the population documents and creates an information analysis report including the results of totaling.
More preferably, the creating means creates a portfolio represented by the totaling result of prescribed items in the keywords or the population documents and a matrix of the time-series increase ratio of the totaling result in the time-series totaling, and creates an information analysis report including the portfolio.
In another preferred embodiment, the creating means, includes first occurrence value frequency calculating means for calculating a function value of the occurrence frequency of the extracted index term in the document to be compared group, second occurrence value frequency calculating means for calculating a function value of the occurrence frequency of the extracted index term in the population document group, and frequency scatter diagram creating means for creating a frequency scatter diagram including each index term and their positioning data based on a combination of the function value of the occurrence frequency in the calculate document to be compared and the function value of the occurrence frequency in the population document group for each index term.
According to yet another embodiment, the creating means includes extracting means for extracting the content data and time data of the population documents or the document to be surveyed and the population documents, tree-like diagram creating means for creating a tree-like diagram representing the co-relation between the plurality of documents based on the content data of each document, clustering means for cutting the tree-like diagram according to a prescribed rule and extracting a cluster, and inside cluster arranging means for determining the arrangement of the document group belonging to each cluster in the cluster based on the time data of each document.
More preferably, the clustering means cuts the tree-like diagram to extract a parent cluster, creates a partial tree-like diagram representing the co-relation of the document group belonging to the parent cluster based on the content data of each document belonging to the parent cluster, and cuts the created partial tree-like diagram according to a prescribed rule to extract a descendant cluster.
The clustering means preferably removes from each document vector a vector component whose deviation among a plurality of documents belonging to the parent cluster is smaller than a value determined by a prescribed method in order to create the partial tree-like diagram.
According to a still further embodiment, the creating means includes evaluation value calculating means for calculating an evaluation value in each cluster for each index term, concentration degree calculating means for calculating the sum of the evaluation values in the each cluster for each index term in all the clusters, calculating the ratio of the evaluation values in each cluster relative to the sum, calculating a square of each ratio, and calculating the degree of concentration in the distribution of each index term in the cluster obtained by calculating the sum of the square of the ratio in all the clusters, share calculating means for calculating the sum of the evaluation values of the index terms in the clusters to be analyzed for all the index terms extracted from each cluster, and calculating the share of each index term in the cluster to be analyzed obtained by calculating the ratio of each index term relative to the sum for each index term, first inverse calculating means for calculating a function value of the inverse of the occurrence frequency of each index term in the cluster, second inverse calculating means for calculating a function value of the inverse of the occurrence frequency of each index term in all the documents including the cluster, creativity degree calculating means for calculating a creativity degree by a function value of the result of subtracting the result of calculation by the second inverse calculating means from the result of calculating by the first inverse calculating means, and keyword extracting means for extracting keywords based on a combination of a degree of concentration calculated by the concentration degree calculating means, a share calculated for each of the document group for analysis by the share calculating means, and a creativity degree calculated by the creativity degree calculating means.
The device for creating information analysis report according to the invention further includes a web server connected to a network and accepting input of a document to be surveyed from a client connected through the network, a management server that queues said document to be surveyed and requests the analysis server to process a document to be surveyed to be processed next, and the analysis server that responds to said request to select a population document group that is a set of population documents similar to the document to be surveyed from information of a document to be compared group stored in a database based on said input document to be surveyed, extract characteristic index terms of said document to be surveyed relative to the population document group, and creates an information analysis report representing characteristics of said document to be surveyed.
In order to solve the above described disadvantage, an program for automatically creating information analysis report creates a report representing characteristics of a document to be surveyed relative to a document to be compared in information analysis of the document to be surveyed and enables a computer to function as input means for accepting input of at least the document to be surveyed, selecting means for selecting a population document group from the information of a document to be compared group stored in a database based on the input document to be surveyed, the population documents being a set of population documents similar to the document to be surveyed, extracting means for extracting characteristic index terms of the document to be surveyed relative to the population documents, creating means for creating an information analysis report representing characteristics of the document to be surveyed based on the population documents and the index terms, an output means for outputting the information analysis report to display means, recording means, or communicating means.
For example, the program further enables the computer to function as calculating means for calculating a similarity relative to the document to be compared, and the selecting means selects population documents based on the result by the calculating means. The calculating means calculates a similarity based on a function value of an occurrence frequency per index term in each document and a document frequency.
For example, program further enables the computer to function as at least one of map creating means for having the population or the index terms distributed in amap state, output data obtaining means for obtaining part of the data of the population or the index terms, fixed comment obtaining means for obtaining a fixed comment corresponding to the content of the map and data, and comment entering means for entering a free comment, the creating means creating an information analysis report representing characteristics of the document to be surveyed by combining the map, the data and/or the comment.
In order to solve the above-described disadvantage, an method for automatically creating information analysis report creates a report representing characteristics of a document to be surveyed relative to a document to be compared in information analysis of the document to be surveyed, the method includes the steps of inputting by accepting input of at least the document to be surveyed, selecting a population document group from the information of a document to be compared group stored in a database based on the input document to be surveyed, the population document group being a set of population documents similar to the document to be surveyed, extracting characteristic index terms of the document to be surveyed relative to the population documents, creating an information analysis report representing characteristics of the document to be surveyed based on the population documents and the index terms, and outputting the information analysis report to display means, recording means, or communicating means.
For example, the method further includes the step of calculating a similarity relative to the document to be compared, wherein in the selecting step, population documents are selected based on the result by the calculating step. Further, in the calculating step, a similarity is calculated based on a function value of an occurrence frequency per index term in each document and a document frequency.
For example, the method further includes a map creating step of having the population or the index terms distributed in a map state, an output data obtaining step of obtaining part of the data of the population or the index terms, a fixed comment obtaining step of obtaining a fixed comment corresponding to the content of the map and data, and a comment entering step of entering a free comment, and in the creating step, an information analysis report representing characteristics of the document to be surveyed is created by combining the map, the data and/or the comment.
According to the invention, based on an input document to be surveyed and documents to be compared and conditions for information analysis, population documents consisting of a document group similar to a document to be surveyed are selected from the documents to be compared, index terms characteristic of the document to be surveyed relative to the population documents are extracted, and an information analysis report representing the characteristics of the document to be surveyed is created based on the population documents and the index terms.
In this way, a person does not have to read the contents of the document to be surveyed and an enormous number of documents to be compared and still an information analysis report that can exactly report about the information of the document to be surveyed can automatically be created.
An information analysis report representing characteristics of the document to be surveyed can be created by combining a map formed by distributing the population or the index terms, the data of the population or the index terms, and a fixed comment or a free comment according to the content of the map and data can be created.
According to the invention, the document to be surveyed and the documents to be compared are specified and input, a condition for information analysis is input, population documents consisting of a document group similar to the document to be surveyed are selected from the documents to be compared, index terms characteristic of the document to be surveyed relative to the population are extracted, an information analysis report representing the characteristics of the document to be surveyed is created, and the obtained information analysis report is output to display means, recording means or communicating means.
For example, similarities relative to the documents to be compared are calculated and population documents are selected based on the result of calculation. In the calculating step, a similarity based on a function value of the occurrence frequency and a document frequency for each index term in an each document is calculated.
In this way, an information analysis report that can exactly report about the information of a document to be surveyed can automatically be created without the necessity of human reading of the contents of the document to be surveyed and an enormous number of documents to be compared.
The device further includes map creating means for having the population or the index terms distributed in a map state, output data obtaining means for obtaining part of the data of the population or the index terms, fixed comment obtaining means for obtaining a fixed comment corresponding to the content of the map and data, and comment entering means for entering a free comment, and the creating means creates an information analysis report representing characteristics of the document to be surveyed by combining the map, the data and/or the comment. Therefore, an information analysis report including the map, the population or the index term data, and a fixed comment or free comment according to the contents of the map and the data can be created.
Now, it will be described in detail about an embodiment of the invention with the accompanied drawings.
Terms used herein are defined or detailed.
d: document to be surveyed (a case related to research. For example, a document such as patent Publication No. ______ or group of such documents)
documents to be compared: all documents P or population documents S
P: all documents (the entire set of documents to be compared including the document to be surveyed d)
N: the number of all the documents P
p: one document in all the documents (N documents exist such as pa, pb, . . . )
S: population documents (part of all the documents P and a document group similar to a document to be surveyed d among all the documents P in the embodiment (including d))
N′: the number of the population documents S (N′<N)
s: one document in the population documents (N′ documents exist such as sa, sb, . . . )
In the drawings, d or (d), P or (P), p or (p), or S or (s) attached to components represent a document to be surveyed, a document to be compared, one document of all documents, or population documents, respectively, and these characters will hereinafter be attached to components as well as operation for clarification. For example, an index term (d) means an index term in a document to be surveyed d. More specifically, according to the embodiment, it will be assumed that there are x index terms in a document d represented as d1, d2, d3, . . . , dx. There are ya index terms in a document parepresented as pa1, pa2, . . . , Paya, and a part of or all of these words would match the d′ index terms represented as d1, d2, . . . , dx in some cases.
There are yb index terms in a document pb, represented as pb1, pb2, . . . , pbyb, and similarly, a part of or all of these terms would match the d′ index terms represented as d1, d2, . . . , dx in some cases.
Similarly, there are yy index terms in a document py represented as py1, py2, . . . , pyyy, and similarly, a part of or all of these terms would match the d′ index terms represented as d1, d2, . . . , dx in some cases.
Note that if vectors are created for those other than index terms in coincidence with the index terms represented asd1, d2, . . . , dx among the index terms for the document pa, their inner product is “0” as will be described. Therefore, only the d′ index terms represented as d1, d2, . . . , dx are necessary as index terms for processing.
TF Operation
TF calculation represents Term Frequency calculation, and an calculation to obtain a function value of the count of occurrence frequencies (index term frequencies) of index terms in a document to be surveyed.
DF calculation represents Document Frequency calculation, and an calculation to obtain the count of the number of hits (document frequencies) when a group of documents to be compared is searched based on index terms included in a certain document.
IDF calculation represents for example the inverse of the result of DF calculation or an calculation to obtain the logarithm of the result obtained by multiplying the inverse by the number of documents P or S. The meaning or effect for the logarithm is that the interval in the scale of the function values near zero is allowed to expand while the interval in the scale of the function values for larger numbers is allowed to decrease, so that they can easily be viewed in one plane.
Functions used in the embodiment will be expressed as follows.
TF(d): the occurrence frequency in d based on d's index terms (d1, . . . , dx). Then, TF(d) can be rewritten into the form of TF(index term; document) as follows.
TF(d1; d): the occurrence frequency based on document d's index term d1 in document d
TF(d2; d): the occurrence frequency based on document d's index term d2 in document d
TF(dx; d): the occurrence frequency based on document d's index term dx in document d
TF (Pa): the occurrence frequency based on P's index terms (pa1, . . . , paya) in Pa
Then, TF(Pa) can be rewritten in the form of TF(index term; document) as follows.
TF(pa1; pa): the occurrence frequency based on document pa's index term pa1 in document pa
TF(pa2; pa): the occurrence frequency based on document pa's index term pa2 in document Pa
TF(paya; pa): the occurrence frequency based on document pa's index term paya in document Pa
However, as will be described, as for TF(pa), only the following occurrence frequencies are necessary.
TF(d1; pa): the occurrence frequency based on document pa's index term d1 in document pa
TF(d2; pa): the occurrence frequency based on document pa's index term d2 in document pa
TF(dx; pa): the occurrence frequency based on document pa's index term dx in document pa
TF(d1; pb): the occurrence frequency based on document pb's index term d1 in document Pb
TF(d2; pb): the occurrence frequency based on document pb's index term d2 in document Pb
TF(d1; pb): the occurrence frequency based on document pb'S index term dx in document Pb
TF(d1; py): the occurrence frequency based on document py's index term d1 in document py
TF(d2; py): the occurrence frequency based on document py's index term d2 in document py
TF(dx; py): the occurrence frequency based on document py's index term dx in document py
In other words, among document Pa′ index terms (Pa1, . . . , Paya), it is only necessary to calculate about (d1, . . . , dx).
TF(pb) is the occurrence frequency in document pb. For example, TF(d1; Pb) represents the occurrence frequency based on document pb's index term d1 in document pb and TF(py) represents the occurrence frequency in document py. For example, TF(d2; py) represents the occurrence frequency based on document py's index term d2 in document py.
DF Operation
DF(P): the document frequency based on d's index term d in P
DF(P) is a value that indicates how frequently the same index terms d1, . . . , dx as index terms in document d are used in all the documents. For example, if an index term “device” is used in 1/10 of six million documents, DF is 600 thousands.
Similarly, it can be rewritten in the form of DF(index term; all documents) as follows.
DF(d1; P): in N documents (pa to py) in the all of the P, the document frequency (the number of documents) with at least one occurrence of d1 based on d's index term d1
DF(d2; P): in N documents (pa to py) in all of the P, the document frequency (the number of documents) with at least one occurrence of d2 based on d's index term d2
DF(dx; P): in N documents (pa to py) in all of the P, the document frequency (the number of documents) with at least one occurrence of dx based on d's index term dx
As for DF(S), the definition may be written in the same manner, but the detailed description is not provided.
DF(S): the document frequency in S based on d's index term
IDF
IDF as will be described is the inverse of the ratio of DF (the document frequency based on d's index term in all the documents P) to N (the number of all the documents), and is represented by its logarithm for equal distribution.
IDF(P): inverse of DF(P)×logarithm of document number: ln [N/DF(P)]
IDF(S): inverse of DF(S)×logarithm of document number: ln [N′/DF(S)]
If for example N (the number of all the documents) is six millions, and DF(d1; P)=six millions, in other words, if a certain index term d1 is included in all the documents P, IDF(d1; P)=0. If DF(d2; P)=600 thousands, in other words, if a certain index term d2 is included in 1/10 of all the documents, IDF(d2; P)=1.
TFIDF and Document Vectors
TFIDF: the product of the function value of TF and the function value of IDF (inverse of DF) that is calculated for each index term in a document. This is a numerical value for each index term based on which the similarity of documents is determined, and the value is in proportion with the occurrence frequency of a certain term in a document and a document frequency is made into its function value and made inverse-proportion.
As a simple example, let us now consider multiplication of TF(d) and IDF(P) on a one to one basis. Note however that the calculation is not necessarily limited to the one-to-one basis. For example, the component of the document vector of d may be considered as follows.
TF(d1; d)*IDF(d1; P)
TF(d2; d)*IDF(d2; P)
TF(dx; d)*IDF(dx; P)
The document vector of Pa is considered as follows:
TF(d1; pa)*IDF(d1; P)
TF(d2; pa)*IDF(d2; P)
TF(dx; pa)*IDF(dx; P)
Herein, the document vector has as a component the values of index terms obtained by operating TFIDF for each index term in a document.
The component of the vector of document d is represented for example as TF(d1; d)*IDF(d1; P), . . . , TF(d1; d)*IDF(d1; P). The component of the vector of the document Pa is represented for example as TF(dx; pa)*IDF(dx; P). More specifically, the document vectors are as follows.
{document vector of document d}={TF(d1; d)*IDF(d1; P), TF(d2; d)*IDF(d2; P), . . . , TF(dx; d)*IDF(dx; P)}
{document vector of document pa}={TF(d1; pa)*IDF(d1; P) TF(d2; pa)*IDF (d2; P), . . . , TF(dx; pa)*IDF (dx; P)}
Similarity Ratio (Similarity Measure)
A similarity indicates the degree of similarity between two documents and it is also referred to as “similarity measure” in this specification. According to the embodiment, a numerical value is obtained as the inner product of two document vectors in order to measure the proximity of the natures of the two document vectors. For example, the similarity (D,Pa: P) of a search document d to a document to be compared Pa that belongs to a document to be compared group P is obtained as the inner product of the document vector (d) of the search document d and the document vector (Pa) of the document to be compared Pa that belongs to the document to be compared group P.
{similarity (d,pa; P)=}{document vector of document d}·{document vector of document pa}=[{TF(d1; d)*IDF(d1; P)}*{TF(d1; pa)*IDF(d1; P)}+{TF(d2; d)*IDF (d2; P)}*{TF(d2; Pa)*IDF(d2; P)}+ . . . +{TF(dx; d)*IDF(dx; P)}*{TF(dx; Pa)*IDF (dx2; P)}]
The similarity of document to be compared p: according to the embodiment, the similarity of a search document d to a certain document to be compared p that belongs to a document to be compared group P. The ratio refers to the sum of the inner products of the document vector (d) of the search document d and the document vector (p) of the certain document to be compared p that belongs to the document to be compared group P.
Herein, the index term means a so-called keyword that is segmented from all or part of the document. Words may be extracted using a known conventional method or commercially available software by extracting significant nouns removed of particles and conjunctions. Alternatively, a database of dictionaries (thesaurus) of index terms may be acquired in advance and index terms available from the database may be used.
Note that as for a document group consisting of a plurality of search documents, an item to be extracted may be an index term as described above while a group of terms on the basis of individual documents, IPC classes, a corporation, a group of corporations, an industry, a year such as a patent application filing year, or a patent registration year may be extracted. In the following description, index terms are mostly used as typical examples in the specification.
Device for Automatically Creating Information Analysis Report
As shown in
As shown in
The input device 2 includes a search document d condition input unit 210, a document to be compared P condition input unit 220, and an extraction condition and others input unit 230.
The recording device 3 includes a condition recording unit 310, an work result storage unit 320, and a document storage unit 330. The document storage unit 330 includes an external database and an internal database. The external database means a document database such as IPDL whose services are available by Japanese Patent Office and PATOLIS whose services are available by PATOLIS Corporation. The internal database means a personally compiled database that stores commercially available data such as patent JP-ROM, a device that reads data from a medium such as an FD (flexible disk), a CD-ROM (compact disk), an MO (Optical-magnetic disk), and a DVD (digital video disk), an OCR (optical character reader) that reads a document output or manually written on paper, and a device that converts read data into electronic data such as text.
The output device 4 includes a map creation condition read out unit 410, a map data obtaining unit 412, a map (graph/table) creation unit 415, a data output condition read out unit 420, an output data obtaining unit 422, a comment condition read out unit 430, a fixed comment acquisition unit 432, a comment addition unit 435, a report creation unit 440 that creates a report by combining a map, data, and a comment, and an output unit 450 that outputs the created report.
In
The functions in the device for automatically creating information analysis report 100 shown in
In the input device 2 shown in
In the processing device 1 shown in
The document to be compared P read out unit 130 reads population documents from the document storage unit 330 based on the reading condition stored in the condition recording unit 310 and transfers the documents to the index term (P) extraction unit 140. The index term (P) extraction unit 140 extracts index terms from documents obtained at the document to be compared P read out unit 130 according to the extraction condition stored in the condition recording unit 310 and stores the extracted index terms in the work result storage unit 320.
In the document to be compared P read out unit 130 and the index term (P) extraction unit 140, it is often the case that a patent publication as a whole, one type of a document to be compared is extracted. Once index terms are segmented, prepared, and stored, the index terms do not have to be segmented again and the process can be omitted.
The TF(d) calculation unit 121 carries out TF calculation to the calculation result of the index term (d) extraction unit for the document to be surveyed d stored in the work result storage unit 320 based on the condition stored in the condition recording unit 310 to obtain TF (d; d), then stores the result in the work result storage unit 320 or transfers the result directly to the similarity calculation unit 150 or a characteristic index term/similarity in population/frequency scatter diagram/structure calculation unit 180.
The TF (P) calculation unit 141 carries out TF calculation to the calculation result of the index term (P) extraction unit for the documents to be compared P stored in the work result storage unit 320 to obtain TF(d; p) according to the condition stored in the condition recording unit 310, stores the result in the work result storage unit 320 or directly transfers the result to the similarity calculation unit 150 or the characteristic index term/similarity in population/frequency scatter diagram/structure calculation unit 180.
The IDF(P) calculation unit 142 carries out IDF calculation to each of the index terms (d) extracted from the document to be compared d stored in the work result storage unit 320 to obtain IF(d; P) according to the condition stored in the condition recording unit 310, stores the result in the work result storage unit 320 or directly transfers the result to the similarity calculation unit 150 or directly to the characteristic index term/similarity in population/frequency scatter diagram/structure diagram and the like calculation unit 180.
The similarity calculation unit 150 obtains the calculation results of the TF(d) calculation unit 121, the TF(P) calculation unit 141, and the IDF(P) calculation unit 142 directly from them or from the work result storage unit 320 based on the conditions stored in the condition recording unit 310. Note that as described above, the calculation result of the TF(d) calculation unit 121 is TF (d; d), the calculation result of the TF(P) calculation unit 141 is TF(d; p), and the calculation result of the IDF(P) calculation unit 142 is IDF(d; P). The similarity calculation unit 150 then operates similarities of the documents to be compared P to the document to be surveyed d, and the results are attached to the documents to be compared P as their similarity data and are transferred to the work result storage unit 320 or directly transferred to the population document S selection unit 160.
In the calculation of similarities by the similarity calculation unit 150, a calculation typically represented by TFIDF calculation is carried out, and the similarities of the documents to be compared P to the document to be surveyed d are calculated. The TFIDF calculation corresponds to the product of the TF calculation result and the IDF calculation result. An example of a method of calculating the similarities (similarity measures) will be described in detail.
Now, let us assume that d is a document to be surveyed, and p represents individual documents among documents to be compared P. As a result of calculation to these documents d and p, assume that index terms segmented from the document d are “red”, “blue”, and “yellow”. Also, it is assumed that index terms segmented from the document P are “red” and “white”. In this case, the index term frequency of the index terms in the document d is TF(d), the index term frequency of the index terms segmented from the document p is TF(p), the document frequency of the index terms obtained from the document to be compared group P is DF(P), and the number of all the documents is 50.
In this example, the frequencies are shown in
Those in the boxes in
Document vector d=(1*ln(50/30),2*ln(50/20),4*ln(50/45),0)
Document vector p=(2*ln(50/30),0,0,1*ln(50/13))
Then, similarity measures are calculated. More specifically, by obtaining the inner product of the document vector d and the document vector p, the similarity measure between the document vector d and the document vector p is obtained. Note that the larger the value of the similarity measure between the document vectors, the higher the degree of the similarity between the documents, and in terms of the distance between the document vectors (dissimilarity measure) the smaller the value is, the higher will be the degree of the similarity. The inner product of the document vectors is the sum of the products of the components of the vectors and can therefore be obtained as follows.
(document vector d·document vector p)=1*ln(50/30)*2*ln(50/30)+0+0+0
where the last term of the right side is “0.” More specifically, the component of the inner product of index terms other than index terms (d) extracted from the document to be surveyed d, in other words, the similarity is “0” and it is only necessary that the TFIDF calculation is carried out for each of index terms (d). In other words, if there is no index term on one side, the component of the inner product is “0” and only the index terms in d are subjected to calculation, so that the amount of calculation can be reduced.
Based on the above-described similarities, with more index terms similar to the d's index terms exist in p, the component of the inner product is not zero, so that a high value is obtained as the similarity. With a smaller number of index terms similar to the d's index terms in p, the inner products of more components are zero, so that a low value is obtained as the similarity of the sum of the components.
Since there are other various kinds of methods for operating similarities, and if the similarity calculation unit based on the TF(d) calculation unit 121, the TF(P) calculation unit 141 and the IDF(P) calculation unit 142 may be carried out as described. Meanwhile, it is understood that if the method of operating a similarity does not require the TF(d) calculation unit 121, the TF(P) calculation unit 141, and the IDF(P) calculation unit 142, all these units may be omitted, and only the similarity measure calculation unit 150 may be provided.
The population narrowing unit 151 is used to narrow down a population to be selected based on a selecting condition stored in the condition recording unit 310. For example, the population may be narrowed down to those by applicants with a large number of applications or a smaller number of applications conversely, special IPC, or limited fields of industry. If such narrow-down process is not necessary, the process may be omitted.
The population document S selection unit 160 selects population documents S as many as a number set in the condition from the work result storage unit 320 or directly as a result of the calculation of similarity calculation unit 150 based on the selecting condition stored in the condition recording unit 310 or from the population narrowing unit 151. For example, documents are sorted in the descending order of similarities, documents exactly as many as a necessary number in the condition are selected, and the selected documents are transferred to the work result storage unit 320 or directly to the index term (S) extraction unit 170.
The process proceeds to the map data obtaining unit 412 or the output data obtaining unit 422 directly from the output of the population document S selection unit 160, and if it is the case, it is understood that the following process is not necessary.
The index term (S) extraction unit 170 extracts index terms (S) from the population documents S obtained from the work result storage unit 320 or as the result of the population document S selection unit 160 based on the condition stored in the condition recording unit 310 and transfers the extracted index terms (S) to the work result storage unit 320 or directly to the IDF(S) calculation unit 171.
The IDS(S) calculation unit 171 carries out IDF calculation to the result of calculation from the work result storage unit 320 or directly from the index term (S) extraction unit 170 and stores the result in the work result storage unit 320 or transfers the result directly to the characteristic index term/similarity in population/frequency scatter diagram/structure diagram/etc calculation unit 180 based on the condition stored in the condition recording unit 310.
The characteristic index term/similarity in population/frequency scatter diagram/structure diagram/etc calculation unit 180 selects population documents and index terms according to the condition stored in the condition recording unit 310 from the work result storage unit 320 or as the result of the TF(d) calculation unit 121, the result of the TF(P) calculation unit 141, the result of the IDF(P) calculation unit 142, and directly as the result of the IDF(S) calculation unit 171 as many as the necessary number written in the condition for selection or the number selected based on the result of calculation based on the condition for example in descending order of similarities or keyword importance degrees, operates the frequency scatter diagram (keyword distribution diagram) or the structure diagram, and stores the result to the work result storage unit 320.
In the recording device 3 shown in
The document storage unit 330 stores the necessary document data obtained from an external database or an internal database in response to the input device 2 or the processing device 1 and provides the data in response to the request from the processing device 1 or the output device 4.
In the output device 4 shown in
The map data obtaining unit 412 obtains the result of the population document S selection unit 160 and the characteristic index term/similarity in population/frequency scatter diagram/structure diagram/etc calculation unit 180 stored in the work result storage unit 320 as well as the data at the document storage unit 330 based on the conditions read out by the map creation condition read out unit 410 and transmits the results to the work result storage unit 320 or directly to the map (graph/table) creation unit 415.
Using the data from the map data obtaining unit 412, the map (graph/table) creation unit 415 creates a graph, a table, a title, a legend and the like. The result is transmitted to the report creation unit 440.
Based on the condition of the data output condition read out unit 420, the output data obtaining unit 422 obtains the result of the population document S selection unit 160 and the results of the characteristic index term TF(d)IDF(S) calculation unit 180 and the like stored in the work result storage unit 320 together with the data in the document storage unit 330 and sends the results to the work result storage unit or directly to the report creation unit 440.
Based on the condition of the data output condition read out unit 430, the fixed comment acquisition unit 432 obtains data from the work result storage unit 320 and the document storage unit 330 and sends the data to the comment addition unit 435 or directly to the report creation unit 440.
Based on the condition of the comment condition read out unit 430, the comment addition unit 435 prepares data to be added as a comment by an evaluator for the research data d that has been prepared directly from an external input device such as a keyboard or an OCR or prepared in advance in the internal database in the document storage unit 330 and sends the data to the work result storage unit 320 or directly to the report creation unit 440.
The report creation unit 440 obtains the conditions and data output from the map (graph/table) creation unit 415, the output data obtaining unit 422, the fixed comment acquisition unit 432, and the comment addition unit 435 directly or from the work result storage unit 320, shapes a map/data/comment into an optimum form as a paper output and creates an information analysis report. The created information analysis report is transmitted to the output unit 450.
The output unit 450 outputs the information analysis report to the recording means or communicating means. The output unit 450 has an automatic distributing function and outputs a new information analysis report periodically (such as once a month). Alternatively, such a new information analysis report is automatically distributed when the report is greatly changed from the previous one (such as when 10% or more of the content is changed).
Note that the above described report creation unit 440 can create an information analysis report only of a map and can output the result through the output unit 450.
Now, with reference to
As shown in
Meanwhile, if the condition in step S202 is a condition input for the documents to be compared P, the condition of the documents to be compared P is input in the document P condition input unit 220 (step S220). Then, the input condition is checked by the displayed screen (see
If the condition in step S202 is an extraction condition or any other condition, an extraction condition or the like is input in the extraction condition and others input unit 230 (step S230). Then, the input condition is checked by the displayed screen (see
As shown in
Meanwhile, in step S102, if the document to be read is a document to be compared P, the document to be compared P is read out in the document to be compared P read out unit 130 (step S130). Then, index terms for the document to be compared P are extracted in the index term (P) extraction unit 140 (Step S140). Subsequently, the extracted index terms are each subjected to TF calculation at the TF(P) calculation unit 141 (step S141) and to IDF calculation in the IDF(P) calculation unit 142 (step S142).
Then, based on the TF (d) calculation result of the output of the TF(d) calculation unit 121, the TF(P) calculation result of the output of TF(P) calculation unit 141, and the IDF(P) calculation result of the output of the IDF(P) calculation unit 142, the calculation result for each of the index terms of the document is obtained at the similarity calculation unit 150, and the average of the index terms for example is output to be used as a similarity of the document, so that the calculation of the similarity is carried out (step S150).
If the method of calculating the similarity is not based on TFIDF, a similarity is sometimes obtained by another method from the index term (d) extraction unit 120 for the document to be surveyeds d and the index term (P) extraction unit 140 for the documents to be compared P.
Then, in step S151, the population narrowing unit removes the information of unnecessary part. Note that the step S151 may be omitted.
Then, the population document S selection unit 160 rearranges the documents operated in step S150 in the ranking of similarities, and population documents S as many as the number set in the extraction condition and others input unit 230 are selected (step S160).
These kinds of data are sometimes directly used in the map (graph/table) creation unit 415 or the report creation unit 440 in the output device 4.
Then, the index term (S) extraction unit 170 for the population documents S extracts index terms (S) from the population documents S selected in step S160 (step S170).
Then, each of the index terms (d) is subjected to IDF calculation by the IDF (S) calculation unit 171 (step 171).
Then, based on the result of the IDF (S) calculation of each of the index terms (d) in the population documents S in step S171 and the result of the TF(d) calculation of each of the index terms (d) in the document to be surveyed d in step S121, calculation regarding the characteristic index term/similarity in population/frequency scatter diagram/structure diagram etc. is carried out (step S180).
As shown in
If the condition read out from the condition recording unit 310 is a map creation condition (S410) and a map is necessary by the condition (step S411), map data is obtained by the map data obtaining unit 412 from the work result storage unit 320 (step S412). Based on the map creation condition from the map creation condition read out unit 410, a map such as a graph and a table is created (step S415) and sent to the report creation unit 440.
Meanwhile, if the condition to be read out from the condition recording unit 310 is a population data output condition (step S420) and data is necessary by the condition (step S421), output data is obtained from the work result storage unit 320 by the output data obtaining unit 422 (step S422). Then, based on the data output condition of the data output condition read out unit 420, the data is output (step S423) and then sent to the report creation unit 440.
If the condition to be read from the condition recording unit 310 is a comment condition (step S430) and a comment is necessary by the condition (step S431), a frame to add a comment is prepared by the map/data/comment composite shaping output unit 440 and a comment is manually input with a keyboard or an OCR (step S435) or obtained using a comment prepared in advance in the internal database of the document storage unit 330 (step S432) and the comment is sent to the report creation unit 440.
If the condition does not indicate a map in step S411, if the condition is not a condition to output data in step S421, or the condition is not a condition to add a comment in step S431, the process ends each at the points, and the data is not sent to the report creation unit 440.
Based on the condition set in the extraction condition setting screen in the example, the extraction condition and others input unit 230 is set.
As can be understood from
Note that in the information analyses report shown in
From
From
From
From
From
From
From
From
From
From
From
From
From
From
In
The following can be seen from
(1) Words in the lower right region of the keyword distribution map have low creativity values and high technicality values. More specifically, the words are used in many documents in the population but used only in a small number of documents in all the documents to be compared. The words in the region should represent the characteristic of the technical field segmented as that of the population. The region is a population characteristic word region.
(2) Words in the upper left region of the keyword distribution map have low technicality values and high creativity values. More specifically, the words are used in many documents in all the documents to be compared but used only in a small number of documents in the population. The words in the region should represent the creativity of the document to be surveyed in the technical field segmented as that of the population. The region is a creative word region.
(3) Words in the upper right region of the keyword distribution map have high values both for technicality and creativity. More specifically, the words are used only a little both in all the documents to be compared and in the population. The words in the region should be very technical words little used other than in the document to be surveyed. The region is a technical word region.
(4) Words in the lower left region of the keyword distribution map have low values both for technicality and creativity. The words are therefore used in many documents in all the documents to be compared and also in many documents in the population. The words in the region should be words generally used in documents irrespective of whether they are from all the documents to be compared or the population. The region is a general word (unnecessary word) region.
From
According to the embodiment, the device for automatically creating information analysis report 100 includes the processing device 1, the input device 2, the recording device 3, and the output device 4. When an information analysis report is created, a document to be surveyed and documents to be compared are specified and input, conditions for information analysis are input, population documents consisting of a document group similar to the document to be surveyed are selected from the documents to be compared, characteristic index terms in the document to be surveyed relative to the population documents are extracted. Then, based on the population documents and the index terms, an information analysis report representing the characteristic of the document to be surveyed is created, and the created information analysis report is output to the display means recording means, or the communicating means.
In this way, an information analysis report that can exactly report about the information of the document to be surveyed can automatically be created without human inspection of the contents of the document to be surveyed and an enormous number of documents to be compared. In addition, an information analysis report having a map, data about the population or index terms, and a fixed comment or a free comment based on the contents of the map and data can be created.
Now, a device for automatically creating information analysis report according to a second embodiment of the invention will be described. The device for automatically creating information analysis report according to the second embodiment basically has the same functions as those of the first embodiment, but the device is connected to a network in particular to carry out processing in response to a request from a client through the network and can transmit the file of an information analysis report obtained as the result of processing to the client through the network.
As shown in
The web server 511 serves as an interface with the client 502 and receives/transmits data from/to the client 502. The web server 511 creates the information of a case on which an information analysis report should be created, i.e., the information of the document to be surveyed (hereinafter referred to as “research case information”) based on the user input transmitted to the web server 511 from the client 502 through the network and provides the management server 512 with the created information.
The management server 512 queues research cases and requests to the first analysis server 513 and the second analysis server 514 in the order of input. The management server 512 includes a first queuing mechanism for requesting the first analysis server 513 and a second queuing mechanism that queues the research cases processed by the first analysis server and requests the second analysis server 514.
The first analysis server 513 extracts a population, carries out various kinds of totaling processing, and creates a structure diagram. The second server 514 creates cluster information representing the characteristic of each cluster in the structure diagram.
Now, processing carried out by the device for automatically creating information analysis report 500 according to the second embodiment will be described. The user operates the client 502 to log in, so that the web server 511 transmits a search screen used to specify a document to be surveyed to the client 502.
If the document to be surveyed is a patent document such as a laid-open publication, the user operates the client 502 and inputs necessary information in the boxes 3701 to 3704. Alternatively, the user may input information to be researched in the text input box 3705.
Note that the box 3706 is used to provide service such as emphasizing similar publications for a period based on an input in the box 3706 in a different color at the time of listing similar publications.
When the user operates the client 502 to turn on a button, information input in each box is transmitted to the web server 511 through the network 501. The web server 511 transmits a check screen to confirm the input of the user to the client 502.
As described above, according to the embodiment, once a document to be surveyed is determined, the research case information is transmitted from the web server 511 to the management server 512. The management server 512 queues research cases by the first queuing mechanism, requests the first analysis server 513 to operate and provides the research case data.
According to the embodiment, for a patent document, its claims and abstract of a patent document constitute a document to be surveyed. If it is a text input, the input text itself is a document to be surveyed. According to the second embodiment, for example the claims and abstract in each of publications such as JP-ROM are documents to be surveyed.
As a population, 3000 cases are extracted in the descending order of similarity measures to the document to be surveyed among the documents to be compared. The similarity measures is calculated in the same manner as that described in connection with the first embodiment and therefore the description is not provided.
Note that the information of the extracted documents constituting the population or the like is stored in the recording device (not shown) in the first analysis server 513.
Then, the first analysis server 513 carries out totaling processing.
The ranking totaling includes keyword totaling, applicant-related totaling, and IPC related totaling. In the keyword totaling, distribution diagrams as shown in
The first analysis server 513 obtains the information of the population from the recording device and totals the publications of the population for each of the applicants (see
The first analysis server 513 obtains the information of the population from the recording device and totals the number of applications filed by the top 10 applicants based on the number of publications in the population for each filing year and creates a graph representing the number transition (
Furthermore, the first analysis server 513 obtains important keywords (for all the publications) from the recording device and creates a graph representing the accumulation of the yearly use frequencies of the important keywords (for all the publications) (
The first analysis server 513 creates a graph based on the totaling result of the number of applications for each year in the population in which the abscissa represents the number of applications for each year and the ordinate represents the increase ratio compared to the number of applications in the previous year (
Hereinafter, the matrix tabulation will be described. The first analysis server 513 further obtains the information of the population from the recording device and refers to the IPC attached to the applications of the top 10 applicants based on the number of application in the population to create the number of applications provided with the IPC groups into a table in a matrix form including the rows of applicants and the columns of IPC main groups (see
After various kinds of totaling processing ends, the first analysis server 513 obtains the information of the population from the recording device and calculates inside population similarity measures (step S3904). The inside population similarity measure refers to the similarity (similarity measure) of the document to be surveyed relative to each of the documents that belong to the population.
Furthermore, the first analysis server 513 carries out the process of calculating the coordinates for a frequency scatter diagram (step S3905). As shown in
As shown in
Thereafter, the product of TF(d) (the occurrence frequencies of d's index terms (d1, . . . , dx) in d) and IDF(P) (the logarithm of DF(P)×the logarithm of the number of documents: ln [N/DF(P)]), i.e., the document vector (d) is calculated (step S4003). Similarly, the product of TF(P) (the occurrence frequencies of P's index terms (P1, . . . , Pya) in P) and IDF(P), i.e., the document vector (p) is calculated (step S4004).
When the document vector (d) and the document vector (p) are calculated, the inner product of the vectors is obtained as similarity measures (step S4005). Furthermore, a prescribed number of documents in the descending order of similarity measures relative to the document to be surveyed d are extracted from the documents to be compared P as a population S and the information of the documents is stored in the recording device (step S4005). Thereafter, the keyword importance degree DF(S) (the document frequency in S based on S's index terms) is calculated (step S4006).
Thereafter, for each of the index terms (d1, . . . , dx) of the document to be surveyed d, the function value IDF of the document frequency is obtained for the documents to be compared P and the population S (steps S4007 and S4008). In step S4007, IDF(d1; P), IDF(d2; P), . . . , IDF(dx; P) are obtained, and in step S4008, IDF(d1; S), IDF(d2; S), . . . , IDF(dx; S) are obtained. The first analysis server 513 creates a plane by IDF(P) and IDF(S), and for example creates a frequency scatter diagram having the index terms provided in prescribed positions on the plane where the x-axis represents the IDP(P) and the y-axis represents the IDF(S) based on the values of IDF(P) and IDF(S) for each of the index terms (d1, . . . , dx) (step S4009).
Note that from step S4009, in the frequency scatter diagram (IDF plan view), the index terms are arranged (scattered), while the scattered index terms are sometimes unevenly localized and become less viewable. Therefore, according to the second embodiment, the density of the index terms provided on the plane is inspected, and if the density in a prescribed region exceeds a prescribed value, the first analysis server 513 widens the scale on the axis in the region to expand the region and narrows the scale on the axis in the other region to compress the other region. Therefore, when a region is expanded and the other region is compressed in this way, the first analysis server 513 carries out coordinate transformation (step S4010). The IDF plan view has a rhombus shape, which can look unusual as a phenogram or can be inconvenient in handling. Therefore, the first analysis server 513 may carry out coordinate transformation, so that the plane can be represented in a square form. The information of the frequency scatter diagram is also stored in the recording device in the first analysis server 513.
After the totaling processing ends, the first analysis server 513 carries out the process of creating a patent structure diagram. Hereinafter, the patent structure diagram will be described in detail.
Patent Structure Diagram
Terms to be used in the following paragraphs are defined.
E: document element (Document elements constitute a document group to be analyzed, and individual objects to be treated as a unit for analysis according to the embodiment. According to the embodiment, the document to be surveyed d or a document p in the population corresponds to the element.)
Tree-like diagram: a diagram in which document elements constituting a document group to be analyzed are connected in a tree-like line.
Dendrogram: a tree-like diagram created by hierarchical cluster analysis. The principle of creating it will be briefly described. Based on the degree of dissimilarities (degree of similarities) between document elements that constitute a document group to be analyzed, the document elements having the smallest dissimilarity measure (largest similarity measure) are connected to form a connected body. Then, the connected body and another document element, or the connected body and another connected body are connected one after another in the ascending order of the dissimilarities between them to generate a new connected body. In this way, a hierarchical representation is formed.
For the ease of representation, the abbreviations are determined as follows.
D: the height of the position of combination (combination distance) of document elements, document element groups, or a document element and a document element group in a tree-like diagram.
α: the height of the cutting position of a tree-like diagram
α*: the cutting height of a tree-like diagram created by <D>+δσD (where −3≦δ≦3). Note that <D> is the average of all the connection heights D in the tree-like diagram, σD is the standard deviation of all the connection heights D.
N: the number of document elements to be analyzed. Unlike the first embodiment, the number refers to the number of objects to be analyzed.
t: the time data of a document element. If for example the document element is a patent document, t refers to any of the filing date, the publication date, the registration date, and the priority date. If the application numbers, the publication numbers and the like are in the order of filing, publication and the like, these application numbers, the publication numbers and the like may be treated as time data. If a document element includes a plurality of documents, the average value, the median value, and the like of the time data of the documents forming the document element may be obtained as the time data of the document element.
Now, a configuration used to create a patent structure diagram in the first analysis server 513 according to the second embodiment will be described.
The document read out unit 4110 reads out a plurality of document elements to be analyzed from the document storage unit in the recording device 4103. The data of the read out document element group are directly sent to the time data extraction unit 4120 and the index term data extraction unit 4130 and used for processing therein, or sent to the work result storage unit in the recording device 4103 and stored therein.
Note that the data transmitted from the document read out unit 4110 to the time data extraction unit 4120 and the index data extraction unit 4130 or the work result storage unit may be the entire data including the time data and the content data of the read out document element group. Alternatively, the data may be only bibliographic data used to specify each of the document element group (such as an application number and a publication number for a patent document). For the latter data, if necessary in subsequent processing, the data of each document element may be read out again from the document storage unit based on the bibliographic data.
The time data extraction unit 4120 extracts the time data of each element from the document element group read out by the document read out unit 4110. The extracted time data is directly sent to the inside cluster element arranging unit 4190 and used for processing therein or sent to the work result storage unit in the recording device 4103 and stored therein.
The index term data extraction unit 4130 extracts the index term data as the content data of each document element from the document element group read out by the document read out unit 4110. The index term data extracted from each of the document elements is directly sent to the similarity measure calculation unit 4140 and used for processing therein or sent to the work result storage unit in the recording device 4103 and stored therein.
The similarity measure calculation unit 4140 operates similarity measures between the document elements based on the index term data of the document elements extracted by the index term extraction unit 4130. The calculated similarity measures are directly sent to the tree-like diagram creation unit 4150 and used for processing therein or directly sent to the work result storage unit in the recording device 4103 and stored therein.
The tree-like diagram creation unit 4150 creates a tree-like diagram of the document element group to be analyzed based on the similarity measures operated by the similarity measure calculation unit 4140 based on conditions for creating the tree-like diagram. The created tree-like diagram is sent to the work result storage unit in the recording device 4103 and stored therein. The tree-like diagram is stored for example in the form of coordinate value data of the coordinate values of document elements and the starting points and end points of individual connecting lines connecting them or in the form of data representing the connection combinations of the document elements and the positions of combination arranged on the two-dimensional coordinate plane.
The disconnecting condition read out unit 4160 reads out a tree-like diagram disconnecting condition recorded in the condition recording unit in the recording device 4103. The read out disconnecting condition is sent to the cluster extraction unit 4170.
The cluster extraction unit 4170 reads out the tree-like diagram created in the tree-like diagram creation unit 4150 from the work result storage unit recorded in the recording device 4103 and cuts the tree-like diagram based on the disconnecting condition read out by the disconnecting condition read out unit 4160, and a cluster is extracted. Data related to the extracted cluster is sent to the work result storage unit in the recording device 4103 and stored therein. The cluster data includes for example information used to specify document elements that belong to each of clusters and connection information among the clusters.
The arrangement condition read out unit 4180 reads out for example a document element arrangement condition in a cluster recorded in the condition recording unit in the recording device 4103. The read out arrangement condition is sent to the inside cluster element arranging unit 4190.
The inside cluster element arranging unit 4190 reads out the data of the cluster extracted by the cluster extraction unit 4170 from the work result storage unit in the recording device 4103 and determines the arrangement of document elements in each of the clusters based on the document arrangement condition read out by the arrangement condition read out unit 4180. The document correlation diagram according to the invention is completed by thus determining the arrangement in the cluster. The document correlation diagram is sent to the work result storage unit in the recording device 4103, stored therein, and output as required.
Now, with reference to the flowchart in
The document read out unit 4110 reads out a plurality of document elements to be analyzed from the document storage unit in the recording device 4103 (step S4210). According to the second embodiment, examples of the document elements to be analyzed include population documents or a document to be surveyed and population documents.
Then, the time data extraction unit 4120 extracts the time data of each element from the document element group read out in the document reading step S4210 (step S4220).
The index term data extraction unit 4130 extracts index term data as the content data of each document element from the document element group read out in the document reading step S4210 (step S4230). The index terms are extracted in the same manner as the first embodiment.
Then, the similarity measure calculation unit 4140 operates similarity measures between the document elements based on the index term data of each of the document elements extracted in the index data extracting step S4230 (step S4240). The similarity measure (similarity) calculation has been described and therefore the description is not provided.
Then, the tree-like diagram creation unit 4150 creates a tree-like diagram of the document element group to be analyzed is created according to a tree-like diagram creating condition based on the similarity measures operated in the similarity measure operating step S4240 (step S4250). As the tree-like diagram, a dendrogram in which the similarity measures between the document elements are reflected on the height of the combination positions (combination distances) is desirably created. A specific example of a method of creating such a dendrogram includes a known Ward method.
The cutting condition read out unit 4160 then reads out a tree-like cutting condition recorded in the condition recording unit in the recording device 4103 (step S4260).
The cluster extraction unit 4170 then cuts the tree-like diagram created in the tree-like diagram creating step S4250 based on the cutting condition read out in the cutting condition reading step S4260 and a cluster is extracted (step S4270).
The arrangement condition read out unit 4180 reads out a document element arrangement condition recorded in the condition recording unit in the recording device 4103 (step S4280)
The inside cluster element arranging unit 4190 then determines the arrangement of the document elements in the cluster extracted in the cluster extracting step S4270 based on the document element arrangement condition read out in the arrangement condition reading step S4280 (step S4290). The structure diagram according to the embodiment is completed by thus determining the arrangement in the cluster. Note that the arrangement condition may be in common for all the clusters. Therefore, if step S4280 is carried out once for one cluster, the step does not have to be carried out again for the other clusters.
The process of creating the structure diagram will be described in detail. According to the embodiment, after parent clusters are extracted by cutting a tree-like diagram at a cutting height a determined by a certain method, a tree-like diagram is created again using only document elements that belong to each of the parent clusters, in order to divide each of the parent clusters into child clusters. At the time of creating the partial tree-like diagram, an index term dimension in which the deviation of the document element vector in the parent cluster takes a value smaller than a value determined by a prescribed method is removed before analysis.
The document read out unit 4110 reads out a plurality of document elements to be analyzed from the document storage unit in the recording device 4103 (step S4310).
The time data extraction unit 4120 extracts time data from each document element in the document group to be analyzed (step S4320).
The time data extraction unit 4130 extracts time data from each document element in the document group to be analyzed (step S4330). At the time, the index term data of the oldest element (oldest document element) E1 of the document group is not necessary as will be described, and therefore the index term data excluding the data of the oldest element is preferably extracted based on the time data extracted in step S4320.
The similarity measure calculation unit 4140 operates similarity measures among document elements (step S4340). Also at this time, similarity measures among the elements excluding the oldest element E1 are operated.
The tree-like diagram creation unit 4150 then creates a tree-like diagram including the document elements of the document group to be analyzed (step S4350,
The cutting condition read out unit 4160 reads out a cutting condition (step S4360). In this example, the cutting position α, a deviation determining threshold that will be described or the like is read out.
The cluster extraction unit 4170 carries out cluster extracting. The tree-like diagram is cut at the cutting height α=a (step S4371,
For each of the clusters, an index term dimension in which the deviation between the elements in the cluster other than the oldest elements is a value smaller than a value determined by a prescribed method is removed (step S4375). Assume for example that in the cluster having a document element E2 as the head in
If for example the threshold for determining the deviation is specified to 10% in the ratio of the standard deviation relative to the cluster inside average, the index terms wb and we are determined as having small deviations and removed.
Then, for each cluster, a partial tree-like diagram including the inside cluster elements other than the oldest element is created (step S4376,
Now, for each cluster, the number of inside cluster elements excluding the oldest element is obtained and compared to a prescribed threshold (such as “3”) (step S4377). Like the document elements E3 to E6 in
Note that if cutting is carried out at the cutting height α* in step S4373 at the time of extracting a descendant cluster, α* may be updated depending on the height D of each combination position in a parent cluster to be cut (variation method) or the initial value of α* may be used (fixed method).
Like the document elements E8 to E10 in
In step S4380, the arrangement condition read out unit reads out the arrangement condition in the cluster. The inside cluster element arranging unit 4190 determines the arrangement of the document element group in the cluster according to the arrangement condition based on the time data of each document element (step S4390,
For example in step S4378, if cutting is carried out at the cutting height α=ax in
For example in step S4378, if cutting is carried out at the cutting height α=ay in
For example in step S4378, if cutting is carried out at the cutting height α=az in
The arrangement condition in the cluster is preferably in the order of occurrence based on the time data in this example, while other arrangements may be applied.
Note that in the example used for describing the threshold for determining the deviation, the ratio of the standard deviation relative to the average is 10%, but this is a preferable example in which the document elements each include one document. The determination threshold for the document elements each including one document is preferably in the range from 0% to 10%. Meanwhile, if each document element includes a plurality of documents, and the ratio of the standard deviation relative to the average of the inside cluster document elements is 60% or not more than 70%, the case is preferably treated as having a small deviation.
The first analysis server 513 carries out the above described processing, so that a patent structure diagram as shown in
Upon receiving the notification of the end of processing from the first analysis server 513, the management server 512 input the research cases by a queuing mechanism, issues a request to the second analysis server 514 about a research case to be processed next in the order, and provides information about the research case data and the patent structure diagram.
Creating Cluster Information
Now, the processing for obtaining cluster information will be described.
The first analysis server 513 calculates the importance degree of each keyword based on the use frequency of the keyword (index term) in the document to be surveyed and the use frequency of the keyword (index terms) in all the publications. The keywords with importance degrees in a prescribed top range are determined as important keywords. The importance of the keywords or the important keyword information is also stored in the recording device in the first analysis server 513.
The use frequency of each keyword in the document to be surveyeds and the use frequency of each keyword in all the publications are quantified and compared, and the degree of how strongly each keyword express the technical characteristic of the document to be surveyed is calculated as the “importance degree” of each keyword. Keywords with higher importance degrees more strongly express the characteristic of the document to be surveyed, and therefore the keywords with importance degrees in a prescribed high range will be referred to as important keywords.
Now, the definition of terms and abbreviations used in the following paragraphs will be described. Cluster information includes titles, the number of publications, the total of IPC classes (top five), the total of applicants (top five) and cluster important keywords for each cluster. The important keywords represent the ten most important keywords extracted from all the publications that belong to the cluster and the keywords are divided into the following four kinds.
Technical Region Terms: Terms used in common for other clusters among the cluster important keywords. Such keywords used in common among many clusters are generally keywords that represent the technical region to which the clusters belong.
Main Terms: Among the cluster important keywords excluding the “technical region terms,” those particularly much used in the cluster. The main terms are not much used in other clusters, and often represent the main technical elements of the cluster. The main terms typically distinguish the cluster from other clusters.
Characteristic Terms: It is often the case that the cluster important keywords excluding the “technical region terms” and the “main terms” are keywords related to means or structures. Among all, general terms relatively often used but not much used in the group of publications to be analyzed (with the top 300 similarity measures in all the publications) would be keywords that could suggest characteristic aspects in means or structures. Such keywords are calculated according to a prescribed standard and indicated as “characteristic terms.”
Other Important Terms: Terms that do not correspond to any one of the above three kinds among the cluster important keywords. It is often the case that “the other important terms” are technical terms relate to means or structures that do not belong to any of the above-three aspects.
Now, the process of extracting such important keywords and obtaining keywords that belong to these kinds will be described. In the following description, as for abbreviations, the parameters used in conjunction with the first analysis server 513 according to the first and second embodiments are denoted by other abbreviations in some examples, while the previously used abbreviations are used in different contexts. Therefore, it is noted that the abbreviations in the following paragraphs are applied only in these paragraphs.
High Frequency Terms: A prescribed number of terms among index terms whose high occurrence frequency in a document group to be analyzed is included in evaluation and have a large weight. For example, such terms are extracted by calculating a function value including GF(E) as the weight of an index term or GF(E) as a variable and extracting a prescribed number of terms with large values therefor.
E: A document group to be analyzed. As the document group E, a document group constituting individual clusters when a large number of documents are clustered based on the similarity measures. Each document group in a set of document groups S including a plurality of document groups E is represented as EU(u=1, 2, . . . , n) (n is the number of document groups).
S: A document group set including a plurality of document groups E, which consists of for example 300 patent documents similar to a patent document or a group of patent documents.
P: All the documents including a set of documents (large document set) including document groups E and a document group set S. As all the documents P, regarding analysis of patent documents, for example, about five million documents including all the patent publications and registered utility model publications issued in the past 10 years in Japan may be used.
N(E) or N(P): The number of documents included in a document group E or a document set P.
D, Dk or Dl to DN(E): Individual documents included in a document group E.
W: The total number of index terms included in a document group E.
w, wi, wj: Individual index terms (i=1, . . . , W, j=1, . . . , W).
Σ{condition H}: To obtain a sum in the range that satisfies the condition H.
Π{condition H}: To obtain a product in the range that satisfies the condition H.
β(w,D): The weight of a index term w in a document D.
C(wi,wj): The degree of co-occurrence in a document group calculated based on the presence/absence of co-occurrence of index terms on a document basis. The presence/absence (1 or 0) of co-occurrence of index terms wi and wj in one document D is summed up for all documents D that belong to a document group E (as weighted by β(wi,D) and β(wj,D).
g or gh: A “ground” made of high frequency terms having similar co-occurrence degrees with each index terms. The number of grounds=b(h=1, 2, . . . , b).
Co(w,g): Index term-ground co-occurrence degree. The co-occurrence degree C(w,w′) between an index term w and a high frequency term w′ that belongs to a ground g is summed up for all w′ (excluding w) that belong to the ground g.
ak: The title (name) of a document Dk.
s: a character string connection of a title ak (k=1, N(E)).
xk: The appearance ratio of a title. The appearance ratio of each title ak (for the number of documents N(E)) in the sum of tittles s.
mk: The genus of index terms wv (title words) appearing in each title ak.
fk: The appearance ratio of a title term in a title sum s (for the number of documents N(E)).
yk: The average of the appearance ratio of a title word, which is created by dividing a title word appearance ratio fk by the genus mk of index terms wv appearing in each title ak.
τk: a title score. The score is calculated for each of the titles of documents that belong to a document group E in order to determine the order of extracting labels.
T1, T2, . . . : Titles (names) extracted in the descending order of the title scores τk.
k: keyword adaptability, which is calculated to determine the number of extracted labels (that will be described) and indicates the ratio occupied by a keyword in a document group E.
TF(D) or TF(w,D): The occurrence frequency of an index term w in a document D (Term Frequency).
DF(P) or DF(w,p): The document frequency based on an index term w in all the documents P constituting a population (Document Frequency). The document frequency refers to the number of hit documents when search is carried out among a plurality of documents.
DF(E) or DF(w,E): The document frequency in a document group E based on an index term w.
DF(w,D): The document frequency in a document D based on an index term w. If the index term w is included in the document D, the frequency is 1 and if not, the frequency is zero.
IDF(P) or IDF(w,P): The logarithm of “the inverse of DF(P)×the total document number N(P) of all the documents.” For example, ln(N(P)/DF(P)).
GF(E) or GF(w,E): The occurrence frequency in a document group E based on an index term w (Global Frequency).
TF*IDF(P): The product of TF(D) and IDF(P), which is operated for each index term in a document.
GF(E)*IDF(P): The product of GF(E) and IDF(P), which is operated for each index term in a document.
Hereinafter, the structure of a processing device used to extract a keyword will be described with reference to the block diagram in
A document read out unit 4510 reads out from a document storage unit of a recording device 4503 a document group E including a plurality of documents D1 to DN(E) to be analyzed based on a reading condition stored in a condition recording unit in the recording device 4503. The data of the read out document group is directly sent to an index term extraction unit 4520 to be used for the processing therein and sent to an work result storage unit in the recording device 4503 to be stored therein.
Note that the data sent to the index term extraction unit 4520 or the work result storage unit from the document read out unit 4510 may be the entire data including the document data of the read out document group E. Alternatively, the data may be only the bibliographic data (such as application numbers or publication numbers in patent documents) that specifies documents D that belong to the document group E. In the latter case, if necessary in subsequent processing, the data of each document D may be read out again from the document storage unit based on the bibliographic data.
The index term extraction unit 4520 extracts index terms in each document from the document group read out by the document read out unit 4510. The index term data of each of the documents is directly sent to the high-frequency extraction unit 4530 and used for processing therein or sent to the work result storage unit in the recording device 4503 and stored therein.
The high frequency extraction unit 4530 extracts a prescribed number of index terms whose high occurrence frequency in a document group E is included in evaluation with large weights based on the index terms in each document extracted in the index term extraction unit 4520 according to a high-frequency term extracting condition stored in the condition recording unit in the recording device 4503.
More specifically, the occurrence frequency of each index term, GF(E) in the document group E is calculated. The IDF(P) of each index term is calculated, and GF(E)*IDF(P), the product of GF(E) and IDF(P) is preferably calculated. Then, a prescribed number of index terms having larger values as a result for GF(E) or GF(E)*IDF(P) as the weight of each of the calculated index terms are extracted as high frequency terms.
The data of the extracted high frequency terms is directly sent to the high frequency term-index term co-occurrence degree calculation unit 4540 and used for processing therein, or sent to the work result storage unit in the recording device 4503. The calculated GF(E) of the index terms and the IDF(P) of the index terms to be preferably calculated are preferably sent to the work result storage unit in the recording device 4503 and stored therein.
The high frequency term-index term co-occurrence degree calculation unit 4540 calculates co-occurrence degrees in a document group E based on the presence/absence of the co-occurrence of the high frequency terms extracted by the high frequency term extraction unit 4530 and the index terms extracted by the index term extraction unit 4520 and stored in the work result storage unit. If p index terms are extracted and q high frequency terms are extracted from the p index terms, matrix data of p rows and q columns results.
The co-occurrence degree data calculated by the high frequency term-index term co-occurrence calculation unit 4540 is directly sent to a clustering unit 4550 and used for processing therein or sent to the work result storage unit in the recording device 4503 and stored therein.
The clustering unit 4550 cluster-analyzes the q high frequency terms according to a clustering condition stored in the condition recording unit in the recording device 4503 based on the co-occurrence degree data calculated by the high frequency term-index term co-occurrence degree calculation unit 4540.
In order to carry out the cluster analysis, similarity measures among the co-occurrence degrees between the q high frequency terms and the index terms are operated.
Then, based on the result of calculation of the similarity measures and according to a tree-like diagram creating condition stored in the condition recording unit in the recording device 4503, a tree-like diagram connecting the high frequency terms in a tree-like form is created. As such a tree-like diagram, a dendrogram in which the dissimilarity measures between the high frequency terms are reflected as the height of the connecting position (connecting distance) is desirably created.
Then, according to a tree-like diagram cutting condition recorded in the condition recording unit in the recording device 4503, the created tree-like diagram is cut. As the result of cutting, the q high frequency terms are clustered based on the similarity measures of co-occurrence degree with the index terms. Individual clusters created by the clustering will be referred to as “ground” gh (h=1, 2, . . . , b).
The ground data formed by the clustering unit 4550 is directly sent to an index term-ground co-occurrence degree calculation unit 4560 and used for processing therein or sent to the work result storage unit in the recording device 4503 and stored therein.
The index term-ground co-occurrence degree calculation unit 4560 calculates the co-occurrence degrees between the index terms extracted by the index term extraction unit 4520 and stored in the work result storage unit in the recording device 4503 and bases formed by the clustering unit 4550. The co-occurrence degree data calculated for each index term is directly sent to a key(w) calculation unit 4570 and used for processing therein or sent to the work result storage unit in the recording device 4503 and stored therein.
The key (w) calculation unit 4570 calculates a key(w) that is the evaluation score of each index term based on the co-occurrence degrees between the index terms and the grounds calculated in the index term-ground co-occurrence degree calculation unit 4560. The calculated key(w) data is directly sent to a Skey(w) calculation unit 4580 and used for processing therein or sent to the work result storage unit in the recording device 4503 and stored therein.
The Skey(w) calculation unit 4580 calculates Skey(w) scores based on the key (w) scores of the index terms calculated by the key(w) calculation unit 4570, the GF(E) of the index terms calculated by the high frequency term extraction unit 4530 and stored in the work result storage unit in the recording device 4503, and the IDF(P) of the index terms. The calculated Skey(w) data are sent to the work result storage unit in the recording device 4503 and stored therein.
An evaluation value calculation unit 4700 reads index terms wi in each document extracted by the index term extraction unit 4520 regarding a set of document groups S including a plurality of document groups Eu. Alternatively, the evaluation value calculation unit 4700 reads out the Skey(w) of the index terms calculated for each of the document groups Eu by the Skey(w) calculation unit 4580 from the work result storage unit. If necessary, the evaluation value calculation unit 4700 may read out the data of each document group E, read out by the document read out unit 4510 from the work result storage unit and count the number of documents N(Eu). GF(Eu) and IDF(P) calculated in the process of extracting high frequency terms by the high frequency term extraction unit 4530 may be read out from the work result storage unit.
The evaluation value calculation unit 4700 calculates an evaluation value A(Wi,Eu) based on the occurrence frequency of each index term wi in each of the document groups E, according to the read out information. The calculated evaluation values are sent to the work result storage unit and stored therein or directly sent to a concentration degree calculation unit 4710 and a share calculation unit 4720 and used for processing therein.
The concentration degree calculation unit 4710 reads out the evaluation value A(wi,Eu) for each of the index terms wi calculated by the evaluation value calculation unit 4700 in each of the document group Eu or receives the value directly from the evaluation value calculation unit 4700.
The concentration degree calculation unit 4710 calculates the concentration degree of the distribution of each of the index terms wi in the document group set S for each index term wi based on the obtained evaluation value A(wi,Eu). The concentration degree is created for each index term wi by calculating the sum of the evaluation values A(wi,Eu) in all the document groups Eu that belong to the document group set Sand the ratio of the evaluation A(wi,Eu) in each document group Eu relative to the sum for each document group Eu and creating the squares of the ratios and the sum of the squares of the ratios in all the document group Eu that belong to the document group set S. The calculated concentration degrees are sent to the work result storage unit and stored therein.
The share calculation unit 4720 reads out the evaluation value A(wi,Eu) of each index term wi in each document group Eu calculated by the evaluation value calculation unit 4700 from the work result storage unit or directly receives from the evaluation value calculation unit 4700.
The share calculation unit 4720 calculates the share of each index term wi in each document group Eu based on the obtained evaluation value A(wi,Eu). The share is created by summing up the evaluation value A(wi,Eu) of each index term wi in the document group Eu for all the index terms wi extracted from each document group Eu that belongs to the above-described document group set S, and calculating the ratio of the evaluation value A(wi,Eu) of each index term wi relative to the sum. The calculated concentration degree is sent to the work result storage unit and stored therein.
A first inverse calculation unit 4730 reads out the index terms wi in each document extracted in the index terms extraction unit 4520 regarding the document group set S including a plurality of document groups Eu from the work result storage unit.
The first inverse calculation unit 4730 calculates a function value of the inverse of the occurrence frequency of each index term wi in the document group set S (such as normalized IDF(S) that will be described) based on the data of the index terms wi in each document in the read out document group set S. The calculated function value of the inverse of the occurrence frequency in the document group set S is sent to the work result storage unit and stored therein or directly sent to a creativity degree calculation unit 4750.
The second inverse calculation unit 4740 calculates a function value of the inverse of the occurrence frequency in the large document set including the document group set S. As the large document set, all the documents P are used. In this case, IDF(P) calculated in the process of extracting a high frequency term in the high frequency term extraction unit 4530 is read out from the work result storage unit and its function value (such as normalized IDF(P) that will be described) is calculated. The calculated function value of the inverse of the occurrence frequency in the large document set P is sent to the work result storage unit and stored therein or directly sent to the creativity degree calculation unit 4750 and used for processing therein.
The creativity degree calculation unit 4750 reads out the function value of the inverse of each occurrence frequency calculated in the first inverse calculation unit 4730 and the second inverse calculation unit 4740 or directly receives the value from the first inverse calculation unit 4730 and the second inverse calculation unit 4740. GF(E) calculated in the process of extracting a high frequency term in the high frequency extraction unit 4530 is read out from the work result storage unit.
The uniqueness calculation unit 4750 calculates as a creativity degree a function value of what is obtained by subtracting the calculation result of the second inverse calculation unit 4740 from the calculation result of the first inverse calculation unit 4730. The function value may be obtained by subtracting the result of calculation by the second inverse calculation unit 4740 from the result of calculation by the first inverse calculation unit 4730 and dividing the result by the sum of the calculation results by the first inverse calculation unit 4730 and the second inverse calculation unit 4740 or by multiplying the result by GF (Eu) in each document group Eu. The calculated creativity degree is sent to the work result storage unit and stored therein.
The keyword extraction unit 4760 reads out data including Skey(w) calculated by the Skey(w) calculation unit 4580, the concentration degrees calculated by the concentration degree calculation unit 4710, the shares calculated by the share calculation unit 4720, creativity degrees calculated by the creativity degree calculation unit 4570 from the work result storage unit.
The keyword extraction unit 4760 extracts keywords based on two or more indexes selected from the four indexes, the read out Skey(w), the concentration degrees, the shares, and the creativity degrees. The keywords may be extracted for example by determining whether the total values of the selected multiple indexes is not less than a prescribed threshold or within a prescribed range of ranks or by categorizing the keywords based on the combinations of the selected multiple indexes.
The extracted keyword data is sent to the work result storage unit in the recording device 4503 and stored therein.
Now, the process of extracting keywords will be described with reference to the flowchart in
1. Reading Document
The document read out unit 4510 reads out a document group E including a plurality of documents D1 to DN(E) to be analyzed from the document storage unit in the recording device 4503 (step S4601).
2. Extracting Index Terms
Now, the index term extraction unit 4520 extracts index terms in each document from the document group read out in the document reading step S4610 (step S4602). The index term data in each document may be expressed for example by a vector including as a component a function value of the appearance times of each index term in each document D (index term frequency TF(D)) included in the document group E.
3. Extracting High frequency Terms
The high frequency term extraction unit 4530 extracts a prescribed number of index terms whose high occurrence frequencies in the document group E are included in evaluation having large weights based on the index term data in each document extracted in the index term extracting step S4602.
More specifically, GF(E) as the occurrence frequency in the document group E is calculated for each index term (step S4603). In order to calculate the GF(E) of each index term, the index term frequency TF(D) of each index term in each document calculated in the index term extracting step S4602 may be summed up for the documents D1 to DN(E) that belong to the document group E.
For the ease of description, virtual examples of TF(D) and GF(E) when 14 index terms w1 to w14 in total are included in a document E including six documents D1 to D6 are given in the following table. In the following description, the virtual examples will be referred to as required.
Now, based on the calculated GF(E) of each index term, a prescribed number of index terms with highest occurrence frequencies are extracted (step S4604). The number of extracted high frequency terms is for example ten. In this case, if the tenth and eleventh terms are in the same place in the ranking, the eleventh term is extracted as a high frequency term as well.
When the high frequency terms are extracted, a prescribed number of index terms with high GF(E)*IDF(P) are preferably extracted by calculating the IDF(P) of each index term. In the following description of the above virtual examples, terms with the highest seven GF(E) are high frequency terms for the ease of description. More specifically, the index terms w1 to w7 are extracted as the high frequency terms.
Note that in order to extract high frequency terms from index terms, it is preferable that unnecessary terms are removed from all the index terms in advance and high frequency terms are extracted from the remaining terms. However, as for a Japanese document, for example, terms are segmented differently using different kinds of morpheme analysis software, and therefore it is impossible to create a sufficient unnecessary term list. Therefore, a minimum amount of unnecessary terms is preferably removed. As an unnecessary term list may include the following examples for patent documents.
Meaningless Word as Keywords
said, the above-described, the, the following, description, claims, claim, patent, number, formula, general, foregoing, as follows, means, characteristic
Terms with Little Importance as Keywords, Unit Symbols, Roman Numerals
entire, range, kind, class, system, for, %, mm, ml, nm, μm, etc.
In this example, the generalized capability is at issue, and therefore the above described examples are selected as unnecessary terms, while a necessary list may be added as required depending on the kind of morpheme analysis software to be used or the field of the document group.
4. Calculating High Frequency Term-Index Term Co-occurrence Degree
Now, the high frequency term-index term co-occurrence degree calculation unit 4540 calculates the degree of co-occurrence between each high frequency term extracted in the above-described high frequency term extracting step S4604 and each index term extracted in the above index term extracting step S4602 (step S4605).
89
The degree of co-occurrence C(wi,wj) of the index terms wi and wj in the document group E is for example calculated by the following expression.
C(wi,wj)=Σ{DεE}[β(wi,D)×β(wj,D)×DF(wi,D)×DF(wj,D)] (1)
where β(wi,D) is the weight of an index term wi in the document D, and can be for example any of the following.
β(wi,D)=1
β(wi,D)=TF(wi,D)
β(wi,D)=TF(wi,D)×IDF(wi,P)
Since DF(wi,D) is 1 if the index term wi is included in the document D and zero if it is not included, DF(wi,D)×DF (wj,D) is 1 if the index terms wi and wj co-occur in one document D and zero if they do not. This is calculated for all the documents D that belong to the document group E (after weighted with β(wi,D) and β(wj,D)), the results are totaled. The totaled result represents the degree of co-occurrence C(wi,wj) of the index terms wi and wj.
Note that in a similar example of the above Expression (1), the co-occurrence degree c(wi,wj) in the document D calculated based on the presence/absence of the co-occurrence of the index terms wi and wj in a sentence may be used instead of [β(wi,D)×β(wj,D)]. The co-occurrence degree c(wi,wj) in the document D may be calculated for example by the following expression.
C(wi,wj)=Σ{senεD}[TF(wi,sen)×TF(wj,sen)] (2)
where sen means each sentence in the document D. If the index terms wi and wj co-occur in a certain sentence, [TF(wi,sen)×TF(wj,sen)] returns at least 1 and zero if they do not. This is carried out for every sentence sen in the document D and the result is totaled as the degree of co-occurrence c(wi,wj) in the document D.
Based on the above-described virtual examples, the co-occurrence degrees may be calculated while the weight β(w1,D)=1 from Expression (1) as follows. The index terms w1 and w1 are the same index terms would co-occur in three documents D1 to D3 in total, in other words, the co-occurrence degree C(w1,w1)=3. The index terms w2 and w1 would co-occur in two documents D1 and D3 in total, in other words, the co-occurrence degree C(w2,w1)=2. Similarly, if the co-occurrence degree C(wi,wj) is calculated based on combinations of any one of the index terms w1 to w14 and any one of the high frequency terms w1 to w7, the following matrix data including 14 rows and 7 columns results.
5. Clustering
Then, the clustering unit 4550 carries out cluster-analysis to the high frequency terms based on the co-occurrence degrees calculated in the high frequency term-index term co-occurrence calculating step S4605.
In order to carry out the cluster analysis, the similarity measures are operated for the co-occurrence degrees with the index terms for the high frequency terms and (step S4606)
In the above-described virtual examples, a result of calculation when co-relation coefficients between 14-dimensional column vectors for each of the high frequency terms w1 to w7 are employed as similarity measures is as given in the following Table.
The lower left half and the upper right half overlap and are therefore omitted. In the table, for example for high frequency terms w1 to w4, any combinations of these terms each have a correlation coefficient of more than 0.8. For high frequency terms w5 to w7, any combinations of these terms have each a correlation coefficient of more than 0.8. Conversely, for combinations of any of high frequency terms w1 to w4 and any of high frequency terms w5 to w7, the correlation coefficients are all less than 0.8.
Now, based on the result of calculation of the similarity measures, a tree-like diagram in which high frequency terms are connected like a tree is created (step S4607).
As a tree-like diagram, a dendrogram in which dissimilarity measures between the high frequency terms are reflected on the height of the connecting positions (connecting distances) is desirably created. According to the principle of creating such a dendrogram, based on the dissimilarity measures between the high frequency terms, the high frequency terms having the minimum dissimilarity measure (the largest similarity measure) are connected with each other to form a connected body. Then, the connected body is connected to another high frequency term or such a connected body and another connected body are connected one after another in the ascending order of similarity measures. In this way, a hierarchical representation is formed. The dissimilarity measure between a connected body and another high frequency term or the dissimilarity measure between connected bodies is updated based on the dissimilarities between the high frequency terms. The updating may be carried out for example according to a known Ward method.
Then, the clustering unit 4550 cuts the above-created tree-like diagram (step S4608). For example, the diagram is cut at the position of <D>+δσD where the connecting distance in the dendrogram is D, <D> is the average of D, σD is the standard deviation of D, and δ is given within a range −3≦δ≦3, preferably δ=0.
As the result of cutting, the high frequency terms are clustered based on the similarity measures for the co-occurrence degrees with the index terms, and the “ground” gh (h=1, 2, . . . , b) consisting of a high frequency term group that belongs to each cluster is formed. High frequency terms that belong to the same ground gh have higher similarity measures in the co-occurrence degrees with the index terms, and high frequency terms that belong to different grounds gh have low similarity measures in the co-occurrence degrees with the index terms.
As for the tree-like diagram and the process of cutting the diagram, the description in connection with the above-described virtual examples is not repeated, while assume that two grounds (the number of grounds b=2), i.e., the ground g1 including high frequency terms w1 to w4 and the ground g2 including high frequency terms w5 to w7 are formed.
6. Calculating Index Term-Ground Co-Occurrence Degree
Then, the index term-ground co-occurrence degree calculation unit 4560 calculates the degree of co-occurrence Co(w,g) (index term-ground co-occurrence degree) between each index term extracted in the index term extracting step S4602 and each ground formed in the clustering step S4608 (step S4609)
The index term-ground co-occurrence degree Co(w,g) is for example calculated by the following expression.
Co(w,g)=Σ{w′εg, w′≠w}C(w,w′) (3)
where w′ is a high frequency term that belongs to a certain ground g and refers to a term other than the index term w to be measured for the degree of co-occurrence Co (w,g). The degree of co-occurrence Co(w,g) between the index term w and the ground g is the total of the degrees of co-occurrence C(w,w′) between all w′ and w.
In the above-described virtual examples, the co-occurrence degree Co(w1 μg1) between the index term w1 and the ground g1 is represented as follows:
Co(w1,g1)=C(w1,w2)+C(w1,w3)+C(w1,w4)
From Table 3, the value equals 2+3+3=8.
The co-occurrence degree Co (w1, g2) between the index term w1 and the ground g2 is represented as follows:
Co(w1,g2)=C(w1,w5)+C(w1,w6)+C(w1,w7)=1+1+0=2
Similarly, all the index terms w and the grounds g1 and g2 are calculated for the co-occurrence degrees and the result is given in the following table.
Note that the index term-ground co-occurrence degree may be calculated by the following expression rather than the above Co(w,g).
Co′(w,g)=Σ{DεE}[β(w,D)×DF(w,D)×θ(Σ{w′εg,w′≠w}DF(w′/D))] (4)
where Θ(x) is a function that returns 1 if X>0, and 0 if X≦0 According to Θ(Σ{w′εg, w′≠w}DF (w′,D)), if at least one w′ that is any one of high frequency terms that belong to the ground g and other than the index term w to be measured for the co-occurrence degree is included in a document D, 1 is returned, while if no such term is included, zero is returned. DF(w,D) returns 1 if at least one index term w to be measured for the co-occurrence degree is included in a document D and returns zero if no such term is included. Multiplying DF (w,D) by Θ(X) returns 1 if w and any w′ that belongs to the ground g co-occur in the document D and zero if there is no co-occurrence. This is multiplied by the weight P (w,D) defined above and the total of the results for all the documents D that belong to the document group E is Co′(w,g).
The index term-ground co-occurrence degree Co(w,g) in Expression (3) is created by totaling the presence/absence (1 or 0) of co-occurrence of w with w′ in D with a weight β(w,D)×β(w′,D) for all E (C(w,w′)) and totaling the results for w′ in g. On the other hand, the index term-ground co-occurrence degree Co′ (w,g) in Expression (4) is created by totaling the presence/absence (1 or 0) of co-occurrence of w and any w′ in g in D with a weight β(w, D) for all E.
Therefore, in any of the cases, the number of documents D having co-occurrence with high frequency terms is larger, a larger degree of index term-ground co-occurrence results. The degree of index term-ground co-occurrence Co(w,g) in Expression (3) changes with changes in the number of w′ in the ground g that co-occurs with the index term w, while the degree of index term-ground co-occurrence Co′ (w,g) in Expression (4) changes based on the presence/absence of w′ in the ground g that co-occurs with the index term w independently of increase/decrease in the number of w′. When the degree of index term-ground co-occurrence Co(w,g) in Expression (3) is used, it is preferable that the weight β(w,D)=1, while when the degree of index term-ground co-occurrence Co′ (w,g) in Expression (4) is used, it is preferable that the weight β(w,D)=TF(w,D).
7. Calculating Key(w)
Then, the key(w) calculation unit 4570 calculates key(w) that is the evaluation score of each index term based on the co-occurrence degree between each index term and the ground calculated in the index term-ground co-occurrence degree calculating step S4609 (step S4610).
For example, key(w) is calculated by the following expression:
key(w)=1−Π{1≦h≦b}[1−Co(w,gh)/F(gh)] (5)
where F(gh)=Σ{wεE}Co(w,gh) by definition. This is the total of the co-occurrence degrees Co(w,gh) between the index terms w and the grounds gh for all the index terms w. Then, Co(w,gh) is divided by F(gh), the difference between the result and 1 is obtained, the result is multiplied for all the grounds gh (h=1, 2, . . . , b) and the difference between the results and are obtained as key(w).
Note that as the index term-ground co-occurrence degree, Co(w,g) in the above-described Expression (3) is used, while Co′(w,g) in Expression (4) may be used as described above.
From Table 4, F(gh) is calculated in the above virtual examples as follows.
Similarly, key (w) is calculated for all the index terms as shown in the following table.
The column in the right end of the table indicates the ranking of key(w) when they are arranged in the descending order.
In order to describe the characteristic of key(w), the same content in Table 2 is indicated with the document frequency DF(E) of each of the index terms and the ranking of key(w) described above as follows:
As can be understood from the table, the ranking of key (w) is greatly affected by the ranking of the document frequencies DF(E) in the document group E. For example, the index term w8 with the maximum DF(E) corresponds to key(w) in the first rank, and the index term w4 with the next largest DF(E) corresponds to key (w) in the second rank, and thereafter, the same applies to the index terms w3, w5, w6, and the like.
For index terms with larger document frequencies DF(E) in the document group E can co-occur with high frequency index terms in a larger number of documents. Therefore, larger index term-ground co-occurrence degree Co(w,g) or Co′(w,g) can be obtained. This is considered to be the reason why the ranking of key(w) is greatly affected by the ranking of DF(E).
Note that the weight β(w,D) used for calculation of the co-occurrence degrees is replaced by TF(w,D), the ranking of key(w) is considered to be more affected by the ranking of a global frequency GF(E) in the document group E.
As can be understood from comparison among index terms w9 to w14 in Tables 3 and 7, those co-occurring with high frequency terms covering a larger number of grounds have greater key(w). For example, a high frequency term co-occurring with index terms w10 to w13 covers two grounds, while a high frequency term co-occurring with the index terms w9 and w14 is localized to one ground. The index terms w10 to w13 have greater key(w) than the index terms w9 and w14.
As can be understood from comparison among index terms w10 to w13 in Tables 3 and 7, those co-occurring with more high frequency terms tend to have greater key (w). For example, the index term w12 co-occurring with the largest number of high frequency term among w10 to w13 has the largest key (w), and w11 co-occurring with the second largest number of high frequency terms has the second largest key(w).
Note that the following expression may be used as the evaluation score of each index term instead of the above key (w).
where Φ is an appropriate normalization constant, and for example Φ=Σn=1bF(gh). F(gh) is as defined in the above Expression (5).
key′(w) is created by multiplying the average of the co-occurrence degrees Co(w,gh) between the index terms w and the grounds gh in all grounds gh(h=1, . . . , b) by the constant (1/Φ).
The following expression may be used as the evaluation score of each index term instead of key(w).
key″ (w) is created by dividing the co-occurrence degree Co(w,gh) between the index term w and the ground gh by F(gh) and obtaining the average in all the grounds gh (h=1, . . . , b)
The product part in key(w) in Expression (5) is developed and if the high-order small term O[(Co(w,gh)/F(gh))2] is ignored, the following is established.
Therefore, key″(w)≈(1/b)key(w) is established.
8. Calculating Skey(w)
Then, in the Skey(w) calculation unit 4580, Skey(w) score is calculated based on the key(w) score of each index term calculated in the key(w) calculating step S4610 and the GF(E) of each index term and the IDF(P) of each index term calculated in the high frequency term extracting step S4604 (step S4611)
Skey(w) score is calculated by the following expression.
A large value is provided for GF (w, E) of a term occurring very often in a document group E, and a large value is provided for IDF(P) of a term rare in all the documents P and unique to the document group E. As described above, key (w) is affected by DF(E), and a large value is provided to key(w) of a term that co-occurs with a larger number of grounds. The larger the values for GF(w,E), IDF(P), and key (W) are, the larger will be Skey(w).
TF*IDF often used as a weight to an index term is the product of an index term frequency TF and IDF that is the logarithm of the inverse of the occurrence ratio DF(P)/N(P) of an index term in a document set. IDF has effectively reduces the contribution of an index term occurring with high percentage in a document set and can provide a high weight to an index term occurring locally in a particular document. However, the value could sometimes be increased just because the document frequency is small. As will be described, Skey(w) score is used to effectively improve the disadvantage.
In a document group E to be analyzed, when the probability of the occurrence of a document including an index term w is P(A), the probability of the occurrence of a document including a ground (an index term that belongs to a ground) is P(B), and the probability of the occurrence of a document including both an index term w and a ground (the percentage that they co-occur in a document) is P(A∩B), the following expressions are established.
P(A)=DF(w,E)/N(E)
P(A∩B)=key(w)
Therefore, in the document group E, the probability (conditional probability) of the co-occurrence of a selected document including an index term w with a ground is represented as follows:
P(B|A)=P(A∩B)/P(A)=key(w)×N(E)/DF(w,E) (9)
Furthermore, if the assumption of uniformity (IDF(E)=IDF(P)) is considered, and the logarithm of the conditional probability is obtained, the following Expression is obtained.
The value is equal to IDF(P) if key(w)=1. At the limit DF→0, since N(P)/DF(w,P)→∞ and key(w)→0, and therefore, by obtaining the product of N(P)/DF(w,P) and key(w), the disadvantage that the IDF value is singularly raised when the DF value is small can be improved. The Skey(w) score in Expression (8) is created by obtaining the product of GF(w,E) and ln key(w)+IDF(P) in Expression 10, and therefore it can be GF(E)*IDF(P) corrected by the degree of co-occurrence.
Note that in the calculation of Skey(w) by Expression 8, instead of key(w) in Expression 5, key′ (w) in Expression or key″ (w) in Expression 7 may be used as described above.
When key″ (w) in Expression 7 is used, the Skey (w) score is represented as Skey(key″), while when key (w) in Expression 5 is used, the Skey(w) score is represented as Skey(key), and then they can be compared as follows.
Skey(key)−Skey(key″)=GF(w,E)×[ln key(w)−ln key″(w)]≈GF(w,E)×ln b
Therefore, the behaviors of Skey(w) using key″ (w) in Expression (7) and Skey(w) using key(w) in Expression (5) substantially match excluding the difference in the number of grounds b, and as long as the number of grounds b is not large, the ranking of Skey(w) scores is not greatly affected.
9. Calculating Evaluation Value
When Skey(w) is calculated, the evaluation value calculation unit 4700 calculates an evaluation value A(w1,Eu) based on a function value of the occurrence frequency of the index term wi in each document group Eu for each document group Eu and each index term wi (step S4612)
As the evacuation value A(wi,Eu), for example the following Skey(w) may be used as it is, or Skey(w)/N(Eu) or GF(E)*IDF(P) may be used. For example, for each document group Eu and each index term wi, the following data is obtained. Note that for the ease of description, the genus W of index terms equals 5 and the number of document groups n equals 3.
10. Calculating Concentration Degree
Then, the concentration degree calculation unit 4710 calculates the degree of concentration for each index term wi as follows (step S4613).
For each index term wi, the sum of the evaluation value A(wi,Eu) in each document group Eu for all the document groups Eu that belong to the document group set S, in other words, the sum Σu=1nA(wi,Eu) is calculated, and the ratio of the evaluation value A(wi,Eu) in each document group Eu relative to the sum is calculated for each document group Eu and each index term wi as follows.
A(wi,Eu)/Σu=1nA(wi,Eu)
The sum of squares of the ratios in all the document groups Eu that belong to the document group set S for each index term wi represents the concentration degree of the index term wi in the document group set S.
Σu=1n{A(wi,Eu)/Σu=1nA(wi,Eu)}2
This is represented as follows in the example of the above table, and the degree of concentration for each index term wi is calculated.
11. Calculating Shares
Then, the share calculation unit 4720 calculates the share of each index term wi in each document group Eu as follows (step S4614).
In each document group Eu, the sum of the evaluation values A(wi,Eu) of all the index terms w, selected from the above-described document group set S, i.e., Σi=1WA(wi, Eu) is calculated. The ratio of the evaluation value A(wi,Eu) of each index term wi relative to the sum is calculated as A(wi,Eu)/Σi=1WA(wi,Eu). This is represented as follows in the example of the above table, and the share of each index term wi in each document group Eu is determined.
12. Calculating Creativity Degrees
Then, values representing the creativity degrees of the index terms wi are calculated as follows.
The first inverse calculation unit 4730 calculates a function value of the inverse of the occurrence frequency of each index term wi in the document group set S (step S4615)
As the occurrence frequency in the document group set S, a document frequency DF(S) for example is used. As a function value of the inverse of the occurrence frequency, the inverse document frequency IDF(S) in the document group set S or a value (normalized IDF(S)) created by normalizing IDF(S) by all index terms extracted from a document group Eu to be analyzed is used as a particularly preferable example. Herein, IDF(S) is the logarithm of “the inverse of DF(S)×the document number N(S) in the document group set S.” An example of the normalization includes the use of a deviation value. The normalization is carried out to sort out the distribution, so that the creativity degree based on the combination with IDF(P) described above can be more easily calculated.
Then, the second inverse calculation unit 4740 calculates a function value of the inverse of the occurrence frequency of each index term wi in the large document set P including the document group set S (step S4616).
As a function value of the inverse of the occurrence frequency, IDF(P) or a value (normalized IDF(P)) created by normalizing IDF(P) by all index terms extracted from the document group Eu to be analyzed is used as a particularly preferable example. An example of the normalization includes the use of a deviation value. The normalization is carried out to sort out the distribution, so that the creativity degree based on the combination with IDF(S) described above can be more easily calculated.
Then, the creativity degree calculation unit 4750 calculates a function value of {the function value of IDF(S)−the function value of IDF(P)} for each index term wi as a creativity degree (step S4617). If only IDF(S) and IDF(P) are used for calculating the creativity degree, one value is created for each index term wi as the creativity degree. If the normalized IDF(S) or the normalized IDF(P) normalized by the document group Eu, or GF(Eu) is separately used as a weight, the creativity degree is calculated for each document group Eu and for each index term wi.
The creativity degree is particularly preferably provided as DEV in the following expression:
The normalized GF(Eu) as the first factor of DEV is created by normalizing the global frequency GF(Eu) of each index term wi in the document group Eu to be analyzed by all the index terms extracted from the document group Eu to be analyzed.
When normalization is carried out so that the normalized IDF(S)>0 and the normalized IDF(P)>0, the second factor of DEV is positive if the normalized value of IDF in the document group S is greater than the normalized value of IDF in the large document set P and negative if it is smaller. If IDF in the document group set S is large, that means the term is rare in the document group set S. Among such rare terms in the document group set S, terms with small IDF in the large document set P including the document group set S have creativity when they are used in the field related to the document group set S even if the terms are often used in other fields. Since being divided by {normalized IDF(S)+normalized IDF(P)}, the second factor of DEV is in the range from −1 to +1, which makes it easier to compare among different document groups Eu.
Since DEV is in proportion with the normalized GF(Eu), it takes a larger value for a term with a higher frequency in the document group.
When the document group set S includes a plurality of document groups Eu (u=1, 2, . . . ) in particular, and the creativity degree ranking is created in the document groups Eu as the document group to be analyzed, common index terms in the document group set S are in lower places in the ranking and terms characteristic to each document groups Eu are placed in higher places in the ranking, so that it would be advantageous to grasp the characteristic of each document group Eu.
13. Extracting Keywords
Then, the keyword extraction unit 4760 extracts keywords based on at least two indexes selected from four indexes Skey (w) the degree of concentration, the share, and the creativity degree obtained in the foregoing steps (step S4618).
Preferably, using all the four indexes Skey(w), the degree of concentration, the share, and the creativity degree, important terms are extracted as the index terms wi in the document group Eu as they are sorted into “unimportant terms,” and “technical region terms,” “main terms,” “creative terms,” and “other important terms” among important terms. A particularly preferable method of sorting is as follows.
For the first determination, Skey(w) is used. In each document group Eu, the descendent ranking of Skey(w) is created, and keywords below a prescribed place in the ranking are determined as “unimportant terms” and removed from the range of keyword extraction. Keywords within the prescribed order range are important terms in each document group Eu, and therefore determined as “important terms.” Then, these terms will further be sorted in the following determination.
For the second determination, the degree of concentration is used. Terms with low concentration degrees are terms scattered in the entire document group set, and therefore the terms can be understood as widely representing the technical area to which the document group to be analyzed belong. Therefore, the ascending ranking of the concentration degrees is created in the document group set S, and those in places in the ranking equal to or higher than a prescribed place in the ranking are determined as “technical region terms.” From the important terms in each document group Eu, keywords in coincidence with the technical region terms are sorted as “technical region terms” in the document group Eu.
For the third determination, the share is used. Terms with high shares have greater shares in the document group to be analyzed than other terms, and therefore can be understood as terms well explaining the document group to be analyzed (main terms). Therefore, the share descending ranking for the important terms that are not sorted by the second determination is created in each document group Eu, and terms within a prescribed place in the ranking are determined as “main terms.”
For the fourth determination, the creativity degree is used. In each document group Eu, the creativity degree descending ranking for the important terms that are not sorted by the third determination is created and terms within a prescribed place in the ranking are determined as “creative terms.” The remaining important terms are determined as “other important terms.”
The determination process can be represented by a table as follows:
in the Foregoing determination, Skey(w) is used as an index for importance degrees used in the first determination, while another index indicating the importance degrees in the document group may be used. For example, GF(E)*IDF(P) may be used.
In the foregoing determination, the four indexes, the importance degree, the degree of concentration, the share, and the creativity degree are used, while at least arbitrary two of these indexes may be used to sort the index terms.
As described above, the keywords are sorted using the four indexes, the importance degree, the degree of concentration, the share, and the creativity degree. Eventually, cluster information including the title, the number of publications, the total of IPC classes (top five), and the total of applicants (top five) for each cluster and the important keywords in the clusters is stored in the recording device in the second analysis server 514, and provided to the management server 512. The management server 512 provides the result of processing by the second analysis server 514 to the file creating server 516.
The flow of cluster information output by the management server 512, the second analysis server 514, and the file creating server 516 will be described.
The second analysis server 514 carries out processing to output IDF information (step S4702). More specifically, the second analysis server 514 operates as follows.
(1) It obtains the result of leaving spaces between keywords in each publication based on the list of publications created at the time of outputting a structure diagram included in the file received from the management server 512.
(2) It calculates IDF (for the population) and IDF (for all the publications) for each keyword obtained in the above (1).
(3) It creates a file including a file holding the values obtained in the above (2) (such as a CSV file) and all the files included in the file (Zip file) received from the management server 512 and returns the file to the management server 512 (step S4703).
The management server 512 further transfers a file (such as a Zip file) including the result of processing by the first analysis server 513 and the IDF information in step S4702 again to the second analysis server 514 (step S4704).
Upon receiving the file, the second analysis server 514 outputs keyword attributes and main applicant information (step S4705). More specifically, the second analysis server 514 operates as follows:
(1) It obtains the degrees of concentration for each keyword and the ranking of degrees of concentration.
(2) It obtains the following values for each cluster and each keyword attached to a cluster.
Creativity Degree and Creativity Degree Ranking (for which the IDF Information is referred to).
(3) It obtains main applicants, the number of applications, and the ranking of main applicants for each cluster.
(4) It obtains main IPC sub groups for each cluster, the number of publications, and the ranking of main IPC sub groups for each cluster.
(5) It creates a file in the form that includes each file holding the values obtained in the above (1) to (4)(such as a CSV file) and all the files in the file (Zip file) received from the management server and returns the file to the management server 512 (step S4706).
The management server 512 transfers a file (such as a Zip file) including the results of processing by the first analysis server 513 and the second analysis server 514 to the file creating server 516 (step S4707).
The file creating server 516 creates a cluster information file based on the received file (step S4708). More specifically, the file creating server 516 operates as follows:
(1) Based on the values calculated in step S4705 in the second analysis server 514, it determines which category (“technical region,” “main aspects (main terms),” “creative aspects (creative terms),” and “others”) the keywords attached to the clusters belong to and sets the keywords to their appropriate items (categories).
(2) It sets information about main applicants in the clusters and the main IPC sub groups to the items.
(3) After carrying out the above (1) and (2) for each cluster, a table form file in which keywords or the like are set in the boxes and a file is created in a form that includes the table form file and all the files included in the file (Zip file) received from the management server 512 and returns the file to the management server 512 (S4709).
In this way, the management server 512 can obtain the final file (Zip file) including all the results of processing. The management server 512 transfers the final file to the web server 511. The web server 511 creates a mail having the file received from the management server 512 as an attached file and transmits the mail to the client 502.
Referring to
The web server can serve as an interface with a client and receives and transmits data from and to a client. The web server creates information about a case on which an information analysis report is to be created, in other words, information about a document to be surveyed (hereinafter referred to as “research case information”) based on a user input and applies the information to the management server.
The management server queues research cases and requests the analysis server in the order of input. The management server has a queuing mechanism to request the analysis server.
The analysis server carries out processing such as population extraction, various totaling processing, and creating the structure diagram and clustering information.
As shown in
Similarly to the second embodiment, as shown in
If the document to be surveyed is a patent document such as laid-open publication, the user operates the client 502 to input necessary information to the boxes 3701 to 3704. Alternatively, the user may input information to be researched in the text input box 3705.
Note that the box 3706 is used to provide service such as emphasizing similar publications for a period based on an input in the box 3706 in a different color at the time of listing similar publications.
When the web server receives the document to be surveyed information and the content selecting information input by the client operated by the user, the web server identifies the case based on the received document to be surveyed information and the content selecting information and transmits the case to the management server. The management server determines the presence/absence of a preceding case being processed by the analysis server and stands by if there is a preceding case. On the other hand if there is no preceding case, the case is input to the analysis server. According to the embodiment, once the document to be surveyed is determined, the research case information is transmitted to the management server from the web server. The management server queues research cases by the queuing mechanism, requests the analysis server for the research case to be processed next and provides the research case data.
As shown in
The database server obtains all the publications from an all publication database (DB) and creates index terms for all the publications (all publication keywords).
The analysis server obtains research case index terms extracted by the database server at the time of carrying out thread processing. Then, the process of totaling the use frequencies of the research case index terms in the documents is carried out. In this way, the analysis server obtains the result of research case index term totaling processing.
Then, the analysis server starts to create a population. The database server responds to a request to start creating a population from the analysis server to calculate all publication similarities based on the created index terms for each of the documents included in all the publications and the obtained result of totaling the research case index terms. The similarity calculation is the same as that described in connection with the first embodiment and therefore the description is not provided. A research case similar population is created from a document group of 3000 documents having the largest all publication similarity ratios. The database server returns the research case similar population to the analysis server. In this way, the analysis server obtains the research case similar population.
The analysis server carries out totaling processing and obtains at least one of the totaling results of the ranking of similarities in the similar document population, the number of documents in the similar document population for each document attribute included in the bibliographic information of the document to be surveyed, the transition of the number of documents in the similar document population or various rankings for each of the document attributes, and an index document frequency scatter diagram.
Similarly to the second embodiment, the analysis server carries out, as totaling, ranking totaling (step S3901), time-series totaling (step S3902), and matrix tabulation (step S3903).
As shown in
The analysis server obtains information about the population from the recording device and totals the publications of the population on an applicant basis (see
The analysis server obtains information about the population from the recording device and totals the number of applications by top 10 applicants based on the number of filed applications for each filing year and creates a graph representing the transition of the numbers (
Furthermore, the analysis server obtains important keywords (for all the publications) from the recording device and creates a graph representing the accumulation of the yearly use frequencies of the important keywords (for all the publications) (
The analysis server creates a graph based on the totaled result of the number of applications for each year in the population in which the abscissa represents the number of publications for each year and the ordinate represents the increase ratio obtained by comparison to the number of applications in the previous year (
Hereinafter, the matrix tabulation will be described. The analysis server further obtains the information of the population from the recording device and refers to the IPC attached to the applications of the top ten applicants based on the number of applications in the population to create the number of applications provided with the IPC groups into a table in a matrix form including the rows of applicants and the columns of IPC main groups for each applicant and based on the applications by each applicant (see
Although not shown, after various kinds of totaling processing ends, the analysis server may obtain the information of the population from the recording device and calculate the inside population similarity measures (step S3904). The inside population similarity measure is the similarity (similarity measure) between the document to be surveyed and each of the documents that belong to the population.
The analysis server carries out the process of calculating coordinates for a frequency scatter diagram (step S3905). As shown in
As shown in
Thereafter, the product of TF(d) (the occurrence frequencies of d's index terms (d1, . . . , dx) in d) and IDF(P) (the inverse of DF(P)×the logarithm of the number of documents: ln [N/DF(P)]), i.e., the document vector (d) is calculated (step S4003). Similarly, the product of TF(P) (the occurrence frequencies of P's index terms (P1, . . . , Pya) in P) and IDF(P), i.e., the document vector (p) is calculated (step S4004).
When the document vector (d) and the document vector (p) are calculated, the inner product of the vectors is obtained as similarity measures (step S4005). Furthermore, a prescribed number of documents are extracted from the documents to be compared P as a population S in the descending order of similarity measures relative to the document to be surveyed d and the information of the documents is stored in the recording device (step S4005). Thereafter, the keyword importance degree DF(S) (the document frequency in S based on S's index terms) is calculated (step S4006).
Thereafter, for each of the index terms (d1, . . . , dx) of the document to be surveyed d, the function value IDF of the document frequency is obtained for the documents to be compared P and the population S (steps S4007 and S4008). In step S4007, IDF(d1; P), IDF(d2; P), . . . , IDF(dx; P) are obtained, and in step S4008, IDF(d1; S), IDF(d2; S), . . . , IDF(dx; S) are obtained. The analysis server creates a plane by IDF(P) and IDF(S), and for example creates a frequency scatter diagram having the index terms provided in prescribed positions on the plane where the x-axis represents the IDF(P) and the y-axis represents the IDF(S) based on the values of IDF(P) and IDF(S) for each of the index terms (d1, . . . , dx) (step S4009).
Note that from step S4009, in the frequency scatter diagram (IDF plan view), the index terms are arranged (scattered), while the scattered index terms are sometimes unevenly localized and become less viewable. Therefore, according to the second embodiment, the density of the index terms provided on the plane is inspected, and if the density in a prescribed region exceeds a prescribed value, the analysis server widens the scale on the axis in the region to expand the region and narrows the scale on the axis in the other region to compress the other region. Therefore, when a region is expanded and the other region is compressed, the analysis server carries out coordinate transformation (step S4010). The IDF plan view has a diamond shape, which can look unusual as a phenogram or can be inconvenient in handling. Therefore, the analysis server may carry out coordinate transformation, so that the plane can be represented in a square form. The information of the frequency scatter diagram is also stored in the recording device in the analysis server.
The analysis server creates a tree-like diagram based on the similarity measures of the documents included in the similar document population and carries out clustering to create a structure diagram. The analysis server also creates the clustering information of the structure diagram including the document to be surveyed based on the created structure diagram data.
As shown in
A more detailed description of the way of creating a patent structure diagram will be the same as that given in connection with the second embodiment and therefore the description is omitted. In this example, with reference to the flowchart in
The document read out unit 4110 reads out a plurality of document elements to be analyzed from the document storage unit of the recording device 4103 (step S4210). According to the embodiment, examples of the document elements to be analyzed include population documents or a document to be surveyed and population documents.
Then, the time data extraction unit 4120 extracts the time data of each element from the document element group read out in the document reading step S4210 (step S4220).
The index term data extraction unit 4130 extracts index term data as the content data of each document element from the document element group read out in the document reading step S4210 (step S4230). The index terms are extracted in the same manner as the first embodiment.
Then, the similarity calculation unit 4140 operates similarity measures between the document elements based on the index term data of each of the document elements extracted in the index data extracting step S4230 (step S4240). The similarity measure (similarity) calculation has been described and therefore the description is omitted.
Then, the tree-like diagram creation unit 4150 creates a tree-like diagram of the document element group to be analyzed based on the similarity measures operated in the similarity measure operating step S4240 (step S4250). As the tree-like diagram, a dendrogram in which the similarity measures between the document elements are reflected on the height of the connection positions (connection distances) is desirably used. A specific example of a method of creating such a dendrogram includes a known Ward method.
The cutting condition read out unit 4160 then reads out a tree-like diagram cutting condition recorded in the condition recording unit in the recording device 4103 (step S4260).
The cluster extraction unit 4170 then cuts the tree-like diagram created in the tree-like diagram creating step S4250 based on the cutting condition read out in the cutting condition reading step S4260 and a cluster is extracted (step S4270).
The arrangement condition read out unit 4180 reads out a document element arrangement condition in the cluster recorded in the condition recording unit in the recording device 4103 (step S4280).
The inside cluster element arranging unit 4190 then determines the arrangement of the document elements in the cluster extracted in the cluster extracting step S4270 based on the document element arrangement condition read out in the arrangement condition reading step S4280 (step S4290). The structure diagram according to the embodiment is completed by thus determining the arrangement in the cluster. Note that the arrangement condition may be in common for all the clusters. Therefore, if step S4280 is carried out once for one cluster, the step does not have to be carried out again for the other clusters.
More specifically, the process of creating the structure diagram will be described. According to the embodiment, after parent clusters are extracted by cutting a tree-like diagram at a cutting height α determined by a certain method, a tree-like diagram is created again using only document elements that belong to each of the parent clusters in order to divide each of the parent clusters into child clusters. At the time of creating the partial tree-like diagram, an index term dimension in which the deviation of the component of the document element vector in the parent cluster takes a value smaller than a value determined by a prescribed method is removed before analysis.
As the analysis server carries out the above described processing, the patent structure diagram as shown in
Now, the process of obtaining cluster information will be described. The definition of terms and abbreviations used in the following paragraphs will be described. Cluster information includes titles, the number of publications, the total of IPC classes (top five), and the total of applicants (top five) and cluster important keywords for each cluster. The important keywords represent the ten most important keywords extracted from all the publications that belongs to the cluster and the keywords are divided into the following four kinds for display.
Technical Region Terms: Among the cluster important keywords, those used in common for other clusters. Keywords used in common among many clusters are generally keywords that represent the technical region to which the clusters belong.
Main Terms: Among the cluster important keywords excluding the “technical region terms,” those particularly used for the cluster. The main terms are not much used for other clusters, and often represent the main technical elements of the cluster. The main terms typically distinguish the cluster from other clusters.
Characteristic Terms: It is often the case that the cluster important keywords excluding the “technical region terms” and the “main terms” are keywords related to means or structures. Among all, general terms much used but not much used in the group of publications to be analyzed (with the top 300 all publication similarity measures) would be keywords that could suggest characteristic aspects in means or structures. Such keywords are calculated according to a prescribed standard and indicated as “characteristic terms.”
Other Important Terms: Terms that do not correspond to any one of the above three kinds among the cluster important keywords. It is often the case that “the other important terms” are technical terms that do not belong to any of the above-three aspects and related to technical terms related to means and structures.
Hereinafter, the configuration of a processing device used to extract a keyword will be described with reference to the block diagrams in
The document read out unit 4510 reads out a document group E including a plurality of documents D1 to DN(E) to be analyzed from the document storage unit in the recording device 4503 based on a reading condition stored in the condition recording unit in the recording device 4503. The data of the read out document group is directly sent to the index term extraction unit 4520 to be used for processing therein and sent to the work result storage unit in the recording device 4503 to be stored therein.
Note that the data sent to the index term extraction unit 4520 or the work result storage unit from the document read out unit 4510 may be the entire data including the document data of the read out document group E. Alternatively, the data may be only the bibliographic data (such as application numbers or publication numbers in patent documents) used to specify each of the documents D that belong to the document group E. In the latter case, if necessary in subsequent processing, the data of each document D may be read out again from the document storage unit based on the bibliographic data.
The index term extraction unit 4520 extracts index terms in each document from the document group read out by the document read out unit 4510. The index term data of each of the documents is directly sent to the high-frequency extraction unit 4530 to be used for processing therein and sent to the work result storage unit in the recording device 4503 to be stored therein.
The high frequency extraction unit 4530 extracts a prescribed number of index terms with a large weight whose high occurrence frequency in the document group E is included in evaluation based on the index terms in each document extracted in the index term extraction unit 4520 and according to a high frequency term extracting condition stored in the condition recording unit in the recording device 4503.
More specifically, for each index term, the occurrence frequency GF(E) in the document group E is calculated. The IDF(P) of each index term is calculated, and GF(E)*IDF(P), the product of GF(E) and IDF(P) is preferably calculated. Then, a prescribed number of index terms having larger values as a result for GF(E) or GF(E)*IDF(P) as the weight of each of the calculated index terms are extracted as high frequency terms.
The data of the extracted high frequency terms is directly sent to the high frequency term-index term co-occurrence degree calculation unit 4540 to be used for processing therein and also sent to the work result storage unit in the recording device 4503 to be stored therein. The GF(E) of the calculated index terms and the IDF(P) of the index terms desired to be calculated are preferably sent to the work result storage unit in the recording device 4503 and stored therein.
The high frequency term-index term co-occurrence calculation unit 4540 calculates co-occurrence degrees in the document group E based on the presence/absence of co-occurrence on a document basis between the high frequency terms extracted by the high frequency term extraction unit 4530 and the index terms extracted by the index term extraction unit 4520 and stored in the work result storage unit. If p index terms are extracted and q high frequency terms are extracted from the p index terms, matrix data of p rows and q columns results.
The co-occurrence degree data calculated by the high frequency term-index term co-occurrence degree calculation unit 4540 is directly sent to a clustering unit 4550 and used for processing therein or sent to the work result storage unit in the recording device 4503 and stored therein.
The clustering unit 4550 cluster-analyzes q high frequency terms according to a clustering condition stored in the condition recording unit in the recording device 4503 based on the co-occurrence degree data calculated by the high frequency term-index term co-occurrence degree calculation unit 4540.
In order to carry out the cluster analysis, similarity measures for the co-occurrence degrees between the q high frequency terms and the index terms are operated.
Then, based on the result of calculation of the similarity measures and according to a tree-like diagram creating condition stored in the condition recording unit in the recording device 4503, a tree-like diagram connecting the high frequency terms in a tree-like form is created. As such a tree-like diagram, a dendrogram in which the dissimilarity measures between the high frequency terms are reflected as the height of the connecting positions (connecting distances) is desirably created.
Then, according to a tree-like diagram cutting condition recorded in the condition recording unit in the recording device 4503, the created tree-like diagram is cut. As the result of cutting, the q high frequency terms are clustered based on the similarity measures for the co-occurrence degree with the index terms. Individual clusters created by the clustering will be referred to as “grounds” gh (h=1, 2, . . . , b).
The ground data formed by the clustering unit 4550 is directly sent to an index term-ground co-occurrence degree calculation unit 4560 and used for processing therein or sent to the work result storage unit in the recording device 4503 and stored therein.
The index term-ground co-occurrence degree calculation unit 4560 calculates the co-occurrence degrees between the index terms extracted by the index term extraction unit 4520 and stored in the work result storage unit in the recording device 4503 and the grounds formed by the clustering unit 4550. The co-occurrence data calculated for each index term is directly sent to a key(w) calculation unit 4570 and used for processing therein or sent to the work result storage unit in the recording device 4503 and stored therein.
The key(w) calculation unit 4570 calculates key(w) that is the evaluation score of each index term based on the co-occurrence degrees of the index terms and the grounds calculated by the index term-ground co-occurrence degree calculation unit 4560. The calculated key (w) data is directly sent to a Skey(w) calculation unit 4580 and used for processing therein or sent to the work result storage unit in the recording device 4503 and stored therein.
The Skey(w) calculation unit 4580 calculates Skey(w) scores based on the key (w) scores of the index terms calculated by the key(w) calculation unit 4570, the GF(E) of the index terms and the IDF(P) of the index terms calculated by the high frequency term extraction unit 4530 and stored in the work result storage unit in the recording device 4503. The calculated Skey(w) data is sent to the work result storage unit in the recording device 4503 and stored therein.
An evaluation value calculation unit 4700 reads index terms wi in each document extracted by the index term extraction unit 4520 regarding a set of document groups S including a plurality of document groups Eu. Alternatively, the evaluation value calculation unit 4700 reads out the Skey(w) of the index terms calculated for each of the document Eu by the Skey(w) calculation unit 4580 from the work result storage unit. If necessary, the evaluation value calculation unit 4700 may read out the data of each document group Eu read out by the document read out unit 4510 from the work result storage unit and count the number of documents N(Eu). GF(Eu) and IDF(P) calculated in the process of extracting high frequency terms by the high frequency term extraction unit 4530 may be read out from the work result storage unit.
The evaluation value calculation unit 4700 calculates an evaluation value A(Wi,Eu) based on the occurrence frequency of each index term wi in each of the document groups Eu according to the read out information. The calculated evaluation values are sent to the work result storage unit and stored therein or directly sent to a concentration degree calculation unit 4710 and a share calculation unit 4720 and used for processing therein.
The concentration degree calculation unit 4710 reads out the evaluation value A(wi, Eu) for each of the index terms wi in each of the document group Eu calculated by the evaluation value calculation unit 4700 from the work result storage unit or receives the value directly from the evaluation value calculation unit 4700.
The concentration degree calculation unit 4710 calculates the concentration degree of the distribution of each of the index terms wi in the document group set S based on the obtained evaluation A(wi,Eu). The concentration degree is created by calculating the sum of the evaluation values A(wi,Eu) of each index term wi in all the document groups Eu that belong to the document group set S and the ratio of the evaluation value A(wi,Eu) in each document group Eu relative to the sum for each document group Eu and creating the squares of the ratios and the sum of the squares of the ratios in all the document group Eu that belong to the document group set S. The calculated concentration degrees are sent to the work result storage unit and stored therein.
The share calculation unit 4720 reads out the evaluation value A(wi,Eu) of each index term wi in each document group Eu calculated by the evaluation value calculation unit 4700 from the work result storage unit or directly receives the value from the evaluation value calculation unit 4700.
The share calculation unit 4720 calculates the share of each index term wi in each document group Eu based on the obtained evaluation value A(wi,Eu). The share is created by summing up the evaluation value A(wi,Eu) of each index term wi for all the index terms wi extracted from each document group Eu that belongs to the above-described document group set S, and calculating the ratio of the evaluation value A(wi,Eu) of each index term wi relative to the sum. The calculated concentration degree is sent to the work result storage unit and stored therein.
The first inverse calculation unit 4730 reads out the index term wi in each document extracted in the index term extraction unit 4520 regarding the document group set S including a plurality of document groups Eu from the work result storage unit.
The first inverse calculation unit 4730 calculates a function value of the inverse of the occurrence frequency of each index term wi in the document group set S (such as normalized IDF(S) that will be described) based on the data of the index terms wi in each document in the read out document group set S. The calculated function value of the inverse of the occurrence frequency in the document group set S is sent to the work result storage unit and stored therein or directly sent to a creativity degree calculation unit 4750 and used for processing therein.
The second inverse calculation unit 4740 calculates a function value of the inverse of the occurrence frequency in a large document set including the document group set S. As the large document set, all the documents P are used. In this case, IDF(P) calculated in the process of extracting a high frequency term in the high frequency term extraction unit 4530 is read out from the work result storage unit and a function value thereof (such as normalized IDF(P) that will be described) is calculated. The calculated function value of the inverse of the occurrence frequency in the large document set P is sent to the work result storage unit and stored therein or directly sent to the creativity degree calculation unit 4750 and used for processing therein.
The creativity degree calculation unit 4750 reads out the function values of the inverses of the occurrence frequencies calculated in the first inverse calculation unit 4730 and the second inverse calculation unit 4740 from the work result storage unit or directly receives the values from the first inverse calculation unit 4730 and the second inverse calculation unit 4740. GF(E) calculated in the process of extracting a high frequency term in the high frequency extraction unit 4530 is read out from the work result storage unit.
The creativity degree calculation unit 4750 calculates as a creativity degree a function value of what is obtained by subtracting the calculation result of the second inverse calculation unit 4740 from the calculation result of the first inverse calculation unit 4730. The function value may be obtained by subtracting the result of calculation by the second inverse calculation unit 4740 from the result of calculation by the first inverse calculation unit 4730 and dividing the result by the sum of the calculation results by the first inverse calculation unit 4730 and the second inverse calculation unit 4740 or by multiplying the result by GF (Eu) in each document group Eu. The calculated creativity degree is sent to the work result storage unit and stored therein.
The keyword extraction unit 4760 reads out various kinds of data including Skey(w) calculated by the Skey(w) calculation unit 4580, the concentration degrees calculated by the concentration degree calculation unit 4710, the shares calculated by the share calculation unit 4720, and the creativity degrees calculated by the creativity degree calculation unit 4750 from the work result storage unit.
The keyword extraction unit 4760 extracts keywords based on two or more indexes selected from the four indexes, the read out Skey(w), the concentration degrees, the shares, and the creativity degrees. The keywords may be extracted for example by determining whether the total values of the selected multiple indexes is not less than a prescribed threshold or within a prescribed range of ranks. The extracted keyword data is sent to the work result storage unit in the recording device 4503 and stored therein. Thereafter, clustering information is created based on combinations of multiple selected indexes and keywords extracted for each of the indexes.
More specifically, the keyword extraction unit 4760 creates clustering information based at least two indexes selected from the four indexes Skey(w), the degrees of concentration, the shares, and the creativity degrees obtained in the foregoing steps and the extracted keywords.
Preferably, using all the four indexes Skey(w), the degrees of concentration, the shares, and the creativity degrees, the index terms wi in the document group Eu are sorted into “unimportant terms,” and “technical region terms,” “main terms,” “creative terms,” and “other important terms” among important terms and the clustering information is created accordingly. A particularly preferable method of sorting is as follows.
For the first determination, Skey(w) is used. In each document group Eu, the descendent ranking of Skey(w) is created, and keywords below a prescribed place in the ranking order are determined as “unimportant terms” and removed from the range of keyword extraction. Keywords within the prescribed order range are important terms in each document group Eu, and therefore determined as “important terms.” Then, these terms will further be sorted in the following determination.
For the second determination, the degree of concentration is used. Terms with low concentration degrees are terms scattered in the entire document group set, and therefore the terms can be understood as widely representing the technical area to which the document group to be analyzed belong. Therefore, the ascending ranking of the concentration degrees in the document group set S is created, and those in places in the ranking equal to or lower than a prescribed place in the ranking are determined as “technical region terms.” From the important terms in each document group Eu, keywords in coincidence with the technical region terms are sorted as “technical region terms” in the document group Eu.
For the third determination, the share is used. Terms with high shares have greater shares in the document group to be analyzed, and therefore can be understood as terms well explaining the document group to be analyzed (main terms). Therefore, in each document group Eu, the share descending ranking for the important terms that are not sorted by the second determination is created, and terms within a prescribed range in the ranking are determined as “main terms.”
For the fourth determination, the creativity degree is used. In each document group Eu, the creativity degree descending ranking for the important terms that are not sorted by the third determination is created and terms within a prescribed range in the ranking are determined as “creative terms.” The remaining important terms are determined as “other important terms.”
The determination process described above can be represented by Table 11.
In the foregoing determination, Skey(w) is used as an index for importance degree used in the first determination, while another index indicating the importance degree in the document group may be used. For example, GF(E)*IDF(P) may be used.
In the foregoing determination, the four indexes, the importance degree, the degree of concentration, the share, and the creativity degree are used, while at least arbitrary two of these indexes may be used to sort the index terms.
As described above, the keywords are sorted using the four indexes, the importance degree, the degree of concentration, the share, and the creativity degree. Eventually, cluster information including titles, the number of publications, the total of IPC classes (top five), and the total of applicants (top five) for each cluster and the important keywords in the clusters is stored in the recording device in the analysis server and provided to the management server.
The process of extracting keywords (see
As shown in
Upon receiving the report, the web server creates an end notification indicating the end of the processing and transmits it to the client.
The web server responds to a request from the client to distribute a log-in screen to the client. In response to log-in from the client, the web server carries out authentication, and if the authentication is not successful, the log-in screen by the client is regained. On the other hand, if the authentication is successful, the web server distributes a purchased report list screen to the client.
In response to a report output request from the client, the web server transfers the report to the client. The client thus obtains the report, and then can display it on the display, store it in the recording device or output as a print from a printer or the like.
The invention is applicable to provide a device for automatically creating information analysis report that analyzes a document to be surveyed or document group and displays the characteristics, an automatic analysis report creating program, and a method of automatically creating an information analysis report.
Number | Date | Country | Kind |
---|---|---|---|
2005-127118 | Apr 2005 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2006/308669 | 4/25/2006 | WO | 00 | 10/25/2007 |