The present invention relates generally to systems for compiling electronic documents and, more particularly, to systems for linking and integrating data amongst a plurality of electronic documents to produce a graph-structured data model, or knowledge graph, of the electronic documents.
Data is often modeled in order to, among other things, optimize searching, enhance analytics tools, and facilitate the integration of new content. For instance, graph data modeling, or graph modeling, is commonly applied to a collection of electronic documents to allow for the analysis and visualization of connections (i.e., relationships) between the various individual documents (i.e., nodes).
As an example, in
A weighted graphical model (e.g., model 111) can be used to identify document clusters which, in turn, may represent distinct topics. For instance, in graphical model 111, articles 113-1 thru 113-5 together form a first document cluster, or sub-graph, 121-1 and articles 113-6 thru 113-9 form a second document cluster, or sub-graph, 121-2. Furthermore, semantic metadata relating to, inter alia, the confidence, or certainty, of the accuracy of information contained in the data model 111, can in turn be utilized to generate a knowledge graph.
As one useful application of graph modeling electronic documents (e.g., scientific articles), the identification of specific topics from the graphical model can be used to identify key opinion leaders for each topic based on the authorship of such documents. In other words, the author(s) associated with the most relevant articles within a particular cluster may be construed as a key leader with respect to that particular topic or field. The identification of key opinion leaders for a particular topic (e.g., covid-19 transmissibility) can be used to establish a collaboration network, wherein key leaders who are identified through the graphical model can serve as collaborators or peer reviewers in subsequent scientific articles on that specific topic.
Despite the usefulness set forth above, the applicant has recognized a notable shortcoming associated with the identification of key opinion leaders through the graph modeling of electronic documents. Notably, the applicant has recognized that certain inconsistencies in the spelling of the name of an author among different electronic documents often occurs due to, inter alia, (i) name misspellings, (ii) the use of abbreviations, and (iii) variances in the conversion of native language characters to Latin-based characters. Furthermore, although certain unique identification numbers/codes are often utilized to verify authorship (e.g., an ORCID identification number issued to academic authors and contributors by ORCID, Inc. of Bethesda, Md.), these types of identification codes are often (i) incorrectly entered or (ii) not assigned to certain individuals.
In view thereof, it is an object of the present invention to provide a novel method for generating a graphical model of a plurality of electronic documents.
It is another object of the present invention to provide a method of the type as described above which establishes connections between individual electronic documents that contain related data.
It is yet another object of the present invention to provide a method of the type as described above which establishes connections between individual electronic documents with related document identifiers, such as authorship.
It is still another object of the present invention to provide a method of the type as described above which identifies potential spelling variances between identifiers associated with the various individual electronic documents.
It is yet still another object of the present invention to provide a method of the type as described above which determines the likelihood of relatedness between document identifiers with variances in spelling.
It is another object of the present invention to provide a method of the type as described above which incorporates connections in the graphical model between the document identifiers when determined to be related, or linked, despite variances in spelling.
It is yet another object of the present invention to provide a method of the type as described above which is inexpensive to implement, can be efficiently processed, and is readily scalable.
Accordingly, as one feature of the present invention, there is provided a computer-implemented method for generating a graphical model of a plurality of electronic documents, each electronic document comprised of data which includes identifying information, the identifying information including authorship, the method comprising the steps of (a) ingesting the data from each of the plurality of electronic documents, (b) constructing a base graphical model using the data from the plurality of electronic documents, (c) disambiguating any relatedness of identifying information between select pairs of the plurality of electronic documents, and (d) calculating a degree of belief of relatedness of identifying information between select pairs of electronic documents, wherein the degree of belief of relatedness of identifying information between select pairs of electronic documents is incorporated into the base graphical model.
Various other features and advantages will appear from the description to follow. In the description, reference is made to the accompanying drawings which form a part thereof, and in which is shown by way of illustration, an embodiment for practicing the invention. The embodiment will be described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural changes may be made without departing from the scope of the invention. The following detailed description is therefore, not to be taken in a limiting sense, and the scope of the present invention is best defined by the appended claims.
In the drawings, wherein like reference numerals represent like parts:
Referring now to
In the description that follows, method 211 is illustrated verifying the accuracy of the listed authorship for electronic documents in the graphical model. However, it should be noted that the principles of the present invention are not limited to incorporating information related to the certainty of authorship into a graphical model. Rather, it is to be understood that method 211 could be utilized to integrate a degree of belief, or confidence, of the veracity of any type of information, or data, included in the graphical model without departing from the spirit of the present invention.
As defined herein, the term “document” denotes any electronic record, or work. In the description that follows, documents are represented primarily as articles, such as scientific publications. However, it is to be understood that use of the term “document” herein is not intended to be limited to scientific publications or other similar types of articles. Rather, use of the term “document” is meant to encompass any/all forms of electronic records (e.g., arbitrary, text-based, information records) derived from any source, including literature, online news stories, and even database records, without departing from the spirit of the present invention.
As seen in
As referenced above, data ingestion step 213 involves acquiring, processing, and storing data from a set of electronic documents into a designated data pipeline. The frequency of data acquisition and ingestion is preferably dependent upon the volume and release dates of the electronic documents in the designated pipeline. As previously noted, the ingested data from the set of electronic documents is then subsequently utilized for data graph modeling.
Preferably, data ingestion step 213 is implemented entirely through a cloud computing services platform, thereby only requiring compute resources when processing data. For example, an Amazon Web Services (AWS) cloud computing services platform could be utilized to implement step 213, thereby allowing for an optimized selection and configuration of web services tools. For instance, data acquisition could be implemented using Python programming scripts on AWS-based Simple Storage Service (S3), processed using AWS-based Elastic MapReduce (EMR), and stored in a column-oriented file structure on the AWS-based Simple Storage Service (S3).
However, it should be noted that the use of a primarily AWS-based cloud computing services platform is provided for illustrative purposes only. Rather, step 213 could be similarly implemented using alternative cloud computing services platforms, such as the Microsoft Azure cloud computing services platform, without departing from the spirit of the present invention.
Referring now to
As seen in
As seen in
In the second stage of step 213, article data (i.e., the content of each electronic document) is ingested into the data pipeline. As a novel feature of ingestion process 213, the present invention is designed to support any updates for article data ingested into the pipeline. For instance,
Thereafter, through a consolidation process 253, data tables 245 and 251 are combined to create an intermediate data table 255 of article fragments, which represents the partially processed documents. Each record in table 255 preferably includes specifically extracted document properties, plus front and back matter fragments (e.g., the metadata and references). Where multiple overlapping sources are used for a single entity/node type, then a further consolidation/disambiguation step is required (not illustrated).
As referenced above, graph construction step 215 involves the creation of a base graph model using the data tables generated from data ingestions step 213, the model allowing for the visualization of relationships (shown as vectors) between various nodes (e.g., authors, article content, and general article reference data). Graph construction step 215 and disambiguation step 217 are preferably implemented using any suitable distributed data processing framework, such as Apache Spark. As such, graph data can be exported in a suitable format for a graph database management system such as Neo4j.
Referring now to
It should be noted that graph edge tables 269 may represent, inter alia, (i) relationships between different node types (e.g., contributions, or linking, between article nodes and author nodes), (ii) relationships between nodes of the same type (e.g., a citation reference, or linking, between multiple article nodes), and (iii) connections to reference data nodes (e.g., linking a scientific article with a particular journal in which it was published).
For optimal performance, edge construction preferably relies on well-known identifying information, or identifiers, whenever available. The use of well-known identifiers eliminates the need to perform a look-up of the target at write time. Depending on the quality of the input data source and/or implementor preferences or constraints, integrity checking may be required at construction time, prior to projection of the data, or delegated to a downstream graph database management system.
Author name disambiguation step 217 is a multi-stepped process which is designed to ensure, or verify, that the proper author is associated with each electronic document in the graphical model. As noted previously, the applicant has recognized that certain inconsistencies in the spelling of the name of an author amongst different document resources often results in the incorrect identification of an individual as a document author. As a result, the accuracy of a graph model generated for a collection of electronic documents can be significantly compromised. Accordingly, the process by which disambiguation step 217 serves to ensure, or certify, proper authorship is associated with a collection of scientific articles forms a critical aspect of the present invention.
Disambiguation step 217 is a split into the following sequence of phases: (i) a linking phase in which author node records are processed to identify similarities in author names and the analysis of paths occurring within a collaboration graph, (ii) a clustering phase in which a similarity graph is constructed using author nodes and similar person edges produced from the linking phase, and (iii) a refinement phase in which clustering results are examined to resolve author ambiguities created through the use of, inter alia, homonyms and synonyms. Each phase referenced above will be described further in detail below.
Referring now to
The collaboration graph construction technique is represented on the left-hand side of
Self-citation is one example of a common pattern of relationships which can be identified through a graphing process. A self-citation occurs when an author cites another document previously written by the same author, which is common in the scientific community. Through self-citation graphing, two authors of articles can be linked by a citation. Through additional filtering and comparison of the author names, preferably using a last name and a first name initial, author synonyms can be discovered and, in turn, used to construct similarity edges between the citing author and the cited author. This is represented generally on the right-hand side of
Fuzzy name matching is another example of a common pattern of relationships which can be improved through the application of a graphing process. Within the collaboration graph, communities, or cliques, within the graph model can be detected. Specifically in this case, linking phase 277 runs a connected components algorithm, which is an implementation of the alternating, big-star, little-star algorithm. Once the components have been allocated, the giant component is removed from consideration. Frequently, the remaining components have been found to be highly cohesive. Then, candidates (i.e., authors) are considered that are within the same component and share the same last name. If the initials and forename match exactly, the candidate is discarded since it should have already been identified through exact name (hash) matching. If candidates pass a high threshold, a name proximity similarity edge row is constructed in table 279. If candidates pass a lower threshold but the affiliation of the author passes a secondary threshold, an author name and affiliation proximity similarity edge row is constructed in table 279.
Fields of study matching is another example of a common pattern of relationships which can be identified through a graphing process. With fields of study matching, graphical paths are used to construct “fields of study” vectors that represent the specific topics on which an article author has been published. These fields of study vectors are then compared for candidate matches to reinforce similarity edges.
Finally, there is a mechanism to support certain known corrections in authorship. Specifically, as shown in
As referenced briefly above, upon completion of the linking phase of disambiguation step 217, a clustering phase is undertaken to construct a similarity graph using author nodes and similar person edges produced from the linking phase. Referring now to
Preferably, an iterative graph algorithm is executed as part of process 305 to identify clusters of potentially common authors. Specifically, any graph clustering algorithm (such as a connected components algorithm) can be run to allocate clusters to each author node. The clusters are then processed and the name of the distinct author is selected on the basis of a general criterion of utility such as the longest name amongst all the names in the cluster or the most frequent occurrence of a name if all names are of approximately the same length.
It is important for downstream consumers of the data pipeline that distinct authors maintain stable identifiers (e.g., to allow for the augmentation of data). In other words, as articles within the data pipeline are added and/or removed, clusters in turn can grow (e.g., form new clusters), shrink (e.g., delete existing clusters), or remain the same. Accordingly, the members, or components, within each cluster may migrate between clusters, form a new cluster, or be permanently deleted.
Therefore, the clustering phase is preferably designed with logic to ensure that cluster identifiers remain stable. Referring now to
Thereafter, the identified author clusters in table 313 are processed to produce distinct, or verified, author nodes. Specifically, a distinct author construction process 315 is applied to the clusters in table 313 to produce a distinct author node table 317. Subsequently, process 315 produces a disambiguated edge table 319 that links cluster members (i.e., author nodes defined in table 301) with the distinct author nodes defined in table 317, thereby facilitating in the reconciliation of authorship in the author nodes. Additionally, process 315 can be used to produce edge tables 321 which link the distinct author nodes defined in table 317 to other entities, such as articles, topics, collaborators, and the like.
As the final phase of disambiguation step 217, an optional refinement phase is undertaken in which clustering results are examined to resolve author ambiguities. Notably, due to the use of author homonyms and synonyms, clustering results can suffer from lumping and splitting errors. The identification of such errors can be accomplished using a classifier model trained on a labelled data set, as will be explained further below.
Specifically, as part of the refinement stage, a decision-tree classifier is trained using labelled data and, in turn, is utilized to refine author clusters. In other words, the decision-tree classifier is used to make predictions on whether cluster data (i.e., cluster members) represents the same person. Based on the pair-level predictions which can be interpreted as a distance measure, the cluster members can be re-clustered using a distributed clustering algorithm to produce refined clusters.
Referring now to
Input data 401 is preferably labelled data that was intentionally withheld from the similarity, or cluster, model. Model training is computationally intensive in nature and involves (i) a preparation step 405, in which input data 401 is split into training pair data 407 and test pair data 409, (ii) a training process 411 in which features data 403 from the similarity model and training pair data 407 are used to create a trained classifier model 413, and (iii) a testing process 415 in which data from trained classifier model 413 and similarity model features table 403 is utilized to evaluate the clustering, or linking, of the ‘gold standard’ test pair data 409. If the results of evaluation process 415 exceeds the previous model, the refined similarity, or cluster, model is deployed.
The trained decision-tree classifier 413 is then utilised to predict pairwise whether members of the cluster refer to the same person. Pair predictions are then processed to split clusters when necessary.
Referring now to
As the final step of process 211, an inferencing, or degree of belief (DoB) calculation, step 219 is implemented in order to infer the level of confidence in matched, or linked, authors; though the application of the inferencing is not solely limited to authors and can be determined along multiple dimensions for other node or edge types within the knowledge graph. In turn, any author matching inferences are incorporated as additional information, or knowledge, into the graph model and can thereby ensure proper authorship of electronic document. Using belief calculation step 219, probabilities within the graphical model can be fed into inference algorithms to determine the likelihood of other relationships being true.
Referring now to
As an alternative example of calculating a degree of belief measure, the confidence in the volume of output created by an author may be calculated using the following formula:
(1−e−α(x−β))/(1+e−α(x−β))
where alpha (α) and beta (β) are controllable parameters. Two sensible values are α=β=2, and x is the logarithm of identified matches (or “duplicates”) minus some sensible upper limit of publications that an individual can produce within the specified time period. For values of x close to zero that function is approximately 1 whereas for larger values it drops rapidly (exponentially) to zero.
In terms of defining values, after processing metadata, all articles are identified in which a particular author is listed, either identically or according to an accepted set of variations for equivalence (e.g., Ralph Stephen Baric==R. S. Baric==Ralph A. Baric). These articles will span a time period of T years. Assuming an upper limit of articles published per year (PPYL), X is defined as:
X=Log(base 10)[Number of articles−PPYL*T]
PPYL is provided with a value of 50 to start and the results are examined. Because that value is high, only the truly most prolific authors would be able to match. Depending on the outcome, the value can be recalibrated, as needed, as well as the alpha and beta parameters of the DoB calculation equation.
The invention described in detail above is intended to be merely exemplary and those skilled in the art shall be able to make numerous variations and modifications to it without departing from the spirit of the present invention. All such variations and modifications are intended to be within the scope of the present invention as defined in the appended claims.
The present invention claims the benefit under 35 U.S.C. 119(e) to U.S. Provisional Patent Application No. 63/216,564, which was filed on Jun. 30, 2021, in the names of Haralambos Marmanis et al., the disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63216564 | Jun 2021 | US |