Systems and Methods for Medical Topic Discovery Based on Large-Scale Machine Learning

TECHNICAL FIELD

The present disclosure generally relates to machine learning for healthcare, and more particularly, to large-scale machine learning systems and methods that process clinical documents to derive informative matrices that represent the relevancies or measures of prevalence of a plurality of medical topics among a plurality of clinical documents.

BACKGROUND

With the widespread adoption of electronic health records (EHR) systems, and the rapid development of new technologies such as high-throughput medical imaging devices, low-cost genome profiling systems, networked and even wearable sensors, mobile applications, and rich accumulation of medical knowledge/discoveries in databases, a tsunami of medical and healthcare data has emerged. It was estimated that 153 exabytes (one exabyte equals one billion gigabytes) of healthcare data were produced in 2013. In 2020, an estimated 2314 exabytes will be produced. From 2013 to 2020, an overall rate of increase is at least 48 percent annually.

In addition to the sheer volume, the complexity of healthcare data is also overwhelming. Such data includes clinical notes, medical images, lab values, vital signs, etc., coming from multiple heterogeneous modalities including texts, images, tabular data, time series, graph and so on. The rich clinical data is becoming an increasingly important source of holistic and detailed information for both healthcare providers and receivers. Collectively analyzing and digesting these rich information generated from multiple sources; uncovering the health implications, risk factors, and mechanisms underlying the heterogeneous and noisy data records at both individual patient and whole population levels; making clinical decisions including diagnosis, triage, and treatment thereupon, are now routine activities expected to be conducted by medical professionals including physicians, nurses, pharmacists and so on.

As the amount and complexity of medical data are rapidly growing, these activities are becoming increasingly more difficult for human experts. The information overload makes medical analytics and decisions-making time consuming, error-prone, suboptimal, and less-transparent. As a result, physicians, patients, and hospitals suffer a number of pain points, quality-wise and efficiency-wise. For example, in terms of quality, 250,000 Americans die each year from medical errors, which has become the third leading cause of death in the United States. Twelve million Americans are misdiagnosed each year. Preventable medication errors impact more than 7 million patients and cost almost $21 billion annually. Fifteen to twenty-five percent of patients are readmitted within 30 days and readmissions are costly (e.g., $41.3 billion in 2011). In terms of inefficiency, patients wait on average 6 hours in emergency rooms. Nearly 400,000 patients wait 24 hours or more. Physicians spend only 27 percent of their office day on direct clinical face time with patients. The U.S. healthcare system wastes $750 billion annually due to unnecessary services, inefficient care delivery, excess administrative costs, etc.

The advancement of machine learning (ML) technology opens up opportunities for next generation computer-aided medical data analysis and data-driven clinical decision making, where machine learning algorithms and systems can be developed to automatically and collectively digest massive medical data such as electronic health records, images, behavioral data, and the genome, to make data-driven and intelligent diagnostic predictions. A machine learning system can automatically analyze multiple sources of information with rich structure, uncover the medically meaningful hidden concepts from low-level records to aid medical professionals to easily and concisely understand the medical data, and create a compact set of informative diagnostic procedures and treatment courses and make healthcare recommendations thereupon.

It is therefore desirable to leverage the power of machine learning in automatically distilling insights from large-scale heterogeneous data for automatic smart data-driven medical predictions, recommendations, and decision-making, to assist physicians and hospitals in improving the quality and efficiency of healthcare. It is further desirable to have machine learning algorithms and systems that turn the raw clinical data into actionable insights for clinical applications. One such clinical application relates to discovering medical topics from large-scale texts.

When applying machine learning to healthcare application, several fundamental issues may arise, including:

1) How to better capture infrequent patterns: At the core of ML-based healthcare is to discover the latent patterns (e.g., topics in clinical notes, disease subtypes, phenotypes) underlying the observed clinical data. Under many circumstances, the frequency of patterns is highly imbalanced. Some patterns have very high frequency while others occur less frequently. Existing ML models lack the capability of capturing infrequent patterns. Known convolutional neural network do not perform well on infrequent patterns. Such a deficiency of existing models possibly results from the design of their objective function used for training. For example, a maximum likelihood estimator would reward itself by modeling the frequent patterns well as they are the major contributors to the likelihood function. On the other hand, infrequent patterns contribute much less to the likelihood, thereby it is not very rewarding to model them well and they tend to be ignored. Infrequent patterns are of crucial importance in clinical settings. For example, many infrequent diseases are life-threatening. It is critical to capture them.

2) How to alleviate overfitting: In certain clinical applications, the number of medical records available for training is limited. For example, when training a diagnostic model for an infrequent disease, typically there is no access to a sufficiently large number of patient cases due to the rareness of this disease. Under such circumstances, overfitting easily happens, wherein the trained model works well on the training data but generalizes poorly on unseen patients. It is critical to alleviate overfitting.

3) How to improve interpretability: Being interpretable and transparent is a must for an ML model to be willingly used by human physicians. Oftentimes, the patterns extracted by existing ML methods have a lot of redundancy and overlap, which are ambiguous and difficult to interpret. For example, in computational phenotyping from EHRs, it is observed that the learned phenotypes by the standard matrix and tensor factorization algorithms have much overlap, causing confusion such as two similar treatment plans are learned for the same type of disease. It is necessary to make the learned patterns distinct and interpretable.

4) How to compress model size without sacrificing modeling power: In clinical practice, making a timely decision is crucial for improving patient outcome. To achieve time efficiency, the size (specifically, the number of weight parameters) of ML models needs to be kept small. However, reducing the model size, which accordingly reduces the capacity and expressivity of this model, typically sacrifice modeling power and performance. It is technically appealing but challenging to compress model size without losing performance.

5) How to efficiently learn large-scale models: In certain healthcare applications, both the model size and data size are large, incurring substantial computation overhead that exceeds the capacity of a single machine. It is necessary to design and build distributed systems to efficiently train such models.

Discovering medical topics from clinical documents has many applications, such as consumer medical search, mining FDA drug labels, investigating drug repositioning opportunities, to name a few. In practice, the clinical text corpus can contain millions of documents and the medical dictionary is comprised of hundreds of thousands of terminologies. These largescale documents contain rich medical topics, whose number can be tens of thousands. How to efficiently discover so many topics from such a large dataset is computationally challenging.

SUMMARY

In one aspect of the disclosure, a machine learning system that includes a plurality of machine learning processors, maintains a topic matrix that represents the relevancies or measures of prevalence of a plurality of medical topics among a plurality of clinical documents. Each processor in the system is configured to determine at least one local sufficient factor group for a document included in the plurality of documents, and to send the at least one local sufficient factor group to one or more other processors in the system. Each processor is further configured to receive at least one remote sufficient factor group from another processor in the system. The remote sufficient factor group or groups are determined by other processors in the system for another document included in the plurality of documents. Each processor processes its local sufficient factor group together with the remote sufficient factor group or groups it receives to obtain the topic matrix.

In another aspect of the disclosure, a method of creating a topic matrix that represents a prevalence of each of a plurality of medical topics among a plurality of clinical documents, includes determining at least one local sufficient factor group for one or more documents included in the plurality of clinical documents using a first professor included in a machine learning system comprising a plurality of machine learning processors. The method further includes sending the at least one local sufficient factor group from the first processor to one or more second processors in the system. The method also includes receiving at the first processor, at least one remote sufficient factor group from a second processor in the system, and processing the local sufficient factor group together with the remote sufficient factor group to obtain the topic matrix at the first processor. The at least one remote sufficient factor group is determined by the second processor for another document included in the plurality of clinical documents.

It is understood that other aspects of methods and systems will become readily apparent to those skilled in the art from the following detailed description, wherein various aspects are shown and described by way of illustration.

BRIEF DESCRIPTION OF THE DRAWINGS

Various aspects of apparatuses and methods will now be presented in the detailed description by way of example, and not by way of limitation, with reference to the accompanying drawings, wherein:

FIG. 1 is a block diagram of a medical topic discovery system implemented by a large-scale peer-to-peer distributed machine learning system including a plurality of processors.

FIG. 2 is a block diagram of a sufficient factor broadcasting model adopted by the peer-to-peer distributed machine learning system of FIG. 1.

FIG. 3 is a block diagram of a random multicast model adopted by the peer-to-peer distributed machine learning system of FIG. 1.

FIG. 4 is an illustration of an algorithm, referred to as Algorithm 3, that implements a sufficient factor selection process used by the peer-to-peer distributed machine learning system of FIG. 1.

FIG. 5 is an illustration of an algorithm, referred to as Algorithm 4, that implements a sufficient factor group transformation process used by the peer-to-peer distributed machine learning system of FIG. 1.

FIG. 6 is an expression graph representing the parsing of an expression related to a sufficient factor identification process used by the peer-to-peer distributed machine learning system of FIG. 1.

FIG. 7 is a block diagram of a software stack included in the processors of the peer-to-peer distributed machine learning system of FIG. 1.

FIG. 8 is a flowchart of a method of creating a topic matrix that represents a prevalence of each of a plurality of medical topics among a plurality of clinical documents.

FIG. 9 is a block diagram of an apparatus, e.g., machine, processor, or worker, included in the large-scale peer-to-peer distributed machine learning system of FIG. 1 that implements the method of FIG. 8.

DETAILED DESCRIPTION

With reference to FIG. 1, a medical topic discovery system 100 configured in accordance with the large-scale distributed machine learning system disclosed herein, includes a plurality of machine learning processors 102, 104, 106. The processors 102, 104, 106, also referred to herein as machines or workers, perform individual machine learning tasks and share the results of these tasks with other processor in the system 100 in order to maintain a shared topic matrix 110 that represents the relevancies or measures of prevalence of different medical topics addressed in a corpus of clinical documents 112. For ease of illustration, only three machine learning processors 102, 104, 106 are shown in FIG. 1. The system 100 may, however, include many additional processors. Some of the concepts and features described herein are included in Diversity-promoting and Large-scale Machine Learning for Healthcare, a thesis submitted by Pengtao Xie in August 2018 to the Machine Learning Department, School of Computer Science, Carnegie Mellon University, which is hereby incorporated by reference in its entirety.

Each processor 102, 104, 106 in the system 100 is configured to determine at least one local sufficient factor group (LSFG) for a document included in the corpus of clinical documents 112, and to send the local sufficient factor group to one or more other processors in the system. For example, as shown in FIG. 1, a first processor 102 may determine and send a local sufficient factor group LSFG1 to the other processors 104, 106 in the system. The local sufficient factor group includes two sufficient factors, each corresponding to a vector representing a measure, e.g., association strength, between words in the clinical document and a medical topic addressed in the clinical document. These sufficient factor vectors may be obtained.

Each processor 102, 104, 106 in the system 100 is further configured to receive at least one remote sufficient factor group (RSFG) from another processor in the system, and to process its local sufficient factor group together with the received remote sufficient factor group to obtain the topic matrix 110. Each remote sufficient factor group is determined by other processors in the system for another document included in the corpus of clinical documents 112.

Continuing with the example shown in FIG. 1, the first processor 102 may receive a remote sufficient factor group RSFG2 from the second processor 104 in the system 100, and a remote sufficient factor group RSFG. from the nth processor 106 in the system. Like the local sufficient factor group determined by the first processor 102, each remote sufficient factor group determined by other processors 104, 106 in the system includes two sufficient factors, each corresponding to a vector representing a measure, e.g., association strength, between words in another clinical document and a medical topic addressed in this other clinical document.

It is noted that the “local” and “remote” nomenclature used in describing the system 100 is relative to each individual processor 102, 104, 106. More specifically, a sufficient factor group determine by a processor is: 1) a “local” sufficient factor group for that processor and 2) a “remote” sufficient factor group for all other processors in the system.

Continuing with FIG. 1, in some configurations, an individual processor 102, 104, 106 may determine more than one local sufficient factor group. For example, a processor may determine a plurality of local sufficient factor groups for a corresponding plurality of clinical documents included in the corpus of clinical documents 112. In such cases, where a processor has determined multiple sufficient factor groups, the processor may be further configured to select and send a subset of the plurality of sufficient factor groups to the one or more other processors in the system.

In some configuration, instead of being sent to all other processors in the system 100, a local sufficient factor group may be sent by a processor 102, 104, 106 to a select subset of other processors in the system. To this end, the processors 102, 104, 106 are further configured to randomly select, from among a plurality of other processors in the system, the subset of other processors to which to send the local sufficient factor group.

Upon receipt of one or more remote sufficient factor groups, a processor 102 processes its local sufficient factor group together with the received remote sufficient factor group or groups it received from other processor 104, 106 in the system 100. Accordingly, the processor 102 is configured to convert each of the local sufficient factor group and the remote sufficient factor group or groups into a corresponding update matrix, and apply each update matrix to the topic matrix using a projection operation. In one configuration, the processor 102 converts each of the local sufficient factor group and the remote sufficient factor group or groups into a corresponding update matrix by obtaining an outer product of the sufficient factors that respectively define the local sufficient factor group and the remote sufficient factor group. The outcome of this process is an updated or present state of the topic matrix 110.

The medical topic discovery system 100 thus described may efficiently process a large corpus of clinical documents 112 by dividing the initial task of document processing among the processors in the system and sharing the results of the document processing, e.g., the sufficient factor groups, across the system. The sharing of results enables the creation of a topic matrix through an iterative approach, where an initial topic matrix derived from an initial set of documents is updated as additional sets of documents are processed by the system 100.

Having thus described the general configuration and operation of a medical topic discovery system 100, following is a description of a large-scale distributed learning architecture that may be used to implement the medical topic discovery system.

Large-Scale Distributed Learning

With continued reference to FIG. 1, the medical topic discovery system 100 may be implemented in the from of a large-scale peer-to-peer distributed machine learning system. The system 100 relies on machine-learned models that are parameterized by matrices in the form of sufficient factor (SF) properties. The system significantly reduces communication and computation costs. For efficient communication, the system uses: 1) sufficient factor broadcasting to transfer small-sized vectors among machines for the synchronization of matrix-form parameters, 2) random multicast where each machine randomly selects a subset of machines to communicate within each clock, and 3) sufficient factor selection that selects a subset of most representative sufficient factors to communicate. These characteristics of the system greatly reduce the number of network messages and the size of each message. For efficient computation, the system uses: 1) sufficient factors to represent parameter matrices and 2) and a sufficient-factor-aware approach for matrix-vector multiplication, which reduces the cost of being quadratic in matrix dimensions down to being linear.

Sufficient Factor Property

The system invokes a mathematical property of a large family of machine learning models that admits the following optimization formulation:

$\begin{matrix} (P) \\ \min_{W} \frac{1}{N} \sum_{i = 1}^{N} f_{i} (W_{a_{i}}) + h (W) & (Eq . 1) \end{matrix}$

The model is parametrized by a matrix W ∈ R^j×D. The loss function f_i^(.)is typically defined over a set of training samples to {(a_i, b_i)}_i=1^N, with the dependence on b_ibeing suppressed. f_i^(.)is allowed to be either convex or nonconvex, smooth or nonsmooth (with subgradient everywhere). Examples include l₂loss and multiclass logistic loss, amongst others. The regularizer h(W) is assumed to admit an efficient proximal operator prox_h(.). For example, h(.) could be an indicator function of convex constraints, l₁-, l₂-, trace-norm, to name a few. The vectors a_iand b_ican represent observed features, supervised information (e.g., class labels in classification, response values in regression), or even unobserved auxiliary information (such as sparse codes in sparse coding) associated with data sample i. The key property exploited below ranges from the matrix-vector multiplication Wa_i. This optimization problem (P) can be used to represent a rich set of machine learning models, such as sparse coding.

Sparse coding learns a dictionary of basis from data, so that the data can be re-represented sparsely (and thus efficiently) in terms of the dictionary. In sparse coding, W is the dictionary matrix, a_iare the sparse codes, b_iis the input feature vector, and f_i(.) is a quadratic function. To prevent the entries in W from becoming too large, each column W_kmust satisfy ∥W_k∥₂≤1. In this case, h(W) is an indicator function which equals 0 if W satisfies the constraints and equals co otherwise. See, e.g., Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 1997, the disclosure of which is incorporated by reference.

To solve the optimization problem (P), it is common to employ either proximal stochastic gradient descent (SGD) or stochastic dual coordinate ascent (SDCA) both of which are popular and well-established parallel optimization techniques. See, e.g., Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project Adam: building an efficient and scalable deep learning training system. In USENIX Symposium on Operating Systems Design and Implementation, 2014, and Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for large-scale linear SVM. In International conference on machine learning, 2008, the disclosures of which are incorporated by reference.

Proximal SGD: In proximal SGD, a stochastic estimate of the gradient, ΔW, is first computed over one data sample (or a mini-batch of samples), in order to update W via W←W−η ΔW (where η is the learning rate). Following this, the proximal operator prox_ηh(.) is applied to W. Notably, the stochastic gradient ΔW in (P) can be written as the outer product of two vectors ΔW=uv^T, where u=(^af(W_a_i,b_i)/^af(W_a_i, v=a_i), according to the chain rule. Later below, it is shown that this low rank structure of W can reduce the amount of communication among machines in a large-scale system.

Stochastic DCA: SDCA applies to problems (P) where f_i( )is convex and h( )is strongly convex (e.g. when h( )contains the squared /!₂norm). SCDA solves the dual problem of (P), via stochastic coordinate ascent on the dual variables. Introducing the dual matrix U=[u₁, . . . , u_N] R^J×Nand the data matrix A=[a₁, . . . , a_N] R^D×N, the dual problem of (P) can be written as:

$\begin{matrix} (D) \\ \min_{U} \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{*} (- u_{i}) + h * (\frac{1}{N} {UA}^{T}) & (Eq . 2) \end{matrix}$

f_i^*(.) and h_i^*(.) are the Fenchel conjugate functions of f_i(.) and h(.), respectively.

The primal-dual matrices W and U are connected by W=Δh * (Z), where the auxiliary matrix

$Z := \frac{1}{N} {UA}^{T} .$

Algorithmically, the dual matrix U, the primal matrix W, and the auxiliary matrix Z are updated. Every iteration, a random data sample i is picked and the stochastic update Du_iis computed by minimizing (D) while holding {u_j}_j≠1fixed. The dual variable is updated via u_i←u_i−Δu_i, the auxiliary variable via Z←Z−Δu_ia_i^T, and the primal variable via W=∇h * (Z),). Similar to stochastic gradient descent, the update of Z is also the outer product of two vectors: Δu_iand a_i, which can be exploited to reduce communication cost.

Sufficient Factor Property in SGD and SDCA: In both SGD and SDCA, the parameter matrix update can be computed as the outer product of two vectors, which vectors are referred to herein as sufficient factors. The set of sufficient factors that are generated with respect to one data example and that atomically produce a parameter update is referred to as a sufficient factor group (SFG). This property can be leveraged to improve the communication efficiency of distributed machine learning systems. Instead of communicating updated parameter matrices among machines, the sufficient factors are communicated among the machines in the form of a sufficient factor group and the update matrices are reconstructed locally at each machine. Because the sufficient factors are much smaller in size, synchronization costs can be dramatically reduced.

Peer-to-Peer Communication Based on Sufficient Factors

As just mentioned, the sufficient factor property may be leveraged to reduce communication cost. To ensure the consistency among different copies or replicas of the parameter matrix, the parameter matrix updates computed at different machines need to be exchanged. One popular system architecture that enables this is parameter server (PS), which consists of a server machine that maintains a shared state of the parameter matrix and a set of worker machines each having a local cache of the parameter matrix. In parameter server, the parameter updates computed at worker machines are aggregated at the server and applied to the shared state of the parameter matrix that is maintained at the server. The server subsequently sends the shared state of the parameter matrix to worker machines and the worker machines refresh their local cache of the parameter matrix to match the shared state of the parameter matrix that is maintained at the server. When parameter server is used to train matrix-parametrized models, updated parameter matrices—which could contain billions of elements—are transferred, incurring substantial communication overhead.

A large-scale peer-to-peer (P2P) distributed machine learning system may be used in place of parameter server framework where the system design is driven by the sufficient factor property. The large-scale peer-to-peer distributed learning architecture is a decentralized system that executes data-parallel distributed training of matrix-parameterized machine learning models. The decentralized system runs on a group of worker machines connected via a peer-to-peer network. Unlike the client-server architectures including the parameter server, machines in the decentralized system play equal roles without server/client asymmetry and every pair of machines can communicate. Each machine holds one shard of the data and a replica of the model parameters. Machines synchronize their model replicas to ensure consistency by exchanging parameter-(pre)updates via network communication. Under this general framework, the decentralized system applies a battery of system algorithm co-designs to achieve efficiency in communication and fault tolerance.

For efficient communication, a feature of the decentralized system is to represent the parameter update matrices by their corresponding sufficient factors, which can be understood as “pre-updates”, meaning that the actual update matrices must be computed on each machine upon receiving fresh sufficient factors, and the update matrices themselves are never transmitted. Since the size of sufficient factors is much smaller than matrices, the communication cost can be substantially reduced. Under a peer-to-peer architecture, in addition to avoiding transmitting update matrices, the decentralized system can also avoid transmitting parameter matrices, while still achieving synchrony. Besides, random multicast, under which each machine sends sufficient factors to a randomly-chosen subset of machines, is leveraged to reduce the number of messages. Sufficient factor selection, which chooses a subset of representative sufficient factors to communicate, is used to further reduce the size of each message.

The decentralized system uses incremental sufficient factor checkpoint for fault tolerance, motivated by the fact that the parameter states can be represented as a dynamically growing set of sufficient factors. Machines continuously save the new sufficient factors computed in each logical time onto stable storage. To recover a parameter state, the decentralized system transforms the saved sufficient factors into a matrix. Compared with checkpointing parameter matrices, saving vectors requires much less disk input/output and does not require the application program to halt. Besides, the parameters can be rollbacked to the state in any logical time.

In programming abstraction, the sufficient factors are explicitly exposed such that system-level optimizations based on sufficient factors can be exploited. The decentralized system is able to automatically identify the symbolic expressions representing sufficient factors and updates, relieving users' burden to manually specify them.

The decentralized system supports two consistency models: bulk synchronous parallel (BSP) and staleness synchronous parallel (SSP). bulk synchronous parallel sets a global barrier at each clock. A worker cannot proceed to the next clock until all workers reach this barrier. Staleness synchronous parallel allows workers to have different paces as long as their difference in clock is no more than a user-defined staleness threshold.

Sufficient Factor Broadcasting

With reference to FIG. 2, leveraging the sufficient factor property of the update matrix in problems (P) and (D), a sufficient factor broadcasting (SFB) computation model that supports efficient (low-communication) distributed learning of the parameter matrix W is used. In a setting with P workers, each of which holds a data shard and a copy of the parameter matrix W. Stochastic updates to W are generated via proximal SGD or SDCA, and communicated between machines to ensure parameter consistency.

In proximal SGD, on every iteration, each worker p computes sufficient factors (u_p, v_p), based on one data sample x_i=(a_i, b_i) in the worker's data shard. The worker then broadcasts (u_p, v_p) to all other workers. Once all P workers have performed their broadcast (and have thus received all sufficient factors), they re-construct the P update matrices (one per data sample) from the P sufficient factors, and apply them to update their local copy of W. Finally, each worker applies the proximal operator prox_h(.). When using SDCA, the above procedure is instead used to broadcast sufficient factors for the auxiliary matrix Z, which is then used to obtain the primal matrix W=∇h*(Z).

In FIG. 2, the SFB operation is performed by four workers. The workers compute their respective sufficient factors (u₁, v₁) , . . . , (u₄, v₄), which are then broadcast to the other three workers. Each workerp uses all four sufficient factors (u₁, v₁) , . . . , (u₄, v₄) to exactly reconstruct the update matrices ΔW_p=u_pv_p^T, and update their local copy of the parameter matrix: W_p←W_p−Σ_q=1⁴u_pv_p^T. While the above description reflects synchronous execution, it is easy to extend to (bounded) asynchronous execution.

Since an updated parameter matrix, referred to as an update matrix (UM), can be computed from a few sufficient factors, sending a update matrix from machine A to B can be equivalently done by first transferring the sufficient factors from A to B, then producing the update matrix from the sufficient factors received at B.

The communication cost of transmitting sufficient factors is O(J+K) which is linear in matrix dimensions while that of transmitting update matrices is O(JK) which is quadratic in matrix dimensions. Hence sufficient factor transfer, instead of update matrix transfer, can greatly reduce communication overhead. The transformation from sufficient factors to an update parameter matrix is mathematically exact, without compromising computational correctness.

In parameter server, the one-sided communication cost from worker machines to the server can be reduced by transmitting sufficient factors. In this case, each worker machine sends new sufficient factor groups to the server, where the received sufficient factor groups are transformed to update matrices to update the shared state of the parameter matrix. However, since the parameter matrix cannot be computed from a few sufficient factors, from the server to worker machines the newly-updated parameters need to be sent as a matrix, which still incurs high communication overhead. To avoid transmitting parameter matrices, the large-scale distributed machine learning system disclosed herein adopts a decentralized peer-to-peer architecture, where worker machines synchronize their parameter replicas by exchanging updates in the form of sufficient factors. In each clock, each worker machine computes sufficient factor groups and broadcasts them to other worker machines. Meanwhile, each worker machine converts the sufficient factor groups received remotely into update matrices which are subsequently added to the parameter replica resident in the worker machine. This computation model is referred to as sufficient factor broadcasting. Unlike parameter server, the decentralized peer-to-peer architecture of the large-scale distributed machine learning system disclosed herein does not maintain the shared state of the parameter matrix and can avoid transmitting matrices.

While the transfer of sufficient factor groups among peer-to-peer machines greatly reduces communication cost, such transfer increases computation overhead because each sufficient factor group is converted into the same update at each of the peer-to-peer machines. Thus the sufficient factor group is converted multiple times. However, in-memory computation is usually much more efficient than inter-machine network communication, especially with the advent of graphics processor unit (GPU) computing, hence the reduction in communication cost overshadows the increase of computation overhead.

Random Multicast

While the peer-to-peer transfer of sufficient factor groups greatly reduces the size of each message from a matrix to a few vectors, a limitation of such transfer is that a large number of sufficient factor groups needs to be sent from each machine in the system to every other machine in the system, which renders the number of messages per clock to be quadratic in the number of machines P. To address this issue, the large-scale distributed machine learning system disclosed herein adopts random multicast. During random multicast, in each clock cycle, each machine randomly selects Q(Q<P−1) machines to send one or more sufficient factor groups to. This reduces the number of messages sent per clock cycle from O(P²) to O(PQ).

FIG. 3 shows an example of random multicast. In each iteration t, an update U_p^tgenerated by machine p is sent only to machines that are directly connected with p (and the update U_p^ttakes effect at iteration t+1). The effect of U_p^tis indirectly and eventually transmitted to every other machine q, via the updates generated by machines sitting between p and q in the topology. This happens at iteration t+τ, for some delay τ>1 that depends on Q and the location of p and q in the network topology. Consequently, the P machines will not have the exact same parameter image W, even under bulk synchronous parallel execution—yet this does not empirically compromise algorithm accuracy as long as Q is not too small.

Two random selection methods are provided. One is uniform selection: each machine has the same probability to be selected. The other is prioritized selection for load-balancing purpose. Each machine is assigned a priority score based on its progress (measured by clock). A machine with faster progress (higher priority) is selected with higher probability and receives more sufficient factors from slower machines. It spends more compute cycles to consume these remote sufficient factors and slows down the computation of new sufficient factors, giving the slower machines a grace time to catch up.

Unlike a deterministic multicast topology, where each machine communicates with a fixed set of machines throughout the application run, random multicast provides several benefits. First, dynamically changing the topology in each clock gives every two machines a chance to communicate directly, which facilitates more symmetric synchronization. Second, random multicast is more robust to network connection failures since the failure of a network connection between two machines will not affect their communication with another one. Third, random multicast makes resource elasticity simpler to implement because adding and removing machines require minimal coordination with existing ones, unlike a deterministic topology which must be modified every time a worker joins or leaves.

Sufficient Factor Selection

In machine learning practice, parameter updates are usually computed over a small batch (whose size typically ranges from tens to hundreds) of examples. At each clock, a batch of K training examples or data are selected and parameter updates are generated with respect to each example. When represented as matrices, these K updates can be aggregated into a single matrix to communicate. Hence the communication cost is independent of the number of K updates. However, this is not the case in sufficient factor transfer. In sufficient factor transfer, a batch K of sufficient factor groups cannot be aggregated into one single sufficient factor group. Instead they are transferred individually. Therefore, communication cost grows linearly with K.

To alleviate this cost, the large-scale distributed machine learning system disclosed herein provides for sufficient factor selection (SFS). During sufficient factor selection, a machine chooses a subset of C sufficient factor groups from a set or batch K of sufficient factor groups (where C<K) to send to other machines. The chosen subset of C sufficient factor groups correspond to the sufficient factor groups that best represent the entire batch.

An efficient sampling-based algorithm called joint matrix column subset selection

(JMCSS) performs sufficient factor selection. Given the P matrices X⁽¹⁾, . . . , X^(P)where X^(P)stores the p-th sufficient factor of all sufficient factor groups, JMCSS selects a subset of non-redundant column vectors from each matrix to approximate the entire matrix. The selection of columns in different matrices are tied together, i.e., if the i-th column is selected in one matrix, for all other matrices their i-th column must be selected as well to atomically form an sufficient factor group. Let I=i₁, . . . , i_Cindex the selected sufficient factor groups and S_I^(P)be a matrix whose columns are from X^(P)and indexed by I. The goal is to find out the optimal selection I such that the following approximation error is minimized: τ_p=1^p∥x^(P)−S_I^(p)(S_I^(p))^†X_I^(p)∥₂, where (S_I^(p))^† is the pseudo-inverse of S_I^(p).

Finding the exact solution of this problem is NP-hard. To address this issue, a sampling-based method (Algorithm 3), which is an adaptation of the iterative norm sampling algorithm, is used. This algorithm is shown in FIG. 4. Let S^(p)be a dynamically growing matrix that stores the column vectors to be selected from X^(p)and S^(p)denote the state of S^(p)at iteration t. Accordingly, X^(p)is dynamically shrinking and its state is denoted by X_t^(p). At the t-th iteration, an index it is sampled and the i_t-th column vectors are taken out from {X^(p)}_p=1^Pand added to {S^(p)}_p=1^P. i_tis sampled in the following way. First, the squared L2 norm of each column vector in {X_t−1^(p)}_p=1^Pis computed. Then sample i_t(1≤i_t≤K+1−t) with probability proportional to Π_p=1^P∥X_i_t^(p)∥₂², where X_i_t^(p)denotes i_t-th.

Then a back projection is utilized to transform X_t^(p): X_t^(p)←X_t^(p)−S_t^(p)(S_t^(p))^†X_t^(p). After C iterations, the selected sufficient factors contained in {S^(p)}_p=1^Pare obtained and packed into sufficient factor groups, which are subsequently sent to other machines. Under JMCSS, the aggregated update generated from the C sufficient factor groups is close to that computed from the entire batch. Hence sufficient factor selection does not compromise parameter-synchronization quality.

The selection of sufficient factors is pipelined with their computation and communication to increase throughput. Two FIFO queues (denoted by A and B) containing sufficient factors are utilized for coordination. The computation thread adds newly-computed sufficient factors into queue A. The selection thread dequeues sufficient factors from A, executes the selection and adds the selected sufficient factors to queue B. The communication thread dequeues sufficient factors from B and sends them to other machines. The three modules operate asynchronously: for each one, as long as its input queue is not empty and output queue is not full, the operation continues. The two queues can be concurrently accessed by their producer and consumer.

Sufficient Factor Representation of Parameters

First, a sufficient-factor representation (SFR) of the parameters is presented. At clock T, the parameter state W_Tis mathematically equal to W₀+Σ_t=1^TΔW_twhere ΔW_tis the update matrix computed at clock t and W₀is the initialization of the parameters. As noted earlier, ΔW_tcan be computed from a sufficient factor group G_t: DW_t=h(G_t), using a transformation h. To initialize the parameters, a sufficient factor group G₀, is randomly generated then let W₀=h(G₀). To this end, the parameter state can be represented as W_T=Σ_t=1^TΔW_th(G_t), using a set of sufficient factor groups. The sufficient-factor representation can be leveraged to reduce computation cost. First of all, since no parameter matrix needs to be maintained, there is no need to explicitly compute the update matrix in each clock, which otherwise incurs O(JK) cost.

Second, in most matrix-parameterized models, a major computation workload is to multiply the parameter matrix by a vector, whose cost is quadratic in matrix dimensions. This cost is reduced by executing the multiplication in a sufficient-factor-aware way. The details are given in the following subsection.

Incremental Sufficient Factor Checkpoint

Based on the SF-representation (SFR) of parameters and inspired by the asynchronous and incremental checkpointing methods, the large-scale distributed machine learning system disclosed herein provides an incremental sufficient factor checkpoint (ISFC) mechanism for fault tolerance and recovery: each machine continuously saves the new sufficient factor groups computed in each clock to stable storage and restores the parameters from the saved sufficient factor groups when machine failure happens. Unlike existing systems which checkpoint large matrices, saving small vectors consume much less disk bandwidth. To reduce the frequency of disk write, the sufficient factor groups generated after each clock are not immediately written onto the disk, but staged in the host memory. When a large batch of sufficient factor groups are accumulated, the large-scale distributed machine learning system disclosed herein writes them together.

Incremental sufficient factor checkpoint does not require the application program to halt while checkpointing the sufficient factors . The IO thread reads the sufficient factors and the computing thread writes the parameter matrix. There is no read/write conflict. In contrast, in matrix-based checkpointing, the IO thread reads the parameter matrix, which requires the computation thread to halt to ensure consistency, incurring waste of compute cycles.

Incremental sufficient factor checkpoint is able to rollback the parameters to the state at any clock. To obtain the state at clock T, the large-scale distributed machine learning system disclosed herein collects the sufficient factor groups computed up to T and transforms them into a parameter matrix. This granularity is much more fine-grained than checkpointing parameter matrices. Since saving large-sized matrices to disk is time-consuming, the system can only afford to perform a checkpoint periodically and the parameter states between two checkpoints are lost. The restore(T) API is used for recovery where T is a user-specified clock which the parameters are to be rollbacked to. The default T is the latest clock.

Sufficient-Factor-Aware Multiplication and Tree Rewriting

In multiclass logistic regression (MLR), each sufficient factor group contains two sufficient factors u, v, whose outer product uv^Tproduces a parameter update. Consequently, the sufficient-factor representation of the parameter state W_Tis Σ_t=0^Tu_tv_t^T. The multiplication between W_Tand a vector x can be computed in the following way: W_T* x=(Σ_t=0^Tu_tv_t^T)x=Σ_t=0^Tu_t(v_t^Tx), which first calculates the inner product v_t^Tx between v_tand x, then multiplies the inner product with u_t. The computation cost is O(T(J+K)), which is linear in matrix dimensions and grows with T . As another example, each sufficient factor group contains two sufficient factors and the parameter update is computed as ΔW=uu^T−vv^T. Then W_Tis represented as Σ_t=1^Tuu^T−vv^T—and W_T* x can be computed as (Σ_t=0^Tu_t(u_t^Tx)−v_t(v_t^Tx), whose cost is O(T(J+K)) as well. When T is small, sufficient factor-aware multiplication is highly efficient.

The large-scale distributed machine learning system disclosed herein may use a multiplication tree to perform sufficient factor-aware multiplication. A multiplication tree is rewritten from an updating tree built by parsing the compute update function which is either defined by users or automatically identified by the system. At the leaf nodes of the updating tree are sufficient factors and at the internal nodes are operations. An in-order traversal of the updating tree transforms the sufficient factors into an update matrix: at each internal node, the associated operation is applied to the data objects (either sufficient factors or matrices) at its two children. The update matrix is obtained at the root.

Given this updating tree, it is rewritten into an multiplication tree. For each subtree in the updating tree, if the operation at the root is vector outer-product (denoted by “ ”) and children of the root are two sufficient factors sv0 and sv 1, then the subtree is transformed into a new tree with three layers: at the root is scalar-vector multiplication “*”; at the two children of the root are sv0 and vector inner-product “ ”; the two children of “ ” are sv1 and x (the vector involved in W_T*x). “+” and “−” representing matrix addition/subtraction in the updating tree are replaced with vector addition/subtraction in the multiplication tree.

To compute W_T*x, where W_Tis represented with T+1 sufficient factor groups, the sufficient factors are fed into each sufficient factor group and x into the leave nodes of multiplication tree, then an in-order traversal is performed to get a vector at the root. W_T*x is obtained by adding up the vectors generated from all sufficient factor groups.

Programming Model

The programming model of the large-scale distributed machine learning system disclosed herein provides a data abstraction called sufficient factor group and two user-defined functions that generate and consume sufficient factor groups to update model parameters. Each sufficient factor group contains a set of sufficient factors that are generated with respect to one data example and atomically produces a parameter update. The sufficient factors are immutable and dense, and their default type is float. Inside an sufficient factor group, each sufficient factor has an index.

To program an application for execution by the large-scale distributed machine learning system disclosed herein, users specify two functions: (1) compute_update which takes the current parameter state and one data example as inputs and computes vectors that collectively form an sufficient factor group; (2) compute_update which takes an sufficient factor group and produces a parameter update. These two functions are invoked by the disclosed engine to perform data-parallel distributed machine learning: each of the P machines holds one shard of the training data and a replica of parameters; different parameter replicas are synchronized across machines to retain consistency (consistency means different replicas are encouraged to be as close as possible).

Every machine executes a sequence of operations iteratively: in each clock, a small batch of training examples are randomly selected from the data shard and compute_svg is invoked to compute an sufficient factor group with respect to each example; the sufficient factor groups are then sent to other machines for parameter synchronization; compute_update is invoked to transform locally-generated sufficient factor groups and remotely-received sufficient factor groups into updates which are subsequently added to the parameter replica. The execution semantics (per-clock) of the disclosed engine is shown in FIG. 5, as Algorithm 4. Unlike existing systems which directly compute parameter updates from training data, the large-scale distributed machine learning system disclosed herein breaks this computation into two steps and explicitly exposes the intermediate sufficient factors to users, which enables SF-based system-level optimizations to be exploited.

Below shows how these two functions are implemented in multiclass logistic regression. The inputs of the compute svg_function include the parameter replica Parameters and a data example Data and the output is a SFG. A sufficient factor group is declared via SFG([d₁, . . . , d_j]) where d_jis the length of the j-th SF. In multiclass logistic regression, an sufficient factor group contains two sufficient factors: the first one is the difference between the prediction vector so ftmax(W * smp.feats) and the label vector smp. label; the second one is the feature vector smp. feats. The update matrix is computed as the outer product between the two sufficient factors.

- def compute_svg(Parameters W, Data smp):
  - svg=SFG ([W.nrows, W.ncols])
  - x=softmax(W * smp.feats)−smp. label
  - svg.sv[0]=x
  - svg. sv[1]=smp.feats
  - return svg
- def compute_update(SFG svg):
  - return outproduct(svg. sv[0],svg. sv[1])

Automatic Identification of Sufficient Factors and Updates

When machine learning models are trained using gradient descent or quasi-Newton algorithms, the computation of sufficient factor groups and updates can be automatically identified by the disclosed engine, which relieves users from writing the two functions compute_svg and compute_update. The only input required from users is a symbolic expression of the loss function, which is in general much easier to program compared with the two functions. Note that this is not an extra burden: in most machine learning applications, users need to specify this loss function to measure the progress of execution.

The identification procedure of sufficient factors depends on the optimization algorithm—either gradient descent or quasi-Newton—specified by the users for minimizing the loss function. For both algorithms, automatic differentiation techniques are needed to compute the gradient of variables. Given the symbolic expression of the loss function, such as f=cross_{entropy(softmax(W*x),y)}in multiclass logistic regression, the disclosed engine first parses the expression into an expression graph as shown in FIG. 6. In the figure, circles denote variables including terminals such as W, x, y and intermediate ones such as a=W* x, b=softmax(a); boxes denote operators applied to variables. According to their inputs and outputs, operators can be categorized into different types, shown in the table included in FIG. 6. Given the expression graph, the large-scale distributed machine learning system disclosed herein uses automatic differentiation to compute the symbolic expressions of the gradient

$\frac{\partial f}{\partial z}$

of f with respect each unknown variable z (either a terminal or an intermediate one). The computation is executed recursively in the backward direction of the graph. For example, in FIG. 6, to obtain ∂f/∂a,

$\frac{\partial f}{\partial b}$

is first computed, then it is transformea into

$\frac{\partial f}{\partial a}$

using an operator-specific matrix A. For a type-2 operator (e.g., softmax) in the table in FIG. 6,

$A_{ij} = \frac{\partial b_{j}}{\partial a_{i}} .$

If W is involved in a type-5 operator (Table 5.1) which takes W and a vector x as inputs and produces a vector a and the gradient descent algorithm is used to minimize the loss function, then the sufficient factor group contains two sufficient factors which can be automatically identified: one is ∂f/∂a and the other is x. Accordingly, the update of W can be automatically identified as the outer product of the two sufficient factors.

If quasi-Newton methods are used to learn machine learning models parameterized by a vector x, the large-scale distributed machine learning system disclosed herein can automatically identifies the sufficient factors of the update of the approximated Hessian matrix W. First of all, automatic differentiation is applied to compute the symbolic expression of the gradient g (x)=∂f/∂x. To identify the sufficient factors at clock k, the states x_k+1and x_kof the parameter vector are plugged in clock k+1 and k into g(x) and calculate a vector y_k=g(x_k+1)−g(x_k). Another vector s_k=x_k+1−x_kis computed. Then based on s_k, y_k, and W_s_k(the state of W at clock k), the sufficient factors which depend on the specific quasi-Newton algorithm instance can be identified. For BFGS, the procedures are: (1) set y_k←y_k/√{square root over (y_k^Ts_k)}; (2) compute v_k=W_ks_k; (3) set y_k←y_k/√{square root over (s_k^Tv_k)}. Then the sufficient factors are identified as y_kand v_kand the update of W_kis computed as y_ky_k^T−v_kv_k^T. For DFP, the procedures are: (1) set s_k←s_k/√{square root over (6_k¹s_k)}; (2) computev_k=W_ky_k; (2) set v_k/√{square root over (y_k^Tv_k)}. Then the sufficient factors are identified as s_kand v_kand the update of W_kis computed as s_ks_k^T−v_k^Tv_k.

Implementation

With reference to FIG. 7, the large-scale distributed machine learning system disclosed herein is a decentralized system, where workers are symmetric, each running the same software stack, which is conceptually divided into three layers: (1) an machine learning application layer including machine learning programs implemented on top of the disclosed engine, such as multiclass logistic regression, topic models, deep learning models, etc.; (2) a service layer for automatic identification of sufficient factors , sufficient factor selection, fault tolerance, etc.; (3) a peer-to-peer communication layer for sufficient factor transfer and random multicast.

The major modules in the disclosed engine include: (1) an interpreter that automatically identifies the symbolic expressions of sufficient factors and parameter updates; (2) a sufficient factor generator that selects training examples from local data shard and computes an sufficient factor group for each example using the symbolic expressions of sufficient factors produced by the interpreter; (3) an sufficient factor selector that chooses a small subset of most representative sufficient factors out of those computed by the generator for communication; (4) a communication manager that transfers the sufficient factors chosen by the selector using broadcast or random multicast and receives remote sufficient factors; (5) an update generator which computes update matrices from locally-generated and remotely-received sufficient factors and updates the parameter matrix; (6) a central coordinator for periodic centralized synchronization, parameter-replicas rotation, and elasticity.

Heterogeneous computing: The programming interface of the disclosed system exposes a rich set of operators, such as matrix multiplication, vector addition, and softmax, through which users write their machine learning programs. To support heterogeneous computing, each operator has a CPU implementation and a GPU implementation built upon highly optimized libraries. In the GPU implementation, the disclosed engine performs kernel fusion which combines a sequence of kernels into a single one, to reduce the number of kernel lunches that bear large overhead. The disclosed engine generates a dependency graph of operators by parsing users' program and traverses the graph to fuse consecutive operators into one CUDA kernel.

Elasticity: The large-scale distributed machine learning system disclosed herein is elastic to resource adjustment. Adding new machines and preempting existing machines do not interrupt the current execution. To add a new machine, the central coordinator executes the following steps: (1) launching the disclosed engine and application program on the new machine; (2) averaging the parameter replicas of existing machines and placing the averaged parameters on the new machine; (3) taking a chunk of training data from each existing machine and assigning the data to the new machine; (4) adding the new machine into the peer-to-peer network. When an existing machine is preempted, it is taken off from the peer-to-peer network and its data shard is re-distributed to other machines.

Periodic centralized synchronization: Complementary to the peer-to-peer decentralized parameter synchronization, large-scale distributed machine learning system disclosed herein performs a centralized synchronization periodically. The centralized coordinator sets a global barrier every R clocks. When all workers reach this barrier, the coordinator calls the AllReduce(average) interface to average the parameter replicas and set each replica to the average. After that, workers perform decentralized synchronization until the next barrier. Centralized synchronization effectively removes parameter-replicas' discrepancy accumulated during decentralized execution and it will not incur substantial communication cost since it is invoked periodically.

Rotation of parameter replicas: The large-scale distributed machine learning system disclosed herein adopts data parallelism, where each worker has access to one shard of the data. Since computation is usually much faster than communication, the updates computed locally are much more frequent than those received remotely. This would render imbalanced updating of parameters: a parameter replica is more frequently updated based on the local data residing in the same machine than data shards on other machines. This is another cause of out-of-synchronization. To address this issue, the large-scale distributed machine learning system disclosed herein performs parameter—replica rotation, which enables each parameter replica to explore all data shards on different machines. Logically, the machines are connected via a ring network. Parameter-replicas rotate along the ring periodically (every S iterations) while each data shard sits still on the same machine during the entire execution. The parameters are rotated rather than data since the size of parameters is much smaller than data. A centralized coordinator sets a barrier every S iterations. When all workers reach the barrier, it invokes the Rotate API which triggers the rotation of parameter replicas.

Data prefetching: The loading of training data from CPU to GPU is overlapped with the sufficient factor generator via a data queue. The next batches of training examples are prefetched into the queue while the generator is processing the current one. In certain applications, each training example is associated with a data—dependent variable (DDV). For instance, in topic model, each document has a topic proportion vector. The states of DDVs need to be maintained throughout execution. Training examples and their DDVs are stored in consecutive host/device memory for locality and are prefetched together. At the end of a clock, GPU buffer storing examples is immediately ready for overwriting. The DDVs are swapped from GPU memory to host memory, which is pipelined using a DDV queue.

Hardware/Software-Aware Sufficient Factor Transfer

The large-scale distributed machine learning system disclosed herein provides a communication library for efficient message broadcasting. It contains a collection of broadcast methods designed for different hardware and software configurations, including (1) whether the communication is CPU-to-CPU or GPU-to-GPU; (2) whether InfiniBand is available; (3) whether the consistency model is bulk synchronous parallel or staleness synchronous parallel.

CPU-to-CPU, bulk synchronous parallel: In this case, the MPI Allgather routine is used to perform all-to-all broadcast. In each clock, it gathers the sufficient factors computed by each machine and distributes them to all machines. MPI_Allgather is a blocking operation (i.e. the control does not return to the application until the receiving buffer is ready to receive sufficient factors from all machines). This is in accordance with the bulk synchronous parallel consistency model where the execution cannot proceed to the next clock until all machines reach the global barrier.

CPU-to-CPU, staleness synchronous parallel: Under staleness synchronous parallel, each machine is allowed to have a different pace to compute and broadcast sufficient factors. To enable this, the all-to-all broadcast is decomposed into multiple one-to-all broadcast. Each machine separately invokes the MPI Bcast routine to broadcast its messages to others. MPI Beast is a blocking operation: the next message cannot be sent until the current one finishes. This guarantees the sufficient factors are received in order: sufficient factors generated at clock t arrive early than those at t+1. This order is important for the correctness of machine learning applications: the updates generated earlier should be applied first.

CPU-to-CPU, bulk synchronous parallel, InfiniBand: An all-gather operation is executed by leveraging the Remote Direct Memory Access (RDMA) feature provided by InfiniBand, which supports zero- copy networking by enabling the network adapter to transfer data directly to or from application memory, without going through the operating system. The recursive doubling (RD) algorithm is used to implement all-gather, where pairs of processes exchange their sufficient factors via point-to-point communication. In each iteration, the sufficient factors collected during all previous iterations are included in the exchange. RDMA is used for the point-to-point transfer during the execution of recursive doubling.

CPU-to-CPU, staleness synchronous parallel, InfiniBand: Each machine performs one-to-all broadcast separately, using the hardware supported broadcast (HSB) in InfiniBand. HSB is topology-aware: packets are duplicated by the switches only when necessary; therefore network traffic is reduced by avoiding the cases that multiple identical packets travel through the same physical link. The limitation of hardware supported broadcast is that messages can be dropped or arrive out of order, which degrades the correctness of machine learning execution. To retain reliability and in-order delivery, on top of hardware supported broadcast another layer of network protocol is added, where (1) receivers send ACKs back to the root machine to confirm message delivery; (2) a message is re-transmitted using point-to-point reliable communication if no ACK is received before timeout; (3) receivers use a continuous clock counter to detect out-of-order messages and put them in order.

GPU-to-GPU: To reduce the latency of inter-machine sufficient factor transfer between two GPUs, the GPUDirect RDMA provided by CUDA is used, which allows network adapters to directly read from or write to GPU device memory, without staging through host memory. Between two network adaptors, the sufficient factors are communicated using the methods listed above.

Similar to broadcast, several multicast methods tailored to different system configurations are provided.

CPU-to-CPU: MPI group communication primitives are used for CPU-to-CPU multicast. In each clock, MPI_Comm split is invoked to split the communicator MPI_COMM_WORLD into a target group (containing the selected machines) and a non-target group. Then the message is broadcast to the target group.

CPU-to-CPU, InfiniBand: The efficient but unreliable multicast method supported by InfiniBand at hardware level and a reliable point-to-point network protocol are used together. InfiniBand combines the selected machines into a single multicast address and sends the message to it. Point-to-point re-transmission is issued if no ACK is received before timeout. Since the selection of receivers is random, any machine does not receive messages in continuous clocks from another machine, making it difficult to detect out-of-order messages. A simple approach is adopted: a message is discarded if it arrives late.

GPU-to-GPU: GPUDirect remote direct memory access is used to copy buffers from

GPU memory to network adaptor. Then the communication between network adaptors is handled using the two methods given above.

Medical Topic Discovery Based on Large-Scale Distributed Learning with Sufficient Factors

As generally described above with reference to FIG. 1, medical topic discovery on a large corpus of clinical documents 112 may be efficiently performed by dividing the initial task of document processing among the processors 102, 104, 106 in the system and sharing the results of the document processing, e.g., the sufficient factor groups, across the system. The sharing of results enables the creation of a topic matrix through an iterative approach, where an initial topic matrix derived from an initial set of documents is updated as additional sets of documents are processed by the system 100.

Creation of the topic matrix by the medical topic discovery system 100 involves initial document processing steps that provide a measure of association between words and medical topics in documents. In one approach to document processing, referred to as topic models, clinical documents are represented by un-ordered sets of words. A “topic” is thus a set of words that tend to co-occur, and represents word co-occurrence patterns that are shared across multiple documents. A “word” in a document may correspond to a medical condition, symptoms, patient demographics, etc. See, e.g., Devendra S. Sachan, Pengtao Xie, and Eric P. Xing. Effective use of bidirectional language modeling for medical named entity recognition. arXiv preprint arXiv:1711.07908, 2017

The system 100 of FIG. 1 may implemented by a large-scale peer-to-peer distributed machine learning system including a plurality of processors configured in accordance with the models and features described above with reference to FIG. 2-8. In the system 100 of FIG. 1, topics addressed in medical documents may be learned as follows: Each medical document is represented with a bag-of-words vector d ∈ R^Vwhere V is the vocabulary size. Each document is assumed to be an approximate linear combination of K topics: d≈Wθ. W ∈ R^V×Kis the topic matrix, where each topic is represented with a sufficient factor in the form of a vector w ∈ R^V. w_v≥0 denotes the association strength between the v-th word and this topic, and Σ_v=1^Vw_v=1. θ ∈ R^Kare the linear combination weights satisfying θ_k≥0 and Σ_k=1^Kθ_k=1. θ_k≥0 denotes how relevant topic k is to the document.

Given the unlabeled documents {d_i}_i=1^N, the topics in these documents are learned by solving the following problem:

$\begin{matrix} \begin{matrix} {\min_{s . t .}}_{{θ_{i}}_{i = 1}^{N}, w} \frac{1}{2} \sum_{i = 1}^{N} { d_{i} - W θ_{i} }_{2}^{2} \\ \forall k = 1, \dots, K, v = 1, \dots, \\ V, W_{kv} \geq 0, \sum_{v = 1}^{V} W_{kv} = 1; \\ \forall i = 1, \dots, N, k = 1, \dots, \\ K, θ_{ik} \geq 0, \sum_{k = 1}^{K} θ_{ik} = 1, \end{matrix} & (Eq . 3) \end{matrix}$

where {θ_i}_i=1^Ndenotes all the linear coefficients.

This problem can be solved by alternating between {θ_i}_i=1^Nand W. The N sub-problems defined on {θ_i}_i=1^Ncan be solved independently by each machine based on the data shard, e.g. clinical documents, it has. The sub-problem defined on W is solved using the large-scale distributed machine learning system disclosed herein. Each machine in the system maintains a local copy or replica of the parameter state W or topic matrix and the various copies among the machines are synchronized to ensure convergence. The projected stochastic gradient descent algorithm is applied, which iteratively performs two steps: (1) stochastic gradient descent, and (2) projection onto the probability simplex.

In the first step, the stochastic gradient matrix computed over one document can be written as the outer product of two vectors: (Wθ_i−d_i)θ_i^T. In other words, this problem has a sufficient factor property and fits into the sufficient factor broadcasting framework described above. In each iteration, on the transmitter side, each machine computes a sufficient factor group and sends the sufficient factor groups to other machines in the system. On the receiver side, each machine converts the sufficient factor groups it receives into gradient matrices which are applied to the receiver's local copy of the state of the topic matrix to update the topic matrix. A projection operation follows the gradient descent update.

FIG. 8 is a flowchart of a method of creating a topic matrix that represents a prevalence of each of a plurality of medical topics among a plurality of clinical documents. The method may be performed, for example, by a system of processors 102, 104, 106 shown in FIG. 1 and configured in accordance with the machine learning and sufficient factors models and features described above, including those shown in FIGS. 2-7.

At block 802, a first professor 102 included in a machine learning system 100 comprising a plurality of machine learning processors 102, 104, 106 determines at least one local sufficient factor group for one or more documents included in the plurality of clinical documents. The local sufficient factor group comprises two sufficient factors, each corresponding to a vector representing a measure between words in the document and a medical topic.

At block 804, the first processor 102 sends the at least one local sufficient factor group to one or more second processors 104, 106 in the system. In one embodiment, a plurality of local sufficient factor groups are determined for a corresponding plurality of documents. In this case, the first processor 102 selects and sends a subset of the plurality of local sufficient factor groups to the one or more other processors. In another embodiment, the first processor 102 randomly selects, from among a plurality of second processors in the system, the one or more second processors 104, 106 to which to send the local sufficient factor group.

At block 806, the first processor 102 receives at least one remote sufficient factor group from a second processor 104, 106 in the system. The at least one remote sufficient factor group is determined by the second processor for another document included in the plurality of clinical documents. The remote sufficient factor group comprises two sufficient factors, each corresponding to a vector representing a measure between words in the other document and a medical topic.

At block 808, the first processor 102 processes the local sufficient factor group together with the remote sufficient factor group to obtain the topic matrix. In one embodiment, the first processor 102 processes the sufficient factor groups by converting each of the local sufficient factor group and the remote sufficient factor group into a corresponding update matrix; and applying each update matrix to the topic matrix using a projection operation. Each of the local sufficient factor group and the remote sufficient factor group are converted into a corresponding update matrix by obtaining an outer product of the sufficient factors that respectively define the local sufficient factor group and the remote sufficient factor group.

The method of FIG. 8 may be performed by each processor 102, 104, 106 in the system. In this case, each processor shares its sufficient factor group or groups with other processors in the system and each processor is able to process its own sufficient factors with those it receives, to thereby obtain and maintain a common, shared topic matrix.

As noted above, the topic matrix represents a prevalence of each of a plurality of medical topics among a plurality of clinical documents. In an example practical application of the topic matrix, an medical topic search inquiry may be input to one of the processors in the system 100, through a user interface, and the topic matrix may be accessed to obtain a listing of documents most relevant to the medical topic.

FIG. 9 is a schematic block diagram of an apparatus 900. The apparatus 900 may correspond to one or more processors or machines of the medical topic discovery system of FIG. 1 configured to enable the method of FIG. 8. The apparatus 900 may be embodied in any number of processor-driven devices, including, but not limited to, a server computer, a personal computer, one or more networked computing devices, an application-specific circuit, a minicomputer, a microcontroller, and/or any other processor-based device and/or combination of devices.

The apparatus 900 may include one or more processing units 902 configured to access and execute computer-executable instructions stored in at least one memory 904. The processing unit 902 may be implemented as appropriate in hardware, software, firmware, or combinations thereof. Software or firmware implementations of the processing unit 902 may include computer-executable or machine-executable instructions written in any suitable programming language to perform the various functions described herein. The processing unit 902 may include, without limitation, a central processing unit (CPU), a digital signal processor (DSP), a reduced instruction set computer (RISC) processor, a complex instruction set computer (CISC) processor, a microprocessor, a microcontroller, a field programmable gate array (FPGA), a System-on-a-Chip (SOC), or any combination thereof. The apparatus 900 may also include a chipset (not shown) for controlling communications between the processing unit 902 and one or more of the other components of the apparatus 900. The processing unit 902 may also include one or more application-specific integrated circuits (ASICs) or application-specific standard products (ASSPs) for handling specific data processing functions or tasks.

The memory 904 may include, but is not limited to, random access memory (RAM), flash RAM, magnetic media storage, optical media storage, and so forth. The memory 904 may include volatile memory configured to store information when supplied with power and/or non-volatile memory configured to store information even when not supplied with power. The memory 904 may store various program modules, application programs, and so forth that may include computer-executable instructions that upon execution by the processing unit 902 may cause various operations to be performed. The memory 904 may further store a variety of data manipulated and/or generated during execution of computer-executable instructions by the processing unit 902.

The apparatus 900 may further include one or more interfaces 906 that may facilitate communication between the apparatus and one or more other apparatuses in the system 100 or an apparatus outside the system. For example, the interface 906 may be configured to transmit/receive sufficient factors to/from other processor or machines in the medical topic discovery system 100. The interface 906 may also be configured to receive one or more of clinical documents or vector representations of such documents from a corpus of clinical documents 112 stored in a database.

Communication may be implemented using any suitable communications standard. For example, a LAN interface may implement protocols and/or algorithms that comply with various communication standards of the Institute of Electrical and Electronics Engineers (IEEE), such as IEEE 802.11, while a cellular network interface implement protocols and/or algorithms that comply with various communication standards of the Third Generation Partnership Project (3GPP) and 3GPP2, such as 3G and 4G (Long Term Evolution), and of the Next Generation Mobile Networks (NGMN) Alliance, such as 5G.

The memory 904 may store various program modules, application programs, and so forth that may include computer-executable instructions that upon execution by the processing unit 902 may cause various operations to be performed. For example, the memory 904 may include an operating system module (O/S) 908 that may be configured to manage hardware resources such as the interface 906 and provide various services to applications executing on the apparatus 900.

The memory 904 stores additional program modules such as: (1) an interpreter module 910 that automatically identifies the symbolic expressions of sufficient factors and parameter updates; (2) a sufficient factor generator module 912 that selects training examples from a local data shard and computes an sufficient factor group for each example using the symbolic expressions of sufficient factors produced by the interpreter module 910; (3) an sufficient factor selector module 914 that chooses a small subset of most representative sufficient factors out of those computed by the SF generator module 912 for communication; (4) a communication manager module 916 that transfers the sufficient factors chosen by the SF selector module 914 using broadcast or random multicast and receives remote sufficient factors; (5) an update generator module 918 which computes update matrices from locally-generated and remotely-received sufficient factors and updates the topic matrix; and (6) a central coordinator module 920 for periodic centralized synchronization, parameter-replicas rotation, and elasticity. Each of these modules includes computer-executable instructions that when executed by the processing unit 902 cause various operations to be performed, such as the operations described above.

The apparatus 900 and modules disclosed herein may be implemented in hardware or software that is executed on a hardware platform. The hardware or hardware platform may be a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic component, discrete gate or transistor logic, discrete hardware components, or any combination thereof, or any other suitable component designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing components, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, or any other such configuration.

Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk (e.g., compact disk (CD), digital versatile disk (DVD)), a smart card, a flash memory device (e.g., card, stick, key drive), random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a general register, or any other suitable non-transitory medium for storing software.

While various embodiments have been described above, they have been presented by way of example only, and not by way of limitation. Likewise, the various diagrams may depict an example architectural or other configuration for the disclosure, which is done to aid in understanding the features and functionality that can be included in the disclosure. The disclosure is not restricted to the illustrated example architectures or configurations, but can be implemented using a variety of alternative architectures and configurations.

In this document, the terms “module” and “engine” as used herein, refers to software, firmware, hardware, and any combination of these elements for performing the associated functions described herein. Additionally, for purpose of discussion, the various modules are described as discrete modules; however, as would be apparent to one of ordinary skill in the art, two or more modules may be combined to form a single module that performs the associated functions according embodiments of the invention.

In this document, the terms “computer program product”, “computer-readable medium”, and the like, may be used generally to refer to media such as, memory storage devices, or storage unit. These, and other forms of computer-readable media, may be involved in storing one or more instructions for use by processor to cause the processor to perform specified operations. Such instructions, generally referred to as “computer program code” (which may be grouped in the form of computer programs or other groupings), when executed, enable the computing system.

Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. As examples of the foregoing: the term “including” should be read as meaning “including, without limitation” or the like; the term “example” is used to provide exemplary instances of the item in discussion, not an exhaustive or limiting list thereof; and adjectives such as “conventional,” “traditional,” “normal,” “standard,” “known”, and terms of similar meaning, should not be construed as limiting the item described to a given time period, or to an item available as of a given time. But instead these terms should be read to encompass conventional, traditional, normal, or standard technologies that may be available, known now, or at any time in the future.

Additionally, memory or other storage, as well as communication components, may be employed in embodiments of the invention. It will be appreciated that, for clarity purposes, the above description has described embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units, processing logic elements or domains may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processing logic elements or controllers may be performed by the same processing logic element or controller. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.

Furthermore, although individually listed, a plurality of means, elements or method steps may be implemented by, for example, a single unit or processing logic element. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined. The inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also, the inclusion of a feature in one category of claims does not imply a limitation to this category, but rather the feature may be equally applicable to other claim categories, as appropriate.

The various aspects of this disclosure are provided to enable one of ordinary skill in the art to practice the present invention. Various modifications to exemplary embodiments presented throughout this disclosure will be readily apparent to those skilled in the art. Thus, the claims are not intended to be limited to the various aspects of this disclosure, but are to be accorded the full scope consistent with the language of the claims. All structural and functional equivalents to the various components of the exemplary embodiments described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed under the provisions of 35 U.S.C. § 112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for.”

	Number	Date	Country
	62699385	Jul 2018	US
	62756024	Nov 2018	US

Systems and Methods for Medical Topic Discovery Based on Large-Scale Machine Learning

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATION(S)

Provisional Applications (2)