RELIABLE SUPERVISED MACHINE LEARNING USING INTERVAL ARITHMETIC

TECHNICAL FIELD

The present invention is generally directed to systems and/or methods for reliable supervised machine learning, more particularly, to an improved supervised machine learning system characterized by a novel application of interval arithmetic for model parameter determination, more particularly, to model parameter determination of an objective function via a branch-and-bound interval arithmetic methodology for iterative bisecting a search region for model parameter space characterized by reduced system memory demands in furtherance of finding a global minimum of the objective function.

BACKGROUND

Machine learning is a field of study wherein systems are devised to learn without being explicitly programmed. A well-known formalism on the topic by Professor Tom Mitchel is often cited: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” (Mitchel, Machine Learning, McGraw-Hill International Editions Computer Science Series, 1^st Ed.).

Supervised machine learning is the process of training a machine to make predictions based upon patterns found in data, with “learning” predicated upon the machine, test data (i.e., labeled test data), and the training (i.e., the “exercise” of prediction refinement - machine tuning). Wikipedia notes that "[s]upervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples."

Mathematically, the machine of the supervised machine consists of a model. Artificial neural networks, among other things, have proven to be powerful models for machine learning. Three common types of artificial neural networks are Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), and MultiLayer Perceptron (MLP). They are popular because they have proven to be effective, namely CNN for image recognition, LSTM for predicting time-series data, and MLP for a wide range of general classification and regression tasks. MLP is also regarded as a “universal function approximator” since it can learn to approximate any function to within a certain error, thus making MLP an industry workhorse.

Supervised machine learning requires a training process where the machine is first provided with training data. The machine is modeled as a parameterized function f_θ: Rⁿ → R^m, and the purpose of training the machine, given training data x_i ∈ Rⁿ consisting of i ∈ 1, 2,..., p samples, is to determine a set θ of model parameters that minimizes an objective function E(θ), namely, the error between the desired model outputs, as represented by the labels y_i ∈ R^m, and the actual model outputs f_θ(x_i), i.e.,

$E (θ) = \frac{1}{m p} {\sum_{i = 1}^{p} ‖f_{θ} (x i) - y i‖}^{2}$

Labels on the training data are provided by human domain experts and provide a “ground truth” starting point or basis from which the machine can learn. For example, training data may consist of radiological images, with labels provided by a team of doctors wherein each image is classified as representing either a healthy or diseased condition or state. For MLP, the set θ of model parameters consists of all weights and biases in the various layers of the artificial neural network. Once training is complete and the best model parameters are found, the machine can be deployed into the field where it can analyze new data and make predictions without the assistance of any labels.

In practice, the objective function is usually nonlinear, having multiple critical points and local minima. The number of model parameters can be very large, often in the thousands or millions, and typical training data will also consist of thousands or millions of samples. From a numerical and computational perspective, all of these factors combine to make training the machine a very difficult and challenging task.

Current training methodologies commence training by initializing the model parameters with random numbers, and thereafter an iterative process begins. The first iteration calculates the gradient of the objective function and updates the model parameters in the opposite direction of the gradient. This has the effect of “tuning” the model parameters by a small amount in the direction that causes the objective function to decrease by the largest amount. Then another iteration occurs and the gradient of the objective function is calculated again, this time with updated model parameters from the first iteration, with the model parameters updated a second time. The process is repeated until the gradient eventually vanishes at a critical point, which will hopefully be the global minimum of the objective function.

The key qualifier is the word “hopefully,” because the gradient, by definition, only provides local information about the objective function, and the iterative process described begins from a random starting point. As a consequence, no guarantee is provided that the process will converge to the global minimum. For example, if the random starting point is near a local minimum, the process will likely converge to that critical point even if it is not the global minimum.

This method, known as gradient descent backpropagation, uses the gradient to “descend” the mathematical landscape of the objective function by adjusting model parameters in small increments (i.e., the “backpropagation”) in order to find a point of minimum value. This is similar to dropping a ball at a random location on top of a hill, then watching it roll down the hill along the path of steepest descent. That path, as well as the final resting place of the ball, will usually depend on where the ball was dropped, since the starting point may cause the ball to get stuck in a depression (i.e., a local minimum) and never reach the bottom of the hill (i.e., the global minimum).

Calculating the gradient can be expensive since it requires taking an average of all training data samples. An acceleration method called stochastic gradient descent backpropagation uses only a random subset of training data samples to calculate the gradient. Obviously, such a calculation will be an approximation of the true gradient. The assumption is that approximation errors average out over time, allowing the method to converge with fewer calculations than if all training data samples are used, particularly if the random subset is significantly smaller, as is oftentimes the case.

Stochastic gradient descent backpropagation is significantly faster than its non-stochastic counterpart. However, neither variation of the method provides any guarantee that the global minimum will ever be found. This fact is fundamental to the underlying mathematics. In almost every case, except for the most trivial examples, the randomly chosen starting point results in not being able to find the global minimum. Instead, the method gets stuck in one of many local minimums.

Gradient descent backpropagation and stochastic gradient descent backpropagation are based on floating-point arithmetic, a type of computer arithmetic that uses finite approximations of real numbers to perform mathematical computations. Gradient descent backpropagation training methods are made practical by the tremendous amount of floating-point performance provided by modern Graphics Processing Units (GPUs). Linear algebra operations such as matrix multiplication, dot product, and vector sum are at the heart of artificial neural networks, where evaluating a single sample of training data often requires calculations on matrices and vectors consisting of thousands, if not millions, of floating-point elements. GPUs are very effective at parallelizing these types of calculations, since the linear arrangement of data in memory aligns with the order that floating-point operations on individual matrix or vector elements are performed. The resulting data parallelism is accomplished by specialized circuitry of the GPU hardware wherein portions of the floating-point calculations of a linear algebra operation are performed in parallel.

Ben-Nun & Hoefler, Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis (2018), set forth a comprehensive survey of the state-of-the-art with regard to theoretical and mathematical formulations of artificial neural networks as a broad and general class of nonlinear global optimization problems in relation to their practical use in artificial intelligence, machine learning, and deep learning. The “gradient descent zoo” described by the authors is a testimony to the tremendous amount of research and development effort that has been invested into making gradient descent a faster and more reliable process. Current advances in supervised machine learning were made feasible through a combination of algorithms (i.e., models/neural networks) and hardware (e.g., GPUs).

In light of the foregoing, the fact nonetheless remains; there are fundamental inherent shortcomings to de facto gradient descent methods. First, from each random starting point, the gradient descent procedure may converge to a different unique local minimum (i.e., gradient descent is not repeatable). Second, each run of the gradient descent procedure is, by definition, a chain of sequential steps that cannot easily be conducted in parallel (i.e., gradient descent is non-parallel). Third, even after running a very large amount of guesses and comparing all of the answers to select the best local minimum, running a new guess may produce a “better” local minimum (i.e., gradient descent is uncertain).

Advances promised through supervised machine learning can only be realized if there is high confidence in the results produced. Reduction in human and computer time and costs will provide faster research turnarounds through a more efficient use of data science, informatics, and information technology. In furtherance thereof, and notionally, the application of interval arithmetic to nonlinear globalization is known, having been set forth by Ratcscheck & Voller (see e.g., What can interval analysis do for global optimization? Journal of Global Optimization 1, pp.111-130, June 1991).

Instead of initializing model parameters with random numbers, each model parameter is initialized as an interval having a lower and upper bound, the initialized elements of the model parameters forming an axis-aligned parallelotope (i.e., a “box”) that defines a search region for the entire model parameter space. While such initialization guarantees a convergence to a global minimum (i.e., “the” global minimum), a powerful advantage relative to gradient descent approaches to supervised machine learning, data storage and/or processing remains an Achilles heel - the weak link if you will.

Heretofore known interval based supervised machine learning teachings have yet to address the sheer magnitude and complexity of the resulting nonlinear global optimization problems. The functions to be minimized, for example, routinely consist of, on the low end, 10⁴ variables, with 10⁹ variables not unheard of. By comparison, known teachings contemplate the minimization of functions of only a few variables (see e.g., Ratz, Box Splitting Strategies for the Interval Gauss-Seidel Step in a Global Optimization Method. Computing, Vol. 53, pp. 337-353, Springer Verlag, 1994, or, Eriksson & Lindstrom, A Parallel Interval Method Implementation for Global Optimization Using Dynamic Load Balancing. Reliable Computing 1(1), pp. 77-91, 1995).

Moreover, the branch-and-bound procedure that characterizes an interval arithmetic approach to nonlinear global optimization requires maintaining a queue of unprocessed interval boxes, which is often sorted based on certain criteria, e.g., unprocessed interval boxes associated with the lowest bounds on the minimization function may be kept at the top of the queue (see e.g., Hansen & Walster, Global Optimization Using Interval Analysis. 2^nd ed., revised and expanded, CRC Press, 2003, or, Leclerc, Parallel Interval Global Optimization and Its Implementation in C++. Interval Computations, Volume 3, 1993). If the number of variables to be minimized is fairly small, practical implementations can store the unprocessed interval boxes in containers provided by the standard libraries of popular programming languages such as C++ or Java. As the number of variables increases by orders of magnitude, such straightforward approaches quickly lead to memory exhaustion, even on modern computers with gigabytes of memory owing to the memory demands growing at a geometric rate with processing times become dominated by the container subroutines that manage the queue, inducing poor cache performance and fatal system crashes.

Thus, in light of the forgoing and in the context of supervised machine learning, it remains desirable and advantageous to greatly reduce the time and cost of training a machine while achieving both reproducibility and accuracy. Moreover, and more particularly, work remains outstanding in connection to global optimization problem solving, more particularly, in connection to systems and/or methods for model parameter determination of an objective function via a branch-and-bound interval arithmetic methodology for iterative bisecting a search region of model parameter space characterized by reduced system memory demands and/or improved data processing parallelism in furtherance of finding a global minimum of the objective function in connection to machine learning.

SUMMARY OF THE INVENTION

An interval arithmetic based system and method for solving a global optimization problem, in furtherance of, for example, supervised machine learning, is contemplated and provide. More particularly, a system/method characterized by a bisection indexing scheme for a parameter domain of an objective function and an attendant work stealing scheme is advantageously provided and disclosed.

The computer system is characterized by a central processing unit and a modal interval processing unit operably linked thereto, the modal processing unit characterized by a plurality of modal interval processing units, and memory accessible by the central processing unit. Memory components of the memory advantageously include a bisection context, a bisection queue, session data and training data. The bisection queue comprises a collection of bisection records characterized by integer fields (e.g., Axis, Item), Axis being an index to a particular array in the bisection context and Item being a record associated with the particular array element. The bisection context includes a plurality of arrays (e.g., Domain, Bottom, Top, and Subdomain), Domain and Subdomain being interval arrays, Bottom and Top being integer arrays, the arrays having a number of elements equal to a number of variables of an objective function to be minimized.

A preferred, robust, non-limiting method, contemplates receiving and storing a representation of the objective function in memory of the computer system. Each model parameter of model parameters of the objective function are initialized as an interval having a lower and upper bound such that initialized elements of the model parameters form an axis aligned parallelotope that delimits a search region S for an entirety of model parameter space. Domain is initialized such that every variable of the objective function has a corresponding interval domain value in Domain, the bisection queue being empty. Search region S in the Domain is divided/subdivided into subdomains. Thereafter, in connection to search region S dividing, a parallelization process is executed, the process characterized by workers in a work stealing scheme.

Each worker performs a depth-first search on a unique subset of the search region S while a breadth-first search is performed by workers in the work stealing scheme. For each worker, integers in Bottom represent a bisection status for Domain, the bisection status using an indexing scheme to record an interval subset restriction in each domain value in Domain. Each integer item in Top represents a subset of its corresponding item in Bottom, intervals in Subdomain representing a subset of the search region S in Domain. A bisection state is maintained via interaction between the bisection context the bisection queue, operations of the interactions characterized by Reset, Push, Pop, Remove and Cleanup operations.

As a nonlinear optimization solver, the contemplated system is relevant to the field of artificial intelligence, machine learning, and deep learning. In particular, the contemplated system abandons current industry trends and paradigms, providing an entirely different approach to training machines. Existing systems and methods are based on a process known as gradient descent, which uses floating-point arithmetic. The well-understood and documented pain points of gradient descent can be reduced to its random “trial by error” nature as well as its inherently sequential computing process which makes it difficult to parallelize. The contemplated system abandons gradient descent in favor of a reliable process of “proof by elimination” based on interval arithmetic. Unlike gradient descent, the instant process is massively parallel and easy to scale. As a result, the contemplated system overcomes all of the known pain points of gradient descent, making it a significant improvement over the prior art. More specific features and advantages obtained in view of the summarized features will become apparent with reference to the drawing figures and DETAILED DESCRIPTION OF THE INVENTION.

BRIEF DESCRIPTION OF THE DRAWINGS

All figures have been prepared, and are included to facilitate and/or enhance an understanding of the basic teachings of the contemplated embodiments, and/or the concepts underlying same, and are incorporated in and constitute a part of this specification. While the drawings illustrate embodiments and context with respect thereto, and together with the description serve to explain principles of embodiments, other embodiments and many of the intended advantages of the disclosed systems, subsystems, devices, mechanisms, methods, operations, etc. will be readily appreciated as they become better understood by reference to the following detailed description and figures.

FIGS. 1-13 are provided herewith wherein:

FIG. 1 schematically depicts a preferred, non-limiting integrated system for training artificial intelligence applications;

FIG. 2 schematically depicts contemplated, non-limiting elements of/for the integrated system of FIG. 1, namely, an accelerator on one hand, FIG. 2(a), and a cluster comprised of central processing unit (CPU) servers and FIG. 2(a) accelerators on the other hand, FIG. 2(b);

FIG. 3 depicts a branch-and-bound process associated with the instant training for finding a global minimum of an objective function wherein each model parameter is initialized as an interval having lower and upper bounds, initialized elements forming an axis-aligned parallelotope that defines a search region, the search region interatively bisected;

FIG. 4 schematically depicts process virtual address space for the contemplated process methodology;

FIG. 5 graphically represents an advantageous, non-limiting bisection indexing scheme part-and-parcel of the instant training system/method;

FIG. 6 schematically depicts a relatedness of/for process virtual address space memory components (FIG. 4), the parameter domain, and the indexing scheme (FIG. 5) in connection with a Push operation, namely conditions/relationships before (FIG. 6(a)) and after (FIG. 6(b)) as to the depicted operation;

FIGS. 7-9 each schematically depicts a relatedness of/for process virtual address space memory components (FIG. 4), the parameter domain, and the indexing scheme (FIG. 5) in connection with a Pop operation, namely conditions/relationships before (FIGS. 7(a), 8(a), 9(a)), during (FIGS. 8(b), 9(b), 9(c)), and after (FIGS. 7(b), 8(b), 9(c) and 9(d)), as the case may be, as to the depicted operation;

FIG. 10 schematically depicts a relatedness of/for process virtual address space memory components (FIG. 4), the parameter domain, and the indexing scheme (FIG. 5) in connection with a Remove operation, namely conditions/relationships before (FIG. 10(a)) and after (FIG. 10(b)) as to the depicted operation;

FIG. 11 schematically depicts a relatedness of/for process virtual address space memory components (FIG. 4), the parameter domain, and the indexing scheme (FIG. 5) in connection with a Cleanup operation, namely conditions/relationships before (FIG. 11(a)) and after (FIG. 11(b)) as to the depicted operation;

FIG. 12 depicts a static structure diagram of an attendant, non-limiting worker and manager processes for contemplated system operations; and,

FIG. 13 depicts a sequence diagram of events of the worker and manager processes of FIG. 12.

DETAILED DESCRIPTION OF THE INVENTION

In advance of particulars for contemplated and advantageous systems and methods characterized by a novel application of interval arithmetic to tackle nonlinear global optimization problems, for example, and without limitation, artificial intelligence, machine learning, and deep learning to name but a few, several preliminary observations are warranted.

First, Applicant has developed an innovative supervised machine learning system that replaces conventional and heretofore known systems and methods, based upon gradient descent backpropagation, with a novel approach based upon modal interval arithmetic. From its IEEE 1788 Working Group Paper captioned Introduction to Modal Intervals (August 2009), to its innovations in relation to improved utility in and for computational systems and methods as evidenced by USPs 7,554,540, 9,529,778, 9,588,736, and 9,934,198, each of which is incorporated herein by reference in their entireties, Applicant has leveraged its earlier work in addressing the shortcomings of known nonlinear global optimization approaches. Via the ability of the contemplated system/method(s) to employ multiple processors, multiple computers, or a combination of both, significant operational speed-ups are attainable.

Second, the contemplated and hereinafter described system and/or method perform the same linear algebra operations as current supervised learning methods, the difference being that elements of the matrices and vectors are modal intervals, not floating point values. While GPUs do not support modal interval arithmetic, current Applicant implementations of the system/methods emulate modal interval arithmetic operations in software on a central processing unit (CPU). In keeping with, among others, the teaching of the `736 & `198 patents, work continues on a modal interval arithmetic hardware accelerator, namely, a modal processing unit (MPU) . In addition to computational speeds in keeping with floating point calculations, the MPU will provide data parallelism in the linear algebra operations for improved performance.

Third, heretofore unseen task parallelism is achieved for an interval arithmetic approach to model training by reimagining system memory utilization in combination with a novel bisection indexing scheme whose codes populates separate memory elements of the system and is thus overarching and correlative relative to same, greatly reducing memory requirements. Moreover, a hybrid parallelization is particularly well suited to support the domain bisections and operations attendant to the contemplated machine learning.

Finally, the following description commences with a presentation of a preferred non-limiting system/system particulars with reference to FIGS. 1 & 2. Thereafter, a preferred non-limiting machine training methodology is taken up with reference to FIGS. 3-11, with particulars for worker and manager processes for contemplated system operations set forth in connection to FIGS. 12 & 13.

With general reference to FIGS. 1 & 2, schematic depictions of advantageous, non-limiting integrated systems/system elements for training artificial intelligence applications are set forth. System 20, FIG. 1, is notionally comprised of stacked hardware and software components or elements, namely, modal interval hardware (e.g., modal processing units (MPUs) 22 as indicated), a MPU driver 24, as circumstances may warrant, and software 26, with users advantageously, but not necessarily, able to develop supervised machine learning with such system. MPUs are hardware accelerators for performing interval arithmetic (see Applicant’s earlier cited `736 patent). The MPU hardware accelerator provides GPU-like performance for modal interval arithmetic. MPUs can be installed side-by-side with GPUs for backward compatibility.

Notionally, modal interval hardware, e.g., chips advantageously manufactured by original equipment manufacturers (OEMs), will be used and leveraged by cloud service providers (CSPs), the CSPs installing/integrating the subject machine training system into their cloud infrastructure to provide interval machine learning to their end users. Alternately, a fully integrated hardware and software technology stack may be provided, more particularly, a technology stack that operates in the cloud infrastructure of the CSPs. End users access the technology stack via the cloud to train their artificial intelligence applications.

With reference to FIG. 2, a training accelerator 36, FIG. 2(a), embodied, for example and without limitation, in a standard 1U rack mountable form factor is contemplated, the accelerator characterized by modal interval hardware, namely, one or more MPUs 22. A training cluster 30, FIG. 2(b), is likewise contemplated, without limitation, the cluster characterized by central processing unit servers 32 and the plural training accelerators 36.

Training accelerator components generally include one or more network interfaces, memory, and one or more MPUs. In a preferred embodiment, the MPUs are hardware circuits capable of performing large amounts modal interval arithmetic operations very quickly, preferably in a single instruction, multiple data (SIMD) format and performance factor. The MPUs are connected to memory. In the preferred embodiment, all of the MPUs share a common virtual address space which is comprised of physical high bandwidth memory (HBM). Also connected to the memory is one or more network interfaces that provide high speed, low latency access to the memory via an external network.

Also contemplated are variations within the system where essential components may take on different form factors or be replaced by different technology that serves the same functional purpose. For example, and without limitation, the rack mountable form factor of the training accelerator may be suitably replaced or substituted for/by/with a discrete card that connects directly to a motherboard of a CPU server via standard or proprietary bus protocols such as PCI Express, CCIX, or OpenCAPI. Moreover, and without limitation, a system that replaces the aforementioned discrete training accelerator card may be suitably replaced with an MPU directly integrated on the CPU itself, so that the combined function of the integrated CPU/MPU completely replaces the need for a discreet training accelerator at all. In every such system variation, the depicted FIG. 2 components and functions remain essentially the same; only the various form factors and/or technology used to manufacture, connect, or assemble the components are different.

It should be noted that a particular variation of a preferred embodiment includes, for example and without limitation, a system wherein the functionality of the MPUs are emulated by the CPUs. A disadvantage of this variant is that without MPUs to perform large amounts of modal interval arithmetic operations very quickly, such exemplary system can be very slow. Due to the astronomically huge amounts of computation required for most practical artificial intelligence, machine learning, and deep learning applications, it is believed that systems characterized by essential MPU hardware, as elsewhere disclosed by Applicant, will be required to effectuate meaningful reliable machine training/learning.

In advance of particulars of the contemplated training methodology, it is to be noted that a physical abstraction of the system, via Message Passing Interface (MPI) protocol, is believed desirable and advantageous. MPI is a de-facto communication protocol standard in the high performance computing industry that abstracts the physical cluster into a conceptual paradigm for parallel computing. Moreover, in addition to the abstraction, MPI provides the mechanism necessary to realize the abstraction by managing the underlying physical cluster components.

In the MPI paradigm, individual CPU computing resources are abstracted into the concept of a process, and each process has a unique identification number. Processes may send and receive messages to each other, and messages are addressed by process identification number.

By default, an MPI application associates a process with every CPU computing resource in the cluster. For example, in a scenario wherein the cluster comprises ten computers, each with sixteen processing cores and each core supporting two hardware threads, the cluster is fairly characterized by three-hundred-twenty hardware processors, the application thusly consists of three-hundred-twenty processes. Any process can send or receive messages in relation to any other process, regardless of which hardware processor the process may actually be running on within the system. Thus, MPI essentially abstracts away all of the underlying physical details of the system in order to provide a simpler conceptual view of communication between processes. Because MPI is such a robust and well-supported industry protocol, the preferred embodiment assumes MPI is used to abstract the system as described above. However, other protocols and physical mechanisms that can provide a functional equivalent are also contemplated, the system/system operations suitably adapted.

It should be noted that MPI only abstracts the communication between CPU computing resources. The communication between CPUs and MPUs is realized by a separate mechanism. For example, with reference to the system of FIG. 2, more particularly FIG. 2(a) wherein the exemplary training accelerator is in a rack-mountable form factor and connected to the CPU servers via the network, the preferred mechanism may be, but is not limited to, remote direct memory access (RDMA) over Converged Ethernet (i.e., RoCE). If the training accelerator is a discrete card that connects directly to the motherboard of a CPU server via standard or proprietary bus protocols such as PCI Express, CCIX, or OpenCAPI, the mechanism for CPU/MPU communication will be defined by the electrical and mechanical engineering specifications of the protocol. Similarly, if the MPU is directly integrated on the CPU itself, the mechanism will most likely be in the form of selector signals issued to the MPU from the CPU instruction stream.

Having provided a system or system/component overview, attention is next directed to the task of finding the global minimum of an objective function using interval arithmetic, namely, the process of ascertaining model parameters via training through the supervised learning process. The aim, advantageously using modal interval arithmetic and a novel conceptualization of process virtual address space in combination with an attendant bisection indexing scheme, is to deterministically find a set of model parameters that minimizes E(θ). An overview of correct and deterministic training immediately follows, with particulars of/for advantageous, desirable and non-limiting process virtual address space allocations/use(s), a bisection indexing scheme, and interplay between, for and among each of contemplated memory components of the process virtual address space, as well as the relationship (s) for between and/or among the contemplated memory components and the bisection indexing scheme with regard to training operations/training data processing are set forth hereinafter.

With reference now to FIG. 3, there is represented an objective function F, model training commencing with initialization of each model parameter as an interval having a lower and upper bound so that the initialized elements of the model parameters form an axis-aligned paralletope (i.e., a “box”) that defines a search region S, the horizontal axis, for the entire model parameter space. If X ⊆ S is such a box, then for every X, modal interval arithmetic is used to compute an interval enclosure E(X) of the objective function.

If the minimum bound of the interval enclosure is greater than the least upper bound of the interval enclosure of any other box, as represented by the line L, this is proof that no global minimum exists in that box (e.g., areas 41), so it can be discarded/deleted. Otherwise, X is bisected into two boxes that are added to a queue and processed in subsequent iterations. This iterative process continues until a user-specified tolerance is reached or the queue is empty.

If the number of model parameters is large, which is usually the case for supervised machine learning, a naive implementation of the method described above will perform very poorly, as the computational complexity can become exponential in the number of model parameters. Developing practical implementations involves reducing the computational complexity. For example, one method widely known in the interval literature is the monotonicity test.

If the gradient of the objective function is computed with modal interval arithmetic for a box, then this is very significant, because the interval enclosure of the gradient represents information about the global topology of the objective function over the box. If any interval component of the gradient does not contain zero, this is proof that no critical points of the objective function are contained in the box (e.g., areas 43), so it can be safely discarded. Performing the monotonicity test at the beginning of each iteration has the effect of eliminating large portions of the search space that cannot possibly contain any solution, dramatically reducing computational complexity and accelerating convergence to the global minimum.

Such assessment of the model parameter space always yields correct global minimums because of the reliable and deterministic training method utilized is a “computational proof by elimination.” In other words, the existence of the global minimum is proven to exist in a box that has been rigorously verified by the modal interval arithmetic to be a more optimal solution than any other. In addition to obtaining a deterministic global minimum ab initio, the need for hyperparameters is practically eliminated since the approach only requires a user-specified tolerance that indicates when the iterative bisection process has reached the desired limit.

Having set forth an overview of Applicant’s interval arithmetic approach to non-linear global optimization, and made reference to a queue of yet-to-be processed box portions resulting from of the interval bisection methodology, it is to be again reiterated that the queue grows in exponential fashion. Moreover, with the model parameters numbering from 10³, on the low end, to 10¹², and the training data commonly comprised of at least an order of magnitude more examples than trainable parameters (i.e., number of model parameters x10), the sheer magnitude of memory requirements to support reliable deterministic training is staggering and well beyond present abilities. That said, underlying Applicant’s construct is the notion that each and every process in the system has its own virtual address space (VAS).

With reference now to FIG. 4, advantageous, non-limiting memory components of a process are indicated with reference to process virtual address space 50, namely, bisection context 60, bisection queue 70, training data 80, and session data 90. Bisection context is further characterized by at least four arrays, which for the sake of the instant description are named Domain 62, Bottom 64, Top 66, and Subdomain 68. Domain and Subdomain are arrays of modal intervals. Bottom and Top are arrays of integers. All of the arrays in bisection context 60 have the same number of elements, the number of those elements equal to the number of function variables, i.e., one element in each array corresponds to a variable of the function to be minimized.

Each variable of the function is initialized as an interval having a lower and upper bound, forming an axis-aligned parallelotope that defines a global search region for the entire domain Ω of the function to be minimized. Upon system initialization, the Domain array in the bisection context is initialized with this global search region so that every function variable has a corresponding interval “domain value” in the Domain array.

After system initialization, the Domain array and the training data are never modified. It is therefore advantageous, although not required, that the virtual address of these memory components be mapped to a physical memory location shared by many processes. This is easy to do on most operating systems, for example, using memory mapped files. It is advantageous because the training data, in particular, can be very large, and allowing many processes to share a read-only instance can significantly reduce the total amount of required memory. The ability to do this successfully may depend, for example, on the contemplated variations of the adopted system (i.e., one or more shown or described heretofore, or suitably adapted).

While an overview of the optimization process, wherein the global search region in the Domain array is bisected into smaller subdomains until the global minimum is proven to exist in a box found via proof-by-elimination, has been set forth with reference to FIG. 3, a hybrid parallelization method of the process is believed desirable and advantageous, and is realized in a work-stealing scheme (see e.g., Prell, Embracing Explicit Communication in Work-Stealing Runtime Systems, Ph.D. Dissertation, University of Bayreuth, 2016).

Each process doing work (a “worker”) sends a work stealing request. All of the steal requests are forwarded amongst the processes. A process for managing the work stealing scheme (i.e., the “manager”) responds to the first steal request it receives by sending the global search region in a message to the requesting worker; then the manager forwards all subsequent steal requests. Once a response to a steal request is received by a worker, the worker begins a depth-first search, placing unprocessed boxes at the top of its queue. As steal requests from other workers arrive, the worker sends messages containing unprocessed boxes from the bottom of its queue to the requesting workers. In the contemplated, non-limiting hybrid parallelization method, each worker performs a depth-first search on a different subset of the global search region. At the same time, a breadth-first search is effectively performed by the workers via the work stealing process.

For each worker, integers in the Bottom array of the bisection context represent a bisection status for the Domain array. The bisection status uses an indexing scheme to record how each domain value in the Domain array is restricted to an interval subset of itself.

With reference now to FIG. 5, a desirable, advantageous non-limiting indexing scheme 100 for a single domain value is depicted, codes 102 representing labeled bisection item elements (i.e., one of sixty-three intervals/interval subsets), with the parent/child element relationships set forth in the upper figure portion. A positive integer (i.e., an “item”) encodes which interval subset of the domain value (e.g., 1-63) is recorded in the bisection status. Root item 1 represents the entire domain value. An item i can be bisected into left child 2i and right child 2i + 1; and, any item (except the root item) can recover its parent as [i/2]. The “depth” of item i can also be recovered as m= [log₂(i)], and the “size” of depth m, i. e., the number of items in the indexing scheme at depth m, is n = 2^m.

Integers in the Top array also represent a bisection status for the Domain array. However, each item in the Top array is an item that represents a subset of its corresponding item in the Bottom array. Under this restriction, the bisection status in the Top array is a subset of the bisection status in the Bottom array.

Intervals in the Subdomain array represent a subset of the global search region in the Domain array, as recorded by the bisection status in the Top array. Specifically, if i is an item in the Top array, n is the size of the depth at item i, and [a,b] is the corresponding element in the Domain array, then the corresponding element in the Subdomain array is [u,v], where

$\begin{array}{l} u = a + \frac{i - n}{n} \cdot (b - a) \\ v = a + \frac{i - n + 1}{n} \cdot (b - a) \end{array}$

The bisection queue is a collection of bisection records. Each bisection record comprises two integer fields, e.g., Axis 72 and Item 74. The Axis field is an index to a particular array element in the bisection context, and the Item field is a record associated with that particular array element. The bisection queue is a double-ended queue, so bisection records can be added or removed from the top or the bottom of the queue. At system initialization, the bisection queue is empty.

In order to support the “obnoxious case” of nonlinear global optimization using interval arithmetic when the function to be minimized consists of thousands or millions of variables (see Neumaier, Complete Search in Continuous Global Optimization and Constraint Satisfaction, Acta Numerica 2004 (A. Iserles, ed.), Cambridge University Press 2004), the bisection context and bisection queue act as a state machine that maintains a bisection state. The essential operations on the bisection state are Reset, Push, Pop, Remove, and Cleanup, each taken up seriatim.

The Reset operation occurs when an idle process receives work to perform. The work arrives in a message containing an array of integers that are copied into the Bottom array of the bisection context. The Reset operation assumes, as a precondition, the work has already arrived and is copied from the message into the Bottom array; the operation also requires the bisection queue to be empty. The Reset operation copies the contents of the Bottom array into the Top array, so that both arrays are identical, then updates each element of the Subdomain array as previously described.

In connection to further contemplated operations, namely, Push, Pop, Remove and Clean-up, reference is generally and broadly made to FIGS. 6-11. The relatedness of/for process virtual address space memory components, the parameter domain and the disclosed indexing scheme are schematically set forth as to the contemplated operations.

With reference to FIG. 6, notionally, the Push operation adds or inserts a bisection record (→) at the “top” of bisection queue 70 as is appreciated with comparison of FIG. 6(a) with FIG. 6(b), so as to form a stack of yet to be processed intervals, more particularly, and effectively, an accumulatingly catalogued, or mapped stack of yet to be processed intervals. Two Push operations or queue additions are depicted (FIG. 6(b)). This operation is a depth first, stepwise, methodical descent through the interval domain bisection/domain array.

Bisection queue 70 is represented as having recorded/memorialized the X-Y parameter domain bisections, namely, bisections corresponding to/with domain indexing scheme 100 depicted adjacent the axes of the illustrative parameter domain 110, figure right, a status of bisection context 60 likewise indicated, more particularly, elements of/for Bottom 64 and Top 66 integer arrays are correspondingly indicated.

Each queue entry is characterized by Axis 72 and Item 74 elements, the latter element being a code 102 of the indexing scheme 100 representing the bisection status for Domain array 62 of bisection context 60 (FIG. 5). As illustrated, interval subset in X parameter space, indicated by code 3, has been bisected into a subset pair characterized by codes 6 & 7 (FIG. 6(a)), subset “6” bisected into a further subset pair characterized by codes 12 & 13 (FIG. 6(b)) with parent subset “3” thereafter negated (i.e., having been processed without detection of a “solution,” the code associated with this subdomain is negated for the sake of the bisection queue (i.e., a signed negative is registered in Item field 74 of the bisection queue 70), each of those further subset pairs “pushed” for later processing (i.e., codes 12 & 13 added to ((→)) Item field 74 of the bisection queue 70.

With reference to FIGS. 7 & 8, Pop operations are illustrated, namely, those associated with “active” and “stale” processing respectively. Notionally, the Pop operation, in contradistinction to the Push operation, is intended to be a selective ascension of/through the interval domain bisection/domain array.

The Pop operation requires, as a precondition, a non-empty bisection queue 70 (i.e., there is at least a single yet to be processed interval resulting from interval bisection). The operation contemplates the examination of the bisection record at the top of the bisection queue. As a threshold matter, the state or condition of the bisection queue element is assessed, more particularly, does the item, i.e., domain subset, have the capacity to contain a solution, or has it been determined that the domain subset at the top of the queue is not capable of containing a solution (i.e., has the Item element been negated, characterized as being stale, via representation in the item array as a signed negative code). The following description proceeds in connection to the representations of FIGS. 7-9.

With reference to FIG. 7, an illustrative Pop operation is shown to the extent that Item field 74 of bisection queue 70 is not negative. As to the operation, the Item field is copied into Top array 66 of bisection context 60 at the element specified by Axis field 72. The corresponding element in Subdomain array 60 (FIG. 4) is updated, the Item field is negated (i.e., made “stale”), and the bisection record is allowed to remain at the top of bisection queue 70.

With reference to each of FIGS. 8 & 9, alternate Pop operations are shown to the extent that Item field 74 of bisection queue 70 is a signed negative code, i.e., the bisection record is “stale” and a parent item is recovered from the absolute value of the Item field. As to the operation upon the recovered parent, the code of Item field 74 is copied into Top array 66 of bisection context 60 at the element specified by Axis field 72. The corresponding element in the Subdomain array 60 (FIG. 4) is updated, the empty bisection record is erased from the top of the bisection queue, and another Pop operation is performed on the “new” top of the bisection queue.

With reference to FIG. 10, the Remove operation requires, as preconditions, that bisection queue 70 is not empty and that the bisection record at the bottom of the queue is not stale (FIG. 10(a)). The Remove operation erases the bisection record at the bottom of the bisection queue after Item field 74 is copied into Bottom array 64 of bisection context 60 at the element specified by Axis field 72. Thereafter (FIG. 10(b)), Bottom array 64 is sent in a message to another worker of the system in furtherance of further, parallel processing.

With reference to FIG. 11, there is represented a Cleanup operation, an operation for which there are no preconditions. If the bisection record at the bottom of bisection queue 70 is stale, it is erased after the absolute value of Item field 74 is copied into Bottom array 64 of bisection context 60 at the element specified by Axis field 72 of bisection queue 70. The operation is repeated until bisection queue 70 becomes empty, or the bisection record at the bottom of the queue is not stale. After each Pop or Remove operation is completed, an implicit Cleanup operation always occurs.

By virtue of the bisection state maintained between the bisection queue and the bisection context, the present invention only requires a fixed amount of memory for the arrays in the bisection context. A small amount of memory is also needed for the bisection queue. However, even if the bisection queue grows very large, the total amount of memory required for the bisection queue is still small owing to the size of each bisection record (two integers). Because of the bisection state, and the operations upon it, the contemplated system can perform a branch-and-bound process to solve nonlinear global optimization problems using interval arithmetic when the function to be minimized consists of thousands or millions of variables.

Having set forth at least an overview of each of the process virtual address space, a bisection indexing scheme, and advantageous contemplated non-limiting operations attendant to model parameter determination of an objective function leveraging the memory elements of the virtual address space and indexing scheme, an overview of the previously noted work stealing scheme hereinafter follows. More particularly, and contextually, a preferred, non-limiting worker and manager scheme is set forth with reference to FIGS. 12 & 13 wherein worker and manager processes and events sequence as to same are depicted, respectively.

With reference to FIGS. 12 & 13, a static structure diagram of structure 200 associated with an advantageous worker and manager processes is depicted. A monotonicity test can significantly improve the overall performance of the optimization process. This requires calculating interval bounds on the gradient of the function. Context 202 is the base class for bisection contexts, and Gradient 204 is a derived bisection context. Gradient 204 includes an additional array of intervals in the bisection context for the gradient of the function to be stored in. Search 206 is derived from Gradient 204, and the Search bisection context adds the notion of the bisection queue.

All of the worker processes in the system (Worker 208) own a unique instance of Search class 206. At system initialization, the Search class is instantiated with references to the Function 210 and Criteria 212 interfaces. The elementCount method in the Domain 214 base class returns the number of function variables, which can be used to dimension arrays in the bisection context. Function 210 provides the evaluate method, which accepts a pointer to the Gradient and Subdomain arrays, evaluates the function based on the current state of the variables in the Subdomain array, updates the Gradient array, and returns an interval bound on the range of the function. Criteria 212 provides user termination criteria and the update method, which accepts a pointer to the Top array. The update method is invoked by Search 206 whenever a new local minimum is found.

There is one manager process in the system, namely Manager 216. Manager serves double-duty by receiving messages from workers that invoke the update method (see FIG. 13; worker process 208' and manager process 216'). The Top and Subdomain arrays of Manager’s bisection context store the best local minimum found by all of the workers; and, the Bottom array stores the bisection state of the global search region, which is specified during system initialization. This is why Manager owns a unique instance of the Solution class 218, i.e., since Manger never requests work, it does not need a Gradient array in the bisection context, nor a bisection queue. Manager invokes the save method of the Output interface 220 each time a new best local minimum is received from a worker.

Under normal conditions, the update method is invoked very infrequently because a large amount of computation is typically required by a worker to find a new local minimum. This means that synchronization caused by multiple workers attempting to update the manager is, almost without fail, never a practical performance issue, even in a system with hundreds or thousands of workers.

The evaluate method is the essential connection point between CPUs and MPUs in the system, as the method invokes an evaluation of the function. In the context of artificial intelligence, machine learning, and deep learning, the function being minimized is a loss function defined by the training data and the artificial neural network. In “big data” applications, the amount of training data can be gigantic. It is not uncommon that a single invocation of evaluate may result in trillions or even quadrillions of modal interval arithmetic operations. For this reason, a preferred embodiment of the system is characterized by hardware MPU circuits capable of performing large amounts modal interval arithmetic operations very quickly, preferably in a single instruction, multiple data (SIMD) format and performance factor.

Additionally, evaluation of the artificial neural network, especially its gradient, often requires session data (see FIG. 4), which is memory reserved for storing intermediate calculations used in the evaluation of the artificial neural network. As mentioned earlier, the training data is read-only. This provides an opportunity to store the training data in a physical memory technology that is optimized for read-only access, and for a single copy of the training data to be shared amongst several processes. Session data, on the other hand, must be writable; and, unlike the training data, session data cannot be shared amongst processes. Each worker requires its own session data.

Because evaluation of the function is generally a very computationally intensive process, the hybrid parallelization method which has been described herein is coarsely parallel. In fine-grained parallelism, the total amount of time to complete each task is usually very small. This places unique demands on the work stealing process, often requiring a “steal-half” scheme wherein a worker responds to a steal request by sending as many as half of its queued tasks in keeping with the earlier cited work of Prell.

In coarse-grained parallelism, the time to complete each task is lengthy; and, the “steal-half” scheme usually adds little value because the default level of parallelism is already very high. Indeed, the present invention generally performs very well without using a “steal-half” scheme. This is why in the description of the preferred embodiment, workers respond to steal requests by sending only the bisection state associated with the bisection record at the bottom of the bisection queue. However, it is also contemplated that workers could respond to steal requests by sending N bisection states associated with N bisection records in the bisection queue.

Last but not least, communication of a least upper bound between workers is contemplated. Again noting that the least upper bound (the “LUB”; see/cross reference L, FIG. 3) is a conservative estimate of the best local minimum found so far, and although the LUB is just a single number, it plays a very important role in the “proof by elimination” method of the branch-and-bound process. The LUB is used as a threshold to determine which subsets of the global search region can be discarded. As the LUB becomes a more refined estimate of the true global minimum, greater amounts of the global search region can be discarded. This reduces the total amount of work needed to find the global minimum, speeding up the total amount of time required to find the global minimum.

If the LUB is shared amongst all of the workers, dramatic speed improvements can be gained. The reason for this is that when a worker improves the estimate of the LUB, all of the other workers can delete any queued bisection records which can be proven to be associated with a bounded range of the function that is strictly greater than the LUB. The ability to delete queued bisection records can grow exponentially in the number of workers, leading to a super-linear speedup. This is highly advantageous, because it means the speedup multiple grows faster than the multiple of workers.

Modern MPI implementations support remote direct memory access (RDMA), a direct memory access from the memory of one computer into that of another without involving the operating system on either computer. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. For example, the MPI_Fetch_and_op command allows a process to update a 32-bit or 64-bit memory address of another process, even if the other process is running on another computer across the network.

This mechanism is used for communicating the LUB between processes, with a very simple background data dissemination protocol (e.g., the “gossip protocol”) . A copy of the LUB is stored in each process as a 32-bit or 64-bit value. As each worker processes the bisection records in its bisection queue, the worker reads and writes its own local copy of the LUB. Periodically, the worker (i.e., the “origin”) will also use the MPI_Fetch_and_op command to update the LUB of a select group of other workers (i.e., the “target group”). As a result, each worker in the target group (i.e., the “target”) has its LUB overwritten by the LUB from either the origin or the target, whichever has the lesser value. In other words, the minimum LUB is propagated from the origin to each target in the target group. Each target will periodically do the same thing, propagating the minimum LUB to a different target group of workers, and so on.

The parameters of the gossip protocol are twofold. One parameter is the minimum amount of time that must elapse before a worker will attempt to propagate the LUB to its target group. The other parameter is a selection of workers for all of the different target groups. Many methods of selection are contemplated, including, but not limited to, random selection, nearest physical topology, and nearest virtual topology.

By using RDMA and a simple background data dissemination gossip protocol, network communication is minimized and synchronization bottlenecks are eliminated, with high probability that the best estimate of the LUB will eventually propagate to every process in the system. Since each worker is constantly improving its own copy of the LUB, it does not hurt the reliability of the overall optimization process if there is possibly a better LUB found by another worker that it cannot see. The system only stands to gain the advantage of reducing the total amount of work, and thus increasing the overall speedup, if the better LUB is eventually propagated to all of the other workers.

What has been described and depicted herein are preferred, non-limiting embodiments of Applicant’s subject matter. Since the elements of the system and/or devices or mechanisms disclosed herein may be embodied in other specific forms without departing from the spirit or general characteristics thereof, some of which forms have been indicated, the embodiments described and depicted herein/with are to be considered in all respects illustrative and not restrictive. Moreover, while nominal operational steps or sequences and/or a protocol have been set forth in relation to a desirable, and advantageous processing methodology, referenced, contemplated sequences/protocols are not so limited. Accordingly, the scope of the subject invention is as defined in the language of the appended claims, and includes not insubstantial equivalents thereto.

RELIABLE SUPERVISED MACHINE LEARNING USING INTERVAL ARITHMETIC

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Parent Case Info

PCT Information

Provisional Applications (1)