Various embodiments concern processing units with hardware architectures suitable for artificial intelligence and machine learning processes, as well as computational systems capable of employing the same.
Historically, artificial intelligence (AI) and machine learning (ML) processes have been implemented by computational systems (or simply “systems”) that execute sophisticated software using conventional processing units, such as central processing units (CPUs) and graphics processing units (GPUs). While the hardware architectures of these conventional processing units are able to execute the necessary computations, actual performance is slow relative to desired performance. Simply put, performance is impacted because too much data and too many computations are required.
This impact on performance can have significant ramifications. As an example, if performance suffers to such a degree that delay occurs, then AI and ML processes may not be implementable in certain situations. For instance, delays of less than one second may prevent implementation of AI and ML processes where timeliness is necessary, such as for automated driving systems where real-time AI and ML processing affects passenger safety. Another real-time system example is military targeting systems, where friend-or-foe decisions must be made and acted upon before loss of life occurs. Any scenario where real-time decisions can impact life, safety, or capital assets are applications where faster AI and ML processing is needed.
Entities have historically attempted to address this impact on performance by increasing the computational resources that are available to the system. There are several drawbacks to this approach, however. First, increasing the computational resources may be impractical or impossible. This is especially true if the AI and ML processes are intended to be implemented by systems that are included in computing devices such as mobile phones, tablet computers, and the like. Second, increasing the computational resources will lead to an increase in power consumption. The power available to a system can be limited (e.g., due to battery constraints), so limiting power consumption is an important aspect of developing new technologies.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fees.
Features of the technology described herein will become more apparent to those skilled in the art from a study of the Detailed Description in conjunction with the drawings. Various embodiments are depicted in the drawings for the purpose of illustration. However, those skilled in the art will recognize that alternative embodiments may be employed without departing from the principles of the present disclosure. Accordingly, although specific embodiments are shown in the drawings, the technology is amenable to various modifications.
Introduced here is an artificial intelligence (AI) system designed for machine learning (ML). As further discussed below, the system may be based on a neuromorphic computational model that learns spatial patterns in inputs using data structures called Sparse Distributed Representations (SDRs). At a high level, an SDR may be representative of a sparse, high-dimensional bit vector whose unique mathematical properties can be leveraged by the system. Note that the term “bit vector” may be used synonymously with the terms “bit array,” “bit map,” “bit set,” and “bit string.”
One of the more interesting challenges in AI is the problem of knowledge representation. Representing information and relationships in a form that computing devices can handle has proven to be difficult with traditional approaches focused on computer science. The underlying problem is that knowledge generally cannot be defined as, or divided into, discrete pieces of information with well-defined relationships. To address this problem, SDRs can be used in an effort to emulate the biological intelligence of the human brain. Generally, an SDR includes hundreds or thousands of bits, and at any given point in time, a small percentage of the bits are ones while the remaining bits are zeros. At a high level, the bits are meant to correspond to neurons in a human brain, where a one represents a relatively active neuron and a zero represents a relatively inactive neuron. An important feature of SDRs is that each bit has meaning. Therefore, the bits that are “active” in a given representation will encode a corresponding set of semantic attributes of what is meant to be represented. Rather than labeling each bit, the meaning of each bit can be learned.
As further discussed below, the neuromorphic computational model can be executed entirely in software, for example, on conventional processing units, such as central processing units (CPUs) and graphics processing units (GPUs), or specialized processing units, such as neural processing units (NPUs). Accordingly, the approaches introduced here could be implemented through the execution—by a conventional processing unit and/or a specialized processing unit—of instructions in a non-transitory medium. Note that while embodiments may be described in the context of software, features of those embodiments may be similarly applicable to firmware and hardware.
Note that
While in the training mode, the system “learns” patterns in data that is provided as input. This data is commonly referred to as “training data.” The training mode is considered to be “supervised” when the training data includes, or is accompanied by, labels for particular outputs. Conversely, the training mode is considered to be “unsupervised” when the training data does not include any labels. Thus, the system may learn in an unsupervised manner if no labels are available, and therefore the system must learn the appropriate relationships between inputs and outputs entirely on its own. Results obtained during the training mode generally are not provided back to a host processing unit. Instead, the “output” of the training mode may be a trained neuromorphic computational model (or simply “trained model”) that is learned through study of the training data.
While in the inferencing mode, the system processes real-world data, rather than training data, through the software pipeline 100. Real-world data does not include labels, and therefore the system produces an output (also called an “inference” or “prediction”) based on relationships learned by studying the training data during the training mode. Said another way, the software pipeline 100—with the trained model learned in the training mode-can be used to predict appropriate labels for the real-world data. As part of the inferencing mode, the system can identify patterns in the real-world data and then create an appropriate SDR. Such an approach allows the classifier 106 to properly identify the labels that correspond to the input and provide those results to the host processing unit. Furthermore, the inferencing mode may allow the system to predict and learn changes in data patterns associated with the labels (e.g., to detect and address data “drift” over time).
Continuous learning may be optionally permitted to let the trained model continue learning even as it is making inferences (e.g., based on analysis of real-world data or testing data). This may occur in scenarios where (i) the learning and inferencing rates of the trained model are similar and (ii) the trained model is capable of learning in unsupervised mode. Continuous learning allows the software pipeline 100 to learn new emerging classes, as well as track “drift” in the definition of learned classes, in real time.
Neuromorphic machine intelligence is a branch of AI in which models are derived based on a mathematical modeling of the cerebral neocortex of the human brain. These models can employ a form of data representation that is observed in nature by producing SDRs. With an SDR, a corresponding object is represented using a data structure (e.g., a binary vector) that is large but is relatively sparse. Said another way, the data structure may include hundreds or thousands entries, though only a small fraction (e.g., less than 5, 2, 1, or 0.5 percent) may be set bits. One important property of SDRs is that each position in the data structure has semantic meaning and represents a pseudo-orthogonal dimension in a high-dimension space. Another important property of SDRs is that the degree of overlap between a pair of SDRs is indicative of (e.g., proportional to) the degree of semantic similarity of the pair of objects represented by the pair of SDRs.
On the other hand, various problems and benchmarks in the field of AI represent objects as feature vectors. The term “feature vector” is generally used to refer to an ordered set of feature values of a corresponding object. The feature values are themselves represented in common datatypes used in computer science, such as integers, floating point numbers, characters, and the like. These representations are called “dense representations” because the focus tends to be the most efficient storage of these feature values. Consequently, these representations are typically short in length with no restrictions on sparsity. Additionally, the bit positions have no independent semantic meaning. Instead, the values in all of the bit positions are considered in combination to infer the value of a feature.
In order to solve standard problems and benchmarks in AI with neuromorphic models, an encoder (e.g., encoder 102 of
There are two known methods of encoding, namely, (i) linear encoding and (ii) random distributed encoding. With linear encoding, all of the b set bits are placed contiguously in an SDR of length l. Given a feature value f, a bucket identifier B can be calculated using min-max scaling followed by normalization to an integer value between 0 and l−B.
Random distributed encoding generated a neighborhood around each bucket identifier B in such a way that neighborhoods of adjacent bucket identifiers have significant overlap. An l-bit encoding can then be generated based on the neighborhood of the bucket identifier. In contrast to linear encoding, the b set bits can be interspersed throughout the length of the SDR.
Linear encoding and random distributed encoding represent two ends of the spectrum with respect to the desired resolution and feature-SDR overlap characteristics. Linear encoding is more restrictive, offering a resolution of only l−b+1 buckets. However, linear encoding is very efficient with near constant runtime overhead. Additionally, there is no accidental overlap between the SDRs of different feature values.
To overcome the challenge posed by balancing resolution and accidental overlap, the encoder introduced here may be adaptive in nature. The adaptive encoder provides a means to balance resolution with the likelihood of accidental overlap. The adaptive encoder can ensure that all of the set bits in the SDR provided to the NPU as input are always present within a span of s bits. Notice that by setting s=b, the adaptive encoder can work like a linear encoder. On the other extreme, the adaptive encoder can work like a random distributed encoder. As s increases from b towards l, the number of buckets increases combinatorically but so does the change of accidental overlap. This effect is discussed in greater detail below.=
Another feature of the adaptive encoder is the ability to algorithmically generate SDRs in a way that the SDRs of adjacent buckets will differ in exactly one position. This minimizes the nonlinearity in the overlap characteristics of the SDRs of buckets in a neighborhood but does not eliminate the nonlinearity completely. By combining linear encoding with random distributed encoding, the adaptive encoder can minimize nonregularities and localize accidental overlap to a minimum. Finally, the runtime complexity of the underlying algorithm may be proportional to the number of set bits, and therefore lies between that of linear encoding and random distributed encoding.
One innovation is to limit the b set bits in the SDR to a window spanning s bits. For convenience, s may be referred to as the “window span.” By doing this, a total of (bs) bucket identifiers can be encoded using SDRs whose set bits belong to the same window. Next, the window can shift by one position and the process can restart. In this manner, a total of (l−s+1)×(bs) bucket identifiers can be encoded. In some embodiments, the window span is predefined (i.e., programmed into memory and unchangeable). In other embodiments, the window span is a user-defined value where b≤s≤l.
This approach to adaptively encoding bucket identifiers can proceed by identifying the window w in which a given bucket identifier lies and the offset position o within the window w. The underlying algorithm can initially generate an s-bit encoding of the offset o where b bits are set. Then, the underlying algorithm can shift the encoding w−1 times to generate the SDR. The b bits within a window can be set in such a way that the bit positions of adjacent bucket identifiers have a known overlap score (e.g., b−1). Furthermore, the SDR of the last bucket identifier of a window and the first bucket identifier of the next window may also differ in exactly one location.
can be encoded in the first window spanning the initial 5 bits. The window can slide up to 12 times, allowing a resolution of 120 possible bucket identifiers. When the window span equals the number of set bits (i.e., s=b), the adaptive encoder can behave similar to a linear encoder. When the window span equals the length of the SDR (i.e., s=l), the adaptive encoder can behavior similar to a random distributed encoder.
Accidental overlap can be significantly reduced in comparison to pure random distributed encoding, yet it can still exist. For example, overlap between SDRs of buckets from the same window can be zero, when the span is more than two times the number of set bits b. However, overlap between SDRs of buckets from the next window can be as high as b−1, and the maximum possible overlap with the SDRs of the following window may be b−2 (and so on). Thus, the accidental overlap may become zero between bucket identifiers that are more than b windows apart.
To further smooth the attenuation, a hybrid approach can be adopted. A second SDR of length l with b set fits can be generated for a given bucket identifier by linearly encoding its window w. The system can then append the two SDRs (i.e., the first SDR from adaptive encoding and the second SDR from linear encoding) to generate a composite SDR of length 2l bits, in which 2b bits are set. An example of encoding in accordance with this hybrid approach with an SDR length of 32 bits with 6 set bits is shown in
The overlap between the composite SDRs of bucket identifiers belonging to the same window lies between b bits (due to the linear encoding) and 2b−1 bits (with 0 to b−1 bits due to the adaptive encoding). The overlap between buckets lying in the next window lies between b−1 bits (due to the linear encoding) and 2b−2 bits (with 0 to b−1 bits due to the adaptive encoding). And the overlap between buckets lying in the next window lies between b−2 bits (due to the linear encoding) and 2b−4 bits (with 0 to b−2 bits due to the adaptive encoding). This pattern continues onward, between bucket identifiers that are more than b windows apart.
The runtime complexity of the underlying algorithm executed by the system can be represented as 0 (b) . However, when (bs) is reasonably small, the w-bit encoding of all offsets within a window can be precalculated and stored in memory. The rest of the operations (e.g., the shift operation and linear encoding of the window) can be generated using constant runtime complexity. In either case, the runtime complexity of the underlying algorithm allows encoding to occur in real time for highly accelerated computational systems.
Representing objects and class signatures as sparse high-dimensional vectors (e.g., hyperdimensional vectors) is done by some computational systems where the vectors can be represented as the collection of indices of its set bits. This type of representation allows for compaction of storage as the vectors are typically very sparse with less than 2 percent of bits set. For example, a 1,024-bit vector with 40 set bits can be represented using 128 bytes in the native bit-vector format or 80 bytes when storing just the indices of set bits, where each index is represented using 2 bytes. This results in a 37.5 percent reduction in space required.
This approach can be employed to gain significant benefits in the system introduced here. The software pipeline (e.g., the software pipeline 100 of
The computational complexity of the three stages of the software pipeline is summarized as follows.
The first stage (also called the “encoding stage”) can execute on a host processing unit efficiently. The host processing unit could be a CPU, for example. The encoding stage may be carefully designed to have a runtime complexity of 0(b), where b is the number of set bits in the input SDR (iSDR). The output may already be sorted, and therefore can avoid the overheard of 0(b log b) of sorting the result. Similarly, the output may already be in the compact format that the second stage (also called the “learning stage”) ingests. Otherwise, the compaction may add complexity of overhead cost 0(l), where l is the length of the iSDR and b«1. With this low complexity and faster operating clock frequency, the encoding stage can keep pace with the subsequent stages of the software pipeline.
The learning stage—in which a model is learned through analysis of training data—can be executed on an NPU, which may be a subcomponent of the NNP. The model can include a collection of neurons and feedforward synapses, which are intended to mimic the pyramidal neurons and proximal synapses in the neocortex of the human brain. The feedforward synapses can be connected to various offset locations in the iSDRs. Based on the feedforward synapses and set bits in the iSDR, the NPU can compute a weighted overlap score for each neuron. The weighted overlap scores can then be used to identify “winning neurons.” The indices of neurons may represent the output SDR (oSDR) of the learning stage.
The NPU can be connected to the host processing unit (e.g., a CPU) using a streaming interconnect, such as a PCI Express (PCle) interface, that allows the passage of iSDRs and oSDRs between the NPU and host processing unit. The compact nature of iSDRs and oSDRs can reduce the streaming bandwidth and onboard memory requirements by roughly 37 percent as discussed above. More importantly, the compact nature may align with the processing of the simple pattern matching neurons to the NPU architecture. This allows a very dense and efficient realization of neurons in the NPU, which enables reduced latency, die area, and energy requirements while increasing throughput and efficiency. Each neuron may have Synaptic Strength Value Memory (SSVM) to track the connection strength of the synapses. By broadcasting only the unordered SDR bits to all of the neurons in parallel using the MISD architecture of the NPU, each neuron can respond to those set bits if the NPU has been programmed to recognize. The NPU can then calculate an overlap score for each neuron to determine the “winning neurons.”
In embodiments where “winning neurons” are determined, passing an iSDR as an unordered collection of indices of its set bits is particularly beneficial. There is no requirement that the collection be ordered. The indices can be changed into a symbol stream, where each symbol corresponds to an index in the collection. After processing the last symbol of the iSDR, the overlap scores of the neurons can be captured and additional logic circuitry may be used to efficiently identify the “winning neurons.” The overall efficiency of this embodiment allows for the execution of thousands of neurons on a single computing device, at very low power, in very small die size, and with high clock speeds. This combination can produce performance gains that are several orders of magnitude better than conventional solutions.
The output SDR (oSDR) of the learning stage can also be an iSDR for subsequent instances of the learning stage without any modifications. The oSDR may also be represented as an unordered collection of set bits. Therefore, models can be sequentially arranged in any hierarchical order to provide higher order processing capabilities with very little overhead. oSDRs that are represented in this manner can also be used in the third stage (also called the “classifying stage”) for learning class signatures and matching against previously learned class signatures for inference purposes. Matching against prior class signatures can be significantly accelerated by comparing only the set bits in an oSDR to those in the class signatures. Each class signature can be maintained as an ordered collection of indices of its set bits. This reduces the complexity to b log lsign, where lsign is the length of the signature SDRs.
Set forth below is a discussion on how to handle missing values using SDRs and a cortical artificial intelligence system (or simply “system”) that is intuitive to understand, simple to compute, and portable from one application to another. The method for handling missing feature values is an important part of the strategy for semantically encoding various data types into the SDR to provide a meaningful input to the model. Handling missing feature values may be a part of all encoding strategies for iSDRs to which the model is applied and represents the strategy of encoding null semantic information for an individual field where the system recognizes the absence of data.
This approach combines the implementation of a cortical processing model based on several neocortical concepts, the understanding of its mathematical properties and that of its data representation (i.e., the SDR), combined with practical methods to encode missing feature values using null semantics.
The first fundamental concept is the mathematical properties of sparse hyperdimensional data representation called SDR. As mentioned above, an SDR can capture combinations of subtle semantics that represent the input data. The high dimensionality and low sparsity provide mathematical guarantees that two arbitrary SDRs will be spatially distant from one another-and likely very spatially distant from one another—unless those SDRs are noisy variants of each other. As an example, it can be mathematically shown in a 2,048-bit input SDR with 40 set bits representing encoded semantics that the change of misclassification is less than e−15 even when 50 percent of the expected 40 set bits are missing in the SDR.
The second fundamental concept relates to the use of an unsupervised brain-inspired learning algorithm (or simply “algorithm”) that can identify combinations of semantics that occur concurrently and/or frequently. The algorithm can be built using an array of neurons that are intended to represent the “pyramidal neurons” in the neocortex of the human brain. These neurons can be connected to a subset of iSDR bit positions. Therefore, each of the iSDR bit positions may be connected to very few neurons, and the effect of a null value in a given field of an iSDR may be limited to only those neurons and not propagated throughout the model, like it would in a conventional ML model.
One aspect that arises from the properties of the SDR and the working of the model is that missing values are generally not a concern for the system. Zero-value bits (i.e., 0-bits) in the input SDR do not necessarily signify that the value is zero. It merely signifies that information about that semantic is missing or absent in the input SDR. The mathematical properties of the input SDR and the working of the NPU makes the system tolerant to a large number of missing one-value bits (i.e., 1-bits).
This leads to the method of encoding missing feature values in traditional AI datasets for computation in the model. A variety of encoding techniques can be used to encode each feature individually into feature SDRs, which can then be converted into a composite input SDR using simple operations such as concatenation. When the value for a feature is missing, it can be encoded as a feature vector with no 1-bits. This corresponds to the second native interpretation of a 0-bit in an SDR (i.e., that the information is missing). As long as the number of missing feature values is not very large, the model can continue to learn and infer the correct patterns in the training data.
While encoding input SDR bits with zeros is simple in concept, the combination of the mathematical properties of the sparse hyperdimensional data representation and the operations of a sparsely connected, neocortically inspired learning algorithm permit handling missing data in an intuitive fashion while continuing to enable highly accurate predictions.
The output data, from the NPU, can be processed to determine the “winning neurons” and format these outputs into the proper form for the oSDR. Whether it is a single- or multi-NPU system, this function can be performed in various ways. For example, this function can be performed via execution of software by the device driver of the NPU, or this function can be performed by an on-board microprocessor or a hardware module that is added to the NPU. In some embodiments, the device driver processes outputs produced by the NPU and constructs oSDRs.
At a high level, the NPU may be a digital representation of a feedforward neuron and can be embodied in software, firmware, or hardware. Examples of hardware-based NPUs are described in U.S. application Ser. No. 17/531,576, which is incorporated by reference herein in its entirety. Regardless of its implementation, for each iSDR, the NPU can calculate its overlap count (OLC) between its synapses and the iSDR. The overall system can include hundreds or thousands of NPUs that compute their respective OLCs independently. This provides an opportunity to parallelize this operation, especially in hardware, giving high efficiency and speed. To generate the final oSDR, the NPUs can be ordered based on their OLCs. The NPUs with the highest OLCs can be declared “winners” for that iSDR. This winner selection process can also be streamlined for efficient hardware implementation.
In the inferencing mode, the identifier of the winning NPUs can server as the oSDR. By restricting the winning NPUs to a small percentage of the overall number of NPUs (e.g., 0.5, 1, or 5 percent), the desired sparsity of the oSDR can be attained. In the inferencing mode, the oSDR can be fed to the next stage of the software pipeline, namely, the classifying stage. The processing of the next iSDR can start immediately. However, in the learning stage, an additional step may be required. The synaptic strengths of the winning NPUs may need to be adjusted, for example, in accordance with Hebbian enforcement rules, to reflect the learning of the NPU after processing the iSDR.
The NPU may employ a value-sorting algorithm (or simply “sorting algorithm”) that produces a list of potential winning neurons based on an analysis of the OLCs. In some instances, this list is likely not the list of actual winning neurons. For example, a multi-NPU system can identify more potential winning neurons than the system is designed to produce. Furthermore, the automatic updating of Synaptic Strength Values (SSVs), stored in the Synaptic Strength Memory (SSM), and the automatic adjusting of Boost Factors (BFs) may not occur until the NPU is notified of its actual winning neurons.
Systems designed for AI and ML may incorporate more than one NPU. For example, two or more NPUs can be designed into such a system. These multiple NPUs may be on a single printed circuit board assembly (PCBA), or these multiple NPUs may be on multiple PCBAs. For example, the system might include two PCIe NPU PCAs, and each PCA may include eight NPUs, so the entire system may include sixteen NPUs. This approach solves the problem of determining which neurons are the actual winning neurons, notifying the various NPUs in the system of their winning neurons, if any, and constructing and transmitting the oSDR to the host processing unit.
As shown in
A system might contain any number of NPUs. However, the system can also specify the number of winning neurons that can be identified, during the processing of any given iSDR. In a multi-NPU system, one way to guarantee the delivery of the desired number of winning neurons is to allow each NPU to output up to the total desired number of winning neurons. The output of these NPUs, known as potential winning neurons, may be multiple times the total number of winning neurons desired by the system. All of the potential winning neurons can be collected and sorted, in order of count value, and the desired number of winning neurons can be chosen from the top count values in the ordered list.
Each NPU can then be notified of its neurons, if any, that were determined to be actual or true winners. Each NPU can be prepared to update SSVs and BFs, as necessary. Note that the data output by each NPU can include more than just the OLC value. The data may also include information to permit correctly identifying the neuron, as shown in
Data corresponding to potential winning neurons can have another field added, so as to identify the NPU that produced the data.
NPUs with NPU indices corresponding to the actual winning neurons, namely, with values of npu_index(1), npu_index(2), and npu_index(4), can be notified that each has one winning neuron. These NPUs can then update their respective SSVs and BFs, as appropriate.
The true winning neurons can then be processed, as necessary, to become the oSDR to be transmitted to the host processing unit. In this example, the three winning neurons with values of npu_index(1), npu_index(2), and npu_index(4) can be processed or encoded, as necessary, to become the oSDR. In
As mentioned above, the model may be a neuromorphic learning model that processes a sparse bit vector—called the iSDR—as input and then generated another sparse bit vector—called the oSDR—as output. One contribution of the model is that the oSDR is more readily classifiable than the iSDR (and at lower computational complexity). During the training mode, the oSDRs can be used to build class definitions for inferencing thereafter. Conversely, during the inferencing mode (or testing mode following the training mode), the oSDRs can be used to predict the class of each sample. In the case of continuous learning, every sample can be used for testing and training, and therefore both operations may be performed. These operations can be carried out during the last stage of the software pipeline, namely, the classifying stage.
The supervised classification technique described below includes the concept of subclasses, creating fine-grained decision boundaries within classes and allowing the overall software pipeline to meet the requirements of sophisticated AI applications. In developing the classifier (e.g., the classifier 106 of
These considerations take into account the mathematical properties of SDRs, which dictate that the probability of significant overlap between any two unrelated SDRs should be infinitesimally small. Also, by the very nature and design of the model, the oSDRs that are generated for samples of the same class should be very similar in their set bits. The intuitive and effective characteristics of oSDRs output by the system allows the system to meet the five criteria specified above.
To illustrate the working of the classifier, an illustrative example is described in the context of the training data shown in
During the training mode, a data structure can be created for each class by the classifier. For example, for each class, the classifier may create a lookup table that is representative of a histogram of set-bit locations (i.e., offsets) in the oSDRs of that class. Each column in the histogram can correspond to an offset position in the oSDRs. The height hi of a column depicts the number of times that offset position j was set in the oSDRs of training samples from class i. For example, referring to
Pseudocode for an example of an algorithm that could be employed by the classifier to produce signature SDRs is presented below. The pseudocode includes two subroutines, namely, a first subroutine (i.e., calc_histogram) for calculating histograms and a second subroutine (i.e., calc_sign_SDRs) for calculating signature SDRs. In batch mode, the runtime complexity of the second subroutine is:
O(((|Training Set|×oSDR_num_set_bits)+(num_classes×oSDR_len)),
where |Training Set|denotes the number of samples in the training data, oSDR_num_set_bits denotes the number of set bits in each oSDR, num_classes denotes the number of classes, and oSDR_len denotes the length of each oSDR. The first part of the sum comes from the runtime complexity of the first subroutine. The second part of the sum comes from the complexity of calculating the signature SDRs designated as the array sign_SDRs. Typically, the first part of the sum is much larger than the second part of the sum for large amounts of training data (e.g., with many samples).
In streaming mode, the algorithm can be changed slightly to update the histogram and threshold of only the class that the sample belongs to. This can be accomplished in 0(oSDR_num_set_bits) complexity. Similarly, the update to the array can be limited to the signature SDR of only the class to which the sample belongs. This can be accomplished in 0(oSDR_len) complexity. Therefore, the overall complexity of handling each sample in the training data may be 0(oSDR_len). Thus, the overall complexity may be proportional to the number of set bits in the oSDR.
For a test oSDR, an overlap score can be calculated for each class by counting the common offsets of the test oSDR with the signature SDR of each class. The class with the highest overlap score can be declared the winner. Ties can be reported and/or resolved arbitrarily. Pseudocode for an algorithm for determining the classification of an oSDR created for a test sample is below. The runtime complexity of the algorithm is 0(num_classes×oSDR_num_set_bits).
In the case of continuous learning, runtime complexity of determining the predicted class of a test oSDR and updating the signature SDR of the predicted class is 0(num_classes×oSDR_num_set_bits)+0(oSDR_len)=0(num_classes×oSDR_num_set_bits). As the number of classes increase, the runtime can grow proportionally, though the runtime could be accelerated with custom hardware. Fortunately, this problem is parallel, and the runtime complexity can be decreased to
where p is the number of threads that can be executed in parallel.
The approach to classification introduced here can satisfy the conditions set forth above. The approach is able to maintain a definition of a class as its signature, which is defined as an SDR. By choosing a suitable multiplier value, the sparsity of the signature SDRs can be help sufficiently low, guaranteeing low-shot and noise-resilient learning capabilities as mathematical properties. These signature SDRs can be easily explained in terms of combination of ranges of feature values that the corresponding classes are receptive to, using the synaptic map of the neuromorphic model. The description of this explainability is covered further below. It can support continuous learning with low runtime complexity and memory requirements as discussed above. Moreover, learning a new class and its signature SDR can be done instantaneously, incrementally, and independently of the definitions determined for existing classes. Similarly, if the definition of an existing class begins to drift-as signified by drift in the set bits of the oSDRs of samples belonging to the existing class—then the signature SDR can be automatically updated to account for the drift. Again, the computation may be incremental and independent of the definition of other classes, allowing for continuous learning in real time.
Below, the fundamentals of classification—as performed using an NPU—are set forth in the context of a set of training data, as well as the approach to building the histograms. Here, a method for dealing with subclasses is also presented. For the purpose of illustration, the training data shown in
The number of entries that are included in a class's subclasses can be used as an indicator of the strength of each subclass. The number of entries is representative of the number of oSDRs that have contributed to each subclass.
Processing subsequent training oSDRs belonging to the new subclass involves several steps. First, a training oSDR may be compared with the signature SDRs of existing subclasses, which in this case is Subclass 0. The subclass whose signature SDR has the highest overlap with the training oSDR can be chosen if the OLC value is higher than an entry threshold, which may be defined by a user. The entry threshold can be expressed as a percentage of the number of set bits in the training oSDR. In this example, the entry threshold is set to 66 percent, which leads to a minimum OLC value of 2 (i.e., the entry threshold multiplied by the number of set bits in the training oSDRs, or 0.66×3).
The signature of Class 0 can be updated after the addition. While updating the signature, the support for each offset location from members of that subclass can be considered. Generally, the offsets with the highest strength are included. If, while adding offsets to the signature, the sparsity increases beyond a stipulated amount, then ties can be broken with a bias towards retaining the old signature. This scenario is illustrated in the case of offsets 2 and 7, both of which are supported by one member of the subclass. However, offset 2 is retained in the signature because of the bias.
Continuing in this fashion,
Pseudocode for the algorithm for forming subclasses is presented below. Here, sign_SDR denotes an array of signature SDRs whose element from the ith row and jth column captures the signature SDR of the jth subclass of class i, and is denoted as sign_SDRij. The element sign_SDRij is in turn defined as a linked list of oSDRlen tuples. Each tuple can have two parameters-offset and strength-that denote an offset in the signature SDR and its associated strength. In other words, the tuple {k, strerngthij[k]} corresponds to the kth offset in the signature SDR and stores its strength (i.e., strengthij[k]). The tuples in the linked list can be arranged in ascending order in terms of the strength parameter. This helps in the efficient addition and deletion of offsets in the signature SDR.
In batch mode, the runtime complexity of the algorithm subroutine (i.e., that of the training phase) is 0(|training_set|×num_subclassesavg×oSDR_num_set_bits). Here, |training_set| denotes the number of samples in the training data, num_subclassesavg denotes the average number of subclasses per class, and oSDR_num_set_bits denotes the number of set bits in each oSDR. In the streaming mode, the algorithm can be changed slightly to handle a single oSDR at a time. In this situation, the runtime complexity of the algorithm subroutine is 0(num_subclassesavg×oSDR_num_set_bits) for each oSDR.
For a testing oSDR, an OLC value can be calculated for each subclass of every class. The subclass with the highest overlap can then be identified. If the OLC value is higher than a minimum threshold—referenced by test_min_overlap—then the parent class of the corresponding subclass is declared as the winner. The minimum threshold could be defined by a user or determined by the system. For example, the testing oSDR T={3,8,13} has the highest overlap score of three with Subclass 1 of Class 0, as shown in
Pseudocode of the algorithm for calculating OLC values of a testing oSDR with respect to the signature SDRs of different subclasses is below. The runtime complexity of the algorithm subroutine is
As oSDR_num_set_bits tends to be a small number, the complexity of the algorithm subroutine mostly depends on the total number of subclasses. When the number of subclasses becomes substantive (e.g., exceeds several dozen), specialized hardware may be sought to accelerate processing. The algorithm subroutine is parallel, and therefore the runtime can be minimized to
where p is the number of parallel threads that are utilized.
The approach to classification introduced here can satisfy the requirements of sophisticated AI applications. A definition for each subclass of each class in the training data can be maintained as a signature, for example, as defined in the SDR format. Choosing a low sparsity for the signature SDRs guarantees low-shot and noise-resilient learning capabilities as mathematical properties. Moreover, the signature SDRs can be easily explained in terms of combination of ranges of feature values that the subclass definition is sensitive to, using the synaptic map of the neuromorphic model. The classification approach can also support continuous learning with low runtime complexity and memory requirements, as mentioned above. Learning a new class or subclass and its signature SDR can be done instantaneous, incremental, and independent of the definitions learned for existing classes or subclasses. Similarly, if the definition of a subclass drifts—as indicated by drift in the set bits of the oSDRs of samples belonging to the subclass—then the signature SDR can be automatically updated based on the drift. Again, the computation may be incremental and independent of the definitions of other classes or subclasses, allowing for continuous learning in real time.
The first two stages of the software pipeline may be entirely unsupervised. In the preceding description of the third stage, the activities of the classifier are supervised. This, in effect, made the entire software pipeline supervised. An unsupervised classification technique, which allows the software pipeline to train with samples in an unsupervised way, can be useful for continuous learning, especially where the model is designed and instructed to learn even when deployed “in the wild.”
For the purposes of classification, training oSDRs can be clustered by class, and a testing oSDR can be classified by identifying the cluster it lies closest to. The classifier can group all of the training oSDRs from a given class into clusters, one for each subclass. For example, in the Modified National Institute of Standards and Technology (MNIST) database of handwritten digits, there are many styles of any given handwritten digit. Each style can potentially be characterized as its own subclass. Further, each subclass definition can be represented with a unique signature SDR that captures the dominant semantics of the members of that subclass.
In the continuous learning mode, the system may not only predict the class of the testing oSDR serving as a sample, but also incrementally train on the same. This approach to training on testing samples can be completed in an unsupervised manner as the testing samples are not labeled. The continuous learning allows the system to track changes in the class and subclass definitions, or the emergence of new classes and subclasses, while deployed in a production environment or an end application. To support this, updates to known subclass definitions and creations of new subclass definitions of unknown classes can respectively be utilized.
The system may initially be trained in a supervised manner, for example, with the supervised_train subroutine described above. After deployment, the system can switch to an unsupervised mode of continuous learning, for example, through the use of the predict_and_train subroutine described below. Unlike the supervised_train subroutine, each testing oSDR can be matched against all of the existing subclasses of the existing classes and the unknown class. The subclass whose signature SDR has the highest overlap with the testing oSDR may be chosen as the prediction if and only if the OLC value is above the entry threshold. In the event that the OLC value exceeds the entry threshold, the corresponding class label p can be returned as the prediction.
Note that the predicted class could alternatively be unknown. In this scenario, the system may prompt the user (e.g., via an interface) and provide the characteristics of the subclass with which the testing oSDR had the highest overlap. This is enabled by the explainable nature of the model. The user may be permitted to provide a label for the subclass after unsupervised learning is complete. Said another way, the system may receive input indicative of a label provided by the user for the subclass. The label could be one of the known labels, in which case the input may be indication of a selection from among the known labels, or the label could be entirely new. Thereafter, the subclass can be moved from the unknown class to the appropriate class based on the user-specified label. Alternatively, if the user does not provide a label, then no further action may be required for the testing oSDR. If OLC value is lower than the entry threshold, then a new subclass of the unknown class can be created and a prediction of “unknown” can be returned. The handling of this prediction may be identical to the explanation set forth above.
The supervised classification technique set forth below improves upon the aforementioned classification techniques in several respects, notably where all training oSDRs are clustered based on their class labels and mutual similarity. With this classification technique, the classifier still generates a signature SDR to define a cluster and, like the aforementioned classification techniques, the signature SDR allows the software pipeline to meet the requirements of sophisticated AI applications. Here, however, the signature SDR is richer in semantic information, allowing fewer clusters or signatures per class. Consequently, the runtime of matching a new training oSDR (while training) or a new testing oSDR (while inferencing) against all existing signature SDRs is reduced significantly.
With the aforementioned classification techniques, the classifier can use the same number of set fits in the signature SDRs as in the oSDRs. Because the signature SDRs are very sparse, the amount of information captures in the signatures is limited. With this classification technique, the classifier can address this limitation by using a greater number of set bits in the signature SDR. When a subclass is first created, the signature SDR can have the same number of set bits as the oSDR. Thereafter, the number of set bits can adaptively increase, based on the subsequent members of the subclass, up to a maximum threshold. By increasing the number of set bits in the signature SDR, the chances of having high overlap with subsequent similar oSDRs increases. Consequently, the number of subclasses decreases, which has a significant effect on the runtime complexity of the algorithm executed by the system. With the aforementioned classification techniques, the number of generated signatures is directly proportional to the number of samples included in the training data. Each new sample is matched against all existing signatures, and therefore the runtime was quadratically dependent on the number of samples. With this classification technique, the number of signatures can be reduced by >8× for a given dataset, without any adverse effects to the third-wave properties. This leads to significant decreases in training runtime. In inferencing mode, each sample can be compared against all learned signatures, and an 8× reduction in the number of signatures leads to a commensurate reduction in inferencing runtime.
For succinctness in describing this classification technique, the following notations are used:
At step 2602, the system can identify the best subclass to add to. To accomplish this, the system can order the subclasses based on overlap with oSDRi. Without loss of generality, assume that the Mth subclass with signature sSDRm is the subclass with the highest overlap. If two or more subclasses are tired for the highest score, then the tie can be broken by looking at the histograms of those subclasses. First, the overlap between oSDRi and the offsets, of columns with nonzero histogram heights, are calculated and the highest overlap thus obtained is declared the winner. See how oSDR7 is handled in the example set for below. The generating and updating of the histograms are also further discussed below. If the tie persists, then for each tied subclass, the heights of the overlapping columns can be summed, and the subclass with the highest sum can be declared the winner. See how oSDR10 is handled in the example set for below. If the tie persists, then the tie can be broken arbitrarily.
At step 2603, the system can update the subclass signature. If no subclass was chosen in step 2601, then a new subclass can be created with sSDR=oSDRi. Otherwise, if the subclass with sSDRm is identified as the best subclass to which oSDRi should be added, then the system can update the histogram of the subclass by incrementing the height of the columns corresponding to the offsets in the oSDRi by one. Thus, hmo
For a testing oSDR, an overlap score can be calculated for each subclass of every class. The subclass with the highest OLC value can then be identified. If the OLC value is higher than an overlap threshold, then the parent class of the corresponding subclass is declared the winner. Otherwise, the system may return “unknown classification.”
Assume, for example, that all of the oSDRs are 16 bits long and have 4 set bits. Note that while the number of set bits need not be constant, a constant number has been selected to simplify the explanation. In such a scenario, l=16 and wi=4. For entry into a subclass, an oSDR must have at least 3 common bits (i.e., cmin=3, 75 percent overlap with the set bits in the oSDR). Again, for simplicity, this example concerns subclass generation for samples belonging to the same class. In this example, the maximum allowable number of set bits in a signature SDR is set to 6 (i.e., vmax=6).
This classification technique maintains a definition for each subclass of every class, in the training data, as its signatures are defined in the SDR format. Choosing a low maximum sparsity for the signature SDRs guarantees low-shot and noise-resilient capabilities as mathematical properties. These signature SDRs can be easily explained, in terms of combination of ranges of feature values, to which the subclass definitions are sensitive, using the synaptic map of the neuromorphic model. This classification technique can also support continuous learning with low runtime complexity and memory requirements as mentioned above. Learning a new class or subclass and its signature SDR can be done immediately, incrementally, and independently of the definitions of existing classes or subclasses. This may happen if during the inferencing stage, a testing oSDR does not sufficiently overlap any of the known signature SDRs. In such a scenario, a new subclass of the unknown type can be created with the oSDR set as its signature SDR. Similarly, if the definition of a class or subclass drifts, then the signature SDR can be automatically updated to account for the drift. Again, the computation may be incremental and independent of the definition of other classes or subclasses, allowing for continuous learning in real time.
One of the key advantages of the system introduced here is the ability to explain results that are achieved. Simply put, the system can explain what was learned during training and why learning occurred. Similarly, the system can explain why a particular result was obtained.
As discussed above, the system-embodied as a software pipeline-has three processing elements, each of which is responsible for a different stage of processing. At a high level, an encoder may be responsible for creating input SDRs (iSDRs) from the raw data obtained as input, a processing unit (e.g., NPU) may be responsible for processing the iSDRs to learn patterns in the raw data, and a classifier may be responsible for taking output SDRs (oSDRs) produced by the processing unit and then creating signatures of the patterns found by the processing unit.
Each stage of the software pipeline is human understandable in the forward and backward directions, making the software pipeline “explainable” in terms of answering two important questions in ML, namely, why did the machine learn from training data and why did the model classify an input represented with testing data in a certain way.
At a high level, the encoder is representative of a transform function that transforms raw data into a bit vector in SDR format. Parameters—which may be user specified—for the encoder can be stored in the model state file, so that a user can understand, at any point in time, the parameters used in the transform and therefore can reverse the transform. The encoder may be configured to take continuous data (e.g., integers or floating point numbers) or discrete data (e.g., string variable or categorical variable). Regardless of data type, the encoder can convert raw data provided as input to a discrete spatial representation that is largely or entirely void of endian order by placing binary set bits (i.e., “ones”) in discrete “buckets.” The number of binary set bits in a bucket may be at least one, and there will typically be some amount of set-bit overlap between adjacent buckets. The overlap exists to reinforce semantic similarity (e.g., the number 1.2 is more semantically similar to the numbers 1.1 and 1.3 than to the number 5.9, and therefore the number 1.2 will be encoded into a bucket that has positional set-bit overlap with the buckets for the numbers 1.1 and 1.3). The discretization of data may create a small amount of uncertainty when reversing the transform applied by the encoder.
As discussed above, the NPU can learn spatial patterns in data through Hebbian-like learning using the biological concepts of neurons and synapses. Importantly, neurons whose synapses are best connected to the positions of the set bits of the input SDR (iSDR) can be selected to represent the pattern found in the iSDR and incorporated into an output SDR (oSDR) that is supplied to the classifier. The synaptic connections for each neuron can be stored in the model state file, making it possible to understand which bits were set in the iSDR for a given neuron. Because the identity of the neurons is known in the oSDR, it is possible to determine the iSDR set-bit positions from a group of neurons in the oSDR. However, there may be some uncertainty in the translation of the oSDR to the iSDR because not all “connected” synapses of a given neuron in the oSDR may have been connected to a set bit in the iSDR. The purpose of learning in the NPU is to tune the synaptic connections in response to the patterns of set bits seen in the iSDRs, and thus the synaptic connections of a trained model should be a very close, though not necessarily perfect, match to the bit patterns seen in the iSDR.
The classifier can learn classification signatures from the oSDRs produced by the NPU. Each oSDR can be evaluated for one of two actions: whether to use the oSDR to create a new signature or include the oSDR in an existing signature. A threshold of hamming distance may be used to determine which of these actions should be taken. In the scenario where no signature exists (i.e., the first oSDR), then a new signature can be automatically created from that oSDR. When an oSDR is included with an existing signature, is may or may not alter the existing signature to some degree. In training mode, a classification label (e.g., provided in the raw data) can be presented to the classifier along with the oSDRs when training is supervised. When training is unsupervised, the oSDRs can be presented to the classifier without any labels. Finally, each class of data object (e.g., according to its label when supervised, or according to its clustering when unsupervised) can have more than one signature definition. For example, a given class may be associated with multiple signatures, each of which is representative of, and corresponds to, a different subclass of the given class.
In the present disclosure, it is explained how to use class signatures and subclass signatures to explain what the model has learned from the raw data, as well as to explain why the model made a particular classification for a given input. The former deals with training and understanding the trained model, while the latter deals with inferencing and understanding how a data observation aligns with the trained model.
To understand what the model has learned during training, class signatures and/or subclass signatures can be processed to find the original raw data ranged by using the known model parameters along with certain user parameters associated with explainability. Examples of model parameters include (i) neuron identifiers contained in the signature, as obtained from the classifier; (ii) synaptic connections for each signature, as obtained from the NPU; and (iii) set bits, sparsity, and window span used for encoding, as obtained from the encoder. Examples of user parameters include (i) the synapse threshold that defines a threshold percentage for synapses that are common between neurons and (ii) the bucket threshold that defines a threshold percentage for buckets that are common between connected synapses.
The explanation process may begin with a class signature or subclass signature from which a list of neuron identifiers can be collected. The neuron identifiers from the signature correspond to neurons in the NPU, where the synapses for each of the neurons can be identified and mapped onto a histogram as synapse number as shown in
By applying the synapse threshold and determining a surviving set of synaptic connections, the pattern of synaptic connections can be learned by the NPU. Thereafter, attention may be turned to the encoder and converting the synaptic connections to encoded bucket boundaries and then to raw data.
The remaining synapses can be processed using the settings of the encoder to get back to the raw data. To accomplish this, the encoder parameters can initially be used to recreate the bucket boundaries for each feature in the raw data. The term “bucket,” as used here, refers to the smallest discrete unit of encoding in the iSDR, though it often represents a range of continuous numbers from raw data.
The encoder bucket boundaries can be understood from the encoder model parameters of field width (i.e., the number of bit positions in an encoded field) and the number of binary set bits used during encoding. For a linear encoder, the number of buckets can be given as:
Number of Buckets=Field Width−Set Bits+1.
If, for example, the field width equals 300 bits and the number of set bits is 15, then the bucket width equals 15, the number of buckets equals 286 (i.e., 300−15+1), and the bucket overlap is 14 bit positions (i.e., 15−1).
With the bucket count and width identified, a graph can be constructed (e.g., by the NPU or host processing unit) that maps the number of synaptic connections for each bucket. The first bucket would sum the number of synapses connected in bit positions 1-5, the second bucket would sum the number of connections in bit positions 2-16, and so on until the last bucket summed the connections in bit positions 386-300.
With the synapse-to-bucket mapped completed, a final filter can be applied to remove unwanted or noninformative peaks from the results. This user parameters is called the bucket threshold, and it can be set at any level that yields the desired results. Continuing with the same example and setting the bucket threshold at 13, the system can obtain
With the final list of encoder buckets, the system can now transform the buckets into raw data ranges. Using the stored encoder settings and minimum/maximum range values for each feature, the system can complete the process of explaining a learned signature in terms of the raw data that created it. In
In addition to understanding what the model learned from the data, a user can also use the system to understand why a new observation has been classified by the model in a certain way. In
Another method for displaying the results is to use a heat map for synaptic connections as shown in
The foregoing description of various embodiments of the claimed subject matter has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter to the precise forms disclosed. Many modifications and variations will be apparent to one skilled in the art. Embodiments were chosen and described in order to best describe the principles of the invention and its practical applications, thereby enabling those skilled in the relevant art to understand the claimed subject matter, the various embodiments, and the various modifications that are suited to the particular uses contemplated.
Although the Detailed Description describes certain embodiments and the best mode contemplated, the technology can be practiced in many ways no matter how detailed the Detailed Description appears. Embodiments may vary considerably in their implementation details, while still being encompassed by the specification. Particular terminology used when describing certain features or aspects of various embodiments should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the technology with which that terminology is associated. In general, the terms used in the following claims should not be construed to limit the technology to the specific embodiments disclosed in the specification, unless those terms are explicitly defined herein. Accordingly, the actual scope of the technology encompasses not only the disclosed embodiments, but also all equivalent ways of practicing or implementing the embodiments.
The language used in the specification has been principally selected for readability and instructional purposes. It may not have been selected to delineate or circumscribe the subject matter. It is therefore intended that the scope of the technology be limited not by this Detailed Description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of various embodiments is intended to be illustrative, but not limiting, of the scope of the technology as set forth in the following claims.
This application claims priority to U.S. Provisional Application No. 63/227,590, titled “Explainable Machine Learning (ML) and Artificial Intelligence (AI) Methods and Systems Using Encoders, Neural Processing Units (NPUs), and Classifiers” and filed on Jul. 30, 2021, which is incorporated by reference herein in its entirety. This application is related to U.S. application Ser. No. 17/531,576, titled “Neural Processing Units (NPUs) and Computational Systems Employing the Same” and filed on Nov. 19, 2021, which is also incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63227590 | Jul 2021 | US |