Priority is claimed in the application data sheet to the following patents or patent applications, each of which is expressly incorporated herein by reference in its entirety:
The present invention relates to the field of artificial intelligence and machine learning, specifically to deep learning models for processing and generating data across various domains, including but not limited to language, time series, images, and audio.
In recent years, deep learning models have achieved remarkable success in numerous fields, such as natural language processing (NLP), computer vision, and speech recognition. One of the most prominent architectures is the Transformer. Transformers have become the foundation for state-of-the-art language models like BERT and GPT. Transformers typically process input data, such as text, by first converting tokens into dense vector representations using an embedding layer. Positional encoding is then added to preserve the order of the tokens. The embedded inputs are processed through self-attention mechanisms and feed-forward layers to capture dependencies and generate outputs.
However, the reliance on embedding and positional encoding layers limits the flexibility of Transformers in handling diverse data types beyond language. Moreover, the use of dense vector representations can be computationally intensive and memory-inefficient, especially for large-scale models.
What is needed is a new neural network model that can operate at a higher level of abstraction, using more compact and expressive representations that can efficiently capture the underlying patterns in the data. By removing the embedding and positional encoding layers from a Transformer, deep learning models can more efficiently process vast amounts of diverse information. The modified Transformer system should be flexible enough to handle various data modalities beyond just text and should enable seamless transfer learning across different languages and domains.
Accordingly, the inventor has conceived and reduced to practice a system and method for latent space dynamics with full-core joint learning. A Latent Transformer LCM system introduces an approach to data processing and generation by combining the power of Variational Autoencoders (VAEs) and Transformers. The system consists of several key components: a codeword allocator, which prepares and converts the input data into codewords; a codebook generation subsystem, which creates and maintains a codebook mapping the input data to codewords; a VAE encode subsystem, which compresses the codewords into a lower-dimensional latent space representation; a Latent Transformer subsystem, which processes the latent space vectors using a modified Transformer architecture without embedding and positional encoding layers; and s VAE decode subsystem which reconstructs or generates data from the processed latent vectors. By leveraging the compressed latent space representation and the attention mechanism of the Transformer, the Latent Transformer LCM system can efficiently process and generate data across multiple modalities, opening up new possibilities for various applications. By operating directly on input vectors and input latent space vectors, the Latent Transformer LCM system allows for the removal of the embedding layer and positional encoding layer found in traditional transformer systems.
According to a preferred embodiment, a deep learning system with a latent transformer core and a latent dynamics analyzer, comprising one or more computers with executable instructions that, when executed, cause the deep learning system to: receive a plurality of input vectors; generate a plurality of latent space vectors by processing the plurality of input vectors through a variational autoencoder's encoder; replicate the latent space vectors to create two latent space vector copies; process a first copy of the latent space vectors through a latent transformer to generate predictions; process a second copy of the latent space vectors through a latent dynamics analyzer to derive equations of motion for the latent space vectors; generate output vectors by passing the plurality of generated predictions through a variational autoencoder's decoder; and use the derived equations of motion to update a plurality of attention mechanisms within the latent transformer or to detect anomalies in latent space dynamics, is disclosed.
According to another preferred embodiment, a method for a deep learning system with a latent transformer core and a latent dynamics analyzer, comprising the steps of: receive a plurality of input vectors; generate a plurality of latent space vectors by processing the plurality of input vectors through a variational autoencoder's encoder; replicate the latent space vectors to create two latent space vector copies; process a first copy of the latent space vectors through a latent transformer to generate predictions; process a second copy of the latent space vectors through a latent dynamics analyzer to derive equations of motion for the latent space vectors; generate output vectors by passing the plurality of generated predictions through a variational autoencoder's decoder; and use the derived equations of motion to update a plurality of attention mechanisms within the latent transformer or to detect anomalies in latent space dynamics, is disclosed.
According to another preferred embodiment, a non-transitory, computer-readable storage media having computer-executable instructions embodied thereon that, when executed by one or more processors of a computing system employing an asset registry platform for a deep learning system with a latent transformer core and a latent dynamics analyzer, cause the computing system to: receive a plurality of input vectors; generate a plurality of latent space vectors by processing the plurality of input vectors through a variational autoencoder's encoder; replicate the latent space vectors to create two latent space vector copies; process a first copy of the latent space vectors through a latent transformer to generate predictions; process a second copy of the latent space vectors through a latent dynamics analyzer to derive equations of motion for the latent space vectors; generate output vectors by passing the plurality of generated predictions through a variational autoencoder's decoder; and use the derived equations of motion to update a plurality of attention mechanisms within the latent transformer or to detect anomalies in latent space dynamics, is disclosed.
According to an aspect of an embodiment, the latent dynamics analyzer comprises: a temporal encoding layer; a neural ordinary differential equation (ODE) module; a symbolic regression network; an equation decoder; and a physics-informed regularization module.
According to an aspect of an embodiment, the input vectors may contain a plurality of appended metadata.
According to an aspect of an embodiment, the executable instructions further cause the system to: generate alerts or signals when substantial changes in the underlying system dynamics are detected.
According to an aspect of an embodiment, the input vectors comprise market data, and wherein the system is configured to analyze and predict financial market behavior.
According to an aspect of an embodiment, the executable instructions further cause the system to: perform end-to-end training of the entire system by computing a total loss function.
The inventor has conceived, and reduced to practice, a system and method for latent space dynamics with full-core joint learning. The Latent Transformer Large Codeword Model (LCM) system for processing, analyzing, and generating data across various domains, including time series, text, images, and more. At its core, the system utilizes a combination of codeword allocation, Variational Autoencoder (VAE) encoding, and transformer-based learning to capture and leverage the underlying patterns, dependencies, and relationships within the data. The system begins by collecting a plurality of inputs and converting them into sourceblocks, which are discrete units of information that capture the essential characteristics of the data. These sourceblocks are then assigned codewords based on a codebook generated by a dedicated subsystem, creating a compressed and efficient representation of the input data. The codewords are further processed to create input vectors, which include a truncated data set, a sequence of zeros, and optionally, a metadata portion that provides additional context about the data type and characteristics.
The input vectors are then passed through a VAE encoder subsystem, which maps them into a lower-dimensional latent space, capturing the essential features and patterns in a compact representation. The latent space vectors serve as the input to a transformer-based learning component, which leverages self-attention mechanisms to uncover and learn the complex relationships and dependencies between the vectors. By analyzing the relationships in the latent space, the transformer can generate accurate predictions or outputs, particularly for tasks involving sequential or time-dependent data. The system can also incorporate metadata information to establish more targeted and context-aware relationships, enhancing the quality and accuracy of the generated results. Through iterative processing and learning, the Latent Transformer LCM system becomes a powerful tool for various data-driven applications, enabling efficient compression, analysis, prediction, and generation of data across multiple domains.
One or more different aspects may be described in the present application. Further, for one or more of the aspects described herein, numerous alternative arrangements may be described; it should be appreciated that these are presented for illustrative purposes only and are not limiting of the aspects contained herein or the claims presented herein in any way. One or more of the arrangements may be widely applicable to numerous aspects, as may be readily apparent from the disclosure. In general, arrangements are described in sufficient detail to enable those skilled in the art to practice one or more of the aspects, and it should be appreciated that other arrangements may be utilized and that structural, logical, software, electrical and other changes may be made without departing from the scope of the particular aspects. Particular features of one or more of the aspects described herein may be described with reference to one or more particular aspects or figures that form a part of the present disclosure, and in which are shown, by way of illustration, specific arrangements of one or more of the aspects. It should be appreciated, however, that such features are not limited to usage in the one or more particular aspects or figures with reference to which they are described. The present disclosure is neither a literal description of all arrangements of one or more of the aspects nor a listing of features of one or more of the aspects that must be present in all arrangements.
Headings of sections provided in this patent application and the title of this patent application are for convenience only, and are not to be taken as limiting the disclosure in any way.
Devices that are in communication with each other need not be in continuous communication with each other, unless expressly specified otherwise. In addition, devices that are in communication with each other may communicate directly or indirectly through one or more communication means or intermediaries, logical or physical.
A description of an aspect with several components in communication with each other does not imply that all such components are required. To the contrary, a variety of optional components may be described to illustrate a wide variety of possible aspects and in order to more fully illustrate one or more aspects. Similarly, although process steps, method steps, algorithms or the like may be described in a sequential order, such processes, methods and algorithms may generally be configured to work in alternate orders, unless specifically stated to the contrary. In other words, any sequence or order of steps that may be described in this patent application does not, in and of itself, indicate a requirement that the steps be performed in that order. The steps of described processes may be performed in any order practical. Further, some steps may be performed simultaneously despite being described or implied as occurring non-simultaneously (e.g., because one step is described after the other step). Moreover, the illustration of a process by its depiction in a drawing does not imply that the illustrated process is exclusive of other variations and modifications thereto, does not imply that the illustrated process or any of its steps are necessary to one or more of the aspects, and does not imply that the illustrated process is preferred. Also, steps are generally described once per aspect, but this does not mean they must occur once, or that they may only occur once each time a process, method, or algorithm is carried out or executed. Some steps may be omitted in some aspects or some occurrences, or some steps may be executed more than once in a given aspect or occurrence.
When a single device or article is described herein, it will be readily apparent that more than one device or article may be used in place of a single device or article. Similarly, where more than one device or article is described herein, it will be readily apparent that a single device or article may be used in place of the more than one device or article.
The functionality or the features of a device may be alternatively embodied by one or more other devices that are not explicitly described as having such functionality or features. Thus, other aspects need not include the device itself.
Techniques and mechanisms described or referenced herein will sometimes be described in singular form for clarity. However, it should be appreciated that particular aspects may include multiple iterations of a technique or multiple instantiations of a mechanism unless noted otherwise. Process descriptions or blocks in figures should be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process. Alternate implementations are included within the scope of various aspects in which, for example, functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those having ordinary skill in the art.
As used herein, “sourceblock” refers to a semantically meaningful unit of text that is derived from the input data through a process called syntactic splitting. Syntactic splitting involves breaking down the input text into smaller chunks along syntactic boundaries, such as those between words or tokens. These resulting chunks, or sourceblocks, serve as the basic units of representation in LCMs, replacing the traditional word or subword tokens used in Large Language Models (LLMs). Each sourceblock is then assigned a unique codeword from a codebook, which allows for efficient compression and processing of the text data. By preserving syntactic and semantic information within sourceblocks, LCMs aim to capture the inherent structure and meaning of the language more effectively while achieving higher compression ratios compared to LLMs.
As used herein, “machine learning core” refers to the central component responsible for processing and learning from the codeword representations derived from the input data. This core can consist of one or more machine learning architectures, working individually or in combination, to capture the patterns, relationships, and semantics within the codeword sequences. Some common architectures that can be employed in the machine learning core of LCMs include but are not limited to transformers, variational autoencoders (VAEs), recurrent neural networks (RNNs), convolutional neural networks (CNNs), and attention mechanisms. These architectures can be adapted to operate directly on the codeword representations, with or without the need for traditional dense embedding layers. The machine learning core learns to map input codeword sequences to output codeword sequences, enabling tasks such as language modeling, text generation, and classification. By leveraging the compressed and semantically rich codeword representations, the machine learning core of LCMs can potentially achieve more efficient and effective learning compared to traditional token-based models. The specific choice and configuration of the machine learning architectures in the core can be tailored to the characteristics of the input data and the desired output tasks, allowing for flexibility and adaptability in the design of LCMs.
As used herein, “codeword” refers to a discrete and compressed representation of a sourceblock, which is a meaningful unit of information derived from the input data. Codewords are assigned to sourceblocks based on a codebook generated by a codebook generation system. The codebook contains a mapping between the sourceblocks and their corresponding codewords, enabling efficient representation and processing of the data. Codewords serve as compact and encoded representations of the sourceblocks, capturing their essential information and characteristics. They are used as intermediate representations within the LCM system, allowing for efficient compression, transmission, and manipulation of the data.
The system is fed a data input 100, which represents the raw data that needs to be processed and analyzed. This data can come from various sources and domains, such as time series, text, images, or any other structured or unstructured format. The data input 100 is fed into a data preprocessor 110, which is responsible for cleaning, transforming, and preparing the data for further processing. The data preprocessor 110 may perform tasks such as normalization, feature scaling, missing value imputation, or any other necessary preprocessing steps to ensure the data is in a suitable format for the machine learning core 120.
Once the data is preprocessed, it is passed to a latent transformer machine learning core 120. The machine learning core 120 employs advanced techniques such as self-attention mechanisms and multi-head attention to learn the intricate patterns and relationships within the data. It operates in a latent space, where the input data is encoded into a lower-dimensional representation that captures the essential features and characteristics. By working in this latent space, the machine learning core 120 can efficiently process and model the data, enabling it to generate accurate and meaningful outputs.
The generated outputs from the machine learning core 120 are then passed through a data post processor 130. The data post processor 130 is responsible for transforming the generated outputs into a format that is suitable for the intended application or user. It may involve tasks such as denormalization, scaling back to the original data range, or any other necessary post-processing steps to ensure the outputs are interpretable and usable.
The processed outputs are provided as a generated output 190, which represents the final result of the latent transformer LCM system. The generated output 190 can take various forms, depending on the specific task and domain. It could be predicted values for time series forecasting, generated text for language modeling, synthesized images for computer vision tasks, or any other relevant output format.
To train and optimize the latent transformer machine learning core 120, the system includes a machine learning training system 600. The training system 600 is responsible for updating the parameters and weights of the machine learning core 120 based on the observed performance and feedback. The training system 600 outputs from the machine learning core 120 and processes the outputs to be reinserted back through the machine learning core 120 as a testing and training data set. After processing the testing and training data set, the machine learning core 120 may output a testing and training output data set. This output may be passed through a loss function 607. The loss function 607 may be employed to measure the discrepancy between the generated outputs and the desired outcomes. The loss function 607 quantifies the error or dissimilarity between the predictions and the ground truth, providing a signal for the system to improve its performance.
The training process is iterative, where the system generates outputs, compares them to the desired outcomes using the loss function 607, and adjusts the parameters of the machine learning core 120 accordingly.
Through the iterative training process, the latent transformer machine learning core 120 learns to capture the underlying patterns and relationships in the data, enabling it to generate accurate and meaningful outputs. The training process aims to minimize the loss and improve the system's performance over time, allowing it to adapt and generalize to new and unseen data.
The data preprocessor 110 receives the raw input data and applies a series of transformations and operations to clean, normalize, and convert the data into a format that can be efficiently processed by the subsequent components of the system. The preprocessing pipeline include but is not limited to subcomponents such as a data tokenizer, a data normalizer, a codeword allocator, and a sourceblock generator. A data tokenizer 111 is responsible for breaking down the input data into smaller, meaningful units called tokens. The tokenization process varies depending on the type of data being processed. For textual data, the tokenizer may split the text into individual words, subwords, or characters. For time series data, the tokenizer may divide the data into fixed-length windows or segments. The goal of tokenization is to convert the raw input into a sequence of discrete tokens that can be further processed by the system.
A data normalizer 112 is responsible for scaling and normalizing the input data to ensure that it falls within a consistent range. Normalization techniques, such as min-max scaling or z-score normalization, are applied to the data to remove any biases or variations in scale. Normalization helps in improving the convergence and stability of the learning process, as it ensures that all features or dimensions of the data contribute equally to the learning algorithm. A codeword allocator 113 assigns unique codewords to each token generated by the data tokenizer 111. Additionally, codewords may be directly assigned to sourceblocks that are generated from inputs rather than from tokens. The codewords are obtained from a predefined codebook, which is generated and maintained by the codebook generation system 140. The codebook contains a mapping between the tokens and their corresponding codewords, enabling efficient representation and processing of the data. The codeword allocator 113 replaces each token, sourceblock, or input with its assigned codeword, creating a compressed and encoded representation of the input data.
A sourceblock generator 114 combines the codewords assigned by the codeword allocator 113 into larger units called sourceblocks. sourceblocks are formed by grouping together a sequence of codewords based on predefined criteria, such as a fixed number of codewords or semantic coherence. The formation of sourceblocks helps in capturing higher-level patterns and relationships within the data, as well as reducing the overall sequence length for more efficient processing by the latent transformer machine learning core 120.
A codebook generation system 140 is a component that works in conjunction with the data preprocessor 110. It is responsible for creating and maintaining the codebook used by the codeword allocator 113. The codebook is generated based on the statistical properties and frequency of occurrence of the tokens in the training data. It aims to assign shorter codewords to frequently occurring tokens and longer codewords to rare tokens, optimizing the compression and representation of the data.
After the data has undergone the preprocessing steps performed by the data preprocessor 110, the resulting output is the latent transformer input 115. The latent transformer input 115 represents the preprocessed and encoded data that is ready to be fed into the latent transformer machine learning core 120 for further processing and learning.
When dealing with time series prediction, the codeword allocator 113 may take a sequence of time series data points as input. In one example the input sequence consists of 1000 data points. The codeword allocator 113 performs the necessary data preparation steps to create a suitable input vector for the autoencoder. It truncates the last 50 data points from the input sequence, resulting in a sequence of 950 elements. This truncated sequence represents the historical data that will be used to predict the future values. The codeword allocator 113 then creates a 1000-element vector, where the first 950 elements are the truncated sequence, and the last 50 elements are filled with zeros. This input vector serves as the input to the Variational Autoencoder Encoder Subsystem 150, which compresses the data into a lower-dimensional latent space representation.
By performing this data preparation step, the codeword allocator 113 ensures that the input data is in a format that is compatible with the autoencoder's training process. During training, the autoencoder learns to reconstruct the complete 1000-element sequence from the truncated input vector. By setting the last 50 elements to zero, the autoencoder is forced to learn the patterns and dependencies in the historical data and use that information to predict the missing values. This approach enables the Latent Transformer LCM system to effectively handle time series prediction tasks by leveraging the power of autoencoders and the compressed latent space representation.
The codeword allocator 113 may split the incoming data input 100 meaningful units called sourceblocks. This process, known as semantic splitting, aims to capture the inherent structure and patterns in the data. The allocator 113 may employ various techniques to identify the optimal sourceblocks, such as rule-based splitting, statistical methods, or machine learning approaches. In one embodiment, the codeword allocator 113 may utilize Huffman coding to split the data into sourceblocks. The Huffman coding-based allocator enables efficient and semantically meaningful splitting of the input data into sourceblocks. Huffman coding is a well-known data compression algorithm that assigns variable-length codes to symbols based on their frequency of occurrence. In the context of the LCM, the Huffman coding-based allocator adapts this principle to perform semantic splitting of the input data.
With Huffman coding, the allocator 113 starts by analyzing the input data and identifying the basic units of meaning, such as words, phrases, or subwords, depending on the specific data modality and the desired level of granularity. This process may not be necessary for numerical or time series data sets. These basic units form the initial set of sourceblocks. The codeword allocator 130 then performs a frequency analysis of the sourceblocks, counting the occurrences of each sourceblock in the input data. Based on the frequency analysis, the allocator 113 constructs a Huffman tree, which is a binary tree that represents the probability distribution of the sourceblocks. The Huffman tree is built by iteratively combining the two least frequent sourceblocks into a single node, assigning binary codes to the branches, and repeating the process until all sourceblocks are included in the tree. The resulting Huffman tree has the property that sourceblocks with higher frequencies are assigned shorter codes, while sourceblocks with lower frequencies are assigned longer codes.
The Huffman coding-based codeword allocator 113 then uses the constructed Huffman tree to perform semantic splitting of the input data. It traverses the input data and matches the sequences of symbols against the sourceblocks represented in the Huffman tree. When a sourceblock is identified, the allocator 113 assigns the corresponding Huffman code to that sourceblock, effectively compressing the data while preserving its semantic structure. The use of Huffman coding for semantic splitting offers several advantages. It allows for variable-length sourceblocks, enabling the codeword allocator 113 to capture meaningful units of varying sizes. This is particularly useful for handling data with different levels of complexity and granularity, such as text with compound words or images with hierarchical structures.
After the sourceblock generation process, the codeword allocator 113 assigns a unique codeword to each sourceblock. The codewords are discrete, compressed representations of the sourceblocks, designed to capture the essential information in a compact form. The codeword allocator can use various mapping schemes to assign codewords to sourceblocks, such as hash functions, lookup tables, or learned mappings. For example, a simple approach could be to use a hash function that maps each sourceblock to a fixed-length binary code. Alternatively, another approach may involve learning a mapping function that assigns codewords based on the semantic similarity of the sourceblocks.
The codebook generation subsystem 140 is responsible for creating and maintaining the codebook, which is a collection of all the unique codewords used by the LCM. The codebook can be generated offline, before the actual processing begins, or it can be updated dynamically as new sourceblocks are encountered during processing. The codebook generation subsystem can use various techniques to create a compact and efficient codebook, such as frequency-based pruning, clustering, or vector quantization. The size of the codebook can be adjusted based on the desired trade-off between compression and information preservation. Going back to the War and Peace example, the string of sourceblocks [′Well′, ‘,’, ‘Prince’, ‘,’, ‘so’, ‘Gen’, ‘oa’, ‘and’, ‘Luc’, ‘ca’, ‘are’, ‘now’, ‘just’, ‘family’, ‘estates’, ‘of’, ‘the’, ‘Buon’, ‘apar’, ‘tes’, ‘·’] may be given codewords such as [12, 5, 78, 5, 21, 143, 92, 8, 201, 45, 17, 33, 49, 62, 87, 11, 2, 179, 301, 56, 4], where each sourceblock is assigned a unique codeword, which is represented as an integer. The mapping between tokens and codewords is determined by the codebook generated by the LCM system.
Once the input data is allocated codewords, it is passed through the Variational Autoencoder Encoder Subsystem 150. This subsystem utilizes a VAE encoder to compress the codewords into a lower-dimensional latent space representation. The VAE encoder learns to capture the essential features and variations of the input data, creating compact and informative latent space vectors. The machine learning training system 600 is responsible for training the VAE encoder using appropriate objective functions and optimization techniques.
The latent space vectors generated by the VAE encoder are then fed into the Latent Transformer Subsystem 170. This subsystem is a modified version of the traditional Transformer architecture, where the embedding and positional encoding layers are removed. By operating directly on the latent space vectors, the Latent Transformer can process and generate data more efficiently, without the need for explicit embedding or positional information. The Transformer Training System 171 is used to train the Latent Transformer, leveraging techniques such as self-attention and multi-head attention to capture dependencies and relationships within the latent space.
The Latent Transformer comprises of several key components. Latent space vectors may be passed directly through a multi-head attention mechanism. The multi-head attention mechanism, which is the core building block of the Transformer, allows the model to attend to different parts of the input sequence simultaneously, capturing complex dependencies and relationships between codewords. Feed-forward networks are used to introduce non-linearity and increase the expressive power of the model. Residual connections and layer normalization are employed to facilitate the flow of information and stabilize the training process.
The Latent Transformer-based core can be implemented using an encoder-decoder architecture. The encoder processes the input codewords and generates contextualized representations, while the decoder takes the encoder's output and generates the target codewords or the desired output sequence. The encoder and decoder are composed of multiple layers of multi-head attention and feed-forward networks, allowing for deep and expressive processing of the codeword representations.
One of the key advantages of the Transformer in the LCM architecture is its ability to capture long-range dependencies between codewords. Unlike recurrent neural networks (RNNs), which process the input sequentially, the Transformer can attend to all codewords in parallel, enabling it to effectively capture relationships and dependencies that span across the entire input sequence. This is useful for processing long and complex data sequences, where capturing long-range dependencies is crucial for understanding the overall context. Another advantage of the Transformer-based core is its parallelization capability. The self-attention mechanism in the Transformer allows for efficient parallel processing of the codewords on hardware accelerators like GPUs. This parallelization enables faster training and inference times, making the LCM architecture suitable for processing large amounts of data in real-time applications.
The Latent Transformer-based core also generates contextualized representations of the codewords, where each codeword's representation is influenced by the surrounding codewords in the input sequence. This contextualization allows the model to capture the semantic and syntactic roles of the codewords based on their context, enabling a deeper understanding of the relationships and meanings within the data. The scalability of the Transformer-based core is another significant advantage in the LCM architecture. By increasing the number of layers, attention heads, and hidden dimensions, the Transformer can learn more complex patterns and representations from large-scale datasets. This scalability has been demonstrated by models like GPT-3, which has billions of parameters and can perform a wide range of tasks with impressive performance.
After being processed by the Latent Transformer, the latent space vectors are passed through the Variational Autoencoder Decode Subsystem 180. The VAE decoder takes the processed latent vectors and reconstructs the original data or generates new data based on the learned representations. The machine learning training subsystem 600 is responsible for training the VAE decoder to accurately reconstruct or generate data from the latent space. In some embodiments, the Decode Subsystem 180 may be used to create time series predictions about a particular data input.
The reconstructed or generated data is then output 190, which can be in the same format as the original input data or in a different modality altogether. This flexibility allows the Latent Transformer LCM to handle various tasks, such as data compression, denoising, anomaly detection, and data generation, across multiple domains.
Moreover, the modular design of the system enables each subsystem to be trained independently or jointly, depending on the specific requirements and available resources. The machine learning training system 600 may provide the necessary mechanisms to optimize the performance of each component and ensure the overall effectiveness of the Latent Transformer LCM.
The input to the Latent Transformer Subsystem 170 is provided by a VAE Encoder Subsystem 150. The VAE Encoder Subsystem 150 is responsible for encoding the preprocessed input data into a lower-dimensional latent space representation. An input is passed through the VAE Encoder Subsystem 150, which learns to compress the data into a compact latent space representation while preserving the essential features and characteristics of the input. Latent space vectors produced by the VAE Encoder Subsystem 150 may be further processed by an expander 151, which increases the dimensionality of the input data to a point where the vectors can be efficiently processed by the Latent Transformer Subsystem 170.
The latent space representation generated by the VAE Encoder Subsystem 150 serves as the input to the Latent Transformer Subsystem 170. The Latent Transformer Subsystem 170 operates in this latent space, leveraging the compressed and informative representation to learn the complex patterns and relationships within the data. By working in the latent space, the Latent Transformer Subsystem 170 can efficiently process and model the data, capturing the intricate dependencies and generating accurate and meaningful outputs.
Once the Latent Transformer Subsystem 170 has processed the latent space representation, the generated output is passed through the VAE Decoder Subsystem 180. The VAE Decoder Subsystem 180 is responsible for decoding the latent space representation back into the original data space. Prior to processing by the VAE Decoder Subsystem 180, Latent Transformer Subsystem outputs may be compressed back to an original size before being processed by the expander 151 by being processed by a compressor 152. The VAE Decoder Subsystem 180 learns to reconstruct the original data from the latent space representation, ensuring that the generated output is coherent and meaningful.
The reconstructed output from the VAE Decoder Subsystem 180 is provided as the generated output 190. The generated output 190 represents the final result of the Latent Transformer LCM system, which can take various forms depending on the specific task and domain. It could be predicted values for time series forecasting, generated text for language modeling, synthesized images for computer vision tasks, or any other relevant output format.
The VAE Encoder Subsystem 150 and VAE Decoder Subsystem 180 play large roles in the overall functioning of the Latent Transformer LCM system. The VAE Encoder Subsystem 150 enables the system to learn a compressed and informative representation of the input data in the latent space, while the VAE Decoder Subsystem 180 ensures that the generated output is coherent and meaningful by reconstructing it back into the original data space. The combination of these subsystems allows the Latent Transformer Subsystem 170 to focus on learning the complex patterns and relationships within the data, leading to accurate and context-aware outputs.
The specific architectures and parameters of the VAE Encoder Subsystem 150, Latent Transformer Subsystem 170, and VAE Decoder Subsystem 180 can be customized and adapted based on the characteristics and requirements of the input data and the specific task at hand. The modular design of the system allows for flexibility and extensibility, enabling the integration of different architectures, attention mechanisms, and training techniques to optimize the performance and efficiency of the Latent Transformer LCM system.
Exemplary pseudocode or a latent transformer using PyTorch may be found in APPENDIX A.
An output formatter 131 is responsible for converting the generated output into a specific format required by the application or user. It applies formatting rules and conventions to enhance the readability, coherence, and usability of the generated output. For example, in the case of generated text, the output formatter 131 may apply capitalization, punctuation, or line breaks to improve the clarity and structure of the text. In the case of generated time series data, the output formatter 131 may convert the values into the desired unit of measurement or apply specific formatting conventions to ensure consistency with the expected output format.
A filtering and thresholding subsystem 132 applies specific criteria or thresholds to filter or select the most relevant or reliable generated outputs. It helps to refine the generated output based on predefined rules, constraints, or user preferences. For example, in a recommendation system, the filtering and thresholding subsystem 132 may filter out generated recommendations that fall below a certain relevance threshold or exclude items that have already been recommended to the user. This subsystem ensures that only the most pertinent and valuable outputs are presented to the user or passed on for further processing.
An output validation and evaluation subsystem 133 assesses the quality and performance of the generated output against predefined metrics or ground truth data. It applies validation techniques to ensure that the generated output meets the expected criteria and conforms to the desired characteristics. This subsystem may include automatic evaluation methods, such as calculating similarity scores, perplexity, or domain-specific metrics, to measure the accuracy, coherence, or effectiveness of the generated output. By continuously monitoring and evaluating the generated output, the output validation and evaluation subsystem 133 provides valuable insights for model improvement and fine-tuning.
An error handling and anomaly detection subsystem 134 identifies and handles any errors, anomalies, or unexpected patterns in the generated output. It incorporates techniques for detecting and correcting syntactic or semantic errors, identifying out-of-distribution samples, or flagging potential issues that require human intervention. This subsystem plays a critical role in maintaining the quality and reliability of the generated output by proactively identifying and addressing any problems or inconsistencies. It helps to prevent the propagation of errors downstream and ensures that the generated output is trustworthy and dependable.
The data post processor 130 works seamlessly with the other components of the Latent Transformer LCM system to deliver high-quality and reliable generated outputs. It receives the generated output from the Latent Transformer Machine Learning Core 120, which has learned the underlying patterns, relationships, and dependencies within the input data. The post-processing subsystems within the data post processor 130 then refine, format, validate, and ensure the quality of the generated output, making it suitable for the intended application or user.
The specific configuration and parameters of each subsystem within the Data Post Processor 130 can be customized and adapted based on the requirements of the application domain and the nature of the generated output. The modular design of the post-processor allows for the integration of additional subsystems or the modification of existing ones to meet the specific needs of the task at hand.
The codebook is an important component of the codebook-based homomorphic compression system. According to the embodiment, it is a collection of codewords, where each codeword corresponds to a sourceblock in the input. The codebook may generate based on the frequency distribution of the inputs, assigning shorter codewords to more frequently occurring inputs and longer codewords to less frequent inputs. There are several techniques for generating the codebook, with the goal of minimizing the average codeword length while maintaining the uniqueness of the codewords. Two common techniques are Huffman coding 202 and arithmetic coding 203. Huffman coding 202 is a variable-length coding technique that assigns codewords based on the frequency of occurrence of each symbol (sourceblock). It constructs a binary tree, known as the Huffman tree, where each leaf node represents a symbol and the path from the root to the leaf determines the codeword. More frequent symbols are assigned shorter codewords, while less frequent symbols receive longer codewords. Huffman coding guarantees an optimal prefix code, meaning no codeword is a prefix of any other codeword. For example, consider the quantized temperature data from the previous example. Let's say the frequency distribution of the intervals is as follows:
Using Huffman coding, the codebook generation subsystem 140 can generate the following codebook:
The most frequent input (Sourceblock 4) receives the shortest codeword (11), while the least frequent input (Sourceblock 0) receives the longest codeword (1100).
Arithmetic coding 203 is another entropy coding technique that assigns codewords to sourceblocks based on their probability distribution. Unlike Huffman coding, arithmetic coding does not assign fixed codewords to symbols. Instead, it represents the entire message as a single fractional number between 0 and 1. The interval [0, 1) is recursively divided based on the probabilities of the symbols, and the final codeword is a binary fraction that falls within the subinterval corresponding to the entire message. Arithmetic coding achieves near-optimal compression rates but requires more computational complexity compared to Huffman coding. For example, using the same quantized temperature data and frequency distribution as before, arithmetic coding would assign subintervals to each symbol based on their probabilities:
To encode a message sequence like [Sourceblock 4, Sourceblock 2, Sourceblock 1], arithmetic coding would recursively subdivide the interval [0, 1) based on the probabilities of the symbols, resulting in a final subinterval. The codeword would be a binary fraction that lies within this final subinterval.
According to an embodiment, an encoder component 201 is present and configured to implement one or more deep learning techniques for generating codewords for quantized data. Deep learning techniques can be employed to generate effective codewords for the quantized data. One approach is to use deep learning-based autoencoder models to learn compact and meaningful representations of the quantized data. Autoencoders are neural network architectures that consist of an encoder and a decoder, where the encoder learns to compress the input data into a lower-dimensional latent space, and the decoder reconstructs the original data from the latent representation.
Here are a few exemplary deep learning encoding techniques that can be implemented for creating codewords of the quantized data, according to an embodiment. Convolutional autoencoders (CAEs) leverage convolutional neural networks (CNNs) in the encoder and decoder parts of the autoencoder. CNNs are particularly effective in capturing spatial dependencies and hierarchical features in data, making them well-suited for encoding structured data such as images or time series. In the context of the codebook-based homomorphic compression, a CAE can be trained on the quantized data. The encoder part of the CAE learns to compress the quantized data into a compact latent representation, which serves as the codeword. The decoder part learns to reconstruct the quantized data from the codeword. As an example, consider an example of using a CAE for encoding quantized sensor data. The quantized data is represented as a 2D matrix, where each row corresponds to a sensor reading, and each column represents a time step. The CAE encoder consists of convolutional layers followed by pooling layers, which gradually reduce the spatial dimensions of the input and extract meaningful features. The output of the encoder is a compact latent representation, which serves as the codeword. The CAE decoder consists of upsampling layers and convolutional layers, which reconstruct the original quantized data from the codeword.
Another form of deep learning coding includes recurrent autoencoders (RAEs). Recurrent autoencoders utilize recurrent neural networks (RNNs) in the encoder and decoder parts of the autoencoder. RNNs are well-suited for processing sequential data, such as time series or natural language, as they can capture temporal dependencies and context. An RAE can be used to encode quantized sequential data. The encoder part of the RAE consists of recurrent layers, such as Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) layers, which process the input sequence and generate a fixed-length latent representation, serving as the codeword. The decoder part of the RAE takes the codeword and reconstructs the original quantized sequence. For example, consider an example of using an RAE for encoding quantized audio data. The quantized audio signal is represented as a sequence of amplitude values. The RAE encoder consists of LSTM layers that process the input sequence and generate a fixed-length latent representation, which serves as the codeword. The RAE decoder, also consisting of LSTM layers, takes the codeword and reconstructs the original quantized audio sequence.
Another form of deep learning coding includes variational autoencoders (VAEs). Variational autoencoders extend the concept of autoencoders by introducing a probabilistic framework. VAEs learn to encode the input data into a probability distribution in the latent space, rather than a single point. The encoder part of the VAE learns to map the input data to the parameters of a probability distribution (e.g., mean and variance of a Gaussian distribution), and the decoder part learns to reconstruct the original data from samples drawn from this distribution. A VAE can be used to generate codewords that capture the underlying probability distribution of the quantized data. The encoder part of the VAE learns to map the quantized data to the parameters of a probability distribution in the latent space. The codewords are then obtained by sampling from this distribution. The decoder part of the VAE learns to reconstruct the original quantized data from the sampled codewords. Consider an example of using a VAE for encoding quantized image data. The quantized images are fed into the VAE encoder, which learns to map each image to the parameters of a Gaussian distribution in the latent space. The codewords are obtained by sampling from this distribution. The VAE decoder takes the sampled codewords and reconstructs the original quantized images.
Another form of deep learning coding includes deep belief networks (DBNs). Deep Belief Networks are generative models that consist of multiple layers of restricted Boltzmann machines (RBMs). DBNs can learn hierarchical representations of the input data by training each layer in an unsupervised manner, followed by fine-tuning the entire network using supervised learning. DBNs can be used to generate codewords that capture the hierarchical structure of the quantized data. The DBN is trained on the quantized data, and the activations of the hidden layers serve as the codewords. The hierarchical nature of DBNs allows for capturing complex patterns and dependencies in the data. Consider an example of using a DBN for encoding quantized text data. The quantized text is represented as a binary vector, where each element corresponds to the presence or absence of a specific word. The DBN is trained on the quantized text data, and the activations of the hidden layers serve as the codewords. The DBN learns to capture the hierarchical structure and semantic relationships in the text data.
These are just a few examples of deep learning encoding techniques that can be explored for creating codewords of the quantized data in a LCM. The choice of the specific deep learning architecture depends on the nature of the data and the desired properties of the codewords. It's important to note that the deep learning encoding process should be designed to generate codewords that are suitable for homomorphic operations. The codewords should exhibit certain properties, such as being compatible with the homomorphic encryption scheme's plaintext space and allowing for efficient homomorphic computations.
During the training process of the deep learning models, the objective function should be designed to capture the desired properties of the codewords, such as minimizing the reconstruction error while ensuring the codewords are suitable for homomorphic operations. Additionally, regularization techniques can be employed to encourage sparsity or other desirable properties in the codewords. Once the deep learning models are trained, the encoder part can be used to generate codewords for new quantized data. The generated codewords can then be used in the codebook-based homomorphic compression scheme, enabling efficient and privacy-preserving computations on the compressed data.
Experimental evaluation and performance analysis can be conducted to assess the effectiveness of the deep learning encoding techniques in generating codewords that achieve good compression ratios, maintain low approximation errors, and enable efficient homomorphic operations. The choice of the deep learning architecture and hyperparameters can be fine-tuned based on the specific requirements and characteristics of the data.
According to the aspect, a codebook library 204 is present and configured to store a plurality of codewords (i.e., a codebook) generated by one or more of the techniques described herein. When it comes to storing the codewords and codebook in the codebook-based homomorphic compression system, several database systems and data storage solutions can be considered. The choice of the storage system depends on factors such as the size of the codebook, the frequency of updates, the retrieval and query requirements, and the overall system architecture. In some implementations key-value stores may be used, Key-value stores are a type of NoSQL database that provide a simple and efficient way to store and retrieve data based on a unique key. Examples of key-value stores include Redis, Memcached, and Amazon DynamoDB. For storing the codewords and codebook, key-value stores can be used to store each codeword as a key-value pair, where the key represents the codeword, and the value represents the corresponding data or metadata associated with the codeword. The codebook can be stored as a collection of key-value pairs, allowing for fast retrieval of codewords based on their keys. Key-value stores offer high performance, low latency, and scalability, making them suitable for scenarios where fast retrieval of codewords is critical.
Document databases, such as MongoDB or Couchbase, store data as flexible, semi-structured documents in formats like JSON or BSON. They provide a schema-less design and allow for easy modification of the data structure. For storing the codewords and codebook, document databases can be used to store each codeword as a document, along with its associated data or metadata. The codebook can be stored as a collection of documents, where each document represents a codeword and its related information. Document databases offer flexibility in terms of data structure, allowing for easy addition or modification of codeword attributes. They also provide querying capabilities based on document fields, enabling efficient retrieval of codewords based on specific criteria.
Relational databases, such as MySQL, PostgreSQL, or Oracle, can also be used to store the codewords and codebook. In a relational database, the codewords can be stored in a table with columns representing the codeword and its associated data or metadata. The codebook can be stored in a separate table, with each row representing a codeword and its corresponding information. Relational databases provide structured querying capabilities using SQL, allowing for efficient retrieval and filtering of codewords based on specific conditions. Relational databases offer strong consistency, ACID properties, and support for complex queries, making them suitable for scenarios where data integrity and structured querying are important.
Graph databases, such as Neo4j or Amazon Neptune, store data as nodes and edges in a graph structure. They are designed to efficiently handle complex relationships and connections between data entities. For storing the codewords and codebook, graph databases can be used to represent the relationships between codewords and their associated data or metadata. Each codeword can be represented as a node in the graph, with edges connecting related codewords or linking codewords to their corresponding data. Graph databases provide efficient traversal and querying capabilities based on the graph structure, allowing for fast retrieval of connected codewords and exploration of relationships between codewords.
Distributed key-value stores, such as Apache Cassandra or Apache HBase, are designed to handle large-scale data and provide high scalability and fault tolerance. They distribute data across multiple nodes in a cluster, allowing for horizontal scaling. For storing the codewords and codebook, distributed key-value stores can be used to store codewords as key-value pairs, similar to regular key-value stores. The codebook can be partitioned and distributed across multiple nodes in the cluster, enabling high scalability and performance. Distributed key-value stores offer eventual consistency, high write throughput, and the ability to handle large volumes of data, making them suitable for scenarios where scalability and fault tolerance are critical.
The VAE Encoder Subsystem 150 takes a codeword vector input 300 as its input. This codeword vector is generated by the codeword allocator 113, which converts the raw input data into a sequence of codewords based on the codebook maintained by the codebook generation subsystem 140. The codeword vector represents the input data in a compact and discrete form, capturing the essential information and structure of the original data. Inside the VAE Encode Subsystem 150, the codeword vector input 300 undergoes a series of transformations to map it into the latent space. The encoder architecture typically consists of multiple layers of neural networks, such as fully connected layers or convolutional layers, depending on the nature of the input data.
A layer of the encoder takes the codeword vector and applies a linear transformation to project it into a higher-dimensional space. This transformation is learned during the training process and helps to capture the complex patterns and relationships within the input data. The output of this layer may be passed through a non-linear activation function, such as the rectified linear unit (ReLU), to introduce non-linearity and enhance the representational power of the encoder.
As the codeword vector input 300 progresses through the subsequent layers of the encoder, the dimensionality of the representation is gradually reduced. Each layer applies a linear transformation followed by a non-linear activation function, allowing the encoder to learn hierarchical features and abstract representations of the input data.
The VAE Encoder Subsystem 150 in the Latent Transformer LCM system can be trained independently or jointly with the other machine learning components, such as the Latent Transformer Subsystem 170 and the VAE Decode Subsystem 180. The flexibility in training allows for optimizing the VAE encoder based on specific requirements and available resources. When trained individually, the VAE encoder can focus on learning the optimal compression and representation of the input codeword vectors in the latent space. The Encoder Training System 151 is responsible for updating the encoder's parameters using techniques like gradient descent and backpropagation, minimizing the reconstruction loss and the KL divergence. Individual training enables the encoder to specialize in mapping the input data to a meaningful latent space representation.
On the other hand, joint training of the VAE encoder 150 with the Latent Transformer 170 and VAE decoder 180 allows for end-to-end optimization of the entire system. By training all components simultaneously, the VAE encoder 150 can learn to generate latent space vectors that are well-suited for processing by the Latent Transformer and decoding by the VAE decoder 180. Joint training enables the system to capture the dependencies and interactions between the different components, leading to improved overall performance. However, joint training may be more computationally intensive and require careful coordination between the training systems. The choice between individual or joint training depends on factors such as the complexity of the data, the desired performance, and the available computational resources. Experimentation and evaluation can help determine the most suitable training approach for a given scenario.
Once the VAE Encoder Subsystem 150 is trained, it can map the input codeword vector to a lower-dimensional latent space representation. This latent space vector captures the essential features and characteristics of the input data in a compressed form. The dimensionality of the latent space vector is typically much smaller than the original codeword vector, allowing for efficient storage and processing.
The latent space vector output 320 serves as the input to the Latent Transformer Subsystem 170, which further processes and generates data based on the learned latent space representation. By compressing the input data into a compact latent space, the VAE Encoder Subsystem 150 enables the Latent Transformer LCM system to handle large-scale and complex datasets efficiently, while preserving the essential information and structure of the data.
Latent space vectors possess the property of continuous differentiability. This means that the latent space formed by these vectors is a smooth and continuous manifold, allowing for smooth interpolation and gradual transitions between different points in the latent space. The continuous differentiability of latent space vectors has important implications for the similarity and relatedness of the outputs generated by the LCM system. In the latent space, outputs that are more proximate to one another, i.e., closer in terms of their latent vector representations, tend to exhibit higher levels of similarity. This is because the VAE Encoder Subsystem 150 learns to map similar input data points to nearby regions in the latent space, capturing their shared characteristics and underlying patterns.
As a result, when the Latent Transformer Subsystem 170 operates on the latent space vectors and generates outputs, the proximity of the latent vectors directly influences the similarity of the generated outputs. Outputs corresponding to latent vectors that are close to each other in the latent space are more likely to share common features, styles, or semantics. This property enables smooth interpolation between different outputs, allowing for the generation of intermediate or blended results that exhibit gradual variations along the latent space. The continuous differentiability of latent space vectors also facilitates the learning and optimization process of the LCM system. During training, the gradients can be computed and propagated smoothly through the latent space, enabling efficient updates of the model parameters. This allows the system to learn meaningful and coherent representations of the input data, capturing the underlying structure and relationships.
Moreover, the proximity-based similarity of latent space vectors opens up possibilities for various applications and use cases. For example, in the context of image generation, interpolating between latent vectors of different images can lead to the generation of smooth transitions or morphs between the corresponding visual contents. Similarly, in the domain of text generation, interpolating between latent vectors of different sentences or paragraphs can result in the generation of semantically coherent and gradually varying textual outputs. The continuous differentiability and proximity-based similarity of latent space vectors in the LCM system provide a powerful tool for exploring and manipulating the generated outputs. By navigating and interpolating within the latent space, users can discover novel and meaningful variations of the data, generate diverse and creative outputs, and gain insights into the underlying structure and relationships captured by the model.
In the Variational Autoencoder (VAE) Encoder and Decoder subsystems of the Latent Transformer Large Codeword Model (LCM) system, the shape of the tensors undergoes transformations as they are compressed and decompressed. The VAE Encoder Subsystem 150 is responsible for compressing the input data into a lower-dimensional latent space representation, while the VAE Decoder Subsystem 180 decompresses the latent representation back into the original data space. The specific shape and dimensionality of the tensors at each stage of the encoding and decoding process can be adjusted based on the goals and requirements of the system.
The VAE Encoder Subsystem 150 takes the preprocessed input data, which is typically in the form of a high-dimensional vector or tensor, and applies a series of transformations to reduce its dimensionality. The shape of the tensor at each layer of the VAE Encoder Subsystem 150 can be customized based on the desired level of compression and the complexity of the input data. For example, after passing through the first layer of the encoder, the expanded input vector may be reduced to a tensor with 1000 elements. This compression step aims to capture the most salient features and patterns in the input data while reducing its dimensionality. The subsequent layers of the encoder can further compress the tensor, reducing it to even lower dimensions, such as 50 or 10 elements, depending on the specific training parameters and the desired level of compression.
The choice of the target dimensionality for the latent space representation depends on various factors, such as the nature of the input data, the complexity of the patterns and relationships to be captured, and the available computational resources. A smaller latent space dimensionality can lead to higher compression rates and more efficient processing, but it may also result in a loss of information and reduced expressiveness. On the other hand, a larger latent space dimensionality allows for more detailed and nuanced representations but may require more computational resources and longer training times.
Once the input data is compressed into the latent space representation, it is passed through the Latent Transformer Subsystem 170, where the self-attention mechanisms and multi-head attention operate on the compressed representation. The Latent Transformer Subsystem 170 learns the underlying patterns, relationships, and dependencies within the latent space, enabling it to generate accurate and context-aware outputs. If the shape of the latent space representation is not large enough to be effectively processed by the Latent Transformer Subsystem 170, the latent space vectors may be processed by an expander 151, which increases the dimensionality of the vector allowing for a richer and more expressive representation.
The generated output from the Latent Transformer Subsystem 170 is then fed into the VAE Decoder Subsystem 180, which is responsible for decompressing the latent representation back into the original data space. The VAE Decoder Subsystem 180 applies a series of transformations to gradually increase the dimensionality of the tensor, eventually reconstructing it into the desired output shape. Similar to the encoding process, the shape of the tensor at each layer of the VAE Decoder Subsystem 180 can be customized based on the desired output characteristics and the requirements of the application.
The flexibility in tensor shapes throughout the encoding and decoding process allows the Latent Transformer LCM system to adapt to various data types, input sizes, and output requirements. By adjusting the compression and decompression parameters, the system can be optimized for different goals, such as achieving high compression rates, preserving important details, or generating outputs with specific dimensions or characteristics.
The ability to customize the tensor shapes in the VAE Encoder and Decoder subsystems enables the Latent Transformer LCM system to handle a wide range of data modalities and tasks, from time series forecasting and language modeling to image generation and beyond. It provides the flexibility to tailor the system to the specific needs of each application, balancing the trade-offs between compression, expressiveness, and computational efficiency.
The illustrated Latent Transformer comprises an Encoder and a Decoder. The Encoder takes latent space vector inputs and processes them through a stack of layers (represented as dashed box 420). Each layer consists of: multi-head attention, which allows the model to attend to different parts of the input sequence; add and norm, which applies residual connection and layer normalization; feed forward, which is a fully connected feed-forward network; and add and norm which is another residual connection and layer normalization.
The power of the transformer model lies in the self-attention mechanism. This mechanism contributes to accelerated learning compared to traditional models such as long short-term memory models. Self-attention empowers the transformer model with the remarkable capability to meticulously scrutinize distinct segments of a given sequence or even encompass the entire contextual essence of a sentence. This profound contextual awareness enables the model to make predictions with an elevated degree of accuracy and relevance.
Contrary to a standard transformer architecture, in a Latent Transformer, an input embedding layer and a positional encoding layer are not necessary. This is because rather than processing data inputs, a Latent Transformer processes latent space vectors which have been processed by a Variational Autoencoder encoder.
This latent space representation captures the essential features and characteristics of the input data, including both the content and positional information. By encoding the input data into a compact latent vector, the VAE effectively combines the roles of the embedding layer and positional encoding layer. The latent vectors generated by the VAE encoder already contain the necessary information for the Transformer to process and learn from, without the need for explicit embedding or positional encoding. This streamlined approach simplifies the Transformer architecture and reduces the computational overhead associated with maintaining separate embedding and positional encoding layers. As a result, the Latent Transformer LCM system can efficiently process and generate data in the latent space, leveraging the power of the Transformer architecture while benefiting from the compressed representation learned by the VAE.
The Encoder utilizes a multi-head attention mechanism 424 which allows the Encoder to attend to different parts of the input sequence and capture dependencies between vectors. The attention mechanism computes three matrices: Query (Q), Key (K), and Value (V). The Query, Key, and Value matrices are obtained by linearly projecting the input embeddings using learned weight matrices. The attention scores are computed by taking the dot product of the Query matrix with the transpose of the Key matrix, followed by scaling and applying a softmax function. The attention scores determine the importance of each vector in the input sequence for a given position. The Value matrix is then multiplied with the attention scores to obtain the weighted sum of the values, which forms the output of the attention mechanism. Multi-Head Attention splits the Query, Key, and Value matrices into multiple heads, allowing the model to attend to different aspects of the input simultaneously. The outputs from each head are concatenated and linearly projected to obtain the final output of the Multi-Head Attention layer 424.
In the Latent Transformer LCM system, the number of attention heads used by the Encoder can be adjusted based on the complexity and nature of the relationships within the input data. The attention mechanism allows the Encoder to focus on different aspects of the input and capture dependencies between elements at various positions. When dealing with datasets where the relationships between elements are weaker or more subtle, increasing the number of attention heads can be beneficial. By having more attention heads, the Encoder can learn and capture a wider range of patterns and dependencies within the data. Each attention head can attend to different parts of the input sequence, allowing the model to capture fine-grained relationships and nuances that may be difficult to detect with fewer attention heads. This is particularly useful when working with complex or heterogeneous datasets, where the relationships between elements may not be immediately apparent. By increasing the number of attention heads, the Latent Transformer LCM system can more effectively learn and represent the underlying structure and dependencies in the data, leading to improved performance and generalization. However, it's important to strike a balance, as having an excessive number of attention heads can increase computational complexity and may lead to overfitting. Experimentation and evaluation on specific tasks can help determine the optimal number of attention heads for a given dataset and desired outcome.
After the Multi-Head Attention layer, a residual connection is applied, followed by Layer Normalization at add and norm 423. The residual connection adds the input embeddings to the output of the attention layer, helping the model learn faster and deeper. Layer Normalization normalizes the activations across the features, stabilizing the training process.
The Feed Forward layer 422 is a fully connected neural network applied to each position of the Encoder's hidden states. It consists of two linear transformations with a Rectified Linear Unit (ReLU) activation function in between. The purpose of the Feed Forward layer is to introduce non-linearity and increase the model's capacity to learn complex representations. The output of the Feed Forward layer has the same dimensionality as the input embeddings. A residual connection and Layer Normalization 421 are applied after the Feed Forward layer.
The Encoder layers 420 are stacked Nx times, where N is a hyperparameter that determines the depth of the Encoder. Each layer follows the same structure: Multi-Head Attention, Add & Norm, Feed Forward, and Add & Norm. By stacking multiple Encoder layers, the model can capture hierarchical and long-range dependencies in the input sequence. The output of the final Encoder layer represents the encoded input sequence, which is then passed to the Decoder for generating the output sequence.
The Decoder generates the output probabilities. It has a similar structure to the Encoder, with a few additions. The Decoder takes output embeddings and processes them through a stack of layers (represented as dashed box 450). The latent space vector output layer 430 takes the previous output vectors (shifted right by one position) and processes them through a plurality of layers.
The masked multi-head attention 451 mechanism prevents the model form attending to future vectors. This layer performs self-attention on the Decoder's input sequence. It allows the Decoder to attend to different parts of its own input sequence. The attention is “masked” to prevent the Decoder from attending to future vectors, ensuring that the predictions are based only on the previously generated vectors. Multi-head attention splits the input into multiple heads, allowing the model to attend different aspect of the input simultaneously.
After the masked multi-head attention, a residual connection is applied follows by layer normalization via add and norm 452. The residual connection adds the input to the output of the attention layer, helping the model learn faster and deeper. Layer normalization normalizes the activations across the features, stabilizing the training process.
The multi-head attention 453 layer performs attention between the Decoder's hidden states and the Encoder's output. It allows the Decoder to attend to relevant parts of the input sequence based on the Encoder's representations. The attention weights are computed based on the compatibility between the Decoder's hidden states and Encoder's outputs.
In the Latent Transformer LCM system, the number of attention heads used by the Decoder can be adjusted based on the complexity and nature of the relationships within the input data. The attention mechanism allows the Decoder to focus on different aspects of the input and capture dependencies between elements at various positions. When dealing with datasets where the relationships between elements are weaker or more subtle, increasing the number of attention heads can be beneficial. By having more attention heads, the Decoder can learn and capture a wider range of patterns and dependencies within the data. Each attention head can attend to different parts of the input sequence, allowing the model to capture fine-grained relationships and nuances that may be difficult to detect with fewer attention heads. This is particularly useful when working with complex or heterogeneous datasets, where the relationships between elements may not be immediately apparent. By increasing the number of attention heads, the Latent Transformer LCM system can more effectively learn and represent the underlying structure and dependencies in the data, leading to improved performance and generalization. However, it's important to strike a balance, as having an excessive number of attention heads can increase computational complexity and may lead to overfitting. Experimentation and evaluation on specific tasks can help determine the optimal number of attention heads for a given dataset and desired outcome.
Another add and norm 454 layer is then followed by feed forward network 455. This a fully connected feed-forward network applied to each position of the Decoder's hidden states. It consists of two linear transformations with a Rectified Linear Unit (ReLU) activation in between. The feed forward layer helps the model capture non-linear interactions and increases the model's capacity.
Another add and norm 456 layer is followed by linear 460 and softmax 470 layers. The final hidden states of the Decoder are passed through a linear transformation to project them into the vocabulary space. Vocabulary space refers to the set of all unique codewords or words that the model can generate or predict. In the context of language models, the vocabulary is a predefined set of codewords that the model is trained on and can output. When the Decoder's final hidden states are passed through a linear transformation, they are projected into a vector space with the same dimensionality as the size of the vocabulary. Each dimension in this space corresponds to a specific codeword in the vocabulary.
A softmax function is applied to the projected values (vectors) to generate output probabilities over the vocabulary. The softmax function normalizes the values so that they sum up to 1, representing a probability distribution over the vocabulary. Each probability indicates the likelihood of a specific vector being the next output vector. The vector with the highest probability is selected as the next output vector. During the model's training, the objective is to maximize the probability of the correct next vector given the input sequence and the previously generated vector. The model learns to assign higher probabilities to the vector that are more likely to appear based on the context. At inference time, the vector with the highest probability in the vocabulary space is selected as the next output vector. This process is repeated iteratively, with the generated vector being fed back into the Decoder as input for the next step, until a stopping criterion is met (e.g., reaching a maximum length or generating an end-of-sequence vector). The size and composition of the vocabulary can vary depending on the specific task and the data the model is trained on. It can include words, sub-words, or even characters, depending on the codeword strategy used.
The Decoder layers 450 can be stacked Nx times, allowing the model to capture complex dependencies and generate coherent output sequences.
This transformer architecture allows the model to process input sequences, capture long-range dependencies, and generate output sequence based on the encoded input and the previously generated codewords.
Another type of variation is the auto-regressive model which feature the use of only the decoder portion of the transformer architecture. In autoregressive architectures, the decoder portion of the transformer is retained and the encoder portion is not used after model pre-training. Auto-regressive models are a class of models that generate outputs by predicting the next element based on the previously generated elements. In the context of the Transformer architecture and language modeling, auto-regressive models are commonly used for tasks such as text generation, machine translation, and language understanding.
Auto-regressive models generate outputs sequentially, one element at a time. In the case of language modeling, the model predicts the next word or vector based on the previous words or vector in the sequence. The prediction of the next element is conditioned on the previously generated elements. The model learns the conditional probability distribution P (x_t|x_1, x_2, . . . , x_{t−1}), where x_t is the element at position t, and x_1, x_2, . . . , x_{t−1} are the previously generated elements. The Transformer architecture, particularly the Decoder component, is well-suited for auto-regressive modeling. The Decoder generates the output sequence one element at a time, conditioned on the previously generated elements and the encoded input sequence from the Encoder. In the Transformer Decoder, the self-attention mechanism is masked to prevent the model from attending to future positions during training. This masking ensures that the model relies only on the previously generated elements to make predictions, following the auto-regressive property. During training, the Transformer Decoder uses a technique called teacher forcing. Instead of feeding the model's own predictions as input for the next step, the ground truth target sequence is used. This helps the model learn to generate the correct output sequence based on the input sequence and the previous target vectors. During inference or generation, the Transformer Decoder generates the output sequence one element at a time. At each step, the model takes the previously generated elements as input and predicts the next element. This process continues until a stopping criterion is met, such as reaching a maximum sequence length or generating an end-of-sequence vector. Auto-regressive models, including the Transformer, have achieved state-of-the-art performance in language modeling tasks. They excel at capturing the statistical properties and dependencies in sequential data, making them effective for generating coherent and fluent text.
While text generation is the most suitable use case of auto-regressors, they perform exceptionally well on a wide variety of tasks. Most modern LLMs are auto-regressors including, for example, the popular GPT series of LLMs, BERT, and XLNet.
The third variation of the transformer model is the sequence-to-sequence model which utilizes both the encoder and decoder portions of the transformer and can be trained in multiple ways. One of the methods is span corruption and reconstruction. These models are, generally, best suited for language translation. The T5 and BART family of models are examples of sequence-to-sequence models.
The Generated Vector Response or Prediction 500 is a lower-dimensional representation that encodes the necessary information for reconstructing or generating the desired output. It contains the learned patterns, relationships, and variations that the Latent Transformer has captured from the input data. The VAE Decoder Subsystem 180 takes this generated vector as input and maps it back to the original data space, producing the final output 190. The decoder architecture typically comprises multiple layers of neural networks, such as fully connected layers or deconvolutional layers, depending on the nature of the output data.
The decoder starts by applying a linear transformation to the generated vector, projecting it into a higher-dimensional space. This transformation helps to expand the compressed representation and prepare it for the subsequent decoding steps. The output of this layer is then passed through a non-linear activation function, such as the rectified linear unit (ReLU), to introduce non-linearity and increase the expressiveness of the decoder. As the generated vector progresses through the subsequent layers of the decoder, the dimensionality of the representation is gradually increased. Each layer applies a linear transformation followed by a non-linear activation function, allowing the decoder to reconstruct the fine-grained details and structure of the output data. In the case of sequence-to-sequence tasks, such as time series prediction or language translation, the VAE Decoder Subsystem 180 may incorporate recurrent neural networks (RNNs) or attention mechanisms to generate the output sequence step by step. The decoder can attend to different parts of the generated vector and the previously generated outputs to produce coherent and contextually relevant results.
During the training process, the VAE Decoder Subsystem 180 learns to minimize the reconstruction loss between the generated output and the target output. It aims to produce outputs that closely match the desired or expected results based on the learned latent space representations. The Decoder Training System 181 is responsible for updating the decoder's parameters using techniques like gradient descent and backpropagation, optimizing the decoder's ability to generate accurate and meaningful outputs. Once the VAE Decoder Subsystem 180 is trained, it can map the Generated Vector Response or Prediction 500 back to the original data space, producing the final output 190. The output can be in various forms, such as reconstructed input data, predicted future sequences, or generated samples, depending on the specific task and application. The flexibility of the VAE Decoder Subsystem 180 allows it to handle various types of output data, such as time series, images, or text. By adapting the decoder architecture and training process to the specific requirements of the task, the Latent Transformer LCM system can generate high-quality outputs that capture the essential characteristics and variations of the target data.
At the model training stage, a plurality of training data 601 may be received by the generative AI training system 650. Data preprocessor 602 may receive the input data (e.g., codeword vector inputs, latent space vector representations) and perform various data preprocessing tasks on the input data to format the data for further processing. For example, data preprocessing can include, but is not limited to, tasks related to data cleansing, data deduplication, data normalization, data transformation, handling missing values, feature extraction and selection, mismatch handling, and/or the like. Data preprocessor 602 may also be configured to create training dataset, a validation dataset, and a test set from the plurality of input data 601. For example, a training dataset may comprise 80% of the preprocessed input data, the validation set 10%, and the test dataset may comprise the remaining 10% of the data. The preprocessed training dataset may be fed as input into one or more machine and/or deep learning algorithms 603 to train a predictive model for object monitoring and detection.
During model training, training output 604 is produced and used to measure the accuracy and usefulness of the predictive outputs. During this process a parametric optimizer 605 may be used to perform algorithmic tuning between model training iterations. Model parameters and hyperparameters can include, but are not limited to, bias, train-test split ratio, learning rate in optimization algorithms (e.g., gradient descent), choice of optimization algorithm (e.g., gradient descent, stochastic gradient descent, of Adam optimizer, etc.), choice of activation function in a neural network layer (e.g., Sigmoid, ReLu, Tanh, etc.), the choice of cost or loss function the model will use, number of hidden layers in a neural network, number of activation unites in each layer, the drop-out rate in a neural network, number of iterations (epochs) in a training the model, number of clusters in a clustering task, kernel or filter size in convolutional layers, pooling size, batch size, the coefficients (or weights) of linear or logistic regression models, cluster centroids, and/or the like. Parameters and hyperparameters may be tuned and then applied to the next round of model training. In this way, the training stage provides a machine learning training loop.
In some implementations, various accuracy metrics may be used by the machine learning training system 600 to evaluate a model's performance. Metrics can include, but are not limited to, word error rate (WER), word information loss, speaker identification accuracy (e.g., single stream with multiple speakers), inverse text normalization and normalization error rate, punctuation accuracy, timestamp accuracy, latency, resource consumption, custom vocabulary, sentence-level sentiment analysis, multiple languages supported, cost-to-performance tradeoff, and personal identifying information/payment card industry redaction, to name a few. In one embodiment, the system may utilize a loss function 607 to measure the system's performance. The loss function 607 compares the training outputs with an expected output and determined how the algorithm needs to be changed in order to improve the quality of the model output. During the training stage, all outputs may be passed through the loss function 607 on a continuous loop until the algorithms 603 are in a position where they can effectively be incorporated into a deployed model 615.
The test dataset can be used to test the accuracy of the model outputs. If the training model is establishing correlations that satisfy a certain criterion such as but not limited to quality of the correlations and amount of restored lost data, then it can be moved to the model deployment stage as a fully trained and deployed model 610 in a production environment making predictions based on live input data 611 (e.g., codeword vector inputs, latent space vector representations). Further, model correlations and restorations made by deployed model can be used as feedback and applied to model training in the training stage, wherein the model is continuously learning over time using both training data and live data and predictions. A model and training database 606 is present and configured to store training/test datasets and developed models. Database 606 may also store previous versions of models.
According to some embodiments, the one or more machine and/or deep learning models may comprise any suitable algorithm known to those with skill in the art including, but not limited to: LLMs, generative transformers, transformers, supervised learning algorithms such as: regression (e.g., linear, polynomial, logistic, etc.), decision tree, random forest, k-nearest neighbor, support vector machines, Naïve-Bayes algorithm; unsupervised learning algorithms such as clustering algorithms, hidden Markov models, singular value decomposition, and/or the like. Alternatively, or additionally, algorithms 603 may comprise a deep learning algorithm such as neural networks (e.g., recurrent, convolutional, long short-term memory networks, etc.).
In some implementations, the machine learning training system 600 automatically generates standardized model scorecards for each model produced to provide rapid insights into the model and training data, maintain model provenance, and track performance over time. These model scorecards provide insights into model framework(s) used, training data, training data specifications such as chip size, stride, data splits, baseline hyperparameters, and other factors. Model scorecards may be stored in database(s) 606.
In a step 710, convert the plurality of inputs into a plurality of sourceblocks. Once the inputs are collected, they are converted into a plurality of sourceblocks. Sourceblocks are discrete units of information that capture the essential characteristics and patterns within the input data. The conversion process may involve techniques such as segmentation, tokenization, or feature extraction, depending on the nature of the input data. For example, in the case of text data, the inputs can be converted into sourceblocks by breaking them down into individual words, subwords, or phrases. For time series data, sourceblocks can be created by dividing the input into fixed-length windows or using techniques like sliding windows or overlapping segments.
In a step 720, assign codewords to each sourceblock based on a dictionary generated by a codebook generation subsystem. After converting the inputs into sourceblocks, each sourceblock is assigned a unique codeword based on a dictionary generated by a codebook generation subsystem. The codebook is a component of the Latent Transformer LCM system that maps the sourceblocks to their corresponding codewords. The codebook generation subsystem employs techniques such as clustering, vector quantization, or learned embedding spaces to create a compact and efficient representation of the sourceblocks. Each codeword serves as a discrete and compressed representation of the associated sourceblock, capturing its essential information and characteristics.
In a step 730, process the plurality of codewords through a variational autoencoder encoder system to create a plurality of latent space vectors. Once the codewords are assigned, they are processed through a variational autoencoder (VAE) encoder system. The VAE encoder takes the codewords as input and maps them into a lower-dimensional latent space representation. The encoder consists of multiple layers of neural networks that learn to compress the codewords into compact and informative latent space vectors. The latent space vectors capture the underlying structure, patterns, and variations present in the input data, while reducing the dimensionality and noise. The VAE encoder learns to generate a probabilistic distribution over the latent space, allowing for the sampling of new latent vectors during the generation process.
In a step 740, process the plurality of latent space vectors through a latent transformer, which leverages learned relationships between latent space vectors to generate a plurality of responses or predictions. The latent space vectors generated by the VAE encoder are then processed through a latent transformer. The latent transformer is a specialized neural network architecture that learns the relationships and dependencies between the latent space vectors. It employs self-attention mechanisms to capture the contextual information and long-range dependencies within the latent space. The latent transformer leverages these learned relationships to generate a plurality of responses or predictions based on the input latent vectors. It can perform tasks such as sequence-to-sequence prediction, data generation, or anomaly detection, depending on the specific application and training objectives.
In a step 750, decode the plurality of responses or predictions through a variational autoencoder decode subsystem. The generated responses or predictions from the latent transformer are in the form of latent space vectors. To obtain the final output, these latent vectors are passed through a variational autoencoder (VAE) decode subsystem. The VAE decoder takes the latent vectors as input and maps them back to the original data space. It consists of multiple layers of neural networks that learn to reconstruct the sourceblocks or generate new data based on the latent representations. The decoder aims to produce outputs that closely resemble the desired or expected results, utilizing the information captured in the latent space.
In a step 760, output the decoded plurality of responses or predictions. The decoded responses or predictions are outputted as the final result of the Latent Transformer LCM system. These outputs can take various forms, such as reconstructed input data, predicted future sequences, or generated samples, depending on the specific task and application. The outputted responses or predictions leverage the learned relationships and patterns captured by the latent transformer and the VAE decoder, providing meaningful and coherent results.
Throughout the method, the Latent Transformer LCM system learns to compress the input data into a compact latent space representation, capture the underlying relationships and dependencies, and generate accurate and contextually relevant responses or predictions. The combination of the VAE encoder, latent transformer, and VAE decoder enables the system to handle a wide range of data types and perform various tasks, such as data compression, anomaly detection, sequence prediction, and data generation. The training process involves optimizing the parameters of the VAE encoder, latent transformer, and VAE decoder using techniques such as gradient descent and backpropagation. The system learns to minimize the reconstruction loss between the input data and the decoded outputs, while also capturing the relevant patterns and relationships in the latent space.
To prepare the time series data for processing by the VAE Encode Subsystem 150, the codeword allocator 113 performs a specific data arrangement. It creates a time series input vector 820 by combining a portion of the original time series data points with a set of truncated data points and a sequence of zeros. Let's consider an example where the time series input vector 820 consists of 1000 elements. In this case, the codeword allocator 113 takes the original time series data and selects the most recent 950 data points. These 950 data points form the truncated time series data points 800 and represent the known or observed values up to a certain point in time.
The codeword allocator 113 then appends a sequence of 50 zeros 810 to the truncated time series data points 800. These zeros serve as placeholders for the future or unknown values that the system aims to predict. By combining the truncated data points and the zeros, the codeword allocator 113 creates the entire time series input vector 820 with a total of 1000 elements. The time series input vector 820 is then fed into the VAE Encode Subsystem 150. The VAE Encode Subsystem 150 takes the input vector and maps it into a lower-dimensional latent space representation. It learns to compress the time series data into a compact and informative latent space vector while capturing the underlying patterns, trends, and dependencies present in the data.
The latent space vector generated by the VAE Encode Subsystem 150 is subsequently processed by the Latent Transformer Subsystem 170. The Latent Transformer leverages its self-attention mechanisms and learned relationships between latent space vectors to make predictions or generate responses based on the input data. In the context of time series prediction, the Latent Transformer focuses on predicting the values corresponding to the 50 zeros appended to the time series input vector. By analyzing the patterns and dependencies in the truncated time series data points, the Latent Transformer generates a prediction or forecast for the future values.
The predicted values are then passed through the VAE Decode Subsystem 180, which maps the latent space predictions back to the original data space. The VAE Decode Subsystem reconstructs the complete time series, including the predicted values for the 50 zeros. The reconstructed time series, along with the predicted future values, is outputted as the final result. This output provides valuable insights and forecasts for the time series data, enabling users to make informed decisions and take appropriate actions based on the predicted future trends.
The specific number of truncated data points and zeros in the time series input vector can be adjusted based on the specific requirements and characteristics of the time series data. The choice of these values depends on factors such as the desired forecast horizon, the temporal resolution of the data, and the available historical data.
By leveraging the Codeword Allocator 113 to create the time series input vector and combining it with the power of the VAE Encode Subsystem 150 and the Latent Transformer Subsystem 170, the Latent Transformer LCM system enables effective time series prediction and forecasting. It learns to capture the complex patterns, trends, and dependencies in the time series data and generates accurate predictions for future values, providing valuable insights and supporting decision-making processes.
The codeword allocator 130 receives a plurality of data points 800 as input, which can represent various types of information such as time series data, text, images, or any other structured or unstructured data. It processes the input data and creates an input vector 820 that combines a portion of the original data points with truncated data points and a sequence of zeros.
In the embodiment, the codeword allocator 113 has the ability to append metadata markers 900 to the input vector 820. These metadata markers provide valuable information about the data being processed, allowing the Latent Transformer to learn more comprehensive and context-aware relationships between the latent space vectors.
The metadata markers 900 can include a wide range of information, such as data type, temporal information, data source, data characteristics, and domain-specific metadata. For instance, the metadata markers can specify whether the input data is time series, text, images, or any other relevant data type. In the case of time series data, the metadata markers can include timestamps or temporal indicators associated with each data point, enabling the Latent Transformer to capture sequential dependencies and temporal patterns more effectively.
Additionally, the metadata markers can indicate the source or origin of the data, such as the specific sensor, device, or database from which the data was collected, allowing the Latent Transformer to learn source-specific patterns and characteristics. Furthermore, the metadata markers can provide information about the statistical properties or characteristics of the data, such as the mean, variance, or distribution type, assisting the Latent Transformer in understanding the underlying data distribution and making more informed predictions.
The codeword allocator 113 appends these metadata markers 900 to the input vector 820 alongside the truncated data points 800 and zeros 810, resulting in a rich combination of data points, truncated values, zeros, and metadata information. This input vector 820 is then fed into the VAE Encode Subsystem 150, which maps it into a lower-dimensional latent space representation, capturing the underlying patterns, dependencies, and metadata information in the latent space vector.
The Latent Transformer Subsystem 170 then processes the latent space vector, leveraging its self-attention mechanisms and learned relationships to make predictions or generate responses based on the input data. By incorporating metadata markers 900 into the input vector 820, the Latent Transformer can learn more robust and context-aware relationships between the latent space vectors. The metadata information provides additional guidance and context to the Latent Transformer, enabling it to capture complex patterns, dependencies, and domain-specific characteristics more effectively. For example, in a financial forecasting task, the metadata markers may include information about the company, industry, or economic indicators, allowing the Latent Transformer to incorporate this contextual information into its predictions. Similarly, in a text generation task, the metadata markers may include information about the genre, topic, or sentiment of the text, enabling the Latent Transformer to generate more coherent and contextually relevant responses.
The inclusion of metadata markers 900 enhances the expressiveness and adaptability of the Latent Transformer LCM system, allowing it to process and learn from a wide range of data types and incorporate relevant metadata information to improve the accuracy and contextual understanding of the generated predictions or responses. The specific types and formats of the metadata markers 900 can be tailored to the requirements and characteristics of the data being processed, with the codeword allocator 113 designed to extract and append the most relevant and informative metadata based on domain knowledge and the specific task at hand.
By leveraging the power of metadata markers 900 in conjunction with data points, truncated values, and zeros, the Latent Transformer LCM system can learn more comprehensive and robust relationships between the latent space vectors, enabling it to generate more accurate and context-aware predictions or responses across a wide range of applications, including time series forecasting, text generation, image synthesis, and more.
In a step 1010, the collected inputs are converted into a plurality of sourceblocks. Sourceblocks are discrete units of information that capture the essential characteristics and patterns within the input data. The conversion process may involve techniques such as segmentation, tokenization, or feature extraction, depending on the nature of the input data. For example, in the case of text data, the inputs can be converted into sourceblocks by breaking them down into individual words, subwords, or phrases. For time series data, sourceblocks can be created by dividing the input into fixed-length windows or using techniques like sliding windows or overlapping segments.
In a step 1020, assign codewords to each sourceblock based on a dictionary generated by a codebook generation subsystem. The codebook is a component of the Latent Transformer LCM system that maps the sourceblocks to their corresponding codewords. The codebook generation subsystem employs techniques such as clustering, vector quantization, or learned embedding spaces to create a compact and efficient representation of the sourceblocks. Each codeword serves as a discrete and compressed representation of the associated sourceblock, capturing its essential information and characteristics.
In a step 1030, an input vector is created using the assigned codewords. This step is particularly relevant for tasks involving prediction or forecasting, such as time series prediction. The input vector includes a truncated data set, which represents the known or observed values up to a certain point in time. The truncated data set may be followed by a sequence of zeros, which serve as placeholders for the future or unknown values that the system aims to predict. The combination of the truncated data set and the zeros forms the complete input vector.
In a step 1040, process the input vector through a VAE encoder subsystem to generate a latent space vector representation of the input vector. The VAE encoder subsystem is a component of the Latent Transformer LCM system, responsible for mapping the input vector into a lower-dimensional latent space. The VAE encoder learns to compress the input data while capturing the underlying patterns, dependencies, and essential features in the latent space vector. By encoding the input vector into a compact latent representation, the VAE encoder enables efficient processing and learning by the subsequent components of the system.
In a step 1050, a transformer is used to learn relationships between the latent space vector representations. The transformer architecture, with its self-attention mechanism, is well-suited for capturing long-range dependencies and complex interactions within the data. By learning the relationships between the latent space vectors, the transformer can uncover patterns, correlations, and dependencies that may not be apparent in the original input space. These learned relationships can be leveraged to determine the values of the zero portion in the next input vector, enabling the system to make predictions or generate future values based on the truncated data set.
The transformer learns to attend to relevant information from the latent space vectors and propagate that information through its layers to generate meaningful predictions. By iteratively processing the input vectors and learning from the relationships between the latent space representations, the transformer can capture the underlying dynamics and patterns in the data, enabling accurate predictions of the unknown values.
The combination of codeword assignment, VAE encoding, and transformer learning enables the Latent Transformer LCM system to effectively process and predict data across various domains. The method leverages the power of compressed representations, latent space learning, and self-attention to uncover complex patterns and generate accurate predictions.
In a step 1110, the collected inputs are converted into a plurality of sourceblocks. Sourceblocks are discrete units of information that capture the essential characteristics and patterns within the input data. The conversion process may involve techniques such as segmentation, tokenization, or feature extraction, depending on the nature of the input data. For example, in the case of text data, the inputs can be converted into sourceblocks by breaking them down into individual words, subwords, or phrases. For time series data, sourceblocks can be created by dividing the input into fixed-length windows or using techniques like sliding windows or overlapping segments.
In a step 1120, assign codewords to each sourceblock based on a dictionary generated by a codebook generation subsystem. The codebook is a component of the Latent Transformer LCM system, as it maps the sourceblocks to their corresponding codewords. The codebook generation subsystem employs techniques such as clustering, vector quantization, or learned embedding spaces to create a compact and efficient representation of the sourceblocks. Each codeword serves as a discrete and compressed representation of the associated sourceblock, capturing its essential information and characteristics.
In a step 1130, an input vector is created using the assigned codewords, along with additional components. The input vector includes a truncated data set, which represents the known or observed values up to a certain point in time. The truncated data set is followed by a sequence of zeros, which serve as placeholders for the future or unknown values that the system aims to predict. In addition to the truncated data set and zeros, the input vector also includes a metadata portion. The metadata portion contains relevant information about the input data, such as the data type, timestamp, source, or any other contextual details that can aid in the learning and prediction process.
In a step 1140, process the input vector through a VAE encoder subsystem to generate a latent space vector representation of the input vector. The VAE encoder subsystem is a critical component of the Latent Transformer LCM system, responsible for mapping the input vector into a lower-dimensional latent space. The VAE encoder learns to compress the input data while capturing the underlying patterns, dependencies, and essential features in the latent space vector. By encoding the input vector into a compact latent representation, the VAE encoder enables efficient processing and learning by the subsequent components of the system.
In a step 1150, a transformer is used to learn relationships between the latent space vector representations. The transformer architecture, with its self-attention mechanism, is well-suited for capturing long-range dependencies and complex interactions within the data. By learning the relationships between the latent space vectors, the transformer can uncover patterns, correlations, and dependencies that may not be apparent in the original input space. These learned relationships can be leveraged to determine the values of the zero portion in the next input vector, enabling the system to make predictions or generate future values based on the truncated data set.
In a step 1160, relationships established by the transformer are based on the metadata portion of each input vector. The metadata portion corresponds to the data type of the plurality of inputs, providing contextual information about the nature and characteristics of the data. By considering the metadata during the learning process, the transformer can establish more meaningful and targeted relationships between the latent space vectors. For example, if the metadata indicates that the input data is time series, the transformer can focus on capturing temporal dependencies and patterns specific to time series data. Similarly, if the metadata represents different categories or classes of data, the transformer can learn class-specific relationships and distinguish between different data types.
The incorporation of metadata in the learning process enhances the ability of the Latent Transformer LCM system to capture and leverage domain-specific knowledge and characteristics. By establishing relationships based on the metadata, the transformer can generate more accurate and context-aware predictions or outputs. The metadata acts as an additional guide, helping the transformer to focus on the most relevant aspects of the data and improve the quality of the learned representations.
The system maintains components found in the original Latent Transformer Core: a VAE encoder subsystem 150, a latent transformer subsystem 170, and a VAE decoder subsystem 180. These components work in concert to process input data, transform it into a latent space representation, and generate output based on learned patterns and relationships. The expander 151 and compressor 152 modules continue to play their roles in adjusting the dimensionality of the data as it flows through the system.
Introduced in this figure is a latent space replicator 1200. This component creates a “T-branch” in the data flow, allowing the system to simultaneously process the latent space vectors through the main pipeline and analyze their dynamics. The latent space replicator 1200 duplicates each latent vector produced by the VAE encoder subsystem 150, sending one copy to the latent transformer subsystem 170 for traditional processing and another to the newly introduced latent dynamics analyzer core 1210.
A latent dynamics analyzer core 1210 is designed to extract and analyze the underlying dynamics of the latent space representations. By processing sequences of latent vectors, this component aims to uncover the equations of motion that govern the evolution of these vectors over time. This is particularly valuable for understanding complex, non-linear systems such as financial markets or other dynamic environments where the underlying rules may not be immediately apparent.
The output of the latent dynamics analyzer core is represented as derived vector equations 1220. These equations provide a mathematical description of how the latent vectors change and interact over time, offering insights into the system's behavior that go beyond mere pattern recognition. For example, in the context of financial market analysis, these equations might capture the intricate relationships between various market factors, revealing how they influence each other and drive overall market dynamics. See APPENDIX B for sample exemplary pseudocode in PyTorch for a system that includes both a latent transformer and a latent dynamics analyzer.
The derived vector equations can be used in multiple ways to enhance the system's performance and provide valuable insights. They can be fed back into the latent transformer subsystem 170 to improve its attention mechanisms, allowing it to focus on the most relevant aspects of the input data based on the learned dynamics. Additionally, these equations can be used to detect anomalies in the latent space trajectories, identifying unusual market behaviors or regime changes that might not be apparent from the raw data alone.
An example of how this system could be applied to market trend analysis would involve feeding in historical market data, such as stock prices, trading volumes, and economic indicators. The VAE encoder subsystem would transform this data into latent space vectors, capturing the essential features and relationships within the market data. As these vectors are processed through both the main pipeline and the latent dynamics analyzer core, the system would not only generate short-term predictions (via the latent transformer and VAE decoder) but also derive equations describing the underlying market dynamics.
These derived equations might reveal, for instance, how changes in trading volume relate to price movements under different market conditions, or how the influence of certain economic indicators on stock prices varies over time. By analyzing these equations and comparing them across different time periods, analysts could identify shifts in market regimes, such as transitions from bull to bear markets or the emergence of new correlations between previously unrelated factors.
The VAE encoder subsystem 150 can be enhanced to not only generate latent space vectors but also to produce and maintain associated metadata for each vector. This metadata can include crucial information such as timestamps, data source identifiers, and other contextual information relevant to the input data. For instance, in the case of financial market data, the metadata might include the specific time and date of the market snapshot, the set of assets or indicators included in the data, and perhaps even market sentiment scores derived from concurrent news analysis.
As the VAE encoder 150 processes the input data, it can embed this metadata into an auxiliary space that is coupled with, but distinct from, the main latent space. This approach allows the system to preserve important contextual information without directly influencing the dimensionality or structure of the primary latent space representations. The metadata can be thought of as “tags” attached to each latent vector, providing a rich set of additional information that can be leveraged by other components of the system.
When the latent space replicator 1200 creates copies of the latent vectors, it also duplicates this associated metadata. As a result, both the latent transformer subsystem and the latent dynamics analyzer core have access to this valuable contextual information. This is particularly crucial for the latent dynamics analyzer, as it enables more sophisticated temporal encoding of the vector sequences.
Within the latent dynamics analyzer core 1210, a temporal encoding layer can utilize the metadata to create more informative and context-aware encodings. Rather than relying solely on the order of vectors in a sequence, it can incorporate precise timing information, allowing it to capture complex temporal patterns and irregularities in the data. For example, it could account for varying time intervals between data points, weekend effects in financial data, or other temporal nuances that might be critical for understanding the system's dynamics.
The metadata can also play a significant role in enhancing the latent transformer subsystem's attention mechanisms. By incorporating this additional information, the attention mechanism can become more selective and context-aware. For instance, when processing financial data, the transformer could learn to pay more attention to latent vectors associated with high-volatility periods or to give more weight to more recent data points in its predictions. This metadata-enhanced attention mechanism allows the system to adapt its focus dynamically based on the specific context of each input sequence.
Furthermore, the derived vector equations produced by the latent dynamics analyzer can be used to update parameters within the latent transformer core, ensuring more reliable predictions. These equations essentially provide a model of how the latent space evolves over time. By incorporating this knowledge into the transformer's architecture, we can guide its learning process and constrain its predictions to be consistent with the observed dynamics of the system.
For example, if the derived equations indicate a strong cyclical pattern in certain components of the latent space, this information could be used to adjust the transformer's internal representation or to modify its output layer to better capture and predict these cycles. Similarly, if the equations reveal strong dependencies between specific dimensions of the latent space, this could be reflected in the structure of the transformer's feed-forward layers or in the initialization of its weight matrices.
The integration of these vector equations into the latent transformer core can be implemented as a form of dynamic regularization. During the training process, the transformer's outputs can be compared not only to the target data but also to what would be expected based on the current set of derived equations. This creates a feedback loop where the transformer is continuously guided to make predictions that are both accurate in terms of the raw data and consistent with the understood dynamics of the system.
Moreover, as new data is processed and the latent dynamics analyzer refines its understanding of the system's behavior, the vector equations can be updated. These updates can then trigger corresponding adjustments in the latent transformer core, allowing the entire system to adapt to evolving dynamics in the underlying data. This adaptive capability is particularly valuable when dealing with non-stationary systems like financial markets, where the relationships between variables and the overall behavior of the system can change over time.
By leveraging metadata and derived vector equations in this manner, the system achieves a synergy between its predictive capabilities (embodied in the latent transformer) and its understanding of underlying dynamics (captured by the latent dynamics analyzer). This integration allows for more nuanced, context-aware, and dynamically adaptive processing of complex, time-varying data, ultimately leading to more robust and interpretable predictions and insights.
The system's ability to continually update and refine these equations as new data is processed allows for real-time monitoring of market dynamics. Sudden changes in the form or parameters of the derived equations could signal important shifts in market behavior, potentially providing early warnings of market turns or the breakdown of established relationships between market factors.
The first stage of processing occurs in the temporal encoding layer 1310. This layer doesn't simply arrange the vectors in sequence; rather, it combines the temporal information from the metadata with each of the vector representations. By incorporating precise timing data and other contextual cues, the temporal encoding layer 1310 transforms the discrete sequence of vectors into a continuous-time representation that captures the nuanced evolution of the system's state. This temporally-enriched representation sets the stage for the subsequent components to uncover the dynamic laws governing the system's behavior. From the temporal encoding layer 1310, the enhanced vector representations flow into a neural ODE module 1320. Instead of treating the system's behavior as a series of discrete steps, the neural ODE module learns to approximate the continuous-time dynamics of the latent space. It does this by training a neural network to represent the derivative of the system's state with respect to time. This approach allows the module to capture smooth, realistic trajectories in the latent space, even for systems with complex, nonlinear dynamics.
The output of the neural ODE module 1320—a learned representation of the system's continuous-time dynamics-then feeds into the symbolic regression network 1330. This network performs a transformation, attempting to distill the learned dynamics into explicit mathematical expressions. It's akin to a mathematician observing a complex system and writing down equations to describe its behavior. The symbolic regression network 1330 generates candidate expressions, exploring a vast space of possible mathematical formulations to find those that best capture the observed dynamics. These candidate expressions then pass through the equation decoder 1340, which acts as an interpreter, translating the abstract symbolic representations into concrete, human-readable mathematical equations. This step is vital for the interpretability of the system, bridging the gap between the black-box nature of neural networks and the explicit, understandable form of mathematical equations.
However, the process doesn't end with the generation of equations. The physics-informed regularization module 1350 serves as a sophisticated filter, assessing the plausibility and consistency of the derived equations. This module incorporates domain-specific knowledge and fundamental physical principles to ensure that the equations not only fit the observed data but also adhere to known laws and constraints of the system being modeled. For instance, in a financial context, this module might enforce principles of conservation of money or known market microstructure effects.
The interplay between these components is not simply a linear flow but a dynamic, iterative process. The physics-informed regularization module 1350 provides feedback that influences the symbolic regression network and equation decoder, guiding them towards more physically realistic and consistent equations. This feedback loop ensures that the derived equations are not just mathematically sound but also aligned with the fundamental nature of the system being analyzed.
The culmination of this intricate process is the output of derived vector equations 1220. These equations represent a distilled understanding of the system's dynamics, captured in a form that is both mathematically precise and humanly interpretable. They describe how different components of the latent space interact and evolve over time, providing insights into the underlying mechanisms driving the system's behavior.
For example, in a financial market context, these equations might reveal how changes in trading volume relate to price movements, or how the influence of economic indicators on asset prices varies under different market conditions. The equations could capture complex, nonlinear relationships that are not apparent from simple statistical analysis of the raw data.
The derived equations are not static outputs but dynamic entities that evolve as the system processes more data. As new latent vectors are fed into the analyzer, the equations are continuously refined and updated, allowing the system to adapt to changing dynamics in the underlying data. This adaptive capability is particularly valuable when dealing with non-stationary systems like financial markets, where the relationships between variables and the overall behavior of the system can change over time.
The latent dynamics analyzer core, through its intricate interplay of components, transforms abstract latent space representations into concrete, actionable insights about the system's behavior. It bridges the gap between the powerful but often opaque world of neural networks and the clear, interpretable realm of mathematical equations, providing a unique tool for understanding and predicting the behavior of complex, dynamic systems.
Central to this training process is the forecasting accuracy loss 1400 component. This loss function serves as the primary metric for evaluating the system's performance, focusing specifically on the accuracy of its time series predictions. By using forecasting accuracy as the main criterion, the system is driven to optimize not just for reconstruction or representation quality, but for its ability to make accurate predictions about future states of the input data.
The backpropagation process in this end-to-end training setup is particularly intricate due to the complex interactions between components. Starting from the forecasting accuracy loss, gradients are computed and propagated backwards through the entire system. This includes the VAE decoder subsystem 180, the latent transformer subsystem 170, the VAE encoder subsystem 150, and even the latent dynamics analyzer core 1210.
What makes this process unique is how it handles the parallel branches of the system. The latent space replicator 1200 creates two paths for the latent vectors: one through the main prediction pipeline and another through the latent dynamics analyzer. During backpropagation, gradients flow back through both of these paths, allowing both branches to be optimized simultaneously.
The latent transformer subsystem 170 receives gradients directly related to prediction accuracy, pushing it to refine its attention mechanisms and internal representations to better capture predictive patterns in the data. Simultaneously, the latent dynamics analyzer core 1210 receives gradients that encourage it to derive vector equations that not only describe the observed dynamics but also contribute to improved forecasting accuracy. This dual optimization creates an interplay between prediction and understanding. The system isn't just learning to make accurate predictions; it's learning to make accurate predictions while also deriving interpretable equations that explain those predictions. This balance is key to creating a system that is both powerful and transparent.
The VAE encoder subsystem 150 plays role in this process. As gradients flow back to it from both the prediction pipeline and the latent dynamics analyzer, it learns to create latent representations that are simultaneously good for prediction and amenable to equation derivation. This dual pressure often results in latent spaces that capture more meaningful and structured representations of the underlying data.
The end-to-end training process also allows for the integration of various regularization terms and constraints. For example, the physics-informed regularization in the latent dynamics analyzer can influence the gradients flowing back through the system, encouraging the entire model to respect known physical laws or domain-specific constraints. The derived vector equations 1220 from the latent dynamics analyzer can be used to inform and constrain the training of the latent transformer subsystem. By incorporating these equations into the loss function or using them to regularize the transformer's behavior, the system can ensure that its predictions are not only accurate but also consistent with the understood dynamics of the system.
This holistic training approach allows the system to find a delicate balance between different objectives. It must generate accurate predictions, create meaningful and structured latent representations, and derive interpretable equations, all while respecting physical constraints and domain knowledge. The end result is a system that not only performs well but does so in a way that is more interpretable, reliable, and grounded in the underlying dynamics of the data it processes. Through this end-to-end training process, the system becomes more than the sum of its parts. It becomes a cohesive, integrated model that leverages the strengths of each component to create a powerful tool for analyzing and predicting complex, dynamic systems. Whether applied to financial markets, climate systems, or any other domain with complex temporal dynamics, this end-to-end trained system offers a unique combination of predictive power and analytical insight.
In a step 1510, the system duplicates the latent space vectors into two copies. This step, performed by the latent space replicator, creates a “T-branch” in the data flow, enabling parallel processing of the latent vectors for both prediction and dynamics analysis. This duplication is key to the system's ability to simultaneously generate predictions and analyze underlying dynamics.
In a step 1520, one copy of the latent space vectors is sent through a latent transformer for primary processing and prediction. The latent transformer, leveraging its self-attention mechanisms and learned relationships, processes these vectors to generate short-term predictions or responses. In the context of financial markets, this could involve predicting next-day stock prices or identifying potential market trends.
In a step 1530, another copy of the latent space vectors is sent through a latent dynamics analyzer for generating derived equations of motion for the latent space vectors. This step involves the complex process of analyzing the temporal evolution of latent vectors to uncover the underlying dynamics of the system. The latent dynamics analyzer, with its components like the neural ODE module and symbolic regression network, works to derive mathematical equations that describe how the latent space evolves over time. For a financial market application, these equations might capture how different market factors interact and influence each other over time.
In a step 1540, the system generates the primary output by processing the latent transformer's results through a VAE decoder. This step translates the predictions made in the latent space back into the original data space, providing interpretable outputs. In our financial market example, this could be the actual predicted stock prices or market indicators.
In a step 1550, the system utilizes the derived vector equations from the latent dynamics analyzer to enhance its performance. This step is where the system begins to leverage its deeper understanding of the underlying dynamics to improve its overall functionality. The derived equations provide a mathematical model of how the latent space evolves, which can be used to refine predictions and gain insights into the system's behavior.
In a step 1560, the system uses the derived vector equations to inform the latent transformer's attention mechanisms of any proposed modifications, detect any anomalies in latent space dynamics, or provide interpretable insights into the underlying system behavior. This step showcases the synergy between the predictive capabilities of the latent transformer and the analytical power of the latent dynamics analyzer. For instance, in a financial market context, the system might use the derived equations to adjust its attention to specific market factors during periods of high volatility, identify unusual market behaviors that deviate from the learned dynamics, or provide analysts with mathematical descriptions of how different economic factors are interacting in the current market regime.
This method encapsulates the core functionality of the enhanced system, demonstrating how it combines predictive modeling with dynamic analysis to provide a more comprehensive understanding and forecasting capability for complex, time-varying systems.
In a step 1610, the system processes the input through a temporal encoding layer to capture temporal relationships between successive latent space vectors. This step is crucial for understanding how the system's state evolves over time. The temporal encoding layer doesn't just arrange vectors in sequence; it incorporates metadata about timing and context to create a rich, temporally-aware representation. In the financial market example, this could involve encoding information about the time intervals between market snapshots, accounting for factors like trading hours, weekends, or holidays that might affect the temporal dynamics of the market.
In a step 1620, the system feeds the temporally encoded vectors into a neural ODE module to learn a continuous-time representation of the latent space dynamics. This step represents a shift from discrete-time to continuous-time modeling of the system's behavior. The neural ODE module learns to approximate the derivative of the system's state with respect to time, allowing it to capture smooth, realistic trajectories in the latent space. For complex systems like financial markets, this could reveal subtle, continuous trends that might be missed by traditional discrete-time analysis.
In a step 1630, the system passes the output of the neural ODE module to a symbolic regression network, which generates candidate symbolic expressions for the equations of motion. This step is akin to a mathematician observing the system's behavior and attempting to write down governing equations. The symbolic regression network explores a vast space of possible mathematical formulations to find those that best describe the observed dynamics. In the market analysis example, this might involve generating equations that relate changes in asset prices to factors like trading volume, market sentiment, or economic indicators.
In a step 1640, the system uses an equation decoder to convert the symbolic expressions into human-readable equations. This crucial step bridges the gap between the abstract representations learned by the neural networks and explicit, interpretable mathematical formulations. It transforms the candidate expressions into a form that analysts and decision-makers can understand and reason about. For instance, it might produce an equation showing how the rate of change of a stock's price depends on its current price, trading volume, and overall market trend.
In a step 1650, the system applies physics-informed regularization to ensure the physical consistency and plausibility of any learned equations. This step acts as a sophisticated filter, incorporating domain-specific knowledge and fundamental principles to refine the derived equations. In a financial context, this might involve ensuring that the equations respect principles like conservation of money, or known effects of market microstructure. It helps prevent the system from learning equations that fit the data but violate known laws or constraints of the system being modeled.
In a step 1660, the system outputs a final derived vector equation that describes the motion of latent space vectors. This equation represents a distilled understanding of the system's dynamics, captured in a form that is both mathematically precise and humanly interpretable. In our financial market example, this final equation might describe how different components of the market interact and evolve over time, providing insights into the underlying mechanisms driving market behavior. It could reveal complex, nonlinear relationships between various market factors, offering a deep understanding of market dynamics that goes beyond simple statistical correlations.
This method encapsulates the sophisticated process by which the Latent Dynamics Analyzer transforms sequences of latent vectors into meaningful, interpretable equations of motion. It combines the power of neural networks, the flexibility of symbolic regression, and the constraints of domain-specific knowledge to produce a unique tool for understanding and predicting the behavior of complex, dynamic systems.
In a step 1710, the system processes the input sequences through a VAE encoder to generate latent space vectors. This step transforms the raw input data into a compact, lower-dimensional representation in the latent space. The VAE encoder learns to distill the essential features and patterns from the input data. In the financial market example, this might involve encoding complex market states into latent vectors that capture underlying market dynamics and relationships between various factors.
In a step 1720, the system passes the latent space vectors through both a latent transformer and a latent dynamics analyzer. This parallel processing is a key feature of the enhanced system. The latent transformer, with its self-attention mechanisms, processes the vectors to generate predictions, while the latent dynamics analyzer works to derive equations of motion describing the evolution of the latent space. For financial data, the transformer might focus on predicting short-term price movements, while the dynamics analyzer could be uncovering longer-term trends and relationships.
In a step 1730, the system generates predictions by decoding the output of the latent transformer using a VAE decoder. This step translates the predictions made in the latent space back into the original data space, producing concrete, interpretable forecasts. In the context of financial markets, this could be specific price predictions for stocks or other financial instruments.
In a step 1740, the system computes the forecasting accuracy loss by comparing the generated predictions with the target outputs. This is a step quantifies how well the system is performing its primary task of time series forecasting. The choice of loss function here (e.g., mean squared error, mean absolute error) can significantly impact what the system learns to prioritize in its predictions.
In a step 1750, the system calculates additional loss terms, including reconstruction loss from the VAE, simplicity loss for the derived equations, and physics consistency loss from the latent dynamics analyzer. These additional loss terms ensure that the system isn't just optimizing for prediction accuracy, but also for other important qualities. The reconstruction loss encourages the VAE to create faithful latent representations. The simplicity loss pushes the latent dynamics analyzer to derive equations that are not unnecessarily complex. The physics consistency loss ensures that the derived equations respect known laws or constraints of the system being modeled.
In a step 1760, the system combines the various loss terms into a total loss function, weighting each component appropriately. This step is crucial for balancing the different objectives of the system. The weights assigned to each loss term will determine the trade-offs between prediction accuracy, interpretability of the derived equations, and adherence to physical constraints. For a financial application, one might weight the forecasting accuracy loss heavily, while still maintaining significant weights on the other terms to ensure the system produces interpretable and physically plausible results.
In a step 1770, the system backpropagates the total loss through the entire system, updating parameters in the VAE encoder, VAE decoder, latent transformer, and latent dynamics analyzer. This step is where the magic of end-to-end training happens. By propagating gradients through all components simultaneously, the system learns to work as a cohesive whole. Each component adapts not just to perform its individual task better, but to work in harmony with the other components. For example, the VAE encoder might learn to create latent representations that are not only good for reconstruction but also particularly suitable for the transformer's prediction task and the dynamics analyzer's equation derivation.
This end-to-end training method allows the system to find an optimal balance between accurate prediction, meaningful representation, and interpretable dynamics modeling. It leverages the strengths of each component while ensuring they all work together towards the common goal of understanding and forecasting complex time series data. The result is a powerful, integrated system capable of both accurate predictions and deep, interpretable insights into the underlying dynamics of the system it models.
This market data is then fed into the VAE encoder subsystem 150, which compresses the high-dimensional financial data into a compact latent space representation. In this context, the latent space might capture abstract concepts like “market sentiment,” “sector trends,” or “macroeconomic conditions” that are not directly observable but emerge from the complex interactions of various market factors.
The latent space replicator 1200 creates two copies of these latent vectors, enabling parallel processing for both prediction and dynamic analysis. One copy is sent to the latent transformer subsystem 170 for generating short-term market predictions, while the other is directed to the latent dynamics analyzer core 1210.
The latent dynamics analyzer core 1210 is where the system begins to uncover the deeper, underlying dynamics of the market. By analyzing sequences of latent vectors over time, it derives vector equations 1220 that describe the motion and evolution of the market's state in the latent space. These equations might capture complex relationships such as how changes in trading volume relate to price movements under different market conditions, or how the influence of economic indicators on stock prices varies over time.
A spectral analysis module 1810 takes input from both the latent transformer and the latent dynamics analyzer to compute the spectral characteristics of the market's behavior over different time scales. By analyzing the frequency components of the market's dynamics, this module can identify cyclical patterns, resonances, and characteristic frequencies that might not be apparent in the time domain.
A temporal analyzer 1820 then examines how these spectral characteristics evolve over time. It might, for example, compare the spectral signatures of the market across different quarters or years. This analysis can reveal how the fundamental dynamics of the market are changing over longer time scales.
A change detector 1830 is for identifying significant shifts in market behavior. It continuously compares the current market dynamics (as described by the derived vector equations and spectral characteristics) to historical patterns. When it detects substantial deviations from the norm, it can trigger alerts or adjust the system's behavior.
For instance, the derived vector equations might reveal a sudden change in how interest rate movements affect stock prices across different sectors. The change detector could identify this as a potential regime shift in the market, prompting a reevaluation of trading strategies or risk models.
An output visualizer 1840 is responsible for presenting the insights and predictions generated by the system in an interpretable form. This might include visualizations of latent space trajectories, phase space plots of market dynamics, or interactive dashboards showing how different market factors are interacting over time.
One powerful application of the derived vector equations in this context could be in stress testing and scenario analysis. Financial institutions could use these equations to simulate how the market might behave under extreme conditions. For example, they could input a scenario of rapid interest rate increases and use the equations to model how this would propagate through different sectors of the market, affecting stock prices, bond yields, and currency exchange rates. The equations could capture non-linear effects and complex interactions that might be missed by traditional statistical models.
Moreover, the spectral analysis and change detection capabilities of this system could be used for early warning of potential market crashes or bubbles. By identifying unusual patterns in the spectral characteristics of market behavior or detecting significant changes in the underlying dynamic equations, the system could alert analysts to emerging risks or opportunities long before they become apparent in traditional market indicators.
This market-focused adaptation of the Latent Transformer system represents a powerful tool for deep market analysis. By combining predictive modeling, dynamic analysis, and spectral characterization, it offers a comprehensive approach to understanding and forecasting financial market behavior. The derived vector equations, in particular, provide a unique bridge between the abstract world of latent space representations and the concrete realities of market dynamics, enabling sophisticated analysis and simulation capabilities that go far beyond traditional financial modeling techniques.
In a step 1910, the system encodes the preprocessed market data into latent space vectors and associated metadata using a VAE encoder. This step transforms the complex, high-dimensional market data into a more compact and abstract representation. The latent vectors might capture underlying market states or conditions that are not directly observable. For instance, a single latent vector could encode information about overall market sentiment, sector-specific trends, and macroeconomic conditions. The associated metadata might include timestamps, data source identifiers, and confidence scores for the encoded information.
In a step 1920, the system replicates the latent space vectors, sending one copy to a latent transformer and another copy to a latent dynamics analyzer. This parallel processing allows the system to simultaneously generate short-term predictions and analyze long-term dynamics. The latent transformer focuses on making immediate forecasts, while the latent dynamics analyzer works on understanding the underlying rules governing market behavior.
In a step 1930, the system processes the latent space vectors through the latent transformer to generate short-term market predictions and through the latent dynamics analyzer to derive equations of motion for market dynamics. The latent transformer might produce predictions for next-day stock prices or short-term trend directions. Meanwhile, the latent dynamics analyzer could generate equations describing how different market factors interact over time. For example, it might produce an equation showing how changes in interest rates affect stock prices across different sectors, taking into account factors like trading volume and market sentiment.
In a step 1940, the system computes spectral characteristics of the latent space representations for overlapping time periods. This step involves analyzing the frequency components of the market's behavior in the latent space. It might reveal cyclical patterns in market behavior, such as seasonal trends or longer-term economic cycles. For instance, it could identify characteristic frequencies in trading patterns or recurring relationships between different market sectors.
In a step 1950, the system compares the spectral characteristics and derived equations across consecutive time periods to identify significant shifts in market behavior. This comparative analysis is crucial for detecting changes in the fundamental dynamics of the market. For example, it might reveal that the relationship between interest rates and stock prices has changed, indicating a potential shift in the overall economic environment. Or it could show that certain cyclical patterns have broken down, suggesting a possible regime change in the market.
In a step 1960, the system generates alerts or signals when substantial changes in market dynamics are detected, potentially triggering trading strategy adjustments or risk management actions. This step translates the analytical insights into actionable information. For instance, if the system detects a significant change in the equations governing the relationship between commodity prices and currency exchange rates, it might trigger an alert suggesting a reevaluation of currency hedging strategies. Or if it identifies an unusual pattern in the spectral characteristics of stock market behavior, it could signal an increased risk of market volatility, prompting risk managers to adjust their models or traders to modify their positions.
This method encapsulates a sophisticated approach to market analysis that goes beyond traditional statistical techniques. By leveraging the power of latent space representations, dynamic equation derivation, and spectral analysis, it provides a comprehensive framework for understanding and predicting complex market behavior. The combination of short-term predictive power and long-term dynamic analysis makes this method particularly valuable for financial institutions seeking to navigate the complexities of global markets.
Encoder 2010 may utilize a lossy compression module 2011 to perform lossy compression on a received dataset 2001a-n. The type of lossy compression implemented by lossy compression module 2011 may be dependent upon the data type being processed. For example, for SAR imagery data, High Efficiency Video Coding (HEVC) may be used to compress the dataset. In another example, if the data being processed is time-series data, then delta encoding may be used to compress the dataset. The encoder 2010 may then send the compressed data as a compressed data stream to a decoder 2020 which can receive the compressed data stream and decompress the data using a decompression module 2021.
The decompression module 2021 may be configured to perform data decompression a compressed data stream using an appropriate data decompression algorithm. The decompressed data may then be used as input to a neural upsampler 2022 which utilizes a trained neural network to restore the decompressed data to nearly its original state 2005 by taking advantage of the information embedded in the correlation between the two or more datasets 2001a-n.
Deformable convolution is a type of convolutional operation that introduces spatial deformations to the standard convolutional grid, allowing the convolutional kernel to adaptively sample input features based on the learned offsets. It's a technique designed to enhance the modeling of spatial relationships and adapt to object deformations in computer vision tasks. In traditional convolutional operations, the kernel's positions are fixed and aligned on a regular grid across the input feature map. This fixed grid can limit the ability of the convolutional layer to capture complex transformations, non-rigid deformations, and variations in object appearance. Deformable convolution aims to address this limitation by introducing the concept of spatial deformations. Deformable convolution has been particularly effective in tasks like object detection and semantic segmentation, where capturing object deformations and accurately localizing object boundaries are important. By allowing the convolutional kernels to adaptively sample input features from different positions based on learned offsets, deformable convolution can improve the model's ability to handle complex and diverse visual patterns.
According to an embodiment, the network may be trained as a two stage process, each utilizing specific loss functions. During the first stage, a mean squared error (MSE) function is used in the I/Q domain as a primary loss function for the AI deblocking network. The loss function of the SAR I/Q channel LSAR is defined as:
Moving to the second stage, the network reconstructs the amplitude component and computes the amplitude loss using MSE as follows:
To calculate the overall loss, the network combines the SAR loss and the amplitude loss, incorporating a weighting factor, a, for the amplitude loss. The total loss is computed as:
The weighting factor value may be selected based on the dataset used during network training. In an embodiment, the network may be trained using two different SAR datasets: the National Geospatial-Intelligence Agency (NGA) SAR dataset and the Sandia National Laboratories Mini SAR Complex Imagery dataset, both of which feature complex-valued SAR images. In an embodiment, the weighting factor is set to 0.0001 for the NGA dataset and 0.00005 for the Sandia dataset. By integrating both the SAR and amplitude losses in the total loss function, the system effectively guides the training process to simultaneously address the removal of the artifacts and maintain the fidelity of the amplitude information. The weighting factor, a, enables AI deblocking network to balance the importance of the SAR loss and the amplitude loss, ensuring comprehensive optimization of the network during the training stages. In some implementations, diverse data augmentation techniques may be used to enhance the variety of training data. For example, techniques such as horizontal and vertical flops and rotations may be implemented on the training dataset. In an embodiment, model optimization is performed using MSE loss and Adam optimizer with a learning rate initially set to 1×10−4 and decreased by a factor of 2 at epochs 100, 200, and 250, with a total of 300 epochs. In an implementation, the batch size is set to 256×256 with each batch containing 16 images.
Both branches first pass through a pixel unshuffling layer 2111, 2121 which implements a pixel unshuffling process on the input data. Pixel unshuffling is a process used in image processing to reconstruct a high-resolution image from a low-resolution image by rearranging or “unshuffling” the pixels. The process can involve the following steps, low-resolution input, pixel arrangement, interpolation, and enhancement. The input to the pixel unshuffling algorithm is a low-resolution image (i.e., decompressed, quantized SAR I/Q data). This image is typically obtained by downscaling a higher-resolution image such as during the encoding process executed by encoder 110. Pixel unshuffling aims to estimate the original high-resolution pixel values by redistributing and interpolating the low-resolution pixel values. The unshuffling process may involve performing interpolation techniques, such as nearest-neighbor, bilinear, or more sophisticated methods like bicubic or Lanczos interpolation, to estimate the missing pixel values and generate a higher-resolution image.
The output of the unshuffling layers 2111, 2121 may be fed into a series of layers which can include one or more convolutional layers and one or more parametric rectified linear unit (PRELU) layers. A legend is depicted for both
A PRELU layer is an activation function used in neural networks. The PRELU activation function extends the ReLU by introducing a parameter that allows the slope for negative values to be learned during training. The advantage of PRELU over ReLU is that it enables the network to capture more complex patterns and relationships in the data. By allowing a small negative slope for the negative inputs, the PRELU can learn to handle cases where the output should not be zero for all negative values, as is the case with the standard ReLU. In other implementations, other non-linear functions such as tanh or sigmoid can be used instead of PRELU.
After passing through a series of convolutional and PRELU layers, both branches enter the resnet 2130 which further comprises more convolutional and PRELU layers. The frequency domain branch is slightly different than the pixel domain branch once inside ResNet 2130, specifically the frequency domain is processed by a transposed convolutional (TConv) layer 2131. Transposed convolutions are a type of operation used in neural networks for tasks like image generation, image segmentation, and upsampling. They are used to increase the spatial resolution of feature maps while maintaining the learned relationships between features. Transposed convolutions aim to increase spatial dimensions of feature maps, effectively “upsampling” them. This is typically done by inserting zeros (or other values) between existing values to create more space for new values.
Inside ResBlock 2130 the data associated with the pixel and frequency domains are combined back into a single stream by using the output of the Tconv 2131 and the output of the top branch. The combined data may be used as input for a channel-wise transformer 2200. In some embodiments, the channel-wise transformer may be implemented as a multi-scale attention block utilizing the attention mechanism. For more detailed information about the architecture and functionality of channel-wise transformer 2200 refer to
A first path may process input data through a position embedding module 2230 comprising series of convolutional layers as well as a Gaussian Error Linear Unit (GeLU). In traditional recurrent neural networks or convolutional neural networks, the order of input elements is inherently encoded through the sequential or spatial nature of these architectures. However, in transformer-based models, where the attention mechanism allows for non-sequential relationships between tokens, the order of tokens needs to be explicitly conveyed to the model. Position embedding module 2230 may represent a feedforward neural network (position-wise feedforward layers) configured to add position embeddings to the input data to convey the spatial location or arrangement of pixels in an image. The output of position embedding module 2230 may be added to the output of the other processing path the received input signal is processed through.
A second path may process the input data. It may first be processed via a channel-wise configuration and then through a self-attention layer 2220. The signal may be copied/duplicated such that a copy of the received signal is passed through an average pool layer 2210 which can perform a downsampling operation on the input signal. It may be used to reduce the spatial dimensions (e.g., width and height) of feature maps while retaining the most important information. Average pooling functions by dividing the input feature map into non-overlapping rectangular or square regions (often referred to as pooling windows or filters) and replacing each region with the average of the values within that region. This functions to downsample the input by summarizing the information within each pooling window.
Self-attention layer 2220 may be configured to provide an attention to AI deblocking network 2023. The self-attention mechanism, also known as intra-attention or scaled dot-product attention, is a fundamental building block used in various deep learning models, particularly in transformer-based models. It plays a crucial role in capturing contextual relationships between different elements in a sequence or set of data, making it highly effective for tasks involving sequential or structured data like complex-valued SAR I/Q channels. Self-attention layer 2220 allows each element in the input sequence to consider other elements and weigh their importance based on their relevance to the current element. This enables the model to capture dependencies between elements regardless of their positional distance, which is a limitation in traditional sequential models like RNNs and LSTMs.
The input 2201 and downsampled input sequence is transformed into three different representations: Query (Q), Key (K), and Value (V). These transformations (wV, wK, and wQ) are typically linear projections of the original input. For each element in the sequence, the dot product between its Query and the Keys of all other elements is computed. The dot products are scaled by a factor to control the magnitude of the attention scores. The resulting scores may be normalized using a softmax function to get attention weights that represent the importance of each element to the current element. The Values (V) of all elements are combined using the attention weights as coefficients. This produces a weighted sum, where elements with higher attention weights contribute more to the final representation of the current element. The weighted sum is the output of the self-attention mechanism for the current element. This output captures contextual information from the entire input sequence.
The output of the two paths (i.e., position embedding module 2230 and self-attention layer 2220) may be combined into a single output data stream xout 2202.
In an embodiment, financial time-series data 2310a-n may comprise (but is not limited to) stock prices, economic indicators, market indices, interest rates, bond yields, currency exchange rates, trade balances, commodities prices, inflation, options and future data, sentiment analysis, credit ratings, mergers and acquisition data, real estate prices, and VIX data. There are various sources of financial time-series data that provide information on market prices, economic indicators, and other financial variables. Some common sources can include, but are not limited, financial data providers (e.g., companies specializing in financial data offer comprehensive financial datasets covering a wide range of asset classes), stock exchanges, central banks, government agencies, financial new websites, Alpha Vantage is a financial data provider that offers a free API for accessing historical and real-time market data, investing websites, world bank, Federal Reserve Economic Data, and/or the like.
There are several common data formats used for storing and transmitting financial time-series data, and which may be used in various implementations of the disclosed system and methods. These formats are designed to efficiently represent the vast amount of information generated through various financial services across various industries. One such format of genomic data which may be processed by system 2300 is comma separated values (CSV). CSV is a simple and widely used text format where each row represents a data entry, and columns are separated by commas. It's easy to read, edit, and widely supported by various data analysis tools. In another embodiment, the financial time-series data may be formatted according to JavaScript Object Notation (JSON) which is a lightweight data interchange format that is easy for humans to read and write. It's commonly used for representing structured data, and its flexibility makes it suitable for financial time-series data. In yet another embodiment, the financial time-series data may be processed in a Hierarchical Data Format version 5 (HDF5). HDF5 is a file format and set of tools for managing complex data. It supports the efficient storage of large and diverse datasets, making it suitable for financial time-series data with many variables. These are merely exemplary data formats which may be implemented in some embodiments and do not represent all possible formats which may be processed by system 2300.
The financial time-series data may be received at a data compressor 2320 which is present and configured to utilize one or more data compression methods on received financial data 2310a-n. Compression techniques are commonly used on financial time-series data to reduce storage requirements, speed up data transmission, and improve overall efficiency. According to an embodiment, the compression technique may be implemented as Run-Length Encoding (RLE) which is a simple compression technique that replaces sequences of identical elements with a single value and a count of the number of occurrences. In financial time-series data, where consecutive observations often have the same value, RLE can be effective in reducing redundancy. In yet another embodiment, the compression technique may be implemented as delta encoding which involves storing the difference between consecutive data points rather than the absolute values. In financial time-series data, where changes in values may be relatively small over time, delta encoding can result in more compact storage.
In an embodiment, the data may be compressed via differential pulse code modulation (DPCM). DPCM is a form of delta encoding that quantizes the difference between each data point and a predicted value based on the previous data point. It is commonly used in audio and video compression and can be adapted for financial time-series data. The provided compression techniques are exemplary only and are in no way limiting to the possible compression techniques which may be used in an embodiment of the disclosed system. The choice of compression technique depends on factors such as the nature of the data, the specific requirements of the application, and the trade-off between compression ratio and computational complexity. Different techniques may be suitable for different types of financial time-series data, and a combination of methods may be employed in practice. Lossy compression algorithms may filter or smooth the data to reduce redundancy or noise. While this can result in higher compression, it may lead to the loss of some information, especially in regions with lower sequencing quality.
Financial time-series data compressed by data compressor 2320 may then be sent to a data decompressor 2330 which can utilize one or more data decompression methods known to those with skill in the art. The output of data decompressor 2330 is a financial data stream(s) of decompressed data which is missing information due to the lossy nature of the compression/decompression methods used. The decompressed financial data stream(s) may be passed to neural upsampler 2340 which can utilize a trained neural network to restore most of the “lost” information associated with the decompressed financial data stream(s) by leveraging the learned correlation(s) between and among the various financial datasets. The output of neural upsampler 2340 is restored financial data 2350.
According to various embodiments, system 2300 utilizes a trained neural upsampler to leverage correlations in the received two or more financial datasets 2310a-n in order to restore lost data. In an aspect, neural upsampler 2340 may comprise a series of recurrent neural network layers, pooling layers, an n-channel transformer, and/or convolutional layers as described herein. In an embodiment, neural upsampler 2340 may be trained on a training dataset comprising a corpus of compressed financial data, wherein the compressed financial data is correlated. The neural upsampler may be trained to generate as output financial data, which is close to its original state, prior to undergoing lossy data compression. The financial data which was used to create the training dataset may be kept and used to validate the training output of neural upsampler, in this way the neural upsampler can be trained to generate output which nearly matches the original, uncompressed financial data.
Financial time series datasets can be correlated in various ways, reflecting relationships and interactions in the broader economic and financial environment. There are some ways in which distinct financial time-series datasets can be correlated, and which may be learned and leveraged by a trained neural upsampler 2340 to restore financial data which has been processed via lossy compression/decompression. For example, exchange rates can be correlated with trade balances. A country with a trade surplus may experience appreciation in its currency, while a trade deficit could lead to depreciation. As another example, stock prices and the VIX typically exhibit a negative correlation. During periods of market uncertainty or decline, the VIX tends to rise as investors seek protection, leading to lower stock prices. Yet another correlation that can be found among financial time-series datasets is that stock prices are often correlated with corporate earnings. Positive earnings reports can lead to higher stock prices, while disappointing earnings may result in stock market declines.
More examples of financial correlations which may be leveraged in one or more embodiments include interest rates and real estate prices, unemployment rates and consumer spending, inflation rates and gold prices, government bond yield and stock prices, oil prices and airline stocks, technology stocks and semiconductor sales, credit ratings and corporate bond yields, GDP (gross domestic product) growth and stock market performance, consumer confidence and retail sales, and/or the like. Of course, the financial time-series datasets may be correlated temporally such as, for example, the correlation between interest rate increase/decrease by a central bank and the influence on markets based on the change. A neural upsampler can use the learned correlations in financial datasets to be trained to restore lost data.
The latent dynamics analyzer system can be synergistically combined with neural upsampling techniques for financial time-series data, enhancing both data compression efficiency and dynamic system analysis capabilities. This integration begins with the neural upsampler serving as a preprocessing step, restoring information lost during lossy compression of financial data. The restored, richer dataset then feeds into the latent transformer core's VAE Encoder Subsystem, enabling the creation of more informative and representative latent space vectors. These enhanced vectors, capturing more nuanced financial relationships, allow the Latent Transformer Subsystem to learn and model more complex and accurate dynamics of the financial system.
The latent dynamics analyzer core benefits significantly from this upsampled data, deriving more precise equations of motion for the latent space vectors, which in turn provide a more accurate representation of the underlying financial system dynamics. This improved modeling capability extends to various aspects of financial analysis, from understanding the intricate effects of interest rate changes on stock prices across different sectors to modeling the evolution of currency exchange rates in response to economic indicators.
Moreover, the insights gained from the latent dynamics analyzer can be fed back to optimize the neural upsampling process, creating a virtuous cycle of continuous improvement. The latent dynamics analyzer's understanding of system dynamics can guide the upsampler to focus on restoring the most dynamically relevant information, further enhancing the overall system performance. This powerful combination of neural upsampling and the LDA system offers superior capabilities in anomaly detection and forecasting for financial markets. The upsampled data provides a clearer picture of normal market behavior, while the LDA's dynamic modeling allows for more accurate predictions of future states.
Deviations from expected dynamics can be more readily identified, potentially providing earlier and more reliable signals of market anomalies or regime changes. Ultimately, this integration creates a robust framework for efficient data compression, transmission, and detailed analysis of financial system dynamics, paving the way for significant advancements in financial modeling, risk management, and predictive analytics.
A neural upsampler which has been trained on compressed financial time-series data is present and configured to restore time-series data which has undergone lossy data compression and decompression by leveraging the correlation between the genomic datasets. A non-exhaustive list of genomic data correlations that may be used by an embodiment of the system and method can include genetic variation and linkage disequilibrium, and haplotype blocks.
The two or more genomic datasets may be processed by a data compressor 2320 employing a lossy compression method. The lossy compression method may implement a lossy compression algorithm appropriate for compressing genomic data. The choice of compression implementation may be based on various factors including, but not limited to, the type of data being processed, the computational resources and time required, and the use case of the upsampler. Exemplary genomic data compression techniques which may be used include, but are not limited to, quality score quantization, reference-based compression, subsampling, and genomic data transformation, to name a few. The compressed genomic data may be stored in a database and/or transmitted to an endpoint. The compressed genomic data may be sent to a data decompressor 2330 which may employ a lossy decompression technique on the compressed genomic data. The decompressed data may be sent to the neural upsampler which can restore the decompressed data to nearly its original state by leveraging the genetic variation (and/or other) correlation between the genomic datasets. The compressed genomic data is received by data decompressor 2330 at step 2401. At data decompressor 2330 the compressed genomic data may be decompressed via a lossy decompression algorithm at step 2402.
A neural upsampler for restoration of financial time-series (e.g., sequence of observations on financial market variables such as stock prices, interest rates, exchange rates, and other economic indicators) data received from two or more data channels may be trained using two or more datasets comprising compressed financial time-series data which is substantially correlated. For example, the two or more datasets may comprise financial time-series data related to unemployment rates and consumer spending. In various embodiments, each channel of the received financial time-series data may be fed into its own neural network comprising a series of convolutional and/or recurrent and ReLU and/or pooling layers which can be used to learn latent correlations in the feature space that can be used to restore data which has undergone lossy compression. A multi-channel transformer may be configured to receive the output that each of the neural networks produce, learn from the latent correlation in the feature space, and produce reconstructed financial time-series data. At step 2403, the decompressed financial time-series data may be used as input to the trained neural upsampler configured to restore the lost information of the decompressed financial time-series data. The neural upsampler can process the decompressed data to generate as output restored financial time-series data at step 2404.
The exemplary computing environment described herein comprises a computing device 10 (further comprising a system bus 11, one or more processors 20, a system memory 30, one or more interfaces 40, one or more non-volatile data storage devices 50), external peripherals and accessories 60, external communication devices 70, remote computing devices 80, and cloud-based services 90.
System bus 11 couples the various system components, coordinating operation of and data transmission between those various system components. System bus 11 represents one or more of any type or combination of types of wired or wireless bus structures including, but not limited to, memory busses or memory controllers, point-to-point connections, switching fabrics, peripheral busses, accelerated graphics ports, and local busses using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) busses, Micro Channel Architecture (MCA) busses, Enhanced ISA (EISA) busses, Video Electronics Standards Association (VESA) local busses, a Peripheral Component Interconnects (PCI) busses also known as a Mezzanine busses, or any selection of, or combination of, such busses. Depending on the specific physical implementation, one or more of the processors 20, system memory 30 and other components of the computing device 10 can be physically co-located or integrated into a single physical component, such as on a single chip. In such a case, some or all of system bus 11 can be electrical pathways within a single chip structure.
Computing device may further comprise externally-accessible data input and storage devices 12 such as compact disc read-only memory (CD-ROM) drives, digital versatile discs (DVD), or other optical disc storage for reading and/or writing optical discs 62; magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices; or any other medium which can be used to store the desired content and which can be accessed by the computing device 10. Computing device may further comprise externally-accessible data ports or connections 12 such as serial ports, parallel ports, universal serial bus (USB) ports, and infrared ports and/or transmitter/receivers. Computing device may further comprise hardware for wireless communication with external devices such as IEEE 1394 (“Firewire”) interfaces, IEEE 802.11 wireless interfaces, BLUETOOTH® wireless interfaces, and so forth. Such ports and interfaces may be used to connect any number of external peripherals and accessories 60 such as visual displays, monitors, and touch-sensitive screens 61, USB solid state memory data storage drives (commonly known as “flash drives” or “thumb drives”) 63, printers 64, pointers and manipulators such as mice 65, keyboards 66, and other devices 67 such as joysticks and gaming pads, touchpads, additional displays and monitors, and external hard drives (whether solid state or disc-based), microphones, speakers, cameras, and optical scanners.
Processors 20 are logic circuitry capable of receiving programming instructions and processing (or executing) those instructions to perform computer operations such as retrieving data, storing data, and performing mathematical calculations. Processors 20 are not limited by the materials from which they are formed or the processing mechanisms employed therein, but are typically comprised of semiconductor materials into which many transistors are formed together into logic gates on a chip (i.e., an integrated circuit or IC). The term processor includes any device capable of receiving and processing instructions including, but not limited to, processors operating on the basis of quantum computing, optical computing, mechanical computing (e.g., using nanotechnology entities to transfer data), and so forth. Depending on configuration, computing device 10 may comprise more than one processor. For example, computing device 10 may comprise one or more central processing units (CPUs) 21, each of which itself has multiple processors or multiple processing cores, each capable of independently or semi-independently processing programming instructions based on technologies like complex instruction set computer (CISC) or reduced instruction set computer (RISC). Further, computing device 10 may comprise one or more specialized processors such as a graphics processing unit (GPU) 22 configured to accelerate processing of computer graphics and images via a large array of specialized processing cores arranged in parallel. Further computing device 10 may be comprised of one or more specialized processes such as Intelligent Processing Units, field-programmable gate arrays or application-specific integrated circuits for specific tasks or types of tasks. The term processor may further include: neural processing units (NPUs) or neural computing units optimized for machine learning and artificial intelligence workloads using specialized architectures and data paths; tensor processing units (TPUs) designed to efficiently perform matrix multiplication and convolution operations used heavily in neural networks and deep learning applications; application-specific integrated circuits (ASICs) implementing custom logic for domain-specific tasks; application-specific instruction set processors (ASIPs) with instruction sets tailored for particular applications; field-programmable gate arrays (FPGAs) providing reconfigurable logic fabric that can be customized for specific processing tasks; processors operating on emerging computing paradigms such as quantum computing, optical computing, mechanical computing (e.g., using nanotechnology entities to transfer data), and so forth. Depending on configuration, computing device 10 may comprise one or more of any of the above types of processors in order to efficiently handle a variety of general purpose and specialized computing tasks. The specific processor configuration may be selected based on performance, power, cost, or other design constraints relevant to the intended application of computing device 10.
System memory 30 is processor-accessible data storage in the form of volatile and/or nonvolatile memory. System memory 30 may be either or both of two types: non-volatile memory and volatile memory. Non-volatile memory 30a is not erased when power to the memory is removed, and includes memory types such as read only memory (ROM), electronically-erasable programmable memory (EEPROM), and rewritable solid state memory (commonly known as “flash memory”). Non-volatile memory 30a is typically used for long-term storage of a basic input/output system (BIOS) 31, containing the basic instructions, typically loaded during computer startup, for transfer of information between components within computing device, or a unified extensible firmware interface (UEFI), which is a modern replacement for BIOS that supports larger hard drives, faster boot times, more security features, and provides native support for graphics and mouse cursors. Non-volatile memory 30a may also be used to store firmware comprising a complete operating system 35 and applications 36 for operating computer-controlled devices. The firmware approach is often used for purpose-specific computer-controlled devices such as appliances and Internet-of-Things (IoT) devices where processing power and data storage space is limited. Volatile memory 30b is erased when power to the memory is removed and is typically used for short-term storage of data for processing. Volatile memory 30b includes memory types such as random-access memory (RAM), and is normally the primary operating memory into which the operating system 35, applications 36, program modules 37, and application data 38 are loaded for execution by processors 20. Volatile memory 30b is generally faster than non-volatile memory 30a due to its electrical characteristics and is directly accessible to processors 20 for processing of instructions and data storage and retrieval. Volatile memory 30b may comprise one or more smaller cache memories which operate at a higher clock speed and are typically placed on the same IC as the processors to improve performance.
There are several types of computer memory, each with its own characteristics and use cases. System memory 30 may be configured in one or more of the several types described herein, including high bandwidth memory (HBM) and advanced packaging technologies like chip-on-wafer-on-substrate (CoWoS). Static random access memory (SRAM) provides fast, low-latency memory used for cache memory in processors, but is more expensive and consumes more power compared to dynamic random access memory (DRAM). SRAM retains data as long as power is supplied. DRAM is the main memory in most computer systems and is slower than SRAM but cheaper and more dense. DRAM requires periodic refresh to retain data. NAND flash is a type of non-volatile memory used for storage in solid state drives (SSDs) and mobile devices and provides high density and lower cost per bit compared to DRAM with the trade-off of slower write speeds and limited write endurance. HBM is an emerging memory technology that provides high bandwidth and low power consumption which stacks multiple DRAM dies vertically, connected by through-silicon vias (TSVs). HBM offers much higher bandwidth (up to 1 TB/s) compared to traditional DRAM and may be used in high-performance graphics cards, AI accelerators, and edge computing devices. Advanced packaging and CoWoS are technologies that enable the integration of multiple chips or dies into a single package. CoWoS is a 2.5D packaging technology that interconnects multiple dies side-by-side on a silicon interposer and allows for higher bandwidth, lower latency, and reduced power consumption compared to traditional PCB-based packaging. This technology enables the integration of heterogeneous dies (e.g., CPU, GPU, HBM) in a single package and may be used in high-performance computing, AI accelerators, and edge computing devices.
Interfaces 40 may include, but are not limited to, storage media interfaces 41, network interfaces 42, display interfaces 43, and input/output interfaces 44. Storage media interface 41 provides the necessary hardware interface for loading data from non-volatile data storage devices 50 into system memory 30 and storage data from system memory 30 to non-volatile data storage device 50. Network interface 42 provides the necessary hardware interface for computing device 10 to communicate with remote computing devices 80 and cloud-based services 90 via one or more external communication devices 70. Display interface 43 allows for connection of displays 61, monitors, touchscreens, and other visual input/output devices. Display interface 43 may include a graphics card for processing graphics-intensive calculations and for handling demanding display requirements. Typically, a graphics card includes a graphics processing unit (GPU) and video RAM (VRAM) to accelerate display of graphics. In some high-performance computing systems, multiple GPUs may be connected using NVLink bridges, which provide high-bandwidth, low-latency interconnects between GPUs. NVLink bridges enable faster data transfer between GPUs, allowing for more efficient parallel processing and improved performance in applications such as machine learning, scientific simulations, and graphics rendering. One or more input/output (I/O) interfaces 44 provide the necessary support for communications between computing device 10 and any external peripherals and accessories 60. For wireless communications, the necessary radio-frequency hardware and firmware may be connected to I/O interface 44 or may be integrated into I/O interface 44. Network interface 42 may support various communication standards and protocols, such as Ethernet and Small Form-Factor Pluggable (SFP). Ethernet is a widely used wired networking technology that enables local area network (LAN) communication. Ethernet interfaces typically use RJ45 connectors and support data rates ranging from 10 Mbps to 100 Gbps, with common speeds being 100 Mbps, 1 Gbps, 10 Gbps, 25 Gbps, 40 Gbps, and 100 Gbps. Ethernet is known for its reliability, low latency, and cost-effectiveness, making it a popular choice for home, office, and data center networks. SFP is a compact, hot-pluggable transceiver used for both telecommunication and data communications applications. SFP interfaces provide a modular and flexible solution for connecting network devices, such as switches and routers, to fiber optic or copper networking cables. SFP transceivers support various data rates, ranging from 100 Mbps to 100 Gbps, and can be easily replaced or upgraded without the need to replace the entire network interface card. This modularity allows for network scalability and adaptability to different network requirements and fiber types, such as single-mode or multi-mode fiber.
Non-volatile data storage devices 50 are typically used for long-term storage of data. Data on non-volatile data storage devices 50 is not erased when power to the non-volatile data storage devices 50 is removed. Non-volatile data storage devices 50 may be implemented using any technology for non-volatile storage of content including, but not limited to, CD-ROM drives, digital versatile discs (DVD), or other optical disc storage; magnetic cassettes, magnetic tape, magnetic disc storage, or other magnetic storage devices; solid state memory technologies such as EEPROM or flash memory; or other memory technology or any other medium which can be used to store data without requiring power to retain the data after it is written. Non-volatile data storage devices 50 may be non-removable from computing device 10 as in the case of internal hard drives, removable from computing device 10 as in the case of external USB hard drives, or a combination thereof, but computing device will typically comprise one or more internal, non-removable hard drives using either magnetic disc or solid state memory technology. Non-volatile data storage devices 50 may be implemented using various technologies, including hard disk drives (HDDs) and solid-state drives (SSDs). HDDs use spinning magnetic platters and read/write heads to store and retrieve data, while SSDs use NAND flash memory. SSDs offer faster read/write speeds, lower latency, and better durability due to the lack of moving parts, while HDDs typically provide higher storage capacities and lower cost per gigabyte. NAND flash memory comes in different types, such as Single-Level Cell (SLC), Multi-Level Cell (MLC), Triple-Level Cell (TLC), and Quad-Level Cell (QLC), each with trade-offs between performance, endurance, and cost. Storage devices connect to the computing device 10 through various interfaces, such as SATA, NVMe, and PCIc. SATA is the traditional interface for HDDs and SATA SSDs, while NVMe (Non-Volatile Memory Express) is a newer, high-performance protocol designed for SSDs connected via PCIe. PCIe SSDs offer the highest performance due to the direct connection to the PCIe bus, bypassing the limitations of the SATA interface. Other storage form factors include M.2 SSDs, which are compact storage devices that connect directly to the motherboard using the M.2 slot, supporting both SATA and NVMe interfaces. Additionally, technologies like Intel Optane memory combine 3D XPoint technology with NAND flash to provide high-performance storage and caching solutions. Non-volatile data storage devices 50 may be non-removable from computing device 10, as in the case of internal hard drives, removable from computing device 10, as in the case of external USB hard drives, or a combination thereof. However, computing devices will typically comprise one or more internal, non-removable hard drives using either magnetic disc or solid-state memory technology. Non-volatile data storage devices 50 may store any type of data including, but not limited to, an operating system 51 for providing low-level and mid-level functionality of computing device 10, applications 52 for providing high-level functionality of computing device 10, program modules 53 such as containerized programs or applications, or other modular content or modular programming, application data 54, and databases 55 such as relational databases, non-relational databases, object oriented databases, NoSQL databases, vector databases, knowledge graph databases, key-value databases, document oriented data stores, and graph databases.
Applications (also known as computer software or software applications) are sets of programming instructions designed to perform specific tasks or provide specific functionality on a computer or other computing devices. Applications are typically written in high-level programming languages such as C, C++, Scala, Erlang, GoLang, Java, Scala, Rust, and Python, which are then either interpreted at runtime or compiled into low-level, binary, processor-executable instructions operable on processors 20. Applications may be containerized so that they can be run on any computer hardware running any known operating system. Containerization of computer software is a method of packaging and deploying applications along with their operating system dependencies into self-contained, isolated units known as containers. Containers provide a lightweight and consistent runtime environment that allows applications to run reliably across different computing environments, such as development, testing, and production systems facilitated by specifications such as containerd.
The memories and non-volatile data storage devices described herein do not include communication media. Communication media are means of transmission of information such as modulated electromagnetic waves or modulated data signals configured to transmit, not store, information. By way of example, and not limitation, communication media includes wired communications such as sound signals transmitted to a speaker via a speaker wire, and wireless communications such as acoustic waves, radio frequency (RF) transmissions, infrared emissions, and other wireless media.
External communication devices 70 are devices that facilitate communications between computing device and either remote computing devices 80, or cloud-based services 90, or both. External communication devices 70 include, but are not limited to, data modems 71 which facilitate data transmission between computing device and the Internet 75 via a common carrier such as a telephone company or internet service provider (ISP), routers 72 which facilitate data transmission between computing device and other devices, and switches 73 which provide direct data communications between devices on a network or optical transmitters (e.g., lasers). Here, modem 71 is shown connecting computing device 10 to both remote computing devices 80 and cloud-based services 90 via the Internet 75. While modem 71, router 72, and switch 73 are shown here as being connected to network interface 42, many different network configurations using external communication devices 70 are possible. Using external communication devices 70, networks may be configured as local area networks (LANs) for a single location, building, or campus, wide area networks (WANs) comprising data networks that extend over a larger geographical area, and virtual private networks (VPNs) which can be of any size but connect computers via encrypted communications over public networks such as the Internet 75. As just one exemplary network configuration, network interface 42 may be connected to switch 73 which is connected to router 72 which is connected to modem 71 which provides access for computing device 10 to the Internet 75. Further, any combination of wired 77 or wireless 76 communications between and among computing device 10, external communication devices 70, remote computing devices 80, and cloud-based services 90 may be used. Remote computing devices 80, for example, may communicate with computing device through a variety of communication channels 74 such as through switch 73 via a wired 77 connection, through router 72 via a wireless connection 76, or through modem 71 via the Internet 75. Furthermore, while not shown here, other hardware that is specifically designed for servers or networking functions may be employed. For example, secure socket layer (SSL) acceleration cards can be used to offload SSL encryption computations, and transmission control protocol/internet protocol (TCP/IP) offload hardware and/or packet classifiers on network interfaces 42 may be installed and used at server devices or intermediate networking equipment (e.g., for deep packet inspection).
In a networked environment, certain components of computing device 10 may be fully or partially implemented on remote computing devices 80 or cloud-based services 90. Data stored in non-volatile data storage device 50 may be received from, shared with, duplicated on, or offloaded to a non-volatile data storage device on one or more remote computing devices 80 or in a cloud computing service 92. Processing by processors 20 may be received from, shared with, duplicated on, or offloaded to processors of one or more remote computing devices 80 or in a distributed computing service 93. By way of example, data may reside on a cloud computing service 92, but may be usable or otherwise accessible for use by computing device 10. Also, certain processing subtasks may be sent to a microservice 91 for processing with the result being transmitted to computing device 10 for incorporation into a larger processing task. Also, while components and processes of the exemplary computing environment are illustrated herein as discrete units (e.g., OS 51 being stored on non-volatile data storage device 51 and loaded into system memory 35 for use) such processes and components may reside or be processed at various times in different components of computing device 10, remote computing devices 80, and/or cloud-based services 90. Also, certain processing subtasks may be sent to a microservice 91 for processing with the result being transmitted to computing device 10 for incorporation into a larger processing task. Infrastructure as Code (IaaC) tools like Terraform can be used to manage and provision computing resources across multiple cloud providers or hyperscalers. This allows for workload balancing based on factors such as cost, performance, and availability. For example, Terraform can be used to automatically provision and scale resources on AWS spot instances during periods of high demand, such as for surge rendering tasks, to take advantage of lower costs while maintaining the required performance levels. In the context of rendering, tools like Blender can be used for object rendering of specific elements, such as a car, bike, or house. These elements can be approximated and roughed in using techniques like bounding box approximation or low-poly modeling to reduce the computational resources required for initial rendering passes. The rendered elements can then be integrated into the larger scene or environment as needed, with the option to replace the approximated elements with higher-fidelity models as the rendering process progresses.
In an implementation, the disclosed systems and methods may utilize, at least in part, containerization techniques to execute one or more processes and/or steps disclosed herein. Containerization is a lightweight and efficient virtualization technique that allows you to package and run applications and their dependencies in isolated environments called containers. One of the most popular containerization platforms is containerd, which is widely used in software development and deployment. Containerization, particularly with open-source technologies like containerd and container orchestration systems like Kubernetes, is a common approach for deploying and managing applications. Containers are created from images, which are lightweight, standalone, and executable packages that include application code, libraries, dependencies, and runtime. Images are often built from a containerfile or similar, which contains instructions for assembling the image. Containerfiles are configuration files that specify how to build a container image. Systems like Kubernetes natively support containerd as a container runtime. They include commands for installing dependencies, copying files, setting environment variables, and defining runtime configurations. Container images can be stored in repositories, which can be public or private. Organizations often set up private registries for security and version control using tools such as Harbor, JFrog Artifactory and Bintray, GitLab Container Registry, or other container registries. Containers can communicate with each other and the external world through networking. Containerd provides a default network namespace, but can be used with custom network plugins. Containers within the same network can communicate using container names or IP addresses.
Remote computing devices 80 are any computing devices not part of computing device 10. Remote computing devices 80 include, but are not limited to, personal computers, server computers, thin clients, thick clients, personal digital assistants (PDAs), mobile telephones, watches, tablet computers, laptop computers, multiprocessor systems, microprocessor based systems, set-top boxes, programmable consumer electronics, video game machines, game consoles, portable or handheld gaming units, network terminals, desktop personal computers (PCs), minicomputers, mainframe computers, network nodes, virtual reality or augmented reality devices and wearables, and distributed or multi-processing computing environments. While remote computing devices 80 are shown for clarity as being separate from cloud-based services 90, cloud-based services 90 are implemented on collections of networked remote computing devices 80.
Cloud-based services 90 are Internet-accessible services implemented on collections of networked remote computing devices 80. Cloud-based services are typically accessed via application programming interfaces (APIs) which are software interfaces which provide access to computing services within the cloud-based service via API calls, which are pre-defined protocols for requesting a computing service and receiving the results of that computing service. While cloud-based services may comprise any type of computer processing or storage, three common categories of cloud-based services 90 are serverless logic apps, microservices 91, cloud computing services 92, and distributed computing services 93.
Microservices 91 are collections of small, loosely coupled, and independently deployable computing services. Each microservice represents a specific computing functionality and runs as a separate process or container. Microservices promote the decomposition of complex applications into smaller, manageable services that can be developed, deployed, and scaled independently. These services communicate with each other through well-defined application programming interfaces (APIs), typically using lightweight protocols like HTTP, protobuffers, gRPC or message queues such as Kafka. Microservices 91 can be combined to perform more complex or distributed processing tasks. In an embodiment, Kubernetes clusters with containerized resources are used for operational packaging of system.
Cloud computing services 92 are delivery of computing resources and services over the Internet 75 from a remote location. Cloud computing services 92 provide additional computer hardware and storage on as-needed or subscription basis. Cloud computing services 92 can provide large amounts of scalable data storage, access to sophisticated software and powerful server-based processing, or entire computing infrastructures and platforms. For example, cloud computing services can provide virtualized computing resources such as virtual machines, storage, and networks, platforms for developing, running, and managing applications without the complexity of infrastructure management, and complete software applications over public or private networks or the Internet on a subscription or alternative licensing basis, or consumption or ad-hoc marketplace basis, or combination thereof.
Distributed computing services 93 provide large-scale processing using multiple interconnected computers or nodes to solve computational problems or perform tasks collectively. In distributed computing, the processing and storage capabilities of multiple machines are leveraged to work together as a unified system. Distributed computing services are designed to address problems that cannot be efficiently solved by a single computer or that require large-scale computational power or support for highly dynamic compute, transport or storage resource variance or uncertainty over time requiring scaling up and down of constituent system resources. These services enable parallel processing, fault tolerance, and scalability by distributing tasks across multiple nodes.
Although described above as a physical device, computing device 10 can be a virtual computing device, in which case the functionality of the physical components herein described, such as processors 20, system memory 30, network interfaces 40, NVLink or other GPU-to-GPU high bandwidth communications links and other like components can be provided by computer-executable instructions. Such computer-executable instructions can execute on a single physical computing device, or can be distributed across multiple physical computing devices, including being distributed across multiple physical computing devices in a dynamic manner such that the specific, physical computing devices hosting such computer-executable instructions can dynamically change over time depending upon need and availability. In the situation where computing device 10 is a virtualized device, the underlying physical computing devices hosting such a virtualized computing device can, themselves, comprise physical components analogous to those described above, and operating in a like manner. Furthermore, virtual computing devices can be utilized in multiple layers with one virtual computing device executing within the construct of another virtual computing device. Thus, computing device 10 may be either a physical computing device or a virtualized computing device within which computer-executable instructions can be executed in a manner consistent with their execution by a physical computing device. Similarly, terms referring to physical components of the computing device, as utilized herein, mean either those physical components or virtualizations thereof performing the same or equivalent functions.
The skilled person will be aware of a range of possible modifications of the various aspects described above. Accordingly, the present invention is defined by the claims and their equivalents.
Number | Date | Country | |
---|---|---|---|
Parent | 18737906 | Jun 2024 | US |
Child | 18885750 | US | |
Parent | 18427716 | Jan 2024 | US |
Child | 18885750 | US | |
Parent | 18410980 | Jan 2024 | US |
Child | 18427716 | US | |
Parent | 18537728 | Dec 2023 | US |
Child | 18410980 | US |