Aspects of the present disclosure relate to storage devices specifically aspects of the present disclosure relate to flash storage devices including a hardware accelerator.
Solid state storage such as flash storage is quickly becoming the most popular type of digital storage for computer applications. The use of Universal Serial Bus (USB) based flash storage devices, also known as thumb drives, has reached near ubiquitous usage. The USB protocol has been updated to increase transfer speeds allowing a greater range of usage for storage in computer systems. Another protocol, Peripheral Component Interconnect Express (PCIe) is also ubiquitous for high-speed interconnection of devices to a computer system. PCIe provides high transfer speed connections to computer systems and even provides some hot swapping capabilities but has long been used to permanently connect certain types of devices like Graphics Cards and Hard Drives to the computer system. Unlike USB, the PCIe hardware interface uses large edge double sided connectors with pinouts for full size cards as large as 49 pins and with the smallest double-sided connector having 18 pins. USB-C (currently the most recent iteration of the USB protocol) on the other hand uses a 12-pin connector. A new standard for PCI interconnect, M.2, has also recently reached widespread adoption. M.2 provides a smaller form factor 59 pin high speed physical interface for PCIe. Additionally, a new protocol for solid state storage NVM Express (NVMe) has reached widespread adoption. NVMe allows for greater parallelism in communication with solid state storage providing increased transfer speeds. NVMe also allows other capabilities for connected devices such as NVM Express over fabrics (NVMe-OF) which allows the use of transport protocol such as TCP to connect to devices connected to a computer systems NVMe physical interface. These new protocols and standards have paved the way for new capabilities that could be implemented on storage devices which improve the functionality of the attached computer system.
It is within this context that aspects of the present disclosure arise.
The teachings of the present disclosure can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
Although the following detailed description contains many specific details for the purposes of illustration, anyone of ordinary skill in the art will appreciate that many variations and alterations to the following details are within the scope of the invention. Accordingly, examples of embodiments of the invention described below are set forth without any loss of generality to, and without imposing limitations upon, the claimed invention.
According to aspects of the present disclosure a storage device may improve the function of a connected computer system with the addition of a hardware accelerator.
The improved storage device 101 may communicate with a computer system 102 through a hardware interface 105. The computer system 102 may include a storage device host controller 108. The storage device host controller 108 may include control blocks for the storage device as an I/O device 110 and, optionally, control blocks for the storage device as a network device 109. The I/O device control block 110 may allow the storage device 101 to communicate with the file system 115 of the computer system 102. In some optional implementations the network device control block 109 may communicate with the hardware accelerator 103 in the improved storage device 101. The network device control block 109 may present the improved storage device as a device connected to the computer system over a network connection 113 instead of being connected through a hardware interface 105 and enable communication with the file system of the computer system 115 as an emulated standard network interface. Additionally, the network device control block 109 may allow the improved storage device 101 to communicate over a network connection 113 through a network bridge 111. The network bridge 111 may bridge the connection between the emulated standard network interface in the internal hardware of the computer system and the external network connection 113. Bridging between the network and the improved storage device allows the hardware accelerator to be connected to a network such as the internet using the inbuilt routing of hardware of the computer. This may allow other computers to access the hardware accelerator and/or other components of the improved storage device while protecting the computer system and any separate internal or dedicated network to which it may be connected.
By way of example, in some implementations, routing rules may be configured to allow the accelerator 103 to access a wide area network such as the Internet as well as block access to the dedicated network of the computer system 102. The routing rules may act as the equivalent of a virtual private network (VPN) that isolates traffic from the accelerator 103 to the Internet from internal traffic for the computer system 102. Both the computer system and the accelerator can talk to the Internet, but outgoing traffic cannot access the computer system's separate network. This is an important feature in implementations where it is desirable or even necessary to allow the computer system 102 to provide ingress traffic to the accelerator 103 and let the accelerator to respond to such traffic, but it is also desirable or necessary to prevent the accelerator from monitoring or initiating a connection to the computer system's separate network.
The computer system may include computer hardware components 112 such as a CPU, Memory, Local Storage, or GPU. The computer hardware components 112 may operate together to carry out computer functions and may communicate with the improved storage device during operation. For example and without limitation, the computer hardware components may receive data from improved storage device, send requests for data to the improved storage device, send storage requests to the improved storage device and send data to be stored in the improved storage device. Additionally in some implementations the computer hardware components may operate with the storage device host controller 108 to allow messages from a network connection to reach the improved storage device 101.
An improved feature of the storage device 101 is the hardware accelerator 103. The hardware accelerator 103 may be configured to offload some traditional processes performed by the computer hardware components 112 to the hardware accelerator. The hardware accelerator 103 may include a specialized integrate circuit (IC) such as a Graphics Processing Unit Accelerator (GPU accelerator) IC, a Neural Processing Unit (NPU) IC or a Field Programmable Gate Array (FPGA) IC. For example and without limitation the hardware accelerator 103 may index data from the computer system 102 stored in the flash storage 104 by generating index entries for the data from the computer system that is stored in the flash memory 104 to improve data access times for the computer system and reduce the amount of data that is required to be transferred between the flash memory and the computer system 102. Additionally, the hardware accelerator 103 may perform other functions such as encryption of data, decryption of data, compression of data, decompression of data etc. Compression of data may include corresponding processes of encoding data and decompression of data may include a corresponding process of decoding of the data. Some examples of compression algorithms that may be used by the hardware accelerator include, without limitation, Lempel-Ziv-Welch (LZW), entropy coding algorithms, run length encoding etc. Additionally, the storage device with the hardware accelerator may be able to read the contents of data provided to it by the computer system and choose how to handle the data based on the type of data. The type of data may include file type, arrangement of data, size of data etc. For example and without limitation the improved storage device may store image data using a first index and store audio data using a second index thus reducing search time for both audio and image data. In some alternative implementation the improved storage system may be able to read and search data within data arrays for example and without limitation the improved storage device may be able to search data arrays such as row based data array files or column based data array files for data queries sent by the computer system to the hardware device. An example of a column based array file type that may be used in some implementations of the present disclosure is the Apache Parquet format file type. It should be understood that the arrays may be an implementation of a vector oriented storage. Vector oriented storage may use one or more machine learning models for dimensional reduction and tokenization of the data, the resulting data is referred as an embedding. The embeddings may be stored in the flash memory in an index with reference to the location of the original data. Thus vector oriented databases may be specialized for Single Instruction Multiple Data (SIMD) architecture used in GPUs and some FPGAs which may be more efficient at machine learning tasks. As such in some vector based file types the block sizes of the different sections are of a fixed size, so the hardware accelerator could programmatically determine where to point each core to independently scan data without prior knowledge of the other blocks. In some alternative implementations data in the flash memory may be assets such, videos, images, audio, game scripting, etc. and the hardware accelerator may create derivative assets from base assets stored in the flash memory 104.
The GPU type hardware accelerator 201 hereinafter referred to as a GPU accelerator may perform functions to assist the operation of a computer system's GPU. For example and without limitation, the GPU accelerator 201 may perform data storage using an index and retrieval using the index. In some implementations, the data stored may be texture data indexed by at least a Level of Detail (LOD); the hardware accelerator may be configured to retrieve the correct texture data based on an LOD provided by the computer system. In some implementations the texture data indexed by at least LOD may be MIP Maps stored in the flash memory and the GPU accelerator may retrieve the correct resolution texture based on the request from the computer system. In some implementations, the GPU accelerator may also perform other functions graphics processing functions such as graphics pipeline operations. For example and without limitation, the GPU accelerator may be configured to perform shading and filtering operations on images in the flash memory. According to an aspect of the present disclosure the GPU accelerator may be configured to allow a specialized language similar to OpenGL to operate the hardware accelerator and perform tasks. This may provide software developers the ability to define how to efficiently comb through data located stored in the flash memory. An aspect of some implementations of the present disclosure is that the GPU accelerator may be able see and use the flash memory on the improved storage device in the same way a GPU on the computer can access memory located in the GPU (Sometimes referred to as VRAM). In some other alternative implementations, the GPU accelerator may include a hardware codec and/or may be configured to perform video encoding and decoding. Additionally, the GPU accelerator may be configured to perform data heavy manipulation and filtering. In some implementations the GPU accelerator may be configured implement one or more trained machine learning models. The GPU accelerator along with other components of the storage device may include a flexible open storage language that enables custom data filters to be generated by users and executed with the GPU accelerator. The GPU accelerator may be configured to execute custom filters that manipulate data arrays stored in the flash memory and provide the manipulated data to the computer system thus reducing processing workload for the computer system. The filters executed by the GPU accelerator may multiply, add, subtract, divide, remove, copy, formatting, algorithmic operations, etc. data stored in the flash memory. In some implementations the computer system may request specific data and the GPU accelerator may filter the arrays stored in the flash for the specific data thus reducing the amount of data transferred to the computer system and eliminating the need for the computer system itself to filter the data.
The storage controller 203 may be any suitable storage controller in the art for example and without limitation the storage controller may be an NVMe controller, PCIe controller, a USB controller or similar. Similarly, the hardware interface 205 may be any hardware interface compatible with the storage controller. Suitable hardware interfaces may be for example and without limitation USB connector, M.2 connector, PCIe edge connector, etc. The flash storage may be any suitable solid-state storage.
According to an aspect of the present disclosure the NPU 401 may read the flash memory 402 and be configured to generate derivative assets from base assets stored in the flash memory. The NPU 401 may be loaded with Neural Network (NN) data such as an NN model with corresponding weights, transition values and the like. The NN data may cause the NPU to implement a pre-trained neural network model for asset generation for example and without limitation the NPU may implement a generative type pre-trained neural network such as an Auto encoder type models, Diffusion type models, etc. Examples of Auto encoder type models includes transformer type models such as Chat Generative Pre-trained Transformers (GPT), Bidirectional Encoder Representations from Transformers (BeRT), Language Model for Dialogue Applications (LaMDA), etc. In some implementations the NPU may implement other neural networks models. For example and without limitation, the other neural networks models may be Neural Radiance Fields (NeRF) models. Aspects of the present disclosure additionally implement a custom language allowing developers to provide the hardware controller with custom instructions for machine learning models. As such there is no limitation on machine learning models that may be implemented by the improved storage device. Furthermore, aspects of the present disclosure are not limited to generative machine learning models or unsupervised machine learning models and may also apply to other ML models like supervised learning models, wherein the model is trained with developer supervision and trained and frozen, the trained model is then used without further adjustment. For example and without limitation, the improved storage device may enable a developer to write a custom script to represent an inference model and use the hyper-parameters from the training to scan data in the flash memory. The inference action may be in the form of a data request from the computer to the hardware accelerator to use the inference model to “query” the data using the hyper parameters as a filter expression.
The NPU 401 implementing the neural network model may be configured to use base asset data 403 stored in the flash memory to generate one, two, or more derivative assets. For example and without limitation, the derivative asset may be event scripting or character dialog, the neural network model may be a pre-trained generative model (such as a large language model like GPT) trained with a machine learning algorithm to generate event scripting and/or character dialog and the base asset may be a prompt for a generation of a specific event and/or character dialog. When the computer system calls the base asset with a generative command the NPU may generate event scripting or character dialog from the prompt. This may allow the computer system to store a prompt which is only a small amount of data in the flash memory and generate a large amount of data for the scripting and/or character dialog.
In another example the neural network model may be a pre-trained generative model for images and the base asset may be an image, a video, one or more frames from a video, or a text prompt stored in the flash memory.
When the computer system calls the base asset with a generative command the NPU may generate a different derivative image, video, frames from a video or model (depending on the request and the type of machine learning model) from the base asset using a pre-trained machine learning model (such as a diffusion model) trained with a machine learning algorithm. In yet another implementation the NPU may implement a deep learning NN model such as a NeRF model with a machine learning algorithm and the base asset includes two or more image views of a scene or an object. When the computer requests the base asset with a generative command the NPU may generate a three-dimensional representation of the base asset.
According to another aspect of the present disclosure the improved storage controller may include an System on a Chip (SoC) that includes a CPU that may run links and may be configured to view data on the flash storage as a native file system for or the SoC. Thus, the hardware accelerator and SoC may “mount” the flash memory as an attached storage drive to the SoC, then Accelerated reads coming from the network interface may perform special file operations at the filesystem level and is not limited to reads and writes at the raw memory block level. As such the improved storage device may be referred as file aware as the SoC enables filesystem level operations on data stored in the flash memory. Additionally, any specialized data format may still reside on an EXT5/ZFS/NTFS low level format provided through the storage controller data interface. The SoC can read the file like an attached drive then process the data from a raw memory perspective using the filesystem as a guide to the raw data. The file system on the improved storage device may be augmented with Index data so when the SoC opens a file, it may the index so that it would not be required to scan the entire flash memory to determine where the actual data blocks reside, e.g., after the directory/node traversing occurs. In some implementations the index may not be needed to support the customized data format, but it may be used to “cache” the file storage layout allowing the hardware accelerator to assign cores to memory ranges. In some implementations an incoming read request to the hardware accelerator may use the filename and the hardware accelerator may locate an index informing the hardware accelerator on memory segments containing the desired file then the hardware accelerator may return the results to the file read request after the location in the flash memory of the desired file has been scanned by the hardware accelerator. The O/S on the SoC may allow the device to have native support for filesystems or used standard libraries (for example and without limitation Java parquet file reader).
The hardware accelerator 602 may read the data and organize the data according to an index. For example, in the case of data corresponding to textures of different LOD, each index may correspond to a different LOD. The hardware accelerator may then send a write command 605 to the flash storage 603 according to the index. In some implementations the write command 605 may be relayed through the storage controller. Additionally, in this implementation, the hardware accelerator 602 also sends a write request for an entry into an index 606 for the data stored in the flash memory 603.
In some implementations the indexing of data may be file and/or data type selective. In such implementations the hardware controller may read the data or metadata to determine the file type and/or data type of the data and send indexed data store requests based on the determined file type and/or data type. To perform indexing of data, in some implementations the hardware accelerator may keep an internal register representation of indexes stored in the flash memory. Alternatively, the hardware accelerator may read the index from the flash memory before organizing the data according to the index.
The computer system 601 may request to read data 607 from the improved storage device 602. This read request 607 may pass to the hardware accelerator 602. In some implementations the read request may be initially received by the storage controller before passing to the hardware accelerator 602. After receiving the read request, the hardware accelerator may selectively send a read request for the indexed data corresponding to the index for the data to be read 608 to the flash memory 603. Selectively sending the read request may include reading the data to determine whether the file type and/or data type of the data to be read is stored with an index or, if there are multiple indices for different data types or file types, which index should be read to determine the location of the data to be read. The flash memory 603 sends the index data including the index location 609 for the data, to be read, back to hardware accelerator 602. The hardware accelerator 602 then uses the index location to send a read request 610 to the flash memory 603 for the data at the location indicated by the index. In response the flash memory 603 may send the requested indexed data 611 back to the hardware accelerator. Alternatively, the flash memory 603 may send the requested indexed data to the storage controller instead. Once received at the hardware accelerator 602 or storage controller the data may be sent 612 to the computer system 601.
The computer system 701 may request to read texture data at a specific LOD 707 from the improved storage device 702. This read request 707 may pass to the GPU accelerator 702. In some implementations the read request may be initially received by the storage controller before passing to the GPU accelerator 702. After receiving the read request, the GPU accelerator may selectively send a read request for the texture data indexed by LOD 708 to the flash memory 703. In some implementations the GPU accelerator may examine the request to determine if the request is for texture data. The flash memory 703 sends the texture data 709, back to GPU accelerator 702. Alternatively, the flash memory may send the requested texture data to the storage controller instead. Once received at the GPU accelerator 703 or storage controller the data may be sent 710 to the computer system 701.
The NPU 802 may receive the request for an asset 804 and initiate the required NN models to generate a derivative asset from the requested asset. Alternatively, the NPU 802 may be pre-configured to generate a derivative asset from the base asset. In either case the NPU 802 sends a request 805 to the flash memory 803 for the base asset. The flash memory 803 then sends the base asset 806 to the NPU 802. After reception of the base asset the NPU 802 generates a derivative asset using the trained machine learning model. Once the derivative asset has been generated the NPU 802 sends the derivative asset 807 to the computer system 801.
In other alternative implementations the base asset may be (or may include) a text description of an image and the neural network may generate a derivative in the form of one or more images derived from the text description. In yet other alternative implementations the base asset may be (or may include) a text description of a digital object and the neural processor may generate a derivative asset in the form of a digital representation of the object in three dimensions from the text description. In yet other alternative implementations the base asset may be (or may include) an image of a digital object and the neural processor may generate a derivative asset in the form of a three dimensional representation of the image of the digital object from the image of the digital object. In still other implementations, the flash memory may contain a script file including instructions for generation of the derivative asset and the neural processor may use the script file to generate the derivative asset from the base asset.
According to aspects of the present disclosure, an NPU may implement Machine Learning models to generate derivative assets of base assets stored in memory. An example of machine learning model that may be trained to generate a derivative asset from a base asset is a diffusion model. A diffusion model trains a neural network to predict an image based on a distribution of noise. Initially a diffusion model is trained to remove noise that is added to a clean image once fully trained the model is tasked with generating an image from random noise. For more information about diffusion models see: Yang, Ling “Diffusion Models: A Comprehensive Survey of Methods and Applications” ArXiv, 2209.00796, Published: Sep. 2, 2022, Available at: https://arxiv.org/abs/2209.00796 the contents of which is incorporated herein by reference. Another generative model is an autoencoder. An autoencoder is a type of neural network layout having encoder networks which take part in dimensional reduction outputting embeddings and decoder networks which predict a synthetic output using the embeddings. The auto-encoder neural network outputs feature length embeddings and the decoder includes a neural network that uses those feature length image embeddings to generate one or more synthetic assets for more information on Autoencoder asset generation see: Huiwen, Chang “Muse: Text-To-Image Generation via Masked Generative Transformers” ArXiv, arXiv: 2301.00704 Published: Jan. 2, 2023, Available at: https://arxiv.org/abs/2301.00704 the contents of which is incorporated herein by reference. Yet another generation method is Neural Radiance fields which generate a three-dimensional representation of multiple image views, more information about NeRFs can be found at: Mildenhall, Ben, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” ArXiv, arXiv: 2003.08934, Published: Mar. 19, 2020 Available at: https://arxiv.org/abs/2003.08934 the contents of which are incorporate herein by reference.
As may be appreciated from the preceding discussion, a storage device may improve the function of a connected computer system with the addition of a hardware accelerator which at least selective indexes data stored in flash memory and may selectively retrieve data stored in the flash memory reducing the amount of data that needs to be read and/or retrieved from the flash memory by the computer system. Additionally, an improved storage device having an NPU may be generated multiple different derivative assets with a generative machine learning model implemented by the NPU. Thus reducing the amount of data the computer system needs to store and/or retrieve from the flash memory to generate different assets because a single base asset may be used to generate multiple different assets.
While the above is a complete description of the preferred embodiment of the present invention, it is possible to use various alternatives, modifications and equivalents. Therefore, the scope of the present invention should be determined not with reference to the above description but should, instead, be determined with reference to the appended claims, along with their full scope of equivalents. Any feature described herein, whether preferred or not, may be combined with any other feature described herein, whether preferred or not. In the claims that follow, the indefinite article “A”, or “An” refers to a quantity of one or more of the item following the article, except where expressly stated otherwise. The appended claims are not to be interpreted as including means-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase “means for.”