Embodiments described herein generally relate to the field of data processing, and more particularly relates to methods and systems of automated/controlled transferring of big data computations from centralized systems to at least one of messaging systems and data collection systems.
Conventionally, big data is a term for data sets that are so large or complex that traditional data processing applications are not sufficient. Challenges of large data sets include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating, and information privacy.
In one big data example, if a prior approach issues a SQL query in one place to run against a lot of data that is located in another place, this prior approach creates a significant amount of network traffic, which could be slow and costly. However, this approach can utilize a predicate pushdown to push down parts of the SQL query to the storage layer, and thus filter out some of the data.
For one embodiment of the present invention, methods and systems of automated/controlled data transfer of big data computations from centralized systems to distributed nodes (e.g., at least one of messaging systems and data collection systems) are disclosed. In one embodiment, a system comprises storage to store data and a plurality of servers coupled to the storage. The plurality of servers perform at least one of ingest, transform, and serve stages of data. A sub-system (e.g., a compiler component) performs program analysis on the computation and automatically detects the program that can be transferred from centralized systems to at least one of messaging systems and data collection systems, which utilize at least one of software and at least one processing unit (e.g., in-line accelerator) to perform the transferred program.
Other features and advantages of embodiments of the present invention will be apparent from the accompanying drawings and from the detailed description that follows below.
Methods, systems and apparatuses for to methods and systems of automated/controlled pushing of big data computations from centralized systems to at least one of messaging systems and data collection systems are described.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention can be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the present invention.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase “in one embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment. Likewise, the appearances of the phrase “in another embodiment,” or “in an alternate embodiment” appearing in various places throughout the specification are not all necessarily all referring to the same embodiment.
The following glossary of terminology and acronyms serves to assist the reader by providing a simplified quick-reference definition. A person of ordinary skill in the art may understand the terms as used herein according to general usage and definitions that appear in widely available standards and reference books.
HW: Hardware.
SW: Software.
I/O: Input/Output.
DMA: Direct Memory Access.
CPU: Central Processing Unit.
FPGA: Field Programmable Gate Arrays.
CGRA: Coarse-Grain Reconfigurable Accelerators.
GPGPU: General-Purpose Graphical Processing Units.
MLWC: Many Light-weight Cores.
ASIC: Application Specific Integrated Circuit.
PCIe: Peripheral Component Interconnect express.
CDFG: Control and Data-Flow Graph.
FIFO: First In, First Out
NIC: Network Interface Card
HLS: High-Level Synthesis
KPN: Kahn Processing Networks (KPN) is a distributed model of computation (MoC) in which a group of deterministic sequential processes are communicating through unbounded FIFO channels. The process network exhibits deterministic behavior that does not depend on various computation or communication delays. A KPN can be mapped onto any accelerator (e.g., FPGA based platform) for embodiments described herein.
Dataflow analysis: An analysis performed by a compiler on the CDFG of the program to determine dependencies between a write operation on a variable and the consequent operations which might be dependent on the written operation.
Accelerator: a specialized HW/SW component that is customized to run an application or a class of applications efficiently.
In-line accelerator: An accelerator for I/O-intensive applications that can send and receive data without CPU involvement. If an in-line accelerator cannot finish the processing of an input data, it passes the data to the CPU for further processing.
Bailout: The process of transitioning the computation associated with an input from an in-line accelerator to a general purpose instruction-based processor (i.e. general purpose core).
Continuation: A kind of bailout that causes the CPU to continue the execution of an input data on an accelerator right after the bailout point.
Rollback: A kind of bailout that causes the CPU to restart the execution of an input data on an accelerator from the beginning or some other known location with related recovery data like a checkpoint.
Gorilla++: A programming model and language with both dataflow and shared-memory constructs as well as a toolset that generates HW/SW from a Gorilla++ description.
GDF: Gorilla dataflow (the execution model of Gorilla++).
GDF node: A building block of a GDF design that receives an input, may apply a computation kernel on the input, and generates corresponding outputs. A GDF design consists of multiple GDF nodes. A GDF node may be realized as a hardware module or a software thread or a hybrid component. Multiple nodes may be realized on the same virtualized hardware module or on a same virtualized software thread.
Engine: A special kind of component such as GDF that contains computation.
Infrastructure component: Memory, synchronization, and communication components.
Computation kernel: The computation that is applied to all input data elements in an engine.
Data state: A set of memory elements that contains the current state of computation in a Gorilla program.
Control State: A pointer to the current state in a state machine, stage in a pipeline, or instruction in a program associated to an engine.
Dataflow token: Components input/output data elements.
Kernel operation: An atomic unit of computation in a kernel. There might not be a one to one mapping between kernel operations and the corresponding realizations as states in a state machine, stages in a pipeline, or instructions running on a general purpose instruction-based processor.
Accelerators can be used for many big data systems that are built from a pipeline of subsystems including data collection and logging layers, a Messaging layer, a Data ingestion layer, a Data enrichment layer, a Data store layer, and an Intelligent extraction layer. Usually data collection and logging layer are done on many distributed nodes. Messaging layers are also distributed. However, ingestion, enrichment, storing, and intelligent extraction happen at the central or semi-central systems. In many cases, ingestions and enrichments need a significant amount of data processing. However, large quantities of data need to be transferred from distributed data collection and logging layers and messaging layers to the central systems for data processing.
Examples of data collection and logging layers are web servers that are recording website visits by a plurality of users. Other examples include sensors that record a measurement (e.g., temperature, pressure) or security devices that record special packet transfer events. Examples of a messaging layer include a simple copying of the logs, or using more sophisticated messaging systems (e.g., Kafka, Nifi). Examples of ingestion layers include extract, transform, load (ETL) tools that refer to a process in a database usage and particularly in data warehousing. These ETL tools extract data from data sources, transform the data for storing in a proper format or structure for the purposes of querying and analysis, and load the data into a final target (e.g., database, data store, data warehouse). An example of a data enrichment layer is adding geographical information or user data through databases or key value stores. A data store layer can be a simple file system or a database. An intelligent extraction layer usually uses machine learning algorithms to learn from past behavior to predict future behavior.
Typically data collection and logging layers are performed on many distributed nodes. Messaging layers are also distributed. However, ingestion, enrichment, storing, and intelligent extraction happen at the central or semi-central systems. In many cases, ingestions and enrichments need a significant amount of data processing and certain operations, such as filtering and aggregation dramatically decrease a volume of the data. The present design pushes or transfers at least a portion (or an entire amount) of computation associated with ingestion, enrichment, and possibly intelligent extraction to a distributed messaging layer or even a distributed data collection layer to reduce network congestion, reduce transfer of data across networks, and increase processing time.
The present design automatically detects computations that are conventionally performed on ingestion/enrichment or intelligent extractions. In one example, these computations are pushed or transferred to be implemented on the messaging layer and then perform that computation on the messaging layer or the computation can even be performed on the data collection layer while the data is getting logged or while data is on the move. In another example, these computations are pushed or transferred to be implemented on edge devices and then perform that computation on the edge device. As a result, this present design leads to a much lower volume of data to be transferred to the central systems, as well as increased utilization of the distributed resources and edge devices, rather than centralized resources.
In one example, the auto transfer feature 508 functions in a manner similar to the auto transfer features 308 and 408. The auto transfer feature 508 can cause computations that are normally performed with resources 504 of the system 502 to be transferred and performed with devices of the event producers 540 or resources 512 of the distributed nodes 510 that filter the collected data to generate a reduced set of data instead of data 550 and data 552. The auto transfer feature 508 may also be located on the distributed nodes 510 and the event producers 540.
A machine 750 (e.g., server 750) performs the operations 705-707, 716-718, and 727-729. The machine 750 includes an I/O processing unit 751 (e.g., network interface card 751) having an in-line accelerator 752. The machine 750 also includes storage 756, general purpose instruction-based processor 757, and memory 758. The machine 750 also includes storage 756, general purpose instruction-based processor 757, and memory 758. A data path 759 illustrates the data flow for machine 750 for stage 701. For example, data is read from a source storage 705 of storage 756 (e.g., operation 753), computations 706 (e.g., operation 754) and shuffle write operations 707 (e.g., operation 755) are performed by the in-line accelerator 752. The outputs 709-710 are sent to a second stage 712 via a network connection 760. The machines 730 and 750 can be a distributed resource (e.g., a messaging system, a data collection system, a collect and delivery stage, an event producers stage) for performing computations that are typically performed with a centralized resource.
There are many ways to implement this idea of automatic extraction of the computation that can be moved from a centralized system to a messaging/logging layer. One simple way for cases having ingestion/enrichment is done using a framework (e.g., Spark). Since computation is split between the stages, shuffling between different partitions of the data is done only between stages. If the source of the first stage is from an external stream of data from a messaging system, the computations in the first stage can be done independently without requiring any shuffle (if there is any computation at all). Therefore, the present design automatically moves that computation to a messaging system (e.g., Kafka) or even a logging/data collection system (e.g., web servers, sensors).
For example, when a data ingestion engine runs an ETL on the incoming data and for a line like this,
JOE Mar. 12, 2013 HTTP://domain name com
Create a corresponding JSON (JavaScript Object Notation) entry
In one example of an implementation in Spark:
(0) to (3) where it leads to shuffle can be detected and offloaded to the messaging system (e.g., Kafka).
Thus, a compiler can get this computation and push it to the servers of the messaging system or even to web servers.
The operations of method 800 may be executed by a compiler component, a data processing system, a machine, a server, a web appliance, a centralized system, a distributed node, or any system, which includes an in-line accelerator. The in-line accelerator may include hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine or a device), or a combination of both. In one embodiment, a compiler component performs the operations of method 800.
At operation 802, the method includes automatically detecting computations to be performed on a centralized resource for ingestion and enrichment operations or intelligent extractions. At operation 804, the method determines whether the detected computations to be performed on the centralized resource can be transferred to a distributed resource of a messaging system or data collection system. At operation 806, the method transfers the detected computations to be performed on the centralized resource to a distributed resource of a messaging system or data collection system when possible (e.g., when desired to reduce latency and network congestion). At operation 808, the method performs the detected computations that have been transferred to the distributed resource of the messaging system or data collection system. In one example, a distributed resource having an I/O processing unit of a machine (e.g., server) and an in-line accelerator is configured for a stage of operations to read the transferred data from the storage, to perform computations on the data, and to shuffle a result of the computations to generate a set of shuffled data to be transferred to a subsequent stage of operations.
In another embodiment, the computations to be transferred can be transferred to a messaging system having in-line hardware for significantly higher performance without impacting any other resources in a system.
In an embodiment, in-line accelerator 911 is coupled to multiple I/O interfaces (not shown in the figure). In an embodiment, input data elements are received by I/O interface 912 and the corresponding output data elements generated as the result of the system computation are sent out by I/O interface 912. In an embodiment, I/O data elements are directly passed to/from in-line accelerator 911. In processing the input data elements, in an embodiment, in-line accelerator 911 may be required to transfer the control to general purpose instruction-based processor 920. In an alternative embodiment, in-line accelerator 911 completes execution without transferring the control to general purpose instruction-based processor 920. In an embodiment, in-line accelerator 911 has a master role and general purpose instruction-based processor 920 has a slave role.
In an embodiment, in-line accelerator 911 partially performs the computation associated with the input data elements and transfers the control to other accelerators or the main general purpose instruction-based processor in the system to complete the processing. The term “computation” as used herein may refer to any computer task processing including, but not limited to, any of arithmetic/logic operations, memory operations, I/O operations, and offloading part of the computation to other elements of the system such as general purpose instruction-based processors and accelerators. In-line accelerator 911 may transfer the control to general purpose instruction-based processor 920 to complete the computation. In an alternative embodiment, in-line accelerator 911 performs the computation completely and passes the output data elements to I/O interface 912. In another embodiment, in-line accelerator 911 does not perform any computation on the input data elements and only passes the data to general purpose instruction-based processor 920 for computation. In another embodiment, general purpose instruction-based processor 920 may have in-line accelerator 911 to take control and completes the computation before sending the output data elements to the I/O interface 912.
In an embodiment, in-line accelerator 911 may be implemented using any device known to be used as accelerator, including but not limited to field-programmable gate array (FPGA), Coarse-Grained Reconfigurable Architecture(CGRA), general-purpose computing on graphics processing unit (GPGPU), many light-weight cores (MLWC), network general purpose instruction-based processor, I/O general purpose instruction-based processor, and application-specific integrated circuit (ASIC). In an embodiment, I/O interface 912 may provide connectivity to other interfaces that may be used in networks, storages, cameras, or other user interface devices. I/O interface 912 may include receive first in first out (FIFO) storage 913 and transmit FIFO storage 914. FIFO storages 913 and 914 may be implemented using SRAM, flip-flops, latches or any other suitable form of storage. The input packets are fed to the in-line accelerator through receive FIFO storage 913 and the generated packets are sent over the network by the in-line accelerator and/or general purpose instruction-based processor through transmit FIFO storage 914.
In an embodiment, I/O processing unit 910 may be Network Interface Card (NIC). In an embodiment of the invention, in-line accelerator 911 is part of the NIC. In an embodiment, the NIC is on the same chip as general purpose instruction-based processor 920. In an alternative embodiment, the NIC 910 is on a separate chip coupled to general purpose instruction-based processor 920. In an embodiment, the NIC-based in-line accelerator receives an incoming packet, as input data elements through I/O interface 912, processes the packet and generates the response packet(s) without involving general purpose instruction-based processor 920. Only when in-line accelerator 912 cannot handle the input packet by itself, the packet is transferred to general purpose instruction-based processor 920. In an embodiment, in-line accelerator 912 communicates with other I/O interfaces, for example, storage elements through direct memory access (DMA) to retrieve data without involving general purpose instruction-based processor 920.
In-line accelerator 911 and the general purpose instruction-based processor 920 are coupled to shared memory 943 through private cache memories 941 and 942 respectively. In an embodiment, shared memory 943 is a coherent memory system. The coherent memory system may be implemented as shared cache. In an embodiment, the coherent memory system is implemented using multiples caches with coherency protocol in front of a higher capacity memory such as a DRAM.
Processing data by forming two paths of computations on in-line accelerators and general purpose instruction-based processors (or multiple paths of computation when there are multiple acceleration layers) have many other applications apart from low-level network applications. For example, most emerging big-data applications in data centers have been moving toward scale-out architectures, a technology for scaling the processing power, memory capacity and bandwidth, as well as persistent storage capacity and bandwidth. These scale-out architectures are highly network-intensive. Therefore, they can benefit from in-line acceleration. These applications, however, have a dynamic nature requiring frequent changes and modifications. Therefore, it is highly beneficial to automate the process of splitting an application into a fast-path that can be executed by an in-line accelerator and a slow-path that can be executed by a general purpose instruction-based processor as disclosed herein.
While embodiments of the invention are shown as two accelerated and general-purpose layers throughout this document, it is appreciated by one skilled in the art that the invention can be implemented to include multiple layers of in-line computation with different levels of acceleration and generality. For example, an in-line FPGA accelerator can backed by an in-line many-core hardware. In an embodiment, the in-line many-core hardware can be backed by a general purpose instruction-based processor.
Referring to
Data processing system 1202, as disclosed above, includes a general purpose instruction-based processor 1227 and an in-line accelerator 1226. The general purpose instruction-based processor may be one or more general purpose instruction-based processors or processing devices (e.g., microprocessor, central processing unit, or the like). More particularly, data processing system 1202 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, general purpose instruction-based processor implementing other instruction sets, or general purpose instruction-based processors implementing a combination of instruction sets. The in-line accelerator may be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal general purpose instruction-based processor (DSP), network general purpose instruction-based processor, many light-weight cores (MLWC) or the like. Data processing system 1202 is configured to implement the data processing system for performing the operations and steps discussed herein.
The exemplary computer system 1200 includes a data processing system 1202, a main memory 1204 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or DRAM (RDRAM), etc.), a static memory 1206 (e.g., flash memory, static random access memory (SRAM), etc.), and a data storage device 1216 (e.g., a secondary memory unit in the form of a drive unit, which may include fixed or removable computer-readable storage medium), which communicate with each other via a bus 1208. The storage units disclosed in computer system 1200 may be configured to implement the data storing mechanisms for performing the operations and steps discussed herein. Memory 1206 can store code and/or data for use by processor 1227 or in-line accelerator 1226. Memory 1206 include a memory hierarchy that can be implemented using any combination of RAM (e.g., SRAM, DRAM, DDRAM), ROM, FLASH, magnetic and/or optical storage devices. Memory may also include a transmission medium for carrying information-bearing signals indicative of computer instructions or data (with or without a carrier wave upon which the signals are modulated).
Processor 1227 and in-line accelerator 1226 execute various software components stored in memory 1204 to perform various functions for system 1200. In one embodiment, the software components include operating system 1205a, compiler component 1205b having an auto transfer feature (e.g., 308, 408, 508, 608), and communication module (or set of instructions) 1205c. Furthermore, memory 1206 may store additional modules and data structures not described above.
Operating system 1205a includes various procedures, sets of instructions, software components and/or drivers for controlling and managing general system tasks and facilitates communication between various hardware and software components. A compiler is a computer program (or set of programs) that transform source code written in a programming language into another computer language (e.g., target language, object code). A communication module 1205c provides communication with other devices utilizing the network interface device 1222 or RF transceiver 1224.
The computer system 1200 may further include a network interface device 1222. In an alternative embodiment, the data processing system disclose is integrated into the network interface device 1222 as disclosed herein. The computer system 1200 also may include a video display unit 1210 (e.g., a liquid crystal display (LCD), LED, or a cathode ray tube (CRT)) connected to the computer system through a graphics port and graphics chipset, an input device 1212 (e.g., a keyboard, a mouse), a camera 1214, and a Graphic User Interface (GUI) device 1220 (e.g., a touch-screen with input & output functionality).
The computer system 1200 may further include a RF transceiver 1224 provides frequency shifting, converting received RF signals to baseband and converting baseband transmit signals to RF. In some descriptions a radio transceiver or RF transceiver may be understood to include other signal processing functionality such as modulation/demodulation, coding/decoding, interleaving/de-interleaving, spreading/dispreading, inverse fast Fourier transforming (IFFT)/fast Fourier transforming (FFT), cyclic prefix appending/removal, and other signal processing functions.
The Data Storage Device 1216 may include a machine-readable storage medium (or more specifically a computer-readable storage medium) on which is stored one or more sets of instructions embodying any one or more of the methodologies or functions described herein. Disclosed data storing mechanism may be implemented, completely or at least partially, within the main memory 1204 and/or within the data processing system 1202 by the computer system 1200, the main memory 1204 and the data processing system 1202 also constituting machine-readable storage media.
The computer-readable storage medium 1224 may also be used to one or more sets of instructions embodying any one or more of the methodologies or functions described herein. While the computer-readable storage medium 1224 is shown in an exemplary embodiment to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that stores the one or more sets of instructions. The terms “computer-readable storage medium” shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention. The term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media.
The above description of illustrated implementations of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific implementations of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications may be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific implementations disclosed in the specification and the claims. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.
This application claims the benefit of U.S. Provisional Application No. 62/385,196, filed on Sep. 8, 2016, the entire contents of this Provisional application is hereby incorporated by reference. This application is related to U.S. Non-Provisional application Ser. No. 15/215,374, filed on Jul. 20, 2016, the entire contents of this application are hereby incorporated by reference. This application is related to U.S. Non-Provisional application Ser. No. ______, filed on Sep. 8, 2017, the entire contents of this application are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62385196 | Sep 2016 | US |