© 2022 Airbiquity Inc. A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. 37 CFR § 1.71(d).
Embodiments of the present disclosure relate to the field of continuous data processing, and in particular, to methods and apparatuses associated with continuous data processing using modular pipelines.
Continuous data processing includes capturing data when it becomes available, filtering or modifying it, and then forwarding it on for further processing; all done one piece of data at a time. This is different from aggregate data processing, such as done with databases or files, where the full data set is known before processing starts and where the data might be processed with multiple passes. In other words, while continuous data processing might be able to retain some values or portions of previous pieces of data while processing a current piece of data, it cannot also refer to pieces of data that have not yet become available. Moreover, this processing is continuous, meaning there may be no effective ‘end’ to the data processing and the processing may continue for as long as data is available. Finally, the timing and amount of data to process may not be predictable and the processing may occur either occasionally or often, processing one piece or many pieces of data at a time. These facts mean many existing tools and methods of processing large amounts of data cannot be applied to continuous data processing.
The following is a summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not intended to identify key/critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later.
Some known methods of continuous data processing on devices with one or more digital processors may result in non-reusable and non-modular code with limited interconnectivity between devices. Various embodiments of continuous data processing with modularity may include processors, pipelines, and modules to provide a conceptual and runtime framework that enables modular and re-usable software development and cross-platform and highly interconnected implementations. Various embodiments may provide a simple conceptual framework for understanding and implementing highly complex data analytic systems with:
Additional aspects and advantages of this invention will be apparent from the following detailed description of preferred embodiments, which proceeds with reference to the accompanying drawings.
In the following detailed description, reference is made to the accompanying drawings that form a part hereof wherein like numerals designate like parts throughout, and in which is shown by way of illustration embodiments that may be practiced. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense, and the scope of embodiments is defined by the appended claims and their equivalents.
Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order than the described embodiment. Various additional operations may be performed and/or described operations may be omitted in additional embodiments.
The terms “substantially,” “close,” “approximately,” “near,” and “about,” generally refer to being within +/−10% of a target value. Unless otherwise specified the use of the ordinal adjectives “first,” “second,” and “third,” etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.
For the purposes of the present disclosure, the phrases “A and/or B” and “A or B” mean (A), (B), or (A and B). For the purposes of the present disclosure, the phrase “A, B, and/or C” means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C).
The description may use the phrases “in an embodiment,” or “in embodiments,” which may each refer to one or more of the same or different embodiments. Furthermore, the terms “comprising,” “including,” “having,” and the like, as used with respect to embodiments of the present disclosure, are synonymous.
Continuous data processing may be performed as close to the source—the place where that data first becomes available for digital processing—as possible. There are a variety of advantages for this, including the likelihood this digital processor many have limited external connectivity incapable of handling the raw data if sent at high rates. However, in many cases the digital processor at this source also has limited computing resources and must perform such processing without also stalling the capture of data, resulting in data loss, or interfering with other processing. Therefore, it is preferable to perform continuous data processing using minimal computing resources, and without affecting whatever else the digital processing device is doing.
Software for this kind of continuous processing has been created as one-off implementations that may be specific to the digital processor and its environment, and are monolithic rather than modular. At best, code reusability is achieved using libraries for heavy mathematical processing or system resources; but the structure of such applications are single-pathed along the lines of (1) Read data, (2) Process data, possibly using library functions, (3) if (2) results in output, send data, and (4) Go to (1).
Achieving a higher level of modularity requires a supporting framework of some kind and a scheme for creating and managing the data types, the processing modules, the inputs and outputs, and so on. For the specific purpose of continuous data processing there are few such framework implementations extant, none of which meets all this criteria. In the specific case of performing continuous data processing on digital processors with limited resources there are even fewer.
Various embodiments described herein met one or more of the following features:
One extant large volume data processing model, called ‘MapReduce’ (https://en.wikipedia.org/wiki/MapReduce), may be applied to continuous data processing. MapReduce processes large amounts of data by ‘mapping’ a large data set to a single function that ‘reduces’ that data into an output. Note that MapReduce has a similar limitation as continuous data processing—the reduce function cannot access later data while processing current data.
MapReduce works by first limiting the entire data set to a smaller data set you want to reduce, this is the ‘map procedure’ which filters and sorts the data. The data is then sent to the ‘reduce function’ which summarizes the data or otherwise uses it to create the output data. MapReduce is very effective at processing very large amounts of data by breaking up the mapped data and distributing the ‘reducing operation’ across a number of computers and process that data in parallel. It also encourages the creation of very simple reduce functions, as opposed to creating large and complex implementations.
However, applying MapReduce on its own to continuous data processing does not provide modularity by itself; rather it is simply another way of creating monolithic applications not much different than the existing common practice. There is also the problem of a single reduce function either being too simplistic or overly complex for many use cases. And, in fact, most ‘Big Data’ MapReduce implementations performing anything but the simplest calculations are usually done using multiple passes; where the output data from one pass is then MapReduced again in another pass using a different map procedure and reduce function.
Another extant model of data processing is the ‘data pipeline’ (https://en.wikipedia.org/wiki/Pipeline (computing)). A data pipeline may include multiple modules, where the data is read into one module and that module's output is read into the next module and so on.
Combining these models together may result in the following data processing model:
However, this simple pipeline model is limited by (a) being entirely linear and (b) not providing enough ‘mapping’. Hewing too closely to this model results in highly specialized and tightly coupled (meaning, ‘less re-usable’) modules and the easiest way to apply it is to create a single module that does everything, i.e. the monolithic approach. There are other issues as well, including the fact one digital processor might have many data types available to process, something not suited to a single pipeline, and creating a single way to do the filtering/mapping step might limit the use cases to which it can be applied.
Some embodiments described herein may apply the concept of data mapping to multiple places throughout the pipeline model, instead of only at the input site.
First, modules in a pipeline are not required to process the output from the module directly before them. Instead, the pipeline may map incoming data from the source to any module in the pipeline, not just the first module, and may map output from any module to the input of any other module further down the pipeline. Also, output from any module may be mapped out of the pipeline to a sink as well as to a subsequent module. This enables the creation of modules with more generalized implementations, entirely decoupled from other modules aside from the data types they input and output. And, in this model, pipelines represent a single arbitrarily complex unit of data analysis composed out of simple modules.
Second, the data mapping to, within and out of a pipeline is not limited by data type: the pipeline inputs may be any data type available, as may the internal pipeline processing and outputs—so long as the module inputting the data is implemented to do so.
Third, incoming data may be mapped to more than one pipeline at a time, where each pipeline is doing a different kind of processing—possibly with different data types from different capabilities of the digital processor—and outgoing data may be mapped to multiple outputs for further processing. In other words, multiple process inputs may be mapped to multiple pipelines and the pipeline outputs may be mapped to multiple process outputs—all at the same time and without interfering with each other.
Various embodiments described herein may include a “Pipeline Data Processor,” which may contain:
This pipeline data processor may provide a basic framework for performing continuous data processing via arbitrary modules, where the modules implement a simple interface but may perform their function in any way appropriate to the data and the digital processor they are running on. Thus, a source module may get raw data from a CAN (controller area network) bus, a file or queue, a network connection, or any other capability of the digital processor. In the same way a sink module may send the processed data out in any number of ways and pipeline analytic modules may use any capability of the digital processor as needed to perform their function.
Some embodiments described herein may feature easy creation of pipelines and pipeline data processors composed from a number of small and adaptable modules. Various embodiments described herein may provide a conceptual framework for understanding how modules operate and are composed into data processing systems; provide a runtime framework to host pipeline data processors composed from modules to implement these data processing systems; and provide a cross-platform library for selected system functionality module developers may use to create cross-platform implementations.
Various embodiments may host a pipeline data processor in a sandbox and provide it with a cross-platform library for accessing basic digital processor capabilities. Resource usage may be limited via the sandbox and/or cross-platform modules via the library. Various embodiments may provide limiting factors and encourage module re-usability while also providing a way to do things outside of those limitations if required by the use case. A module developer can implement to the framework and the cross-platform library and the same code will run in various hardware and/or software environments.
Some embodiments may implement the pipeline data processor and/or the sandbox in ways appropriate to the digital processor and its capabilities, which means multi-threading and process control may be utilized, if provided (or implement without those capabilities, if not). Resource requirements may be reduced on low resource systems and/or data processing may be throttled, if required.
Since the individual modules may be simple implementations and the mapping rules may be very simple to implement, the resource usage of a pipeline data processor consisting of pipelines and modules may be comparable to the same use case implemented as a difficult-to-maintain monolithic system.
Finally, on large interconnected systems with multiple digital processors the source module may provide the raw data (and may also perform minimal processing to reduce network usage on some devices), then send that partially processed data to one or more other devices in the system for further processing; spreading the processing load out like some ‘MapReduce’ implementations, but using a framework designed to provide significantly more control over how and where that processing occurs (e.g., in multiple devices on the same edge client, multiple co-operating edge clients, or a mix of edge clients and cloud servers).
In cases where a sink (e.g., sinks A and B), a source (e.g., sources A and B), or an pipeline analytic module (e.g., modules A-C) require access to system resources not mediated by the pipeline data processor host 10, arbitrary external libraries 25-27 may be compiled in the pipeline data processor 15 to provide those capabilities.
Modules written to use only the cross-platform APIs 12 may run on various platforms (e.g., any system with enough resources for those modules). In various embodiments, the cross-platform APIs 12 may include signal APIs for capturing system data (such as CAN Bus signals), message bus APIs for sending and receiving data over the internal client network, or other host and system APIs to support the pipeline data processor(s) 15 and modules (which may include runtime logging or profiling functionality).
In various embodiments, a source module (e.g., source A or source B in this example) may use a map function to map incoming data of a continuous data stream to different reducers selected from different ones of the pipeline analytic modules. Referring briefly to
In various embodiments, map functions may be generalized implementations data-driven by a description of the mappings (e.g., pipeline description language code) or may be runtime-optimized by generating mapping-specific source code from a description of the mappings. In various embodiments, mapping may be by data type and/or which module is outputting the data, not by key like some other mapping approaches.
Referring again to
In various embodiments, the one or more hardware processors may include a general purpose integrated circuit (e.g., a general purpose CPU) or an application specific integrated circuit (ASIC), or combinations thereof.
In one embodiment, the system 100 may be used for vehicle edge data processing. The system 100 may pre-process data generated by vehicle sensors on the “edge,” before the data is transmitted—to the cloud over a communications networks. In such an example, a source module may provide data originating in the vehicle (e.g., vehicle sensor data, in a raw or pre-processed state), and a sink module may provide the pipeline-processed data to the cloud. The pipeline processing may transform the vehicle sensor data to reduce transmission expenses, avoid data throttling, for data privacy preservation, optimize distributed data analysis involving the cloud, or the like.
One vehicle edge data processing implementation may include one or more pipeline data processors running on one or more first devices and a data agent (similar to
In vehicle edge data processing cases, the pipeline data processor 15 may be implemented using embedded vehicle hardware devices, such as an ECU (electronic control unit). Embedded vehicle hardware devices may include single purpose I/O hardware, which may utilize a hardware processor that includes both general purpose CPU(s) and ASICs.
The system 100 may be used in various applications and in the field of vehicle data processing or other fields. In some embodiments, the pipeline processing may perform dynamic distributed processing on server farms, in which extremely large volumes of data are processed in multiple ways through multiple steps. In some embodiments, the pipeline processing may perform advanced media (e.g., video, audio, 3D) processing distributed across multiple cores of a CPU or multiple devices for better throughput. In some embodiments, the pipeline processing may perform message-based operating system processing.
Pipeline data processors 215-217 may communicate with each other to enable distributed data processing on the client or where data from separate devices is combined for an edge computing application. Pipeline data processors 215-217 may also send data to local external processes or processes on another device D. In the illustrated example, a message bus subscriber of the device D may be associated with any upload process described herein, to cache data and uploads it to cloud server(s).
Communication over the network 219 may be performed using a message bus based on publish/subscribe semantics. The message bus may be specific to the client, its internal networks, and the contained devices. Available message bus protocols include MQTT (MQ telemetry transport) and SOME/IP (scalable service-oriented middleware/Internet Protocol) or other protocols with similar semantics. To enable message bus data interchange, there may be source and sink modules that are configured to publish or subscribe to selected data messages using the appropriate message bus protocol.
In this example, the pipeline data processors 215-217 are implemented on different hardware processors distributed over different devices (e.g., devices A-C) interconnected using external connectivity (a client internal network or some other message bus). In other examples, one or more pipeline data processors may be implemented on a hardware processor of a single device. For example, different pipeline analytic modules (e.g., reducers) of a same pipeline data processor may be implemented on the different cores (e.g., respectively). In another example, the pipeline analytic modules of one pipeline data processor may be implemented on one core and the pipeline analytic modules of another pipeline data processor may be implemented on another core.
In various embodiments, the package 300 includes one or more pipeline data processors in package description language (PDL) or some other descriptive form, along with build and deployment information. The package 300 may be a PDL package consumable by a build system to output all of the binaries and other assets required for installation on a client.
Each pipeline data processor may operate independently, and in some cases may be targeted to a separate computing device on the client (e.g., devices that communicate with each other over their external interfaces). However, data may be forwarded from a message bus sink module on one pipeline data processor to a message bus source module on another pipeline data processor, allowing for distributed data processing on the client.
A data agent application may be pre-installed or installed with the pipeline data processor host/pipeline data processor. The data agent may be responsible for caching data until a connection to the cloud is available. In various embodiments, a singular system-wide data agent may also provide message brokering capabilities, if needed by a Message Bus (MBUS) protocol. The data agent may collect output data specified in the package from sink modules in the pipeline data processor and securely transmit it to cloud server(s) for retention and further processing.
A data agent application may be required on a device with an Internet connection, but may be pre-installed or installed with the pipeline data processor host/pipeline data processor. The data agent may collect output data specified in the package from sink modules in the pipeline data processor and securely transmit it to cloud server(s) for retention and further processing.
In an unrestricted out-of-order group of modules (not shown), data may be provided by a source, may be processed by each module in a subgrouping following multiple paths (in contrast to a simple pipeline where there may be only a single path), and then exits the group at a sink. The multiple paths of the unrestricted out-of-order group of modules may include multiple inputs and outputs.
Although an unrestricted out-of-order group of modules may be more flexible than a simple pipeline, it may be a collection of modules connected together arbitrarily (with single entry and exit points), which may require a monolithic implementation. This may restrict code re-use and composability or restrict the data types modules can operate on. Also, such a group of modules is an open-ended network graph (rather than a pipeline). In an open-ended network graph, modules act as nodes (e.g., vertices) and data paths are the connections (e.g., edges) in between them, which may result in extreme and arbitrary complexity and require the monolithic implementation.
In a forward-only pipeline, data comes in via a source module, and may be processed by each pipeline analytic module in the pipeline following multiple paths. These multiple paths are constrained to later modules in the pipeline or to the sink module (e.g., no backward chaining), and then may exit the pipeline at a sink.
This style of pipeline provides more flexibility than the simple pipeline, while avoiding the complexity of an unrestricted out-of-order group of modules. Modules may be composed in any way appropriate to their expected data inputs and outputs, but the inputs are required to be output by another module earlier in the pipeline and the outputs must be consumed by another module later in the pipeline.
In various embodiments, pipeline data flows may be controlled by the pipeline data processor, which may initiate the flow for each source by calling a heartbeat function. In this example, the source may use signal APIs to fetch signal data and then pass the data to the pipeline analytic module A, which passes the data, in turn, to pipeline analytic modules B and C.
Pipeline analytic module B may pass its output directly to the sink module, which may use message bus APIs to send the data out as a message. Pipeline analytic module C may send its output to pipeline analytic module D, which then may send its output to the sink to be sent as a message.
This entire processing cascade may occur from the heartbeat function of the source and within a single thread of execution. The pipeline data processor may drive data through all pipelines in a single thread, which may reduce implementation complexity. The modules of the pipeline data processor may themselves independently run separate threads and perform their own thread synchronization (to execute in the same thread as the pipeline data processor).
The multiple pipeline data processor combines the constrained flexibility of forward-only pipelines with the ability to share data sources and sinks between them. Although it is possible to create highly complicated processors, the conceptual complexity is advantageously limited because each pipeline may operate as a separate unit of processing.
Multiple pipeline data processors of various embodiments described herein may correspond with a PDL model for pipelines, in which:
The data processing modules described herein may be combinable to make up a single unit of computing (e.g., a single unit of edge computing). Each data processing module may include standalone code, embeddable in software, for an individual purpose. The standalone code may take in data, put out data, and may have an API, which standalone code of the other data processing modules (for other individual purposes) can reference to consume or pass data thereto.
Most of the equipment discussed above comprises hardware and associated software. For example, the typical continuous data processing system is likely to include one or more hardware processors and software executable on those hardware processors to carry out the operations described. We use the term software herein in its commonly understood sense to refer to programs or routines (subroutines, objects, plug-ins, etc.), as well as data, usable by a machine or hardware processor. As is well known, computer programs generally comprise instructions that are stored in machine-readable or computer-readable storage media. Some embodiments of the present invention may include executable programs or instructions that are stored in machine-readable or computer-readable storage media, such as a digital memory. We do not imply that a “computer” in the conventional sense is required in any particular embodiment. For example, various processors, embedded or otherwise, may be used in equipment such as the components described herein.
Memory for storing software again is well known. In some embodiments, memory associated with a given processor may be stored in the same physical device as the processor (“on-board” memory); for example, RAM or FLASH memory disposed within an integrated circuit microprocessor or the like. In other examples, the memory comprises an independent device, such as an external disk drive, storage array, or portable FLASH key fob. In such cases, the memory becomes “associated” with the digital processor when the two are operatively coupled together, or in communication with each other, for example by an I/O port, network connection, etc. such that the processor can read a file stored on the memory. Associated memory may be “read only” by design (ROM) or by virtue of permission settings, or not. Other examples include but are not limited to WORM, EPROM, EEPROM, FLASH, etc. Those technologies often are implemented in solid state semiconductor devices. Other memories may comprise moving parts, such as a conventional rotating disk drive. All such memories are “machine readable” or “computer-readable” and may be used to store executable instructions for implementing the functions described herein.
A “software product” refers to a memory device in which a series of executable instructions are stored in a machine-readable form so that a suitable machine or processor, with appropriate access to the software product, can execute the instructions to carry out a process implemented by the instructions. Software products are sometimes used to distribute software. Any type of machine-readable memory, including without limitation those summarized above, may be used to make a software product. That said, it is also known that software can be distributed via electronic transmission (“download”), in which case there typically will be a corresponding software product at the transmitting end of the transmission, or the receiving end, or both.
Having described and illustrated the principles of the invention in a preferred embodiment thereof, it should be apparent that the invention may be modified in arrangement and detail without departing from such principles. We claim all modifications and variations coming within the spirit and scope of the following claims.