Businesses and agencies often need or desire to perform complex analysis of large amounts of continuously changing data. The data may be centralized or distributed. Current data analysis systems are inefficient at performing sophisticated data analysis, such as machine learning and graph processing, under these conditions.
The detailed description refers to the following drawings in which like numerals refer to like items, and in which;
Businesses and agencies may be faced with a need, or may have a desire, to analyze very large quantities of data using complex queries. Some systems are designed to query large databases. Other systems are designed to construct complex queries on databases. However, these systems are not capable of both scaling to query massive amounts of data and performing complex queries such as machine learning, graph processing or dynamic programming. In addition, the data to be queried may be dynamic, meaning the data change frequently.
Disclosed herein are systems and methods that support complex queries of large, dynamic data sets. The systems and methods provide a distributed programming platform for continuously analyzing data. The continuous analytics aspect of the systems and methods, where applications constantly refine their analysis as new data arrives, is useful in many applications, such as user recommendation systems, link analysis, and financial modeling. In addition, unlike batch or single-point processing, the herein disclosed continuous analytics may use partial re-execution, low-latency turnaround, and transitive propagation of changes to dependent tasks. Thus, the herein disclosed systems and methods address at least three specific problems with the current state of data: scale, complexity and dynamics. The systems and methods allow writing complex (statistical, machine learning) queries on large (terabyte+-size), continuously changing data sets that may be re-executed quickly among a set of dependent tasks. The system can support incremental processing of data such that when new data arrives, new results can generally be obtained without restarting the computation from scratch.
More specifically, the systems and methods provide efficient and fast access to large data sets, acquire data from these data sets, divide the data into abstractions referred to herein as distributed arrays, distribute the arrays and the processing tasks among a number of processing platforms, and update the data processing as new data arrives at the large data sets. In an example, the systems and methods extend currently available systems by using language primitives, as add-ons, for scalability, distributed parallelism and continuous analytics. In particular, the systems include the constructs darray and onchange to express those parts of data analysis that may be executed, or re-executed, when data changes. In an aspect, the systems ensure, even though the data is dynamic, that the processes “see” a consistent view of the data. For example, using the methods, if a data analysis process states y=f(x), then y is recomputed automatically whenever x changes. Such continuous analytics methods allow data updates to trigger automatic recalculation of only those parts of the process that transitively depend on the updated data.
As noted above, continuous analytics may be important to businesses and agencies, and many complex analytics are transformations on multi-dimensional arrays. For example, in an Internet product or service delivery system, user recommendations or ratings may play a vital marketing role, and product and service offers may be updated as new customer ratings are added to a ratings dataset. Many examples of such Internet-based systems exist, including Internet-based book stores, online movie delivery systems, hotel reservation services, and similar product and service systems. Other examples include online advertisers, who may sell advertisement opportunities through an auction system, and social network sites. All of these businesses or applications have three characteristics. First, they analyze large amounts of data—from ratings of millions of users to processing links for billions of Web pages. Second, they continuously refine their results by analyzing newly arriving data. Third, they implement complex processes—matrix decomposition, eigenvalue calculation, for example—on data that is incrementally appended or updated. For example, Web page ranging applications and anomaly detection applications calculate eigenvectors of large matrices, recommendation systems implement matrix decomposition, and genome sequencing and financial applications primarily involve array manipulation. Thus, the expression of large sets of data elements in arrays, and the subsequent analysis of the data elements based on these arrays, makes the complex analysis mentioned above not only feasible, but also efficient.
Continuous analytics implies that processing may be “always on”: results are calculated and refined with low latency. Continuous analytics imposes additional challenges compared to simply scaling analytics to a cluster and processing terabytes of data. First, only a few portions of the input data may change; hence only the affected parts of the process should be re-executed.
Current batch processing analytics systems cannot efficiently address such partial computations. Second, since the data is dynamic, it is difficult to express and enforce that distributed processes are run on a consistent view of the data. Finally, programming primitives that support continuous analytics should be able to do so without exposing low-level programming details like message passing.
In
The storage driver 120 communicates between the storage layer 100 and the worker layer 140, which includes workers 142, each of which in turn includes processing devices, communications interfaces, and computer readable mediums, and each of which stores and executes a continuous analytics program 144. The continuous analytics program 144 may include a subset of the programming of a larger continuous analytics program that is maintained in the program layer 200. The workers 142 may be distributed or centralized.
The storage driver 120 reads input data, handles incremental updates, and saves output data. The storage driver 120 may export an interface that allows programs and distributed arrays in the program layer 200, and hence the workers 142 and master 160, to register callbacks on data. Such callbacks notify the different components of the program when new data enters a data store 110 or existing data is modified during incremental processing.
The storage driver 120 also provides for transactional-based changes to data stored in the data stores 110. For example, if a user recommendation file for a hotel chain is to be changed based on a new recommendation from a specific hotel customer, all of the data related to that customer's new recommendation is entered into the appropriate table in the appropriate data store 110. More specifically, if the new recommendation includes three distinct pieces of data, all three pieces of data are entered, or none of the three pieces of data is entered; i.e., the data changes occur atomically. The transactional basis for changing data is required due to the possibility that multiple sources may be writing to and modifying the same data file.
The storage driver 120, as explained below, is notified when data in the storage layer 100 changes, through modification, addition, or subtraction, for example, and in turn notifies the master 160 or workers 142 of the changes.
The master 160 acts as the control thread for execution of program layer 200 programs. The master 160 distributes tasks to workers 142 and receives the results of the task execution from the workers 142. The master 160 and workers 142 form a logical unit. However, in an embodiment, the master 160 and the workers 142 may execute on different physical machines or servers. Thus, the master 160 executes a control and distribution program that distributes tasks associated with a continuous analytics program. The master 160 further receives inputs from the workers 142 when tasks are completed. Finally, the master 160 may re-distribute tasks among the workers 142.
The program layer 200 includes a basic analytics program 210 (see
Using update 226 not only triggers the corresponding onchange tasks but also binds the tasks to the data that the tasks should process. That is, the update construct 226 creates a version vector that succinctly describes the state of the array, including the versions of partitions that may be distributed across machines. This version vector is sent to all waiting tasks. Each task fetches the data corresponding to the version vector and, thus, executes on a programmer-defined, consistent view of the data.
The runtime of the continuous analytics program 220 may create tasks on workers 142 for parallel execution. That is, multiple workers execute the same or different tasks on multiple array partitions. The continuous analytics program 220 includes foreach construct 228 to execute such tasks in parallel. The foreach construct 228 may invoke a barrier at the end of each task execution to ensure all other parallel tasks finish before additional or follow-on tasks are started. Thus, foreach construct 228 brings each of the parallel workers 142 to the same ending point with respect to the parallel tasks before any of the parallel workers 142 being another task. Human users can remove the barrier by setting an argument in the foreach construct to false.
In