Massive data sets are stored and analyzed by companies to retrieve valuable information. For example, Internet search logs, Internet content collected by crawlers, and click streams collected from Internet services result in large-scale data sets that need to be analyzed in order to retrieve their wealth of information. The information that can be obtained from such large data sets includes the ability to detect changes in Internet user patterns, fraudulent activity, and to support service quality and novel Internet features.
As a result of the size of these data sets, traditional parallel database solutions can be prohibitively expensive. In an attempt to reduce the costs associated with such analysis, large-scale distributed storage and processing systems that are comprised of large clusters of commodity servers have been created. Because of the scale and parallelism of such distributed computing systems, it is challenging to design a programming model that efficiently and effectively utilizes the resources while achieving parallelism.
Embodiments of the present invention relate to systems, methods and computer storage media for providing Structured Computations Optimized for Parallel Execution (SCOPE) that facilitate analysis of a large-scale dataset. SCOPE includes, among other features, an extract command for extracting data bytes from a data stream and structuring the data bytes as date rows having strictly defined columns. The date rows support a range of data types and are not limited to a few select data types in which SCOPE is capable of handling. SCOPE also includes a process command that specifies data rows as an input. SCOPE also includes a reduce command that identifies data rows as an input as well as a reduce key that facilitates the reduction based on the reduce key. SCOPE additionally includes a combine command that identifies two data row sets that are to be combined based on an identified joint condition. Additionally, SCOPE includes a select command that leverages SQL and C# languages to create an expressive script that is capable of analyzing large-scale data sets in a parallel computing environment.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Illustrative embodiments of the present invention are described in detail below with reference to the attached drawing figures, which are incorporated by reference herein and wherein:
The subject matter of embodiments of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies.
Embodiments of the present invention relate to systems, methods and computer storage media for providing Structured Computations Optimized for Parallel Execution (SCOPE) that facilitate analysis of a large-scale dataset. SCOPE includes, among other features, an extract command for extracting data bytes from a data stream and structuring the data bytes as date rows having strictly defined columns. The date rows support a range of data types and are not limited to a few select data types in which SCOPE is capable of handling. SCOPE also includes a process command that specifies data rows as an input. SCOPE also includes a reduce command that identifies data rows as an input as well as a reduce key that facilitates the reduction based on the reduce key. SCOPE additionally includes a combine command that identifies two data row sets that are to be combined based on an identified joint condition. Additionally, SCOPE includes a select command that leverages SQL and C# languages to create an expressive script that is capable of analyzing large-scale data sets in a parallel computing environment.
Accordingly, in one aspect, the present invention provides computer-readable media comprising computer-executable instructions for providing Structured Computations Optimized for Parallel Execution (SCOPE) that facilitate analysis of a large-scale dataset. The step includes interpreting an extract scripting command, the extract scripting command specifying an extract data source from which an extractor, identified in the extract scripting command, extracts one or more extract data rows, to generate computer-executable instructions for applying the extractor to the data source to generate the one or more extract data rows. The steps also include interpreting a process scripting command, the process scripting command specifying one or more process input data rows from which a processor, identified in the process scripting command, generates one or more process output data rows, to generate computer-executable instructions for applying the processor to the one or more process input data rows to generate the one or more process output data rows. The steps further include interpreting a reduce scripting command, the reduce scripting command specifying one or more reduce input data rows from which a reducer, identified in the reduce scripting command, generates one or more reduce output data rows, to generate computer-executable instructions for applying the reducer to the one or more reduce input data rows to generate the one or more process output data rows, wherein the computer-executable instructions for applying the reducer to the one or more reduce input data rows guarantees that all rows in the one or more reduce input data rows that match a reduce key identified in the reduce scripting command are processed by a single call to the reducer. Additionally, the steps includes interpreting a combine scripting command, the combine scripting command specifying a joint condition, one or more first combine input data rows and one or more second combine input data rows from which a combiner, identified in the combine scripting command, generates one or more combine output data rows, to generate computer-executable instructions for applying the combiner to the first and the second combine input data rows in view of the joint condition to generate the one or more combine output data rows.
In another aspect, the present invention provides a method for providing Structured Computations Optimized for Parallel Execution (SCOPE) that facilitate analysis of a large-scale dataset. The method includes receiving, at a SCOPE computing cluster, a SCOPE script that includes one or more scripting commands that identify one or more input data rows, wherein the SCOPE script includes at least one extract scripting command for extracting the one or more input data rows from at least a portion of the large-scale dataset, further wherein the at least a portion of the large-scale dataset is identified in the SCOPE script. The method also includes compiling the SCOPE script at the SCOPE computing cluster to generate an execution plan. The method further includes storing the execution plan on a computer-readable storage medium. The method also includes generating a computational graph describing the execution at the SCOPE computing cluster. The method additionally includes storing the computational graph on a computer-readable storage medium. The method also includes executing the execution plan at the SCOPE computing cluster to provide a SCOPE plan that facilitate analysis of a large-scale dataset, wherein the execution of the execution plan includes extracting the one or more input rows data from a plurality of file extents distributed across a plurality of computing devices associated with the SCOPE computing cluster, wherein the plurality of file extents are associated with one or more data streams.
A third aspect of the present invention provides computer-readable media comprising computer-executable instructions for interpreting a Structured Computations Optimized for Parallel Execution (SCOPE) script that facilitate analysis of a large-scale dataset. The steps include interpreting the SCOPE script with reference to a library of computer-executable commands for generating a program that can be executed across a plurality of processors. The library includes an extract command for extracting one or more data rows from a data source. The library also includes a process command for taking one or more process input data rows and producing a plurality of process output data rows. The library additionally includes a reduce command for taking one or more reduce input data rows and producing a plurality of process output data rows, wherein all of the one or more reduce input data rows associated with a reduce key identified in the SCOPE script are processed in a single call. The library also includes a combine command for combining two sets of row data that share a set of combine keys identified in the SCOPE script. Additionally, the library includes a select command for manipulating row data of one or more identified data sources as row data, wherein the one or more identified data sources include data streams, row data, files, and databases, further wherein the manipulation of the row data includes at least one from the following, transforming the row data, adding a column to the row data, removing a column to the row data, filtering the row data, grouping the row data, aggregating the row data, and joining the row data. The structure of the computer executable commands in the library are compatible with SQL and C# syntax. The steps also include calling a compiler to compile the SCOPE script, wherein an execution plan results from compiling the SCOPE script. The steps additionally include storing the execution plan on one or more computer-readable media.
Having briefly described an overview of embodiments of the present invention, an exemplary operating environment suitable for implementing embodiments hereof is described below.
Referring to the drawings in general, and initially to
Embodiments may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program modules including routines, programs, objects, modules, data structures, and the like, refer to code that performs particular tasks or implements particular abstract data types. Embodiments may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, specialty computing devices, etc. Embodiments may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
With continued reference to
Computing device 100 typically includes a variety of computer-readable media. By way of example, and not limitation, computer-readable media may comprise Random Access Memory (RAM); Read Only Memory (ROM); Electronically Erasable Programmable Read Only Memory (EEPROM); flash memory or other memory technologies; CDROM, digital versatile disks (DVD) or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, carrier waves or any other medium that can be used to encode desired information and be accessed by computing device 100.
Memory 112 includes computer-storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, etc. Computing device 100 includes one or more processors that read data from various entities such as memory 112 or I/O modules 120. Presentation module(s) 116 present data indications to a user or other device. Exemplary presentation modules include a display device, speaker, printing module, vibrating module, and the like. I/O ports 118 allow computing device 100 to be logically coupled to other devices including I/O modules 120, some of which may be built in. Illustrative modules include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, and the like.
With reference to
Among other components not shown, the system 200 may include a client 204, a server 206, 208, and 210, a SCOPE computing cluster 212, and a data store 240. Each of the components shown in
The system 200 is suited to implement SCOPE scripts created by users of the client 204. For example, a user, such as a developer, system administrator, or programmer that is responsible for preparing a script that is able to perform large-scale data analysis, utilizes the client 204 to create a SCOPE script that is to be compiled and executed by the SCOPE computing cluster (e.g., SCOPE computing cluster 212) to analyze data stored in one or more data stores (e.g., the data store 240) by servers (e.g., servers 206, 208, and 210). In an exemplary embodiment, the client 204 is a computing device such as the computing device 100 discussed with respect to
The servers 206-210 are computing devices utilized in implementing embodiments of the present invention. In an exemplary embodiment, the servers 206-210 each include one or more processors that are utilized in a configuration conducive for parallel computing. Expanding on this embodiment, the servers 206-210 are arranged in clusters as part of a distributed computing environment. Additionally, as will be discussed later, the servers 206-210 are incorporated in the SCOPE computing cluster 212 in an exemplary embodiment. Therefore, in the previous exemplary embodiment, reference to the SCOPE computing cluster 212 includes a reference to the servers 206-210. It is understood that servers 206-210 are merely representative, and not limiting as to the scope of the present application. For example, while servers 206-210 represent three distinct servers, in reality hundreds or thousands of individual servers, racks, clusters, and pools may be implemented to effectively perform embodiments of the present invention.
In an additional exemplary embodiment, the servers 206-210 are a commodity type server that is typically less expensive relative to specialized or custom computing devices. Additionally, servers 206-210 typically include one or more disk stores (e.g., computer-readable media) that are directly attached to each of the servers 206-210. Further, the disk stores of a distributed computing environment are distributed among a plurality of the servers associated with the distributed computing environment. For example, a large-scale data set that is identified in a SCOPE script may be distributed among the servers 206-210, such that the SCOPE script, when executed, utilizes the servers 206-210 in parallel to analyze the data stored among the servers 206-210.
The SCOPE computing cluster 212 is a computing cluster for implementing aspects of a SCOPE script. In an exemplary embodiment, the SCOPE computing cluster 212 includes a library of computer-executable commands 214, a receiving component 216, an exporting component 218, and a job manager 220. The job manager 220 includes a computational graph generator 222, a runtime component 224, and a compiler component 225. The computational graph generator 222 includes a plurality of operations that may be utilized when generating a computational graph. The operations include, but are not limited to, a filter 226 operation, a read 228 operation, a join 230 operation, a partition 232 operation, an aggregate 234 operation, a cross 236 operation, and an output 238 operation.
The receiving component 216 functions to receive a SCOPE script. For example, the receiving component 216, in an exemplary embodiment receives a SCOPE script, by way of the network 202, from the client 204. The export component 218 functions to communicate information from the SCOPE computing cluster 212, the job manager 220, or other components related to the processing of a SCOPE script. For example, the export component 218 may communicate, to the client 204, information resulting from the compilation of a SCOPE script. Additionally, the export component 218 may communicate results, data, or other outputs that result from a SCOPE script. In an additional exemplary embodiment, the receiving component 216 and the export component 218 function as means for transmitting, broadcasting, and/or communicating data, information, and/or indications to and from the SCOPE computing cluster 212.
The library of computer-executable commands 214 (library 214) includes one or more computer-executable commands that are utilized when interpreting a SCOPE script. In an exemplary embodiment, the commands of the library 214 are utilized, in part, to generate a program that can be executed across a plurality of processors. For example, in a distributed computing environment that is functioning in a parallel configuration.
The commands included within the library 214 include, but are not limited to, an extract command, a process command, a reduce command, a combine command, an output command, and a select command. The commands will be discussed in greater detail at a later point. Additionally, the library 214 in an exemplary embodiment includes commands that facilitate the interpretation of custom commands included in a SCOPE script. In yet an additional exemplary embodiment, the library 214 includes commands that facilitate importing outputs from one or more scripts. For example, an import script extends view functionality across scripts, which allows for the output of a script to be utilized by additional scripts that include an import command.
The compile component 225 compiles a SCOPE script. In an exemplary embodiment, the compiler component 225 parses a SCOPE script, checks the syntax of the script, and resolves names within the script. Additionally, the compiler component 225, in an embodiment, tracks column definitions, and renaming of columns and data of row data sets. Further, for commands included in a SCOPE script that is compiled by the compiler component 225, the compiler component 225 checks that the columns are properly defined by the identified inputs. In an example, the compiler results in an internal parse tree, which can be translated directly into a physical execution plan. In this example, the physical execution plan results from utilizing default plans for each command in the SCOPE script.
In an exemplary embodiment, a physical execution plan is a specification of a distributed computing environment job. For example, such a plan may be utilized by COSMOS, a product of the Microsoft Corporation of Redmond Wash. The physical execution plan may be represented by a computational graph. In an exemplary embodiment, the computational graph is a Directed Acyclic Graph (DAG). A DAG can describe a data flow with each vertex representing a program and each edge representing a data channel. In this example, a vertex program is a serial program composed from SCOPE runtime physical operators. These runtime physical operators may in-turn call user-defined functions. In an exemplary embodiment, the operators within a vertex program are executed in a pipelined fashion, similar to a query execution in a traditional database system.
The job manager 220, in an exemplary embodiment, constructs a specified graph and schedules the execution of a SCOPE script. In this example, a vertex becomes runnable when the identified inputs are available. The execution environment monitors the state of the vertices and channels. The execution environment in this example also schedules runnable vertices for execution, determines where to run a vertex, establishes the resources required to run a vertex, and initiates the vertex program.
Continuing with this example, the translation into an execution plan from a SCOPE script is performed by traversing the parse tree from the bottom up. For each operator, SCOPE has a default implementation rule. For example, implementation of a simple filtering operation is a vertex program using a built-in physical operation of SCOPE, the filter operation. The filter operation is followed with a function that implements the filtering predicate.
In an exemplary translation, the compiler component 225 combines adjacent vertices with physical operators that can be easily pipelined into super vertices. There are four relationships between any two adjacent vertices, 1:1, 1:n, n:1, and n:m. One of the heuristics that SCOPE utilizes, in an embodiment, is to combine two vertices with 1:1 relationship. For example, if a “filter” is followed by a “sort,” SCOPE is able to combine the two operators into a single super vertex, which executes as a “filter”+“sort” in a pipelined fashion. Additionally, a vertex with a n:1 relationship can be pipelined into a 1:1 relationship or a 1:n relationship. Similarly, a 1:1 vertex may be pipelined into a 1:1 relationship or a 1:n relationship.
Returning to
In an exemplary embodiment, the computational graph generator 222 generates a computational graph that can also be displayed by a display device. For example, a computational graph that results from a SCOPE script is then communicated to the client 204 where a visual representation of the computational graph is then displayed on a display device of the client 204. In yet an additional exemplary embodiment, the computational graph is stored on a computer-readable medium, such as the data store 240.
The runtime component 224 is a runtime system that provides services for a running program but is not part of the operating system. In an exemplary embodiment, the computational graph generated by the computational graph generator 222 is provided to the runtime component 224. In an exemplary embodiment, but not limiting embodiment, the runtime component 224 utilizes a Dryad system available from the Microsoft Corporation of Redmond Wash.
Dryad is a general-purpose runtime for execution of parallel data applications. Typically, an application written for Dryad is modeled as a DAG, but as previously discussed, the program may be modeled as a generic computational graph. The DAG, in this example, defines the dataflow of the application, and the vertices of the graph define the operations that are to be performed on the data. Computational vertices are written using sequential constructs, devoid any concurrency or mutual exclusion semantics. The Dryad runtime, in this embodiment, parallelizes the dataflow graph by distributing the computational vertices across various execution engines, which can be multiple processor cores on the same computer or different physical computers connected by a network, as in a cluster. Scheduling of the computational vertices, in an exemplary embodiment, on the available hardware is handled by the Dryad runtime, without any explicit intervention by a developer of the application or an administrator of the network. The flow of data between one computational vertex to another is implemented by using communication “channels” between the vertices, which in physical implementation is realized by TCP/IP streams, shared memory, or temporary files in an exemplary embodiment.
In additional embodiments, the runtime component 224 is utilized to interpret intermediate code compiled from a development environment, such as a SCOPE script. In this example, the SCOPE script requires the runtime component 224 in order to be executed. It is understood that the requirement of a runtime component in the previous embodiment is merely exemplary and not limiting as to the scope of the present application. Intermediate code is typically code that is a result of compilation, but not executable at the machine level. Therefore, the runtime component acts as a service process that provides the framework for execution of the intermediate code and then provides the structure for it to run on an operating system.
As previously mentioned, the SCOPE computing cluster 212, in exemplary embodiments, includes a plurality of servers, such as servers 206-210, to facilitate the execution of a SCOPE script. For example, the SCOPE computing cluster 220, in an exemplary embodiment, receives a SCOPE script from the client 204 at the receiving component 216. The SCOPE script is then compiled by the compile component 225, which utilizes commands found in the library of computer-executable commands 214 to develop, in part, an execution plan. Additionally, a computational graph is generated by the computational graph generator 222, which is then provided as an input to the runtime component 224 to execute the execution plan. Continuing with this exemplary embodiment, the execution plan is executed in parallel among a plurality of servers of the SCOPE computing cluster 212, including servers 206-210. Further, the data stream that is analyzed as a result of the execution plan is pulled from a plurality of servers, such as servers 206-210, and a data store, such as the data store 240. It is understood that the data streams, databases, and data files that include the data that is to be analyzed by the SCOPE script may be stored in a distributed environment with multiple extents spread across the plurality of servers and data stores. Additionally, the extents may be stored in a redundant and reliable manner that are maintained by a distributed computing system, such as Cosmos previously discussed.
The various scripting commands to be discussed later are interpreted by one or more computing devices, such as the SCOPE computing cluster 212, the client 204, and the servers 206-210. In an exemplary embodiment, the interpretation of scripting commands located in a SCOPE script is done by the compile component 225. In yet another exemplary embodiment the interpretation of the scripting command in a SCOPE script is done by the runtime component 224 as a result of interpreting intermediate code. Therefore, the interpretation of a scripting command, which facilitates manipulating data identified in the scripting command, is performed by one or more component of
Accordingly, any number of components may be employed to achieve the desired functionality within the scope of embodiments of the present invention. Although the various components of
SCOPE, in an embodiment, is a declarative and extensible scripting language to be utilized for analyzing large-scale data sets. SCOPE allows for ease of use without requiring any explicit parallelism, while being amendable to efficient parallel execution on large clusters of computing devices. In several embodiments, SCOPE utilizes features compatible with the SQL language, while allowing expressions compatible with the C# language. The ability incorporate C# compatible expressions in a script for analysis of large-scale data sets allows for existing C# expression libraries and custom C# expressions to compute functions and scalar values or manipulate whole row sets.
In an exemplary embodiment, a SCOPE script consists of a sequence of commands. Traditionally, but not always, commands are data transformation operators that take one or more data rows (row set) as input, perform an operation on the data, and then output a data row set. In yet an additional exemplary embodiment, a command utilizes the output row set of a previous command as an input. However, SCOPE commands can also take named inputs and a user can name an output of a command using, among other options, an assignment. In an exemplary embodiment, the output of a command may be utilized one or more times by subsequent commands.
SCOPE is able to provide a scripting language that is conducive for executing on distributed data storage and processing systems. A typical distributed computing system includes a plurality of clusters that consist of hundreds or thousands of commodity computing devices that are connected by a high-bandwidth network. One exemplary difficulty with such a distributed computing network is designing a programming model that enables users to easily write programs that can effectively and efficiently utilize all resources in the distributed computing network while achieving maximum parallelism.
One solution that has been employed for a programming model in a distributed computing network is a Map-Reduce model. The Map-Reduce model requires the programmer to provide a map function that performs grouping and a reduce function that performs an aggregation. This model is limited. The users are forced to map their applications to the Map-Reduce model in order to achieve parallelism. For some applications, this mapping is very unnatural. Users are required to provide implementations for the map and reduce functions, even for simple operations. Additionally, in more complex applications that require multiple stages of Map-Reduce, there are multiple evaluation strategies and execution orders that may be selected by the user, which can result in suboptimal selection and lead to performance degradation by orders of magnitude. Further, a Map-Reduce model is typically limited to only handle string data types for both the key and the value. In addition, a Map-Reduce model requires data to be analyzed in a data stream, which results in an environment that is not intuitive to a user when developing the script. SCOPE overcomes these deficiencies in a variety of way as identified in this disclosure.
SCOPE is able to handle a variety of data types and is not limited to a select few data types. For example, SCOPE is functional to handle data types of string, integer, long, float, double, Boolean, DateTime, and byte[ ]. As previously discussed, SCOPE utilizes rows to pass information. A row object consists of a set of columns, which are strongly typed to a supported data type. In addition to supporting at least the previously mentioned data types, SCOPE is also functional to support nullable types of supported data types. While a few data types have been specified, SCOPE is not limited to the enumerated data types. Users of SCOPE are able to add additional data types to satisfy their requirements.
The utilization of rows to handle data is fundamentally different from relying on data streams. The utilization of rows allows for row level communication that provides a level of validation and system-level safety. For example, utilizing rows allows for validation at run time. This validation, in an embodiment, results from a user being able to view a computational graph that is generated from a SCOPE script. The user can therefore view the graph to validate the execution plan. Compile time checking, in an embodiment, is performed at the client-computing device. In an additional embodiment, the compile time validation is conducted at the SCOPE computing cluster.
An additional advantage of SCOPE's utilization of rows is from a programming perspective. Because rows have defined schemas that include columns with defined data types, it is easier to program against rows than a stream that lacks such definition. Further advantages of rows include their ability to facilitate creation of super-vertices, as previously discussed, and they allow objects to be passed through a system without requiring serialization to streams and deserialization from streams. Additional advantages of SCOPE's utilization of rows is the prevention of user error in data manipulation and data input because validation of the data, script, intermediate code, final code, and execution plan can be conducted at least at one of compile time and run time. This is possible, in part, because of the strong type associated with the columns of the data rows. A stream on the other hand can include arbitrary and dynamic schemes that are not conducive to such validation.
Turning to
The distributed software layer diagram 300 includes a SCOPE script 302. As previously discussed, a SCOPE script, in an exemplary embodiment, is a high-level scripting language for writing data analysis jobs. The SCOPE script 302 is compiled by the SCOPE compiler 304 to result in an efficient parallel execution plan. The execution plan, in this embodiment, is then utilized by a SCOPE runtime 306 in concert with a distributed computing execution environment 308 to provide an automatic handling fault tolerance, data partitioning resource managements, and parallelism. A computational graph is created that represents data flow of processes and edges.
A distributed computing storage system 310, in an exemplary embodiment, is an append-only file system that stores large quantities of data. In this example, the system is optimized for sequential inputs and outputs. Continuing with this example, a file is composed of a sequence of extents that are units of space allocation. Data within the extents comprise a sequence of append blocks. The block boundaries are defined by application appends that may include a collection of application defined records. Distributed computing files 312 are the data files and streams manipulated by the SCOPE script and derivatives thereof.
Turning to
An extract command is used to extract “rows” from a given input source. The input generally is comprised of textual or binary data. In an embodiment, it is an extractor's responsibility to decide how to translate a stream into rows. Therefore, in an exemplary embodiment, an extract command takes in an arbitrary stream and converts the stream into a sequence of row data, such as a row data 406. The data rows follow a schema identified in the extract clause. The parsing of a data stream and the construction of data rows is performed by an extractor in an exemplary embodiment. The extractor may be user written or a built-in extractor of SCOPE.
In yet another exemplary embodiment, the extractor implements at least two methods, a produces method and an extract method. The produces method is called at compile time. In this example, the produce method describes an output schema given requested columns identified in the extract command. Also in an exemplary embodiment, the output schema contains name/type pairs, where the type can be of any data type (e.g., integers, float, and double). The second method, extract method, is called at runtime. In an exemplary embodiment, the extract method utilizes the IEnumerable<Row> syntax of C# to take an input stream and yield rows. Stated differently, in this embodiment, the extract method translates a byte stream to rows.
An output command 408 is used to write data to a data stream 410, a file, or any other data sink. In an exemplary embodiment, the output command is the only way that data can exit the system. Formatting a row for output is done by calling the specified outputter. The outputter can be a SCOPE provided outputter, or in an additional embodiment, the outputter is a user-defined outputter, such as an outputter created from extending the C# class “outputter.” In an additional embodiment, if an outputter is not specified, a default outputter is utilized.
Therefore, an exemplary SCOPE script for extracting columns A-E from a data stream identified as “sample.in” utilizing an extractor identified as “MyExtractor” is represented as follows in one exemplary embodiment.
EXTRACT A,B,C,D,E
FROM “sample.in”
USING MyExtractor;
The row data that is extract from sample.in may then be further manipulated with additional commands or it may be output back to a data stream utilizing the output command. In an exemplary embodiment, the row data is written to a data stream identified as “sample.out” utilizing the following command.
OUTPUT TO “sample.out”;
As previously discussed, in an embodiment, an extractor is either a built-in extractor or a custom extractor. One example of an extractor, identified as MyExtractor, is as follows.
Turning now to
The process command will take an arbitrary number of rows (data rows) and produce an arbitrary number of rows in return. The process command operates without regard for the order in which rows are passed to it. In an exemplary embodiment, the process command is utilized for filtering data, such as removing rows that do not meet specified criteria. The process command is also useable for adding or removing columns from input data rows. Additionally, the process command is useable for transforming data of input data rows. For example, taking variables A, B, and C and producing A, B, C, and D. Where D is some defined function of the A, B, and C variables.
The following is an exemplary portion of an exemplary SCOPE script that includes a process command.
PROCESS
PRODUCE A, B, C, D, E
USING MyProcessor;
The process command produces A, B, C, D, and E from an input of data rows utilizing a processor. The processor, similar to the previously described extractor can be built-in to SCOPE or a custom extractor. In an exemplary embodiment, the actual work of the process command is done by the processor. The processor retrieves one input row at a time, performs some computation on the row, and outputs zero to multiple rows.
In an exemplary embodiment, the process command is a flexible command that allows a user to implement processing that is difficult or impossible to express in SQL alone. The processing command is capable of returning multiple rows per input row, which in one embodiment allows for unnesting capabilities. For example, the process command can break an input search string into a series of words and return one row for each of these words.
An exemplary processor that may be called by a process command is as follows.
Similar to the two methods discussed with respect to the extractor command, the process command also implements two methods, the produces method and the process method. The produces method and the process method function as previously described, such as the produces method is called at compile time and the process command is called at run time.
Turning to
For example, a row set 602 (set of data rows) is input to the reduce command 606. The row set 602 includes a number of rows and a number of columns. One of the columns of the row set 602 includes a key value for each of the rows that comprise the row set 602. The column with the keys, key column 604, indicates at least three unique keys in this example. The input is grouped based on the keys and fed into the reduce command one grouping at a time. The reduce command 606 generates an output data set 608. The output data set 608 is reduced based on the keys 610.
In an exemplary embodiment the reduce command takes as input a row set that has been grouped on a grouping column that is specified in an “ON” call of the reduce command. The reduce command continues by processing each group and outputs zero to multiple rows per group. The reduce function is called once per group. In some embodiments, the reducer of a reduce command may require the row within each group to be sorted on specific columns. This can be achieved with a presort clause. This prevents from having to sort the input within a reducer.
The following is an exemplary reduce command that may be included in a SCOPE script.
REDUCE
ON key
PRODUCE key, A, B, C, D
USING MyReducer;
In an exemplary embodiment, the reduce command utilizes a reducer to reduce the input data rows to the output data rows. A reducer, similar to the previously discussed extractor, includes built-in reducers and user generated reducers. The following is an exemplary reducer that may be called by a reduce command.
Similar to the two methods discussed with respect to the extractor command, the process command also implements two methods, the produces method and the process method. The produces method and the extract method function as previously described, such as the produces method is called at compile time and the process command is called at run time.
Turning to
For example, a first data input row set 702 having keys 706 and a second input row set 704 having keys 708 are the input to the combine command 710. The combine command 710 then combines the first input row set 702 and the second input row set 704 based on the keys 706 and 708. The output of the combine command 710 is a row set 712. The row set 712 is combined based on the keys of the input row sets. Additionally, the key may be referenced as a joint condition. For example, a joint condition for combination includes table1.A==table2.A. The join condition, in this example, combines when the value (key) of a column A in a first table is equivalent to the value (key) of a column A in a second table. In an additional embodiment of the present invention, the joint condition is not a typical equality condition, but rather an expression. Therefore, SCOPE is not limited to utilizing only equality conditions, but instead may rely on one or more expressions. In this embodiment, the semantics and runtime behavior will differ when an expression is utilized in place of an equality condition.
In an exemplary embodiment, the two input data sets must be grouped in the same manner for the combiner to receive the groups as an input. The combiner, in this example, then processes the rows within each matching group to produce output rows. This particular example allows for partitioning and distributed processing of the inputs.
The following is an exemplary combine command that may be included in a SCOPE script.
table1=EXTRACT A,B,C,D
FROM “vol1/users/brams/sample.in”
USING DefaultTextExtractor;
table2=EXTRACT A,B,C,D
FROM “vol1/users/brams/sample.in”
USING DefaultTextExtractor;
COMBINE table1 WITH table2
ON table1.A==table2.A
USING MyCombiner;
In an exemplary embodiment, the combine command utilizes a combiner to combine the input data sets based on a joint condition. The combiner, similar to the extractor, can be either a SCOPE provided combiner or a user created combiner. The following is an exemplary combiner.
Additional commands that are functional in a SCOPE script include a select command, a join command, and an import command. The select command is a command that is patterned after an SQL select statement, thus allowing SCOPE to leverage the SQL language. A select command is capable of performing a variety of services, such as transforms, add columns, remove columns, filter data, group data, and join data. The select command can join multiple inputs utilizing inner and outer joins. Multiple aggregation functions are supported with a SCOPE select command. Examples include COUNT, COUNTIF, MIN, MAX, SUM, AVG, STDEV, VAR, FIRST, and LAST. It is understood that while multiple aggregation functions have been listed, additional functions are within the scope of the present invention. The select command provides expression to the SCOPE language through the leveraging of SQL, C#, and/or .NET expressions. Exemplary select commands include, but are not limited to the following select commands in accordance with embodiments of the present invention. A transform select command may be provided in SCOPE as:
SELECT A.Substring(0,3) AS ShortA,
B+C AS Z.
An exemplary select command for grouping may be represented in SCOPE as the following:
SELECT Query, COUNT( ) AS Count WHERE Query.StartsWith(“a”)
HAVING Count>50.
An exemplary select command for joining may be represented in SCOPE as the following:
SELECT a.A, b.B
FROM a,b
WHERE a.A==b.A.
It is understood by those with ordinary skill in the art that the previously discussed select command examples are merely representative and not limiting as to the scope of the present invention. For example, the select command is functional to provide more complex operations that allow SCOPE to be a dynamic, flexible, and powerful scripting language.
Additionally, in an embodiment, the select command is able to leverage all functions and operators of C# to make those functions and operators available in SCOPE. Users of SCOPE are also able to write their own functions. In an embodiment, the definition of a user-defined function is included within the SCOPE script. The following example illustrates the use of C# string functions and shows how to write a user-defined function. In the following example, columns A, B, and C are all of type string and, consequently, any of the C# string functions may be utilized in this embodiment. The user-defined function in the following example, “StringOccurs”, counts the number of occurrences of a given pattern string in an input string. The example is as follows:
Where in the above example, the expression A+C denotes string contenation because both operands are strings. The C# function “Trim” strips white space from the beginning and the end of the string. The user-defined function StringOccurs is included in the SCOPE script in a section delimitated by #CS and #ENDCS. It is understood that the previously example of a user-defined function within a SCOPE script is merely for explanatory purposes and is not intended to be limiting as to the scope of the present invention.
Turning to
Turning to
The processor identifier 904 is an identifier of a processor that will be called as a result of the process scripting command 900. The “MyProcessor” previously discussed with respect to
Turning to
The reducer identifier 1004 is an identifier of a reducer to be called as a result of the reduce scripting command 1000. The “MyReducer” discussed with respect to
Turning to
Turning to
Turning to
Turning to
As indicated at a block 1404, the SCOPE script is compiled. In an exemplary embodiment, the SCOPE script is compiled at a SCOPE computing cluster. However, in additional embodiments the SCOPE script is compiled at a client device. A result of compiling the SCOPE script is the generation of an execution plan, as indicated at a block 1406. The execution plan is stored in a computer-readable storage medium, as indicated at a block 1408. As shown at a block 1410, a computational graph is generated. In an exemplary embodiment the computational graph is generated by the computational graph generator 222 previously discussed with respect to
Many different arrangements of the various components depicted, as well as components not shown, are possible without departing from the spirit and scope of the present invention. Embodiments of the present invention have been described with the intent to be illustrative rather than restrictive. Alternative embodiments will become apparent to those skilled in the art that do not depart from its scope. A skilled artisan may develop alternative means of implementing the aforementioned improvements without departing from the scope of the present invention.
It will be understood that certain features and sub combinations are of utility, may be employed without reference to other features and sub combinations, and are contemplated within the scope of the claims. Not all steps listed in the various figures need be carried out in the specific order described.