This invention relates to the generation of a reusable data processing program based on a user's manipulation of a tabular representation of data.
Complex computations can be expressed as a data flow through a directed graph, with components of the computation being associated with the vertices of the graph and data flows between the components corresponding to links (arcs, edges) of the graph. In some cases, the computations associated with a component are described in human-readable form referred to as “business rules.” Business rules include a set of criteria used to transform data from one format to another, make determinations about data, or generate new data based on a set of input data.
Referring to
Referring to
In some examples, dataflow graphs are executable computer programs that include vertices (representing data processing components or datasets) connected by directed links (representing flows of work elements, i.e., data) between the vertices. For example, such an environment is described in more detail in U.S. Pat. No. 7,716,630, titled “Managing Parameters for Graph-Based Applications,” incorporated herein by reference. A system for executing such graph-based computations is described in U.S. Pat. No. 5,966,072, titled “EXECUTING COMPUTATIONS EXPRESSED AS GRAPHS,” incorporated herein by reference. Dataflow graphs made in accordance with this system provide methods for getting information into and out of individual processes represented by graph components, for moving information between the processes, and for defining a running order for the processes. This system includes algorithms that choose interprocess communication methods from any available methods (for example, communication paths according to the links of the graph can use TCP/IP or UNIX domain sockets, or use shared memory to pass data between the processes). A dataflow graph as referred to herein, is an executable computer program.
While the business rule development paradigm described above uses a tabular user interface to provide a user with a comprehensive view of the different conditions defining a business rule, the tabular user interface does not provide an overall view of the transformed records that result from applying the business rule to a set of input records.
Aspects described herein relate to an alternative and improved paradigm for defining a transformation based on user manipulation of data records in a tabular user interface. Data records are displayed to the user in a tabular user interface. The user's manipulations of the data records in the tabular user interface (e.g., adding or removing columns, filtering the data records, and defining computations based on the data records) are aggregated and together form an aggregate transformation. What the user sees in the tabular interface is an up-to-date representation of a set of input records as transformed by the aggregate transformation. When the user is satisfied with the transformed data records displayed in the tabular user interface, they can export the aggregate transformation (sometimes referred to as a set of “final transformations”) as a reusable data processing program for processing other input data.
In a general aspect, a method for developing a reusable data processing program including a set of data transformation steps by displaying a set of records and iteratively enabling a user to select one or more data transformation steps, iteratively applying the data transformation steps to the records, and iteratively displaying the transformed records includes accessing a number of input records, rendering a representation of the number of transformed records in a user interface, the number of transformed records determined by applying the set of data transformation steps to the number of input records, and receiving first user input as the user manipulates the representation of the number of transformed records using the user interface, the first user input including one or more data transformation steps. For each data transformation step of the one or more data transformation steps, the method includes adding the data transformation step to the set of data transformation steps, updating the number of transformed records, including applying the set of data transformation steps to the number of input records, and rendering a representation of the transformed number of transformed records in the user interface. The method further includes receiving second user input causing export of the reusable data processing program based at least in part on the set of data transformation steps, the reusable data processing program being applicable to one or more pluralities of records different from the number of input records.
Aspects may include one or more of the following features.
The set of data transformation steps may include a number of data transformation steps. The number of data transformation steps may be applied sequentially according to an order specified by the user. The method may include rendering a representation of the set of data transformation steps in the user interface during development of the reusable data processing program. The representation of the set of data transformation steps may display the data transformation steps in a list ordered according to the order specified by the user. The representation of the set of data transformation steps may include a dataflow graph representation of the data transformation steps. The method may include receiving third user input causing removal of one or more data transformation steps from the set of data transformation steps. The method may include receiving third user input causing modification of one or more data transformation steps from the set of data transformation steps.
The user interface may include a tabular interface and the representation of the number of transformed records is rendered in the tabular interface. The user interface may include a list interface where the set of data transformation steps is rendered as a list in the list interface. The set of data transformation steps rendered in the list interface may be ordered according to an order of application of the data transformation steps to the number of input records. The method may include receiving third user input to change the order of application of the set of data transformation steps. The method may include interacting with a data transformation step using the list interface to modify the data transformation step. The method may include interacting with a data transformation step using the list interface to remove the data transformation step from the set of data transformation steps.
The set of data transformation steps may include one or more of a filter data transformation step, an add field data transformation step, and a choose fields data transformation step. The set of data transformation steps may include a filter data transformation step. Causing export of the reusable data processing program may include compiling the set of data transformation steps to form the reusable data processing program. Causing export of the reusable data processing program may include forming a dataflow graph representation of the set of data transformation steps to form the reusable data processing program.
The method may include computing a data profile for the number of transformed records and rendering a representation of the data profile in the user interface. The second user input may be received upon determining that a data profile for the number of transformed records is in accordance with a predetermined data profile.
The predetermined data profile or predetermined profile rule may specify an allowable range for some characteristics of the data profile. The method may include computing a data quality for the number of transformed records and rendering a representation of the data quality in the user interface. The data quality may include at least one of counts of valid values, invalid values, NULL values, distinct values, unique values, and/or maximum and minimum values.
In another general aspect, a system for developing a reusable data processing program including a set of data transformation steps by displaying a set of records and iteratively enabling a user to select one or more data transformation steps, iteratively applying the data transformation steps to the records, and iteratively displaying the transformed records includes a first input for accessing a number of input records, an output for rendering a representation of the number of transformed records in a user interface, the number of transformed records determined by applying the set of data transformation steps to the number of input records, and a second input for receiving first user input as the user manipulates the representation of the number of transformed records using the user interface, the first user input including one or more data transformation steps. The system includes one or more processors configured to, for each data transformation step of the one or more data transformation steps, perform the steps of adding the data transformation step to the set of data transformation steps, updating the number of transformed records, including applying the set of data transformation steps to the number of input records, and rendering a representation of the transformed number of transformed records in the user interface. The system further includes a third input for receiving second user input causing export of the reusable data processing program based at least in part on the set of data transformation steps, the reusable data processing program being applicable to one or more pluralities of records different from the number of input records.
In another general aspect, a non-transitory computer-readable medium stores instructions for causing a computing system to implement a method for developing a reusable data processing program including a set of data transformation steps by displaying a set of records and iteratively enabling a user to select one or more data transformation steps, iteratively applying the data transformation steps to the records, and iteratively displaying the transformed records. The instructions cause the computing system to access a number of input records, render a representation of the number of transformed records in a user interface, the number of transformed records determined by applying the set of data transformation steps to the number of input records, and receive first user input as the user manipulates the representation of the number of transformed records using the user interface, the first user input including one or more data transformation steps. For each data transformation step of the one or more data transformation steps, the instructions causes the computing system to add the data transformation step to the set of data transformation steps, update the number of transformed records, including applying the set of data transformation steps to the number of input records, and render a representation of the transformed number of transformed records in the user interface. The instructions further cause the computing system to receive second user input causing export of the reusable data processing program based at least in part on the set of data transformation steps, the reusable data processing program being applicable to one or more pluralities of records different from the number of input records.
In another general aspect, a system for developing a reusable data processing program including a set of data transformation steps by displaying a set of records and iteratively enabling a user to select one or more data transformation steps, iteratively applying the data transformation steps to the records, and iteratively displaying the transformed records includes means for accessing a number of input records, means for rendering a representation of the number of transformed records in a user interface, the number of transformed records determined by applying the set of data transformation steps to the number of input records, means for receiving first user input as the user manipulates the representation of the number of transformed records using the user interface, the first user input including one or more data transformation steps, and means for processing configured to, for each data transformation step of the one or more data transformation steps add the data transformation step to the set of data transformation steps, update the number of transformed records, including applying the set of data transformation steps to the number of input records, and render a representation of the transformed number of transformed records in the user interface. The system also includes means for receiving second user input causing export of the reusable data processing program based at least in part on the set of data transformation steps, the reusable data processing program being applicable to one or more pluralities of records different from the number of input records.
In another general aspect, a method for developing a reusable data processing program includes accessing a number of input records, rendering a representation of the number of input records in one or more user interfaces, receiving a set of one or more data transformation steps, applying the set of data transformation steps to the number of input records to obtain a number of transformed records, rendering a representation of the number of transformed records in the one or more user interfaces, receiving first user input as the user manipulates the representation of the number of transformed records using the one or more user interfaces, the first user input including one or more data transformation steps. For each data transformation step of the one or more data transformation steps of the first user input, the method adds the data transformation step to the set of data transformation steps to update the set of data transformation steps, updates the number of transformed records, including applying the updated set of data transformation steps to the number of input records to obtain an updated number of transformed records, and renders a representation of the updated number of transformed records in the one or more user interfaces. The method further includes receiving second user input causing export of the reusable data processing program, said exported program being based at least in part on the updated set of data transformation steps, the reusable data processing program being applicable to one or more pluralities of records different from the number of input records.
Among other advantages, aspects advantageously provide a graphical shortcut for making settings in data transformations, to instruct a computer to transform data records in a particular way. This graphical shortcut can involve a tabular user interface for developing and/or adjusting complex transforms for data processing programs. The graphical shortcut allows one to choose data processing conditions, such as directly via the tabular form in which the records are displayed, without having to cycle through the code of the transform every time a transform step needs to be changed, added, or removed. This saves computing resources and makes the performing of the change, addition, or removal of the transform step efficient and reliable.
Furthermore, aspects save time and computing resources while ensuring proper execution of the resulting data processing program. For example, providing an environment for developing rules in a tabular interface advantageously lets the user see the result of applying the rules to data in real time. This can make more efficient use of computing resources because seeing results helps users immediately recognize errors in their code without resorting to trial and error, which can be computationally wasteful. A tighter feedback cycle between making edits and seeing results can help users catch logic and conceptual errors much more quickly. In some aspects, the interface works on a subset of data (e.g., 500 records) so the logic may run more often to provide user feedback and a tighter feedback loop, but it also runs on small datasets that are cached during a session to reduce extraction costs from the source system (e.g., a cloud provider such as Amazon S3 is paid only once for the ready data, not once per edit). Some aspects also optimize the data processing program by, for example, merging steps, pushing down logic to the source system, and reordering logic to be more efficient.
Aspects also advantageously ensure that proper execution of the resulting data processing program is more likely. Software testing and debugging is a difficult problem-even the most well-developed software includes bugs. Aspects described herein provide a view of the output data as the software is developed, facilitating the quick and easy discovery of errors in the software (i.e., based on visual inspection, the user knows almost immediately that the output data is different from what they expected). As a result, software developed using the invention is more likely to have error-free execution. That is, the interface provides a powerful debugging tool, where the tabular interface acts like a probe in a debugger, presenting results that help users intuitively recognize and resolve errors in the data processing program.
Some aspects advantageously obviate the need for a user to manually build a dataflow graph. Building of the dataflow graph is done behind the scenes, so the user doesn't have to spend time and waste computing resources configuring the layout of the dataflow graph. This amounts to a graphical shortcut, where program development in the tabular view significantly reduces the processing load on the underlying computing hardware (e.g., by reducing graphical processing load from dragging and dropping components, wiring components together, rearranging components, switching between development views and runtime views, etc.).
Yet other aspects guide users to reduce erroneous manipulation of the tabular representation of data, to reduce errors in the transforms, and to obtain a resulting data processing program that is more likely to be executable and properly applied to data records. The tabular view is a constrained programming environment that, by its very nature, aims to reduce the number of ways a user could introduce errors into the program. Compared to spreadsheet programs aspects operate on semantically meaningful concepts like “data fields” rather than interface elements like “spreadsheet cells”. Among other benefits, this prevents a class of error that arises in spreadsheets when a formula is not replicated correctly between cells (not to mention saving the effort of having to replicate formulas into many cells). For example, a spreadsheet formula might use a relative reference when it should use an absolute one or vice versa, or a formula might not have been copied properly onto inserted rows. Furthermore, showing test data while expressions/transformations are being constructed can also help catch logic errors without waiting to run the expressions/transformations on the full dataset.
Per-field profiling could also be used to catch some classes of logic errors that arise from misconceptions about input data. For example, helping the user notice that a numeric value is probably from a constrained set of values so that someone doesn't attempt to perform mathematical operations on it. That type of information might not be immediately obvious from seeing the first few dozen values, but the data profile would make that clear.
Being able to test transformations on very large datasets also advantageously facilitates the identification of outliers in results. Local spreadsheets run into performance issues when run on extremely large datasets so in practice users write formulas in smaller spreadsheet files and then migrate them to larger ones, whereas the transformations described herein are implemented in graphs and can be applied to the dataset from the beginning.
Other features and advantages of the invention are apparent from the following description and from the claims.
Referring to
The iterative development method 326 includes a first step 328 where the user 320 adds, removes, or modifies a data transformation by manipulating test data displayed in the tabular view 321 (e.g., by adding or removing fields and/or filtering the data). The result of the first step 328 is a set of “working” transformations 330, which are fed back to the user interface 322 and displayed to the user 320 in a transformation history view 323 (e.g., as a list of transformations, described in greater detail below).
The iterative development method 326 includes a second step 332 where the set of working transformations 330 is applied to working test data 336 to form transformed test data 334. As part of the second step 332, the transformed test data 334 is also fed back to the user interface 322, where it is displayed to the user 320 in the tabular view 321. Optionally, as part of the second step 332, the transformed test data 334 can be processed in a data analyzer 337 to generate a profile/quality data 339 for the transformed test data 334. The profile/quality data 339 is displayed to the user 320 in the user interface 322.
As the user manipulates data in the tabular view 321, the first and second steps 328, 332 of the iterative development method 326 are repeated and the user interface 322 is repeatedly updated to reflect the current working transformations 330 and transformed test data 334. The user 320 can view the data profile 339 of the transformed test data 334 at any point during the iterative development method 326 to ensure the data 334 conforms to a desired data profile. For example, the data profile obtained from the transformed data can be checked against a predetermined data profile to recognize any errors. In some examples, the errors are communicated as warning to the user, including information guiding the user how to fix the error. Eventually, in a third step of the iterative development method 326, when the user is satisfied with the state of the transformed test data 334 displayed in the tabular view 321, they export the set of working transformations 330 as the set of final transformations 324. The set of final transformations 324 is exported in a form that is usable to transform data other than the working test data 326. One example of such a form is a component for use in a dataflow graph.
Referring to
Referring to
In
Next, the Active View Data component 454 is configured by the user to process data from the Country data source 452 the generate density data that is written to the Density data sink 456. To do so, the user double clicks on the Active View Data component 454 to open the component. Throughout the remainder of this example, the user interacting with an element of the user interface is indicated by the outline of the element being shown in a bolded line. For example, in
Referring to
The tabular view 321 displays the test data 334 in a tabular form, with rows 558 corresponding to record numbers (e.g., records 1-7) and columns 560 corresponding to fields of the records (e.g., “country name,” “country code,” etc.). Field values for particular records are displayed in cells 562 at the intersections of particular rows and columns (e.g., the code for record 3 is “MK”). The user 320 can scroll through the tabular view 321 to view the test data 334 (note that no transformations are included in the set of working transformations of
The field selection menu 319 includes a checkbox 325 for each field in the original working test data. Each checkbox can be toggled to select whether its associated field is displayed in the tabular view 321. In
The buttons 556 are associated with a set of data transformations or other operations that can be applied to the working test data 336. In some examples, the buttons include an “Add Filter” button 563 that allows the user to add a Filter transformation to the set of working transformations 330, an “Add Field” button 564 to add an “Add Field” transformation to the set of working transformations, an “Add Aggregation” button 590 that allows the user to add an “Add Aggregation” transformation to the set of working transformations, a “Show Profile” button 591 that allows the user to view a data profile of transformed test data 334, and a “Show Data Quality” button 592 that allows the user to view results of a data quality analysis of the transformed test data 334. The transformations and operations associated with the buttons 556 are described in greater detail below. A “View Graph” button 567 causes a dataflow graph associated with set of working transformations to be displayed to the user.
In
In
The Name field 575 requires the user to specify a name for the new column. In this case the user has chosen “Density” as the name for the new column. The Data Type field 574 allows the user to choose a data type for the new column from a list of data types such as Number, String, and Boolean data types (or the user can choose to automatically detect the data type). In this case, the user has chosen automatic detection of the data type, which results in Number being the data type for the new column.
The Expression field 576 requires the user to specify an expression (e.g., a calculation based on values of one or more fields in a record), which is used to populate the values in the new column. In this case, the user has specified the expression to calculate the population density of the countries as:
“=round(Country.population/Country.area)”
(i.e., the population of the country divided by the area of the country, rounded to the nearest integer value). When finished, the user clicks the save button to return to the user interface 322.
Referring to
In
Referring to
In some examples, when the user is satisfied with the transformed test data 334, as shown in the tabular view 321, they can click the save button to export the set of working transformations as the set of final transformations 324. Clicking the save button returns the user to the dataflow graph development environment 436 of
Referring to
Referring to
Referring to
In
In the aggregation Key field 794, the user 320 has selected the charge_amt field as the aggregation Key. The user has populated the aggregation Expression filed 795 with the Expression:
“rollup_sum(purchase_details.charge_amt)”
indicating that the aggregation transformation is a rollup aggregation that determines a sum of charge_amt values for each unique cust_ID. In
Referring to
In
In
In
In some examples, a more complete characterization of the data quality is accessed by clicking the “Show Data Quality” button 592 of the user interface 322. Discussion of that characterization of the data quality is beyond the scope of this invention and is not discussed further herein.
In some examples, the user may need to modify or remove transformations from the set of working transformations 330. The user can do so in the user interface 322 by interacting with list of transformations in the transformation history view 323 (e.g., by clicking the modify or remove buttons associated with the transformations in the list). Furthermore, there may be situations where the user wants to reorder the transformations in the transformation history view. In such cases, the user can, for example, drag the transformations to modify the order shown in the transformation history view.
It should be appreciated that the types of transformations described in the example above are only an example of the transformations that may be available in the user interface 322 and that other transformations may be available to the user.
In the examples described above, the data being accessed by the data processing programs are shown as datasets (e.g., databases or other sets of data stored on disk or in memory). However, it should be appreciated that flows of data can also be used to develop the data processing programs and can be processed by the data processing programs in a runtime setting.
The step of exporting the set of final transformations 324 may be a compilation step that translates the set of final transformations into a lower-level programming language such as assembly language, object code, or machine code to create an executable program. Alternatively, the exporting step may translate the set of final transformations into Ab Initio's DML programming language, into an Ab Initio dataflow graph, or into an Ab Initio “EZ Graph” which is an easily modifiable and optimizable computational graph (described in U.S. Patent Pub. 2021-0232579, the contents of which are incorporated herein by reference). Finally, the exporting step may export the set of transformations without any translation.
The data processing program exported from the iterative development method described above is not only usable to process the working test data but is also usable to process other, real-world data in both batch and streaming applications.
The computational resource allocation approaches described above can be implemented, for example, using a programmable computing system executing suitable software instructions or it can be implemented in suitable hardware such as a field-programmable gate array (FPGA) or in some hybrid form. For example, in a programmed approach the software may include procedures in one or more computer programs that execute on one or more programmed or programmable computing system (which may be of various architectures such as distributed, client/server, or grid) each including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), at least one user interface (for receiving input using at least one input device or port, and for providing output using at least one output device or port). The software may include one or more modules of a larger program, for example, that provides services related to the design, configuration, and execution of data processing graphs. The modules of the program (e.g., elements of a data processing graph) can be implemented as data structures or other organized data conforming to a data model stored in a data repository.
The software may be stored in non-transitory form, such as being embodied in a volatile or non-volatile storage medium, or any other non-transitory medium, using a physical property of the medium (e.g., surface pits and lands, magnetic domains, or electrical charge) for a period of time (e.g., the time between refresh periods of a dynamic memory device such as a dynamic RAM). In preparation for loading the instructions, the software may be provided on a tangible, non-transitory medium, such as a CD-ROM or other computer-readable medium (e.g., readable by a general or special purpose computing system or device), or may be delivered (e.g., encoded in a propagated signal) over a communication medium of a network to a tangible, non-transitory medium of a computing system where it is executed. Some or all of the processing may be performed on a special purpose computer, or using special-purpose hardware, such as coprocessors or field-programmable gate arrays (FPGAs) or dedicated, application-specific integrated circuits (ASICs). The processing may be implemented in a distributed manner in which different parts of the computation specified by the software are performed by different computing elements. Each such computer program is preferably stored on or downloaded to a computer-readable storage medium (e.g., solid state memory or media, or magnetic or optical media) of a storage device accessible by a general or special purpose programmable computer, for configuring and operating the computer when the storage device medium is read by the computer to perform the processing described herein. The inventive system may also be considered to be implemented as a tangible, non-transitory medium, configured with a computer program, where the medium so configured causes a computer to operate in a specific and predefined manner to perform one or more of the processing steps described herein.
A number of embodiments of the invention have been described. Nevertheless, it is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the following claims. Accordingly, other embodiments are also within the scope of the following claims. For example, various modifications may be made without departing from the scope of the invention. Additionally, some of the steps described above may be order independent, and thus can be performed in an order different from that described.
This application claims the benefit of U.S. Provisional Application No. 63/472,445 filed Jun. 12, 2023, the content of which is incorporated herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63472445 | Jun 2023 | US |