The present invention relates to coordination of processing and interchange of data between modules in a mixed software environment.
Rapid, efficient analysis of data continues to be essential in both the civilian and the military sphere. Typical sensor-driven processing systems collect large amounts of data that must be analyzed and processed to provide the necessary information to perform a mission, complete a task, and/or achieve a goal. However, the data that is ingested into the data domain often are in different formats, with different metadata, or with other properties that can hinder the data-processing system's ability to quickly and accurately process the data to provide a usable result. In addition, the processed data often is not an end product in itself but is in turn input into other systems, and so in order to be useful, the processed data often must be in a form that is compatible with the next system in the data-processing chain.
For example, the Navy collects a myriad of data from numerous sensor sources, both during missions specifically intended to collect data and during other operations. This data is collected from sources such as such as sonobuoys, sidescan and multibeam sonar, fathometer, electro-optical imaging and manually collected measurements, and includes underwater environmental data such as seafloor characteristics, ocean properties, and atmospheric properties. In addition, the data to be analyzed can be both historical data, i.e., previously collected data representing the environment at a previous point in time, and dynamic, real-time data, representing the environment at or near the time of analysis. The contents of a particular data set, both the type of physical measurement represented by the data (e.g., bathymetry, temperature, salinity, etc.) and its geospatial representation (points, lines, polygons, etc.) are intrinsic properties of the data, while the binary storage format (comma-separated text, NetCDF, ESRI Shape, SVG, etc), file name, and file organization, are generally considered to be “extrinsic” properties. See Erich Gamma et al., Design Patterns: Elements of Reusable Object-Oriented Software (1995). Thus, each data set can be characterized by the physical measurements it represents, its spatial representation, and data storage format.
The Navy's post-mission analysis (PMA) of the data that it collects has a basic three-stage process. First, the sensor data is ingested from a raw, sometimes proprietary format. Second, the ingested data goes through one or more analysis steps, which may involve a human operator or an automated processing algorithm, and result in one or more derived data products. Finally, the data product(s) are exported to an external source, archived, or posted in a discoverable form. See John P. Stenbit, “Department of Defense Net-Centric Data Strategy,” May 9, 2003, Department of Defense memorandum, http://www.dod.mil/nii/org/cio/doc/Net-Centric-Data-Strategy-2003-05-092.pdf.
Combining historical and dynamic data to generate a useful product such as a representative environment from these data requires advanced data fusion techniques. Michael Harris, et al. “Environmental Data Collection, Sensor to Decision Aid,” in Sixth International Symposium on Technology and the Mine Problem, May 9-13, 2004. Early PMA systems used a tool-chain of software programs. Each software component was specifically bound to not only to binary file format, but also to the representation of data within the format. An expert operator, aware of the capabilities of each software program, would be required to manually execute the program to generate the desired product.
For example, bathymetry (water depth) soundings are geometrically represented as a series of points in three-dimensions: latitude, longitude, and depth. These data can be stored in a file such as a comma-separated text file, whose format preserves the data content but relies on the operator to retain the data context. The bathymetry data can then be input into a bathymetry interpolation program that accepts comma-separated values and applies a tide correction shift to the point values and to produce a gridded bathymetry product with evenly spaced, averaged bathymetry values over a given geographic area. A different environmental parameter, sea surface temperature, may be encoded in the same format as point values in comma-separated text and also input into the bathymetry interpolation program. The bathymetry interpolation process has no way to discriminate the input types, yet applying it to the temperature data creates nonsense output. The discrimination between these two environmental parameters is left to the operator, slowing down the process considerably, and limiting the ability of the system to rapidly, efficiently, and accurately process large sets of data.
Given this view of the data, a methodology for describing these intrinsic and extrinsic properties and allowing the processing components to self-describe their input-output interfaces greatly enhances the level of processing automation, error control, and context-awareness in the PMA software system.
This summary is intended to introduce, in simplified form, a selection of concepts that are further described in the Detailed Description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Instead, it is merely presented as a brief overview of the subject matter described and claimed herein.
The present invention provides fully automated methods for coordinating the processing and analysis of data from disparate data sources to ensure that the data input to a data processing tool represents the proper physical measurements, has the proper spatial representation, and is in the proper file format to permit the data processing tool to produce the logically correct output. The present invention also permits the fully automated integration of multiple data processing tools, for example, individual data processing modules in an integrated data processing system, into a single platform that can process and analyze data from disparate data sources to produce appropriate output.
In accordance with the present invention, a data set can include a set of self-described data specifications defining the data in the data set. The data specifications can include a definition of a data type, e.g., a physical measurement represented by the data in the data set, a definition of a spatial representation of the physical measurement represented by the data, and a definition of a file storage format in which the data is stored on a medium.
The data set is available for processing by a data processing system which can include one or more data processing tools. In accordance with the present invention, each data processing tool can have a set of self-described input specifications defining the input data that it will accept for processing. Thus, for example, a data processing tool can include self-described input specifications defining the acceptable data type, e.g., a physical measurement represented by acceptable input data, a spatial representation of acceptable input data, and a file storage format of acceptable input data. The set of input specifications can be incorporated into the data processing tool itself or can be externalized as a portable configuration file that is operatively associated with the data processing tool.
The data processing tool also can have a set of output specifications defining the data specifications of the output data that it will produce, including definitions of the output data type, e.g., the physical measurement represented by the output data, the spatial representation of the output data, and the file format of the output data. The output specifications may be the same as the input specifications or they may be different, thus enabling the conversion of input data having a first set of data specifications into output data having a second set of data specifications.
In addition, in some embodiments, the data processing tool can comprise one module in an integrated data processing system containing multiple modules, where the output data from a first data processing tool may in turn be input into a second data processing tool which also has defined the characteristics of the input data that it will accept, and so on until the original data set is processed to its final output.
In other embodiments, at any stage in the processing of the data set, an appropriate data processing tool having input specifications that are compatible with the data specifications of the data set can be automatically selected from among multiple possible data processing tools in the data processing system.
The present invention can also provide a method for validating a data set for use with a data processing tool. Because a data processing tool can include a set of input specifications defining the physical measurement, the spatial representation, and the file storage format of data that it will accept for processing, a data set having a set of data specifications that is not compatible with the processing tool's input specifications can be rejected by the data processing tool, thus preventing the unnecessary processing of data that would generate nonsense or otherwise unusable results. In other embodiments, a data set having one or more data specifications which are not compatible with the corresponding input specifications of a data processing tool can be automatically converted into a revised data set whose data values remain unchanged but having revised data specifications that are compatible with those of the data processing tool so that the data can be processed rather than be rejected.
In some cases, the data specifications of a data set can automatically be set based on the source of the data. In other cases, if one or more of the data specifications is set, the remaining specifications can automatically be set, for example, to ensure that the set of data specifications as a whole will enable the data to be processed and produce useful output.
The present invention also includes one or more data processing tools, including a tool that can examine the input specifications of a data processing tool in a data processing system and the data specifications of a data set received for input into the data processing system and accept, reject, or modify the data set so that only a data set that can produce appropriate output is processed by the data processing system.
The aspects and features of the present invention summarized above can be embodied in various forms. The following description shows, by way of illustration, combinations and configurations in which the aspects and features can be put into practice. It is understood that the described aspects, features, and/or embodiments are merely examples, and that one skilled in the art may utilize other aspects, features, and/or embodiments or make structural and functional modifications without departing from the scope of the present disclosure.
The present invention provides a computer-implemented method for automatically organizing the processing of data from disparate data sources using disparate data processing tools to ensure that the resulting output is logically coherent and useful from an operational viewpoint. As will be appreciated by one skilled in the art, a method for automatically facilitating the processing of data in accordance with the present invention can be accomplished by executing one or more sequences of instructions contained in computer-readable program code read into a memory of one or more general or special-purpose computers configured to execute the instructions, wherein a data set can be input into a data processing tool and transformed into useful output data, where both the input and the output data represent a physical measurement and where both the input and the output data as well as the data processing tool have a specific set of parameters defining, for example, the type of physical measurement represented by the data, the spatial representation of the physical measurement within the data set, and the data storage format for storage of the data on a physical medium.
The present invention provides fully automated methods for coordinating the processing and analysis of data from disparate data sources to ensure that the data input to a data processing tool represents the proper physical measurements, has the proper spatial representation of those measurements, and is in the proper file format to permit the data processing tool to produce logically correct output. The present invention also permits the fully automated integration of multiple data processing tools, for example, individual data processing tools in a larger data processing system, into a single platform that can process and analyze data from disparate data sources to produce appropriate output.
In accordance with the present invention, a data set can include a set of data specifications defining the data in the data set. The data specifications can include a definition of intrinsic properties of the data set such as the physical objects that are the subject of the data set or a definition of extrinsic properties of the data set such as a software package in which the data set is used, or a combination of both types of properties. Thus, in an exemplary embodiment described herein, the data specifications of a data set can define a physical measurement represented by the data, a spatial representation of the physical measurement represented by the data, and the file storage format in which the data is stored on a medium.
The data set is available for processing by a data processing system which can include one or more data processing tools. In accordance with the present invention, each data processing tool can have a set of input specifications defining the intrinsic and extrinsic properties of input data that it will accept for processing. Thus, an exemplary data processing tool described herein can include input specifications defining the physical measurement represented by acceptable input data, the spatial representation of acceptable input data, and the file storage format of acceptable input data.
The data processing tool also can have a set of output specifications defining the output data that it will produce, such as definitions of the physical measurement represented by the output data, the spatial representation of the output data, and the file format of the output data. The output specifications may be the same as the input specifications or they may be different, thus enabling the conversion of input data having a first set of characteristics into output data having a second set of characteristics.
Thus, the present invention includes two main aspects. The first is a specification of the intrinsic and extrinsic properties of the data in a data set, referred to herein as the data specifications of the data set. The second is the definition by a data processing tool of the intrinsic and extrinsic properties of data that are acceptable for processing by the tool. In accordance with the present invention, a data processing tool can compare the data specifications of the data set with the input specifications of the data processing tool and take one or more actions as a result of that comparison.
It should be noted that not all relations between a physical measurement, a spatial representation, and a data storage format are valid. An example of a nonsense triplet would be sea temperature as polygonal areas stored in CHRTR format. Such a triplet would not allow the data set to provide useful results because the data storage format CHRTR cannot encode polygon shapes.
The set of physical measurements, geospatial representations of those physical measurements, and file formats that can be accepted for processing by a data processing tool and the permissible combinations thereof comprise a set of input specifications of the data processing tool. The input specifications can be included as part of the data processing tool itself or can be implemented using a type-driven plug-in system. This data type signature of the data processing tool is declared and stored within the type-driven plug-in system. The system later uses this input-output signature to recall plug-ins that satisfy a processing need.
For example, in some embodiments in accordance with the present invention, the inspection and selection of an appropriate data set can be performed by a generic module interface that allows for the following basic capabilities:
Inspection of input specifications of a data processing tool
Inspection of output specification of the data processing tool
Inspection of data specifications of data for input into the data processing tool
Execution of the data processing tool with appropriate data
Collection of output data
This methodology allows a significant decoupling of the invoking application from the external process without requiring the operator to retain or exercise knowledge of the intrinsic properties of the input/output data types. The execution environment, for example, the Navy's Environmental Post-Mission Analysis (EPMA) system, can use the knowledge of the specification, and use an inspection of each module to enforce type compatibility at a logical level. Such an interface can be implemented using the C++ programming language using the Qt 4.3 open source library to facilitate compiled library loading, though any programming language supporting interface constructs and run-time loading of external libraries can be used, such as the Java programming language.
For example, the input specifications can be set forth in a portable XML file containing the specifications such as the exemplary set of input specifications illustrated in
In some embodiments, the data processing tool can also include a similar set of data specifications defining, for example, combinations of data type, representation, and storage format of data that are output by the data processing tool, and this definition of the output data can be used to determine whether that output data can be input into another data processing tool in an integrated data processing system.
In addition, in some embodiments, such a data processing tool can comprise an intermediate tool that can be used to convert an otherwise incompatible data set, e.g., one having an incompatible representation specification, to a data set having a set of data specifications that is compatible with the data processing system.
Exemplary logic flows for various exemplary embodiments of methods for self-describing data processing in accordance with the present invention are illustrated in
In a first exemplary embodiment illustrated in
As shown in
At step 302, a data set for input into the data processing tool can be received, the data set including a set of data specifications defining the data in the data set, where the data specifications include a physical measurement represented by the data in the data set, a geospatial representation of the physical measurement represented by the data, and a data storage format in which the data in the dataset is stored on a storage medium. At step 303, the processor can compare the input specifications to the data specifications of the data set and at step 304 can inquire whether the data specifications of the data set are compatible with the input specifications of the data processing tool. If the answer to the question at step 304 whether the data specifications are compatible with the input specifications is “Yes,” the data set can be accepted for processing and at step 305 can be passed to the data processing tool for processing. If, however, the answer at step 304 is “No,” the data set can be rejected at step 306, and will not be processed by the data processing tool. This is especially critical in cases where incompatible data would unconditionally processed by the tool, corrupting subsequent stages and silently producing an incorrect output. In some embodiments, at optional step 307, an error message can be displayed on a display operatively connected to the processor, where the error message can provide information to a user regarding the rejection of the data set and the reasons for the rejection, such as identifying the incompatible input specification and data specification and/or providing suggestions for a remedy.
As noted above, in some embodiments, the data processing tool can also include a set of output data specifications defining data that is output from the data processing tool, the set of output data specifications including a specification of a physical measurement represented by the output data, a specification of a geospatial representation of the physical measurement represented by the output data, and a specification of a data storage format for storage of the output data. In this embodiment, in some cases the data set will be accepted for processing by the data processing tool only if it can produce data compatible with the output specifications.
In some embodiments, the data processing tool can also be a part of an integrated data processing system made up of multiple data processing modules, where the output from one data processing tool in an integrated data processing system can be input for another data processing tool in the system. In such embodiments, the output data set from the first data processing tool can have a set of output data specifications as defined by the first data processing tool and the second data processing tool can have a set of its own input specifications, which, like the input specifications of the first data processing tool, can define acceptable data that can be input into the tool for processing. The output data specifications of the output data set from the first data processing tool can then be inspected and compared to the input specifications of the second data processing tool in a manner similar to that described with respect to
In another embodiment of a method for data processing in accordance with the present invention, an appropriate data processing tool can be automatically selected from several possible tools in a data processing system based on the data specifications in a data set. An exemplary processing flow in accordance with this embodiment is illustrated in
As shown in
At step 402 shown in
Another embodiment of the present invention, which can be executed using the exemplary logic flow illustrated in
As shown in
As noted above, in some embodiments of the present invention, the data specifications of the data set can be automatically populated based on a source of the data set. For example, the “Bathymetry Attributed Grid” or “BAG” format has a specific naming convention. All BAG files possess a “.bag” extension, contain bathymetry measurements, represented as a grid of points. Any data file that is suffixed with “.bag” can be inferred to have this specific type triple. In addition, as noted above, not all combinations of data specifications make logical sense. For example, acoustic image representation of the seafloor can neither be represented by polygon features, nor can it be persisted in a “Shapefile” format. In some such cases, data of only one data specification of the data set need be received by the computer, and the other data specifications can be automatically set so that they are compatible with the data processing tool. Formally, this is possible only when a particular value of a data specification occurs once and only once in the set of possible data type triples. The incidence of unique data specification values, and thus, automatic type inference, can be improved by increasing refining the granularity of the data specification values. For example, “bathymetry” may be persisted in several formats and geometric representations. “Fused Bathymetry” however, implies a specific format, i.e., “CHRTR,” and a specific representation, i.e., “gridded.”
The present invention also can include a data input tool for carrying a method for receiving and inputting a data set into a data processing system in accordance with one or more aspects described herein. Such a data processing tool can include a data processing tool definition module which includes a set of input specifications defining acceptable data that can be input into a data processing tool and a data inspection module configured to inspect the data set to determine whether the data specifications of the data set are compatible with the input specifications of the data processing tool, so that the data set can be passed for processing by the data processing system only if all the data specifications of the data set are compatible with the input specifications of the data processing tool.
Thus, the method of the present invention abstracts the execution of both external and internal processes, while providing a programming language-independent method to conduct run-time interface inspection for complex data types. Though this technique was developed to support geo-spatial data types stored as binary formatted files, the type specification system is easily extended to other problem domains.
There are few facilities in place to support this level of high-level context awareness either in programming language constructs or operating system-level features. Programming languages generally provide type-checking facilities for function invocation using basic data types such as integer, string, and real-valued number, or aggregations of these basic types as structures. The operating system itself gives the ability to name files. File suffixes are typically used as an indicator of what file's format is. However, it does not address the intrinsic properties that the data describes as does the method of the present invention.
It should be noted that one or more aspects of a system and method for self-describing data processing as described herein can be accomplished by one or more processors executing one or more sequences of one or more computer-readable instructions read into a memory of one or more computers from volatile or non-volatile computer-readable media capable of storing and/or transferring computer programs or computer-readable instructions for execution by one or more computers. Volatile media can include a memory such as a dynamic memory in a computer. Non-volatile computer readable media that can be used can include a compact disk, hard disk, floppy disk, tape, magneto-optical disk, PROM (EPROM, EEPROM, flash EPROM), SRAM, SDRAM, or any other magnetic medium; punch card, paper tape, or any other physical medium such as a chemical or biological medium.
Although particular embodiments, aspects, and features have been described and illustrated, it should be noted that the invention described herein is not limited to only those embodiments, aspects, and features. It should be readily appreciated that modifications may be made by persons skilled in the art, and the present application contemplates any and all modifications within the spirit and scope of the underlying invention described and claimed herein. For example, although the present invention has been described in terms of an exemplary set of data specifications and input/output specifications, it will be readily apparent to one skilled in the art that many other types of data specifications and input/output specifications are possible, and all such other types of data specifications and/or input/output specifications may be used as appropriate in the present invention. In addition, one skilled in the art would readily appreciate that the methodology described in the present disclosure generalizes to an arbitrary development environment and computer programming language, such as C++, Java, and Python. The scope of this methodology also generalizes to higher level architectures, from a desktop application to a networked service-oriented architecture. All such embodiments are also contemplated to be within the scope and spirit of the present disclosure.
This application claims the benefit of priority based on U.S. Provisional Patent Application No. 61/172,807 filed on Apr. 27, 2009, the entirety of which is hereby incorporated by reference into the present application.
Number | Date | Country | |
---|---|---|---|
61172807 | Apr 2009 | US |