The present invention relates generally to data pattern recognition and extraction. More particularly, the invention relates to a method and computer program product for data pattern recognition and extraction.
With increasing competition in the corporate world, companies are constantly striving to improve their market strategies. In one aspect, the efficient sharing and analysis of performance or market figures is essential to making sound business decisions.
In many situations, however, data is not readily available in a single document nor is it in a format that is easily analyzable. It is desired, for example, to have the data in a single database-compatible document, wherein interactive queries can be utilized to quickly and easily find specific data in the document. From another perspective, it is very important that any data extraction and/or consolidation method or computer program product require little configuration time from the part of the user.
For example, in a spreadsheet having defined rows and columns, such as an Excel spreadsheet, one or more data tables may be available. Data in the tables may or may not be related. However, it is desired, for example, to be able to merge related data in order to obtain a high-level understanding of the data comprised in the tables.
What is needed therefore is a method and a computer program product to extract data from one or more data files, and to consolidate the extracted data in database-compatible output formats. Further, a data extraction method and computer program product that reduce the data extraction configuration time are also needed.
The present invention relates to a method and a computer program product for data pattern recognition and extraction.
In one aspect of the invention, there is provided a computer implemented method for manually and/or automatically configuring a data extraction from one or more input files. A user selects one or more input files for data extraction. In one embodiment, a user interface of the present invention allows the user to manually specify configuration parameters for the data extraction. In another embodiment, the present invention provides a plurality of heuristics to automatically detect data extraction areas located in one or more input files, automatically identify a layout type for each extraction area, and generate one or more data extraction outputs according to user-defined or pre-configured report types. Further, the present invention comprises additional heuristics to merge data extracted from multiple extraction areas whenever the extracted data is logically related.
In another aspect of the present invention, the configuration parameters of a data extraction are converted into metadata, and associated with the input file of the data extraction. For subsequent data extractions from an updated version of the input file, the metadata is used to automatically extract data from the updated input file according to the previously configured data extraction, without the need for a manual re-configuration of the data extraction.
The invention can be practiced with, for example and without limitation, spreadsheets having defined rows and columns, such as Excel spreadsheets.
Further embodiments, features, and advantages of the present invention, as well as the structure and operation of the various embodiments of the present invention, are described in detail below with reference to the accompanying drawings.
The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the pertinent art to make and use the invention.
The present invention will be described with reference to the accompanying drawings. The drawing in which an element first appears is typically indicated by the leftmost digit(s) in the corresponding reference number.
Overview
The present invention provides a method and a computer program product for automated data pattern recognition and extraction. In an embodiment of the present invention, the computer program product includes an execution module and a user interface.
The execution module comprises a plurality of sub-modules including sub-modules to identify table areas in a tabular data file, sub-modules to identify rows and columns in table areas, and sub-modules to extract data from the table areas. In another embodiment, the execution module also includes sub-modules to aggregate data extracted from one or multiple data files.
The user interface serves to customize a data extraction according to an extraction strategy provided by a user. In one embodiment of the present invention, the user interface receives a plurality of user inputs or configuration parameters that are used to configure a data extraction. The configuration parameters are relayed by the user interface to the execution module, which executes a data extraction based on the received configuration parameters. The result of the data extraction is then relayed back to the user through the user interface. Additional features of the execution module and the user interface will be set forth in the description that follows.
In the example of
Typically, a tabular data file includes a plurality of data cells organized in rows and columns that define a plurality of data axes in the tabular data file. Data axes are used to classify data in the tabular data file. In one embodiment of the present invention, data axes in a tabular data file include at least one of a value, text, or time axis. Similarly, non-empty data cells in the tabular data file have value, text, or time data types. As can be understood by a person skilled in the art(s), a non-empty data cell having a value type belongs to a value data axis, for example. In another embodiment, a data axis comprises both empty and non-empty data cells. In another embodiment of the present invention, a tabular data file has a flat or a matrix layout. In a flat layout, each data column is associated with a single data axis and holds data of a common type. Further, a header row of the tabular data file defines names for the data axes in the file. In a matrix layout, however, each data column of value type is associated with a plurality of data axes.
Referring back to
In step 106, the user selects between a manual or an automatic data extraction configuration.
If a manual configuration is selected in step 106, process 100 branches to step 108. In step 108, the user manually configures the data extraction. In an embodiment, the user interface receives user configuration parameters specifying a data extraction area and data axes for the data extraction. Data is then extracted, in step 110, according to the configuration parameters received from the user in step 108.
If an automatic configuration is selected in step 106, process 100 branches to step 116. In step 116, configuration parameters that define a data extraction are automatically detected by the execution module. Data will be then automatically extracted from the one or more input data files. It is to be noted here that when an automatic configuration is selected, tasks relating to the data extraction are performed by a computer process without intervening user instructions. Additional features and advantages of each of the manual and automatic configuration will be set forth in the description that follows.
In steps 112 and 118, the user can preview the result of the data extraction.
In an embodiment, the user refines configuration parameters upon previewing the result of the data extraction. In another embodiment, the user modifies the type of the data extraction configuration after previewing the result of the data extraction. In an exemplary embodiment, the user selects a manual data extraction configuration upon previewing the result of a previous automatic data extraction configuration.
If, after previewing the results of the data extraction in step 112 and/or 118, the user finds the result of the data extraction acceptable, the user saves the result of the data extraction in step 114 or 120. The extracted data is saved into an output file specified by the user. In an embodiment, the output file has a flat layout compatible with database use.
In another aspect of the present invention, upon saving the result of a data extraction, the configuration parameters of the data extraction are converted into metadata which is associated with the input data file. In an embodiment of the present invention, for each subsequent data extraction from an updated version of the input file, the metadata is used to automatically extract data from the updated input file according to the previously configured data extraction. In other words, a data extraction re-configuration is not needed for a future data extraction from an updated version of the input file. According to the present invention, an updated version of the input file can have modified values and/or formatting of data cells, new rows corresponding to new text axes, and new columns corresponding to new time and/or value axes.
Manual Data Extraction
In step 502, the user inputs configuration parameters to specify a data extraction area. The data extraction area defines an area of data cells from which data will be extracted. In an embodiment of the present invention, specifying the data extraction can be done by either specifying a range of data cells or an extraction starting cell. In specifying a range of data cells, the data extraction area corresponds to the specified range. In specifying a starting cell, the data extraction area corresponds to all data cells below and to the right of the starting cell.
In step 504, the user selects whether a data transposition is to be employed in the data extraction.
Referring back to
In another aspect, the present invention provides several optional features relating to shaping the structure of the output of the data extraction. For example, in cases where the text axis comprises a logical hierarchy, the present invention provides user options to reflect that hierarchy in the data extraction output. In embodiments of the present invention, text indentation and cell formatting information in a text axis are used to detect the different levels of hierarchy in the text axis.
At the end of the data axes configuration step, the data extraction is fully configured to be executed. As described earlier, a preview feature of the present invention allows the user to preview the expected data extraction output before completing the extraction. In embodiments of the present invention, various pre-defined output formats are also available for selection by the user at the beginning of the configuration.
Automatic Data Extraction
In another aspect, the present invention provides a method for automatically extracting data from an input file. According to this aspect, the invention provides a plurality of heuristics to automatically detect data extraction areas located in one or more input files, automatically identify data axes within each detected extraction area, and generate one or more extraction outputs according to user-defined or pre-configured report types. Further, the present invention comprises additional heuristics to merge data extracted from multiple extraction areas whenever the extracted data is logically related.
Once table areas are detected in the input file in step 1310, the exact range of each table area is identified in step 1320. In an embodiment of the present invention, starting at a non-empty cell of a table area, adjacent cells (top, bottom, left, right) are evaluated to find a non-empty cell. If an adjacent non-empty cell is found, the evaluation continues at that cell. Otherwise, the recursion stops. Cells determined non-empty at the end of the recursion define the exact range of the table area. At each point of the recursion, non-empty cell positions corresponding to the top left, top right, bottom left, and bottom right positions are updated and recorded. In another embodiment, if the input file is of a rich file format such as a Microsoft Excel document, for example, range names, if defined, can be used to detect the exact range of a table area in the input file. Typically, range names are user-defined names that refer to a range of data cells.
Having identified the table areas in steps 1310 and 1320, the next step 1330 is to identify a layout type for each table area in the input file. As, typically, a table area comprises header cells and data cells, identifying a layout type for the table area amounts to determining the different cell areas in the table.
According to an embodiment of the present invention, each table area in the input file is evaluated against a plurality of table layout types. An evaluation score is calculated for each layout type, and the layout type corresponding to the maximum evaluation score is identified as a layout type for the table area. Typically, a number of layout parameters are associated with each layout type. Layout parameters are parameters that characterize a layout. In a flat table layout, for example, a “header row” is one of the characteristic layout parameters. As can be understood by a person skilled in the art(s), a number of layout parameters may be general to more than one layout type, while others are layout-specific. We refer to general layout parameters as first level layout parameters, and to layout-specific parameters as second level layout parameters in what follows. Additional features of the heuristic method used for table area layout recognition will now be described.
In a first step, a set of first level layout parameters are located in the table area. In an embodiment, the set of first level layout parameters include a top left data cell. Typically, the top left data cell indicates a starting cell for a header row in the table area. In an embodiment of the present invention, locating a first level layout parameter in a table area is done by evaluating each cell in the table area against a set of rules associated with the layout parameter. Typically, the set of rules associated with a layout parameter characterize location and formatting information of the parameter. However, although first level layout parameters are general to all layout types, the evaluation rules associated with them can be different from one layout type to another. Tables 2 and 3 illustrate an example of layout parameter rules for the “top left data cell” in a flat and a matrix layout, respectively.
In an embodiment of the present invention, when locating a first level layout parameter, each data cell in the table area is evaluated, in parallel, with respect to the layout parameter rules of each of the plurality of layout types. As a result, for each layout type, a weight is calculated for each data cell in the table area with respect to the layout parameter. The weight represents a likelihood that the data cell corresponds to the layout parameter in the table area. In an embodiment, data cells having a weight higher than a pre-defined threshold are remembered as potential candidates for the layout parameter. A data cell having the maximum weight is selected as the layout parameter. As it is apparent to a person skilled in the art(s), the data cell corresponding to the maximum weight may or may not be the same data cell for all layout types.
In a second step of the layout recognition heuristic, a set of second level layout parameters are located in the table area. The objective is that, in locating the second level layout parameters of a layout type, the layout type can be either validated or removed from consideration as a potential layout type for the table area.
Typically, first and second level layout parameters allow the full characterization of a data extraction area in a table area according to a layout type. Second level layout parameters, however, are not independent from first level layout parameters. In a flat layout type, for example, the “header row ending cell” is one of second level layout parameters of the layout type. The “header row ending cell” is related, however, to the “top left data cell”, a first level layout parameter. This is true because the “top left data cell” defines the starting cell of a header row, and therefore, both parameters should occur on the same row level.
As a result of this dependence between first level and second level layout parameters, locating second level layout parameters of a layout type must be done based on a set of first level layout parameters. Accordingly, when locating a second level layout parameter, a set of first level layout parameters is assumed, and the second level layout parameter is located based on it. As a result, only data cells that may correspond to the second level layout will be considered. For example, in a flat layout type, when locating a “header row ending cell”, only data cells located on the same row level as the “top left data cell” will be evaluated.
In an embodiment of the present invention, the data cells corresponding to the maximum weight for first level layout parameters, in the first step, are initially assumed. Based on them, second level layout parameters are located in the table area using a rule-based evaluation approach similar to the one described above with respect to first level parameters. If second level layout parameters are successfully located based on the assumed first level layout parameters, the data cells corresponding to the maximum weights for both the first and second level layout parameters are selected as first and second level layout parameters for the table area. If second level layout parameters could not be located, however, the location process is repeated with respect to the data cells having the second highest weights for first level layout parameters in the first step. As can be understood by a person skilled in the art(s), the process of locating second level layout parameters based on first level layout parameters can be recursively repeated until second level layout parameters are successfully located or the list of potential candidates for first level layout parameters is exhausted. Also, as is apparent to a person skilled in the art, second level layout parameters are successfully located when the evaluation process results in a set of data cells having evaluation weights higher than a pre-defined threshold.
After locating first and second level layout parameters for each layout type in a table area, a match probability is calculated for each layout type. The match probability is defined as the probability that the layout type matches the layout of the table area. The layout type having the maximum match probability is selected as the layout for the table area.
At this point of the automatic data extraction process, table areas have been detected and their extraction areas and layout types identified. Referring back to
In step 1350, data is extracted from each of the table areas identified in the previous steps of the process. When data is extracted from a plurality of table areas in the input file, the extracted data may be merged together in the data extraction output. In an embodiment, data extracted from multiple table areas are analyzed to determine logical associations among them. In another embodiment, names of data axes are compared to determine if the data can be related.
In step 1360, an output of the automatic data extraction is generated. The output may comprise one or more tables depending on whether or not data is merged in the previous step. In an embodiment, generated output tables have a flat layout type compatible with database use. As described earlier with regard to the manual data extraction configuration, a preview feature of the present allows the user the ability to preview the expected data extraction output before completing the extraction. In embodiments of the present invention, various pre-defined output formats are also available for selection by the user at the beginning of the configuration.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various changes in form and detail can be made therein without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.