The subject matter of this invention relates generally to data retrieval. More specifically, aspects of the present invention provide a tool for extracting schema from spreadsheets.
As information technology has improved in popularity, its usefulness as a way to store and retrieve data has become widely appreciated. Computers offer the ability to store data utilizing a fraction of the physical space required by paper-based storage solutions. In addition, access to the computer-based data can significantly reduce the retrieval time for the data.
To facilitate computer-based storage, several different types of storage paradigms have been developed. As can be appreciated, these paradigms can differ significantly with respect to characteristics such as simplicity of use and availability. For example, database-type storage solutions can offer interlinked data and/or or indexing that facilitate accessing and/or interpreting data. However, the time and knowledge needed to initialize the database-type storage solutions may be prohibitive for some users. In contrast, simple table-based data storage solutions, such as spreadsheets, provide a medium with greater ease of use for less sophisticated users, but this can sometimes come at the expense of data accessibility.
The inventors of the present invention have discovered that the current way of accessing data in table-based storage solutions such as spreadsheets can be improved. Specifically, the flexibility that allows a user to utilize a spreadsheet in many different ways can provide difficulties in attempting to access the data stored therein without human intervention. For example, because users are not required to define fields for data, to use standardized data constructs, and/or to provide a data definition that can be accessed by others, it becomes difficult for someone accessing the data to interpret the data that has been retrieved. To this extent, there is no way, given a set of unknown spreadsheets, to query the spreadsheets for a desired dataset. Furthermore, even though two different spreadsheets may have related information, a spreadsheet created by one individual may have a different format, different data types, different naming conventions, etc., that make using the spreadsheets in conjunction with one another a challenge.
In general, aspects of the present invention provide a tool for extracting schema from a spreadsheet. In an embodiment, a set of data that is stored in an uncataloged tabular format, such as a spreadsheet, is retrieved. The structure of the retrieved set of data is surveyed to determine the dataset schema thereof. Then, data elements within the dataset schema are analyzed to obtain information regarding the data elements. Based on dataset schema and the element information, an interface can be constructed that allows remote access to the set of data.
A first aspect of the invention provides a method for extracting spreadsheet schema, comprising: retrieving a set of data stored in an uncataloged tabular format; surveying a structure of the set of data to determine a dataset schema of the set of data; analyzing data elements within the dataset schema to obtain element information; and constructing an interface using the dataset schema and the element information for remotely accessing the set of data.
A second aspect of the invention provides a system for extracting spreadsheet schema, comprising at least one computer device that performs a method, comprising: retrieving a set of data stored in an uncataloged tabular format; surveying a structure of the set of data to determine a dataset schema of the set of data; analyzing data elements within the dataset schema to obtain element information; and constructing an interface using the dataset schema and the element information for remotely accessing the set of data.
A third aspect of the invention provides a computer program product stored on a computer readable storage medium, which, when executed performs a method for extracting spreadsheet schema, comprising: retrieving a set of data stored in an uncataloged tabular format; surveying a structure of the set of data to determine a dataset schema of the set of data; analyzing data elements within the dataset schema to obtain element information; and constructing an interface using the dataset schema and the element information for remotely accessing the set of data.
A fourth aspect of the invention provides a method for deploying an application for extracting spreadsheet schema, comprising: providing a computer infrastructure being operable to: retrieve a set of data stored in an uncataloged tabular format; survey a structure of the set of data to determine a dataset schema of the set of data; analyze data elements within the dataset schema to obtain element information; and construct an interface using the dataset schema and the element information for remotely accessing the set of data.
Still yet, any of the components of the present invention could be deployed, managed, serviced, etc., by a service provider who offers to implement the teachings of this invention in a computer system.
Embodiments of the present invention also provide related systems, methods and/or program products.
These and other features of this invention will be more readily understood from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings in which:
The drawings are not necessarily to scale. The drawings are merely schematic representations, not intended to portray specific parameters of the invention. The drawings are intended to depict only typical embodiments of the invention, and therefore should not be considered as limiting the scope of the invention. In the drawings, like numbering represents like elements.
As indicated above, aspects of the present invention provide a tool for extracting schema from a spreadsheet. In an embodiment, a set of data that is stored in an uncataloged tabular format, such as a spreadsheet, is retrieved. The structure of the retrieved set of data is surveyed to determine the dataset schema thereof. Then, data elements within the dataset schema are analyzed to obtain information regarding the data elements. Based on dataset schema and the element information, an interface can be constructed that allows remote access to the set of data.
Turning to the drawings,
Computing device 104 is shown including a processing component 106 (e.g., one or more processors), a memory 110, a storage system 118 (e.g., a storage hierarchy), an input/output (I/O) component 114 (e.g., one or more I/O interfaces and/or devices), and a communications pathway 112. In general, processing component 106 executes program code, such as schema extraction program 140, which is at least partially fixed in memory 110. To this extent, processing component 106 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations.
Memory 110 also can include local memory, employed during actual execution of the program code, bulk storage (storage 118), and/or cache memories (not shown) which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage 118 during execution. As such, memory 110 may comprise any known type of temporary or permanent data storage media, including magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Moreover, similar to processing component 116, memory 110 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms.
While executing program code, processing component 106 can process data, which can result in reading and/or writing transformed data from/to memory 110 and/or I/O component 114 for further processing. Pathway 112 provides a direct or indirect communications link between each of the components in computer system 102. I/O component 114 can comprise one or more human I/O devices, which enable a human user 120 to interact with computer system 102 and/or one or more communications devices to enable a system user 120 to communicate with computer system 102 using any type of communications link.
To this extent, schema extraction program 140 can manage a set of interfaces (e.g., graphical user interface(s), application program interface, and/or the like) that enable human and/or system users 120 to interact with schema extraction program 140. Users 120 could include system administrators and/or clients who need to query and/or provide query and/or other access to a tabular dataset 200 (
In any event, computer system 102 can comprise one or more computing devices 104 (e.g., general purpose computing articles of manufacture) capable of executing program code, such as schema extraction program 140, installed thereon. As used herein, it is understood that “program code” means any collection of instructions, in any language, code or notation, that cause a computing device having an information processing capability to perform a particular action either directly or after any combination of the following: (a) conversion to another language, code or notation; (b) reproduction in a different material form; and/or (c) decompression. To this extent, schema extraction program 140 can be embodied as any combination of system software and/or application software. In any event, the technical effect of computer system 102 is to provide processing instructions to computing device 104 in order to remediate a migration-related failure.
Further, schema extraction program 140 can be implemented using a set of modules 142-148. In this case, a module 142-148 can enable computer system 102 to perform a set of tasks used by schema extraction program 140, and can be separately developed and/or implemented apart from other portions of schema extraction program 140. As used herein, the term “component” means any configuration of hardware, with or without software, which implements the functionality described in conjunction therewith using any solution, while the term “module” means program code that enables a computer system 102 to implement the actions described in conjunction therewith using any solution. When fixed in a memory 110 of a computer system 102 that includes a processing component 106, a module is a substantial portion of a component that implements the actions. Regardless, it is understood that two or more components, modules, and/or systems may share some/all of their respective hardware and/or software. Further, it is understood that some of the functionality discussed herein may not be implemented or additional functionality may be included as part of computer system 102.
When computer system 102 comprises multiple computing devices 104, each computing device 104 can have only a portion of schema extraction program 140 fixed thereon (e.g., one or more modules 142-148). However, it is understood that computer system 102 and schema extraction program 140 are only representative of various possible equivalent computer systems that may perform a process described herein. To this extent, in other embodiments, the functionality provided by computer system 102 and schema extraction program 140 can be at least partially implemented by one or more computing devices that include any combination of general and/or specific purpose hardware with or without program code. In each embodiment, the hardware and program code, if included, can be created using standard engineering and programming techniques, respectively.
Regardless, when computer system 102 includes multiple computing devices 104, the computing devices can communicate over any type of communications link. Further, while performing a process described herein, computer system 102 can communicate with one or more other computer systems using any type of communications link. In either case, the communications link can comprise any combination of various types of wired and/or wireless links; comprise any combination of one or more types of networks; and/or utilize any combination of various types of transmission techniques and protocols.
As discussed herein, schema extraction program 140 enables computer system 102 to extract spreadsheet schema. To this extent, schema extraction program 140 is shown including a dataset retrieval module 142, a dataset structure survey module 144, a data element analyzer module 146, and an interface constructor module 148.
Computer system 102, executing dataset retrieval module 142, retrieves a tabular dataset 152, where tabular dataset 152 is a set of data stored in a tabular format. Retrieval of tabular dataset 152 can be performed using any solution now known or later developed, including, but not limited from retrieval from a storage system 118, over a local area or wide area network, or the like, or creation by user 120. In any case, tabular dataset 152, as retrieved by dataset retrieval module 142 can be an uncataloged set of data. Specifically, tabular dataset 152 does not require the inclusion and/or association of interlinking data, indices, metadata or other external links into the data, interfaces, or other access tools in order to be utilized by schema extraction program 140.
Referring now to
In any event, once tabular dataset 200 has been retrieved, dataset structure survey module 144, as executed by computer system 102, can survey a structure of tabular dataset 200. This survey can be performed based only on the data that is found within tabular dataset 200, and thus without external access aids. For example, one or more rectangular areas within tabular dataset 200 can be identified. Each identified rectangular area can be an area within tabular dataset 200 that has contiguous data, that is, data elements 210 that contain data. In order to identify these rectangular areas, a line-by-line scan of tabular dataset 200 can be performed. This scan can be performed, similar to a computer graphics scan, by treating tabular dataset 200 as a two dimensional array and using a scan-line inspired algorithm to determine non-intersecting rectangles. Such a scan-line inspired algorithm can work as follows: scan-line 302 (
Even though empty rows (columns) may not require further processing, empty rows (columns) can be particularly important as they can be used to identify the boundaries between the rectangular data-containing areas. If an empty row is identified, then the algorithm can conclude that any future rectangles will not intersect with any rectangles identified thus far (due to the empty row) and therefore the algorithm can mark all previously identified rectangles as complete. For non-empty rows, whenever a non-empty cell in the tabular dataset is identified, it can be used to define a new rectangle initially only containing the single identified non-empty cell. Then, the algorithm can test whether there is any adjacent rectangle (in the same row) that is adjacent to the newly created rectangle, and if this is the case the two rectangles can be merged into one (thereby extending the boundary of the previously identified rectangle).
The algorithm can also consider the case in which a rectangle is adjacent or overlaps with a previously identified rectangle in one of the previous rows. This consideration can involve at least four different cases to identify overlaps, including: (a) whether a previously identified rectangle is adjacent on the upper row of the current rectangle with the left or right column of the other rectangle within the boundaries of the current rectangle; (b) whether a previously identified rectangle is adjacent on the lower row of the current rectangle with the left or right column of the other rectangle within the boundaries of the current rectangle; (c) whether a previously identified rectangle is adjacent on the left column of this rectangle with the upper or lower row of the other rectangle within the boundaries of the current rectangle; and/or (d) whether the previously identified rectangle is adjacent on the right column of this rectangle with the upper or lower row of the other rectangle within the boundaries of the current rectangle. If any of these four cases applies, the two rectangles can be merged into one. The algorithm terminates when all the rows (and columns) in the tabular dataset are processed.
Turning now to
In any event, as shown in
The information returned by line-by-line scan performed using scan-line 302 can also be used to determine type information for the data elements within a particular rectangular area 310a-f. For example, a set of known data types can be created based on the data identified in the tabular dataset and their correspondence with well-known data types used in computing environments (e.g. strings, integers, floats, dates, times). Popular tabular datasets (e.g. commercially available spreadsheets) often have data types that are used specifically with a particular product, and these can used as an initial type system. Alternatively, data types can be imported. These known data types can be imported from any source, including, but not limited to from previous analysis of other spreadsheets. Data elements within the rectangular area 310a-f can then be compared with these data types to attempt to determine whether the data types correspond.
Turning now to
The information returned by line-by-line scan performed using scan-line 402 can also be used to determine a set of header identifiers within a particular rectangular area 410a-f. For example, contents of data locations within rectangular area 410a-f, particularly data locations adjacent to the border, can be analyzed to determine whether they contain textual data. If these data locations are found to contain textual data, the data can be analyzed to determine whether it corresponds to common values for known header identifiers. For example, values such as “name”, “date”, “amount”, “cost”, and the like, if found in these data locations could be determined as being header identifiers. In an embodiment, the textual data can be compared with an external source, such as a dictionary, an ontology, and/or the like. Further, if multiple linear arrays of header identifiers are found in a single rectangular area 410a-f, a type hierarchy can be created by relying on the merging attributes of the data locations within the rectangular area 410a-f.
Referring back to
Referring again to
As such, the interface constructed by interface constructor module 148 provides users 120 a tool to access and understand the data within tabular dataset 152 that would otherwise be unavailable. Further, this data could be combined with data from other such datasets 152 and/or with more structured data such as from one or more structured databases, thus providing greater accessibility to existing data. In this way, a user 120 can issue a structured query without knowledge of the data in the tabular dataset 152 and receive in return data elements in the tabular dataset 152 that satisfy the structured query. Further, the evaluating of such a structured query with respect to the tabular dataset can return a trigger interface to iterate over the data elements that satisfy the structured query This trigger interface could offer the ability to iterate one-by-one over all the elements of the tabular dataset that satisfy a query. In more detail, the interface could provide methods to the user to return the size of the answer set to the query (say, a size( )method), as well as methods to test whether the answer set is empty (say, a is Empty( )method), and also methods to get the first answer in the answer set (say, a getFirst( )method). Also, the user could be able to use a next( )method to get the next answer after the current one, until all the answers in the answer set have been processed. In this manner, and with such an interface, the user will be able to get all the answers to a query, starting from the first
Turning now to
Turning now to
While shown and described herein as a method and system for extracting spreadsheet schema, it is understood that aspects of the invention further provide various alternative embodiments. For example, in one embodiment, the invention provides a computer program fixed in at least one computer-readable medium, which when executed, enables a computer system to extract spreadsheet schema. To this extent, the computer-readable medium includes program code, such as schema extraction program 140 (
In another embodiment, the invention provides a method of providing a copy of program code, such as schema extraction program 140 (
In still another embodiment, the invention provides a method of generating a system for remediating a migration-related failure. In this case, a computer system, such as computer system 120 (
The terms “first,” “second,” and the like, if and where used herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another, and the terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced item. The modifier “approximately”, where used in connection with a quantity is inclusive of the stated value and has the meaning dictated by the context, (e.g., includes the degree of error associated with measurement of the particular quantity). The suffix “(s)” as used herein is intended to include both the singular and the plural of the term that it modifies, thereby including one or more of that term (e.g., the metal(s) includes one or more metals).
The foregoing description of various aspects of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously, many modifications and variations are possible. Such modifications and variations that may be apparent to an individual in the art are included within the scope of the invention as defined by the accompanying claims.