Information management systems are used extensively throughout business to manage information and operations. For example, multi-national companies and other organizations (e.g., governmental) or enterprises typically have many individual office sites that have respective information systems that manage local operations. The enterprises generally have a need to be able to collect the information from each of the office sites to be able to create enterprise-level views or reports for management, investors, and government regulators. Large enterprises typically allow the local operations to create and/or manage their respective information management systems. Even when large enterprises have certain central information management systems, it is still quite common that regional operations operate independently from other regional operations with regard to information management. As a result, for large enterprises to collect and manage the massive amount of management and operational data that is typical in today's business world, large-scale information management systems with data repositories that have many tables are utilized. In some cases, a common data repository, such as SAP, Oracle, or other commercial data repository is used. In other cases, multiple, different data repositories are used within the same enterprise. In either case, many tables are often created by the data repositories so that certain processing functions, such as searching, operate faster. The reason for the speed increase is generally due having less data within each table (i.e., “thinner” tables”).
More particularly, processing large-scale data repositories that include one or more different data structures for storing data is time-consuming. Such large-scale data repositories may include upwards of 100,000 tables. Currently, a process for pointing or creating connections to each table within the large-scale data repositories so that data in each of the tables can be accessed is a manual process, and may take five minutes or more for connecting to each table. Both extract, transform, and load (ETL) and Big Data approaches require manual efforts to set-up each individual connection to each table. Moreover, as data repositories and/or enterprises grow and add more data repositories, the difficulty of keeping track of which table connections have been created for future enterprise-level data collection efforts grows.
The principles of the present invention provide for a system and process that allows for identifying tables within a large-scale data repository and automatically creating connections to the tables so that a data collection process may utilize those connections. In identifying the tables, two process options are possible, including undirected and directed processes. The undirected process may be performed by the system receiving a connection parameters of a data repository, and automatically determining a list of all tables in the data repository across schemas. The directed process provides a specific list of tables in the data repository to use. In performing the undirected process, a connection string to each data repository may be used to establish access to system-level metadata of the one or more data repositories. A catalog may be read from each data repository in which enterprise data, such as management and operational data, may be accessed, where the catalog(s) provide a list of tables of each data repository. Connection strings to the tables may be created and stored in a data file or other storage location. The directed process may be performed by processing specific table information of a subset of tables in a data repository as provided via an input file, and creating connection strings for each of the tables listed in the input file. It should be understood that the principles of the present invention may be applied to one or more data repositories in a parallel or serial manner.
One embodiment of a method for automatically establishing connections to tables within a data repository may include establishing a pointer to a data repository inclusive of a plurality of tables. Connection data for the respective tables, which are inclusive of data values and metadata associated with the data values, may be automatically created. In one embodiment, system-level metadata may be accessed within a catalog of the data repository, and the system-level metadata may include the connection data. The connection data may be inclusive of a table ID, table type, and/or other information associated with the tables that provides the ability to create a connection to a table. Inclusive of metadata for the respective tables may be a current version of the metadata so that as tables and metadata associated with the tables change, that the metadata associated with the tables may be retained at each date and time that the process is performed. Connection data associated with a plurality of tables may be identified within the system-level metadata. In an alternative embodiment, a subset of the tables may be provided to the process in a file so that connections to the subset of tables may be automatically created at run-time.
Another embodiment of a method for automatically establishing connections to tables within a data repository may include establishing a pointer to a data repository inclusive of a plurality of tables. Connections for the respective tables inclusive of connection parameters and metadata for each of the tables may be automatically created. Yet another embodiment of a method for automatically establishing connections to tables within a data repository may include establishing a pointer to a data repository inclusive of a plurality of tables. A catalog of the data repository may be accessed. Table names may be read from the catalog, and connections for the respective tables.
A more complete understanding of the method and apparatus of the present invention may be obtained by reference to the following Detailed Description when taken in conjunction with the accompanying Drawings wherein:
With regard to
Because enterprises, especially large, multi-national enterprises, often use different enterprise management systems with data repositories, such as SAP, Oracle, and other enterprise-scale information management systems, management of respective information management systems, naming conventions, data structures, and a variety of other aspects of managing information of the enterprise 100 may vary. As a result of local management of information management systems, the ability for enterprise-level information that enables senior management to quickly access and generate enterprise-level information (e.g., total sales revenue) is typically not possible. It is not uncommon for such enterprise-level information to take upwards of 9 to 12 months to be generated as a result of enterprise management systems having hundreds or thousands of tables being used within one or more data repositories to manage operational data of the enterprise. Such limitations are often a result of legacy information management systems being utilized by large enterprises, but such limitations may also exist due to the nature of enterprise information management systems, legacy or current.
With regard to
The illustrative network configuration shows four levels of computing systems. It should be understood that many more levels and layers of complexity of network architecture is often used within large enterprises, which creates a data repository, such as a relational database, that has hundreds or thousands of tables in which information generated by each facility of an organization is stored. As an example, in the case of a multi-national retail operation, such as a retail store, sales of clothing and other products may be generated in vast amounts. Moreover, information of customers may be collected and stored for a variety of different purposes, including returns, marketing, demographic assessment, and many other reasons, as understood in the art. The reason for a data repository utilizing hundreds or thousands of tables is to be able to provide for fast querying or perform other functions in a fast manner within the information management system. In other words, many data repositories of information management systems utilize relatively “shallow” tables, but use the many tables so as to increase the speed of searching for data contained therein. As a result, however, the ability to (i) keep track of names and locations of tables, (ii) know what data is in each table, and (iii) aggregate the data from each of the tables can be incredibly time consuming and difficult. In addition, and as previously described, the ability to connect to each table to be able to access and collect the data from each table can be very time consuming.
With regard to
As understood in the art, and as previously described, the number of data repositories of a relatively large enterprise may range upwards of 100,000, and be used to store operational data of the enterprise, such as an accounting firm, retail chain, or otherwise. It should also be understood that not all enterprises use a single or common information technology platform for managing operational and other data, which compounds the problem of creating connections to and extracting data from the tables. Moreover, as a result of utilizing the principles of the present invention, the time to create connections to each of the tables may be reduced from five minutes to milliseconds, and an entire set of tables may be reduced from weeks or months to a few minutes, which is a significant reduction in both time and expense.
With regard to
Two subsets of table connection parameters 404 shown provide table IDs (“<objectName>”) and connections at a schema level (“<DataSourceAlias>”), and be provided to a process to expedite identification of tables that are a subset of tables in the catalog(s) of the data repository. This process can be used if a user knows which specific tables within a data repository are going to be used for gathering data rather than creating connections to all of the tables, a subset of connections may be provided by the table connection parameters 404. The subset of table connection parameters 404 may be created automatically utilizing a data crawling process, as provided in co-pending U.S. patent application having Ser. No. 14/297,543 filed Jun. 5, 2014 entitled “System and Method for Finding and Inventorying Data from Multiple, Distinct Data Repositories,” the contents of which are hereby incorporated by reference in their entirety. The connection parameters 402 may use an XPATH format or any other link or connection format, as understood in the art. Connection data, which may include certain components of the subset of table connection parameters 404 or table information extracted from catalogs of the data repository, may be created, stored, and used during runtime. The connection data, which may be created automatically utilizing the principles of the present invention, may include code that is utilized during runtime to cause a process or system to access tables associated with the connection data.
With regard to
In the “undirected” path, the process may start at step 504, where database connection strings and tables may be obtained from an input table. In response to obtaining the connection strings, the module 504 may parse the connection strings or otherwise use identifiers to identify table connection parameters. Such an input table may include subsets of table connection parameters, such as the table connection parameters 404 shown in
In one embodiment, table level metadata documents may be created at step 512 and record level documents may be created at step 514 using the connections to the individual tables created at step 510. These documents may be utilized for generating enterprise-level information, such as overall business accounting information (e.g., total sales revenue), internal information (e.g., demographic landscape of all employees), etc. It should be understood that steps 512 and 514 are illustrative and that any additional and/or alternative processes for using the data of the tables may be utilized in accordance with the principles of the present invention.
In summary, the use of the process 500 may be used to (i) identify all data repositories in which tables used to manage data of an enterprise exist, (ii) create connections to each of the tables in the data repositories, (iii) build an inventory of the tables of interest, which may include a subset of the tables or all of the tables, and (iv) collect and process data values and metadata from each of the tables to generate information desired (e.g., total sales of an enterprise).
With regard to
In addition to creating connections, the principles of the present invention may provide for creating data files associated with each table or one or more data files that is associated with multiple tables. The data file(s) may be used to copy data, both data values and metadata, contained in each of the tables. The data file(s) may be schema-less, and the data may have a schema-less format. Moreover, the data file(s) may not have a data format or be formatted. By copying the data from the tables into separate data file(s), faster processing of the data may be performed.
The previous description is of a preferred embodiment for implementing the invention, and the scope of the invention should not necessarily be limited by this description. The scope of the present invention is instead defined by the following claims.