The present disclosure relates generally to enterprise data handling. More specifically, the present disclosure relates to generation and storage of bulk data in a unified and compressed form. For example, in one embodiment, statistical analysis system data may be unified and compressed for ease of storage and/or access.
A summary of certain embodiments disclosed herein is set forth below. It should be understood that these aspects are presented merely to provide the reader with a brief summary of these certain embodiments and that these aspects are not intended to limit the scope of this disclosure. Indeed, this disclosure may encompass a variety of aspects that may not be set forth below.
Generally speaking, embodiments provided herein relate to systems and methods for creating, storing, and/or using bulk data (such as statistical analysis system data) in an efficient manner. While the following discussion will refer to statistical analysis system (SAS) data, the current approaches could be used with any bulk data, especially bulk data where processing the data relates to reading data tables in their entirety, such as bulk data associated with analytical and reporting processes. In certain embodiments, a statistical analysis system (SAS®) data step view (e.g., a compiled machine language program) may be associated with compressed payload data that may be self-extracting upon opening the SAS® data step view with the SAS® software. The SAS® data step view may appear to a user as a data source. However, it is actually an executable program that transparently renders data as it is read. Compressed data may appear to be in a table of rows and columns. SAS® data step views typically transform data from external sources. However, by appending payload data to SAS® data step view, local payload data may be transformed into meaningful formatted data by the SAS® data step view.
Additionally and/or alternatively, metadata may be captured that may enable non-SAS-specific (e.g., “generic”) hosts and/or clients to re-create bulk data in a manner interpretable by the non-SAS® specific hosts.
Accordingly, the techniques and systems provided herein may greatly improve operation of computer systems, such as systems designed to render data for analytical purposes (e.g., systems executing Statistical Analysis System (SAS®) data step views, data hosts, general purpose computers, etc. In some embodiments, parallel compression and/or decompression of portions of the data may positively impact processing time for the compression and/or decompression processes. For example, a quad-core processing core complex running four parallel decompression processes, each on a different core, may be more than 3 times faster than a single decompression process running on the quad-core processing complex. Further, by generating the metadata, storage and rendering of self-describing bulk data (e.g., tabular data), may be available to a wide variety of applications.
Various refinements of the features noted above may exist in relation to various aspects of the present disclosure. Further features may also be incorporated in these various aspects as well. These refinements and additional features may exist individually or in any combination. For instance, various features discussed below in relation to one or more of the illustrated embodiments may be incorporated into any of the above-described aspects of the present disclosure alone or in any combination. The brief summary presented above is intended only to familiarize the reader with certain aspects and contexts of embodiments of the present disclosure without limitation to the claimed subject matter.
These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
One or more specific embodiments will be described below. In an effort to provide a concise description of these embodiments, not all features of an actual implementation are described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure. Further, the current embodiments may be implemented by one or more computer-processors that implement one or more machine-readable instructions stored on a tangible, non-transitory, machine-readable medium and/or by specialized circuitry designed to implement the discussed features.
The information age has brought about rapid advancements in telecommunication, hardware-based computing, software, and other data related activities. Thus, the current information-based society has resulted in the generation of a vast amount of valuable digital resources with more and more data consumption by customers, vendors, and electronic devices. For example, many organizations may retain a significant amount of analytics data within the organizations for use in business intelligence and other statistical analysis functions. Data collection has and will continue to exponentially increase. Unfortunately, as more data is collected, storage requirements grow exponentially, overwhelming available storage capacity. Further, as the data is increasingly relied upon for business intelligence and other analytical functions, rapid access is beneficial. It is now recognized that pre-processing of data may delay access to such data. Additionally, as statistical analysis becomes more wide-spread, it may be desirable to enable a multitude of statistical analysis engines to access data payloads. However, traditional statistical analysis data files may only be interpretable by a single statistical analysis engine.
Accordingly, as discussed above, new techniques may be implemented to efficiently store and use data analysis tools, such as statistical analysis system (SAS) tools. By way of introduction,
The SAS® host/client 12 may generate a SAS® data step view and request that the SAS® data step view be stored in a data store 16 (a tangible, non-transitory, machine-readable medium) by executing a view creation request 18 (e.g., executing a macro that is interpretable by the SAS® host/client 12 via the SAS® host/client 12). Upon invocation of such a request 18, the storage and use service 20 may implement a generalization service 22 that stores metadata indicating the variable characteristics that may be used to render the payload data in a generic format for non-SAS® (e.g., “generic”) hosts/clients 14 (e.g., hosts/clients that do not run SAS® Software.
Further, as mentioned above, the data payload may include a vast amount of data that may rapidly deplete storage capacity of the data store 16. Accordingly, compression services 24 may compress the payload for efficient storage of the payload data, such that storage capacity of the data store 16 may be less depleted upon saving the view and associated data payload to the data store 16.
Additionally, because SAS® data step views utilize external data files, the SAS® data step views may only be valid when the data files are present and accessible. Unfortunately, there are few mechanisms to ensure that the SAS® data step views are located with their corresponding payload data files. Accordingly, unification services 26 may append the metadata, the SAS® data step view, and the payload into a single unified file, such that the SAS® data step view, the data payload, and the metadata are all bound together, reducing the ability to move the SAS® data step view, the data payload, and/or the metadata without the other pieces of the SAS® data step view, the data payload, and/or the metadata.
As illustrated, the output of the services 20 may be a unified and compressed SAS® data file 28. As will be discussed in more detail below, the unified and compressed SAS® data file 28 may be a single file that includes: the SAS® data step view, the compressed payload data associated with the SAS® data step view, and metadata that may be used by the generic host/client 14 to construct a generic (e.g., non-SAS) view of the payload data.
For example, the SAS® host/client 12 may access the unified and compressed SAS® data file 28 by providing a view request 30 to open the view 32. As will be discussed in more detail below, the request 30 may trigger automatic decompression of the payload data, such that it may be used in conjunction with the SAS® data step view.
Similarly, the generic host/client 14 may access the unified and compressed SAS® data file 28 by providing a view request 34 to open or otherwise access the unified and compressed SAS® data file 28. As will be discussed in more detail below, the request 34 may trigger automatic decompression of the payload data, such that it may be used in conjunction with the metadata to reconstruct a generic view 36 that is interpretable by the generic host/client 14.
As illustrated, the generalization service 22 may receive the compressible payload 42 (block 64) as input and also may receive metadata 44 (e.g., a table) relating to the variable characteristics of the SAS® data step view 40 (block 66). For example, the variable characteristics of the SAS® data step view 40 may include the names, types, lengths, formats, labels, etc. of the variables described in the SAS® data step view 40.
Further, the compressible payload 42 and the metadata 44 may be published by a broadcaster service 39 to the compression service 24 (e.g., parallel compression service) to compress the payload (block 68). For example, the compression service 24, which may be hosted on a multi-core processor complex, may divide the compressible payload data 42 into divisions or blocks of data. These blocks may be compressed in parallel by multiple compression functions 45 implemented on the processors of the multi-core processor complex. For example, in some embodiments, four compression functions 45 may run individually on independent cores of a quad-core processor complex. In some embodiments, a single core may implement multiple instances of the compression functions. By implementing parallel compression, the compression processing time of the, oftentimes expansive, payload data may be greatly reduced. For example, by implementing four parallel compression functions 45 on a quad core processor, the compression processing time may be over three times as fast as a single compression processing function.
Any number of separate compression functions may run in parallel. For example, a server may have one board with two quad-core processors. Each core may run in hyper-threaded mode, causing it to appear as two processors to an operating system of the server. This may yield about 40% more throughput under Linux than running the same core as a serial processor. In this example, the server has 1 board with 2 chips/board×4 cores/chip×2 logical CPUs/core=16 logical CPUs. Thus, to maximize compression parallelism, 16 parallel compression functions may be implemented to support multiple simultaneous uses. Further, as CPU breadth increases, so may the number of parallel compression functions.
As illustrated by the multiple arrows 46, the multiple compression functions 45 may yield compressed blocks of data. The blocks are then reassembled after the compression function is complete (e.g., by the unification service 26), resulting in compressed payload data 42′.
For example, the unification service 26 may generate a unified and compressed file 28 that includes the SAS® data step view 40, the metadata 44, and the combined blocks or segments of compressed data 42′. In one embodiment, the unification service 26 may append the compressed payload data 42′ to the end of the SAS® data step view 40 and associate the metadata 44 with the SAS® data step view 40 and/or the appended compressed payload data 42′ (block 70). As will be discussed in more detail below, because of particular features of the SAS® host/client 12, the appending of the compressed payload data 42′ to the SAS® data step view 40 will not affect the ability of the SAS® host/client 12 to access/execute the SAS® data step view 40. Thus, a single unified and compressed output file 28 may be generated by the unification service 26 and stored in the data store 16.
Turning now to the usage of the unified and compressed SAS® file,
When the SAS® host/client 12 and/or the generic host/client 14 attempts to access the unified and compressed SAS® data step view file 28, the compressed payload 42′ may be decompressed by the parallel compression service 94. The compressed payload 42′ may be divided into portions 96. Similar to the compression discussed above, the portions 96 are decompressed in parallel (e.g., via separate decompression functions running on separate processor cores of a multi-core processor complex). Upon decompression of the portions 96, the decompressed payload data may be interleaved by an interleaver 97, such that the decompressed data is merged. The merged decompressed payload data may be used by the SAS® host/client 12, the generic host/client 14 and/or the SAS® data step view 40, resulting in an expected image of data 101.
Because the SAS® host/client 12 is a system running SAS® software, the SAS® host/client 12 is able to access the SAS® data step view 40 in a native format of the file 28. However, because the SAS® data step view 40 is interpretable by the SAS® software and the generic host/client 14 is not running SAS® software, a translator 98 may utilize the metadata 44 in conjunction with the SAS® data step view 40 and the payload 42 to construct a generic view 36 (e.g., a self-expanding rendition, such as a modified unified and compressed file 28′) that is interpretable by the generic host/client 14, using the single unified and compressed SAS® data step view file 28.
In some embodiments, a process monitoring and management service 102 may monitor the decompression services 94. From time to time, these decompression services may not shut down properly (e.g., UNIX zombie processes). Accordingly, the process monitoring and management service 102 may monitor improperly shut down decompression processes and remove execution of these decompression services 94.
Based upon these variables and/or attributes of the variables, the payload data may be reconstructed in a manner interpretable by the generic host/client 14 of
The payload data, reconstructed in the generic host/client interpretable form, may then be utilized by the generic host/client 14 (block 136). For example the generic host/client 14 may access a generic object model that is in an expected format of the generic host/client 14, such that the variables and/or variable attributes may be used by the generic host/client 14 in a generic view 36.
Turning now to a discussion of rendering a view via the SAS® host/client 12 of
Upon access of the file 28 by the SAS® host/client 12 and/or the generic host/client 14, the compressed payload 42′ of the file 28 of
As mentioned above, when the generic host/client 14 of
A view may then be rendered based upon the SAS® data step view 40, the decompressed payload 42, and/or the reconstructed payload data of block 156 (block 158). For example, the generic object model constructed in block 156, the SAS® data step view 40 of
The piped stream of data may have many applications. In certain embodiments, full-size data tables, such as SAS® data files, may be replaced with like-named SAS® data step views, such that users and existing programs accustomed to using particular tables utilize the view instead. Such implementation may be required little to no front-end changes to software and/or retraining of software users. In some embodiments, when storage capacity reached a low threshold, the full-sized tables may be automatically replaced with the like-named views, freeing up additional storage. Alternatively, in some embodiments, the replacement may be automated or otherwise triggered without a low threshold (e.g., as part of a comprehensive space-management program for storage of such tables).
As may be appreciated, by applying the current techniques, SAS® data may be easily accessed, while increasing processing efficiencies, increasing storage capacity, and creating mechanisms for generic hosts/clients to make use of bulk payload data. Thus data storage costs may be reduced, while workforce throughput may increase. By implementing these techniques as SAS® data step views are accessed, the compression/decompression of the SAS® payload data may have relatively little impact on the graphical user interface experience of opening SAS® data step views, while offering significant performance and/or storage capacity improvement.
While only certain features of disclosed embodiments have been illustrated and described herein, many modifications and changes will occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the present disclosure.
This application claims priority to and the benefit of U.S. Provisional Application No. 62/337,683, entitled “UNIFIED AND COMPRESSED STATISTICAL ANALYSIS DATA,” filed May 17, 2016, which is hereby incorporated by reference in its entirety for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
10305758 | Bhide | May 2019 | B1 |
20040156550 | Govindaswamy | Aug 2004 | A1 |
20070244987 | Pedersen | Oct 2007 | A1 |
20100046424 | Lunter | Feb 2010 | A1 |
20130290388 | Lenox | Oct 2013 | A1 |
20130318051 | Kumar | Nov 2013 | A1 |
Number | Date | Country | |
---|---|---|---|
62337683 | May 2016 | US |