The following disclosure relates to the field of cloud computing, and more particularly, to data management on a cloud computing platform.
Organizations may need to transfer data from legacy data storages to the cloud for larger consumption and usage. In some cases, a legacy data storage supports row-based file formats, and data files are exported to the cloud in the row-based file format. However, data analysis on files in a row-based file format may not be as efficient as other file formats.
Embodiments described herein are a data management system and associated methods configured to handle or process data files in row-based file formats. A data management system as described herein is configured to ingest a data file in a row-based file format to a cloud computing platform, along with an associated schema for the data file. The data management system is further configured to transform the data file to a columnar file format using the schema uploaded with the data file. A technical benefit is the data file in columnar file format may be more efficiently analyzed by data analysis tools on the cloud computing platform.
In an embodiment, a data management system comprises processing resources and storage resources provisioned on a cloud computing platform to implement a data management service. The processing resources are configured to cause the data management system at least to receive a first Application Programming Interface (API) request from a client application to store a data file in a row-based file format, and process a first request body of the first API request to identify a schema associated with the data file. The schema indicates column information to construct one or more columns from data in the data file. The processing resources are configured to further cause the data management system at least to identify a cloud-based storage resource on the cloud computing platform to store the data file, transmit an API response to the client application with a resource identifier of the cloud-based storage resource, receive a second API request from the client application to store the data file at the cloud-based storage resource, process a second request body of the second API request to identify the data file, and store the data file and the schema at the cloud-based storage resource.
In an embodiment, a data management system comprises processing resources and storage resources provisioned on a cloud computing platform to implement a data management service. The processing resources are configured to cause the data management system at least to extract a data file in a row-based file format from a cloud-based storage resource, and extract a schema associated with the data file from the cloud-based storage resource. The schema indicates column information to construct one or more columns from data in the data file. The processing resources are configured to further cause the data management system at least to transform the data file in the row-based file format into a columnar file format based on the schema, and load the data file in the columnar file format to a cloud-based centralized repository.
Other embodiments may include computer readable media, other systems, or other methods as described below.
The above summary provides a basic understanding of some aspects of the specification. This summary is not an extensive overview of the specification. It is intended to neither identify key or critical elements of the specification nor delineate any scope particular embodiments of the specification, or any scope of the claims. Its sole purpose is to present some concepts of the specification in a simplified form as a prelude to the more detailed description that is presented later.
Some embodiments of the present disclosure are now described, by way of example only, and with reference to the accompanying drawings. The same reference number represents the same element or the same type of element on all drawings.
The figures and the following description illustrate specific exemplary embodiments. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the embodiments and are included within the scope of the embodiments. Furthermore, any examples described herein are intended to aid in understanding the principles of the embodiments, and are to be construed as being without limitation to such specifically recited examples and conditions. As a result, the inventive concept(s) is not limited to the specific embodiments or examples described below, but by the claims and their equivalents.
An organization, such as a company, health care organization, educational organization, governmental organization, etc., may generate and/or collect a large volume of data that is stored in data storage 120. Often times the data is stored in data silos 122, which are data repositories controlled by one department, business unit, etc., typically isolated from the rest of the organization. To derive valuable insights and get a holistic view of the data, the organization may desire that the data be merged into a centralized repository. Thus, the organization may want to migrate the data from legacy data storage 120 to the cloud for improved consumption of the data, such as through data management service 100. As will be described in more detail below, some file formats supported by legacy data storage 120 may not be desirable for data analysis or other processing, so data management service 100 supports conversion of the data to another format when stored in a centralized repository.
In an embodiment, a data management system is implemented on cloud computing platform 112 to provide the data management service 100.
In an embodiment, data management system 200 may include or implement a data collector 202, a data converter 204, and a data analyzer 206. Data collector 202 is configured to input, upload, or ingest data for the data management service 100 from external devices or systems (e.g., legacy data storage 120 or associated controllers) over a network connection, such as by exchanging messages, files, etc. Data collector 202 may use or provide an Application Programming Interface (API) 208 to interact with client applications implemented at external systems, such as legacy data storage 120. Data collector 202 is configured to store the ingested data in cloud-based storage (e.g., storage resources 232). Operations or functions performed by data collector 202 may generally be referred to as an ingestion phase of data into data management service 100.
Data converter 204 is configured to convert or transform data files stored in cloud-based storage from a native or legacy file format to another file format, and store the data files in a cloud-based centralized repository 250 on the cloud computing platform 112. For example, a native file format supported by legacy data storage 120 may comprise a row-based file format. One example of a row-based file format is a delimited file format, which is a collection of records arranged in rows, and individual data values or fields are separated by column delimiters within the rows. One example of a delimited file format is a comma-separated values (CSV) file format, which is a row-based file format where individual data values or fields are separated by commas within the rows. A row-based file format, such as CSV, may provide a challenge for data analysis or other data processing as column information for the data in the file may not be reasonably identifiable. Thus, data converter 204 is configured to convert or transform data files in a row-based file format to a columnar file format or column-based file format. In a columnar file format, data is stored by column instead of by row. Examples of columnar file formats are Apache Parquet (referred to generally herein as Parquet), Optimized Row Columnar (ORC), etc. In some cases, columnar file formats have become the standard in centralized repositories for fast analytics workloads as opposed to row-based file formats. Columnar file formats can significantly reduce the amount of data that needs to be fetched by accessing columns that are relevant to the workload. Operations or functions performed by data converter 204 may generally be referred to as a transformation phase of data in data management service 100.
In
Data management system 200 may include various other components, applications, etc., not specifically illustrated in
In an embodiment, data collector 202 is configured to input or upload a schema for a data file in a row-based file format 302 during the ingestion phase. Data collector 202 stores the schema in cloud-based storage with the data file.
Further, during the ingestion phase, data collector 202 is configured to upload or otherwise ingest a schema 406 associated with the data file 402 for storage at cloud-based storage resource 420. Schema 406 may be provisioned or pre-provisioned, such as by the organization 400 (or another entity), prior to uploading the data file 402.
In an embodiment, an API 208 may be defined or provided to ingest data files 402, and the schemas 406 associated with the data files 402.
Data collector 202 receives the API request 602 requesting upload of the data file 402, and processes the request body 610 of the API request 602 to identify the schema 406 associated with the data file 402. Data collector 202 provisions or identifies a cloud-based storage resource 420 on cloud computing platform 112 for the data file 402, and transmits an API response 604 (e.g., a first API response) to client application 620 with a resource identifier (ID) 612 (e.g., a Uniform Resource Identifier (URI)) of the cloud-based storage resource 420 for the data file 402. Data collector 202 may temporarily store the schema 406 associated with the data file 402.
Client application 620 receives the API response 604, and processes the API response 604 to identify the resource identifier 612 of the cloud-based storage resource 420 for the data file 402. In an embodiment, client application 620 transmits another API request 606 (e.g., a second API request) to data collector 202 requesting storage of the data file 402 at cloud-based storage resource 420. Client application 620 includes, inserts, or passes the data file 402 in a request body 614 of the API request 606, along with any other desired information. Data collector 202 receives the API request 606 requesting storage of the data file 402, and stores the data file 402 and the schema 406 at cloud-based storage resource 420. Data collector 402 may then reply with an API response 608. One technical benefit is a schema 406 is stored together with the data file 402 so that other processes may locate and access the schema 406 when processing the data file 402.
The following describes an example of an ingestion phase of a data file 402 into data management service 100. Data collector 202 receives an API request 602 to store a data file 402 in a row-based file format 302 from client application 620 (step 1002), and processes a request body 610 of the API request 602 to identify a schema 406 associated with the data file 402 (step 1004). Data collector 202 provisions or identifies a cloud-based storage resource 420 on cloud computing platform 112 for the data file 402 (step 1006), and transmits an API response 604 to client application 620 with a resource identifier 612 (e.g., URI) of the cloud-based storage resource 420 for the data file 402 (step 1008). At this time, data collector 202 may temporarily store the schema 406 associated with the data file 402. Data collector 202 receives another API request 606 from the client application 620 to store the data file 402 at the cloud-based storage resource 420 (step 1010), and processes a request body 614 of the API request 606 to identify the data file 402 (step 1012). Data collector 202 then stores the data file 402 and the schema 406 at the cloud-based storage resource 420 (step 1014), which is identifiable based on the resource identifier 612. Technical benefits of method 1000 are the schema 406 is uploaded in an API call to the cloud computing platform 112, and stored together with the data file 402 so that other processes may locate and access the schema 406 when processing the data file 402.
In an embodiment, data converter 204 is configured to convert a data file 402 in row-based file format 302 to columnar file format 322 using the schema 406 associated with the data file 402. As described above, the schema 406 associated with the data file 402 indicates column information to construct one or more columns from data in the data file 402. Thus, data converter 204 uses the schema 406 as a template or blueprint to transform the data file 402 into columnar file format 322 with one or more columns 326 specified by the schema 406.
AWS analytics services 1320 comprise analytics services for data, such as data movement, data storage, log analytics, business intelligence (BI), machine learning (ML), etc. One type of AWS analytics service 1320 is AWS Lake Formation 1322, which creates secure data lakes 1324 making data available for wide-ranging analytics. A data lake 1324 is a centralized repository that allows a customer to store structured and unstructured data. In formation of a data lake 1324, AWS Lake Formation 1322 collects data from multiple data sources, and moves the data into the data lake 1324 in its original format. A data lake 1324 uses Amazon S3 1312 as its primary storage platform.
Another type of AWS analytics services 1320 is AWS Glue 1326, which is a serverless data integration service that discovers, prepares, moves, and integrates data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue 1326 may process data in stored in Amazon S3 1312 when forming a data lake 1324. For example, AWS Glue 1326 may prepare data for analysis through automated extract, transform, and load (ETL) processes. The architecture for AWS Glue 1326 is disclosed in more detail below.
Another type of AWS analytics services 1320 is Amazon Athena 1328, which is a serverless, interactive analytics service that supports open-table and file formats. Amazon Athena 1328 is configured to query data from a variety of data sources (e.g., a data lake), and analyze the data and/or build applications.
Data management service 100 provides a data ingestion layer responsible for ingesting data into AWS storage services 1310, such as Amazon S3 1312, a data warehouse, etc.
In general, when a file is uploaded to an AWS storage service 1310, the file may be stored within an S3 bucket 1314 of Amazon S3 1312. More particularly, the file is stored as an S3 object within an S3 bucket 1314. Thus, when uploaded to Amazon S3 1312, for example, the CSV file 1402 is stored as an S3 object 1420 within an S3 bucket 1314. The S3 object 1420 consists of the file data (i.e., CSV file 1402, data file 402, etc.) and metadata 1403 (META) that describes the file data.
The CSV file 1402 may contain headers for data in the file, but does not describe the schema of the data within the CSV file 1402. Thus, a schema 406 is defined for the CSV file 1402, such as by the organization 400 or entity uploading the CSV file 1402. As above, the schema 406 indicates a column structure or column information to construct one or more columns from data in the CSV file 1402. The data ingestion layer of the data management service 100 also allows for uploading of the schema 406 for the CSV file 1402. For example, multiple APIs may be defined in the AWS environment 1300, and an API 1414 used in the data ingestion layer may allow the schema 406 to be uploaded or transferred to an AWS storage service 1310 along with the CSV file 1402, much as described above in
Multiple CSV files 1402 and associated schemas 406 may be uploaded in a similar manner to S3 bucket 1314 or other AWS storage services 1310. AWS Lake Formation 1322 may then collect the data from S3 bucket 1314 or other AWS storage services 1310, and move the data into the data lake 1324. Although the data such as this has been moved to the data lake, the data in its native format (e.g., CSV) may not be conducive to AWS analytics services 1320. Thus, AWS Lake Formation 1322 may facilitate transformation of the data to another format. For example, AWS Glue 1326 may be used to convert CSV files to another format, such as Parquet. However, other types of file conversion are considered herein.
A general workflow for AWS Glue 1326 is as follows. First, the AWS Glue data catalog 1510 is populated with table definitions. AWS Glue 1326 allows a user to select a crawler 1514, which is a program that connects to a data store 1516 (data source 1502 or data target 1550), progresses through a prioritized list of classifiers to extract metadata, and then creates metadata tables 1512 in the AWS Glue data catalog 1510. A user may also populate the AWS Glue data catalog 1510 with manually-created tables 1512. Next, a user defines a job (e.g., a Glue job, an ETL job, etc.) that describes the transformation of data from the data source 1502 to the data target 1550. To create a job, a user selects a table 1512 from the AWS Glue data catalog 1510, and the job uses this table definition to access the data source 1502 and interpret the format of the data. The user also selects a table 1512 or location from the AWS Glue data catalog 1510 to be the data target 1550 of the job. AWS Glue 1326 uses transform engine 1504 to convert data from a source format to a target format based on script 1506. Transform engine 1504 performs operations such as copy data, rename columns, and filter data to transform the data.
Next, the job is run to transform the data. The job may be run on demand, or start based on a schedule, an event-based trigger, etc. Script 1506 comprises code that extracts data from a data source, transforms the data, and loads the transformed data into a data target. Thus, when the job runs for an ETL operation, script 1506 extracts data from data source 1502 (e.g., a data file), transforms the data, and loads the data to the data target 1550.
Script 1506 is configured to transform or convert the CSV file 1402 into a Parquet file 1602 based on the schema 406 associated with the CSV file 1402. In this example, instead of pulling a table 1512 from AWS Glue data catalog 1510 for transforming CSV file 1402, script 1506 transforms the CSV file 1402 based on the schema 406 uploaded with the CSV file 1402 and stored in the same S3 object 1420. As described above, the schema 406 describes column information to construct one or more columns from data in the CSV file 1402. Thus, script 1506 is able to transform certain data of the CSV file 1402 into columns of the Parquet file 1602 based on the schema 406. One technical benefit is transform engine 1504 is able to accurately define one or more columns of data in Parquet file 1602 based on the schema 406. Script 1506 is further configured to store the Parquet file 1602 in the data target 1550. In the embodiment of
The ETL operation 1600 may perform a similar operation on multiple CSV files 1402 as described above to convert the CSV files 1402 to Parquet files 1602. Thus, the data target 1550 (e.g., data lake 1324) may store many Parquet files 1602 that are available for processing via other AWS services.
After conversion, data management service 100 may run a crawler 1514 of AWS Glue 1326 to create or update a table in AWS Glue data catalog 1510 from the Parquet file 1602.
In an embodiment, AWS Glue 1326 may process the schema status indicator 504 in the schema 406 to determine whether to run the crawler 1514 on the Parquet file 1602. As described above, the schema status indicator 504 is a value, flag, or other indication of whether the schema 406 is new, updated, etc. When the schema 406 is new or updated, AWS Glue 1326 may run the crawler 1514 on the Parquet file 1602. When the schema 406 is not new or updated, AWS Glue 1326 may omit running the crawler 1514 on the Parquet file 1602. One technical benefit is there may be a cost involved in running a crawler 1514, so crawler 1514 is run in instances where a schema 406 new or updated.
In
In
Embodiments disclosed herein can take the form of software, hardware, firmware, or various combinations thereof.
Computer readable storage medium 2212 can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device. Examples of computer readable storage medium 2212 include a solid-state memory, a magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk, and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W), and DVD.
Processing system 2200, being suitable for storing and/or executing the program code, includes at least one processor 2202 coupled to program and data memory 2204 through a system bus 2250. Program and data memory 2204 can include local memory employed during actual execution of the program code, bulk storage, and cache memories that provide temporary storage of at least some program code and/or data in order to reduce the number of times the code and/or data are retrieved from bulk storage during execution.
Input/output or I/O devices 2206 (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled either directly or through intervening I/O controllers. Network adapter interfaces 2208 may also be integrated with the system to enable processing system 2200 to become coupled to other data processing systems or storage devices through intervening private or public networks. Modems, cable modems, IBM Channel attachments, SCSI, Fibre Channel, and Ethernet cards are just a few of the currently available types of network or host interface adapters. Display device interface 2210 may be integrated with the system to interface to one or more display devices, such as printing systems and screens for presentation of data generated by processor 2202.
The following clauses and/or examples pertain to further embodiments or examples. Specifics in the examples may be used anywhere in one or more embodiments. The various features of the different embodiments or examples may be variously combined with some features included and others excluded to suit a variety of different applications. Examples may include subject matter such as a method, means for performing acts of the method, at least one machine-readable medium including instructions that, when performed by a machine cause the machine to perform acts of the method, or of an apparatus or system according to embodiments and examples described herein.
Some embodiments pertain to Example 1 that includes a data management system comprising processing resources and storage resources provisioned on a cloud computing platform to implement a data management service, the processing resources configured to cause the data management system at least to extract a data file in a row-based file format from a cloud-based storage resource, and extract a schema associated with the data file from the cloud-based storage resource, where the schema indicates column information to construct one or more columns from data in the data file. The processing resources are configured to further cause the data management system at least to transform the data file in the row-based file format into a columnar file format based on the schema, and load the data file in the columnar file format to a cloud-based centralized repository.
Example 2 includes the subject matter of Example 1, where the schema was uploaded to the data management service with the data file.
Example 3 includes the subject matter of Examples 1 and 2, where the row-based file format comprises a delimited file format.
Example 4 includes the subject matter of Examples 1-3, where the columnar file format comprises Apache Parquet file format.
Example 5 includes the subject matter of Examples 1-4, where the cloud computing platform comprises an AWS environment. The processing resources are configured to further cause the data management system at least to extract the data file in the row-based file format from an Amazon S3 object in an S3 bucket, extract the schema from the Amazon S3 object, transform the data file into the columnar file format based on the schema extracted from the Amazon S3 object, and load the data file in the columnar file format to another S3 bucket of the cloud-based centralized repository.
Example 6 includes the subject matter of Examples 1-5, where the processing resources are configured to further cause the data management system at least to extract the schema from metadata of the Amazon S3 object.
Example 7 includes the subject matter of Examples 1-6, where the cloud-based centralized repository comprises a data lake created in the AWS environment.
Example 8 includes the subject matter of Examples 1-7, where the processing resources are configured to further cause the data management system at least to determine whether to run a crawler on the data file in the columnar file format to create or update a table in an AWS Glue data catalog based on a schema status indicator in the schema.
Example 9 includes the subject matter of Examples 1-8, where the processing resources are configured to further cause the data management system at least to run the crawler on the data file in the columnar file format to create or update the table in the AWS Glue data catalog when the schema status indicator indicates that the schema is new or updated.
Example 10 includes the subject matter of Examples 1-9, where the processing resources are configured to further cause the data management system at least to access the table in the AWS Glue data catalog created or updated from the data file in the columnar file format, and query the data file in the columnar file format based on the table.
Some embodiments pertain to Example 11 that includes a method of implementing a data management service on a cloud computing platform. The method comprises extracting a data file in a row-based file format from a cloud-based storage resource, and extracting a schema associated with the data file from the cloud-based storage resource, where the schema indicates column information to construct one or more columns from data in the data file. The method further comprises transforming the data file in the row-based file format into a columnar file format based on the schema, and loading the data file in the columnar file format to a cloud-based centralized repository.
Example 12 includes the subject matter of Example 11, where the schema was uploaded to the data management service with the data file.
Example 13 includes the subject matter of Examples 11 and 12, where the cloud computing platform comprises an AWS environment. Extracting the data file comprises extracting the data file in the row-based file format from an Amazon S3 object in an S3 bucket. Extracting the schema associated with the data file comprises extracting the schema from the Amazon S3 object. Transforming the data file comprises transforming the data file into the columnar file format based on the schema extracted from the Amazon S3 object. Loading the data file comprises loading the data file in the columnar file format to another S3 bucket of the cloud-based centralized repository.
Example 14 includes the subject matter of Examples 11-13, where extracting the schema comprises extracting the schema from metadata of the Amazon S3 object.
Example 15 includes the subject matter of Examples 11-14, further comprising determining whether to run a crawler on the data file in the columnar file format to create or update a table in an AWS Glue data catalog based on a schema status indicator in the schema.
Example 16 includes the subject matter of Examples 11-15, further comprising running the crawler on the data file in the columnar file format to create or update the table in the AWS Glue data catalog when the schema status indicator indicates that the schema is new or updated.
Example 17 includes the subject matter of Examples 11-16, further comprising accessing the table in the AWS Glue data catalog created or updated from the data file in the columnar file format, and querying the data file in the columnar file format based on the table.
Some embodiments pertain to Example 18 that includes a non-transitory computer readable medium embodying programmed instructions executed by a processor, where the instructions direct the processor to implement a method of implementing a data management service on a cloud computing platform. The method comprises extracting a data file in a row-based file format from a cloud-based storage resource, and extracting a schema associated with the data file from the cloud-based storage resource, where the schema indicates column information to construct one or more columns from data in the data file. The method further comprises transforming the data file in the row-based file format into a columnar file format based on the schema, and loading the data file in the columnar file format to a cloud-based centralized repository.
Example 19 includes the subject matter of Example 18, where the schema was uploaded to the data management service with the data file.
Example 20 includes the subject matter of Examples 18 and 19, where the cloud computing platform comprises an AWS environment. Extracting the data file comprises extracting the data file in the row-based file format from an Amazon S3 object in an S3 bucket. Extracting the schema associated with the data file comprises extracting the schema from the Amazon S3 object. Transforming the data file comprises transforming the data file into the columnar file format based on the schema extracted from the Amazon S3 object. Loading the data file comprises loading the data file in the columnar file format to another S3 bucket of the cloud-based centralized repository.
Although specific embodiments were described herein, the scope of the invention is not limited to those specific embodiments. The scope of the invention is defined by the following claims and any equivalents thereof.