The present disclosure herein generally relates to processing large data sets and, more particularly, to systems and methods for a data ingestion process to move data including streaming data events from a variety of data sources and different table schemas to a data warehouse in a file format that can be stored and analyzed in an efficient and traceable manner.
Some “big data” use-cases and environments may use tables having different schemas, which may present some problems when trying to process data, particularly the processing of streaming data. Additional complexities may be encountered if there is a file count limit for the big data store solution. For example, a count limit may be reached by the processing of too many files, even if the files are each small in size. Furthermore, data volumes to be processed may not be of a stable size. Another concern may be the searchability of the data. Data stored in the data store should preferably be configured, without further undue processing, for being searched based on criteria in a logical manner. Given these and other aspects of a distributed system, multiple different types of errors might typically be encountered during the performance of processing jobs. As such, error handling may be complicated by one or more of the foregoing areas of concern.
Accordingly, there exists a need for a distributed processing system and method solution that, in some aspects, can efficiently handle streaming data events having a variety of different data structure schemas, exhibit a minimal need for data recovery mechanisms, is optimized for query and purge operations, and reduces a level of compaction used to efficiently store data in the data store.
The following description is provided to enable any person in the art to make and use the described embodiments. Various modifications, however, will remain readily apparent to those in the art.
In some aspects of the present disclosure, a distributed file system may be used to store “big data”. In one embodiment, an Apache Hadoop® cluster comprising Apache HDFS® and Apache Hive® ACID tables may be used to store streaming data to leverage storage and query effectiveness and scalability characteristics of the Hadoop cluster. The streaming data however may not be formatted for efficient storage and/or querying. Instead, the streaming data might be formatted and configured for efficient and reliable streaming performance. Accordingly, some embodiments herein include aspects and features, including an ingestion process, to efficiently and reliably transform the streaming data into data structures (e.g., data tables) that can be stored and further analyzed.
Prior to discussing the features of the ingestion process(es) herein, a number of aspects will be introduced. In some instances, the streaming data that may be received and processed by methods and systems herein may be Apache Kafka® data events, messages, or records. Kafka is run as a cluster on one or more servers that can serve a plurality of datacenters. A Kafka cluster stores streams of records produced by publishers for consumption by consumer applications. A Kafka publisher publishes a stream of records or messages to one or more Kafka topics and a Kafka consumer subscribes to one or more Kafka topics. A Kafka data stream is organized in a category or feed name, called a topic, to which the messages are stored and published. Kafka topics are divided into a number of partitions that contain messages in an ordered, unchangeable sequence. Each topic may have multiple partitions. Multiple consumers may be needed or desired to read from the same topic to, for example, keep pace with the publishing rate of producers of the topic. Consumers may be organized into consumer groups, where topics are consumed in the consumer groups and each topic can only be consumed in a specific consumer group. However, one consumer in a specific consumer group might consume multiple partitions of the same topic. That is, each consumer in a consumer group might receive messages from a different subset of the partitions in a particular topic.
For each topic, a Kafka cluster maintains a partitioned log 200 as illustratively depicted in
Operation 310 may include a target file converter to transform the Kafka events to the desired, target format that is optimized for querying, analysis, and reporting. Operation 310 may include moving the files from staging folders of streaming writer 305 to staging folders data of the target format. In one embodiment, the files are moved from the HDFS staging folders to Hive staging folders or data structures. An Extraction, Transform, Load (ETL) process is further performed to transform the Kafka data into target ACID tables. In some instances, the Kafka data may be imported into the ACID tables using a Hive query operation. Upon completion of importing the Kafka data into the target formatted data structures (e.g., Hive ACID tables), the processed filed can be achieved. In agreement with a configurable setting, the archived files may be purged after a period of time. Referring to
At operation 410, the writing of the data messages is committed to the first data structure, in a transactional manner. The transactional nature of the writing and commitment thereof may obviate a need for a data recovery mechanism regarding operation 405. In this manner, additional resources and/or complexity need not be devoted to a data recovery process.
Proceeding to operation 415, the data messages in the first data structure may be moved to a staging area for a second data structure of a second data storage format (e.g., Hive ACID). As seen by the present example, the first data structure of the first data storage format (e.g., HDFS) is not the same as the second data structure of the second data storage format (e.g., Hive ACID).
At operation 420, the data messages are transformed to the second data structure of the second data storage format. In this example, the Kafka messages are transformed to into Hive ACID tables. Hive ACID tables are configured and optimized for efficient storage and querying of the data therein. As such, the Kafka messages may be queried and otherwise searched and manipulated in a fast and efficient manner when stored in Hive ACID tables as a result of the transformation operation 420.
At operation 425, the data messages may be archived after the transformation of operation 420. In some aspects, the archived records may be purged or otherwise discarded after a defined period of time. The defined period of time might correspond to a service level agreement (SLA) or some other determining factor.
At operation 525, a determination is made whether a predefined threshold size for the buffer 520 is satisfied. If not, then process 500 returns to 505 and a next Kafka message 510 is consumed. When the predefined threshold size for the in-memory buffer 520 is satisfied by the merged multiple Kaka messages, the process proceeds from operation 525 to operation 530 where a schema of the Kafka messages is checked for compatibility with the processing system implementing process 500. In some instances and use-cases, the schema compatibility check at operation 530 may reduce or eliminate a need for schema-related compatibility issues at later processing operations. If a schema compatibility issue is determined at 530, then the type of error is determined at 580 to determine how to handle the exception. If the exception or error is fatal, as determined at operation 585, then the worker thread executing process 500 stops. If the exception is not fatal but a retry threshold limit has been satisfied as determined at operation 590, then the process also ends. If the exception is not fatal but a retry threshold limit has not been satisfied at 590, then the process proceeds to operation 595 where the current worker thread may be stopped and a process 500 may be rescheduled with a new worker thread.
In the event there is a schema compatibility issue at 535 that is not fatal, then an incompatible Kafka message error 545 may be recorded in a log such as, for example, a write-ahead log (WAL) 550 and the process will continue to 505 to process new Kafka messages. In the event the Kafka messages are determined to be compatible at operation 535, then the messages in the buffer can be merged and written to a log (e.g., a WAL) at 540.
Returning to
Referring to
Prior to committing the WAL and Kafka messages as a transaction at 565, the WAL is checked for errors at 580 to determine an exception type, if any. If the exception or error is fatal, as determined at operation 585, then the worker thread executing process 500 stops. If the exception is not fatal but a retry threshold limit has been satisfied as determined at operation 590, then the process ends. If the exception is not fatal but a retry threshold limit has not been satisfied at 590, then the process proceeds to operation 595 where the current worker thread may be stopped and a process 500 may be rescheduled with a new worker thread. The WAL is committed at operation 575 and the process will return to 505 to process new Kafka messages.
At a second step 710, the state of the HDFS log may be marked as committed by, for example, renaming it to a final name to make it consumable by a next step. For example, a file “TopicA_partition1.avro.staging” may be renamed to “TopicA_partition1-o100-200.avro”, where the “100-200” refers to the offset of the Avro file being from 100-200. In this example, the files are formatted as an Avro record that includes data structure definitions with the data, which might facilitate applications processing the data, even as the data and/or the applications evolve over time. The renaming of the WAL HDFS filename may be monitored for an exception or error. In this second step, the HDFS WAL is marked as a final state to ensure the record is written exactly once to HDFS (i.e., no duplicate records) for the next step in the conversion process of
Continuing to step three of process 700, the Kafka offset range of the log record committed at operation 710 is acknowledged. For example, an acknowledgement of offset 200 of TopicAPartitionl is acknowledged as already written to WAL. If there is a failure to acknowledge the Kafka offset, then the particular log may be deleted or otherwise discarded at operation 725. In the event there is an exception in either of steps 705, 710, and 715, there may be no need for an additional rollback of the streaming Kafka data, instead the worker thread executing process 700 may quit and be restarted, as shown at 730. Continuing to step 4, the log (e.g., WAL) of the partition is closed after the acknowledgement of step 3, operation 715.
In accordance with process 700, some embodiments herein may operate to implement a Kafka-HDFS streaming writer as a transactional process, using exact-one semantics. In some embodiments, process 700 is not limited to Kafka messages/events. In some embodiments, process 700 may be implemented to systems with a queue that employs ever-increasing commit offsets. Aspects of process 700 support load balancing and scalability in some embodiments. For example, load balancing may be achieved by Kafka consumer groups by adding more HDFS writer processes to handle an additional load. Alternatively or additionally, the number of worker thread pools may be increased. For scalability, the HDFS writer of process 700 may support horizontal scaling without a need for recovery, where an upgrade might be achieved by killing the node and upgrading.
Referring to, for example, process 500 and 700, error handling is incorporated into some embodiments herein. In general, exceptions or errors may be classified into three categories, including fatal errors, non-fatal errors, and incompatible schema detected errors. Fatal errors cannot generally be recovered from and the executing process should immediately cease. Examples of a fatal error may include, for example, an out of memory (00M) error, a failed to write to WAL error, and a failed to output log error. A non-fatal error may generally be recoverable from and may be addressed by quitting the current worker thread and retrying the process (e.g., after a predetermined or configurable period of time). However, if a maximum number of retries has been attempted (i.e., a retry threshold), then the process may be stopped. A non-fatal error might include, for example, a failed to write to HDFS error, a HDFS replication error, and a Kafka consumer timeout error. For an incompatible schema detected error, the Kafka message cannot be discarded and the messages will need to be written to an error log (e.g., WAL), as soon as possible.
In a first step of the ACID-converter process, the Avro data files are moved to a staging area 840 as a staging table. A staging table 842, 844, and 846 is established for each of the records 832, 834, and 836, respectively, from ingestion area 830. The staging tables (i.e., data structures) are formatted as Avro files in the example of
In a second step of
In a third step of
In some aspects,
In some aspects, various applications and process jobs herein may be controlled and implemented by a workflow scheduler. In one embodiment herein, the workflow scheduler may be Apache Oozie® scheduler for Hadoop. However, the present disclosure is not limited to the Oozie scheduler and other schedulers may be used, depending for example on a platform and/or application. For example, a cleaner application may be periodically executed against the archive layer to free HDFS space and other jobs may be executed to implement a batch ingestion workflow. In some aspects,
In some aspects, metrics associated with some of the processes herein may be tracked and monitored for compliance with one or more predetermined or configurable standards. The standard may be defined in a SLA for an implementation of the processes disclosed herein. In some embodiments, a start time, an end time, and a duration associated with the jobs executed in performing processes disclosed herein may be monitored and compared to a predetermined nominal time for each. In some embodiments, a scheduler may track any of the times not met and further report, via an email or other reporting mechanism, a notification of a job end time that is not met. Other error handling and reporting strategies may be used in some other embodiments.
System 1100 includes processor(s) 1110 operatively coupled to communication device 1120, data storage device 1130, one or more input devices 1140, one or more output devices 1150, and memory 1160. Communication device 1120 may facilitate communication with external devices, such as a data server and other data sources. Input device(s) 1140 may comprise, for example, a keyboard, a keypad, a mouse or other pointing device, a microphone, knob or a switch, an infra-red (IR) port, a docking station, and/or a touch screen. Input device(s) 1140 may be used, for example, to enter information into system 1100. Output device(s) 1150 may comprise, for example, a display (e.g., a display screen) a speaker, and/or a printer.
Data storage device 1130 may comprise any appropriate persistent storage device, including combinations of magnetic storage devices (e.g., magnetic tape, hard disk drives and flash memory), optical storage devices, Read Only Memory (ROM) devices, etc., while memory 1160 may comprise Random Access Memory (RAM), Storage Class Memory (SCM) or any other fast-access memory.
Ingestion engine 1132 may comprise program code executed by processor(s) 1110 (and within the execution engine) to cause system 1100 to perform any one or more of the processes described herein. Embodiments are not limited to execution by a single apparatus. Schema dataset 1134 may comprise schema definition files and representations thereof including database tables and other data structures, according to some embodiments. Data storage device 1130 may also store data and other program code 1138 for providing additional functionality and/or which are necessary for operation of system 1100, such as device drivers, operating system files, etc.
All systems and processes discussed herein may be embodied in program code stored on one or more non-transitory computer-readable media. Such media may include, for example, a floppy disk, a CD-ROM, a DVD-ROM, a Flash drive, magnetic tape, and solid state Random Access Memory (RAM) or Read Only Memory (ROM) storage units. Embodiments are therefore not limited to any specific combination of hardware and software.
Embodiments described herein are solely for the purpose of illustration. Those in the art will recognize other embodiments may be practiced with modifications and alterations to that described above.