1. Technical Field
The present teaching relates to methods, systems, and programming for data processing. Particularly, the present teaching is directed to methods, systems, and programming for monitoring data quality and dependency.
2. Discussion of Technical Background
The advancement in the Internet has made it possible to make a tremendous amount of information accessible to users located anywhere in the world. This introduces new challenges in data processing for “big data,” where a data set can be so large or complex that traditional data processing applications are inadequate. For big data processing, users can easily lose track of the quality of the data for their interested applications.
Conventional approaches for monitoring data quality in a database require a user to input a query directed to the database. When the user wants to monitor data in a plurality of data sources in a big data system, the user has to input a plurality of queries each of which corresponds to a data source, which is time-consuming for the user. The user also has to learn different query languages for different types of data sources. In addition, there is no easy way for the user to obtain interrelationship among different jobs running on a same cluster or on different clusters in the big data system.
Therefore, there is a need to develop techniques to monitor data quality to overcome the above drawbacks.
The present teaching relates to methods, systems, and programming for data processing. Particularly, the present teaching is directed to methods, systems, and programming for monitoring data quality and dependency.
In one example, a method, implemented on a machine having at least one processor, storage, and a communication platform connected to a network for monitoring data in a plurality of data sources of heterogeneous types is disclosed. A request is received for monitoring data in the data sources of heterogeneous types. One or more metrics are determined based on the request. The request is converted into one or more queries based on the one or more metrics. Each of the one or more queries is directed to at least one of the data sources of heterogeneous types. A monitoring task is created for monitoring the data in the data sources based on the one or more queries in response to the request.
In another example, a system, having at least one processor, storage, and a communication platform connected to a network for monitoring data in a plurality of data sources of heterogeneous types is disclosed. The system comprises a user request receiver, a metrics determiner, a query generator, and a monitoring task generator. The user request receiver is configured for receiving a request for monitoring data in the data sources of heterogeneous types. The metrics determiner is configured for determining one or more metrics based on the request. The query generator is configured for converting the request into one or more queries based on the one or more metrics. Each of the one or more queries is directed to at least one of the data sources of heterogeneous types. The monitoring task generator is configured for creating a monitoring task for monitoring the data in the data sources based on the one or more queries in response to the request.
Other concepts relate to software for implementing the present teaching on monitoring data in a plurality of data sources of heterogeneous types. A software product, in accord with this concept, includes at least one machine-readable non-transitory medium and information carried by the medium. The information carried by the medium may be executable program code data, parameters in association with the executable program code, and/or information related to a user, a request, content, or information related to a social group, etc.
In one example, a machine-readable, non-transitory and tangible medium having information recorded thereon for monitoring data in a plurality of data sources of heterogeneous types is disclosed. The information, when read by the machine, causes the machine to perform the following. A request is received for monitoring data in the data sources of heterogeneous types. One or more metrics are determined based on the request. The request is converted into one or more queries based on the one or more metrics. Each of the one or more queries is directed to at least one of the data sources of heterogeneous types. A monitoring task is created for monitoring the data in the data sources based on the one or more queries in response to the request.
Additional novel features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The novel features of the present teachings may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.
The methods, systems, and/or programming described herein are further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:
In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant teachings. However, it should be apparent to those skilled in the art that the present teachings may be practiced without such details. In other instances, well known methods, procedures, systems, components, and/or circuitry have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present teachings.
The present disclosure describes method, system, and programming aspects of monitoring data, realized as a specialized and networked system by utilizing one or more computing devices (e.g., mobile phone, personal computer, etc.) and network communications (wired or wireless). The method and system as disclosed herein aim at monitoring data in an effective and efficient manner.
Data quality has different meanings for different people. For some people, data quality means how the values for a particular feature are statistically distributed. For other people, data quality means how the distribution changes over time. For other people, data quality means how features from different data sources are co-related, e.g. matching, overlapping, etc. The system disclosed in the present teaching may automatically access and monitor data quality from different data sources, based on a user's request which can indicate what data quality means for the user and what data to monitor. The system may determine some metrics based on the request, and convert the request into some queries based on the metrics. For heterogeneous types of data sources, the user does not have to know any query language regarding the data sources. The system can generate and optimize the queries automatically.
For example, when a user wants to monitor data from Hive, HDFS (Hadoop Distributed File System), and PIG, the user can just send a request to specify some metrics, without knowing query languages of Java and Hive QL. The system in the present teaching may convert the request to some queries, each of which is directed to at least one of Hive, HDFS, and PIG. The system can optimize the queries to make them efficient and effective, e.g. based on the data structure in each of the data sources.
Usually, the request may be related to monitoring data periodically. Accordingly, the system can create a monitoring task based on the optimized queries. The system can then store the monitoring task and run it periodically. The user can input the request by e.g. selecting one or more jobs in some clusters of a data system that includes different data sources. The terms “job,” “table,” and “task” here may be used interchangeably to mean a Hive table, an Oozie job, or a HDFS feed. The request may indicate some alert conditions for generating alerts or warnings related to the monitoring. The system may generate an alert and send it to the user, if one of the alert conditions is met after the monitoring task is executed.
In one example, the user can select one of existing monitoring tasks provided by the system to generate a monitoring request. In another example, the user can share monitoring tasks with other users. For example, a user can select one of monitoring tasks shared by other users to generate a monitoring request. A user can determine whether a monitoring task of his/hers is shared or not. A user can also determine a group of users to share his/her monitoring task(s). The system can provide a user interface for the user to input the request, determine metrics, determine alert conditions, determine sharing scope, etc.
Thus, the user does not need to know any query languages or write any queries to monitor data. Once the user sends a request, the system can perform the monitoring task periodically, based on automatically generated queries. The user does not need to worry about the monitoring if not receiving any alert from the system.
In addition, the system can generate a data dependency graph that can reflect and track overall status and healthiness of data processing jobs, e.g. big data pipelines. In one example, the data dependency graph includes a first set of nodes each of which representing a data source, a second set of nodes each of which representing a data processing job, e.g. a running pipeline step, and a set of arrows each of which connecting two nodes and representing a dependency relationship between the two nodes. For example, if an arrow starts from a node representing a data source and ends at a node representing a running pipeline step, the system has determined that the running pipeline step depends on the data source, e.g. consumes data from the data source. In this manner, the data dependency graph generated by the system can provide a visual illustration of data relationships among the data sources and running pipeline steps. Based on the data dependency graph, the user can easily understand a potential impact if the user wants to modify any data, add a running pipeline step, or delete a running pipeline step. As such, the data dependency graph can enable more efficient data monitoring, troubleshooting, resource allocation, and system operations on a big data system.
Additional novel features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The novel features of the present teachings may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.
Individual users 108 may be of different types such as users connected to the network 110 via desktop computers 108-1, laptop computers 108-2, a built-in device in a motor vehicle 108-3, or a mobile device 108-4. An individual user 108 may send a request to the data source monitoring engine 104 via the network 110 for monitoring data in the data system 106. Based on the request, the data source monitoring engine 104 may generate a monitoring task and execute it periodically to monitor data in the data system 106. The data source monitoring engine 104 may generate and send an alert to the user if a pre-determined alert condition is met after the monitoring task is executed.
More often than not, a corporate user 102 can send a request to the data source monitoring engine 104 via the network 110 for monitoring data in the data system 106. The corporate user 102 may represent a company, a corporation, a group of users, an entity, etc. For example, a company that is an Internet service provider may want to monitor data related to online activities of users of the Internet service provided by the company. In that case, the data may be stored in the data system 106 as various types, e.g. in databases like Hive, HBase, Oozie, HDFS, etc. This may be because users' online activities can include different types of actions and hence be related to different and heterogeneous types of data.
The data source monitoring engine 104 may receive a request for monitoring data in the data system 106, from either a corporate user 102 or an individual user 108. The data source monitoring engine 104 can determine metrics based on the request and convert the request into one or more queries based on the metric. The data source monitoring engine 104 can also optimize the queries that may be directed to data sources of heterogeneous types. The data source monitoring engine 104 may generate a monitoring task based on the optimized queries and execute it periodically to monitor data in the data system 106. Based on the request, the data source monitoring engine 104 can also generate one or more alert conditions associated with the monitoring task, such that the data source monitoring engine 104 can generate and send an alert to the user if one of the alert conditions is met after the monitoring task is executed.
The data dependency analyzing engine 105 may collect information from different data sources in the data system 106 and information from different data processing jobs, e.g. running pipeline steps. The data dependency analyzing engine 105 may determine dependency relationships among the data sources and running jobs to generate a data dependency graph. The data dependency graph may include nodes representing data sources, nodes representing running jobs, and arrows each of which connecting two nodes and representing a dependency relationship between the two nodes. The data dependency graph may be generated either periodically or upon request. The data dependency analyzing engine 105 can provide the data dependency graph to a user for the user's better understanding of the data in the data system 106.
The content sources 112 include multiple content sources 112-1, 112-2 . . . 112-3, such as vertical content sources. A content source 112 may correspond to a website hosted by an entity, whether an individual, a business, or an organization such as USPTO.gov, a content provider such as cnn.com and Yahoo.com, a social network website such as Facebook.com, or a content feed source such as tweeter or blogs. A corporate user 102, e.g. a company that maintains a web site and/or runs a search engine may access information from any of the content sources 112-1, 112-2 . . . 112-3.
The user request receiver 302 in this example obtains a request for managing a monitoring task. The request may come from an individual user 108 or a corporate user 102. The user request receiver 302 may keep receiving requests and send them to the user identifier 304 for user identification.
The user identifier 304 in this example identifies the user based on the request. The user identifier 304 can send an identity of the user to the user authorization unit 306 for user authorization. The user authorization unit 306 in this example determines authorization information for the user and determines whether the user should be authorized for monitoring data. For example, a lower level corporate user of a company may only monitor a limited set of data related to the company, while a higher level corporate user may monitor all data related to the company. If the user authorization unit 306 determines that the user is not authorized to monitor data indicated in the request, the user authorization unit 306 can send an instruction to the user interface generator 310 to deny the user's request. If the user authorization unit 306 determines that the user is authorized to monitor data indicated in the request, the user authorization unit 306 can send another instruction to the monitoring task managing unit 308 to process the user's request for data monitoring.
The monitoring task managing unit 308 in this example receives the authorization information and the request from the user authorization unit 306 and identifies data sources based on the authorization information and the request, from the data system 106. The monitoring task managing unit 308 determines existing tasks and tables in the data sources associated with the user, from the monitoring task database 309 where existing monitoring tasks are stored. The monitoring task managing unit 308 can then retrieve the tasks and tables and provide them to the user via a user interface generated by the user interface generator 310.
The user interface generator 310 in this example may receive an instruction from the user authorization unit 306 to deny the user's request. In that case, the user is not authorized to monitor data indicated in the request. Thus, the user interface generator 310 may generate a user interface to indicate that the request is denied. The user interface may include reasons that the request is denied, e.g. “data monitoring regarding such data is not open to users at your level.” The user interface may also include an option for the user to input another request, with an instruction like “Please enter another request by selecting from the following tables.”
The user interface generator 310 in this example may also receive a message from the monitoring task managing unit 308 to provide the tasks and tables to the user. In that case, the user is authorized to monitor data indicated in the request. Thus, the user interface generator 310 may generate a user interface to provide the tasks and tables from data sources associated with the user. Through the user interface, the user may input selections for monitoring data of his/her interest. The selections may be based on existing monitoring tasks, shared monitoring tasks, and/or tables or columns associated with the user.
As illustrated in
Like existing monitoring tasks, each of the shared monitoring tasks may include information about monitor name, type, monitor entity, schedule, last run time, and operation options related to the monitoring task. The operation options for each shared monitoring task may include a “Dashboard” button 1318, a “Cron jobs” button 1317, a “view” button 1316, and a “Subscribe” button 1314. By clicking on the “view” button 1316, the user can access detailed information related to the shared monitoring task. By clicking on the “Dashboard” button 1318, the user can access previous execution results associated with the shared monitoring task. The user can also subscribe to the shared monitoring task, by clicking on the “Subscribe” button 1314.
Each of the shared monitoring tasks is associated with a username. The “Shared Monitors” section 1310 can include a search box 1312 for the user to search shared monitoring tasks, e.g. by username, by monitor name, or by type. Through the user interface 1300, the user can select one or more tasks from the existing monitoring tasks and/or shared monitoring tasks to monitor data in the data system 106.
As illustrated in
The user interface 1400 may also include a “Checks” section 1410 which includes information about pre-defined metrics for the monitoring task. The user can modify the pre-defined metrics shown in the “Checks” section 1410, such that the monitoring task “p13n_magazine_hourly_tbl” may be associated with modified metrics.
The table specified in
The user may also change the selections with respect to the columns 1414 that are included in the table. For example, metric “count” is currently selected in
The user can also specify alert conditions under the columns section 1414. In one case, the user can input min and max values under “Alerts on Range Violation” 1416, regarding some field and metric. For example, the user may specify the min value for a field of “age” to be 0 and the max value for the field of “age” to be 200. Then, if the system finds a record with an “age” value outside the range of 0˜200 after executing the monitoring task, the system will generate an alert and send it to the email addresses specified in the “Basic Info” section 1404.
In another case, the user can input a percentage number under “Alerts on Average Violation” 1418, regarding some field and metric. For example, a corporate user that is an Internet web site provider may input 20% under “Alerts on Average Violation” 1418 for a field of “views”. This may be because the corporate user expects the number of views of its web site in a time period to be different from the average number of views by no more than 20%. There can be different ways to define the average number of views. In one example, if the time period mentioned above is an hour, the average number of views can be calculated by averaging the numbers of views in a plurality of hours before the time period. In another example, if the time period mentioned above is an hour, the average number of views can be calculated by averaging the numbers of views at the same time in days before the time period. For example, if the time period mentioned above is during 5:00 PM to 6:00 PM today, the average number of views may be calculated by averaging the numbers of views during 5:00 PM to 6:00 PM yesterday, during 5:00 PM to 6:00 PM the day before yesterday, and during 5:00 PM to 6:00 PM three days before today. Then, if the number of views based on an execution of the monitoring task is different from the average number of views by more than 20%, the system will generate an alert and send it to the email addresses specified in the “Basic Info” section 1404.
An alert may be a warning about an error, e.g. when a person's age is recorded to be a value below 0. An alert may also be an unexpected good fact, e.g. when a web site's views increase more than expected compared to an average number of views. Based on the set alert conditions, the system can send an alert to a user, either for the user to notice and correct an error or for the user to notice and analyze an unexpected good fact or result.
Referring back to
The user input analyzer 312 in this example can receive and analyze user inputs via the user interface provided to the user by the user interface generator 310. The input may include the user's selection of metrics, input about alert condition parameters, request for monitoring task result, and/or other information related to data monitoring. The user input analyzer 312 may analyze and sort out the inputs, and send analyzed inputs to the monitoring task managing unit 308 for managing monitoring task and to the task result reporter 318 for task result reporting.
The monitoring task managing unit 308 in this example may receive the analyzed inputs from the user input analyzer 312, and generate and/or update a monitoring task associated with the user based on the analyzed inputs, e.g. the user specified metrics and alert conditions for data monitoring. The monitoring task managing unit 308 may store the monitoring task associated with some metadata and the user's personal information into the monitoring task database 309 where information about different monitoring tasks associated with different users can be stored. The metadata may include information associated with the monitoring task, e.g. information about alert conditions, partition conditions, schedule of the monitoring task, etc. The user's personal information may include the user's user ID, the user's authority level, other users associated with the user, etc. In one embodiment, the monitoring task managing unit 308 may send alert conditions associated with a monitoring task to the task result reporter 318 for generating an alert when one of the alert conditions is met.
The monitoring task scheduler 314 in this example can schedule different monitoring tasks stored in the monitoring task database 309 for execution. In one embodiment, the monitoring task scheduler 314 may determine all monitoring tasks to be executed in next time period, e.g. next hour, based on the schedule information of the monitoring tasks stored in the monitoring task database 309. The monitoring task scheduler 314 may retrieve the monitoring tasks to be executed in the next time period from the monitoring task database 309 and store them in a task queue, in a sequence according to their respective running schedules. According to a timer, the monitoring task scheduler 314 can extract the next monitoring task in the task queue when the scheduled time comes, and send it to the monitoring task executor 316 for execution.
The monitoring task executor 316 in this example executes monitoring tasks received from the monitoring task scheduler 314. In one embodiment, the monitoring task executor 316 sends a task request to the monitoring task scheduler 314, when the monitoring task executor 316 has an idle processor for performing a monitoring task. The monitoring task scheduler 314 may send the monitoring task executor 316 the next monitoring task in the task queue, either upon the request or waiting for the schedule running time for the next monitoring task. After the monitoring task executor 316 receives the monitoring task, it will execute the task based on the metrics associated with the task, and generate a task result accordingly. In one embodiment, the monitoring task executor 316 may store the task result into the monitoring task database 309 associated with the monitoring task. In another embodiment, the monitoring task executor 316 may send the task result to the task result reporter 318 for generating a task result report. In yet another embodiment, the monitoring task scheduler 314 may merely send information (e.g. a task ID) about the next monitoring task to the monitoring task executor 316, and the monitoring task executor 316 can retrieve the next monitoring task based on the information from the monitoring task database 309 for execution.
The task result reporter 318 in this example analyzes one or more alert conditions associated with an executed monitoring task and determines whether any of the alert conditions is met based on the executed task result. The task result reporter 318 may receive the result of the executed monitoring task from the monitoring task executor 316. The task result reporter 318 may obtain the alert conditions associated with the executed monitoring task either from the monitoring task managing unit 308 or from the monitoring task database 309. If the task result reporter 318 determines one of alert conditions is met, the task result reporter 318 may generate an alert accordingly. The task result reporter 318 may generate multiple alerts each of which is triggered by a different alert condition associated with the executed monitoring task. The task result reporter 318 may then store the generated alert(s) associated with the executed monitoring task in the monitoring task database 309, and/or send the generated alert(s) to the user, e.g. by sending an email to the email addresses listed in the “Basic Info” section 1404 in
The task result reporter 318 in this example may also generate a result report or summary associated with a monitoring task, either periodically or upon request from a user. In one embodiment, the user input analyzer 312 receives a user request, via a user interface, for a result report regarding a monitoring task, and forwards the request to the task result reporter 318. The task result reporter 318 then retrieves results from previous executions of the monitoring task from the monitoring task database 309, based on the request. For example, the task result reporter 318 may retrieve results of the monitoring task executed during the last three months or during the last year. The task result reporter 318 can then generate a result summary based on the retrieved results and send it to the user in response to the user request.
In another embodiment, the task result reporter 318 may retrieve results for a monitoring task and generate a result summary for the task periodically or according to a timer. For example, the task result reporter 318 may generate a result summary for a monitoring task every week or every month, and send it to one or more users associated with the monitoring task.
Based on the authorization information, at 407, it is determined whether the user is authorized for monitoring the data associated with the user request. If so, the process goes to 410, where existing jobs or tables in data sources associated with the user are determined, e.g. based on the user request. Otherwise, the process goes to 408, where the user request is denied, e.g. by providing a denial message to the user.
At 412, the jobs or tables are provided to the user via a user interface, such that the user can provide inputs to select or modify metrics for monitoring data. At 414, user inputs are received via the user interface and analyzed to determine metrics selected by the user. At 416, a monitoring task associated with the user is generated or updated, e.g. based on the metrics selected or modified by the user. At 418, the monitoring task is stored into a database. At 420, alert conditions are generated and stored in the database associated with the monitoring task.
At 422, various monitoring tasks in the database are scheduled with a task queue. The task queue may include e.g. monitoring tasks to be executed in the next hour, in a sequence according their respective running schedules. At 424, the monitoring tasks in the task queue are executed, e.g. one by one according to their respective running schedules, to generate task results. At 426, task results of the executed tasks are stored into the database, each associated with a corresponding monitoring task. At 428, an alert is generated when a result of a monitoring task meets an alert condition, and is sent to the user associated with the monitoring task. At 430, a result summary is generated and sent to a user, either periodically or upon request from the user.
The data source identifier 502 in this example receives user authorization information associated with a user and a request, e.g. from the user authorization unit 306. If the user authorization information indicates that the user is authorized to monitor data associated with the request, the data source identifier 502 may determine data sources associated with the user based on the request. For example, the data source identifier 502 may determine that the user is authorized to monitor data from Hive database. In one embodiment, the user requests to monitor some data but can be authorized to monitor only a subset of the data requested. This may be because the user's authority level is low such that he/she is not authorized to monitor some types of databases or some types of tables in a database.
The data source identifier 502 may retrieve information about the data to be monitored by the user, e.g. information about the tables in the identified data sources. The data source identifier 502 can then send the information about the data and the identified data sources to the existing task extractor 504 for task extraction.
The existing task extractor 504 in this example extracts, from the monitoring task database 309, existing monitoring tasks associated with the identified data sources and/or associated with the user. In one embodiment, the existing task extractor 504 may also extract monitoring tasks associated with other users and shared with the user, from the monitoring task database 309. The existing task extractor 504 may then provide, to the user, information about tables in the data sources, existing and/or shared tasks, etc.
The metrics determiner 512 in this example receives analyzed user input. The user input may be provided by a user via a user interface, to indicate the user's selection and/or modification related to a monitoring task. The metrics determiner 512 may determine metrics for monitoring data based on the analyzed user input. For example, the metrics may include one or more of the metrics illustrated in
The query generator 514 in this example receives the metrics from the metrics determiner 512, and generates queries based on the metrics and the data sources associated with the metrics. For example, when there are two metrics each of which is associated with a different type of data source, the query generator 514 may generate two queries each of which is associated with one of the two metrics and based on a different query language.
The query generator 514 may also optimize the generated queries, e.g. based on the metrics. For example, when there are multiple queries generated for multiple metrics that have some common features and/or are related to a same data source, the query generator 514 may merge the multiple queries into one or two simple queries. The query generator 514 may then send the queries to the monitoring task generator 520 for monitoring task generation.
As such, the system can convert the user request into one or more queries that are automatically generated and optimized by the system. The user does not need to know any query language or input any query.
The metadata generator 516 in this example receives the analyzed user input, and generates metadata related to the metrics, e.g. based on the analyzed user input. The metadata may include e.g. information under the “Basic Info” section 1404 in
The sharing configuration unit 518 in this example receives metadata from the metadata generator 516, and determines sharing configuration based on the metadata. The sharing configuration may indicate whether the user wants to share a monitoring task with other users. The sharing configuration may also indicate a list of users with whom the user wants to share a monitoring task. In one embodiment, the sharing configuration unit 518 may determine sharing configuration based on the user's personal information or historical behavior. For example, a lower level corporate user may have to share all monitoring tasks with a higher level corporate user in a same company, due to a pre-determined rule. In another example, if a user has never shared any monitoring task with any other user, the sharing configuration unit 518 may give a default sharing configuration for the user to avoid sharing any new monitoring tasks. After determining the sharing configuration, the sharing configuration unit 518 may then send the sharing configuration to the monitoring task generator 520 for monitoring task generation.
The monitoring task generator 520 in this example receives queries from the query generator 514 and sharing configuration from the sharing configuration unit 518. The monitoring task generator 520 can generate or update a monitoring task based on the queries and sharing configuration. The monitoring task generator 520 may then store the monitoring task associated with the user and the metadata, in the monitoring task database 309.
In one example, the monitoring task generator 520 can generate a new monitoring task associated with queries generated based on the user's input and associated with the pre-determined sharing configuration. The user's input may be received e.g. via the user interface 1400 in
In another example, the monitoring task generator 520 can update an existing monitoring task associated with queries generated based on the user's input and associated with updated sharing configuration. Some of the user's input may be received e.g. via the user interface 1300 in
In one embodiment, the monitoring task generator 520 may send information about the monitoring task to the alert condition generator 530 for generating alert conditions. The alert condition generator 530 in this example receives analyzed user input and information about the monitoring task. The alert condition generator 530 can generate alert conditions associated with the monitoring task based on the analyzed user input. The user input may be received e.g. via the user interface 1400 in
The alert condition generator 530 may store the alert conditions 532 in the monitoring task managing unit 308 or store the alert conditions into the monitoring task database 309 (not shown). In either case, each alert condition is associated with a monitoring task, such that after the monitoring task is executed, the system can retrieve the associated alert condition and determine whether an alert should be generated based on the alert condition.
At 612, queries are automatically generated and optimized based on the metrics. At 614, metadata related to the metrics are generated. At 616, sharing configuration is determined based on the metadata. At 618, a monitoring task is generated or updated based on the queries and sharing configuration. At 620, the monitoring task is stored associated with the user, the metadata, and/or the sharing configuration. At 622, alert conditions associated with the monitoring task are generated. At 624, the alert conditions are stored associated with the monitoring task.
The active task determiner 702 in this example can determine newly active monitoring tasks in the monitoring task database 309 in a given time period. Different monitoring tasks stored in the monitoring task database 309 may have different running schedules for execution, e.g. once every day at 12:00 PM, twice every day at 9:00 AM and 5:00 PM, once every hour, once every week, etc. The active task determiner 702 may determine that which monitoring tasks are scheduled to be executed in a time period, e.g. the next hour from current time, based on the time information provided by the timer 703. The determined monitoring tasks can be referred as active tasks in the monitoring task database 309 for the time period. In the above example, the active task determiner 702 may determine newly active task in the monitoring task database 309 once every hour.
After determining the newly active tasks, the active task determiner 702 may retrieve them from the monitoring task database 309. In one case, there is only one newly active task in a time period. In another case, there is no active task in a time period. The active task determiner 702 may then send the retrieved active task(s) to the task ranking unit 704 for task ranking.
The task ranking unit 704 in this example receives the retrieved task(s) from the active task determiner 702 and rank them, e.g. based on their respective scheduled execution times. For example, the active task determiner 702 may assign a higher ranking to a monitoring task that is scheduled to be executed in 5 minutes and assign a lower ranking to a monitoring task that is scheduled to be executed in 10 minutes. The task ranking unit 704 may send the ranked active monitoring tasks to the task queue generator/updater 706 for task queue generation.
The task queue generator/updater 706 in this example receives the ranked active tasks from the task ranking unit 704 and generates or updates the task queue 708, e.g. using the ranked active tasks. In one example, the system just initiates the data monitoring and the task queue generator/updater 706 may generate the task queue 708 and feed the task queue 708 with the ranked active tasks in order of their respective rankings, e.g. from higher ranked tasks to lower ranked tasks. The task queue 708 may follow a FIFO (first in first out) rule, such that a higher ranked task will be extracted from the task queue 708 and executed before a lower ranked task. In another example, after the system has executed monitoring tasks for some time, the task queue generator/updater 706 may update the task queue 708 and feed the task queue 708 with the newly ranked active tasks in order of their respective rankings, e.g. from higher ranked tasks to lower ranked tasks. In this case, the task queue 708 may also follow a FIFO rule, such that previous active tasks (if there are) in the task queue 708 will be extracted and executed before the newly ranked active tasks are extracted and executed.
In one embodiment, the active task determiner 702 may not retrieve the active monitoring tasks, but just retrieve some metadata about the active monitoring tasks from the monitoring task database 309. The task ranking unit 704 may rank the newly active tasks based on the metadata that may include information about the scheduled execution times for the newly active tasks, and generate a sequence of task IDs corresponding to the newly active tasks. The task queue generator/updater 706, after receiving the sequence of task IDs, can retrieve the newly active tasks from the monitoring task database 309 and update the task queue 708 accordingly.
The task extractor 710 in this example may extract monitoring tasks from the task queue 708, either according to the timer 703 or upon request, and send the extracted tasks for execution. In one embodiment, the task extractor 710 may receive a task request from the monitoring task executor 316, when the monitoring task executor 316 has an idle processer to execute a monitoring task. In response to the task request, the task extractor 710 may extract next queued monitoring task from the task queue 708 and send it to the monitoring task executor 316 for execution. In another embodiment, the task extractor 710 may extract next queued monitoring task from the task queue 708 according to the time information provided by the timer 703. For example, at one minute before the scheduled execution time of the next queued monitoring task, the task extractor 710 may extract the next queued monitoring task and send it to the monitoring task executor 316 for execution. In yet another embodiment, after the task extractor 710 receives a task request from the monitoring task executor 316, the task extractor 710 may wait until sometime (e.g. one minute) before the scheduled execution time of the next queued monitoring task, to extract the next queued monitoring task from the task queue 708 and send it to the monitoring task executor 316 for execution.
The executed task determiner 902 in this example obtains a request for a monitoring result summary associated with a user. The request may be from the user and carried out by the analyzed user input. The request may also be from the timer 903, when a scheduled time comes for generating the result summary. For example, the system may periodically generate a result summary for a monitoring task and send to users associated with the task. The timer 903 may be synchronized with the timer 703.
Based on the request, the executed task determiner 902 may determine an executed task based on the request. The executed task determiner 902 can send information about the executed task to the executed metrics determiner 904. In one embodiment, the executed task determiner 902 may determine multiple executed tasks based on the request, and send information about each of the executed tasks to the executed metrics determiner 904. In that case, a result summary may be generated for each of the executed tasks.
The executed metrics determiner 904 in this example determines one or more metrics associated with the executed task received from the executed task determiner 902. In one embodiment, the executed metrics determiner 904 determines one or more metrics associated with each of the executed tasks received from the executed task determiner 902. The executed metrics determiner 904 can then send the determined metric(s) to the result summary unit 906 for generating the result summary.
The result summary unit 906 in this example can receive the determined metrics from the executed metrics determiner 904 and retrieve historical results corresponding to each of the metrics from the monitoring task database. For example, a monitoring task may be related to monitoring number of visitors on a web site every day. The result summary unit 906 may retrieve the visitor numbers in the past three months for the web site.
The result summary unit 906 may also retrieve previous alerts generated for the executed task. Referring to the above example about visitor numbers, an alert condition may be set for an alert to be triggered when any daily visitor number to be different from an average daily visitor number in the past three months by more than 50%. Then, the system might have generated one or more alerts each was triggered when the alert condition is met. The result summary unit 906 can retrieve information about each alert previously generated, including information about generation date and time, alert title, alert reason, etc.
The result summary unit 906 may generate a result summary based on the retrieved historical results and/or previously generated alerts associated with the executed task. The result summary unit 906 can provide the result summary to the user upon request from the user, or to user(s) subscribed to the executed task upon request from the timer 903 when the scheduled time for result summary comes. The result summary can be provided to the user via one or more user interfaces.
As illustrated in
The user interface 1500 may also include a “Show Alerts” button 1514. In one embodiment, the user can access alerts generated for the monitoring task by clicking on the “Show Alerts” button 1514. In another embodiment, the user can access alerts generated for the metric “matched” 1510 by clicking on the “Show Alerts” button 1514 and another “Show Alerts” button (not shown) can be clicked to access alerts generated for the metric “less_than_one” 1520. In yet another embodiment, after the user clicks on the “Show Alerts” button 1514, a message “No Alert For this Dashboard” is displayed to the user when no alert has been generated for the metric “matched” 1510.
As illustrated in
The user interface 1600 may also include a “Hide Alerts” button 1630. In one embodiment, the user can hide alerts generated for the monitoring task by clicking on the “Hide Alerts” button 1630. In another embodiment, the user can hide alerts generated for the metric associated with the alerts by clicking on the “Hide Alerts” button 1630 and other “Hide Alerts” buttons can be clicked to hide alerts generated for the other metrics associated with the monitoring task. In either embodiment, the “Hide Alerts” button 1630 will disappear and a “Show Alerts” button will be displayed.
In one embodiment, the system may have a default setting to display the alerts to the user when the result summary is first provided until the user clicks on the “Hide Alerts” buttons. In another embodiment, the system may have a default setting to hide the alerts from the user when the result summary is first provided until the user clicks on the “Show Alerts” buttons.
In one embodiment, the alert records 1610 may be displayed to the user in the user interface 1600 that is different from the user interface 1500 in
Referring back to
The task result reporter 318 in
The result analyzer 908 can then analyze the results with the alert conditions. Based on the analysis, the result analyzer 908 can determine whether an alert condition is met and whether an alert needs to be generated accordingly. If one or more alert conditions are met, the result analyzer 908 may send information about the results and the alert conditions to the alert generator 910 for alert generation. If no alert condition is met, the result analyzer 908 may send information to the alert generator 910 for generating a no alert message.
The alert generator 910 in this example receives information from the result analyzer 908. If the information indicates that one or more alert conditions are met, the alert generator 910 can generate an alert for each of the met alert conditions associated with the task, e.g. in form a record or a message. Each alert may include information about data and time of the alert generation, type of alert condition violated, reasons for the alert generated, etc. The alert generator 910 can store the alert in association with the metric and/or the monitoring task into the monitoring task database 309. As such, the alert becomes one of the historical alerts displayed to a user associated with the metric when a user wants to see the historical alerts, e.g. by clicking on the “Show Alerts” button 1514 in
If the information received from the result analyzer 908 indicates that no condition is met for a metric, the alert generator 910 can generate a no alert message associated with the metric and stores the no alert message into the monitoring task database 309 in association with the metric. As such, when a user clicks on a “Show Alerts” button for the metric, the system can provide the no alert message to the user. The no alert message may be e.g. “No Alert for this Metric.”
As discussed above, when the result summary unit 906 generates a result summary, the result summary may include information about alerts and/or no alert messages stored in association with a monitoring task.
At 1020, results of an executed task associated with the user are received. At 1022, alert conditions associated with the executed task are received. At 1024, the results are analyzed with the alert conditions. At 1025, it is determined whether any alert condition is met, e.g. based on the analysis of the results with the alert conditions.
If an alert condition is met, the process goes to 1026, where an alert corresponding to the alert condition is generated and stored in association with the executed task and/or a metric of the executed task, e.g. in the monitoring task database 309. Then at 1028, the alert is sent to the user, e.g. via emails, phone calls, text messages, online chats, video calls, etc. The process then ends at 1030. In one embodiment, the process goes back from 1028 to 1020 to receive results of another executed task.
Otherwise, if no alert condition is met, the process goes to 1030 and ends. In one embodiment, the process goes back from 1025 to 1020 to receive results of another executed task, if no alert condition is met. In another embodiment, if no alert condition is met, a no alert message is generated and stored in association with the executed task and/or a metric of the executed task, e.g. in the monitoring task database 309.
The pipeline crawler 1102 in this example is configured for collecting information of running pipelines. For example, on Hadoop, the pipeline crawler 1102 can obtain runtime information of pipelines from the Oozie server in the data system 106. The pipeline crawler 1102 may collect job information periodically based on time information from the timer 1103. The pipeline crawler 1102 may also collect job information upon a request, e.g. a request from the data/job relationship determiner 1106. In one embodiment, the timer 1103 may be synchronized with the timer 703 and/or the timer 903. The pipeline crawler 1102 may send the collected job information to the data/job relationship determiner 1106 for determining data/job relationships.
The data source crawler 1104 in this example is configured for collecting information of data sources, e.g. grid data sources like HDFS feeds, Hive tables, HBase tables in the data system 106. The data source crawler 1104 may collect data information periodically based on time information from the timer 1103. The data source crawler 1104 may also collect data information upon a request, e.g. a request from the data/job relationship determiner 1106.
The data dependency graph may include nodes representing data sources, nodes representing running jobs, and arrows each of which connecting two nodes and representing a dependency relationship between the two nodes. The data dependency graph may be generated either periodically or upon request. The data dependency analyzing engine 105 can provide the data dependency graph to a user for the user's better understanding of the data in the data system 106. The data source crawler 1104 may send the collected data information to the data/job relationship determiner 1106 for determining data/job relationships.
The data/job relationship determiner 1106 in this example receives job information from the pipeline crawler 1102 and receives data information from the data source crawler 1104. In one embodiment, the job information and the data information are associated with a same cluster that includes pipeline jobs and data sources. A pipeline job may consume data from a data feed and/or produce data into a data feed. In another embodiment, the job information and the data information are associated with multiple clusters.
The data/job relationship determiner 1106 can determine relationships among different pipeline steps and data sources. For example, a pipeline step may read data from a data source, process the data to generate some new data, and store the new data into another data source. In another example, a data source may provide data to a plurality of running pipeline steps at the same time. The data/job relationship determiner 1106 may send all of these determined relationships to the dependency graph generator 1108 for generating a dependency graph.
The dependency graph generator 1108 in this example receives the determined relationships among the jobs and data sources, and generates a dependency graph based on the determined relationships. The dependency graph can be a virtual representation that reflects and can be used to track overall status and healthiness of big data pipelines. For example, the dependency graph may include nodes that represent data feeds/sources and pipeline steps, and includes directed links among nodes to record how individual pipeline steps consume and/or produce data feeds. These graph elements like nodes and directed links may also be associated with job statistics information based on which advanced analytics and monitoring capabilities on pipelines can be implemented. Thus, the dependency graph can provide an overall picture of the producer-consumer relationship among different grid jobs and data sources.
The user can select any node in the user interface 1700 to view information about the node. The system can highlight the node selected by the user. For example, as illustrated in
Advanced analytics can be performed based on the dependency information provided in the dependency graph in the user interface 1700. In one example, a job 1720 writes data into data source 1722, a job 1730 reads data from the data source 1722 and writes data into data sources 1732, 1734, 1736, 1738. Based on this dependency information, the user can determine that if there is any error in the job 1720, data in the data source 1722 may be impacted. Hence, the job 1730 and the data sources 1732, 1734, 1736, 1738 may also be impacted. As such, the user can predict error propagation in the data processing based on the dependency graph. On the other hand, if the user finds out there is an error in the data source 1736, the user may track down back the dependency chain to check whether there is any error happened in the job 1730, in the data source 1722, or in the job 1720. As such, the user can find the root of error during data processing, based on the dependency graph.
In one embodiment, a job may read data from a data source and write data into the same data source. For example, the job 1730 consumes and produces the data source 1732. In another embodiment, a data source may be consumed by multiple jobs at the same time. For example, the data source 1740 is consumed by the jobs 1741, 1742, 1743 at the same time or in a same time period.
The user interface 1700 includes a menu bar 1702 which indicates that the user is under monitoring and dependency mode. The user interface 1700 also includes a search box 1704 for the user to search job name or path in a cluster.
Similar to the user interface 1700, the user interface 1800 includes a dependency graph that comprises a plurality of black nodes, a plurality of grey nodes, and a plurality of directed links. Each grey node in the user interface 1800 represents a data source in the cluster. Each black node in the user interface 1800 represents a pipeline step or a job that consumes or produces one of the data sources represented by a grey node. Each directed link in the user interface 1800 represents a dependency relationship from one node to another.
Similar to the user interface 1700, the user can select any node in the user interface 1800 to view information about the node. The system can highlight the node selected by the user. For example, as illustrated in
The dependency graph in the user interface 1800 visualizes the dependency relationship among pipelines and data sources, such that a user can easily understand a dependency in the cluster without writing any queries. As illustrated in
Referring back to
The request analyzer 1110 in
The dependency graph retriever 1112 in this example retrieves the dependency graph from the dependency graph database 1109 based on the information received from the request analyzer 1110. The dependency graph retriever 1112 may then provide the dependency graph to the user.
In one embodiment, there are multiple dependency graphs generated, at different times in the past, for the cluster requested by the user, and the user does not specify which one of the dependency graphs is requested. In this case, by default, the dependency graph retriever 1112 can retrieve the last generated dependency graph associated with the cluster.
In another embodiment, the request indicates that the user wants a real time dependency graph associated with the cluster. In this case, the request analyzer 1110 may send a message to the data/job relationship determiner 1106, such that the data/job relationship determiner 1106 can request job information and data information associated with the cluster from the pipeline crawler 1102 and the data source crawler 1104 respectively. The data/job relationship determiner 1106 can then determine dependency relationships among different jobs and data sources in the cluster in real time. Based on the determined dependency relationships, the dependency graph generator 1108 can generate the real time dependency graph. The real time dependency graph may be stored into the dependency graph database 1109 and/or retrieved by the dependency graph retriever 1112 and provided to the user.
At 1212, a request for a dependency graph is received and analyzed from a user. At 1214, the dependency graph is retrieved from the database, based on the request. At 1216, the dependency graph is provided to the user, e.g. via a user interface as shown in
To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform(s) for one or more of the elements described herein (e.g., the data source monitoring engine 104 and/or the data dependency analyzing engine 105 and/or other components of systems 100 and 200 described with respect to
The computer 2000, for example, includes COM ports 2050 connected to and from a network connected thereto to facilitate data communications. The computer 2000 also includes a central processing unit (CPU) 2020, in the form of one or more processors, for executing program instructions. The exemplary computer platform includes an internal communication bus 2010, program storage and data storage of different forms, e.g., disk 2070, read only memory (ROM) 2030, or random access memory (RAM) 2040, for various data files to be processed and/or communicated by the computer, as well as possibly program instructions to be executed by the CPU. The computer 2000 also includes an I/O component 2060, supporting input/output flows between the computer and other components therein such as user interface elements 2080. The computer 2000 may also receive programming and data via network communications.
Hence, aspects of the methods of data monitoring, as outlined above, may be embodied in programming Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Tangible non-transitory “storage” type media include any or all of the memory or other storage for the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.
All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of a data source monitoring engine into the hardware platform(s) of a computing environment or other system implementing a computing environment or similar functionalities in connection with data monitoring. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
Hence, a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, which may be used to implement the system or any of its components as shown in the drawings. Volatile storage media include dynamic memory, such as a main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.
Those skilled in the art will recognize that the present teachings are amenable to a variety of modifications and/or enhancements. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution—e.g., an installation on an existing server. In addition, the data monitoring as disclosed herein may be implemented as a firmware, firmware/software combination, firmware/hardware combination, or a hardware/firmware/software combination.
While the foregoing has described what are considered to constitute the present teachings and/or other examples, it is understood that various modifications may be made thereto and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2015/075876 | 4/3/2015 | WO | 00 |