The present disclosure relates to computing systems, and, in particular, to methods, systems, and computer program products for predicting the performance of a data processing system in performing an analysis of a big data dataset.
Big data is a term or catch-phrase that is often used to describe data sets of structured and/or unstructured data that are so large or complex that they are often difficult to process using traditional data processing applications. Data sets tend to grow to such large sizes because the data are increasingly being gathered by cheap and numerous information generating devices. Big data can be characterized by 3Vs: the extreme volume of data, the variety of types of data, and the velocity at which the data is processed. Although big data doesn't refer to any specific quantity or amount of data, the term is often used in referring to petabytes or exabytes of data. The big data datasets can be processed using various analytic and algorithmic tools to reveal meaningful information that may have applications in a variety of different disciplines including government, manufacturing, health care, retail, real estate, finance, and scientific research.
In some embodiments of the inventive subject matter, a method comprises performing operations as follows on a processor: receiving a big data dataset comprising new active data; receiving a request to predict a level of performance with respect to a performance parameter of a data processing system in analyzing the new active data; selecting a machine learning algorithm from a plurality of machine learning algorithms based on the performance parameter to obtain a selected machine learning algorithm; selecting a group of historical metadata from a plurality of groups of historical metadata of datasets that have previously been analyzed using the data processing system to provide a selected group of historical metadata; applying the selected machine learning algorithm to the selected group of historical metadata to generate a model of the selected group of historical metadata; obtaining metadata of the new active data; applying the model to the metadata of the new active data to generate a prediction of the level of performance with respect to the performance parameter; and configuring the data processing system for analyzing the new active data based on the prediction.
In other embodiments of the inventive subject matter, a system comprises a processor and a memory coupled to the processor, which comprises computer readable program code embodied in the memory that when executed by the processor causes the processor to perform operations comprising: receiving a big data dataset comprising new active data; receiving a request to predict a level of performance with respect to a performance parameter of a data processing system in analyzing the new active data; selecting a machine learning algorithm from a plurality of machine learning algorithms based on the performance parameter to obtain a selected machine learning algorithm; selecting a group of historical metadata from a plurality of groups of historical metadata of datasets that have previously been analyzed using the data processing system to provide a selected group of historical metadata; applying the selected machine learning algorithm to the selected group of historical metadata to generate a model of the selected group of historical metadata; obtaining metadata of the new active data; applying the model to the metadata of the new active data to generate a prediction of the level of performance with respect to the performance parameter; and configuring the data processing system for analyzing the new active data based on the prediction.
In still other embodiments of the inventive subject matter, a computer program product comprises a tangible computer readable storage medium comprising computer readable program code embodied in the medium that when executed by a processor causes the processor to perform operations comprising: receiving a big data dataset comprising new active data; receiving a request to predict a level of performance with respect to a performance parameter of a data processing system in analyzing the new active data; selecting a machine learning algorithm from a plurality of machine learning algorithms based on the performance parameter to obtain a selected machine learning algorithm; selecting a group of historical metadata from a plurality of groups of historical metadata of datasets that have previously been analyzed using the data processing system to provide a selected group of historical metadata; applying the selected machine learning algorithm to the selected group of historical metadata to generate a model of the selected group of historical metadata; obtaining metadata of the new active data; applying the model to the metadata of the new active data to generate a prediction of the level of performance with respect to the performance parameter; and configuring the data processing system for analyzing the new active data based on the prediction.
It is noted that aspects described with respect to one embodiment may be incorporated in different embodiments although not specifically described relative thereto. That is, all embodiments and/or features of any embodiments can be combined in any way and/or combination. Moreover, other methods, systems, articles of manufacture, and/or computer program products according to embodiments of the inventive subject matter will be or become apparent to one with skill in the art upon review of the following drawings and detailed description. It is intended that all such additional systems, methods, articles of manufacture, and/or computer program products be included within this description, be within the scope of the present inventive subject matter, and be protected by the accompanying claims. It is further intended that all embodiments disclosed herein can be implemented separately or combined in any way and/or combination.
Other features of embodiments will be more readily understood from the following detailed description of specific embodiments thereof when read in conjunction with the accompanying drawings, in which:
In the following detailed description, numerous specific details are set forth to provide a thorough understanding of embodiments of the present disclosure. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In some instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present disclosure. It is intended that all embodiments disclosed herein can be implemented separately or combined in any way and/or combination. Aspects described with respect to one embodiment may be incorporated in different embodiments although not specifically described relative thereto. That is, all embodiments and/or features of any embodiments can be combined in any way and/or combination.
As used herein, a “service” includes, but is not limited to, a software and/or hardware service, such as cloud services in which software, platforms, and infrastructure are provided remotely through, for example, the Internet. A service may be provided using Software as a Service (SaaS), Platform as a Service (PaaS), and/or Infrastructure as a Service (IaaS) delivery models. In the SaaS model, customers generally access software residing in the cloud using a thin client, such as a browser, for example. In the PaaS model, the customer typically creates and deploys the software in the cloud sometimes using tools, libraries, and routines provided through the cloud service provider. The cloud service provider may provide the network, servers, storage, and other tools used to host the customer's application(s). In the IaaS model, the cloud service provider provides physical and/or virtual machines along with hypervisor(s). The customer installs operating system images along with application software on the physical and/or virtual infrastructure provided by the cloud service provider.
As used herein, the term “data processing facility” includes, but is not limited to, a hardware element, firmware component, and/or software component. A data processing system may be configured with one or more data processing facilities.
Some embodiments of the inventive subject matter stem from a realization that big data datasets may differ in a variety of ways, including the traditional 3V characteristics of volume, variety, and velocity as well as other characteristics, such as variability (e.g., data inconsistency), veracity (quality of the data), and complexity. As a result, a data processing environment used to analyze or process one big data dataset may be less suitable for analyzing or processing a different big data dataset. Some embodiments of the inventive subject matter may provide the operators of a big data analysis data processing system a prediction of how well the data processing may perform in analyzing a big data dataset with respect to one or more performance parameters. The performance parameters may include, but are not limited to, time of execution for performing an analysis, a probability of success (e.g., determining a pattern in the big data dataset), the amount of processor resources used in performing the analysis, and the amount of memory resources used in performing the analysis.
Some embodiments of the inventive subject matter may provide a Decision Support System (DSS) for generating the prediction of how well a data processing system may perform in analyzing a given big data dataset, which can then be used to configure the data processing system for improved performance. The decision support system may generate the performance prediction in response to a new prediction request for a new big data dataset based on historical job data corresponding to previous big data datasets that have been analyzed and based on various machine learning algorithms that have been used in predicting the performance of analyzing previous big data datasets, which have had their accuracy evaluated based on actual results.
Although described herein with respect to evaluating the performance of a data processing system for analyzing big data datasets, it will be understood that embodiments of the present inventive subject matter are not limited thereto and may be applicable to evaluating the performance of data processing systems generally with respect to a variety of different tasks.
The performance prediction generated by the DSS big data environment advisor 105 may be used as a basis for configuring a data processing system to analyze the new active data in the big data dataset. Configuring a data processing system may involve various operations including, but not limited to, adjusting the processing, memory, networking, and other resources associated with the data processing system. Configuring the data processing system may also involve scheduling which jobs are run at certain times and/or re-assigning jobs between the data processing system and other data processing systems. In addition, the particular analytic tools and applications that are used to process the big data dataset may be selected enhance efficiency.
Although
Referring now to
As shown in
The data classification module 325 may be configured to collect metadata corresponding to the analysis jobs performed previously on other big data datasets by various data processing systems and data processing system configurations including the data processing system target for a current active data dataset. The algorithm mapping module 330 may be configured to select a machine learning algorithm form a plurality of machine learning algorithm that may be the most accurate in determining a prediction for the performance of a data processing system in analyzing a current active data dataset. This selection may be made based on one or more previous predictions with respect to various data processing systems and data processing system configurations. The prediction engine module 335 may be configured to generate a prediction of the performance of a data processing system with respect to one or more performance parameters in response to a request identifying the one or more performance parameters and new active data forming part of a big data dataset to be analyzed. The prediction engine module 335 may select a group of historical metadata (i.e., metadata for data that has already been analyzed by one or more data processing systems) that most closely matches the metadata of the new active data to be analyzed from the data classification module 325 and may select a machine learning algorithm that is the most efficient at generating a prediction for the particular performance parameter(s) from the algorithm mapping module 330. The prediction engine module 335 may then apply the particular machine learning algorithm received from the algorithm mapping module 330 to the group of historical metadata to build a prediction model, which may be an equation, graph, or other mechanism for specifying a relationship between the data points in the group of historical metadata. The prediction model may then be applied to the metadata of the new active data to generate a prediction of the level of performance with respect to one or more performance parameters in analyzing the new active data on the data processing system. The data center management interface module 340 may be configured to communicate changes to a configuration of a data processing system based on the prediction generated by the prediction engine module 335. The DSS big data environment advisor data processing system 105 may be integrated as part of a data center management system or may be a stand-alone system that communicates with a data center management system over a network or suitable communication connection.
Although
Computer program code for carrying out operations of data processing systems discussed above with respect to
Moreover, the functionality of the DSS big data environment advisor data processing system 105, the data processing system 200 of
The data processing apparatus of
The algorithm mapping module 330 provides a library of possible machine learning algorithms that can be used in generating a model for predicting the performance of a data processing system in the analyzing a big data dataset. Different machine learning algorithms may generate better models than others depending on the particular performance parameter of interest. Thus, the algorithm mapping module 330 may maintain information on the accuracy of the resulting performance predictions when various machine learning algorithms were previously used for various performance parameters. The algorithm mapping module 330 may provide to the prediction engine 335 the machine learning algorithm that has resulted in the most accurate predictions for a particular performance parameter at block 435. The algorithm mapping module 330 may also provide one or more default machine learning algorithms when no historical prediction accuracy data is available for a particular performance parameter. Various machine learning algorithms can be used in accordance with embodiments of the inventive subject matter, including, but not limited to, kernel density estimation, K-means, kernel principal components analysis, linear regression, neighbors, non-negative matrix factorization, support vector machines, dimensionality reduction, fast singular value decomposition, and decision tree.
The remaining blocks of
A model or prediction model is generated at block 470 based on the selected machine learning algorithm at block 455 and the selected group of historical metadata at block 465. In accordance with various embodiments of the inventive subject matter, the model may be an equation, graph, or other construct/mechanism for specifying a relationship between the data points in the group of historical metadata. For example, if linear regression is chosen as the machine learning algorithm, an equation may be generated that most fits the data points in the group of historical metadata. The resulting model is output at block 475. The prediction engine module 335 applies the model obtained at block 475 to the metadata of the new active data at block 480 to generate a prediction 485 of the level of performance with respect to the requested performance parameter. For example, if the performance parameter is a time for execution for performing an analysis, the makespan value may be computed by applying the model generated by the machine learning algorithm to the metadata of the new active data of the big data dataset to be analyzed. The prediction 485 can be used to configure the data processing system for analyzing the big data dataset comprising the new active data. For example, various thresholds may be defined for one or more parameters that when compared to the predicted performance level provide an indication that changes need to be made to the data processing system before the big data dataset is provided to the data processing system for analysis to improve the performance of the data processing system.
In some embodiments of the inventive subject matter, to improve the accuracy of the prediction, rather than using a single machine learning algorithm that is considered the most accurate for generating a prediction for a particular performance parameter, an ensemble methodology may be used where multiple machine learning algorithms are applied to the selected group of historical metadata to generate a plurality of models. The plurality of models may then be applied to the metadata of the new active data to generate a plurality of predictions, which can then be processed using an ensemble methodology to provide a final prediction. The ensemble methodology may be used when the models generated by the machine learning algorithms are independent of each other. In accordance with various embodiments of the inventive subject matter, the ensemble methods may include, but are not limited to, Bayes optimal classifier, bagging, boosting, Bayesian parameter averaging, Bayesian model combination, bucket of models, and stacking.
Some embodiments of the inventive subject matter may provide a DSS that can assist users of a big data analysis center in configuring their data processing system for a particular big, data analysis task to meet, for example, requirements of service level agreements. Unexpected alerts and breakdowns may be reduced as a data processing system may be better configured to process a big data analysis job before the job starts. As big data is by definition resource intensive in terms of the amount and complexity of the data to be analyzed, even minor improvements in data processing system performance can result in large savings in terms of cost, resource usage, and time. A prediction of the performance of a data processing system, according to embodiments of the inventive subject matter, is generated in a technology agnostic manner and uses ensemble approaches of machine learning, progressive clustering, and online learning. Moreover, the DSS described herein is self-tuning by improving historical metadata group selection used in model generation based on newly arriving metadata corresponding to new big data analysis jobs.
In the above-description of various embodiments of the present disclosure, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or contexts including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc.) or combining software and hardware implementation that may all generally be referred to herein as a “circuit,” “module,” “component,” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product comprising one or more computer readable media having computer readable program code embodied thereon.
Any combination of one or more computer readable media may be used. The computer readable media may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an appropriate optical fiber with a repeater, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python or the like, conventional procedural programming languages, such as the “C” programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS).
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that when executed can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions when stored in the computer readable medium produce an article of manufacture including instructions which when executed, cause a computer to implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer, other programmable instruction execution apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatuses or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various aspects of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Like reference numbers signify like elements throughout the description of the figures.
The corresponding structures, materials, acts, and equivalents of any means or step plus function elements in the claims below are intended to include any disclosed structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.