Aspects of this application relate to data analytics. More particularly, aspects of this application relate to analyzing log data in a distributed system.
Traditionally, the problems of preventative maintenance and diagnosis of failures have been addressed through conventional technologies using single node processing, including relational databases and the like. As the volume of data increases with developing technology, these traditional techniques become less computationally feasible.
Many entities utilize log mining. Log mining is the analysis of logs that are produced by a number of pieces of equipment. Log mining is utilized to improve customer service, to perform predictive maintenance, diagnose problems, and identify failures. As technology advances, logs are created by numerous components within complex systems. The data contained in these logs is voluminous and requires advanced technological techniques for efficient processing and analysis of the log data.
Improved systems and methods are needed which address these shortcomings.
A system for data mining log record data includes a distributed data processing system comprising a plurality of distributed nodes, each node having at least one of a storage device and a data processing device. A distributed data analytics processor selects certain log records from a set of log records and stores the selected log records in a plurality of the distributed nodes. The distributed data analytics processor performs data analysis on the selected log records stored in the plurality of distributed nodes. The distributed nodes may be arranged in a cluster. The distributed nodes are in communication with one another through a distributed file system. The distributed data analytics processor is configured to at least in part control the communication between a first distributed node and a second distributed node.
In one embodiment, the distributed data analytics processor is configured to add a new node and to store a portion of the selected log records in a storage device of the new node.
According to another embodiment, the distributed data analytics processor is configured to produce information corresponding to suggested predictive patterns representative of information contained in the set of log records. The suggested predictive patterns may relate to at least one instance of a specific failure of an operation unit, or the suggested predictive patterns may relate to a normal utilization model of an operational unit.
In another embodiment, the distributed data analytics processor is configured to perform analysis on the selected log record data using the processing device of a plurality of the distributed nodes simultaneously in parallel.
In another embodiment, a method for performing data mining on log record data comprises the steps of, in a distributed data analytics processor, selecting a plurality of log records from a set of log records then storing the selected log records in a plurality of distributed nodes and performing data analysis on the selected log records, wherein the analysis is performed in part by at least one data processing device associated with one of the plurality of distributed nodes.
In another embodiment a non-transitory computer readable medium storing computer instructions that when executed by a processor cause the processor to perform the steps of in a distributed data analytics processor, selecting a plurality of log records from a set of log records, storing the selected log records in a plurality of distributed nodes; and performing data analysis on the selected log records, wherein the analysis is performed in part by at least one data processing device associated with one of the plurality of distributed nodes.
The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there is shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:
Big Data technologies and paradigms, which operate on distributed systems, may be utilized to efficiently perform log mining tasks on multiple nodes in parallel, thereby making log mining possible on large scale data. Methods according to embodiments of this disclosure are applicable across multiple domains, for example, logs relating to medical, mobility and power-generation may be mined, by way of non-limiting example.
The utility of log mining for failure diagnosis, predictive maintenance and other applications is well recognized in industry. Until recently, traditional technologies such as relational databases, single node processing, and the like have been used to address the problems of preventative maintenance and diagnosis. But due to the growth of data volume in recent times, these approaches have become less and less computationally feasible.
Embodiments of the invention utilize Big Data technologies to efficiently scale up log mining. All stages of the process are considered, including data storage, data pre-processing and data mining itself.
Conventional log mining typically involves objects and processes. Objects define or contain information about physical objects. The information relating to or contained in objects provides data for further processing. Processing includes operations that construct problem instances, which typically correspond to a time window reflecting operational information for a given duration within a specific operational unit. Identifying information from the data associated with the objects is extracted and mapped to instances. Instances correspond to events which are used for analysis and future actions in response to the events. Analysis is implemented to process the data and produce useful information that addresses a recognized business need. By way of one non-limiting example, event instances relating to failures of a particular type of machine may be extracted from general log information. Information collected during timeframes during and near the identified failures may present patterns, which in turn, provide information which can be used to identify imminent failures or suggest service intervals for ongoing maintenance.
Objects include physical objects or equipment or operational units, which are defined by a physical object such as a machine, controller, actuator and the like. Numerous operational units may be arranged or associated to create a system of operational units. Each operational unit includes processes or utilities for monitoring operation of the operational unit and generating log information based on various operational states of the operational unit. Log information may be saved in the form of log records, which are stored in log files. Alternatively, the log information may be used to populate a database containing the log information. Records containing the log information may also be treated as objects. Service notifications or other information that serve to identify a failure may also be stored in files or entered into a database to create additional objects representative of system or equipment states. For example, information relating to predictions of specific failure types or other information needed for supervised scenarios may be stored for later retrieval and analysis.
The methods and manufactures described herein may be used to perform system log data mining across multiple operational platforms. A high-level data model of a system for applying the methods of the present invention is shown in
Still referring to
Aspects of the present invention allow for entities 101, 103, 105, 107, 109, and 111 to be identified by other attributes.
Referring now to
Service or maintenance notifications (or tickets) may be used by a Customer Service Center for recording detailed information on performed maintenance or repairs. An example of a notification 300 is illustrated in
In conventional systems, the data associated with the log files and notifications shown in
The data schema in the relational database is fixed, thereby reducing flexibility. Further, while running queries on small stores of data is not problematic; these operations create bottlenecks when they are being performed on large or massive data stores. In an environment where systems are increasingly becoming interconnected, data is shared across distributed networks. As systems become more interconnected, these systems include more devices. Each device generates log information at a substantial rate. Therefore, the data stores containing this information become massive. For example, a query may require extraction of portions of a data store which has become very large as the overall data store grows. Moreover, complex analysis is difficult to perform in a single node environment, requiring additional exportation of data. Data exportation requires additional resources such as additional storage and configuration of additional network systems for connecting the analysis tools to the extracted data.
As shown in
The processors 420 may include one or more central processing units (CPUs), graphical processing units (GPUs), or any other processor known in the art. More generally, a processor as used herein is a device for executing machine-readable instructions stored on a computer readable medium, for performing tasks and may comprise any one or combination of, hardware and firmware. A processor may also comprise memory storing machine-readable instructions executable for performing tasks. A processor acts upon information by manipulating, analyzing, modifying, converting or transmitting information for use by an executable procedure or an information device, and/or by routing the information to an output device. A processor may use or comprise the capabilities of a computer, controller or microprocessor, for example, and be conditioned using executable instructions to perform special purpose functions not performed by a general purpose computer. A processor may be coupled (electrically and/or as comprising executable components) with any other processor enabling interaction and/or communication there-between. A user interface processor or generator is a known element comprising electronic circuitry or software or a combination of both for generating display images or portions thereof. A user interface comprises one or more display images enabling user interaction with a processor or other device.
Continuing with reference to
The computer system 410 also includes a disk controller 440 coupled to the system bus 421 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 441 and a removable media drive 442 (e.g., floppy disk drive, compact disc drive, tape drive, and/or solid state drive). Storage devices may be added to the computer system 410 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire).
The computer system 410 may also include a display controller 465 coupled to the system bus 421 to control a display or monitor 466, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. The computer system includes an input interface 460 and one or more input devices, such as a keyboard 462 and a pointing device 461, for interacting with a computer user and providing information to the processors 420. The pointing device 461, for example, may be a mouse, a light pen, a trackball, or a pointing stick for communicating direction information and command selections to the processors 420 and for controlling cursor movement on the display 466. The display 466 may provide a touch screen interface which allows input to supplement or replace the communication of direction information and command selections by the pointing device 461.
The computer system 410 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 420 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 430. Such instructions may be read into the system memory 430 from another computer readable medium, such as a magnetic hard disk 441 or a removable media drive 442. The magnetic hard disk 441 may contain one or more data stores and data files used by embodiments of the present invention. Data store contents and data files may be encrypted to improve security. The processors 420 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 430. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
As stated above, the computer system 410 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein. The term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processors 420 for execution. A computer readable medium may take many forms including, but not limited to, non-transitory, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as magnetic hard disk 441 or removable media drive 442. Non-limiting examples of volatile media include dynamic memory, such as system memory 430. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 421. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
The computing environment 400 may further include the computer system 710 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 480. Remote computing device 780 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 410. When used in a networking environment, computer system 410 may include modem 472 for establishing communications over a network 471, such as the Internet. Modem 472 may be connected to system bus 421 via user network interface 470, or via another appropriate mechanism.
Network 471 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 410 and other computers (e.g., remote computing device 480). The network 471 may be wired, wireless or a combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art. Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 471.
Processes according to embodiments of the present disclosure perform log mining in three steps. First, the data to be analyzed is stored for processing. The data may be exported from a distributed operations system and moved onto a cluster of one or more distributed storage devices. The storage devices may further include processing abilities. For example, the data to be analyzed may be transferred to a storage device associated with a computer workstation or server. Further, a plurality of storage devices may be organized as a cluster. Each storage device or workstation in the cluster is configured to operate in communication with each other storage device or workstation in the cluster via a distributed operating system. For example, a distributed operating system, which may be used in embodiments of the present invention, is HADOOP®, a distributed operating system provided by THE APACHE™ SOFTWARE FOUNDATION. Data for analysis may be stored using a distributed file system, which is optimized for distributed systems and processing. By way of non-limiting example, the HADOOP DISTRIBUTED FILE SYSTEM (HPFS™) may be used to store the data for analysis in each of the distributed storage devices.
Second, pre-processing is performed on the collected data. In log mining, the units of analysis tend to be time-windows associated with specific machines. For example, in a scenario where the desire is to identify or learn search patterns or to build a predictive model for predicting a failure of a specific type F, each occurrence of failure F within the fleet or system is associated with a time window corresponding to the machine on which the failure F occurred. These failures may be identified in notification records, which identify a type of failure and the machine affected. To provide meaningful predictive information, log data may be identified and selected for an affected machine in a time window occurring at some pre-determined time period prior to the identified failure. For completeness, additional records containing “negative” data, or data from time periods far removed from the time of failure F may be identified for further analysis. In another embodiment, the goal may be the development of a normal utilization model for particular types of equipment or for normal utilization under specific operating conditions. By identifying specific machines and dates/times of interest, appropriate log data may be identified for later analysis.
After the data selected for analysis has been identified, an appropriate formulation may be developed, for example, representing instances as a sequence of events, or as vectors in high-dimensional attribute space for a set of attributes (e.g., statistics on event occurrences). Pre-processing of the analysis data may be performed using distributed paradigms which allow the selection of records from the equipment units of interest that fall within a determined time window, [T1, T2], within a specified date. For instance, a key may be defined from an operational unit and failure date. This identifies records corresponding to the failure on a specific operational unit and groups them into appropriate instances for analysis. Depending of the analysis method chosen, the output after identifying records of interest may be a return of selected original records. In another embodiment, aggregated statistics for the selected time window may be returned.
When the data which is to be further analyzed has been identified, analytic processing is performed in parallel across multiple nodes, converting the log information or event notifications into a mathematical representation that is compatible with a machine learning algorithm 505. The machine learning algorithm receives the processed log information, processed to identify maintenance and failure conditions. The machine learning algorithm may perform additional calculations or analysis to determine factors that lead to or result from various states captured in the log data. The analytics are performed across different data nodes simultaneously via the distributed processing system. This allows multiple processing resources to work together across separate data locations or nodes thereby reducing the time required to process massive volumes of data while not overloading any single processing node. This prevents lags and bottlenecks associated with the processing of large volumes of data. By using embodiments described in this specification, data analysis that traditionally would require approximately one week to perform on a conventional single node configuration may instead be processed and completed in a matter of hours.
The machine learning algorithm is executed on the converted data to create or generate learned models or statistics of interest 507. Generated models are representative of the system from which the original log data was acquired and provide additional information relating to processes or occurrences that lead to failures or maintenance issues. The generated models or statistics are generated and stored in the distributed storage system 508 for later retrieval. The output models and/or statistics may be used at a later time for prediction of maintenance needs under normal use, or to predict future failures based on prior patterns leading to such failures.
Referring now to
Given a collection of instances identified or obtained by the pre-processing described above, various approaches for modeling and pattern mining may be employed. For example, unsupervised methods may be used for anomaly detection. Common patterns may be extracted and used to build a profile of the stored data. Patterns may be identified based on established vectors, item sets or sequential patterns. Likewise, supervised techniques may also be employed to perform functions such as failure type diagnosis, preventative maintenance and/or failure predictions, and discriminative sequential pattern mining. These approaches should be appropriate for application to large volumes of data. By way of example, APACHE™ SPARK™ MLib provides scalable implementations of analysis for multiple clusters and supervised classification methods for vector data. Other languages may also be used to implement methods according to embodiments of the disclosure. For example, methods may be implemented in R (e.g., using rmr), JAVA®, among other programming languages.
Implementations of scalable sequential pattern mining methods are also provided in various embodiments of the present disclosure. Consider that all sequential pattern mining methods begin from a collection of sequences (e.g., instances) and candidate patterns are evaluated against sequences in the collection. In distributed operating platforms, the collection is distributed across nodes in files or in memory (e.g. within distributed storage devices). Using distributed operating platforms, the evaluation of patterns may proceed in parallel. Distributed processing systems allow access to the data in distributed nodes through the distributed file system. Additionally, processing may be divided among separate nodes allowing the evaluation, processing and analysis of the distributed data to occur simultaneously across nodes. The distributed processing platform then combines the results of the parallel processing into a single output.
An executable application, as used herein, comprises code or machine readable instructions for conditioning the processor to implement predetermined functions, such as those of an operating system, a context data acquisition system or other information processing system, for example, in response to user command or input. An executable procedure is a segment of code or machine readable instruction, sub-routine, or other distinct section of code or portion of an executable application for performing one or more particular processes. These processes may include receiving input data and/or parameters, performing operations on received input data and/or performing functions in response to received input parameters, and providing resulting output data and/or parameters.
A graphical user interface (GUI), as used herein, comprises one or more display images, generated by a display processor and enabling user interaction with a processor or other device and associated data acquisition and processing functions. The GUI also includes an executable procedure or executable application. The executable procedure or executable application conditions the display processor to generate signals representing the GUI display images. These signals are supplied to a display device which displays the image for viewing by the user. The processor, under control of an executable procedure or executable application, manipulates the GUI display images in response to signals received from the input devices. In this way, the user may interact with the display image using the input devices, enabling user interaction with the processor or other device.
The functions and process steps herein may be performed automatically or wholly or partially in response to user command. An activity (including a step) performed automatically is performed in response to one or more executable instructions or device operation without user direct initiation of the activity.
The system and processes of the figures are not exclusive. Other systems, processes and menus may be derived in accordance with the principles of the invention to accomplish the same objectives. Although this invention has been described with reference to particular embodiments, it is to be understood that the embodiments and variations shown and described herein are for illustration purposes only. Modifications to the current design may be implemented by those skilled in the art, without departing from the scope of the invention. As described herein, the various systems, subsystems, agents, managers and processes can be implemented using hardware components, software components, and/or combinations thereof. No claim element herein is to be construed under the provisions of 35 U.S.C. 112, sixth paragraph, unless the element is expressly recited using the phrase “means for.”
This application claims the benefit of priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application Ser. No. 62/266,859 entitled, “LOG MINING WITH BIG DATA”, filed Dec. 14, 2015, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
62266859 | Dec 2015 | US |