Segmenting machine data into events

Information

  • Patent Grant
  • 12130842
  • Patent Number
    12,130,842
  • Date Filed
    Friday, March 3, 2023
    a year ago
  • Date Issued
    Tuesday, October 29, 2024
    29 days ago
Abstract
Methods and apparatus consistent with the invention provide the ability to organize and build understandings of machine data generated by a variety of information-processing environments. Machine data is a product of information-processing systems (e.g., activity logs, configuration files, messages, database records) and represents the evidence of particular events that have taken place and been recorded in raw data format. In one embodiment, machine data is turned into a machine data web by organizing machine data into events and then linking events together.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention

This invention relates generally to information organization and understanding, and more particularly to the organization and understanding of machine data.


2. Description of the Related Art

Information systems invariably generate vast amounts and wide varieties of machine data (e.g., activity logs, configuration files, messages, database records) whose value is widespread. Troubleshooting systems, detecting operational trends, catching security problems and measuring business performance, for example, typically require the organization and understanding of machine data. But the overwhelming volume, different and changing formats, and overall complexity of machine data create substantial difficulty for software developers, system administrators and business people who want to make sense of it and gain insight into information system behavior. The problem is compounded by the fact that information systems, and the machine data they generate, continue to grow in complexity and size.


Consider for example an information system environment for web-based applications consisting of web servers, application servers, databases and networks. Each information system component is constantly logging its own machine data documenting its activities. System administrators need to access and comprehend the machine data from one or more components to find and fix problems during operations. Security analysts want to understand patterns of machine data behavior from network devices to identify potential security threats. Business people are interested in tracing the machine data across components to follow the paths and activities customers perform when purchasing products or services.


Today, people generally attempt to comprehend information system behavior by manually looking at and trying to piece together machine data using the knowledge from one or more individuals about one or more systems. Individuals typically have specific technology domain expertise like networking, operating systems, databases, web servers or security. This expertise can also be in specific application domains like finance, healthcare, or communications. Manual approaches can be effective when considering small amounts of machine data in a single domain, but humans are easily overwhelmed as the size, variety and dynamic nature of the machine data grows.


Automated approaches, like homegrown scripts, data analysis programs, and data warehousing software, by contrast, can work with large amounts of machine data. But organizing different types of frequently changing data and formats can be troublesome, generally requiring specific methods for each type of data and necessitating modification of methods when the data formats change or new types of data are encountered. Automated approaches to building understanding from machine data are typically limited to finding simple, predefined relationships between known data elements.


Generally machine data is organized today by relying on predefined data schemas and predetermined algorithms for parsing and categorizing data. In current approaches, what data elements exist in a machine data set and how the data elements are classified generally must be known ahead of time. How the data is cleansed, parsed and categorized is defined algorithmically in advance for different types of data formats resulting in systems that are brittle, expensive to implement, and have numerous functional shortcomings. For example, unexpected types of data are typically ignored. As a result, data categorization usefulness degrades quickly and unexpected data and behaviors are not observed or recorded. Given the inherent dynamic nature of information systems and the machine data they generate, current organization methods have limited applicability.


Building understanding from machine data is inherently subjective and depends on the task, scope of data and skill level of people using a solution. Deriving specific, useful meanings from large quantities of machine data can require expertise in one or more domains and knowledge of how data from one domain relates to data from another domain. Current methods of deriving meaning from machine data are generally based on building simple pair-wise relationships (A→B) between predetermined data elements using data values. More advanced techniques may be able to find predetermined multi-data element relationships (A→B→C), provided the data elements are described in advance, requiring the availability of multiple domain experts to configure and continuously manage a solution.


Conventional methods, whether human or automated, of organizing and understanding machine data across multiple information systems and domains suffer from an inability to effectively keep up with changing machine data and are constrained by limited data relationships, making these methods difficult, time consuming, expensive and often ineffective.


There exists, therefore, a need to develop other techniques for organizing and deriving understanding from machine data.


SUMMARY OF THE INVENTION

Methods and apparatus consistent with the invention address these and other needs by turning machine data (MD) into a machine data web (MDW). A MDW is created by organizing MD into events representing discrete activities, and dynamically linking events together representing larger, more complex activities. Much like the World Wide Web is a hyperlinked information space of documents and web sites. A MDW is an interconnected information space of information system events and activities. The MDW can be searched, browsed, navigated, and analyzed as a proxy for the information-processing environment itself. Unlike the WWW's HTML documents and hyperlinks, however, the events organized from machine data, and the links between these events, do not generally exist and must be manufactured through the processing and analysis of MD.


In one implementation, MD is organized into events using a collection of techniques including, but not limited to, aggregating a MD collection into discrete events, extracting important entities from an event's data, segmenting an event's data into tokens, and classifying events into like categories. An important aspect is the ability to continuously learn and adapt, keeping up with changes in the MD. In the example of a web-based application information system environment, data sources and data formats can be constantly changing. For example, new web servers and network components can be added and old ones removed as the application requires more capacity or reconfiguration.


In another aspect, knowledge or understanding is built from the organized MD as events are connected to one another by dynamically constructing links using a number of techniques, including but not limited to the analysis of event data values, timing, patterns, and statistics. One advantage of the MDW is that it can learn new types of links as they occur and build paths by chaining multiple links together. Another advantage is the ability to preserve integrity by reconstructing the original MD from the MDW events. Dynamic construction of links and paths through multiple machine data sources enables a system administrator working on a web-based application information system to follow the sequence of activities from the web server to the application and eventually the database in order to locate the source of a problem.





BRIEF DESCRIPTION OF THE DRAWINGS

The invention has other advantages and features which will be more readily apparent from the following detailed description, when taken in conjunction with the accompanying drawings:



FIG. 1 is a diagram an example information-processing environment suitable for use with an MDW.



FIG. 2 is a flow diagram of one example of creation of an MDW according to the invention.



FIG. 3 is a flow diagram of one example of MD organization according to the invention.



FIG. 4 is a flow diagram of one example of MD understanding according to the invention.



FIG. 5 is a diagram illustrating access to an MDW.





DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the example of FIG. 1, the information-processing environment includes hardware and software components such as computers, routers, databases, operating systems and applications in a distributed configuration for processing information. Each component may be producing MD 110, and there may be many MD sources and large quantities of MD across multiple technology and application domains. For example, a computer may be logging operating system events, a router may be auditing network traffic events, a database may be cataloging database reads and writes or schema changes, and an application may be sending the results of one application call to another across a message queue. In this embodiment, individual IT personnel—who may reside in different data centers, companies, or even geographies—typically manage specific technology and application domains. Aspects of the invention will be described with respect to the information-processing environments in FIG. 1, but the invention can also be used with other information-processing environments.



FIG. 2 represents one approach 200 to building a MDW 290 from MD 110. This approach includes an organization process 235 and an understanding process 275. During the organization process 235, the MD 110 is organized into collections of discrete events 250, referred to herein as event data (ED). Events 250 represent units of system activity. Examples of events 250 include, for example, a web server servicing an HTTP “get” request from a web browser, an application server servicing an API call, or a database updating records in a table. Collections of events 250 can describe larger system activities, such as an attempt to update a customer record or submit an order. One of the challenges in organizing 235 MD 110 into events 250 is that MD generally has little formal structure and typically includes not much more than a time stamp common across different sources of MD and different types of events. MD 110 is also subject to changes in environment configurations. For example, changing the verbosity level in a web server configuration file can dramatically increase or decrease the amount of information included in an HTTP “get” event found in the web server's log file.


During the understanding process 275, ED 250 is analyzed to create dynamic links between events and build the MDW 290. As an example, consider that a log from a web server may contain specific types of events 250 with specific event data, but a log from an application server or database may contain different events 250 and event data specific to its own domain. A system administrator may, for example, locate the web server event by looking for a session ID found in a web server log, locate the application server event by finding a process ID in the message queue, and locate a database table update event by searching for a transaction ID in the database audit trail. All three sources may contain events 250 that are part of a larger system activity, yet there is no obvious or explicit common structure or data shared among the MD 110 produced by each system. Common structure is manufactured across the three sources by analyzing the event data 250 so that connections between events can be identified. In one implementation, patterns of event behavior are recorded in real-time and identified, for example, as frequently occurring or infrequently occurring. Frequent patterns identify typical system processes and well-known links Infrequent patterns identify deviations or anomalies and less well-known links. Contrast this with the world of the web, where hyperlinks are part of the formal, common structure of HTML—the language for building most web pages. Building links by hand for large volumes of ED 250 is not an option for complex information-processing environments.


Machine Data Organization Process



FIG. 3 is a flow diagram of one implementation 235 of the MD organization process of FIG. 2. In this implementation, there are several steps including collection 305, source identification 315, aggregation 325, extraction 335, segmentation 345, and classification 355. Through these steps, MD 110 is collected from the information-processing environment and organized into ED 250 for the MD understanding process. For convenience, the technology that implements each step will be referred to as a module. That is, the “collection module” is the technology that collects MD. In one implementation, the modules are all implemented as software.


Collection


In the collection step 305, the MD 110 may be collected directly from its original source or consolidated over a number of sources. Machine data 110 can, and often does, arrive out of order. Collection 305 of MD 110 can be performed based on standard approaches to data access, for example, reading log files, examining message bus traffic, becoming a sync for logging systems like Syslog, or connecting to database auditing systems. Parts of the collection module can be situated in different locations, preferably with access to the MD 110.


Source Identification—Classification into Domains


Given the repetitive, yet dynamic, nature of MD 110, an effective organization process 235 (such as shown in FIG. 3) preferably will learn about data formats and structure automatically. In one implementation, learning is separated into different domains based on the source of MD 110. Domains can be general system types, such as log files, message bus traffic, and network management data, or specific types, such as output of a given application or technology—Sendmail logging data, Oracle database audit data, and J2EE messaging. An MDW can include a mix of general domains and specific domains.


In this example organization process 235, the domain for a given source of MD is identified 315 so that domain specific organization methods can be applied. Domains are determined through a learning process. The learning process uses collections of MD from well-known domains as input and creates a source signature 312 for each domain. In one implementation, source signatures 312 are generated from representative samples of MD 110 by creating a hash table mapping punctuation characters to their frequency. While the tokens and token values can change in a MD collection, in this particular implementation, the signature 312 generated by the frequency of punctuation is quite stable, and reliable within a specific domain. Other implementations could use functions of the punctuation and tokens, such as the frequencies of the first punctuation character on a line, or the first capitalized term on a line. Given that source signatures 312 can be large and hard to read, signatures can have a corresponding label in the form of a number or text that can be machine generated or human assigned. For example, the source signature 312 for an Apache web server log might be programmatically assigned the label “205”, or a user can assign the label “Apache Server Log”.


In one embodiment, clustering is used to classify 315 collected MD 110 into domains according to their source signatures 312. As collections of MD 110 are encountered, each collection's signature is matched to the set of known source signatures 312 by performing a nearest-neighbor search. If the distance of the closest matching signature 312 is within a threshold, the closest matching signature 320′s domain is assumed to be the domain of the source. If no best match can be found, a new source signature 312 can be created from the sample signature and a new source domain created. Alternatively, a default source domain can be used. In one implementation, the distance between two signatures is calculated by iterating over the union of attributes of the two signatures, with the total signature distance being the average of distances for each attribute. For each attribute A, the value of A on Signature1 and Signature2, V1 and V2, are compared and a distance is calculated. The distance for attribute A is the square of (V1−V2)*IDF, where IDF is the log(N/|A|, where N is the number of signatures, and |A| is the number of signatures with attribute A.


Source Identification—Classification as Text/Binary


Some MD 110 sources are non-textual or binary and cannot be easily processed unless a known process is available to convert the binary MD into textual form. To classify a source as textual or binary, a sample MD collection is analyzed. Textual MD can also have embedded binary MD, such as a memory dump, and the classification preferably identifies it as such. In one implementation, the textual/binary classification works as follows. The sample is a set of lines of data, where a line is defined as the data between new lines (i.e., ‘\n’), carriage-returns (i.e., ‘\r’), or their combination (i.e., ‘\r\n’). For each line, if the line's length is larger than some large threshold, such as 2k characters, or if the line contains a character with an ASCII value of zero (0), a count of Binary-looking lines is incremented. Otherwise, if the line's length is shorter than a length that one would expect most text lines to be below, such as 256 characters, a count of Text-looking lines is incremented. If the number of Text-looking lines is twice as numerous as the Binary-looking lines (other ratios can be used depending on the context), the source is classified as text. Otherwise, the source is classified as binary.


Aggregation of Machine Data into Raw Events


When the source signature 320 for a collection of MD has been identified 315, the corresponding aggregation rules are applied 325 to the MD collection. Aggregation rules describe the manner in which MD 110, from a particular domain, is organized 325 into event data 330 by identifying the boundaries of events within a collection of MD, for example, how to locate a discrete event by finding its beginning and ending. In one implementation, the method of aggregation 325 learns, without prior knowledge, by grouping together multiple lines from a sample of MD 110. Often MD 110 contains events 330 that are anywhere from one to hundreds of lines long that are somehow logically grouped together.


The MD collection may be known a priori, or may be classified, as single-line type (i.e., containing only single-line events) or multi-line type (i.e., possibly containing multi-line events) prior to performing aggregation. For those MD collections that are classified as single-line type, aggregation 325 is simple-single-line type MD collections are broken on each line as a separate event. Multi-line type MD collections are processed 325 for aggregation. In one implementation, a MD collection is classified as a multi-line type if 1) there is a large percentage of lines that start with spaces or are blank (e.g., if more than 5% of the lines start with spaces or are blank), or 2) there are too many varieties of punctuation characters in the first N punctuation characters. For example, if the set of the first three punctuation characters found on each line has more than five patterns (e.g., ‘:::’, ‘!:!’, ‘,,,’, ‘:..’, ‘( )*’, the collection might be classified as multi-line.


Another aspect of aggregation methods 325 is the ability to learn, and codify into rules, what constitutes a break between lines and therefore the boundary between events, by analyzing a sample of MD. For example, in one implementation, an aggregation method 325 compares every two-line pair looking for statistically similar structures (e.g., use of white space, indentation, and time-stamps) to quickly learn which two belong together and which two are independent. In one implementation, aggregation 325 works as follows. For each line, first check if the line starts with a time-stamp. If so, then break. Typically, lines starting with a time-stamp are the start of a new event. For lines that do not start with a time-stamp, combine the current line with the prior line to see how often the pair of lines occurs, one before the other, as a percentage of total pairs in the MD sample. Line signatures are used in place of lines, where a line signature is a more stable version of a line, immune to simple numeric and textual changes. In this implementation, signatures can be created by converting a line into a string that is the concatenation of leading white space, any punctuation on the line, and the first word on the line. The line “10:29:03 Host 191.168.0.1 rebooting:normally” is converted to “::..:Host.”


Now this current line signature can be concatenated with the previous line signature (i.e., signature1 combined with signature2) and used as a combined key into a table of break rules. The break rule table maps the combined key to a break rule, which determines whether there should be a ‘break’, or not, between the two lines (i.e., whether they are part of different events or not). Break rules can have confidence levels, and a more confident rule can override a less confident rule. Break rules can be created automatically by analyzing the co-occurrence data of the two lines and what percent of the time their signatures occur adjacently. If the two line signatures highly co-occur, a new rule would recommend no break between them. Alternatively, if they rarely co-occur, a new rule would recommend a break between them. For example, if line signature A is followed by line signature B greater than 20% of the time A is seen, then a break rule might be created to recommend no break between them. Rules can also be created based on the raw number of line signatures that follow/proceed another line signature. For example, if a line signature is followed by say, ten different line signatures, create a rule that recommends a break between them. If there is no break rule in the break rule table, the default behavior is to break and assume the two lines are from different events. Processing proceeds by processing each two-line pair, updating line signature and co-occurrence statistics, and applying and learning corresponding break rules. At regular intervals, the break rule table is written out to the hard disk or permanent storage.


Extraction of Entities


Following aggregation 325 and before event segmentation 345, various extraction methods 335 can be applied to identify semantic entities 340 within the data. In one implementation, search trees or regular expressions can be applied to extract and validate, for example, IP addresses or email addresses. The goal of extraction 335 is to assist the segmentation process 345 and provide semantic value to the data.


Segmentation of Events


Segmentation 345 rules describe how to divide event data 330 into segments (also known as tokens 350). It is important to note at this point that segments 350 have little semantic value, unless an extracted entity 340 has been applied. In one implementation a segmentation rule 345 examines possible separators or punctuation within the event 330, for example, commas, spaces or semicolons. An important aspect of segmentation 345 is the ability to not only identify individual segments 350, but also to identify overlapping segments 350. For example, the text of an email address, “bob.smith@corp.com”, can be broken 345 into individual and overlapping segments 350; <bob.smith>, <@> and <corp.com> can be identified as individual segments, and <<bob.smith><@><corp.com>> can also be identified as an overlapping segment. In one implementation, segmentation 345 uses a two-tier system of major and minor breaks. Major breaks are separators or punctuation that bound the outer most segment 350. Examples include spaces, tabs, and new lines. Minor breaks are separators or punctuation that break larger segments 350 into sub segments 350, for example periods, commas, and equal signs. In one implementation, more complex separators and punctuation combinations are used to handle complex segmentation tasks 345, for example handling Java exceptions in an application server log file.


Classification of Event Types


In the embodiment of FIG. 3, the final step of the organization process 235 is the classification 355 of events 350 into event types. Examples of event types include a web server HTTP “get,” an application server database “connect,” or an email server “send mail attempt.” In one implementation, an event signature 352 is generated for each event type. One method for generating an event signature 352 is to build a hierarchical scheme for identifying particular types of events based on the overall event structure 330, segmentation 350, segment values 350, and extracted entities 340. The purpose of the event signature 352 is to identify a type of event regardless of the situation. In this way a particular type of event can have the same signature 352 in multiple MDWs. For example, a mail server's send mail attempt generally has the same signature 352 in every MDW regardless of the information-processing environment.


In one implementation a hierarchical event signature {v1, v2, v3, . . . vn} 352 is constructed from a list of successively more specific hash functions {f1( ), f2( ), f3( ), . . . fn( )}, where each fn( ) produces a value representing a level of the hierarchy. The event signature 352 is most useful when each successive function is more specific. For example, in one embodiment, the following function list represents a 9 level event signature 352, from most general to most specific:

    • f1( ): firstCharType—returns alpha, numeric, white space, other, depending on the type of the first character of the event.
    • f2( ): headwhitespace—returns the number of spaces/tabs at the beginning of the event.
    • f3( ): firstpunc—returns the first punctuation character of the event.
    • f4( ): firstImportantKeywords—returns a hash value of first word in the event that is an important keyword, where there is a list of known important terms.
    • f5( ): firstKnownWord—returns the first word in the event that is a known keyword, where there is a list of known terms.
    • f6( ): importantKeyword—returns the list of all hash values of important keywords that are found in the event.
    • f7( ): firstUnknownWord—returns the first word in event that is not a known keyword.
    • f8( ): headPunc—returns the first 10 punctuation characters in the event, removing duplicates.
    • f9( ): allPunc—returns all punctuation in event.


In this implementation, the event signature 352 is a traversal through a hierarchy of possible values. Given that event signatures 352 can be large and hard to read, an event signature can have a corresponding label in the form of a number or text that can be machine generated or human assigned. For example, an email server “send mail attempt” event might be programmatically assigned the label “500”, but a user can assign the label “send mail attempt”.


Machine Data Understanding Process



FIG. 4 is a flow diagram of one implementation 275 of the MD understanding process shown in FIG. 2. During the understanding process 275, knowledge about how events relate to one another is discovered from the event data 250. This knowledge is valuable in understanding the behavior of the underlying information-processing environment. Links 410, representing relationships between events 250 are useful, among other things, for finding connections and causality where little or no common structure exists. For example, in an email-messaging information-processing environment, an event 250 may exist in the message transfer agent (MTA) indicating the receipt of a message from a sender, another event 250 may exist in the spam filtering software documenting that the sender is known and the message is safe to forward to a user's mailbox, and finally the mailbox authentication may contain an event 250 showing that the user attempted to login to their mailbox and retrieve their mail. These three events 250 may contain no common structure other than a timestamp. However, the three events 250 are connected as part of a larger email messaging activity. In one implementation of the understanding process 275, several techniques are applied including linking 405, which creates connections 410 between events 250; path construction 415, to build more complex, multi-link connections 420; and analysis 425, which records historical data 492 and generates statistics 494 about the MDW.


Linking Events


By analyzing event data 250 and possible link hints 402 from external systems or human input, links 410 can be created 405. An important feature of the MDW approach is the ability to create 405 link relationships 410 dynamically and learn new possible link relationships on the fly. A number of methods can be used in the analysis of ED 250 to create 405 links 410, including, but not limited to, value analysis, statistical analysis, timing analysis, and the evaluation of link hints 402. These methods can be used individually or in combination with one another. From our previous example, perhaps the link 410 between the MTA and the spam filter events 250 is a value association between the MTA message ID and the spam filter article ID, or the link 410 between the spam filter and the user email retrieval 250 is an associative mail box name. All three events 250 might be tied together, for example by observing a timing pattern that occurs over and over again with statistically relevant frequency.


In one implementation, link analysis 405 takes place by creating a co-occurrence table with an entry for pairs of event types or event data values that occur within a predetermined window of each other. In one aspect, windows are bounded by a window threshold taking the form of time (e.g. 10 minutes), event types (e.g. 50 unique event types), or event instances (e.g. 1000 events). The value of the co-occurrence table entry is the distance between the pair (time, event types, or event instances). Pairs that co-occur often enough, and meet a distance standard deviation threshold are deemed relevant and reliable links. For example, assume that an event 250 of type A occurred 50 times, an event of type B occurred 40 times, an event of type A was followed by an event of type B 20% of the time, and the standard deviation of their distance was less than 5.0 (a predetermined threshold), then a link 410 is created between events 250 of type A and type B (represented as A→B). Standard deviation thresholds are based on a function of window thresholds and may change based on the time to complete analysis or the number of desired results. Window thresholds may change based on data density and time available to complete the analysis.


Path Construction by Chaining Linked Events


Paths 420 are multi-link collections representing a chain of linked events 410. Paths 420 often represent a higher level of information system behavior, possibly spanning multiple systems, applications or data centers. Paths 420 are useful, for example, for following more complex activities or transactions through one or more systems. In our email example, a path 420 could be the receiving or sending of an email including three or more events 250 and two or more links 410. Similar to links 410, paths 420 are created 415 by analyzing event data 250,410 and possible path hints 412 from external systems or human input. An important feature is the ability to create paths 420 dynamically and learn new possible paths on the fly.


Paths 420 are built by chaining together 415 event links 410, using a number of methods. In one implementation, paths 420 are discovered as chains of transitive links 410. For example, given previously discovered links 410 A→B, B→C, A→C, and C→A, transitively composition yields the following three event paths 420: A→B→C, B→C→A, A→C→A, C→A→B and C→A→C. These paths 420 can also be combined to make larger and larger path chains. In one aspect, certain restrictions are applied 415 to reduce combinatorial explosion. One restriction might involve the elimination of cycles and repetitions. For example, one rule 415 might be that A→C and C→A cannot be combined to create A→C→A. In a second possible restriction 415, for A→B and B→C to be combined there must be an A→C link 410, with the average distance of A→C being approximately equal to the sum of the average distances between A→B and B→C. In addition, the standard deviation of the distance for A→C must be approximately equal to the standard deviations of A→B and B→C. Finally, paths 420 that are rotations of other paths can be removed, keeping the most reliable path. For example, given paths 420 A→B→C and C→A→B, if the standard deviation of the distance between C→A is greater than the standard deviation of the distance between B→C then A→B→C would be kept and C→A→B removed.


Like the WWW and HTML hyperlinks, event links 410 and paths 420 can be represented as a uniform resource locator (URL). In one implementation a link 410 from one event 250 to another is represented by the following URL “mdw://<name of MDW>/<link type>/<link value>/<event 1>/event<2>.” A link 410 can resolve to one of several destinations including, but not limited to an event type, an event instance or an event segment within an event instance.


Analysis of the MDW


In addition to links 410 and paths 420, another aspect of the MDW understanding process 275 is the ability to generate 425 historical information 492 about itself, for example, statistics 494 for event, event type, link or path occurrences. One aspect of historical data 492 regarding the MDW is that it can reveal historical behavior of the information-processing environment itself.


Accessing the MDW



FIG. 5 refers to one approach to access the elements of the machine data web 290, including its data and dynamic relationships, through an application-programming interface (API). In one embodiment, the MDW 290 and corresponding technology infrastructure is Internet-based. The API includes commands to post data 510 to the MDW infrastructure 290 including, but not limited to, MD, events, segments, source signatures, link hints, and path hints. In the same embodiment, the API also includes commands to get data 520 from the MDW 290 including, but not limited to, the original MD, events, segments, source signatures, links, and paths. Utilizing the MDW API, a variety of applications and systems can take advantage of an advanced organization and understanding of machine data.


The MDW can be implemented in many different ways. In one approach, each box in FIGS. 2, 3 and 4 is implemented in software as a separate process. All of the processes can run on a single machine or they can be divided up to run on separate logical or physical machines. In alternate embodiments, the invention is implemented in computer hardware, firmware, software, and/or combinations thereof. Apparatus of the invention can be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a programmable processor. Method steps of the invention can be performed by a programmable processor executing a program of instructions to perform functions of the invention by operating on input data and generating output. The invention can be implemented advantageously in one or more computer programs. Each computer program can be implemented in a high-level procedural or object-oriented programming language or in assembly or machine language if desired; in any case, the language can be a compiled or interpreted language. Any of the foregoing can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits) and other forms of hardware.


Although the detailed description contains many specifics, these should not be construed as limiting the scope of the invention but merely as illustrating different examples and aspects of the invention. It should be appreciated that the scope of the invention includes other embodiments not discussed in detail above. For example, not all of the steps shown are required in every implementation, and they may be implemented in ways other than the examples given above. The order of the steps may also be changed in certain cases. Various other modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus of the present invention disclosed herein without departing from the spirit and scope of the invention.

Claims
  • 1. A method comprising: identifying a first source signature for a first source of machine data and a second source signature for a second source of machine data;receiving machine data;comparing a first portion of the machine data with the first source signature and a second portion of the machine data with the second source signature;based on comparing the first portion of the machine data with the first source signature and the second portion of the machine data with the second source signature, determining the first portion of the machine data is associated with the first source of machine data and the second portion of the machine data is associated with the second source of machine data;based on determining the first portion of the machine data is associated with the first source of machine data, segmenting the first portion of the machine data into at least one first event, wherein segmenting the first portion of the machine data into the at least one first event comprises determining a particular starting point in the first portion of the machine data and a particular ending point in the first portion of the machine data for the at least one first event;based on determining the second portion of the machine data is associated with the second source of machine data, segmenting the second portion of the machine data into at least one second event, wherein segmenting the second portion of the machine data into at least one second event comprises determining a particular starting point in the second portion of the machine data and a particular ending point in the second portion of the machine data for the at least one second event;identifying, in real time, a pattern that associates the at least one first event with the at least one second event; andproviding, to a computing system, information associated with the pattern that associates the at least one first event with the at least one second event.
  • 2. The method of claim 1, wherein the first source signature comprises a first source label and the second source signature comprise a second source label.
  • 3. The method of claim 1, wherein the first source of machine data is associated with a first rule and the second source of machine data is associated with a second rule.
  • 4. The method of claim 1, wherein the first source of machine data is associated with a first rule and the second source of machine data is associated with a second rule, wherein segmenting the first portion of the machine data into the at least one first event is based on application of the first rule to the first portion of the machine data, wherein segmenting the second portion of the machine data into the at least one second event is based on application of the second rule to the second portion of the machine data.
  • 5. The method of claim 1, wherein the at least one first event includes at least a portion of the first portion of the machine data, wherein the at least one second event includes at least a portion of the second portion of the machine data, wherein the first portion of the machine data includes the first source signature and the second portion of the machine data includes the second source signature.
  • 6. The method of claim 1, wherein identifying the first source signature and the second source signature comprises: generating the first source signature and the second source signature.
  • 7. The method of claim 1, wherein identifying the first source signature and the second source signature comprises: obtaining a collection of machine data from a plurality of sources of machine data, the plurality of sources of machine data comprising the first source of machine data and the second source of machine data; andgenerating, for each of the plurality of sources of machine data, a respective source signature.
  • 8. The method of claim 1, wherein identifying the first source signature and the second source signature comprises: generating one or more hash tables.
  • 9. The method of claim 1, wherein identifying the first source signature and the second source signature comprises: obtaining the first source signature and the second source signature via an application programming interface.
  • 10. The method of claim 1, wherein one or more of the first source signature or the second source signature are based on one or more of punctuation or a token.
  • 11. The method of claim 1, further comprising: clustering the machine data based on the first source signature and the second source signature.
  • 12. The method of claim 1, wherein comparing the first portion of the machine data with the first source signature and the second portion of the machine data with the second source signature comprises: performing a nearest-neighbor search.
  • 13. The method of claim 1, wherein comparing the first portion of the machine data with the first source signature and the second portion of the machine data with the second source signature comprises: performing a nearest-neighbor search; anddetermining a first distance between the first source signature and a third signature of the first portion of the machine data and a second distance between the second source signature and a fourth signature of the second portion of the machine data.
  • 14. The method of claim 1, wherein comparing the first portion of the machine data with the first source signature and the second portion of the machine data with the second source signature comprises: performing a nearest-neighbor search; anddetermining a first distance between the first source signature and a third signature of the first portion of the machine data and a second distance between the second source signature and a fourth signature of the second portion of the machine data,wherein determining the first portion of the machine data is associated with the first source of machine data is based on the first distance and determining the second portion of the machine data is associated with the second source of machine data is based on the second distance.
  • 15. The method of claim 1, wherein a data format of the first portion of the machine data and a data format of the second portion of the machine data are different data formats.
  • 16. The method of claim 1, further comprising: outputting the first source signature and the second source signature via an application programming interface.
  • 17. The method of claim 1, further comprising: comparing a third portion of the machine data with a third source signature for a third source of machine data; andbased on comparing the third portion of the machine data with the third source signature, determining the third portion of the machine data is associated with a third source of machine data.
  • 18. The method of claim 1, further comprising: comparing a third portion of the machine data with a third source signature for a third source of machine data;based on comparing the third portion of the machine data with the third source signature, determining the third portion of the machine data is not associated with a third source of machine data; andgenerating a fourth source signature for the third portion of the machine data.
  • 19. One or more non-transitory computer-readable storage media, storing one or more sequences of instructions, which when executed by one or more processors cause the one or more processors to: identify a first source signature for a first source of machine data and a second source signature for a second source of machine data;receive machine data;compare a first portion of the machine data with the first source signature and a second portion of the machine data with the second source signature;based on comparing the first portion of the machine data with the first source signature and the second portion of the machine data with the second source signature, determine the first portion of the machine data is associated with the first source of machine data and the second portion of the machine data is associated with the second source of machine data;based on determining the first portion of the machine data is associated with the first source of machine data, segment the first portion of the machine data into at least one first event, wherein to segment the first portion of the machine data into the at least one first event, execution of the one or more sequences of instructions by the one or more processors causes the one or more processors to determine a particular starting point in the first portion of the machine data and a particular ending point in the first portion of the machine data for the at least one first event;based on determining the second portion of the machine data is associated with the second source of machine data, segment the second portion of the machine data into at least one second event, wherein to segment the second portion of the machine data into at least one second event, the execution of the one or more sequences of instructions by the one or more processors causes the one or more processors to determine a particular starting point in the second portion of the machine data and a particular ending point in the second portion of the machine data for the at least one second event;identify, in real time, a pattern that associates the at least one first event with the at least one second event; andprovide, to a computing system, information associated with the pattern that associates the at least one first event with the at least one second event.
  • 20. A system comprising: a memory containing computer-executable instructions; anda processing device configured to execute the computer-executable instructions to cause the system to: identify a first source signature for a first source of machine data and a second source signature for a second source of machine data;receive machine data;compare a first portion of the machine data with the first source signature and a second portion of the machine data with the second source signature;based on comparing the first portion of the machine data with the first source signature and the second portion of the machine data with the second source signature, determine the first portion of the machine data is associated with the first source of machine data and the second portion of the machine data is associated with the second source of machine data;based on determining the first portion of the machine data is associated with the first source of machine data, segment the first portion of the machine data into at least one first event, wherein to segment the first portion of the machine data into the at least one first event, execution of the computer-executable instructions by the processing device causes the system to determine a particular starting point in the first portion of the machine data and a particular ending point in the first portion of the machine data for the at least one first event;based on determining the second portion of the machine data is associated with the second source of machine data, segment the second portion of the machine data into at least one second event, wherein to segment the second portion of the machine data into at least one second event, the execution of the computer-executable instructions by the processing device causes the system to determine a particular starting point in the second portion of the machine data and a particular ending point in the second portion of the machine data for the at least one second event;identify, in real time, a pattern that associates the at least one first event with the at least one second event; andprovide, to a computing system, information associated with the pattern that associates the at least one first event with the at least one second event.
Parent Case Info

This application claims benefit as a continuation of U.S. patent application Ser. No. 17/447,408, filed Sep. 10, 2021, which claims benefit as a continuation of U.S. patent application Ser. No. 16/399,146, filed Apr. 30, 2019, now U.S. Pat. No. 11,119,833, issued Sep. 14, 2021, which claims benefit as a continuation of U.S. patent application Ser. No. 14/611,189, filed Jan. 31, 2015, now U.S. Pat. No. 10,318,553, issued Jun. 11, 2019, which claims benefit as a continuation of U.S. patent application Ser. No. 14/170,228, filed Jan. 31, 2014, now U.S. Pat. No. 9,128,916, issued Sep. 8, 2015, which claims benefit as a continuation of U.S. patent application Ser. No. 13/664,109, filed Oct. 30, 2012, now U.S. Pat. No. 8,694,450, issued Apr. 8, 2014, which claims benefit as a continuation of U.S. patent application Ser. No. 13/099,268, filed May 2, 2011, now U.S. Pat. No. 8,589,321, issued Nov. 19, 2013, which claims benefit as a continuation of U.S. patent application Ser. No. 11/459,632, filed Jul. 24, 2006, now U.S. Pat. No. 7,937,344, issued May 3, 2011, which claims benefit of U.S. Provisional Patent Application No. 60/702,496, filed Jul. 25, 2005, the entire contents of the aforementioned are hereby incorporated by reference as if fully set forth herein, under 35 U.S.C. § 120. The applicant(s) hereby rescind any disclaimer of claim scope in the parent application(s) or the prosecution history thereof and advise the USPTO that the claims in this application may be broader than any claim in the parent application(s).

US Referenced Citations (198)
Number Name Date Kind
5613113 Goldring Mar 1997 A
6137470 Sundstrom et al. Oct 2000 A
6212494 Bouguraev Apr 2001 B1
6272531 Shrader Aug 2001 B1
6470384 Brien et al. Oct 2002 B1
6611825 Billheimer et al. Aug 2003 B1
6658487 Smith Dec 2003 B1
6701305 Holt et al. Mar 2004 B1
6728728 Spiegler et al. Apr 2004 B2
6801938 Bookman et al. Oct 2004 B1
6836894 Hellerstein et al. Dec 2004 B1
6906709 Larkin et al. Jun 2005 B1
6978274 Gallivan et al. Dec 2005 B1
7003781 Blackwell Feb 2006 B1
7134081 Fuller, III et al. Nov 2006 B2
7376969 Njemanze et al. May 2008 B1
7526769 Watts, Jr. et al. Apr 2009 B2
7607169 Njemanze et al. Oct 2009 B1
7616666 Schultz Nov 2009 B1
7783655 Barabas et al. Aug 2010 B2
7797309 Waters Sep 2010 B2
7805482 Schiefer Sep 2010 B2
7895383 Gregg et al. Feb 2011 B2
7926099 Chakravarty et al. Apr 2011 B1
7937344 Baum et al. May 2011 B2
7962483 Thomas Jun 2011 B1
7979362 Zhao et al. Jul 2011 B2
8112425 Baum et al. Feb 2012 B2
8196150 Downing et al. Jun 2012 B2
8346777 Auerbach Jan 2013 B1
8577847 Blazejewski et al. Nov 2013 B2
8589321 Baum et al. Nov 2013 B2
8615773 Bishop et al. Dec 2013 B2
8661062 Jamail et al. Feb 2014 B1
8694450 Baum et al. Apr 2014 B2
8751529 Zhang et al. Jun 2014 B2
8788525 Neels et al. Jul 2014 B2
8806361 Noel et al. Aug 2014 B1
8943056 Baum et al. Jan 2015 B2
8954450 Trahan et al. Feb 2015 B2
9020976 Ahmed et al. Apr 2015 B2
9043717 Noel et al. May 2015 B2
9092411 Barabas et al. Jul 2015 B2
9128916 Baum et al. Sep 2015 B2
9158811 Choudhary et al. Oct 2015 B1
9215240 Merza et al. Dec 2015 B2
9280594 Baum et al. Mar 2016 B2
9286413 Coates et al. Mar 2016 B1
9292590 Baum et al. Mar 2016 B2
9298805 Baum et al. Mar 2016 B2
9317582 Baum et al. Apr 2016 B2
9361357 Baum et al. Jun 2016 B2
9363149 Chauhan et al. Jun 2016 B1
9384261 Baum et al. Jul 2016 B2
9516052 Chauhan et al. Dec 2016 B1
9848008 Chauhan et al. Dec 2017 B2
10127258 Lamas et al. Nov 2018 B2
10157089 Ahmad et al. Dec 2018 B2
10237292 Chauhan et al. Mar 2019 B2
10242086 Baum et al. Mar 2019 B2
10250628 Chauhan et al. Apr 2019 B2
10254934 Chauhan et al. Apr 2019 B2
10255312 Swan et al. Apr 2019 B2
10318553 Baum et al. Jun 2019 B2
10318555 Baum et al. Jun 2019 B2
10324957 Baum et al. Jun 2019 B2
10339162 Baum et al. Jul 2019 B2
10425300 Vlachogiannis et al. Sep 2019 B2
10540321 Miller Jan 2020 B2
10891281 Baum et al. Jan 2021 B2
11010214 Baum et al. May 2021 B2
11036566 Baum et al. Jun 2021 B2
11036567 Baum et al. Jun 2021 B2
11192295 Warfield et al. Aug 2021 B2
11119833 Baum Sep 2021 B2
11126477 Baum et al. Sep 2021 B2
11204817 Baum et al. Dec 2021 B2
11599400 Baum Mar 2023 B2
11663244 Baum et al. May 2023 B2
20020046248 Drexler Apr 2002 A1
20020069223 Goodisman et al. Jun 2002 A1
20020078381 Farley et al. Jun 2002 A1
20020157017 Mi et al. Oct 2002 A1
20020174083 Hellerstein et al. Nov 2002 A1
20020198984 Goldstein et al. Dec 2002 A1
20030014408 Robertson Jan 2003 A1
20030023593 Schmidt Jan 2003 A1
20030041264 Black et al. Feb 2003 A1
20030056200 Li et al. Mar 2003 A1
20030084349 Friedrichs et al. May 2003 A1
20030126613 McGuire Jul 2003 A1
20030154396 Godwin et al. Aug 2003 A1
20030169925 Polonowski Sep 2003 A1
20030182310 Charnock et al. Sep 2003 A1
20030208485 Castellanos Nov 2003 A1
20030236766 Fortuna et al. Dec 2003 A1
20040024773 Stoffel et al. Feb 2004 A1
20040030703 Bourbonnais et al. Feb 2004 A1
20040098668 Vehkomaki May 2004 A1
20040122656 Abir Jun 2004 A1
20040167908 Wakefield et al. Aug 2004 A1
20040167911 Wakefield et al. Aug 2004 A1
20040215599 Apps et al. Oct 2004 A1
20040250134 Kohler et al. Dec 2004 A1
20050022207 Grabarnik et al. Jan 2005 A1
20050044208 Jones et al. Feb 2005 A1
20050044406 Stute Feb 2005 A1
20050060562 Bhattacharya et al. Mar 2005 A1
20050076067 Bakalash et al. Apr 2005 A1
20050086188 Hillis et al. Apr 2005 A1
20050089048 Chittenden et al. Apr 2005 A1
20050102292 Tamayo et al. May 2005 A1
20050108256 Wakefield et al. May 2005 A1
20050108630 Wasson et al. May 2005 A1
20050131935 O'Leary et al. Jun 2005 A1
20050172162 Takahashi et al. Aug 2005 A1
20050182736 Castellanous et al. Aug 2005 A1
20050198234 Leib et al. Sep 2005 A1
20050210027 Aggarwal et al. Sep 2005 A1
20050222810 Buford et al. Oct 2005 A1
20050223027 Lawrence et al. Oct 2005 A1
20050256956 Littlefield Nov 2005 A1
20050262193 Mamou et al. Nov 2005 A1
20050283680 Kobayashi et al. Dec 2005 A1
20060004691 Sifry Jan 2006 A1
20060069717 Mamou et al. Mar 2006 A1
20060101034 Murphy May 2006 A1
20060117091 Justin Jun 2006 A1
20060167825 Sayal Jul 2006 A1
20060173878 Bley Aug 2006 A1
20060174024 Chi et al. Aug 2006 A1
20060179025 Bechtel et al. Aug 2006 A1
20060195297 Kubota et al. Aug 2006 A1
20060230004 Handley Oct 2006 A1
20060230306 Richards et al. Oct 2006 A1
20060245641 Viola et al. Nov 2006 A1
20060248106 Milne et al. Nov 2006 A1
20060259519 Yakushev et al. Nov 2006 A1
20060265406 Chkodrov Nov 2006 A1
20060294086 Rose et al. Dec 2006 A1
20070022072 Kao et al. Jan 2007 A1
20070067323 Vandersluis Mar 2007 A1
20070073743 Bammi et al. Mar 2007 A1
20070118491 Baum et al. May 2007 A1
20070234426 Khanolkar et al. Oct 2007 A1
20080040191 Chakravarty et al. Feb 2008 A1
20080077572 Boyle et al. Mar 2008 A1
20080126408 Middleton May 2008 A1
20080148280 Stillwell et al. Jun 2008 A1
20080222654 Xu et al. Sep 2008 A1
20080294663 Heinley et al. Nov 2008 A1
20090157596 Couch et al. Jun 2009 A1
20090199118 Sabato et al. Aug 2009 A1
20090287630 Kaiser Nov 2009 A1
20100229112 Ergan et al. Sep 2010 A1
20110016123 Pandey et al. Jan 2011 A1
20110119100 Ruhl et al. May 2011 A1
20110179017 Meyers et al. Jul 2011 A1
20110208743 Baum et al. Aug 2011 A1
20120078925 Behar et al. Mar 2012 A1
20120290972 Yook et al. Nov 2012 A1
20130054596 Baum et al. Feb 2013 A1
20130097662 Pearcy et al. Apr 2013 A1
20130227689 Pietrowicz et al. Aug 2013 A1
20130239124 Ahmad et al. Sep 2013 A1
20130246925 Ahuja et al. Sep 2013 A1
20140019458 Walton Jan 2014 A1
20140082513 Mills et al. Mar 2014 A1
20140092095 Higgins et al. Apr 2014 A1
20140149438 Baum et al. Apr 2014 A1
20140237337 Baum et al. Aug 2014 A1
20150128267 Gupta et al. May 2015 A1
20150142842 Baum et al. May 2015 A1
20150143522 Baum et al. May 2015 A1
20150149460 Baum et al. May 2015 A1
20150154250 Baum et al. Jun 2015 A1
20150227612 Baum et al. Aug 2015 A1
20150227613 Baum et al. Aug 2015 A1
20150227614 Baum et al. Aug 2015 A1
20150293685 Chen et al. Oct 2015 A1
20150295778 Hsiao et al. Oct 2015 A1
20150295779 Ching et al. Oct 2015 A1
20150295780 Hsiao et al. Oct 2015 A1
20150295796 Hsiao et al. Oct 2015 A1
20150317377 Baum et al. Nov 2015 A1
20150324581 Singla et al. Nov 2015 A1
20160156667 Baum et al. Jun 2016 A1
20160255108 Baum et al. Sep 2016 A1
20170031565 Chauhan et al. Feb 2017 A1
20170034196 Chauhan et al. Feb 2017 A1
20170048264 Chauhan et al. Feb 2017 A1
20170063920 Thomas et al. Mar 2017 A1
20180069887 Chauhan et al. Mar 2018 A1
20180159885 Baum et al. Jun 2018 A1
20180189328 Frazier et al. Jul 2018 A1
20190098106 Mungel et al. Mar 2019 A1
20190251099 Baum et al. Aug 2019 A1
20190258651 Baum et al. Aug 2019 A1
Foreign Referenced Citations (4)
Number Date Country
2003308229 Oct 2003 JP
WO 2000079415 Dec 2000 WO
WO 2002027443 Apr 2002 WO
WO 2007014268 Feb 2007 WO
Non-Patent Literature Citations (155)
Entry
Softpanorama, “Sendmail Log Formats” (Jul. 28, 2019), pp. 1-9 [retrieved from https://softpanorama.org/Mail/Sendmail/sendmail_logs_format.shtml]. (Year: 2019).
Mark P. Mattson, “Superior pattern processing is the essence of the evolved human brain” (Aug. 2014), pp. 1-17 [retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4141622/pdf/fnins-08-00265.pdf]. (Year: 2014).
IBM, “Apache HTTP Server Version 2.2” (2007), pp. 1-6 [retrieved from https://publib.boulder.ibm.com/httpserv/manual70/logs.html]. (Year: 2007).
Agichtein et al., “Mining Reference Tables for Automatic Text Segmentation,” (Aug. 2004), ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, pp. 20-29.
Agrawal et al., “Mining Sequential Patters,” (Aug. 6, 2002), Proceedings of 11th International Conference on Data Engineering, pp. 3-14.
Antunes et al., “Temporal Data Mining: An Overview” (2001), pp. 1-15 [retrieved from http://www.dcc.fc.up.pt/˜Itorgo/AIFTSA/Proceedings/AO.pdf]. (Year: 2001).
Bounsaythip, C., et al., “Overview of Data Mining for Customer Behavior Modeling”, VTT Information Technology, Research Report TTEI-2001-18, dated Jun. 29, 2001, 59 pages.
Cooley et al. “Data Preparation for Mining World Wide Web Browsing Patterns”, dated Feb. 1999, Knowledge and Information System, vol. 1, issue 1, 28 pages.
Cooley, R. et al., “Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data,” (May 2000), University of Minnesota, pp. 1-170 [retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.411.6912&rep=repl &type=pdf].
Fu et al., “Mining Navigation History for Recommendation,” Proceeding of the 5th International Conference on Intelligent User Interfaces, 2000, pp. 106-112.
Gerardo et al., “Association Rule Discovery In Data Mining By Implementing Principal Component Analysis”, International Conference On AI, Simulation, And Planning In High Autonomy Systems, (AIS 2004), pp. 50-60 [retrieved from https://link.springer.com/chapter/10.1007/978-3-540-30583-5_6].
Han, E. et al., “Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification”, PAKDD, dated Mar. 20, 1999, 12 pages.
Harms et al., “Discovering Representative Episodal Association Rules From Event Sequences Using Frequent Closed Episode Sets and Event Constraints,” (Aug. 7, 2002), Proceedings of The 2001 IEEE International Conference On Data Mining, downloaded from gttps://ieeexplore.ieee.org/abstract/document/989576/, pp. 603-606.
Hellerstein et al., “Discovering Actionable Patterns in Event Data,” (2002), IBM Systems Journal, vol. 41, Issue 3, pp. 475-493 [retrieved from http://ieeexplore.ieee.org/document/5386872].
Kirk, D., “Windows Notepad: Insert Time And Date Into Text Or Log File,” (Jun. 6, 2005), pp. 104, [retrieved from http://www.tech-recipes.com/rx/909/windows-notepad-insert-time-and-date-into-text-or-log-file/].
Kryszkiewicz, M., “Fast Discovery of Representative Association Rule,” (Feb. 26, 1999), International Conference On Rough Sets and Current Trends in Computing, pp. 214-222.
Lee et al., “Data Mining Approaches for Intrusion Detection,” (Jan. 26-29, 1998), Proceedings of The 7th USENIX Security Symposium, pp. 1-15 [retrieved from https://usenix.org/legacy/publications/library/proceedings/sec98/full_papers/lee/lee.pdf].
Lee, W., “Mining In A Data-Flow Environment: Experience In Network Intrusion Detection” (1999), Proceeding Of The Fifth ACM SIGKDD International Conference on Knowledge Discover And Data Mining, 1999 pp. 114-124.
Ma et al., “Mining Mutually Dependent Patterns”, Proceeding Of The IEEE International Conference On Data Mining, pp. 409-416, Nov. 29-Dec. 2, 2001.
Mannila et al., “Discovery Frequent Episodes in Sequences,” (1995), KDD-95 Proceedings, pp. 210-215 [retrieved from https://www.aaai.org/Papers/KDD/1995/KDD95-024.pdf].
Mastsuo, Y., “Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information”, Proceedings of the 16th International F.A.I.R.S.C., dated May 2003, 13 pages.
Pei et al., “Mining Access Patterns Efficiently from Web Logs”, dated 2000, Knowledge Discovery and Data Mining Lecture Notes in Computer Science, vol. 1805, 12 pages.
Peng et al., “Mining Logs Files For Data-Driven System Management”, Acm Sigkdd Explorations Newsletter—Natural Language Processing And Text Mining, vol. 7, Issue 1, pp. 44-51, Jun. 2005.
Prewett, J., “Analyzing Cluster Log Files Using Logsurfer,” 12 (2003), Proceedings of the 4th Annual Conference on Linux, pp. 1-12 [retrieved from http://citeseerx.ist.psu,edu/viewdoc/download?doi+10.1.1.5.8610&rep+rep1&type+pdf].
Rouillard, J., “Real-Time log file analysis using the Simple Event Correlator (SEC)” (Nov. 2004), Proceedings of LISA, pp. 1-36, [retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.8610&rep1&type=pdf].
Russell, S. J., et al., “Artificial Intelligence: A Modern Approach, 2nd Edition”, Pearson Education, Inc., dated 2003, pp. 733-739.
Soderland, S., “Learning Information Extraction Rules for Semi-Structured and Free Text,” (Feb. 1999). Machine Learning, vol. 34, Issue 103, pp. 223-272.
Srivastava et al., “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data”, dated Jan. 2000, ACM SIGKDD Explorations Newsletter, vol. 1, issue 2, pp. 12-23.
Stamatatos, E. et al., “Text Genre Detection Using Common Word Frequencies”, Proceedings of the 18th International Conference on Computational Linguistics, (2000), vol. 2, p. 808-814.
Stubblebine, T., “Regular Expression Pocket Reference,” (Aug. 2003), O'Reilly Media, Inc., pp. 1-93.
Vaarandi, R., “A Data Clusterig Algorithm for Mining Patterns From Event Logs,” (Dec. 19, 2003), Proceedings Of The 3rd IEEE Workshop On IP Operations & Management (IPOM 2003), pp. 119-126.
Witten I. H., et al., “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,” Morgan Kaufmann Publishers, 2000, pp. 72-76, 114-118, 193-201.
International Search Report, re PCT Application No. PCT/2006/029019, dated Aug. 3, 2008.
International Preliminary Report on Patentability, re PCT Application No. PCT/2006/029019, dated Jan. 29, 2008.
Bitincka, Ledion et al., “Optimizing Data Analysis with a Semi-structured Time Series Database,” self-published, first presented at “Workshop on Managing Systems via Log Analysis and Machine Learning Techniques (SLAML)”, Vancouver, British Columbia, Oct. 3, 2010.
Blyth, Microsoft Operations Manager 2000, in 68 pages/slides.
Carraso, David, “Exploring Splunk,” published by CITO Research, New York, NY, Apr. 2012.
Chabowski, “The Mainframe versus the Server Farm—A Comparison,” (Apr. 24, 2017) [retrieved from https://www.suse.com/c/mainframe-versus-server-farm-comparison/].
Chilukuri, Symptom Database Builder for Autonomic Computing, IEEE, International Conference on Autonomic and Autonomous Systems, Silicon Valley, CA, USA Jul. 19-21, 2006, in 11 pages.
Conorich, “Monitoring Intrusion Detection Systems: From Data to Knowledge,” Enterprise Security Architecture, pp. 19-30, May/Jun. 2004.
Cuppens, “Real Time Intrusion Detection,” RTO Meeting Proceedings 101, North Atlantic Treaty Organisation, Research and Technology Organisation, Papers presented at the RTO Information Systems Technology Panel (IST) Symposium held in Estoril, Portugal, May 27-28, 2002.
Debar, A revised taxonomy for intrusion-detection systems, IBM Research Division, Zurich Research Laboratory 2000, in 18 pages.
Galassi et al., “Learning Regular Expressions From Noisy Sequences”, International Symposium On Abstraction, Reformulation, And Approximation (SARA 2005), pp. 92-106, Jul. 18-21, 2007.
GFI Launches GFT LANguard Security Event Log Monitor 3.0, Intrado GlobeNewswire, Jun. 10, 2002.
GFI's New LANguard S.E.L.M. 4 Combats Intruders—Help Net Security, https://www.helpnetsecurity.com/2002/12/05/gfis-new-languard-selm-4-combats-intruders/. In two pages, 2002.
Girardin, et al., “A Visual Approach for Monitoring Logs,” USENIX Technical Program—Paper—Proceedings of the 12th Systems Administration Conference (LISA '98), in 13 pages.
Gomez, et al., “Using Lamport's Logical Clocks to Consolidate Log Files from Different Sources,” A. Bui et al. (Eds.): IICA 2005, LNCS 3908, pp. 126-133, 2006.
Gorton, “Extending Intrusion Detection with Alert Correlation and Intrusion Tolerance,” Thesis For The Degree of Licentiate of Engineering. Technical Report No. 27 L. Department of Computer Engineering Chalmers University of Technology, Goteborg, Sweden 2003.
Han, E. et al., “Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification”, PAKDD, 2001, LNAI 2035, pp. 53-65 (2001) [retrieved from http://www.springerlink.com/index/25gnd0jb6nklffhh.pdf on Sep. 23, 2010].
Helmer, et al., Lightweight agents for intrusion detection, Department of Computer Science, Iowa State University 2003.
Jakobson, et al., “Real-time telecommunication network management: extending event correlation with temporl constraints,” Springer Science+Business Media Dordrecht 1995.
Kent, et al., “Recommendations of the National Institute of Standards and Technology,” Guide to Computer Security Log Management, Special Publication 800-92, Computer Security Division, Information Technology Laboratory, National Institute of Standards and Technology (NIST), Sep. 2006.
Kim, et al., “A Case Study on the Real-time Click Stream Analysis System,” CIS 2004, LNCS 3314, pp. 788-793, 2004.
Kwok, Investigating IBM Tivoli Intelligence ThinkDynamic Orchestrator (ITITO) And IBM Tivoli Provisioning Manager (ITPM), Electrical & Computer Engineering Department University of Waterloo, Ontario, Canada, Apr. 2006.
Lin et al., “Efficient Adaptive-Support Association Rule Mining For Recommender Systems”, Data Mining And Knowledge Discover, vol. 6, Issue 1, pp. 83-105, Jan. 2012 pp. 83-105.
Luiijf, et al., Intrusion Detection Introduction and Generics, TNO Physics and Electronics Laboratory 2003, Session I: Real Time Intrusion Detection, Overview and Practical Experience, RTO Meeting Proceedings 101, Estoril, Portugal, May 27-28, 2002.
Manoel, et al., “Problem Determination Using Self-Managing Autonomic Technology,” IBM/Redbooks, Jun. 2005. (412 pages).
Microsoft Operations Manager, MOM 2005 Frequently Asked Questions, https://web.archive.org/web/20050830095611/http://www.microsoft.com/mom/evaluation/faqs/default.mspx. Published Aug. 25, 2004.
Microsoft Unveils New Microsoft Operations Manager 2000, Enterprise-Class Event and Performance Management Of Windows-Based Servers and Applications, May 9, 2001 in 4 pages.
Nguyen, et al., “Sense & Response Service Architecture (SARESA): An Approach towards a Real-time Business Intelligence Solution and its use for a Fraud Detection Application,” DOLAP '5, Nov. 4-5, 2005, Bremen, Germany. ACM 1-59593-162-7/05/0011.
Nihuo Software, “Log File Sample Explained,” (Nov. 2, 2007), pp. 1-3 [retrieved from https://web.archive.org/web/20071102044914/https://www.loganalyzer.net/log-analysis-tutoring-file-sample-explain.html].
SLAML 10 Reports, Workshop On Managing Systems via Log Analysis and Machine Learning Techniques, ;login: Feb. 2011 Conference Reports.
Splunk Cloud 8.0.2004 User Manual, available online, retrieved May 20, 2020 from docs.splunk.com.
Splunk Enterprise 8.0.0 Overview, available online, retrieved May 20, 2020 from docs.splunk.com.
Splunk Quick Reference Guide, updated 2019, available online at https://www.splunk.com/pdfs/solution-guides/splunk-quick-reference-guide.pdf, retrieved May 20, 2020.
Stearley, J., “Towards Informatic Analysis Of Syslogs”, (2004_IEEE International Conference On Cluster Computing (IEEE Cat. No. 04EX935), pp. 309-318, 2004.
Tierney, et al., “The NetLogger Methodology for High Performance Distributed Systems Performance Analysis,” IEEE HPDC-7'98, 28-31, Jul. 1998 at Chicago, Illinois.
Tozzi, CI, “What Makes Mainframes Different? Mainframe vs. Server,” (Apr. 12, 2018) [retrieved from https://blog.syncsort.com/2018/04/mainframe/mainframes-different-mainframe-vs-server/].
Valeur, et al., “A Comprehensive Approach to Intrusion Detection Alert Correlation,” IEEE Transactions On Dependable and Secure Computing, vol. 1, No. 3, pp. 146-169, Jul.-Sep. 2004.
Wikipedia, “Mainframe Computer,” (Nov. 1, 2019) [retrieved from https://en.wikipedia.org/w/index.php?title=Mainframe_comuter].
Witten, I. H., et al., “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations”, Morgan Kaufmann Publishers, dated 2000, pp. 80-82.
Wu, “Collecting Task Data in Event-Monitoring Systems,” University of Waterloo, Ontario, Canada 2004.
Yurcik, et al., “UCLog+ : A Security Data Management System for Correlating Alerts, Incidents, and Raw Data From Remote Logs,” Escuela Superior Politécnica del Litoral (ESPOL) University of Illinois at Urbana-Champaign, Jul. 2006.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,228, Final Office Action dated Aug. 6, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 17/447,404, filed Sep. 10, 2021.
United States Patent and Trademark Office, U.S. Appl. No. 14/530,686, Notice of Allowance dated Jan. 13, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,188, Final Office Action dated Aug. 5, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,188, Notice of Allowance dated Dec. 30, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,189, Non-Final Office Action dated Apr. 11, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,189, Final Office Action dated Aug. 15, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,189, Advisory Action dated Oct. 26, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,191, Final Office Action dated Aug. 10, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,191, Final Office Action dated Jan. 21, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,191, Notice of Allowance dated Mar. 4, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 14/611,228, Final Office Action dated Dec. 17, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/691,135, Non-Final Office Action dated Jul. 23, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/691,135, Notice of Allowance dated Dec. 18, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/691,163, Non-Final Office Action dated Jul. 17, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/691,163, Notice of Allowance dated Jan. 8, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 14/691,195, Non-Final Office Action dated Jul. 24, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/691,195, Notice of Allowance dated Dec. 18, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 15/011,622, Final Office Action, dated Mar. 13, 2007.
United States Patent and Trademark Office, U.S. Appl. No. 15/011,622, Non-Final Office Action dated Sep. 1, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 15/011,622, Advisory Action dated Jan. 12, 2018.
United States Patent and Trademark Office, U.S. Appl. No. 15/011,622, Non-Final Office Action dated May 29, 2018.
United States Patent and Trademark Office, U.S. Appl. No. 15/011,625, Non-Final Office Action dated Jul. 9, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 15/011,625, Final Office Action dated Jan. 18, 2017.
United States Patent and Trademark Office, U.S. Appl. No. 15/011,625, Advisory Action dated Mar. 31, 2017.
United States Patent and Trademark Office, U.S. Appl. No. 15/011,625, Advisory Action dated Jan. 17, 2018.
United States Patent and Trademark Office, U.S. Appl. No. 15/011,625, Non-Final Office Action dated May 29, 2018.
United States Patent and Trademark Office, U.S. Appl. No. 15/143,581, Non-Final Office Action dated Jun. 16, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 15/143,581, Final Office Action, dated Dec. 5, 2016.
United States Patent and Trademark Office, U.S. Appl. No. 15/143,581, Final Office Action dated Jan. 11, 2018.
United States Patent and Trademark Office, U.S. Appl. No. 15/143,581, Non-Final Office Action dated Jun. 26, 2018.
United States Patent and Trademark Office, U.S. Appl. No. 15/143,581, Notice of Allowance dated Nov. 28, 2018.
Unites States Patent and Trademark Office, U.S. Appl. No. 15/421,304, Non-Final Office Action dated Apr. 30, 2018.
Unites States Patent and Trademark Office, U.S. Appl. No. 15/421,304, Non-Final Office Action dated Mar. 8, 2017.
Unites States Patent and Trademark Office, U.S. Appl. No. 15/421,304, Advisory Action, dated Jan. 11, 2018.
Unites States Patent and Trademark Office, U.S. Appl. No. 15/421,304, Final Office Action, dated Sep. 18, 2017.
Unites States Patent and Trademark Office, U.S. Appl. No. 15/421,304, Notice of Allowance, dated Jan. 29, 2019.
United States Patent and Trademark Office, U.S. Appl. No. 11/459,632, Non-Final Office Action dated Mar. 11, 2010.
United States Patent and Trademark Office, U.S. Appl. No. 11/459,632, Final Office Action dated Oct. 4, 2010.
United States Patent and Trademark Office, U.S. Appl. No. 11/459,632, Notice of Allowance dated Jan. 18, 2011.
United States Patent and Trademark Office, U.S. Appl. No. 13/099,268, Final Office Action dated Apr. 26, 2013.
United States Patent and Trademark Office, U.S. Appl. No. 13/099,268, Notice of Allowance dated Sep. 24, 2013.
United States Patent and Trademark Office, U.S. Appl. No. 14/170,228, Non-Final Office Action dated Jul. 3, 2014.
United States Patent and Trademark Office, U.S. Appl. No. 14/170,228, Advisory Action dated Jan. 20, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/170,228, Notice of Allowance dated Apr. 30, 2015.
United States Patent and Trademark Office, U.S. Appl. No. 14/266,831, Non-Final Office Action dated Jun. 16, 2014.
United States Patent and Trademark Office, U.S. Appl. No. 14/266,831, Notice of Allowance dated Nov. 12, 2014.
United States Patent and Trademark Office, U.S. Appl. No. 14/530,686, Non-Final Office Action dated Jul. 31, 2015.
Abad et al., “Log Correlation for Intrusion Detection: A Proof of Concept” (Dec. 8-12, 2003), Proceedings of the 19th Annual Computer Security Applications Conference, pp. 1-10 [retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1254330].
IEEE, International Conference on Autonomic and Autonomous Systems, Silicon Valley, CA, USA Jul. 19-21, 2006, in 11 pages.
Fu et al., “Mining Navigation History For Recommendation”, Proceeding Of The 5th International Conference On Intelligent User Interfaces, pp. 106-112, 2000 pp. 106-112.
Stearley, J., “Towards Informatic Analysis Of Syslogs”, (2004_IEEE International Conference On Cluster Computing (IEEE Cat. No. 04EX935), pp. 1-10, 2004.
Sheth et al., “Semantic Content Management for Enterprises and the Web” (2002), IEEE Internet Computing, pp. 1-19 [retrieved from https://pdfs.semanticscholar.org/f209/6eadab5f101919310eec17174d1bfb5a3fb2.pdf].
Sheth et al., “Semantic Association Identification and Knowledge Discovery for National Security Applications” (Jan.-Mar. 2005), Journal of Database Management, vol. 16, Issue 1, pp. 33-53 [retrieved from https://lsdis.cs.uga.edu/lib/download/SAA+2004-PISTA.pdf].
Anyanwu et al., “SemRank: ranking complex relationship search results on the semantic web” (May 10-14, 2005), Proceedings of the 14th international conference on World Wide Web, pp. 117-127 [retrieved from http://dl.acm.org/citation.cfm?id=1060766].
U.S. Appl. No. 11/459,632, filed Jul. 24, 2006, Baum et al.
U.S. Appl. No. 13/099,268, filed May 2, 2011, Baum et al.
U.S. Appl. No. 13/664,109, filed Oct. 30, 2012, Baum et al.
U.S. Appl. No. 14/170,228, filed Jan. 31, 2014, Baum et al.
U.S. Appl. No. 14/266,831, filed May 1, 2014, Baum et al.
U.S. Appl. No. 14/530,686, filed Oct. 31, 2014, Baum et al.
U.S. Appl. No. 14/611,171, filed Jan. 30, 2015, Baum et al.
U.S. Appl. No. 14/611,228, filed Jan. 31, 2015, Baum et al.
U.S. Appl. No. 14/611,191, filed Jan. 31, 2015, Baum et al.
U.S. Appl. No. 14/611,189, filed Jan. 31, 2015, Baum et al.
U.S. Appl. No. 14/611,188, filed Jan. 31, 2015, Baum et al.
U.S. Appl. No. 14/691,195, filed Apr. 20, 2015, Baum et al.
U.S. Appl. No. 14/691,163, filed Apr. 20, 2015, Baum et al.
U.S. Appl. No. 14/691,135, filed Apr. 20, 2015, Baum et al.
U.S. Appl. No. 15/011,622, filed Jan. 31, 2016, Baum et al.
U.S. Appl. No. 15/011,625, filed Jan. 31, 2016, Baum et al.
U.S. Appl. No. 15/143,581, filed Apr. 30, 2016, Baum et al.
U.S. Appl. No. 15/885,753, filed Jan. 31, 2018, Baum et al.
U.S. Appl. No. 16/264,638, filed Jan. 31, 2019, Baum et al.
U.S. Appl. No. 16/399,146, filed Apr. 30, 2019, Baum et al.
U.S. Appl. No. 16/399,169, filed Apr. 30, 2019, Baum et al.
U.S. Appl. No. 16/399,136, filed Apr. 30, 2019, Baum et al.
U.S. Appl. No. 17/447,404, filed Sep. 10, 2021, Baum et al.
U.S. Appl. No. 17/447,408, filed Sep. 10, 2021, Baum et al.
U.S. Appl. No. 17/448,196, filed Sep. 20, 2021, Baum et al.
U.S. Appl. No. 15/421,304, filed Jan. 31, 2017, Baum et al.
U.S. Appl. No. 16/398,104, filed Apr. 29, 2019, Baum et al.
Related Publications (1)
Number Date Country
20230205791 A1 Jun 2023 US
Provisional Applications (1)
Number Date Country
60702496 Jul 2005 US
Continuations (7)
Number Date Country
Parent 17447408 Sep 2021 US
Child 18178417 US
Parent 16399146 Apr 2019 US
Child 17447408 US
Parent 14611189 Jan 2015 US
Child 16399146 US
Parent 14170228 Jan 2014 US
Child 14611189 US
Parent 13664109 Oct 2012 US
Child 14170228 US
Parent 13099268 May 2011 US
Child 13664109 US
Parent 11459632 Jul 2006 US
Child 13099268 US