Determining timestamps to be associated with events in machine data

Information

  • Patent Grant
  • 11526482
  • Patent Number
    11,526,482
  • Date Filed
    Monday, January 31, 2022
    2 years ago
  • Date Issued
    Tuesday, December 13, 2022
    2 years ago
  • CPC
    • G06F16/2272
    • G06F16/2228
    • G06F16/2291
    • G06F16/2322
    • G06F16/248
    • G06F16/2477
    • G06F16/24568
    • G06F16/24575
    • G06F16/24578
    • G06F16/951
  • Field of Search
    • CPC
    • G06F16/2272
    • G06F16/2228
    • G06F16/2291
    • G06F16/2322
    • G06F16/24568
    • G06F16/24575
    • G06F16/24578
    • G06F16/2477
    • G06F16/248
    • G06F16/951
    • G06F16/1873
    • G06F16/219
  • International Classifications
    • G06F16/00
    • G06F16/22
    • G06F16/248
    • G06F16/951
    • G06F16/23
    • G06F16/2458
    • G06F16/2455
    • G06F16/2457
Abstract
Methods and apparatus are disclosed to automatically timestamp events within streaming machine data. The streaming machine data is broken into a set of events using breaking rules. Each event can be analyzed by iterating over own time stamp format patterns from a list of known time stamp format patterns to determine whether a matching pattern exists in the event. When an individual event broken out from the streaming machine data includes time information according to at least one known time stamp format pattern of the list of known time stamp format patterns, a timestamp can be created for the event by extracting a time value from event ng the matching pattern determined to exist in the event.
Description
BACKGROUND
Field

This invention relates generally to information organization, search, and retrieval and more particularly to time series data organization, search, and retrieval.


Description of the Related Art

Time series data are sequences of time stamped records occurring in one or more usually continuous streams, representing some type of activity made up of discrete events. Examples include information processing logs, market transactions, and sensor data from real-time monitors (supply chains, military operation networks, or security systems). The ability to index, search, and present relevant search results is important to understanding and working with systems emitting large quantities of time series data.


Existing large scale search engines (e.g., Google and Yahoo web search) are designed to address the needs of less time sensitive types of data and are built on the assumption that only one state of the data needs to be stored in the index repository, for example, URLs in a Web search index, records in a customer database, or documents as part of a file system. Searches for information generally retrieve only a single copy of information based on keyword search terms: a collection of URLs from a Website indexed a few days ago, customer records from close of business yesterday, or a specific version of a document.


In contrast, consider an example of time series data from a typical information processing environment, shown in FIG. 1. Firewalls, routers, web servers, application servers and databases constantly generate streams of data in the form of events occurring perhaps hundreds or thousands of times per second. Here, historical data value and the patterns of data behavior over time are generally as important as current data values. Existing search solutions generally have little notion of time-based indexing, searching or relevancy in the presentation of results and don't meet the needs of time series data.


Compared to full text search engines, which organize their indices so that retrieving documents with the highest relevance scores is most efficient, an engine for searching time series data preferably would organize the index so that access to various time ranges, including less recent time ranges, is efficient. For example, unlike for many modem search engines, there may be significantly less benefit for a time series search engine to cache the top 1000 results for a particular keyword.


On the other hand, given the repetitive nature of time series data, opportunities for efficiency of index construction and search optimization are available. However, indexing time series data is further complicated because the data can be collected from multiple, different sources asynchronously and out of order. Streams of data from one source may be seconds old and data from another source may be interleaved with other sources or may be days, weeks, or months older than other sources. Moreover, data source times may not be in sync with each other, requiring adjustments in time offsets post indexing. Furthermore, time stamps can have an almost unlimited number of formats making identification and interpretation difficult. Time stamps within the data can be hard to locate, with no standard for location, format, or temporal granularity (e.g., day, hour, minute, second, sub-second).


Searching time series data typically involves the ability to restrict search results efficiently to specified time windows and other time-based metadata such as frequency, distribution of inter-arrival time, and total number of occurrences or class of result. Keyword-based searching is generally secondary in importance but can be powerful when combined with time-based search mechanisms. Searching time series data requires a whole new way to express searches. Search engines today allow users to search by the most frequently occurring terms or keywords within the data and generally have little notion of time based searching. Given the large volume and repetitive characteristics of time series data, users often need to start by narrowing the set of potential search results using time-based search mechanisms and then, through examination of the results, choose one or more keywords to add to their search parameters. Timeframes and time-based metadata like frequency, distribution, and likelihood of occurrence are especially important when searching time series data, but difficult to achieve with current search engine approaches. Try to find, for example, all stories referring to the “Space Shuttle” between the hours of 10 AM and 11 AM on May 10, 2005 or the average number of “Space Shuttle” stories per hour the same day with a Web-based search engine of news sites. With a focus on when data happens, time-based search mechanisms and queries can be useful for searching time series data.


Some existing limited applications of time-based search exist in specific small-scale domains. For example, e-mail search is available today in many mainstream email programs and web-based email services. However, searches are limited to simple time functions like before, after, or time ranges; the data sets are generally small scale and highly structured from a single domain; and the real-time indexing mechanisms are append only, usually requiring the rebuilding of the entire index to interleave new data.


Also unique to the cyclicality of time series data is the challenge of presenting useful results. Traditional search engines typically present results ranked by popularity and commonality. Contrary to this, for time series data, the ability to focus on data patterns and infrequently occurring, or unusual results may be important. To be useful, time series search results preferably would have the ability to be organized and presented by time-based patterns and behaviors. Users need the ability to see results at multiple levels of granularity (e.g., seconds, minutes, hours, days) and distribution (e.g., unexpected or least frequently occurring) and to view summary information reflecting patterns and behaviors across the result set. Existing search engines, on the other hand, generally return text results sorted by key word density, usage statistics, or links to or from documents and Web pages in attempts to display the most popular results first.


In one class of time series search engine, it would be desirable for the engine to index and allow for the searching of data in real-time. Any delay between the time data is collected and the time it is available to be searched is to be minimized. Enabling real-time operation against large, frequently changing data sets can be difficult with traditional large-scale search engines that optimize for small search response times at the expense of rapid data availability. For example, Web and document search engines typically start with a seed and crawl to collect data until a certain amount of time elapses or a collection size is reached. A snapshot of the collection is saved and an index is built, optimized, and stored. Frequently accessed indices are then loaded into a caching mechanism to optimize search response time. This process can take hours or even days to complete depending on the size of the data set and density of the index. Contrast this with a real-time time series indexing mechanism designed to minimize the time between when data is collected and when the data is available to be searched. The ability to insert, delete and reorganize indices, on the fly as data is collected, without rebuilding the index structure is essential to indexing time series data and providing real-time search results for this class of time series search engines.


Other software that is focused on time series, e.g., log event analyzers such as Sawmill or Google's Sawzall can provide real-time analysis capabilities but are not search engines per se because they do not provide for ad hoc searches. Reports must be defined and built in advance of any analysis. Additionally, no general keyword-based or time-based search mechanisms are available. Other streaming data research projects (including the Stanford Streams project and products from companies like StreamBase Systems) can also produce analysis and alerting of streaming data but do not provide any persistence of data, indexing, time-based, or keyword-based searching.


There exists, therefore, a need to develop other techniques for indexing, searching and presenting search results from time series data.


SUMMARY

Methods and apparatus consistent with the invention address these and other needs by allowing for the indexing, searching, and retrieval of time series data using a time series search engine (TSSE). In one implementation, one aspect of TSSEs is the use of time as a primary mechanism for indexing, searching, and/or presentation of search results. A time series search language (TSSL) specific to time-based search mechanisms is used to express searches in human readable form and results are presented using relevancy algorithms specific to time series data. Search expression and results presentation are based on key concepts important to searching time series data including but not limited to time windows, frequency, distribution, patterns of occurrences, and related time series data points from multiple, disparate sources.


In one aspect of the invention, multiple sources of time series data are organized and indexed for searching and results are presented upon user or machine initiated searches. In another aspect, a time series search engine (TSSE) includes four parts: (1) a time stamp process; (2) an indexing process; (3) a search process; and (4) a results presentation process.


In one aspect of the invention, a computer-implemented method for time searching data includes the following steps. Time series data streams are received. One example of time series data streams includes server logs and other types of machine data (i.e., data generated by machines). The time series data streams are time stamped to create time stamped events. The time stamped events are time indexed to create time bucketed indices, which are used to fulfill search requests. Time series search request are executed, at least in part, by searching the time bucketed indices.


In certain implementations, time stamping the time series data streams includes aggregating the time series data streams into events and time stamping the events. For example, the events may be classified by domain and then time stamped according to their domain. In one approach, for events that are classified in a domain with a known time stamp format, the time stamp is extracted from the event. However, for events that are not classified in a domain with a known time stamp format, the time stamp is interpolated.


In another aspect of the invention, time bucketed indices are created by assigning the time stamped events to time buckets according to their time stamps. Different bucket policies can be used. For example, the time buckets may all have the same time duration, or may have different time durations. In addition, time buckets may be instantiated using a lazy allocation policy. The time stamped events may also be segmented, and the segments used to determine time bucket indices. Various forms of indexing, including hot indexing, warm indexing and speculative indexing, may also be used.


The creation of time bucket indices facilitates the execution of time series searches. In one approach, a time series search request is divided into different sub-searches for the affected time buckets, with each sub-search executed across the corresponding time bucket index.


Other aspects of the invention include software, computer systems and other devices corresponding to the methods described above, and applications for all of the foregoing.





BRIEF DESCRIPTION OF THE DRAWINGS

The invention has other advantages and features which will be more readily apparent from the following detailed description of the invention and the appended claims, when taken in conjunction with the accompanying drawings, in which:



FIG. 1 (prior art) is a diagram of time series data environments.



FIG. 2 is a diagram of a time series search engine according to the invention.



FIG. 3 is a diagram of a time stamp process suitable for use with the time series search engine of FIG. 2.



FIG. 4 is a diagram of an event aggregation process suitable for use with the time stamp process of FIG. 3.



FIG. 5 is a diagram of an indexing process suitable for use with the time series search engine of FIG. 2.



FIG. 6 is a diagram of a search process suitable for use with the time series search engine of FIG. 2.



FIG. 7 is a diagram of a results presentation process suitable for use with the time series search engine of FIG. 2.





The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.


DETAILED DESCRIPTION


FIG. 1 illustrates different examples of time series data environments with potentially large numbers of data sources and streams of time series data across multiple domains. In this figure, the first picture represents an information-processing environment with time series data from web servers, application servers, and databases in the form of server logs. The second picture is a typical market-trading environment with transactions between multiple buyers and sellers and between two or more markets. Time series data is generated in the form of transaction records representing the intention or trade or the final settlement of the trade as examples. In the third picture, a real-time monitoring environment is depicted with multiple sensors producing time series data in the form of recorded measurements. All three of these environments are examples of potential applications for the TSSE.


Aspects of the invention will be described with respect to the first picture in FIG. 1, the information-processing environment, but the invention can also be used with other time series data environments and applications including the other environments shown in FIG. 1.



FIG. 2 illustrates one approach 200 to architecting a TSSE. Time series data streams 205 arrive synchronously or asynchronously from multiple sources, multiple searches 255 are expressed by users and/or other systems, and results sets 275 are presented through a variety of mechanisms including, for example, application programming interfaces and web-based user interfaces.


The arrival of time series data streams 205 at the TSSE 200 can be effected by having the TSSE gather them directly or by having a user-supplied script collect, preprocess, and deliver them to a default TSSE collection point. This architecture preferably tolerates data arriving late and temporally out of order. Currently, most sources of time series data are not designed for sophisticated processing of the data, so the TSSE typically will collect or be fed raw time series data that are close to their native form. The TSSE can be situated in different locations so long as it has access to the time series data. For example, one copy of the TSSE can be run on a single central computer or multiple copies can be configured in a peer-to-peer set-up with each copy working on the same time series data streams or different time series data streams.



FIG. 2 depicts an example TSSE 200 with four major processes: time stamp process 210, index process 220, search process 230 and presentation process 240. The time stamp process 210 turns raw time series data 205 into time stamped events 215 to be fed to the indexing process 220. Following our information processing example, raw logs 205 from multiple web servers, application servers and databases might be processed by the time stamp process 210 to identify individual events 215 within the various log formats and properly extract time and other event data. The event data 215 is used by the index process 220 to build time bucketed indices 225 of the events. These indices 225 are utilized by the search process 230 which takes searches 255 from users or systems, decomposes the searches, and then executes a search across a set of indices.


For example, a user might want to locate all the events from a particular web server and a particular application server occurring within the last hour and which contain a specific IP address. In addition, the search process 230 may choose to initiate the creation of meta events 237 at search time to handle time-based and statistical summary indices useful in searching through repetitive, temporal data. For example, meta events 237 may represent averages, means, or counts of actual events or more sophisticated pattern based behavior. In this case a user might want to search to find all the events occurring with a frequency of three per minute.


Upon completion, the search process 230 hands results from the selected indices 235 to the presentation process 240 which merges result sets, ranks results, and feeds the results 275 to an API or user interface for presentation.


Time Stamp Process


Process 210 shown in FIG. 2 of an exemplary implementation 200 of a TSSE is to acquire streaming time series data, identify individual events within the stream, and assign time stamps to each event. An example time stamp process 210 block diagram is shown in FIG. 3 and includes several steps including event aggregation 310, domain identification 320, time extraction 330, and time interpolation 340. Time series data streams 205 are received as input to the time stamp process 210 and then processed into individual time stamped events 215.


Event Aggregation


Step 310 in the time stamp process 210 of FIG. 3 aggregates the streaming time series data 205 into individual events 315. In our information-processing example, web server time series data streams may have a single line per event and be easy to identify. However, an application server time series data stream may contain single events with a large number of lines making identification of individual events within the stream difficult.


In one implementation, event aggregation 310 uses feature extraction (e.g., leading punctuation, significant words, white space, and breaking characters) and machine learning algorithms to determine where the event boundaries are. FIG. 4 is a diagram of an event aggregation process suitable for use with the time stamp process of FIG. 3.


Source Identification-Classification into Domains


Given the repetitive, yet dynamic, nature of the time series data 205 in our information processing example (which data will be referred to as machine data 205 or MD 205), an effective aggregation process 310 (such as shown in FIG. 4) preferably will learn about data formats and structure automatically. In one implementation, learning is separated into different domains based on the source of MD 205. Domains can be general system types, such as log files, message bus traffic, and network management data, or specific types, such as output of a given application or technology—Sendmail logging data, Oracle database audit data, and J2EE messaging.


In this example event aggregation process 310, the domain for a given source of MD is identified 415 so that domain specific organization methods can be applied. Domains are determined through a learning process. The learning process uses collections of MD from well-known domains as input and creates a source signature 412 for each domain. In one implementation, source signatures 412 are generated from representative samples of MD 205 by creating a hash table mapping punctuation characters to their frequency. While tokens and token values can change in MD collection, in this particular implementation, the signature 412 generated by the frequency of punctuation is quite stable, and reliable within a specific domain. Other implementations could use functions of the punctuation and tokens, such as the frequencies of the first punctuation character on a line, or the first capitalized term on a line. Given that source signatures 412 can be large and hard to read, signatures can have a corresponding label in the form of a number or text that can be machine generated or human assigned. For example, the source signature 412 for an Apache web server log might be programmatically assigned the label “205”, or a user can assign the label “Apache Server Log”.


In one embodiment, clustering is used to classify 415 collected MD 205 into domains according to their source signatures 412. As collections of MD 205 are encountered, each collection's signature is matched to the set of known source signatures 412 by performing a nearest-neighbor search. If the distance of the closest matching signature 412 is within a threshold, the closest matching signature 420's domain is assumed to be the domain of the source. If no best match can be found, a new source signature 412 can be created from the sample signature and a new source domain created. Alternatively, a default source domain can be used. In one implementation, the distance between two signatures is calculated by iterating over the union of attributes of the two signatures, with the total signature distance being the average of distances for each attribute. For each attribute A, the value of A on Signature1 and Signature2, V1 and V2, are compared and a distance is calculated. The distance for attribute A is the square of (V1−V2)*IDF, where IDF is the log(N/|A|), where N is the number of signatures, and |A| is the number of signatures with attribute A.


Source Identification—Classification as Text/Binary


Some MD 205 sources are non-textual or binary and cannot be easily processed unless a known process is available to convert the binary MD into textual form. To classify a source as textual or binary, a sample MD collection is analyzed. Textual MD can also have embedded binary MD, such as a memory dump, and the classification preferably identifies it as such. In one implementation, the textual/binary classification works as follows. The sample is a set of lines of data, where a line is defined as the data between new lines (i.e., ‘\n’), carriage returns (i.e., V), or their combination (i.e., ‘\r\n’). For each line, if the line's length is larger than some large threshold, such as 2 k characters, or if the line contains a character with an ASCII value of zero (0), a count of Binary-looking lines is incremented. Otherwise, if the line's length is shorter than a length that one would expect most text lines to be below, such as 256 characters, a count of Text-looking lines is incremented. If the number of Text-looking lines is twice as numerous as the Binary-looking lines (other ratios can be used depending on the context), the source is classified as text. Otherwise, the source is classified as binary.


Aggregation of Machine Data into Raw Events


When the source signature 420 for a collection of MD has been identified 415, the corresponding aggregation rules are applied 425 to the MD collection. Aggregation rules describe the manner in which MD 205, from a particular domain, is organized 425 into event data 315 by identifying the boundaries of events within a collection of MD, for example, how to locate a discrete event by finding its beginning and ending. In one implementation, the method of aggregation 425 learns, without prior knowledge, by grouping together multiple lines from a sample of MD 205. Often MD 205 contains events 315 that are anywhere from one to hundreds of lines long that are somehow logically grouped together.


The MD collection may be known a priori, or may be classified, as single-line type (i.e., containing only single-line events) or multi-line type (i.e., possibly containing multi-line events) prior to performing aggregation. For those MD collections that are classified as single line type, aggregation 425 is simple-single-line type MD collections are broken on each line as a separate event. Multi-line type MD collections are processed 425 for aggregation. In one implementation, a MD collection is classified as a multi-line type if 1) there is a large percentage of lines that start with spaces or are blank (e.g., if more than 5% of the lines start with spaces or are blank), or 2) there are too many varieties of punctuation characters in the first N punctuation characters. For example, if the set of the first three punctuation characters found on each line has more than five patterns (e.g., ‘:::’, ‘!:!’, ‘,,,’, ‘:..’, ‘( )*’), the collection might be classified as multi-line.


Another aspect of aggregation methods 425 is the ability to learn, and codify into rules, what constitutes a break between lines and therefore the boundary between events, by analyzing a sample of MD. For example, in one implementation, an aggregation method 425 compares every two-line pair looking for statistically similar structures (e.g., use of white space, indentation, and time-stamps) to quickly learn which two belong together and which two are independent. In one implementation, aggregation 425 works as follows. For each line, first check if the line starts with a time-stamp. If so, then break. Typically, lines starting with a time-stamp are the start of a new event. For lines that do not start with a time-stamp, combine the current line with the prior line to see how often the pair of lines occurs, one before the other, as a percentage of total pairs in the MD sample. Line signatures are used in place of lines, where a line signature is a more stable version of a line, immune to simple numeric and textual changes. In this implementation, signatures can be created by converting a line into a string that is the concatenation of leading white space, any punctuation on the line, and the first word on the line. The line “10:29:03 Host 191.168.0.1 rebooting:normally” is converted to “: . . . :Host.”


Now this current line signature can be concatenated with the previous line signature (i.e., signature1 combined with signature2) and used as a combined key into a table of break rules. The break rule table maps the combined key to a break rule, which determines whether there should be a ‘break’, or not, between the two lines (i.e., whether they are part of different events or not). Break rules can have confidence levels, and a more confident rule can override a less confident rule. Break rules can be created automatically by analyzing the co-occurrence data of the two lines and what percent of the time their signatures occur adjacently. If the two line signatures highly co-occur, a new rule would recommend no break between them. Alternatively, if they rarely co-occur, a new rule would recommend a break between them. For example, if line signature A is followed by line signature B greater than 20% of the time A is seen, then a break rule might be created to recommend no break between them. Rules can also be created based on the raw number of line signatures that follow/proceed another line signature. For example, if a line signature is followed by say, ten different line signatures, create a rule that recommends a break between them. If there is no break rule in the break rule table, the default behavior is to break and assume the two lines are from different events. Processing proceeds by processing each two-line pair, updating line signature and co-occurrence statistics, and applying and learning corresponding break rules. At regular intervals, the break rule table is written out to the hard disk or permanent storage.


Time Stamp Identification


Once the incoming time series stream 205 has been aggregated 310 into individual events 315, the events and their event data are input into a time stamp identification step 320 which determines whether or not the time series event data contains tokens that indicate a match to one of a collection of known time stamp formats. If so, the event is considered to have a time stamp from a known domain and extraction 330 is performed. Otherwise, interpolation 340 is performed.


Time Stamp Extraction


If a known domain has been identified for an event, the event 315 is taken as input to a time stamp extraction step 330 where the time stamp from the raw event data is extracted and passed with the event to the indexing process 220. In an exemplary implementation, this timestamp extraction 330 occurs by iterating over potential time stamp format patterns from a dynamically ordered list in order to extract a time to be recorded as the number of seconds that have passed since the Unix epoch (0 seconds, 0 minutes, 0 hour, Jan. 1, 1970 coordinated universal time) not including leap seconds. Additionally, the implementation takes into account time zone information and normalizes the times to a common offset. To increase performance, the ordering of this list is determined using a move-to-front algorithm, wherein whenever a match is found the matching pattern is moved to the beginning of the list. In such an implementation, the most frequently occurring patterns are checked earliest and most often, improving performance. The move-to-front lists may be maintained either for all time series data sources together, on a per-source basis (to take advantage of the fact that the formats in a single source are likely to be similar), or in some other arrangement.


Time Stamp Interpolation


In the case where the event did not contain a time stamp from a known domain, then a timestamp is assigned to the event based on its context. In one implementation, the time stamp is linearly interpolated 340 from the time stamps of the immediately preceding and immediately following events 315 from the same time series data stream. If these events also contain no time stamps from a known domain, further earlier and/or later events can be used for the interpolation. The time stamp extraction module 330 automatically stores the time stamp of every hundredth event (or some other configurable period) from each time series data stream in order to facilitate time stamp interpolation 340. In another implementation, time stamps are interpolated 340 based on the time associated with the entire time series data stream 205 including acquisition time, creation time or other contextual meta time data.


Indexing Process


Returning to FIG. 2, in the indexing process 220, indexes are created based on incoming event data 215. The indexing process 220 organizes and optimizes the set of indices in an online fashion as they are extended with more events. An example TSSE indexing process 220 is shown in FIG. 5 and includes, in one implementation, several steps including bucketing 510, segmenting 520, archival 530, allocation 540, insertion 550, committing to secondary storage 560, merging buckets in secondary storage 570, and expiring buckets in secondary storage 580.


Time Bucketing


Events indexed by the TSSE are often queried, updated, and expired using time-based operators. By hashing the components of the index over a set of buckets organized by time, the efficiency and performance of these operators can be significantly improved. The final efficiency of the bucketing will, of course, depend on the hardware configuration, the order in which the events arrive, and how they are queried, so there is not a single perfect bucketing policy.


In one implementation, buckets with a uniform extent are used. For example, each time bucket can handle one hour's worth of data. Alternate policies might vary the bucket extents from one time period to another. For example, a bucketing policy may specify that the buckets for events from earlier than today are three hour buckets, but that the buckets for events occurring during the last 24 hours are hashed by the hour. In the information processing example, a bucket might cover the period Jan. 15, 2005 12:00:00 to Jan. 15, 2005 14:59:59. In order to improve efficiency further, buckets are instantiated using a lazy allocation policy (i.e., as late as possible) in primary memory (i.e., RAM). In-memory buckets have a maximum capacity and, when they reach their limit, they will be committed to disk and replaced by a new bucket. Bucket storage size is another element of the bucketing policy and varies along with the size of the temporal extent. Finally, bucket policies typically enforce that buckets (a) do not overlap, and (b) cover all possible incoming time stamps.


Step 510 in indexing an event by time is to identify the appropriate bucket for the event based on the event's time stamp and the index's bucketing policy. Each incoming event 215 is assigned 510 to the time bucket where the time stamp from the event matches the bucket's temporal criteria. In one implementation, we use half-open intervals, defined by a start time and an end time where the start time is an inclusive boundary and the end time is an exclusive boundary. We do this so that events occurring on bucket boundaries are uniquely assigned to a bucket. Following our example in the information processing environment, a database server event with the time stamp of Jan. 15, 2005 12:00:01 might be assigned to the above-mentioned bucket.


Segmentation


Once an appropriate bucket has been identified 510 for an event, the raw event data is segmented 520. A segment (also known as a token) is a substring of the incoming event text and a segmentation 520 is the collection of segments implied by the segmentation algorithm on the incoming event data. A segment sub string may overlap another substring, but if it does, it must be contained entirely within that substring. We allow this property to apply recursively to the containing substring, so that the segment hierarchy forms a tree on the incoming text.


In one implementation, segmentation 520 is performed by choosing two mutually exclusive sets of characters called minor breakers and major breakers. Whenever a breaking character, minor or major, is encountered during segmentation of the raw data, segments are emitted corresponding to any sequence of bytes that has at least one major breaker on one end of the sequence. For example, if, during segmentation, a minor breaking character is found, then a segment corresponding to the sequence of characters leading from the currently encountered minor breaker back to the last major breaker encountered is recorded. If a major breaker was encountered, then the sequence of characters leading back to either the last major breaker or the last minor breaker, whichever occurred most recently, determines the next segment to be recorded.


Segmentation 520 rules describe how to divide event data into segments 525 (also known as tokens). In one implementation a segmentation rule examines possible separators or punctuation within the event, for example, commas, spaces or semicolons. An important aspect of segmentation is the ability to not only identify individual segments 525, but also to identify overlapping segments. For example, the text of an email address, “bob.smith@corp.com”, can be broken into individual and overlapping segments; <bob.smith>, <@> and <corp.com> can be identified as individual segments, and <<bob.smith><@><corp.com>> can also be identified as an overlapping segment. As described above, in one implementation, segmentation 520 uses a two-tier system of major and minor breaks. Major breaks are separators or punctuation that bound the outer most segment 525. Examples include spaces, tabs, and new lines. Minor breaks are separators or punctuation that break larger segments into sub segments, for example periods, commas, and equal signs. In one implementation, more complex separators and punctuation combinations are used to handle complex segmentation tasks 520, for example handling Java exceptions in an application server log file.


An example of segmentation in our information-processing example, IP addresses could be broken down using white space as major breakers and periods as minor breakers. Thus, the segments for the raw text “192.168.1.1” could be:

    • “192”
    • “192.168”
    • “192.168.1”
    • “192.168.1.1”


In another implementation, certain segments may represent known entities that can be labeled and further understood algorithmically or by human added semantics. For example, in the above representation, “192.168.1.1” may be understood to be an IP address. Named entity extraction can be algorithmically performed in a number of ways. In one implementation, the segment values or segment form from the same segment across multiple events is compared to an entity dictionary of known values or known forms.


In another implementation, entity extraction techniques are used to identify semantic entities within the data. In one implementation, search trees or regular expressions can be applied to extract and validate, for example, IP addresses or email addresses. The goal of extraction is to assist the segmentation process 520 and provide semantic value to the data.


Archiving and Indexing Events


At this point in the process, incoming events have time stamps 215, segments 525, and a time bucket 515 associated with them. To create the persistent data structures that will be used later to perform lookups in the search process, we store the raw data of the event with its segmentation, create indices that map segments and time stamps to offsets in the event data store, and compute and store metadata related to the indices.


Because the TSSE tolerates, in near real time, both the arrival of new events and new searches, the system preferably is careful in managing access to disk. For the indexes, this is accomplished by splitting index creation into two separate phases: hot indexing and warm indexing. Hot indexes are managed entirely in RAM, are optimized for the smallest possible insert time, are not searchable, and do not persist. “Warm” indexes are searchable and persistent, but immutable. When hot indexes need to be made searchable or need to be persistent, they are converted into warm indexes.


In the implementation shown in FIG. 5, a hot index 555 contains a packed array of segments, a packed array of event addresses and their associated time stamps, and a postings list that associates segments with their time stamped event addresses. For performance reasons, the packed arrays can have hash tables associated with them to provide for quick removal of duplicates. When incoming events are being indexed, each segment of the event is tested for duplication using the segment array and its associated hash. The event address is also tested for duplication, against the event address array and its associated hash. If either of the attributes is a duplicate, then the instance of that duplicate that has already been inserted into the packed array is used. Otherwise, the new segment or event address is copied into the appropriate table 550 and the associated hash table is updated. As events are inserted into the hot index, the space associated with each of the packed arrays gets used. A hot slice is considered to be “at capacity” when one of its packed arrays fills up or when one of its hash tables exceeds a usage threshold (e.g., if more than half of the hash table is in use). Once a hot index reaches capacity 540, it cannot accept more segments for indexing. Instead it is converted to a warm index, committed to disk 560, and replaced with a new empty hot index.


Another feature of this particular system is speculative indexing. Based on earlier indexing processes, new time buckets can be initialized using all or part of a representative, completed bucket as an exemplar. In other words, by keeping around copies of data that may reasonably be expected to occur in a time bucket, we can improve indexing performance by speculatively initializing parts of the hot index. In one embodiment, the speculative indexing is performed by copying the packed array of segments and its associated hash table from an earlier hot index. The hot slice is then populated as usual with the exception that the segment array is already populated and ready for duplicate testing. Because of the highly regular language and limited vocabulary of machines, the hit rate associated with this speculation can be very good.


The searching process (as described in the next section) allows the user to search on segments, segment prefixes, and segment suffixes. To accommodate these search types, in one implementation, the segments array can be sorted and then stored as a blocked front coded lexicon (hereafter called “the forward lexicon”). This data structure makes it possible to perform segment and segment prefix lookups efficiently while still achieving a reasonable amount of compression of the segment text. When a search is being performed on a particular segment, the offset of the segment in the forward lexicon is used as an efficient way to look up metadata associated with the queried-for segment in other associated tables.


To handle suffix lookups, a blocked front coded lexicon can be created on the same collection of segments after they have been string-reversed (hereafter called “the reverse lexicon”). Also, a map is populated that converts the offset of a reversed segment in the reverse lexicon to the equivalent non-reversed segment's offset in the forward lexicon (hereafter called “the reverse-forward map”). When performing suffix lookups, the offset in the reverse lexicon is used as an offset into the reverse-forward map. The value stored at that position in the map is the appropriate offset to use for the other metadata arrays in the warm index.


The warm index provides a list of event offsets for each segment indexed, preferably in an efficient manner. In one implementation, this can be done by maintaining an array of compressed postings lists and an associated array of offsets to the beginning of each of those compressed postings lists. The postings lists are maintained in segment offset order, so when a lookup is performed, the segment ID can be used to find the appropriate entry of the postings lists offsets array. The values in the postings lists entries are the offsets that should be used to look up events in the packed array of event addresses.


Finally, statistical metadata can be provided for each indexed segment (e.g., the first and last time of occurrence of the segment, the mean inter-arrival time, and the standard deviation of the inter-arrival time).


During the course of the indexing process, it is possible that a single time bucket will be filled and committed to disk 560 several times. This will result in multiple, independently searchable indices in secondary storage for a single time span. In an exemplary implementation, there is a merging process 570 that takes as input two or more warm indices and merges them into a single warm index for that time bucket. This is a performance optimization and is not strictly required for searching.


Expiring Events


Furthermore, over a long period of time, it is possible that applying the indexing process 220 to time series data will cause a large amount of persistent data to accumulate. The indexing process, therefore, preferably contains an expiration process 580 that monitors the database for time buckets to be deleted based on user-provided preferences. In one implementation, these preferences might include a trailing time window (“events older than 3 months need not be returned in search results”), a time range (“events earlier than January 1 of this year need not be returned in search results”), a maximum number of events (“no more than 1 million events need be returned in search results”), or a maximum total size for the index (“return as many useful search results as possible while consuming no more than 100 GB of Disk”). A process periodically wakes up and tests the collection of warm slices for any slices that meet the expiration criterion. Upon expiration, a warm index file and its associated raw event data and segmentation is moved out of the active index. The index file need not necessarily be deleted. In one implementation, the index file could be streamed to less expensive offline storage.


Search Process


An example TSSE search process is shown in FIG. 6 and includes several methods for parsing 610 a search phrase, issuing multiple sub-searches 625 in order to satisfy the incoming parent search, using sub-search results 635 to prune searches, and merging 640 sub-search results into a final set of search results for presentation to the user.


Time Series Search Language


During search processing, incoming search phrases 255 are parsed 610 according to a time series search language (TSSL) in order to generate annotated parse trees 615. An exemplary TSSL language syntax includes a series of modifiers or commands taking the format name::value. Some modifiers may have default values and some can only be used once, while some can appear several times in the same search with different values. Examples include the following:

    • average::value—calculate the average number of events using the value time frame.
    • page::value—present search results by value. Value can be seconds, minutes, hours, days, weeks or months or any other metadata element, for example, source or event type.
    • count::—calculate the total number of for events.
    • daysago::value—search for events within the last value days.
    • index::value—the index to search-main, default, history, or another index defined by the TSSE.
    • hoursago::value—search for events within the last value hours.
    • eventtype::value—search for events with an event type or tag that matches the specified value.
    • host::value—search for events whose hostname was set to the specified value. This is the host that logged the event, not necessarily the host that generated the event.
    • maxresults::value—the maximum number of results to return. minutesago::value—search for events within the last value minutes.
    • related::value—search for events with segment values (e.g., 404 or username) matching one or more in the current event.
    • similar::value—search for events with a similar event type to the current event.
    • sourcetype::value—search for events with a given sourcetype of value.
    • unexpected::value—search for events that lie outside observed patterns in the index by the specified value of 0 (expected) to 9 (most unexpected).


Modifiers can be combined with keywords, wildcard characters, literal strings, quoted phrases and Boolean operators, such as AND, OR, NOT. Parentheses can be used to nest search and sub-search phrases together. An example search phrase might be “sourcetype::mysq1*sock* NOT (started OR (host::foo OR host::BAR)) maxresults::10 (eventtype::baddb OR eventtype::?8512-3) daysago::30”.


In one implementation, a custom parser 610 handles the Boolean operators “NOT” and “OR” and defaults to “AND”. This implementation also handles using parentheses to disambiguate the language when there are several operators. Otherwise, it associates left-to-right. The implementation also supports special search operators that are indicated using a domain specifier followed by a demarcation element. For example, searching for “source::1234”, might indicate that the searcher (human or system) wants to restrict results to events that were received from a particular source ID.


Incoming search phrases may also trigger ad hoc computation 612 based on a map of special keywords. For example, a special search string might be used to indicate that a search is to be stored and reissued on a periodic basis or to request a list of sources. In this case, the search string would be stored in a table on disk along with a schedule specifying the schedule on which the search should be reissued. Depending on the results of the search when executed, additional actions may be triggered. For example, an email alert might be sent, an RSS feed might be updated, or a user-supplied script might be executed. Another example of a search that triggers ad hoc computation 612 is one that is indicated to be saved for later use, but not to be reissued on a periodic basis.


Assuming that the search parser 610 determined that an annotated syntax tree 615 should be created for the search string, the next component, the search execution engine 620 will use the annotated syntax tree 615 to issue sub-searches 625 to the time bucketed indices 565. Each sub-search 625 is targeted at an individual time bucket 565. Time buckets are queried in the order that is most advantageous to pruning given the sort order for the results. For example, if search results are sorted in reverse chronological order, then the sub-search for the most recent time bucket will be issued first. This allows the search execution engine 620 to examine the results 635 of the sub-search before proceeding with additional (expensive) sub-searches 625. For example, if a particular sub-search returns enough results 635, then it is not necessary to proceed with additional sub-searches 625.


Once enough results sets 637 have been accumulated to satisfy the parent search, another module will take the results and merge 640 them into a single result set 235, 237 that satisfies the search. This merging process, in one implementation, performs a merge sort on the results from each of the buckets to keep them in the order required for the presentation process.


Presentation Process


The final process in an exemplary implementation of our example TSSE is the preparation of search results for presentation 240, as shown in FIG. 7. Unlike current large-scale search engines that present non-interactive results ordered by keyword relevance ranking, this example TSSE can present results organized by time, event relationships, and keyword relevance ranking.


Time Based Presentation


Unique to the challenge of indexing and searching time series data is the presentation of results using time as a primary dimension 710. Because existing large-scale search engines do not organize information by time, the presentation of time-based results is not a consideration. However, a primary benefit of a TSSE is the ability to index, search and present time series data chronologically. Results can be presented by aggregating and summarizing search results based on discrete time ranges or based on statistical calculations.


For example, the example TSSL can specify to see results for only a particular time frame and/or to see results presented by seconds, minutes, hours, days, weeks or months. In this way the search window can be limited to a timeframe and the results can be constructed for optimal viewing based on the density of the expected result set returned from a search. The search “192.168.169.100 hoursago::24 page::seconds”, will return time series events including the keyword “192.168.169.100” that occurred within the last 24 hours and will summarize the display results by seconds. In an exemplary implementation of a TSSE, summarization can include both aggregated display lines summarizing the events for the summary window and/or paging the results by the summary window. In the example above, each page of the search results presentation may include one second in time. Examples include but are not limited to:

    • Ability to scroll/page through the data (n) results at a time by count.
    • Ability to scroll/page through the data by time: next/previous second, minute, hour, day, year.
    • Ability to specify max count per timeframe.
    • Ability to get next (n) results within a paged time frame—(within a second) get next 100.


      Metadata Presentation


In addition to time-based presentation 710, an example TSSE preferably is able to present additional aggregation and summarization of results by metadata characteristics 720, such as, data source, data source type, event type, or originating host machine. In this way, results can be not only organized by time, but also refined by metadata aggregation and summarization. The search “192.168.169.100 page::source” will present all the results with “192.168.169.100” and put each data source containing results on a separate page. Examples include but are not limited to:

    • Original physical location of the data source.
    • Original physical machine, sensor etc. generating the data.
    • Type of data source as dynamically assigned by the indexing process.
    • Type of event as dynamically assigned by the indexing process.


      Zoom Control


Because time and certain metadata parameters (e.g., machine IP addresses) can be continuous, an example TSSE user interaction model can include the ability to move from small increments of time (seconds or minutes) or metadata parameters (different classes of IP addresses) using a zoom control 730. This zoom control can be combined with other metadata search parameters to enable the rapid movement through large amounts of data. Examples include but are not limited to:

    • Ability to zoom in and out around a given time from any second(s) to minute(s), hour(s), etc.
    • Ability to zoom in to second resolution around 12:15 AM Jun. 3, 2005, for a specific data source type and physical machine location.


      Presentation Density Control


Given the different types of users (humans and systems) and the varying types of time series data and events (e.g., single line events a few bytes in size, to multiple line events several megabytes in size) it is useful to be able to specify the density of the results. In one implementation the presentation density can be controlled 740 to return and/or display only the raw data without any metadata in a simple ASCII text format. Alternatively the same results can be returned and or displayed with full metadata as rich XML.


Implementation


The TSSE can be implemented in many different ways. In one approach, each box shown in the various figures is implemented in software as a separate process. All of the processes can run on a single machine or they can be divided up to run on separate logical or physical machines. In alternate embodiments, the invention is implemented in computer hardware, firmware, software, and/or combinations thereof. Apparatus of the invention can be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a programmable processor; and method steps of the invention can be performed by a programmable processor executing a program of instructions to perform functions of the invention by operating on input data and generating output. The invention can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. Each computer program can be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired; and in any case, the language can be a compiled or interpreted language. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, a processor will receive instructions and data from a read-only memory and/or a random access memory. Generally, a computer will include one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM disks. Any of the foregoing can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits) and other forms of hardware.


Therefore, although the detailed description contains many specifics, these should not be construed as limiting the scope of the invention but merely as illustrating different examples and aspects of the invention. It should be appreciated that the scope of the invention includes other embodiments not discussed in detail above. Various modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus of the present invention disclosed herein without departing from the spirit and scope of the invention as defined in the appended claims. Therefore, the scope of the invention should be determined by the appended claims and their legal equivalents.

Claims
  • 1. A computer-implemented method comprising: breaking streaming machine data into a set of events, each event in the set of events including a portion of the machine data, wherein a subset of events in the set of events includes time information, and wherein each event of the set of events is broken out from the streaming machine data according to one or more breaking rules;determining whether each event broken out from the streaming machine data according to the one or more breaking rules include time information by: for each event of the set of events, iterating over known time stamp format patterns from a list of known time stamp format patterns to determine whether a matching pattern exists in the event, wherein each time stamp format pattern in the list of known time stamp format patterns represents a pattern that may occur in the event and indicates a location in the event from which a time stamp may be extracted; andresponsive to determining that an individual event broken out from the streaming machine data according to the one or more breaking rules includes time information according to at least one known time stamp format pattern of the list of known time stamp format patterns, generating a timestamp for the individual event by: extracting a time value from the time information of the individual event using the matching pattern determined to exist in the individual event and associating the time value to the individual event as the timestamp for the individual event.
  • 2. The computer-implemented method of claim 1, further comprising: responsive to determining that one or more additional events of the set of events broken out from the streaming machine data according to the one or more breaking rules do not include time information according to at least one of the list of known time stamp format patterns, generating a timestamp for the one or more additional events based at least partly on at least one other event in the set of events.
  • 3. The computer-implemented method of claim 2, wherein the at least one other event in the set of events is an event preceding the one or more additional events.
  • 4. The computer-implemented method of claim 2, wherein generating the timestamp for the one or more additional events comprises assigning a time value of a time stamp of the at least one other event a time value for the time stamp for the one or more additional events.
  • 5. The computer-implemented method of claim 2, wherein the time stamp for the one or more additional events includes a same time value as a time stamp for the at least one other event.
  • 6. The computer-implemented method of claim 1, wherein the computer-implemented method is implemented in parallel by a plurality of computing devices forming a distributed computing system.
  • 7. The computer-implemented method of claim 1, further comprising outputting the set of events, including the individual event, to a search engine system configured to search the set of events based on timestamps within the set of events.
  • 8. The computer-implemented method of claim 1, further comprising outputting the set of events, including the individual event, to a search engine system configured to index the set of events for searching based at least partly on timestamps within the set of events.
  • 9. The computer-implemented method of claim 1, wherein the set of events is associated with a data source from which the streaming machine data was retrieved, and wherein the list of known time stamp format patterns is associated with the data source.
  • 10. The computer-implemented method of claim 1, wherein the one or more breaking rules are based at least partly on a character pattern identified within the streaming machine data.
  • 11. The computer-implemented method of claim 1, further comprising normalizing the timestamp for the individual event according to a specified time zone.
  • 12. The computer-implemented method of claim 1, wherein the timestamp is a Unix timestamp.
  • 13. The computer-implemented method of claim 1, further comprising identifying the one or more breaking rules according to a source of the streaming machine data.
  • 14. A system, comprising: non-transitory computer-readable media including computer-executable instructions; anda processor configured to execute the computer-executable instructions, wherein execution of the computer-executable instructions causes the system to: break streaming machine data into a set of events, each event in the set of events including a portion of the machine data, wherein a subset of events in the set of events includes time information, and wherein each event of the set of events is broken out from the streaming machine data according to one or more breaking rules;determine whether each event broken out from the streaming machine data according to the one or more breaking rules include time information by causing the system to: for each event of the set of events, iterate over known time stamp format patterns from a list of known time stamp format patterns to determine whether a matching pattern exists in the event, wherein each time stamp format pattern in the list of known time stamp format patterns represents a pattern that may occur in the event and indicates a location in the event from which a time stamp may be extracted; andresponsive to determining that an individual event broken out from the streaming machine data according to the one or more breaking rules includes time information according to at least one known time stamp format pattern of the list of known time stamp format patterns, generate a timestamp for the individual event by causing the system to: extract a time value from the time information of the individual event using the matching pattern determined to exist in the individual event and associating the time value to the individual event as the timestamp for the individual event.
  • 15. The system of claim 14, wherein execution of the computer-executable instructions further causes the system to: responsive to determining that one or more additional events broken out from the streaming machine data according to the one or more breaking rules do not include time information according to at least one of the list of known time stamp format patterns, generate a timestamp for the one or more additional events based at least partly on at least one other event in the set of events.
  • 16. The system of claim 14, wherein execution of the computer-executable instructions further causes the system to output the set of events, including the individual event, to a search engine system configured to search the set of events based on timestamps within the set of events.
  • 17. The system of claim 14, wherein execution of the computer-executable instructions further causes the system to output the set of events, including the individual event, to a search engine system configured to index the set of events for searching based at least partly on timestamps within the set of events.
  • 18. The system of claim 14, wherein the set of events is associated with a data source from which the streaming machine data was retrieved, and wherein the list of known time stamp format patterns is associated with the data source.
  • 19. The system of claim 14, wherein execution of the computer-executable instructions further causes the system to identify the one or more breaking rules for use in breaking out the set of events from the streaming machine data according to a source of the streaming machine data.
  • 20. One or more non-transitory computer-readable media including computer-executable instructions that, when executed, cause a computing system to: break streaming machine data into a set of events, each event in the set of events including a portion of the machine data, wherein a subset of events in the set of events includes time information, and wherein each event of the set of events is broken out from the streaming machine data according to one or more breaking rules;for each event broken out from the streaming machine data according to one or more breaking rules, iterate over known time stamp format patterns from a list of known time stamp format patterns to determine whether a matching pattern exists in the event, wherein each time stamp format pattern in the list of known time stamp format patterns represents a pattern that may occur in the event and indicates a location in the event from which a time stamp may be extracted; andresponsive to determining that an individual event broken out from the streaming machine data according to the one or more breaking rules includes time information according to at least one known time stamp format pattern of the list of known time stamp format patterns, extract a time value from the time information of the individual event using the matching pattern determined to exist in the individual event and associate the time value to the individual event as a timestamp for the individual event.
  • 21. The one or more non-transitory computer-readable media of claim 20, wherein execution of the computer-executable instructions further causes the computing system to: responsive to determining that one or more additional events broken out from the streaming machine data according to the one or more breaking rules do not include time information according to at least one of the list of known time stamp format patterns, generate a timestamp for the one or more additional events based at least partly on at least one other event in the set of events.
  • 22. The one or more non-transitory computer-readable media of claim 20, wherein execution of the computer-executable instructions further causes the computing system to output the set of events, including the individual event, to a search engine system configured to search the set of events based on timestamps within the set of events.
  • 23. The one or more non-transitory computer-readable media of claim 20, wherein execution of the computer-executable instructions further causes the computing system to output the set of events, including the individual event, to a search engine system configured to index the set of events for searching based at least partly on timestamps within the set of events.
  • 24. The one or more non-transitory computer-readable media of claim 20, wherein the set of events is associated with a data source from which the streaming machine data was retrieved, and wherein the list of known time stamp format patterns is associated with the data source.
  • 25. The one or more non-transitory computer-readable media of claim 20, wherein execution of the computer-executable instructions further causes the computing system to identify the one or more breaking rules for use in breaking out the set of events from the streaming machine data according to a source of the streaming machine data.
  • 26. A system comprising: non-transitory computer-readable media including computer-executable instructions; anda processor configured to execute the computer-executable instructions, wherein execution of the computer-executable instructions causes the system to: break streaming machine data into a set of events, each event in the set of events including a portion of the machine data, wherein a subset of events in the set of events includes time information, and wherein each event of the set of events is identified from the streaming machine data according to one or more breaking rules;for each event identified from the streaming machine data according to one or more breaking rules, iterate over known time stamp format patterns from a list of known time stamp format patterns to determine whether a matching pattern exists in the event, wherein each time stamp format pattern in the list of known time stamp format patterns represents a pattern that may occur in the event and indicates a location in the event from which a time stamp may be extracted; andresponsive a determination that an individual event identified from the streaming machine data according to the one or more breaking rules includes time information in at least one known time stamp format pattern of the list of known time stamp format patterns, extract a time value from the time information of the individual event using the matching pattern determined to exist in the individual event and associate the time value to the individual event as a timestamp for the individual event.
  • 27. The system of claim 26, wherein execution of the computer-executable instructions further causes the system to: responsive to determining that one or more additional events identified from the streaming machine data according to the one or more breaking rules do not include time information in at least one of the list of known time stamp format patterns, generate a timestamp for the one or more additional events based at least partly on at least one other event in the set of events.
  • 28. The system of claim 26, wherein execution of the computer-executable instructions further causes the system to output the set of events, including the individual event, to a search engine system configured to search the set of events based on timestamps within the set of events.
  • 29. The system of claim 26, wherein execution of the computer-executable instructions further causes the system to output the set of events, including the individual event, to a search engine system configured to index the set of events for searching based at least partly on timestamps within the set of events.
  • 30. The system of claim 26 wherein the set of events is associated with a data source from which the streaming machine data was retrieved, and wherein the list of known time stamp format patterns is associated with the data source.
  • 31. The system of claim 26, wherein execution of the computer-executable instructions further causes the system to identify the one or more breaking rules for use in breaking out the set of events from the streaming machine data according to a source of the streaming machine data.
  • 32. One or more non-transitory computer-readable media including computer-executable instructions that, when executed, cause a computing system to: break streaming machine data into a set of events, each event in the set of events including a portion of the machine data, wherein a subset of events in the set of events includes time information, and wherein each event of the set of events is broken out from the streaming machine data according to one or more breaking rules;determine whether each event broken out from the streaming machine data according to the one or more breaking rules include time information by causing the computing system to:for each event of the set of events, iterate over known time stamp format patterns from a list of known time stamp format patterns to determine whether a matching pattern exists in the event, wherein each time stamp format pattern in the list of known time stamp format patterns represents a pattern that may occur in the event and indicates a location in the event from which a time stamp may be extracted; andresponsive to determining that an individual event broken out from the streaming machine data according to the one or more breaking rules includes time information in at least one known time stamp format pattern of the list of known time stamp format patterns, generate a timestamp for the individual event by causing the computing system to:extract a time value from the time information of the individual event using the matching pattern determined to exist in the individual event and associating the time value to the individual event as the timestamp for the individual event.
  • 33. The one or more non-transitory computer-readable media of claim 32, wherein execution of the computer-executable instructions further causes the computing system to: responsive to determining that one or more additional events broken out from the streaming machine data according to the one or more breaking rules do not include time information in at least one of the list of known time stamp format patterns, generate a timestamp for the one or more additional events based at least partly on at least one other event in the set of events.
  • 34. The one or more non-transitory computer-readable media of claim 32, wherein execution of the computer-executable instructions further causes the computing system to output the set of events, including the individual event, to a search engine system configured to search the set of events based on timestamps within the set of events.
  • 35. The one or more non-transitory computer-readable media of claim 32, wherein execution of the computer-executable instructions further causes the computing system to output the set of events, including the individual event, to a search engine system configured to index the set of events for searching based at least partly on timestamps within the set of events.
  • 36. The one or more non-transitory computer-readable media of claim 32, wherein the set of events is associated with a data source from which the streaming machine data was retrieved, and wherein the list of known time stamp format patterns is associated with the data source.
  • 37. A computer-implemented method, comprising: breaking streaming machine data into a set of events, each event in the set of events including a portion of the machine data, wherein a subset of events in the set of events includes time information, and wherein each event of the set of events is identified from the streaming machine data according to one or more breaking rules;for each event identified from the streaming machine data according to one or more breaking rules, iterating over known time stamp format patterns from a list of known time stamp format patterns to determine whether a matching pattern exists in the event, wherein each time stamp format pattern in the list of known time stamp format patterns represents a pattern that may occur in the event and indicates a location in the event from which a time stamp may be extracted; andresponsive to determining that an individual event identified from the streaming machine data according to the one or more breaking rules includes time information in at least one known time stamp format pattern of the list of known time stamp format patterns, extracting a time value from the time information of the individual event using the matching pattern determined to exist in the individual event and associating the time value to the individual event as a timestamp for the individual event.
  • 38. The computer-implemented method of claim 37, further comprising: responsive to determining that one or more additional events identified from the streaming machine data according to the one or more breaking rules do not include time information in at least one of the list of known time stamp format patterns, generating a timestamp for the one or more additional events based at least partly on at least one other event in the set of events.
  • 39. The computer-implemented method of claim 37, further comprising outputting the set of events, including the individual event, to a search engine system configured to search the set of events based on timestamps within the set of events.
  • 40. The computer-implemented method of claim 37, further comprising outputting the set of events, including the individual event, to a search engine system configured to index the set of events for searching based at least partly on timestamps within the set of events.
RELATED APPLICATIONS

The present application is a continuation of U.S. patent application Ser. No. 17/125,807, filed Dec. 17, 2020; which is a continuation of U.S. patent application Ser. No. 15/963,740, filed Apr. 26, 2018; which is a continuation of U.S. patent application Ser. No. 15/661,260, filed on Jul. 27, 2017; which is a continuation of U.S. patent application Ser. No. 15/420,938, filed on Jan. 31, 2017; which is a continuation of U.S. patent application Ser. No. 14/611,170, filed on Jan. 30, 2015; which is a continuation of U.S. patent application Ser. No. 13/353,135, filed on Jan. 18, 2012; which is a continuation of U.S. patent application Ser. No. 11/868,370, filed Oct. 5, 2007; which claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 60/828,283, filed Oct. 5, 2006. The subject matter of all of the foregoing is incorporated herein by reference in its entirety.

US Referenced Citations (474)
Number Name Date Kind
4739398 Thomas et al. Apr 1988 A
4956774 Shibamiya et al. Sep 1990 A
5121443 Tomlinson Jun 1992 A
5276629 Reynolds Jan 1994 A
5347540 Karrick Sep 1994 A
5414838 Kolton et al. May 1995 A
5613113 Goldring Mar 1997 A
5627886 Bowman May 1997 A
5737600 Geiner Apr 1998 A
5745693 Knight et al. Apr 1998 A
5745746 Jhingran et al. Apr 1998 A
5751965 May et al. May 1998 A
5761652 Wu et al. Jun 1998 A
5847972 Eick Dec 1998 A
5951541 Simpson et al. Sep 1999 A
5953439 Ishihara et al. Sep 1999 A
5960434 Schimmel Sep 1999 A
5966704 Furegati et al. Oct 1999 A
6021437 Chen et al. Feb 2000 A
6088717 Reed et al. Jul 2000 A
6115705 Larson Sep 2000 A
6137283 Williams et al. Oct 2000 A
6212494 Boguraev Apr 2001 B1
6285997 Carey et al. Sep 2001 B1
6341176 Shirasaki et al. Jan 2002 B1
6345283 Anderson Feb 2002 B1
6363131 Beidas et al. Mar 2002 B1
6449618 Blott Sep 2002 B1
6490553 Van Thong et al. Dec 2002 B2
6496831 Baulier et al. Dec 2002 B1
6516189 Frangione et al. Feb 2003 B1
6598078 Ehrlich et al. Jul 2003 B1
6598087 Dixon, III et al. Jul 2003 B1
6604114 Toong et al. Aug 2003 B1
6658367 Conrad Dec 2003 B2
6658487 Smith Dec 2003 B1
6662176 Brunet et al. Dec 2003 B2
6678674 Saeki Jan 2004 B1
6725287 Loeb et al. Apr 2004 B1
6751228 Okamura Jun 2004 B1
6760903 Morshed et al. Jul 2004 B1
6763347 Zhang Jul 2004 B1
6768994 Howard et al. Jul 2004 B1
6789046 Murstein et al. Sep 2004 B1
6801938 Bookman et al. Oct 2004 B1
6816830 Kempe Nov 2004 B1
6907422 Predovic Jun 2005 B1
6907545 Ramadei Jun 2005 B2
6920468 Cousins Jul 2005 B1
6951541 Desmarais Oct 2005 B2
6980963 Hanzek Dec 2005 B1
6993246 Pan et al. Jan 2006 B1
7003781 Blackwell et al. Feb 2006 B1
7035925 Nareddy et al. Apr 2006 B1
7069176 Swaine et al. Jun 2006 B2
7076547 Black Jul 2006 B1
7084742 Haines Aug 2006 B2
7085682 Heller et al. Aug 2006 B1
7127456 Brown et al. Oct 2006 B1
7134081 Fuller, III et al. Nov 2006 B2
7146416 Yoo et al. Dec 2006 B1
7184777 Diener et al. Feb 2007 B2
7231403 Howitt et al. Jun 2007 B1
7301603 Chen et al. Nov 2007 B2
7321891 Ghazal Jan 2008 B1
7376752 Chudnovsky et al. May 2008 B1
7376969 Njemanze et al. May 2008 B1
7379999 Zhou et al. May 2008 B1
7395187 Duyanovich et al. Jul 2008 B2
7406399 Furem et al. Jul 2008 B2
7437266 Ueno et al. Oct 2008 B2
7454761 Roberts et al. Nov 2008 B1
7457872 Aton et al. Nov 2008 B2
7493304 Day et al. Feb 2009 B2
7523191 Thomas et al. Apr 2009 B1
7526769 Watts, Jr. et al. Apr 2009 B2
7546234 Deb et al. Jun 2009 B1
7546553 Bozak et al. Jun 2009 B2
7565425 Van Vleet et al. Jul 2009 B2
7580938 Pai et al. Aug 2009 B1
7580944 Zhuge et al. Aug 2009 B2
7593953 Malalur Sep 2009 B1
7600029 Mashinsky Oct 2009 B1
7616666 Schultz Nov 2009 B1
7617314 Bansod et al. Nov 2009 B1
7620697 Davies Nov 2009 B1
7627544 Chkodrov Dec 2009 B2
7653742 Bhargava et al. Jan 2010 B1
7673340 Cohen Mar 2010 B1
7680916 Barnett et al. Mar 2010 B2
7685109 Ransil et al. Mar 2010 B1
7689394 Furem et al. Mar 2010 B2
7747641 Kim et al. Jun 2010 B2
7783655 Barabas et al. Aug 2010 B2
7783750 Casey et al. Aug 2010 B1
7797309 Waters Sep 2010 B2
7809131 Njemanze et al. Oct 2010 B1
7810155 Ravi Oct 2010 B1
7818313 Tsimelzon et al. Oct 2010 B1
7827182 Panigrahy Nov 2010 B1
7856441 Kraft et al. Dec 2010 B1
7885954 Barsness et al. Feb 2011 B2
7895167 Berg et al. Feb 2011 B2
7895383 Gregg et al. Feb 2011 B2
7925678 Botros et al. Apr 2011 B2
7926099 Chakravarty et al. Apr 2011 B1
7934003 Carusi et al. Apr 2011 B2
7937164 Samardzija et al. May 2011 B2
7937344 Baum et al. May 2011 B2
7962489 Chiang et al. Jun 2011 B1
7970934 Patel Jun 2011 B1
7974728 Lin Jul 2011 B2
7979362 Zhao et al. Jul 2011 B2
7979439 Nordstrom et al. Jul 2011 B1
7991758 Beeston Aug 2011 B2
8005992 Pichumani et al. Aug 2011 B1
8031634 Artzi et al. Oct 2011 B1
8046749 Owen et al. Oct 2011 B1
8073806 Garg et al. Dec 2011 B2
8112425 Baum et al. Feb 2012 B2
8196150 Downing et al. Jun 2012 B2
8200527 Thompson et al. Jun 2012 B1
8301603 Kan et al. Oct 2012 B2
8321448 Zeng et al. Nov 2012 B2
8346777 Auerbach et al. Jan 2013 B1
8401710 Budhraja et al. Mar 2013 B2
8438170 Koran et al. May 2013 B2
8577847 Blazejewski et al. Nov 2013 B2
8589375 Zhang et al. Nov 2013 B2
8601112 Nordstrom Dec 2013 B1
8615773 Bishop et al. Dec 2013 B2
8635130 Smith et al. Jan 2014 B1
8645390 Oztekin et al. Feb 2014 B1
8683467 Bingham et al. Mar 2014 B2
8707194 Jenkins Apr 2014 B1
8751529 Zhang et al. Jun 2014 B2
8788525 Neels et al. Jul 2014 B2
8793118 Srinivasa et al. Jul 2014 B2
8813039 Maczuba Aug 2014 B2
8825623 Samuelson et al. Sep 2014 B2
8898189 Nakano et al. Nov 2014 B2
8904299 Owen et al. Dec 2014 B1
8904389 Bingham et al. Dec 2014 B2
8909622 Emigh et al. Dec 2014 B1
8914601 Lethin et al. Dec 2014 B1
8924376 Lee Dec 2014 B1
8990184 Baum et al. Mar 2015 B2
9002854 Baum et al. Apr 2015 B2
9020976 Ahmed et al. Apr 2015 B2
9037555 Genest et al. May 2015 B2
9043185 Bender May 2015 B2
9047352 Dong et al. Jun 2015 B1
9092411 Barabas et al. Jul 2015 B2
9128995 Fletcher Sep 2015 B1
9130832 Boe et al. Sep 2015 B1
9130860 Boe et al. Sep 2015 B1
9146954 Boe et al. Sep 2015 B1
9146962 Boe et al. Sep 2015 B1
9152726 Lymperopoulos et al. Oct 2015 B2
9158811 Choudhary et al. Oct 2015 B1
9164786 Bingham et al. Oct 2015 B2
9208463 Bhide et al. Dec 2015 B1
9215240 Merza et al. Dec 2015 B2
9256501 Rahut Feb 2016 B1
9286413 Coates et al. Mar 2016 B1
9294361 Choudhary et al. Mar 2016 B1
9317582 Baum et al. Apr 2016 B2
9384203 Seiver et al. Jul 2016 B1
9420050 Sakata et al. Aug 2016 B1
9465713 Tonouchi Oct 2016 B2
9495187 Bingham et al. Nov 2016 B2
9514175 Swan et al. Dec 2016 B2
9521047 Alekseyev et al. Dec 2016 B2
9590877 Choudhary et al. Mar 2017 B2
9594789 Baum et al. Mar 2017 B2
9740755 Lamas et al. Aug 2017 B2
9747316 Baum et al. Aug 2017 B2
9767197 Agarwal et al. Sep 2017 B1
9785417 Crossley et al. Oct 2017 B2
9785714 Gabriel Oct 2017 B2
9792351 Hernandez-Sherrington et al. Oct 2017 B2
9922065 Swan et al. Mar 2018 B2
9922066 Swan et al. Mar 2018 B2
9922067 Baum et al. Mar 2018 B2
9928262 Baum et al. Mar 2018 B2
9996571 Baum et al. Jun 2018 B2
10019496 Bingham et al. Jul 2018 B2
10127258 Lamas et al. Nov 2018 B2
10157089 Ahmad et al. Dec 2018 B2
10216779 Swan et al. Feb 2019 B2
10242039 Baum et al. Mar 2019 B2
10255136 Salapura et al. Apr 2019 B2
10255312 Swan et al. Apr 2019 B2
10262018 Swan et al. Apr 2019 B2
10318541 Bingham et al. Jun 2019 B2
10346357 Bingham et al. Jul 2019 B2
10353957 Bingham et al. Jul 2019 B2
10417203 Ramnarayanan et al. Sep 2019 B2
10425300 Vlachogiannis et al. Sep 2019 B2
10503732 Kim et al. Dec 2019 B2
10540321 Miller Jan 2020 B2
10592522 Bingham et al. Mar 2020 B2
10614132 Bingham et al. Apr 2020 B2
10678767 Baum et al. Jun 2020 B2
10740313 Baum et al. Aug 2020 B2
10747742 Baum et al. Aug 2020 B2
10776441 Echeverria Sep 2020 B1
10877986 Bingham et al. Dec 2020 B2
10877987 Bingham et al. Dec 2020 B2
10891281 Baum et al. Jan 2021 B2
10977233 Swan et al. Apr 2021 B2
10997191 Bingham et al. May 2021 B2
11192295 Warfield et al. Aug 2021 B2
11144526 Swan et al. Oct 2021 B2
11238048 Breeden Feb 2022 B1
11249971 Baum et al. Feb 2022 B2
20010044795 Cohen et al. Nov 2001 A1
20020042821 Muret et al. Apr 2002 A1
20020046248 Drexler Apr 2002 A1
20020069223 Goodisman et al. Jun 2002 A1
20020078381 Farley et al. Jun 2002 A1
20020080810 Casals Jun 2002 A1
20020129137 Mills, III et al. Sep 2002 A1
20020154175 Abello et al. Oct 2002 A1
20020157017 Mi et al. Oct 2002 A1
20020169735 Kil et al. Nov 2002 A1
20020173911 Brunet et al. Nov 2002 A1
20020198984 Goldstein et al. Dec 2002 A1
20030014399 Hansen et al. Jan 2003 A1
20030018435 Jenner et al. Jan 2003 A1
20030037034 Daniels et al. Feb 2003 A1
20030041264 Black et al. Feb 2003 A1
20030074401 Connell et al. Apr 2003 A1
20030084349 Friedrichs et al. May 2003 A1
20030110186 Markowski et al. Jun 2003 A1
20030120593 Bansal et al. Jun 2003 A1
20030126613 McGuire Jul 2003 A1
20030141879 Wilsher Jul 2003 A1
20030154192 Laborde et al. Aug 2003 A1
20030171977 Singh et al. Sep 2003 A1
20030182310 Charnock et al. Sep 2003 A1
20030204698 Sachedina et al. Oct 2003 A1
20030208472 Pham Nov 2003 A1
20030212699 Denesuk et al. Nov 2003 A1
20040024773 Stoffel et al. Feb 2004 A1
20040034795 Anderson et al. Feb 2004 A1
20040049693 Douglas Mar 2004 A1
20040057536 Kasper, II et al. Mar 2004 A1
20040073534 Robson Apr 2004 A1
20040122656 Abir Jun 2004 A1
20040143602 Ruiz Jul 2004 A1
20040169688 Burdick et al. Sep 2004 A1
20040170392 Lu et al. Sep 2004 A1
20040194141 Sanders Sep 2004 A1
20040243618 Malaney et al. Dec 2004 A1
20040254919 Giuseppini Dec 2004 A1
20040267691 Vasudeva Dec 2004 A1
20050010564 Metzger et al. Jan 2005 A1
20050015624 Ginter Jan 2005 A1
20050021736 Carusi Jan 2005 A1
20050022207 Grabarnik et al. Jan 2005 A1
20050033803 Vleet et al. Feb 2005 A1
20050044406 Stute Feb 2005 A1
20050055357 Campbell Mar 2005 A1
20050071379 Kekre et al. Mar 2005 A1
20050076067 Bakalash et al. Apr 2005 A1
20050080806 Doganata et al. Apr 2005 A1
20050114331 Wang et al. May 2005 A1
20050114707 Destefano et al. May 2005 A1
20050125807 Brady, Jr. et al. Jun 2005 A1
20050138111 Aton et al. Jun 2005 A1
20050172162 Takahashi et al. Aug 2005 A1
20050177372 Wang et al. Aug 2005 A1
20050203888 Woosley Sep 2005 A1
20050223027 Lawrence et al. Oct 2005 A1
20050235356 Wang Oct 2005 A1
20050256956 Lttlefield et al. Nov 2005 A1
20050259776 Kinser et al. Nov 2005 A1
20050273281 Wall et al. Dec 2005 A1
20050273614 Ahuja et al. Dec 2005 A1
20050289540 Nguyen Dec 2005 A1
20060004691 Sifry Jan 2006 A1
20060004731 Seibel et al. Jan 2006 A1
20060004909 Takuwa et al. Jan 2006 A1
20060026164 Jung Feb 2006 A1
20060031216 Semple et al. Feb 2006 A1
20060048101 Krassovsky et al. Mar 2006 A1
20060059238 Slater et al. Mar 2006 A1
20060069717 Mamou et al. Mar 2006 A1
20060085163 Nader Apr 2006 A1
20060085399 Carmel et al. Apr 2006 A1
20060143175 Urkainczk et al. Jun 2006 A1
20060149558 Kahn et al. Jul 2006 A1
20060153097 Schultz et al. Jul 2006 A1
20060161816 Gula et al. Jul 2006 A1
20060173878 Bley Aug 2006 A1
20060184529 Berg et al. Aug 2006 A1
20060184615 Park et al. Aug 2006 A1
20060197766 Raz Sep 2006 A1
20060197768 Van Hook et al. Sep 2006 A1
20060198359 Fok et al. Sep 2006 A1
20060212242 Levine et al. Sep 2006 A1
20060218278 Uyama Sep 2006 A1
20060218279 Yamaguchi et al. Sep 2006 A1
20060224254 Rumi et al. Oct 2006 A1
20060224583 Fikes et al. Oct 2006 A1
20060229931 Fligler Oct 2006 A1
20060248106 Milne et al. Nov 2006 A1
20060259519 Yakushev et al. Nov 2006 A1
20060265406 Chkodrov et al. Nov 2006 A1
20060294086 Rose et al. Dec 2006 A1
20070027612 Barfoot et al. Feb 2007 A1
20070033632 Baynger et al. Feb 2007 A1
20070038603 Guha Feb 2007 A1
20070038889 Wiggins Feb 2007 A1
20070043562 Holsinger et al. Feb 2007 A1
20070043704 Raub et al. Feb 2007 A1
20070067323 Vandersluis Mar 2007 A1
20070067575 Morris Mar 2007 A1
20070073519 Long Mar 2007 A1
20070073743 Bammi et al. Mar 2007 A1
20070074147 Wold Mar 2007 A1
20070100873 Yako et al. May 2007 A1
20070112754 Haigh May 2007 A1
20070113031 Brown et al. May 2007 A1
20070124437 Chervets May 2007 A1
20070130171 Hanckel et al. Jun 2007 A1
20070156786 May et al. Jul 2007 A1
20070156789 Semerdzhiev et al. Jul 2007 A1
20070192300 Reuther et al. Aug 2007 A1
20070255529 Biazette et al. Nov 2007 A1
20070283194 Villella Dec 2007 A1
20080021994 Grelewicz et al. Jan 2008 A1
20080027961 Arlitt et al. Jan 2008 A1
20080077558 Lawrence et al. Mar 2008 A1
20080083314 Hayashi et al. Apr 2008 A1
20080126408 Middleton May 2008 A1
20080134209 Bansal et al. Jun 2008 A1
20080148280 Stillwell et al. Jun 2008 A1
20080184110 Barsness et al. Jul 2008 A1
20080215546 Baum Sep 2008 A1
20080222654 Xu et al. Sep 2008 A1
20080270799 Yamaguchi et al. Oct 2008 A1
20080279113 Kalliola Nov 2008 A1
20080319975 Morris et al. Dec 2008 A1
20090003219 Beacham et al. Jan 2009 A1
20090083314 Maim Mar 2009 A1
20090119257 Waters May 2009 A1
20090138435 Mannion et al. May 2009 A1
20090157596 Couch et al. Jun 2009 A1
20090172014 Huetter Jul 2009 A1
20090172666 Yahalom Jul 2009 A1
20090177692 Chagoly et al. Jul 2009 A1
20090182866 Watanabe et al. Jul 2009 A1
20090192982 Samuelson et al. Jul 2009 A1
20090204380 Kato et al. Aug 2009 A1
20090237404 Cannon, III et al. Sep 2009 A1
20090259628 Farrell Oct 2009 A1
20090271511 Peracha Oct 2009 A1
20100179953 Kan et al. Jul 2010 A1
20100205212 Quadracci et al. Aug 2010 A1
20100223619 Jaquet et al. Sep 2010 A1
20100235338 Gabriel Sep 2010 A1
20100250712 Ellison et al. Sep 2010 A1
20100268797 Pyrik et al. Oct 2010 A1
20100332661 Tameshige et al. Dec 2010 A1
20110016123 Pandey et al. Jan 2011 A1
20110055256 Phillips et al. Mar 2011 A1
20110161851 Barber et al. Jun 2011 A1
20110179160 Liu et al. Jul 2011 A1
20110238687 Karpuram et al. Sep 2011 A1
20110261055 Wong et al. Oct 2011 A1
20110298804 Hao et al. Dec 2011 A1
20110307905 Essey et al. Dec 2011 A1
20110314148 Peterson et al. Dec 2011 A1
20120022707 Budhraja Jan 2012 A1
20120036484 Zhang et al. Feb 2012 A1
20120078925 Behar et al. Mar 2012 A1
20120120078 Hubbard et al. May 2012 A1
20120124503 Coimbatore et al. May 2012 A1
20120130774 Ziv et al. May 2012 A1
20120174097 Levin Jul 2012 A1
20120197928 Zhang et al. Aug 2012 A1
20120197934 Zhang Aug 2012 A1
20120216135 Wong et al. Aug 2012 A1
20120221314 Bourlatchkov et al. Aug 2012 A1
20120278292 Zahavi et al. Nov 2012 A1
20120284713 Ostermeyer et al. Nov 2012 A1
20120311153 Morgan et al. Dec 2012 A1
20120311475 Wong et al. Dec 2012 A1
20120317266 Abbott Dec 2012 A1
20120323941 Chkodrov et al. Dec 2012 A1
20120323970 Larson et al. Dec 2012 A1
20120331553 Aziz et al. Dec 2012 A1
20130007261 Dutta et al. Jan 2013 A1
20130030764 Chatterjee et al. Jan 2013 A1
20130055092 Cannon, III et al. Feb 2013 A1
20130097157 Ng et al. Apr 2013 A1
20130124714 Bendar May 2013 A1
20130158950 Cohen et al. Jun 2013 A1
20130204948 Zeyliger et al. Aug 2013 A1
20130239111 Bingham et al. Sep 2013 A1
20130239124 Ahmad et al. Sep 2013 A1
20130247042 Bingham et al. Sep 2013 A1
20130247043 Bingham et al. Sep 2013 A1
20130247044 Bingham et al. Sep 2013 A1
20130247133 Price et al. Sep 2013 A1
20130262347 Dodson Oct 2013 A1
20130262656 Cao et al. Oct 2013 A1
20130300747 Wong et al. Nov 2013 A1
20130332594 Dvir Dec 2013 A1
20130346615 Gondi Dec 2013 A1
20140019458 Walton Jan 2014 A1
20140040306 Gluzman et al. Feb 2014 A1
20140075029 Lipchuk et al. Mar 2014 A1
20140173029 Varney et al. Jun 2014 A1
20140195667 Ketchum et al. Jul 2014 A1
20140214888 Marquardt et al. Jul 2014 A1
20140280894 Reynolds et al. Sep 2014 A1
20140283083 Gula et al. Sep 2014 A1
20140324862 Bingham et al. Oct 2014 A1
20140351217 Bostock Nov 2014 A1
20150026167 Neels et al. Jan 2015 A1
20150143180 Dawson May 2015 A1
20150149879 Miller et al. May 2015 A1
20150154269 Miller et al. Jun 2015 A1
20150169654 Chen Jun 2015 A1
20150178342 Seering Jun 2015 A1
20150213631 Vander Broek Jul 2015 A1
20150293954 Hsiao et al. Oct 2015 A1
20150295778 Hsiao et al. Oct 2015 A1
20150295780 Hsiao et al. Oct 2015 A1
20150295796 Hsiao et al. Oct 2015 A1
20150379065 Yoshizawa et al. Dec 2015 A1
20160019215 Murphey et al. Jan 2016 A1
20160034555 Rahut et al. Feb 2016 A1
20160034566 Rahut Feb 2016 A1
20160055225 Xu et al. Feb 2016 A1
20160070736 Swan et al. Mar 2016 A1
20160125314 Mulukutla et al. May 2016 A1
20160127465 Barstow May 2016 A1
20160140128 Swan et al. May 2016 A1
20160360382 Gross et al. Dec 2016 A1
20170114810 Angerhausen et al. Apr 2017 A1
20170139962 Baum et al. May 2017 A1
20170139963 Baum et al. May 2017 A1
20170147616 Ramanarayanan et al. May 2017 A1
20170169082 Bingham et al. Jun 2017 A1
20170169134 Bingham et al. Jun 2017 A1
20170169137 Bingham et al. Jun 2017 A1
20170213007 Moturu et al. Jul 2017 A1
20170255639 Bingham et al. Sep 2017 A1
20170255683 Bingham et al. Sep 2017 A1
20170255711 Bingham et al. Sep 2017 A1
20170286499 Bingham Oct 2017 A1
20180218045 Pal Aug 2018 A1
20180293280 Svec Oct 2018 A1
20180293327 Miller Oct 2018 A1
20180307532 Di Balsamo Oct 2018 A1
20180341596 Teotia et al. Nov 2018 A1
20190065549 Pang Feb 2019 A1
20190098106 Mungel et al. Mar 2019 A1
20190147084 Pal et al. May 2019 A1
20190171630 Swan et al. Jun 2019 A1
20190179815 Bingham et al. Jun 2019 A1
20190205449 Erickson et al. Jul 2019 A1
20190213180 Swan et al. Jul 2019 A1
20190303365 Bingham et al. Oct 2019 A1
20200174986 Baum et al. Jun 2020 A1
20210103575 Baum Apr 2021 A1
20210248122 Baum Aug 2021 A1
20210248123 Baum Aug 2021 A1
20220035775 Sriharsha Feb 2022 A1
20220036177 Sriharsha Feb 2022 A1
Foreign Referenced Citations (8)
Number Date Country
1 480 100 Nov 2004 EP
2003-308229 Oct 2003 JP
10-0745-483 Aug 2007 KR
WO 1997038376 Oct 1997 WO
WO 2000079415 Dec 2000 WO
WO 2002027443 Apr 2002 WO
WO 2003096220 Nov 2003 WO
WO 2008043082 Apr 2008 WO
Non-Patent Literature Citations (102)
Entry
Baum et al., U.S. Appl. No. 14/611,170, filed Jan. 30, 2015.
Baum et al., U.S. Appl. No. 14/815,980, filed Aug. 1, 2015.
Swan et al., U.S. Appl. No. 14/929,248, filed Oct. 30, 2015.
Swan et al., U.S. Appl. No. 15/007,176, filed Jan. 26, 2016.
Swan et al., U.S. Appl. No. 15/008,425, filed Jan. 27, 2016.
Swan et al., U.S. Appl. No. 15/008,428, filed Jan. 27, 2016.
Swan et al., U.S. Appl. No. 15/339,887, filed Oct. 1, 2016.
Swan et al., U.S. Appl. No. 15/339,953, filed Nov. 1, 2016.
Baum et al., U.S. Appl. No. 15/421,416, filed Jan. 31, 2017.
Baum et al., U.S. Appl. No. 15/420,985, filed Jan. 31, 2017.
Baum et al., U.S. Appl. No. 15/420,938, filed Jan. 31, 2017.
Baum et al., U.S. Appl No. 15/421,068, filed Jan. 31, 2017.
Baum et al., U.S. Appl. No. 15/661,260, filed Jul. 27, 2017.
Baum et al., U.S. Appl. No. 15/661,268, filed Jul. 27, 2017.
Baum et al., U.S. Appl. No. 15/661,286, filed Jul. 27, 2017.
Baum et al., U.S. Appl. No. 15/883,552, filed Jan. 30, 2018.
Baum et al., U.S. Appl. No. 15/883,588, filed Jan. 30, 2018.
Baum et al., U.S. Appl. No. 15/963,740, filed Apr. 26, 2018.
Swan et al., U.S. Appl. No. 15/885,806, filed Jan. 31, 2018.
Swan et al., U.S. Appl. No. 16/264,571, filed Jan. 31, 2019.
Swan et al., U.S. Appl. No. 16/264,618, filed Jan. 31, 2019.
Swan et al., U.S. Appl. No. 16/264,587, filed Jan. 31, 2019.
Baum et al., U.S. Appl. No. 16/264,610, filed Jan. 31, 2019.
Baum et al., U.S. Appl. No. 16/779,552, filed Jan. 31, 2020.
Baum et al., U.S. Appl. No. 17/125,807, filed Dec. 17, 2020.
Baum et al., U.S. Appl. No. 17/243,966, filed Apr. 29, 2021.
Baum et al., U.S. Appl. No. 17/243,967, filed Apr. 29, 2021.
Bounsaythip, C., et al., “Overview of Data Mining for Customer Behavior Modeling,” VTT Information Technology, Research Report TTE1-2001-18, Jun. 29, 2001, 59 pages., Retrieved from the Internet: <http://citeseerx.ist.psu.edU/viewdoc/download7doi10.1.1.22.3279&rep=rep1&type=pdf> [Sep. 20, 2010],.
Chaudhuri, S., “An Overview Of Query Optimization In Relational Systems,” Processing Of The 1998 ACM SIGACT-SIGMODE-SIGART Symposium On Principles Of Database Systems, pp. 34-43, 1998.
Cooley, R., et al., “Data Preparation for Mining World Wide Web Browing Patterns,” Knowledge and Information Systems 1, Springer-Verlab, 1999, 25 pages, Retrieved from the Internet: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.161,5884&rep=rep1&type=pdf> [Sep. 21, 2010].
Dell, Inc., “Foglight For Virtualization, Free Edition,” 2013 Dell, Datasheet-Foglight4Virtual-FreeEd-US-KS-2013-06-12.
Forlizzi, L. et al., “A data model and data structures for moving objects databases,” SIGMOD Record, ACM, New York, NY, vol. 29, No. 2, Jun. 1, 2000, pp. 319-330, XP002503045.
Gleed, Kyle, Viewing ESXi Logs from the DCUI, Jun. 2012, Retrieved from the Internet: URL:https://blogs.vmware.com/vsphere/author/kyle_gleed>.
Golden, et al., “In Search of Meaning for Time Series Subsequence Clustering: Matching Algorithms Based on a New Distance Measure.” 15th ACM International Conference on Information and Knowledge Management, pp. 347-356, Nov. 2006.
Graefe, G., “Query Evaluation Techniques for Large Databases,” ACM Computing Surveys, vol. 25, No. 2, pp. 73-170, Jun. 1993.
Han, E. H., et al., “Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification,” PAKDD 2001, LNAI 2035, pp. 23-65, Retrieved from the Internet: <http://springlerlink.com/index25gnd0jb6nklffhh.pdf> [Sep. 23, 2010].
Hoke, Evan et al., “InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures”, ACM SIGOPS Operating Systems Review, vol. 40 Issue 3, Jul. 2006, ACM Press, 7 pgs.
Kolovson, C. et al., “Indexing Techniques for Historical Databases,” Proceedings of the Fifth International Conference on Data Engineering, Los Angeles, CA, IEEE Comput. Soc., Feb. 6-10, 1989, pp. 127-137.
Leung. T.Y.C. et al., “Query processing for temporal databases,” Proceedings of the International Conference On Data Engineering, Los Angeles, IEEE, Comp. Soc. Press, vol. Conf. 6, Feb. 5-9, 1990, pp. 200-208, XP010018207.
Matsuo, Y., “Keyword Extraction from a Single Document Using Word Co-Occurrence Statistical Information,” Proceedings of the Sixteenth International Florida Artificual Intelligence Research Society Conference, May 12-14, 2003, 5 pages, Retrieved from the Internet: <http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf> [Sep. 27, 2010].
Russel, S. J., et al., “Artificial Intelligence: A Modern Approach, 2nd Edition,” Pearson Education, Inc. 2003, pp. 733-739.
Srivastava, J. et al., “Web Usage Mining: Discovery and Application of Usage Patterns from Web Data,” ACM SIGKDD Explorations Newsletter, vol. 1, Issue 2, Jan. 2000, pp. 12-23, http://www.portal.acm.org/citation.cfm?is=846188, retrieved Sep. 21, 2010.
Stamatatos, E., et al., “Text Genre Detection Using Common Word Frequencies,” Proceedings of the 18th International Conference on Computational Linguistics, vol. 2, 2000, pp. 808-814, Retrieved from the Internet: <http://portal.acm/citation.cfm?id=992763> [Sep. 23, 2010].
vCenter Operations Manager 5.7.2 retrieved from vmware.com/support/pubs/vcops-pubs.html on Sep. 30, 2013, 2 pages.
VMware, Inc., “VMware vCenter Operations Manager Documentation, vCenter Operations Manager 5.7,” http://www.vmware.com/support/pubs/vcops-pubs.html, 1 page, Apr. 4, 2013.
Witten, et al.:, “Algorithms: the basic methods,” Data Mining Practical Machine Learning Tools and Techniques with Java Implementations, pp. 79-118, 2000.
Witten, et al.:, “Numeric Prediction,” Data Mining Practical Machine Learning Tools and Techniques with Java Implementations, pp. 201-227, 2000.
Witten, I. H., et al., “Inferring Rudimentary Rules,” Data Mining: Practicial Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, 2000, pp. 80-82, 114-118, 210-218, 220-227, 329-335.
Chinese Patent Office, Application No. 200780044899.5, Official Communication, dated Mar. 9, 2011.
Chinese Patent Office, Application No. 200780044899.5, Official Communication, dated Feb. 23, 2012.
Chinese Patent Office, Application No. 201210293010.X Foreign Office Action dated Oct. 27, 2015.
Chinese Patent Office, Application No. 201210293010.X Pending Claims as of Oct. 27, 2015.
Chinese Patent Office, Application No. 201210293010.X, Official Communication, dated Oct. 6, 2014.
Chinese Patent Office, Application No. 201210293010.X, Official Communication, dated Apr. 9, 2015.
European Patent Office, Application No. 07853813.9, Extended Search Report dated Dec. 11, 2009
European Patent Office, Application No. 07853813.9, Examination Report dated Mar. 15, 2010.
European Patent Office, Application No. 07853813.9, Foreign Official Communication dated Apr. 9, 2013.
European Patent Office, Application No. 07853813.9, Foreign Office Action dated Nov. 24, 2015.
European Patent Office, Application No. 07853813.9, Pending Claims as of Nov. 24, 2015.
European Patent Office, Application No. 07853813.9, Summons to Oral Proceedings dated Mar. 20, 2019.
European Patent Office, Application No. 12159074.9, Extended Search Report dated Jul. 4, 2012.
European Patent Office, Application No. 12159074.9, Foreign Office Action dated Jun. 23, 2015.
European Patent Office, Application No. 12159074.9, Pending Claims as of Jun. 23, 2015.
European Patent Office, Application No. 12159074.9, Pending Claims as of Dec. 6, 2017.
European Patent Office, Application No. 12159074.9, Summons to Oral Proceedings dated Dec. 6, 2017.
European Extended Search Report dated May 3, 2019, Application No. 18203898.4.
International Patent Office, Application No. PCT/US2006/029019, International Search Report and Written Opinion, dated Aug. 3, 2007.
International Patent Office, Application No. PCT/US2007/080616, International Search Report and Written Opinion, dated Jun. 12, 2008.
Japanese Patent Office, Application No. 2009-531632, Official Communication dated Jul. 27, 2012.
Korean Patent Office, Application No. 10-2009-7009378, Official Communication, dated Jun. 21, 2012.
Korean Patent Office, Application No. 10-2012-7009888, Official Communication, dated Feb. 15, 2013.
Bitincka, Ledion et al., “Optimizing Data Analysis with a Semi-structured Time Series Database,” self-published, first presented at “Workshop on Managing Systems via Log Analysis and Machine Learning Techniques (SLAML)”, Vancouver, British Columbia, Octobers, 2010.
Blyth, Microsoft Operations Manager 2000, in 68 pages/slides.
Carraso, David, “Exploring Splunk,” published by CITO Research, New York, NY, Apr. 2012.
Chilukuri, Symptom Database Builder for Autonomic Computing, IEEE, International Conference on Autonomic and Autonomous Systems, Silicon Valley, CA, USA Jul. 19-21, 2006, in 11 pages.
Conorich, “Monitoring Intrusion Detection Systems: From Data to Knowledge,” Enterprise Security Architecture, May/Jun. 2004.
Cuppens, “Real Time Intrusion Detection,” RTO Meeting Proceedings 101, North Atlantic Treaty Organisation, Researchand Technology Organisation, Papers presented at the RTO Information Systems Technology Panel (1ST) Symposium held in Estoril, Portugla, May 27-28, 2002.
Debar, A revised taxonomy for intrusion-detection systems, IBM Research Division, Zurich Research Laboratory 2000, in 18 pages.
Dell, Inc., “Foglight For Virtualization, Free Edition,” http://www.quest.com/foglight-for-virtualization-free-edition/, 1 pages, published prior to Apr. 30, 2013.
GFI Launches GFT LANguard Security Event Log Monitor 3.0, Intrado GlobeNewswire, Jun. 10, 2002.
GFI's New LANguard S.E.L.M. 4 Combats Intruders—Help Net Security, https://www.helpnetsecurity.com/2002/12/05/gfis-new-languard-selm-4-combats-intruders/. In two pages, 2002.
Girardin, et al., “A Visual Approach for Monitoring Logs,” USENIX Technical Program—Paper -Proceedings of the 12th Systems Administration Conference (LISA '98), in 13 pages.
Gomez, et al., “Using Lamport's Logical Clocks to Consolidate Log Files from Different Sources,” A. Bui et al. (Eds.): IICA 2005, LNCS 3908, pp. 126-133, 2006.
Gorton, “Extending Intrusion Detection with Alert Correlation and Intrusion Tolerance,” Thesis For The Degree of Licentiate of Engineering. Technical Report No. 27 L. Department of Computer Engineering Chalmers University of Technology, Goteborg, Sweden 2003.
Helmer, et al., Lightweight agents for intrusion detection, Department of Computer Science, Iowa State University 2003.
Jakobson, et al., “Real-time telecommunication network management: extending event correlation with temporl constraints,” Springer Science+Business Media Dordrecht 1995.
Kent, et al., “Recommendations of the National Institute of Standards and Technology,” Guide to Computer Security Log Management, Special Publication 800-92, Computer Security Division, Information Technology Laboratory, National Institute of Standards and Technology (NIST), Sep. 2006.
Kim, et al., “A Case Study on the Real-time Click Stream Analysis System,” CIS 2004, LNCS 3314, pp. 788-793, 2004.
Kwok, Investigating IBM Tivoli Intelligence ThinkDynamic Orchestrator (ITITO) And IBM Tivoli Provisioning Manager (ITPM), Electrical & Computer Engineering Department University of Waterloo, Ontario, Canada, Apr. 2006.
Luiijf, et al., Intrusion Detection Introduction and Generics, TNO Physics and Electronics Laboratory 2003, Session I: Real Time Intrusion Detection, Overview and Practical Experience, RTO Meeting Proceedings 101, Estoril, Portugal, May 27-28, 2002.
Manoel, et al., “Problem Determination Using Self-Managing Autonomic Technology,” IBM/Redbooks, Jun. 2005. (412 pages).
Microsoft Operations Manager, MOM 2005 Frequently Asked Questions, https://web.archive.org/web/20050830095611/http://www.microsoft.com/mom/evaluation/faqs/default.mspx. Published Aug. 25, 2004.
Microsoft Unveils New Microsoft Operations Manager 2000, Enterprise-Class Event and Performance Management Of Windows-Based Servers and Applications, May 9, 2001 in 4 pages.
Nguyen, et al., “Sense & Response Service Architecture (SARESA): An Approach towards a Real-time Business Intelligence Solution and its use for a Fraud Detection Application,” DOLAP '5, Nov. 4-5, 2005, Bremen, Germany. ACM 1-59593-162-7/05/0011.
SLAML 10 Reports, Workshop On Managing Systems via Log Analysis and Machine Learning Techniques, ;login: Feb. 2011 Conference Reports.
Splunk Cloud 8.0.2004 User Manual, available online, retrieved May 20, 2020 from docs.splunk.com.
Splunk Enterprise 8.0.0 Overview, available online, retrieved May 20, 2020 from docs.splunk.com.
Splunk Quick Reference Guide, updated 2019, available online at https://www.splunk.com/pdfs/solution-guides/splunk-quick-reference-guide.pdf, retrieved May 20, 2020.
Tierney, et al., “The NetLogger Methodology for High Performance Distributed Systems Performance Analysis,” IEEE HPDC-7'98, Jul. 28-31, 1998 at Chicago, Illinois.
Valeur, et al., “A Comprehensive Approach to Intrusion Detection Alert Correlation,” IEEE Transactions On Dependable and Secure Computing, vol. 1, No. 3, Jul.-Sep. 2004.
Wu, “Collectiong Task Data in Event-Monitoring Systems,” University of Waterloo, Ontario, Canada 2004.
Yurcik, et al., “UCLog+ : A Security Data Management System for Correlating Alerts, Incidents, and Raw Data From Remote Logs,” Escuela Superior Politecnica del Litoral (ESPOL) University of Illinois at Urbana-Champaign, Jul. 2006.
Related Publications (1)
Number Date Country
20220156244 A1 May 2022 US
Provisional Applications (1)
Number Date Country
60828283 Oct 2006 US
Continuations (7)
Number Date Country
Parent 17125807 Dec 2020 US
Child 17589818 US
Parent 15963740 Apr 2018 US
Child 17125807 US
Parent 15661260 Jul 2017 US
Child 15963740 US
Parent 15420938 Jan 2017 US
Child 15661260 US
Parent 14611170 Jan 2015 US
Child 15420938 US
Parent 13353135 Jan 2012 US
Child 14611170 US
Parent 11868370 Oct 2007 US
Child 13353135 US