Many applications generate real-time streaming data, applications such as online financial transactions, IT operations management, and sensor networks. This streaming data has many dimensions (time, location, objects), and each dimension can be hierarchical in nature.
Given such streaming data, it is often desirable to analyze multiple pattern queries that exist at various abstraction levels in real-time.
Example embodiments include apparatus, systems, and methods that provide event pattern analysis over multi-dimensional data in real-time in order to compute one hierarchical event pattern query from another. A cost for this computation is also generated.
Example embodiments analyze vast amounts of multi-dimensional sequence data being streamed into data warehouses or databases. For example, many data warehouses include large amounts of multi-dimensional application data that exhibits logical sequential ordering among individual data items, such as radio-frequency identification (RFID) data and sensor data. Example embodiments utilize an E-Cube to integrate complex event processing (CEP) and online analytical processing (OLAP) techniques to provide pattern analysis functionalities. An E-Cube model is composed of cuboids that associate patterns and dimensions at certain abstraction levels. As one example, the E-Cube differs from a traditional data cube in that the E-Cube aggregates queries over dimensions and patterns. This model leverages OLAP techniques in databases to allow users to navigate or explore the data at different abstraction levels while simultaneously supporting real-time multi-dimensional sequence data analysis. Furthermore, CEP is used for pattern matching in a variety of applications, ranging from RFID tracking for supply chain management to real-time intrusion detection. Example embodiments use E-Cubes to integrate OLAP and CEP techniques for timely real-time multi-dimensional pattern analysis over event streams.
For purposes of illustration, an example embodiment of E-Cube is discussed in connection with a hurricane tracking. Example embodiments, however, can be utilized for pattern detection among event streams in numerous other applications. By way of example, numerous applications generate real-time streaming data, such as applications associated with online financial transactions, information technology (IT) operations management, sensor networks that generate real-time streaming data, radio frequency identification (RFID) technology, etc. It is often desirable to analyze this streaming data and determine multiple pattern queries that exist at different abstraction levels in real-time. Consider an RFID tracking system used to track mass movement of people and goods during natural disasters. Terabytes of RFID data could be generated by such a tracking system. Facing a huge volume of RFID data, emergency personnel need to perform pattern detection on various dimensions at different granularities in real-time. In particular, one may need to monitor people movement and traffic patterns of needed resources (e.g., water and blankets) at different levels of abstraction to ensure fast and optimized relief efforts.
Example embodiments utilize an E-cube to process and query large volumes of streaming sequence data in real-time at various abstraction levels, such as the data being generated by the RFID tracking system 100. The E-Cube processes workloads of complex pattern detection queries at multiple levels of abstraction over extremely high-speed event streams by effectively leveraging their central processing unit (CPU) resource utilization. Systems and methods utilize the E-Cube to compute one hierarchical event pattern query from another hierarchical event pattern and determine a cost (such as a CPU cost) of such an evaluation.
Example embodiments utilize an E-Cube hierarchy to build a directed acyclic graph H where each node corresponds to a pattern query qi and each edge corresponds to a pair-wise refinement relationship between two pattern queries. Each directed edge <qi, qj> is labeled with either the label “concept” if qi<cqj, “pattern” if qi<pqj, or both to indicate the refinement relationship among the two queries qi and qj.
A pattern query qi can be rolled up into another pattern query qj by either changing one or more positive (negative) event types to a coarser (finer) level along the event concept hierarchy of that event type, changing the pattern to a coarser level, or both.
With example embodiments, an E-Cube is an E-Cube hierarchy where each pattern query is associated with its query result instances. Each individual pattern query along with its result instances in E-Cube is called an E-cuboid.
Example embodiments extend OLAP operations by pattern-drill down, pattern-roll-up, concept-roll-up, and concept-drill-down for pattern queries in an E-Cube hierarchy. OLAP-like operations on E-Cubes allow users to navigate from one E-cuboid to another in E-Cube. As one example, the operation pattern-drill-down (qm, list [Typeij, Poskj]) applied to qm inserts a list of n event types with the event type Typeij into the position Poskj of qm (1·j·n). As another example, the operation concept-drill-down(qm, list [(Typemj, Typenj), Poskj]) applied to qmj drills down a list of event types from Typemj to Typenj (Typemj>cTypenj) at the position Poskj of qm (1·j·n). As yet another example, the operation pattern-roll-up(qm, list[Typeij Poskj]) applied to qm deletes a list of n event types with the event type Typeij from the position Poskj of qm (1·j·n). As yet another example, the operation concept-roll-up(qm, list[(Typemj, Typenj), Poskj]) applied to qm rolls up a list of event types from Typemj to Typenj (Typemj<cTypenj) at the position Poskj of qm (1·j·n).
These concepts are illustrated with regard to
The results of pattern-drill-down (pattern-roll-up) can be computed by a general-to-specific (specific-to-general) reuse with only pattern changes. The results of concept-drill-down (concept-roll-up) can be computed by a general-to-specific (specific-to-general) evaluation with only concept changes.
Hierarchical instance stacks (HIS) hold event instances processed by the E-Cube. HIS provides shared storage of events across different concept and pattern abstraction levels. Each instance is stored in a single stack even though it may semantically match multiple event types in an event type concept hierarchy, namely, the finest one in E-Cube hierarchy. HIS is populated with event instances as the stream data is consumed. The stack based query evaluation can be extended to access event instances in hierarchical stacks instead of flat stacks.
Example embodiments utilize E-Cubes to produce query results quickly and improve computational efficiency by sharing results among queries in a unified query plan. Instead of processing each pattern in our E-Cube hierarchy independently using a stack-based strategy, example embodiments compute one pattern from other previously computed patterns within the E-Cube hierarchy.
Concept and pattern relationships between queries identified by the E-Cube model are used to promote reuse and to reduce redundant computations among queries.
Given a workload of pattern queries, the E-Cube model translates the pattern queries into an E-Cube hierarchy H, and then designs a strategy to determine an optimal evaluation ordering for the queries in the E-Cube hierarchy such that the total execution cost is minimized. To achieve this objective of finding an optimal overall execution strategy for completing the workload captured by the E-Cube hierarchy, example embodiments consider three choices when evaluating each query qi in H as follows:
A parent-child relationship can be either due to pattern changes or concept changes. Concept and pattern relationships exist between queries identified by the E-Cube model to promote reuse and to reduce redundant computations among queries. The model considers two orthogonal aspects, namely, (1) abstraction detection: drill down vs. roll up in E-Cube hierarchy, and (2) refinement type: pattern or concept refinement.
The query reuse can be done in the following ways:
1. General-to-specific with only pattern changes;
2. General-to-specific with only concept changes;
3. General-to-specific with simultaneous pattern and concept changes;
4. Specific-to-general with only pattern changes;
5. Specific-to-general with only concept changes; and
6. Specific-to-general with simultaneous pattern and concept changes.
In order to assist in discussing the example use cases, definitions are provided for the following terms:
(1) Ccompute(qi|qj) is the evaluation cost for query qi basing on evaluation results for qj.
(2) Ccompute(qi) is the cost of computing results for a query qi independently.
(3) |Si| is the number of tuples of type Ei that are in a time window TWP. This can be estimated as RateE*TWP*PE.
(4) TWP is the time window specified in a pattern query P.
(5) RateE is the rate of primitive events for the event type E.
(6) PE is the selectivity of the single-class predicates for event class E. This is the product of selectivity of each single-class predicate of E.
(7) PtEi, Ej is the selectivity of the implicit time predicate of subsequence (Ei, Ej). The default value is set to ½.
(8) PEi, Ej is the selectivity of multi-class predicates between event class Ei and Ej. If E1 and E2 do not have predicates, this value is set to 1.
(9) |RE| is the number of results for the composite event E.
(10) Ctype is the unit cost to check type of one event instance.
(11) qi.length is the number of event types in a query qi.
(12) NumE is the number of total events received so far.
(13) NumRE is the number of relevant events received of the types in query set Q.
(14) Caccess is the cost of accessing one event.
(15) Capp is the unit cost of appending one event to a stack and setting up pointers for the event.
(16) Cct is the unit cost to compare a timestamp of one event instance with another one.
Reuse Case 1: General-to-Specific with Pattern Changes
Considering only pattern changes, the computation of the lower level query can be optimized by reusing results from the upper level query. The two sharing cases are stated as below. Given queries qi and qj (qi>pqj) in a pattern hierarchy and the results of qi, then the results for qj can be constructed as bellow. In case I: Differ by positive types, the results of qi with the events of positive types listed in qj but not in qi are joined. In case II: Differ by negative types, the results from qi that do not satisfy the sequence constraints formed by negative event types listed in qj but not in qi are filtered. The pseudo-code for general-to-specific evaluation guided by the pattern hierarchy is shown below:
For case I above, the costs for the compute operation depend on two factors, namely (1) if pointers exist between joining events and (2) if the re-used result is ordered or not on the joining event type. Assume two pattern queries qi=SEQ(Ei, Ej, Ek) and qj=SEQ(Ei, Ej, Ek, Em, En) differ by two positive event types Em and En. Also, assume pointers exist between events of type Em and En. To compute qj, results are constructed for SEQ(Em, En) by an efficient stack-based join. These results will by default be sorted by En's timestamp. These results are then joined with qi results using the most appropriate join method.
The definitions provided above show the factors used in the cost estimation in Equation 1 shown below:
For case II, assume two pattern queries qi=SEQ(Em, En) and qj=SEQ(Em, !Ek, En) differ by one negative event type Ek. For every qi result, it can be returned for qj if no Ek events are found between the particular interval in qj. The cost formula is shown in Equation 2 below:
C
compute(qj|qi).gp
=|S
m
|*|S
n
|*Pt
Em, En
*P
Em, En*(1−PtEm, Ek*PEk, En)
Besides this computation sharing, online pattern filtering can also be achieved and thus potentially save the computation costs of qi completely (Ccompute(qi)). Specifically, if a pattern qi is at a coarser level than a pattern qj, and a matching attempt with qi fails, then there is no need to carry out the evaluation for qj. That is, qj will also fail since it is stricter.
Example 1: Given pattern queries q3 at 130, q6 at 160, and q7 at 170 in
Reuse Case 2: General-to-Specific with Concept Changes
Considering only concept changes, composite results constructed involving events of the highest event concept level are a super-set of pattern query results below it in an ECube hierarchy. The lower level query can be computed by reusing and further filtering the upper query results.
Given two pattern queries qi and qj with only concept changes (qi>c qj) on positive event types, a cost model is formulated in Equation 3 shown below:
C
compute(qj|qi).gc
=|R
qi
|*C
type
*q
i.length.
For each result of qi, the event types for the constructed composite event instances are interpreted to determine which of them indeed match a given lower level type. The strategy becomes less efficient as the number of results to be re-interpreted increases.
Example 2: In
Given two pattern queries qi=SEQ(Em, !Ek1, En) and qj=SEQ(Em, !Ek, En) with only concept changes (qi>cqj) on negative event types where Ek is a super concept of Ek1 in the event concept hierarchy. To facilitate query sharing, qj is rewritten into the expression shown in Equation 4 below:
SEQ(Em, !Ek, En)=SEQ(Em, !Ek1̂ . . . !̂Ekn, En).
For every qi result, it can be returned for qj if no Ek2, Ek3 . . . and Ekn events are found between the position in a specified query.
Example 3: In
Reuse Case 3: General-to-Specific with Concept & Pattern Refinement
Given qi and qj in an E-Cube hierarchy with simultaneous concept and pattern changes (qi>cpqj), the cost to compute the child qj from the parent qi corresponds to Equation 5 below:
The idea is to consider this as a two-step process that composes the strategies for concept and then pattern-based reuse (or, vice versa) effectively with minimal cost.
Reuse Case 4: Specific-to-General with Pattern Changes
Given queries qi and qj (qi>pqj) in a pattern hierarchy and the results of qj, then qi can be computed by reusing qj results and unioning them with the delta results not captured by qj. Our compute operation includes two key factors, namely, result reuse and delta result computation. The pseudo-code for the specific-to-general evaluation is below:
In general, assume qi=SEQ(Ei, Ej, Ek) is refined by an extra event Em into qj=SEQ(Ei, Em, Ej, Ek). qj results are reused for qi and SEQ(Ei, !Em, Ej, Ek) results are the delta results. The cost model is given in Equation 6 below:
C
compute(qi|qj).sp
=|R
qj
|*C
type
*q
j.length+|Sk|*|Sj|*PtEj, Ek*PEj, Ek+|Sk|*|Sj|*PtEj, Ek*PEj, Ek*|Si|*PEi, Ej*PEi, Ej*(1−PEi, Ej*PEm, Ej*PEi, Ej*PEm, Ej)
This specific to-general computation for a pattern hierarchy would need to check the non existence of a possibly long intermediate pattern for delta result computation when two queries differing by more than one event type. These overhead costs in some cases may not warrant the benefits of such partial reuse. When two queries differ by negative event types, the specific-to-general method is similar to above except that during delta result computation we need to compute some additional sequence results filtered in the specific query due to the existence of events of negative types.
Example 4:
ReuseSubpatternResult. Q3 is computed from the results of q6 by subtracting subsequences composed of positive event types G, A and T. For example, in
ComputeDeltaResults. Some sequences may not have been constructed for q6 due to the non-existence of events of type D. Such sequence results, however, are constructed for q3. In this case, each instance of type T has one pointer to an A event for q3 and another pointer to a D event for q6. Hence, for a T event that does not point to any D event, an inference is made that a sequence involving this T event would not have been constructed for q6. This T event thus should trigger its sequence construction for q3 by a stack-based join. If one T event points to both an A and a D event, then the A and D events may still not satisfy the time constraints. If the timestamp of the A event is greater than the timestamp of the D event, sequence construction is triggered by such T event for q3. In
Reuse Case 5: Specific-to-General with Concept Changes
The result set of a higher concept abstraction level is a super set of the results of pattern queries below it. Thus an upper level query can be computed in part by reusing the lower level query results. The lower level pattern query is computed first. Then these results are also returned for the upper level pattern. In addition, the events of the higher event type concept level not captured by the lower queries are also constructed. Such specific-to-general computation requires no extra interpretation costs as compared to the general-to-specific evaluation. Given two pattern queries qi and qj with only concept changes (qi>cqj), a cost model is formulated by Equation 7 below:
C
compute(qi|qj).sc
=C
compute(qi)
−C
compute(qj).
Example 5:
Reuse Case 6: Specific-to-General with Concept & Pattern
Given qi and qj in an E-Cube hierarchy with simultaneous concept and pattern changes (qi>cpqj), one intermediate query p is found with either only concept or pattern changes from qj so that query p minimizes Equation 8 below:
As above, results are computed in two stages from qj to p and from p to qi by using specific-to-general evaluation with first only pattern and then only concept changes or vice versa effectively with minimal cost.
Example embodiments thus allow for results sharing across queries and also include a cost model to compute the cost of such execution. These costs can be input to an optimizer than can then create an optimal plan to execute a large set of queries.
According to block 400, event patterns are analyzed in multi-dimensional data.
According to block 410, based on analysis of the event patterns, a hierarchical event pattern query is computed from another hierarchical event pattern query.
One example embodiment utilizes an E-Cube to perform the computations. For example, an E-Cube model is built of multi-dimensional data with cuboids that aggregate the multi-dimensional data over both patterns and dimensions. The E-Cube model integrates both event processing (CEP) and online analytical processing (OLAP) techniques to perform pattern analysis over event streams in the multi-dimensional data.
According to block 420, the hierarchical event pattern query is executed on the multi-dimensional data.
After the query is executed, results of the query are provided to a computer and/or user. For example, the results of the query are displayed on a display, stored in a computer, or provided to another software application.
In one embodiment, the processor unit includes a processor (such as a central processing unit, CPU, microprocessor, application-specific integrated circuit (ASIC), etc.) for controlling the overall operation of the memory 530 (such as random access memory (RAM) for temporary data storage, read only memory (ROM) for permanent data storage, and firmware). The processing unit 550 communicates with memory that stores instructions to execute or assist in executing methods discussed herein.
Blocks discussed herein can be automated and executed by a computer or electronic device. The term “automated” means controlled operation of an apparatus, system, and/or process using computers and/or mechanical/electrical devices without the necessity of human intervention, observation, effort, and/or decision.
The methods in accordance with example embodiments are provided as examples, and examples from one method should not be construed to limit examples from another method. Further, methods discussed within different figures can be added to or exchanged with methods in other figures. Further yet, specific numerical data values (such as specific quantities, numbers, categories, etc.) or other specific information should be interpreted as illustrative for discussing example embodiments. Such specific information is not provided to limit example embodiments.
In some example embodiments, the methods illustrated herein and data and instructions associated therewith are stored in respective storage devices, which are implemented as computer-readable and/or machine-readable storage media, physical or tangible media, and/or non-transitory storage media. These storage media include different forms of memory including semiconductor memory devices such as DRAM, or SRAM, Erasable and Programmable Read-Only Memories (EPROMs), Electrically Erasable and Programmable Read-Only Memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as Compact Disks (CDs) or Digital Versatile Disks (DVDs). Note that the instructions of the software discussed above can be provided on computer-readable or machine-readable storage medium, or alternatively, can be provided on multiple computer-readable or machine-readable storage media distributed in a large system having possibly plural nodes. Such computer-readable or machine-readable medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components.