1. Field of the Invention
The present disclosure relates to a method and apparatus for generating a workflow graph. More particularly, the present disclosure relates to a computer-based method and apparatus for automatically identifying a workflow graph from empirical data of a process using probabilistic analysis.
2. Background Information
Over time, individuals and organizations implicitly or explicitly develop processes to support complex, repetitive activities. In this context, a process is a set of tasks that must be completed to reach a specified goal. Examples of goals include manufacturing a device, hiring a new employee, organizing a meeting, completing a report, and others. Companies are strongly motivated to optimize business processes along one or more of several possible dimensions, such as time, cost, or output quality.
Many business processes can be modeled with workflows. As used herein, a workflow is a model of a set a tasks with order constraints that govern the sequence of execution of the tasks. A workflow can be represented with a workflow graph, which, as referred to herein, is a representation of a workflow as a directed graph, where nodes represent tasks and edges represent order constraints and/or task dependencies. Traditionally, in business processes where workflows are utilized, the workflows are designed beforehand with the intent that tasks will be carried out in accordance with the workflow. However, businesses often carry out their activities without the benefit of a formal workflow to model their processes. In such instances, development of a workflow could provide a better understanding of the business processes and provide a step towards optimization of those processes. However, development of a workflow by hand based on human observations can be a formidable task.
U.S. Pat. No. 6,038,538 to Agrawal, et al., discloses a computer-based method and apparatus that constructs models from logs of past, unstructured executions of given processes using transitive reduction of directed graphs.
The present inventors have observed a further need for a computer-implemented method and system for identifying a workflow based on an analysis of the underlying empirical data associated with the execution of tasks in actual processes used in business, manufacturing, testing, etc., that is straightforward to implement and that operates efficiently.
The present disclosure describes systems and methods that can automatically generate a workflow and an associated workflow graph from empirical data of a process using a layer-building approach that is straightforward to implement and that executes efficiently. The systems and methods described herein are useful for, among other things, providing workflow graphs to improve the understanding of processes used in business, manufacturing, testing, etc. Improved understanding of such processes can facilitate optimization of those processes. For example, given a workflow model for a given process discovered as disclosed herein, the tasks of the workflow model can be adjusted (e.g., orders and/or dependencies of tasks can be changed) and the impact of such adjustments can be evaluated based on simulation data.
According to one exemplary embodiment, a method for generating a workflow graph comprises obtaining data corresponding to multiple instances of a process, the process including a set of tasks, the data including information about order of occurrences of the tasks; analyzing the occurrences of the tasks to identify order constraints among the tasks; partitioning a set of nodes representing tasks into a series of subsets, such that no node of a given subset is constrained to precede any other node of the given subset unless said pair of nodes are conditionally independent given one or more nodes in an immediately preceding subset, and such that no node of a following subset is constrained to precede any node of the given subset; and connecting one or more nodes of each subset to one or more nodes of each adjacent subset with an edge based upon the order constraints and based upon conditional independence tests applied to subsets of nodes, thereby constructing a workflow graph representative of the process wherein nodes represent tasks and nodes are connected by edges.
According to another exemplary embodiment, a system for generating a workflow graph comprises a processing system and a memory coupled to the processing system, wherein the processing system is configured to execute the above-noted steps.
According to another exemplary embodiment, a computer-readable medium comprises executable instructions for generating a workflow graph, wherein the executable instructions comprise instructions adapted to cause a processing system to execute the above-noted steps.
The present disclosure describes exemplary methods and systems for finding an underlying workflow of a process and for generating a corresponding workflow graph, given a set of cases, where each case is a particular instance of the process represented by a set of tasks. In addition to deriving a workflow from scratch, the approach can be used to compare an abstract process design or specification to the derived empirical workflow (i.e., a model of how the process is actually carried out).
Graph Model Overview
To illustrate some basic concepts and terminology utilized in connection with the graph model associated with the subject matter disclosed herein, a simple example will be described. Input data used for identifying a workflow is a set of cases (also referred to as a set of instances). Each case (or instance) is a particular observation of an underlying process, represented as an ordered sequence of tasks. A task as referred to herein is a function to be performed. A task can be carried out by any entity, e.g., humans, machines, organizations, etc. Tasks can be carried out manually, with automation, or with a combination thereof. A task that has been carried out is referred to herein as an occurrence of the task. For example, two cases (C1 and C2) for a process of ordering and eating a meal from a fast food restaurant might be:
(C1) stand in line, order food, order drink, pay bill, receive meal order, eat meal at restaurant (in that order);
(C2) stand in line, order drink, order food, pay bill, receive meal order, eat meal at home (in that order). Data corresponding to a collection of cases may be referred to herein as a case log file, a case log, or a workflow log.
As reflected above, data for cases can be represented as triples (instance, task, time). In this example, triples are sorted first by instance, then by time. Exact time need not be represented; sequence order reflecting relative timing is sufficient (as illustrated in this example). Of course, actual time could be represented if desired, and further, both a start time and an end time could be represented in a case log.
For simplicity, each task can be treated as granular, meaning that it cannot be decomposed, and the time required to complete a task need not be modeled. With such treatment, there are no overlapping tasks. Task overlap can be modeled by treating the task start and the task end as separate sub-tasks in the graph model. Any more complex task can be broken down into sub-tasks in this manner. In general, task decomposition may be desirable if there are important dependency relations to capture between one or more of the sub-tasks and some other external task.
The case log file provides the primary components—tasks and order data—for deriving a workflow from empirical data. A goal is to derive a workflow graph that correctly models dependency constraints between tasks in the process. Since dependency constraints are not directly observed in data of the type illustrated above, order constraints serve as the natural surrogate for them. Some order constraints will reflect true dependency constraints, some will simply represent standard practice, and some will occur by chance. As a general matter, a process expert can distinguish between these situations based upon a review of the output workflow produced by the methods described herein in view of some understanding of the underlying process.
The framework for the graph model involves layer-by-layer graph building. Each graph is built up from layers of nodes. A node is a minimal graph unit and simply represents a task. Nodes are connected via edges that denote temporal relationships between tasks. Three basic operations can link together nodes or more complex graphs: the sequence operation, the AND operation, and the OR operation.
The sequence operation (→) links a series of graphs together with strict order constraints. For example, consider the following nodes: SL=stand in line, PB=pay bill, and RM=receive meal. Then graph G1=SL→PB, graph G2=PB→RM, and graph G3=SL→PB→RM are all valid sequence graphs, because SL always precedes PB, which always precedes RM. Similarly, graph G4=G1→RM and graph G5=SL→G2 are valid sequence graphs with one level of nesting, and the graphs G3, G4, and G5 are functionally equivalent. The sequence operation (→) between a pair of graphs indicates that the parent graph (on the left) always precedes the child graph (on the right), e.g., SL →PB in the example above. Such ordering requirements may also described herein using an order constraint symbol (<), e.g., SL<PB.
When used to describe connections between nodes or graphs herein, the sequence operation reflects a strict order constraint, as noted above. However, it will be appreciated that the sequence operation (→) may also be used herein in describing the particular order between actual occurrences of tasks. In such usage, the sequence operation does not necessarily reflect a strict order constraint for those tasks generally, but instead simply represents an observed order for that occurrence. As will be discussed elsewhere herein, an analysis of the sequences of actual occurrences of tasks can be used to determine whether strict order constraints are generally applicable for given types of tasks.
Nodes in the graph are linked together by order constraints. In practice, the order constraints encoded will sometimes indicate dependency structure (e.g., the task on the right cannot be done before the task on the left), but not always. Order constraints in a process may result from many reasons: tradition, habit, efficiency, or too few observed cases. As noted previously, a process expert with some understanding of the underlying process can determine whether order constraints represent true task dependency or not.
The graph model includes nodes that represent tasks that are not subject to strict sequential order. Non-sequential task structure is modeled with a branching operator, which may also be referred to herein as a split node. Branches have a start or split point and an end or join point. Between the start and end points are two or more parallel threads of nodes that can be executed. Each of these parallel threads of nodes can be referred to as a “branch.” Two types of branching operation—the AND operation and the OR operation—are described below. Thus, split nodes can be AND nodes or OR nodes. Each operation can be considered a sub-graph. For all branches stemming from such an operation, there are no ordering links between branches.
More formally, a workflow graph G is a tuple<N, E> where N denotes a non-empty set of nodes (or vertices) and E denotes a collection of ordered pairs of nodes. A node is associated with a unique label and can be any one of the following classes:
An edge, characterizing a temporal constraint, in its most abstract form is an ordered pair of nodes of the form (Source node, Target node), wherein the task represented by the source node needs to finish before the task represented by the target node can begin. This is graphically denoted as (Source-node→Target-Node). Source nodes and target nodes are also referred to herein as parent nodes and child nodes, respectively.
Less formally, split nodes are meant to represent the points where choices are made (e.g., where one of several mutually exclusive tasks are chosen) or where multiple parallel threads of tasks will be spawned. Join nodes are meant to represent points of synchronization. That is, a join node is a task J that, before allowing the execution of any of its children, waits for the completion of all active threads that have J as an endpoint. This property can be referred to as a synchronization property.
For example, referring to the fast food cases C1 and C2 above, the tasks “order food” and “order drink” (or nodes representing those tasks) can happen in either order. Unordered graphs are partitioned into separate branches using the AND operation. More formally, the AND operation is a branching operation, where all branches must be executed to complete the process. The branches can be executed in parallel (simultaneously), meaning there are no order restrictions on the component graphs or their sub-graphs. The parallel nature of these tasks is reflected in their representation in the graph of
The graph model also includes tasks that associated with mutually exclusive events. In the fast food example, it can be assumed that it is not possible to both “eat meal at restaurant” and “eat meal at home” for a given meal. Mutually exclusive graphs are partitioned into separate branches using the OR operation. More formally, the OR operation is a branching operation, where exactly one of the branches will be executed to complete the process.
The example of
The approaches described herein also address incomplete cases. An incomplete case is a process instance where one or more of the tasks in the process are not observed. This can happen for a number of reasons. For example, the process might have been stopped prior to completion, such that no tasks were carried out after the stopping point. Alternatively or in addition, there may have been measurement or recording errors in the system used to create the case logs. This ability of the approaches described herein to address such cases makes the present approaches quite robust.
Extraneous tasks and ordering errors can also be addressed by methods described herein. An extraneous task is a task recorded in the log file, but which is not actually part of the process logged. Extraneous tasks may appear when the recording system makes a mistake, either by recording a task that didn't happen or by assigning the wrong instance label to a task that did happen. An ordering error means that the case log has an erroneous task sequence, such as (A→B) when the true order of the tasks is (B→A). An ordering error may occur if there is an error in the time clock of the recording system or if there is a delay of variable length between when a task happens and when it is recorded, for example.
Extraneous tasks and ordering errors can be addressed, for example, using an algorithm that identifies order constraints that are unusual and that ignores those cases in developing the workflow. For example, if the case log for a process includes the sequence A→B (i.e., task A precedes task B) for 27 cases (instances) and the sequence B→A for two cases, this may indicate an ordering error or an extraneous instance of A or B in those two unusual cases. Eliminating those two cases from further consideration in a workflow analysis may be desirable. Alternatively, as another example, the data could be retained and simply analyzed from a statistical perspective such that if the quantity R=(# of times A occurs before B)/(total # of instances) exceeds a predetermined threshold (e.g., a threshold of 0.7, 0.8, 0.9, etc.), then an order constraint of A<B can be presumed.
As a general matter, it is convenient to assume under the graph model that the workflow graph is acyclical. This is a reasonable assumption in many cases. Nevertheless, various real-world processes involve cyclic activities. In this regard, a cyclic sub-graph is a segment of a graph where one or more tasks are repeated in the process, such as illustrated in the example of
Optional tasks can also be addressed by the approaches described herein. An optional task is a task that is not always executed and has no alternative task (e.g., OR operation) such as illustrated in the example of
Optional tasks present an ambiguity. If a given task is not observed, one does not know whether it is optional or whether there is a measurement error, or both. One way to address this consideration is to assign a threshold for measurement error. Thus, if a task is missing at a rate higher than the threshold, then it is considered to be an optional task. Modeling optional tasks with such node probabilities is attractive since including probabilities is also helpful for quantifying measurement error. It will be appreciated that probabilities for missing/optional tasks in a simple OR branch (i.e., all branches consist of a single node) cannot be estimated accurately without a priori knowledge of how to distribute the missing probability mass over the different nodes.
The workflow discovery algorithms described herein assume that branches are either independent or mutually exclusive to facilitate efficient operation, and the use of the two basic branching operations (OR and AND) in that context excludes various types of complex dependency structures from analysis. Stated differently, ordering links between nodes in different branches should be avoided. Of course, real-world systems can exhibit complex dependencies, such as illustrated in the example of
In view of the likelihood of task uncertainty, workflows can be modeled in accordance with approaches disclosed herein using a probabilistic framework. This can be done efficiently by decomposing the joint probability distribution of tasks into series of conditional probability distributions (of smaller dimension), where this factorization into smaller conditional probability distributions follows the dependencies specified in the workflow. This decomposition is somewhat similar to Bayesian network decomposition of a joint probability distribution.
With the foregoing overview in mind, exemplary embodiments of workflow discovery algorithms will now be described.
Any suitable technique for generating a case log file can be used, such as conventional methods known to those of ordinary skill in the art. Such case log files can be generated, for instance, by automated analysis (e.g., automated reasoning over free text) of documents and electronic files relating to procurement, accounts receivable, accounts payable, electronic mail, facsimile records, memos, reports, etc. Case log files can also be generated by data logging of automated processes (such as in an assembly line), etc.
An example of a hypothetical case file is illustrated in
At step 120, the processing system analyzes occurrences of tasks to identify sequence order relationships among the tasks. For example, the processing system can examine the data of the multiple cases to determine, for instance, whether a task identified as task A always occurs before a task labeled as task B in the cases where A and B are observed together. If so, an order constraint A<B can be recorded in any suitable data structure. If task A occurs before task B in some instances and after task B in other instances, an entry indicating that there is no order constraint for the pair A, B can be recorded in the data structure (e.g., “none” can be recorded). If task A is not observed with task B in any instances, an entry indicating such (e.g., “false”) can be recorded in the data structure. This analysis is carried out for all pairings of tasks, and order constraints among the tasks are thereby determined.
An exemplary result of the analysis carried out at step 120 is illustrated in
Further inspection of the ordering summary of
Thus, one exemplary algorithm for identifying order constraints is as follows:
Another exemplary algorithm “GetOrderingOracle” can identify order constraints by comparing occurrence data to a predetermined threshold, such as follows:
Algorithm GetOrderingOracle
Input: a workflow log L, and a predetermined threshold θ
Output: an ordering oracle for L
1. For every pair of tasks Ti, Tj that appears in the log
2. Return O.
The value of θ can be application dependent and can be determined using measures familiar to those skilled in the art (e.g., likelihood of the data), or can be determined empirically by analyzing past data for a given process where order constraints are already known, for example. Other approaches for identifying order constraints will be apparent to those of skill in the art.
At step 130, the processing system can initialize a set of nodes G to represent the set of tasks and can initialize an empty workflow graph H. The set of nodes can then be placed into the graph layer-by-layer, for example, such as described below.
At step 140, the processing system can analyze the order constraints to identify nodes from the set G that have no preceding nodes (i.e., there are no other nodes constrained to precede them based on the order constraints) and assign them to a current subset. The current subset can also be viewed as a current layer in the layer-by-layer approach for building the workflow graph. The nodes of the current subset could actually be removed from the set G, or they could be appropriately flagged in a data structure in any suitable fashion. For example, these nodes can be removed from G, and they can be inserted into the workflow graph H, meaning that they are now mathematically associated with the workflow graph H.
It should be noted in this regard that the processing system is analyzing nodes that symbolically or mathematically represent types tasks, as opposed to the actual occurrences of tasks, along with corresponding order constraints. As noted previously, the actual occurrences of tasks are instances of tasks actually carried out as reflected by the empirical data in the case log file.
At step 145, the processing system can determine whether a current subset has multiple nodes, and if so, designates one or more split nodes (e.g., AND, OR) to precede the multiple nodes. Such split nodes do not represent actual observable tasks, but rather provide a mechanism for connecting nodes and/or groups of nodes. The processing system can identify whether such split nodes are AND nodes or OR nodes simply by examining the order constraint matrix (or suitable data structure) to determine whether the nodes for those tasks are exclusive (e.g., labeled as “Excl”). If a pair of nodes is designated mutually exclusive, they are joined with an OR split operator, otherwise the pair is joined with an AND split operator. The label “hidden” in this regard is merely a convenient descriptor reflecting the fact that such split nodes do not correspond to observable tasks, that is, they are “hidden” in the observable task data.
At step 150, the processing system analyzes order constraints of unassigned nodes (e.g., the remaining nodes of set G that have not been removed or assigned) to identify nodes among them that have no preceding nodes (i.e., there are no other nodes constrained to precede them based on the order constraints) or that pass a conditional independence test with respect to those preceding nodes, and assigns them to a next subset. The next subset can be viewed as a next layer in the layer-by-layer graph building approach. The nodes of the next subset could actually be removed from the set G, or they could be appropriately flagged in a data structure in any suitable fashion. For example, these nodes can be removed from G, and they can be inserted into the workflow graph H, meaning that they are now mathematically associated with the workflow graph H. For example, the algorithm “GetNextBlanket” described later herein can be used to assign nodes to a next subset. In this manner, for example, the processing system can partition a set of nodes representing tasks into a series of subsets, such that no node of a given subset is constrained to precede any other node of the given subset unless said pair of nodes is conditionally independent given one or more nodes in an immediately preceding subset, and such that no node of a following subset is constrained to precede any node of the given subset.
At step 160 the processing system connects nodes in the current subset with nodes in the next subset via directed edges. An exemplary approach for carrying out this step will be described in detail in connection with
At step 170 the processing system redefines the next subset as the current subset, and at step 180, determines whether any unassigned nodes remain, e.g., whether the set G has more nodes remaining it. If the answer to the query at step 180 is yes, the process 100 proceeds back to step 150. If the answer to the query at step 180 is no, the process 100 proceeds to step 190, wherein the processing system executes a final join operation to connect the nodes of the current subset (i.e., which is now the final subset) to other nodes with edges. For example, the processing system could join the nodes of the current subset to a single end node via edges, or it could join the nodes of the current subset together such that one of those nodes is the single end node. Join nodes are added in a nested fashion such that such that all the branches of each unterminated split node are connected with a corresponding join node. For example, the two branches in the OR node in
Thus, at the completion of step 190, a workflow graph representative of the process has been constructed, wherein the graph is representative of the identified relationships between the nodes of the identified subsets, and wherein the nodes are connected by edges. In such a workflow graph, branches are joined at various levels of nesting using the OR and AND branching operators (split operators) representative of the relationships between nodes, and nodes are connected with edges based on the stored order constraints. It will be appreciated that a graph as referred to herein is not limited to a pictorial representation of a workflow process but includes any representation, whether visual or not, that possesses the mathematical constructs of nodes and edges. In any event, a visual representation of such a workflow graph can be communicated to one or more individuals, displayed on any suitable display device, such as a computer monitor, and/or printed using any suitable printer, so that the workflow graph may be reviewed and analyzed by a human process expert or other interested individual(s) to facilitate an understanding of the process. For example, by assessing the workflow graph generated for the process, such individuals may become of aware of process bottlenecks, unintended or undesirable orderings or dependencies of certain tasks, or other deficiencies in the process. With such an improved understanding, the process can be adjusted as appropriate to improve its efficiency.
As noted above, an exemplary process for connecting nodes as indicated at step 160 of
At step 250, the processing system inserts one or more join nodes between nodes of set A and set S if the size of set A is greater than one (i.e., if there is more than one node in set A). The insertion can be done, for example, by executing the algorithm “HiddenJoins” shown below. The joins can be considered “hidden” in the sense that they do not represent observable tasks in the case log.
Algorithm HiddenJoins
Input: H, a workflow graph;
At step 260, if the size of set S is greater than one (i.e., there is more than one node in set S), the processing system inserts one or mode split nodes (e.g., AND, OR) between nodes of sets A and S (or between a final node descendent from set A and nodes of set S). The insertion can be done, for example, by executing the algorithm “HiddenSplits” shown below. The splits can be considered “hidden” in the sense that they do not represent observable tasks in the case log.
Algorithm HiddenSplits
Input: H, a workflow graph;
At step 270, the processing system marks all the nodes in the set S as “selected.”At step 280, the processing system determines whether there are any unselected nodes remaining in the next subset (as that subset is currently defined under the present iteration). If the answer to the query at step 280 is yes, the process returns to step 220. If the answer to the query at step 280 is no, the process 200 returns to process 100 at step 170.
As noted above, an exemplary process for adding an edge to graph H connecting nodes T and N, where T is an ancestor of N, depending upon an independence test (step 210 of
At step 330 the processing system carries out a conditional independence test involving node N and pairs of nodes T1, T2 in set AC. Namely, for each pair of nodes T1, T2 in set AC, the processing system evaluates whether T1 and N are independent given the presence of T2 and whether T2 and N are independent given the presence of T1. If T1 and N are independent given the presence of T2, the processing system removes the node T1 from AC (or flags T1 as “unavailable” or with some other suitable designation). If T2 and N are independent given the presence of T1, the processing system removes the node T2 from AC (or flags T2 as “unavailable” or with some other suitable designation). For example, the independence test can be carried out using the exemplary algorithm “GetIndpendenceOracle” shown below. Although the steps of the algorithm suggest that the algorithm is carried out for every task Tk that appears in the case log, it will be appreciated that the algorithm can simply be called as necessary to evaluate particular triples of nodes.
Algorithm GetIndependenceOracle
Input: a workflow log L, a threshold θ (e.g., application dependent);
Output: an independence oracle for L
1. For every task Tk that appears in the log
2. Return I.
In a variation on the algorithm above, the conditional independence test can utilize the Chi-squared test (more formally written as χ2 test) instead of the G-squared test, both of which are well known in the art. This variation differs only in how the empirical values (Oi,j) and the expected values (Ei,j) are combined in step xiv above, as will be appreciated by those skilled in the art.
At step 340, for each remaining ancestor node T of N in AC (i.e., not removed or flagged “unavailable”), a directed edge is added connecting each node T to node N in graph H. At step 350, the processing system determines whether there remain any unselected nodes in the next subset. If the answer to the query is yes, the process 300 returns to steep 310. If the answer to the query is no, the process continues to step 360. At step 360, for each node N in the next subset without an ancestor in the current subset, the processing system identifies a node T in the current subset that co-occurs most often with the node N and adds an edge connecting that node T with node N in graph H. This “no ancestor” circumstance can occur because it is possible to remove all potential ancestors from the set AC at step 330 if the conditions set forth at step 330 are satisfied. In a variation of this embodiment, it is possible to terminate step 330 before removing the final node from set AC, in which case step 360 could be eliminated.
At step 370, the processing system adds and/or deletes edges between nodes of the current subset and the next subset as necessary to ensure that the nodes in every pair from the next subset either (1) have no parents in common or (2) have exactly the same parents. This step is carried out to maintain a workflow graph that is consistent with the overall graph model, i.e., to avoid ordering links between nodes in different branches.
An exemplary approach for generating a workflow graph from a case log file has been described above in connection with various figures and algorithms. An exemplary algorithm written in pseudo-code with calls to other algorithms for generating a workflow graph will be further described below. The main algorithm is called “LearnOrderedWorkflow” and is shown below. It will be appreciated that the subset CurrentBlanket referred to in the algorithm corresponds to the “current subset” referred to above and that the subset NextBlanket referred to in the algorithm corresponds to the “next subset” referred to above. It will also be appreciated by those skilled in the art that various steps illustrated in
Algorithm LearnOrderedWorkflow
Input: O, an ordering oracle for a set T of tasks;
I, an independence oracle for T;
Output: a workflow graph H
The algorithm LearnOrderedWorkflow aims to recover a workflow representative of data of the log file. The algorithm is an iterative layer building algorithm that exploits the data in two ways to establish the layers (subsets) and the links between the successive layers. First, it exploits the data to establish an ordering of tasks (i.e., which tasks co-occur, which tasks are mutually exclusive, which tasks occur before other tasks or in parallel to other tasks). Second, it uses the data to establish conditional independence of two variables X and Y given a third variable Z, denoted mathematically as (X⊥Y|Z), to establish certain types of temporal relationships between tasks.
Two types of information are derived from case log: information about the order of the tasks that can be derived directly from the event sequences, and information about the conditional independences of the tasks. These types of information are derived by two procedures which generate two data structures (referred to as oracles): an ordering oracle, and an independence oracle.
The LearnOrderedWorkflow algorithm accepts as input an ordering oracle O and an independence oracle I, and produces as output a workflow graph H. It will be appreciated that in a variation, the algorithm can call procedures for generating the ordering information and independence information as needed instead of calculating and storing that information for all nodes of the set of nodes at the outset. The workflow graph H is recovered layer-by-layer using information from the ordering oracle and the independence oracle. The algorithm works by iteratively adding child nodes to a partially built graph (corresponding to the partially built workflow graph H) in a specific order. It begins by using the ordering oracle to detect nodes that have no parents (and serve as the “root causes” of all other measurable tasks, i.e., nodes that do not have any measurable ancestors). Such nodes are identified in Step 3 of the LearnOrderedWorkflow procedure. If there is more than one measurable node as a “root cause”, explicit branching nodes (e.g., AND-splits, OR-splits) are added to the graph. This is accomplished by the HiddenSplits procedure (corresponding to step 5 of the LearnOrdered Workflow procedure). Essentially, this procedure assembles the current layer into a partial workflow graph. The remaining steps of the LearnOrderedWorkflow procedure (Steps 7a-7f) involve iteratively identifying successive layers in the workflow graph and appending them to the current version of the workflow. This process continues until all visible nodes have been accounted for in the recovered workflow.
At each iteration (Steps 7a-7f), a set of nodes called CurrentBlanket is determined. This set of nodes contains all of the “leaves” and only the “leaves” of the current workflow graph H, i.e., all the task nodes that do not have any children in H. The initial choice of nodes for CurrentBlanket are exactly the root causes. The next step is to find which measurable tasks should be added to H. The algorithm builds the workflow graph by selecting only a set of tasks NextBlanket such that:
The procedure GetNextBlanket (below) returns a set corresponding to these properties. Identifying which nodes in NextBlanket should be descendants of which nodes in CurrentBlanket is accomplished by the Dependencies procedure.
It is possible that between nodes in CurrentBlanket and nodes in NextBlanket there are hidden join/split nodes. Such nodes are added to H by the InsertLatents algorithm (below).
As noted previously, Steps 7a-7f in the LearnOrderedWorkflow procedure are repeated until all observable tasks are placed in H the workflow graph. To complete the workflow graph, step 8 of LearnOrderedWorkflow ensures that all nodes are synchronized with a final end node. If an end node is not visible, multiple threads will remain open if not joined. This is accomplished by a call to the HiddenJoins procedure (step 8).
Exemplary algorithms for HiddenSplits, HiddenJoins, GetIndependence Oracle (which can generate the independence oracle “I” called in the algorithm above), and GetOrderingOracle (which can generate the ordering oracle “O” called in the algorithm above) have already been described herein. Exemplary algorithms for GetNextBlanket, Dependencies, and InsertLatents called in the main algorithm are provided below.
The GetNextBlanket algorithm (below) identifies suitable nodes of the next layer (or next subset) for the layer-by-layer building of the workflow graph. The GetNextBlanket procedure focuses on the subset of nodes in the remaining set of nodes G referred to previously. The GetNextBlanket procedure can iterate over all pairs of nodes (T1, T2) in G such that node T1 has no parents and such that T1 precedes T2 (meaning that T1 is constrained to precede T2). The GetNextBlanket procedure can also be implemented to iterate over pairs of nodes (T1, T2) in G such that node T1 has no parents, such that T1 precedes T2, and such that the iterations occur over pairs of nodes for which there are no intervening nodes evident from the order constraints of the ordering oracle. If the nodes T1 and T2 can co-occur with any task Ti in the current layer (current subset) and T1 and T2 are conditionally independent given task Ti then the order constraint for T1 to precede T2 is removed (as otherwise this will result in unwanted loops. Mutually exclusive tasks are directly identifiable from the ordering oracle (as the pair of such tasks will never co-occur and consequently no edge will be inserted in the set G).
Algorithm GetNextBlanket
Input: CurrentBlanket, a set of tasks in the current layer (current subset)
While the GetNextBlanket procedure (above) identifies the tasks in the next layer (next subset), it does not indicate which tasks in the current layer are ancestors of the tasks in the newly identified next layer. This is performed by the Dependencies procedure. It is worth noting that the independence oracle needs only to consider conditioning on positive values of a single node T2 (step 2a of Dependencies).
Algorithm Dependencies
Input: CurrentBlanket, a subset of a set T of nodes;
The algorithm InsertLatents (below) can introduce required nodes between two layers (subsets) of nodes representing observable tasks, as called by the main algorithm LearnOrderedWorkflow (above).
Algorithm InsertLatents
Input a workflow graph H;
In another exemplary embodiment alternative embodiment, the possibility of measurement error is addressed. For each node T representing a task that is measurable, the possibility that T is not recorded in a particular instance (or case) even though T happened can be accounted for. That is, let TM be a binary variable such that TM=1 if task T is recorded to happen. Then, the following measurement model is provided:
Measurement variables are proxies for the nodes representing actual tasks and allow for errors in recording. Even allowing the possibility of measurement error, the methods described herein can robustly reconstruct a workflow graph.
Additional considerations regarding how to avoid generating invalid workflow graphs, which may arise from anomalies in the data (such as statistical mistakes), will now be discussed. A first consideration involves how to avoid cycles. As noted previously, one approach for addressing cycles is to identify cyclic tasks with pattern recognition and replace the data corresponding to cyclic tasks with a pseudo-task. As another approach, if a cycle is detected in the ordering oracle, the weakest link Ti→Tj in the cycle (according to the frequency of occurrence of (Ti, Tj) in the dataset, where Ti precedes Tj) can simply be removed. This procedure can be iterated until no cycles remain.
A second consideration involves how to guarantee that splits and joins are suitably nested. Appropriate nesting can be accomplished by modifying the ordering and independence oracles, if necessary. For example, if the independence oracle links the current and next layers (subsets) in a such way that the ancestral relations between nodes in the two layers create join nodes that are not nested within previous split nodes (as decided by procedure Dependencies), edges can be added to the graph or removed until the resulting workflow graph has a properly nested structure. First, either graph M1 or M2 in HiddenJoins and HiddenSplits should be examined to determine if either is disconnected. If neither is disconnected, edges can be removed from M1 starting from the least frequent observed pairs until M1 is disconnected.
This is not enough, however, to guarantee consistency with the graph model. As a further step, another algorithm GetParseTree can be called to identify any other edges that should be added. GetParseTree (below) obtains a parse tree from a partially built workflow graph.
Algorithm GetParseTree
Input: a set of nodes S;
Let Parents(V, G) represent the set of parents of node V in graph G, and LeastCommonAncestor(S, PT) represent the node T in tree PT that is a common ancestor of all elements in S and has no descendant that is also an ancestor of all elements in S. Notice that if S contains only one element S, then LeastCommonAncestor(S, PT)=S. The level of T in PT is the size of the largest path from T to one of its descendants in S, where the size of a path is the number of edges in this path.
A further structural consideration is necessary to avoid generating invalid graphs. Namely, in the procedure Dependencies, for each pair of observable tasks either the tasks do not have any parent in common in AncestralGraph, or the tasks have exactly the same parents. Also, each task in NextBlanket has at least one parent in AncestralGraph. Finally, let PT be the parse tree for CurrentBlanket. For any node T0 in NextBlanket, it follows that if LeastCommonAncestor(Parents(T0, AncestralGraph), PT) has a level of at least 2, then T0 is a child of every element from Leaves(LeastCommonAncestor(Parents(T0, AncestralGraph), PT), PT) in AncestralGraph.
If, during the execution of the main algorithm, any of the above conditions fails, then a valid workflow graph will not be generated. In such a case, the following modification of the algorithm Dependencies can be implemented.
Algorithm Dependencies2
Input: G, the current workflow graph
Thus, it will be appreciated that various conditions that might otherwise prevent generating a valid workflow graph can be addressed by the methods described herein.
Hardware Overview
Computer system 1300 may be coupled via bus 1302 to a display 1312 for displaying information to a computer user. An input device 1314, including alphanumeric and other keys, is coupled to bus 1302 for communicating information and command selections to processor 1304. Another type of user input device is cursor control 1315, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1304 and for controlling cursor movement on display 1312.
The exemplary methods described herein can be implemented with computer system 1300 for deriving a workflow from empirical data (case log files) such as described elsewhere herein. Such processes can be carried out by a processing system, such as processor 1304, by executing sequences of instructions and by suitably communicating with one or more memory or storage devices such as memory 1306 and/or storage device 1310 where derived workflow can be stored and retrieved, e.g., in any suitable database. The processing instructions may be read into main memory 1306 from another computer-readable medium, such as storage device 1310. However, the computer-readable medium is not limited to devices such as storage device 1310. For example, the computer-readable medium may include a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other medium from which a computer can read, containing an appropriate set of computer instructions that would cause the processor 1304 to carry out the techniques described herein. The processing instructions may also be read into main memory 1306 via a modulated wave or signal carrying the instructions, e.g., a downloadable set of instructions. Execution of the sequences of instructions causes processor 1304 to perform process steps previously described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the exemplary methods described herein. Moreover the process steps described elsewhere herein may be implemented by a processing system comprising a single processor 1304 or comprising multiple processors configured as a unit or distributed across multiple machines. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software, and a processing system as referred to herein may include any suitable combination of hardware and/or software whether located in a single location or distributed over multiple locations.
Computer system 1300 can also include a communication interface 1316 coupled to bus 1302. Communication interface 1316 provides a two-way data communication coupling to a network link 1320 that is connected to a local network 1322 and the Internet 1328. It will be appreciated that data and workflows derived there from can be communicated between the Internet 1328 and the computer system 1300 via the network link 1320. Communication interface 1316 may be an integrated services digital network (ISDN) card or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 1316 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 1316 sends and receives electrical, electromagnetic or optical signals which carry digital data streams representing various types of information.
Network link 1320 typically provides data communication through one or more networks to other data devices. For example, network link 1320 may provide a connection through local network 1322 to a host computer 1324 or to data equipment operated by an Internet Service Provider (ISP) 1326. ISP 1326 in turn provides data communication services through the “Internet” 1328. Local network 1322 and Internet 1328 both use electrical, electromagnetic or optical signals which carry digital data streams. The signals through the various networks and the signals on network link 1320 and through communication interface 1316, which carry the digital data to and from computer system 1300, are exemplary forms of modulated waves transporting the information.
Computer system 1300 can send messages and receive data, including program code, through the network(s), network link 1320 and communication interface 1316. In the Internet 1328 for example, a server 1330 might transmit a requested code for an application program through Internet 1328, ISP 1326, local network 1322 and communication interface 1316. In accordance with the present disclosure, one such downloadable application can provide for deriving a workflow and an associated workflow graph as described herein. Program code received over a network may be executed by processor 1304 as it is received, and/or stored in storage device 1310, or other non-volatile storage for later execution. In this manner, computer system 1300 may obtain application code in the form of a modulated wave. The computer system 1300 may also receive data via over a network, wherein the data can correspond to multiple instances of a process to be analyzed in connection with approaches described herein.
Components of the invention may be stored in memory or on disks in a plurality of locations in whole or in part and may be accessed synchronously or asynchronously by an application and, if in constituent form, reconstituted in memory to provide the information used for processing information relating to occurrences of tasks and generating workflow graphs as described herein.
An example of how LearnOrderedWorkflow works will now be described for hypothetical data. Assume for now that the hypothetical graph G in
Suppose that a directionality graph G is given in
In the initial step, the set CurrentBlanket will contain tasks {1, 2, 3, 4, 5}. The HiddenSplits algorithm will work as follows: two graphs, M1 and M2, will be created based on O and tasks {1, 2, 3, 4, 5}. These graphs are shown in
Consider the new call for HiddenSplitStep (see HiddenSplits algorithm herein) with argument S={1, 2, 3}. The corresponding graphs M1 and M2 are now shown in
From the ordering graph illustrated in
The algorithm now performs the insertion of possible latents between {1, 2, 3, 4, 5} and {6, 7, 12}. There is only one set Siblings in InsertLatents, {6, 7, 12}, and one AncestralSet, {1, 2, 3, 4, 5}. When inserting hidden joins for elements in AncestralSet, the algorithm will perform an operation analogous to the previous example of InsertHiddenSplits, but with arrows directed in the opposite way. The modification is shown in
The algorithm proceeds to add more observable tasks in the next cycle of LearnOrderedWorkflow. The candidates are {8, 9, 10, 11}. By inspection of
When determining direct dependencies, the algorithm first selects {6, 7} as the possible ancestors of {8, 9}. Since 8 and 7 are independent conditioned on 6, and 9 and 6 are independent conditioned on 7, only edges 6→8 and 7→9 are allowed. Analogously, the same will happen to 8→10 and 9→11. Graph H, after introducing all observable tasks, is shown in
While this invention has been particularly described and illustrated with reference to particular embodiments thereof, it will be understood by those skilled in the art that changes in the above description or illustrations may be made with respect to form or detail without departing from the spirit or scope of the invention.
This application claims the benefit under 35 U.S.C.§ 119(e) of U.S. Provisional Patent Application No. 60/709,434 “Method and Apparatus for Probabilistic Workflow Mining” filed Aug. 19, 2005, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60709434 | Aug 2005 | US |