Embodiments of the invention generally relate to information technology, and, more particularly, to visual analytic tools.
Over time, a patient's medical condition can often evolve in complex and seemingly unpredictable ways. Moreover, variations in symptoms and diagnoses can often be observed within a population of patients, even when those patients are battling the same underlying disease. Similarly, a range of procedures, medications, and other interventions may be used by clinicians as they work to find treatment plans that yield the desired patient outcomes.
For this reason, scientists have long studied how variations in care and disease progression can lead to different outcomes. The most formal studies in this area often use randomized controlled trials (RCTs). While results from RCTs offer statistical rigor and serve as the “gold standard” for evidence-based medicine, they are expensive and time-consuming to conduct. This can makes the process slow and cumbersome when working to generate and explore new hypotheses. As a result, researchers have begun to take advantage of the growing repositories of observational data stored in electronic medical record (EMR) systems. For example, a number of platforms have been developed to analyze and make available vast volumes of this electronic data for ad hoc analyses.
A common type of retrospective study conducted using observational data is temporal event analysis. This type of investigation represents each patient's medical history as a sequence of time-stamped events. The temporal properties of these events, such as sequence and timing, are then analyzed to see how they impact a patient's eventual outcome. A variety of techniques have been used to gain insights from this sort of clinical event sequence data, ranging from data mining systems to interactive visualization-based tools.
While such mining-based and visualization-based methods have proven useful, they both suffer from significant limitations. First, mining-based methods often identify short snippets of frequently occurring patterns. The context in which these patterns occurs, however, is typically lost. This makes it hard to answer many meaningful questions, such as “Do the patterns typically appear early or late in an episode?” and “Does the importance of a pattern change at different stages of an episode?”
In contrast, visualization-based methods can illustrate episodes from start to finish, making clear the context surrounding intermediate events. Visualization methods, however, are typically limited to a small number of events or event types before becoming so complex that they are difficult if not impossible to interpret.
A need therefore exists for improved visual analytics techniques that combine both mining and visualization-based techniques to overcome the limitations outlined above.
In one aspect of the present invention, techniques for visual analytics are provided that combine both mining and visualization-based techniques. An exemplary computer-implemented method can include steps of obtaining an episode definition comprising a sequence of timestamped events for an entity that satisfy one or more constraints, wherein the episode definition comprises at least a starting milestone event, an ending milestone event and an outcome measure; translating the episode definition to a formal query; obtaining matching data that satisfies the formal query from a data repository for a plurality of entities, wherein for each of the entities, the matching data comprises a plurality of timestamped events comprising at least the starting milestone event and the ending milestone event; performing temporal pattern mining on the matching data to identify one or more event subsequence patterns that occur in a set of input episodes with a support value above a threshold; applying a statistical pattern analyzer to the identified event subsequence patterns to identify one or more correlations between the identified event subsequence patterns and the outcome measure that provide an indication of a degree of informativeness of a given pattern in terms of predicating an episode outcome; and visualizing one or more of the identified correlations, wherein at least one of the steps is performed by at least one hardware device.
According to further aspects of the invention, the episode definition optionally comprises one or more of milestone events, preconditions, an outcome measure and temporal constraints. The preconditions can specify one or more constraints that must be satisfied prior to a starting milestone. The episode definition can be interactively specified by a user.
According to further aspects of the invention, the temporal pattern mining comprises a frequent pattern mining. The frequent pattern mining can be applied to an overall event sequence returned by the formal query, and to each intermediate event sequence occurring between sequential milestone events.
According to further aspects of the invention, the visualization comprises visualizing one or more of a cohort overview, a milestone timeline and a mined pattern diagram. The exemplary milestone timeline illustrates a sequence of milestone events defining an overall episode. The exemplary mined pattern diagram visualizes a set of on two axes reflecting positive and negative coverage and optionally provides animation for temporal comparison.
Another aspect of the invention or elements thereof can be implemented in the form of an article of manufacture tangibly embodying computer readable instructions which, when implemented, cause a computer to carry out a plurality of method steps, as described herein. Furthermore, another aspect of the invention or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and configured to perform noted method steps. Yet further, another aspect of the invention or elements thereof can be implemented in the form of means for carrying out the method steps described herein, or elements thereof, the means can include hardware module(s) or a combination of hardware and software modules, wherein the software modules are stored in a tangible computer-readable storage medium (or multiple such media).
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
Aspects of the present invention provide improved visual analytics techniques that combine both mining and visualization-based techniques to overcome the limitations outlined above. According to one aspect of the invention, improved visual analytics techniques are provided that combine visual episode query tools to interactively specify episode definitions, on-demand data analytics that perform pattern mining to help discover important intermediate events within an episode, and dynamic information visualization capabilities to allow interactive exploration and analysis of clinical event sequence data. The query capabilities allow users to intuitively and quickly retrieve cohorts of patients that satisfy complex clinical episodes of interest. The disclosed visual analytics system then automatically leverages event pattern mining algorithms to uncover important events within the returned cohort. Finally, another aspect of the invention provides an interactive visual interface that lets users answer a range of interesting questions. The disclosed interactive visualization techniques identify events that impact outcome and how those associations change over time.
While the present invention is illustrated in the context of exemplary patients and clinical episodes, the present invention can be applied in any context where visual analytics are needed to combine both pattern mining and temporal event visualization-based techniques, as would be apparent to a person of ordinary skill in the art.
As shown in
Visual Query Module
The exemplary visual query module 110 has two features: (1) an easy-to-use user interface component enabling the definition of a clinical episode specification, and (2) a query engine that converts the episode specification to an executable query and retrieves matching patient data from a clinical data warehouse.
Each episode specification 200 has at least two milestone events 210-1 and 210-N to represent the start and end of the episode 200. For instance, in the earlier example, the onset of angina would be the start milestone 210-1 and heart failure would be the end milestone 210-N. In addition, intermediate milestones, such as milestone events 210-2 and 210-3, can be included to encode additional constraints. For example, an arrhythmia could be included as an intermediate milestone 210 to consider only patients who suffered from an irregular heartbeat prior to heart failure. Finally, time gaps can be included to ensure temporal constraints (e.g., at least two years between milestones).
Preconditions are a set of constraints, if any, that must be satisfied prior to the starting milestone. For example, a precondition could specify that only patients with a diagnosis of diabetes prior to the onset of angina be included.
The outcome measure specifies the way to evaluate the eventual result of an episode 200. Continuing the heart failure example, the outcome measure for a patient could be, for instance, the presence of an eventual heart value replacement procedure. The outcome measure definition is a critical element in the episode specification because the pattern mining algorithms look for event patterns within an episode that have strong correlations with good (or bad) outcomes.
In one exemplary embodiment, once the user has finished defining the episode specification via the user interface 300, the visual query specification is translated into a formal query, expressed, for example, in Structure Query Language (SQL), that retrieves matching patient event episodes from the patient data repository 105. Generally, the query returns all patients having events (in the proper order) that satisfy the episode specification. Except for the step of translating to SQL, the exemplary visual analytics system 100 is independent of the underlying data source, thereby allowing for easy migration between data sources.
Temporal Pattern Mining
As previously indicated, the pattern mining module 400 performs temporal pattern mining.
As shown in
Thereafter, during step 420, the pattern mining module 400 detects frequent event patterns using the Frequent Pattern Miner. The exemplary frequent pattern miner is responsible for detecting event subsequences that frequently occur in a set of input episodes 200. The miner defines a pattern as “frequent” based on the percentage of the input episodes in which the pattern appears, referred to herein as the pattern's support. As indicated above, the miner looks for patterns with a support value above a threshold. In one preferred embodiment, the support value is configurable. Users can also specify a minimum pattern length that can be any integer value greater than or equal to one. The pattern discovery employed by the exemplary pattern mining module 400 is based on a bitmap representation-based Sequential PAttern Miner (SPAM) (see, e.g., Jay Ayres et al., “Sequential PAttern Mining Using a Bitmap Representation,” Proc. of the 8th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining,” 429-35 (2002), incorporated by reference herein) which uses a search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms. The SPAM algorithm has been proven to be faster than traditional pattern mining approaches by an order of magnitude, especially when applied to relatively long episodes. The SPAM algorithm takes as input a set of event sequences (i.e., the episode data) and a user-specified support value, and produces as output a set of frequent patterns. The user-supplied minimum length threshold is then applied to filter out patterns that are too short.
Generally, the Statistical Pattern Analyzer looks for correlations between the mined patterns and the episode specification's outcome measure. The exemplary pattern mining module 400 employs the Statistical Pattern Analyzer during step 430 to form a bag-of-pattern (BoP) representation matrix for each episode from the identified set of frequent patterns. More formally, given a set of n frequent patterns, the BoP representation is an n-dimensional vector, where the i-th element of that vector stores the frequency of the i-th pattern within the corresponding episode. If there are m episodes (corresponding to m distinct patients), then an m×n episode-pattern matrix X=[x1, x2, . . . , xn,] is constructed whose (j,i)-th element indicates the number of times the i-th pattern appeared in the j-th episode. Thus, its i-th column xj summarizes the frequency of the i-th pattern in all m episodes. An m dimensional episode outcome vector y can also be constructed, such that yj is the outcome of the j-th episode. In the binary case, yiε(+1,−1) with +1 representing positive outcome and −1 representing negative outcome. Given this formulation, statistics are computed measuring the correlation between each xi and y to measure the informativeness of the i-th pattern in terms of predicting an episode's outcome. For example, the Pearson correlation, P-value (to measure the significance of a correlation), information gain, and odds ratio can be computed.
During step 440, the Statistical Pattern Analyzer performs a statistical analysis for the correlation of each pattern with outcomes. Finally, the pattern mining module 400 provides the results to the exemplary graphical user interface 300 during step 450.
Interactive Visualization
As previously indicated, once the pattern mining module 400 has completed, the results are passed to the interactive visualization module 500.
As shown in
The interactive visualization module 500 initially aggregates the event sequence data between each milestone, including outcome and timing, during step 510. Thereafter, the interactive visualization module 500 generates a flow graph layout and color coding during step 520 and renders the flow graph during step 530.
The exemplary interactive visualization module 500 retrieves pattern statistics for the selected edge (or overall sequence if no edge is selected) during step 540. An incremental rendering of the event pattern scatter plot (animate entry/exit/change of individual events) is generated during step 550. Finally, the interactive visualization module 500 listens for an edge selection event during step 560.
As indicated above, the exemplary the interactive visualization module 500 provides a cohort overview.
As discussed hereinafter,
As indicated above, the exemplary the interactive visualization module 500 also provides a milestone timeline. Generally, the milestone timeline visualization illustrates the sequence of milestone events 210 that define the overall episode 200. As shown in
From the overall episode shown in
As indicated above, the exemplary the interactive visualization module 500 also provides a mined pattern diagram. As shown in
The size of each pattern's circle represents the information gain with larger circles being more meaningful, and the color of the circle represents the odds ratio. The exemplary embodiment adopts the same exemplary green-to-yellow-to-red color gradient used in the timeline 210 to encode the odds ratio. As a result, large red circles represent mined patterns that tend to lead to poor outcomes while large green circles represent patterns that led to good outcomes. Circles can be selected via mouse clicks to retrieve more information about the pattern. Upon selection, a sidebar can be displayed to the right of the scatter plot showing both the sequence of events that forms that pattern as well as the full set of statistics computed by the mining algorithm.
Coupled with the milestone timeline 210, the pattern diagram 630 provides hierarchical access to a complex set of mined pattern statistics. Users can select a region of the episode (i.e., intermediate episodes 220) via the timeline to see the corresponding set of patterns in the pattern diagram 630. They can then select one of those patterns to see the lowest level of information including the events in the pattern and detailed statistics such as p-values.
An important feature of the mined pattern diagram 630 is its support for temporal comparison. The significance of an event pattern can vary between different stages of an episode. For example, a specific pattern may be present in the overall episode 200, but without statistical significance with respect to outcome. Meanwhile, that same pattern may have a very strong association with outcome during an early intermediate episode 220 despite having absolutely no correlation with outcome later in time.
To help users understand these temporal changes in pattern significance, the exemplary mined pattern diagram 630 adopts animated transitions whenever the milestone timeline selection changes. Upon any such change, the pattern diagram component compares the “before” and “after” pattern sets and computes three distinct sets: incoming patterns, outgoing patterns, and remaining patterns. Incoming patterns are patterns that only exist in the newly selected portion of the episode. Circles representing these patterns are added to the diagram. Outgoing patterns are patterns that only exist in the previously selected portion of the episode. Circles for these patters are removed from the diagram. Most critical are the remaining patterns. The circles for these patterns are animated to new locations, colors and sizes to reflect the change in statistics for the patterns. Therefore, as users click from early to late term intermediate episodes 220, the bubble chart shows via animation the trajectory of a pattern as it becomes more (or less) significant and/or prevalent. If an individual pattern is selected (as in
One exemplary implementation comprises a web-based application, making it easily deployable to large user populations. The system uses Servlet technology, which is supported by the open-source Apache Tomcat server and a number of commercial offerings (e.g., IBM WebSphere). The server-side functionality is implemented in Java. The exemplary implementation connects to ICDA data sources (see, e.g., D. Gotz et al., “ICDA: A Platform for Intelligent Care Delivery Analytics,” AMIA Annual Symposium Proc., American Medical Informatics Association (2012), incorporated by reference herein), which are based on widely used standards such as ICD, CPT, and NDC.
Client-side functionality is developed using standard web technologies and allows access through any modem web browser. In addition to HTML, CSS, and JavaScript technologies, the exemplary implementation adopts a Dojo toolkit for user interface widgets. D3.js is used as a visualization toolkit on which to build the custom visualizations. The visualizations therefore rely on SVG as the underlying rendering technology.
Data adapters are provided to connect the prototype to two data sources, each with a somewhat different set of available clinical event types. As users interact with the query interface to add event constraints to an episode specification, type-ahead find is used to constrain the selections to only the event types present within the data. This allows users to quickly see what event types are available in a given dataset without deep prior knowledge of the data source.
The disclosed method allows users to perform a wide range of ad hoc visual analysis tasks over event sequence data. Three exemplary use cases are discussed to show the types of investigations that the disclosed visual analytics system 100 supports.
One Pattern Over Time
This use case, shown in
All Patterns Over Time
Another use case investigates a cohort of hypothyroidism patients using ICD codes. As instructed by the user, the exemplary visual analytics system 100 retrieves a cohort of patients who progress from obesity, to hypertension, to type-2 diabetes, to hypothyroidism. The outcome event of interest, found in 11.6% of the cohort, is a diagnosis of anemia. As one may expect, the group is mostly women with ages ranging from 53-95. It can be shown that there is an interesting change in the observed patterns as the user moves from the first to the last intermediate episodes. At the start of the progression, there are very few (e.g., seven) common patterns found (and all have negative associations). In the middle period, there are more patterns though with only week correlations with outcome. This can be illustrated in the way the circles cluster along the diagonal of the mined pattern diagram. For this particular analysis, the strong indicators for anemia are not evident until the third and final intermediate episode where the number of frequent patterns grows significantly and the odds ratios grow quite large.
Comparing Two Patterns
Another use case investigates a group of hypertensive patients using Hierarchical Condition Category (HCC) data. An exemplary episode specification is given that requires a sequence of four milestone conditions: hypertension, followed by hypertensive heart disease, followed by angina, followed by heart infection/inflammation. The outcome measure is specified as cardio-respiratory failure and shock. Episodes are retrieved for a cohort of patients matching this specification with just over 7% having negative outcomes. A large number of patterns (with minimum length of 1) are found in the overall sequence. The very same pattern can be found in the first intermediate episode and, the significance was even stronger than in the overall data. Further analysis shows that while the patients suffering arrhythmias during this early stage were among those most likely to have a bad outcome, there was another subgroup that was had much better outcomes. Patients with endocrine/metabolic disorders in the first intermediate episode had much better outcomes (hence the green circle in the chart). However, comparing p-values for these two patterns, it can be seen that only arrhythmias has a sufficiently small p-value to be statistically significant. The encocrine/metaboloic disorder pattern had a p-value of 0.094, which is greater than the commonly used 0.05 threshold for significance. Nonetheless, it remains a possible factor that could serve as a hypothesis for additional investigation.
Various aspects of the invention provide an exploratory visual analytics system for clinical episode analysis that combines a graphical query interface, event pattern mining and interactive visualization.
The techniques depicted in
Additionally, the techniques depicted in
An aspect of the invention or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and configured to perform exemplary method steps.
Additionally, an aspect of the present invention can make use of software running on a general purpose computer or workstation. With reference to
Accordingly, computer software including instructions or code for performing the methodologies of the invention, as described herein, may be stored in associated memory devices (for example, ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (for example, into RAM) and implemented by a CPU. Such software could include, but is not limited to, firmware, resident software, microcode, and the like.
A data processing system suitable for storing and/or executing program code will include at least one processor 702 coupled directly or indirectly to memory elements 704 through a system bus 710. The memory elements can include local memory employed during actual implementation of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during implementation.
Input/output or I/O devices (including but not limited to keyboards 708, displays 706, pointing devices, and the like) can be coupled to the system either directly (such as via bus 710) or through intervening I/O controllers (omitted for clarity).
Network adapters such as network interface 714 may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems and Ethernet cards are just a few of the currently available types of network adapters.
As used herein, including the claims, a “server” includes a physical data processing system (for example, system 712 as shown in
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that any of the methods described herein can include an additional step of providing a system comprising distinct software modules embodied on a computer readable storage medium; the modules can include, for example, any or all of the components detailed herein. The method steps can then be carried out using the distinct software modules and/or sub-modules of the system, as described above, executing on a hardware processor 702. Further, a computer program product can include a computer-readable storage medium with code adapted to be implemented to carry out at least one method step described herein, including the provision of the system with the distinct software modules.
In any case, it should be understood that the components illustrated herein may be implemented in various forms of hardware, software, or combinations thereof, for example, application specific integrated circuit(s) (ASICS), functional circuitry, an appropriately programmed general purpose digital computer with associated memory, and the like. Given the teachings of the invention provided herein, one of ordinary skill in the related art will be able to contemplate other implementations of the components of the invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a,” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of another feature, integer, step, operation, element, component, and/or group thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.