This is a Non-Provisional patent application of U.S. Provisional Application No. 62/478,529, filed in the United States on Mar. 29, 2017, entitled, “State Transition Network Analysis of Multiple One-Dimensional Time Series,” the entirety of which is hereby incorporated by reference.
The present invention relates to a system for encoding semantic information from a collection of time series as a multilayer network and, more particularly, to a system for encoding semantic information from a collection of time series as a multilayer network for state transition network analysis.
Visibility networks may be used to analyze individual time series in a similar way using a state transition network (see Literature Reference No. 1 in the List of Incorporated Literature References). Visibility networks can be used to simplify complicated time series information, but they have several disadvantages. For instance, visibility networks do not generalize to multiple/multidimensional time series. Furthermore, they do not account for different time scales and do not specifically encode semantic features of interest. Additionally, visibility networks treat every piece of data the same way. Moreover, recurrence networks (as described in Literature Reference No. 4) and coarse-graining methods (described in Literature Reference No. 5) can be used to encode multiple time series, but lack the ability to encode semantic features.
Thus, a continuing need exists for a method of encoding specific semantic features that may be specialized to specific datasets and account for multiple time scales.
The present invention relates to a system for encoding semantic information from a collection of time series as a multilayer network and, more particularly, to a system for encoding semantic information from a collection of time series as a multilayer network for state transition network analysis. The system comprises one or more processors and a non-transitory computer-readable medium having executable instructions encoded thereon such that when executed, the one or more processors perform multiple operations. The system generates a collection of time series from social media data related to an event of interest. The collection of time series is partitioned into a set of time intervals, and semantic features are extracted from the set of time intervals as a set of semantic intervals. The semantic features are encoded into a multilayer network, resulting in an encoded network. A set of subgraphs of the multilayer network are transformed into a state transition network. The encoded network is analyzed using the state transition network. Using the analyzed encoded network, an alert related to a prediction of a future event of interest is generated.
In another aspect, each layer of the multilayer network represents a distinct time series, and wherein the multilayer network comprises nodes, which each represent an event of interest, and colored edges, wherein different colored edges represent different time-scales.
In another aspect, in encoding the semantic features into the multilayer network, the system defines a node in the multilayer network for each semantic interval. A timestamp is assigned to each semantic interval. Timestamps of each pair of nodes are compared to determine a difference in timestamps between the pair of nodes. Each pair of nodes is linked with an edge having a color determined based on the difference in timestamps.
In another aspect, in transforming a set of subgraphs of the multilayer network into a state transition network, the system divides the multilayer network into a set of time intervals. A subgraph of the multilayer network restricted to each time interval is generated, wherein each subgraph is comprised of nodes and edges linking the nodes. For each subgraph, a number of edges of each type is determined. The number of edges is compared against a threshold value, and a state is assigned to each subgraph. The state transition network is generated, wherein each state is represented by a state node and each transition from one state to another is represented by an edge.
In another aspect, a number of occurrences of a state determines a size of the state node in the state transition network, and a number of transitions determines a size of a corresponding edge in the state transition network.
In another aspect, using the state transition network, a random walker method with memory is used to probabilistically select a next state in the encoded network to yield a prediction of a future event of interest.
In another aspect, the alert is sent electronically via at least one of email, text message, and a social media network.
Finally, the present invention also includes a computer program product and a computer implemented method. The computer program product includes computer-readable instructions stored on a non-transitory computer-readable medium that are executable by a computer having one or more processors, such that upon execution of the instructions, the one or more processors perform the operations listed herein. Alternatively, the computer implemented method includes an act of causing a computer to execute such instructions and perform the resulting operations.
The file of this patent or patent application publication contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The objects, features and advantages of the present invention will be apparent from the following detailed descriptions of the various aspects of the invention in conjunction with reference to the following drawings, where:
The present invention relates to a system for encoding semantic information from a collection of time series as a multilayer network and, more particularly, to a system for encoding semantic information from a collection of time series as a multilayer network for state transition network analysis. The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of aspects. Thus, the present invention is not intended to be limited to the aspects presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.
The reader's attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.
Furthermore, any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6. In particular, the use of “step of” or “act of” in the claims herein is not intended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.
Before describing the invention in detail, first a list of cited references is provided. Next, a description of the various principal aspects of the present invention is provided. Finally, specific details of various embodiment of the present invention are provided to give an understanding of the specific aspects.
The following references are cited and incorporated throughout this application. For clarity and convenience, the references are listed herein as a central resource for the reader. The following references are hereby incorporated by reference as though fully set forth herein. The references are cited in the application by referring to the corresponding literature reference number, as follows:
Various embodiments of the invention include three “principal” aspects. The first is a system for encoding semantic information for state transition network analysis. The system is typically in the form of a computer system operating software or in the form of a “hard-coded” instruction set. This system may be incorporated into a wide variety of devices that provide different functionalities. The second principal aspect is a method, typically in the form of software, operated using a data processing system (computer). The third principal aspect is a computer program product. The computer program product generally represents computer-readable instructions stored on a non-transitory computer-readable medium such as an optical storage device, e.g., a compact disc (CD) or digital versatile disc (DVD), or a magnetic storage device such as a floppy disk or magnetic tape. Other, non-limiting examples of computer-readable media include hard disks, read-only memory (ROM), and flash-type memories. These aspects will be described in more detail below.
A block diagram depicting an example of a system (i.e., computer system 100) of the present invention is provided in
The computer system 100 may include an address/data bus 102 that is configured to communicate information. Additionally, one or more data processing units, such as a processor 104 (or processors), are coupled with the address/data bus 102. The processor 104 is configured to process information and instructions. In an aspect, the processor 104 is a microprocessor. Alternatively, the processor 104 may be a different type of processor such as a parallel processor, application-specific integrated circuit (ASIC), programmable logic array (PLA), complex programmable logic device (CPLD), or a field programmable gate array (FPGA).
The computer system 100 is configured to utilize one or more data storage units. The computer system 100 may include a volatile memory unit 106 (e.g., random access memory (“RAM”), static RAM, dynamic RAM, etc.) coupled with the address/data bus 102, wherein a volatile memory unit 106 is configured to store information and instructions for the processor 104. The computer system 100 further may include a non-volatile memory unit 108 (e.g., read-only memory (“ROM”), programmable ROM (“PROM”), erasable programmable ROM (“EPROM”), electrically erasable programmable ROM “EEPROM”), flash memory, etc.) coupled with the address/data bus 102, wherein the non-volatile memory unit 108 is configured to store static information and instructions for the processor 104. Alternatively, the computer system 100 may execute instructions retrieved from an online data storage unit such as in “Cloud” computing. In an aspect, the computer system 100 also may include one or more interfaces, such as an interface 110, coupled with the address/data bus 102. The one or more interfaces are configured to enable the computer system 100 to interface with other electronic devices and computer systems. The communication interfaces implemented by the one or more interfaces may include wireline (e.g., serial cables, modems, network adaptors, etc.) and/or wireless (e.g., wireless modems, wireless network adaptors, etc.) communication technology.
In one aspect, the computer system 100 may include an input device 112 coupled with the address/data bus 102, wherein the input device 112 is configured to communicate information and command selections to the processor 100. In accordance with one aspect, the input device 112 is an alphanumeric input device, such as a keyboard, that may include alphanumeric and/or function keys. Alternatively, the input device 112 may be an input device other than an alphanumeric input device. In an aspect, the computer system 100 may include a cursor control device 114 coupled with the address/data bus 102, wherein the cursor control device 114 is configured to communicate user input information and/or command selections to the processor 100. In an aspect, the cursor control device 114 is implemented using a device such as a mouse, a track-ball, a track-pad, an optical tracking device, or a touch screen. The foregoing notwithstanding, in an aspect, the cursor control device 114 is directed and/or activated via input from the input device 112, such as in response to the use of special keys and key sequence commands associated with the input device 112. In an alternative aspect, the cursor control device 114 is configured to be directed or guided by voice commands.
In an aspect, the computer system 100 further may include one or more optional computer usable data storage devices, such as a storage device 116, coupled with the address/data bus 102. The storage device 116 is configured to store information and/or computer executable instructions. In one aspect, the storage device 116 is a storage device such as a magnetic or optical disk drive (e.g., hard disk drive (“HDD”), floppy diskette, compact disk read only memory (“CD-ROM”), digital versatile disk (“DVD”)). Pursuant to one aspect, a display device 118 is coupled with the address/data bus 102, wherein the display device 118 is configured to display video and/or graphics. In an aspect, the display device 118 may include a cathode ray tube (“CRT”), liquid crystal display (“LCD”), field emission display (“FED”), plasma display, or any other display device suitable for displaying video and/or graphic images and alphanumeric characters recognizable to a user.
The computer system 100 presented herein is an example computing environment in accordance with an aspect. However, the non-limiting example of the computer system 100 is not strictly limited to being a computer system. For example, an aspect provides that the computer system 100 represents a type of data processing analysis that may be used in accordance with various aspects described herein. Moreover, other computing systems may also be implemented. Indeed, the spirit and scope of the present technology is not limited to any single data processing environment. Thus, in an aspect, one or more operations of various aspects of the present technology are controlled or implemented using computer-executable instructions, such as program modules, being executed by a computer. In one implementation, such program modules include routines, programs, objects, components and/or data structures that are configured to perform particular tasks or implement particular abstract data types. In addition, an aspect provides that one or more aspects of the present technology are implemented by utilizing one or more distributed computing environments, such as where tasks are performed by remote processing devices that are linked through a communications network, or such as where various program modules are located in both local and remote computer-storage media including memory-storage devices.
An illustrative diagram of a computer program product (i.e., storage device) embodying the present invention is depicted in
Described is a system and method for encoding semantic information from a collection of time series as a multilayer network, and for analyzing the encoded network via a state transition network. One purpose of the invention is to characterize the occurrence of rare, large scale events through analysis of semantic abstractions in data. The semantic abstractions consist of a collection of time series from which a multilayer network is built whose subgraphs are analyzed and transformed into a state transition network. The state transition network can be used to model and predict behavior of the underlying system. Some non-limiting examples of systems that can be modeled include online social networks, such as Twitter, human mobility networks, and economic networks. For each of these, it is desirable to predict large scale events, such as nationwide protests, epidemics, and recessions, respectively.
The system described herein extends the method of analyzing time series via a state transition network in a non-obvious way to a higher dimension by changing the method by which the time series are encoded to a network. Instead of using a visibility-based method, semantic information is encoded at multiple time scales to more directly represent the dynamics of an underlying spreading process. In other words, a collection of times series is encoded rather than a single time series. Many methods for encoding time series to a network, such as visibility graphs (see Literature Reference Nos. 2 and 3), only apply to individual time series. The system according to embodiments of the present disclosure accounts for multiple time scales and encodes semantic features instead of abstracting the time series directly, allowing for more specialization and adaptation to particular purposes.
Due to the method of encoding specific semantic features, the technique described herein may be specialized to specific datasets. In particular, the percolation processes underlying opinion diffusion or disease can be encoded from a collection of time series into a multilayer network and modeled via state transitions.
The input to the system described herein is a collection of related time series, a non-limiting example of which is protest data, which can be collected from Twitter, or other online media sources, by scraping all tweets containing a protest-related hashtag. Most of these tweets can be geocoded to a particular city, and using this geocoding one can construct a time series of the number of protest-related tweets occurring in each city on each day. In one embodiment, this collection of time series is encoded as a multilayer directed network with colored edges, which tracks the geographical spread of activity over time. As can be appreciated by one skilled in the art, since edges represent different time scales, it is necessary to distinguish types of edges via coloring or another type of labeling, unless only a single time scale is used.
(3.1) Semantic Feature Extraction
The first step to constructing the multilayer network is to define a collection of events and value abstractions for each time series which comprises the relevant semantic features. A value is assigned to each point in the time series of either low or high, the low points being those which are less than one standard deviation above the mean of the entire time series.
(3.2) Allen's Temporal Relations
It is straightforward to compare the values of individual numbers, but slightly more complicated to compare intervals which comprise many different values. Allen's temporal relations (see Literature Reference No. 7) allow one to compare time intervals in a sensible manner. There are 7 relations which are enumerated in the table in
(3.3) Additional Relations
In many scenarios, it may be beneficial to have a more fine-grained approach than that offered by Allen's temporal relations. On Twitter, for example, activity often spreads very quickly and it is important to distinguish between events which occur hours apart as opposed to days apart. Several additional relations are used, which will be used to construct a directed, colored graph. In order to define the relations, a set of parameters needs to be set. There is a collection of times:
t1, . . . , tl and a corresponding collection of colors c1, c2, . . . , cl with t0 set equal to 0.
If E2 occurs after E1 but the gap is less than ti but greater than ti−1, then link E1 to E2 with a directed edge that is colored ci.
(3.4) Defining States
The multilayer network as constructed consists of nodes which each represent events and directed, colored links which represent relations between the events as defined by the colors. Therefore, to each node an interval of time is associated in which there is relatively high activity in a city. Suppose all nodes whose intervals co-occur with a specified interval of time, for instance Jan. 1, 2015 to Jan. 10, 2015, are collected. One may preserve all relations which exist between these events, and create a subgraph on the corresponding nodes. This subgraph represents all activity occurring within that 10-day period. Therefore, if one wishes to assign states to specified periods of time, it is natural to use these subgraphs.
If there tend to be a small number of nodes, then the number of possible subgraphs may be so small that it makes sense for each distinct subgraph to represent a state. However, if there are too many distinct subgraphs, the number of distinct states will be so large that few, if any, will occur more than once. In that case it makes more sense to use properties of the subgraphs to determine the corresponding state. One possibility is to use common network measures, such as degree distribution, diameter, mixing time, and modularity. The approach according to embodiments of the present disclosure is to simply use a value abstraction of the number of links of each color. That is, the system determines whether the number of links of each color is low, medium, or high.
(3.5) State Transition Network
Suppose the events span a period of time from Ts to Te. In order to build a state transition network which represents this entire period of time, first break it up into intervals of fixed length c and assign a state to each interval. The collection of intervals is:
T={(Ts,Ts+c),(Ts+c,Ts+2c), . . . ,(Ts+floor((Te−Ts)/c)c,Te)}
In order to assign a state to an interval (a,b) from the set of intervals T, first collect all events whose temporal midpoints lie within that interval. That is, if an event occurs over the time interval (e1,e2), then the temporal midpoint (e2−e1)/2 must lie within the interval (a,b). Having collected all such events, the relations between them are determined, and the corresponding subgraph is built with colored, directed edges. The number of links of each color is counted and stored in a vector (n1, n2, . . . , nl). Each of these values is compared against a threshold vector. An example of a threshold vector is (1,100,1000). The vector (n1, n2, . . . , nl) is compared against this threshold vector and transformed into a vector of value abstractions. Values between 1 and 100 would be considered “low”, values between 100 and 1000 would be considered “medium,” and values greater than 1000 would be considered “high.” With this threshold vector, the numerical vector (305,5,800,6000,8000) would be transformed into (medium,low,medium,high,high). Finally, this vector is represented more simply as (2,1,2,3,3) in which for the sake of brevity, readability, and extendibility, replace “low” with “1,” “medium” with “2,” and “high” with “3.”
After this process is complete, each interval Ti from Twill have been assigned a state Si. Each pair of states (Si,Si+1), therefore, represents a transition from state Si to state Si+1. In the state transition network each state S is represented by a node, and each transition (S,U) is represented by an edge. Moreover, each node S is weighted by the number of times the state S occurs, and the edge (S,U) is weighted proportionally to the number of times the transition (S,U) occurs. In order to build the state transition network, the number occurrences of each state S is counted along with the number of transitions of any pair of states (S,U). After having assigned a vector of value abstractions to each time interval, then count the number of occurrences of each state along with the number of state transitions from one state to another. The number of occurrences of a state determines the size of the state node in the state transition network, while the number of transitions determines the size of the corresponding edge in the state transition network.
(3.6) Notable Events/Protest Events
Periods of high, sustained activity or which correspond to real-world events are especially noteworthy, and it is desirable to characterize them within the framework of the state transition network. In order to incorporate real world events and exogenous shocks, one can replace states in specified intervals with special, additional states. For example, the GSR (Gold Standard Report) data details real-world protest events.
(3.7) Random Walker Model
The most straightforward way to model the underlying system using the state transition network is with a random walker. Given any starting state S, the next state is predicted by having the random walker randomly choose an adjacent state with probability proportional to the weights on the outgoing edges. However, this model is overly simplistic. For instance, it does not account for the fact that a long period of inactivity is more likely to persist than a short period of inactivity. Similarly, a spike of high activity may be more likely to die out quickly compared to sustained activity.
(3.8) Random Walker with Finite Memory
A random walker only knows it's current state, and essentially predicts the next state Si+1 using only the current state Si. To improve the accuracy, one could instead attempt to predict the state Si+1 using the states Si, Si−1, Si−2, . . . , Si−h, but a walker which “remembers” many previous states may be better at realistically predicting future activity.
(3.9) State Sequences as Early Warning Signals
By counting all state sequences of some specified length d, at any given time the previous d−1 states can be used to place a probabilistic prediction on the following state. This corresponds directly to the method the random walker with memory uses to probabilistically choose the next state. Specifically, count the number of sequences whose first d−1 entries are the previous d−1 as observed in the system, and predict the probability of state S as the portion of those sequences whose dth entry is S. In particular, with the Brazil protest data this yields a prediction of a protest event. Hence, the sequences of length d ending in the protest event may be used as early warning signals.
One advantage of the invention described herein with respect to previous work in this area is that multiple time series can be analyzed. This is critical in many applications in which a single time series does not represent the underlying system well. Additionally, multiple time scales are employed. This is particularly important in opinion diffusion, in which social media and real-life interactions provide for vastly different time scales. Further, multiple semantic features may be extracted, although only value abstractions are employed in the system according to embodiments of this disclosure. Non-limiting examples of systems which can apply the described process include online social networks, collective human activity in an urban environment, and collective activity of vehicles or aircraft. A single aggregated feature of all individual elements in such systems is unlikely to be sufficient to usefully describe its state at a given time, so it is necessary to consider a large number of features. For instance, rather than considering the number of vehicles which experienced a specific failure in each month, the system according to embodiments of the present disclosure can consider the number of vehicles of each model type which experienced that failure, resulting in a higher dimensional time series.
A potential commercial application of this invention is to model the diffusion of failures in a system which may occur on different time scales. This system may be a collection of individual vehicles or aircraft which may experience a variety of related failures, or it may be a more complex system of interrelated entities which can rarely undergo a cascade of smaller failures leading to a large, unexpected failure. For instance, in an online social network, an individual failure may correspond to a user retweeting a hashtag that is related to a social movement, which could culminate in a more widespread failure in the form of a large protest. Similarly, individual failures in vehicles could lead to a recall. Additionally, individuals becoming sick could result in an epidemic.
In the event that a recall on a consumer product (e.g., vehicle component) is predicted, an alert (e.g., email, text message) can be automatically sent to experts who can verify that a recall should be issued, possibly preventing deaths and saving money resulting from lawsuits. Similarly, in the case of an epidemic, early action can be taken to reduce medical costs through such actions as emergency declaration, quarantine, requesting medical aid, and developing a vaccine. Furthermore, in the case of protests resulting from a social movement, users (e.g., officials or other interested parties) who receive an early warning via an automatically generated alert (
Finally, while this invention has been described in terms of several embodiments, one of ordinary skill in the art will readily recognize that the invention may have other applications in other environments. It should be noted that many embodiments and implementations are possible. Further, the following claims are in no way intended to limit the scope of the present invention to the specific embodiments described above. In addition, any recitation of “means for” is intended to evoke a means-plus-function reading of an element and a claim, whereas, any elements that do not specifically use the recitation “means for”, are not intended to be read as means-plus-function elements, even if the claim otherwise includes the word “means”. Further, while particular method steps have been recited in a particular order, the method steps may occur in any desired order and fall within the scope of the present invention.
Number | Name | Date | Kind |
---|---|---|---|
20140136186 | Adami | May 2014 | A1 |
Number | Date | Country | |
---|---|---|---|
62478529 | Mar 2017 | US |