1. Field
The present disclosure relates generally to social network analytics.
2. Background
Social network analysis software facilitates quantitative analysis of social networks by describing network features via numerical or visual representation. Social networks may include groups such as families, a group of individuals identifying themselves as friends, project teams, classrooms, sports teams, legislatures, nation-states, membership on networking websites like TWITTER® or FACEBOOK®.
Some social network analysis software can generate social network features from raw social network data formatted in an edge list, adjacency list, or adjacency matrix or socio-matrix. These social network features may be presented using some kind of visualization. Some social network analysis software can perform predictive analysis. Predictive analysis, such as peer influence modeling or contagion modeling, may use social network phenomena such as a tie to predict social network outcomes. An example of predictive analysis is to use individual level phenomena to predict the formation of a tie or edge.
When analyzing many social networks, an analyst may desire to include many different parameters simultaneously, though in some cases this type of analysis is impossible due to the lack of available techniques. For example, simultaneously including different types of relationships, different topics of discussion, different roles, properties of the people and organizations involved, as well as states of the social network at different times, may be useful when performing social network analysis but is currently impossible. In other words, to date, no single social network analysis tool can perform all of the above aspects in a single representation. A tight coupling of content and topics of discussion with social networks is currently unavailable, but may be desirable because such information can shed light on social network data. These capabilities may be particular desirable for social networks that involve communication between the participants, such as those constituted by or supported by social media. Social media include but are not limited to FACEBOOK®, TWITTER®, or GOGGLE PLUS®.
Thus, certain problems in social network analysis remain unsolved. For example, there is no approach or data visualization tool that can incorporate all of the above aspects in a single representation and provide a unified solution to the depiction of the social network. Another related problem is that current technologies are unable to represent and summarize multiple types of relationships in a temporal sequence simultaneously. For example, available tools do not provide a view of time, topics, and ranked importance of entities in the social network.
The illustrative embodiments provide for a method. The method includes processing social network data using one or more processors to establish a tensor model of the social network data, the tensor model having at least an order of four. The method also includes decomposing the tensor model using the one or more processors into a plurality of principal factors. The method also includes synthesizing, using the one or more processors, and from a subset of the plurality of principal factors, a summary tensor representing a plurality of relationships among a plurality of entities in the tensor model, such that a synthesis of relationships is formed and stored in one or more non-transitory computer readable storage media. The method also includes identifying, using the one or more processors and further using one of the summary tensor and a single principal factor in the subset, at least one parameter selected from the group consisting of: a correlation among the plurality of entities, a similarity between two of the plurality of entities, and a time-based trend of changes in the synthesis of relationships. The method also includes communicating the at least one parameter.
The illustrative embodiments also provide for a system. The system includes a modeler configured to establish a tensor model of social network data, the tensor model having at least an order of four. The system also includes a decomposer configured to decompose the tensor model into a plurality of principal factors. The system also includes a synthesizer configured to synthesize, from a subset of the plurality of principal factors, a summary tensor representing a plurality of relationships among a plurality of entities in the tensor model, such that a synthesis of relationships is formed and stored in one or more non-transitory computer readable storage media. The system also includes a correlation engine configured to identify, using one of the summary tensor and a single principal factor in the subset, at least one parameter selected from the group consisting of: a correlation among the plurality of entities, a similarity between two of the plurality of entities, and a time-based trend of changes in the synthesis of relationships. The system also includes an output device configured to communicate the at least one parameter.
The novel features believed characteristic of the illustrative embodiments are set forth in the appended claims. The illustrative embodiments, however, as well as a preferred mode of use, further objectives and features thereof, will best be understood by reference to the following detailed description of an illustrative embodiment of the present disclosure when read in conjunction with the accompanying drawings, wherein:
The illustrative embodiments provide several useful functions. For example, the illustrative embodiments provide for a multi-dimensional mathematical model which synthesizes multiple relationships in a social network, together with topics of discussion, to reveal hidden or latent links, correlations, and trends in social network relationships.
The illustrative embodiments also recognize and take into account that social network relationships and content in social media may be mathematically modeled using tensors. Relationships between nodes, such as people, organizations, locations, and other entities can be represented simultaneously using tensors. The illustrative embodiments provide techniques to mathematically decompose these tensors to simultaneously reveal topics, themes, and characteristics of the relationships of these entities in a temporal sequence.
The illustrative embodiments solve the previously unsolved issue of finding latent interactions in social network data. Examples of latent interactions in social network data include but are not limited to non-obvious trends or relationships in data, events, people, places, and relationship, possibly over the temporal periods. One way of finding such latent interactions proposed by this invention is finding two entities that are both highly weighted in a significant principal factor. Another way of finding such latent interactions is to compare two entities of the same type, such as two persons or two terms or topics, using one of a number of distance or similarity metrics applied to sub-tensors representing the two entities.
As used herein, the following terms have the following definitions:
“Social network information” includes information relating to a social network, such as relationships between people and other entities that play a role in social relations or interactions among people, as well as information that describes how entities in a social network are connected to words and objects.
“Social network information”, for example, includes information posted on social media Web sites such as FACEBOOK® and TWITTER®. “Social network information” may include information outside of a social network, such as an online social network, but in some way relates to persons or entities associated with persons.
An “entity” is an object in an abstract sense. An “entity” is not necessarily animate. Examples of entities include a person, a group of persons (distinguished from the members of the group), a social organization, a thing, a place, an event, a document, a word, an idea, or any other concept that may be identified as an abstract or concrete object.
A “document” is any unit of text for analysis, such as a TWEET® on TWITTER®, a single paragraph of a larger document, a blog, a text message, an entry in a database, a string of alphanumeric character or symbols, a whole document, a text file, a label extracted from multi-media content, or any other unit of text for analysis.
A “tensor” is a multi-dimensional array of numbers.
An “order” of a tensor is the number of dimensions, also known as ways or modes. An “order” of a tensor is the number of indices required to address or represent a single entry or number in the tensor. A one-dimensional tensor has an order of one and may be represented by a series of numbers. Thus, for example, a vector is a tensor of order one. A two-dimensional tensor has an order of two and may be represented by a two-dimensional array of numbers, which in a simplistic example may be a tic-tac-toe board. A three-dimensional tensor has an order of three and may be represented by a three-dimensional array of numbers, which may in a simple example be visualized as a large cube made up of smaller cubes, with each smaller cube representing a number entry. A simple way of visualizing an order three tensor might be to visualize a RUBIC'S CUBE®, where the tensor constitutes numbers associated with each component cube. A four-dimensional tensor has an order of four and may be represented by a four-dimensional array of numbers, which in some, but not all cases may be thought of as a series of three-dimensional tensors. Tensors may have an order greater than four. The above examples of specific orders of tensors are for example and ease of understanding only, and are not necessarily limiting on the claimed inventions. Tensors of order three or higher may be called high-order tensors.
A “cell” is the location in a tensor of a single entry or number. A cell is identified or addressed by a set of integers called indices. A third-order tensor has three indices, a fourth-order tensor has four indices, and so on.
A “sub-tensor” of a tensor is a tensor of lower order extracted from the original tensor by holding one or more indices of the first tensor constant and letting all the others vary. For example, a third-order tensor may be extracted from a tensor of order four by holding a single index constant and letting all others vary.
A “column rank” of a matrix is the maximum number of linearly independent column vectors of the matrix. A “row rank” of a matrix is the maximum number of linearly independent row vectors. A result of fundamental importance in linear algebra is that the column rank and the row rank are equal. Thus, the “rank” of the matrix is either one of the column rank or the row rank. The rank of a matrix can be computed through mathematical numerical algorithms. An example of such an algorithm is singular value decomposition (SVD) to be defined below.
The definition of tensor rank is an analogue to the definition of matrix rank. A high-order tensor is “rank one” if it can be written as an outer-product of vectors. This fact means that each entry of a rank one tensor is the product of the entries of the corresponding vector cells. The PARAFAC algorithm, defined below, decomposes a tensor as a sum of rank one tensors.
An “outer-product rank”, or simply a “rank”, of a tensor is defined as the smallest number of rank-one tensors that generate the tensor as their sum. A tensor has an “outer-product rank” r if it can be written as a sum of r and no fewer outer-products of vectors in the corresponding space.
A “matrix decomposition” is a factorization of a matrix into a product of matrices. Many different matrix decompositions exist; each finds use among a particular class of applications. A matrix decomposition can also be expressed as a sum of vector outer-products, as illustrated in the top row of
A “Singular Value Decomposition (SVD)” refers to both a mathematical theory and a matrix decomposition algorithm that expresses an arbitrary matrix as a product of three matrices: an orthogonal matrix, a diagonal matrix, and another orthogonal matrix, as illustrated in
“Principal component analysis (PCA)” is a mathematical procedure that uses a linear transformation to convert a set of observations of possibly correlated variables into a set of values of orthogonal variables called “principal components”. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the proceeding components. PCA can be implemented through a variety of algorithms, and is most commonly implemented via SVD-based algorithms. The first few principal components usually retain most of the variation in the data. This fact leads to the idea of reduction of the matrix to a matrix with fewer directions in vector space and low rank approximation methods in data analysis. PCA and low rank approximations can be interpreted mathematically as performing an orthogonal projection onto such a vector space.
“Tensor decomposition” is a factorization of a tensor into a product of matrices and tensors, or a sum of rank-one tensors, each being an outer-product of vectors. The result of a tensor decomposition can often be used to identify correlations among different factors or attributes in a high-order tensor. There are many different tensor decompositions via different algorithms. Two particular tensor decomposition can be considered to be high-order extension to the matrix SVD (singular value decomposition): the PARAFAC (Parallel Factorization) and HOSVD (Higher-Order SVD, also known as the Tucker decomposition), as illustrated in
A “principal factor” refers to a set of vectors whose outer-product is a rank-one tensor which may result from tensor decomposition. A principal factor can be viewed as a projection of a tensor onto tensor space with only one direction that combines information from all of the dimensions of the original tensor. The projection is used to focus on information of interest.
A “summary tensor” is a tensor of lower rank that is a projection of a tensor with a higher rank. For example, a summary tensor may be constructed from one or more principal factors. A summary tensor reduces noise in the tensor of higher rank with respect to information of interest by retaining only directions in the underlying tensor space of greater importance. In some cases, a summary tensor may contain a subset of information taken from a larger tensor, but important information usually is not lost when the projection is performed. A summary tensor will have the same order as the original tensor, but with a lower rank.
The flow begins when the process receives input regarding social media data (operation 102). The process then determines the types of entities, features, and relations to represent using tensors (operation 104). Optionally, the process may partition data by temporal periods (operation 106). In this case, also optionally, the process may represent each temporal period as a separate tensor (operation 108). Alternatively, also optionally, the process may represent all of the data as a single tensor, with or without time as one of the dimensions (operation 110).
Whether the process went to operation 108 or operation 110, the process then may apply appropriate tensor decomposition techniques (operation 112). In an illustrative embodiment, a single technique may be used. Tensor decomposition techniques are more fully described elsewhere herein, including with respect to
In addition to representing true relations between entities, such as family or friendship, or business or communication ties, the illustrative embodiments also allow for the representation of non-relational attributes, for example, biometric features like eye color or height, or type of organization, into the same tensor representation by recasting them as the relation of matching on that characteristic. This feature allows for better assessment of the similarity of entities, likely or potential grouping of entities, or possible hidden ties between entities.
Non-relational attributes can be categorical like eye-color or numerical like height. In the former case, one way of representing categorical attributes is binary representation. For example, if two people have the same eye color, the cell representing their intersection in the “matched eye color” relation will have a 1 and otherwise will have a 0. For numerical non-relational attributes like height, the cell representing their intersection in the “matched height” relation may have a 1 if their heights are both in the same height range or are within a certain distance of each other; otherwise, the value in the cell may be 0.
An alternative way of representing non-relational attributes is as non-binary values. In this case, if two people share a rare value for an attribute, their intersection cell will receive a higher value than if they share a common value for that attribute. For example, two people with blue eyes in a geographical region where most people have darker eyes will get a higher value for their match than two people with darker eyes. Similarly, people who are close to each other in height but share an extreme height, either tall or short, will get a higher value in the cell for their intersection in the “shared height relation” than two people who are close to each other with an average height.
The illustrative embodiments shown in
System 200 may be used to identify latent interactions in social network data. System 200 may use modeler 202 to establish tensor model 204 of social network data 206. Tensor model 204 may be at least an order of four in some illustrative embodiments. However, tensor model 204 may have different orders in different illustrative embodiments. In an illustrative embodiment, tensor model 204 may be a four-dimensional tensor comprising a time-based sequence of three-dimensional tensors.
In an illustrative embodiment, establishing tensor model 204 may include incorporating both relationships among entities and non-relational attributes of the entities into a single tensor representation, wherein the entities are in the tensor model. For example, biometric features such as eye color or height may be correlated in a single tensor representation to the type of organization to which the persons having those characteristics belong.
In another example, without limitation, the illustrative embodiments contemplate that tensor model 204 correlates an identification phrase of a third-party social network service with topics of discussion. An example of such an identification phrase may be a TWITTER HASHTAG® on the TWITTER® social network service. Known social network analysis techniques do not blend content analysis and relationships in the social network in this manner. Thus, in a non-limiting example, the determined parameter discussed below may consist of the correlation among the plurality of entities, wherein the plurality of entities comprises an identification phrase of a third-party social network service and a topic of discussion.
System 200 may also include decomposer 208 in communication with modeler 202. In an illustrative embodiment, decomposer 208 may be implemented using the same processor that implements modeler 202. Decomposer 208 may be configured to decompose tensor model 204 into plurality of principal factors 210. In other illustrative embodiments, decomposer 208 may be configured to decompose tensor model 204 into a single principal factor. Plurality of principal factors 210 may include subset of principal factors 212. Subset of principal factors 212 contains fewer principal factors than plurality of principal factors 210. Subset of principal factors 212 could be single principal factor 214.
System 200 may also include synthesizer 216 in communication with decomposer 208. Synthesizer 216 may be the same functional entity as modeler 202 in some illustrative embodiments. Synthesizer 216 may be configured to synthesize, from subset of principal factors 212, summary tensor 218. Summary tensor 218 may represent plurality of relationships 220 among plurality of entities 222 in tensor model 204. In this manner, a synthesis of relationships 224 is formed and stored in one or more non-transitory computer readable storage media 226.
System 200 may also include correlation engine 228. Correlation engine 228 may be configured to identify, using one of summary tensor 218 and single principal factor 214, at least one of parameter 230 that is selected from the group consisting of: correlation 232 among plurality of entities 222, similarity 234 between two of plurality of entities 222, and time-based trend of changes 236 in synthesis of relationships 224. Time-based trend of changes 236 may be modeled by overlapping time windows of tensor model 204 to approximate sequencing in tensor model 204, as described further with respect to
System 200 may also include output device 238. Output device 238 may be configured to communicate parameter 230. Communicating parameter 230 may include communication of parameter 230 to some other device or software application, display of parameter 230 on a display, storing of parameter 230 in non-transitory computer readable storage media 226, and other transmission of parameter 230. Other forms of communication exist. In an illustrative embodiment, modeler 202, decomposer 208, synthesizer 216, correlation engine 228, and output device 238 are all embodied as a computer system, and possibly as a single computer system.
In an illustrative embodiment, decomposer 208 may have other functions. For example, decomposer 208 may be configured to receive a specification of a first entity modeled in tensor model 204. In this case, decomposer 208 may be configured to select single principal factor 214 that assigns a large weight to the first entity. As used herein the term “large weight” means a weight in a specified number of weights assigned to entities in single principal factor 214 or a weight in single principal factor 214 that is larger than a predetermined threshold.
Decomposer 208 may also be configured to identify a second entity modeled in the tensor model that is related to the first entity. Identifying the second entity may be based on the second entity being assigned a second weight in single principal factor 214, wherein the second weight is large. In other words, decomposer 208 may rank the second entity within a specified number of the largest entities in single principal factor 214, or decomposer 208 may assign the second entity a weight larger than a predetermined threshold. The second entity may be of the same type as the first entity or it may be of a different type. For example, the first entity may be a person and the second entity may be a person or may be a topic or a time period.
In an illustrative embodiment, plurality of relationships 220 may include a relationship between a document and a word, phrase, or string. The word, phrase, or string may be an identification phrase of a third party social network service, such as for example a TWITTER HASHTAG®.
In an illustrative embodiment, parameter 230 may consist of a similarity between two of plurality of entities 222. In this case, correlation engine 228 may be further configured to identify latent interaction 240 in social network data 206. Examples of latent interaction 240 in social network data 206 include but are not limited to non-obvious trends or relationships in data, events, people, places, and relationship, possibly over the temporal periods.
Correlation engine 228 may perform this identification by comparing first sub-tensor 242 of summary tensor 218 to second sub-tensor 244 of summary tensor 218. First sub-tensor 242 may represent one of a first entity or a first complex entity. Second sub-tensor 244 may represent one of a second entity or a second complex entity. A “complex entity” may be, in a non-limiting example, an entity in plurality of entities 222 at a particular time period, represented by an N−2 dimensional sub-tensor of the original tensor model, as opposed to just an entity or just a time period, which is represented by an N−1 dimensional sub-tensor of the original tensor model. In this case, “N” is the dimensionality of both the original tensor (tensor model 204) and the corresponding summary tensor (summary tensor 218). Summary tensor 218 and tensor model 204 may have a same tensor order. In an illustrative embodiment, comparing may use one of a distance metric or a similarity metric, or both. Other comparing techniques may be used.
System 200 of
The illustrative embodiments shown in
In an illustrative embodiment, flowchart 300 may begin when the process processes social network data using one or more processors to establish a tensor model of the social network data, the tensor model having at least an order of four (operation 302). The process may then decompose the tensor model using the one or more processors into a plurality of principal factors (operation 304). The process may then synthesize, using the one or more processors, and from a subset of the plurality of principal factors, a summary tensor representing a plurality of relationships among a plurality of entities in the tensor model, such that a synthesis of relationships is formed and stored in one or more non-transitory computer readable storage media (operation 306).
The process may then identify, using the one or more processors and further using one of the summary tensor and a single principal factor in the subset, at least one parameter selected from the group consisting of: a correlation among the plurality of entities, a similarity between two of the plurality of entities, and a time-based trend of changes in the synthesis of relationships (operation 308). The process may then communicate the at least one parameter (operation 310).
The process may terminate thereafter in some illustrative embodiments. In other illustrative embodiments, the process may be varied or expanded. For example, a relationship in the plurality of relationships may be established by a commonality among two entities represented in the tensor model.
In an illustrative embodiment, the plurality of relationships may include a relationship between a first person and a second person. In another illustrative embodiment, the plurality of relationships may include a relationship between a person or an organization and a non-person object or event. In yet another illustrative embodiment, the plurality of relationships include a relationship between a document and a word, phrase, or string. In still another illustrative embodiment, the word, phrase, or string comprises an identification phrase of a third party social network service.
In an example where the method described in
The process of identifying may be further varied yet. For example, the parameter may consist of the similarity between two of the plurality of entities. In this case, identifying may further include comparing a first sub-tensor of the summary tensor, representing one of a first entity or a first complex entity, to a second sub-tensor of the summary tensor, representing one of a second entity or a second complex entity, wherein comparing uses one of a distance metric or a similarity metric.
In this example, the first sub-tensor may be a first N−1 sub-tensor relative to the summary tensor and the second sub-tensor may be a second N−1 sub-tensor relative to the summary tensor. “N” may be a dimensionality of the tensor model. The first sub-tensor and the second sub-tensor have a same tensor order.
The method described with respect to
In an illustrative embodiment, the tensor model may be a four-dimensional tensor comprising a time-based sequence of three-dimensional tensors. In an illustrative embodiment, the at least one parameter may be the time-based trend of changes. In this case, the time-based trend of changes may be modeled by overlapping time windows of the tensor model to approximate sequencing in the tensor model.
In an illustrative embodiment, establishing the tensor model may include incorporating relationships among entities, non-relational attributes of the entities into a single tensor representation, or both, wherein the entities are in the tensor model. In still another illustrative embodiment, the at least one parameter may consist of the correlation among the plurality of entities. In this case, the plurality of entities may consist of an identification phrase of a third-party social network service and a topic of discussion.
Flowchart 300 described with respect to
Broadly speaking, flow 400 illustrates a process for characterizing entities in a social network by using tensor representation and decomposition of heterogeneous data. Flow 400 begins with receiving data 402. Data 402 may be from a multi-source heterogeneous social network, represented abstractly by the dots and arrows shown in
After receiving data 402, the process may transform or represent data 402 as tensor model 406. Each two-dimensional array may represent an entity by entity comparison for a particular type of relationship. Individual cell entries may represent different facts about the two entities. For example, the number “7” 408 in “phone call” array 410 may represent that a first entity and a second entity are associated with 7 phone calls. However, this number may represent other aspects of the relationship, such as for example a weighting of an importance of the phone call rather than a number of phone calls. Thus, the illustrative embodiments are not limited to this example or what is precisely displayed in
After transforming or representing data 402 as tensor model 406, the process may use one or more mathematical techniques to decompose tensor model 406. In the illustrative embodiment of
Flow 500 of
After receiving data 502, the process may transform or represent data 502 as tensor model 504. Tensor model 504 may provide a layer for each term or terms, with each layer representing how often one blogger communicates to another blogger using the term which that layer represents. For example, the number “7” 506 in “term 1” array 508 may represent that a first blogger addresses 7 posts to a second blogger that contain “term 1” array 508. “Term 1” array 508 could be, for example, the verb representing a certain action or move in a football game or the name of a player, but could be any term. Number “7” 506 may represent other aspects of the relationship, such as, instead, a weighting of an importance of the term. Thus, the illustrative embodiments are not limited to this example or what is precisely displayed in
After transforming or representing data 502 as tensor model 504, the process may use one or more mathematical techniques to decompose tensor model 504. In the illustrative embodiment of
In summary, in some illustrative embodiments, tensor model 604 may be constructed by a sequence of tensors representing underlying dynamic social networks 602 at each time instance along temporal axis 600. In this example, tensor decompositions, performed during tensor analysis 606, may be performed on each tensor in the sequence. Changes in data in dynamic social networks 602 may be analyzed by comparing the results of each sequential tensor decomposition over time. However, this example is not limiting of the illustrative embodiments for the reasons given below.
In particular,
As indicated above, data may be from different networks. Data from different sources, including different social media like FACEBOOK® and TWITTER®, could be represented in a tensor. However, in most cases one network is represented at different points in time in order to represent the relationships among the same individuals. Theoretically, individuals may be on multiple networks; thus, a given tensor could possibly represent complex relationships among the same set of people among multiple networks.
Although, mathematically, the same set of individuals may be involved at each point in time, the relationship between any particular pair may be null at any given time. If an individual has only zeros in all the cells representing relationships with other individuals during a given time period, the individual is, effectively, not part of the network at that time period. In this sense, by adding or subtracting non-zero values to these cells over time, an order four tensor can represent the growth or shrinkage of a network over time.
In different illustrative embodiments, a distinction of
Accordingly, tensor analysis 606, which possibly may be tensor decomposition using principal component factor analysis, need not necessarily be simply a series of PARAFAC analysis on individual three-dimensional tensors for each time period, as shown in
Whatever mathematical technique is used, temporal change graph 608 may be produced as a result of tensor analysis 606. Temporal change graph 608 may show a score on a three-dimensional grid versus time and topics. Thus, for example, at a particular time, a particular topic may have a higher or a lower score. The score represents the intensity or importance of the discussion of the topic in a blog, which may reflect in part the relative frequency with which the topic was discussed, but also incorporates the effect of other correlated parameters.
In any case, a latent interaction may be tracked, such as tracking a trend in a change in the score over time for a particular topic or blogger. This information may allow an analyst, for example, to make future predictions regarding the topic of interest, to assess and recommend law enforcement, business, or military actions as appropriate, to draw conclusions regarding individuals discussing the topics, or to come to whatever conclusion the user considers helpful.
In an illustrative embodiment, the time-based trend of changes may be modeled by overlapping time windows of the tensor model to approximate sequencing in the tensor model. When the data is partitioned into separate time periods, the time periods may overlap so that data from the end of one period is included in the next time period. (The terms “time period” and “time window” are synonymous as used herein.) This technique of overlapping time periods has the effect of eliminating sharp boundaries between the time periods, as well as tying the different time periods together into a kind of sequence. Without this technique, the time periods may be unrelated and unordered, like any other category in the tensor model.
However, overlapping time periods may be weighted, as illustrated in
Traditionally, known latent semantic analysis techniques face challenges in temporal analysis. For example, a time window may be chosen and features only reflect one time window. Furthermore, when using known latent semantic analysis, interesting features spanning multiple time windows can be lost. Furthermore, in traditional analysis, while temporal periods form a “sequence”, mathematically, each time window in the tensor model is independent, with no connection between time windows.
The illustrative embodiments use weighted overlapping time windows to address these problems. Use of weighted overlapping time windows is described in more detail with respect to
Two different weighted time overlaps are shown in
Time window 800 shows a two hour time window, with an overlap of one hour and a 0.5 weight factor. In considering this time window, let “Hour A” be any particular hour time period shown on the timeline in time window 800. The entities or topics (features) in the hour before Hour A (that is Hour (A−1)) are down weighted by a factor of 0.5. The features in Hour A are set to those of Hour A, or receive full weight. With the overlapping time window, information from the hour before is no longer arbitrarily disregarded, thereby overcoming limitations in known techniques. Furthermore, each time period is now linked with the previous time period by incorporating information from it, thereby ensuring that interesting features spanning multiple time windows are not lost.
The actual time window and weighting may be determined for each application. Thus, for example, time window 802 shows a three-hour window with a one hour overlap. The features in Hour A−2 are down weighted by a factor of 0.5 and the features in Hour A and Hour A−1 receive full weight.
Many other variations are possible. Different weightings are possible than those shown. Multiple weightings may be used in each time window. For example, in time window 800 the weighting in Hour A could be 0.9 and the weighting in hour A−1 could be 0.4. Longer or shorter time windows are possible, with different overlaps. Time may be expressed in other units other than hours, such as shorter times (seconds or minutes for example) or longer times (days, weeks, months, or years for example). Thus, the illustrative embodiments shown in
In an illustrative embodiment,
Model 902 shows three different graphs, graph 904, graph 906, and graph 908. Graph 904 shows a commenter factor that indicates a score associated with various entities. Graph 906 shows an addressee factor that indicates a score associated with an entity. Graph 908 shows a term factor that indicates a score associated with a particular term associated with posts or blogs related to Entity A 900. More or fewer graphs may be present. Model 902 may be presented in formats other than graphs, as shown.
Model 902 may be a result of performing a three-way commenter-addressee-term tensor factorization using PARAFAC. In a public blog or comment forum, people (commenters) may post comments and sometimes directly address their comments to other people posting (addressees). Model 902 shows that Entity A 900 may be associated with unusual term usage, with terms of interest shown in Graph 908. In this illustrative embodiment, model 902 may show the radical behavior of Entity A 900 and connection of Entity A 900 to an organization of interest 910, such as for example a terrorist organization.
An analyst may use the information shown in model 902 to take certain actions. For example, the analyst may report the findings shown by model 902 to proper authorities for further investigation of Entity A 900. However, if model 902 reflected discussion in a scientific field, then perhaps model 902 may show that further investigation in a particular scientific enquiry may be of interest. Still differently, if model 902 reflected discussion about a product, the analyst may report that certain marketing actions may be recommended to increase sales of the product. Thus, the illustrative embodiments are not limited to the examples described above.
Like model 902 of
Each matrix in
In another example, matrix 1102 shows email relationships among different entities. In this example, entity m emails entity 1 eight times in a time period, as indicated by the number 8 in the corresponding cell of the intersecting row of entity m and column of entity 1. The time period might be a week, as shown, but could be any time period. Likewise, entity 1 emails entity m once a time period, as shown by that corresponding cell. Matrix 1102 is an example of a non-binary, asymmetrical matrix.
In another example, matrix 1104 illustrates the relationships of various entities with respect to eye color. For example, entity 1 and entity 2 have the same eye color, and specifically brown eyes. The color of the eye is indicated by the value of the cell. The fact that any non-zero number is entered into a cell means that the entities for the intersecting column and row have the same eye color. In this case, the number “1” refers to the weight assigned to a match for brown eyes. Likewise, entity x and entity m have blue eyes, which has a larger weight represented by the number “3”. Matrix 1104 is an example of a non-binary symmetrical matrix.
In still another example, matrix 1106 illustrates the relationships of various entities with respect to height. Thus, matrix 1106 shows re-representing properties as relationships.
The fact that a non-zero number is in a cell illustrates that two entities share a common range of heights. The value of the number entry in a cell may correspond to a weight of the cell for a match. Thus, for example, a higher number may correspond to a match on a rare feature, such as “blue eyes” in certain geographical regions for persons of a given height, or perhaps corresponds to a rare range of heights for persons in a geographical region.
Matrix 1106 is another example of a non-binary symmetrical matrix. The weights are what make matrix 1106 non-binary.
Although four matrices are shown in
Four-dimensional tensor 1200 is an example of a four-dimensional tensor that may be represented as a series of three-dimensional tensors, such as a series of three-dimensional tensors representing the same network that varies over time. Note, however, that not all mathematical algorithms operating on different representations of four dimensional tensors produce the same results. Nevertheless, tensor 1200 illustrates a four-dimensional tensor describing an entity-by-entity-by-feature-by-time set of relationships that incorporates temporal information into a heterogeneous social network.
However, in
Although a four-dimensional tensor, represented as a series of three-dimensional tensors over time, is shown in
Specifically,
Third, the process may project data onto the principal components. In this manner the most relevant data may be found. Fourth, the principal components may be interpreted. The result of interpretation may be a model, including a multi-dimensional tensor representing information of greatest interest.
A “principal factor” is a tensor analogue to a “principal component” of a matrix. Principal factors are analogous to principal components, but the terms are not identical. A principal factor is a projection of a tensor onto a vector space with only one direction. Just as principal components can be used to focus on important information in a matrix, a projection principal factor may be used to focus on information of interest in a higher order tensor. A principal factor may be derived using one or more other mathematical approximation techniques operating on the set of data. A summary tensor may be constructed from one or more principal factors.
Singular Value Decomposition (SVD) 1400 may be, for example, the mathematical matrix decomposition technique applied in
The two mathematical tensor decomposition techniques described with respect to parallel factor analysis (PARAFAC) 1402 and Higher-order singular value decomposition (HOSVD) 1404 in
Turning now to
Processor unit 1504 serves to execute instructions for software that may be loaded into memory 1506. Processor unit 1504 may be a number of processors, a multi core processor, or some other type of processor, depending on the particular implementation. A number, as used herein with reference to an item, means one or more items. Further, processor unit 1504 may be implemented using a number of heterogeneous processor systems in which a main processor is present with secondary processors on a single chip. As another illustrative example, processor unit 1504 may be a symmetric multi-processor system containing multiple processors of the same type.
Memory 1506 and persistent storage 1508 are examples of storage devices 1516. A storage device is any piece of hardware that is capable of storing information, such as, for example, without limitation, data, program code in functional form, and/or other suitable information either on a temporary basis and/or a permanent basis. Storage devices 1516 may also be referred to as computer readable storage devices in these examples. Memory 1506, in these examples, may be, for example, a random access memory or any other suitable volatile or non-volatile storage device. Persistent storage 1508 may take various forms, depending on the particular implementation.
For example, persistent storage 1508 may contain one or more components or devices. For example, persistent storage 1508 may be a hard drive, a flash memory, a rewritable optical disk, a rewritable magnetic tape, or some combination of the above. The media used by persistent storage 1508 may also be removable. For example, a removable hard drive may be used for persistent storage 1508.
Communications unit 1510, in these examples, provides for communications with other data processing systems or devices. In these examples, communications unit 1510 is a network interface card. Communications unit 1510 may provide communications through the use of either or both physical and wireless communications links.
Input/output (I/O) unit 1512 allows for input and output of data with other devices that may be connected to data processing system 1500. For example, input/output (I/O) unit 1512 may provide a connection for user input through a keyboard, a mouse, and/or some other suitable input device. Further, input/output (I/O) unit 1512 may send output to a printer. Display 1514 provides a mechanism to display information to a user.
Instructions for the operating system, applications, and/or programs may be located in storage devices 1516, which are in communication with processor unit 1504 through communications fabric 1502. In these illustrative examples, the instructions are in a functional form on persistent storage 1508. These instructions may be loaded into memory 1506 for execution by processor unit 1504. The processes of the different embodiments may be performed by processor unit 1504 using computer implemented instructions, which may be located in a memory, such as memory 1506.
These instructions are referred to as program code, computer usable program code, or computer readable program code that may be read and executed by a processor in processor unit 1504. The program code in the different embodiments may be embodied on different physical or computer readable storage media, such as memory 1506 or persistent storage 1508.
Program code 1518 is located in a functional form on computer readable media 1520 that is selectively removable and may be loaded onto or transferred to data processing system 1500 for execution by processor unit 1504. Program code 1518 and computer readable media 1520 form computer program product 1522 in these examples. In one example, computer readable media 1520 may be computer readable storage media 1524 or computer readable signal media 1526. Computer readable storage media 1524 may include, for example, an optical or magnetic disk that is inserted or placed into a drive or other device that is part of persistent storage 1508 for transfer onto a storage device, such as a hard drive, that is part of persistent storage 1508. Computer readable storage media 1524 may also take the form of a persistent storage, such as a hard drive, a thumb drive, or a flash memory, that is connected to data processing system 1500. In some instances, computer readable storage media 1524 may not be removable from data processing system 1500.
Alternatively, program code 1518 may be transferred to data processing system 1500 using computer readable signal media 1526. Computer readable signal media 1526 may be, for example, a propagated data signal containing program code 1518. For example, computer readable signal media 1526 may be an electromagnetic signal, an optical signal, and/or any other suitable type of signal. These signals may be transmitted over communications links, such as wireless communications links, optical fiber cable, coaxial cable, a wire, and/or any other suitable type of communications link. In other words, the communications link and/or the connection may be physical or wireless in the illustrative examples.
In some illustrative embodiments, program code 1518 may be downloaded over a network to persistent storage 1508 from another device or data processing system through computer readable signal media 1526 for use within data processing system 1500. For instance, program code stored in a computer readable storage medium in a server data processing system may be downloaded over a network from the server to data processing system 1500. The data processing system providing program code 1518 may be a server computer, a client computer, or some other device capable of storing and transmitting program code 1518.
The different components illustrated for data processing system 1500 are not meant to provide architectural limitations to the manner in which different embodiments may be implemented. The different illustrative embodiments may be implemented in a data processing system including components in addition to or in place of those illustrated for data processing system 1500. Other components shown in
In another illustrative example, processor unit 1504 may take the form of a hardware unit that has circuits that are manufactured or configured for a particular use. This type of hardware may perform operations without needing program code to be loaded into a memory from a storage device to be configured to perform the operations.
For example, when processor unit 1504 takes the form of a hardware unit, processor unit 1504 may be a circuit system, an application specific integrated circuit (ASIC), a programmable logic device, or some other suitable type of hardware configured to perform a number of operations. With a programmable logic device, the device is configured to perform the number of operations. The device may be reconfigured at a later time or may be permanently configured to perform the number of operations. Examples of programmable logic devices include, for example, a programmable logic array, programmable array logic, a field programmable logic array, a field programmable gate array, and other suitable hardware devices. With this type of implementation, program code 1518 may be omitted because the processes for the different embodiments are implemented in a hardware unit.
In still another illustrative example, processor unit 1504 may be implemented using a combination of processors found in computers and hardware units. Processor unit 1504 may have a number of hardware units and a number of processors that are configured to run program code 1518. With this depicted example, some of the processes may be implemented in the number of hardware units, while other processes may be implemented in the number of processors.
As another example, a storage device in data processing system 1500 is any hardware apparatus that may store data. Memory 1506, persistent storage 1508, and computer readable media 1520 are examples of storage devices in a tangible form.
In another example, a bus system may be used to implement communications fabric 1502 and may be comprised of one or more buses, such as a system bus or an input/output bus. Of course, the bus system may be implemented using any suitable type of architecture that provides for a transfer of data between different components or devices attached to the bus system. Additionally, a communications unit may include one or more devices used to transmit and receive data, such as a modem or a network adapter. Further, a memory may be, for example, memory 1506, or a cache, such as found in an interface and memory controller hub that may be present in communications fabric 1502.
The different illustrative embodiments can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment containing both hardware and software elements. Some embodiments are implemented in software, which includes but is not limited to forms, such as, for example, firmware, resident software, and microcode.
Furthermore, the different embodiments can take the form of a computer program product accessible from a computer usable or computer readable medium providing program code for use by or in connection with a computer or any device or system that executes instructions. For the purposes of this disclosure, a computer usable or computer readable medium can generally be any tangible apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The computer usable or computer readable medium can be, for example, without limitation an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or a propagation medium. Non-limiting examples of a computer readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk, and an optical disk. Optical disks may include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W), and DVD.
Further, a computer usable or computer readable medium may contain or store a computer readable or usable program code such that when the computer readable or usable program code is executed on a computer, the execution of this computer readable or usable program code causes the computer to transmit another computer readable or usable program code over a communications link. This communications link may use a medium that is, for example without limitation, physical or wireless.
A data processing system suitable for storing and/or executing computer readable or computer usable program code will include one or more processors coupled directly or indirectly to memory elements through a communications fabric, such as a system bus. The memory elements may include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some computer readable or computer usable program code to reduce the number of times code may be retrieved from bulk storage during execution of the code.
Input/output or I/O devices can be coupled to the system either directly or through intervening I/O controllers. These devices may include, for example, without limitation, keyboards, touch screen displays, and pointing devices. Different communications adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Non-limiting examples of modems and network adapters are just a few of the currently available types of communications adapters.
The description of the different illustrative embodiments has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. Further, different illustrative embodiments may provide different features as compared to other illustrative embodiments. The embodiment or embodiments selected are chosen and described in order to best explain the principles of the embodiments, the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.
This application was made with United States Government support under contract number N00014-09-C-0082 awarded by the United States Office of Naval Research. The United States Government has certain rights in this application.