The present invention relates to clustering of a large number of objects. In particular, the invention is directed to segmentation of a social graph representing a large number of tracked users of social networks.
Informed marketing models rely on analyzing massive data, pertinent to identifiable objects, acquired from a variety of sources, one of which being the social media. The data fed to a market model may be segmented according to various criteria where objects of cohesive or similar characteristics are grouped in identifiable clusters. Several methods of data clustering are known in the art. There are however several challenges pertaining to computational complexity, selection of appropriate descriptors of objects, and selection of segmentation criteria that suit marketing objective.
The invention provides a method of clustering a plurality of objects representing tracked users of a network. The method is implemented using at least one processor configured to perform processes of initializing K centroids, one for each of a specified number K of clusters, K>1, and assigning each object to one of the clusters according to measures of affinity of the object to each of the centroids.
The process of assigning an object to a cluster is based on determining with respect to each of the K centroids: an angular-affinity measure; a radial distance; a radial-affinity measure based on the radial distance; and a composite affinity measure based on the radial-affinity measure and the angular-affinity measure. The object is assigned to a selected cluster having a centroid of least composite affinity measure.
The centroid of the selected cluster is updated to account for inclusion of the object.
According to one aspect of the invention, there is provided a method of clustering a plurality of objects comprising:
In the method described above, said each object is characterized by a respective vector of descriptors and said updating comprises steps of:
The method further comprises:
The method further comprises executing multiple cycles of said selecting, evaluating, identifying, assigning, and updating for said each object with the predetermined order for any cycle differing from the predetermined order for any other cycle of the multiple cycles.
The method further comprises, for each cycle of said multiple cycles:
The method further comprises:
The method further comprises:
The method further comprises determining each of said radial-affinity measure, said angular-affinity measure, and said composite affinity measure as a normalized value bounded between 0 and 1.0.
In the method described above:
said angular affinity measure is determined as a dot product of vectors (C/∥C∥) and (P/∥P∥);
said radial affinity measure is determined as a ratio ∥P∥/(∥P∥+D); and
said composite affinity measure is a weighted sum of said angular affinity measure and said radial affinity measure;
where C denotes a centroid vector of said each centroid, P denotes an object vector of said each object vector, ∥C∥ denotes magnitude of C, ∥P∥ denotes magnitude of P, and D denotes the Euclidean distance ∥(P−C)∥.
Alternatively, in the method described above:
said angular affinity measure is determined as a dot product of vectors (C/∥C∥) and (P/∥P∥);
said radial affinity measure is determined as (1−D/D*) for D<D* and 0.0 otherwise;
According to another aspect of the invention, there is provided a system of clustering a plurality of objects comprising:
The system further comprises means for characterizing said each object by a respective vector of descriptors, said processor readable instructions further cause said at least one processor to:
The system further comprises means for assigning to said each object a respective weight, said processor readable instructions further causing said at least one processor to establish said predetermined order as a descending order according to weight.
In the system described above, said processor readable instructions further cause said at least one hardware processor to execute multiple cycles of assigning said plurality of objects to said set of clusters with the predetermined order of selecting objects for any cycle differing from the predetermined order for any other cycle of the multiple cycles.
In the system described above, said processor readable instructions further cause said at least one processor to:
In the system described above, said processor readable instructions further cause said at least one processor to:
In the system described above, said processor readable instructions further cause said at least one processor to:
The system further comprises computer executable instructions causing the at least one hardware processor to determine each of said radial-affinity measure, said angular-affinity measure, and said composite affinity measure as a normalized value bounded between 0 and 1.0.
In the system described above, said processor readable instructions further cause said at least one hardware processor to:
Alternatively, in the system described above, said processor readable instructions further cause said at least one processor to:
According to yet another aspect of the invention, there is provided a method of clustering a plurality of objects comprising:
storing descriptor vectors of N objects of said plurality of objects in a memory device; and configuring at least one hardware processor to perform processes of:
In the method described above, said each segmentation comprises:
In the method described above, said allocating comprises:
In the method described above, said allocating comprises:
The method further comprises maintaining object-assignment records indicating for said each selected object:
In the method described above, said M independent segmentations are executed sequentially. Alternatively, said M independent segmentations may be executed concurrently.
According to further aspect of the invention, there is provided a system for clustering a plurality of objects comprising:
In the system described above, the computer executable instructions cause said at least one hardware processor to:
In the system described above, the computer executable instructions cause said at least one hardware processor to:
In the system described above, the computer executable instructions cause said at least one hardware processor to:
In the system described above, the computer executable instructions further cause said at least one hardware processor to maintain object-assignment records indicating for said each selected object:
In the system described above, the computer executable instructions further cause said at least one hardware processor to execute said M independent segmentations sequentially.
Alternatively, the system comprises means for executing said M independent segmentations concurrently.
According to one more aspect of the invention, there is provided a method of clustering a plurality of objects comprising:
employing at least one hardware processor for:
The method further comprises, upon completion of said each cycle:
The method further comprises:
In the above method, said updating comprises revising said centroid vector of said each cluster to equal a summation of said phase-specific vector sum and said cycle-specific vector sum divided by a total number of objects assigned to said each cluster upon completion of said each cycle.
According to one more aspect of the invention, there is provided a system for clustering a plurality of objects comprising:
a hardware processor and a memory device having computer executable instructions stored thereon for execution by the hardware processor, causing the hardware processor to:
In the system described above, the computer executable instructions, upon completion of said each cycle, further cause the processor to:
In the system described above, the computer executable instructions further cause the processor to:
In the system described above, the computer executable instructions further cause the processor to revise said centroid vector of said each cluster to equal a summation of said phase-specific vector sum and said cycle-specific vector sum divided by a total number of objects assigned to said each cluster upon completion of said each cycle.
According to yet one more aspect of the invention, there is provided a method of clustering a plurality of objects comprising:
employing a hardware processor for:
The method further comprises, upon completion of said each cycle, updating a centroid vector of said each cluster.
The method further comprises:
According to yet one more aspect of the invention, there is provided a system for clustering a plurality of objects comprising:
a hardware processor and a memory device having computer executable instructions stored thereon for execution by the processor, causing the processor to:
In the system described above, upon completion of said each cycle, the computer executable instructions further cause the processor to update a centroid vector of said each cluster.
In the above system, the computer executable instructions further cause the processor to:
Thus, improved methods and systems for clustering a plurality of objects have been provided. Methods and systems of the present invention provide the following advantages: better selection of segmentation criteria that suit marketing objectives; more accurate grouping of objects of common traits; and significantly reducing the computational effort of finding optimal or near-optimal segmentation of objects by proper selection of search domains and by recognizing and avoiding redundant calculations.
Embodiments of the present invention will be further described with reference to the accompanying exemplary drawings, in which:
A social graph is represented by tracked data relevant to users of a communication network in general and the Internet in particular. Noting that a communication network is not necessarily limited to serve human users, a tracked user is herein termed “an object” and is represented by a multidimensional vector quantifying a set of descriptors of the object.
The K-means method is based on determining a distance, however defined, between an object and the centroids of the K clusters and associating the object with the nearest cluster. The number K is judicially selected. To realize statistically significant clustering, a lower bound, q, q>1, of the number of objects per cluster may be predefined. Thus, an upper bound of the number K may be determined as the ratio └N/q┘, N being the number of tracked objects. Initially, the clusters are empty sets and each cluster is associated with a respective centroid “seed”. Each object of the population is then assigned to one of the clusters and updated values of the K centroids are determined. Several cycles of assigning objects to clusters then updating the K centroids may take place until some convergence criterion is satisfied.
The values of the initial centroids may have a significant effect on the number of cycles needed. If the K initial centroids are too close to each other, the objects of any pair of clusters are likely to be spread, and interposed, over the global multidimensional space of the descriptors instead of being segregated. Consequently, the updated centroids would drift slowly towards steady-state values during successive update cycles. Thus, the initial values of the centroids are preferably selected to be distant from each other to realize fast convergence of an iterative centroid-refinement process.
The criterion of assigning an object to a cluster may be based on the radial (Euclidean) distances of the object to the K centroids or the angular displacements of the vector representing the object from the K vectors representing the centroids. The angle between a vector representing an object and a vector representing a centroid is a planer angle having values between 0 and π/2 radians. Since the cosine of an angle between 0 and π/2 is a monotone function of the angle, the cosine of the angle may be used for comparing angular-displacement values of an object from the K centroids. With all vectors representing the population of objects normalized to a magnitude of 1.0, the cosine of the angular displacement between any two vectors is simply the dot product of the two vectors.
Whether clustering is based on radial measures or angular measures, the descriptors may be individually normalized. For example, a descriptor representing the age of a tracked user may be normalized by dividing the age of each tracked user by the mean age or median age of the population under consideration (40 years for Canada and the U.S.A). Likewise, a descriptor representing annual income may be divided by the mean income of the population under considerations. The vectors representing the population of objects (tracked users), may further be normalized to a magnitude of 1.0 for specific purposes.
In accordance with the present invention, an object may be assigned to a cluster based on a composite measure of affinity which takes into account both the angular displacement and radial distance between the object and the centroid of the cluster.
Consider a population of N objects, each object representing a tracked user and is represented by a ν-dimensional vector Pj, 0≤j<N. The N objects are to be grouped into K clusters. The centroid of each cluster is represented by a ν-dimensional vector Cj, 0≤j<K. Consider three clustering approaches:
Radial-affinity clustering;
Angular-affinity clustering; and
Composite radial-angular-affinity clustering.
Following selection of K initial-state centroids (K centroid seeds), two basic processes are iteratively performed regardless of the clustering criterion. The first process allocates each object to a cluster, and the second process refines the centroids.
The process of allocating objects to clusters entails N×K basic affinity computations with each basic affinity computation requiring at least ν multiplications and ν additions. The total number of multiplications is at least N×K×ν; naturally N>>K, and typically K>>ν. (For example: N=1000,000, K=100, ν=8.)
The process of refining the centroids entails N×ν additions and at least K×(ν+1) multiplications or divisions.
Thus, the process of allocating objects to clusters is by far more computationally intensive.
The distance between an object (vector P) and a cluster is defined as the distance between the object and the centroid (vector C) of the cluster. A normalized array C is denoted ć.
Array [p1, p2, . . . , pν]T represents a ν-dimensional object vector P;
Array [c1, c2, . . . , cν]T represents a ν-dimensional centroid vector C; and
Array [χ1, χ2, . . . , χν]T represents a ν-dimensional normalized centroid vector {tilde over (c)}.
The cosine of the planar angle between the object vector P and the centroid vector C is determined from the scalar product of P and C:
P·C=p
1
×c
1
+p
1
×c
1
+ . . . +p
ν
×c
ν.
If vectors P and C are normalized to a magnitude of unity, then the scalar product is the cosine of the angle. If only vector C is normalized, then the scalar product is the magnitude of vector P times the cosine of the angle. Since the magnitude of P is a common factor in the search for the nearest (most similar) centroid, normalizing the objects vectors would be unnecessary.
The square of the distance D between the object and the centroid may be determined as:
D
2=(p1−1)2+(p1−c1)2+ . . . +(pν−cν)2.
If ν is a large number, and if the scalar product P·C has already been computed, then D2 may be determined from the identity:
D
2
=∥P∥
2
+∥C∥
2−2×P·C.
Consider a single centroid represented by ν-dimensional vector C.
Radial Affinity
(Vectors P and C are not Normalized)
D
2=(p1−c1)2+(p2−c2)2+ . . . +(pν−cν)2
(ν multiplications and 2×σ−1 additions or subtractions).
If the N objects are grouped according to objects' distances from centroids, then only the square of object-centroid distance need be determined.
Angular Affinity
(Vectors A is not normalized, vector C is normalized to
ć=(χ1,χ2, . . . ,χν)T,∥ć∥=1.0
A·c=p
1×χ1+p2×χ2+ . . . +pν×χν
(ν multiplications and σ−1 additions).
If the N objects are grouped according to objects' planar angles from centroids, then only the dot-product of object vector P and normalized centroid vector ć need be determined.
The ν descriptors are preferably normalized so that each descriptor has a mean value of 1.0 each. However, the objects' vectors need not be normalized to a magnitude of 1.0 since each object individually selects one of the K centroids based on comparing values of a monotone function of the angles. The monotone function may be the cosine function of an angle or the cosine function multiplied by ∥P∥. An updated centroid is a natural vector (not normalized) determined according to natural vectors of constituent objects of the cluster. Normalized centroid vectors are needed however to isolate the effect of the varying magnitudes of centroid vectors. Upon determining a normalized centroid vector from a natural centroid vector, the natural centroid vector is still retained to be used in a succeeding update. Updating a centroid is preferably performed as a recursive process that does not require processing all constituent objects as illustrated in
Based on the angular-affinity criterion, selection of a cluster for an object is based only on vectors' directions.
The hybrid radial-angular clustering process requires determining both the angular affinity and radial affinity. The angular affinity is determined as the dot product of the object vector and normalized centroid as discussed above. The radial affinity is determined as a function of the Euclidean distance between the object and the natural centroid. Since the angular affinity P·ć has already been determined, the Euclidean distance can be determined based on the values ∥P∥2, ∥C∥2 which are retained for frequent use.
Hybrid Radial-Angular Affinity
(Vectors P is not normalized, vector C is normalized to
ć=(χ1,χ2, . . . ,χν)T, ∥ć∥=1.0
A·ć=p
1×χ1+p2×χ2+ . . . +pσ×χν
D
2
=∥P∥
2
+∥C
2∥2−2×(P·ć)×∥C∥
(ν multiplications, ν+1 additions, 1 multiplication)
(Vectors P and C are not normalized)
P·C=p
1
×c
1
+p
2
×c
2+ . . . +ρν×cν
D
2
=∥P∥
2
+∥C∥
2−2×(P·C)
(ν multiplications, ν+1 additions, 1 division)
An object may be assigned to a cluster based on a composite measure of affinity which takes into account both the angular affinity and radial affinity of the object and the centroid of the cluster. Let Θj, 0≤Θj≤π/2, denote the angular displacement of an object P from centroid Cj, and Dj denote the radial distance from object P to centroid Cj, 1≤j≤K. The angular affinity Ωj of the object to centroid Cj may be defined as the cosine of Θj which is bounded between 0.0 and 1.0. The distances from the object to the K centroids may vary significantly (even with descriptor normalization to a mean value of unity). It is desirable however to define a measure of radial affinity to be also bounded.
A first measure of radial affinity of the object P to a centroid Cj may be determined as:
Δj=∥P∥/(∥P∥+Dj).
Thus, Δj=1.0 if Dj=0.0 (for a centroid that coincides with the object in the ν-dimensional space). Δj decreases as Dj increases.
An affinity index Sj reflecting both the angular affinity and radial affinity of an object to centroid Cj may be defined as:
S
j=α×Ωj+β×δj where 0.0≤α≤1.0, 0.0≤β≤1.0, α+β=1.
A second measure of radial affinity of the object P to a centroid Cj may be determined as:
Δj=(1−Dj/D*) for Dj≤D* and Δj=0.0 for Dj>D*;
where D* is the sum of the mean value μ and the standard deviation σ (or 2×σ) of the K radial distances between the object and the K centroids.
Thus, Δj is bounded between 0.0 and 1.0, where a value of zero corresponds to a centroid of a radial distance from the object exceeding a predefined threshold and a value of 1.0 corresponds to a centroid that coincides with the object.
An affinity index Śj reflecting both the angular affinity and radial affinity of an object to centroid Cj may be defined as:
Ś
j=α×Ωj+β×Δj
where α and β are weighting factors: 0≤α≤1.0, 0≤β≤1.0, and α+β=1.0.
The mean value μ and standard deviation σ of the distances D1 to D5 are 10.55 and 4.11, respectively. Thus, D*, selected as μ+σ, equals 14.66. The measures of radial affinity Δ1, Δ2, Δ3, Δ4, and Δ5 are then determined as 0.485, 0.527, 0.466, 0.117, and 0.0.
S
j=α×Ωj+β×δj, 0≤α≤1.0, 0≤β≤1.0, and α+β=1.0
δj=∥P∥/(∥P∥+Dj), 0≤j<K.
The assignment of objects to clusters is determined in an iterative execution of a global computation cycle 1110. In a global computation cycle, a cluster-selection procedure 1120 is executed to select a cluster for each object, assign the object to the selected cluster, and then update the centroid of the selected cluster.
The cluster-selection procedure 1120 comprises applying for each of the K clusters processes of:
The value of the composite radial-angular similarity measure may be retained for each cluster if the process of selecting a preferred cluster for the object takes into consideration other factors. Otherwise, only one composite radial-angular affinity measure is retained (process 1170) based on comparison of results relevant to successive clusters.
Upon completion of the cluster-selection procedure 1120 for all of the K clusters, a cluster is selected. The centroid of the selected cluster is then updated (process 1180) as illustrated in
Ś
j=αΩj+β×Δj, 0≤α≤1.0, 0≤β≤1.0, and α+β=1.0.
Δj=(1−Dj/D*) for Dj≤D* and Δj=0.0 for Dj>D*;
where D* is the sum of the mean value μ and the standard deviation σ of the K radial distances {D1, D2 . . . DK} between the object and the K centroids. The value of D* may be selected according to other criteria.
The assignment of objects to clusters is determined in an iterative execution of a global computation cycle 1210. In a global computation cycle, process 1220 is applied to determine for each of the K clusters:
The values Ωj and Dj are retained for each cluster (process 1242). Upon determining Dj for all clusters (1≤j≤K), the mean value μ and the standard deviation σ of the K radial distances {D1, D2, . . . , DK} between the object and the K centroids can be determined (process 1250). With D* defined as D*=μ+σ (or generally D*=μ+h×σ, h being a positive real number), the radial-affinity measure Δj can be determined for each cluster j.
Cluster-selection procedure 1260 comprises applying for each of the K clusters processes of:
The value of the composite radial-angular affinity measure may be retained for each cluster if the process of selecting a preferred cluster for the object takes into consideration other factors. Otherwise, only one composite radial-angular affinity measure is retained (process 1270) based on comparison of results relevant to successive clusters.
Upon completion of the cluster-selection procedure 1260 for all of the K clusters, a cluster is selected and the centroid of the selected cluster is updated (process 1280) as illustrated in
Processes 1130, 1140, 1150, and 1160 described with reference to
Process 1130 determines the angular affinity Ωj of the object to centroid Cj. The centroid vector Cj is normalized to a magnitude of unity. The resulting normalized centroid vector c is denoted:
ć=(χ1,Ω2, . . . ,χν)T,∥ć∥==1.0.
The angular-affinity measure Ωj is determined as:
Ωj=P·ć=p1×χ1+p2×χ2+ . . . +pν×χν.
Process 1140 determines the radial distance Dj from the object to centroid Cj. The square of distance may be determined from the Cartesian representation of the object vector P and the candidate-centroid vector as:
D
2=(p1−c1)2+(p2−c2)2+ . . . +(pν−cν)2.
However, where ν>>1, and since the values of ∥P∥2, ∥Cj∥2, ∥Cj∥, and P·ć have already been determined, the square of the distance may be determined as:
D
2
=∥P∥
2
+∥C∥
2−2×(P·ć)×∥C∥.
Process 1150 determines the radial affinity as:
δj=∥P∥/(∥P∥+Dj), 0≤j<K.
Process 1160 determines the composite radial-angular affinity measure as:
S
j=Ωj+β×δj, where β (β>0.0) is a design parameter.
The currently computed value Sj is compared with the last encountered highest value S*. If Sj is less than or equal to S* (step 1360), a subsequent cluster, if any, is considered (step 1312) as a new candidate. Otherwise, if Sj is larger than S* (step 1360), the index k* of the optimal centroid is set to equal the index j of the current candidate cluster, the value S* is set to equal Sj (step 1370), and a subsequent cluster, if any, is considered (step 1312) as a new candidate.
Upon selecting an object (process 1402), the index j of the candidate cluster is set to equal 0 (process 1410). An index j of a candidate cluster of centroid Cj is updated in step 1412. If the index exceeds the total number K of clusters, the computation of the radial distances between the object and the K centroids is considered complete and process 1440 is executed to determine an upper bound of a radial distance.
Process 1420 determines the angular affinity Ωj of the object to centroid Cj according to the steps of process 1130 of
Process 1430 determines the radial distance Dj from the object to centroid Cj according to the steps of process 1140 of
Process 1440 determines the mean value μ and the standard deviation σ of the K radial distances {D1, D2, . . . , DK} between the object and the K centroids. With D* defined as D*=μ+σ (or D*=μ+h×σ, h>0.0), the radial-affinity measure Δj can be determined for each cluster j (process 1460).
An initial value of the highest affinity measure S* is set to 0.0, the index k* of the cluster of highest affinity measure is initialized as a null value (0 for example), and the index j of the candidate cluster is set to equal 0 (process 1450). An index j of a candidate cluster of centroid Cj is updated in step 1452. If the index exceeds the total number K of clusters, the cluster-selection process is considered complete (step 1456) and the object under consideration is assigned to the selected cluster (step 1490).
Process 1460 determines the radial affinity as:
Δj=(1−D/D*) for Dj≤D* and Δj=0.0 for Dj>D*;
where D* is the sum of the mean value μ and the standard deviation σ, or μ+(2×σ), of the K radial distances {D1, D2 . . . DK} between the object and the K centroids as determined in process 1440.
Process 1470 determines the composite radial-angular affinity measure as:
S
j=α×Ωj+β×Δj, where 0.0≤α≤1.0, 0.0≤β≤1.0, α+β=1.
The currently computed value Sj is compared with the last encountered highest value S*. If Sj is less than or equal to S* (step 1475), a subsequent cluster, if any, is considered (step 1452) as a new candidate. Otherwise, if Sj is larger than S* (step 1475), the index k* of the optimal centroid is set to equal the index j of the current candidate cluster and the value S* is set to equal Sj (step 1480) and a subsequent cluster, if any, is considered (step 1452) as a new candidate.
At the beginning of each global computation cycle, each cluster contains a single hypothetical centroid which would be a seeded value at the start of the first global computation cycle or a computed centroid of objects assigned to the cluster in a previous global computation cycle. The centroid seeds of the K clusters are judicially selected as described above.
To determine an updated value of the centroid based on the newly assigned object, a straightforward approach is to retain pointers to vectors representing the objects so far assigned to the cluster and determine the centroid vector after each new allocation to the cluster as the mean value of the accumulated object vectors. However, this would be computationally intensive for clusters of large object memberships. Alternatively, the centroid vector may be determined recursively as described below.
Initially, the cluster is empty but is assigned a vector v* which may be a centroid seed or a centroid vector determined in a previous global computation cycle. The value of an update counter of a cluster is denoted “t”; initially, t=0 and a vector sum Q is set to equal v*. The update counter assumes values t of 1, 2, . . . for subsequent assignments of object vectors v1, v2, . . . to the cluster.
The centroid vector C of a cluster is determined recursively. With initial values t←0 and Q←v*, the value of C at t=1, when an object vector v1 is added to the cluster is determined as (ν* +v1)/2, and the value of C at t=2 when an object vector v2 is added to the cluster is determined as (ν* +v1+v2)/3, and so on. Thus, with each addition of a vector v to the cluster, the value of C can be determined from the recursion:
t←(t+1);
Q←(Q+v); and
C←Q/(t+1).
The processes illustrated in
Thus, invention provides a method of clustering a plurality of objects according to a clustering criterion. The method comprises configuring at least one hardware processor to perform processes of generating a set of K centroids, K>1, assigning each centroid to a respective cluster of a set of K clusters, then assigning objects, selected in a predetermined order, to one of the clusters based on an affinity measures to the clusters. For a selected object, the method performs processes of evaluating a composite affinity measure to each centroid of the K centroids based on a radial-affinity measure and an angular-affinity measure and identifying a particular centroid of highest composite affinity measure. The selected object is then assigned to a particular cluster corresponding to the particular centroid and the particular centroid is updated to account for inclusion of the selected object. Identifiers of objects assigned to each cluster are stored for use in a marketing model.
Each object is characterized by a respective vector of descriptors. Updating the particular centroid comprises steps of maintaining a count of current objects assigned to the particular cluster, maintaining a vector sum of vectors of descriptors of the current objects, and determining an updated centroid as the vector sum divided by the count of current objects
Optionally, each object may be assigned a respective weight and the predetermined order of allocating objects to respective clusters is selected as a descending order according to weight.
A cycle of allocating each object to a respective centroid may be repeated until the centroids are stabilized. A large number of cycles may be actuated. Preferably, the predetermined order of selecting objects for allocation to a cluster differs from one cycle to another. For each cycle generating a respective pseudo-random sequence of different integers is generated. The integers correspond to memory addresses of vectors of descriptors of the plurality of objects. Thus, the predetermined order is established according to the respective pseudo-random sequence.
The method further comprises steps of maintaining object-assignment records indicating for each object an identifier of a cluster to which each object is assigned as well as a corresponding composite affinity measure.
Optionally, several initial cycles of allocating each object to a respective centroid are actuated for all objects of the plurality of objects then several succeeding cycles of allocating objects to respective centroids are actuated for only each object of a composite affinity measure below a specified level.
The method further comprises determining an overall number of changes of object assignments to clusters during a cycle of allocating each object to a respective centroid.
While a ratio of the overall number of changes to a total number of objects of the plurality of objects exceeds a predefined threshold, the cycle of allocating each object to a respective centroid is repeated. The number of actuating the cycle may be limited to a predefined number.
Preferably, each of the radial-affinity measure, the angular-affinity measure, and the composite affinity measure is determined as a normalized value bounded between 0 and 1.0.
According to a first implementation of the cluster-selection process, the angular affinity measure of an object to a centroid is determined as a dot product of vectors (C/∥C∥) and (P/∥P∥), the radial affinity measure of the object to the centroid is determined as a ratio ∥P∥/(∥P∥+D), and the composite affinity measure is a weighted sum of the angular affinity measure and the radial affinity measure, where C denotes a centroid vector of the centroid, P denotes an object vector of the object, ∥C∥ denotes magnitude of C, ∥P∥ denotes magnitude of P, and D denotes the Euclidean distance (P−C)∥.
According to a second implementation of the cluster-selection process, the angular affinity measure of an object to a centroid is determined as a dot product of vectors (C/∥C∥) and (P/∥P∥), the radial affinity measure of the object to the centroid is determined as a ratio F defined as F=(1−D/D*) for D<D* and F=0 otherwise, and the composite affinity measure is a weighted sum of the angular affinity measure and the radial affinity measure; where C denotes a centroid vector of the centroid, P denotes an object vector of the object, ∥C∥ denotes magnitude of C, ∥P∥ denotes magnitude of P, and D denotes the Euclidean distance (P−C)∥, and D* is a predefined distance threshold, D*>0.
The clustering method described above with reference to
The tendency of object natural division into locked objects and free objects may be exploited to reduce the computational effort for clustering massive data. With the actuation of numerous cycles, gradual designation of a number of objects as locked objects reduces the computation effort.
During each cycle of phase-1, 1721, the content of the cluster is divided into a set 1761 of locked objects and a set 1771 of free objects. During each cycle of phase-1, the affinity level of each object of the set 1771 of free objects to each of K centroids is determined. The centroid C(1) may shift during each cycle of phase-1.
Likewise, during each cycle of phase-2, 1722, the content of the cluster is divided into a set 1762 of locked objects and a set 1772 of free objects. During each cycle of phase-2, the affinity level of each object of the set 1772 of free objects to each of K centroids is determined. The proportion of locked objects during phase-2 is higher than the proportion of locked objects during phase-1. The centroid C(1) may shift during each cycle of phase-2.
The trend continues during phase-3, 1723, phase-4, 1724, etc., where the proportion of locked objects (1763, 1764, . . . ) continues to increase and affinity levels to K-centroids are computed for only free objects (1773, 1774, . . . ).
During a first phase covering a number of cycles (the first five cycles for example), the initial global affinity threshold of 1.0 is maintained. Thus, every object in every cluster is considered a free object and is allowed to look for a better cluster; an object is considered a free object only if its computed affinity to a respective cluster is less than a current value of the global affinity threshold. At the end of each cycle, the affinity threshold may be modified.
During each of subsequent phases, each phase covering a respective number of cycles, the global affinity threshold is reduced according to a predetermined rule. For example, the global affinity threshold may be multiplied by 0.8 for each new phase. A flexible means is to provide an array of global affinity thresholds where each entry corresponds to a cycle index. For example, phase-0 may cover cycles of indices 0 to 4, phase-1 may cover cycles of indices 5 to 9, and so on as indicated in the table below.
Thus, following each cycle, process 1840 determines if the global affinity threshold is to be updated. If an update is due, process 1850 determines a new global affinity threshold either according to a rule, such as assigning a value based on cycle index according to a predetermined formula, or by indexing an array similar to the exemplary array above.
With an update of the global affinity threshold, process 1860 is actuated to determine for each cluster a respective count (“locked object count, denoted L*”) of the number of objects to be locked to respective clusters and a respective sum of descriptor vectors of all locked objects (“Locked vector sum, denoted Q*”. These values will be used in process 1180 to determine updated centroids. With each new cycle 1110 of the multiple cycles of
The clustering method described above with reference to
Thus, the invention provides a method of clustering a plurality of objects comprising: determining for every object of the plurality of objects a respective characterizing object vector; initializing a singular affinity of every object to exceed 1.0; initializing a set of clusters of objects as empty sets; and assigning a centroid with a respective centroid vector to each cluster of the set of clusters.
During each phase of successive phases, a phase-specific affinity threshold is determined and a predefined number of cycles is actuated, performing for each cycle processes of: determining an affinity level of each object having a respective singular affinity below the phase-specific affinity threshold to each cluster according to the respective characterizing object vector and the respective centroid vector; and assigning said each object to a specific cluster corresponding to highest affinity level.
Upon completion of each cycle the singular affinity of each object is revised to equal a corresponding highest affinity level and the centroid vector of each cluster is updated.
For each cluster, and preceding each phase, a respective phase-specific vector sum of object vectors of specific objects assigned to the cluster is determined, each of the specific objects having a singular affinity not less than said phase-specific affinity threshold.
During each cycle, for each cluster, a cycle-specific vector sum of object vectors of all objects assigned to said each cluster is determined.
The process of updating a centroid vector of a cluster comprises equating the centroid vector to a summation of the phase-specific vector sum and the cycle-specific vector sum divided by a total number of objects assigned to the cluster upon completion of said each cycle.
Whether or not the feature of locking objects to clusters is activated, the computational effort may be reduced by limiting the search domain. During a cycle of centroid update, an object of a specific cluster may consider migrating to a cluster of high affinity to the specific cluster. The inter-cluster affinity may be determined in terms of respective inter-centroid affinity. After completion of a centroid-update cycle, the affinity of each pair of centroids may be determined. With K centroids, the number of computations of inter-cluster affinity levels is (K×(K−1))/2 which significantly smaller than the number of computations N×K of object-cluster affinity levels since N is typically much larger than K. With 1000,000 objects (N=1000000) and 100 clusters (K=100), for example, the number of computations of an affinity level of each object to each centroid would be 108, while the number of computations of inter-cluster affinity levels would be 4950. Additionally, a centroid pair of low affinity, below a predefined lower bound, may be eliminated.
The compound effect of activating the feature of locking objects to clusters and limiting the search domain can be a significant reduction of the overall computational effort.
Table-I identifies neighboring clusters of each of the 25 clusters of
Thus, the invention provides a method of clustering a plurality of objects comprising: determining for each object of the plurality of objects a respective characterizing object vector;
Upon completion of each cycle, updating a centroid vector of said each cluster.
The method further comprises setting the respective phase-specific search domain to be the set of clusters for an initial phase of said successive phases. Upon completion of each phase, performing steps of
According to the first method of expediting clustering processes, described above with reference to
Thus, as illustrated in
According to the second method of expediting clustering processes, described above with reference to
Thus, as illustrated in
The first and second methods of expediting clustering processes may be combined resulting in further decrement of the requisite processing effort as illustrated in
Alternatively, instead of object-constellation assignment based on determining the affinity level of each of the N objects to each of the constellation centers, object-constellation assignment may be based on the already tracked information of
Thus, the invention provides a method of segmenting a plurality of objects based on performing multiple independent segmentation processes where each segmentation process produces a set of object clusters. The method comprises steps of storing descriptor vectors of N objects of the plurality of objects in a memory device, and configuring at least one hardware processor to perform processes of generating a plurality of distinct sets of K centroid seeds, 3<2K<N, generating a plurality of distinct pseudo-random sequences of N non-repeating integers corresponding to memory addresses of descriptor vectors, and executing M independent segmentation processes of the N objects, M>2.
Each segmentation process is based on composite radial-angular affinity. A segmentation process starts with a respective one of the sets of K centroid seeds then selects objects for allocation to clusters according to a respective pseudo-random sequence. Each segmentation produces a respective set of K centroids.
Executing the M segmentation processes produces a plurality of centroids which are, in turn segmented into K constellations starting with any of the sets of K centroids as K constellation seeds and assigning each of remaining centroids to one of K constellations. Upon formation of the constellations, each object selected according to a respective pseudo-random sequence is allocated to a respective constellation according to constituent centroids of the constellations.
The M independent segmentation processes may be executed sequentially or concurrently.
Each segmentation comprises assigning each centroid seed to a respective cluster of a set of K clusters and for each object selected according to a respective pseudo-random sequence performing processes of evaluating a composite affinity measure to each centroid of the K centroids identifying a particular centroid of highest composite affinity measure; and assigning each selected object to a particular cluster corresponding to the particular centroid. The composite affinity measure is based on a radial-affinity measure and an angular-affinity measure to each centroid. The particular centroid is then updated to account for inclusion of each selected object.
According to an implementation, allocating an object to a respective constellation comprises steps of determining a center of each constellation of the K constellations, based on the constituent centroids of the constellations, and determining an affinity measure of the object to the center. The object is assigned to a constellation to the center of which the object has highest affinity measure
According to another implementation, allocating an object to a respective constellation comprises steps of identifying a specific centroid of the plurality of centroids to which the object has highest composite affinity measure and selecting a constellation containing the specific centroid.
The method further comprises maintaining object-assignment records indicating for each selected object an identifier of a cluster to which each selected object is assigned and a corresponding composite affinity measure.
Systems and apparatus of the embodiments of the invention may be implemented as any of a variety of suitable circuitry, such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), discrete logic, software, hardware, firmware or any combinations thereof. When modules of the systems of the embodiments of the invention are implemented partially or entirely in software, the modules contain a memory device for storing software instructions in a suitable, non-transitory computer-readable storage medium, and software instructions are executed in hardware using one or more processors to perform the techniques of this disclosure.
It should be noted that methods and systems of the embodiments of the invention and data sets described above are not, in any sense, abstract or intangible. Instead, the data is necessarily presented in a digital form and stored in a physical data-storage computer-readable medium, such as an electronic memory, mass-storage device, or other physical, tangible, data-storage device and medium. It should also be noted that the currently described data-processing and data-storage methods cannot be carried out manually by a human analyst, because of the complexity and vast numbers of intermediate results generated for processing and analysis of even quite modest amounts of data. Instead, the methods described herein are necessarily carried out by electronic computing systems having processors on electronically or magnetically stored data, with the results of the data processing and data analysis digitally stored in one or more tangible, physical, data-storage devices and media.
Although specific embodiments of the invention have been described in detail, it should be understood that the described embodiments are intended to be illustrative and not restrictive. Various changes and modifications of the embodiments shown in the drawings and described in the specification may be made within the scope of the following claims without departing from the scope of the invention in its broader aspect.
The present application is a national entry of PCT/IB2018/057019 filed Sep. 13, 2018, which claims the benefit of provisional application 62/558,085 filed on Sep. 13, 2017, entitled “Composite Radial-Angular Clustering of a Large Scale Social Graph”, the entire content of both applications being incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2018/057019 | 9/13/2018 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62558085 | Sep 2017 | US |