The present invention relates to clustering of a large number of objects. In particular, the invention is directed to selection of centroid seeds for efficient segmentation of a social graph representing a large number of tracked users of social networks.
Finding a global optimal segmentation of a population of a large number of objects, exceeding 10000 for example, may require prohibitively extensive computational effort. Using the K-means method with a predefined objective function, an attained segmentation of a population under consideration into K clusters, K being a specified integer exceeding unity, corresponds to a local minimum of the objective function.
For a particular population of objects, and for: a given number of clusters; a particular affinity-measure definition; and a particular rule for assigning an object to a cluster; the contents of the steady-state clusters are not unique. The segmentation rule attempts to maximize a metric of overall object-centroid affinity. However, a person skilled in the art is well aware that, for a large number of objects, a global maximum metric is generally not attainable, except by lucky coincidence. The contents of the clusters are heavily dependent on the initial selection of the set of clusters and, to a lesser extent, on the sequential order in which the objects—or candidate descriptor vectors in general—are considered. Additionally, the segmentation computational effort strongly depends on the initial selection of the set of clusters.
The objective of the invention is to provide methods of segmenting objects of a vast social graph into clusters of objects for enhancing marketing intelligence. An initial set of clusters each populated with a single centroid is used to start the segmentation process. The segmentation process assigns objects to clusters according to affinity measures of each object to centroids of the clusters and rules based on the affinity measures. The objects, and consequently the centroids, are represented as descriptor vectors in a multi-dimensional descriptor space. The addition of an object to a cluster naturally changes the position of the centroid of the cluster in the multi-dimensional descriptor space. Consequently, the segmentation process has to be repeated numerous times to redefine the centroids until steady-state descriptor vectors of the centroids are reached.
A judicial selection of the initial centroid set can result in creating clusters of improved distinctive contents as well as reducing the segmentation computational effort. The judicial selection according to the present invention is based on finding mutually repulsing centroids based on predefined affinity thresholds.
The methods of present invention, together with the methods disclosed in U.S. Provisional Application 62/558,085 (filed on Sep. 12, 2017, entitled “Composite Radial-Angular Clustering of a Large-Scale Social Graph”) aim at minimizing a first metric of global inter-centroid affinity and subsequently maximizing a second metric of global object-centroid affinity.
In accordance with an aspect, the invention provides a method of generating a set of centroids of a plurality of objects. The method comprises processes of specifying a target number of centroids and employing a processor to execute instructions for: obtaining, for each object of the plurality of objects, a respective characterizing vector of v variables, v>1; determining for each variable of the v variables respective moments based on obtained characterizing vectors; repeating a procedure of generating a centroid until the target number of centroids is attained, and storing the set of centroids for starting a segmentation process of the plurality of objects.
The procedure for generating a centroid comprises processes of generating v random cumulative-probability values and for each variable, accessing a respective software module providing a deduced value of the variable corresponding to a respective one of the random cumulative-probability values, the deduced value being an element of a vector representing a new centroid of the set of centroids, the respective software module being configured to evaluate a respective probability distribution function tailored to the respective moments.
The process of obtaining, for each object of the plurality of objects, a respective characterizing vector of v variable further comprises processes of: assigning v weights to the v variables, each weight being variable specific and bounded to positive values not exceeding 1.0;
and normalizing each of the v variables so that: a minimum value of each variable equals 0.0; and a maximum value of each variable equals a corresponding variable-specific weight.
The method further comprises selecting the respective probability distribution function as one of: a Gamma distribution; a Weibull distribution; and a piecewise linear distribution. The respective moments comprise at least a first moment and a second moment. The type of the respective probability distribution function may be user defined.
In accordance with another aspect, the invention provides a method of generating centroids of a plurality of objects. The method comprises specifying an affinity threshold and employing a processor to execute instructions for: acquiring a descriptor vector of v variables, v>1, for each object of the plurality of objects; initializing a centroid set to include an object of the plurality of objects; and performing for each object of the plurality of objects a procedure for deciding whether the object qualifies as a centroid. The procedure comprises determining an affinity measure to each centroid of the centroid set based on a descriptor vector of the each object and a descriptor vector of the each centroid and selecting the each object as a centroid to be added to the centroid set subject to ascertaining that the affinity measure to the each centroid is less than the affinity threshold. Thereby, the method creates a set of uniformly spaced centroids for use in automated intelligent-marketing systems.
The process of acquiring a descriptor vector comprises normalizing the v variables so that a value of each variable is within a predefined range.
In one implementation, normalizing the v variables comprises scaling the variables so that a mean value of each variable equals 1.0. In another implementation, normalizing the v variables comprises shifting and scaling the variables so that a minimum value and a maximum value of each variable equal 0.0 and 1.0 respectively. In a further implementation, normalizing the v variables comprises shifting and scaling the variables so that a minimum value of each variable equals 0.0 and a maximum value of each variable equals a respective variable-specific positive upper bound not exceeding 1.0.
Performing the procedure for determining whether the object qualifies as a centroid is terminated subject to ascertaining that the set of centroids contains a number of centroids equal to a predefined upper bound.
The method further comprises generating non-repeating randomly sequenced indices of objects of the plurality of objects; and selecting objects of the plurality of objects at indices corresponding to the randomly sequenced indices.
The process of determining an affinity measure comprises computing a radial affinity level and an angular-affinity level between each object and each centroid, and computing the affinity measure as a function of the radial-affinity level and the angular-affinity level. The function may be selected as a weighted sum of the radial-affinity level and the angular-affinity level.
In one embodiment, the process of ascertaining that the affinity measure to each centroid is less than the affinity threshold comprises verifying that: the radial-affinity level is less than the radial-affinity threshold; and the angular-affinity level is less than the angular-affinity threshold.
In accordance with a further aspect, the invention provides a method of creating centroids of a plurality of objects. The method comprises specifying an affinity threshold and employing a processor to execute instructions for acquiring, for each object of the plurality of objects, a respective characterizing vector of v variables, v>1, and deducing for each variable a respective cumulative distribution function to produce v cumulative distribution functions. The instructions further cause the processor to execute processes of initializing a centroid set as an empty set, generating a succession of descriptor vectors each comprising v variables, and performing for each descriptor vector of the succession of descriptor vectors a procedure for descriptor-vector election as a centroid vector.
The procedure comprises processes of determining an affinity measure to each centroid of the centroid set based on the each descriptor vector and a descriptor vector of each centroid, and assigning the each descriptor vector to the centroid set as a centroid subject to ascertaining that the affinity measure to the each centroid is less than the affinity threshold.
Thus, the method creates a set of uniformly spaced centroids for use in automated intelligent-marketing systems.
The process of generating a succession of descriptor vectors comprises randomly indexing an inverse of a cumulative distribution function of each variable of the v variables to determine v variable values forming a descriptor vector of the succession of descriptor vectors.
In one implementation, the process of acquiring the respective characterizing vector of v variables comprises normalizing each of the v variables to be within a predefined range.
In another implementation, the process of acquiring the respective characterizing vector of v variables comprises assigning for each variable a respective variable-specific weight greater than 0.0 and not exceeding 1.0, then shifting and scaling each of the variables so that: a minimum value of each variable equals 0.0; and a maximum value of each variable equals a corresponding variable-specific weight.
The affinity measure to the empty centroid set is assigned a value of zero.
The method terminates performing the procedure for descriptor vector election as a centroid vector upon determining that a count of centroids of the set of centroids equals a predefined upper bound.
The process of determining an affinity measure comprises computing a radial affinity level and an angular-affinity level between each descriptor vector and each centroid, and computing the affinity measure as a function of the radial-affinity level and the angular-affinity level. The function may be formed as a weighted sum of the radial-affinity level and the angular-affinity level.
In one implementation, the process of specifying an affinity threshold comprises itemizing the affinity threshold as a radial-affinity threshold and an angular-affinity threshold. Accordingly, the process of determining an affinity measure comprises computing a radial affinity level and an angular-affinity level between the each descriptor vector and each centroid. Subsequently, ascertaining that the affinity measure to each centroid is less than the affinity threshold comprises verifying that the radial-affinity level is less than the radial-affinity threshold and the angular-affinity level is less than the angular-affinity threshold.
In accordance with a further aspect, the invention provides a method of creating centroids of a plurality of objects. The method comprises specifying a target number of centroids and an affinity threshold, and defining bounds of v variables, v>1, each object of the plurality of objects being characterized by a respective vector of descriptors of the v variables within the bounds. A processor is employed to execute instructions for generating a maximal centroid set comprising a maximum attainable number of centroids selected from the plurality of objects conditional on an affinity measure of each centroid to each other centroid being less than the affinity threshold. Where the maximum attainable number differs from the target number, the instructions further cause the processor to execute processes of iteratively tuning the affinity threshold and generating the centroid set until the maximum attainable number equals the target number or a predefined permissible number of iterations is reached. The maximal centroid set is stored for starting a segmentation process of the plurality of objects.
Tuning the affinity threshold comprises increasing the affinity threshold subject to a determination that the maximum attainable number is less than the target number, or decreasing the affinity threshold subject to a determination that the maximum attainable number exceeds the target number.
Generating a centroid set comprises initializing the centroid set as an empty set of zero count of centroids and performing for each object processes of: determining an affinity measure to each centroid of the centroid set; and adding the each object to the centroid set, updating the count of centroids, subject to ascertaining that the affinity measure to each centroid is less than the affinity threshold. When all objects are considered, the count of centroids becomes the maximum attainable number of centroids. In one implementation, the affinity measure is determined as a composite radial-angular affinity measure formulated as a function of a radial-affinity level and an angular affinity level and the affinity threshold is determined as a specific value of the composite radial-angular affinity measure.
Alternatively, generating the centroid set comprises initializing the centroid set as an empty set of zero count of centroids and performing for each object processes of: determining a radial affinity level and an angular affinity level to each centroid of the centroid set; and adding the each object to the centroid set, updating the count of centroids, subject to ascertaining that the radial affinity level to the each centroid is less than a predefined radial threshold and the angular affinity level to the each centroid is less than the angular threshold. When all objects are considered, the count of centroids becomes the maximum attainable number of centroids.
In accordance with a further aspect, the invention provides a method of creating centroids of a plurality of objects. The method comprises specifying a target number of centroids, a radial threshold, and an angular threshold, and defining bounds of v variables, v>1, each object of the plurality of objects being characterized by a respective vector of descriptors of the v variables within the bounds. A processor is employed to execute instructions for generating a maximal centroid set comprising a maximum attainable number of centroids selected from the plurality of objects conditional on a radial affinity level of each centroid to each other centroid being less than the radial threshold and an angular affinity level of each centroid to each other centroid being less than the angular threshold. Upon determining that the maximum attainable number of centroids differs from the target number, the instructions cause the processor to execute processes of iteratively tuning the radial threshold and the angular threshold, and generating the centroid set until the maximum attainable number equals the target number or a predefined permissible number of iterations is reached. The generated maximal centroid set is stored for use in a segmentation process of the plurality of objects.
Tuning the radial threshold and the angular threshold comprises increasing at least one of the radial and the angular thresholds subject to a determination that the maximum attainable number is less than the target number, or decreasing at least one of the radial and the angular thresholds subject to a determination that the maximum attainable number exceeds the target number.
Generating the centroid set comprises initializing a centroid set as an empty set of zero count of centroids and performing for each object processes of: determining a radial affinity level and an angular affinity level to each centroid of the centroid set; and adding the each object to the centroid set and updating the count of centroids subject to ascertaining that the radial affinity level to each centroid is less than the radial threshold and the angular affinity level to each centroid is less than the angular threshold. When all objects are considered, the count of centroids becomes the maximum attainable number of centroids.
The method further comprises determining the radial threshold as a mean value of a radial lower bound and a radial upper bound, and determining the angular threshold as a mean value of an angular lower bound and an angular upper bound.
In accordance with a further aspect, the invention provides an apparatus for generating a set of centroids of a plurality of objects. The apparatus comprises a memory device storing processor executable instructions causing a processor to determine a target number of centroids; obtain, for each object of the plurality of objects, a respective characterizing vector of v variables, v>1; and determine for each variable of the v variables respective moments based on obtained characterizing vectors. The instructions cause the processor to generate v random cumulative-probability values and, for each variable, access a respective software module providing a deduced value of each variable corresponding to a respective one of the random cumulative-probability values, the deduced value being an element of a vector representing a new centroid of the set of centroids, the respective software module being configured to evaluate a respective probability distribution function tailored to the respective moments. The instructions cause the processor to repeat generating a new centroid until the target number of centroids is attained. The set of centroids is stored in a storage medium for starting a segmentation process of the plurality of objects.
In accordance with a further aspect, the invention provides an apparatus for generating centroids of a plurality of objects. The apparatus comprises a memory device storing processor executable instructions causing a processor to determine an affinity threshold, acquire a descriptor vector of v variables, v>1, for each object of the plurality of objects, and initialize a centroid set to include an object of the plurality of objects. The instructions cause the processor to determine, for each object of the plurality of objects, an affinity measure to each centroid of the centroid set as a function of a descriptor vector of the each object and a descriptor vector of each centroid. An object is added as a centroid to the centroid set subject to ascertaining that the affinity measure to each centroid is less than the affinity threshold. Thus, the apparatus creates a set of uniformly spaced centroids for use in automated intelligent-marketing systems.
In accordance with a further aspect, the invention provides an apparatus for creating centroids of a plurality of objects. The apparatus comprises a memory device storing processor executable instructions causing a processor to obtain an affinity threshold, acquire, for each object of the plurality of objects, a respective characterizing vector of v variables, v>1, and deduce for each variable a respective cumulative distribution function to produce v cumulative distribution functions.
The instructions further cause the processor to initialize a centroid set as an empty set, generate a succession of descriptor vectors each comprising v variables, and determine, for each descriptor vector of the succession of descriptor vectors, an affinity measure to each centroid of the centroid set as a function of the each descriptor vector and a descriptor vector of each centroid. A descriptor vector is assigned to the centroid set as a centroid subject to ascertaining that the affinity measure to each centroid is less than the affinity threshold. Thus, the apparatus creates a set of uniformly spaced centroids for use in automated intelligent-marketing systems.
In accordance with a further aspect, the invention provides an apparatus for creating centroids of a plurality of objects. The apparatus comprises a memory device storing processor executable instructions causing a processor to: obtain from a user a target number of centroids and an affinity threshold; acquire bounds of v variables, v>1, each object of the plurality of objects being characterized by a respective vector of descriptors of the v variables within the bounds; and generate a centroid set comprising a maximum attainable number of centroids selected from the plurality of objects conditional on an affinity measure of each centroid to each other centroid being less than the affinity threshold.
Where the maximum attainable number differs from the target number, the instructions cause the processor to iteratively tune the affinity threshold, and generate a corresponding centroid set until the maximum attainable number equals the target number or a predefined permissible number of iterations is reached.
The maximal centroid set is stored for starting a segmentation process of the plurality of objects.
In accordance with a further aspect, the invention provides an apparatus for creating centroids of a plurality of objects. The apparatus comprises a memory device storing processor executable instructions causing a processor to: obtain from a user a target number of centroids, a radial threshold, and an angular threshold; acquire bounds of v variables, v>1, each object of the plurality of objects being characterized by a respective vector of descriptors of the v variables within the bounds; and generate a maximal centroid set comprising a maximum attainable number of centroids selected from the plurality of objects conditional on a radial affinity level of each centroid to each other centroid being less than the radial threshold; and an angular affinity level of each centroid to each other centroid being less than the angular threshold.
Where the maximum attainable number differs from the target number, the instructions cause the processor to iteratively tune the radial threshold and the angular threshold, and generate a corresponding centroid set until the maximum attainable number equals the target number or a predefined permissible number of iterations is reached.
The maximal centroid set is stored for starting a segmentation process of the plurality of objects.
To generate a maximal centroid set, the instructions cause the processor to: initialize a centroid set as an empty set of zero count of centroids, and for each object: determine a radial affinity level and an angular affinity level to each centroid of the centroid set; and add the each object to the centroid set and update the count of centroids subject to a determination that the radial affinity level to each centroid is less than the radial threshold and the angular affinity level to each centroid is less than the angular threshold.
When all objects are considered, the count of centroids becomes the maximum attainable number of centroids.
Embodiments of the present invention will be further described with reference to the accompanying exemplary drawings, in which:
Processor: The term processor refers to a single hardware processor or an assembly of hardware processors which may be operated concurrently either independently, according to a pipelined arrangement, or according to other multi-processing arrangements.
Radial-affinity level: The radial affinity level of an object to a centroid (or vice versa) is determined as a function of the Euclidean distance between a descriptor vector characterizing the object and a descriptor vector characterizing the centroid. The radial-affinity level may be normalized so that the affinity level is 1.0 if the Euclidean distance is zero and the affinity level approaches zero as the Euclidean distance increases. Details of computation of a normalized radial-affinity level are provided in Provisional Application 62/558,085, filed on Sep. 12, 2017, entitled “Composite Radial-Angular Clustering OF A Large-Scale Social Graph”.
Angular-affinity level: The angular-affinity level of an object to a centroid (or vice versa) is determined as a function of the dot product of a descriptor vector characterizing the object and a descriptor vector characterizing the centroid. Options of computation of a normalized angular-affinity level are provided in the aforementioned Provisional Application.
Composite radial-angular affinity measure: A composite radial-angular affinity measure of an object to a centroid (or vice versa) is a function (such as a weighted sum) of the radial-affinity level and the angular-affinity level defined above.
Radial-affinity threshold: The term refers to a maximum permissible radial-affinity level of an object to a centroid.
Angular-affinity threshold: The term refers to a maximum permissible angular-affinity level of an object to a centroid.
Radial threshold: A specific value of a radial-affinity measure
Angular threshold: A specific value of an angular-affinity measure
Maximal centroid set: A set of centroids containing the maximum attainable number of centroids selected from a plurality of objects conditional on an affinity measure of each centroid to each other centroid being less than the affinity threshold
Mutually repulsing centroids: With each centroid represented as a multi-dimensional descriptor vector, a centroid set is said to comprise mutually repulsing centroids if the radial-affinity level of each centroid to each other centroid is less than a predefined radial-affinity threshold and/or if the angular-affinity level of each centroid to each other centroid is less than a predefined angular-affinity threshold. The centroids of the centroid set are also considered to be mutually repulsing if the composite radial-angular affinity measure of each centroid pair is less than a predefined composite threshold.
μp={Γ(p,0)+Γ(p,1)+ . . . +Γ(p, N−1)}/N.
Preferably, the values of the descriptors are normalized; hereinafter, all descriptors are considered to be normalized.
In accordance with a first-mode normalization criterion, the variables (descriptor values) are normalized so that the mean value of each descriptor is 1.0. Thus, the normalized value 306 of a descriptor 302 of index p of an object of index q, denoted γ(p,q), is determined as:
γ(p, q)=Γ(p,q)/μp, 0≤p≤v, 0≤q<N.
The standard deviation 312 of the normalized values of a descriptor 302(p) is denoted σp, 1≤p≤v.
In accordance with a second-mode normalization criterion, the variables (descriptor values) are normalized so that the minimum value of each descriptor is zero and the maximum value is 1.0. Thus, the normalized value 306 of a descriptor 302 of index p of an object of index q is determined as: γ(p, q)=(Γ(p,q)-ap)/(bp-ap), 1≤p≤v, 0≤q<N, where ap and bp are the lower bound and upper bound, respectively, of a descriptor of index p.
A set of descriptor values 740 corresponding to a predefined number W, W>>1, of equidistant samples 722 of each cumulative distribution function 720 is determined and stored in arrays 750. Each array 750 corresponds to a variable (a descriptor type) and stores descriptor values ranging from Xp(0) to Xp(W−1), 1≤p≤v. As illustrated, descriptor values d1, d2, d3, and d4 corresponding to a selected cumulative-distribution index H are stored in respective arrays 750. A descriptor vector of v descriptors is generated by randomly selecting one descriptor value from each of the v arrays 750.
A set of descriptor values 840 corresponding to a predefined number W, W>>1, of equidistant samples 722 of each cumulative distribution function 820 is determined and stored in arrays 850. Each array 850 corresponds to a variable (a descriptor type) and stores descriptor values ranging from Xp(0) to Xp(W−1), 1≤p≤v. As illustrated, descriptor values d1, d2, d3, and d4 corresponding to a selected cumulative-distribution index H are stored in respective arrays 850. A descriptor vector of v descriptors is generated by randomly selecting one descriptor value from each of the v arrays 850.
A set of descriptor values 940 corresponding to a predefined number W, W>>1, of equidistant samples 722 of each complementary function 920 is determined and stored in arrays 950. Each array 950 corresponds to a variable (a descriptor type) and stores descriptor values ranging from Up(0) to Up(W−1), 1≤p≤v. As illustrated, descriptor values d1, d2, d3, and d4 corresponding to a selected cumulative-distribution index G are stored in respective arrays 950. A descriptor vector of v descriptors is generated by randomly selecting one descriptor value from each of the v arrays 950.
The centroids may be generated based on the individual descriptor vectors of the tracked object as illustrated in
Each variable of the v variables may be normalized according the first-mode normalization criterion as illustrated in
The centroids may be determined according to a single affinity threshold (radial or angular) as illustrated in
In an initialization process 1602, a centroid set is initialized as an empty set with a zero centroid count. An object from a plurality of objects is selected as a centroid. Each object of the plurality of objects is characterized by a respective descriptor vector.
In a process 1610, the selected object is added to the centroid set and the centroid count is increased. Process 1620 determines whether predefined upper bound K* of the number of centroids has been reached. If so, process 1670 communicates the centroid set to a subsequent process. Otherwise, process 1622 determines whether all tracked objects have been examined for consideration as potential centroids. If all tracked objects have been examined, process 1670 communicates the centroid set to the subsequent process. Otherwise, process 1630 examines another object from the plurality of tracked objects and process 1640 determines object's affinity to each selected centroid. If the object's affinity to any centroid equals or exceeds a predefined affinity threshold, the object is disqualified; otherwise, the examined object qualifies as a new centroid. Process 1650 logically removes the examined object, whether selected as a centroid or not, from the plurality of objects. Process 1650 inherently takes place if the objects of the plurality of objects are examined sequentially. Process 1660 proceeds to process 1610 to add the examined object to the centroid set and increase the centroid count if the examined object is qualified. Otherwise, process 1660 proceeds to process 1630 to select a new object for examination. Process 1620 terminates the build up of the centroid set if the number of centroids reaches the predefined upper bound K* and process 1622 terminates the expansion of the centroid set when all objects have been examined.
In an initialization process 1702, a centroid set is initialized as an empty set with a zero candidate count and a zero centroid count. A descriptor vector is generated from a deduced distribution and selected as a centroid.
In a process 1710, the descriptor vector is added to centroid set and the centroid count is increased. Process 1720 determines whether a predefined upper bound K* of the number of centroids has been reached. If so, process 1770 communicates the centroid set to a subsequent process. Otherwise, process 1722 determines whether a sufficient number N* of candidate centroids have been generated. If a sufficient number of candidate centroids has been generated and examined, process 1770 communicates the centroid set to the subsequent process. Otherwise, process 1730 generates another candidate centroid from the deduced probability functions and increases the candidate count.
Process 1740 determines the candidate's affinity to each selected centroid. If the candidate's affinity to any centroid equals or exceeds a predefined affinity threshold, the candidate is disqualified; otherwise, the examined candidate qualifies as a new centroid. Process 1760 proceeds to process 1710 to add the examined candidate to the centroid set and increase the centroid count if the examined candidate is qualified. Otherwise, process 1760 leads to process 1730 to generate a new centroid candidate (a new descriptor vector) for examination. Process 1720 terminates the expansion of the centroid set if the number of centroids reaches the predefined upper bound K* and process 1722 terminates the expansion of the centroid set when a user-defined sufficient number N* of candidates (descriptor vectors) have been examined.
Thus, the invention provides a method of generating centroids of a plurality of objects. The method comprises specifying an affinity threshold and employing a processor to execute instructions for: acquiring a descriptor vector of v variables, v>1, for each object of the plurality of objects; initializing a centroid set to include an object of the plurality of objects; and performing for each object of the plurality of objects a procedure for deciding whether the object qualifies as a centroid. The procedure comprises determining an affinity measure to each centroid of the centroid set based on a descriptor vector of the each object and a descriptor vector of the each centroid and selecting the each object as a centroid to be added to the centroid set subject to ascertaining that the affinity measure to the each centroid is less than the affinity threshold. Thereby, the method creates a set of uniformly spaced centroids for use in automated intelligent-marketing systems.
The process of acquiring a descriptor vector comprises normalizing the v variables so that a value of each variable is within a predefined range.
In one implementation, normalizing the v variables comprises scaling the variables so that a mean value of each variable equals 1.0. In another implementation, normalizing the v variables comprises shifting and scaling the variables so that a minimum value and a maximum value of each variable equal 0.0 and 1.0 respectively. In a further implementation, normalizing the v variables comprises shifting and scaling the variables so that a minimum value of each variable equals 0.0 and a maximum value of each variable equals a respective variable-specific positive upper bound not exceeding 1.0.
Performing the procedure for determining whether the object qualifies as a centroid is terminated subject to ascertaining that the set of centroids contains a number of centroids equal to a predefined upper bound.
The method further comprises generating non-repeating randomly sequenced indices of objects of the plurality of objects; and selecting objects of the plurality of objects at indices corresponding to the randomly sequenced indices.
The process of determining an affinity measure comprises computing a radial affinity level and an angular-affinity level between each object and each centroid, and computing the affinity measure as a function of the radial-affinity level and the angular-affinity level. The function may be selected as a weighted sum of the radial-affinity level and the angular-affinity level.
In one embodiment, the process of ascertaining that the affinity measure to each centroid is less than the affinity threshold comprises verifying that: the radial-affinity level is less than the radial-affinity threshold; and the angular-affinity level is less than the angular-affinity threshold.
In an initialization process 1602, a centroid set is initialized as an empty set with a zero centroid count. An object from a plurality of objects is selected as a centroid. Each object of the plurality of objects is characterized by a respective descriptor vector.
In a process 1610, the selected object is added to the set of centroids and the centroid count is increased. Process 1620 determines whether a predefined upper bound K* of the number of centroids has been reached. If so, process 1670 communicates the centroid set to a subsequent process. Otherwise, process 1622 determines whether all tracked objects have been examined for consideration as potential centroids. If all tracked objects have been examined, process 1670 communicates the centroid set to the subsequent process. Otherwise, process 1630 examines another object from the plurality of tracked objects and process 1840 determines the object's radial affinity to each selected centroid.
Process 1850 logically removes the examined object, whether selected as a centroid or not, from the plurality of objects. Process 1850 inherently takes place if the objects of the plurality of objects are examined sequentially.
If the object's radial affinity to any centroid equals or exceeds a predefined radial-affinity threshold, the object is disqualified and process 1860 proceeds to process 1630 to select another object. Otherwise, process 1860 proceeds to process 1845 to determine the object's angular affinity to the centroid set. If the angular affinity to any centroid equals or exceeds a predefined angular-affinity threshold, process 1865 proceeds to process 1630 to select another object. Otherwise, process 1865 proceeds to process 1610 to add the examined object to the centroid set and increase the centroid count. Process 1620 terminates the expansion of the centroid set if the number of centroids reaches the predefined upper bound K* and process 1622 terminates the expansion of the centroid set when all objects have been examined.
In an initialization process 1702, a centroid set is initialized as an empty set with a zero candidate count and a zero centroid count. A descriptor vector is generated from a deduced distribution and selected as a centroid.
In a process 1710, the descriptor vector is added to the set of centroids and the centroid count is increased. Process 1720 determines whether a predefined upper bound K* of the number of centroids has been reached. If so, process 1770 communicates the centroid set to a subsequent process. Otherwise, process 1722 determines whether a sufficient number N* of candidate centroids have been generated. If a sufficient number of candidate centroids has been generated and examined, process 1770 communicates the centroid set to the subsequent process. Otherwise, process 1730 generates another candidate centroid from deduced probability functions and increases the candidate count.
Process 1940 determines the candidate's radial affinity to each selected centroid. If the candidate's radial affinity to any centroid equals or exceeds a predefined radial-affinity threshold, the candidate is disqualified and process 1760 leads to process 1730 to generate a new centroid candidate (a new descriptor vector) for examination. Otherwise, process 1960 proceeds to process 1945 to determine the object's angular affinity to the centroid set. If the angular affinity to any centroid equals or exceeds a predefined angular-affinity threshold, process 1965 proceeds to process 1730 to generate a new centroid candidate (a new descriptor vector) for examination. Otherwise, process 1865 proceeds to process 1710 to add the examined descriptor vector to the centroid set and increase the centroid count. Process 1720 terminates the expansion of the centroid set if the number of centroids reaches the predefined upper bound K* and process 1722 terminates the expansion of the centroid set when a sufficient number N* of candidates (descriptor vectors) have been examined.
Thus, the invention provides a method (
The procedure comprises processes of determining an affinity measure to each centroid of the centroid set based on the each descriptor vector and a descriptor vector of each centroid, and assigning the each descriptor vector to the centroid set as a centroid subject to ascertaining that the affinity measure to the each centroid is less than the affinity threshold.
Thus, the method creates a set of uniformly spaced centroids for use in automated intelligent-marketing systems.
The process of generating a succession of descriptor vectors comprises randomly indexing an inverse of a cumulative distribution function of each variable of the v variables to determine v variable values forming a descriptor vector of the succession of descriptor vectors.
In one implementation, the process of acquiring the respective characterizing vector of v variables comprises normalizing each of the v variables to be within a predefined range.
In another implementation, the process of acquiring the respective characterizing vector of v variables comprises assigning for each variable a respective variable-specific weight greater than 0.0 and not exceeding 1.0, then shifting and scaling each of the variables so that: a minimum value of each variable equals 0.0; and a maximum value of each variable equals a corresponding variable-specific weight.
The affinity measure to the empty centroid set is assigned a value of zero.
The method terminates performing the procedure for descriptor vector election as a centroid vector upon determining that a count of centroids of the set of centroids equals a predefined upper bound.
The process of determining an affinity measure comprises computing a radial affinity level and an angular-affinity level between each descriptor vector and each centroid, and computing the affinity measure as a function of the radial-affinity level and the angular-affinity level. The function may be formed as a weighted sum of the radial-affinity level and the angular-affinity level.
In one implementation, the process of specifying an affinity threshold comprises itemizing the affinity threshold as a radial-affinity threshold and an angular-affinity threshold. Accordingly, the process of determining an affinity measure comprises computing a radial affinity level and an angular-affinity level between the each descriptor vector and each centroid. Subsequently, ascertaining that the affinity measure to each centroid is less than the affinity threshold comprises verifying that the radial-affinity level is less than the radial-affinity threshold and the angular-affinity level is less than the angular-affinity threshold.
Process 2024 counts the bisection cycles. Process 2030 determines a maximum attainable number L of centroids corresponding to a given inter-centroid single affinity constraint using the method of
In a first bisection cycle, process 2020 determines Δ* as 0.5 and process 2030 determines that the number of attainable centroids is four (L=4). Since L<K, process 2050 increases Δmin from 0.0 to Δ*, which is currently 0.5.
In a second bisection cycle, process 2020 determines Δ* as 0.75 and process 2030 determines that the number of attainable centroids is nine (L=9). Since L<K, process 2050 increases Δmin from 0.5 to Δ*, which is currently 0.75.
In a third bisection cycle, process 2020 determines Δ* as (0.75+1.0)/2, which is 0.875, and process 2030 determines that the number of attainable centroids is seventeen (L=17). Since L>K, process 2060 decreases Δmax from 1.0 to Δ*, which is currently 0.875.
In a fourth bisection cycle, process 2020 determines Δ* as (0.75+0.875)/2, which is 0.8125, and process 2030 determines that the number of attainable centroids is fourteen (L=14).
Since L>K, process 2060 decreases Δmax from 0.875 to Δ*, which is currently 0.8125.
In a fifth bisection cycle, process 2020 determines Δ* as (0.75+0.8125)/2, which is 0.78125, and process 2030 determines that the number of attainable centroids is fourteen (L=11). Since L<K, process 2050 increases Δmin from 0.75 to Δ*, which is currently 0.78125.
In a six bisection cycle, process 2020 determines Δ* as (0.78125+0.8125)/2, which is 0.796875, and process 2030 determines that the number of attainable centroids is twelve (L=12). Since L=K, processes 2040 and 2060 lead to process 2080 and the latest centroid set determined in process 2030 is used for starting segmentation of the plurality of objects into K clusters.
It is possible that equality of the number L of attainable centroids to the target number K of centroids would never be reached where by continuing the bisection cycles, the number L may oscillate ad infinitum between a number L1 that is less than K and a number L2 that is higher than K. For this reason, process 2022 limits the number of bisection cycles to a predefined value β. After β bisection cycles, the search interval {Δmax−Δmin} is reduced to 2−β of the range of affinity levels. For β=20, for example, the search interval is reduced to less than one millionth of the range of affinity levels and the centroid set of L1 centroids or the centroid set of L2 centroids may be selected. For example, with a target of 100 centroids, the number of attainable centroids (process 2030) may oscillate between 98 and 101 in which case the latter may be preferred.
Thus, the invention provides a method (FIG, 20 and
Tuning the affinity threshold comprises increasing the affinity threshold subject to a determination that the maximum attainable number is less than the target number, or decreasing the affinity threshold subject to a determination that the maximum attainable number exceeds the target number.
Generating a centroid set comprises initializing the centroid set as an empty set of zero count of centroids and performing for each object processes of: determining an affinity measure to each centroid of the centroid set; and adding the each object to the centroid set, updating the count of centroids, subject to ascertaining that the affinity measure to each centroid is less than the affinity threshold. When all objects are considered, the count of centroids becomes the maximum attainable number of centroids. In one implementation, the affinity measure is determined as a composite radial-angular affinity measure formulated as a function of a radial-affinity level and an angular affinity level and the affinity threshold is determined as a specific value of the composite radial-angular affinity measure.
Alternatively, generating the centroid set comprises initializing the centroid set as an empty set of zero count of centroids and performing for each object processes of: determining a radial affinity level and an angular affinity level to each centroid of the centroid set; and adding the each object to the centroid set, updating the count of centroids, subject to ascertaining that the radial affinity level to the each centroid is less than a predefined radial threshold and the angular affinity level to the each centroid is less than the angular threshold. When all objects are considered, the count of centroids becomes the maximum attainable number of centroids.
Process 2210 initializes a lower bound Δmin and an upper bound Δmax of inter-centroid radial-affinity thresholds, a lower bound Ωmin and an upper bound Ωmax of inter-centroid radial-affinity thresholds, and sets a bisection counter to zero. Process 2220 starts a sequence of bisection cycles by determining a candidate value Δ* of inter-centroid single affinity constraint as the mid value between the lower bound Δmin and the upper bound Δmax.
Process 2230 determines a number L of attainable centroids corresponding to current values of the inter-centroid radial-affinity constraint Δ* and angular-affinity constraint Ω* using the method of
Process 2225 determines a candidate value Ω* of inter-centroid angular-affinity constraint as the mid value between the lower bound Ωmin and the upper bound Ωmax. Process 2222 limits the number of iterative bisection-search cycles to a value β, β>1, so that the relative smallest search interval ε=2−β is infinitesimally small; for example, setting β=16, ε≈0.0000153. Process 2224 counts the bisection cycles.
Process 2235 determines a number L of attainable centroids corresponding to current values of the inter-centroid radial-affinity constraint Δ* and angular-affinity constraint Ω* using the method of
With the inter-centroid radial affinity or angular affinity normalized to vary between 0.0 and 1.0, process 2210 initializes the lower bound Δmin to equal 0.0, the upper bound Δmax to equal 1.0, the lower bound Ωmin to equal 0.0 and the upper bound Ωmax to equal 1.0. The initial angular-affinity threshold Ω* is set to equal 0.5 and a bisection counter is initialized to equal 0.
In a first bisection cycle, process 2220 determines Δ* as 0.5 and process 2030 determines that the number of attainable centroids is three (L=3) based on the current thresholds Δ* of 0.5 (determined in process 2220) and Ω* of 0.5 (initialized in process 2210). Since L is less than K, process 2250 increases Ωmin from 0.0 to Ω*, which is currently 0.5, and proceeds to process 2225. Process 2225 determines a new value of Ω* as (Ωmin+Ωmax)/2 which is 0.75. Process 2235 determines the number of attainable centroids to be seven (L=7). Since L is less than K, process 2245 proceeds to process 2255 which increases Δmin from the current value of 0.0 to the current value of Δ*, which is 0.5.
In a second bisection cycle, process 2220 determines Δ* as (Δmin+Δmax)/2, which is (0.5+1.0)/2 and process 2030 determines that the number of attainable centroids, with Δ*=0.75 and Ω*=0.75, is nine (L=9). Since L is less than K, process 2250 increases Ωmin from 0.5 to Ω*, which is currently 0.75.
Process 2225 determines Ω* as (0.75+1.0)/2, which is 0.875, and process 2235 determines that the number of attainable centroids, with Δ*=0.75 and Ω*=0.875, is eleven (L=11). Since L is less than K, process 2255 increases Δmin from 0.5 to Δ*, which is currently 0.75.
In a third bisection cycle, process 2220 determines Δ* as (0.75+1.0)/2, which is 0.875, and process 2230 determines that the number of attainable centroids, with Δ*=0.875 and Ω*=0.875, is fifteen (L=15). Since L is greater than K, process 2270 decreases Ωmax from 1.0 to Ω*, which is currently 0.875.
Process 2225 determines Ω* as (0.75+0.875)/2, which is 0.8125, and process 2235 determines that the number of attainable centroids is twelve (L=12). Since L=K, processes 2245 and 2265 lead to process 2280 and the latest centroid set determined in process 2235 is used for starting segmentation of the plurality of objects into K clusters.
Thus, the invention provides a method (
Tuning the radial threshold and the angular threshold comprises increasing at least one of the radial and the angular thresholds subject to a determination that the maximum attainable number is less than the target number, or decreasing at least one of the radial and the angular thresholds subject to a determination that the maximum attainable number exceeds the target number.
Generating the centroid set comprises initializing a centroid set as an empty set of zero count of centroids and performing for each object processes of: determining a radial affinity level and an angular affinity level to each centroid of the centroid set; and adding the each object to the centroid set and updating the count of centroids subject to ascertaining that the radial affinity level to each centroid is less than the radial threshold and the angular affinity level to each centroid is less than the angular threshold. When all objects are considered, the count of centroids becomes the maximum attainable number of centroids.
The method further comprises determining the radial threshold as a mean value of a radial lower bound and a radial upper bound, and determining the angular threshold as a mean value of an angular lower bound and an angular upper bound.
Processes 2620 generate the centroids. Process 2622 generates v random number, each bounded between 0.0 and 1.0, inclusive, each generated random number representing a cumulative distribution value. Process 2624 determines values of v variables (representing a new centroid) corresponding to the v random cumulative distribution values as illustrated in
A first representation 2710 corresponds to raw descriptor vectors A, B, and C, based on raw values of the two variables, having values of {8.0, 0.0}, {2.5, 5}, and {6.0, 8.0}. Descriptor vector “A” may represent a centroid while descriptor vectors “B” and “C” may represent object-B and object-C, respectively. The (unnormalized) radial-affinity levels 2712 of object-B and object-C with respect to the centroid, based on descriptor vectors “B” and “C”, are 7.53 and 8.25, respectively. The corresponding angular-affinity levels 2714 of object-B and object-C with respect to the centroid are 0.600 and 0.447.
A second representation 2720 corresponds to weighted descriptor vectors A*, B*, and C*, where a weight of 0.5 is applied to the second variable of each descriptor. Thus, A*, B*, and
C*, have values of {8.0, 0.0}, {2.5, 2.5}, and {6.0, 4.0}. The (unnormalized) radial-affinity levels 2722 of object-B and object-C with respect to the centroid, based on descriptor vectors “B*” and “C*”, are 6.04 and 4.47, respectively. The corresponding angular-affinity levels 2724 of object-B and object-C with respect to the centroid are 0.832 and 0.707.
Generally, applying a weight of a value less than 1.0 to a variable lessens the contribution of the variable to the overall process of centroid selection. Thus, variable-specific weights may be applied according to perceived importance of each of the v variables.
To generate one descriptor vector 3030, a set 3032 of four random numbers r1, r2, r3, and r4 are generated, each representing a respective value of a cumulative probability of one of the variables (hence bounded between 0.0 and 1.0). Corresponding values v1, v2, v3, and v4 of the four variables are then determined to form a descriptor vector {v1, v2, v3, v4}.
To generate another descriptor vector 3040, a set 3042 of four random numbers r5, r6, r7, and r8 are generated, each representing a respective value of a cumulative probability of one of the variables. Corresponding values u1, u2, u3, and u4 of the four variables are then determined to form another descriptor vector {u1, u2, u3, u4}.
Thus, the invention provides yet another method (
The procedure for generating a centroid comprises processes of generating v random cumulative-probability values and for each variable, accessing a respective software module providing a deduced value of the variable corresponding to a respective one of the random cumulative-probability values, the deduced value being an element of a vector representing a new centroid of the set of centroids, the respective software module being configured to evaluate a respective probability distribution function tailored to the respective moments.
The process of obtaining, for each object of the plurality of objects, a respective characterizing vector of v variable further comprises processes of: assigning v weights to the v variables, each weight being variable specific and bounded to positive values not exceeding 1.0; and normalizing each of the v variables so that: a minimum value of each variable equals 0.0; and a maximum value of each variable equals a corresponding variable-specific weight.
The method further comprises selecting the respective probability distribution function as one of: a Gamma distribution; a Weibull distribution; and a piecewise linear distribution. The respective moments comprise at least a first moment and a second moment. The type of the respective probability distribution function may be user defined.
The processes illustrated in
Systems of the embodiments of the invention may be implemented as any of a variety of suitable circuitry, such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), discrete logic, software, hardware, firmware or any combinations thereof. When modules of the systems of the embodiments of the invention are implemented partially or entirely in software, the modules contain a memory device for storing software instructions in a suitable, non-transitory computer-readable storage medium, and software instructions are executed in hardware using one or more processors to perform the techniques of this disclosure.
Numerous specific details have been set forth in the following description in order to provide a thorough understanding of the invention. However, the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
It should be noted that data and data output from the systems and methods described herein are not, in any sense, abstract or intangible. Instead, the data is necessarily digitally encoded and stored in a physical data-storage computer-readable medium, such as an electronic memory, mass-storage device, or other physical, tangible, data-storage device and medium. It should also be noted that the currently described data-processing and data-storage methods cannot be carried out manually by a human analyst, because of the complexity and vast numbers of intermediate results generated for processing and analysis of even quite modest amounts of data. Instead, the methods described herein are necessarily carried out by electronic computing systems on electronically or magnetically stored data, with the results of the data processing and data analysis digitally encoded and stored in one or more tangible, physical, data-storage devices and media.
Although specific embodiments of the invention have been described in detail, it should be understood that the described embodiments are intended to be illustrative and not restrictive. Various changes and modifications of the embodiments shown in the drawings and described in the specification may be made within the scope of the following claims without departing from the scope of the invention in its broader aspect.
The present application claims the benefit from U.S. provisional application 62/580,388 filed on Nov. 1, 2017, entitled “Mutually repulsing centroids for segmenting a vast social graph”, the entire content of which is incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2018/058585 | 11/1/2018 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62580388 | Nov 2017 | US |