SYSTEMS AND METHODS FOR AGGLOMERATIVE CLUSTERING

Information

  • Patent Application
  • 20250028752
  • Publication Number
    20250028752
  • Date Filed
    July 19, 2023
    a year ago
  • Date Published
    January 23, 2025
    3 months ago
  • CPC
    • G06F16/355
    • G06F16/31
    • G06F40/295
    • G06F40/30
  • International Classifications
    • G06F16/35
    • G06F16/31
    • G06F40/295
    • G06F40/30
Abstract
A computerized system and method may provide a robust, automated clustering procedure, including handling of outlier points, which may involve measuring and/or quantifying degrees of relevance and/or generality for a plurality of input entities. In some embodiments, a clustering procedure may be used, e.g., to generate a hierarchical, multi-tiered taxonomy of such entities. In some embodiments, a computerized system comprising a processor, and a memory, may be used for calculating a distance between nodes for each of a plurality of pairs of nodes, where the pairs may comprise a plurality of input entities and/or initial clusters; selecting one or more of the pairs based on the calculated distances; and merging one or more of the selected pairs, which may include a common node, into one or more final clusters. Some embodiments of the invention may allow routing interactions between remotely connected computer systems based on an automatically generated taxonomy.
Description
FIELD OF THE INVENTION

The invention relates generally to clustering-algorithm-based automatic generation of a taxonomy, for example from a constituting set of entities or words.


BACKGROUND OF THE INVENTION

Clustering and/or organizing entities, items, or terms according to various similarity or relatedness measures have countless applications in a variety of technological areas, such as for example the automatic, computer-based, generation of documents and text previously assumed to require human intelligence and/or intuition, and generally in the analysis of large amounts of documents using natural language processing (NLP) techniques. Current cluster analysis procedures and approaches allow grouping terms according to a similarity measure or score, and, subsequently, labeling groups or clusters by a human user. However, there is a need for novel systems, protocols, and approaches that allow automatically organizing terms in more complex and informative structures in a robust manner.


SUMMARY OF THE INVENTION

Embodiments may automatically cluster a plurality of entities (which may be, e.g., words extracted from a plurality of documents) and enable robust handling of outlier points, for example, based on measuring or quantifying degrees of relevance and/or generality for the entities.


Embodiments may generate taxonomies which may describe, for example, intricate semantic relationships between a plurality of terms placed in multiple tiers or categories of semantic hierarchy, providing a description of such intricate relationships.


A computerized system and method may calculate a distance between nodes for each of a plurality of pairs of nodes, where the pairs may comprise a plurality of input entities and/or initial clusters; select one or more of the pairs based on the calculated distances; and merge one or more of the selected pairs, which may include a common node, into one or more final clusters.


Some embodiments of the invention may automatically generate a domain taxonomy based on measuring and/or quantifying degrees of generality for entities within the domain under consideration. A computerized system comprising a processor, and a memory including a plurality of entities may be used for calculating generality scores for a plurality of input nodes (where nodes may include, for example, entities or clusters of entities), selecting exemplars based on the scores, and clustering unselected nodes under the exemplars to produce a multi-tiered, hierarchical taxonomy structure among nodes.


In some embodiments of the invention, entities may correspond to documents and/or text files and/or to words or terms extracted from such documents or text files.


Some embodiments of the invention may allow categorizing interactions among remotely connected computers using an automatically generated domain taxonomy, e.g., within a contact center environment. In this context, documents describing interactions between remotely connected computers may be considered as input entities, from which words may be extracted and clustered as described herein. Some embodiments may accordingly offer routing interactions between remotely connected computer systems based on an automatically generated taxonomy.





BRIEF DESCRIPTION OF THE DRAWINGS

Non-limiting examples of embodiments of the disclosure are described below with reference to figures attached hereto. Dimensions of features shown in the figures are chosen for convenience and clarity of presentation and are not necessarily shown to scale. The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features and advantages thereof, can be understood by reference to the following detailed description when read with the accompanied drawings. Embodiments of the invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like reference numerals indicate corresponding, analogous or similar elements, and in which:



FIG. 1 is a high-level block diagram of an exemplary computing device which may be used with embodiments of the present invention.



FIG. 2 depicts an example hierarchical-clustering-based organization of words that may be produced by embodiments of the invention applying previous methods and approaches.



FIG. 3 is a graphical representation of an example automatically generated, hierarchical domain taxonomy that may be generated using some embodiments of the invention.



FIG. 5 depicts an example affinity-propagation-based word clustering algorithm incorporating word generality indices according to some embodiments of the invention.



FIG. 6 is an example visualization of an automatically generated, hierarchical domain taxonomy that may be generated using some embodiments of the invention.



FIG. 7 is a block diagram of remotely connected computer systems according to some embodiments of the present invention.



FIG. 8 is a high-level flow diagram illustrating an example procedure for organizing call center interactions according to a taxonomy established by some embodiments of the invention.



FIG. 9A-C shows an example cluster merging process according to some embodiments of the invention.



FIG. 10A-B shows an example cluster batch closure process according to some embodiments of the invention.



FIG. 11 shows an illustration of an example connectivity support value calculation according to some embodiments of the invention.



FIG. 12 depicts an example agglomerative batch closure procedure incorporating a plurality of clustering conditions and/or criteria according to some embodiments of the invention.



FIG. 13 is a flow diagram illustrating an example method for an automatic clustering of nodes according to some embodiments of the invention.





It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn accurately or to scale. For example, the dimensions of some of the elements can be exaggerated relative to other elements for clarity, or several physical components can be included in one functional block or element.


DETAILED DESCRIPTION

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention can be practiced without these specific details. In other instances, well-known methods, procedures, and components, modules, units and/or circuits have not been described in detail so as not to obscure the invention.


Embodiments of the invention may automatically generate a hierarchical, multi-tiered taxonomy, for example, based on measuring and/or quantifying degrees of generality for a plurality of input entities—which may, be for example, a plurality of words extracted from a corpus of documents—as further described herein. In some embodiments, a computerized system comprising a processor, and a memory including a plurality of entities such as documents or text files, may be used for extracting words from a plurality of documents; calculating generality scores for the extracted words; selecting some of the extracted words to serve as exemplars based on the scores; and clustering unselected words under appropriate exemplars to produce or output a corresponding taxonomy. Some embodiments of the invention may further allow categorizing interactions among remotely connected computers using a domain taxonomy, and/or routing interactions between remotely connected computer systems based on the taxonomy as described herein.


Embodiments of the invention may include an agglomerative batch closure (ABC) clustering procedure or protocol which may include, for example, generality and/or relevance score-based, and/or K-nearest neighbors (KNN) based connectivity constraints—for example in order to allow robust forming and merging of clustered points or entities such as for example word or document clusters and desirable handling of outlier points as described herein, leading to clean and informative separation of terms, topics, and the like. In some embodiments, a computerized system comprising a processor, and a memory may be used for calculating a distance between nodes for each of a plurality of pairs of nodes, the pairs comprising one or more of a plurality of entities and initial clusters; selecting, by the processor, one or more of the pairs based on the calculated distances; and merging, by the processor, one or more of the selected pairs including a common node into one or more final clusters.



FIG. 1 shows a high-level block diagram of an exemplary computing device which may be used with embodiments of the present invention. Computing device 100 may include a controller or processor 105 that may be, for example, a central processing unit processor (CPU), a chip or any suitable computing or computational device, an operating system 115, a memory 120, a storage 130, input devices 135 and output devices 140.


Operating system 115 may be or may include any code segment designed and/or configured to perform tasks involving coordination, scheduling, arbitration, supervising, controlling or otherwise managing operation of computing device 100, for example, scheduling execution of programs. Memory 120 may be or may include, for example, a Random Access Memory (RAM), a read only memory (ROM), a Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a double data rate (DDR) memory chip, a Flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units. Memory 120 may be or may include a plurality of, possibly different memory units. Memory 120 may store for example, instructions (e.g. code 125) to carry out a method as disclosed herein, and/or data such as queries, documents, interactions, etc.


Executable code 125 may be any executable code, e.g., an application, a program, a process, task or script. Executable code 125 may be executed by controller 105 possibly under control of operating system 115. For example, executable code 125 may be one or more applications perform methods as disclosed herein, for example those of FIGS. 2-13, according to embodiments of the present invention. In some embodiments, more than one computing device 100 or components of device 100 may be used for multiple functions described herein. For the various modules and functions described herein, one or more computing devices 100 or components of computing device 100 may be used. Devices that include components similar or different to those included in computing device 100 may be used, and may be connected to a network and used as a system. One or more processor(s) 105 may be configured to carry out embodiments of the present invention by for example executing software or code. Storage 130 may be or may include, for example, a hard disk drive, a floppy disk drive, a Compact Disk (CD) drive, a CD-Recordable (CD-R) drive, a universal serial bus (USB) device or other suitable removable and/or fixed storage unit. Data such as documents or interactions may be stored in a storage 130 and may be loaded from storage 130 into a memory 120 where it may be processed by controller 105. In some embodiments, some of the components shown in FIG. 1 may be omitted.


Input devices 135 may be or may include a mouse, a keyboard, a touch screen or pad or any suitable input device. It will be recognized that any suitable number of input devices may be operatively connected to computing device 100 as shown by block 135. Output devices 140 may include one or more displays, speakers and/or any other suitable output devices. It will be recognized that any suitable number of output devices may be operatively connected to computing device 100 as shown by block 140. Any applicable input/output (I/O) devices may be connected to computing device 100, for example, a wired or wireless network interface card (NIC), a modem, printer or facsimile machine, a universal serial bus (USB) device or external hard drive may be included in input devices 135 and/or output devices 140.


Embodiments of the invention may include one or more article(s) (e.g. memory 120 or storage 130) such as a computer or processor non-transitory readable medium, or a computer or processor non-transitory storage medium, such as for example a memory, a disk drive, or a USB flash memory, encoding, including or storing instructions, e.g., computer-executable instructions, which, when executed by a processor or controller, carry out methods disclosed herein. Procedures and protocols described herein may thus be performed using a computer systems such as computing device 100, or, additionally or alternatively, using a plurality of remotely connected computer systems, such as for example one or more devices such as computer devices 100 connected over a communication network.


Embodiments of the invention may take as input a plurality of entities and consider them as nodes or points, and group or cluster a plurality of such nodes or points according to the principles and procedures outlined herein.


In some embodiments, entities considered as or included in nodes may be or may describe for example terms, words, or sentences which may be extracted from or identified within a set or corpus of documents (which may also be referred to as a “domain”). Term extraction may be performed based on various conditions or constraints, such as for example a combination of occurrence data, e.g. the number of times the term occurs in the set of documents, along with various filtering mechanisms. Embodiments of the invention may thus search and subsequently extract or retrieve a plurality of entities based on such conditions and/or criteria, filtering principles or mechanisms, as well as appropriate word extraction procedures known in the art.


In some embodiments, the words extracted may be used as training data for a vector embedding model (e.g. a Word2Vec process), which may be used to calculate or produce vector representations or embeddings to a plurality of entities considered by embodiments of the invention as further described herein.


It should generally be noted that while terms or words extracted from documents are used herein as a particular example for entities which may be taken as input by some embodiments of the invention—additional and/or alternative entities may be considered by different embodiments. Thus, entities such as terms of words should be considered merely as a non-limiting example. In this context, terms such as “nodes”, “points”, “entities”, “words”, and the like, may be used interchangeably throughout the present document.


A domain as referred to herein may be or may correspond to a dataset or repository from which entities may be extracted or identified. Thus, in some embodiments a domain may be, e.g., a corpus of documents from which a plurality of words or terms may be extracted.


A lexicon or domain lexicon as referred to herein may be or may include a set of entities such as terms, or words or other items which may for example be collected or extracted from a plurality of data items—such as a domain or corpus of text documents and/or files as described herein. A domain lexicon and may be organized in a dedicated data structure, such as a table or a JSON data object, and may include a corresponding plurality of attributes describing the entities (such as for example a number of occurrences of a given word in the data items based on which the domain was established). In some embodiments of the invention, a domain lexicon may be established to correspond to a particular domain based on input data items provided by or for that domain, such as data items received from or describing a given remote computer, or a plurality of remote computers (which may belong to or be associated with, for example, an organization or a plurality of organizations).A taxonomy, or domain taxonomy (when applied to a specific domain), as referred to herein may be or may include a multi-tiered, hierarchical structure of entities, items, terms, or words (e.g., extracted from and/or describing a domain), where similar terms are clustered or grouped together, and where terms move from general to specific across different levels of the hierarchical structure. Some example taxonomies are provided herein. However, one skilled in the art may recognize that additional or alternative particular forms and formats or taxonomies, including various levels of hierarchy among clustered of entities, may be used in different embodiments of the invention.


A vector embedding or representation as used herein may be or may describe for example an ordered list of values and/or numbers. A given term may, for example, have or be associated with a 5-dimensional vector with norm 1, for example [1.1, 2.1, 3.1, 4.1, 5.1]. Various vectors of different dimensionalities and value types (including, for example, binary “true/false” like values) may be used as part of vector representations or embeddings in different embodiments of the invention.


Vector embeddings or representations may be calculated for the entities considered by some embodiments of the invention. For example, a Word2Vec model, or another suitable model or procedure, may be used to produce for each entity an embedding or vector representation (e.g. in a metric space). In some embodiments, each unique entity such as a term or cluster (such as for example a cluster of words) may be assigned a corresponding unique vector. Various vector embedding models which may be used to generate vectors representations or embeddings for entities and/or words are known in the art.


Given a lexicon including a plurality of underlying entities or a term “vocabulary” (which may include, for example, multi-word terms, as well as individual words) and suitable vector embeddings or representations of these terms, e.g., in a metric space—embodiments of the invention may use various clustering algorithms to cluster or group relevant terms according to their sematic similarity, which may prove useful, for example, for constructing a taxonomy. For example, the known k-means algorithm or another suitable algorithm may be applied to the associated vector embeddings or representations, for example together with additional constraints and conditions (e.g., specifying a number of final clusters) as disclosed herein. Alternative clustering algorithms and procedures, or a plurality of such procedures, may be used in different embodiments of the invention.


For a given entity such as a word, term, node, or a cluster formed of constituent entities (e.g., following a given clustering iteration) embodiments or the invention may calculate a vector representation or embedding for that entity based on its various properties or attributes. Some example properties or attributes for entities which may be used in this context are further discussed herein (e.g., generality and/or relevance scores), but alternative or additional attributes may be used in different embodiments. In the case of a cluster (such as, e.g., a cluster of words), such a representation may, e.g., be defined as equal to the centroid of the cluster, or the centroid of the constituent term vectors, which may for example be calculated as the mean or average of these vectors—although other procedures for calculating a cluster vector may be used in different embodiments of the invention. Based on the vectors or embeddings generated, embodiments of the invention may determine whether entities or clusters should be further linked, grouped, or clustered together.


To determine if entities or clusters of entities may be clustered or linked, embodiments may compare a pair of embeddings or representations—in some embodiments this comparison or measure may be termed or referred to as a distance. For example, some embodiments may use or include the cosine similarity measure as a distance measure, which may indicate similarity between two non-zero vector representations or embeddings S1 and S2 using the cosine of the angle between them:










sim

(


S
1

,

S
2


)

=



S
1

·

S
2






S
1



·



S
2









(

eq
.

1

)







Eq. 1 may output scores between 0.00 (no similarity) and 1.00 (full similarity or identity). Embodiments of the invention may calculate similarity scores and link or group two entities if, for example, a similarity score exceeding a predetermined threshold (such as for example sim (S1,S2)≥0.70) is calculated based on the corresponding vector representations or embeddings. Some embodiments may store calculated similarity scores in an affinity matrix as further described herein. Additional or alternative measures and/or formulas of similarity may be used in different embodiments of the invention.


Some embodiments of the invention may include or involve additional conditions or criteria that me be applied to similarity scores or measures, or distance measures, such as for example finding and using the kth most similar vector to set a similarity threshold such that vectors or embeddings found less similar than that threshold may not be linked to a given entity or cluster. Other types of thresholds may be used in different embodiments. Different thresholds may be adaptive in the sense that it may be tuned, for example in the beginning of a given clustering iteration, according to various performance considerations. For example, a similarity threshold may be tuned such that each entity is connected to no less than one and no more than three other entities at each clustering iteration. Other ranges or tuning or adaptiveness thresholds and measures may be used in different embodiments of the invention.


In this context, one skilled in the art would generally recognize that different formulas and/or measures for similarity or distance, as well as conditions and/or criteria may be included or combined with the different schemas, procedures and/or protocols described herein to produce or provide different embodiments of the invention.



FIG. 2 depicts an example hierarchical-clustering-based organization of words that may be produced by some embodiments of the invention. Embodiments employing or utilizing the so-called Hierarchical Clustering algorithm (or HC) may produce various artifacts such as the ‘hierarchy’ of clusters as in levels or tiers 210, 220, and 230FIG. 2, where thick lines denote the separation between clusters in a given level among the three levels depicted, and where subsequent levels or tiers (starting from the highest level 210) describe smaller (mid-level, 220) and smaller (lower level, 230) sub-clusters for each cluster found in the preceding level. Since HC is based merely on semantic similarity as measured by, e.g., word embedding distances (themselves typically largely dependent on the dataset used for the training of a corresponding Word-2-Vec model), a resulting grouping using such algorithm may not distinguish between general and specific terms—grouping, for example, “CNBC” (specific) with “News” (more general), while on the other hand, grouping “CNN” with “Fox” (both equally specific) as in the in lower tier 230. Additionally, as may be seen in the middle tier 220, the general word “TV” was placed in (what a human being may identify, looking from the outside, as) the “TV Channels” group which includes “Channel”, “Disney”, “CNN”, “Fox”, “CNBC”, and “News”—despite the equally legitimate claim of the “TV Display” group including “Display”, “Color”, and “Pixelation” (which may indicate, for example, that in the training set, the words ‘TV’ and ‘Channel’ co-occurred more frequently than ‘TV’ and ‘Display’).


Despite managing to group the input set of terms, HC does not provide, on its own, any names or labels for the resulting groups or cluster (even for top tier 210, where a subjective' human interpretation based division between, e.g., “Internet-related” words and “TV-related” words may seem unambiguous). Providing such labels or names may thus require manual intervention by a human user, which may become, in many cases, an undesirable performance bottleneck (in addition to involving various subjective biases and corresponding errors). Another common shortcoming or drawback from which previous clustering approaches (such as, e.g., HC) often suffer is the requirement to manually specify the desired number of output clusters as input to the clustering procedure. Having limited a priori information regarding a given domain lexicon, specifying or assigning such value may not be a trivial task, and offering a semantically meaningful clustering output for essentially different input datasets, or corpuses of documents, would be difficult to achieve.


Some embodiments of the invention may thus improve prior technology by allowing automatically hierarchically clustering, grouping, or generally categorizing or organizing a group of nodes under particular node or “topic” which describes or summarizes them. In some embodiments, a topic (otherwise referred to as “exemplar”, or cluster title herein) may be considered a subject relating and/or representing and/or describing the plurality of underlying terms.



FIG. 3 is a graphical representation of an example automatically generated, hierarchical domain taxonomy that may be generated using some embodiments of the invention. By requiring as in some embodiments words to go from general to specific, which may be achieved according to the various methods and procedures described herein, not only are the groups or clusters at each of tiers 310, 320, 330, and 340 more homogenous (e.g., with respect to their level of generality/specificity) and well distinguished from one another compared to, e.g., clustering results described in tiers 210, 220, 230 in FIG. 2—but, in addition, terms at each tier may, for example, automatically be used as a label for the terms underneath them. For example, it can be seen that terms included in tier 310 (“Internet”, “TV”) may serve as labels or titles to term included in tier 320 (“Speed” under “Internet”; “Channel” and “Display” under “TV”), and that similar hierarchies are satisfied among subsequent, lower tiers 330, 340. A taxonomy produced or provided by some embodiments of the invention may thus include hierarchical, tiered structures such as for example illustrated in FIG. 3.


Some embodiments of the invention may include or provide an ABC procedure or protocol, e.g., to ensure that clusters (such as for example the ones depicted in FIG. 3) are kept cohesive, and that outlier data points or entities are handled desirably.



FIG. 9A-C shows an example cluster merging or combining process according to some embodiments of the invention. Some clustering protocols and procedures such as, e.g., HC may allow successively merging or combining a pair of clusters, e.g., for which a distance within a predetermined threshold is measured or calculated, at any clustering step or iteration. While an unmediated, intuitive inspection of the spatial distribution of the data points shown in FIG. 9A may indicate that points found to the left of data point 910 should belong to a first cluster, and that points found to the right of the data point 910 should belong to a second, different cluster—subsequent iterations starting from an initial cluster including two points 920 (which may, e.g., be characterized by having the shortest inter-point distance in the dataset) may add data point 910 to that cluster (FIG. 9B), then possibly add an additional data point to the right of the middle point (FIG. 9C), and so forth—to produce output clusters not offering an informative picture of the input dataset, or missing essential characteristics of it. Such outcome may, in some cases, be particularly undesirable, e.g., in some attempts of generating taxonomies, where forming and merging clusters must be performed in an accurate and robust manner to allow hierarchical, clean and informative separation of terms, topics, and the like. Hence, some embodiments of the invention may allow carefully handling particular data points such as outliers or boundary points (such as, e.g., point 910 in the above example).


‘Batch Closure’ as used herein (e.g., in the context of the ABC framework) may describe or refer to a clustering protocol or procedure where n closest pairs or “batches” of clusters are located or identified, for example, before initial clusters are merged to form larger clusters. Following such identification, the group or batch of identified clusters may be merged by forming a “transitive closure” of each linked group.



FIGS. 10A-B shows an example cluster batch closure process according to some embodiments of the invention. Embodiments may calculate distances between nodes for a plurality of pairs of nodes, where nodes may be or may include various entities, words or clusters (such as e.g. “initial” clusters, which may have been established throughout previous clustering iterations). A plurality of pairs of nodes (such as e.g., four pairs, or n=4, in the present example)—may then be identified and selected based on calculated distances (such as for example in case the distance between the two members of each pair may be calculated and found shortest among all existing pairs, although other distance based conditions or criteria may be used in different embodiments) to form four links 1010a-d. Linked members or batches may subsequently be “transitively” merged—that is, two or more of the identified pairs containing a common node or member may be merged to form a larger cluster, such as for example the three-member cluster 1020 in FIG. 10B which is formed from two identified pairs in FIG. 10A which contain a common member and links 1010a-b).


Some embodiments of the invention may thus perform subsequent batch closure iterations on entities, words, or initial clusters until final clusters are formed according to, or based on, a plurality of conditions or criteria. For example, in some embodiments, points or words may no longer be added to clusters including a number of words above a predetermined threshold. In yet another example, a word calculated or measured to be above a predetermined distance from a given cluster may not be added to that cluster. In some embodiments, the value for the allowed maximum batch size that may be formed, e.g., per a given clustering iteration, may be modified and/or altered according to various accuracy and/or performance considerations.


In some embodiments, different constraints may be applied to different iterations at different stages of the ABC procedure. For example, merging pairs of nodes as part of the ABC procedure may be performed based on various ‘connectivity constraints’, which may be imposed to restrict what particular points or clusters, or types or points or clusters, are allowed to be merged together. In one example, constraints may be applied such that two clusters may be merged only if there is a pair of points, one in each cluster, such that one (or possibly both) is in the set of k-nearest neighbors (KNNs) of the other (for example such that each data point is placed in the same cluster as its KNN). Other examples constraints may include, e.g., a maximum batch size or number of nodes allowed to be merged together, (for example based on a predetermined threshold of a batch size of 5 distinct nodes), and/or comparing calculated distances between nodes to a predetermined threshold (for example using affinity and/or connectivity matrices as further described herein).


In some embodiments of the invention, a connectivity matrix may be calculated or, e.g., derived or transformed from an affinity or similarity matrix, which may contain pairwise similarity scores (as may calculated by some embodiments of the invention, e.g., using Eq. 1 herein—see also further discussion herein) between pairs of the words, entities, or clusters input into the clustering procedure. Connectivity matrices may be used to identify, for example, the KNNs of each point whose similarity score to another, given point is below a predetermined threshold or within a specific distance or interval.


Some embodiments of the invention may use, e.g., weighted statistical parameters, attributes, or characteristics of the similarity matrix in transforming an affinity matrix into a connectivity matrix. For example, an embodiment may use a threshold of T=μ+k·σ—where μ is the mean and σ is the standard deviation of similarity values in the affinity matrix, and k is an additional parameter or weight (which may, e.g., be set to equal unity)—as a connectivity determining criterion (see also example affinity and connectivity matrices in Tables 1-2). In one example, a point may be clustered with another point or added to a cluster if a similarity score higher than T is calculated based on the vectors representations for the point(s) and/or the cluster (which may be based, e.g., on the centroid of the cluster as noted herein). In some embodiments, k may be set to a negative value, e.g., in case the forming of looser clusters, and an accordingly lower clustering threshold, are desirable.


Example affinity and connectivity matrices for three words {W1, W2, W3} may be seen in Tables 1-2, respectively.














TABLE 1







T = 0.5 + 1.2(0.2) = 0.74
W1
W2
W3





















W1
1.00
0.30

0.75




W2
0.30
1.00
0.30



W3

0.75

0.30
1.00










Given μ=0.5, k=1.2, and σ=0.2, a connectivity threshold may be calculated as T=μ+k·σ=0.74. An additional constraint may be added in some embodiments of the invention, which may require a similarity score smaller than 1.00 for connectivity (as a similarity score of 1.00 may only describe the similarity of an entity to itself). Thus, in an affinity matrix such as described by Table 1, words W1 and W3, for which a pairwise similarity score of 0.75 is calculated, may be considered connected (thus marked in bold text), while words W1 and W2, as well as W2 and W3, may be considered unconnected. An affinity matrix such as, e.g., the one shown in Table 1 may be converted or transformed into a connectivity matrix, for example, by setting similarity scores satisfying connectivity conditions and/or criteria to 1, and remaining ones to 0, as shown in Table 2.













TABLE 2







W1
W2
W3





















W1
0
0

1




W2
0
0
0



W3

1

0
0











Alternative procedures for calculating similarity and/or connectivity matrices may be used in different embodiments of the invention.


In order to decide whether two pairs of nodes or of clusters C1 and C2 should be merged, embodiments may calculate a ‘connectivity support’ value or index based on a connectivity matrix, e.g., as the total number of connections from C1 to C2 (which may include, for example, the number of nodes within C1 that are connected to nodes within C2), or as a ratio of connections possible or available to a plurality of clusters and/or data points. For example, a connectivity support threshold may be set as T=X where X is an integer (and may, e.g., equal 3—meaning there should be at least three connections between nodes or words in C1 to ones in C2). Connections between clusters may exist in different forms—for example between one node in C1 to three nodes in C2, or between three nodes in C1 and two nodes in C2, and so forth—as may be reflected, e.g., in a connectivity support matrix such as for example depicted in Table 2 (for illustration purposes, one may think of W1 as a node in C1, and of W2 and W3 as nodes in C2; based merely on Table 2, it may be determined that there is only one connection between C1 and C2, and given T=3>1, C1 and C2 should not be connected). By performing node or cluster merging based on connectivity support values, which may, for example, amount to preventing clusters having a calculated connectivity support value below a predetermined threshold from merging, or determining that cluster having a connectivity support above a predetermined threshold should be merged—embodiments of the invention may allow clusters to be formed and be kept cohesive, while allowing the ABC procedure or protocol to automatically terminate or end, e.g., without having to specify a number of predetermined number of output clusters prior to its execution. In some embodiments, a threshold may, for example, be set to a default value of Tc=k/10 (such that, e.g., the more “loose” the connectivity matrix is, the stricter the support level required for cluster formation). It should be noted that in some examples, as well as in the one considered herein, clusters such as C1 and C2 may also be singleton clusters, e.g., containing or corresponding to a single entity or point.



FIG. 11 shows an illustration of an example connectivity support value calculation according to some embodiments of the invention. It can be seen, for example, that each of the five entities {G, H, I, J, K} in cluster C2 are among the KNNs (given μ as the mean and σ as the standard deviation of similarity values as described herein) of at least one of the six entities {A, B, C, D, E, F} in cluster C1, as indicated by black arrows and set notation in box 1110. The total number of k-nearest neighbors found for entities in cluster C1 is 9 (see box 1110, and note double counting—as some entities in C2 are KNNs of more than one entity in C1). The connectivity support value for C1 and C2 may thus be calculated for example with reference to the combinatorial space of possible connections between the constituent entities, or with reference to having all entities in C2 as KNNs of all entities of C1, which amounts to the product of the numbers of entities in each cluster, e.g., to C1·C2=6·5=30. The connectivity support value for C1 and C2 would then be 9/30=0.3. Given k=1.2. Tc=1.2/10=0.12<0.3. Thus, cluster C1 and C2 may be merged by embodiments of the invention in this particular example. One skilled in the art would recognize, however, that additional or alternative terms and formulas may be used to calculate different connectivity support values in different embodiments of the invention.


In order to correctly handle outlier data points and avoid potential errors or undesirable clustering results that may be associated with them, embodiments of the invention may apply “dynamic contextual weighting” techniques—which may include or involve, for example, iteratively calculating or recalculating a centroid for merged clusters by using weighted scores or metrics describing each of the cluster members or points at each iteration, or once in a plurality of iterations. The metrics, score, or rank of each point may, e.g., be generated based on an appropriate ranking algorithm, such as for example the graph-based TextRank procedure or algorithm (thus calculated scores may be, e.g., TextRank relevancy scores, or generality scores as described herein).


A dynamic contextual weighting procedure may thus allow clustering points or cluster members according to how salient or relevant they are with respect to the rest of the cluster members during or before a particular iteration in the clustering process or procedure. In such manner, applying an appropriate ranking procedure or algorithm such as for example TextRank (and/or a procedure involving or including the calculation of generality indices as described herein), embodiments of the invention may ensure that the most representative or relevant members of that cluster may have the biggest influence on the calculated or recalculated centroid, while the less relevant points with respect to that cluster will have less influence or weight. Members or nodes may then be ejected or removed from clusters (e.g., from initial clusters received as input for a given ABC iteration) In some embodiments, points which receive, e.g., a relevancy score below a predetermined threshold—which may be set for example as a certain number of standard deviations below the cluster mean score—may be ejected from that cluster, which may allow correcting for outliers and wrongly clustered points. Ejected points may be put aside, and for example be omitted from the final clustering result or output, or added to the cluster closest or most similar to them, following a clustering iteration or given a final clustering result or output. For example, in some embodiments of the invention, distances or similarity scores, e.g., between each of the ejected nodes and a plurality of final clusters received after a clustering iteration, or a plurality of clustering iterations, may be calculated (e.g., based on Eq. 1 herein), and ejected nodes may be added to the relevant final clusters based on the calculated distances (for example, each ejected node may be added to the cluster to which it is found to be most similar).


Clusters provided or received following a given clustering iteration (using, e.g., the ABC protocol or procedure) may generally be considered as input for a subsequent clustering iteration, and subsequent iterations may follow and a plurality of clustering operations may be repeated, e.g., until final clusters are received—for example when the clustering protocol or procedure reaches a stop block based on various clustering constraints and criteria such as provided herein, and/or as generally known in the art.


For the purpose of, e.g., taxonomy building, in which robust, hierarchical clustering is desirable, different embodiments of the invention may include, for example, combinations of batch closure iterations and merging of clusters based on or according to connectivity constrains and/or connectivity support values—and including, e.g., dynamic contextual weighting procedures and/or and additional input preferences for merging clusters and/or identifying and handling outlier entities or points. Embodiments of the invention may thus merge batches of cluster pairs (or perform transitive closures such pairs) at each given clustering iteration, which may result in added stability and less required iterations compared to previous solutions and approaches; perform contextualized (for example TextRank and/or word generality index based) weighted centroid linkage, for example in order to establish more robust cluster centroids and handling of outliers/border points in an improved manner, and support various connectivity constraints for maintaining stable cohesive clusters and allowing for early stopping without predefining number of clusters as discussed herein.


One skilled in the art may recognize, however, that in different embodiments of the invention, various additional and/or alternative clustering conditions or criteria may be used or incorporated into the ABC procedure or protocol, or to different iterations of such protocol. For example, different embodiments of the invention may include or involve a more personalized ranking protocol or algorithm, which may for example include or involve the TextRank procedure and/or generality indices as described herein, alongside additional attributes and corresponding constraints relating to the points or entities considered in a given clustering procedure.



FIG. 12 depicts an example agglomerative batch closure procedure incorporating a plurality of clustering conditions and/or criteria according to some embodiments of the invention. Embodiments of the invention may start the procedure with receiving a lexicon or set of entities, items, or points to be clustered, as well as clusters which have been formed in previous clustering iterations (step 1210). For existing clusters, embodiments may then execute a ranking or scoring procedure or algorithm such as, e.g., the TextRank procedure as discussed herein, on entities or points within a cluster—and eject point with scores below a predetermined threshold from the cluster (step 1220). Subsequently, embodiments may calculate or recalculate a centroid for each cluster from remaining points, for example while considering TextRank relevancy scores of the remaining points as weights as explained herein (step 1230). Embodiments may then calculate a new affinity matrix for existing points and clusters and identify batches or pairs of clusters and/or entities which may be merged by transitive closure, for example based on a similarity threshold as described herein (step 1240). Step 1240 may include for example the calculation of a connectivity matrix, e.g., as described herein (step 1250). Embodiments may subsequently calculate connectivity support values and remove pairs having a value below a predetermined threshold from the remaining ABC steps (step 1260). Embodiments may then perform the transitive closures or mergers of the remaining cluster pairs having connectivity support values above the predetermined threshold (step 1270). If closures or mergers were performed, embodiments may go back to step 1220 and perform additional ABC iterations. If no more mergers were performed, some embodiments of the invention may further handle outlier points, such as points ejected in step 1220, and add them to the clusters found closest or most similar to them (step 1280). Subsequently, the procedure may terminate (step 1290). In some embodiments, the calculation of a connectivity matrix in step 1250, may be performed before or in absence of steps 1220-1240, for example in cases where there are no existing clusters or that all clusters are singleton clusters, and in cases where the affinity matrix is already given and needs not be calculated. Additional or alternative ABC procedures and protocols and/or constituent steps may be included in different embodiments of the invention.


Embodiments may calculate a plurality of scores or grades which may for example describe various relationships between the different entities and/or clusters considered, and which be used as part of different clustering operations and procedures (e.g., in order to produce a taxonomy such as that depicted in FIG. 3). For example, embodiments may calculate or measure entity, node, or cluster generality and/or relevance scores or ranks and related indices according to the principles and formulas provided herein, for example in order to select cluster titles or exemplars. In this context, some embodiments of the invention may calculate or measure entity, node, or cluster generality scores and related indices according to the principles and formulas provided herein.


One informative indicator for calculating or measuring, for example, word generality in a given corpus of documents may be the number of separate documents in which a given word occurs. One example document is a contact center interaction. Such indicator may generally be considered as the frequency of occurrence of an entity (such as, e.g., a word) in a plurality of data or information items (which may be, e.g., documents). Thus, a document frequency (DF) index, may be, e.g., calculated as the counting of documents including the word (or, e.g., a logarithm of this number or value) by embodiments of the invention given an input domain or corpus of documents. DF may be considered an informative measure in addition to, or separately from the total word frequency. While more general or abstract terms may appear across a large number of documents, they might not appear as frequently within any individual document. For example, in a document or file which may describe, e.g., an interaction between remote computers, or between humans operating remote computing systems, such as a caller and a contact center agent, some more specific or concrete words may dominate as the conversation develops and becomes more detailed in nature (and may revolve, e.g., around a specific technical issue described by corresponding terms, such as “cellular network”, “download speed”, and the like—compared to less-specific words such as “internet”). Thus, in some embodiments, DF may be calculated, e.g., based on parts of the documents considered, and/or on the frequency of occurrence in a given document—for example by omitting the first n lines (which may be, in some cases, associated with less—informative contents), or by not considering terms appearing less than n times within the document. Additional or alternative conditions or criteria for calculating a frequency of occurrence or DF indices may be used in different embodiments of the invention.


Another indicator may be established based on the idea that the more “general” an entity such a word may be—the more contexts in which it may occur, and hence the more co-words or joint-entities it may have. A co-word may be defined and/or identified as a word which occurs in the same grammatical phrase as the word under consideration, and/or found within a predetermined distance (such as, e.g., separated by at most 5 words within a sentence) from that word. For example, ‘channel’ would be a co-word of ‘TV’ since they frequently occur together in multi-word phrases such as ‘TV channel’, ‘I changed channels on the TV’, etc. Similar or equivalent general definitions may be formulated and used for non-word entities and joint entities (such as for example, based on a distance or similarity of a given entity and/or its attributes within a database or repository may be from those of other entities and/or their attributes within the same repository or database). Co-words may generally be identified in cases where they are linked to a given word by a dependency parser, where various such dependency parsers may be defined or used in different embodiments of the invention (and may include, for example, ‘-’, ‘on the’, and ‘with the’, as well as more broad or abstract grammatical relationships such as subject-object and the like, for example, based on various grammatical standards for subjecthood such as nominal subjecthood or nsubj, and the like). More generally, co-words may be considered a particular kind of joint-entities for a given entity—that is entities that appear in conjunction to that particular entity (for example, joint entities may be defined and/or identified by being included in at least a minimum number or percentage of data or information items which also include the entity under consideration—although other definition may be used). In some embodiments of the invention, a joint-entity index such as for example a co-word count (CWC) index, which may for example a logarithm of the number of different or distinct co-words found for a given word within a set of documents, may be calculated. The calculated CWC index for a given entity or word may be compared to a predetermined threshold of minimum co-words. Such threshold may reflect the minimum co-occurrence threshold for a word to be considered ‘general’ by embodiments of the invention. In some embodiments, a “sliding window”, e.g., of pre-defined length may be used to define or capture co-words found before or after a given word—for example without requiring a particular dependency parser. Additional or alternative conditions or criteria for capturing co-words or joint entities and calculating CWC or joint-entity index may be used in different embodiments of the invention.


A joint-entity-spread or a co-word spread (CWS) index may follow on from the CWC index but go a step further in that in addition to the number of co-words or different co-words that may be relevant for capturing the generality of a word appearing in multiple contexts, the diversity of these contexts may be taken into account based on, e.g., calculating how semantically ‘spread out’ different co-words found for a given word are. More generally, a joint-entity-spread index may be based on a distance or dissimilarity of each joint-entity from the given entity. For example, in the case there is a certain word with a large number of tightly knit co-words, and there is a second word, having the same number of co-words, but the latter being more varied and diverse. The latter word may accordingly be considered more general. To measure or calculate the co-word spread for a given word w, the mean similarity of the word's vector embedding from the respective vector embeddings of each of its co-words xi (i=1,2, . . . n) may be calculated by embodiments of the invention as, for example:










CWS
=




Sim

(

w
,

x
i


)


n


;




(

eq
.

2

)










x

i
=

(

1
,
2
,

,
n

)





co_words


(
w
)






Where Sim may for example be defined as in example Eq. 1. Additional or alternative similarly measures and formulas for calculating joint-entity spread or CWS indices may be used in different embodiments of the invention.


Another measure for the generality of a word may involve, given a certain multi-word phrase, finding a primary or ‘head’ word with respect to which other word(s) are secondary or ancillary. For example, in the phrase ‘TV channel’, one may intuitively recognize that ‘TV’ is the headword—informing a general subject or domain—while ‘channel’ is the particular aspect of that domain being focused on. A relative weighted frequency of occurrence or relative DF (RDF) score or index of an entity or word based on, or with respect to, the (average) DF scores of its joint entities or co-words may be used a measure for such characteristic or attribute by embodiments of the invention. In some embodiments, the RDF of a word i may, for example, be defined and calculated as, for example:










RDF
=


log

(
DFi
)

/

(





log

(
DFj
)


n

)



;




(

eq
.

3

)









j


co_words


(
i
)






In some embodiments, a variant may include the average of the DFjs—weighted by their respective co-occurrence frequencies with i, although additional or alternative variants may also be used. A high relative DF score (e.g., above a predetermined threshold) may indicate that, at least in the contexts in which it appears, a given word may be the head word and hence of a more general nature than its co-words. The generality of a word may accordingly still be recognized despite having a relatively low global DF value or score.


Some or all of the above indices, factors or components may be used and/or combined to obtain an overall generality index or score for a given word, which may be calculated by various embodiments of the invention. Since the outputs of each of the above calculations may be of a different scale, some embodiments may include normalizing the values or scores for each category across the domain vocabulary—for example, by or based on dividing each one by the maximum calculated value for that index category, resulting in values between 0 and 1, relative to that maximum value. An overall word generality index (WGI), which may be used as a generality score, for a given word w may thus be defined and calculated by embodiments of the invention, for example, according to the following example equation:










WGI
w

=


α

(

DF
w

)

·

β

(

RDF
w

)

·

γ

(

CWC
w

)

·

δ

(

CWS
w

)






(

eq
.

4

)







where α, β, γ, δ may be coefficients or weights that may be assigned to each of the scores or values considered herein, which may be included in the WGI calculated by some embodiments of the invention. In some embodiments, all weights may be equal by default (e.g., set to 1). Additional or alternative normalization or weighting may be included in different embodiments of the invention.


Embodiments of the invention may cluster a plurality of entities (such as words and terms extracted from or identified within a plurality of documents as described herein) according to generality and/or relevance scores, indices or metrics as described herein. For example, generality and/or relevance scores (such as for example WGIs or TextRank scores as described herein), may be considered or incorporated as a priori conditions or preferences into various clustering and/or grouping protocols and procedures, e.g., to enable selecting ‘exemplars’, or cluster titles, as part of the clustering of entities into a multi-tiered taxonomy.


In some embodiments of the invention, which may include or involve clustering approaches and techniques such as, e.g., Affinity Propagation (AP), words may be selected from the original set (e.g., through an iterative process) to serve as exemplars or representatives for other words (such that exemplars may for example be used as cluster labels, topics or titles for the latter words). As part of such procedure, some embodiments may, for example, select the exemplars or cluster title or label and/or perform clustering operations, e.g., based on an affinity matrix as described herein. Following the selection of exemplars for a given set or cluster of nodes, embodiments of the invention may group or cluster the remaining, unselected nodes under the selected exemplar or exemplars, and for example iteratively repeat various steps included in the clustering procedure to automatically generate a domain taxonomy, e.g., as further described herein.


In other embodiments, a priori input preferences or predefined constraints (including, for example, various upper/lower threshold values for calculated indices that may be applied such that, e.g., if an index calculated for a given word is below a predetermined threshold—then it may be set to zero), may be combined or integrated into a clustering method or algorithm such as for example the AP algorithm. Thus, exemplars selected as part of a clustering algorithm or procedure (e.g., when a clustering algorithm hits a stop block, and/or upon convergence of an iterative process, e.g., until appropriate convergence criteria are met, as known in the art), may be ‘representative’ of the other cluster members, taking into account not only their similarities to the other cluster members or affinity matrix values, but also the a priori preferences supplied as input such as for example some or all of the above word generality measures or metrics.


For example, based on WGI scores input to a clustering algorithm or procedure as a priori preferences or preconditions, more ‘general’ words (e.g., for which WGI above a threshold of 0.8 were calculated) may be chosen or selected cluster exemplars per clustering iteration. Such precondition may lead to clustering results possessing some of the desired characteristics of a taxonomy discussed herein. In another example, clusters or pairs of nodes of less general words (e.g., for which WGI scores below a threshold of 0.3 were calculated) may not be merged with clusters or pairs of nodes containing more general words (e.g., WGI>0.5) in a given clustering or ABC iteration. Additional examples may be based on, e.g., RDF scores indicating that less frequent words (e.g., characterized by DF<30 and RDF>5; or, e.g., DF<0.8 and RDF>0.7 in a case where scores may be normalized with respect to other cluster members as demonstrated herein in the context of probabilistic selection of exemplars) are, in fact, more general than their more frequent counterparts (e.g., DF>30 and RDF<1; or, e.g., DF>0.8 and RDF<0.2 when normalized scores are considered). Thus, less frequent words may be chosen as exemplars based a priori conditions incorporating such RDF scores.


In some embodiments of the invention, WGIs may be used in ABC clustering iterations, e.g., as part of calculating or recalculating a centroid or contextualized embedding for a given cluster—for example in order to give more weight to more general terms within a given cluster. Thus, relevancy scores (such as e.g., TextRank scores) may comprise or include generality scores such as the WGIs described herein. In one example, WGIs may be used instead of, e.g., TextRank relevancy scores for weighting different entities within a cluster. In another example, WGIs may be combined TextRank scores, e.g., such that TextRank scores may be normalized using WGIs, or vice versa—e.g., in a manner similar to that demonstrated herein with regard to normalizing affinity values using WGIs. In this context, one skilled in the art may recognize that various normalization or weighting formulas may be used in different embodiments of the invention.


Those skilled in the art would recognize that additional embodiments of the invention may be considered or realized in various example contexts and scenarios where the calculation of generality or relevance of words and/or entities may be considered or incorporated into clustering protocols and procedures as predetermined conditions or criteria, for example to form a hierarchical, multi-tiered taxonomy as described herein.


In some embodiments, word generality metrics or values of the preferences or thresholds input to a clustering procedure or algorithm may be further normalized, weighted or scaled, e.g., based on values or elements included in the affinity or connectivity matrices and/or a plurality of arithmetic operations. Conversely, similarity, affinity or connectivity matrix values may be scaled or normalized based on word generality metrics or values. In some embodiments, constraint or conditions applied to, for example, statistical parameters derived from WGI scores or related metrics such as, e.g., a median WGI score or the range of all calculated WGI scores, may be input as the preference of each word to a clustering procedure. In one example, the interval or range [MIN-WGI, MAX-WGI] for clustered entities may be used as a normalization or scaling factor S in, e.g., (1/S) (affinity_value)—which may normalize affinity values to account for more or less clusters as part of a particular clustering procedure. Similar scaling or normalization procedures may be introduced, e.g., to scale WGI scores based on affinity or similarity values, and alternative or additional such procedures may be used in different embodiments of the invention.


Similarly, in some embodiments of the invention, WGI scores may be normalized and used as probabilistic factors for choosing an exemplar. For example, in a cluster including terms A, B, C, and given WGI(A)=0.8, WGI(B)=0.7, and WGI(C)=0.5, the probabilities of choosing term A as an exemplar for the cluster by the AF algorithm may be P(A)=0.8/(0.8+0.7+0.5)=40%, and the corresponding probabilities for terms B and C may be P(B)=35%, and P(C)=25%.



FIG. 4 illustrates an example clustering procedure incorporating word generality indices according to some embodiments of the invention. In step 410, entities, terms or words may be clustered based on semantic similarity, e.g., as may be calculated based on a vector embedding model as described herein (such as for example with regard to the ABC protocol). Words or entities within a cluster may then be selected and identified as representative of the cluster based on WGI scores or indices, to serve as cluster labels, titles, or names: for example, the word within a cluster for which the highest WGI is calculated may be chosen by some embodiments of the invention as a label, title or name for that cluster (step 420). The chosen or selected entities or words may subsequently be removed from the cluster which they were chosen to represent (step 430). In some embodiments of the invention, entities or words chosen as titles may not be removed and thus remain as members of the cluster.


In another example, exemplars or cluster labels may only be removed before further clustering the a given cluster into sub-clusters so that it may not reappear again in a lower level of the hierarchy. For example, given ‘internet’ as the exemplar of the cluster including: {internet, speed, download, upload} then ‘internet’ may be removed when breaking this cluster into sub-clusters {speed} and {download, upload}. Each exemplar/label may thus appear in one level in the hierarchy, and it may be removed such that the next most general terms in the cluster (e.g., having the next highest WGI scores) may then serve as the exemplars of the sub-clusters, e.g., in subsequent level in the hierarchy.


In some embodiments of the invention, affinity matrix values and WGI scores may be input simultaneously as a priori conditions into a clustering algorithm (such as, e.g., the AP algorithm), which may then determine, on the basis of both inputs, both which terms are to serve as exemplars and which terms should be clustered together (e.g., such that each term is simply clustered together with its nearest exemplar). For example, embodiments may for example first normalize or scale WGI scores by affinity matrix values as described herein, then select exemplars based on normalized or scaled WGI scores, and then cluster each of the remaining words with the exemplar closest or most similar to it. In other words, each term may be clustered or linked with its nearest exemplar (which can be used, e.g., as a cluster title as described herein).



FIG. 5 depicts an example affinity-propagation-based word clustering algorithm incorporating word generality indices according to some embodiments of the invention. In step 510, a plurality of words or terms may be input to the algorithm. Input words may be, for example, the N nouns (e.g., where N=50) appearing the largest number of times in a given domain lexicon (which may include for example thousands of documents). WGIs may then be calculated for each of the input words, for example according to Eq. 4 (step 520). An AP clustering procedure may then be executed on the input words while using or employing calculated WGIs as clustering “preferences” (step 530). In one example, the AP procedure may include normalizing or scaling calculated WGI scores using affinity matrix values as described herein. In another example, the clustering procedure or algorithm may refrain from including, in the same cluster, two or more words for which a WGI higher than a predetermined threshold was calculated. In yet another example, the algorithm may only cluster together a plurality of words if and only if the average WGI calculated based on the WGIs for all words under consideration is below a predetermined threshold, or in between two such thresholds. Different conditions and criteria, as well as quantitative and statistical parameters, may be used in different embodiments of the invention, and such conditions may also be applied to underlying indices such as for example the DF, CWC, CWS, and RDF considered herein. As part of step 530, words may be chosen as “exemplars” or identified as representative of a given cluster as described herein (for example an exemplar being the word having a scaled WGI score above a predetermined threshold). Each of the remaining words or terms may then be clustered or linked with the exemplar closest to them, and exemplars may subsequently be removed from the cluster and serve as a cluster titles or labels as described herein (step 540). One skilled in the art would recognize that different steps, workflows and clustering techniques may be used in different procedures and/or algorithms according to different embodiments of the invention.


Methods, procedures, and approaches provided herein may be combined with various additional techniques and procedures, such as for example, different clustering algorithms (which may include, for example, both “soft” and “hard” clustering approaches) and associated techniques (relating, e.g., to a plurality of ABC iterations as described herein, calculating and ranking text relevance or generality scores, and/or to verifying, analyzing, or ensuring robustness of a clustering result or output) to provide different embodiments of the invention.


Additional/alternative embodiments of the invention may use or employ the generated taxonomy as part of various computer-based procedures and protocols, including, but not limited to, additional and/or different entity clustering and classification procedures, search protocols, and the like.


In some embodiments of the invention, additional entities may be received following the calculation of vector representations for entities or nodes (e.g., by a Word2vec model), and/or following the clustering of at least some nodes or entities as described herein. In such embodiments, the additional entities may themselves be clustered (e.g., separately from the previously clustered entities) based on preceding calculations and/or clustering operations. For example, once a domain taxonomy such as for example the one depicted in FIG. 3 has been calculated, embodiments of the invention may receive a plurality of documents as additional entities. Embodiments may then search the contents of the received documents for terms included in the taxonomy, and for example group or cluster documents including the terms “Fox” and “CNN” in a single group, e.g., even if none of the documents contains any terms such as “News”, “Channel”, and the like. Those skilled in the art would recognize many additional or alternative examples in which embodiments of the invention may use or utilize past calculations and/or past clustering results to, e.g., separately cluster or categorize subsequent, additionally received entities. In particular, similar embodiments may be used in the context of a contact center as further discussed herein.


In another example, a plurality of search results for an input query may be provided by embodiments of the invention based on a generated taxonomy or corresponding vector representations for a plurality of entities or terms. For instance, embodiments may receive “Fox” as an input query, search a database or corpus of documents and find no documents containing the term “Fox”. However, based on a taxonomy such as the one depicted in FIG. 3, embodiments may further search for documents containing the terms “CNN” and “CNBC”, which were previously represented and/or clustered as similar to “Fox” (in the context of media providers), and provide such documents as search results for the input query. In this context, similarly to word or term extraction techniques and protocols, various search procedures which may be included or used in some embodiments of the invention are known in the art.


Various outputs such as e.g., clusters, and taxonomies produced or provided by embodiments of the invention may be stored in various formats, such as for example tables, graph databases, JSON files, and the like. Those skilled in the art would recognize that various data formats may allow or enable, e.g., clustering additional, newly received entities based on a previously generated taxonomy, or providing search results based on such taxonomy as described herein.


Two sets of clustering results shown herein may illustrate how embodiments of the invention, using approaches, techniques and procedures as described herein, may improve the quality of hierarchical clustering—for example in the context of creating a domain taxonomy. For example, given a plurality of input entities such as a corpus of documents containing a plurality of words, systems and methods based on, for example, the standard Affinity Propagation clustering procedure may result in the following output groups or clusters:

    • bill, payment, balance, statement, total, mail
    • charge, fee, cost
    • connection, signal, network, setting, wire, wireless, test
    • customer, loyalty
    • credit, card, debit, digit, social
    • home, house
    • password, store, id, application, user, reference
    • information, access, update, detail, info, record
    • internet, speed, basic, data
    • message, screen, text, error, page
    • equipment, modem, device, router
    • name, list, family
    • order, approval
    • phone, line, mobile, security, computer, port
    • plan, package, price, offer, promotion, tax, discount, rate, contract, deal, bundle
    • button, power, light, remote, control, voice, program, movie
    • support, department, agent, representative, supervisor technician, appointment
    • tv, box, cable, channel, stream, room, video, play, address, code, zip, area, apartment, city, location, verification


In contrast, a clustering procedure incorporating some or all of the techniques, protocols and constraints provided herein may result, for example, in the following output:

    • name, information, password, user, id, list, family, address, code, area, zip, apartment, error, location, city, digit, verification, social, info
    • customer, support, department, store, agent, representative, supervisor, record, loyalty, reference
    • bill, charge, payment, order, credit, card, balance, statement, total, debit, approval, mail, page
    • plan, package, fee, price, offer, tax, promotion, detail, discount, rate, contract, cost, deal, bundle
    • device, equipment, modem, access, technician, signal, update, power, light, network, appointment, setting, wire, router, wireless, test
    • internet, connection, home, speed, basic, data
    • tv, box, cable, channel, remote, button, screen, stream, control, program, video, room, voice, play, movie
    • phone, line, mobile, message, text, security, computer, port, application


      where words in bold may be for example be chosen as cluster titles, labels or exemplars as discussed herein and may thus be selected to, e.g., as representative of a given cluster. Accordingly, clusters may be merged or organized in a hierarchical manner, to form a domain taxonomy such as, e.g., the example taxonomy depicted in FIGS. 3.


It should be noted that in some embodiments of the invention, exemplars may be removed from the cluster they are chosen to represent or describe, while in other embodiments exemplars may be kept as entities or nodes within the relevant cluster.


Terms, clusters, and taxonomies produced or provided by embodiments of the invention may be displayed in an appropriate format and/or visualization such as, e.g., a graph, a report, and the like.



FIG. 6 is an example visualization of an automatically generated, hierarchical domain taxonomy that may be generated using some embodiments of the invention. It may be seen that clusters may include additional clusters in a hierarchical manner—which may be achieved based on word generality or relevance scores as calculated or quantified by embodiments of the invention, as well as for example based on the various clustering protocols discussed herein. In some embodiments, visualizations may be included in reports, which may be sent to various parties of interest such as for example a supervisor responsible or interested in tracking or monitoring, e.g., contact center activity (see additional discussions regarding some example uses of taxonomies in contact center environments herein). It should be noted that other visualization types and frameworks may be used in different embodiments of the invention.


Taxonomies produced by embodiments of the invention may be used in organizations such as call centers, which may create and/or document and/or store “interactions”, which may be represented e.g., as transcripts. Such interactions data and/or corresponding transcripts may be or may describe conversations or data exchanged between, typically, an agent or representative (typically human) of the company and a customer. Interactions may include, for example, voice, audio or video recordings of conversations, and/or other data such as text, e-mail or instant messaging exchanges. Interactions may be converted from one format to another, and may include more than one different format of data: e.g. an interaction may include an audio conversation and/or a text version of that conversation created by for example automatic speech recognition (ASR). Text versions of interactions may be stored and searched.



FIG. 7 is a block diagram of remotely connected computer systems according to some embodiments of the present invention. While FIG. 7 shows such a system in the context of a contact center, embodiments of the invention may be used in other contexts. Incoming interactions 20 (e.g. conversations, telephone calls, interactive voice response interactions, etc.) among people 3 (e.g., customers) and agents 5 may enter a contact center 10 and be routed for example by a PBX (private branch exchange) 25 or other equipment to relevant systems, such as interactive voice response (IVR) block or processor 32, Internet sessions or web block 34 and voice interactions block or recorder 30. People 3 may operate external user equipment 4 to communicate with agents 5 via contact center 10; and agents 5 may operate agent terminals 6 for that communication and other purposes. Incoming interactions 20 may be pre-processed and may enter the system as text data, or may be converted to text via ASR module 22.


User equipment 4, agent terminals 6 and user terminals 8 may include computing or telecommunications devices such as personal computers or other desktop computers, conventional telephones, cellular telephones, portable or tablet computers, smart or “dumb” terminals, etc., and may include some or all of the components such as a processor shown in FIG. 1.


Interaction data or documents may be stored, e.g., in files and/or databases. For example, logger 40, menus logger 42, and web-page logger 44 may record information related to interactions, such as the content or substance of interactions (e.g. recordings and/or transcripts of telephone calls) and metadata (e.g. telephone numbers used, customer identification (ID), etc.). In the case that documents other than interactions are used, other databases may be used. The data from contact center 10 may be output, sent or exported to an analysis center 50, which may be part of contact center 10, or external to and/or remotely located from contact center 10.


Analysis center 50 may perform functions such as those shown in FIGS. 2-6 and 8-13 herein, and may include for example embedding module 52, which may for example include the Word2vec model and related clustering operations discussed herein. Analysis center 50 may communicate with one or more user terminals 8 to for example provide visualizations (such as for example the one provided in FIG. 6).


One or more networks 12 may connect equipment or modules not physically co-located, for example connecting external user equipment 4 to contact center 10, and contact center 10 to analysis center 50 and agent terminals 6. Agent terminals 6 may thus be physically remote from user equipment 4. Networks 12 may include for example telephone networks, the Internet, or other networks. While in FIG. 7 contact center 10 is shown passing data to analysis center 50, these modules may communicate via a network such as networks 12.


Web block 34 may support web interactions over the Internet (e.g. operate web pages which may be executed in part on user equipment), IVR block 32 may provide menus and other information to customers and for obtaining selections and other information from customers, and recorder 34 may process or record voice sessions with customers. It may be appreciated that contact center 10 presented in FIG. 7 is not limiting and may include any blocks and infrastructure needed to handle voice, text (SMS (short message service), WhatsApp messages, chats, etc.) video and any type of interaction with costumers.


Each of modules and equipment such as contact center 10, ASR module 22 PBX 25, IVR block 32, voice interactions block or recorder 30, menus logger 42, connect API 34, analysis center 50, external user equipment 4, and agent terminals 6, user terminals 8 and other modules discussed herein may be or include a computing device such as included in FIG. 1, although various units among these modules may be combined into one computing device. Agent terminals 6 and user equipment 4 may be remote or physically separate computer systems communicating and/or connected over network 12.


Some embodiments of the invention may be used, for example, to organize or categorize a corpus or plurality of documents describing, e.g., interactions between customers/users and agents in a call or contact center or in a plurality of call centers. For example, hundreds of customer interactions handled by a wide variety of call centers belonging to multiple, different industries, may automatically be organized and/or sorted by embodiments of the invention into corresponding taxonomies, which may include a wide range of words and terms describing, for example, various different products, customer reported issues, and use-cases.


Additionally or alternatively, interactions may be categorized, sorted, or associated among themselves according to, or based on, a previously generated taxonomy. In this context, different protocols and procedure may be used in different embodiments of the invention—such as for example ones demonstrated herein for receiving and clustering additional entities following the previous calculations or clustering procedures. Additional steps or repetitions of steps such as, e.g., extracting words from documents, calculating generality scores or metrics, selecting nodes as exemplars, and clustering nodes under the selected exemplars may also be introduced for the interactions under consideration. In another example, once a taxonomy has been built or has been previously generated, a vector embedding model (which may be, e.g., different from the model already used for generating vector embeddings for words and/or documents in a given domain, as described herein) may subsequently be used by some embodiments to create contextualized, semantic embedding vectors of each word or term in the context of the generated taxonomy—for example by combining or concatenating embeddings describing related words, or words pertaining to the same cluster. One skilled in the art would recognize, however, that different procedures and protocols for categorizing interactions may be performed by different embodiments of the invention based on previously executed clustering operations and previously generated taxonomies.



FIG. 8 is a flow diagram depicting an example procedure for organizing call center interactions according to a taxonomy established by some embodiments of the invention. In step 810, interactions are stored after being executed in the call center (such as a multi-component call center system as, e.g., depicted in FIG. 7). By extracting terms or words from the interactions, a domain lexicon may subsequently be built (step 820). A word, phrase or document vector embedding model (such as for example a Word2vec model as described herein) may be trained based on the resulting lexicon (step 830). A taxonomy may then be automatically generated according to the protocols and procedures described herein (step 840). Optionally, a user may subsequently edit the generated taxonomy, for example in order to modify the automatically generated clustering or grouping of terms according to user preferences (step 850).


Using a taxonomy generated according to some or all of the principles and procedures outlined herein, any given call, or part of an interaction (such as for example particular phrases, parts of a conversation, etc.) may automatically be indexed, sorted or decomposed into its main topics, keywords, and the like. Some embodiments of the invention may further group various other words or entities into the various topics and sub-topics in the taxonomy. For example, particular user-reported issues, e.g., in a technical support call, may be categorized alongside their commonly associated agent responses, based on words or terms included in the call and the corresponding taxonomy (step 860). In addition, by grouping or aggregating words from different interactions, which may be for example associated with different (e.g., unrelated) call centers and/or industries, an overall statistical summary of words or terms which may be recognized as related topics and/or reported issues—as well as of their respective proportions—may be built or constructed (step 870). In addition, embodiments of the invention may further monitor trends, or perform semantic topic monitoring in incoming interactions or calls, e.g., based on or according to a generated taxonomy and/or corresponding historic calculations and clustering procedure (step 880). In such manner, embodiments of the invention may offer insight into interaction trends relating to, e.g., what user or customer reported issues are most statistically common, and as to shifts in and changes in such commonly reported issues over periods of time (which may be performed, in some embodiments, by comparing past calculations and/or scores and/or clustering results and/or taxonomies to one another).


In some embodiments of the invention, contact center activity—such as for example the routing of interactions from users 3 to agents 5 by PBX 25, the recording of interactions by recorder 34, and the like—may be modified, altered or optimized (for example, dynamically, in real time) according to, or performed based on, a generated taxonomy. For example, a contact center system (such as for example the depicted in FIG. 7) may be configured to route incoming interactions to particular agents based on the agents' expertise or skills matching topics included or associated with the incoming interaction. This may be done, e.g., in case a previously generated taxonomy associates the agents' skill (which may, e.g., be extracted from agent records, where various skills may stored as keywords associated with a given agent ID) with incoming interaction topics. Such association may take place, for example, based on the words describing the topic and the agent's skills having been clustered under the same exemplar in a given taxonomy. Additionally or alternatively, the system may, be configured to record more/less interactions within a given timeframe where a particular topic or issue is clustered under a particular exemplar which may be, e.g., manually labeled as less important according to storage space saving considerations. In this context, different techniques of extracting agent skills and interaction topics or skills and topics with appropriate keywords may be used in different embodiments of the invention. One skilled in the art would recognize that alternative utilizations of taxonomies in the context of contact center interactions may be offered by different embodiments of the invention.


Embodiments of the invention improve call center and interaction routing technology by providing an effective and semantically sensitive approach for automatically categorizing interactions, which further enables dynamic optimization and management of contact center activity—such as the routing of interactions based on frequently reported topics and subtopics. Those skilled in the art would recognize that similar or equivalent improvements may be offered by embodiments of the invention in contexts, systems, and environments different from those associated with a call or contact center. Embodiments more generally offer an improvement to clustering procedures and approaches by allowing the automatic organization of clustered entities (including, but not limited to, words, phrases and terms) in complex and informative structures, which may be hierarchical and/or multi-tiered as described herein, while having desirable semantically significant and statistically robust qualities (as reflected, for example, in the differences between the relationship and hierarchy among tiers 210, 220, and 230 in FIG. 2, and those among tiers 310, 320, 330, 340 in FIG. 3 as described herein, the differences being associated with some of the technological improvements offered by some embodiments of the invention).



FIG. 13 is a flow diagram illustrating an example method for an automatic clustering of nodes according to some embodiments of the invention. In step 1310, embodiments may calculate distances between nodes for a plurality of pairs of nodes (such as for example for each of these pairs), where the pairs or underlying nodes may include, e.g., a plurality of entities and/or initial clusters. Embodiments may then select pairs based on the calculated distances (step 1320), and merge the selected pairs, which may include a common node or member, into final clusters (step 1330).


One skilled in the art will realize the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The foregoing embodiments are therefore to be considered in all respects illustrative rather than limiting of the invention described herein. Scope of the invention is thus indicated by the appended claims, rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.


In the foregoing detailed description, numerous specific details are set forth in order to provide an understanding of the invention. However, it will be understood by those skilled in the art that the invention can be practiced without these specific details. In other instances, well-known methods, procedures, and components, modules, units and/or circuits have not been described in detail so as not to obscure the invention. Some features or elements described with respect to one embodiment can be combined with features or elements described with respect to other embodiments.


Although embodiments of the invention are not limited in this regard, discussions utilizing terms such as, for example, “processing,” “computing,” “calculating,” “determining,” “establishing”, “analyzing”, “checking”, or the like, can refer to operation(s) and/or process(es) of a computer, a computing platform, a computing system, or other electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information non-transitory storage medium that can store instructions to perform operations and/or processes.


The term set when used herein can include one or more items. Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.

Claims
  • 1. A method of clustering nodes, the method comprising: calculating, by a computer processor, a distance between nodes for each of a plurality of pairs of nodes, the pairs comprising one or more of a plurality of entities and initial clusters;selecting, by the processor, one or more of the pairs based on the calculated distances; andmerging, by the processor, one or more of the selected pairs including a common node into one or more final clusters.
  • 2. The method of claim 1, wherein the merging is performed based on one or more connectivity constraints, the constraints based on at least one of: a maximum batch size, k-nearest neighbors of one or more of the nodes, and comparing the calculated distances to a predetermined threshold.
  • 3. The method of claim 1, comprising: calculating, by the processor, a similarity matrix for one or more of the pairs;transforming, by the processor, the similarity matrix into a connectivity matrix, the transforming based on weighted statistical characteristics of the similarity matrix; andcalculating, by the processor, a connectivity support value for one or more of the pairs based on the connectivity matrix; andwherein the merging is performed based on the calculated connectivity support values.
  • 4. The method of claim 1, comprising: calculating, by the processor, a score for one of more nodes within a given initial cluster;calculating, by the processor, a centroid for the initial cluster based on the calculated scores; andejecting, by the processor, one or more of the nodes from the initial cluster based on the calculated centroid.
  • 5. The method of claim 4, comprising: calculating, by the processor, one or more second distances between each of the ejected nodes and one or more of the final clusters; andadding, by the processor, one or more of the ejected nodes to one or more of the final clusters based on one or more of the second distances.
  • 6. The method of claim 1, comprising: providing, by the processor, a plurality of search results for an input query based on the final clusters.
  • 7. The method of claim 1, wherein one or more of the entities include one or more words extracted from one or more documents.
  • 8. The method of claim 4, wherein the scores comprise one or more generality indices, the indices comprising one or more of: a frequency of occurrence for one or more of the entities, identifying one or more of the entities as joint-entities of a given entity, a distance of each joint-entity from the given entity, and a weighted frequency of occurrence for a given entity based on frequencies of occurrence of one or more other entities.
  • 9. A computerized system for clustering nodes, the system comprising: a memory, anda computer processor configured to:calculate a distance between nodes for each of a plurality of pairs of nodes, the pairs comprising one or more of a plurality of entities and initial clusters;select one or more of the pairs based on the calculated distances; andmerge one or more of the selected pairs including a common node into one or more final clusters.
  • 10. The computerized system of claim 9, wherein the merging is performed based on one or more connectivity constraints, the constraints based on at least one of: a maximum batch size, k-nearest neighbors of one or more of the nodes, and comparing the calculated distances to a predetermined threshold.
  • 11. The computerized system of claim 9, wherein the processor is to: calculate a similarity matrix for one or more of the pairs;transform the similarity matrix into a connectivity matrix, the transforming based on weighted statistical characteristics of the similarity matrix; andcalculate a connectivity support value for one or more of the pairs based on the connectivity matrix; andwherein the merging is performed based on the calculated connectivity support values.
  • 12. The computerized system of claim 9, wherein the processor is to: calculate a relevancy score for one of more nodes within a given initial cluster;calculate a centroid for the initial cluster based on the calculated relevancy scores; andeject one or more of the nodes from the initial cluster based on the calculated centroid.
  • 13. The computerized system of claim 12, wherein the processor is to: calculate one or more second distances between each of the ejected nodes and one or more of the final clusters; andadd one or more of the ejected nodes to one or more of the final clusters based on one or more of the second distances.
  • 14. The computerized system of claim 9, wherein the processor is to provide a plurality of search results for an input query based on the final clusters.
  • 15. The computerized system of claim 9, wherein one or more of the entities include one or more words extracted from one or more documents.
  • 16. The computerized system of claim 12, wherein the scores comprise one or more generality indices, the indices comprising one or more of: a frequency of occurrence for one or more of the entities, identifying one or more of the entities as joint-entities of a given entity, a distance of each joint-entity from the given entity, and a weighted frequency of occurrence for a given entity based on frequencies of occurrence of one or more other entities.
  • 17. A method for categorizing interactions using an automatically generated domain taxonomy, the method comprising: in a computerized system comprising a processor and a memory, and connected by a network to one or more remote computers:extracting, by the processor, a plurality of words from one or more documents;calculating, by the processor, a distance between nodes for each of a plurality of pairs of nodes, the pairs comprising one or more of the words and initial clusters;selecting, by the processor, one or more of the pairs based on the calculated distances;merging, by the processor, one or more of the selected pairs including a common node into one or more final clusters;ranking, by the processor, one or more nodes within one or more of the final clusters;selecting, by the processor, one or more of the ranked nodes as cluster titles;iteratively repeating the extracting of a plurality of words, the calculating of a distance between nodes, the selecting one or more of the pairs, the merging of one or more of the selected pairs, the ranking of one or more nodes, and the selecting of one or more of the ranked notes, until one or more convergence criteria are met, wherein the criteria based on at least one of: a maximum cluster size, and a maximum number of pairs including a common node; andautomatically generating a taxonomy comprising one or more of the clusters and the titles from one or more iterations, the taxonomy organized in a hierarchical structure.
  • 18. The method of claim 17, comprising: providing a plurality of search results for an input query based on the taxonomy.
  • 19. The method of claim 17, wherein one or more of the documents describe one or more interactions, the interactions routed using a private branch exchange to one or more of the remote computers.
  • 20. The method of claim 19, comprising: routing, by the private branch exchange, one or more of the interactions to a remote computer among the one or more remote computers based on the taxonomy.