The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements.
In the following detailed description of embodiments of the invention, reference is made to the accompanying drawings in which like references indicate similar elements, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that logical, mechanical, electrical, functional, and other changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
Furthermore, category data can be sparse, which means that the category data has a large dimensionality. In one embodiment, the category data is sparse because the categories are discrete and lack a natural similarity measure between them. Examples of category data include electronic program guide (EPG) data, and content metadata. The data system 10 includes an input processing module 9 to preprocess and load the category data 11 from database inputs 8A-N.
The category data 11 is grouped into clusters, and/or classified into folders by the clustering/classification module 12. Details of the clustering and classification performed by module 12 are below. The output of the clustering/classification module 12 is an organizational data structure 13, such as a cluster tree or a dendrogram. A cluster tree may be used as an indexed organization of the category data or to select a suitable cluster of the data.
Many clustering applications require identification of a specific layer within a cluster tree that best describes the underlying distribution of patterns within the category data. In one embodiment, organizational data structure 13 includes an optimal layer that contains a unique cluster group containing an optimal number of clusters.
A data analysis module 14 may use the folder-based classifiers and/or classifiers generated by clustering operations for automatic recommendation or selection of content. The data analysis module 14 may automatically recommend or provide content that may be of interest to a user or may be similar or related to content selected by a user. In one embodiment, a user identifies multiple folders of category data records that categorize specific content items, and the data analysis module 14 assigns category data records for new content items with the appropriate folders based on similarity.
A user interface 15 also shown in
Clustering is a process of organizing category data into a plurality of clusters according to some similarity measure among the category data. The module 12 clusters the category data by using one or more clustering processes, including seed based hierarchical clustering, order-invariant clustering, and subspace bounded recursive clustering. In one embodiment, the clustering/classification module 12 merges clusters in a manner independent of the order in which the category data is received.
In one embodiment, the group of folders created by the user may act as a classifier such that new category data records are compared against the user-created group of folders and automatically sorted into the most appropriate folder. In another embodiment, the clustering/classification module 12 implements a folder-based classifier based on user feedback. The folder-based classifier automatically creates a collection of folders, and automatically adds and deletes folders to or from the collection. The folder-based classifier may also automatically modify the contents of other folders not in the collection.
In one embodiment, the clustering/classification module 12 may augment the category data prior to or during clustering or classification. One method for augmentation is by imputing attributes of the category data. The augmentation may reduce any scarceness of category data while increasing the overall quality of the category data to aid the clustering and classification processes.
Although shown in
Furthermore, database input module 9 comprises database dimension reduction module 15. As stated above, category datasets can be sparse. Reducing the dimensionality of the datasets improves the efficiency and quality of modules using the datasets, because the datasets are denser and easier to search and/or process. In one embodiment, database dimension reduction module 15 reduces the dimensionality of category dataset 11 by modifying the term relations between the terms in category dataset 11 and the content. A term relation is data that define the relationship between a term in category data 11 and the one or more particular pieces of content associated with that term. In another embodiment, database dimension reduction module 15 reduces the dimensionality of category dataset 11 by reducing the number of terms in category dataset 11.
In one embodiment, database input module 9 extracts category data for a particular piece of content from content metadata. Content metadata is information that describes content used by data system 10.
Category data for a particular piece of content is one or more terms that describe the different categories associated with the piece of content. As illustrated in
In
One embodiment of database dimension reduction module 15 that reduces the dimensionality of category dataset 11 uses a category dataset ontology.
Ontology 200 comprises entity 202, object 204, living thing 206, organism 208, animal 210, chordate 212, vertebrate 214, mammal 216, placental 218, carnivore 220, canine 222, dog 224, feline 226, cat 228, artifact 230, covering 232, protective covering 234, shelter 236, canopy 238, umbrella 240. Entity 202 is the top most hypernym of category dataset ontology 200. The top most hypernym in the ontology is also known as the root node and is the most generic term in the ontology tree. Thus, in ontology 200, entity 202 is the most generic term and every term below entity 202 “is an” entity. Object 204 is a hyernym of living thing 206 and artifacts 230, as well as the nouns 208-228 and 232-240 below living thing 206 and artifact 230, respectively. The structure of category dataset ontology 200 illustrates two main branches: one branch relating living things 206 and the other branch relating artifacts 230. Living thing 206 branch comprises the following hypernyms (in terms of more generic to more specific): organism 208, animal 210, chordate 212, vertebrate 214, mammal 216, placental 218, carnivore 220. Under carnivore 220, ontology 200 splits into two branches: canine 222/dog 224 and feline 226/cat 228. Canine 222 is a hypernym of dog 224. Similarly, feline 226 is a hypernym of cat 228.
The second maim branch of ontology 200 comprises the following hypernyms (in terms of more generic to more specific): artifact 230, covering 232, protective covering 234, shelter 236, canopy 238, and umbrella 240.
In one embodiment, database dimension reduction module 15 uses ontology 200 to determine relatedness between categorical terms. In one embodiment, term relatedness is determined by the closeness of terms in ontology 200 by counting the number of hops to get from one term to the other. For example, dog 224 and cat 228 are closer to each other than to umbrella 240. This is because, based on ontology 200, cat 228 is four terms away from dog 224 whereas umbrella 240 is sixteen terms distant from dog 224.
This degree of relatedness can be used to group terms by sub-attributes. Each term in ontology 200 can attributes. Grouping the terms by relatedness associates sub-attributes with a term. In one embodiment, each term in the group is related to the other terms as a sub-attribute. Using a group size limitation restricts the total number of terms within each group (e.g., how far to traverse an ontology to group terms). A bound is used herein to refer to a group size limitation or size limitation on a hierarchy sub-tree. Hierarchy sub-trees are described further below.
For example, a grouping of categorical dataset attributes could be:
(Recreation, Pachinko, Fun Entertainment, Encore, Swimming, Skating, Gymnastics, Hunting, Fishing, Tennis, Basketball, Golf, Soccer, Baseball, Athletics) (1)
(Tofu, Food, Diet, Vitamin, Sushi, Soup, Pudding, Dessert, Chocolate, Beverage) (2)
Using this grouping, database dimension reduction module 15 adds to a term attributes corresponding to each other term in the group. Furthermore, the groups have intuitive meaning and a smaller set of values. For example, group (1) could be seen as an attribute for types of recreations while group (2) could be food attributes. An algorithm can distinguish the terms based on the semantically related terms. Of course, alternate embodiments could have different results. For example, alternate embodiments can employ different ontologies with more, less and/or different classes and structures. One embodiment of grouping terms by sub-attributes is further described in
In an alternate embodiment, instead of splitting term attributes into sub-attributes, database dimension input module 15 replaces terms with more generic terms. Replacing multiple terms with a generic term reduces the overall dimensionality of category data 11. In one embodiment terms are mapped onto a hypernym of the term. On the other hand, in alternate embodiments, a term is mapped onto another related term that is not a hypernym. Term replacement results in fewer terms in ontology 200 as several terms are mapped onto the same abstract term. The resulting data is much less sparse, because each term is associated with more multimedia data and there are proportionately more terms associated with each multimedia data. The degree of abstraction can be controlled by specifying the desired statistical properties of the resulting terms. In one embodiment, the degree of abstraction is controlled mapping the generic term to each leaf node in ontology 200 directly below the generic term. For example, an EPG dataset (Brother, Sister, Grandchild, Baby, Infant, Son, Daughter, Husband, Mother, Parent, Father) is mapped onto ‘relative’, and EPG dataset (Hunting, Fishing, Gymnastics, Basketball, Tennis, Golf, Soccer, Football, Baseball) is mapped onto ‘sport’. As with grouping terms by sub-attribute, term replacement modifies the term relation because one or more terms in category data 11 are replaced by generic terms. One embodiment of replacing terms is further described in
At block 306, method 300 uses the term list subsets to generate new term relations. As stated above, a term relation relates particular pieces of content to a term in category dataset 11. By using the term list subsets, method 300 reduces the dimensionality of the term relation. Reducing the term relation dimensionality allows for a more efficient search or other allocation of machine resources in processing category datasets. Term relation dimensionality reduction is further described in
At block 308, method 300 modifies the term relation. In one embodiment, method 300 replaces the old term relation with the new, reduced dimension term relation. In another embodiment, method 300 updates the old relation with the new reduced dimension term relation.
Method 400 further executes an outer processing loop (blocks 406-422) to create term list subset based on the sorted term list. At block 408, method 400 creates a new term subset, Sx. In one embodiment, method 400 creates an empty set for Sx.
Furthermore, method 400 executes an inner processing loop (blocks 410-418) to add terms to the subset based on the term frequency. At block 412, method 400 determines if the sum of the frequencies of the terms in Sx is less than percentage px. If so, at block 414, method 400 adds ty to Sx. Otherwise, method 400 sets the term list, T, to be the set difference between the old term list, T, and the term subset, Sx. This gives a new term list T that does not have any of the term listed in Sx. The inner processing loop ends at block 418. At block 420, method 400 outputs Sx. The outer processing loop ends at block 422.
At block 504, method 500 builds a hierarchy of terms. A hierarchy of terms describes the inter-relation of terms. For instance, category dataset ontology 200 is an example of a hierarchy of terms. In this embodiment, the hierarchy of term is a hypernym term hierarchy. Building a hierarchy term is further described in
At block 506, method 500 generates subclasses from the hierarchy of terms. Sub-classing groups the terms by the sub-attribute of the term. Generating subclasses is further described in
Method 600 further executes a processing loop (blocks 606-612) to add terms in the hierarchy to the subtree, S. At block 606, method 600 creates a node for the current term, ti, being processed. Method 600 adds the node to subtree S at block 610. The processing loop ends at block 612. Method 600 returns the subtree S at block 614.
Method 700 further executes a processing loop (blocks 706-718) to link the terms in subtree S to hypernyms of the term in terminological hierarchy H. At block 706, method 700 sets h to be a hypernym of the current node. At block 700, method 700 determines if h is a node of T. If not, at block 712, method 700 creates node n for h and links the current node to the new node n. Method 700 sets the current to node n at block 716. Execution proceeds to block 718 where the processing loop ends.
If h is a node of T, method 700 breaks out of processing loop and links the current node to the node in T at block 720. At block 722, method 700 returns tree T.
If the number of terms in the hierarchy is greater than the bound, method 800 further executes a processing loop (blocks 808-812) to generate additional subclasses from the term hierarchy. At block 808, method 800 executes the loop for each sub-hierarchy in the term hierarchy. In one embodiment, a sub-hierarchy is a leaf directly below the hierarchy root node. In alternate embodiments, method 800 partitions the term hierarchy into different sub-hierarchies (e.g., removing sub-trees from the hierarchy tree, etc.). At block 810, method 800 generates subclasses from the new sub-hierarchies. The processing loop ends at block 812.
Method 900 further executes an inner processing loop (blocks 914-918) to generate the term relations for T* for each term ty in Tx. At block 916, method 900 includes the term relation (ty, Ax) in T*. In one embodiment, Tx is the set of terms and T* is the mappings. For examples, a term relation can be (term, abstract term). The inner and outer processing loops end at blocks 918, 920, respectively.
At block 922, method 900 outputs T*, which represents the new list of terms. At block 924, method 900 computes R*, which is the changes in the term relation from T to T*. Computing the new relation R* is further described in
Method 1000 further executes a processing loop (blocks 1006-1012) to assign individual term relations (d, r) to a relation subclass. At block 1008, method 1000 set s equal to the subclass containing r. In one embodiment, r is in a single sub-class. At block 1010, method 1000 adds (d, s) to R′. The processing loop ends at 1012. Method 1014 returns the modified relation R′ at block 1014.
The following descriptions of
In practice, the methods described herein may constitute one or more programs made up of machine-executable instructions. Describing the method with reference to the flowchart in
The web server 1208 is typically at least one computer system which operates as a server computer system and is configured to operate with the protocols of the World Wide Web and is coupled to the Internet. Optionally, the web server 1208 can be part of an ISP which provides access to the Internet for client systems. The web server 1208 is shown coupled to the server computer system 1210 which itself is coupled to web content 1240, which can be considered a form of a media database. It will be appreciated that while two computer systems 1208 and 1210 are shown in
Client computer systems 1212, 1216, 1224, and 1226 can each, with the appropriate web browsing software, view HTML pages provided by the web server 1208. The ISP 1204 provides Internet connectivity to the client computer system 1212 through the modem interface 1214 which can be considered part of the client computer system 1212. The client computer system can be a personal computer system, a network computer, a Web TV system, a handheld device, or other such computer system. Similarly, the ISP 1206 provides Internet connectivity for client systems 1216, 1224, and 1226, although as shown in
Alternatively, as well-known, a server computer system 1228 can be directly coupled to the LAN 1222 through a network interface 1234 to provide files 1236 and other services to the clients 1224, 1226, without the need to connect to the Internet through the gateway system 1220. Furthermore, any combination of client systems 1212, 1216, 1224, 1226 may be connected together in a peer-to-peer network using LAN 1222, Internet 1202 or a combination as a communications medium. Generally, a peer-to-peer network distributes data across a network of multiple machines for storage and retrieval without the use of a central server or servers. Thus, each peer network node may incorporate the functions of both the client and the server described above.
Network computers are another type of computer system that can be used with the embodiments of the present invention. Network computers do not usually include a hard disk or other mass storage, and the executable programs are loaded from a network connection into the memory 1308 for execution by the processor 1304. A Web TV system, which is known in the art, is also considered to be a computer system according to the embodiments of the present invention, but it may lack some of the features shown in
It will be appreciated that the computer system 1300 is one example of many possible computer systems, which have different architectures. For example, personal computers based on an Intel microprocessor often have multiple buses, one of which can be an input/output (I/O) bus for the peripherals and one that directly connects the processor 1304 and the memory 1308 (often referred to as a memory bus). The buses are connected together through bridge components that perform any necessary translation due to differing bus protocols.
It will also be appreciated that the computer system 1300 is controlled by operating system software, which includes a file management system, such as a disk operating system, which is part of the operating system software. One example of an operating system software with its associated file management system software is the family of operating systems known as Windows® from Microsoft Corporation of Redmond, Wash., and their associated file management systems. The file management system is typically stored in the non-volatile storage 1314 and causes the processor 1304 to execute the various acts required by the operating system to input and output data and to store data in memory, including storing files on the non-volatile storage 1314.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.
This patent application is related to the co-pending U.S. patent application, entitled “CLUSTERING AND CLASSIFICATION OF CATEGORICAL DATA”, attorney docket no. 080398.P649, application Ser. No. ______. The related co-pending application is assigned to the same assignee as the present application.