1. Field of Invention
This invention is directed to systems and methods for organizing data by hierarchical clustering of the data.
2. Description of Related Art
Data is stored in various ways, such as, for example, in media files as media data. Media data maybe media streams or files, such as, for example, audio, video, graphic and/or text streams or files. One exemplary form of media data is digital photographs. The affordability of high quality digital cameras has enabled digital photography to proliferate, allowing millions to easily take and store digital photographs. These digital photographs are often stored as digital photograph data files.
Media data files usually include several different parts. For example, a digital photograph data file may include image data recorded in a particular file format, such as, for example, the JPEG format. Along with the image data, certain information about the image data may be typically stored as meta-data in the resulting digital photograph data file and that is associated with the image data. The associated meta-data is a separate and distinct data from the underlying image data. One exemplary format is the exchangeable image file format (Exif), which is often used as the format for the header information that is stored as part of the JPEG image data file. Examples of stored meta-data in the Exif format include the file name, one or more timestamps, such as the time the data was created, the time when last change to the image file occurred, short descriptions of the image data, or the GPS location for the place the image data was obtained.
Many techniques have been created for managing digital photograph data files and other such rapidly accumulating data files. For simple data files, one such technique involves placing such data files into specific folders depending on a topic that each such data file is associated with. Another technique involves manually organizing one's contact information into a given file directory within a personal computer database. The user reviews the content and determines the placement of the specific contact information in a file directory, and any sub-categories, such as friends, business contact, school contact, and the like.
Even such simple data as contact information written in a particular format, such as the format used in Microsoft Word®, contains two features. The name of the data record that identifies the data can be called a scalar feature that condenses the information that is contained within the record. The actual contents of the record, such as the name of the contact, the contact's address, or other data pertaining to that specific contact, are more detailed and can be called vector features.
One way to organize data files is for a user to actually examine the content of each data file and/or the name of that data file, and subsequently manually determine an appropriate location of that data file within a specific file directory structure, such as a folder labeled with an appropriate topic descriptor. Placing and gathering data files into specific locations organizes the data files into specific relationships. However, when, for example, tens of thousands of photographs have to be organized, manually organizing each data file becomes nearly impossible. The difficulty is amplified when the content of each data file is complicated, such as, for example, when the content is image data.
This invention provides systems and method for efficiently organizing data based on meta-data or other ordered information within data files.
This invention separately provides systems and methods for organizing data files by clustering related data files based on organizing meta-data of a data file.
This invention separately provides systems and methods for extracting the meta-data of a data file.
This invention separately provides systems and methods for organizing the data files based on the meta-data of the data files.
This invention separately provides systems and methods for organizing desired data files for browsing and/or retrieval.
In various exemplary embodiments of the systems and methods according to this invention, a desired set of data files is organized by examining a set of meta-data, where each meta-data element of the meta-data is extracted from, or at least has been associated with, a particular data file. In various exemplary embodiments, a structure within the set of meta-data is assessed by obtaining a desired range of values of an element of the meta-data for analyzing the meta-data elements, then comparing the values for that element of the meta-data for all or a subset of the data files.
In various exemplary embodiments, the meta-data elements of the set of meta-data are clustered using the assessed structure of the set of meta-data. The structure of the set of meta-data includes boundaries that delineate each cluster of meta-data element values from other clusters. In various exemplary embodiments, the value of one meta-data element of one data file is compared to the value of that meta-data element of another data file in the clusters based on the range value to determine the similarity or dissimilarity between the compared data files.
In various exemplary embodiments, the data is organized using a comparison between all possible pairs of data or a subset of all possible pairs of data. In various exemplary embodiments, the compared similarity or dissimilarity is given a numerical value corresponding to a placement of the clusters of the meta-data elements and their corresponding data files. In various exemplary embodiments, the placement of the clusters is checked for greater accuracy. In various exemplary embodiments, the data files are organized more efficiently and computationally less expensively than when generating low level features by constructing content-base similarity measures.
These and other features and advantages of this invention are described in, or apparent from, the following detailed description of various exemplary embodiments of the method and apparatus according to this invention.
Various exemplary embodiments of this invention will be described in detailed, with reference to the following figures, wherein:
The following detailed description of various exemplary embodiments of systems and methods according to this invention is focused on organizing desired data based on processing of meta-data corresponding to a data file. However, it should be appreciated that this invention is not limited to only the disclosed exemplary embodiments. In general, this invention can be used with any method or apparatus that organizes multitudes of data using corresponding meta-data.
As shown in
In step S400, a value for a parameter K is selected. Next, in step S500, the meta-data is organized hierarchically as desired. Operation then continues to step S600, where operation of the method ends.
It should be appreciated that, in various exemplary embodiments, the extracted meta-data element may be organized chronologically, if, for example, the at least one extracted element of the meta-data includes a timestamp element. Alternatively, the meta-data element may be organized alphabetically if the at least one extracted element of the meta-data includes a file name or some other text string. In still other various exemplary embodiments, the meta-data element may be organized numerically if the at least one extracted meta-data element of the meta-data includes numerical data. In yet other various exemplary embodiments, the at least one extracted meta-data element of the meta-data may define a location, such as, for example, GPS data. It should be appreciated that any other appropriate meta-data element, in addition to or in place of the time, alphabetical, numerical and/or positional meta-data elements described above, can be used as an organizing characteristic. It should also be appreciated that any known or later-developed way of ordering or organizing the values of the selected meta-data element(s) may be used to organize the data files into a desired order.
In various exemplary embodiments, each extracted meta-data element is given a desired identification, or indexed. As a result, in such exemplary embodiments, each data file is thus identified based not on the actual value of the organizing meta-data element in terms of the time, name, or location, but by the location of the value of that meta-data element, within the set of data files. In other words, as an example, a set of data files are organized chronologically based on the values of a timestamp meta-data element. However, the data files are then identified, or indexed, by the order they are located in the set of data files in view of the time values of the timestamp meta-data elements, not by the absolute time values of the timestamp meta-data elements. Nevertheless, the meta-data element for each data file continues to retain its absolute value, which can be compared later.
In various exemplary embodiments, the parameter K has a numerical value. The input value for the parameter K may be a default value or a desired value. In various exemplary embodiments, the parameter K is a value that determines the clustering sensitivity to pair-wise comparisons between the selected meta-data elements of each pair of data files in the set or a subset of pairs of data files in the set. Therefore, larger values of parameter K represent comparisons that result in coarser clustering of the data files. In other words, larger values of the parameter K require values for the meta-data that are further apart from each other to fall into separate clusters. On the other hand, smaller values for the parameter K can be tailored to integrate or emphasize specific features of the meta-data that become more or less apparent at either greater or lower values for the parameter K.
For example, a smaller value for the parameter K is typically more appropriate for a meta-data element having values that are very finely spaced, or features of meta-data that become more apparent at smaller differences. In contrast, a larger value for the parameter K is typically more appropriate for a meta-data element having values that are very coarsely spaced, or features of meta-data that become more apparent at greater differences. Consequently, the desired value for the parameter K will differ depending on the type of meta-data, the spacing of the meta-data, and the number of meta-data elements in the set. Therefore, in various exemplary embodiments, a plurality of values for the parameter K are used to fully analyze and compare the meta-data. Thus, in various exemplary embodiments according to this invention, no assumptions are made regarding an a priori distribution of the input set of meta-data elements. Various exemplary types of meta-data that can be analyzed and/or compared using such values for the parameter K include, for example, low level image features, GPS data, timestamps in hours, months, and/or years.
As shown in
The list of values for the parameter K corresponds to the values for the parameter K selected in step S400. In various exemplary embodiment, a list of values for the parameter K containing a plurality of different values for the parameter K can be either automatically generated, for example, randomly, can be based on a quick scan of the meta-data values, or can be manually input. In various exemplary embodiments, the values for the parameter K within the list contains a plurality of values for the parameter K.
In step S530, each of the values for the parameter K in the list is used to obtain a similarity value SK for each pair of indexed meta-data elements in the list:
where:
The collection of the similarity value SK for each compared pair of meta-data elements using a particular value for the parameter K can be expressed as a similarity matrix.
In other words, the meta-data for the ith and jth data files can be compared based on the parameter K to obtain the similarity value SK for the values ti and tj of the meta-data elements of the ith and jth data files. As the t value is the actual value of the meta-data, in one exemplary embodiment, t can be a time in minutes if the meta-data is a timestamp.
The type of actual value of the meta-data elements that can be used to obtain a similarity value SK need not be a scalar value such as time. Other types of meta-data elements can be used to obtain the similarity value SK. In various exemplary embodiments, content-based feature vectors may also be used together with or in place of the meta-data. In this case, the similarity value is:
where vi and vj are actual vectors for the selected meta-data element of the th and j data files. Other suitable types of values and equations may be used in various other exemplary embodiments. Operation then continues to step S540.
In step S540, a novelty score vK is obtained for each elements of the similarity matrix SK that has been generated for a particular value for the parameter K. One way to obtain the novelty share vK is to use a matched filter technique to correlate a kernel along a main diagonal S(i,i) of the similarity matrix SK (i,j) That is, in various exemplary embodiments, the novelty score vK is determined only along the diagonal of the similarity matrix SK. To find the actual boundaries between the groups of meta-data, in various exemplary embodiments, a Gaussian tapered 11×11 checkerboard kernel, g is used to calculate the novelty score vK(s) as:
where vK(s) is the novelty score for the ith element of the similarity matrix SK for a particular value for the parameter K and the Gaussian tapered 11×11 checkerboard kernel g.
In Eq. (3), the value for 1 and n range between −5 and +5 because an 11×11 matrix is used. In various exemplary embodiments, other sized matrices may be used, such as, for example, a 9×9 matrix, where the value for j and k range between −4 and 4. To obtain the novelty score vK, any desired sized checkerboard kernel may be used.
By using a checkerboard kernel, a full analysis need not be performed. Rather, only the strip around the main diagonal with the same width as the kernel need be obtained, reducing the computational complexity, which linearly corresponds to the number of data files. It should be noted that comparisons of only subset of pairs of data, rather than all possible pairs of data, may be used in any pair-wise comparisons. In general, using only a subset of all possible pairs results in substantial computational savings with minimal performance degradation.
When the novelty scores vK are determined for the various values of the parameter K, several peaks in the novelty score appear. It should be noted that different peaks appear for different values of the parameter K. Because the values for the parameter K represent a range of structure, the different values for the parameter K allow the similarity matrices SK to reveal structures at different resolutions. The peaks in the novelty scores vK, in turn, indicate a hierarchical set of boundaries between contiguous groups of data having similar or closer meta-data element values than other groups, i.e., clusters. Therefore, the peaks in the novelty scores vK are boundaries between groups with similar meta-data values and indicate a cluster of meta-data values that are separable from other clusters. Therefore, the peaks in novelty scores vK, which are boundaries between groups of meta-data, are obtained. Operation then continues to step S550.
In step S550, a boundary list for each different value of the parameter K is obtained, first by locating all the peaks in the novelty score vK for each value of the parameter K, and enforcing a hierarchical structure on the detected boundaries. In various exemplary embodiments, the analysis to obtain a boundary list is done from a courser scale to a finer scale, or decreasing values for the parameter K, using each value in the list of values of the parameter K. All the peaks in the novelty scores vK for each value of the parameter K is then collected to build a hierarchical set of peak values or boundaries using a boundary list BK={b1, . . . bnk} that will include all boundaries detected. That is, all boundaries detected at course scales or greater values of the parameter K will be included in the boundary list for all finer scales or lesser values of the parameter K. It is assumed that boundaries between groups further apart obtained at courser scales still exits at finer scales.
The boundaries are located where the novelty score vK is at a local maximum value, and is determined from the maximum of similarity measure and the kernel correlated along the main diagonal of the similarity matrix. Another way of obtaining the maxima or minima of the novelty score is to obtain a derivative of the Eq. (3) for example. The operation then continues to step S560.
In step S560, a determination is made whether all the values for the parameter K in the list have been used to determine the boundaries by obtaining the similarity value SK, the novelty score VK, and the boundary bk for each value of the parameter K. If not, the operation returns to step S520. Otherwise, operation continues to step S570.
In step S570, the detected boundaries represented by the list of boundaries BK are used to obtain a confidence score C(BK), which represent the results of the clustering that have been ranked for each level in the hierarchy of the detected boundaries. The confidence score C(BK) is based on the average within-class similarity and the between class dissimilarity as represented by:
where:
As shown above, the first sum, which quantifies the average within-class similarity between the data files within each cluster, and the second sum, which quantifies the average between-class similarity between the data files in adjacent clusters, are negated to quantify the between-cluster dissimilarity. The rate of change for the first sum and the second sum vary depending on the value of the parameter K. Therefore, for a plurality of values for the parameter K, one value will allow the confidence score C(BK) to be maximized. Consequently, operation continues to step S580, where the boundary list BK for the value of the parameter K that maximizes the confidence score C(BK) is obtained. Then, the operation proceeds to step S590, where the operation returns to step S600. Other types of statistical measures can be used to obtain the confidence score C(BK), such as the Bayes information criterion (BIC). Some examples of the Bayes information criterion are set forth in “A tutorial on learning with Bayesian networks” by D. Heckermann, Technical Report MSR-TR-95-06, Microsoft Research, Redmond, Wash. (1995, Revised 1996); S. Chen et al., “Speaker, environment and channel change detection and clustering via the Bayesian information criterion”, DARPA Speech Recognition Workshop (1998); and by S. Renals et al., “Audio Information Access from Meeting Room” (April, 2003), each of which is incorporated herein by reference in its entirety.
One exemplary use of systems and methods according to this invention involves organizing digital photographs into time-based events by hierarchical clustering. With the proliferation of digital cameras, the number of digital photographs accumulating on personal computers is growing rapidly. Individual digital image files, which are typically in the JPEG image file format, includes a wealth of meta-data in the digital files, typically stored in a standard exchangeable image file format (Exif). Such meta-data includes a timestamp that indicates when the photograph was taken or when subsequently re-saved or modified. Nevertheless, because a plurality of meta-data may be recorded with the image file, such information as the original timestamp, or any subsequent modified timestamp, may be separately recorded as meta-data and can be individually extracted and analyzed using various exemplary embodiments of systems and methods according to this invention.
In one exemplary embodiment, a clustering of 512 photographs were used. First, all photographs had timestamps (meta-data), and Were placed manually into meaningful folders, i.e., specific events, by a photographer. This manual clustering of these photographs will be referred to in the following discussion as the ground truth clustering.
The Exif header for each photograph was first processed to extract the timestamp for that photograph. The extracted timestamps were first organized and ordered in time. The timestamps were ordered chronologically using any basic time unit, such as minutes. However, once the timestamps were chronologically ordered, then each timestamp, and thus each corresponding photograph, was given an index or time order number or value, and was subsequently thereafter referred to by this index, rather than by the absolute time value of the timestamp.
After the initial processing to extract the timestamps and organize the photographs, the structure of the collection of timestamps was assessed by building a similarity matrix Sk.
A checkerboard pattern along the main diagonal of the similarity matrix Sk shown in
As shown in
As discussed above, different features become more apparent at different values of the parameter K. In the corresponding novelty scores vK, the boundary points vary considerably depending on the scale of the analysis, i.e., value of the parameter K. In
In
The technique is based on the assumption that detected event boundaries must, at some scale or, for some value of the parameter K, approach a maximum novelty score. For each value of the parameter K, the peaks in the novelty score vK that indicate a boundary are detected by analysis of the first difference. Using a given threshold score avoids detecting spurious peaks that may appear, for example, because of an unusually long gap in the time values in photographs that are of the same event. Such a given threshold score may be used as a minimum threshold score. For example, a novelty score which is greater than 5 can be selected as a peak in each contiguous region.
This confidence measure C(BK) depends explicitly on both the number of detected clusters and the values of the parameter K.
As shown in
In general, the data source 200 shown in
The data source 200 and/or the data sink 220 can be integrated with the data organizing system 100. Additionally, the data organizing system 100 may be integrated with devices providing additional functions in addition to the data source 200 and/or the data sink 220, in a larger system that performs multiple functions, such as a digital camera that automatically organizes the captured photographs into folders.
Each of the respective one or more user input device(s) 106 may be one or any combination of multiple input devices, such as a keyboard, a mouse, a joy stick, a trackball, a touch pad, a touch screen, a pen-based system, a microphone and associated voice recognition software, or any other known or later-developed device for inputting data and/or user commands to the data organizing system 100. It should be understood that the one or more user input device(s) 106, of
Each of the links 104, 108, 210 and 230 connecting the a display device 102, one or more user input device(s) 106, a data source 200, a data sink 220 to the data organizing system 100 can be a signal line, a direct cable connection, a modem, a local area network, a wide area network, and intranet, the Internet, any other distributed processing network, or any other known or later developed connection device or structure. It should be appreciated that any of these links 104, 108, 210 and 230 may include wired or wireless portions. In general, each of the links 104, 108, 210 and 230 can be implemented using any known or later-developed connection system or structure usable to connect the respective devices to the data organizing system 100. It should be understood that the links 104, 108, 210 and 230 do not need to be of the same type.
As shown in
Various embodiments of the data organizing system 100 can be implemented as software executing on a programmed general purpose computer, a special purpose computer, a microprocessor or the like. It should also be understood that each of the circuits, routines, and/or applications shown in
The meta-data extracting circuit, routine, or application 140 extracts at least one meta-data element associated with a data file. At least one element of the meta-data of each data file is extracted from the plurality of data files to be organized. Data files such as digital image files, which are typically in the JPEG image file format, includes a wealth of meta-data in the digital files, typically stored in a standard exchangeable image file format (Exif). Such extractable meta-data includes a timestamp that indicates when the photograph was taken or when subsequently re-saved or modified.
The meta-data organizing circuit, routine, or application 150 organizes the extracted meta-data element into a desired order based on values for the extracted meta-data elements. The extracted meta-data elements are organized using any desired organizing characteristic, such as the chronological, alphabetical, numerical and/or positional characteristic, and can order the extracted meta-data element based on an assigned identification value, or indexed.
The similarity value determining circuit, routine, or application 160, determines for at least one of the at least one parameter value, a similarity value for at least two of the plurality of data files using at least some of the extracted meta-data elements and that parameter value. Therefore, the similarity value determining circuit, routine, or application 160 compares the meta-data for at least a pair of data files using the parameter value to obtain the similarity value of each such pair of the data files.
The novelty value determining circuit, routine, or application 170, determines at least one novelty value for that data file based on the plurality of similarity values. That is, the novelty value determining circuit, routine, or application 170 determines the novelty value based on the similarity values for a desired number of data files.
The data dividing circuit, routine, or application 180 divides at least some of the data files into groups based on the extracted meta-data elements and an input parameter value. In various exemplary embodiments, the data dividing circuit, routine, or application 180 divides the at least some of the data files into groups based on the extracted meta-data elements and an input parameter value by determining at least one boundary location between ones of the plurality of data files based on the at least one novelty value determined for at least some of the data files, and determining, for at least some of the determined boundary locations, the at least one parameter value that maximizes the confidence value.
The confidence value determining circuit, routine, or application 190 determines, for at least some of the determined boundary locations, a confidence value for that boundary location.
In operation, the data organizing system 100 inputs or otherwise obtains a plurality of data files, each with its corresponding meta-data, and may input the value for the input parameter from the data source 200 over the link 210 and/or reads one or more data files from the memory 130. The input parameter may be input through the user input device 106. If obtained from the data source 200, the input/output interface 110 inputs the data files and/or the input parameter, and, under the control of the controller 120, forwards any appropriate data files to the meta-data extracting circuit, routine, or application 140.
The meta-data extracting circuit, routine, or application 140 extracts at least one meta-data element associated with at least some of the input data files. The meta-data extracting circuit, routine, or application 140 then, under the control of the controller 120, stores the extracted meta-data elements to the memory 130, or outputs the extracted meta-data elements directly to the meta-data organizing circuit, routine, or application 150. The meta-data organizing circuit, routine, or application 150 inputs, under control of the controller 120, the extracted meta-data elements and organizes the extracted meta-data elements into a desired order based on values for the extracted meta-data elements. The meta-data organizing circuit, routine, or application 150 then, under the control of the controller 120, stores the ordered extracted meta-data to the memory 130 or outputs the ordered extracted meta-data elements directly to the similarity value determining circuit, routine, or application 160.
The similarity value determining circuit, routine, or application 160 inputs, under control of the controller 120, the ordered meta-data elements and/or the corresponding data files and determines, for at least one of the at least one parameter value, a similarity value for at least one pair of two of the plurality of data files using at least some of the extracted meta-data elements and/or the contents of those data files and that parameter value. The similarity value determining circuit, routine, or application 160 then, under the control of the controller 120, stores the determined similarity values to the memory 130 or outputs the determined similarity values directly to the novelty value determining circuit, routine, or application 170.
The novelty value determining circuit, routine, or application 170 inputs, under control of the controller 120, at least some of the similarity values and determines, for each of a number of data files associated with the input similarity values, at least one novelty value for each such data file based on similarity values for that data file and a desired number of surrounding data files. The novelty value determining circuit, routine, or application 170, then, under the control of the controller 120, stores the determined novelty values to the memory 130 or outputs the determined novelty values directly to the data dividing circuit, routine, or application 180.
The data dividing circuit, routine, or application 180 inputs, under control of the controller 120, at least some of the novelty values and divides the corresponding data files into groups by determining at least one boundary location between various ones of the plurality of data files based on the at least one novelty value determined for at least some of the data files. The data dividing circuit, routine, or application 180, then, under the control of the controller 120, stores the determined boundary location to the memory 130 or outputs the determined boundary location to the confidence value determining circuit, routine, or application 190.
The confidence value determining circuit, routine, or application 190 inputs, under control of the controller 120, one or more boundary locations, and determines, for at least some of the determined boundary locations, a confidence value for that boundary location for at least some of the determined boundary locations. The confidence value determining circuit, routine, or application 190, then, under the control of the controller 120, stores the determined confidence value to the memory, or outputs the determined confidence value to the data dividing circuit, routine, or application 180. The data dividing circuit, routine, or application 180 then determines the at least one parameter value that maximizes the confidence value for at least some of the determined boundary locations. Therefore, in operation of the data organizing system 100, the input parameter value, the extracted ordered meta-data elements, and/or the contents of the corresponding data files are organized using the at least some of the read/received data files into groups based on the ordered extracted meta-data elements and/or the corresponding contents of the data files and the input parameter value. The divided, and thus organized, data files can then be further stored in the memory 130, output to the data sink 220 and/or displayed on the display device 102.
While
Alternatively, the data organizing system 100 may be a separate device including the meta-data extracting circuit, routine or application 140, the meta-data organizing circuit, routine or application 150, the similarity value determining circuit, routine or application 160, the novelty value determining circuit, routine or application 170, the data dividing circuit, routine or application 180, and the confidence value determining circuit, routine or application 190, the controller 120, the memory 130, and/or the input/output interface 110. Furthermore, although shown as separate circuits, routines, and/or applications, the meta-data extracting circuit, routine, or application 140, the meta-data organizing circuit, routine, or application 150, the similarity value determining circuit, routine, or application 160, the novelty value determining circuit, routine, or application 170, the data dividing circuit, routine, or application 180, and the confidence value determining circuit, routine, or application 190 may themselves be integrated together with various combination.
While this invention has been described in conjunction with the exemplary embodiments outlined above, various alternatives, modifications, variations, improvements, and/or substantial equivalents, whether known or that are or may be presently unforeseen, may become apparent to those having at least ordinary skill in the art. Accordingly, the exemplary embodiments of the invention, as set forth above, are intended to be illustrative, not limiting. Various changes may be made without departing from the spirit and scope of the invention. Therefore, the claims as filed and as they may be amended are intended to embrace all known or later-developed alternatives, modifications, variations, improvements, and/or substantial equivalents.
This non-provisional application claims the benefit of U.S. Provisional Application No. 60/515,713, filed on Oct. 31, 2003. The disclosure of the prior application is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60515713 | Oct 2003 | US |