In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It may be evident, however, to one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate description of the present invention. It is to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. Preferably, the present invention is implemented in software as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s). The computer platform also includes an operating system and micro instruction code. The various processes and functions described herein may either be part of the micro instruction code or part of the application program (or a combination thereof) which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device. It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying Figures are preferably implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
Here, an embodiment of the present invention to analyze data is presented. For clarity's sake, a level of abstraction is maintained that is common and well-known to those skilled in the related art; for instance, sets and maps are represented as, or approximated by, data on an information system.
To illustrate how frequency or probability is handled in the present invention, a data structure called frequency count is herein disclosed. It is a concrete way to model the simple counting probability measures on a set. In this embodiment, all data is represented as a frequency count on some set.
In the following, for any set A, a frequency count on A means a data that keeps track of members of A and their numbers. It is treated as a subset of A×N, where N={1,2,3, . . . } is the set of natural numbers, such that no member of A appears more than once. The set of frequency counts on A is denoted by Freq(A). Thus a frequency count on A, i.e., a member F of Freq(A), is a set of pairs (a,n), where a is a member of A and n is a natural number, such that if (a,n) is in F, no other member of the form (a,m) is in F. These pairs in frequency counts are hereinafter called the particles. For a member a of A and a frequency count F on A, the count of a, denoted by countF(a), is defined to be n, if there is a particle of the form (a,n) in F, and 0 otherwise; mass(F), the mass of F, is defined by the sum of countF(a) for all a in A; and PF(a), the probability of a, is defined by countF(a) divided by mass(F). The support supp(F) of F is defined to be the subset of A that consists of the members a with countF(a)>0. The entropy H(F) of F is defined by the sum −Σaεsupp(F)PF(a) log2PF(a) for all a in supp(F).
The following should be noted for later reference:
[FC I] From two frequency counts F on A and G on B, another frequency count (the product) F×G on A×B may be generated as follows: F×G is a subset of (A×B)×N that consists of particles ((a,b),nm) for all combinations of particles (a,n) in F and (b,m) in G. This corresponds to the product probability measure.
[FC II] When there is a map f:A→B, a map f*:Freq(A)→Freq(B) of frequency counts is defined as follows: For a frequency count F,f*(F) is a subset of B×N that consists of particles (b,n) such that at least one particle (a,m) in F with b=f(a) exists and n is the sum of m's in all such particles (a,m). In other words, the set f*(F) is made by adding (f(a),m) for all (a,m) in F and then replacing (b,i) and (b,j) of the same b by (b,i+j) until there is no distinct particles that have the same first component. This corresponds to the induced probability measure.
[FC III] If A⊃B, then Freq(A)⊃Freq(B), i.e., a frequency count on B is automatically a frequency count on A. When A⊃B and F is a frequency count on A, the restriction F|B of F to B is a frequency count on B (and therefore on A) that consists of all the particles (a,n) in F such that a is in B.
[FC IV] Two frequency counts F and G on A are said to be equivalent if there is a number m>0 such that countF(a)=m countG(a) for all a in A. If F and G are equivalent, various properties hold: mass(F)=m mass(G), supp(F)=supp(G), PF(a)=PG(a) for all a in A, and H(F) =H(G).
[FC V] For a set A, the standard frequency count St(A) on A is defined as the subset of A×N consisting of one particle (a,1) for each a in A. Note that, according to this definition and [FC I], St(A)×St(B) is identical to St(A×B).
All the primitive maps that are listed in [PM I] and on are included in the set of primitive maps.
Based on the loaded data and the primitive maps, other data and maps are generated to explore the possibilities of various sets that characterize the data. In the beginning, there is the input data represented as a frequency count on sets. Thus the system begins by trying possible maps that can be applied to the sets. The result of applying such maps to existing data is a new data. More specifically, the process keeps the following data structures:
As the process continues, more members are added to FC, SETS and MAPS, in one of the following way:
[D I] If a pair of frequency counts F and G are already in FC, F×G may be added to FC (see [FC I].) Similarly for three or more frequency counts.
[D II] If any map in MAPS can be applied to some map(s) in MAPS (e.g., [PM III], [PM IV], [PM V], [PM VI], and [PM XII]) the resulting map may be added to MAPS. For instance, some pair of maps may be chosen and either their product or, if applicable, their concatenation may be added to MAPS; or it may be any map applied to other maps and result may be added to MAPS.
[D III] A subset of a set in SETS can be added to SETS. A frequency count may be restricted to a subset. An inverse image of a subset can be added to SETS. For a subset B of A, the subset classifier map subsetB:A→bool (defined by subsetB(a)=true if aεB and false otherwise) may be added to MAPS.
[D IV] If a frequency count F on a set A is in FC and a map f:A→B is in MAPS,f*(F) may be added to FC (see [FC II].) If this rule is used to add a frequency count, FC also records the map that was used.
Note that the sets can be considered to make a directed graph structure by taking sets as nodes and maps as edges. The frequency counts on the sets can also be considered to make a directed graph structure by taking frequency counts as nodes and maps as edges.
These maps and data can be explored and added to the data structures in various orders. For instance, a breadth-first search order could be used in the tree structure mentioned above. In this embodiment, a stochastic search algorithm is used:
Exploration Algorithm
Outline
Stochastically execute one of the actions from 1 to 6 below:
Details
Each frequency count, set, and map in FC, SETS, and MAPS is assigned an integral weight. In the beginning, the input data has the weight 1000, others are all given the weight of 100.
For each frequency count or map, a set of eligible objects are defined as follows: For a frequency count F on a set A, its set EO(F) of eligible objects consists of all the frequency counts in FC and all proper subsets of A in SETS. For a map f:A→B, its set EO(f) of eligible objects consists of all maps in MAPS to which f can be applied, all proper subsets of B in SETS, and all frequency counts on A.
Each time the exploration algorithm is invoked, a frequency count, a set, or a map is chosen with a probability from FC, SETS, and MAPS (201). The probability is proportional to its weight; except in the case of a set, where it is proportional to 200 divided by the number of members in the set.
If a frequency count F on a set A is chosen, another frequency count G or a proper subset B of A is chosen from EO(F) with a probability proportional to its weight (202). If G on a set C is chosen, F×G is added to FC and A×C to SETS (203). F×G is given the weight equal to the larger of the weights of F and G. A×C is given the weight equal to the larger of the weights of A and C. If B is chosen, F|B is added to FC (204) and given the weight equal to the larger of the weights of F and B.
If a set A is chosen, its subset B is randomly chosen and added to SETS and given the weight of 100. The subset map subsetB:A→bool is also added to MAPS with the weight of 100 (205).
If a map f:A→B is chosen, a frequency count F on A, a proper subset C of B, or a map g is chosen from EO(f) with a probability proportional to its weight (206). If a frequency count F is chosen,f*(F) is added to FC (207), and given a weight equal to the larger of the weights of f and F. If a proper subset C of B is chosen,f−1(C) is added to SETS (208) and given the same weight as C; if a map g is chosen, f(g) is added to MAPS (209), and given the weight equal to the larger of the weights of f and g.
Particle Record
When the input data is received and represented as a frequency count, it creates a particle record (311) for each particle in the frequency count and stores it in the particles record (310); the type (308) is set to explicit. The sum of the count field (313) of the particles that are in the particles record (310) is stored in the mass field (309).
When a result of applying a map f to a frequency count F on a set A is added to FC, in the record (302) that is created in FC for the result, the type is set to explicit. If the number of particles in F is more than MAXPARTICLE, only MAXPARTICLE particles are stochastically chosen with the probability proportional to their count; otherwise, all particles in F are chosen. For each chosen particle (a,n), the member f(a) is computed. If an explicit particle record (311) with the member field (312) containing f(a) is already there, its count field (313) is increased by n; otherwise, an explicit particle record (311) is created with the member field (312) containing f(a) and the count field (313) set to n.
In this embodiment, the method iterates the Exploration Algorithm and then checks for patterns (data and map) in the frequency counts in FC. This is done by calculating the entropy H(F) for any frequency count F that has been updated in the current iteration, if any. The entropy is normalized by subtracting it from the entropy of the frequency count that is created by sending, by the same map that created F, the standard frequency count on the original set. Thus, if a frequency count F on A is created by sending the frequency count G on B, by a map f:B→A, i.e., F=f*(G), the quantity J(f,F)=H(f*(St(B)))−H(F) is computed. When a frequency count with J(f,F) higher than a threshold value is found, the map f and the frequency count that led to the frequency count is marked as pattern and used (e.g., output, backtracked) in the later stages; also the map and the frequency count each gets its weight value increased by 100. The threshold value should be determined according to the application and other factors, such as the available resources. As the benchmark of the presence of patterns other than J(f,F), another possibility is the relative entropy (also known as Kullback-Leibler divergence). For two frequency counts F and G, the relative entropy D(F,G) is the sum of −PF(a) log2[PF(a)/PG(a)] for all a in supp (G). Instead of finding a high J(f, F), a low D(F,f*(St(B))) may be looked for.
In computing the entropy of various frequency counts, various relationships are employed to reduce the computation cost:
When a frequency count F with low entropy is found, a process of idealization takes place. That is a process of creating another frequency count F′ by removing some particles from F so that its entropy would be even lower.
Next, the particles still left in F′ are backtracked. Let the map that caused F be f:A→B, i.e., F=f*(G) for some frequency count G on a set A. A particle (b,n) in F′ is made by combining the particles of the form (f(a),ma) (see [FC II].) Let f*−1(F′) be the inverse image of F′ by f, which is the restriction of G to f−1(supp(F′)) (see [FC III].) That is, (a,m) in G belongs to f*−1(F′) if and only if countF′(f(a))>0. If f has been made by concatenating more than one map, e.g., f=f1∘f2∘ . . . ∘fk, there will be a series of frequency counts such as fk*−1(F), (fk−1Πfk))*−1(F′), and so on. These frequency counts are added to FC along with the information as to how they are created (e.g., the idealization, the taking of inverse image) and the same weight as that of F. They are then treated in the same way as other frequency counts in FC.
Finally, if a frequency count F in FC is on a set of maps, i.e., a set that is of the form A→B for some sets A and B, and if relatively few members of the set have higher counts, one of more members of A→B with high counts may be added to MAPS.
The maps that were found as patterns may be used as indicators of useful characteristics or parameters of the original data. As such, they are the output of the embodiment. The part of the data that causes a specific map to be a pattern is found by backtracking and may also be output.
This embodiment can be used to analyze various kinds of data. The following examples are intended to illustrate but not limit the use to which this embodiment may be put.
Data
In this embodiment, an image is loaded from any of available image file format and represented in the following way.
The color space is denoted by Col. For a color image, it is generally a three dimensional real vector space. If the image is a grayscale image, Col is the set of real numbers. For images with larger spectrum Col might be a vector space of higher dimensions. Here, the only assumption is that it is a real vector space.
The image domain is denoted by Dom and assumed to be some finite subset of a d-dimensional Euclidean space EDom. For instance, an ordinary bitmap image has a domain of m×n lattice points in a 2-dimensional Euclidean space. For other kind of images, such as 3D medical image data, the dimension would be higher.
An image generally gives colors at each point in the domain. Thus an image can be considered a map from Dom to Col, that is, a member of the set Dom→Col. This embodiment represents the input image by a frequency count on Dom→Col. That is, the initial data is a frequency count Im in Freq(Dom→Col) that contains one particle (im,1), where im:Dom→Col is the map that sends each pixel position to the color in the image.
Primitive Maps
In addition to the general primitive maps, there may be added primitive maps specifically useful for image data. For instance, if the image is in pixels, as usually the case, neighbor relationship between pixels may be useful. This is put in the system as a primitive map Nb:Dom×Dom→bool that gives true whenever two members of Dom are neighboring pixels. Another example would be various kinds of filters that are known in the related art of image processing; e.g., a wavelet filter.
Derived Data and Maps
Some examples of simpler maps and data that the method may add to MAPS and FC are:
A. Color frequency
B. Color difference and position difference frequency
Patterns
The frequency count ev*(Im×St(Dom)) on Col obtained in A2 would have small entropy when there are not too many colors used. If the whole image is one color, it would have entropy of 0, the lowest possible value.
The frequency count added in B6 on Col×VDom would have small entropy when there are many pairs of pixels that have the same particular color difference and are separated by the same vector. If, for instance, there are horizontal lines of one color, there would be relatively high concentration of particles (particles with high counts) with color difference 0 and horizontal vectors, giving the frequency count lower entropy.
A data matrix is a rectangular array with N rows and D columns, the rows giving different observations or individuals and the columns giving different attributes or variables. Each variable can have a value that is a member of some set, which we call here the value set. For instance, if the variable can only take an integral number, the value set is the set of integers. If the variable can take any number, the value set is the set of real numbers. Or if the variable can take the value of “yes” or “no”, the value set can be the set of Booleans.
Let the D variables denoted by a1,a2, . . . ,aD and the sets in which variables take values by X1,X2, . . . XD, respectively. Then, each observation gives a member in the set X1×X2× . . . ×XD. The input data in the form of a data matrix is represented in this embodiment as a frequency count on X1×X2× . . . ×XD with each observation contributing a single count in one particle. Thus, the mass of the frequency count is N.
Thus a method and apparatus has been disclosed to arrange given data so that high-dimensional data can be more effectively analyzed and better pattern discovery within the data is allowed. It is applicable in wide variety of industry, where more and more data are collected and it is increasingly important to find the relevant information out of a vast pile of data. The areas in which the present invention is useful includes the case of the large number of genes and relatively few patients with a given genetic disease and the case of images, whch can easily have a million dimensions (pixels).
While only certain preferred features of the invention have been illustrated and described herein, many modifications and changes will occur to those skilled in the art. For instance, the concepts such as sets and maps, which have been used herein to explain the present invention has many equivalent or similar concepts in diverse discipline: e.g., function, type, method, etc. The terminologies such as set and map can be avoided entirely if one wishes; the whole invention can be described in terms of data and subroutine. Such superficial differences are, however, not real differences.
It is, therefore, to be understood that the appended claims are intended to cover all such modifications, changes and differences of terminologies as fall within the true spirit of the invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB05/52570 | 8/1/2005 | WO | 00 | 2/1/2007 |
Number | Date | Country | |
---|---|---|---|
60592911 | Aug 2004 | US |