Claims
- 1. A method for processing information in a data set that contains samples of at least two classes using an empirical risk minimization model, wherein each sample in the data set has an importance score, comprising:
a. selecting samples of a first class being labeled with class label+1 and a second class with class label−1, from the data set; b. prescribing an empirical risk minimization model using the selected samples with an objective function and a plurality of constraints which adequately describes the solution of a classifier to separate the selected samples into the first class and the second class; c. modifying the empirical risk minimization model to include terms that individually limit the influence of each sample relative to its importance score in the solution of the empirical risk minimization model; and d. solving the modified empirical risk minimization model to obtain the corresponding classifier to separate the samples into the first class and the second class.
- 2. The method of claim 1, wherein each sample comprises measurements of expression levels of a plurality of selected known or unknown biological entities in a corresponding biological specimen.
- 3. The method of claim 2, wherein the class of a sample as identified by class label+1 or −1, respectively, indicates the phenotype of the biological specimen comprising the presence, absence, or severity of a plurality of biological conditions or perturbations associated with the biological specimen, respectively.
- 4. The method of claim 2, wherein a biological entity comprises a plurality of cellular constituents.
- 5. The method of claim 1, wherein the importance score of a sample represents the relative trustworthiness of the sample.
- 6. The method of claim 5, further comprising the step of obtaining an importance score for each sample from information related to the data set.
- 7. The method of claim 6, wherein the step of obtaining an importance score for each sample from information related to the data set further comprising the steps of:
a. deriving a statistical classification model using the data set; b. applying the classification model to a sample to generate an output; and c. comparing the output to the known class label of the sample to compute an importance score for the sample such that a sample that is misclassified or considered outliers based on the output of said classification model is assigned a relatively lower importance score.
- 8. The method of claim 1, further comprising the steps of:
a. introducing a new sample x; b. applying the classifier on x and obtaining an output y; and c. assigning the sample x to the class corresponding to the label+1 or −1 based on a cutoff on y.
- 9. A method for processing information in a data set of m samples xi, i=1, 2, . . . , m with the corresponding class membership labels ci, i=1, 2, . . . , m, ∈{−1, +1}, each sample comprising measurements of n variables, comprising:
a. assigning to each sample xi a relative importance score pi≧0, pi representing the trustworthiness of sample xi; b. minimizing 912υ·υ+∑i=1mpiξi subjecting to ci (υ·xi+b)≧1−ξi, i=1, 2, . . . , m to obtain a solution comprising v and b, wherein ξi represents a non-negative error for the ith constraint; and c. constructing an n-dimensional unit vector d=ν/∥ν∥=(d1 d2 . . . dn)T from the solution that identifies a direction along which the samples are best separated into a first class labeled as +1 and a second class labeled as −1, respectively, for the set of assigned importance scores p1, p2, . . . , Pm.
- 10. The method of claim 8, wherein the absolute value of the kth element in the n-dimensional unit vector d, |dk|, k=1, 2, . . . , n, corresponds to a relative significance measure of the kth variable in the separation of the samples into the first and second classes.
- 11. The method of claim 8, wherein the sign of the kth elements in the n-dimensional unit vector d, sign(dk), k=1, 2, . . . , n, indicates whether the corresponding kth variable is upper regulated or down regulated with respect to the data class labeled as +1.
- 12. The method of claim 9, 10 or 11, wherein each sample comprises n measurements of expression levels of a plurality of selected known or unknown biological entities in a corresponding biological specimen.
- 13. The method of 10 or 11, wherein the kth variable comprises measurements of expression levels of the kth selected known or unknown biological entity across all samples in the data set.
- 14. The method of claim 9, 10 or 11, wherein the class of a sample as identified by class label +1 or −1, respectively, indicates the phenotype of the biological specimen comprising the presence, absence, or severity of a plurality of biological conditions or perturbations associated with the biological specimen.
- 15. The method of claim 12, wherein a biological entity comprises a plurality of cellular constituents.
- 16. The method of claim 12, wherein a biological entity comprising a gene, a protein, or a fragment of a protein.
- 17. The method of claim 9, further comprising the steps of:
a. introducing a new sample x=(x1 x2 . . . . xn)T; b. computing a scalar value 10y=d·x=∑j=1ndjxj; and c. assigning the sample x to the class corresponding to the label+1 if y>yc and to the class corresponding to the label−1 if y≦yc, respectively, wherein yc is a scalar cutoff value on y.
- 18. The method of claim 9, further comprising the steps of:
a. selecting a pair of constant positive values for parameter σ and parameter C, respectively; b. selecting a positive function Φ(t1, t2) that has a range [0, 1], and is monotonically decreasing with respect to its first variable t1 and monotonically increasing with respect to its second variable t2; c. computing a δi for each sample xi, i=1, . . . , m, δi being a quantitative measure of discrepancy between xi's known class membership and information extracted from the data set; d. choosing the set of assigned importance scores p1, P2, . . . , Pm in the form of pi=CΦ(δi, σ), i=1, . . . , m; and e. minimizing 1112υ·υ+∑i=1mpiξi subjecting to constraints of ci(υ·xi+b)≧1−ξi, i=1, 2, . . . , m to obtain a solution comprising υ, wherein ξi represents a non-negative error for the ith constraint.
- 19. The method of 18, wherein the first class has a class means M1 and the second class has a class means M2, and δi for each sample xi, i=1, . . . , m, being the shortest distance between the sample xi and the line going through and thereby defined by M1 and M2.
- 20. The method of 18, wherein the function has the form of Φ(δi, σ)=exp(−δi2/σ2),i=1, . . . , m.
- 21. The method of claim 18, further comprising:
a. introducing a new sample x=(x1 x2 . . . xn)T; b. computing a scalar value 12y=d·x=∑j=indjxj; and c. assigning the sample x to the class corresponding to the label+1 if y>yc and to the class corresponding to the label−1 if y≦yc, respectively, where yc being a scalar cutoff value on y.
- 22. The method of claim 18, for a pair of σ and C, further comprising a step of performing a backward stepwise variable selection procedure which comprises the steps of:
a. assigning each variable in the data set with an initial temporary significance score of zero. b. computing a temporary significance score for each variable in the data set based on the absolute value of the corresponding element in d=ν/∥ν∥ from the solution and the variable's temporary significance score; c. finding the variable in the data set with the smallest temporary significance score; d. assigning the temporary significance score of the variable as its final significance score and removing it from the data set to be used in future iterations; e. repeating steps (b)-(d) until all variables in the data set have been assigned a final significance score; and f. constructing a vector s=(s1 s2 . . . sn), wherein sk,j=1, . . . , n, represents a computed final significance score for the kth variable of the n variables in the separation of the samples into the first and second classes.
- 23. The method of claim 22, wherein each sample comprises n measurements of expression levels of a plurality of selected known or unknown biological entities in a corresponding biological specimen.
- 24. The method of 22, wherein the kth variable comprises measurements of expression levels of the kth selected known or unknown biological entity across all samples in the data set.
- 25. The method of claim 22, wherein the class of a sample identified by class label+1 or −1, respectively, indicates the phenotype of the biological specimen comprising the presence, absence, or severity of a plurality of biological conditions or perturbations associated with the biological specimen, respectively.
- 26. The method of claim 23 or 24, wherein a biological entity comprises a plurality of cellular constituents.
- 27. The method of claim 23 or 24, wherein a biological entity comprising a gene, a protein, or a fragment of a protein.
- 28. The method of claim 18, for a pair of σ and C, further comprising a step of performing a component analysis procedure to determine q unit vectors, q≦min{m, n}, as projection vectors to a q dimensional component space, wherein the performing step comprises the following steps of:
a. setting k=n; b. obtaining unit vector d=ν/∥ν∥ from the solution using a current data set; c. projecting the samples onto a (k−1) dimensional subspace perpendicular to the unit vector d, and renaming these projections as the current data set; d. saving d as a projection vector and setting k=k−1; and e. repeating steps (b)-(d) until q projection vectors have been determined.
- 29. The method of claim 28, wherein each sample comprises n measurements of expression levels of a plurality of selected known or unknown biological entities in a corresponding biological specimen.
- 30. The method of 28, wherein the kth variable comprises measurements of expression levels of the kth selected known or unknown biological entity across all samples in the data set.
- 31. The method of claim 28, wherein the class of a sample identified by class label+1 or −1, respectively, indicates the phenotype of the biological specimen comprising the presence, or absence, or severity of a plurality of biological conditions or perturbations associated with the biological specimen, respectively.
- 32. The method of claim 29 or 30, wherein a biological entity comprises a plurality of cellular constituents.
- 33. The method of claim 29 or 30, wherein a biological entity comprises a gene, a protein, a fragment of a protein.
- 34. A system for processing information in a data set having n variables and m samples xi, i=1, 2, . . . , m with the corresponding class membership labels ci, i=1, 2, . . . , m ∈{−1,+1}, comprising:
a. an input device for receiving the information; and b. a processing unit communicating to the input device and performing the steps of:
i. assigning to each sample x, a relative importance score pi≧0, pi representing the trustworthiness of sample xi; ii. minimizing 1312υ·υ+∑i=1mpiξi subjecting to constraints of ci(υxi+b)≧1−ξi, i=1, 2, . . . , m to obtain a solution comprising ν and b, wherein ξi, represents a non-negative error for the ith constraint; and iii. constructing an n-dimensional unit vector d=ν/∥ν∥=(d1 d2 . . . dn)T from the solution to identify a direction along which the samples are best separated into a first class labeled as +1 and a second class labeled as −1 for the set of assigned importance scores P1, P2, . . . , Pm.
- 35. The system of claim 34, wherein the absolute value of the kth element in the n-dimensional unit vector d, |dk|, k=1, 2, . . . , n, corresponds to a relative significance measure of the kth variable in the separation of the samples into the first and second classes.
- 36. The system of claim 34, wherein the sign of the kth elements in the n-dimensional unit vector d, sign(dk), k=1, 2, . . . , n, indicates whether the corresponding kth variable is upper regulated or down regulated with respect to the data class labeled as +1.
- 37. The system of claim 34, 35 or 36, wherein each sample comprises n measurements of expression levels of a plurality of selected known or unknown biological entities in a corresponding biological specimen.
- 38. The system of 35 or 36, wherein the kth variable comprises measurements of expression levels of the kth selected known or unknown biological entity across all samples in the data set.
- 39. The system of claim 34, 35 or 36, wherein the class of a sample identified by class label+1 or −1, respectively indicates the phenotype of the biological specimen comprising the presence, absence, or severity of a plurality of biological conditions or perturbations associated with the biological specimen, respectively.
- 40. The system of claim 37, wherein a biological entity comprises a plurality of cellular constituents.
- 41. The system of claim 37, wherein a biological entity comprises a gene or a protein or a fragment of a protein.
- 42. The system of claim 34, further comprising:
a. introducing a new sample x=(x1 x2 . . . xn)T; b. computing a scalar value 14y=d·x=∑j=indjxj; and c. assigning the sample x to the class corresponding to the class label +1 if y>yc and to the class corresponding to the class label −1 if y≦yc, respectively, where yc is a scalar cutoff value on y.
- 43. The system of claim 34, wherein the processing unit comprises a microprocessor.
- 44. The system of claim 34, wherein the input device comprises a microprocessor interface.
- 45. The system of claim 34, wherein the input device further comprises at least one device selected from the group of a GUI, a scanner, a CD-ROM, a diskette, a computer coupled to a network, and a networking device.
- 46. The system of claim 34, further comprising an output device coupled to the processing unit.
- 47. The system of claim 34, wherein the output device comprises at least one device selected from the group of a GUI, a printer, a CD-ROM, a diskette, a computer coupled to a network, and a networking device.
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims priority to U.S. Provisional Application Serial No. 60/290,526, which was filed on May 11, 2001, in the United States Patent and Trademark Office and incorporated herein in its entirety by reference.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60290526 |
May 2001 |
US |