Claims
- 1. A method for analyzing gene expression data, the method comprising:
(a) organizing data pertaining to a plurality of samples into a b-tree comprising a plurality of levels, each level comprising a plurality of leaf nodes; (b) defining a plurality of attributes for filtering the data at each level of the b-tree; (c) distributing the data among the plurality of leaf nodes according to the plurality of attributes; (d) grouping the leaf nodes according to their corresponding attributes; (e) defining a control sample set and an experimental sample set; (f) performing a t-test comparing the experimental sample set with the control sample set; and (g) generating a table of t-test results.
- 2. The method of claim 1, wherein the t-test results and information for identifying the gene expression data are stored in a relational database.
- 3. The method of claim 2, wherein the relational database comprises a plurality of relational tables comprising:
the table of t-test results; and at least one second table of the plurality of relational tables comprising a plurality of labels for associating the plurality of t-test results with descriptions of the comparisons to which they correspond.
- 4. The method of claim 1, further comprising:
defining a different control sample set and a different experimental sample set; repeating steps (e) and (f) for a plurality of pairwise comparisons; and entering the results of each step (f) into the table.
- 5. The method of claim 1, wherein the plurality of attributes comprise structural and morphological characteristics of gene expression data.
- 6. The method of claim 1, wherein the plurality of attributes is selected from the group consisting of organ site, diagnosis, disease, stage of disease, demographic and donor data.
- 7. The method of claim 6, wherein donor data is from a human donor and is selected from the group consisting of height, weight, race, date of birth, cause of death, age at death, exercise habits, diet profile, sleeping habits, smoking habits, alcohol habits, and drug habits.
- 8. The method of claim 6, wherein donor data is from an animal donor and is selected from the group consisting of strain, genetic modification and treatment information.
- 9. The method of claim 1, further comprising filtering the data to prune data sets that are smaller than a pre-determined minimum sample size.
- 10. The method of claim 1, further comprising filtering the data to prune outlier samples.
- 11. The method of claim 1, further comprising filtering the data to prune data sets that fail to meet pre-determined quality control criteria.
- 12. The method of claim 1, wherein step (d) comprises performing a simple search to compare attributes by using a search grammar comprising an array of references to sub-arrays, each sub-array comprising at least one search term.
- 13. The method of claim 1, wherein the t-test results are encoded according to a three-state scheme where up-regulation relative to the control sample is assigned a first symbol, down-regulation relative to the control sample is assigned a second symbol different from the first symbol and no change relative to the control sample is assigned a third symbol different from the first and second symbols.
- 14. The method of claim 13, wherein the first symbol is +1, the second symbol is −1, and the third symbol is 0.
- 15. The method of claim 13, further comprising operating on the encoded t-test results using a similarity search algorithm to measure a level of similarity between a plurality of different experimental sample sets.
- 16. The method of claim 15, wherein the similarity search algorithm comprises the kappa statistic.
- 17. The method of claim 15, further comprising ranking the encoded t-test results according to the level of similarity.
- 18. A computerized storage and retrieval system of biological information comprising:
a stored database containing records pertaining to a plurality of gene regulation events for each of a plurality of control samples and experimental samples, wherein the database comprises a plurality of relational tables, each relational table having a means for linking to at least one other relational table of the plurality; the plurality of relational tables comprising:
a first table of the plurality of relational tables comprising a plurality of gene regulation event categories into which at least some of gene regulation events are grouped, the first table comprising results of a plurality of comparisons of selected control samples and selected experimental samples, wherein the selected control samples and selected experimental samples are results of a b-tree analysis; at least one second table of the plurality of relational tables comprising a plurality of labels for associating the plurality of comparisons with descriptions of the comparisons, and a user interface allowing a user to selectively view information regarding the plurality of gene regulation events.
- 19. The system of claim 18, further comprising a third table of the plurality of relational tables comprising one or more manually-generated comparisons.
- 20. The system of claim 18, wherein the plurality of relational tables further comprises one or more tables selected from the group consisting of a control vocabulary table for selecting one or more area of analysis, a comparison table for describing a nature of each comparison, a context table for describing the organization of the b-tree, a sample set path table describing a pathway used to navigate the b-tree, and a gene family table for describing a gene type within which the sample sets fall.
- 21. The system of claim 18, further comprising a database for storing encoded data, wherein the encoded data comprises the results of the b-tree analysis compared by performing a t-test then encoded according to a three-state scheme where up-regulation in the selected experimental sample relative to the selected control sample is assigned a first symbol, down-regulation is assigned a second symbol different from the first symbol and no change is assigned a third symbol different from the first and second symbols.
- 22. The system of claim 21, wherein the first symbol is +1, the second symbol is −1, and the third symbol is 0.
- 23. The system of claim 21, further comprising a similarity search algorithm for operating on the encoded data to measure a level of similarity between a plurality of different experimental sample sets.
- 24. The system of claim 23, wherein the similarity search algorithm comprises the kappa statistic.
- 25. The system of claim 23, wherein the user interface outputs a report comprising a ranking the encoded data according to the level of similarity.
- 26. A database system having a plurality of internal records, the database comprising:
a first plurality of records specifying gene regulation events for a plurality of samples; a second plurality of records specifying comparison results from comparison sets comprising selected experiment sample sets and selected control sample sets, wherein a first portion of the plurality of samples is designated sample control sets and a second portion of the plurality of samples is designated experiment sample sets, wherein the comparison results are derived from b-tree analysis of the comparison sets; a third plurality of records specifying comparison context comprising data describing how a comparison set was selected and analyzed; and a fourth plurality of records comprising a plurality of links for associating the first, second and third plurality of records.
- 27. The database of claim 26, wherein the statistical analysis of the comparison sets comprises a t-test.
- 28. The database of claim 27, further comprising a plurality of encoded records wherein the comparison results are encoded in a three-state scheme corresponding to up-regulation, down-regulation and no change.
- 29. The database of claim 28, wherein the encoded records are acted on by a similarity search algorithm.
- 30. The database of claim 29, wherein the similarity search algorithm is the kappa statistic.
- 31. The database of claim 26, wherein the comparison context comprises a description of the b-tree organization.
- 32. The database of claim 31, wherein the b-tree organization comprises a hierarchical organization of a plurality of levels, each level corresponding to an attribute selected for filtering the first plurality of records.
- 33. The database of claim 26, further comprising a fifth plurality of records comprising manually-generated comparisons, wherein the fourth plurality of records further comprises a link for associating the fifth plurality of records with the first, second and third pluralities of records.
- 34. The database of claim 26, wherein the first plurality of records further includes a regulation event identifier associated with each gene regulation event.
- 35. The database of claim 26, wherein the first, second, third, fourth and fifth pluralities of records are arranged in a relational format comprising a plurality of relational tables.
- 36. The database of claim 35 wherein the plurality of relational tables comprises at least an event table, a comparison table, a control vocabulary table, a context table, a sample table, a manually-generated comparison table and a gene family table.
RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Applications Ser. No. 60/331,182, filed Nov. 9, 2001, Ser. No. 60/388,745, filed Jun. 17, 2002, and Ser. No. 60/390,608, filed Jun. 21, 2002, each entitled “An Automated Computer-Based Algorithm for Organizing and Mining Gene Data Derived from Biological Samples with Complex Clinical Attributes”, and Ser. No. 60/412,156, filed Sep. 19, 2002, entitled “Comparative Gene Expression Database and Algorithm for Generating Same”, each of which is incorporated herein by reference in its entirety.
PCT Information
Filing Document |
Filing Date |
Country |
Kind |
PCT/US02/35454 |
11/4/2002 |
WO |
|
Provisional Applications (4)
|
Number |
Date |
Country |
|
60331182 |
Nov 2001 |
US |
|
60388745 |
Jun 2002 |
US |
|
60390608 |
Jun 2002 |
US |
|
60412156 |
Sep 2002 |
US |