The invention relates to methods for classifying samples based on alterations in network modularity. The methods may be useful for the diagnosis, prognosis and monitoring of a biological state such as a disease state.
Genome-scale technologies are being utilized to understand complex diseases such as cancer1. In particular, transcriptome analyses have been extensively applied as molecular diagnostic and prognostic tools in breast cancer. This has revealed clusters of gene expression signatures, such as the 70 gene prognostic2, Luminal/Basal3 and Wound4 signatures that have prognostic value. Interestingly, these different signatures have little overlap, yet when used to examine the same set of patients, they yield comparable prognostic results. This has led to the suggestion that each signature is capturing a portion of the alterations in the global transcriptome that result in poor prognosis in breast cancer5.
High throughput technologies have also been applied to the development of proteome wide maps of protein-protein interaction networks (interactomes). Interactome data has subsequently been employed to identify proteins associated with the breast cancer tumor suppressor BRCA1, thus identifying the centrosome component HMMR, a polymorphism of which is associated with breast cancer risk6. Furthermore, integration of the interactome with the 70 gene expression signature was recently employed to expand the signature, resulting in increased prognostic performance in breast cancer7.
There remains a need in the art for new and effective methods to diagnose disease, provide an evaluation of disease progression and prognosis, as well as to identify new methods and compositions for use in distinguishing between disease states.
We have demonstrated that human protein-protein interaction networks or interactomes are composed of hub proteins that are co-expressed with their interacting partners only in some tissues (intermodular hubs) and hubs that are more frequently co-expressed with their partners (intramodular hubs). Significant differences in domain, linear motifs and phosphorylation site structure were observed between the hub classes, and signalling domains were more often found in intermodular hub proteins which are more frequently associated with oncogenesis. We also found that alterations in network modularity of the interactome are associated with different biological states. Using methods developed and described by the inventors herein, it is possible to identify hubs that can significantly discriminate between biological states.
The inventors also investigated how altered gene expression profiles in a disease state (e.g. breast cancer) disturb the global organization of the human interactome. They found that the modular assembly of the human interactome is altered as a function of disease outcome and they demonstrate that analysis of dynamic network modularity predicts disease states. The methods rely on measurements of co-expression levels of protein hubs and interacting partners. These levels are subjected to a polynomial analysis that yields a result indicative of prognosis, likelihood or reoccurrence, or the likelihood of responding to therapy.
Broadly stated, the present invention relates to a method of identifying hubs that significantly correlate with a class distinction between samples. In an aspect, the invention relates to a method of identifying hubs and their interacting partners that significantly correlate with a class distinction between samples comprising sorting hubs and their interacting partners (also referred to as “interactors”) by degree to which their presence or co-expression in the samples correlate with the class distinction, and determining whether the correlation is stronger than expected by chance. A hub whose expression correlates with a class distinction more strongly than expected by chance is an informative hub. The class distinction can be a known class and in an embodiment the class distinction is a biological state, in particular a disease state. A known class can also be a set of subjects, in particular subjects with a favourable prognosis or subjects with an unfavourable prognosis. Sorting hubs and interacting partners by the degree to which their co-expression in samples correlates with a class distinction can be carried out using conventional correlation analyses.
In an aspect, the invention relates to a method of identifying hubs and their interacting partners that significantly discriminate among biological states, in particular disease states, comprising obtaining a reference data set that can be clustered into different biological states and into interactions comprising hubs and their interacting partners characteristic of each biological state, and assessing differences in interactions for each biological state to identify informative hubs that significantly discriminate between the biological states; and optionally confirming informative hubs by searching for the hubs in databases of scientific literature for the biological states.
In an aspect, the invention provides a method for determining a biological state through the discovery and analysis of discriminatory data patterns or network signatures of co-expression of hubs and their interacting partners. Analytical methods are utilized to discover hidden discriminatory patterns or network signatures of co-expression of hubs and their interacting partners that are a subset of a larger reference data set and that classify a biological state. The methods of the invention may be used to distinguish two or more biological states in a reference data set and the resulting discriminatory patterns or reference network signatures may be used to classify unknown or test samples.
The invention provides sets of informative hubs and interacting partners and network signatures that distinguish classes, in particular biological states, more particularly disease states, and uses therefor. The invention also provides computer-readable data media or databases comprising informative hubs and interacting partners and network signatures that distinguish classes.
The invention further provides a method for distinguishing a class, in particular a biological state, more particularly a disease state, in a sample by determining differences in co-expression of informative hubs and their interacting partners in a sample from the subject compared with a standard or model. The methods may be used in the diagnosis, prognosis or monitoring of a disease, or to assess treatments or drug responsiveness.
The invention relates to a method of characterizing or classifying a sample from a subject (e.g. a biological sample), by detecting or quantitating in the sample amounts or levels of informative hubs and their interactors that are characteristic of a class, in particular a biological state, more particularly a disease state, the method comprising assaying for differential co-expression of the hubs and their interactors in the sample. The invention also relates to a method of characterizing or classifying a biological state, in particular a disease state, of a subject by detecting or quantitating in a sample from the subject amounts or levels of informative hubs and their interactors that are characteristic of a biological state, in particular a disease state, the method comprising assaying for differential co-expression of the hubs and their interactors in the sample. Co-expression of the hubs and their interactors can be assayed using techniques known in the art. The invention pertains to a method for classifying a sample obtained from an individual into a class (e.g. favorable or poor prognosis) comprising assessing the sample for co-expression of informative hubs and their interacting partners and classifying the sample as a function of expression of informative hubs and interacting partners with respect to a model.
In another aspect, a method for generating reference network signatures characteristic of biological states is provided, which comprises: (a) obtaining a reference data set that can be clustered into different biological states and which comprises expression data for hubs and their interacting partners; (b) clustering hubs and interacting partners by biological states and assessing differences in each interaction between a hub and interacting partners between biological states to identify informative hubs and their interacting partners that significantly discriminate between the biological states; and (c) obtaining reference network signatures of the co-expression of informative hubs and interacting partners characteristic of the biological states. In another aspect, such a method further comprises comparing the reference network signature with a network signature of the informative hubs and interacting partners in a sample from a patient to characterize or classify the biological state of the patient.
In a variety of aspects of the methods described herein, the biological state is a disease state. In certain aspects, the disease state is cancer. In other aspects, the cancer is breast cancer. In another aspect, a method for screening a subject for a disease or disease stage or classifying a disease or disease stage in a subject is provided which comprises (a) obtaining a biological sample from a subject; (b) detecting the amount of co-expression of hubs and interacting partners characteristic of the disease or disease stage in the sample; and (c) comparing the amount detected to a predetermined standard or model. In embodiments, detection of amounts of co-expression of hubs and interacting partners associated with the disease or disease stage that differs significantly from the standard or model indicates the disease or disease stage. In other embodiments, detection of amounts of co-expression of hubs and interacting partners associated with the disease or disease stage that are substantially similar to the standard or model indicates the disease or disease stage.
In another aspect, a method for classifying a breast cancer patient according to prognosis is provided comprising: (a) comparing the levels of co-expression of hubs and interacting partners characteristic of breast cancer prognosis in a sample from the patient to levels of co-expression of the hubs and interacting partners in a reference population; and (b) classifying the patient according to prognosis of the breast cancer based on the similarity between the levels of co-expression in the sample and the reference population.
In such a method, step (b) can include determining whether the similarity exceeds one or more predetermined threshold values of similarity. In another embodiment, this method further comprises assigning a therapeutic regimen to the patient.
In another aspect, a method of categorizing drug responsiveness in a population comprises (a) determining the expression levels of hubs and interacting partners for individuals in the population; (b) identifying a first group of individuals in the population that have a substantially similar response to the drug; (c) clustering the hubs and interacting partners by the drug response of the first group to generate a reference network signature indicating drug responses for the first group of individuals. In another embodiment, this method further comprises the steps of (d) identifying a second group of individuals having a substantially similar response to the drug which differs from the drug response of the first group; and (e) clustering the hubs and interacting partners by the drug response of the second group to generate a reference network signature indicating drug responses for the second group of individuals. In another embodiment, one may repeat steps (d) and (e) one or more times for an additional group or individuals having a substantially similar drug response that differs from other groups.
In another aspect, a method for assigning an individual to one of a plurality of categories in a clinical trial comprises determining for the individual co-expression of hubs and interacting partners in a sample from the individual; producing a network signature of informative hubs and their interacting partners; comparing the network signature with reference network signatures of reference populations that have different clinical categories; and assigning the individual to a category in the clinical trial based on correlation of the network signature with one or more reference network signature.
In another aspect, a business method is provided for obtaining regulatory review of a drug comprising: (a) determining hubs and their interacting partners that significantly discriminate among responders and non-responders to the drug; (b) using results from step (a) to determine whether a patient would benefit from administration of the drug; and (c) combining information from prior regulatory filings for the drug in combination with information from step (b) to support a new drug approval regulatory filing.
In other aspects, this invention provides computer systems, computer programs, computer-readable data media and laboratory robots or evaluating devices for implementing the methods described herein.
In another aspect, a method for diagnosing a subject for the presence of a biological state, a disease or disease stage comprises: (a) obtaining a biological sample from the subject; (b) detecting the expression levels of hub proteins and their interacting partners in the sample; (c) determining the relative expression of the hub proteins and their interacting partners in the sample; and (d) comparing the subject's relative expression to a standard or model, wherein a significant difference between the subject's relative expression and the standard or model indicates the biological state, disease or disease stage.
In another aspect, a method for diagnosing a subject for the presence of a biological state, a disease or disease stage comprises: (a) obtaining a biological sample from the subject; (b) detecting the expression levels of hub proteins and their interacting partners in the sample; (c) determining the relative expression of the hub proteins and their interacting partners in the sample; and (d) comparing the subject's relative expression to a standard or model, wherein substantial similarity between the subject's relative expression and the standard or model indicates the biological state, disease or disease stage.
In another aspect, a method for diagnosing a subject for the presence of a biological state, a disease or disease stage comprises: (a) obtaining a biological sample from the subject; (b) detecting the expression levels of a hub protein and an interacting partner in the sample; (c) determining the relative expression of the hub protein and the interacting partner in the sample; and (d) comparing the subject's relative expression to a standard or model, wherein a significant difference or substantial similarity between the subject's relative expression and the standard or model indicates the biological state, disease or disease stage.
In another aspect, a method for generating a network signature identifying a biological state, a disease or disease stage, comprises: (a) obtaining gene expression levels from a reference population having two or more different biological states, diseases or disease stages; (b) dividing the reference population gene expression levels into two or more groups, each group characteristic of one said different biological state, disease or disease stage; and (c) assessing differences in relative gene expression levels between hub proteins and interacting partners in the groups to identify hub proteins whose expression relative to their interacting partners is characteristic of one said different biological state, disease or disease stage.
In another aspect, a method for generating a network signature identifying a biological state, a disease or disease stage, comprises: (a) obtaining gene expression levels from a reference population having two different biological states, diseases or disease stages; (b) dividing the reference population gene expression levels into two groups, each group characteristic of a different biological state, disease or disease stage; and (c) assessing differences in relative gene expression levels between a hub protein and an interacting partner in the groups to identify a hub protein whose expression relative to an interacting partner is characteristic of a biological state, disease or disease stage.
In another aspect, a system comprises a computer processor capable of processing gene expression data for hub proteins and their interacting partners, an input device, an output device, and a memory capable of storing computer-readable instructions, wherein the contents of the memory comprises computer-readable instructions that if executed are capable of directing the computer to: (a) receive gene expression level data from a biological sample from a subject; (b) determine the relative expression of hub proteins and their interacting partners in the sample; (c) compare the relative expression to a standard or model; and (d) output an indication of the presence of a biological state, a disease or disease stage, likelihood thereof, or prognosis therefor.
In another aspect, a system comprises a computer processor capable of processing gene expression data for a hub protein and its interacting partners, an input device, an output device, and a memory capable of storing computer-readable instructions, wherein the contents of the memory comprises computer-readable instructions that if executed are capable of directing the computer to: (a) receive gene expression levels data from a biological sample from a subject; (b) determine the relative expression of a hub protein and an interacting partner in the sample; (c) compare the relative expression to a standard or model; and (d) output an indication of the presence of a biological state, a disease or disease stage, likelihood thereof, or prognosis therefor.
In another aspect, a system comprises a computer processor capable of processing gene expression data for hub proteins and their interacting partners, an input device, an output device, and a memory capable of storing computer-readable instructions, wherein the contents of the memory comprises computer-readable instructions that if executed are capable of directing the computer to: (a) receive gene expression level data from a reference population having two or more different biological states, diseases or disease stages; (b) divide reference population gene expression levels into two or more groups, each group characteristic of a different biological state, disease or disease stage; (c) determine the relative gene expression of hub proteins and their interacting partners in the groups; (d) assess differences in relative gene expression levels between hub proteins and their interacting partners in the groups to identify hub proteins whose expression relative to their interacting partners is characteristic of a biological state, disease or disease stage; and (f) output a network signature useful in identifying a biological state, disease or disease stage.
In another aspect, a system comprises a computer processor capable of processing gene expression data for a hub protein and its interacting partners, an input device, an output device, and a memory capable of storing computer-readable instructions, wherein the contents of the memory comprises computer-readable instructions that if executed are capable of directing the computer to: (a) receive gene expression level data from a reference population having two different biological states, diseases or disease stages; (b) divide reference population gene expression levels into two groups, each group characteristic of one said different biological state, disease or disease stage; (c) determine the relative gene expression of a hub protein and an interacting partner in the groups; (d) assess differences in relative gene expression levels between a hub protein and an interacting partner in the groups to identify a hub protein whose expression relative to an interacting partner is characteristic of one said different biological state, disease or disease stage; (e) repeat (c) and (d) for additional interacting partners with the hub protein, and for additional hub proteins and their interacting partners; and (f) output a network signature useful in identifying a biological state, disease or disease stage.
In another aspect, a computer-readable medium, comprises computer-readable code that if executed is configured to: (a) compare the relative expression of hub proteins and their interacting partners detected in a subject's sample to a standard or model characteristic of a biological state, disease or disease stage; and (b) provide an indication of a biological state, disease or disease stage in the subject based upon the comparison.
In another aspect, a computer-readable medium, comprises computer-readable code that if executed is configured to: (a) compare the relative expression of a hub protein and an interacting partner detected in a subject's sample to a standard or model characteristic of a biological state, disease or disease stage; and (b) provide an indication of a biological state, disease or disease stage in the subject based upon the comparison.
In another aspect, a computer-readable medium, comprising computer-readable code that if executed is configured to: (a) receive gene expression level data from a reference population having two or more different biological states, diseases or disease stages; (b) divide reference population gene expression levels into two or more groups, each group characteristic of a different biological state, disease or disease stage; (c) determine the relative gene expression of hub proteins and their interacting partner in the groups; (d) assess differences in relative gene expression levels between hub proteins and their interacting partners in the groups to identify hub proteins whose expression relative to their interacting partners is characteristic of a biological state, disease or disease stage; and (f) provide a network signature useful in identifying a biological state, disease or disease stage.
In another aspect, a computer-readable medium, comprising computer-readable code that if executed is configured to: (a) receive gene expression level data from a reference population having two different biological states, diseases or disease stages;
(b) divide reference population gene expression levels into two groups, each group characteristic of one different biological state, disease or disease stage; (c) determine the relative gene expression of a hub protein and an interacting partner in the groups; (d) assess differences in relative gene expression levels between a hub protein and an interacting partner in the groups to identify a hub protein whose expression relative to an interacting partner is characteristic of one said different biological state, disease or disease stage; (e) repeat (c) and (d) for additional interacting partners with the hub protein, and for additional hub proteins and their interacting partners; and (f) provide a network signature useful in identifying a biological state, disease or disease stage.
Other objects, features and advantages of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples while indicating preferred embodiments of the invention are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
The invention will now be described in relation to the drawings in which:
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The following definitions supplement those in the art and are directed to the present application and are not to be imputed to any related or unrelated case. Although any methods and materials similar or equivalent to those described herein can be used in the practice of the invention, particular materials and methods are described herein.
Numerical ranges recited herein by endpoints include all numbers and fractions subsumed within that range (e.g., 1 to 5 includes 1, 1.5, 2, 2.75, 3, 3.90, 4, and 5). In another embodiment, all fractions or integers between and including the two numbers are included in the range. It is also to be understood that all numbers and fractions thereof are presumed to be modified by the term “about.” The term “about” means plus or minus 0.1 to 50%, 5-50%, or 10-40%, preferably 10-20%, more preferably 10% or 15%, of the number to which reference is being made. As used herein and in the appended claims, the singular forms “a”, “an”, and “the” include plural reference unless the context clearly dictates otherwise. Thus, for example, reference to an “interacting partner” is a reference to one or more interacting partners and equivalents thereof known to those skilled in the art, and so forth. Further, various embodiments in the specification or claims are presented using “comprising” language. In certain embodiments, a related embodiment may also be described using “consisting of” or “consisting essentially of” language.
“Biological state” includes without limitation a healthy state, a disease state, a potential disease state, a stage of a disease, prognosis of a disease, a physiological state, drug responsive or drug non-responsive state, toxicity of one or more drugs, toxicity state, biological state of an organ, presence of a pathogen (e.g. a virus), and the like.
A “reference data set” generally comprises quantitative data for putative informative hubs and interacting partners for a reference population and data characterizing different class distinctions (e.g. biological states, in particular disease states) in the reference population. Reference data sets can be from published data, clinical or test data or from samples from a reference population. One skilled in the art can readily determine an appropriate reference population based on particular applications of methods of the invention. A reference data set generally includes data relating to two or more different class distinctions. In aspects of the invention, a reference data set includes data concerning two or more different health states of a reference population (e.g. healthy state versus disease state). Reference populations can be selected on a variety of criteria based on the particular application of methods of the invention. Examples of criteria include health state, disease state, age, gender, drug use, genetic similarity, ethnicity, or other criteria. A reference population can be focused on a particular criteria or contain a variety of individuals having more than one state. The number of individuals to be included in a reference population to obtain a statistically useful determination can be readily determined by one skilled in the art. A reference population may generally contain tens, hundreds, or thousands of reference individuals or samples depending on the particular application.
A “network signature” refers to the level or amount of co-expression of one or more hubs and their interacting partners in a given population or sample at one or more time points. A “reference” network signature is a profile of a particular set of hubs (e.g. informative hubs) and their interacting partners that is characteristic of a particular class (e.g. biological state). For example, a reference network signature that quantitatively describes the expression of hubs and their interacting partners in breast cancer (see Example 1) can be used for determining prognosis in individual breast cancer patients. Reference network signatures may be generated using a reference data set. In certain embodiments, a network signature includes a complete network or subnetworks, i.e., a skeleton network. In one embodiment, a network signature includes a profile of all hubs identified using the algorithms or code contained herein. A skeleton network is a spanning tree (i.e., a tree composed of n−1 edges that connects all n vertices in the network) formed by the edges with the highest betweenness centralities. The remaining edges in the network are shortcuts. A skeleton network can be identified using published methods48.
In an embodiment, network signatures are comprised of 2, 3, 4, 5, 10, 15, 20, 25, 50, or more hubs or hub/interacting partner sets. The informative hubs and interacting partners that are used in network signatures can be hubs and interacting partners that exhibit increased expression over normal samples or decreased expression versus normal samples. The particular set of informative hubs and interacting partners used to create a network signature can be, for example, the hubs and interacting partners that exhibit the greatest degree of differential co-expression, or they can be any set of informative hubs and interacting partners that exhibit some degree of differential co-expression and provide sufficient power to accurately classify a sample. The hubs and interacting partners selected are those that have been determined to be differentially expressed in for example a disease, different disease state, drug-responsiveness, or drug-sensitive sample, relative to a normal sample or different disease state or drug-responsiveness and confer power to classify the sample. By comparing samples from patients with reference network signatures, the patient's susceptibility to a particular disease, prognosis, disease state, drug-responsiveness, or drug-resistance can be determined. In another embodiment a subset of a network signature includes only a portion of the network signature minimally necessary to distinguish the biological state, disease or disease stage thereof.
In yet another embodiment, a network signature is formed by the relative expression or pattern of relative expression of at least one, and preferably more than one, hub protein and one, or preferably more than one, of each hub's interacting partner proteins, which relative expression or pattern is characteristic of a disease, i.e., is changed from the relative expression of the hub/interacting partners in the healthy, non-disease state. In one embodiment, the network signature is formed by the relative expression of at least 5 hub protein/interacting partner protein sets. In one embodiment, the network signature is formed by the relative expression of at least 10, at least 20, at least 40, at least 50, at least 70, at least 100, at least 200, at least 300 or at least 500 or more hub protein/interacting partner protein sets. The network signature can take many forms, e.g., it can be identified as a number, a series of numbers, or graphs, e.g., bar graphs or curves.
A “reference” or “standard” or “model” thus refers to a network signature or a subset of a network signature that characterizes a particular biological state. As used herein, for example, a reference or standard or model may in one embodiment be a network signature characteristic of a healthy, disease-free state in a reference population. In another embodiment, the reference” or “standard” or “model” is a network signature characteristic of the presence of a particular disease at a designated stage of disease, e.g., stage I cancers, in a reference population. In another embodiment, the reference” or “standard” or “model” is a network signature characteristic of a reference population having a disease that had a poor outcome. In another embodiment, the reference” or “standard” or “model” is a network signature characteristic of a reference population having a disease that had a good outcome, e.g., survival for a selected number of years post-diagnosis. In yet another embodiment, the reference, standard or model may be a network signature formed of disease-characteristic hubs/interacting partners from a single subject at a particular time. These latter references are particularly useful in assessing progression of the disease or monitoring efficacy of therapeutic intervention. For example, the single reference subject may be the same subject being monitored for disease progression or therapeutic efficacy.
The generation of a network signature requires a method for assaying or quantitating the expression of hubs and interacting partners in samples. The expression levels of genes encoding the hubs and interacting partners or gene products, e.g., proteins, may be assayed in samples. Methods are currently available to one of skill in the art to quickly determine the expression level of several gene products from samples. Hybridization assays can be used to rapidly determine expression of gene products in samples. Microarrays or gene chips comprising short oligonucleotides complementary to mRNA products chemically attached to a solid support can be used for a rapid determination of gene expression in samples. Microarrays are commercially available, for example from Affymetrix, Santa Clara, Calif. Alternatively, methods are known to one skilled in the art for a variety of immunoassays to detect protein expression products. Some aspects of the invention may use spectrometric data of components of the hubs and interacting partners obtained from any spectrometric or chromatographic technique including without limitation resonance spectroscopy, mass spectroscopy, and optical spectroscopy. Examples of spectrometric platforms include MS, NMR, liquid chromatography, gas chromatography, high performance liquid chromatography, capillary electrophoresis, and any known form of mass spectrometry in low or high resolution mode such as LC-MS, GC-MS, CE-MS, LC-UV, MS-MS, MSn, etc. The methods described herein are not limited by the particular process selected to detect or quantify expression levels of the genes or gene products, including the hubs and their interacting partners. One of skill in the art may readily select a suitable conventional method for same.
The term “relative expression” as used herein refers to the interrelationship of the expression of one or more hubs with the expression of each of their interacting partners. Relative expression is generally the hub expression level minus interactor expression level. The relative expression may be a numerical or graphical representation of the interrelationship or pattern created by correlating the expression level of a hub protein with the expression level of one or preferably more of its interacting partner(s) in one or more samples. The correlation of these expression levels relative to each other in the hub/interacting partner complexes can cause a change in the network signature characteristic of a particular biological state, disease or disease stage.
“Correlation analysis” refers to a correlation-based similarity analysis including a correlation analysis using Pearson's correlation coefficient (PCC) including the related Spearman's rho and Kendall's tau known in the art.
“Disease” refers to any disorder, disease, condition, syndrome or combination of manifestations or symptoms recognized or diagnosed as a disorder which may be correlated with or characterized by co-expression of a subset of hubs and their interacting proteins in an interactome. The invention has application in any disease in which changes in the patterns of informative hubs and their interacting proteins allow it to be distinguished from a non-diseased state. Therefore, diseases that have a genetic component in which the genetic abnormality is expressed, diseases in which the expression of drug toxicity is observed, or diseases in which the levels of molecules in the body are affected may be studied by the present invention.
Exemplary diseases include, for example, cancer, cardiovascular diseases including heart failure, hypertension and atherosclerosis, respiratory diseases, renal diseases, gastrointestinal diseases including inflammatory bowel diseases such as Crohn's disease and ulcerative colitis, hepatic, gallbladder and bile duct diseases, including hepatitis and cirrhosis, hematologic diseases, metabolic diseases, endocrine and reproductive diseases, including diabetes, bone and bone mineral metabolism diseases, immune system diseases including autoimmune diseases such as rheumatoid arthritis, lupus erythematosus, and other autoimmune diseases, musculoskeletal and connective tissue diseases, including arthritis, infectious diseases and neurological diseases such as Alzheimer's disease, Huntington's disease and Parkinson's disease.
Although the invention is generic, embodiments of the invention provide for diagnosis or prognosis of various cancers including but not limited to carcinomas, melanomas, lymphomas, sarcomas, blastomas, leukemias, myelomas, osteosarcomas, neural tumors, and cancer of organs such as the breast, ovary, and prostate. A particular embodiment of the invention relates to the discovery and use of relative expression, or co-expression patterns, of hubs and interacting partners that reflect the current or future biological state of an organ or tissue.
“Hub” refers to a protein that interacts with two or more interacting partners, preferably 3, 4, 5, 6, 7, 8, 9, or 10 or more interacting partners. A significant or informative hub is a hub that significantly discriminates between classes, in particular biological states, more particularly disease states. In aspects of the invention, the hubs are intermodular hubs. In an embodiment, an informative or significant hub displays significantly altered PCC as a function of disease state, in particular disease outcome. In an embodiment, the informative or significant hubs display significantly altered PCC as a function of breast cancer disease outcome. Examples of such breast cancer outcome informative hubs include without limitation one or more of the BASC complex, MAP3K1, GRB2, SHC and SRC, estrogen signaling (ESR1), the DNA damage response (BRCA1, RAD51, MRE11), proteasome components and ribosomal components.
“Interactome” refers to sets of molecular interactions in cells, in particular protein-protein interaction networks.
“Intermodular hubs” refers to classes of hubs in the human interactome that display low correlation of co-expression with their partners. Intermodular hubs may generally be characterized by one or more of the following: (a) less molecular functional similarity with their interactors compared to intramodular hubs; (b) interact between functional modules; (c) important for global network connectivity; (d) greater average sequence length than intramodular hubs; (e) higher modularity compared to intramodular hubs; (f) lower globularity than intramodular hubs; (g) linear motifs are significantly over-represented compared with intramodular hubs; and (h) enriched in domains associated with cell signaling, in particular tyrosine kinase, PDZ and Ga domains.
“Intramodular hubs” refers to classes of hubs in the human interactome that display relatively higher correlation of co-expression compared with intermodular hubs. Intramodular hubs may generally be characterized by one or more of the following: (a) greater molecular functional similarity with their interactors compared to intermodular hubs; (b) act as key components within more functionally homogenous modules; (c) lower average sequence length than intermodular hubs; (d) greater globularity than intermodular hubs; and (e) linear motifs are significantly under-represented compared with intermodular hubs.
“Pearson Correlation Coefficient” or “PCC” refers to the measure of the correlation between two variables and in particular reflects the degree of linear relationship between the two variables. The PCC is typically denoted by r. In the context of the present invention, the variables include the expression data for a hub and its interactors, and the PCC of each interaction of a hub may be determined as follows:
Let XI
Let XH
where I is a interactor of hub H and j denotes the expression data for the hub or interactor in each of n tissues, and the summation is over all tissues (j=1, 2, 3 . . . n). sIsH is the product of the standard deviations of the expression data for the hub and interactor.
In respect to analytical methods of the invention to identify informative hubs the PCC may be defined as follows:
where I and H denote the expression of an interactor and a hub, respectively and A is a first class (e.g. biological state) and D is a second class (e.g. biological state). The summations are over the number of samples/individuals in each group and SIASHA and SIDSHD are the products of the standard deviations of the hub and the interactor expression for the first biological state and second biological state respectively.
The term “sample” and the like mean a material known or suspected of expressing or containing one or more hubs and interacting partners. A sample can be used directly as obtained from the source or following a pretreatment to modify the character of the sample. In aspects of the invention, a sample is representative of the expression levels of informative hubs and interacting partners. A “biological sample” is a sample derived from any biological source, such as tissues, extracts, or cell cultures, including cells (e.g. tumor cells), cell lysates, and physiological fluids, such as, for example, blood or subpopulations thereof (e.g. white blood cells, erythrocytes), plasma, serum, saliva, ocular lens fluid, cerebrospinal fluid, sweat, urine, fecal matter, tears, bronchial lavage, swabbings, milk, ascites fluid, nipple aspirate, needle aspirate, synovial fluid, peritoneal fluid, lavage fluid, and the like. The sample can be obtained from animals, preferably mammals, most preferably humans. Samples can be from a single individual or pooled prior to analysis. The sample can be treated prior to use, such as preparing plasma from blood, diluting viscous fluids, and the like. Methods of treatment can involve filtration, distillation, extraction, concentration, inactivation of interfering components, the addition of reagents, and the like.
In embodiments of methods of the invention, the sample is a mammalian tissue sample. In another embodiment the sample is a human physiological fluid. In a particular embodiment, the sample is human serum. In a further embodiment, the sample is white blood cells or erythrocytes.
The samples that may be analyzed in accordance with the invention include polynucleotides, for example from clinically relevant sources, preferably expressed RNA or a nucleic acid derived therefrom (cDNA or amplified RNA derived from cDNA that incorporates an RNA polymerase promoter). The target polynucleotides can comprise RNA, including, without limitation total cellular RNA, poly(A)+ messenger RNA (mRNA) or fraction thereof, cytoplasmic mRNA, or RNA transcribed from cDNA (i.e., cRNA; see, for example, Linsley & Schelter, or U.S. Pat. No. 5,545,522, 5,891,636, 5,716,785 or 6,271,002). Methods for preparing total and poly(A)+ RNA are well known in the art, and are described generally, for example, in Sambrook et al., (1989, Molecular Cloning—A Laboratory Manual (2nd Ed.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y.) and Ausubel et al, eds. (1994, Current Protocols in Molecular Biology, vol. 2, Current Protocols Publishing, New York). RNA may be isolated from eukaryotic cells by procedures involving lysis of the cells and denaturation of the proteins contained in the cells. Additional steps may be utilized to remove DNA. Cell lysis may be achieved with a nonionic detergent, followed by microcentrifugation to remove the nuclei and hence the bulk of the cellular DNA. (See Chirgwin et al., 1979, Biochem. 18:5294-5299). Poly(A)+RNA can be selected using oligo-dT cellulose (see Sambrook et al., 1989, Molecular Cloning—A Laboratory Manual (2nd Ed.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y.). In the alternative, RNA can be separated from DNA by organic extraction, for example, with hot phenol or phenol/chloroform/isoamyl alcohol.
It may be desirable to enrich mRNA with respect to other cellular RNAs, such as transfer RNA (tRNA) and ribosomal RNA (rRNA). Most mRNAs contain a poly(A) tail at their 3′ end allowing them to be enriched by affinity chromatography, for example, using oligo(dT) or poly(U) coupled to a solid support, such as cellulose or Sephadex™ (see Ausubel et al., eds., 1994, Current Protocols in Molecular Biology, vol. 2, Current Protocols Publishing, New York). Bound poly(A)+mRNA is eluted from the affinity column using 2 mM EDTA/0.1% SDS.
The terms “subject”, “individual” or “patient” refer, interchangeably, to a warm-blooded animal such as a mammal. In particular, the terms refer to a human. A subject, individual or patient may be afflicted with or suspected of having or being pre-disposed to a disease as described herein. The term also includes animals bred for food, as pets, or for study including horses, cows, sheep, poultry, fish, pigs, cats, dogs, and zoo animals goats, apes (e.g. gorilla or chimpanzee), and rodents such as rats and mice.
The present invention relates to a method of identifying hubs that significantly correlate with a class distinction between samples. A method of the invention may involve sorting hubs and their interacting partners or interactors by degree to which their presence or co-expression in the samples correlate with the class distinction, and determining whether the correlation is stronger than expected by chance. A hub whose expression correlates with a class distinction more strongly than expected by chance is an informative hub. The class distinction can be a known class and in an embodiment the class distinction is a biological state, in particular a disease state. A known class can also be a set of subjects, in particular subjects with a favourable prognosis or subjects with an unfavourable prognosis. Conventional correlation analyses can be used to sort hubs and interacting partners. In aspects of the invention, each hub is assessed for the difference in Pearson correlation coefficient and an average co-expression of each interaction for a hub can be calculated, i.e., an estimate of the difference in correlation of each interaction around a hub between groups or samples is calculated.
Methods of the invention, for the purpose of determining the state of a sample or subject based upon hubs and their interacting partners or interactors or network signatures for the sample and for one or more reference populations, can include linear, non-linear, and/or multivariate calculations from fields including mathematics, statistics and/or computer science. Such calculations may proceed in two phases: (a) an overall computation involving training and/or estimation using data from the reference population(s), and (b) a simpler computation for an individual using the results of phase (a). The end result of such calculations is to provide one or more qualitative or quantitative indicators of the class or state of a sample or subject. Examples of calculations which may be used in the methods of the present invention include discriminant analysis, classification analysis, multiple discriminant analysis, cluster analysis, and affinity propagation analysis.
In an aspect, the invention relates to a method of identifying hubs and their interacting partners that significantly discriminate among biological states, in particular disease states and such methods may comprise obtaining a reference data set that can be clustered into different biological states and into interactions comprising hubs and their interacting partners characteristic of each biological state; and, assessing differences in interactions for each biological state to identify informative hubs that significantly discriminate between the biological states; and optionally confirming informative hubs by searching for such hubs in databases of scientific literature for the biological states. In an aspect, the invention relates to a method of identifying hubs that discriminate between biological states, in particular disease states, comprising: (a) obtaining a reference data set that can be clustered into different biological states and which comprises expression data for genes encoding putative hubs and their interacting partners; (b) clustering the identified hubs and interacting partners by biological states and assessing differences in each interaction between a hub and interacting partners between biological states to identify informative hubs that significantly discriminate between the biological states; and optionally; (c) confirming the informative hubs by searching for the hubs in databases of scientific literature for the biological states. The clustering analysis in a method of the invention may be carried out using an affinity propagation algorithm (see Example 1).
Databases of scientific literature which can be searched in methods of the invention include without limitation PubMed and other databases available through the National Center for Biotechnology Information.
In an aspect, the invention provides a method for determining a biological state through the discovery and analysis of discriminatory data patterns or network signature of co-expression of hubs and their interacting partners. The data can be from health data, clinical data or from a biological sample. Analytical methods are utilized to discover hidden discriminatory patterns or a network signature of co-expression of hubs and their interacting partners that are a subset of a larger data set and that classify a biological state. The methods of the invention may be used to distinguish two or more biological states in a reference data set and the resulting discriminatory patterns or reference network signatures may be used to classify unknown or test samples.
In an aspect the invention provides a method for generating reference network signatures characteristic of biological states or comprising hubs and their interacting partners that discriminate between biological states, comprising: (a) obtaining a reference data set that can be clustered into different biological states and which comprises expression data for hubs and their interacting partners; (b) clustering the identified hubs and interacting partners by biological states and assessing differences in each interaction between a hub and interacting partners between biological states to identify informative hubs and their interacting partners that significantly discriminate between the biological states; and (c) obtaining reference network signatures of the co-expression of informative hubs and interacting partners characteristic of the biological states or comprising hubs and their interacting partners that discriminate between biological states.
Methods of the invention for generating a network signature may further comprise preparing a subnetwork signature, in particular a skeleton network signature.
The invention provides sets of informative hubs and interactors and network signatures that distinguish classes, in particular biological states, more particularly disease states, and uses therefor. The invention also provides microarrays comprising genes encoding informative hubs and their interacting partners. The invention further provides computer-readable data media or databases comprising informative hubs and interactors and network signatures that distinguish classes.
The invention also provides a method for distinguishing a class, in particular a biological state, more particularly a disease state, in a sample by determining differences in co-expression of informative hubs and their interactors or network signatures in a sample from the subject compared with a standard or model.
In aspects of the invention, methods are provided for detecting the presence of a disease (e.g. cancer) in a sample, the absence of a disease in a sample, the stage of a disease, the stage or grade of the disease, and other characteristics of diseases that are relevant to prevention, diagnosis, characterization, and therapy in a patient, for example, the benign or malignant nature of a cancer, the metastatic potential of a cancer, the indolence or aggressiveness of a cancer, and other characteristics of diseases that are relevant to prevention, diagnosis, characterization, and therapy of diseases or drug responsiveness in a patient. Methods are also provided for assessing the efficacy or responsiveness of a therapy for a disease, monitoring the progression of a disease, determining the prognosis of a patient, selecting an agent or therapy for treating or inhibiting a disease, treating a patient afflicted with a disease, inhibiting a disease in a patient, and assessing the disease (e.g. carcinogenic) potential of a test compound.
In an aspect, the invention relates to a method of characterizing or classifying a sample from a patient (e.g. a biological sample), by detecting or quantitating in the sample amounts or levels of informative hubs and their interactors that are characteristic of the disease, the method comprising assaying for differential co-expression of the hubs and their interactors in the sample. The expression levels of hubs and interacting partners may be determined by isolating and determining the level of transcribed nucleic acids. Alternatively or additionally, the levels of co-expression of the polypeptides may be determined. Co-expression of the hubs and their interactors can be assayed using techniques known in the art, such as microarrays or mass spectroscopy of the components of the hubs and interacting partners or genes encoding same extracted from the sample.
The invention pertains to a method for classifying a sample obtained from an individual into a class (e.g. favorable or poor prognosis) comprising assessing the sample for co-expression of informative hubs and their interacting partners and classifying the sample as a function of expression of informative hubs and interacting partners with respect to a model.
In an aspect, the invention provides a method for characterizing or classifying a disease state in a subject comprising: (a) obtaining a sample from a subject; (b) producing a sample network signature of informative hubs and their interactors in the sample; and (c) comparing the sample network signature with a reference network signature to characterize the disease state in the subject.
In an aspect, a method is provided for characterizing a disease sample by detecting co-expression of informative hubs and interacting partners in the sample comprising: (a) (a) obtaining a sample from a subject; (b) measuring levels of co-expression of informative hubs and interacting partners in the sample; and (c) comparing the levels with amounts measured for a standard or model.
In an embodiment of the invention, a method is provided for detecting breast cancer in a subject comprising: (a) obtaining a sample from the subject; (b) measuring levels of co-expression of hubs and their interacting partners characteristic of breast cancer in the sample; and (c) comparing the levels with levels detected for a standard or model.
In an embodiment, the invention relates to classifying a breast cancer patient according to prognosis comprising: (a) comparing the levels of co-expression of hubs and interacting partners characteristic of breast cancer in a sample from the patient to levels of co-expression of the hubs and interacting partners in a reference population; and (b) classifying the patient according to prognosis of the breast cancer based on the similarity between the levels of co-expression in the sample and the reference population. In a specific embodiment, step (b) comprises determining whether the similarity exceeds one or more predetermined threshold values of similarity.
In a further embodiment, the methods further comprise assigning a therapeutic regimen to the diagnosed subject, e.g., a breast cancer patient. In an embodiment, the invention provides a method for assigning a therapeutic regimen to a patient comprising classifying the patient as having a poor prognosis or good prognosis on the basis of co-expression of informative hubs and interacting partners and assigning the patient a therapeutic regimen comprising no adjuvant chemotherapy if the patient is classified as having a good prognosis or comprising chemotherapy if the patient has a poor prognosis.
In embodiments of the methods of the invention for breast cancer diagnosis or prognosis, the hubs are informative hubs, in particular one or more informative hubs chosen from or selected from the group consisting of the BASC complex, MAP3K1, GRB2, SHC and SRC, estrogen signaling (ESR1), the DNA damage response (BRCA1, RAD51, MRE11), proteasome components and ribosomal components.
Still another embodiment of a method for diagnosing a subject for the presence of a biological state, a disease or disease stage comprises: (a) obtaining a biological sample from the subject; (b) detecting the expression levels of a hub protein and interacting partner(s) in the sample; (c) determining the relative expression of a hub protein and interacting partner(s) in the sample; (d) comparing the subject's relative expression to a standard or model. Such a standard or model, in one embodiment, is a network signature characteristic of a biological state, a disease or disease stage in a reference population. In one embodiment, the relative expression is determined for each significant hub in each subject, as described in Example 1. In one embodiment the algorithm to measure the difference in co-expression of the hubs and each interacting protein of those hubs found to be significant uses the following equation:
InteractionDiff=In−H
where the difference is taken of the expression of each of n interactors, In, from each significant hub, H, and all significant hubs are evaluated. Patient data are then clustered using the affinity propagation44 algorithm. In another embodiment, the standard or model is a subject-specific network signature of the same subject generated from a temporally earlier biological sample. In aspects, a significant difference between the subject's relative expression and the standard or model indicates a biological state, disease or disease stage, or can identify whether therapeutic intervention is necessary or, if currently administered, is efficacious. In other aspects, a significant similarity between the subject's relative expression and the standard or model indicates a biological state, disease or disease stage, or can identify whether therapeutic intervention is necessary or, if currently administered, is efficacious. In such a method, one may repeat step (c) for additional interacting partners with the hub protein, and for additional hub proteins and their interacting partners, to generate a subject-specific network signature useful in identifying the biological state, disease or disease stage. In still another embodiment, the steps (b), (c) and/or (d) may transform the expression levels of a hub protein and an interacting partner, or relative expression, into numerical or graphical form. This may be done by a suitably programmed computer or processor. For example, see the code in Example 3. In another embodiment, this method can assist in predicting likelihood of recurrence of a disease, depending upon the selection of the standard or model.
In another embodiment, a method for generating a network signature identifying a biological state, a disease or disease stage is performed by (a) obtaining gene expression levels from a reference population having at least two different biological states, diseases or disease stages; (b) dividing the reference population gene expression levels into groups, each group characteristic of one different biological state, disease or disease stage; and (c) assessing differences in relative gene expression levels between a hub protein and an interacting partner in the groups to identify a hub protein whose expression relative to an interacting partner is characteristic of one said different biological state, disease or disease stage. In one embodiment, the method includes centering of the expression levels of (a) and/or (b). In one embodiment, the centering may be median centering. In certain embodiments of this method, step (c) is repeated for additional interacting partners with the hub protein, and for additional hub proteins and their interacting partners, to generate a network signature useful in identifying a biological state, disease or disease stage. In still other embodiments of this method, step (c) includes (i) matching each expression level to a hub protein or an interacting partner protein of the hub protein; (ii) obtaining the Pearson correlation coefficient (r) for each hub protein using the following equation:
wherein:
“I” denotes the amount of expression of an interacting partner,
“H” denotes the amount of expression of a hub protein,
“A” denotes the group of subjects having one biological state, disease or disease stage,
“D” denotes the group of subjects having a different biological state, disease or disease stage,
“nA or nD” denotes the number of subjects in each group, and
“s1
The invention also provides a method of assessing whether a patient is afflicted with or has a pre-disposition for a disease, in particular cancer, the method comprising comparing: (a) levels of co-expression of hubs and their interacting partners characteristic of the disease in a sample from the patient; and (b) reference levels of co-expression of hubs and their interacting partners characteristic of the disease in samples of the same type obtained from normal patients not afflicted with the disease, patients afflicted with the disease or at a different stage in the disease. In an embodiment, altered co-expression levels relative to the reference levels is an indication that the patient is afflicted with the disease. In another embodiment, substantially similar co-expression levels relative to the reference levels is an indication that the patient is afflicted with the disease.
In a further aspect, a method for screening a subject for a disease or disease stage is provided comprising (a) obtaining a biological sample from a subject; (b) detecting the amount of co-expression of hubs and interacting partners characteristic of the disease in the sample; and (c) comparing the amount detected to a predetermined standard or model. In an embodiment, detection of amounts of co-expression of hubs and interacting partners associated with the disease that differ significantly from the standard or model indicates the disease or disease stage. In another embodiment, detection of amounts of co-expression of hubs and interacting partners associated with the disease that are substantially similar to a standard or model indicates the disease or disease stage.
The invention provides a method for detection, diagnosis or prediction of a disease in a subject comprising: obtaining a sample of blood, plasma, serum, urine or saliva or a tissue sample from the subject; subjecting the sample to a procedure to measure levels of co-expression of hubs and interacting partners characteristic of the disease; detecting, diagnosing, and predicting disease by comparing the levels of hubs and interacting partners to the levels obtained from a control subject with no disease.
The invention also provides a method for assessing the aggressiveness or indolence of a cancer (e.g. staging), the method comprising comparing: (a) levels of co-expression of hubs and interacting proteins characteristic of the aggressiveness or indolence of the cancer in a patient sample; and (b) levels of co-expression of the hubs and interacting proteins in a standard or model.
In an embodiment, a significant difference between the co-expression levels in the sample and the standard or model is an indication that the cancer is aggressive or indolent. In another embodiment, substantially similar co-expression levels in the sample and the standard or model is an indication that the cancer is aggressive or indolent.
In an aspect, the invention provides a method for determining whether a cancer has metastasized or is likely to metastasize in the future, the method comprising comparing: (a) levels of co-expression of hubs and interacting partners characteristic of metastasis or likelihood thereof in a patient sample; and (b) levels (or non-metastatic levels) of the co-expression of hubs and interacting proteins in a standard or model.
In an embodiment, a significant difference between the levels in the patient sample and the standard or model is an indication that the cancer has metastasized or is likely to metastasize in the future. In an embodiment, substantially similar levels in the patient sample and the standard or model is an indication that the cancer has metastasized or is likely to metastasize in the future.
In another aspect, the invention provides a method for monitoring the progression of a disease, in particular cancer in a patient the method comprising: (a) detecting levels of co-expression of hubs and interacting proteins characteristic of the disease in a sample from the patient at a first time point; (b) repeating step (a) at a subsequent point in time; and (c) comparing the levels detected in (a) and (b), and therefrom monitoring the progression of the disease.
The invention contemplates a method for determining the effect of an environmental factor on a disease comprising comparing levels of co-expression of hubs and interacting proteins in the sample in the presence and absence of the environmental factor.
The methods of the invention may include the step of assigning a numerical value depending on whether the expression levels of hubs and interacting partners fall within or outside a reference network signature or levels for a standard of model. For example, a numerical value of 0 can be assigned to a sample if the expression levels are within the reference network signature or levels for a standard of model, and a positive value can be assigned where the expression levels are outside the reference network signature or levels for a standard of model. A positive value in some embodiments indicates a perturbed expression profile. As the number of hubs and interacting partners having expression levels outside the reference network signature or the levels for a standard or model increases, the assigned value will correspondingly increase. A sample or subject having a perturbed expression profile may indicate a disease state, a predisposition to developing a disease, a prognosis associated with a disease, or treatment of a disease and such a perturbed health state may be used to estimate the course of a disease. In some embodiments (e.g. where the standard or model represents a desirable category or classification), a positive value may indicate a favorable or normal profile which in the context of a disease or disease state may indicate the absence of a disease state or a predisposition to developing a disease, or a favorable prognosis or treatment of a disease.
The invention further relates to a method of assessing the potential efficacy of a therapy for inhibiting a disease in a patient. A method of the invention comprises comparing: (a) levels of co-expression of hubs and interacting proteins characteristic of the disease in a first sample from the patient obtained from the patient prior to providing at least a portion of the therapy to the patient; and (b) levels of co-expression of hubs and interacting proteins characteristic of the disease in a second sample obtained from the patient following therapy. In an embodiment, a significant difference between the levels of co-expression of hubs and interacting proteins in the second sample relative to the first sample is an indication that the therapy is efficacious for inhibiting the disease. In another embodiment, substantially similar levels of co-expression of hubs and interacting proteins in the second sample relative to the first sample is an indication that the therapy is efficacious for inhibiting the disease. The “therapy” may be any therapy for treating the disease, including but not limited to therapeutics, radiation, immunotherapy, gene therapy, and surgical removal of tissue. Therefore, the method can be used to evaluate a patient before, during, and after therapy.
The methods of the invention can be used to categorize or subcategorize drug responses in a population based on co-expression levels of hubs and interacting partners. A network signature can be generated using the methods of the invention that correlates network modularity and drug responses (e.g. changes in a sign or symptom of a disease). Methods of the invention for classifying a population by drug response can be used to stratify drug responses into, for example responder categories. These categories may be useful for predicting the effectiveness of a treatment, including the appropriate dosage or patient subpopulations for a treatment, or for optimizing a therapeutic regimen. The methods of the invention allow an early determination of drug responsiveness and evaluation of patients prior to an overt or full display of a drug response. These methods also permit a prediction of patient responsiveness as a companion diagnostic with other known diagnostic agents.
Thus, the invention provides a method of categorizing drug responsiveness in a population comprising (a) determining the expression levels of hubs and interacting partners for individuals in the population; (b) identifying a first group of individuals in the population that have a substantially similar response to the drug; (c) clustering the hubs and interacting partners by the drug response of the first group to generate a reference network signature indicating drug responses for the first group of individuals. A substantially similar response to a drug can refer to individuals having overt manifestations or indications that can be objectively determined by a physician (e.g. signs of a disease or a test result) or are based on subjective symptoms described by the individual. The method can further include the steps of (d) identifying a second group of individuals having a substantially similar response to the drug which differs from the drug response of the first group; and (e) clustering the hubs and interacting partners by the drug response of the second group to generate a reference network signature indicating drug responses for the second group of individuals. The method can further include optionally repeating steps (d) and (e) one or more times for an additional group or individuals having a substantially similar drug response that differs from other groups. In another aspect, this method may be used to determine how a particular drug or therapeutic, preadministered to a population, affects the network signature for a particular disease or disease state.
The invention also provides a method of predicting a drug response in an individual comprising (a) determining expression levels of hubs and interacting partners in a sample from the individual; (b) producing a network signature of informative hubs and their interacting partners; and (c) comparing the network signature with a reference network signature of drug responses to predict the drug response in the individual. In an embodiment, a network signature of the individual that is within or substantially similar to the reference network signature indicates that the individual has or will have a substantially similar response to the drug as the reference population used for the reference network signature.
The invention further provides a method for assigning an individual to one of a plurality of categories in a clinical trial comprising determining for the individual co-expression of hubs and interacting partners in a sample from the individual; producing a network signature of informative hubs and their interacting partners; comparing the network signature with reference network signatures of reference populations that have different clinical categories; and assigning the individual to a category in the clinical trial based on correlation of the network signature with one or more reference network signature.
The invention also provides pharmacogenetic methods for determining suitable treatment regimens for diseases, in particular cancer, and methods for treating cancer patients, based around selection of patients according to the methods of the invention.
A method of the invention that provides a network signature may be used as a readout in animal model based screening methods for new therapeutic approaches and compounds. In an aspect of the present invention, a network signature is utilized to predict the efficacy of potential new treatments in animal models for disease states.
The present invention also provides a method for evaluating the efficacy of, or validating or predicting the utility of an animal model of a disease for elucidating strategies, pathways, processes and guiding the development of hypotheses for testing in a target animal. The method may comprise comparing a network signature generated for an animal model of a disease using a method of the invention and a network signature of a population of the target animal suffering from the disease.
The methods of the invention may further employ other data along with the network modularity signature. For example, in classifying a disease state, data including without limitation, patient age, stage of disease, molecular or genetic subtype and other like data.
Methods of the invention may be used in diagnostic methods performed in a physician's office or in a clinical laboratory. They can also be used in remote diagnostic methods in which the step of measuring the co-expression of hubs and interacting partners is separated from the step of analyzing the co-expression in reference to a standard or model or reference network signature. The measurement and analysis steps may be coordinated via a network such as the internet.
In an aspect, the invention relates to methods for assigning a sample to a prognostic class and methods for classifying a sample obtained from a subject in a prognostic class using a method or scheme described herein. Once a sample from a subject is classified in a prognostic class, then a healthcare provider can determine the proper course of treatment for the subject.
The invention provides a business method for obtaining regulatory review of a drug comprising: (a) determining hubs and their interacting partners that significantly discriminate among responders and non-responders to the drug; (b) using results from step (a) to determine whether a patient would benefit from administration of the drug; and (c) combining information from prior regulatory filings for the drug in combination with information from the association in step (b) to support a new drug approval regulatory filing. This method in one embodiment is performed by a suitably programmed computer processor. In one embodiment, the method employs all or a portion of the code defined in Example 3. In a business method of the invention, the prior regulatory filings may be filed in the United States or in a country outside of the United States. A business method of the invention may further comprise marketing the drug with a diagnostic test, wherein the diagnostic test stratifies a patient population that displays a network signature that supports a treatment regimen with the drug, and stratifies the patient population so that a subset of the patient population that is likely to benefit from treatment with the drug is identified. The method may identify a subset of a population comprising individuals for whom results from the diagnostic test predict no adverse event if treated with the drug or predict an efficacious response if treated with the drug. The business method may further comprise the step of collecting royalties from sales of the drug.
In certain embodiments, any and all of the methods described herein is computer-implemented and thus the invention provides computer systems, computer programs, computer-readable data media and laboratory robots or evaluating devices for the any of the methods of the invention.
In one embodiment, a system comprises a computer processor capable of processing gene expression data for a hub protein and its interacting partners, an input device, an output device, and a memory capable of storing computer-readable instructions, wherein the contents of the memory comprises computer-readable instructions that if executed are capable of directing the computer to: (a) receive gene expression levels data from a biological sample from a subject; (b) determine the relative expression of a hub protein and an interacting partner in the sample; (c) compare the relative expression to a standard or model; and (d) output an indication of the presence of a biological state, a disease or disease stage, likelihood thereof, or prognosis therefor. In another embodiment, this system directs the computer to repeat step (b) and/or (c) for additional interacting partners with the hub protein, and for additional hub proteins and their interacting partners. In some embodiments, steps (b) and (c) are performed with multiple hubs and interacting partners. In one embodiment, the resulting output indication is a network signature or subset thereof characteristic of a biological state, a disease, or a disease stage. In one embodiment of this system, the computer-readable instructions comprise the computer program of Example 3.
In another embodiment, a system comprises a computer processor capable of processing gene expression data for a hub protein and its interacting partners, an input device, an output device, and a memory capable of storing computer-readable instructions, wherein the contents of the memory comprises computer-readable instructions that if executed are capable of directing the computer to: (a) receive gene expression level data from a reference population having two different biological states, diseases or disease stages; (b) divide reference population gene expression levels into two groups, each group characteristic of one different biological state, disease or disease stage; (c) determine the relative gene expression of a hub protein and an interacting partner in the groups; (d) assess differences in relative gene expression levels between a hub protein and an interacting partner in the groups to identify a hub protein whose expression relative to an interacting partner is characteristic of one different biological state, disease or disease stage; (e) optionally repeat steps (c) and/or (d) for additional interacting partners with the hub protein, and for additional hub proteins and their interacting partners; and (f) output a network signature useful in identifying a biological state, disease or disease stage. In one embodiment of this system, the computer-readable instructions comprise the computer program of Example 3. In an embodiment, steps (c) and (d) are performed with multiple hubs and interacting partners.
In another embodiment, a computer-readable medium comprises computer-readable code that if executed is configured to: (a) compare the relative expression of a hub protein and an interacting partner detected in a subject's sample to a standard or model characteristic of a biological state, disease or disease stage; and (b) provide an indication of a biological state, disease or disease stage in the subject based upon the comparison. This computer-readable medium, in certain embodiments, contains computer-readable code configured for additional interacting partners with the hub protein, and for additional hub proteins and their interacting partners. In one embodiment of this medium, the computer-readable code comprises the computer program of Example 3.
In another embodiment, a computer-readable medium comprises computer-readable code that if executed is configured to: (a) receive gene expression level data from a reference population having two different biological states, diseases or disease stages; (b) divide reference population gene expression levels into two groups, each group characteristic of one different biological state, disease or disease stage. For example, in one embodiment, one group is composed of poor outcome subjects having or being treated for a cancer and the other group is composed of good outcome subjects successfully treated for the cancer. Successful treatment can include a disease-free state or survival with the disease for a significant period of time, post-diagnosis. Additional steps which the medium is configured to execute are: (c) determine the relative gene expression of a hub protein and interacting partners in the groups; (d) assess differences in relative gene expression levels between a hub protein and an interacting partner in the groups to identify a hub protein whose expression relative to an interacting partner is characteristic of one biological state, disease or disease stage; (e) optionally repeating steps (c) and/or (d) for additional interacting partners with the hub protein, and for additional hub proteins and their interacting partners; and (f) provide a network signature (or a subset thereof) useful in identifying a biological state, disease or disease stage. In an embodiment of this method, steps (c) and (d) are performed with multiple hubs and interacting partners. In one embodiment of this medium, the computer-readable code comprises the computer program of Example 3.
In an aspect, the invention pertains to a method for use in a computer system for classifying at least one sample obtained from an individual. The method comprises providing a model which correlates classes (e.g. biological states) and co-expression of hubs and their interacting partners; assessing a sample for co-expression of hubs and their interacting partners; and using the model to classify the sample comprising comparing the co-expression of informative hubs and their interacting partners to the model to thereby obtain a classification. The methods further comprise cross-validation of the model by eliminating or withholding samples used to build the model; building a cross-validation model for classifying without eliminating samples and using the cross-validation model classifying the eliminated samples into a winning class by comparing the co-expression values of hubs and their interacting partners of the eliminated samples based on the cross-validation model classification of the eliminated samples. The methods may further comprise filtering out any hub and interacting partner co-expression values in the sample that exhibit an insignificant change and normalizing the co-expression values. The method may also comprise providing an output indicating the classes.
The invention also relates to a computer apparatus for classifying a sample into a class, wherein the sample is obtained from a subject, wherein the apparatus comprises a source of co-expression values of hubs and their interacting partners in the sample, a processor routine executed by a digital processor coupled to receive the gene co-expression values from the source, the processor routine determining classification of the sample by comparing the co-expression values of the sample to a model built to correlate the co-expression values with co-expression of hubs and interacting partners characteristic of the class; and an output assembly coupled to the digital processor for providing an indication of the classification of the sample.
Another aspect of the invention provides a computer apparatus for constructing a model for classifying at least one sample to be tested having hub and interacting partner co-expression values, wherein the apparatus comprises a source of hub and interacting partner co-expression values from two or more samples belonging to two or more classes, the source being a series of hub and interacting partner co-expression values for the samples; a processor routine executed by a digital processor, coupled to receive the hub and interacting partner co-expression values from the source, the processor routine determining hubs and interacting partners for classifying the sample, and constructing the model with a portion of the informative or relevant hubs and interacting partners using a correlation scheme. The apparatus can further comprise a filter coupled between the source and the processor routine for filtering out any of the hubs and interacting partners that are not significant. The output assembly can be a graphical representation which may be in colour.
The invention also provides a machine readable computer assembly for classifying a sample into a class, wherein the sample is obtained from an individual, wherein the computer assembly comprises a source of hub and interacting partner co-expression values of the sample, a processor routine executed by a digital processor, coupled to receive the co-expression values from the source, the processor routine determining classification of the sample by comparing the co-expression values of hubs and interacting partners in the sample to a model; and an output assembly coupled to the digital processor for providing an indication of the classification of the sample. The invention also provides a machine readable computer assembly for constructing a model for classifying at least one sample to be tested having hub and interacting partner co-expression values, wherein the computer assembly comprises a source of co-expression values from two or more samples belonging to two or more classes the source being a series of hub and interacting partner co-expression values for the samples, a processor routine executed by a digital processor coupled to receive the co-expression values of the vectors from the source, the processor routine determining relevant hub and interacting partners from the co-expression values for classifying the sample and constructing the model with a portion of the relevant hub and interacting partners by using a correlation analysis.
The invention further provides a kit for performing a method of the invention. A kit may comprise a microarray for assaying levels of informative hubs and interacting partners and a computer system for comparing the levels with a standard, model or reference network signature. The computer system may comprise a processor and a memory encoding one or more programs coupled to the processor wherein the one or more programs cause the processor to perform a method comprising computing the aggregate differences of co-expression between the sample and a reference population or a method comprising determining the correlation of co-expression of the hubs and interacting partners to the co-expression in a reference population. In an aspect, the kit is able to distinguish samples from patients with a good disease prognosis from samples from patients with poor prognosis. Thus, the invention provides a kit for determining whether a sample is derived from a subject having a good prognosis or a poor prognosis comprising at least one microarray comprising genes encoding hubs and interacting partners characteristic of prognosis of a disease and a computer readable medium having recorded thereon programs for determining the similarity of the co-expression of informative hubs and interacting partners in the sample to that in a reference population of individuals having a good prognosis or a poor prognosis wherein one or more programs cause a computer to perform a method comprising computing the aggregate differences in co-expression of the informative hubs and interacting partners between the sample and the reference population or a method comprising determining the correlation of the co-expression in the sample to the co-expression in the reference population.
All of the above methods and compositions may be utilized in combination with other known diagnostic reagents, compositions and methods to identify biological states, diseases and disease states, or to predict the likelihood of particular responsiveness of a subject to therapeutic regimens, or the likelihood of recurrence of a disease or the degree of severity of disease or biological state. The methods described herein may be used to confirm diagnoses made utilizing other methods and reagents or to assist in differential diagnoses of biological states, diseases and disease stages. One of skill in the art may select from among all known diagnostic reagents and methods for combination with the methods described herein.
The following non-limiting examples are illustrative of the present invention:
The following materials and methods were used in the study described in the Examples.
Data Integration to Determine PCC of Co-Expression in Interaction Networks
A method analogous to that previously described was used13. The complete interactome from OPHID9 as well as subsets of interactions interologue mapped from yeast to man41 or just literature curated interactions11 was downloaded as well as expression data from 79 human tissues8. Hubs were selected as those with greater than 5 interactions, as these proteins are in the top 15% of the degree distribution of the network. For each hub the average PCC of co-expression for each interaction and the hub was assessed using a similar algorithm as previously described13. Random re-assignment of the expression values to nodes in the network was used to ascertain if the observed network was nonrandom. The network was visualized using Cytoscape 2.5.142.
GO Functional Similarity of Hubs and their Interactors
Semantic similarity between hubs and their interactors was calculated by combining the similarity scores between the GO terms annotated to each protein. Lin GO similarity measures were used to compute GOES term similarity using the GraSM approach where for each term of each of the proteins only the most similar term of the other protein is used to compute a composite average43.
Topological Network Analysis
Betweenness and Characteristic Pathlength of networks were calculated using previously described algorithms using the tYNA web interface19. When assessing network robustness to hub removal, an equivalent number of intermodular and intramodular hubs were removed from the network in order of descending clustering coefficient. To validate that the two hub classes are distinct, length, phosphorylation, linear motifs, globularity, and domain architecture were investigated (see Supplemental Methods below). These were either computed directly from the hub sequence or by mapping to the appropriate database. Significance levels were computed by sampling (see Supplemental Methods below).
Distribution of Hub Types by Human Disease Phenotypes
Entries in OMIM24 for each hub gene was extracted and subsequently manually curated for 1) hubs associated with cancer, malignancy or metastasis 2) found to be involved in oncogenic translocation fusions.
Network Analysis Between Breast Tumour Samples
To determine the essential network misregulated between breast cancer patient outcome (alive without disease vs. dead from disease), a non-parametric algorithm was used to sample hub behaviour between groups of samples. Briefly, the absolute difference of the PCC of two groups of a hub and each of its interactions was calculated as well as 1000 random re-assignments of patients into equally sized groups. P-value cut-off and degree cut-off for hubs were optimized as a function of accuracy during cross validation runs. Patients were clustered using an affinity propagation algorithm44. Kaplan-Meier survival curves were drawn for groups defined by the algorithm using patient survival data and drawn using SPSS v14.0.
A classification algorithm was trained to identify patterns in expression of genes interacting with the hub that were predictive of prognosis and the ability of the algorithm to predict the patient outcome was assessed using 5-fold cross-validation. Specifically, the patient network data and clinical outcome were partitioned into five approximately equally-sized portions; the algorithm was trained on four of these portions, holding out one of the portions for testing. To test the algorithm, only the gene expression data for patients in the hold-out set was provided and its predictions of clinical outcome compared with the actual outcomes for these patients. This procedure was repeated for each hold-out set, amassing unbiased outcome predictions for every patient. To measure the variability in predictions, the 5-fold cross validation procedure was repeated three times with different random partitions of the data. The algorithm first identifies hubs based on their number of neighbours, k, and then assigns each a score, p, equal to the significant difference of hub correlation with its interactors between alive patients and those who died of disease when compared to a random distribution. The algorithm then selects a subset of the hubs by applying a cutoff to p; subtracts the hub expression level from those of all its interactors; and clusters the hub-subtracted expression levels of interactor genes using affinity propagation44. To evaluate the accuracy of the algorithm, the hub-subtracted expression levels of patients in the hold-out set are clustered along with the patients in the training set and the predicted probability of a poor outcome in these patients is set to be the proportion of patients from the training set in their cluster who experienced a poor outcome. The performance of this classifier was calculated using different thresholds forp and minimum hub degree (k), and it was found that the best performance of test set classification was achieved when k=7 and p=0.09 was used for training set parameters (
Supplemental Methods: Data Integration to Determine PCC of Co-Expression in Interaction Networks
A method analogous to that previously described was used13. The complete interactome from STRING10 or OPHID9 as well as subsets of interactions interologue mapped from yeast to man41 or just literature curated interactions11 was downloaded as well as expression data from 79 human tissues8. Duplicate gene expression spots from the GeneAtlas data for a particular gene were averaged. A degree (k) cut off of greater than or equal to 5 was used since this represents the highest 15% of the degree distribution of hubs. For each hub the average PCC of co-expression for each interaction and the hub was assessed using a similar algorithm as previously described13. The entire OPHID database9 and human GeneAtlas expression data8 and matched gene expression data and protein interactions via NCBI gene IDs were downloaded. The Pearson Correlation Coefficient of each interaction of each hub was calculated by:
Let XI
Let XH
where I is an interactor of hub H and j denotes the expression data for the hub or interactor in each of n tissues, and the summation is over all tissues (j=1, 2, 3 . . . n). sIsH is the product of the standard deviations of the expression data for the hub and interactor. The average over all nH interactors for hub H was taken as:
where rI,H is the correlation of each interaction across n tissues. The network was visualized using Cytoscape 2.5.142.
Supplemental Methods: Selection of a Cut-Off Point Between Inter- and Intramodular Hubs
The probability density of the average PCC represents the underlying frequencies of hub average PCCs. Therefore, the cut off was chosen as the local minimum of the frequency distribution between the two peaks of the maxima frequency. Hubs within +/−0.5 standard deviations of the average PCC were excluded as they could not be unambiguously described as either inter or intramodular hubs.
Supplemental Methods: Random Reassignment of Expression Data
Random reassignment of the expression data was taken by randomly shuffling the expression data gene labels. This method of random reassignment retains the topological network structure of the interactome during the randomization.
Supplemental Methods: Topological Network Analysis
Betweenness and Characteristic Pathlength of networks, which measures their connectivity, were calculated using previously described algorithms using the tYNA algorithm19. Betweenness of a node n is defined as the number of node pairs (n1,n2) where the shortest path from n1 to n2 passes through node n, if and only if, the graph is undirected and the shortest path is not counted as passing through the end nodes. CPL reflects the connectivity across the network and is defined as the median value of the minimum pathlengths required to go from node n1 to n2. A custom Python script was used to employ the batch version of tYNA by looping over all hub proteins. To attack the network, intermodular and intramodular hubs were removed in descending order of clustering coefficient. This network attacking method is similar to the one used to interrogate intermodular and intramodular hub behaviour in the yeast proteome as previously described13. The clustering coefficient is defined as:
Where E is the set of edges in the graph, n is a node and ON(n) is the set of nodes such that for each n′ in ON(n), n′< >n and there is at least 1 edge from n′ to n.
Then:
Supplemental Methods: Biochemical Features of Human Hub Proteins
In order to avoid sampling biases and over-counting of features (linear motifs, domains, etc.) associated with the hub classes a redundancy reduction was performed of both the intramodular and intermodular hub sets. This was done using the CD-HIT algorithm by comparing all protein sequences within a hub class to all other sequences within the same class and removing any member of the class with more than 90% sequence similarity to any other member. To validate that the two hub classes are biologically distinct, length, phosphorylation sites46 and other linear motifs21, globularity47, and domain architecture22 were investigated within the redundancy reduced hub classes. The hub classes were analyzed by splitting them into three partitions (intermodular hubs, intramodular hubs and unknown, where unknown are hubs that could not confidently be assigned as intermodular hubs or intramodular hubs). Sets of Python and Perl scripts, BLAST and the database mentioned below were utilized to perform analysis of the following biochemical features of the hub proteins. These features were either predicted from the hub protein sequence or mapped from the mentioned databases. Significance levels were assessed by sampling as described below.
a) Phosphorylation Sites.
First, all hub proteins were mapped to phospho.ELM (v6, 2006) by reciprocal BLAST searches. A cutoff of 100 was used for the bitscore and it was demanded that the second-best hit was 50 below the best-match. Subsequently, the number of known phosphorylation sites within a hub was extracted from Phospho.ELM. Significant differences between intermodular and intramodular hubs were determined by sampling 10e6 times from the combined hub set and determining whether the mean number of sites for a hub class was significantly higher or lower than what would be expected if there were no two distinct classes. Secondly, the NetworKIN algorithm was used to predict the number of phosphorylation sites for which kinases could be assigned. Previously, it was shown that even without experimental validated phosphorylation sites this algorithm can predict novel/potential sites with highly significant enrichment (compared to random)46. Thus the Python version of NetworKIN was used to predict the number of sites for each hub and sampling was subsequently performed as described above to determine significance levels.
b) Linear Motifs.
The literature curated data set of experimentally validated instances of linear motifs from the ELM21 database was used. The set was matched (using BLAST as above) to the hub sequences and subsequently the number of ELM instances in each sequence was determined. The significance in differences between intermodulars and parties was estimated by sampling as described above.
c) Domain Architecture.
The domain architecture of hub proteins was determined by searching the SMART22 set of Hidden Markov Models (HMMs) against the hub sequences. This was performed by a custom build search pipeline using Python scripts as clients for a text-pipeline at SMARTs webserver (EMBL, Heidelberg). Hand annotated lists of domains involved in signaling were used to discriminate architectural differences between the hub classes. These lists were primarily based on the annotation within SMART with some additional curation. Sampling was used to estimate the significance of different domain compositions of the two hub classes as described above. This pipeline was also used to determine the number of residues residing in known globular domains (in contrast to predicted globular regions as below).
e) Globularity and Disorder.
Two previously published algorithms for detecting intrinsic protein disorder from sequence (GlobPlot, DisEMBL) were used. Both of these algorithms were deployed using pipeline versions written in Python. The number of residues residing in disordered regions was counted and the significance between the hubclasses by sampling was determined as above.
Supplemental Methods: Gene Ontology Similarity Between Hubs and their Interactors
Semantic similarity between protein pairs was calculated by combining the similarity scores between the GO terms annotated to each protein. Lin-GraSM similarity measures were used to compute GO term similarity45. These measures are based on the concept of information content (IC), which was calculated for each term according to the expression:
IC
c=−log2(fc)
where fc is the frequency with which the term is annotated within the UniProt database. The IC values were normalized by dividing by the scale maximum. Lin-GraSM similarity between two terms is given by a ratio between the terms average IC and that of their disjunctive common ancestors:
All terms of the first protein are paired with each term of the second one, and all similarity scores are used to produce an average:
SSW
AVG=Avgi,j└sim(termi,termj)┘
Supplemental Methods: Distribution of Hub Types by Human Disease Phenotypes
Entries in OMIM24 for each hub were extracted and subsequently manually curated for 1) hubs associated with cancer, malignancy or metastasis 2) found to be involved in oncogenic translocation fusions. Equally, hubs were extracted from the census of cancer genes25. Hubs associated with cancer were normalized for the frequency of each hub type and significant differences in the distribution of hubs between cancer and non-cancer genes was determined by the Fisher's exact test.
Supplemental Methods: Network Analysis of Breast Tumour Samples
To determine the hubs that significantly discriminate between patients who are alive without disease and dead of disease, a non-parametric test was established. First the original patient data26 was filtered to remove patients that were alive with disease by removing patients that had metastases but did not die from breast cancer at last time of follow up and patients that did not requisitely die of disease by removing patients who died without metastases and thus could not be confirmed to be dead from disease. This filtering resulted in a cohort of 255 patients (from 296 in the original study26, 181 alive without disease and 74 dead of disease. The expression data was median-centered and expression value was matched with the protein-protein interaction data by mapping to NCBI geneID. Each hub was assessed for the difference of the PCC of each interaction by the following equation:
where I and H denote the expression of an interactor and a hub respectively and A is the group of patients who are alive without disease whereas D is the group of patients who died of disease. The summations are over the number nA or nD of patients in each group, and sIAsHA and sIDsHD are the products of the standard deviations of the hub and the interactor expression for the alive and dead groups respectively. The average of the absolute value of rA,D for the hub and each of its interactors is given by:
where n is the number of interactors for a given hub. This metric gives us an estimate of the difference in correlation of each interaction around a hub between the two groups (alive without disease vs. dead of disease). To determine if the deviation in correlation between the two groups is significant, patients were randomly reassigned to the two groups 1000 times and the AverageHubDiff was recalculated. Therefore, the p-value of each hub was given as the frequency of the random AverageHubDiff being greater than the real AverageHubDiff divided by 1000.
To evaluate if the genes in the significant hubs have been previously implicated in breast cancer pathology the number of publications of the included hubs were examined by searching the PubMed database using NCBI gene name and “breast cancer”. This measure was corrected for the total number of publications by simply searching the NCBI gene name of the included hubs in the PubMed database. The ratio of included hubs in the breast cancer literature/total publication of included hubs was evaluated against an equivalent number of excluded hubs (hubs with a P≧0.91) and evaluated for the prevalence in the breast cancer literature while controlling for total publications for those genes.
Supplemental Methods: Assessment of Individual Patients
To evaluate the dynamic network properties of each significant hub in each patient the algorithm was adapted to measure the difference in co-expression of the hubs and each interactor of those hubs found to be significantly different between patients dead of disease and alive without disease using the following equation:
InteractionDiff=In−H
where the difference is taken of the expression of each of n interactors, In, from each significant hub, H, and all significant hubs are evaluated.
Patient data were then clustered using the affinity propagation44 algorithm using the set of expression differences of significant hubs and their interactors as inputs using a 5-fold cross validation strategy. Briefly, the patients were randomly assigned to five approximately equal groups. Four of the five groups were used to train the algorithm including hub selection and affinity propagation clustering of the training set. The test group was then clustered using the training set probability groups. The performance of the algorithm at correctly categorizing the test set patients was evaluated by plotting the sensitivity and 1—specificity at all possible probability cut offs. To determine which cutoff should be used for hub degree (k) and p-value for significant hubs, 3 runs of 5-fold cross validation were run at several p-value cut-offs and degree cut-offs. To evaluate which p-value cut off to use for selecting hubs for clustering, the algorithm performance was assessed across an array of p-value cut offs and degree cut offs (
For generation of Kaplan-Meier curves, patients were assigned a prognosis probability based on the frequency training set patients in each cluster who were alive without disease or dead of disease. Probabilities of poor outcome of >0.4 were assigned to the poor prognosis groups as this cut off consistently resulted in the highest predictive performance. The prognosis probabilities were further tested in binary logistic regression models with other clinical covariates including tumour grade, tumour size, number of positive lymph nodes and patient age to control for differences in tumour sample at the time of excision. Cut offs for the regression equation were evaluated and the highest accuracy of prediction was used as a cut-off (probability >0.4)
To investigate global alterations in interactome assembly, it was first sought to determine if biological context manifested by changes in gene expression affect the structure of the interactome. To do so, genome-wide expression data taken from 79 human tissues8 with a large set of hub proteins (defined as proteins having 5 or more interacting partners) taken from both literature-curated and high throughput (HTP) sources9 (
Modular structure in interactomes has been proposed to confer higher order function to the network, such that intermodular hubs provide for temporally and spatially restricted linkages to intramodular hubs that in turn fulfill specific functions, often as multi-subunit macromolecular machines14, 15. For example, most components of the 26S proteasome show highly correlated expression, and function together to mediate protein degradation (
Intermodular hubs, by providing dynamic structure to modular interactomes, have also been proposed to be critical for global network connectivity and regulation. To test this in the human network, the interactome was attacked by removing either intermodular hubs or intramodular hubs in descending order of clustering coefficient and betweenness of the resulting network was analyzed19. Betweenness is a measure of information flow through networks, with high betweenness reflecting multiple paths between all nodes and low betweenness few pathways connecting network nodes. Betweenness also measures the centrality of a node in a network thus expressing its importance as an intersection between all parts of the network. In a biological framework betweenness measures how functional complexes communicate with each other. In the human interactome, selective removal of intermodular hubs resulted in rapid decay of betweenness in the network when compared to removal of intramodular hubs (
The full compendium of human interactions is not known, leading to the suggestion that topological features such as modularity may be artefacts of analyzing incomplete datasets20. Although analysis of three different datasets of human interactions all revealed evidence of modularity, it was sought to assess whether there were distinct biochemical and genetic features that might distinguish hub types. On average, intermodular hub proteins have a greater amino acid sequence length than intramodular hub proteins (Mann-Whitney U-test, P<0.005,
Next the types of domains present in intermodular or intramodular hubs were explored. Domains associated with cell signaling (as defined in the SMART Database22) were found to be significantly enriched in intermodular hubs (binomial sign test, P<0.001), compared to non-signaling domains, which are evenly distributed between the hub types (
To explore this organization the well-characterized RAS subnetwork was examined. This revealed RAS to be an intramodular hub, with most of its highly correlated partners representative of regulators of RAS activity, such as RALGDS and SOS (
Disturbance of Network Modularity is Associated with Breast Cancer Outcome
The analysis of the human interactome suggests that intermodular hubs are enriched for signaling domains and control global connectivity and information flow within the network (for example, betweenness and CPL). During oncogenic transformation rewiring of signaling networks has been proposed to drive the phenotypic alterations associated with tumour progression whilst maintaining the robust features of the network14. Given the key role of intermodular hubs in coordinating signaling within the interactome, it was considered whether there are differences in the association of hub type with cancer by querying the OMIM24 for association of intermodular and intramodular hubs with cancer. This revealed that mutations in intermodular hubs were associated with cancer phenotypes more frequently than intramodular hubs (Fisher's exact test, P<0.05,
To examine whether transitions in hub status (i.e. alterations in modularity) are associated with poor prognosis in cancer a well-described cohort of sporadic breast cancer patients26 was used. Significant differences in the average PCC of hubs and their interacting partners in patients that were disease-free after extended follow up, versus those that died of disease were first looked for. This revealed 256 hubs that displayed significantly altered PCC as a function of disease outcome. One of the hubs identified in this analysis was BRCA1, which is mutated in a subset of familial breast cancers. Analysis of BRCA1 modularity revealed high correlation of co-expression with its partners in tumours with good outcome, compared to reduced correlation in poor outcomes (
Next, protein interactions between all the significant hubs identified in this analysis were examined. This uncovered a highly inter-connected “circuit” that contains many hub proteins known to be important for the pathogenesis of breast malignancies (
The inventors determined that the altered dynamic network modularity that was identified provides a prognostic signature in breast cancer patient tumour samples. To develop an algorithm to assess hub behaviour in individual patients, the relative expression of hubs with each of their interacting partners was taken. Identification of the hubs that were significantly different between patients that survived versus those that died from disease was determined. In turn the relative expression for hubs and their partners was used in an affinity propagation clustering algorithm to generate a probability of poor prognosis for each patient. The algorithm was employed in a 5-fold cross-validation strategy in which 4/5 of the patient data was randomly selected as a training set with subsequent testing on the hold-out set. In this strategy, the hub selection process was incorporated on the training set within the cross-validation loop to avoid over-fitting problems. Triplicate runs were performed using three different randomized test sets and the average performance was analyzed using receiver operator characteristic (ROC) curves. This revealed a typical area under the curve (AUC) value of 0.711 (
Efforts to map the human protein-protein network are in their infancy and current physical maps likely reflect only a small fraction of the full interactome. Therefore, assay performance was assessed as a function of interactome complexity, by analyzing networks in which hubs were randomly removed. This revealed that removal of hubs reduced assay performance (
The “poor outcome” probabilities were used next to group patients into two prognostic groups. Probability of prognosis was set at greater than or equal to 0.4 since at this cut off the algorithm consistently yielded the highest accuracy of prediction. Analysis of these two groups revealed the 5-year survival was significantly different (Mantel-Cox Log Rank test, nominal P<0.001) with only 44% of patients possessing the poor prognosis modularity signature expected to survive disease free for more than 5 years (
Finally, the cross-validation analysis was repeated using a separate cohort of breast cancer patients (TransBIG33). Strikingly, the algorithm showed comparable, if not improved, performance compared to the original breast cancer patient cohort (AUC 0.718-0.827;
A study has been conducted utilizing the fractal nature of the human protein-protein interaction network. Previous examinations of real world networks revealed that many complex networks display fractal behavior. The networks are self similar regardless of scale. To determine if the human protein-protein interaction network is indeed fractal, published methods47 were applied.
The 3 conditions that are required to be satisfied to define a fractal network were met with the human protein-protein interaction network identified in Example 1. Those conditions are:
(1) The number of boxes needed to cover the original, the skeleton, and the Random Spanning Tree (RST)), exhibit power law relationship to the size of the box. A skeleton network is a network that has been trimmed of many vertices but retains the vertices of the nodes with the highest betweenness centrality. A random spanning tree (RST) is also a network trimmed of many vertices but unlike the skeleton no choice is made with regards to the vertices that remain as long as all the nodes can be connected to the network via the remaining vertices.
(2) The number of boxes needed to cover the original and the skeleton is almost the same.
(3) The fractal dimension (power coefficient of the best fitting power function) of the Random Spanning Tree (RST) is almost the same as the fractal dimension of the original network.
Furthermore, synthetic networks of similar but deliberately different properties of the real human interaction network did not display fractal properties as defined above. For example such a synthetic network has an equivalent number of nodes that did not have a scale-free but Gaussian distribution of degrees for the node.
The human interaction network that was previously shown with the prediction algorithm was found to displays fractal properties. Thus, it was hypothesized that other self similar subnetworks (i.e., the skeleton network or RST) are sufficient to predict the outcome of the breast cancer patients using the algorithm described herein. Therefore, the previously described algorithm (i.e. Example 1) was applied. Instead of using the full interaction network, subset networks of the RST or skeleton were used. Based on measuring the area under the curve of the receiver operator curve of the 5-fold cross validation runs, the predictive power of the algorithm was equivalent when the whole network was used as well as the skeleton network. This suggests that the information contained within the whole network is imbedded in the simplified skeleton network. Conversely, when the RST was used as the interaction network data, the predictive power was greatly reduced. This suggests that necessary power for making prediction on biological outcome (e.g., breast cancer patient outcome) is lost when the whole network is trimmed using an RST.
This example suggests that instead of using the whole human interaction network to perform the prediction described in previous iterations of the algorithm as in Example 1, the method can be performed with similar accuracy and provide the same predictions simply by use of the skeleton network.
An example of computer code useful to implement the methods described herein is reproduced below:
APCLUSTER uses affinity propagation (Frey and Dueck, Science, 2007) to identify data clusters, using a set of real-valued pair-wise data point similarities as input. Each cluster is represented by a data point called a cluster center, and the method searches for clusters so as to maximize a fitness function called net similarity. The method is iterative and stops after maxits iterations (default of 500—see below for how to change this value) or when the cluster centers stay constant for convits iterations (default of 50). The command apcluster(s,p,‘plot’) can be used to plot the net similarity during operation of the algorithm.
For N data points, there may be as many as N̂2−N pair-wise similarities (note that the similarity of data point i to k need not be equal to the similarity of data point k to i). These may be passed to APCLUSTER in an N×N matrix s, where s(i,k) is the similarity of point i to point k. In fact, only a smaller number of relevant similarities are needed for APCLUSTER to work. If only M similarity values are known, where M<N̂2−N, they can be passed to APCLUSTER in an M×3 matrix s, where each row of s contains a pair of data point indices and a corresponding similarity value: s(j,3) is the similarity of data point s(j,1) to data point s(j,2).
APCLUSTER automatically determines the number of clusters, based on the input p, which is an N×l matrix of real numbers called preferences. p(i) indicates the preference that data point i be chosen as a cluster center. A good choice is to set all preference values to the median of the similarity values. The number of identified clusters can be increased or decreased by changing this value accordingly. If p is a scalar, APCLUSTER assumes all preferences are equal to p. The fitness function (net similarity) used to search for solutions equals the sum of the preferences of the data centers plus the sum of the similarities of the other data points to their data centers. The identified cluster centers and the assignments of other data points to these centers are returned in idx. idx(j) is the index of the data point that is the cluster center for data point j. If idx(j) equals j, then point j is itself a cluster center. The sum of the similarities of the data points to their cluster centers is returned in dpsim, the sum of the preferences of the identified cluster centers is returned in expref and the net similarity (sum of the data point similarities and preferences) is returned in netsim.
A specific example of this code is illustrated below:
[idx,netsim,dpsim,expref]=apcluster(s,p,‘NAME’,VALUE, . . . )
The following parameters can be set by providing name-value pairs, eg, apcluster(s,p,‘maxits’,1000):
This code is copyrighted by Brendan J. Frey and Delbert Dueck (2006).
In summary, using dynamic network principles, specific alterations in the modularity of the human interactome that were associated with poor outcome in breast cancer were elucidated. Rather than defining a series of isolated hubs, it was found that most hubs identified in this analysis were components of an interconnected network that had modules associated with MAPK, Estrogen and DNA damage signaling, all of which have been implicated in breast cancer. The presence of these components in a dynamic network suggests they coordinate tumour activity related to poor outcome. Proteasome and RNA processing were the other two major modules identified in this network. Consistent with the notion that aberrant organization of modules is important in cancer progression, many components of the proteasome are associated with aberrant expression and copy number abnormalities (CNAs) in breast cancer tumours and cell lines39, 40. Moreover, low level CNA genes with significant dosage effects in breast cancer were found to be associated with RNA processing and metabolism40. These results suggest that alterations in the modularity of networks associated with cellular metabolism are important targets in breast cancer progression. The impact of altered modularity on breast cancer outcome defined in this study provides compelling impetus for the systematic development of multi-modal therapies aimed at targeting multiple nodes in this altered network, rather than individual hubs.
Employing a network modularity signature led to clustering of patients into prognostic groups more accurately than previous microarray investigations of breast cancer samples26. For example, in the current analysis the prognosis accuracy was 76.1% compared to 64% accuracy is previous studies with the same patient sample26. The positive predictive value of the analysis is 81.25%, with a sensitivity of 86.1%. This increase in accuracy was not restricted to the optimized cutoffs employed during clustering (p≦0.09 and k≧7), as similar increases in prognostic accuracy (73.3%) were observed for naïve settings (k≧5 and p≦0.05), suggesting that the parameters have not been overfit. Indeed, analysis of a distinct cohort revealed similar, if not enhanced performance. The favourable performance of the classification algorithms further suggests that changes in network modularity are a defining feature of tumour phenotype that, in turn, determines patient prognosis.
A network modularity signature was able to predict outcome in breast cancer without taking into consideration molecular subtype3. The molecular subtype signature may also be incorporated into the modularity analysis as well as other mechanisms controlling network dynamics, such as alterations in protein levels and phosphorylation-dependent changes in protein-protein interactions.
The present invention is not to be limited in scope by the specific embodiments described herein, since such embodiments are intended as but single illustrations of one aspect of the invention and any functionally equivalent embodiments are within the scope of this invention. Indeed, various modifications of the invention in addition to those shown and described herein will become apparent to those skilled in the art from the foregoing description and accompanying drawings. Such modifications are intended to fall within the scope of the appended claims.
All publications, patents and patent applications referred to herein, as well as priority document U.S. Provisional Patent Application No. 61/104,328, International Patent Application No. PCT/CA2009/001449, and parent U.S. patent application Ser. No. 13/123,138 are incorporated by reference in their entirety to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety. All publications, patents and patent applications mentioned herein are incorporated herein by reference for the purpose of describing and disclosing the methodologies, reagents, etc. which are reported therein which might be used in connection with the invention. Nothing herein is to be construed as an admission that the invention is not entitled to antedate such disclosure by virtue of prior invention.
This application is a continuation of pending U.S. patent application Ser. No. 13/123,138, filed Jul. 5, 2011, which is a 371 of International Patent Application No. PCT/CA2009/001449, filed Oct. 9, 2009 (now expired), which claims the benefit of U.S. Provisional Application No. 61/104,328, filed Oct. 10, 2008 (now expired).
Number | Date | Country | |
---|---|---|---|
61104328 | Oct 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13123138 | Jul 2011 | US |
Child | 14747820 | US |