The present invention relates generally to an array of nucleic acid molecules, the expression profiles of which characterise the anatomical origin of a cell or population of cells within the large intestine. More particularly, the present invention relates to an array of nucleic acid molecules, the expression profiles of which characterise the proximal or distal origin of a cell or population of cells within the large intestine. The expression profiles of the present invention are useful in a range of applications including, but not limited to determining the anatomical origin of a cell or population of cells which have been derived from the large intestine. Still further, since the progression of a normal cell towards a neoplastic state is often characterised by phenotypic de-differentiation, the method of the present invention also provides a means of identifying a cellular abnormality based on the expression of an incorrect expression profile relative to that which should be expressed by the subject cells when considered in light of their anatomical location within the colon. Accordingly, this aspect of the invention provides a valuable means of identifying the existence of large intestine colon cells, these being indicative of an abnormality within the large intestine such as the onset or predisposition to the onset of a condition such as a colorectal neoplasm.
Bibliographic details of the publications referred to by author in this specification are collected alphabetically at the end of the description.
The reference to any prior art in this specification is not, and should not be taken as, an acknowledgment or any form of suggestion that that prior art forms part of the common general knowledge in Australia.
Adenomas are benign tumours of epithelial origin which are derived from glandular tissue or exhibit clearly defined glandular structures. Some adenomas show recognisable tissue elements, such as fibrous tissue (fibroadenomas), while others, such as bronchial adenomas, produce active compounds giving rise to clinical syndromes. Tumours in certain organs, including the pituitary gland, are often classified by their histological staining affinities, for example eosinophil, basophil and chromophobe adenomas.
Adenomas may become carcinogenic and are then termed adenocarcinomas. Accordingly, adenocarcinomas are defined as malignant epithelial tumours arising from glandular structures, which are constituent parts of most organs of the body. This term is also applied to tumours showing a glandular growth pattern. These tumours may be sub-classified according to the substances that they produce, for example mucus secreting and serous adenocarcinomas, or to the microscopic arrangement of their cells into patterns, for example papillary and follicular adenocarcinomas. These carcinomas may be solid or cystic (cystadenocarcinomas). Each organ may produce tumours showing a variety of histological types, for example the ovary may produce both muconous and cystadenocarcinoma. In general, the overall incidence of carcinoma within an adenoma is approximately 5%. However, this is related to size and although it is rare in adenomas of less than 1 centimetre, it is estimated at 40 to 50% among villous lesions which are greater than 4 centimetres. Adenomas with higher degrees of dysplasia have a higher incidence of carcinoma. Once a sporadic adenoma has developed, the chance of a new adenoma occurring is approximately 30% within 26 months.
Colorectal adenomas represent a class of adenomas which are exhibiting an increasing incidence, particularly in more affluent countries. The causes of adenoma, and its shift to adenocarcinoma, are still the subject of intensive research. To date it has been speculated that in addition to genetic predisposition, environmental factors (such as diet) play a role in the development of this condition. Most studies indicate that the relevant environmental factors relate to high dietary fat, low fibre and high refined carbohydrates.
Colonic adenomas are localised proliferations of dysplastic epithelium which are initially flat. They are classified by their gross appearance as either sessile (flat) or penduculated (having a stalk). While small adenomas (less than 0.5 millimetres) exhibit a smooth tan surface, penduculated adenomas have a head with a cobblestone or lobulated red-brown surface. Sessile adenomas exhibit a more delicate villous surface. Penduculated adenomas are more likely to be tubular or tubulovillous while sessile lesions are more likely to be villous. Sessile adenomas are most common in the cecum and rectum while overall penduculated adenomas are equally split between the sigmoid-rectum and the remainder of the large intestine.
Adenomas are generally asymptomatic, therefore rendering difficult their early diagnosis and treatment. It is technically impossible to predict the presence or absence of carcinoma based on the gross appearance of adenomas, although larger adenomas are thought to exhibit a higher incidence of concurrent malignancy than smaller adenomas. Sessile adenomas exhibit a higher incidence of malignancy than penduculated adenomas of the same size. Some adenomas result in the production of microscopic stool blood loss. However, since stool blood can also be indicative of non-adenomatous conditions and obstructive symptoms are generally not observed in the absence of malignant change, the accurate diagnosis of adenoma is rendered difficult without the application of highly invasive procedures such as biopsy analysis. Accordingly, there is an on-going need to elucidate not only the causes of adenoma and its shift to malignancy but to develop more informative diagnostic protocols, in particular protocols which will enable the rapid, routine and accurate diagnosis of adenoma and adenocarcinoma at an early stage, such as the pre-malignant stage. To this end, studies of colorectal adenocarcinoma have suggested a variable incidence, histopathology and prognosis between proximal and distal tumours.
In terms of pursuing this line of investigation, the advent of gene expression profiling has led to an improved understanding of intestinal mucosa development. For example, regulation of transcription factors involved in producing and maintaining the radial-axis balance from the crypt base to the lumen and those giving rise to epithelial cell differentiation are now better understood as a result of microarray gene expression analysis. [Peifer, 2002, Nature 420: 274-5, 277; Traber, 1999, Adv Exp Med Biol 470:1-14]. Similarly, understanding has improved of the developmentally programmed genetic events within the embryonic gut, especially those molecular control mechanisms responsible for regional epithelium differences between the small intestine and large intestine. [de Santa Barbara et al, 2003, Cell Mol Life Sci 60:1322-1332; Park et al., 2005, Genesis 41:1-12] On the other hand, little is known about the proximal-distal gene expression variation along the longitudinal axis of the large intestine. [Bates et al. 2002, Gastroenterology 122:1467-1482]Epidemiologic studies of colorectal adenocarcinoma suggest support for variable incidence, histopathology, and prognosis between proximal and distal tumours. [Bonithon-Kopp and Benhamiche, 1999, Eur J Cancer Prev 8 Suppl 1:S3-12; Bufill, 1990, Ann Intern Med 113:779-788; Deng et al, 2002, Br J Cancer 86:574-579; Distler and Holt, 1997, Dig Dis 15:302-311]. Thus an understanding of location-specific variation could provide valuable insight into those diseases that have characteristic distribution patterns along the colorectum, including colorectal cancer. [Birkenkamp-Demtroder et al., 2005, Gut 54:374-384; Caldero et al., 1989, Virchows Arch A Pathol Anat Histopathol 415:347-356; Garcia-Hirschfeld Garcia et a., 1999, Rev Esp Enferm Dig 91:481-488].
The colorectum (also termed the large intestine) is often divided for clinical convenience into six anatomical regions starting from the terminal region of the ileum: the cecum; the ascending colon; the transverse colon; the descending colon, the sigmoid colon; and the rectum. Alternatively, these segments may be grouped to divide the large intestine into a two region model comprising the proximal and distal large intestine. The proximal (“right”) region is generally taken to include the cecum, ascending colon, and the transverse colon while the distal (“left”) region includes the splenic flexure, the descending colon, the sigmoid flexure and the rectum. This division is supported by the distinct embryonic ontogenesis of these regions whose junction is two thirds along the transverse colon and also by the distinct arterial supply to each region. While the proximal large intestine develops from the embryonic midgut and is supplied by the superior mesenteric artery, the distal large intestine forms from the embryonic hindgut and is supplied by the inferior mesenteric artery. [Yamada and Alpers, 2003, Textbook of Gastroenterology, 2 Vol. Set.] A comprehensive of review of proximal distal differences are provided in [Iacopetta, 2002, Int J Cancer 101:403-408].
In work leading up to the present invention it has been determined that a panel of genes are differentially expressed between the proximal and distal sections of the human large intestine. Accordingly, this has enabled the development of means for determining whether a large intestine derived cell of interest is of proximal origin or distal origin. Samples of normal large intestine derived cells or tissues can therefore be routinely characterised in terms of their anatomical origin within the large intestine. Still further, since most disease conditions are characterised by some change in phenotypic profile or gene transcription of the diseased cells, this being particularly true of cells which are predisposed to or have become neoplastic, the method the present invention provides a convenient means of identifying abnormal cells or cells which are predisposed to becoming abnormal. More particularly, where a cell of known large intestine anatomical origin expresses one or more genes or profiles of genes which are not characteristic of that location, the cell is classified as abnormal and may then undergo further analysis to elucidate the nature of that abnormality.
Throughout this specification and the claims which follow, unless the context requires otherwise, the word “comprise”, and variations such as “comprises” and “comprising”, will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
As used herein, the term “derived from” shall be taken to indicate that a particular integer or group of integers has originated from the species specified, but has not necessarily been obtained directly from the specified source. Further, as used herein the singular forms of “a”, “and” and “the” include plural referents unless the context clearly dictates otherwise.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
One aspect of the present invention is directed to a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, said method comprising measuring the level of expression of one or more genes selected from:
In another aspect there is provided a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, said method comprising measuring the level of expression of one or more genes selected from:
In another aspect, the present invention provides a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
The present invention also provides a detection method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
Preferably, the step of accessing first expression data includes accessing third expression data of which said first expression data is a subset, and the method includes processing said third expression data to select a subset of the third expression data corresponding to a subset of genes differentially expressed either alone or in combination along the proximal-distal axis of said large intestine, the selected subset being said first expression data.
The present invention also provides a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
The present invention also provides a detection method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
The present invention also provides a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
The present invention also provides a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
The present invention also provides a detection system having components for executing any one of the above methods.
The present invention also provides a computer-readable storage medium having stored thereon program instructions for executing any one of the above methods.
The present invention also provides a detection system, including:
In another aspect there is provided a method of determining the onset or predisposition to the onset of a cellular abnormality or a condition characterised by a cellular abnormality in the large intestine, said method comprising determining, in accordance with one of the methods hereinbefore described, the proximal-distal gene expression profile of a biological sample derived from a known proximal or distal origin in the large intestine wherein the detection of a gene expression profile which is inconsistent with the normal proximal-distal large intestine gene expression profile is indicative of the abnormality of the cell or cellular population expressing said profile.
A related aspect of the present invention provides a nucleic acid array, which array comprises a plurality of:
a is a graphical representation of a typical example of the first and second principal components generated by applying principal component analysis (PCA) to all 44,928 probesets of the Discover data set, revealing little, if any, structure;
b is a graph of the first and second principal components generated by applying PCA to a subset of 115 probesets that are each differentially expressed in tissue samples from the cecum and rectum (i.e., the extreme proximal and distal ends of the large intestine), revealing two classes corresponding to the proximal and distal portions of the large intestine;
The present invention is predicated, in part, on the elucidation of gene expression profiles which characterise the anatomical origin of a cell or cellular population from the large intestine in terms of a proximal origin versus a distal origin. This finding has now facilitated the development of routine means of characterising, in terms of its anatomical origin, a cellular population derived from the large intestine. Still further, since some cellular disorders are characterised by a change in the gene expression profile of the diseased cell relative to a corresponding normal cell, the present invention also provides a means of routinely screening large intestine cells, which have been derived from a known anatomical location within the large intestine, for any changes to the gene expression profile which they would be expected to express based on that particular location. Where the correct gene expression profile is not observed, the cell is exhibiting an abnormality and should be further assessed by way of diagnosing the specifics of the abnormality. In particular, it would be appreciated by the person of skill in the art that neoplastic cells, or cells predisposed thereto, sometimes undergo de-differentiation—this being evidenced by a change to the gene expression phenotype of the cell to a less differentiated phenotype. Accordingly, any change to the gene expression profile characteristic of a large intestine cell of proximal or distal origin may be indicative of the onset or predisposition to the onset of a large intestine neoplasma, such as an adenoma or an adenocarcinoma. Also provided by the present invention are nucleic acid arrays, such as microarrays, for use in the method of the invention.
Accordingly, one aspect of the present invention is directed to a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, said method comprising measuring the level of expression of one or more genes selected from:
As detailed hereinbefore, the method of the present invention is predicated on the determination that distal versus proximal location of a cell within the large intestine can now be ascertained by virtue of gene expression profiles which are unique to the cells of each of these locations. Accordingly, reference to determining the “anatomical origin” or “anatomical location” of a cell or cellular population “derived from the large intestine” should be understood as a reference to determining whether the cell in issue originates from the distal region of the large intestine or the proximal region of the large intestine. Further to this, by “origin” or “location” is meant the location of the cell or cells under investigation either just prior to the time that the cell was harvested from the large intestine or, where the cell has naturally detached from the large intestine (e.g. where it has sloughed off and is found in a stool sample), at the time immediately prior to the cell detaching from the large intestine. Without limiting the present invention to any one theory or mode of action, the large intestine has no digestive function, as such, but absorbs large amounts of water and electrolytes from the undigested food passed on from the small intestine. At regular intervals, peristaltic movements move the dehydrated contents (faeces) towards the rectum. For clinical convenience the large intestine is generally divided into six anatomical regions commencing after the terminal region of the ileum—these being:
These segments can also be grouped to divide the large intestine into a two region model comprising the proximal and distal large intestine. The proximal region is generally understood to include the cecum and ascending colon while the distal region includes the splenic flexure, the descending colon, the sigmoid flexure and the rectum. This division between the proximal and distal region of the large intestine is thought to occur approximately two thirds along the transverse colon. This division is supported by the distinct embryonic ontogenesis of these regions whose junction is two thirds along the transverse colon and also by the distinct arterial supply to each region. Accordingly, tissues of the transverse colon may be either proximal or distal depending on which side of this junction corresponds to their point of origin. It would be appreciated that although the method of the present invention may not necessarily indicate from which part of the proximal or distal large intestine a cell originated, it will provide valuable information in relation to whether the tissue is of proximal origin or distal origin. While the proximal large intestine develops from the embryonic midgut and is supplied by the superior mesenteric artery, the distal large intestine forms from the embryonic hindgut and is supplied by the inferior mesenteric artery.
Accordingly, reference to the “proximal” region of the large intestine should be understood as a reference to the section of the large intestine comprising the cecum and ascending colon, while reference to the “distal” region of the large intestine should be understood as a reference to the splenic flexure, descending colon, sigmoid flexure and rectum. The transverse colon region comprises both proximal and distal region, the relative proportions of which will depend on where the junction of the proximal and distal tissue occurs. Specifically, the tissue of the transverse colon can be from either the proximal or distal region depending on the relative distance between the hepatic and splenic flexures.
In accordance with the present invention, it has been determined that the genes detailed in paragraphs (i) and (ii), above, are modulated, in terms of differential changes to their levels of expression depending on whether the cell expressing that gene is located in the proximal region of the large intestine or the distal region of the large intestine. For ease of reference, these genes and their mRNA transcripts are depicted in italicised text while their protein expression products are depicted in non-italicised text. These genes are collectively referred to as “location markers”.
Each of the genes detailed in sub-paragraphs (i) and (ii), above, would be well known to the person of skill in the art, as would their encoded protein expression products. The identification of these genes as markers of colorectal (large intestine) cell location occurred by virtue of differential expression analysis using Affymetrix HG133A or HG133B gene chips. To this end, each gene chip is characterised by approximately 45,000 probe sets which detect the RNA transcribed from approximately 35,000 genes. On average, approximately 11 probe pairs detect overlapping or consecutive regions of the RNA transcript of a single gene. In general, the gene from which the RNA transcripts are identifiable by the Affymetrix probes are well known and characterised genes. However, to the extent that some of the probes detect RNA transcripts which are not yet defined, these genes are indicated as “the gene or genes detected by Affymetrix probe x”. In some cases a number of genes may be detectable by a single probe. This is also indicated where appropriate. It should be understood, however, that this is not intended as a limitation as to how the expression level of the subject gene can be detected. In the first instance, it would be understood that the subject gene transcript is also detectable by other probes which would be present on the Affymetrix gene chip. The reference to a single probe is merely included as an identifier of the gene transcript of interest. In terms of actually screening for the transcript, however, one may utilise a probe directed to any region of the transcript and not just to the terminal 600 bp transcript region to which the Affymetrix probes are generally directed.
Reference to each of the genes detailed above and their transcribed and translated expression products should therefore be understood as a reference to all forms of these molecules and to fragments, mutants or variants thereof. As would be appreciated by the person of skill in the art, some genes are known to exhibit allelic variation between individuals. Accordingly, the present invention should be understood to extend to such variants which, in terms of the present diagnostic applications, achieve the same outcome despite the fact that minor genetic variants between the actual nucleic acid sequences may exist between individuals. The present invention should therefore be understood to extend to all RNA (eg mRNA, primary RNA transcript, miRNA, tRNA, rRNA etc), cDNA and peptide isoforms which arise from alternative splicing or any other mutation, polymorphic or allelic variation. It should also be understood to include reference to any subunit polypeptides such as precursor forms which may be generated, whether existing as a monomer, multimer, fusion protein or other complex.
Without limiting the present invention to any one theory or mode of action, although each of the genes hereinbefore described is differentially expressed, either singly or in combination, as between the cells of the distal and proximal large intestine, and is therefore diagnostic of the anatomical origin of any given cell sample, the expression of some of these genes exhibited particularly significant levels of sensitivity, specificity, positive predictive value and/or negative predictive value. Accordingly, in a preferred embodiment, one would screen for and assess the expression level of one or more of these genes.
The present invention therefore preferably provides a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, said method comprising measuring the level of expression of one or more genes selected from:
Preferably, said genes are ETNK1 and/or GBA3 and/or PRAC.
The detection method of the present invention can be performed on any suitable biological sample. To this end, reference to a “biological sample” should be understood as a reference to any sample of biological material derived from an animal such as, but not limited to, cellular material, biofluids (eg. blood), faeces, tissue biopsy specimens, surgical specimens or fluid which has been introduced into the body of an animal and subsequently removed (such as, for example, the solution retrieved from an enema wash). The biological sample which is tested according to the method of the present invention may be tested directly or may require some form of treatment prior to testing. For example, a biopsy or surgical sample may require homogenisation prior to testing or it may require sectioning for in situ testing of the qualitative expression levels of individual genes.
Alternatively, a cell sample may require permeabilisation prior to testing. Further, to the extent that the biological sample is not in liquid form, (if such form is required for testing) it may require the addition of a reagent, such as a buffer, to mobilise the sample.
To the extent that the location marker gene is present in a biological sample, the biological sample may be directly tested or else all or some of the nucleic acid material present in the biological sample may be isolated prior to testing. In yet another example, the sample may be partially purified or otherwise enriched prior to analysis. For example7 to the extent that a biological sample comprises a very diverse cell population, it may be desirable to enrich for a sub-population of particular interest. It is within the scope of the present invention for the target cell population or molecules derived therefrom to be pretreated prior to testing, for example, inactivation of live virus or being run on a gel. It should also be understood that the biological sample may be freshly harvested or it may have been stored (for example by freezing) prior to testing or otherwise treated prior to testing (such as by undergoing culturing).
The choice of what type of sample is most suitable for testing in accordance with the method disclosed herein will be dependent on the nature of the situation. Preferably, said sample is a faecal sample, enema wash, surgical resection or tissue biopsy.
As detailed hereinbefore, the present invention is designed to characterise a cell or cellular population, which is derived from the large intestine, in terms of its anatomical origin within the large intestine. Accordingly, reference to “cell or cellular population” should be understood as a reference to an individual cell or a group of cells. Said group of cells may be a diffuse population of cells, a cell suspension, an encapsulated population of cells or a population of cells which take the form of tissue.
Reference to “expression” should be understood as a reference to the transcription and/or translation of a nucleic acid molecule. In this regard, the present invention is exemplified with respect to screening for location markers taking the form of RNA transcripts (eg primary RNA, mRNA, miRNA, tRNA, rRNA). Reference to “RNA” should be understood to encompass reference to any form of RNA, such as primary RNA, mRNA, miRNA, tRNA or rRNA. Without limiting the present invention in any way, the modulation of gene transcription leading to increased or decreased RNA synthesis will also correlate with the translation of some of these RNA transcripts (such as mRNA) to produce an expression product. Accordingly, the present invention also extends to detection methodology which is directed to screening for modulated levels or patterns of expression of the location marker expression products as an indicator of the proximal or distal origin of a cell or cellular population. Although one method is to screen for mRNA transcripts and/or the corresponding protein expression product, it should be understood that the present invention is not limited in this regard and extends to screening for any other form of location marker such as, for example, a primary RNA transcript. It is well within the skill of the person of skill in the art to determine the most appropriate screening target for any given situation. Preferably, the protein expression products is the subset of analysis.
Reference to “nucleic acid molecule” should be understood as a reference to both deoxyribonucleic acid molecules and ribonucleic acid molecules. The present invention therefore extends to both directly screening for mRNA levels in a biological sample or screening for the complimentary cDNA which has been reverse-transcribed from an mRNA population of interest. It is well within the skill of the person of skill in the art to design methodology directed to screening for either DNA or RNA. As detailed above, the method of the present invention also extends to screening for the protein expression product translated from the subject mRNA.
The method of the present invention is predicated on the correlation of the expression levels of the location markers of a biological sample with the normal proximal and distal levels of these markers. The “normal level” is the level of marker expressed by a cell or cellular population of proximal origin in the large intestine and the level of marker expressed by a cell or cellular population of distal origin. Accordingly, there are two normal level values which are relevant to the detection method of the present invention. It would be appreciated that these normal level values are calculated based on the expression levels of large intestine derived cells which do not exhibit an abnormality or predisposition to an abnormality which would alter the expression levels or patterns of these markers.
The normal level may be determined using tissues derived from the same individual who is the subject of testing. However, it would be appreciated that this may be quite invasive for the individual concerned and it is therefore likely to be more convenient to analyse the test results relative to a standard result which reflects individual or collective results obtained from healthy individuals, other than the patient in issue. This latter form of analysis is in fact the preferred method of analysis since it enables the design of kits which require the collection and analysis of a single biological sample, being a test sample of interest. The standard results which provide the proximal and distal normal reference levels may be calculated by any suitable means which would be well known to the person of skill in the art. For example, a population of normal tissues can be assessed in terms of the level of expression of the location markers of the present invention, thereby providing a standard value or range of values against which all future test samples are analysed. It should also be understood that the proximal and distal normal reference levels may be determined from the subjects of a specific cohort and for use with respect to test samples derived from that cohort. Accordingly, there may be determined a number of standard values or ranges which correspond to cohorts which differ in respect of characteristics such as age, gender, ethnicity or health status. Said “normal level” may be a discrete level or a range of levels. The results of biological samples which are tested are preferably assessed against both the proximal and distal normal reference levels. An increase in the expression of the genes of group (i), hereinbefore defined, relative to normal distal levels is indicative of the test tissue being of proximal origin while an increase in the expression of the genes of group (ii), hereinbefore defined, relative to normal proximal levels is indicative of the tissue being of distal origin. It would also be appreciated, however, that one may also approach the defined correlative step by analysing the results which are obtained from the point of view of determining whether the result obtained is the same as a normal or distal level, thereby indicating that the test sample is of the same origin as the normal reference level sample against which it has been assessed.
It should be understood that the “individual” who is the subject of testing may be any primate. Preferably the primate is a human.
As detailed hereinbefore, it should be understood that although the present invention is exemplified with respect to the detection of nucleic acid molecules, it also encompasses methods of detection based on testing for the expression product of the subject location markers. The present invention should also be understood to mean methods of detection based on identifying either protein product or nucleic acid material in one or more biological samples. However, it should be understood that some of the location markers may correlate to genes or gene fragments which do not encode a protein expression product. Accordingly, to the extent that this occurs it would not be possible to test for an expression product and the subject marker must be assessed on the basis of nucleic acid expression profiles.
The term “protein” should be understood to encompass peptides, polypeptides and proteins. The protein may be glycosylated or unglycosylated and/or may contain a range of other molecules fused, linked, bound or otherwise associated to the protein such as amino acids, lipids, carbohydrates or other peptides, polypeptides or proteins. Reference herein to a “protein” includes a protein comprising a sequence of amino acids as well as a protein associated with other molecules such as amino acids, lipids, carbohydrates or other peptides, polypeptides or proteins.
The location marker proteins of the present invention may be in multimeric form meaning that two or more molecules are associated together. Where the same protein molecules are associated together, the complex is a homomultimer. An example of a homomultimer is a homodimer. Where at least one marker protein is associated with at least one non-marker protein, then the complex is a heteromultimer such as a heterodimer.
Reference to a “fragment” should be understood as a reference to a portion of the subject nucleic acid molecule. This is particularly relevant with respect to screening for modulated RNA levels in stool samples since the subject RNA is likely to have been degraded or otherwise fragmented due to the environment of the gut. One may therefore actually be detecting fragments of the subject RNA molecule, which fragments are identified by virtue of the use of a suitably specific probe.
In another aspect, the present invention provides a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
The present invention also provides a detection method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
Preferably, the step of accessing first expression data includes accessing third expression data of which said first expression data is a subset and the method includes processing said third expression data to select a subset of the third expression data corresponding to a subset of genes differentially expressed either alone or in combination along the proximal-distal axis of said large intestine, the selected subset being said first expression data,
Preferably, the method includes processing said further expression data and said multivariate classification data to generate said proximal-distal origin data representing said proximal-distal origin.
Most preferably, the selected expression data corresponds to genes selected from:
The present invention also provides a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
Preferably, the method includes processing said second expression data and said classification data to generate proximal-distal origin data representing said location.
Preferably, said kernel method includes a support vector machine (SVM).
More preferably, said classification data is representative of genes selected from:
Still more preferably, said classification data is representative of a subset of 13 genes.
Most preferably, said 13 genes are
PRAC,
CCL11,
FRZB or the gene or genes detected by Affymetrix probe number: 203698_s_at,
GDF15 or the gene or genes detected by Affymetrix probe number: 221577_x_at,
CLDN8,
SEC6L1 or the gene or genes detected by Affymetrix probe number: 221577_x_at,
GBA3 or the gene or genes detected by Affymetrix probe number: 279954_s_at,
DEFA5,
SPINK5,
OSTalpha,
ANPEP or the gene or genes detected by Affymetrix probe number: 202888_s_at, and
MUC5.
The present invention also provides a detection method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
Preferably, said step of accessing first expression data includes accessing third expression data of which said first expression data is a subset, and the method includes processing said third expression data to select a subset of the third selected expression data corresponding to a subset of genes differentially expressed along the proximal-distal axis of said at least one large intestine, the selected subset being said first expression data.
Preferably, the selected expression data corresponds to genes selected from:
The present invention also provides a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
Preferably, said canonical variate analysis includes profile analysis.
Preferably, said subset of genes includes genes selected from:
The present invention also provides a method for determining the anatomical origin of a cell or cellular population derived from the large intestine of an individual, including:
Advantageously, said processing may include processing said training data with GeneRave.
Preferably, said subset of genes includes genes selected from:
Advantageously, said subset of genes may include 7 genes.
Preferably, said 7 genes are SEC6L1, PRAC, SPINK5, SEC6L1, ANPEP, DEFA5, and CLDN8.
In another preferred embodiment, said subset of genes are one or more of the following subsets:
Reference to “proximal-distal origin” should be understood as a reference to cells or expression data of either a proximal origin or a distal origin. Reference to “cells or cellular subpopulations”, “large intestine”, “proximal”, “distal”, “origin”, “location”, “gene” and “expression” should be understood to have the same meaning as hereinbefore provided.
The present invention also provides a detection system having components for executing any one of the above methods.
The present invention also provides a computer-readable storage medium having stored thereon program instructions for executing any one of the above methods.
The present invention also provides a detection system, including:
As detailed hereinbefore, the method of the present invention is useful for identifying abnormal cells on the basis that a cell of distal or proximal origin which is not expressing the gene expression profile characteristic of that anatomical origin is exhibiting an abnormal expression profile and should therefore undergo further analysis to determine the full extent and nature of the subject abnormality. For example, some colorectal adenoma or adenocarcinoma cells may exhibit an incorrect proximal-distal large intestine expression profile due to the de-differentiation events which are characteristic of the neoplastic transformation of these cells.
Accordingly, in another aspect there is provided a method of determining the onset or predisposition to the onset of a cellular abnormality or a condition characterised by a cellular abnormality in the large intestine, said method comprising determining, in accordance with one of the methods hereinbefore described, the proximal-distal gene expression profile of a biological sample derived from a known proximal or distal origin in the large intestine wherein the detection of a gene expression profile which is inconsistent with the normal proximal-distal large intestine gene expression profile is indicative of the abnormality of the cell or cellular population expressing said profile.
Reference to “gene expression profile” should be understood as a reference to the univariate or multivariate gene expression results hereinbefore described. For example, the “profile” may correlate to the expression level of one or more marker genes as hereinbefore discussed or the result of the multivariate analysis of the genes and/or gene sets hereinbefore described. Accordingly, reference to “proximal-distal gene expression profile” is a reference to the gene expression profile characteristic of cells of proximal large intestine origin and that of cells of distal large intestine origin.
It would be appreciated that the cells which are the subject of analysis in the context of the present invention are of known proximal or distal origin. This information may be determined by any suitable method but is most conveniently satisfied by isolating the biological sample from a defined location in the large intestine via a biopsy. However, other suitable methods of harvesting or otherwise determining the anatomical origin of the biological sample are not excluded.
The abnormality of a cell or cellular population of the biological sample is based on the detection of a gene expression profile which is inconsistent with that of the profile which would normally characterise a cell of its particular proximal or distal origin. By “inconsistent” is meant that the expression level of one or more of the genes which are analysed is not consistent with that which is typically observed in a normal control.
The method of the present invention is useful as a one off test or as an on-going monitor of those individuals thought to be at risk of the development of disease or as a monitor of the effectiveness of therapeutic or prophylactic treatment regimes such as the ablation of diseased cells which are characterised by an abnormal gene expression profile. In these situations, mapping the modulation of location marker expression levels or expression profiles in any one or more classes of biological samples is a valuable indicator of the status of an individual or the effectiveness of a therapeutic or prophylactic regime which is currently in use. Accordingly, the method of the present invention should be understood to extend to monitoring for the modulation of location marker levels or expression profiles in an individual relative to a normal level (as hereinbefore defined) or relative to one or more earlier gene marker levels or expression profiles determined from a biological sample of said individual.
Means of testing for the subject expressed location markers in a biological sample can be achieved by any suitable method, which would be well known to the person of skill in the art, such as but not limited to:
A person of ordinary skill in the art could determine, as a matter of routine procedure, the appropriateness of applying a given method to a particular type of biological sample.
Without limiting the present invention in any way, and as detailed above, gene expression levels can be measured by a variety of methods known in the art. For example, gene transcription or translation products can be measured. Gene transcription products, i.e., RNA, can be measured, for example, by hybridization assays, run-off assays, Northern blots, or other methods known in the art.
Hybridization assays generally involve the use of oligonucleotide probes that hybridize to the single-stranded RNA transcription products. Thus, the oligonucleotide probes are complementary to the transcribed RNA expression product. Typically, a sequence-specific probe can be directed to hybridize to RNA or cDNA. A “nucleic acid probe”, as used herein, can be a DNA probe or an RNA probe that hybridizes to a complementary sequence. One of skill in the art would know how to design such a probe such that sequence specific hybridization will occur. One of skill in the art will further know how to quantify the amount of sequence specific hybridization as a measure of the amount of gene expression for the gene was transcribed to produce the specific RNA.
The hybridization sample is maintained under conditions that are sufficient to allow specific hybridization of the nucleic acid probe to a specific gene expression product. “Specific hybridization”, as used herein, indicates near exact hybridization (e.g., with few if any mismatches). Specific hybridization can be performed under high stringency conditions or moderate stringency conditions. In one embodiment, the hybridization conditions for specific hybridization are high stringency. For example, certain high stringency conditions can be used to distinguish perfectly complementary nucleic acids from those of less complementarity. “High stringency conditions”, “moderate stringency conditions” and “low stringency conditions” for nucleic acid hybridizations are explained on pages 2.10.1-2.10.16 and pages 6.3.1-6.3.6 in Current Protocols in Molecular Biology (Ausubel, F. et al., “Current Protocols in Molecular Biology”, John Wiley & Sons, (1998), the entire teachings of which are incorporated by reference herein). The exact conditions that determine the stringency of hybridization depend not only on ionic strength (e.g., 0.2.times.SSC, 0.1.times.SSC), temperature (e.g., room temperature, 42° C., 68° C.) and the concentration of destabilizing agents such as formamide or denaturing agents such as SDS, but also on factors such as the length of the nucleic acid sequence, base composition, percent mismatch between hybridizing sequences and the frequency of occurrence of subsets of that sequence within other non-identical sequences. Thus, equivalent conditions can be determined by varying one or more of these parameters while maintaining a similar degree of identity or similarity between the two nucleic acid molecules Typically, conditions are used such that sequences at least about 60%, at least about 70%, at least about 80%, at least about 90% or at least about 95% or more identical to each other remain hybridized to one another. By varying hybridization conditions from a level of stringency at which no hybridization occurs to a level at which hybridization is first observed, conditions that will allow a given sequence to hybridize (e.g., selectively) with the most complementary sequences in the sample can be determined.
Exemplary conditions that describe the determination of wash conditions for moderate or low stringency conditions are described in Kraus, M. and Aaronson, S., 1991. Methods Enzymol., 200:546-556; and in, Ausubel et al., Current Protocols in Molecular Biology, John Wiley & Sons, (1998). Washing is the step in which conditions are usually set so as to determine a minimum level of complementarity of the hybrids. Generally, starting from the lowest temperature at which only homologous hybridization occurs, each ° C. by which the final wash temperature is reduced (holding SSC concentration constant) allows an increase by 1% in the maximum mismatch percentage among the sequences that hybridize. Generally, doubling the concentration of SSC results in an increase in Tm of about 17° C. Using these guidelines, the wash temperature can be determined empirically for high, moderate or low stringency, depending on the level of mismatch sought. For example, a low stringency wash can comprise washing in a solution containing 0.2.times.SSC/0.1% SDS for 10 minutes at room temperature; a moderate stringency wash can comprise washing in a pre-warmed solution (42° C.) solution containing 0.2.times.SSC/0.1% SDS for 15 minutes at 42° C.; and a high stringency wash can comprise washing in pre-warmed (68° C.) solution containing 0.1.times.SSC/0.1% SDS for 15 minutes at 68° C. Furthermore, washes can be performed repeatedly or sequentially to obtain a desired result as known in the art. Equivalent conditions can be determined by varying one or more of the parameters given as an example, as known in the art, while maintaining a similar degree of complementarity between the target nucleic acid molecule and the primer or probe used (e.g., the sequence to be hybridized).
A related aspect of the present invention provides a nucleic acid array, which array comprises a plurality of:
wherein the level of expression of said nucleic acid is indicative of the proximal-distal origin of a cell or cellular subpopulation derived from the large intestine.
Reference herein to a low stringency at 42° C. includes and encompasses from at least about 1% v/v to at least about 15% v/v formamide and from at least about I M to at least about 2M salt for hybridisation, and at least about 1M to at least about 2M salt for washing conditions. Alternative stringency conditions may be applied where necessary, such as medium stringency, which includes and encompasses from at least about 16% v/v at least about 30% v/v formamide and from at least about 0.5M to at least about 0.9M salt for hybridization, and at least about 0.5M to at least about 0.9M salt for washing conditions, or high stringency, which includes and encompasses from at least about 31% v/v to at least about 50% v/v formamide and from at least about 0.01M to at least about 0.15M salt for hybridization, and at least about 0.01M to at least about 0.15M salt for washing conditions. In general, washing is carried out at Tm=69.3+0.41 (G+C) % [19]=−12° C. However, the Tm of a duplex DNA decreases by 1° C. with every increase of 1% in the number of mismatched based pairs (Bonner et al (1973). J. Mol. Biol. 81:123).
A library or array of nucleic acid or protein markers provides rich and highly valuable information. Further, two or more arrays or profiles (information obtained from use of an array) of such sequences are useful tools for comparing a test set of results with a reference, such as another sample or stored calibrator. In using an array, individual nucleic acid members typically are immobilized at separate locations and allowed to react for binding reactions. Primers associated with assembled sets of markers are useful for either preparing libraries of sequences or directly detecting markers from other biological samples.
A library (or array, when referring to physically separated nucleic acids corresponding to at least some sequences in a library) of gene markers exhibits highly desirable properties. These properties are associated with specific conditions, and may be characterized as regulatory profiles. A profile, as termed here refers to a set of members that provides diagnostic information of the tissue from which the markers were originally derived. A profile in many instances comprises a series of spots on an array made from deposited sequences.
A characteristic patient profile is generally prepared by use of an array. An array profile may be compared with one or more other array profiles or other reference profiles. The comparative results can provide rich information pertaining to disease states, developmental state, receptiveness to therapy and other information about the patient.
Another aspect of the present invention provides a diagnostic kit for assaying biological samples comprising an agent for detecting one or more proximal-distal markers and reagents useful for facilitating the detection by the agent in the first compartment. Further means may also be included, for example, to receive a biological sample. The agent may be any suitable detecting molecule.
The present invention is further described by the following non-limiting examples:
Materials and Methods
Gene Expression Data
To explore variation of human gene expression along the non-neoplastic large intestine, we used gene expression data collected using the Affymetrix (Santa Clara, Calif. USA) GeneChip® oligonucleotide microarray system described in [Lipshutz et al., 1999, Nat Genet 21:20-24]. The data are two independent Affymetrix (Santa Clara, Calif. USA) Human Genome 133 GeneChip datasets: a large commercial microarray database of HGU-133 A&B chip data for ‘discovery’, and a smaller HGU-133 Plus 2.0 microarray data set generated by us for ‘validation’.
The larger data set was analyzed to identify gene expression patterns and the independently derived second expression set was used to validate these patterns. Thus, the first data set was mined for hypothesis generation while the second set was used for hypothesis testing.
The data for this study are oligonucleotide microarrays hybridized to labelled cRNA synthesized from poly-A mRNA transcripts isolated from colorectal tissue specimens. The Affymetrix platform that we use is designed to quantify target mRNA transcripts using a panel of 11 perfect match 25 bp oligonucleotide probes (and 11 mismatch probes), called a probeset. To determine the biological relevance of probeset binding intensity, we have annotated the resulting probeset lists using the most current Affymetrix metafiles and BioConductor libraries available. We note that there are multiple probesets on the microarray platform theoretically reactive to any given target ‘gene’. As our focus is to explore transcript expression dynamics along the large intestine, and not to elucidate the underlying genomic mechanisms, we do not explore this phenomenon further. Nevertheless, this fundamental annotation detail should be considered when interpreting the biological relevance of these data and we caution the reader (and other researchers using these techniques) to be wary of the dangers of using the terms ‘genes’ and ‘probeset’ interchangeably.
‘Discovery’ Data Set
Gene expression and clinical descriptions for 184 colorectal tissue specimens were purchased from GeneLogic Inc. (Gaithersburg, Md., USA).
Individual tissue microarray data were selected with the following characteristics: non-neoplastic colorectal mucosa (confirmed by histology) from otherwise healthy tissue specimen (i.e. no evidence of inflammation or other disease at specimen site) with an anatomically-identifiable site of resection designated as one of: cecum, ascending colon, descending colon, sigmoid colon, or rectum.
For each tissue selected from the GeneLogic database, we received electronic files of raw data containing a total of 44,928 probesets (HGU133A and HGU133B, combined), experimental and clinical descriptors for each tissue, and digitally archived microscopy images of the histology preparations. Each data record was manually assessed for clinical consistency and a sample of records was randomly chosen for histopathology audit using digitally archived histology images. A quality control analysis was performed to identify and remove arrays not meeting essential quality control measures as defined by the manufacturer. [Affymetrix, 2001; Wilson and Miller, 2005, Bioinformatics].
Gene expression levels were calculated by both Microarray Suite (MAS) 5.0 (Affymetrix) and the Robust Multichip Average (RMA) normalization techniques. [Affymetrix, 2001; Hubbell et al., 2002, Bioinformatics 18:1585-1592; Irizarry et al., 2003, Nucleic Acids Res 31:e15] MAS normalized data was used for performing standard quality control routines and the final data set was normalized with RMA for all subsequent analyses.
‘Validation’ Data Set
The colorectal specimens in the ‘validation’ set were collected from a tertiary referral hospital tissue bank in metropolitan Adelaide (Repatriation General Hospital and Flinders Medical Centre). The tissue bank and this project were approved by the Research and Ethics Committee of the Repatriation General Hospital and patient consent was received for each tissue studied. Following surgical resection, specimens were placed in a sterile receptacle and collected from theatre. The time from operative resection to collection from theatre was variable but not more than 30 minutes. Samples, approximately 125 mm3 (5×5×5 mm) in size, were taken from the macroscopically normal tissue as far from pathology as possible, defined both by colonic region as well as by distance either proximal or distal to the pathology. Tissues were placed in cryovials, then immediately immersed in liquid nitrogen and stored at −150° C. until processing.
Frozen samples were processed by the authors using standard protocols and commercially available kits. Briefly, frozen tissues were homogenized using a carbide bead mill (Mixer Mill MM 300, Qiagen, Melbourne, Australia) in the presence of chilled Promega SV RNA Lysis Bluffer (Promega, Sydney, Australia) to neutralize RNase activity. Homogenized tissue lysates for each tissue were aliquoted to convenient volumes and stored −80° C. Total RNA was extracted from tissue lysates using the Promega SV Total RNA system according to manufacturer's instructions and integrity was assessed visually by gel electrophoresis.
To measure relative expression of mRNA transcripts, tissue RNA samples were analyzed using Affymetrix HG U133 Plus 2.0 GeneChips (Affymetrix, Santa Clara, Calif. USA) according to the manufacturer's protocols [Affymetrix, 2004]. Biotin labelled cRNA was prepared using 5 μg (1.0 μg/μL) total RNA (approx. 1 μg mRNA) with the “One-Cycle cDNA” kit (incorporating a T7-oligo (dT) primer) and the GeneChip IVT labelling kit. In vitro transcribed cRNA was fragmented (20 μg) and analyzed for quality control purposes by spectrophotometry and gel electrophoresis prior to hybridization. Finally, an hybridization cocktail was prepared with 15 μg of cRNA (0.5 μg/μL) and hybridized to HG U133 Plus 2.0 microarrays for 16 h at 45° C. in an Affymetrix Hybridization Chamber 640. Each cRNA sample was spiked with standard eukaryotic hybridization controls for quality monitoring.
Hybridized microarrays were stained with streptavidin phycoerytherin and washed with a solution containing biotinylated anti-streptavidin antibodies using the Affymetrix Fluidics Station 450. Finally, the stained and washed microarrays were scanned with the Affymetrix Scanner 3000.
The Affymetrix software package was used to transform raw microarray image files to digitized format. As for the Discovery set above, gene expression levels for the validation data set were generated using MAS 5.0 (Affymetrix) for quality control purposes and with the RMA normalization method for expression data.
Statistical Analysis
As shown in
In the described embodiment, the detection system also includes C++ modules 1008 to provide C++ language support, including C++ libraries, and an R module 1012 providing support for the R statistical programming language and the MASS library described in [Venables and Ripley, 2002] and available from the CRAN open source depository at http://cran.r-project.org. The system also includes the BioConductor software application 1010 available from http.//www.bioconductor.org, which, together with the profile analyzer 1004 and principal component analyzer 1006, are implemented in the R programming language, as described at http://www.r-project.org. The SVM 1002 is implemented in the C++ programming language. The classifier module 1007 is the GeneRave application, as described at http://www.bioinformatics.csiro.au/products.shtml and references provided therein. The system also includes the Microarray Suite (MS) 5.0 1014, and the Robust Multichip Average (RMA) normalization application 1016, both available from Affymetrix, and described at http://www.affymetrix.com. The software applications are executed under control of a standard operating system 1018, such as Linux or MacOS 10.4, and the computer system includes standard computer hardware components, including at least one processor 1022, random access memory 1024, a keyboard 1026, a standard pointing device such as a mouse 1028, and a display 1030, all of which are interconnected via a system bus 1032, as shown.
The detection methods include classification methods of the general form of
Furthermore, it will be apparent to those skilled in the art that the resulting classifier or discriminating function represented by the initially generated classification data can be adjusted based on decision theoretic principles to improve the classification outcomes and their utility. For example, a prior belief in the probability of outcomes can be incorporated, and/or a decision surface can be modified based on the different costs of misclassification cases. These and other relevant methods of decision theory, minimizing loss functions, and cost of misclassification are described in [Krzanowski and Marriott, 1995].
For all statistical analysis, we used open source software available from BioConductor for the R statistics environment (R being an open source implementation of the S statistical analysis environment). (Bioconductor, www.bioconductor.org) [Gautier et al., 2004, Bioinformatics 20:307-315; Gentleman et al., 2004, Genome Biol 5:R80].
The linear methods used to generate and process linear and non-linear combinations of gene expression levels, including linear regression, multiple linear regression, linear discriminant analysis, logistic regression, generalized linear models, and principal components analysis, are all described in [Hastie, 2001], for example. These methods are implemented in R.
Gene expression gradients were analyzed using three analytical techniques. First, we compared the gene expression variation of individual genes along the large intestine in the usual univariate manner. Next, we further explored those particular genes exhibiting statistically significant expression differences with linear models to compare dichotomous (proximal vs. distal) expression change with a gradual (multi-segment) model of change. Finally, we applied multivariate techniques to understand subtle genome-wide expression variance along the proximal-distal axis. Such genome-wide expression variances were interrogated using non-parametric methods as described in [Ripley, 1996], including nearest neighbor methods.
Individual Gene Expression Maps
Univariate Differential Expression
Differentially expressed gene transcripts between the proximal and distal large intestine were identified using a moderated 1-test implemented in the ‘limma’ Bioconductor library for R [Smyth, 2005]. Significance estimates (p-values) were corrected to adjust for multiple hypothesis testing (MHT) using the conservative Bonferroni correction. The subset of tissues limited to the cecum vs. the rectum were similarly tested.
Gene transcripts identified to be differentially expressed were also evaluated in the ‘Validation’ specimens on a probeset-by-probeset basis using modified t-tests. To assess the significance of the total number of differential probesets that were likewise differential in the validation data, the number of ‘validated’ probesets were compared to a null distribution estimated using a Monte Carlo simulation.
Multi-Segment Large Intestine vs. Two-Segment Large Intestine Model Comparison
To evaluate the nature of inter-segment gene expression variation we analyzed differentially expressed probesets for relative fit to linear models in a multi-segment vs. a two segment framework. The goal of this analysis is to explore whether the intersegment expression of probesets that are known to be differentially expressed between the terminal ends of the large intestine are better modelled by a five-segment linear model that approximates a continual gradation or by a simpler, dichotomous ‘proximal’ vs. ‘distal’ gradient. As our data are only identified by colorectal segment designation and not by a continuous measurement along the length of the large intestine, we approximate the continuous model using the tissue segment location. We chose probesets that are differentially expressed between the most terminal segments (cecum and rectum) in order to maximize the likelihood of identifying transcripts that vary along the proximal-distal axis of the large intestine.
We first modelled the expression of these probesets along the proximal-distal axis of the large intestine using a five factor robust linear model according to an indicator matrix defined by the colorectal segment for each tissue. For this model each tissue was assigned by biopsy location to one of: cecum, ascending, descending, sigmoid, or rectum. (For reasons described below, transverse tissues were not included in this analysis.) This five segment model was then compared to a two-factor robust linear model with a design matrix corresponding to the theoretical proximal and distal regions of the large intestine. The same data were used for both model comparisons, however for the two segment model, the first factor (corresponding to the proximal tissues) included all of the tissues from the cecum and ascending colon while the second factor (corresponding to the distal large intestine) included all tissues from the descending, sigmoid and rectum segments.
When comparing these distinct models for each probeset, we used an F-test to evaluate the hypothesis Ha that the improved fit (reduced regression residual) provided by the more complex five-segment model was significantly better than the simpler two segment model. A non-significant residual reduction indicates a failure to reject the null hypothesis
Multivariate Gene Expression Pattern Mapping
Results
Gene Expression Data Collection
Discovery and Validation Data Sets
A discovery data set was generated using data from the hybridization of cRNA to Affymetrix HG U133A/B GeneChip microarrays that were purchased from GeneLogic Inc.
Data from 184 normal tissues meeting inclusion criteria and quality assurance criteria for the HG U133A/B GeneChip were analyzed and used for hypothesis generation. The tissues comprised segment subsets as follows: 29 cecum, 45 ascending, 13 descending, 54 sigmoid, and 43 rectum. For each tissue, 44,928 probe sets were background corrected and normalized using RMA preprocessing.
To construct the ‘validation’ data set, 19 HG U133 Plus2.0 GeneChips were hybridized to labelled cRNA prepared from 8 proximal tissue specimens and 11 distal specimens. Due to stringent quality control parameters for tissue and GeneChip acceptability, this validation data set did not include sufficient tissues to explore multiple segment models. Each microarray measured transcript expression for 54,675 probe sets.
The theoretical juncture between the proximal and distal large intestine is approximately two thirds the length of the transverse colon measured from the hepatic flexure. [Yamada and Alpers, 2003, supra] As sample data were not specific for distance along the transverse colon, these tissues were excluded from the discovery analysis.
Gene Variation Along the Large Intestine
Individual Gene Expression Changes
Univariate Differential Expression
To explore the ‘natural’ dividing point between the anatomical segments of the large intestine, we measured the absolute number of probeset expression changes when the hypothetical ‘divide’ was moved stepwise from cecum to rectum.
A total of 206 probesets, corresponding to approximately 154 known gene targets, were differentially expressed higher in the proximal or distal colorectal samples compared to the corresponding region (Bonferroni corrected p<0.05). Of these 206 probesets, 31 (16.5%) were also differentially expressed in the validation data with a significant difference (31/206, p<<0.05 by Monte Carlo estimation).
A total of 15 probesets were differentially expressed between tissues selected only from the cecum (n=29) and the rectum (n=43). While 102 (89%) of these probesets are included in the 206 probesets differing between proximal and distal large intestine described above, the cecum vs. rectum gene expression is useful, in principle, to isolate those transcripts that are different between the most terminal ends of the large bowel. In this subset, 28 probesets (24.3%) were likewise differentially expressed in the rectum vs. the cecum in the validation data (28/115, p<10-5 by Monte Carlo estimation).
Differentially expressed probesets and difference statistics for probesets with elevated expression in proximal and distal tissues are shown in Tables 1 and 2, respectively.
Multi-Segment Gene Expression Models
An analysis for differential expression was also made for all five inter-segment transitions in order from the cecum to the rectum (i.e. cecum vs. ascending, ascending vs. transverse, etc.). Interestingly, no transcript was differentially expressed to a significant degree between any two adjoining segments (moderated t-test; p <0.05).
To explore the precise nature of these gene transcript expression changes, we built and compared robust linear models fitted to the expression data based on location for each tissue sample. Two robust linear models of univariate probeset expression were compared for each of the 115 probesets differentially expressed between the two terminal segments of the large intestine, the cecum and rectum. In particular, we queried whether the expression of those transcripts that were differentially expressed between these terminal segments were better explained (in terms of residual fit) by a simple two-segment model or by the more descriptive five-segment model.
Of the 15 differentially expressed probesets, the analysis failed to reject the null hypothesis that a complex model does not significantly improve model fit to the observed gene expression data for 65 (57%) of cases (F-test, p >0.05). Thus, more than half of these differentially expressed transcripts along the large intestine are satisfactorily modelled by the two segment expression model whereby expression is dichotomous and defined by either proximal vs. distal location. The most differentially expressed probeset between the cecum and rectum is the transcript for PRAC. A comparison of the two-segment and multi-segment models for this transcript are shown in
For the remaining 50 (43%) probesets, the null hypothesis was rejected (p<0.05), suggesting that a five factor model dependent on segment location in fact improves the predictive effectiveness of such transcripts' expression along the proximal-distal axis in a significant manner. Inspection of these models confirms that most models are monotonic increasing or monotonic decreasing in tissues progressing along the large intestine.
Interestingly, 41 (82%) of the 50 multi-segment models show a gradual increase across the large intestine while only 9 models (18%) indicate a gradual decrease from proximal to distal expression (shown in
Patterns of Gene Expression Along the Large Intestine
In addition to analyses of individual gene changes along the large intestine, we used multivariate analytical techniques to explore patterns of gene changes along the proximal-distal axis.
Supervised Principal Components Analysis
To visualize and explore the structure of expression variability at an organ level, principal component analysis (PCA) and a variant of PCA known as Supervised PCA were applied to the gene expression data using the principal component analyzer (PCA) 1006 of the detection system. PCA is described in [Venables and Ripley, 2002], and was implemented in R. A detailed description of supervised PCA can be found in [Bair et al., 2004].
Initially, expression data representing gene expression of all 44,928 probesets of the ‘Discovery’ data set were processed by the PCA module 1006 using principal components analysis (PCA). PCA is a standard method for simplifying a multi-dimensional data set by generating linear transformations of the data set dimensions to reduce the number of dimensions. The transformed data is provided as principal component data representing a sorted set of “principal components”, such that the first principal component has the greatest variance, the second principal component the second greatest variance, and so on. The result of applying PCA to the complete data set includes the multivariate or principal component data shown in
To investigate whether a subset of all genes could be used to generate one or more principal components indicative of tissue location, the expression data was analyzed by supervised PCA. As described in [Bair et al, 2004], supervised PCA is similar to standard principal components analysis but uses only a subset of the features/genes (usually selected by some univariate means) to generate the principal components. In this case, the set of genes differentially expressed between the cecum and rectum (i.e., the extreme ends of the large intestine) were selected for PCA analysis. However, other forms of feature selection could alternatively be used. Specifically, a reduced data matrix was generated by including only the 115 probesets that are differentially expressed between tissue samples taken from the cecum and rectum, but for all 184 normal tissues from all segments of the large intestine. Standard PCA was then performed on this feature specific data. As shown in
Although the principal component data could be used to predict the origin of cells based on expression of genes from these cells, other analysis methods are preferred for this task, as described below.
Profile Analysis (Canonical Variate Analysis)
Expression patterns along the gut were also analyzed by the profile analyzer 1004 using Profile Analysis to visualize inter versus intra-segment expression variation. As described in [Kiiveri, 1992], profile analysis is a modification of standard canonical variate analysis suited to cases where the number of variables exceeds the number of observations. The method models the p×p within-class covariance matrix Σw via a factor analytic model [Kiiveri, 1992] with a relatively low number of independent factors. Permutation tests are used to determine the significance of each term (i.e. gene) in each of the canonical variates. By including only significant terms, profile analysis provides a feature selection capability. This method is generally useful as an exploratory tool to characterize the class variation structure. Canonical variate analysis is implemented in the R MASS library, as described in [Venables and Ripley, 2002]. Profile Analysis was implemented in a proprietary library in R, as described in [Kiiveri 1992].
Given a priori knowledge of segment labels for tissues, profile analysis attempts to identify the limited gene transcript subspace that provides maximum inter-class separation of each of the five segments of the large intestine while minimizing the intraclass (i.e., with each segment) variance. The results of profile analysis of the complete data set include the canonical variable data shown in
Support Vector Machines
While the multivariate methods described above are useful for investigating gene expression variation along the large intestine, supervised machine learning was used to identify genes that are also predictive of tissue location in a robust manner, and to identify the smallest subsets of probesets/genes that can be used to predict tissue location with a low-cross validated error rate.
In the described embodiment, the particular form of machine learning used is a support vector machine (SVM), as provided by the SVM module 1002; however, it will be apparent to the skilled addressee that other kernel methods could alternatively be used. As described in [Scholkopf, 2004], kernel methods are extensions of linear methods whereby the variables are mapped to another space where the essential features of this mapping are captured by a simple kernel. Kernel methods can be particularly advantageous in cases where the observations are linearly separable in the kernel space but not in the original data space.
The SVM 1002 determines the combination of features (gene transcripts) that maximally separates the observations (i.e., tissues) along a class-decision boundary, using standard SVM methodology, as described in [Cristianini and Shawe-Taylor, 2000].
Specifically, the support vector machine (SVM) 1002 was used to generate classification data representing the smallest sub-set of probesets from the complete data set whose expression enables the maximum separation of cells originating from the cecum and rectum. The SVM 1002 was trained using a linear kernel and the classification data generated at each iteration was evaluated using 10-fold cross-validation. The lowest contributing gene transcripts from each subset of transcripts were recursively eliminated to identify the smallest set of transcripts with high prediction accuracy.
The cross-validated SVM error rate as a function of the number of probesets included in the model (as they were successively eliminated) is shown in
To measure the utility of this model in an independent data set, the classification data for the thirteen feature model was tested for proximal vs. distal prediction performance in the validation data. Using a traditional linear discriminant analysis model built with these 13 probesets, the eight proximal and eleven distal tissues were predicted with 100% accuracy.
Classifier Model
As an alternative to the SVM 1002, a classifier 1007 was also used to process the complete expression data from tissue samples taken from known locations along the proximal-distal axis of the large intestine to identify combinations of genes that can be used to identify the origin of a cell or cell population of unknown origin along the large intestine. In the described embodiment, the linear GeneRave classifier was used, as described at http://www.bioinformatics.csiro.au/overview.shtml. GeneRave is preferred in cases where the number of variables exceeds the number of observations. However, it will be apparent to those skilled in the art that other classifiers could be alternatively used, including non-linear classifiers and classifiers based on regularized logistic regression.
As described in [Kiiveri 2002], the GeneRave classifier 1007 generates classification data representing linear combinations of expression levels to identify subsets of genes that can be used to accurately identify the location of a sample of unknown location. GeneRave 1007 uses a Bayesian network model to select genes by eliminating genes that in linear combination with other genes do not have any correlation with the location from which corresponding tissue samples were taken.
The result of the GeneRave analysis of the complete data set in classification data corresponding to a set of 7 genes whose expression levels can be used to accurately identify the origin of a corresponding cell along the proximal-distal axis of the large intestine. The 7 genes are SEC6L1, PRAC, SPINK5, SEC6L1, ANPEP, DEFA5, and CLDN8.
Discussion
A Map of Gene Differential Expression Along the Large Intestine
Univariate expression analysis identified 206 probesets corresponding to 154 unique gene targets that are differentially expressed between the normal proximal and normal distal large intestine regions in human adults. A subset of 115 probesets (89% common to the proximal vs. distal list) is likewise differentially expressed between the terminal colorectal segments of the cecum and rectum. Interestingly, we found no transcripts that were expressed significantly differently between any two adjacent segments.
To estimate the validity of these findings, we have also measured the expression change of these gene transcripts in an independent set of microarray data. Thirty-one (31) of the 206 differentially expressed probesets in our initial discovery data set of 184 colorectal tissue samples were also differentially expressed in the validation data of 19 specimens.
Using a Monte Carlo simulation, we showed that such a large number of probesets differential in both datasets is extremely unlikely.
Nearly all (28/31, 90%) of these ‘validated’ transcripts were likewise differentially expressed between the two terminal segments of the cecum and rectum. 57 of 154 (37%) corresponding gene targets were confirmed to be differentially expressed between the proximal and distal large intestine by independent means.
Differential Transcript Expression for Individual Genes
The most significantly differential probeset we observed in our discovery data was against the gene transcript for PRAC. PRAC is highly expressed in the distal large intestine relative to the proximal tissues. Further, PRAC appears to be expressed in a low-high pattern along the large intestine with a sharp expression change occurring between the ascending and descending colorectal specimens.
We found eight (8) probesets corresponding to seven (7) HOX genes to be differentially expressed between the proximal and distal large intestine. The 39 members of the mammalian homeobox gene family consist of highly conserved transcription factors that specify the identity of body segments along the anterior-posterior axis of the developing embryo [Hostikka and Capecchi, 1998, Mech Dev 70:133-145; Kosaki et al., 2002, Teratology 65:50-62]. The four groups of HOX gene paralogues are expressed in an anterior to posterior sequence, for e.g. from HOXA1 to HOX13. [Montgomery et al., 1999, Gastroenterology 116:702-731] It has been found that: lower numbered HOX genes are expressed higher in the proximal tissues (HOXD3, HOXD4, HOXB6, HOXC6 and HOXA9), while the higher named genes are more expressed in the distal large intestine (HOXB13 and HOXD13).
Interestingly, there was a conspicuous absence in our findings of some gene transcripts that have been previously shown to be differentially expressed along the proximal-distal axis. Our data do not demonstrate a significant expression gradient for the caudal homeobox genes CDX1 or CDX2, transcription factors that have been shown to be involved in intestine pattern development across a range of vertebrates. (Chalmers et al., 2000) (James et al., 1994) (Silberg et al., 2000) In particular, CDX2 is believed to play a role in maintaining the colonic phenotype in the adult large intestine and was recently shown to be present at relatively high concentrations in the proximal large intestine but absent in the distal large intestine (James et al., 1994) (Silberg et al., 2000). Neither statistical analysis nor visual inspection of probeset expression for this gene show differential expression along the large intestine in our data (data not shown).
We observed significant differential transcript expression for a number of the solute-carrier transport genes. While probeset expression for SLC2A10, SLC13A2, and SLC28A2 are higher in the distal large intestine, the solute carrier family members SLC9A3, SLC14A2, SLC16A1, SLC20A1, SCL23A3, and SLC37A2 are higher in the proximal tissues.
Our results show that probesets against all three of the five members of the chromosome 7q22 cluster of membrane-bound mucins previously believed to be expressed in large intestine, MUC11, MUC12 and MUC17, are differentially expressed at higher levels in the distal gut [Byrd and Bresalier, 2004, Cancer Metastasis Rev 23:77-99; Williams et al., 1999, Cancer Res 59:4083-4089; Gum et a., 2002, Biochem Biophys Res Commun 291:466-475]. We also confirmed this differential expression pattern for MUC12 and MUC17 in the independent validation data. Previous reports have also raised the question about whether the genomic sequences for MUC11 and MUC12 are from closely related or perhaps even the same gene. [Byrd and Bresalier, 2004, supra] Correlation analysis of MUC11 and MUC12 probesets show a strong, positive correlation at the lower end of the probeset expression range with a weaker correlation as expression increases (data not shown). This correlation profile could be due to increased variability at higher expression levels or, possibly, because the expression levels in the distal large intestine (where they are higher) reflect a distinct transcriptional control.
In addition, while previous research has suggested that the secreted, gel-forming mucin MUC5B is only weakly expressed in the large intestine [Byrd and Bresalier, 2004, supra], our results show that probesets reactive to this transcript are expressed higher in the distal large intestine as for the membrane-bound mucins.
Some of the expression patterns we report here for humans have been shown to be similarly patterned in the gastrointestinal tracts of rodent models. However, a number of specific genes previously shown to be differentially expressed along the large intestines of mice and rats were not found to be so expressed by us. Such gene transcript targets include, carbonic anhydrase IV (Fleming et al., 1995), solute carrier family 4 member 1 (alias AE1) (Rajendran et al., 2000), CD36/fatty acid translocase (Chen et al., 2001), and toll-like receptor 4 (Ortega-Cava et a., 2003). On the other hand, our data are in agreement with earlier studies of expression of aquaporin-8 (AQP8), a gene whose expression product is suspected to be involved in water absorption in the normal rat large intestine (Calamita et al., 2001). We observe that AQP8 is significantly expressed to a higher level in the proximal human large intestine compared to the distal tissues (p<0.006, data not shown.) The family of claudin tight junction proteins may also play a role in maintaining the water barrier integrity in the large intestine (Jeansonne et al., 2003). We found the expression of claudin-8 (CLDN8) is much more highly expressed in the distal colorectal tissues. Conversely, claudin-15 (CLDN15), which is also believed to be localized in the tight junction fibrils was expressed at a higher level in the proximal colorectal tissues (Colegio et al., 2002).
The Nature of Gene Expression Change Along the Large Intestine
While one goal of this work was to understand which gene transcripts are differentially expressed along the large intestine, a second aim was to explore the nature of these expression changes along the proximal-distal axis in region or segment-specific detail.
We observed two broad patterns of statistically significant transcript expression change along the colorectum. The major pattern is described by those 65 gene transcripts that were well fitted by a two-segment expression model. We suggest that the expression of these transcripts is dichotomous in nature—elevated in the proximal segments and decreased in distal segments, or vice-versa.
Such data are consistent with the conventional anatomical view that the ‘natural’ divide between the proximal and distal large intestine occurs between the ascending and descending colon. This finding is contrary to a recent report by Komuro et al. that a breakpoint between the descending and sigmoid colon yields the largest differential expression (Komuro et al., 2005). However, we note that in addition to analyzing this pattern in colorectal cancer specimens, Komuro et al. also chose to include the transverse colon in their analysis. We intentionally exclude tissues from that segment to avoid the possible confounding affect related to the predicted midgut-hindgut fusion point approximately two-thirds the length of the transverse colon.
A second set of 50 transcripts do not display a dichotomous change, but rather show a significant improvement in fit by applying the expression data to a five-segment model supporting a more gradual expression gradient moving along the large intestine from the cecum to the rectum.
These two characteristic expression patterns hint that gene expression along the proximal-distal axis is perhaps coordinated by two underlying systems of organization.
We observed that the majority of differentially expressed transcripts in the adult normal tissues measured here are expressed in a pattern that is consistent with a midgut vs. hindgut pattern of embryonic development. Further, multivariate methods including supervised PCA and canonical variate analysis also suggest that the primary source of variation among these data are explained by the proximal vs. distal divide. In a recent study Glebov et al. found that the number of genes differentially expressed between the ascending and descending colon in the adult is substantially larger than the number of genes likewise identified in 17-24 week old fetal large intestines. Glebov et al. hypothesize that the gene expression pattern of the adult large intestine is possibly set concurrently with expression of the adult colonic phenotype at ˜30 weeks gestation or perhaps even in response to post-natal luminal contents of the gastrointestinal tract. While we did not explore gene expression in the fetal large intestine, we observe patterns of expression in the adult that support an embryonic origin consistent with the midgut-hindgut fusion.
Most of those transcripts that exhibit a gradual expression change between the cecum and rectum exhibit a prototypical pattern of increased expression moving from the cecum to the rectum. This pattern is not observed in the midgut-hindgut differential transcripts where the number of transcripts elevated proximally is approximately equal to the number elevated in the distal region. We propose that the characteristic distally increasing pattern in those transcripts could be a function of extrinsic factors in comparison to the intrinsically defined midgut-hindgut pattern. Such factors could include the effect of luminal contents that move in a unidirectional manner from the cecum to the rectum and/or the regional changes in microflora along the large intestine. Further work will be required to investigate whether such extrinsic controls are working in a positive manner of inducing transcriptional activity or through a reduced transcriptional silencing.
Gene Expression Changes in Concert Along the Large Intestine
To explore the expression of genes in concert along the large intestine, we also apply principal component analysis and profile analysis to these expression data. There is strong evidence for a proximal versus distal gene expression pattern with these multivariate visualization techniques. Furthermore, profile analysis, which simultaneously maximizes inter-segment expression differences while attempting to shrink the intra-segment variance, suggests that the same set of genes that account for the variability between the cecum to the rectum also best separate the individual segments. Though these multivariate results do not exclude a subtle proximal-distal gradient, the apparent bimodal nature of these multivariate plots suggests that the major source of expression variation in these tissues is consistent with a midgut- vs. hindgut-derived pattern.
A Smaller Set of Genes can be Informative
Finally, the sophisticated classification method of support vector machines is used to select a subset of informative probesets that can be used to provide a stable, robust classification of proximal versus distal tissues. Probesets ‘selected’ by the SVM 1002 are a subset of the differential transcripts identified by univariate methods, above. By evaluating this 13-transcript model in the independent validation set, the robustness of these predictors is further demonstrated.
Those skilled in the art will appreciate that the invention described herein is susceptible to variations and modifications other than those specifically described. It is to be understood that the invention includes all such variations and modifications. The invention also includes all of the steps, features, compositions and compounds referred to or indicated in this specification, individually or collectively, and any and all combinations of any two or more of said steps or features.
Conclusions
Our work suggests that transcript abundance, and perhaps transcriptional regulation, follows two broad patterns along the proximal-distal axis of the large intestine. The dominant pattern is a dichotomous expression pattern consistent with the midgut-hindgut embryonic origins of the proximal and distal gut. Transcripts that follow this pattern are roughly equally split into those that are elevated distally and those elevated proximally. The second pattern we observe is characterized by a gradual change in transcript levels from the cecum to the rectum, nearly all of which exhibit increasing expression toward the distal tissues. We propose that tissues that exhibit the dichotomous midgut-hindgut patterns are likely to reflect the intrinsic embryonic origins of the large intestine while those that exhibit a gradual change reflect extrinsic factors such as luminal flow and microflora changes. Taken together, these patterns constitute a gene expression map of the large intestine. This is the first such map of an entire human organ.
indicates data missing or illegible when filed
Cuff, M. A., D. W. Lambert and S. P. Shirazi-Beechey. 2002. Substrate-induced regulation of the human colonic monocarboxylate transporter, MCT1. J Physiol 539:361-371.
Number | Date | Country | Kind |
---|---|---|---|
60/802312 | May 2006 | US | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/AU07/00703 | 5/22/2007 | WO | 00 | 8/6/2009 |