The present invention relates to nucleic acids and more particularly to methods of equalizing the representation of nucleic acids in a population of nucleic acid molecules.
Approximately 10,000-20,000 genes are thought to be expressed within living cells, depending upon the specific cell type. RNAs corresponding to different genes can be present in different levels in cells. For example, transcripts from as few as 10-15 genes may represent 10-15% of cellular mRNA by mass. In addition to these highly abundant transcripts, another 1000-2000 genes encode moderately abundant transcripts, which can account for up to 50% of cellular mRNA mass. Transcripts from the remaining genes fall into the low abundance class.
Because many genes are identified by isolating complementary DNA (cDNA). corresponding to an RNA sequence, a significant problem can arise because of differences in the levels at which specific RNAs are present in cell types. The most abundant sequences can be repeatedly sampled, while the lowest abundance class may be rarely, if ever, sampled.
Several normalization and subtractive hybridization protocols have been developed to help overcome this problem. These techniques can be technically difficult to perform, and they can fail to detect cDNAs corresponding to rare transcripts.
The invention is based in part on the discovery of novel procedures for equalizing, or normalizing, the representation of nucleic acids in a sample of nucleic acids in which different nucleic acids are initially present in the sample in unequal amounts.
Accordingly, in one aspect the invention provides a method of screening a population of nucleic acid sequences. The method includes providing a population of nucleic acid sequences, partitioning the population into one or more subpopulations of nucleic acids, and identifying a first nucleic acid sequence having an increased level in the subpopulation relative to its level in the starting population of nucleic acids. The first nucleic acid is then compared to a reference nucleic acid sequence or sequences. The absence of the first nucleic acid sequence in the reference nucleic acid or nucleic acid sequences indicates the first nucleic acid is a novel nucleic acid sequence.
The RNA can be derived from a plant, a single-celled animal, a multi-cellular animal, a bacterium, a virus, a fungus, or a yeast. If desired, the RNA can also be partitioned prior to synthesizing cDNA.
Among the advantages of the methods are that they eliminate, or minimize, redundant identification and characterization of identical nucleic acid sequences in a population of nucleic acids.
In some embodiments, the cDNA is synthesized to selectively generate cDNA species that are enriched for those sequences oriented towards the 5′-terminus of the cDNA. In other embodiments, the cDNA is synthesized to enrich for those sequences oriented towards the 3′-terminus of the cDNA.
In some embodiments, the population is normalized by digesting the cDNAs with one or more restriction endonucleases, in different reaction vessels, so as to generate segregated multiple partitions. Preferably, each specific digested cDNA-fragment will occur in only one partition.
In some embodiments, the cDNAs are partitioned by physical methods, which may optionally follow the restriction endonuclease digestion. The physical methods separate the cDNAs a function of their terminal nucleotide sequences, overall length and migratory pattern on a sizing matrix that possesses the ability to separate molecules as a function of their physical and/or biochemical properties.
In other embodiments, the cDNAs are partitioned during subsequent PCR-based amplification of adapter-ligated cDNA fragments that have been digested with one or more restriction endonucleases.
In other embodiments, the cDNAs are partitioned by screening the original mixture of cDNAs so as to remove those sequences that have already been characterized. Screening occurs using partitioned subtraction, whereby the original cDNAs are brought into contact with a prepared, subtraction library of known sequence in such a way that any sequence contained within the original library that is complimentary to any element of the subtraction library is removed or suppressed.
cDNA sequences may also be partitioned by determining the size of each cDNA fragment prior to sequencing; biasing for formation of larger fragment PCR products by lariat formation. In this method, a bias for the larger fragment within the PCR reaction is introduced to allow efficient preferential amplification of longer fragments. Alternatively, partitioning may occur by preferentially amplifying 5′ terminal or 3′ terminal sequences of mRNA molecules.
If desired, the amplified cDNAs may fractioned by separating the amplified cDNAs on a sizing matrix that separates molecules as a function of their physical and/or biochemical properties and excising individual cDNA fragments from said sizing matrix. The excised cDNA fragments are then inserted into a recombinant vector, or further amplified.
In some embodiments, the restriction endonuclease is a restriction endonuclease that possesses a recognition sequence 4 to 8 basepairs in length and produces either a 5′- or 3-terminal overhang 0 to 6 basepairs in length.
In some embodiments, the identified sequence is subjected to computational analysis. The computational analysis can include querying, or searching, a nucleotide sequence database to identify sequences that match, or the absence of any sequences that match. The database includes a plurality of known nucleotide sequences of nucleic acids that may be present in the sample.
Preferably, the nucleic acid database comprises substantially all the known, expressed nucleic acid sequences derived from a group comprising a plant, a single-celled animal, a multi-cellular animal, a bacterium, a virus, a fungus, or a yeast.
In some embodiments, sizing includes diluting and re-amplification of the cDNAs, fractionating the re-amplified cDNAs by use of one or more sizing matrixes that separate the molecules as a function of their physical and/or biochemical characteristics, physically dividing or cutting the sizing matrixes into a plurality of sections, wherein each section is comprised of one or more cDNAs of similar molecular weight or size. The cDNAs are eluted from each of the sizing matrix section, ligated into a cloning vector and transformed into a host, e.g. a bacterial host. A plurality of the transformed host colonies are selected so as to ensure a statistically-accurate representation of the cDNAs originally contained within the sizing matrix sections. The inserts from this plurality of colonies are recovered and their molecular weight or size of are determined. A plurality of insert DNAs, wherein each successive insert has a molecular weight or size that is within a 0.2 basepair window; and wherein only those DNA species that fall within the 0.2 basepair window is subsequently subjected to nucleotide sequencing.
As utilized herein, the term “normalized” is defined as a mixture of mRNAs (or cDNAs thereof) in which the copy number of highly abundant mRNA species is reduced relative to its copy number in a starting population of nucleic acids, and the copy number of a less abundant mRNA species has been enriched relative to the copy number of the latter mRNA in the starting population.
Among the advantages provided by the present invention are that it multiple partitioning strategies function in a synergistic manner so as to ameliorate unnecessary, redundant sequencing of the same sequence(s), while concomitantly enhancing the sequencing of rarer sequences.
The partition strategies disclosed herein also normalize cDNA abundance by separating the cDNA sequences into multiple partitions possessing minimal sequence overlap. In addition, the various partitioning strategies are performed so as to assure that substantially all cDNAs are sampled. An additional normalization effect may be obtained by separating the resulting DNA fragments based upon their overall size (i.e., size fractionation). Moreover, it is also possible to normalize the abundance of the cDNAs to an even greater degree by the use of one of several disclosed pre-characterization methods.
All technical and scientific terms used herein have the same meanings commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice of the present invention, the preferred methods and materials are now described. The citation or identification of any reference within this application shall not be construed as an admission that such reference is available as prior art to the present invention. All publications mentioned herein are incorporated herein in their entirety by reference.
The present invention provides methods for identifying nucleic acids in a population of nucleic acid samples. It is based in part on normalizing the representation of sequences that may be initially present in different levels in the population of nucleic acid sequences. The normalization takes place by one or more methods of partitioning the nucleic acid population.
A schematized overview of the invention is shown in
Any population of RNA molecules can be used as long as the population contains, or is suspected of containing, two or more distinct RNA molecules. The population can be isolated from a starting sample using standard methods for isolating RNA. The RNA population can be isolated from, e.g., an entire organism or multiple organisms, or from a tissue or cell of an organisms. The RNA can also be isolated from, e.g., cultured cells, such as eukaryotic or prokaryotic cells grown in vitro. If desired, the RNA can be mRNA, (e.g., polyA+ RNA), or stable RNAs (e.g., ribosomal RNA, transfer RNA, or small nuclear RNA). The input RNA or cDNA can be a subpopulation containing the 5′ end of RNA molecules (110), a subpopulation having an internal regions of starting RNA molecules (112), or subpopulations containing the 3′ end of the cDNA molecules (114).
The selected population or subpopulation is next subjected to a normalization analysis (200). The normalization analysis includes one or more partitioning steps that decrease the relative amount of sequences that are abundant in the starting population of nucleic acids and increase the relative representation of sequences that are rare in the starting population of nucleic acids. A partitioning step can take place before or after mRNA is converted to cDNA. A partitioning step can also take place following amplification of a cDNA. Unless stated otherwise, any partitioning method described herein can be used in conjunction with one or more additional partitioning methods. Examples of suitable partitioning steps are provided below.
In some embodiments, cDNA molecules are subjected to digestion with restriction enzymes, after which adapter oligonucleotides are ligated to the digestion products, and the resulting products amplified.
The amplified products are next sequenced (300). Sequencing can be performed by any method known in the art. The compiled sequence data are then assembled (400), and the sequence generated is compared to known sequences, e.g., sequences in publicly available databases.
The methods herein described are therefore useful for identifying genes, e.g., expressed genes in an organism of interest, e.g., a human. The sequence information obtained is particularly useful for identifying genes transcribed at low levels, or generating low levels of steady state transcripts. The methods can also be used, e.g., to identify secreted proteins for potential therapeutic use and/or for drug targets; identify variations within the human genome, such as single nucleotide polymorphisms (SNPs); identify differences between normal and diseased tissue; and analyze differential gene expression in different tissues and/or species.
Partitioning Prior to cDNA Synthesis
One approach to normalize levels of mRNA from a given sample, e.g. a given cell or tissue type, is to arbitrarily separate a starting population of RNA molecules into many smaller subpopulations, or collections. In general, a greater number of partitions increases the likelihood that a given partitions will lack a sequence or sequences that is abundant in the starting population of nucleic acid sequences. This method therefore allows for access to sequences that are expressed in very low copy number.
Alternatively, RNA populations can be isolated from different cell types. This partitioning strategy is based on the premise that different tissues tend to express different subsets of genes. Thus, RNA sequences can be partitioned by sequencing multiple different cDNA libraries extracted from one or more tissues within the body. However, the partitioning will not typically be complete, because many genes are expressed in more than one tissue type.
Synthesis and Amplification of cDNA Molecules
Typically, partitioning is performed on cDNA populations that have been modified for subsequent analysis. The modifications may include: (i) digesting the cDNA with at least one restriction endonuclease; (ii) ligating an adapter oligonucleotide to one or more ends of the termini of the digestion products; and (iii) amplifying the ligated products, e.g., in PCR-mediated amplification. These methods are particularly suited to cDNA molecule that have been constructed from the 5′, internal, and 3′ subpopulation of RNA molecules as described above. These manipulations are collectively known as SeqCalling™ chemistry. In preferred embodiments, cDNA is generated from populations of RNA molecules that have been divided into subpopulations containing 5′ ends of transcripts, populations of molecules containing internal regions of RNA molecules, or subpopulations containing 3′ ends of RNA molecules.
A. Construction and Amplification of cDNA Subpopulation Enriched for the 5′ Ends of mRNA Molecules
5′-enriched cDNA synthesis generates cDNA species that are enriched for those sequences oriented towards the 5′-terminus of the cDNA, and in which a specific oligonucleotide sequence is ligated to the 5′-terminus. Approaches for generating cDNAs specifically enriched in transcript 5′ ends are often based on the synthesis of a homopolymeric (e.g., dG or dA) tail by the enzyme terminal deoxynucleotidyl transferase (TdT) subsequent to the synthesis of the first cDNA strand. Second strand synthesis is then primed by the use of a complementary homo-oligonucleotide primer sequence. See e.g., Frohman, et al., 1988. Proc. Natl. Acad. Sci. USA 85: 8998-9002; Delort, et al., 1989. Nucl. Acids Res. 17: 6439-6448; Loh, et al., 1989. Science 243: 217-220; Belyavsky, et al., 1989. Nucl. Acids Res. 17: 2919-2932; Ohara, et al., 1989. Proc. Natl. Acad. Sci. USA 86: 5673-5677.
Alternatively, amplification can exploit the 5′-terminal cap structure present in eukaryotic mRNAs (see e.g., Furuichi & Miura, 1975. Nature 253: 374-375; Banerjee, 1980. Microbiol. Rev. 44: 175-205; Shatkin, 1985. Cell 40: 223-224). However, mRNA preparations generally include a mixture of both capped and non-capped mRNA species. The non-capped mRNAs are thought to be primarily the result of degradation within the cell or during the isolation procedure. An alternative approach to enrich for full-length mRNAs is to purify capped mRNA using affinity reagents. These reagents include naturally occurring proteins that bind the cap structure (see e.g., Edery, et al., 1995. Mol. Cell. Biol. 15: 3363-3371); anti-cap antibodies (see e.g., Bochnig, et al., 1987. Eur J Biochem. 68: 460-467); and chemical modification of the cap, followed by selection for the modified cap structure (see e.g., Carninci, et al., 1996. Genomics 37: 327-336). In addition, 5′-oligo capping can also be used, in which specific oligonucleotide sequences are selectively added to 5′-capped mRNAs prior to first strand cDNA synthesis. Subsequent synthesis of the second strand, is primed by an oligonucleotide that is complementary to the modified cap sequence. See e.g., Maruyama & Sugano, 1994. Gene 138: 171-174; Suzyki, et al., 1997. Gene 200: 149-156; Fromont-Racine, et al., 1993. Nucl. Acids Res. 21: 1683-1684; U.S. Pat. No. 5,597,713).
An alternative method for isolating RNA molecules containing a capped 5′ end is shown in
In the first step in
In the third step in
Although first strand synthesis occurs for both types of mRNA species (i.e., full-length and truncated/degraded), only those mRNAs with the appropriate sequence ligated to the 5′-terminus (i.e., full-length mRNAs) contain a priming site for subsequent second strand synthesis. Thus, RNAs derived from the full-length mRNAs are selectively amplified.
Preferably, a thermostable enzyme for second strand synthesis in a non-thermal cycled temperature profile is used to ensure more stringent priming of the second strand reaction compared to a non-thermostable enzyme.
A double-stranded cDNA prepared with an adapter containing an oligonucleotide sequence (nR plus “signature sequence”) ligated to the 5′-terminus is digested with a restriction endonuclease as shown in
A representation of the distribution of clones derived using 5′ enriched synthesis with respect to the region of the gene they include is shown in
B. Construction and Amplification of cDNA Subpopulations Enriched for the Interior Regions Ends of RNA Molecules
To generate relatively short cDNA fragments generated from the interior regions of a RNA molecule, i.e., from a region not containing the 5′ or 3′ terminus, the following procedure is used.
RNA is purified using any standard procedure (see e.g., Berger, 1987. Methods Enzymol. 152: 215-219) and cDNA is synthesized according to standard protocols, such as random oligomer or oligo-dT primed synthesis (see, e.g., Gubler & Hoffman, 1983, Gene 25: 263-269, Okayama & Berg, 1982, Mol. Cell Biol. 2: 161-170).
The cDNA is initially digested with a pair of restriction endonucleases. Although any enzyme pair that generates distinct 5′-terminus overhangs is acceptable, a preferred embodiment utilizes enzymes that possess a 4-8 basepair (bp) recognition site yielding a 0-6 bp 5′-terminal overhang, and a more preferred embodiment utilizes enzymes that possess a 6 bp recognition sequence and generates a 4 bp 5′-terminus overhang. One form of manipulation for generating internal fragments is shown in
Two suitable 24 nucleotide adapter molecules can be generated from RA24 [SEQ ID NO:9]; RC24 [SEQ ID NO:10]; JA24 [SEQ ID NO:11]; or JC24 [SEQ ID NO:12]. The adapters are generated by annealing the RA24, RC24, JA24 or JC24 24-mer oligonucleotides [SEQ ID NOs:9-12, respectively] with 12-mer oligonucleotides possessing sequences that are complementary to the last 8 nt of the 3′-terminus of the 24-mer and the 4 bp overhang. The sequences of these primers and other primers described herein are provided in Table 1.
These 4 bp overhang sequences are chosen so as to be complementary to the overhangs that are generated by the restriction endonuclease digestions. In addition, the last 3′-terminal nucleotide of the 24-mer adapter (i.e., A or C) is selected such that a functional restriction endonuclease recognition site is not re-generated when the adapter anneals to the digested cDNA.
Following ligation of the adapters, the restriction endonucleases are heat-inactivated, and the reaction mixture is PCR amplified.
Internal fragments may alternatively be generated using a second type of adapters, which results in longer amplified fragments (also referred to as “Long Internal Chemistry” or “Long Chemistry”). This method is similar to short chemistry, except all adapters possess an additional common sequence on their 5′-termini. This technique suppresses the amplification of small fragments while concomitantly increasing the amplification of longer fragments. The subsequent PCR amplification with the “X” and “J” primers results in production of both a hetero (i.e., “RX-JR”) adapter fragment and “homo” adapter fragments (i.e., “RX-XR” and “RJ-JR”), which are unstable in a host and are rarely observed following the cloning process.
The effectiveness of enriching for internal fragments is shown in
Following digestion, adapters are ligated to these 5′-terminal overhangs. The primers are longer relative to primers used to generate short fragments. Two specific pairs of adapter molecules that can be used in long chemistry synthesis include RXC [SEQ ID NO:2]; RXA [SEQ ID NO:3]; RJC [SEQ ID NO:4]; or RJA [SEQ ID NO:5]. The adapters are generated by annealing RXC, RXA, RJC or RJA oligonucleotides [SEQ ID NOs:2-5, respectively] with 12-mer oligonucleotides possessing sequences that are complementary to the last 8 nt of the 3′-terminus of the 24-mer and the 4 bp overhang. These 4 bp overhang sequences are chosen so as to be complementary to the overhangs that are generated by the restriction endonuclease digestions. In addition, the last 3′-terminal nucleotide of the 24-mer adapter (i.e., A or C) is selected such that a functional restriction endonuclease recognition site is not re-generated when the adapter anneals to the digested cDNA.
Following the ligation of the adapters, the restriction endonucleases are heat inactivated and the reaction mixture is PCR amplified. While the sequences of the two adapters are distinct, they nevertheless possess common 5′ sequences that allow the formation of lariat or pan-handle structures that function to suppress PCR-mediated amplification of the shorter fragments.
C. cDNA Synthesis of Molecules Enriched for 3′ Ends
3′-enriched cDNA synthesis generates cDNAs that are enriched for the sequences oriented towards the 3′-terminus of the cDNA. This is accomplished by synthesis of the first-strand using a specific oligonucleotide sequence that has been modified to contain an adapter sequence at its 5-terminus [SEQ ID NO:14]. Following first-stand cDNA synthesis with the primer, standard cDNA synthesis protocols are utilized as illustrated in
The 3′-enriched cDNA is digested with one restriction endonuclease. Although any enzyme that generates a distinct 5′-terminus overhang is acceptable, it is generally most preferred to utilize an enzyme that possesses a 6 bp recognition site yielding a 4 bp 5′-terminal overhang. Following digestion, an adapter is then ligated to these 5′-terminal overhangs. These adapters are generated from the JA24 [SEQ ID NO:11] or JC24 [SEQ ID NO:12] 24-mer annealed with 12-mer oligonucleotides possessing sequences that are complementary to the last 8 nt of the 3′-terminus of the 24-mer and the 4 bp overhang. These 4 bp overhang sequences are chosen so as to be complementary to the overhangs that are generated by the restriction endonuclease digestions. In addition, the last 3′-terminal nucleotide of the 24-mer adapter (i.e., A or C) is selected such that a functional restriction endonuclease recognition site is not re-generated when the adapter anneals to the digested cDNA.
Following the ligation of the adapters, the restriction endonucleases are heat inactivated and the reaction mixture is PCR amplified.
Longer fragments enriched for the 3′-ends can be obtained by ligating a longer primer to cDNA molecules that have been digested with a restriction enzyme. Any enzyme that generates a distinct 5′-terminus overhang can be used. It is generally preferred to utilize an enzyme that possesses a 6 bp recognition site yielding a 4 bp 5′-terminal overhang. Following digestion, an adapter is then ligated to the 5′-terminal overhangs. Acceptable adapters are generated from the JA24 [SEQ ID NO:11] or JC24 [SEQ ID NO:12] 24-mer annealed with 12-mer oligonucleotides possessing sequences that are complementary to the last 8 nt of the 3′-terminus of the 24-mer and the 4 bp overhang. These 4 bp overhang sequences are chosen so as to be complementary to the overhangs that are generated by the restriction endonuclease digestion. In addition, the last 3′-terminal nucleotide of the 24-mer adapter (i.e., A or C) is selected such that a functional restriction endonuclease recognition site is not regenerated when the adapter anneals to the digested cDNA.
While the sequences of the two adapters are distinct, they possess common 5′ sequences that allow the formation of structures that suppress PCR-mediated amplification of the shorter fragments.
Following the ligation of the adapters, the restriction endonucleases are heat inactivated and the reaction mixture is PCR amplified.
The cDNA fragments prepared as above can be size-fractionated, e.g., electrophoretic fractionation on agarose or polyacrylamide gels, or other types of gels comprised of a similar material. The cDNA fragments may then be physically excised in defined size ranges (i.e., as identified by size makers) and recovered from the excised gel fragments. Additionally, if the quantities of isolated cDNA fragments are low, they can be amplified, e.g., by PCR amplification For example, if the cDNA fragments are generated by Long Internal SeqCalling™ Chemistry protocol, they are amplified with J23 [SEQ ID NO:6] and X22 [SEQ ID NO:15] primers (either before or after fractionation) prior to cloning, as these cDNAs cannot be efficiently cloned into E. coli. Similarly, if the cDNA fragments are generated by Long 5′ SeqCalling™ Chemistry protocol, they can be amplified by J23 [SEQ ID NO:6] and RS [SEQ ID NO:1] oligonucleotides (either before or after fractionation) prior to cloning, as these products cannot be efficiently cloned into E. coli.
When PCR amplification is used to amplify fragments, conditions are preferentially chosen to minimize non-productive hybridization events. It has been observed that DNA re-hybridization during the PCR amplification process (designated the “Cot effect”; see e.g., Mathieu-Daude, et al., 1996. Nucl. Acids Res. 24: 2080-2084) can inhibit amplification. This effect is particularly evident during later PCR amplification cycles, when a substantial concentration of the amplified product has accumulated and the primer concentration has been depleted. As a result, amplification in the later PCR cycles typically follow non-linear dynamics.
By manipulating PCR amplification reaction conditions, it is possible to markedly enhance the “Cot effect”, by the insertion of a slow-annealing step in between the denaturation and re-naturation steps in each PCR amplification cycle. The slow-annealing temperature is chosen so as to be above that of the primer-template melting temperature (Tm), but at or above that of the template-template Tm, thus favoring template-template annealing over template-primer annealing. For example, a 85-75° C. decrease in temperature at a 10° C./minute gradient can be utilized
Partitioning Methods
One or more of the following techniques, or combinations these techniques, can be used to normalize the abundance of RNA (or their cDNA counterpart) species within a given cell or tissue sample.
(i) Partitioning by Restriction Endonuclease Digestion
A cDNA library can be partitioned into many different sets of fragments by digestion with different restriction enzyme pairs. Fragmentation of the same cDNA library with different sets of restriction enzymes, in different reaction vessels, results in segregated multiple partitions, i.e., each specific fragment will occur in only one partition. The digested fragments can be analyzed further, e.g., by direct sequencing, cloning of the digested fragments or sequencing, or one or more of these techniques.
If desired, the cDNA is digested into fragments of a length that is convenient for sequencing. Preferably, multiple different partitions, e.g., 10-100, 20-750, or 50-250 partitions are obtained.
(ii) Partitioning by Fragment Size or Other Physical Property
Partitioning can also be performed using other separation methods that separate DNA molecules according to their physical characteristics. The methods can include, e.g., separation based on physical and/or biochemical properties (i.e., molecular weight/size, terminal nucleotide sequences, exact migratory pattern, and the like). Separation methods can include, e.g., gel electrophoresis, including agarose or polyacrylamide gel electrophoresis, high pressure liquid chromatography (HPLC), preparative-scale capillary electrophoresis, and similar methodologies.
In one embodiment, unique cDNAs that represent unique (i.e., not previously sequenced) fragments are selected based on their presence in a characteristic restriction enzyme fragment. In this process, a cDNA population is digested with restriction endonucleases, fractionated, and fragments in a desired size range are recovered. The recovered fragments are then ligated to a vector and transformed into an appropriate host, e.g., E. coli. Rather that being directly sequenced following the selection process, the DNA fragments are isolated and separated, e.g., sized using one or more sizing matrixes that separate the molecules as a function of their physical or biochemical properties. The embodiment is thus referred to as “clone sizing”. Those recombinant clones that have an insert with characteristics not present in a reference database are determined to contain a unique DNA fragment. Preferably, only unique fragments are subsequently sequenced.
For example, a DNA fragment that is sized in this way possesses two pieces of information that serve as a unique identifier: (i) the identity of the restriction endonuclease used to generate the fragment, and (ii) the size of the fragment. With these two pieces of information, fragments are picked for subsequent nucleotide sequencing by searching for a specific fragment within a 0.2 basepair window. If a fragment is present in the window, the E. coli clone containing the fragment is re-arrayed on a liquid handling robot such as a Tecan Genesis or Packard Multiprobe device, and sequenced. When multiple fragments are present within the 0.2 bp window, only one is selected to be sequenced. Thus, by use of this sizing filter, sequencing of identical fragments is significantly lowered.
By sizing individual fragments and comparing the observed size to previously determined sequences, i.e., using a “sizing filter”, only fragments of unique lengths need to be sequenced.
To pre-size large numbers of fragments, the fragments can be initially pooled as a function of their expected size, so as to ensure the any fragment occurs in a minimum of at least three individual pools.
Size fractionation may be accomplished in a number of ways. One commonly utilized method is electrophoretic fractionation on agarose or polyacrylamide gels, or other types of gels comprised of a similar material. The cDNA fragments may then be physically excised in defined size ranges (i.e., as identified by size makers) and recovered from the excised gel fragments. Additionally, if the quantities of isolated cDNA fragments are low, they can be PCR amplified at this stage. For example, if the cDNA fragments are generated by Long Internal SeqCalling™ Chemistry protocol, described above, they must be amplified with J23 and X22 primers (either before or after fractionation) prior to cloning, as these cDNAs cannot be efficiently cloned into E. coli. Similarly, if the cDNA fragments are generated by Long 5′ SeqCalling™ Chemistry protocol, described above, they must be amplified by J23 and RS oligonucleotides (either before or after fractionation) prior to cloning, as these products cannot be efficiently cloned into E. coli.
(iii) Partitioning Based on Hybridization
Screening can be performed using a variety of methods that rely on hybridization between a probe sequence or sequences and a cDNA library. Members of the library containing a homologous sequence are then removed from the library. For example, a cDNA library can be brought into contact with a prepared library of known sequence in such a way that any sequence contained within the substrate library that is complimentary to any element of the subtraction library is removed or suppressed. This method obviates re-characterizing, e.g., re-sequencing, already characterized members of the cDNA population.
(iv) Amplification-Associated Partitioning
Partitioning can also be performed in association with amplification. In particular, partitioning can be carried out during PCR amplification of adapter-ligated cDNA fragments described above. During PCR-mediated amplification of mixtures of cDNA fragments, short fragments tend to be preferentially amplified relative to large fragments. PCR conditions can be adjusted to favor the formation of larger fragments within the PCR reaction to allow efficient preferential amplification of longer fragments.
Normally, two different primers are used in PCR amplification to prime the enzymatic activity of the polymerase at each terminus of the target sequence. Conversely, if primers with identical 5′ sequences are used, there is a tendency for the fragments to form lariat or pan-handle structures, due to intra-strand hybridization, which interferes with the amplification process. Because the probability of the two ends of a polymer (i.e., cDNA fragment) finding one another is inversely proportional to a fractional power of the polymer length, short fragments tend to form these lariat structures more readily than do longer ones. Accordingly, this effect is exploited in the amplification of long cDNA fragments. See U.S. Pat. No. 5,565,340, whose disclosure is incorporated herein by reference, in its entirety.
Long fragment amplification can be enhanced using DNA fragments to which have been ligated long adapter sequences as described above. Amplification is dependent upon a number of factors that can alter the ratio of a linear adapter structure, which is permissive for amplification, and a lariat-loop structure, which suppresses amplifications. The equilibrium constant associated with the formation of the suppressive and the permissive structures, and, therefore, the efficiency of suppression of particular DNA fragments during PCR, is primarily a function of the following factors: (i) differences in melting temperature of suppressive and permissive structures; (ii) position of the primer sequence within the adapter; (iii) the length of the target DNA fragments; (iv) PCR primer concentration; and (v) primary structure.
Analysis of Partitioned cDNA Molecules
Partitioned cDNA molecules are next analyzed by comparing the sequences to a reference nucleic acid or nucleic acids. To facilitate analysis of partitioned cDNA molecules, they can, if not subcloned previously, be ligated into an appropriate vector and transformed into cells by any applicable method.
The reference nucleic acid or nucleic acids can be any fragment for which sufficient information is available to unambiguously identify the partitioned cDNA molecule. The reference nucleic acid or nucleic acids can therefore be part of, e.g., sequence databases, or databases of other characteristics that unambiguously identify a nucleic acid. Examples of such characteristics include e.g., a compilation of fragment sizes associated with specific restriction enzymes for a particular gene. In some embodiments, partitioned nucleic acids will be sequenced. The partitioned sequences can be sequenced by any method known to the art and the resulting sequence data is analyzed by computer-based systems.
Suitable databases include publicly available databases that comprehensively record all observed DNA sequences. Such databases include, e.g., GenBank from the National Center for Biotechnology Information (Bethesda, Md.), the EMBL Data Library at the European Bioinformatics Institute (Hinxton Hall, UK) and databases from the National Center for Genome Research (Santa Fe, N. Mex.). However, any database containing entries for the sequences likely to be present in such a sample to be analyzed is usable in the further steps of the computer methods. Methods of searching databases are described in detail in e.g., U.S. Pat. No. 5,871,697, whose disclosure is incorporated herein by reference, in its entirety.
Table 1 below summarizes the various primers and adapters disclosed herein.
The invention will be further described in the following examples, which do not limit the scope of the invention described in the claims. Examples 1-6 collectively describe the synthesis and amplification of cDNA subfractions enriched for the 5′ terminal sequences of mRNA molecules. Example 7 describes clone sizing.
For each reaction, 2.5 μg mRNA (do not exceed 3 μg total) is added to H2O so as to provide a total volume of 73.5 μl. This mixture is then heated to 65° C. for 10 minutes, and quick-cooled on ice. The CLAP Cocktail (see below) is made as follows:
For each reaction:
For each reaction:
For each reaction:
For each reaction:
Total reaction volume 60 μl
For each reaction:
For each reaction:
For each reaction:
While the above 2 hour incubation described in Example 3 is progressing, prepare one Boehringer-Mannheim Quick-Spin G-50 columns per reaction as follows:
For each reaction:
For each reaction:
3) Heat samples as per protocol below, making sure that the sample tubes are placed in the thermocycler only after it has reached >80° C.
For each reaction:
SeqCalling™ Chemistry products generated in any of Examples 1-6 are diluted and re-amplified. Fractionation is then performed by electrophoresising the re-amplified sample on an agarose gel using MetaPhor agarose (FMC). After the electrophoresis, the gel is physically cut into a total of 48 fractions. 24 of the fractions are derived from a 4% MetaPhor gel, and correspond to the lower molecular weight fractions; whereas the other 24 fractions derived from the 3% MetaPhor gel, correspond to the upper molecular weight fractions.
Following the elution of the DNA from the gel fractions, the DNA fragments are ligated into a vector with the TOPO-TA cloning vector (Invitrogen). These plasmids are then transformed into E. coli. The transformed bacterial cells are plated onto petri dishes and grown to a size that allows automated colony picking. A suitable number of colonies/fraction are selected so as to ensure a statistically accurate representation of the DNA fragments contained within the fraction (i.e., suitable numbers of picked colonies/fraction are 48 or 96). Following the incubation of the selected clones, the fragment contained within each individual clone are sized using the proprietary MegaBACE system, or an equivalent. Sizing is performed with multiple clones/lane. This multiplexing allows sizing to be performed in a cost and time efficient manner. The multiplexing is performed with a liquid handling robot (e.g., Matrix PlateMate). After running the multiplexed fragments on MegaBACE, and correlating the size of the fragment with the E. coli clone containing the insert, the fragments are analyzed to determine suitability for sequencing.
The effect of using a clone sizing step on the complexity, i.e., the representation of rarely transcripts, of the resulting clones, is shown in
Although particular embodiments have been disclosed herein in detail, this has been done by way of example for purposes of illustration only, and is not intended to be limiting with respect to the scope of the appended claims that follow. In particular, it is contemplated by the inventor that various substitutions, alterations, and modifications may be made to the invention without departing from the spirit and scope of the invention as defined by the claims. For example, the selection of the specific tissue(s) or cell line(s) that is to be utilized in the practice of the present invention is believed to be a matter of routine for a person of ordinary skill in the art with knowledge of the embodiments described herein.
This application claims priority to provisional application U.S. Ser. No. 60/115,109, filed Jan. 8, 1999, and non-provisional application U.S. Ser. No. 09/417,386, filed Oct. 13, 1999, which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
60115109 | Jan 1999 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10407519 | Apr 2003 | US |
Child | 11592067 | Nov 2006 | US |
Parent | 09417386 | Oct 1999 | US |
Child | 11592067 | Nov 2006 | US |