Deep sequencing analysis is increasingly important for providing quantitative assessment of gene expression in both diagnostics and research. To make a high quality cDNA library for deep sequencing of mRNA and other non-structural regulatory RNAs, the RNAs of interest must be enriched from total RNA which contains approximately 98% structural RNAs, including rRNA and tRNA.
Currently, there are two major techniques to enrich for non-structural RNAs: 1) positive selection (e.g., affinity selection of RNAs comprising Poly(A) tails) and 2) negative selection (e.g., using methods which remove rRNAs, which represent approximately 90% of total RNA). Both positive and negative selection techniques employ nucleic acid oligonucleotides anchored to resin to capture the target RNA. Importantly, contamination of samples with fragments of structural RNAs necessitates additional purification steps and greatly reduces the depth of useful sequence data that can be obtained using these existing oligo-based enrichment protocols. Moreover, such additional purification steps are expensive and require large amounts of starting total RNA owing to their low efficiency.
In addition, existing protocols for mRNA cDNA cloning are biased against detection of the 5′ most sequences. Moreover, many genes possess distinct isoforms that employ alternative transcription start sites, and in many cases these isoforms are functionally distinct. Thus, current protocols for mRNA sequencing tend to lack important information about specific mRNA isoforms that are expressed.
The development of a quick and efficient enzymatic enrichment method to enrich for capped RNAs that can be used to obtain deep-sequencing libraries of high quality and that are compatible with liquid-based assays so that they can be processed in a high-throughput manner, would be of great benefit.
The instant invention is based, at least in part, on the identification of novel methods for the rapid enzymatic enrichment of capped RNAs.
The instant methods dramatically improve the sensitivity of RNA detection and allow for the rapid detection of the full diversity of alternative 5′ ends in a given transcriptome. Using the instant methods, far less RNA is required than in traditional methods and costly secondary purification steps are eliminated. The method also allows for cloning other capped RNAs, in addition to mRNA (such as primary miRNA or primary piRNA, which are usually not recovered well using existing positive and negative selection methods, if at all). The instant methods also allow for more complete profiling of functional mRNA than methods based on poly(A) selection, because while all functional mRNA contains a cap, functional mRNA may or may not contain a poly(A) tail. In addition, negative selection procedures cannot remove tRNAs, thus removal of tRNA must be performed later, usually by size selection, which also removes some functional capped RNAs of small size.
The instant methods employ a series of enzymatic steps that eliminate contaminating RNA sequences and dramatically enrich for RNAs that contain the nuclease resistant 5′ cap structure that is a feature of mRNAs and several other endogenous RNA species. This procedure dramatically improves the sensitivity of RNA detection and allows the full diversity of alternative 5′ ends to be analyzed. The method also allows for an easy way to process samples in a high-throughput manner.
Specifically, the instant methods comprise a multi-step enzymatic procedure to enrich for capped RNAs. In one step, RNAs which are resistant to 5′ monophosphate-dependent exonuclease are obtained by treatment with 5′-3′ monophosphate-dependent exonuclease. This step removes 28S, 16S, and 5.8S rRNA, but not 5S and mature tRNA, all of which contain 5′ monophosphate. In a subsequent step, the resulting RNAs are treated with a phosphatase to remove exposed 5′ phosphate groups because the 5′-3′ monophosphate-dependent exonuclease cannot destroy all RNAs with 5′ monophosphate, especially highly structured RNAs. This step renders 5S RNA, tRNA, partially degraded rRNAs and triphosphorylated RNA, such as pre-tRNAs, unamenable to subsequent linker ligation, which requires a 5′ phosphate. In the next step, RNAs which are sensitive to decapping enzyme are obtained by treatment with a decapping enzyme to expose the alpha-phosphate in the cap, (Gppp-RNA). The resulting exposed monophosphate from the cap is the only monophosphate available for ligation, the RNAs with monophosphate phosphate (rRNA) or triphoshphate (tRNA) having been destroyed by exonuclease or rendered unamenable to cloning due to dephosphorylation by phophatase in the previous steps.
In one embodiment, the RNA may be ligated with an RNA ligase, which only recognizes 5′ phosphorylated RNA as an acceptor molecule. In one embodiment, cDNA may be obtained by RT PCR using random primers, e.g., chimeric solexa primer-random primers. The resulting cDNA may be subjected to size selection by PAGE gel purification to obtain cDNA within a desired size range. In one embodiment, cDNA may be amplified by PCR using DNA oligonucleotides designed according to the 5′ linker and the primer sequence in the RT reaction. By changing linkers and primers, the above procedure can be applied to many deep-sequencing platforms.
The instant methods provide many advantages over the prior art. As set forth above, most mRNA-sequence analysis uses oligo-based purification to enrich for mRNA using positive or negative selection, both of which involve column purification and require large amounts of total RNA. In contrast, the instant method invention uses an enzymatic approach to enrich for capped RNA, thus avoiding tedious and less efficient column purification procedures. The instant method also results in less RNA degradation than can occur using oligo-based purifications. Moreover, the sequences cloned using the instant methods are anchored at the 5′ end of capped RNAs, thus simplifying the bioinformatics analysis to compare samples.
Using the instant method, it is possible to clone mRNA from very little starting material, e.g., 500 ng or less, an amount lower than can be used with most available techniques. The instant methods can be used, e.g., to sequence mRNA from very small amounts of sample like clinical samples, to sequence partially degraded samples, to sequence pri-miRNA or other capped RNA, to enrich mRNA without column purification (which increases speed and reduces cost) and to process samples in a high-throughput manner using liquid handling robots. The RNAs enriched for using the claimed methods also are more representative of functional mRNAs in eukaryotes than those enriched for using current methods.
Moreover, other protocols which enrich for capped sequences, specifically the CAGE protocol, have a known bias with a nonspecific G at the most 5′ end of the CAGE tags, which is attributed to the template-free 5′-extension during the first-strand cDNA synthesis. This introduces erroneous mapping of CAGE tags. In addition, CAGE uses 5′ linker with random sequence to do 5′ ligation, thus introducing mutations. This does not occur using the instant methods. In addition, in C. elegans, 70% of all functional transcripts get spliced to a leader sequence (SL). The SL is 22 nucleotides long. CAGE can only sequence the first 20 nucleotides of any particular transcript. Thus, the information obtained from CAGE is meaningless 70% of the time in that organism.
Moreover, in contrast to other prior art methods, the instant enzymatic process effectively eliminates rRNA. The combination of phosphatase and tobacco acid phosphatase does not remove ribosomal RNA. rRNA competes with mRNA in the following RT and PCR steps. Most importantly, rRNA can be degraded during the ligation step, which is usually carried out overnight. Even a very tiny amount of degradation, e.g. 1-2%, can give significant contamination with rRNA or other structural RNA, because those RNAs are about 98% of the total RNA.
Of particular importance are the downstream applications made possible by the instant methods. Non-limiting examples include the identification of novel 5′ ends of non-annotated gene transcripts as described in the Examples infra, as well as confirmation of information relating to annotated transcripts, by overlaying reads onto annotated gene transcript information. Moreover, the instant methods can be applied to the identification of novel alternative splice sites (thereby identifying novel alternatively spliced transcripts) at the 5′ end or even within a transcriptome depending on the sequencing platform, and can also be used to quantify mRNA expression levels across different samples in an experiment (e.g., developmental stage transcriptome analysis as described in the Examples section infra). The instant methods can also be applied to identify pri-miRNA and pri/pre-piRNA and other non-polyadenylated but capped RNA. The instant methods can also be applied to samples that have partially degraded due to preparation or storage.
Accordingly, in one aspect, the invention is directed to a method of enriching for capped RNA present in a starting RNA sample containing rRNA comprising: contacting a starting RNA sample 5′ monophosphate-dependent exonuclease to obtain a population of 5′ monophosphate-dependent exonuclease resistant RNAs depleted of 28S, 16S, and 5.8S rRNA; contacting the population of 5′ monophosphate-dependent exonuclease resistant RNAs with at least one phosphatase to obtain a population of phosphatase-resistant RNAs depleted of RNAs having exposed 5′ phosphate groups; and contacting the population of phosphatase-resistant RNAs with a decapping enzyme to obtain a population of RNAs having an exposed alpha-phosphate in the 5′ Gppp cap, to thereby enrich for capped RNA present in a starting RNA sample containing rRNA.
In one embodiment, the starting RNA sample is total cellular RNA. In one embodiment, the starting RNA sample comprises 500 ng or less of RNA.
In one embodiment, the starting RNA sample includes polyA+ and polyA− RNA.
In one embodiment, the starting RNA sample comprises degraded RNA.
In one embodiment, the method further comprises sequencing the population of RNAs having an exposed alpha-phosphate in the 5′ Gppp cap.
In one embodiment, the method further comprises the step of contacting the population of RNAs having an exposed alpha-phosphate in the cap with an RNA ligase to obtain a population of ligated RNAs.
In one embodiment, the method further comprises subjecting the population of ligated RNAs to RT PCR using random primers to obtain a cDNA library.
In one embodiment, the method further comprises subjecting the cDNA library to a size selection procedure.
In one embodiment, the method further comprises amplifying the cDNA library by PCR.
In one embodiment, the method further comprises treating the starting RNA sample with a polynucleotide kinase prior to step a) to phosphorylate the 5′ terminus of degraded RNA.
In one embodiment, the method further comprises sequencing 100 bases or fewer of the cDNA members of the library.
In one embodiment, the method further comprises sequencing between 50 and 125 bases of the cDNA members of the library.
In one embodiment, the invention pertains to a composition obtained using a method of the invention.
In another aspect, the invention pertains to a RNA mixture comprising RNA molecules comprising transcriptional start sites, wherein the RNA molecules comprise monophosphorylated 5′ termini or hydroxyl 5′ termini, wherein the mixture is substantially free of 28S, 16S, and 5.8S ribosomal RNA, RNA with triphosphorylated 5′ termini, and 5′ m7G capped RNA.
In one embodiment, the mixture comprises RNAs substantially free of poly-A at the 3′ terminus.
In one embodiment, the mixture is substantially free of RNA with poly-A at the 3′ terminus.
In one embodiment, the mixture comprises RNA with a size of less than 200 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 50 nucleotides and about 200 nucleotides. In another embodiment, RNA mixture comprises a substantially pure population of RNA molecules having a size of between about 130 nucleotides and 170 nucleotides.
In one embodiment, the invention pertains to a cDNA library generated from an RNA mixture of the invention.
In another aspect, the invention pertains to a kit comprising: a first component comprising a 5′ monophosphate-dependent exonuclease, a second component comprising a phosphatase, and a third component comprising a decapping enzyme.
In one embodiment, the kit further comprises a control sample.
In one embodiment, the kit further comprises instructions for use of the kit to enrich for capped RNAs.
In another aspect, the invention pertains to methods for identifying transcriptional start (TS) sites.
Before further description of the invention, certain definitions are included here:
As used herein, “nucleic acid” refers to DNA, RNA and derivatives thereof. The terms “DNA” and “RNA” refer to deoxyribonucleic acid and ribonucleic acid, respectively. The term “mRNA” refers to messenger RNA. The term rRNA refers to ribosomal RNA. The term “tRNA” refers to transfer RNA.
“polyA+RNA” refers to RNA comprising a poly(A) tail which consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases at the 3′ end of an RNA molecule. “polyA-RNA” refers to RNA molecules lacking a polyA tail.
The term “cap” with respect to RNAs refers to the cap found on the 5′ end of an mRNA molecule which consists of a guanine nucleotide connected to the mRNA via an unusual 5′ to 5′ triphosphate linkage. This guanosine is methylated on the 7 position directly after capping in vivo by a methyl transferase. It is also referred to as a 7-methylguanylate cap, abbreviated m7G or m7Gppp. Capped RNA refers to an RNA comprising a 5′ cap and is also referred to as Gppp-RNA.
As used herein, the term MicroRNA (miRNA) refers to RNA molecules that are processed from small hairpin RNA (shRNA) precursors that are produced from miRNA genes. miRNAs are 21-23 nucleotides in length and through the RNA-induced silencing complex they target and silence mRNAs containing imperfectly complementary sequence. Animal miRNAs are initially transcribed as part of one arm of an 80 nucleotide RNA stem-loop that in turn forms part of a several hundred nucleotides long miRNA precursor termed a primary miRNA (pri-miRNA). Animal miRNAs are initially transcribed as pri-miRNA by DNA-dependent RNA polymerase II, then processed by Drosha as pre-miRNA, a stem-loop structure molecule, and finally cut by Dicer to form mature miRNA.
The term Piwi-interacting RNAs (piRNA) refers to small RNA species that are processed from single-stranded precursor RNAs. They are 21-35 nucleotides in length and form complexes with the piwi protein. Pri-piRNAs are long primary transcripts of piRNA molecules.
As used herein, the term “21U” refers to a 21 nt transcript starting with U (i.e., not limited to the 21U-RNA genes depicted in the Figures). However, “21U-RNA” or a gene name starting with “21ur” followed by a suffix is intended to refer to a 21U-RNA gene.
Where a method disclosed herein refers to “amplifying” a nucleic acid, the term “amplifying” refers to a process in which the nucleic acid is exposed to at least one round of extension, replication, or transcription in order to increase (e.g., exponentially increase) the number of copies (including complimentary copies) of the nucleic acid. The process can be iterative including multiple rounds of extension, replication, or transcription. Various nucleic acid amplification techniques are known in the art, such as PCR amplification or rolling circle amplification.
A “primer” as used herein refers to a nucleic acid that is capable of hybridizing to a complimentary nucleic acid sequence in order to facilitate enzymatic extension, replication or transcription.
As used herein the term “transcriptional start site” refers to the site of an mRNA molecule at which transcription begins.
As used herein the term “exposed alpha-phosphate” refers to the 5′ phosphate which was linked to the guanine nucleotide in the cap structure.
As used herein, the term “ligated RNAs” refers to RNAs which have been ligated with at 5′ phosphoryl-terminus through the formation of a 3′→5′ phosphodiester bond, with hydrolysis of ATP to AMP and PPi.
As used herein, the term “RT-PCR” refers to Reverse transcription polymerase chain reaction. RT-PCR is a variant of polymerase chain reaction (PCR), a laboratory technique commonly used in molecular biology to generate many copies of a DNA sequence. In RT-PCR an RNA strand is first reverse transcribed into its DNA complement (complementary DNA, or cDNA) using the enzyme reverse transcriptase, and the resulting cDNA is amplified using traditional PCR or real-time PCR.
As used herein, the term PCR refers to a method of amplifying nucleic acid molecules which relies on thermal cycling, consisting of cycles of repeated heating and cooling of the reaction for DNA melting and enzymatic replication of the DNA. Primers (short DNA fragments) containing sequences complementary to the target region along with a DNA polymerase are key components to enable selective and repeated amplification. As PCR progresses, the DNA generated is itself used as a template for replication, setting in motion a chain reaction in which the DNA template is exponentially amplified.
A cDNA library is a combination of cloned cDNA (complementary DNA) fragments inserted into a collection of host cells, which together constitute some portion of the transcriptome of the organism. A cDNA library, in the context of the present invention, can also refer to the product obtained from reverse transcription of a RNA mixture comprising RNA molecules comprising transcriptional start sites, wherein the RNA molecules comprise monophosphorylated 5′ termini or hydroxyl 5′ termini, wherein the mixture is substantially free of 28S, 16S, and 5.8S ribosomal RNA, RNA with triphosphorylated 5′ termini, and 5′ m7G capped RNA.
As used herein, the term “random primers” refers to short segments of single-stranded DNA (ssDNA) called oligonucleotides, or oligos for short. These oligos are only 8 nucleotides long (octamers) and they consist of every possible combination of bases which means there must be 48=65,536 different combinations in the mixture. Because every possible octamer is present, these primers can bind to any section of DNA.
As used herein, the term “size selection” refers to subjecting a population of nucleic acid molecules to a process that allows for selection of molecules having a desired size range. One such process is gel electrophoresis, in which smaller molecules move through the gel more quickly than larger molecules. The inclusion of reference markers allows for selection of molecules of a specific size range.
The term “deep sequencing” refers to a method of sequencing a plurality of nucleic acids in parallel. See e.g., Bentley et al., Nature, 456:53-59, 2008. Sequencing depth refers to the total number of all the sequences reads or base pairs represented in a single sequencing experiment or series of experiments.
As used herein, the term “tag” refers to a non-target nucleic acid component that provides a means of addressing a nucleic acid fragment to which it is joined. For example, in preferred embodiments, a tag comprises a nucleotide sequence that allows for identifying, recognition, and/or molecular or biochemical manipulation of the DNA to which the tag is attached (e.g., by providing a site for annealing an oligonucleotide, such as a primer for extension by a DNA polymerase, or an oligonucleotide for capture or for a ligation reaction). The process of adding the tag to a DNA molecule can also be referred to as “tagging” and DNA that undergoes tagging or that contains a tag is referred to as “tagged” (e.g., “tagged DNA”). A “sequencing tag domain” or a “sequencing tag” refers to a tag domain that exhibits a sequence for the purpose of facilitating sequencing of the DNA fragment on which the tag is present.
As used herein, the term “read” refers to a sequence of nucleotides within a cDNA (i.e, a short subsequence of a cDNA sequence), where said cDNA is a copy of all or a portion of an RNA molecule enriched according to the methods of the invention for rapid enzymatic enhancement of capped RNAs. Exemplary reads are of cDNA sequences generated according to the CapSeq methods of the invention. In exemplary embodiments, a “read” comprises between about 50 and about 125 nucleotides, preferably 100 nucleotides or fewer, e.g., 20, 30, 40, 50, 60 70, 80, 90 or 100 nucleotides. In exemplary embodiments, a “read” comprises about 50-100 nucleotides, for example, about 70-90 nucleotides, e.g., 70-80 or 80-90 nucleotides. A read can be a sequence derived directly from a cDNA or can result from analysis and/or modification, e.g., trimming, of a cDNA sequence. A “read” generated according to the methods of the invention for rapid enzymatic enhancement of capped RNAs or CapSeq protocols of the invention can also be referred to as a “5′ sequence tag.”
A. Starting RNA Samples
Starting RNA samples for enrichment of capped sequences can be obtained from a population of cells. Such starting RNA samples comprise a population of RNA molecules, i.e., a plurality of RNA molecules. Exemplary cells from which starting RNA samples can be obtained include prokaryotic and eukaryotic cells and include, for example, bacterial cells, animal cells, fungal cells and plant cells. In one embodiment, the cell is a mammalian cell, e.g., a human cell. In one embodiment, the cell is obtained from an organ or an organism. In another embodiment, the starting material can be an organism (e.g., C. elegans) or lysate thereof. The cell may be from a clinical sample, e.g., may be obtained from a subject that suffers from a disease or disorder.
The present invention is not limited by the nature of the sample. Samples include, but are not limited to, cell lysates, provided that nucleases that can affect nucleic acids of interest are removed or inhibited in said lysates, mixtures of nucleic acids (e.g., unpurified, partially purified, etc.), environmental samples, etc. Samples can comprise nucleic acids from a single-cell organism or, if an organism comprises multiple cells, from one or more cells. Samples comprising nucleic acids from multiple cells can comprise cells from one or more types of cells, including cells from different organisms, and/or different cells from the same organism. The present invention finds use with prokaryotic nucleic acid molecules, eukaryotic nucleic acid molecules, or with a mixture of both eukaryotic and prokaryotic nucleic acid molecules. Prokaryotic mRNAs lack caps.
In some preferred embodiments of the present invention, the sample comprises (or contains) RNA. In one embodiment, the starting RNA sample comprises total cellular RNA (e.g., structural RNA (such as rRNA and tRNA) as well as mRNA) and includes both polyA+ and polyA− RNA. RNA may be obtained from a cell using techniques known in the art. Typically, the cell is lysed and a starting RNA sample is obtained using known RNA isolation techniques. Exemplary methods include, for example, a MasterPure RNA Purification Kit, an ArrayPure Nano-Scale RNA Purification Kit, or a MasterPure Yeast RNA Purification Kit (all of which are from EPICENTRE), or using a kit from another commercial source, or using a “home-brew” method known in the art. Alternatively, in some embodiments a sample of the invention that comprises or contains RNA can contain a subfraction of total RNA obtained by any method known in the art, such as, but without limitation, a subfraction based on size (e.g., by purification on an agarose or polyacrylamide gel, or by column purification, including by HPLC), or a subfraction obtained by salt precipitation (e.g., using precipitation with 0.5-2.5 M LiCl (Barlow et al., Biochem. Biophys. Res. Comm. 13: 61, 1963); Cathala et al, DNA 2: 329, 1983) or 2.5 M ammonium acetate). In some embodiments, a sample comprising RNA can also contain DNA.
One preferred method of the invention comprises a method for enriching for an RNA having a 5′-cap in a biological sample comprising prokaryotic RNA, eukaryotic RNA or both prokaryotic and eukaryotic RNA and at least one undesired nucleic acid, e.g., structural RNA.
In different embodiments of this method, the RNA having a 5′-cap is selected from the group consisting of: (i) prokaryotic mRNA; (ii) eukaryotic mRNA, including polyadenylated and non-polyadenylated eukaryotic mRNA; (iii) a mixture of both prokaryotic and eukaryotic mRNA; (iv) eukaryotic snRNA; (v) eukaryotic pre-micro RNA; and (vi) prokaryotic or eukaryotic primary RNA transcripts of unknown function.
The instant enzymatic method of enriching for capped RNA requires only small amounts of RNA, e.g., 5000 ng or less of starting RNA. In one embodiment, a starting RNA sample comprises 5000 ng or less RNA, 2500 ng or less RNA, 1000 ng or less RNA, 750 ng or less RNA, 500 ng or less RNA, 250 ng or less RNA, or 100 ng or less RNA.
The instant enzymatic method can also be used to enrich for capped RNAs in a degraded preparation. In one embodiment, a starting RNA population comprises degraded RNA.
In one embodiment, a starting RNA population for use in the instant methods comprises mRNA as well as other capped but less abundant pri-miRNA and/or pri-piRNAs or other non-coding regulatory RNAs.
Starting RNA samples may or may not undergo further manipulation steps prior to the enzymatic enrichment of capped RNAs set forth below. For example, if a large amount of degraded RNA is present in the starting RNA sample, the starting RNA sample can be treated with a polynucleotide kinase prior to enzymatic enrichment for capped RNAs to phosphorylate the 5′ end of degraded RNA. This phosphorylation will render RNA degradation fragments with 5′ OH groups sensitive to 5′ monophosphate-dependent exonuclease.
B. Enzymatic Enrichment of Capped RNAs
The abundance of structural RNAs in the cell creates a challenge for monitoring the expression of Pol II RNA transcripts. Roughly 95% to 98% of cellular transcripts are structural RNAs (rRNA and tRNA) that can result in a high level of noise in RNA-seq experiments. These structural transcripts are 5′-monophosphorylated, and most are sensitive to enzymatic treatment with 5′-to-3′ nucleases, such as Terminator™ exonuclease. Nevertheless, 5′ to 3′ exonuclease treatments cannot completely remove a large molar excess (relative to mRNA) of 5S RNA, tRNA and other partially-degraded rRNA fragments. The presence of these structural RNA fragments has necessitated the use of tedious, inefficient and costly hybridization-based methods to enrich for mRNAs prior to cDNA library construction (Vivancos et al., Genome Res, 20: 989-999, 2010; Wang et al., Nat Rev Genet, 10: 57-63, 2009).
The present invention features a purely enzymatic approach that facilitates cloning the 5′ ends of Pol II transcripts from small quantities of tissue. The presence of a 5′ phosphate is required for ligation mediated by RNA ligase. Removal of the phosphate from 5S RNA, tRNA and partially degraded rRNA, for example, after 5′ monophosphate-dependent (or 5′ to 3′) exonuclease, e.g., Terminator, treatment can prevent cloning of such RNAs during library construction. Therefore, phosphatase, e.g., calf intestine phosphatase (CIP) was used to remove 5′ phosphates from RNA samples, following digestion with 5′ monophosphate-dependent (or 5′ to 3′) exonuclease (
The basic steps of the protocol for enzymatic enrichment of capped RNAs are set forth below. As is set forth herein, it will be understood that additional steps may be performed before or after these steps. In addition, it is noted that manufacturers of enzymes that may be used in this method provide suggested concentrations of enzymes and reaction conditions (including, e.g., appropriate buffers and incubation temperatures and times). It will be understood that while the appended Examples set forth the specific experimental conditions used to demonstrate that the instant method enriches for several different types of capped RNAs, the specific enzymes and reaction conditions used may be varied by the skilled artisan, e.g., based on manufacturers' instructions.
1. 5′ Monophosphate-Dependent Exonuclease
In one embodiment, a starting RNA sample is contacted with a 5′ monophosphate-dependent exonuclease to obtain a population of 5′ monophosphate-dependent exonuclease resistant RNAs depleted of 28S, 16S, and 5.8S rRNA.
One exemplary such 5′ monophosphate-dependent exonuclease is the terminator exonuclease commercially available from Epicenter. Another exemplary such nuclease is XRN-1 which is commercially available from NEB.
One preferred method of the invention comprises a method for enriching for an RNA having a 5′-cap in a biological sample comprising prokaryotic RNA, eukaryotic RNA or both prokaryotic and eukaryotic RNA and at least one undesired nucleic acid, the method comprising treating the sample with purified 5′ exoribonuclease under conditions in which the 5′ exoribonuclease is active and for sufficient time so that the undesired nucleic acid is digested and the sample is enriched for RNA having a 5′-cap. The enzymatic reaction can be carried out until the starting sample is substantially depleted of 28S, 16S, and 5.8S rRNA. This can be readily determined by one of ordinary skill in the art by testing aliquots of reaction mixture for the presence of the molecules to be depleted. Enzyme activity of a 5′ exonuclease can be measured using a number of different methods. Without limitation, suitable methods that can be used for assaying activity and determining relative activity using RNA substrates with a 5′-triphosphate, a 5′-cap, or a 5′-monophosphate are described by Stevens and Poole (J. Biol. Chem. 270: 16063, 1995).
The 5′ monophosphate-dependent exonuclease reaction conditions can be optimized by titrating enzyme or substrate RNA, or by performing time course experiments. The time course for digestion can be performed using total RNA as starting material, followed by visualization on a 5% denaturing PAGE gel with SYBR gold or ethidium bromide staining. In the event that the reaction has reached completion, 26S and 18S rRNAs should no longer be visible on the gel. To ensure that capped RNAs are intact, a radio-labeled 5′ capped RNA made by in vitro transcription using methods known in the art can be added to the digestion mixture and monitored using a PAGE gel. Alternatively, the reaction could be monitored using real-time PCR. In this case, cDNA can be produced using random primers in a reverse transcriptase reaction, followed by real-time PCR to compare the amounts of rRNA and mRNA.
2. Phosphatase
The 5′ monophosphate-dependent exonuclease resistant RNAs are subsequently contacted with at least one phosphatase to obtain a population of phosphatase-resistant RNAs depleted of RNAs having exposed 5′ phosphate groups.
One exemplary such phosphatase is Alkaline Phosphatase, e.g., calf intestinal phosphatase (CIP) which catalyzes the removal of 5′ phosphate groups from DNA, RNA, ribo- and deoxyribonucleoside triphosphates. CIP is commercially available from NEB, Promega, Invitrogen, and other sources, such as Thermo Scientific and Affymetrix.
This step removes all exposed phosphates. Other phosphatases that can be employed include, e.g., bacterial alkaline phosphatase and HK phosphatase, which are also commercially available. Other substitutes include APex™ heat-labile alkaline phosphatase (Epicenter) and TSAP alkaline phosphatase (Promega).
The enzymatic reaction can be carried out until the sample being treated is substantially depleted of exposed phosphates, e.g. including 5S RNA, tRNA, partially degraded rRNAs and triphosphorylated RNA like pre-tRNAs. Phosphatases are very robust enzymes and it is rare for the reaction not to go to completion.
The phosphatase treatment step can be monitored by adding a 5′ radio-labeled substrate to the reaction, followed by monitoring with PAGE gel analysis using art-recognized methods.
In one embodiment, during phosphatase digestion, DNase I may optionally be added to remove contaminated genomic DNA. Although this step is not critical, it can increase the quality of the resulting library.
3. Decapping Enzyme
The phosphatase-treated RNAs are subsequently contacted with a decapping enzyme. One such decapping enzyme, Tobacco Acid Pyrophosphatase (TAP), hydrolyzes the phosphoric acid anhydride bonds in the triphosphate bridge of the cap structure found in most eukaryotic mRNA, releasing the cap nucleoside and generating a 5′-monophosphorylated terminus on the RNA molecule. Similarly, TAP digests the triphosphate group at the 5′ end of prokaryotic transcripts, generating an RNA molecule with a 5′-monophosphorylated terminus. TAP is commercially available from many sources (e.g., Epicentre). Other enzymes that catalyze cap processing to GDP and monophosphates can also be substituted for TAP, e.g., human Dcp2 and yeast Dcp1.
The enzymatic reaction can be carried out until the RNAs are substantially enriched for the presence of a 5′-monophosphorylated terminus on the RNA molecules. This can be readily determined by one of ordinary skill in the art by testing aliquots of reaction mixture for the presence of the desired terminus or for removal of caps.
The decapping step can be monitored by adding a 5′ radiolabeled capped substrate made by in vitro transcription to the reaction, followed by sample treatment with TAP and CIP simultaneously. If the reaction has proceeded to completion, the labeled phosphate will be released from the RNA. This can be easily quantified using a PAGE gel with art-recognized methods.
This three step enzymatic process results in a population of 5′ monophosphorylated RNAs that are substantially enriched for (formerly) capped RNAs. The resulting population may optionally be subject to further manipulation steps, e.g., as set forth below.
C. Ligation or Dephosphorylation
In one embodiment, the population of de-capped RNAs having an exposed alpha-phosphate obtained using the enzymatic protocol set forth herein is contacted with an RNA ligase to obtain a population of ligated RNAs. In one embodiment, the ligase is a T4 RNA ligase. Ligases (e.g., T4 ligases) are commercially available from many resources, e.g., Takara, Ambion, Invitrogen, or NEB.
In another embodiment, the population of RNAs having an exposed alpha-phosphate obtained using the enzymatic protocol set forth herein is dephosphorylated by contacting the population with a phosphatase, e.g., APex™ Heat-Labile Alkaline Phosphatase, for end-labeling.
D. cDNA Libraries
In one embodiment, the population of ligated RNAs is subjected to RT PCR using random primers to obtain a cDNA library. Methods for using RT-PCR to obtain cDNA libraries are well known in the art. In one embodiment, the cDNA library may be size selected, e.g., using gel electrophoresis, to obtain nucleic acid molecules of a desired size range.
In one embodiment, the cDNA library may be amplified by PCR using techniques well known in the art. It is also noted that amplification-free protocols have been developed so that amplification is not required prior to sequencing. For example, commercially available single-molecule sequencing technologies, e.g., from Helicos and Pacific Biosciences do not require PCR amplification before sequencing. The Helicos system can even directly sequence RNAs without cDNA library construction. Therefore, in one embodiment, RNAs may be directly sequenced.
In an exemplary embodiment, the enriched RNA populations can be used to generate libraries, referred to herein as “CapSeq libraries.” In this embodiment, a 5′-adapter is ligated and first-strand cDNA is synthesized using a 3′-adapter linked to a random octamer, thus avoiding a second ligation step. Second-strand cDNA is amplified from the first strand cDNA using linear PCR with a primer mapped to the 5′ linker, and the first- and second-strand cDNAs are size-fractionated. Finally, libraries are amplified using PCR and subject to deep-sequencing. This procedure can typically be performed with 0.5-2 μg of total RNA, and the entire process can easily be completed within e.g., about 2 days.
This “CapSeq” protocol provides a convenient approach for 5′-oriented RNA sequencing. The protocol can be performed on small quantities of total RNA without the need for affinity- or hybridization-based purification steps.
E. Sequencing
In one embodiment, a population of molecules obtained using the instant methods, e.g., RNA or cDNA derived therefrom, may be sequenced. The RNAs obtained using the instant cap-enrichment process are particularly well suited for analysis of the transcriptome and, in one embodiment, RNA or cDNA molecules synthesized from these cap-enriched RNAs can be sequenced using known methods.
Illustrative non-limiting examples of nucleic acid sequencing techniques include, but are not limited to, chain terminator (Sanger) sequencing and dye terminator sequencing. Chain terminator sequencing uses sequence-specific termination of a DNA synthesis reaction using modified nucleotide substrates. Extension is initiated at a specific site on the template DNA by using a short radioactive, or other labeled, oligonucleotide primer complementary to the template at that region. The oligonucleotide primer is extended using a DNA polymerase, standard four deoxynucleotide bases, and a low concentration of one chain terminating nucleotide, most commonly a di-deoxynucleotide. This reaction is repeated in four separate tubes with each of the bases taking turns as the di-deoxynucleotide. Limited incorporation of the chain terminating nucleotide by the DNA polymerase results in a series of related DNA fragments that are terminated only at positions where that particular di-deoxynucleotide is used. For each reaction tube, the fragments are size-separated by electrophoresis in a slab polyacrylamide gel or a capillary tube filled with a viscous polymer. The sequence is determined by reading which lane produces a visualized mark from the labeled primer as you scan from the top of the gel to the bottom.
Dye terminator sequencing alternatively labels the terminators. Complete sequencing can be performed in a single reaction by labeling each of the di-deoxynucleotide chain-terminators with a separate fluorescent dye, which fluoresces at a different wavelength.
A set of methods referred to as “next-generation sequencing” techniques have emerged as alternatives to Sanger and dye-terminator sequencing methods (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296, 2009; each herein incorporated by reference in their entirety). Most current methods describe the use of next-generation sequencing technology for de novo sequencing of whole genomes to determine the primary nucleic acid sequence of an organism. In addition, the instant methods are particularly well suited to targeted re-sequencing (deep sequencing) and allow for sensitive mutation detection within a population of wild-type sequence. Recent publications describing the use of bar code primer sequences permit the simultaneous sequencing of multiple samples during a typical sequencing run including, for example: Margulies et al., Nature, 437: 376-80, 2005; Mikkelsen et al., Nature, 448: 553-60, 2007; McLaughlin et al., ASHG Annual Meeting, 2007; Shendure et al., Science, 309: 1728-32, 2005; Harris et al., Science, 320: 106-9, 2008; Simen et al., 16th International HIV Drug Resistance Workshop, Barbados, 2007; Thomas et al., Nature Med., 12: 852-855, 2006; Mitsuya et al., J. Vir., 82: 10747-10755, 2008; Binladen et al., PLoS ONE, 2: e197, 2007; and Hoffmann et al., Nuc. Acids Res., 35: e91, 2007, all of which are herein incorporated by reference.
Next-generation sequencing (NGS) methods share the common feature of massively parallel, high-throughput strategies, with the goal of lower costs in comparison to older sequencing methods. These methods are particularly well suited to deep sequencing. NGS methods can be broadly divided into those that require template amplification and those that do not. Amplification-requiring methods include pyrosequencing commercialized by Roche as the 454 technology platforms (e.g., GS 20 and GS FLX), the Solexa platform commercialized by Illumina, and the Supported Oligonucleotide Ligation and Detection (SOLID) platform commercialized by Applied Biosystems. Non-amplification approaches, also known as single-molecule sequencing, are exemplified by the HeliScope platform commercialized by Helicos BioSciences, and emerging platforms commercialized by VisiGen and Pacific Biosciences, respectively.
In pyrosequencing (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S. Pat. No. 6,210,891; U.S. Pat. No. 6,258,568; each herein incorporated by reference in its entirety), template DNA is fragmented, end-repaired, ligated to adaptors, and clonally amplified in-situ by capturing single template molecules with beads bearing oligonucleotides complementary to the adaptors. Each bead bearing a single template type is compartmentalized into a water-in-oil microvesicle, and the template is clonally amplified using a technique referred to as emulsion PCR. The emulsion is disrupted after amplification and beads are deposited into individual wells of a picotitre plate functioning as a flow cell during the sequencing reactions. Ordered, iterative introduction of each of the four dNTP reagents occurs in the flow cell in the presence of sequencing enzymes and luminescent reporter such as luciferase. In the event that an appropriate dNTP is added to the 3′ end of the sequencing primer, the resulting production of ATP causes a burst of luminescence within the well, which is recorded using a CCD camera. It is possible to achieve read lengths greater than or equal to 400 bases, and 1×106 sequence reads can be achieved, resulting in up to 500 million base pairs (Mb) of sequence.
In the Solexa/Illumina platform (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S. Pat. No. 6,833,246; U.S. Pat. No. 7,115,400; U.S. Pat. No. 6,969,488; each herein incorporated by reference in its entirety), sequencing data are produced in the form of shorter-length reads. In this method, single-stranded fragmented DNA is end-repaired to generate 5′-phosphorylated blunt ends, followed by Klenow-mediated addition of a single A base to the 3′ end of the fragments. A—addition facilitates addition of T—overhang adaptor oligonucleotides, which are subsequently used to capture the template-adaptor molecules on the surface of a flow cell that is studded with oligonucleotide anchors. The anchor is used as a PCR primer, but because of the length of the template and its proximity to other nearby anchor oligonucleotides, extension by PCR results in the “arching over” of the molecule to hybridize with an adjacent anchor oligonucleotide to form a bridge structure on the surface of the flow cell. These loops of DNA are denatured and cleaved. Forward strands are then sequenced with reversible dye terminators. The sequence of incorporated nucleotides is determined by detection of post-incorporation fluorescence, with each fluor and block removed prior to the next cycle of dNTP addition. Sequence read length ranges from 36 nucleotides to over 50 nucleotides, with overall output exceeding 1 billion nucleotide pairs per analytical run.
Sequencing nucleic acid molecules using SOLiD technology (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S. Pat. No. 5,912,148; U.S. Pat. No. 6,130,073; each herein incorporated by reference in their entirety) also involves fragmentation of the template, ligation to oligonucleotide adaptors, attachment to beads, and clonal amplification by emulsion PCR. Following this, beads bearing template are immobilized on a derivatized surface of a glass flow-cell, and a primer complementary to the adaptor oligonucleotide is annealed. However, rather than utilizing this primer for 3′ extension, it is instead used to provide a 5′ phosphate group for ligation to interrogation probes containing two probe-specific bases followed by 6 degenerate bases and one of four fluorescent labels. In the SOLiD system, interrogation probes have 16 possible combinations of the two bases at the 3′ end of each probe, and one of four fluors at the 5′ end. Fluor color and thus identity of each probe corresponds to specified color-space coding schemes. Multiple rounds (usually 7) of probe annealing, ligation, and fluor detection are followed by denaturation, and then a second round of sequencing using a primer that is offset by one base relative to the initial primer. In this manner, the template sequence can be computationally re-constructed, and template bases are interrogated twice, resulting in increased accuracy. Sequence read length averages 35 nucleotides, and overall output exceeds 4 billion bases per sequencing run.
In certain embodiments, nanopore sequencing is employed (see, e.g., Astier et al., J Am Chem Soc. 2006 Feb. 8; 128(5):1705-10, herein incorporated by reference). The theory behind nanopore sequencing has to do with what occurs when the nanopore is immersed in a conducting fluid and a potential (voltage) is applied across it: under these conditions a slight electric current due to conduction of ions through the nanopore can be observed, and the amount of current is exceedingly sensitive to the size of the nanopore. If DNA molecules pass (or part of the DNA molecule passes) through the nanopore, this can create a change in the magnitude of the current through the nanopore, thereby allowing the sequences of the DNA molecule to be determined.
HeliScope by Helicos BioSciences (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S. Pat. No. 7,169,560; U.S. Pat. No. 7,282,337; U.S. Pat. No. 7,482,120; U.S. Pat. No. 7,501,245; U.S. Pat. No. 6,818,395; U.S. Pat. No. 6,911,345; U.S. Pat. No. 7,501,245; each herein incorporated by reference in their entirety) is the first commercialized single-molecule sequencing platform. This method does not require clonal amplification. Template DNA is fragmented and polyadenylated at the 3′ end, with the final adenosine bearing a fluorescent label. Denatured polyadenylated template fragments are ligated to poly(dT) oligonucleotides on the surface of a flow cell. Initial physical locations of captured template molecules are recorded by a CCD camera, and then label is cleaved and washed away. Sequencing is achieved by addition of polymerase and serial addition of fluorescently-labeled dNTP reagents. Incorporation events result in fluor signal corresponding to the dNTP, and signal is captured by a CCD camera before each round of dNTP addition. Sequence read length ranges from 25-50 nucleotides, with overall output exceeding 1 billion nucleotide pairs per analytical run. Other emerging single molecule sequencing methods real-time sequencing by synthesis using a VisiGen platform (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; U.S. Pat. No. 7,329,492; U.S. patent application Ser. No. 11/671,956; U.S. patent application Ser. No. 11/781,166; each herein incorporated by reference in their entirety) in which immobilized, primed DNA template is subjected to strand extension using a fluorescently-modified polymerase and florescent acceptor molecules, resulting in detectible fluorescence resonance energy transfer (FRET) upon nucleotide addition. Another real-time single molecule sequencing system developed by Pacific Biosciences (Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S. Pat. No. 7,170,050; U.S. Pat. No. 7,302,146; U.S. Pat. No. 7,313,308; U.S. Pat. No. 7,476,503; all of which are herein incorporated by reference) utilizes reaction wells 50-100 nm in diameter and encompassing a reaction volume of approximately 20 zeptoliters (10.times.10.sup.-21 L). Sequencing reactions are performed using immobilized template, modified phi29 DNA polymerase, and high local concentrations of fluorescently labeled dNTPs. High local concentrations and continuous reaction conditions allow incorporation events to be captured in real time by fluor signal detection using laser excitation, an optical waveguide, and a CCD camera.
In one embodiment, libraries of tagged DNA fragments can be made from target DNA for use in, e.g., deep sequencing and amplification methods using transposon compositions. Exemplary such methods are known in the art (see, e.g., US application 20100120098. For example, linear ssDNA fragments or tagged circular ssDNA fragments (and amplification products thereof) from target DNA comprising any dsDNA of interest (including double-stranded cDNA prepared from RNA) for analysis.
A. Identification of 5′ Ends
In an exemplary aspect, the instant methods can be used to effectively clone the 5′ ends of as of yet unannotated transcripts (e.g., pri-lin-4 as demonstrated in the instant Examples). Thus, the instant methods effectively identify 5′ ends and allow analysis of the transcriptome to map 5′ start sites. In addition, as will be recognized by one of ordinary skill in the art, by anchoring at the 5′ end and using gene specific primers, the instant methods can be used to determine, e.g., which start sites correlate with a given isoform of a molecule. Moreover, whether the newly identified 5′ ends actually correspond to transcripts of the gene being studied (e.g., pri-lin-4) can be assessed with art-recognized techniques, e.g., Northern blot analysis to confirm the size of the transcript. In the event of low expression of the transcript, another option is RT-PCR, followed by nested PCR with gene-specific primers. The latter is the routine process of 5′ RACE or 3′RACE followed by normal sequencing to obtain the ends of a gene.
B. Identifying Argonaute (AGO)-Associated Small RNAs
The methods of the invention can be used to explore the expression and biogenesis of small RNAs in a model organism, in particular, small RNAs believed to be involved in gene regulation. Of particular interest are genes that associate with Argonaute (AGO) proteins, such proteins playing a known role in sequence-directed gene regulation by association with a variety of classes of small RNAs.
The Argonaute (AGO) proteins associate with small RNAs to form sequence-directed gene regulatory complexes that are deeply conserved in eukaryotes (Hutvagner et al., Nat Rev Mol Cell Biol, 9: 22-32, 2008). Most organisms encode multiple functionally-distinct AGO family members. These AGOs are loaded with a diversity of small RNA cofactors, produced through a similarly diverse repertoire of small-RNA biogenesis mechanisms (Siomi et al., Nature, 457: 396-404, 2009). AGO-associated small RNAs include micro (mi) RNAs and short-interfering (si) RNAs that are processed from doublestranded (ds) RNA precursors by the RNase III-related enzyme Dicer (Bernstein et al., Nature, 409: 363-366, 2001).
In some organisms, AGO-associated small-RNA species are produced, independent of Dicer, by RNA-dependent RNA Polymerase (RdRP) (Gu et al., Mol Cell., 36: 231-244, 2009; Pak et al., Science, 315: 241-244, 2007; Sijen et al., Science, 315: 244-247, 2007). Still others, such as the piRNAs of animals, are produced, at least in part, through Pol II transcription (Cecere et al., Mol Cell, 47(5): 734-45, 2012; Saito et al., Nature, 461: 1296-1299, 2009).
1. Piwi-Interacting RNAs
Many piRNA species originate from large genomic clusters and direct Piwi-dependent transposon silencing, heterochromatin modification and germ cell maintenance (Aravin et al., Science, 318: 761-764, 2007a; Batista et al., Mol Cell., 31: 67-78, 2008; Brennecke et al., Cell, 128: 1089-1103, 2007; Das et al., Mol Cell., 31: 79-90, 2008; Lin, Science, 316: 397, 2007). In flies and mammals, transposon-directed piRNAs typically map to both strands and are produced by a “ping-pong” amplification cycle, whereby sense piRNAs direct Piwi-dependent cleavage of a primary transcript to generate the 5′ ends of antisense piRNAs and vice versa (Aravin et al., Science, 316: 744-747, 2007b; Brennecke et al., Cell, 128: 1089-1103, 2007; Gunawardane et al., Science, 315: 1587-1590, 2007; Houwing et al., Cell, 129: 69-82, 2007). Recent work suggests that Piwi-bound precursor piRNAs are trimmed by a 3′-to-5′exonuclease and then methylated on the 2′-OH of the 3′ end residue of the mature piRNA (Kawaoka et al., Mol Cell., 43: 1015-1022, 2011). In mice an abundant class of piRNAs, patchytene piRNAs, originates from large genomic clusters (Aravin et al., Nature, 442: 203-207, 2006). These piRNAs are not generated by the ping-pong cycle, but instead appear to be processed directly from a single-strand precursor by an unknown mechanism.
The C. elegans piRNAs, known as 21U-RNAs, are an abundant class of germline-expressed small RNAs that interact with the PIWI ortholog PRG-1 (Batista et al., Mol Cell., 31: 67-78, 2008; Das et al., Mol Cell., 31: 79-90, 2008; Ruby et al., Cell, 127: 1193-1207, 2006). These Piwi-interacting (pi) RNAs are a class of germline-expressed small RNAs that have been linked to epigenetic programming in metazoan. Similar to mammalian pachytene piRNAs, 21U-RNAs are diverse in sequence and the overwhelming majority lack perfectly-complementary RNA targets. The C. elegans piRNAs (21U-RNAs) are defined by more than 15,000 genomically-encoded species. Unlike mammalian piRNAs, however, 21U-RNAs do not appear to be processed from long RNA precursors. Instead, they derive from individual gene-like loci that are dispersed within two large clusters on chromosome IV (Cecere et al., Mol Cell., 47(5): 734-45, 2012; Ruby et al., Cell, 127: 1193-1207, 2006). Within these clusters, more than 15,000 distinct 21U-RNAs are expressed from both strands and reside within introns and intergenic regions, but are rarely found in coding regions (Batista et al., Mol Cell., 31: 67-78, 2008; Ruby et al., Cell, 127: 1193-1207, 2006). The presence of a conserved 8 nucleotide (nt) motif and A/T-rich region upstream of each 21URNA led Ruby et al. (Cell, 127: 1193-1207, 2006) to suggest that 21U-RNAs are independently expressed loci. Consistent with this idea, a recent study identified Forkhead-family transcription factors that associate with the 8 nt motif and whose activity was correlated with 21U-RNA expression (Cecere et al., Mol Cell., 47(5): 734-45, 2012).
The methods of the invention are particularly useful in understanding the biogenesis and expression of piRNAs as these small RNAs are suggested to play a role in gene regulation. Examples 10-18 demonstrate the use of the methods of the invention (methods that enrich the 5′ ends of Pol II transcripts) to understand the origin of C. elegans 21U-RNAs. Examples 10-18 demonstrate that a species of capped-short (cs) RNA is frequently expressed bidirectionally at Pol II loci in C. elegans. Interestingly, at annotated 21U-RNA loci, csRNAs originate precisely 2 nt upstream of the mature piRNA species, suggesting that csRNAs are piRNA precursors. In addition, it is shown that csRNAs associated with TS sites genome-wide define a second class of 21U-RNA loci, and nearly double the number of piRNA species available for genome surveillance. Examples 10-18 demonstrate that the methods of the invention have general utility in TS site identification and 5′ anchored RNA-expression profiling.
C. Identification of TS Sites
To illustrate the general utility of this approach it is demonstrated herein that CapSeq can be used to identify TS sites of genes expressed, e.g., genes in mouse testes, including candidate TS sites for primary miRNA transcripts and piRNA clusters. In order to identify TS sites (or candidate TS sites, one or a population of 5′ sequence tags generated using the methods of the invention cap be mapped to genes within a database of interest.
5′ sequence tags of the invention can be instrumental in understanding gene regulation, determination of transcriptional start sites, understanding the biogenesis of short non-coding RNAs, etc. 5′ sequence tags can be subject to bioinformational analysis and organized into databases representative of a variety of cell types, species, developmental stages, etc.).
In general, a 5′ sequence tag results from quick sequencing of a cloned cDNA. The cDNAs used for 5′ sequence tag generation are typically individual clones from a cDNA library generated using the rapid enzymatic enhancement of capped RNAs or CapSeq protocols of the invention. The resulting cDNA sequence is not necessarily a perfect copy (or complement) of the RNA species.
5′ sequence tags (or “reads”) can be mapped to specific chromosome locations using physical mapping techniques, such as radiation hybrid mapping, Happy mapping, or FISH. Alternatively, if the genome of the organism that originated the 5′ sequence tag has been sequenced, one can align the 5′ sequence tag sequence to that genome using a computer. Reads from a cDNA library can be limited, for example, by removing reads mapping to genes or RNAs not of interest. For example, reads mapping to structural RNAs can be removed from a sample or population of reads intended to be the subject of further analysis.
In this respect, 5′ sequence tag are a valuable tool to identify predicted transcription start (TS) sites throughout the genome, which leads to the prediction and/or understanding of gene transcripts, their regulation, and ultimately their function. Moreover, the situation in which the 5′ sequence tags are obtained (tissue, organ, organism, cell type, disease state, developmental state, etc.) gives information on the conditions in which the corresponding transcript is acting. Ultimately, 5′ sequence tag (or portions thereof) can be used to design probes and/or primers based on the nucleotide sequence contained therein, for example, in better understanding several important aspect of gene regulation.
Transcription start (TS) sites identified by mapping 5′ sequence tags of the invention can be considered “candidate” TS sites, and subject to further analysis, for example, bioinformational analysis, sequence analysis, comparison with data generated using other comparable or compatible sequencing approaches, etc. In exemplary embodiments of the invention, candidate TS sites can be mapped to sequences in databases including, but not limited to miRNA databases, whole genome databases, non-coding RNA databases, etc.
D. Clinical and Related Uses
The protocols of the invention are particularly useful for analyzing clinical samples, e.g., for gene expression analysis, because Capseq is rapid and uses very little starting material, which is often limiting when using clinical samples, e.g., tissue, cell, body fluid samples and the like. The methods of the invention also provide added benefit based in the fact that the methods are liquid-based. Thus the skilled artisan can process multiple samples in a parallel fashion.
The protocols of the invention are also particularly well-suited for processing samples (biological samples) in which the nucleic acid, e.g., RNA, is partially degraded. This is often the case when using, for example, certain biopsy samples or clinical samples. Using the methods of the invention, analysis of degraded RNA is possible. By contrast, other art-recognized methods are not suited for use with degraded RNA samples.
The protocols of the invention are also particularly suited for detecting alternative promoter usage or mRNA 5′ ends from cancer samples, embryonic stem cells, and the like.
In essence, the methods of the invention can replace the art-recognized CAGE methodology and can also be used as a common RNA-seq way to compare gene levels (e.g., by comparing the tag portions of reads of the invention).
Comparing to other art-recognized protocol, RNA-seq (poly A purification) is biased for poly A containing RNA while using Ribo-minus requires the use of different kits to remove rRNA for different organisms. By contrast, CapSeq can clone all mRNA and is not subject to species difference.
The invention also pertains to compositions made using the enzymatic method of enriching for capped RNAs described herein.
In one embodiment, the invention pertains to an RNA mixture comprising RNA molecules comprising transcriptional start sites, wherein the RNA molecules comprise monophosphorylated 5′ termini or hydroxyl 5′ termini, wherein the mixture is substantially free of ribosomal RNA, RNA with triphosphorylated 5′ termini, and 5′ m7G capped RNA.
Because the instant method does not involve a poly A+ selection step, in one embodiment, the mixture comprises RNAs substantially free of poly-A at the 3′ terminus.
In a further embodiment, the mixture is substantially free of RNA with poly-A at the 3′ terminus.
In one embodiment, the mixture comprises RNA with a size of less than 200 nucleotides. In one embodiment, the mixture comprises RNA with a size of between about 50 nucleotides and about 200 nucleotides. As used in this context, the term about refers to +/−5 nucleotides. For example, in one embodiment, the mixture comprises RNA with a size of between 45 and 205 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 75 and 200 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 75 and 175 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 75 and 150 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 75 and 125 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 75 and 100 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 50 and 175 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 50 and 175 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 50 and 150 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 50 and 125 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 50 and 100 nucleotides. In another embodiment, the mixture comprises RNA with a size of between about 50 and 75 nucleotides.
In one embodiment, the RNA mixture comprises a substantially pure population of RNA molecules having a size of between about 130 nucleotides and 170 nucleotides.
The present invention also provides kits or systems for performing any of the methods of the invention, including any of the steps of said methods.
For example, in one embodiment, the present invention provides a kit comprising: a) a first component comprising a 5′ monophosphate-dependent exonuclease, a second component comprising a phosphatase, and a third component comprising a decapping enzyme. In one embodiment, the kit further comprises a control sample. In one embodiment, the kit further comprises instructions for use of the kit to enrich for capped RNAs.
All publications, patents, and patent applications cited herein, whether supra or infra, are hereby incorporated by reference in their entirety. **************
The above disclosure generally describes the present disclosure, which is further exemplified by the following examples. These specific examples are described solely for purposes of illustration, and are not intended to limit the scope of this disclosure. Although specific targets, terms, and values have been employed herein, such targets, terms, and values will likewise be understood as exemplary and non-limiting to the scope of this disclosure.
Conventional RNAseq strategies often require large amounts of RNA (e.g., on the order of at least 5 to 10 μg of total RNA, but often much higher amounts), as well as positive or negative oligo-based selection steps. Although commercially available kits claim that 1 μg or less RNA can be used for these purposes, these amounts are on the border of the detection limit, putting into question the quality of the resulting cDNA library. Such enrichment steps often require purification columns, which can be inefficient and bias the population of enriched products. The present invention provides for a fast, efficient, and sensitive enzymatic method for eliminating contaminating RNA sequences and enriching for RNAs that contain 5′ cap structures.
The method, in brief, entails purification of total RNA from a target cell population. As shown in
Terminator (Epicentre; TER51020), CIP (NEB; MO290S), T4 RNA ligase (Takara 2050A), SUPERaseIn™ (Ambion; AM2696), TAP (Epicentre; T19050), DNase I (Ambion; AM2222), Superscript III (Invitrogen; 18080, with 10× buffer without Mg2+), ExTaq (Takara; RR001B), Taq (Roche), pCR 2.1 TOPO (Invitrogen; K4500-02), RNase A (Ambion; AM2270), RNase H (Ambion; AM2292), Phase lock gel heavy (5PRIME), spin-x 0.45 μm (Costar; 19442-758), 10 bp DNA marker (Invitrogen 10821-015), phenol/chloroform pH 6-8 (for DNA, pH 7-8 is better), glycoblue (Ambion; AM9515), glycogen (Ambion; AM9510). With respect to oligos, the reverse transcription oligo (RT oligo) was 5′-CAGAAGACGGCATACGANNNNNNNN-3′, the solexa 3 oligo was 5′-CAAGCAGAAGACGGCATACGA, the CMo13279 oligo was 5′-GTTCTACAGTCCGACGATC-3′, the CMo13278 oligo was 5′-AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGACGATC-3′, the M13 forward primer was 5′-GTAAAACGACGGCCAG-3′, the M13 reverse primer was 5′-CAGGAAACAGCTATGAC-3′, and the T7 primer was 5′-TAATACGACTCACTATAGGG-3′.
The 5′ linker was a DNA/RNA hybrid oligo (RNA underscored, otherwise DNA):
Precipitation involved adding 1/10th volume of sodium acetate and 20 μg glycogen or 30 μg glycoblue. To this mixture was added 1-1.5× isopropanol or 3-4× ethanol. The resulting mixture was chilled at −20° C. for 30 min and centrifuged @ 20,000 g for 15 min at 4° C.
Phase lock involved centrifuging the sample at 13,000×g for 2 minutes. This was followed by mixing phenol/chloroform with equal volume of RNA/DNA (at least 100 μL each). The mixture was mixed by finger tapping, followed by centrifugation at 13,000×g for 4 minutes. The salt concentration was no more than 0.5M.
Gels were made with Biorad cassette 345-9902 and the Biorad criterion system was used to run the PAGE gels. Gels were cut to be in small pieces, and cracked in 1.5 mL siliconized tubes by grinding the gel against the wall. RNA/DNA elution buffer (0.8 mL; 10 mM Tris-Cl pH 7.5, 1 mM EDTA, and 0.3M NaCl) was added to elute overnight. The spin-x column was used to remove gel. The resulting product was precipitated as described above with isopropanol without additional salt.
RNA can be isolated from desired tissue or whole organisms (e.g., C. elegans) using art-recognized techniques or commercially available kits (e.g., Qiagen).
Terminator™ 5′-phosphate-dependent exonuclease (Epicentre, Madison, Wis.) treatment involved setting up a 20 μL reaction with the following components:
This mixture was incubated at 30° C. for 2 hours.
The following components were added to the Terminator treatment mixture from above.
This mixture was incubated at 37° C. for 30 minutes, followed by phenol/chloroform/glycoblue precipitation using art-recognized techniques.
The following components were added to the precipitated mixture from the CIP treatment above.
This mixture was incubated at 37° C. for 1 hour, followed by phenol/chloroform extraction without glycoblue (since glycoblue was added in the previous step).
The following components were added to the pellet obtained from the precipitate after TAP treatment:
This mixture was incubated at 15° C. for 6 hours, followed by 4° C. overnight. The next day, the mixture was subjected to phenol/chloroform extraction without glycoblue.
The 5′ linker barcodes were as follows: embryonic (barcodeA), L1 developmental stage (barcodeB), L3 developmental stage (barcodeC), and YA developmental stage (barcodeD).
The following components were added to the pellet obtained from the precipitate after the 5′ ligation step:
This mixture was incubated at 65° C. for 5 minutes, followed by incubation on ice for 2 minutes. The following was added to this mixture:
This mixture was incubated at 15° C. for 10 minutes, 25° C. for 10 minutes, 50° C. for 50 minutes, and then 85° C. for 5 minutes. Following this, 1 μL of RNase A and 1 μL of RNase H were added to the mixture, which was subsequently incubated at 37° C. for 20 minutes.
Linear PCR to Amplify cDNA
The following components were added to the mixture from the RT reaction above:
The PCR conditions were as follows:
One cycle: 94° C. for 60 seconds
12 cycles: 94° C. for 20 seconds
This mixture was phenol/chloroform extracted without addition of glycoblue. This was followed by 15% PAGE (with 7M urea) gel purification to obtain PCR products of 130-170 nucleotides (total RNA containing 5S and 5.8S rRNAs were used as markers).
Purified linear PCR products from the step above were used for a test PCR reaction. This step, although optional, can be carried out to identify the optimal PCR cycles for amplification. Such optimization greatly increases the quality of the cDNA library. PCR products at different cycles are sampled to determine when enough product is obtained without over-cycling.
First, gel purified PCR products were precipitated with glycoblue and resuspended in 20 μL of water. The test PCR reaction was set up as follows:
The PCR conditions were as follows:
One cycle: 94° C. for 20 seconds
15 cycles: 94° C. for 20 seconds
The above PCR reaction was carried out in two tubes for each sample, followed by purification of PCR products using 8% PAGE without urea. A 10 bp marker was used to locate the position of PCR products of the correct size range (i.e., 130-170 nucleotides).
Samples from the gel purification above were resuspended in 15 μL Tris-Cl (pH 7.5) one at a time to avoid denaturing, as DNA denatures when totally dried. The resuspended samples were subjected to TA cloning as follows:
This mixture was incubated at 72° C. for 15 minutes. Following this, 1 μL of salt buffer and 0.5 μL of enzyme was added. This mixture was incubated at room temperature for 30 minutes, followed by transformation with TOP10 competent cells. Transformed cells were streaked onto ampicillin and β-galactosidase containing plates.
Colonies of transformed bacteria on the streaked plates were subjected to colony PCR as follows:
PCR conditions were as follows:
One cycle: 94° C. for 120 seconds
15 cycles: 94° C. for 20 seconds
Alternative to manual sequencing, the purified PCR products were subjected to Solexa sequencing (Illumina) using Solexa sequencing tags in accordance with the manufacturer's instructions. The sequence obtained was 75 nucleotides long from single end sequencing.
RNA was extracted from C. elegans from the following developmental stages: embryonic, L1, L3, and young adult (YA). RNA from each of these stages were processed using the method as described in Examples 1-3, followed by deep sequencing with Illumina Solexa sequencing technology.
C. elegans were grown at 20° C. on E. coli strain OP50 as a food source and harvested at different developmental stages, as indicated. Worms were then dounced with phenol (pH6.7-7) with a metal douncer to expose RNA. The water phase of the extract was precipitated with 0.3M sodium acetate (NaAc) pH 5.2 and 1 volume of isopropanol at −20° C. for at least 20 minutes. The RNA was precipitated by centrifugation at 20,000×g for 15 minutes at 4° C., washed once with 70% cold ethanol, and dissolved in water. To clean the RNA further, another equal volume phenol extraction was carried out as described above.
Genome information (WS215) was downloaded from the publicly available WormBase database. The mapping was performed using bowtie in which only 1 mismatch at most for each 20 nt was allowed. That is, e.g., if the read is 72 nucleotides long, the number of mismatches cannot exceed 3. Before matching, the splice leader (SL) was searched for each read, and removed if present, using a custom PERL script. Given barcoding issues (i.e., a certain percentage of barcodes are shorter than expected) and issues related to removing mutated SL sequences (some SL's are larger than others), the resulting reads could have 3′ ends off by 1-2nt than expected. Accordingly, matching reads beginning at the same position on the same genome strand are combined and the end of the combined reads is represented by the end of the longest reads among them using a custom PERL script. Each sample was normalized to 5 million mRNA sequences and visualized using a generic genome browser.
Using the procedure described in Example 4, the 5′ ends of drh-3 transcripts obtained were compared with annotated drh-3 (D2005.5) transcripts. drh-3 encodes a RNA helicase-like protein which shares similarity to dicer-related helicases. As shown in
Using the procedure described in Examples 4 and 5, the 5′ ends of csr-1, pgl-1, and msp-76 transcripts obtained were compared with the annotated csr-1 and pgl-1 transcripts.
As shown in
Using the procedure described in Examples 4 and 5, the 5′ ends of transcripts for the small RNA genes K02E2.6 (i.e., K02E2.6.1 and K02E2.6.2) and F55C9.3 obtained were compared with the respective annotated sequences.
Using the procedure described in Examples 4 and 5, the 5′ ends of transcripts for pri-lin-4 and the pri-mir-42-43-44-45 and pri-mir-229-64-65-66 clusters obtained were compared with the respective annotated sequences.
Similar results were seen for the pri-mir-42-43-44-45 and pri-mir-229-64-65-66 clusters (
Besides these examples, the instant method detected potential pri-mRNAs for most of the well annotated miRNAs in C. elegans.
Using the procedure described in Examples 4 and 5, the 5′ ends of transcripts for pri/pre-21 ur-3338 obtained were compared with the respective annotated sequences.
In order to further explore the expression and biogenesis of 21U-RNAs, the CapSeq protocol was used to enrich 5′sequence tags of approximately 70 to 90 nucleotides in length for RNA Polymerase II (Pol II) transcripts. In addition, a previously-described enzymatic treatment (referred to here as CIP-TAP cloning; Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project, 2009) was used to enrich for capped-small (cs) RNAs like those previously found associated with promoter regions of Pol II genes in a variety of organisms (Haussecker et al., Nat Struct Mol Biol, 15: 714-721, 2008; Nechaev et al., Science, 327: 335-338, 2010; Seila et al., Cell Cycle, 8: 2557-2564, 2009; Taft et al., Cell Cycle, 8: 2332-2338, 2009). Together the data in the following Examples define candidate TS sites throughout the genome, including TS sites for rare transcripts such as primary-miRNAs (pri-miRNAs) and for pre-mRNAs associated with thousands of transpliced mRNAs. CapSeq reads usually mapped sense relative to annotated genes and often appeared to coincide with the 5′ends of the corresponding mature Pol II products. In contrast, csRNAs were frequently bidirectional upstream of protein-coding genes, with antisense-oriented csRNAs positioned ˜150 bps upstream of the sense-oriented csRNAs.
Using the CapSeq approach, we generated 5 libraries from 3 different developmental stages (L1, L3, and adult) and obtained ˜61 million reads that mapped to the C. elegans genome, including 46 million that mapped to non-structural RNAs. Visual inspection using the genome browser software “Gbrowse” revealed that most CapSeq reads are indeed enriched at the 5′ ends of genes transcribed by RNA Pol II (
To estimate the fidelity with which CapSeq defines the actual 5′ ends of transcripts, as opposed to internally truncated RNAs, the degradation rate of the spliced leader SL1, a 22 nt 5′ leader sequence trans-spliced to many C. elegans genes, was analyzed (Allen et al., Genome Res, 21: 255-264, 2011). For this analysis, five highly-expressed SL1 trans-spliced genes were chosen: alh-8, rps-24, rps-3, rps-29 and rps-15. Complete or partial SL1 reads appended with sequences that perfectly matched to the first 30 nt of each gene were analyzed. For technical and computational reasons, considered sequences with at least the last 3 nts of SL1 were considered, and sequences missing only the first nt of SL1 were excluded (see Experimental Procedures following Example 18). It was found that the average frequency of cloning a 5′-truncated product at each position (nt 3-20, relative to the last nt) of SL1 was approximately 1 in 15,000 per nucleotide relative to intact SL1 reads.
As an additional and more general approach to place an upper limit on the frequency of cloning degradation products by CapSeq, reads were considered that map to all 6,967 trans-spliced transcripts both annotated by WormBase and confirmed by the above data set. Next were compared the number of non-SL reads that start at each position of the transcript to the total number of SL-containing reads. This analysis revealed that, for every SL read detected, a non-SL read (potential degradation product) occurred at an average rate of 1 in 17,000 at each position along a transcript. Together, these data indicate that CapSeq strongly enriches for cloning the 5′-most nt of capped transcripts.
A total of 16,784 unique SL trans-splice sites (
Analysis of C. elegans CapSeq and CIP-TAP data, containing lists of trans-splice sites, transcription start sites, sense csRNAs derived from protein coding genes, and antisense csRNAs derived from protein-coding genes was conducted. Also analyzed were lists of the transcription start sites for pri-miRNAs.
Among 5,711 new trans-splice sites, 4,186 were associated with protein coding genes and mapped to splice acceptor sites within 500 nt upstream of annotated 5′ ends of transcripts (31%), to annotated exon-exon cis-splice-acceptor sites (37%), and to previously non-annotated splice-acceptor sites within annotated exons (25%) or annotated introns (6%). In addition to identifying new trans-splice sites, the analysis also revealed the relative abundance of alternative SL usage for each locus (Table 1A). For example, it was found that 25% of all unique trans-splice sites identified in this study use multiple spliced leaders.
The transcription start (TS) sites for C. elegans genes are poorly mapped. Most gene annotations in WormBase simply indicate the 5′ end of the trans-spliced exon, or the position of the AUG codon. The 5′ ends of many non-SL CapSeq reads mapped near, but did not coincide precisely with, the 5′ ends of annotated transcripts (
Using the CapSeq protocol, it was found that, for CapSeq reads detected at a frequency of one read per 10 million total reads or greater, approximately 60% of the corresponding genomic loci exhibited a YR motif. This prevalence of the YR motif is significantly higher (P-value=0) than the 25% probability of a YR motif occurring by chance in the genome. The percentage of candidate TS sites with YR motifs increased to 80% when the cutoff was increased from 1 read per 10 million to 100 reads per 10 million, suggesting that reads lacking a YR motif represent either degradation-derived reads or TS sites that are less-frequently utilized. Using a cutoff of one CapSeq read per 10 million total reads and a requirement for a YR motif, the CapSeq data predicted approximately 64,000 candidate TS sites genome wide.
In order to pair candidate TS sites with existing annotations, CapSeq reads within a 1000 nt window upstream of gene annotations were considered. This distance was chosen as a conservative upper limit to allow for the possibility of non-annotated 5′ exon sequence or long distances between the TS site and the first splice acceptor site required for trans-splicing. For most genes the 3′ limit was arbitrarily set at a distance of 200 nt downstream of the annotated 5′ end. However, in order to reduce the chance of scoring degradation products as TS sites, the 3′ limit was reduced to 100 nt for very abundantly transcribed genes whose total read counts exceeded 1000 reads per ten million. Using these criteria, candidate TS sites could be assigned to more than 50% of annotations in WS215, including 52% of annotated protein-coding genes (10,667 genes), 15% of annotated pseudogenes (226), 54% of annotated non-coding RNAs (137), 74% of snoRNA loci (102), and 37% of snRNA genes (42). It was found that a typical gene has multiple candidate TS sites separated by several to sometimes hundreds of nucleotides (
19,828 candidate TS sites were also identified that could not be readily paired with annotations based on the above criteria. These defined 12,457 clusters where candidate TS sites were found to reside within a 100 nt interval typical of annotated Pol II genes. The majority of these (84%) were separated from other annotations or from each other by greater than 1 kb. These findings suggest that there are many as yet non-annotated Pol II loci in the C. elegans genome and/or that many of the existing annotations are separated from their actual 5′ ends by greater than the arbitrarily set 1 kb limit used for this analysis.
Recent studies have identified csRNAs associated with promoters and TS sites (Haussecker et al., Nat Struct Mol Biol, 15: 714-721, 2008; Nechaev et al., Science, 327: 335-338, 2010; Seila et al., Cell Cycle, 8: 2557-2564, 2009; Taft et al., Cell Cycle, 8: 2332-2338, 2009). To identify csRNAs in C. elegans, an 18-40 nt size fraction was gel-purified from whole RNA. In order to select against the recovery of abundant un-capped small RNA species, including 22G-RNAs, miRNAs and mature piRNAs, the sample was treated with CIP to remove 5′ mono- or tri-phosphates, thus making them inaccessible to 5′ ligation (
As expected, deep-sequencing of the CIP-PNK sample revealed abundant 22G-RNAs, miRNAs and piRNAs, but very few small RNAs that mapped to Pol II promoters (
Interestingly, bidirectional, divergent csRNAs were observed upstream of many genes (
At bidirectional loci, the antisense csRNAs were located an average of ˜150 nt upstream of the sense csRNAs. In contrast, long-capped RNA reads from CapSeq were almost exclusively collinear with the downstream gene (
As with long capped RNAs cloned by CapSeq, csRNAs appeared to be specific to Pol II promoters, as they were not significantly enriched at Pol I or Pol III promoters (
The abundant 22G-RNAs of C. elegans are synthesized by RNA-dependent RNA polymerases (RdRP), utilizing the mature mRNA as a template (CIP-PNK sample in
miRNAs are sequentially processed from primary transcripts (pri-miRNAs) synthesized by Pol II. Drosha processes pri-miRNAs into stem-loop precursors (pre-miRNAs) that are exported to the cytoplasm and processed by Dicer into mature miRNAs (Hutvagner et al., Nat Rev Mol Cell Biol, 9: 22-32, 2008). To identify candidate TS sites for miRNA genes, both csRNAs and CapSeq reads mapping upstream of annotated miRNAs were analyzed. Because many miRNAs are co-expressed in a single primary transcript (Lau et al., Science, 294: 858-862, 2001), there are only about 100 unique miRNA loci annotated in the C. elegans genome. At least 1 candidate TS site was identified for 64 of the 100 annotated miRNA loci corresponding to 83 individual mature miRNAs.
It was found that CapSeq and csRNA reads that mapped upstream of the pre-miRNAs frequently shared the same 5′ end and, as with other Pol II loci, were often clustered within a short interval (
To identify potential precursor 21U-RNA transcripts (pre-21Us), a search was conducted for CapSeq and csRNA reads mapping to the 9079 unique, non-overlapping 21U-RNAs annotated in WS215. Strikingly, it was found that capped-short RNA reads identified by CIP-TAP were much more strongly correlated with 21U-RNAs than were capped long RNA reads identified by CapSeq. For example, CapSeq reads mapping to only 217 annotated 21U-RNA loci were identified, whereas candidate csRNAs mapped to approximately 6,000 of the 9,079 21U-RNA loci. Interestingly, a very strong bias for the mature 21U-RNA species to map 2 nt downstream of CapSeq loci was observed (44/217, p-value<1.2E-14;
Table 2C 21U-RNA related analyses. 21U-like RNAs, new 21U-RNAs, and 21U-RNAs with −2 CapSeq reads
Reads with 5′ ends that align 2 nt upstream of mature 21U-RNAs (−2 csRNAs) were enriched approximately 60-fold by CIP-TAP cloning relative to the CIP-PNK cloning method, and peaked at 25-26 nt (
Next, experiments were conducted to determine if partially overlapping 21U-RNA species are processed from a single or from multiple independent transcripts that share a promoter. Of approximately 6,000 annotated 21U-RNAs that partially overlap, a subset of 2,301 distinct 21U-RNA pairs were analyzed (
When the 2,301 21U-RNA pairs were analyzed for the presence of YR motifs, it was found that, for about half of the pairs (1,130), it is impossible for YR motifs to exist for both sister 21U-RNAs because the mature 21U-RNA 5′ ends are separated by only 1 nt (
While analyzing the CIP-TAP data, it was noticed that there were many abundant csRNA-producing loci within the 21U-RNA clusters on LGIV for which mature 21U-RNAs were not detected. Altogether, 2,309 csRNA-producing loci were identified that fail to produce mature 21U-RNAs.
The csRNA reads obtained from these loci were similar in both size and abundance to those derived from canonical 21U-RNA loci (
Given the large number of 21U-like loci, it was reasoned that polymorphisms in C. elegans wild-isolates might convert the +3 residue to a U at one or more of these loci. Consistent with this possibility two 21U-RNAs (IV+17159702-17159722 and IV+1590356315903583) were identified that were cloned from JU1580 (Felix et al., PLoS Biol, 9: e1000586, 2011) and CB4856 respectively, but not from N2, and that mapped to a 21U-like loci. In both cases, independent deep-sequencing data confirmed that the wild-isolates contain SNPs in the corresponding 21U-like loci that change the +3 residue of the csRNA to a U. Together, these findings indicate that a U at position +3 of a csRNA is important for 21U-RNA processing and/or stability.
In previous studies, the computational identification of 21U-RNAs required both the YRNT motif, which we now recognize as a transcription start site, as well as a larger motif containing a conserved 8 nt consensus (CTGTTTCA) positioned 40 nt upstream (
Although the majority of the newly defined 21U-RNAs were derived from atypical loci, it was found that they exhibit the same 2′-O-methyl modification found on the 3′ ends canonical 21U-RNA species, as they were enriched in a previously published 3′ end oxidization experiment (
These newly identified 21U-species nearly double the total number of Piwi-associated small RNAs in C. elegans, and include several extremely abundant 21U-RNAs. The single most abundant 21U-RNA derives from a highly expressed csRNA locus on the X chromosome (
To confirm that CapSeq can identify TS sites from other species, a pilot study was performed using mouse testis RNA. As shown in
Reads mapping to mouse piRNA clusters were also analyzed (
The data in Examples 10-18 interestingly show that the majority of 21U-RNA loci produce csRNAs but do not produce longer transcripts defined by CapSeq reads. The majority of 21U-RNA-associated csRNAs are unidirectional and originate precisely 2 nucleotides upstream of the corresponding mature 21U-RNA species. These findings suggest that csRNAs are processed into piRNAs by removing the cap plus two nucleotides, and by trimming the 3′ end. These data show that the 8 nucleotide motif upstream of canonical 21U-RNAs is not required for 21U-RNA biogenesis. Rather, these findings suggest that any germline-expressed csRNA that contains a +3U can give rise to a piRNA that is loaded onto the Piwi Argonaute, including thousands of csRNAs associated with protein-coding and other Pol II loci. These findings reveal a role for promoter-associated csRNAs in piRNA biogenesis and uncover a species of piRNA associated with TS sites of Pol II genes throughout the genome, nearly doubling the repertoire of piRNA species in C. elegans.
The above examples inform the skilled artisan on several important aspects relating to the expression of C. elegans piRNAs. The analysis employed two approaches, CapSeq and CIP-TAP, both of which enrich for the 5′ ends of Pol 11 transcripts. The CapSeq protocol, designed to select for long-capped RNAs, identified only ˜200 sequences that overlap annotated piRNAs, but detected thousands of candidate TS sites for other Pol II genes. The CIP-TAP protocol, designed to detect capped-small RNAs, identified thousands of candidate piRNA precursor transcripts that average 26 nt in length and initiate 2 nt upstream of the mature piRNA species. In addition, CIP-TAP identified csRNAs that were associated with many other Pol II promoters, where they were frequently oriented divergently, with the sense csRNA often corresponding to a major TS site for the corresponding longer transcript detected by CapSeq.
Strikingly, germline-expressed csRNAs genome-wide that contain a U at the +3 position were found to be processed into piRNAs (21U-RNAs) and to associate with the Piwi Argonaute PRG-1. These findings indicate that the U in the YRNU motif is important for 21U-RNA stability, processing, or Piwi Argonaute loading, and that the YR is important for efficient transcription initiation (See Model
An Enzymatic Approach for 5′-End Anchored Transcription Profiling
Transcription profiling by deep sequencing has become an increasingly important tool for following gene expression. The CapSeq protocol described here facilitates transcription profiling by using a series of three enzymatic treatments that dramatically enrich for the 5′ ends of Pol II transcripts. Because this approach does not require affinity purification to remove structural RNA contaminants, it is usually performed on relatively small quantities of RNA, and aside from one gel purification for size selection, the entire procedure is carried out in a PCR tube. Importantly, the CapSeq procedure anchors clones at the 5′ cap of Pol II transcripts, and thus can clone RNAs with or without poly (A) tails. Thus CapSeq provides a quantitative way to profile a diversity of Pol II transcripts, while providing insights on alternative transcription-initiation sites, which may be of potential developmental significance.
The studies on C. elegans have enabled measurement of the fidelity with which CapSeq recovers the 5′ ends of transcripts. The presence of SL sequences on mRNA clones provides a convenient and objective marker for bona fide 5′ ends. By measuring the frequency of truncated clones lacking a full-length SL sequence, it was estimated that degraded mRNA clones were recovered in the CapSeq protocol at a frequency of less than one in 15,000 per position.
Genome-Wide Identification of Pol II TS Sites
The data described here provide the first systematic and comprehensive look at the TS sites of Pol II transcripts in C. elegans. The trans-splicing of SL sequences to the 5′ ends of many mature transcripts confounds the identification of TS sites in C. elegans. Consequently, only a handful of TS sites for C. elegans Pol II transcripts have previously been identified (Allen et al., Genome Res, 21: 255-264, 2011; Morton et al., RNA, 17: 327-337, 2011). By using CapSeq to clone capped transcripts from several different stages of development, candidate TS sites have been identified for approximately 50% of the annotated protein-coding genes in C. elegans. In addition, 5′ ends for other Pol II transcripts have been identified that are typically under-represented in poly(A)-selected RNA-seq studies, including snRNAs, snoRNAs, SL RNA precursors, and histone mRNAs. In keeping with predictions from previous studies (Allen et al., Genome Res, 21: 255-264, 2011), it was found that an overall 70% of annotated protein-encoding genes had trans-spliced forms. Because of the abundance of SL-containing reads, these findings provide a comprehensive measure of alternative spliced-leader usage for most genes and useful data for refining the prediction of SL splice-acceptor sites.
This analysis also enabled the identification of candidate TS sites for many miRNA primary transcripts. Altogether, by combining OP-TAP and CapSeq data, it was possible to predict TS sites for 60% of the annotated C. elegans miRNAs. Surprisingly, it found that expression levels, as inferred from read counts for pri-miRNAs, were comparable to that of many abundant protein-coding genes. In contrast, pri-miRNA transcripts were very rarely detected in data from a previous study that used poly(A) selection RNA-Seq protocols (Lamm et al., Genome Res, 21: 265-275, 2011), suggesting that either pri-miRNAs lose their poly(A) tail more rapidly than their 5′ cap, or perhaps lack a poly(A) tail entirely. It was concluded that CapSeq and CIP-TAP can be used to quantify the activity of a wide variety of Pol II genes. The approach described here can readily be extended to produce a comprehensive profile of C. elegans TS sites. Finally, sequencing of the mouse testes CapSeq library also revealed a strong enrichment for RNA 5′ ends and a YR motif around mouse TS sites of Pol II genes, including miRNAs and piRNAs. As such, it is predicted that CapSeq will be a valuable tool for the identification of Pol II TS sites, and for transcription profiling in a wide variety of organisms.
Capped Small RNAs are Associated with Promoters in C. elegans
csRNAs have thus been identified in C. elegans that map to TS sites of Pol II genes, including protein-coding, miRNA and other non-coding RNA genes. Like the longer reads recovered by CapSeq, csRNAs exhibit a consensus Pol II initiator element yYRyyy. Indeed the 5′ ends of csRNAs often coincide with the 5′ ends of CapSeq reads. However, unlike CapSeq reads, csRNAs are frequently bidirectional at promoters, with divergent csRNAs separated by an average of approximately 150 nt. This finding is consistent with the idea that many eukaryotic promoters are intrinsically bidirectional (Seila et al., Cell Cycle, 8: 2557-2564, 2009). In general, for csRNA and CapSeq reads that share a common 5′ end, the abundance of csRNA reads was proportional to the abundance of CapSeq reads, suggesting that csRNAs might be associated with Pol II initiation at active promoters. Despite their correlation with active gene expression, the above analysis suggests that csRNAs are relatively low abundance transcripts compared to other small RNAs. Based on the CIP-TAP cloning experiments, it is estimated that csRNAs represent less than 1% of the total small RNAs in adult C. elegans.
Capped-small RNAs that flank the TS sites of active promoters have been identified in mammals and Drosophila (Core et al., Science, 322: 1845-1848, 2008; Haussecker et al., Nat Struct Mol Biol, 15: 714-721, 2008; Seila et al., Science, 322: 1849-1851, 2008; Yamamoto et al., BMC Genomics, 8: 67, 2007). The above data suggest that csRNAs are most similar to PASRs, which were enriched using a CIP-TAP cloning method and had 5′ ends that frequently coincided with those of capped RNAs identified by CAGE (Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project, 2009; Kapranov et al., Science, 8, 316, 5830: 1484-8, 2007; Kapranov et al., Nature, 457, 7232:1028-32, 2009). Although the biogenesis and function of PASRs remains unknown, it has been speculated that PASRs might reflect Pol II pausing, or premature termination, or that they are processed from promoter-associated long-capped RNAs (Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project, 2009; Nechaev et al., Science, 327: 335-338, 2010). The above data suggest that csRNAs are independent transcripts. If csRNAs are derived from random 3′ degradation of long capped RNAs, it might have been expected to observe a broader size range continuing up to 32 nt (the largest size able to be sequenced in this experiment). The size of C. elegans csRNAs is similar to the estimated size of approximately 28 nts of nascent RNA that can be accommodated in the Pol II exit canal (Andrecka et al., Proc Nati Acad Sci USA, 105: 135-140, 2008; Chen et al., Proc Natl Acad Sci USA, 106: 127-132, 2009; Proudfoot et al., Cell, 108: 501-512, 2002). This size is also similar to that of csRNAs found associated with promoter-proximal pausing of Pol II, which occurs at many genes throughout metazoan genomes (Nechaev et al., Science, 327: 335-338, 2010; Rasmussen et al., J Mol Biol, 252: 522-535, 1995).
C. elegans piRNAs are Processed from Capped Small RNAs
Here it has been shown that csRNA loci genome wide can give rise to 21U-RNAs in approximate proportion to their abundance. The only requirement for 21U-RNA production was the presence of a U residue at the +3 position of the csRNA. These findings are consistent with a model in which csRNAs are precursors for 21U-RNA production (
When multiple csRNAs were produced at 21U-RNA loci, they typically shared the same orientation and their 5′ ends were often separated by not more than 5 bp. In contrast, other Pol II loci, such as protein-coding genes, produced sense-oriented CapSeq reads at high abundance, and multiple relatively low abundance csRNAs that were often oriented in both directions, with the antisense csRNA divergently expressed about 150 nt upstream of the sense csRNAs (
Recent studies have shown that PRG-1 and its piRNA cofactors provide an important first line of defense in a surveillance pathway that distinguishes self from non-self (Ashe et al., Cell, 150: 88-99, 2012; Lee et al., Cell, 150: 7887, 2012; Shirayama et al., Germline Cell, 150: 65-77, 2012). Importantly, PRG-1/piRNA complexes function in a context that does not require perfect base-pairing, greatly increasing the repertoire of potential target RNAs in C. elegans. The findings described here add to the amazing variety of piRNA biogenesis mechanisms, and identify a new type of piRNA that nearly doubles the number of piRNA species available for genome defense in C. elegans. The finding that promoter-associated small RNAs are processed and loaded onto an Argonaute also raises the intriguing possibility that Argonaute-small RNA pathways might regulate promoter activity directly.
Worm Strains
The Bristol N2 strain of C. elegans was used in this study and cultured essentially as described (Brenner, Genetics, 77: 71-94, 1974).
RNA Cloning and Sequencing
RNA was extracted using TRI Reagent (MRC, Inc.) or phenol. For CapSeq, 0.5-2μ of total RNA was treated with Terminator exonuclease (Epicentre) to degrade rRNAs, calf intestine phosphatase (CIP, NEB) to remove 5′ phosphates, tobacco acid pyrophosphatase (TAP) to remove 5′ caps, and the resulting long-capped RNAs were ligated to a 5′ adapter. First strand cDNA was primed using a pool of random octamers containing a common 5′ sequence corresponding to the 3′ adapter oligo. The first strand cDNA was size selected and then amplified using Illumina adapter oligos. Details are provided below. Small RNA libraries were prepared essentially as described (Gu et al., Mol Cell., 36: 231-244, 2009), with some modifications. Briefly, gel-purified 18-40 nt small RNAs were dephosphorylated using CIP and then ligated to a 3′ adapter. The small RNAs were then treated with polynucleotide kinase (PNK, NEB), to add a 5′ monophosphate, or with TAP, to remove a 5′ cap and leave a 5′ monophosphate. The resulting RNAs were ligated to barcoded 5′ adapters and libraries were amplified using Illumina adapter oligos. Details are provided below. Libraries were sequenced using an Illumina Genome Analyzer II or HiSeq instrument at the UMass Medical School Deep Sequencing Core (Worcester, Mass.).
Bioinformatics
Sequences were processed and mapped to the genome using custom PERL (5.10.1) scripts, Bowtie 0.12.7 (Langmead et al., Genome Biol, 10: R25, 2009) and blastn (2.2.25). For C. elegans experiments, reads were aligned to the C. elegans genome (WormBase release WS215), Repbase 15.10 (Jurka et al., Cytogenet Genome Res, 110: 462-467, 2005), and miRBase 16 (Kozomara et al., Nucleic Acids Res 39: D152-157, 2011). For mouse cloning experiments, reads were aligned to the mouse genome assembly NCBIM37 (Ensembl 67), miRBase 18 and the non-coding RNA database fRNAdb 3.4 (Mituyama et al., Nucleic Acids Res, 37: D89-92, 2009). The Generic Genome Browser (Gbrowse 1.70, Generic Model Organism Database (GMOD)) was used to visualize the alignments.
Immunoprecipitation
The PRG-1 IP was performed as described previously (Gu et al., Mol Cell., 36: 231-244, 2009). Small RNAs were extracted from IP and input and cloned using a TAP cloning protocol, as described (Gu et al., Mol Cell., 36: 231-244, 2009).
Accession Numbers
Illumina data are available from GEO under the series number GSE40053.
CapSeq Protocol
To destroy 5.8S, 18S, and 26S rRNA, 0.5-2 μg of total RNA was treated with 0.1 U/μl Terminator exonuclease (Epicentre, TER51020) in 20 μl volume containing 1U/μl of SUPERase In™ (Ambion, AM2696) and 1× buffer A at 30° C. for 1-2 hr. The efficiency of the above reaction could be monitored by resolving the sample on a 5% denaturing PAGE gel visualized by Ethidium Bromide. To de-phosphorylate tRNA and 5S rRNA, the above reaction was diluted into 100 μl with 10 μl of 10× DNase I buffer, 20 U more of SUPERase In™, 30 U of CIP (NEB, MO290S), 5 U of DNase I (Ambion, AM2222), and H2O, and incubated at 37° C. for 30 minutes. The RNA was extracted using phenol/chloroform with phase-lock column (5PRIM, 2302830), and precipitated with 0.3M NaAc (pH 5.2) and at least 1 volume of isopropanol plus 30 μg of glycoblue (Ambion, AM9515), a procedure called as RNA cleanup. To expose the 5′ phosphate in the cap structure, the RNA was treated with 0.25 U/μl TAP (tobacco acid pyrophosphatase, Epicentre, T19050) in 10 μl reaction containing 1U/μl SUPERase In™ and 1× TAP buffer at 37° C. for 1 hour, and then cleaned up, but without additional glycoblue because glycoblue was already added in the previous step.
The RNA was ligated with 5 μM of barcoded 5′ linkers in 10 μl reaction containing 1× buffer, 1 U/μl of SUPERase In™, 2 U/μl of T4 RNA ligase (Takara, 2050A), 10% DMSO, and 0.1 μg/μl BSA at 15° C. for 8 hours, and cleaned up without additional glycoblue.
To make cDNA, the precipitated RNA was annealed with 50 pmole of RT oligo plus 10 nmole of dNTP (each nucleotide) in a 13 μl reaction at 65° C. for 5 min, and then chilled on ice for at least 2 minutes. The sample was incubated with 5 U/μl Superscript III (Invitrogen, 18080), 10 mM DTT, 1 U/μl SUPERase In™, and 1× buffer, all of which brought in 7 μl volume to the previous annealing reaction. The RT was incubated at 15, 25, and 37° C. each for 15 min, and then at 50° C. for 30 min to finish the RT reaction. The RT was heat-inactivated at 85° C. for 5 min. To destroy RNA, 1 μl each of RNase A (Ambion, AM2270) and RNase H (Ambion, AM2292) is added to the RT reaction at 37° C. for 20 min. To increase the cDNA quantity (optional), the RT reaction was diluted into 100 μl PCR reaction containing 1× PCR buffer, 0.01 U/μl ExTaq (Takara, RROO1B), 0.05 μM of oligo CM013729, 0.2 mM dNTP, and H2O, and a linear PCR was performed for 10 cycles with condition 94° C. 20 s, 55° C. 20 s, and 72° C. 30 s. The cDNA of desired size (here—130 to 170 nt for most samples or −50-130 nt for ya0217, including the linker), visualized using SYBRGold (Invitrogen, S-11494) with 5S and 5.8S RNA from the total RNA as size marker, was purified from a 15% denaturing PAGE gel. The cDNA was eluted using TE buffer (10 mM Tris pH 7.5, 1 mM EDTA) containing 0.3 M NaCl for 6 hours to overnight with constant vortexing, filtered through 0.45 μm Spin-x column (Costar, 19442-758), and precipitated with isopropanol and glycoblue, as above. The whole procedure after PCR above was defined as PAGE gel purification.
To obtain the optimal PCR cycle number to produce the cDNA, a testing PCR was performed in 50 μl reaction containing 1× buffer, 0.2 uM each of oligo CM013279 and solexa3sh, 20% of the eluted linear PCR product, 0.25 min dNTP and 0.025 U/μl ExTaq, for 15 cycles with condition 94° C. 20 s, 52° C. 20 s, and 72° C. 30 s. And then 5 μl each of oligo solexa3 and CM013278, each at 10 μM stock concentration, was added. 3 μl of PCR product was sampled at 3, 6, 9 and 12 more cycles, and then resolved on a 8% native PAGE gel visualized using Ethidium Bromide with 10 bp DNA marker (Invitrogen 10821-015). The optimal PCR cycle number was defined as the one at which the PCR reaction produces the cDNA of the desired size (here 130 to 170 nt for most samples or 50-130 for ya0217) without obvious bulged products (diffusive band running much more slowly). This was the condition to make the final cDNA amplicons.
To check the quality of the library, TA cloning followed by colony PCR was performed to obtain individual cDNA species sequenced using the traditional method. Finally, single end 76 or 100 nt sequence was obtained using Illumina Genome Analyzer or Hi-Seq.
Oligo Used in CapSeq:
Small RNA Cloning:
A ligation dependent method was used, as described (Gu et al., Mol Cell., 36: 231-244, 2009) to make small RNA libraries. PRG-1 IP was performed using the same antibody and method, as described (Batista et al., Mol Cell., 31: 67-78, 2008; Gu et al., Mol Cell., 36: 231-244, 2009). Unlike CIP-PNK cloning or ligation independent cloning, TAP cloning, which clones much less degradation products, was used to clone RNA from PRG-1 IP, the oxidized sample for enriching for the 3′ modified RNA (Gu et al., Mol Cell., 36: 231-244, 2009), and the corresponding control samples. To clone capped small RNA, 18-40 nt RNA was first purified from 40 μg of total RNA using a 15% denaturing PAGE gel, and then treated with 0.5 U/μl CIP in 100 μl reaction containing 0.5 U/μl SUPERase In™ and 1×CIP buffer at 37° C. for 1 hour. The dephosphorylated RNA was cleaned up, ligated with a 3′ linker, gel-purified, and split into two parts: one quarter treated with PNK and the rest treated with TAP. The treatment and the following cloning protocol were described previously (Gu et al., Mol Cell., 36: 231-244, 2009). Except the previously published oxidized sample (the first step of beta-elimination) and its control, other samples were cloned using 5′ 4nt-barcoded linker. Single-end 36nt sequence was obtained using illumina genome analyzer.
Bioinformatics Analysis:
Genome and annotations used include WormBase release WS215, Repbase 15.10, and miRBase release 16 for C. elegans, and NCBIM37.67, miRBase 18 and non-coding RNA database fRNAdb 3.4 for mouse. Bowtie 0.12.7, blastn 2.2.25 and custom PERL (5.10.1) scripts were used to map and analyze the sequence. Gbrowse 1.70 was used to visualize the alignments. In general, all analyses were performed using custom PERL scripts.
A. Mapping CapSeq Reads to C. elegans Genome
Single-end reads of 75 nt (L10831, L30831 and YA0831) or 100 nt (avr0217 and ya0217) were obtained using Solexa Genome Analyzer and Hi-Seq system respectively. Reads were debarcoded, and then trimmed off the 3′ linker starting with TCGTATGCC. If there was no such sequence, an incomplete 3′ linker (TCGTATGC, TCGTATG, TCGTAT, TCGTA, TCGT, TCG or TC) at the very 3′ end was searched and removed in a non-recursive way. To avoid the mutations introduced by random priming in RT, the last 8 nt of the read was further trimmed off after removal of the 3′ linker. The processed RNA read with size of at least 17 nt was subject to a custom PERL script to identify and remove the splicing leader or its variants with up to 2 mutations. Then the non-SL-containing part was mapped to C. elegans genome WS215 and annotations using Bowtie 0.12.7 with parameters: -n 2-e 180 -a --best --strata -m 200. ‘-e 180’ defines the maximum mutations allowed, here 6; ‘-a --best --strata’ only returns the best matches; and ‘-m 200’ only reports RNA read with less than 200 best matches. The mutation rate allowed for the alignment was 0 for reads 17-19 nt long, 1 for 19-21 nt, 2 for 22-24 nt, 3 for 25-49 nt, 4 for 50-74 nt, and 5-6 for >=75 nt. The unmatched RNA reads of size >=40 nt were mapped to genome WS215 partially using blastn with parameter e 0.01. A PERL script was used to search the unaligned part of the RNA within 1000 nt flanking the reads which satisfied the splicing criterion, and only the best exon-exon candidates were obtained. A PERL script was used to obtain the histogram of the start loci of mapped reads, after being normalized to 10 million of total reads mapped to the sense strand of protein coding genes.
B. Mapping Small RNA and RNA-Seq to C. elegans
The C. elegans RNA-seq data using directional cloning strategy was obtained from GEO GSE22410 (Lamm et al., Genome Res, 21: 265-275, 2011). In other cases, a custom PERL script was used to de-barcode the sample and remove the 3′ linker CTGTAG and beyond. If the reads do not contain a complete 3′ linker, the incomplete 3′ linker CTGTA/CTGT/CTG/CT at the very 3′ end was removed in a non-recursive way. RNA at least 17 nt was used in the following analysis.
Bowtie 0.12.7 was used to match RNA to the annotations and genome with parameter -v 3 -a --best --strata -m 400. A custom PERL pipeline was used to perform the post-matching analysis, and the mutation rate allowed for RNA at least 17 nt long was: 0 mismatch for size 17-18nt, 1 for 19 to 21 nt, 2 for 22 to 24 nt, and 3 for longer than 24 nt. The bowtie parameter ‘-a --best strata’ only reports the best matches. If an RNA sequence was mapped to several genomic loci, then the reads of the RNA were split evenly to each locus. To account for the different sequencing volume, the small RNA data was normalized to 5 million of non-structural reads mapped, and the RNA-seq (mRNA) data was normalized to 10 million of sense protein coding reads. A PERL script was used to draw the scatter plot in
Analysis of C. elegans CapSeq Quality-Enrichment for Capped RNA
To estimate the enrichment for reads mapped to the very 5′ end (long capped RNA), relative to those mapped to the rest of the genes (potentially non-capped but cloned anyway), two analyses were performed.
In analysis 1, five annotated SL1-containing genes, alh-8, rps-24, rps-3, rps-29 and rps-15 were used to identify and quantify reads perfectly matched to the first 30 nt of each gene plus the complete or partial SL1. Each sample was normalized to 10 million of sense protein coding reads. Because debarcoding allowed one mutation, SL1 missing the first nucleotide (position 21 relative to the last nt of SL1) could be caused by debarcoding, thus ignored. It was also required that there be at least 3nt for the identity of SL1. Therefore, only positions 3-20, relative to the last nt of SL1, were used to calculate the degradation rate for each position of SL1, 1/15297 on average.
In analysis 2, only considered 6967 of the annotated trans-spliced genes were considered which were also confirmed by the data described herein. All sense reads mapped to these protein coding genes were obtained, and then the number of SL-containing reads (T-SL), which represented some, if not all, of the long capped RNA reads, was compared with the number of non-SL-containing reads (T-non-SL), which could be non-capped RNA or non-trans-spliced long capped RNA reads. Assuming all the non-SL-containing reads as non-capped reads, the relative enrichment for capped reads is equal to (T-SL/T-non-SL)*1500. ‘1500’ is the average gene size. This resulted in ˜17,000 fold enrichment for full-size SL1 reads (position 22 relative to the last nt) over any other positions, a rate consistent with that in analysis 1. Analysis 2 would underestimate the enrichment ratio because some non-SL-containing reads are capped RNAs.
Mapping SL-Containing Reads to Protein Coding Genes
CapSeq samples included were avr0217, L10831, L30831, ya0217, and YA0831. Reads mapped to tRNAs and rRNAs were removed completely. SL-containing, uniquely matched, and 5′ perfectly matched RNAs with size of at least 20 nt after removal of SL was used in this analysis. For each mapped locus, represented by the start site of the mapped RNA, the nearby loci from 1 nt upstream to 1 nt downstream were scanned, and then the original locus was removed if at least one of the nearby loci had 10 fold more reads mapped. RNA reads were first normalized to 10 million of sense protein coding reads, and a histogram for the start sites of mapped RNAs was generated for each sample. The five histograms were combined and loci with less than 1 read were removed.
Every genomic position within a gene was assigned with a type, including start, exon and intron. For a gene containing several transcripts, the type ‘start’ had priority over ‘exon’, and ‘exon’ had priority over ‘intron’, if a position was associated with multiple types. To associate an SL-containing RNA with a gene, a PERL script searched within the annotated genes. If failing, the script then searched for the nearest transcripts within 1-500nt downstream the start site of the mapped RNA. If still failing, the script searched within 1-500 nt upstream of the start site. If still failing, the script labeled the start site of this RNA as ‘NA’, meaning it cannot find a gene associated with the RNA.
Identification of Pol II Start Sites for Protein Coding Genes Using CapSeq
CapSeq samples included were avr0217, L10831, L30831, ya0217, and YA0831. After removal of reads mapped to the structural RNAs, only uniquely mapped, 5′ perfectly matched, and non-SL containing reads with size >=30 nt were included in the following analysis. RNA reads were first normalized to 10 million of sense protein coding reads and a histogram for the start sites of mapped reads was generated for each sample. The histograms were combined for the five samples, and loci with less than 5 reads were removed. 944 genes, each with 1000 or more reads out of ˜7 million of sense reads in YA0831, were defined as the top genes. For these genes, only loci mapped within a 200 nt region from 100 nt upstream to 100 nt downstream the annotated start sites were considered, because the read number cutoff 1/10 million for the start loci could bring in some degradation products from genes with the most abundant reads. For each locus, the nearby loci from 1 nt upstream to 1 nt downstream were scanned, and the original locus was removed if at least one of the nearby loci had at least 10 fold reads mapped.
Then a script searched the nearest gene for each locus within a range from 200 nt upstream to 1000 nt downstream. If the start site of a transcript was covered by at least some RNA reads mapped, the transcript was labeled as ‘yes’ for coverage by CapSeq. Or it was labeled as ‘NA’ or Not′ in the last column. The final output position of the CapSeq start locus was relative to the start site of the transcript assigned as described above, with negative number indicating upstream.
To analyze the motif around the start site, nt occurrence from −50 (upstream) to +50 around each start locus was summed up. The weight of each start site was set up as 1, rather than the mapped reads, thus avoiding bias caused by loci with the most abundant reads.
Identification of csRNAs Upstream Protein Coding Genes
Sense csRNAs and anti-sense csRNAs were defined separately using CIP-TAP sample. Background noise filtered included reads mapped to sense 21U-RNAs, sense/anti-sense reads mapped to tRNAs/rRNAs/miRNAs, anti-sense reads mapped to other structural RNAs such as snRNAs/snoRNAs, and anti-sense reads mapped to protein coding genes/pseudogenes. Only uniquely matched reads outside the two chromosome IV regions with canonical 21U-RNAs were considered. Furthermore, the included reads were enriched at least 10 fold in CIP-TAP sample, as compared to those in CIP-PNK sample, and had at least 2 reads per 5 million of non-structural reads in CIP-TAP. A script searched sense csRNAs within a region from 500 nt to 1 nt upstream any SL-containing transcripts or from 50 nt upstream to 50 nt downstream any non-SL-containing transcripts, and stops if reaching the upstream genes in the same direction.
To identify the anti-sense csRNAs, the 200 nt region from 100 to 300 nt upstream every sense csRNA was scanned, and the scanning process stopped before the upstream transcripts.
Analysis of the Motif and Size of 22G-RNA
Included in this analysis were anti-sense RNAs mapped to protein coding genes with 1st nt perfectly matched and at least 1 per million non-structural reads in CIP-PNK sample. In the motif analysis, the position was referred to the start site of mapped 22G-RNA locus with negative number indicating upstream positions.
Analysis of the Expression Levels of csRNAs and lcRNAs at TS Sites
Sense csRNAs and lcRNAs at TS sites for protein coding genes were compared, and then the 4344 loci having both csRNAs and lcRNAs were used to draw a scatter plot.
Analysis of the Ratio of Trans-Spliced Genes in C. elegans Genome
The number of SL-containing genes and the number of non-SL-containing genes were obtained from the above analysis 4 and 5. The total number of genes mapped with or without SL was 14337, among which 10276 genes had 5′ end reads with SL, 10679 genes had 5′ end reads without SL, and the two datasets shared 6618 genes.
Analysis of csRNA Derived from miRNA Loci
All miRNA loci were visually checked to find upstream csRNA loci using Gbrowse, and 51 csRNA regions upstream miRNAs were hand-picked. RNA reads uniquely mapped to these loci were normalized to 5 million of non-structural reads. A histogram of the RNA start site was generated. Then removed were start sites with less than 2 reads or with nearby start sites (−2 to 2) having more than 5 fold reads. These start sites were used in the motif analysis, and the reads mapped to these sites were used in the size analysis.
Prediction of the Pri-miRNA Start Sites
All miRNA loci were visually screened for regions usually less than 500 nt starting upstream individual miRNAs or clusters of miRNAs, which contained both CapSeq or csRNA loci. 66 of such regions were hand-picked from Gbrowse. Only uniquely mapped RNAs without any mutation at the first nucleotide were used. Included in the analysis were the five Capseq samples aforementioned, each of which was normalized to 10 million of sense protein coding reads, and the CIP-TAP sample, which was normalized to 5 million of non-structural RNAs. A combined histogram of CapSeq start sites was generated and any start loci with less than 1 reads/10 million sense protein coding reads was removed. A histogram of CIP-TAP start site was also generated, and any start site with less than 1 read per 5 million reads was removed. For each miRNA, the most abundant start locus with YR motif (R, the first nt of the mapped RNAs) from either histogram was assigned as the start sites for the pri-miRNA respectively. If several loci were identified from the same histogram, the closest locus was selected. After obtaining both start sites for each pri-miRNAs, one from CapSeq and the other from CIP-TAP sample, the distance between the two sites was analyzed.
Comparison of CapSeq and RNA-Seq
In this comparison, we randomly chose 5 non-SL-containing protein coding genes with size of 2156, 1573, 2560, 1382, and 860 nt for kin-31, glc-1, rpl-12, glh-1, and cel-1, and 5 miRNAs, each of which represent the first miRNA, if in the miRNA clusters, with genomic size of 897, 941, 580, 548, and 402 nt for mir-35, mir-61, mir-54, mir-58, and mir-229. The above strategy eliminates the complexity caused by trans-splicing and dramatically reduces the size difference between individual mRNAs and miRNAs. Using CapSeq and CIP-TAP data visualized by Gbrowse, we expanded the coding region of each miRNA or mRNA, as annotated in WS215, to include the upstream capped RNA region. Only uniquely mapped reads with size of >=30 nt falling within the desired region were considered in the analysis. 11 samples of directional RNAseq (Lamm et al., Genome Res, 21: 265-275, 2011), and 5 samples of CapSeq were combined respectively. In this way, both RNA-seq and CapSeq were considered as mixed stage samples because at least both contain L1, L3 and YA. CapSeq reads basically represented capped RNAs, and RNA-seq RNAs mostly represented non-capped RNAs. Assuming there was no difference between RNA enrichment methods used in CapSeq (enzymatic ribo-minus) and RNA-seq (poly A selection), the capped RNAs and non-capped RNAs should correlate positively, regardless of the source, here miRNAs or mRNAs. However, miRNAs as a whole were almost completely depleted from RNA-seq, as compared to mRNAs, strongly suggesting that the enrichment method could cause this bias.
The coordinate uses for each gene is as below:
Definition of Unique 21U-RNAs
A custom PERL script was used to identify all annotated 21U-RNAs which were mapped to unique genomic loci without overlapping with each other. A genomic locus is defined as a combination of chromosome, strand and start position. The total number of such 21U-RNA is 9079 out of 15073 in WS215.
Identification of CapSeq Reads Overlapping with 21U-RNAs
Uniquely matched reads without mutation at the 1st nt were overlapped with the unique 21U-RNAs. Included were the CapSeq reads overlapping with only one 21U-RNA on the same strand. The relative position of the 5′ end of CapSeq read was referred to the 5′ end (+1) of the 21U-RNA overlapped, and a negative number represented upstream. For the CapSeq reads starting 2 nt upstream the annotated 21U-RNAs, annotated non-21U-RNA genes were searched within 20 nt upstream to 200 nt downstream the CapSeq reads.
Analysis of csRNAs Overlapping with 21U-RNAs
The RNAs overlapping with only one unique 21U-RNA on the same strand were analyzed using CIP-TAP and CIP-PNK samples. To analyze the correlation of 21U-RNA and csRNA, the reads mapped to each unique 21U-RNA were obtained from CIP-PNK and CIP-TAP samples respectively. Only included were reads mapped to the starting positions of 21U-RNAs in CIP-PNK and those mapped 2 nt upstream the annotated starting positions of 21U-RNAs in CIP-TAP. The reads were normalized to 5 million of non-structural reads. To draw the figure using log 2 scale, we only considered 21U-RNA loci with at least 1 reads in either CIP-PNK or CIP-TAP sample. Furthermore, if one of the two numbers was less than 1, we assigned that number to 1 to avoid drawing negative number using log 2 scale.
Identification of 21U-Like RNA Loci
To identify loci similar to 21U-RNAs but failing to generate 21U-RNAs, we used CIP-TAP sample with restraints either from inside or outside. Included in this analysis were the reads uniquely mapped and located on chromosome IV within the two intervals of 4500000-7000000 and 13500000-17200000. The inside filter was to filter out sense RNA reads mapped to 21U-RNA loci, anti-sense and sense reads mapped to miRNA/tRNA/rRNA loci, and anti-sense reads mapped to protein coding genes and pseudogenes. The outside restraints included PRG-1 IP and CIP-PNK. Any read enriched 5× or more in the PRG-1 IP, as compared to the input, was considered as a 21U-RNA candidate and was removed from CIP-TAP sample. However, we cannot directly filter out these PRG-1 IP RNA loci, represented by the start site, because 21U-RNAs were usually located 2 nt downstream the potential csRNAs cloned in CIP-TAP sample. Therefore these IP loci were shifted 2nt upstream first, and then used as a filter. CIP-PNK sample was used in two ways to remove potential noises. First, CIP-PNK sample served as control for CIP-TAP, and included in the analysis were RNAs in the CIP-TAP sample with at least 1 read/million of non-structural reads and enriched at least 10 fold as compared to CIP-PNK sample. Second, CIP-PNK sample was used to filter out mapped loci in CIP-TAP near which there were at least two other loci in CIP-PNK, because 22G-RNAs, the major background noise in CIP-TAP sample, were usually overlapped with each other. The nearby regions were defined as the regions from position −25 to −5 for upstream and +6 to 26 for downstream the start sites of the mapped reads in CIP-TAP. The motif and size analysis was performed using the −2,300 loci obtained this way, and the position in the motif analysis was referred to the start sites of csRNAs.
Identification of Novel 21U-RNAs in Wild Isolates
A custom PERL script searched 21U-RNA candidates cloned in the wild isolates using criteria: 1) the 21U-RNA locus had a ‘U’ at +1 position in the wild isolates but other nucleotides in N2, as annotated in WS215; 2) the 21U-RNA has YR motif; 3) 21U-RNA had at least 5 reads after being normalized to 5 million of non-structural reads, and was mapped to a unique locus; 4) the 21U-RNA was mapped to chromosome IV within the two intervals of 4500000-7000000 or 13500000-17200000.
Identification of New 21U-RNAs
New 21U-RNAs were defined using PRG-1 IP with the inside restraints and outside restraints. The inside restraints removed reads mapped to tRNAs or rRNAs, RNAs of non-20/21-U, and RNAs with mutation at the 1st position or with less than 1 read per 5 million nonstructural reads in PRG-1 IP. The outside restraints remove the reads without a YR motif in which Y is 3 nt upstream the RNA start sites, and the reads enriched less than 5 fold, as compared to the input sample. csRNAs were defined as the mapped RNA loci enriched at least 10 fold in the CIP-TAP sample over CIP-PNK sample.
Analysis of the Percentage of Non-Canonical 21 U-RNAs
This analysis was different from analysis 18 above because the new 21U-RNAs defined there could contain many canonical 21U-RNAs missed in the published list due to insufficient sequencing depth. In this analysis, such 21U-RNAs were removed. Each small RNA sample was normalized to 5 million of non-structural reads, and Capseq YA0831 was normalized to 10 million of sense protein coding reads. U-21nt RNA, starting with U and 21 nt long, was obtained as the RNA reads uniquely matched and enriched at least 5 fold in CIP-TAP sample, as compared to CIP-PNK sample. rRNAs and tRNAs were removed beforehand. These RNAs are divided into 4 groups: 1) mapped within the two regions of 4500000-7000000 and 13500000-17200000 on chromosome IV and annotated as 21U-RNAs already; 2) mapped within the two regions of 4500000-7000000 and 13500000-17200000 on chromosome IV, but not annotated as 21U-RNAs; 3) mapped to other regions on chromosome IV or other chromosomes, but annotated as 21U-RNAs; 4) mapped to other regions on chromosome IV or other chromosomes, but not annotated as 21U-RNAs. Group 2 and group 4 U-21nt RNAs must have YR motif. Group 1 represented the canonical 21U-RNAs, while group 4 represented the non-canonical 21U-RNAs. Non-SL-containing CapSeq RNAs in YA0831 and csRNAs enriched in CIP-TAP sample were searched for overlapping with U-21nt RNAs in group 1 and group 4. csRNAs were RNAs uniquely matched and enriched at least 5 fold in CIP-TAP sample, as compared to CIP-PNK sample. CapSeq RNAs were RNAs uniquely matched, containing no SL sequence, and perfectly matched at the first nucleotide in YA0831. Both csRNAs and CapSeq RNAs started 2 nt upstream U-21nt RNAs in either group 1 or group 4.
Mouse Capseq Analysis
Genome and annotations NCBIM37.67 were obtained from ftp.ensembl.org, CAGE data was obtained from http://fantom31p.gsc.riken.jp/cage/download/mm5/, non-coding RNA database fRNAdb v3.4 was obtained from http://www.ncrna.org/frnadb/, and miRNA database miRBase release 18 was obtained from miRBase. These annotations were mapped to mouse genome using a custom PERL script plus Bowtie. Bowtie was also used to map reads of >=19 nt long to the genome and annotations with parameters: -n 2 -e 180 -a --best --strata -m 200. A more stringent mutation filter was used to exclude non-specific matches: 0 mismatch for reads 19-24 nt long, 1 for 25-29 nt, 2 for 30-39 nt, 3 for 40-49nt, 4 for 50-59 nt, 5 for 60-69, and 6 for >=70 nt. A custom PERL script was used to obtain the start locus histogram of the mapped reads, after being normalized to 10 million of non-structural RNA reads. And all the alignments were visualized using Gbrowse 1.70.
Mapping Small RNAs to Mouse Genome
The mouse Mili IP data was obtained from GEO GSM475280 (Robine et al., Curr Biol, 19: 2066-2076, 2009), and then reads of size >=18 nt was mapped with Bowtie 0.12.7, using parameters ‘-n 2 -e 180-a--best-strata -m 200’. The mutation rate allowed was as described above for mouse CapSeq analysis. The post-bowtie analysis was the same as for the C. elegans small RNA.
Analysis of the Mouse CapSeq Quality—the Rate of Capped RNA
The range of the potential start site for each transcript annotated was defined as the region from position −20 to 21 relative to +1, the annotated start position. To account for alternative splicing, which could annotate the same state sites on one transcript as the non-start sites on the other transcript, the start sites had priority over non-start sites, thus being removed from the non-start sites. The total number of genomic positions for each class of sites, either start or non-start, and the reads matched, were used to estimate the overall enrichment for capped RNAs.
Motif Analysis of Mouse CapSeq and CAGE
Included in the analysis were RNA reads uniquely matched without any mismatch at the first position, after removal of RNAs mapped to tRNA, rRNA, snRNA and snoRNA. A histogram of the RNA start sites was made. The start sites analyzed had at least 1 reads out of 1 million of non-structural RNA reads, and there were no other start sites 1nt upstream or downstream with 10 fold more reads. The nucleotide occurrence around each start site (−50 to 50) was used to calculate the overall frequency table. To avoid bias from the start sites with the abundant reads, the weight for each start site was set to 1. The log2 ratio of foreground rate/background rate represented the relative enrichment for each nucleotide at each position. Here the background rate for A/G/C/U was based on the nucleotide frequency of all genomic regions from position −200 to +200 relative to each start site, and the foreground frequency for each nucleotide at each position was calculated using the −50 to 50 frequency table above.
For the CAGE data, if the start position of the RNA had a mismatch, the start position was reset as the position +2, as long as this position was perfectly matched. Or the RNA was discarded.
Prediction of TS Sites for Mouse Pri-miRNAs
Mouse testis samples at 4 weeks and 6 months were used in this analysis. Reads mapped to more than 3 genomic loci, or mapped with mutation at position +1, or mapped to snRNA, snoRNA, rRNA, tRNA, mRNA and piRNA were excluded. Reads of different length were combined according to the start sites, and then the start sites with less than 1 read per 5 million non-structural RNA reads were excluded. Also excluded were the start sites with less than 1/10th reads of the flanking start sites either 1nt upstream or downstream. The start sites were then assigned to the annotated pre-miRNAs as long as the start sites are within 1 kb upstream of the pre-miRNAs. Totally, 134 pre-miRNAs were assigned with TS sites, and these TS sites were enriched for YR motif (data now shown.)
This invention was made with government support under Grant No. NIH 2 RO1 GM058800 and a Grant from the Howard Hughes Medical Institute. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61551247 | Oct 2011 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2012/061978 | Oct 2012 | US |
Child | 14260718 | US |