The present invention relates to a method to study the transcriptome complexity in C. elegans or other species using reads from direct RNA sequencing or other sequencing method with ultra-long size.
Alternative splicing is hallmarks of eukaryotic transcriptomes. Over 90% of human genes show evidence of for alternative splicing. It plays a key role not only in cell fate specification, but also in development of higher organisms with sophisticated tissues, organs and developmental processes by expanding the complexity of transcriptome and thus of proteome. Aberrant splicing has been frequently linked to various diseases, including cancer, ageing, diabetes, abnormal nutritional response and neuronal disorders. In addition, the transcriptome is further subjected to various base modifications with different biological implications. Systematic detection of such modifications and understanding of their in vivo roles remains a significant challenge.
Identification of all type of transcripts produced by a genome is crucial for understanding the functional complexity of normal development and disease progression, but remains a challenging task even in an organism with a relatively small genome. For example, to facilitate annotation of the transcriptome of C. elegans or C. briggsae with a genome size of approximately 100 Mb, various datasets have been used, including ESTs, full-length cDNAs and RNA sequencing of (RNA-seq) of cDNA fragments using massively parallel sequencing. Short RNA-seq reads, typically shorter than 200 nt, have played a leading role in transcriptome annotation during the past decade. However, it is difficult to reconstruct and quantify alternative transcripts using short reads, which is further complicated by a requirement of an amplification step. Clearly, the ability to produce longer reads using native RNA molecule without amplification would minimize perturbation of transcript integrity, permitting capturing of full-length RNA molecules, which would be ideal for elucidating transcriptome complexity, including alternative splicing, alternative transcriptional start and ending, as well as the underlying biology. To this end, synthetic long-read RNA sequencing has been introduced, which relies on sub-pooling of full-length cDNAs followed by sequence amplification, fragmentation and assembly to produce a long read. The method has been shown to be able to recover many novel isoforms in humans and mice. However, the amplification and reverse transcription steps make it problematic for quantification and detection of native modifications. The current method of choice for profiling RNA methylation is RNA immunoprecipitation using modification-specific antibodies followed by reverse transcription and massively parallel sequencing. However, it provides poor resolution in terms of modification site. Third generation sequencing technology, for example, Pacific Biosciences (PacBio) RSII platform, is able to produce long read and detect DNA methylation based on polymerase kinetics during DNA synthesis, but a reverse transcription step is required for the sequencing RNA molecule. Therefore, direct sequencing of native RNA molecules is still not feasible.
Recently, Oxford Nanopore Technology (ONT) has developed a direct sequencing method for both DNA and RNA based on changes in ion current profile when a nucleotide passes through a nanopore. Due to its ultra-long read length, it has been adopted for many applications, including resolving repeats within human Y chromosome centromeres, improving the existing genome assembly, the rapid on-site sequencing of pathogens, and detecting 5-methylcytosine (5mC) in the genomes of humans and yeast. Direct sequencing of single-molecule native RNA is expected to benefit transcript integrity by getting rid of the steps for reverse transcription and amplification. The DNA modifications detected with ONT are highly correlated with those from the bisulfate sequencing-based method. Because ONT relies on the change of profile in electric current to differentiate nucleotide bases, with appropriate positive and negative training data sets, the platform may be able to detect known or unknown modifications in native RNA molecules without any pre-treatment step.
Given a relatively high error rate of the long reads, using them to define transcriptome complexity is not trivial. Several methods have been developed to call transcript isoforms with a reference genome using long reads, including ToFu and SQANTI, which were designed for PacBio cDNA reads. These methods depend heavily on existing splicing junctions to classify the reads into representative isoforms, which may compromise the potential of the long read in defining novel splicing junction. Therefore, they demand precise junctions for each individual read track. To satisfy this requirement, the junctions must be pre-corrected for each read using existing junctions or massively parallel sequencing reads (referred to as short reads hereafter). A method for calling transcript isoforms without a reference genome has also been developed. However, the method suffers from a higher false positive rate, and is problematic in handling close paralogs, which are often associated with short reads. Importantly, with decreasing costs of third generation sequencing, it has become increasingly desirable to define the transcriptome complexity of an existing genome using long reads only. However, a method capable of meeting this challenge is still lacking.
RNA modifications are emerging as a significant player not only in regulation of rRNAs and tRNAs, but also in post-transcriptional regulation of mRNAs. More than 150 RNA modifications are known, but the true potential of only a few of these has recently been revealed at transcriptome scale, which is mainly due to rapid development in detection technology based on high-throughput sequencing. For example, transcriptome wide RNA modification is mainly achieved by coupling antibody immunoprecipitation or chemical treatments to massively parallel sequencing. However, these techniques suffer from low resolution or limited capacity for generalization. A more straightforward method for detection of transcriptome wide modification is necessary.
The C. elegans genome is one of the best characterized metazoan genomes due to its homozygosity and lack of gaps. The 5′ end of most of its mRNAs carries a unique SL that is derived from independent loci, making it straightforward to evaluate the completeness of mRNA transcripts purified using oligo-d(T) magnetic beads. To examine the potential of ONT RNA sequencing in defining transcriptome complexity, direct sequencing of poly(A) tailed RNAs from different developmental stages of C. elegans is first performed in the present invention. Then a novel method for de novo identification of alternative splicing events by using the mapping tracks of full-length RNA transcripts is provided to identify 57,000 novel isoforms that are absent in the current annotation. Putative stage-specific expression of isoforms which was independent of the stage-specific expression of genes is also detected in the present invention. Finally, coding sequence-specific candidate RNA modification in all types of nucleotides is also shown in the present invention.
Accordingly, it is an objective of the present invention applying a direct RNA sequencing method with ultra-long reads using Oxford Nanopore Technologies to study the transcriptome complexity in C. elegans. Approximately six million reads using native poly(A) tailed mRNAs from three developmental stages, with average read lengths ranging from 900 to 1,100 nt are generated in the present invention. Around half of the reads represent full-length transcripts. To utilize the full-length transcripts in defining transcriptome complexity, a method is provided to classify the long reads as the same as existing transcripts or as a novel transcript using sequence mapping tracks rather than existing intron/exon structures, which identifies roughly 57,000 novel isoforms and recover at least 26,000 out of the 33,500 existing isoforms. Notably, the sets of gene with differential expression versus differential isoform usage over development are largely different, implying a fine-tuned regulation at isoform level during development. An unexpected increase in putative RNA modification in all bases in the coding region relative to the UTR is observed, suggesting their possible roles in translation. The ONT long reads and the method for read classification are expected to deliver new insights into RNA processing and modification and their underlying biology in the future.
In a first aspect of the present invention, there is provided a method for identifying different isoforms of RNA from a genome comprising: providing one or more nucleotide sequence reads from a nucleotide from an organism, wherein said one or more nucleotide sequence reads comprise at least one isoform; obtaining a reference sequence from an annotated database, wherein the reference sequence includes one or more gene loci having an exon and an intron; clustering one or more sequence tracks by a distance score Score 1 to obtain a first group of sequence tracks and a second group of sequence tracks; merging sequence tracks in the first group of sequence tracks if the distance score Score 1 is below a percentage of a first value; clustering sequence tracks in the second group of sequence tracks by a mutual distance score Score 2; and merging sequence tracks in the second group of sequence tracks if the mutual distance score Score 2 is below a percentage of a second value so as to generate one or more isoforms of the RNA with respect to one or more gene loci.
In a first embodiment of the first aspect of the present invention, there is provided a method wherein said clustering one or more sequence tracks by the distance score Score 1 comprises: constructing one or more sequence tracks by performing a read-mapping between the one or more nucleotide sequence reads and the one or more gene loci; excluding the sequence tracks having a first overlap to the exon below a certain amount of nucleotide, or excluding the sequence tracks having a second overlap to the antisense of the one or more gene loci; computing the distance score Score 1, wherein the distance score Score 1 is computed by
wherein Aexon is an exon sequence of a first sequence track in the first group of sequence tracks, Bexon is an exon sequence of a second sequence track in the first group of sequence tracks, (Aexon∩Bexon) is the amount of shared exon sequence between Aexon and Bexon, (Aexon∪Bexon) is the amount of pooled exon sequence between Aexon and Bexon, Aintron is an intron sequence of the first sequence track in the first group of sequence tracks, Bintron is an intron sequence of the second sequence track in the first group of sequence tracks, (Aintron∩Bintron) is the amount of shared intron sequence between Aintron and Bintron, (Aintron∪Bintron) is the amount of pooled intron sequence between Aintron and Bintron and weight is a ratio of an average intron length to an average exon length of the reference sequence.
In a second embodiment of the first aspect of the present invention, there is provided a method wherein the certain amount of nucleotide according to the first embodiment is approximately from 10 to 100 nt.
In a third embodiment of the first aspect of the present invention, there is provided a method wherein if the distance score Score 1 according to the first embodiment is below the percentage of the first value, said merging sequence tracks in the first group of sequence tracks comprises: subtracting the length of the first sequence track and the length of the second sequence track in the first group of sequence tracks to obtain a third value; if the third value is over zero, merging the second sequence track with the first sequence track.
In a fourth embodiment of the first aspect of the present invention, there is provided a method wherein the ratio according to the first embodiment is approximately from 0.1 to 0.9.
In a fifth embodiment of the first aspect of the present invention, there is provided a method wherein the distance score Score 1 according to the third embodiment is below the percentage of approximately 5% of the first value.
In a sixth embodiment of the first aspect of the present invention, there is provided a method wherein said clustering sequence tracks in the second group of sequence tracks by the mutual distance score Score 2 comprises: computing the mutual distance score Score 2, wherein the mutual distance score Score 2 is computed by
wherein Aexon is an exon sequence of a first sequence track in the second group of sequence tracks, Bexon is an exon sequence of a second sequence track in the second group of sequence tracks, (Aexon∩Bexon) is the amount of shared exon sequence between Aexon and Bexon, Aintron is an intron sequence of the first sequence track in the second group of sequence tracks, Bintron is an intron sequence of the second sequence track in the second group of sequence tracks, (Aintron∩Bintron) is the amount of shared intron sequence between Aintron and Bintron, and weight is a ratio of an average intron length to an average exon length of the reference sequence.
In a seventh embodiment of the first aspect of the present invention, there is provided a method wherein if the mutual distance score Score 2 according to the sixth embodiment is below the percentage of the second value, said merging sequence tracks in the second group of sequence tracks comprises: subtracting the first sequence track and the second sequence track in the second group of sequence tracks to obtain a fourth value; if the fourth value is over zero, merging the second sequence track with the first sequence track.
In an eighth embodiment of the first aspect of the present invention, there is provided a method wherein the ratio according to the sixth embodiment is approximately from 0.1 to 0.9.
In a ninth embodiment of the first aspect of the present invention, there is provided a method wherein the mutual distance score Score 2 according to the seventh embodiment is below the percentage of approximately 1% of the second value.
In a second aspect of the present invention, there is provided a method for identifying different RNA isoforms using long reads of RNA sequencing comprising assigning sequence tracks to a given gene locus based on long-read mapping against a reference genome wherein existing isoforms from the reference genome are also included as a sequence track, excluding long-read mappings that show few overlaps with existing exon from the existing isoforms or are in antisense to the given gene locus, clustering the sequence tracks based on a distance score Score 1 wherein the distance score Score 1
wherein the amount of shared sequence in nucleotide between exons or introns from each sequence track is calculated as (Aexon∩Bexon) or (Aintron∩Bintron), and the total number of nucleotide (Aexon∪Bexon) or (Aintron∩Bintron) is calculated as pooled sequences in nucleotide between the same two exons or introns, and wherein the weight is based on the ratio of average intron length versus the exon length of the reference genome, with a default value of 0.5, merging the sequence tracks with a cut-off based on the distance score Score 1 between the sequence tracks, wherein the sequence track with a first length in summed exons of the isoforms other than the referenced isoforms is treated as a first subread and merged with the sequence track with a second length in the summed exons of said isoforms, wherein said second length is longer than the first length; merging the sequence tracks if the distance score Score 1 is lower than 5% and retaining one of the other isoforms with the biggest size of summed exons from each group along with the existing isoforms for subsequent isoform calling, wherein the remaining sequence tracks are assigned as a second subreads to be used for expression quantification and intron/exon boundary correction, clustering the retained sequence tracks based on their mutual distance score Score 2 wherein the mutual distance score
merging the sequence track with a first length in the summed exons of the isoforms obtained after said clustering, being assigned as a third subreads, into a track with a second length in the summed exons of said isoforms if their mutual distance score Score 2 is lower than 1% so as to generate a one or more resulting isoforms for the given gene locus, wherein said second length is longer than the first length; correcting the resulting sequence tracks for intron/exon junctions to result in novel isoforms.
In a first embodiment of the second aspect of the present invention, there is provided a method wherein for said intron/exon boundary correction, the junctions from the second subreads are aligned against the junctions defined for the isoform by the longest full-length long-read.
In a second embodiment of the second aspect of the present invention, there is provided a method wherein a minor shift of no more than 5% change in distance score in exon-intron boundary caused by read errors are permitted to reduce over calling of different RNA isoforms.
Other aspects and advantages of the present invention will be apparent to those skilled in the art from a review of the ensuing description.
Throughout the present specification, unless the context requires otherwise, the word “comprise” or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers. It is also noted that in this disclosure and particularly in the claims and/or paragraphs, terms such as “comprises”, “comprised”, “comprising” and the like can have the meaning attributed to it in U.S. Patent law; e.g., they can mean “includes”, “included”, “including”, and the like; and that terms such as “consisting essentially of” and “consists essentially of” have the meaning ascribed to them in U.S. Patent law, e.g., they allow for elements not explicitly recited, but exclude elements that are found in the prior art or that affect a basic or novel characteristic of the present invention.
Furthermore, throughout the present specification and claims, unless the context requires otherwise, the word “include” or variations such as “includes” or “including”, will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers.
In the methods of preparation described herein, the steps can be carried out in any order without departing from the principles of the invention, except when a temporal or operational sequence is explicitly recited. Recitation in a claim to the effect that first a step is performed, and then several other steps are subsequently performed, shall be taken to mean that the first step is performed before any of the other steps, but the other steps can be performed in any suitable sequence, unless a sequence is further recited within the other steps. For example, claim elements that recite “Step A, Step B, Step C, Step D, and Step E” shall be construed to mean step A is carried out first, step E is carried out last, and steps B, C, and D can be carried out in any sequence between steps A and E, and that the sequence still falls within the literal scope of the claimed process. A given step or sub-set of steps can also be repeated. Furthermore, specified steps can be carried out concurrently unless explicit claim language recites that they be carried out separately. For example, a claimed step of doing X and a claimed step of doing Y can be conducted simultaneously within a single operation, and the resulting process will fall within the literal scope of the claimed process.
Other definitions for selected terms used herein may be found within the detailed description of the present invention and apply throughout. Unless otherwise defined, all other technical terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which the invention belongs.
The present invention will be described in detail through the following embodiments with appending drawings. It should be understood that the specific embodiments are provided for an illustrative purpose only, and should not be interpreted in a limiting manner.
To evaluate the potential of direct RNA sequencing in identifying novel splicing isoforms, poly(A) tailed RNAs were first purified in the present invention. Most of these RNAs were expected to be mRNAs encoding proteins. Then direct RNA sequencing using portable MinION devices was performed and a total of approximately six million long reads from three developmental stages, i.e., embryo (EMB), L1 larva (L1) and young adult (YA) was generated. For each stage, at least 1.6 million reads with an average read length of 1118, 908 and 925 nt for EMB, L1 and YA were produced, respectively (Table 1). These reads are substantially longer than the short reads produced by massively parallel sequencing, which are typically less than 200 nt in length. The direct RNA sequencing reads were referred as long reads hereafter. It was expected that a subset of these reads represent full-length transcripts derived from the C. elegans genome.
1The Reads number is the raw read number, including that of the ENO2 internal control.
2Based on Wormbase (WS260), containing a total of 33,501 protein-coding isoforms.
The reads were mapped against the C. elegans genome using minimap2 through “split-read” alignment, which implements “concave gap cost” for long insertions and deletions to accommodate intron skipping. Taking into account of mismatches and small insertions and deletions (indels) against the C. elegans genome, the overall read accuracy is roughly 85% (Table 1). Despite this relatively low read accuracy, the percentage of long reads that can be mapped back to the C. elegans genome, referred to as mappability hereafter, is 99.7%, indicating the high specificity of the long reads, consistent with previous mapping results using other types of long reads including PacBio cDNA reads. Despite a relatively low read quality score (average q score =11) of the long read compared with the short reads, this mappability is significantly higher than the short reads that are routinely used in RNA-seq, the mappability of which is around 80%. The substantially elevated mappability over an extended genomic interval provides an advantage in identification of novel splicing isoforms. Consistent with this, when the long reads were mapped against existing annotated spliced exons and UTRs and non-coding transcripts in the WormBase WS260, the overall mappability dropped to 76.7%. The nearly perfect mappability against the C. elegans genome contrasts sharply with a much lower mappability against its annotated transcripts, suggesting a possibility of novel transcripts that are currently missing in the WormBase, highlighting the value of the long reads in defining novel splicing isoforms.
To examine to what extent the long reads represent a full-length transcript, the long reads were classified into two categories, i.e., full-length and partial-length reads. Given that the mRNAs were purified using oligo-d(T) beads, most of them had intact 3′ ends. Therefore, the full-length reads were defined as those that span at least 95% of the length of their best hit of an existing transcript as described previously, or those that carry a splicing leader (SL) at 5′ end. The remaining reads were defined as partial-length reads (
To independently evaluate the capability of Nanopore long read in recovering full-length transcript, the percentage of full-length transcripts encoded by mitochondrial genes are calculated, which also carry a poly(A) tail but no intron in C. elegans, and it is expected few complications associated with splicing. Nearly 80% of the mitochondrial transcripts were full-length, although for ND5 the total was less than 70% (
Current methods for identifying alternative splicing events using long reads mainly depend on predefined exon-intron junctions, leading to a high false positive rate in calling novel isoforms. For example, if an existing junction is inaccurately predicted, it will overwrite any isoforms defined by the long read mapping. In addition, existing methodologies for using long read to identify isoforms are not designed for quantifying expression level. Given the extremely high mappability of our long reads against the high-quality C. elegans genome, these full-length reads hold promise for de novo identification of intron-exon junctions, alternative promoter and polyadenylation site, as well as variation in UTR, which were collectively referred to as transcriptome complexity.
To take advantage of the long reads in defining transcriptome complexity, a new method, called TrackCluster, is provided to take full advantage of the mapping tracks of the full-length long reads to de novo construct a transcript isoform and determine its expression level. Using a customized classifier, TrackCluster either de novo identifies an isoform using a full-length transcript or groups the transcript with an existing isoform by their similarity score (
To demonstrate the performance of “TrackCluster”, simulation datasets were generated for gene unc-52, for which 17 isoforms are currently annotated in the WormBase. For each isoform, 10-300 long reads were randomly generated. To mimic the characteristics of our actual reads, the reads with around 85% accuracy (3.9% mismatch, 6.1% deletion and 5% insertion) were generated, around 65% of which were expected to be in full length based on the statistics of our Nanopore reads. In addition, 23% of the full-length reads were marked with an SL signal. In 100 simulations, a FDR smaller than 5% in terms of novel isoform calling and a variation smaller than 10% in terms of isoform quantification were achieved.
With approximately three million full-length long reads and approximately another three million partial-length long reads, 169,804 splicing junctions were totally identified. 150,591 (88.7%) out of those junctions were identical to those annotated in the WormBase (WS260). 4,537 (23.6%) of the remaining junctions (19,213) were also identified by meta-analysis of RNA-seq data (
It is worth noting that nearly half of the members of the category of “fusion gene” (defined as fusion between two existing separate transcripts) belong to an operon (
To examine whether different categories of novel isoforms demonstrated differential expression level, their accumulated expression level from the three developmental stages were plotted. The category involving UTR extensions demonstrated a higher expression level than the remaining categories (
To help illustrate the use of long read coverage in identifying novel isoforms, three novel isoforms of unc-52 that were identified by TrackCluster with the highest coverage of long reads were depicted. Also shown were the three existing isoforms that have the highest level of similarity to the newly identified isoforms (
In a few cases, the long reads were able to recover missing sequence within the C. elegans genome. For example, the long read derived from tsr-1 locus indicated absence of an exon along with its flanking sequences (
The polyadenylation sites (PAS) from long reads form an independent resource for identification of its motif. The PAS site were determined as described (
The capability to unambiguously assign isoforms using the long reads only permitted quantification of stage-specific expression not only at gene level, but also at isoform level. To evaluate whether stage-specific expression at gene level is contributed by differential expression of their isoforms, the isoforms from three developmental stages, i.e., EMB, L1 and YA, using TrackCluster were quantified. Despite a relatively high correlation of expression between gene levels measured with RNA-seq and the long reads (
In one of the embodiment illustrating the power of the long reads in delineating stage-specific expression of isoform, the stage-specific read tracks along with the novel isoforms identified by the long reads as well as the existing isoforms for gene efhd-1 was plotted, which is an ortholog of human efhd-1 (EF-hand domain family member D1) (
One unexpected observation was that the long reads derived from EMB were significantly longer in size than those of the remaining stages for both raw reads and mapped reads (Mann—Whitney U test, p<1e-15) (
To explore the capability of direct RNA-seq to identify modifications in RNA molecules, all of the possible modified bases via the deviation of their ion current profile from that of known unmodified nucleotides was first identified by using Tombo. It worked by computing the possibility of a possible modification on each site for every read, and output the fraction of a possible modification on the site out of all input reads. The modification was detected by reproducible deviations of ion current profile of a base in question from that of an unmodified base without knowledge of exact chemical identity of the modification (
To investigate whether the modification in cytosine was contributed by 5mC only, 5mC was detected and quantified its ratio in RNA using a well-established method for 5mC detection. The pattern of 5mC is similar to that of total cytosine modifications but at a much smaller scale (
The capability of direct sequencing of full-length transcripts provides a key advantage over massively parallel sequencing-based RNA-seq analysis in several ways. First, direct sequencing of a full-length RNA transcript makes it straightforward to identify transcript isoform with numerous exons. Second, it enables profiling of developmental stage-specific or cell-specific expression of isoforms, which is problematic using RNA-seq, but may provide important insights into developmental processes. Finally, it holds promise to define RNA modifications de novo without extra treatment step, for example, by measuring the deviation in ion current profile from that of wild-type RNA. In the present invention, direct RNA sequencing of C. elegans poly-(A) tailed RNAs derived from three developmental stages was performed by using ONT technology, and it was provided a pipeline for identifying isoform variants based on read track, allowing straightforward characterization of transcriptome complexity in a stage-specific way.
Approximately 75% out of the existing 33,500 isoforms were recovered, and identified another 57,000 novel isoforms with approximately 6 million long reads. It is likely that the actual number of novel isoforms may be substantially higher due to the following reasons. First, our sampling depth was not as high as those of analyses using RNA-seq both spatially and temporally. For example, three developmental stages were sampled in the present invention, whereas RNA-seq has been performed for tens of developmental stages in different cell or tissue types. Second, simultaneous support by at least five full-length reads were demanded to define a novel isoform. If this requirement was reduced to two full-length long reads, approximately 20,000 more candidate novel isoforms could be identified (data not shown). Third, the ratio of full-length reads may also be underestimated. This was because our definition of full-length was based on alignment against an existing transcripts. Many transcript isoforms may not exist in the WormBase that are currently annotated. For example, the full-length ratio of mitochondrial reads that carry no intron was 78%, whereas the ratio of nuclear mRNAs was only 49% (
One category of novel isoform output by TrackCluster is “intron retention”, defined as a novel transcript that carries a well-established intron supported by various RNA-seq data (
It is possible that at least a fraction of the long reads could be artifacts. This is because that these reads contain sequences derived from different parts of a single chromosome or from different chromosomes. However, the chromosome assembly in the relevant regions seems to be intact, as judged by a lack of repetitive sequences or gaps in these regions. It was speculated that these reads could be chimeric ones created during adaptor ligation, i.e., two separate reads were ligated together. It was estimated that about 0.5% of the long reads were likely to be the results of such artefacts.
Despite the ultra-long read length offered by Nanopore direct RNA-seq, a few notable caveats might limit its application in the following respects. First, its sequence throughput is substantially lower than conventional RNA-seq, translating into a much higher sequencing cost per nucleotide. Currently, only one to two GB of data of RNA sequences can be generated per flow cell, whereas over 10 GB of data of DNA sequences can be produced using the same flow cell. This low throughput significantly inhibits its application in research areas that are heavily dependent on gene expression profiling, which demand an especially high coverage for these low-abundance transcripts. Second, the relatively high error rate in read sequences is problematic during alignment in some cases. A customized alignment algorithm must be used to accommodate these errors. Third, Nanopore direct RNA-seq is known to be deficient in calling the very last bases that it sequences. This could have contributed to our lower than expected percentage of calling of SL-containing transcripts. Given that Nanopore direct RNA-seq produces sequence from 3′ to the 5′ end, it was speculated that the underrepresentation of SL signals was partially due to incomplete sequence at the 5′ end of the long reads, which inhibited reliable calling of SL signals, typically only 22 nt in length. Fourth, methodology for detecting RNA base is in its infancy and under active development. A more robust method is needed to reliably detect the RNA base modification and its chemical identity. Any putative RNA base modifications reported in this study could be artifact resulting from various noises. Functional characterization of these modifications is not warranted until they are independently validated. Finally, Nanopore direct RNA-seq demands a large amount of starting RNAs in a magnitude of ˜100 microgram. This limits its use in single-cell analysis. Future development should focus on adaptation of Nanopore direct RNA-seq to small amount of starting materials, which would maximize its potential in identifying novel isoforms.
Taken together, with the newly provided classifying method in the present invention, the long reads generated by ONT greatly facilitate the unambiguous resolution of alternative splicing events. The reads also hold great potential in de novo identification of RNA modifications, which is expected to catalyse the functional characterization of the new isoforms and modifications. Given the evidences of conserved splicing events between nematode and mammals, part of the splicing events were expected to be conserved across species.
Animals were as described and synchronized as embryo, L1 and young adult following a previous study. Briefly, C. elegans (N2) was cultured on NGM plates with OP50 E. coli at 20° C. Gravid adult worms were treated with bleach to isolate embryos. The embryos were incubated in M9 buffer without food at room temperature for 12 h to hatch and arrest at the L1 stage for harvesting. Part of the starved, synchronized L1 larvae were fed with OP50 and cultivated at 20° C. till adulthood to be harvested for RNA preparation. Animals of different stages, i.e., embryo, L1 larva and young adult, were collected and total RNAs were extracted using TRIzol (Invitrogen) following the manufacturer's instructions. Approximately 100 μg total RNAs were extracted for each sample. Around 900 ng of poly(A) tailed mRNAs was purified using Dynabeads™ mRNA Purification Kit (Invitrogen) based on the user's manual for each library preparation. Nanopore sequencing libraries were constructed using Direct RNA sequencing kit (cat #SQK-RNA001). The libraries were loaded onto Nanopore R9.4.1 flow cell (cat #FLO-MIN106) and sequenced on MinION acquired from Oxford Nanopore Technologies. The software used for sequencing was MINKNOW 2.1 with base-caller, Albacore (v2.0.1).
The resulting SAM files were sorted and indexed with “samtools” (v2.1) by sequence coordinate. For visualization on genome browser, they were converted to bigGenePred format (https://genome.ucsc.edu/goldenpath/help/examples/bigGenePred.as) using customized script in TrackCluster package. The coverage track was generated by using “bedtools (2.24)”.
The present invention is to provide the pipeline “TrackCluster” to identify novel isoforms by clustering of isoforms based on their similarity in track structure, i.e., combination of intron/exon and their positions. Briefly, after mapping, each read mapped to the reference genome was converted to a read track in bigGenePred format.
Tracks from each locus were subjected to three rounds of filtering steps to generate novel isoforms (
If the similarity score was higher than 95% between tracks, the one with a shorter length in summed exons was treated as a subread and merged to the one with a longer length. This step served to cluster the tracks with similar structures into distinct groups.
Third, for the remaining tracks that showed a similarity score lower than 95% (equivalent to the distance score 1 is lower than 5%) between each other, a pairwise similarity score 2 was calculated as follows.
In sequence track A and B, the amount of shared sequence between exons from each track was calculated in nucleotide(nt)(Aexon∩Bexon) and then the smaller nt number of two summed exon in track A and track B, min(Aexon, Bexon), was also selected. The number of intron sequences were similarly calculated as in exon sequences and normalized with a weight of 0.5. If the similarity score 2 was higher than 99% (equivalent to the distance score 2 is lower than 1%) between tracks, the one with a shorter length in summed exons was treated as a subread and merged to the one with a longer length. This step served to merge the tracks from partial-length long reads with the one defined by a full-length read. Representative isoform(s) for a locus were/was generated as a result.
TrackCluster was first run using full-length long reads, and then with the remaining reads. There are three purposes for doing this in the present invention. First is to reduce data processing time. Second is to determine the expression level of isoforms using all long reads as described below. Third is to recover the isoforms that are potentially missed by the cutoff. A fraction of isoforms was recovered with a “truncated” 5′ end (“5′ missing” or “UTR truncations” (
To quantify isoform for each locus, the subreads for each representative isoform was counted. If one subread showed 99% identity to with multiple representative isoforms (number =N), the count for each of these isoforms was counted as 1/N.
Previous study showed that Nanopore direct RNA reads was truncated a few nucleotides in the 5′ end, which made the determination of SL problematic. To maximize the possibility to recover an SL, a customized script was written as part of TrackCluster, which used Smith-Waterman (SW) alignment algorithm to detect putative SL signal by aligning the very first 22 nucleotides of the long reads against seven SL sequences. Reads with SW scores over 11 were treated as SL-containing reads. Simulation suggested that FDR was lower than 20% using these parameters and cut-off.
PAS motif was identified as described. A 50 nt region immediately upstream of poly(A) sites were scanned for all possible hexamer sequences to identify the top 50 over-represented motifs. The over-represented motifs were then scanned against the sequences of 14-24 nt (19 +/−5 nt) upstream of a PAS to obtain occurrence of the motifs within these regions. The count of motifs with same composition of nucleotides, for example, AATAAA, AAATAA and ATAAAA, were not merged as described.
Modification of the RNA sequences were identified with Tombo package version 1.4. The models of “5mC” and “de novo” were implemented separately to detect possible modification in each read. The score on each site indicated the fraction of a possible modification on a given site. For plotting the modification coverage along gene body and UTRs, the modification coverage was normalized for each isoform using “w0” method with a bin size of 5 nucleotides with EnrichedHeatmap. Only the isoforms with both 5′ and 3′ UTR longer than 50 nucleotides were used in the calculation.
The poly-(A) lengths of each read were calculated using Nanopolish. The raw current signal from the 3′ unaligned ends of reads were extracted to estimate the length of poly-(A) tail, which was deduced by the duration of the signal.
An isoform was defined as differentially expressed between stages when the change of its relative abundance (percentage of read count) out of all the transcripts within the same locus is greater than 20% across stage. A gene was defined as differentially expressed between stages when the fold change of its abundance of combined transcripts (read count per million) is greater than four across stage. Only genes supported by at least five long reads were used for the subsequent statistical analyses. Difference in the lengths of long reads between different developmental stages were calculated using Mann—Whitney U test implemented in R 3.4.4.
All raw and processed sequencing data generated in this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE130044. The source code for TrackCluster has been deposited in the Github, and is available on https://github.com/Runsheng/trackcluster. All the isoforms and tracks of the long reads can be accessed through the following link (https://genome.ucsc.edu/cgibin/hgTracks?db=cell&hubUrl=https://raw.github.com/runsheng/na notrack/master/hub.txt).
The present invention provide a method for identifying different isoforms using long reads of RNA sequencing. The potential applications of the present invention include, but are not limited to, detection of fusion transcript or other types of abnormal transcripts as a diagnostic marker for cancer cells/tissues, or genetic diseases. The present method yields a much higher sensitivity as opposed to the existing methods mostly based on Next Generation Sequencing (NGS).
The present application claims priority from the U.S. provisional patent application Ser. No. 62/931,839 filed Nov. 7th, 2019, and the disclosure of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62931839 | Nov 2019 | US |