Embodiments of the subject matter disclosed herein generally relate to a method for RNA-seq quantification for estimating an abundance level of functional mRNAs for determining genes with deleterious transcriptional aberrations.
RNA-seq has been adopted to study a wide range of biological problems. Recent developments have shown the promising use of RNA sequencing (RNA-seq) data in clinical diagnostics. Importantly, as transcriptomics (i.e., a set of all RNA molecules in one cell or a population of cells) helps improve the interpretation of functional consequences of underlying genetic variants, the use of RNA-seq data can lead to the detection of causative genes of genetic disorders that genome-based diagnostics alone has failed to identify.
A step in the application of RNA-seq data is the use of appropriate quantification methods suitable for specific downstream analysis [1], [2]. In transcriptome-based diagnostics, RNA-seq data are typically quantified to detect splicing outliers through analysis of differential splicing events between cases and controls [3], [4]. Splicing is a form of RNA processing in which a newly made precursor messenger RNA (pre-mRNA) transcript is transformed into a mature messenger RNA (mRNA). During splicing, introns (i.e., non-coding regions) are removed and exons (e.g., coding regions) are joined together. For those eukaryotic genes that contain introns, splicing is usually required in order to create an mRNA molecule that can be translated into a protein. A splicing outlier corresponds to a gene that is aberrantly expressed with abnormal splicing patterns and thus, such an mRNA molecule is an undesirable one.
However, a large number of unannotated splicing junctions, often found in the RNA-seq data, makes the discovery of the causative genes solely from the quantification of splicing junctions challenging. A splicing junction can be annotated or unannotated. An annotated splicing junction is a junction found in a normal mRNA molecule for which there is prior information, and thus, such a junction is known and expected to happen in a certain fashion. However, an unannotated splicing junction is not known, and there is no prior information about it.
Furthermore, the degradation of aberrant transcripts via the activation of the nonsense-mediated decay (NMD) pathway can complicate the prediction of causative genes from the splicing outliers. The NMD is an mRNA quality-control mechanism that typifies all eukaryotes examined to date. The NMD surveys the newly synthesized mRNAs and degrades those that harbor a premature termination codon (PTC), thereby preventing the production of truncated proteins that could result in disease in humans. To take into account the degradation of such deleterious variants, gene expression outliers can be identified via differential expression analysis between cases and controls. A gene expression outlier is undesirable in the gene expression process.
Indeed, the integration of these two separate types of transcriptional aberration analysis has shown to increase the diagnostic rate [5], [6]. However, because the existing RNA-seq quantifications used for aberrant expression analysis in these previous studies focus on the overall abundance levels of all transcripts [5] to [7], they substantially overestimate the abundance level of the functional mRNAs (i.e., mRNAs that produce proteins) when nonfunctional, aberrant, transcripts with splicing error are present. For this reason, the traditional methods underestimate the signal for detecting true causative genes.
Thus, there is a need for a new method or measure for determining a more accurate abundance level of the transcriptional aberrations.
According to an embodiment, there is a method for analysis of transcriptional aberrations and molecular diagnostic of genetic diseases. The method includes a step of receiving ribonucleic acid, RNA, related data, a step of calculating a probability λt of an error-free splicing for a coding transcript t based on the RNA data, a step of calculating the count-per-million (CPM) normalized xt for the coding transcript t based on the RNA data, a step of calculating an omega index based on a product of the probability λt and the CPM normalized xt for a gene g of the human genome, and a step of determining that the gene g is a candidate for a genetic disorder.
According to another embodiment, there is a computing device for analysis of transcriptional aberrations and molecular diagnostic of genetic diseases. The computing device includes an interface configured to receive ribonucleic acid, RNA, related data, and a processor connected to the interface and configured to calculate a probability λt of an error-free splicing for a coding transcript t based on the RNA data, calculate the count-per-million (CPM) normalized xt for the coding transcript t based on the RNA data, calculate an omega index based on a product of the probability λt and the CPM normalized xt for a gene g of the human genome, and determine that the gene g is a candidate for a genetic disorder.
For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
The following description of the embodiments refers to the accompanying drawings. The same reference numbers in different drawings identify the same or similar elements. The following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims.
Reference throughout the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with an embodiment is included in at least one embodiment of the subject matter disclosed. Thus, the appearance of the phrases “in one embodiment” or “in an embodiment” in various places throughout the specification is not necessarily referring to the same embodiment. Further, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
According to an embodiment, a novel measure, called herein the omega measure or the omega index, is introduced and this measure evaluates transcriptional aberrations by focusing on the portions of the functional mRNAs rather than the overall mRNA abundance levels. This means that the omega measure is an RNA-seq quantification with the ability to more accurately estimate the abundance level of functional mRNAs, which is believed to be more suited for the analysis of true gene expression and splicing outliers. The omega measure (observed mRNA's error-free gene-level abundance) is a between-sample RNA-seq quantification that estimates the gene-level abundances of splicing-error-free mRNA transcripts annotated to code for functional proteins. The term “code” herein is used to distinguish over the mRNAs that are not resulting into a functional protein. By using clinically established cases of genetic disorders, the inventors have demonstrated, as discussed later, that the novel measure omega is a practical application in the genome field capable to intensify the signal for deleterious transcriptional aberrations, to narrow down the search space for the prediction of causative genes.
The omega measure estimates the relative abundance of the mRNA transcripts produced through error-free splicing processes, i.e., functional mRNA transcripts.
The method computes the probability of error-free splicing events for each pair of consecutive exons in each coding transcript based on observed splicing and intron retention events as illustrated in
By using the splice junction counts and the intron retention counts values as weights in the count-per-million (CPM) measure of the mRNA transcripts for each coding gene G, the method is able to compute in step 130 the omega measure 132.
The omega measure allows one to partition the expression level of each coding gene G into an annotated value 410, corresponding to the omega measure, as illustrated in
Assuming that the functional mRNAs' abundances play an important role in controlling the phenotypic state of the cell, the omega measure is expected to differentiate RNA-seq samples based on their phenotypic characteristics. Indeed, clustering analysis of the GTEx data (see the Consortium G, (2013), The Genotype-Tissue Expression (GTEx) project. Nature genetics 45: 580-585) showed that the omega measure outperformed the transcript-per-million (TPM) quantification in differentiating RNA-seq samples with different tissue types, as discussed later.
The method for calculating the omega measure 132 illustrated in
To count the frequency of each splicing junction in sub-step 116, the method computed the coverage of the observed introns from the uniquely mapped split reads of sub-step 114, using SAMtools (see, for example, Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al., (2009), The sequence alignment/map format and SAMtools. Bioinformatics (Oxford, England) 25: 2078-2079) and BEDTools (see, for example, Quinlan A R, Hall I M, (2010), BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics (Oxford, England) 26: 841-842).
To classify each observed junction as annotated or unannotated in sub-step 112, the method computes in this embodiment a distance of each observed splicing junction from the closest donor and acceptor sites using the location of the annotated exons in GENCODE 25. To analyze in sub-step 118 the intron retention events, the method uses SAMtools and BEDTools to count the number of unsplit reads, which are determined in sub-step 117 based on the input 104 in step 100, within annotated intronic regions and multiples the counts by the ratio of the read length to the corresponding intron size.
Given these measures, a Markovian assumption is made and the probability λt of splicing-error free, transcription rate for each coding transcript t is computed in sub-step 112, by using the formula:
where Jt is a set of splice junctions for transcript t, ai is the count of the annotated splice junction i (calculated in sub-step 116), ui is the count of the unannotated splice junction i (calculated in sub-step 116), iri is the normalized count for the intron retention (calculated in sub-step 118) within the annotated splicing region, and ε is a small constant.
The step 120 discussed above, which achieves the transcript count normalization, includes a sub-step 124 of determining the transcript counts and a sub-step 126 of mapping the RNA-seq reads to a reference transcriptome for hg38 (GEN-CODE 25) using kallisto (Bray N L, Pimentel H, Melsted P, Pachter L, (2016), Near-optimal probabilistic RNA-seq quantification. Nature biotechnology 34: 525-527) to estimate the transcript counts. To compute the CPM, i.e., the normalized counts for coding transcripts, the method selects those transcripts that are annotated as protein-coding from chromosomes 1-22 and X and normalizes their counts so that the sum is fixed to 1 million.
To compute the TPM, for comparison purposes, the counts of the selected transcripts were divided by their lengths and then normalized the length-normalized counts to have the sum to be 1 million. To obtain the gene-level CPM and TPM for each coding gene, the normalized values of its coding transcripts were summed up. The normalized count for the annotated coding transcripts was calculated in sub-step 128, based on the results from sub-steps 112 and 122.
Based on the annotated transcript rate from sub-step 112 and the CPM from sub-step 122, the omega index is defined as:
where Tg is a set of coding transcripts annotated for gene g, λt is the annotated, error-free splicing for transcript t, calculated in sub-step 112, and xt is the CPM for the coding transcript t calculated in sub-step 122.
Based on the method illustrated in
To compare the prediction power of the omega measure with the existing measures, the following definition has been used to determine the low expression outliers. The gene-level mRNA abundance levels of the cases and the GTEx dataset were quantified using the traditional TPM measure and the novel omega measure, and they were then log-transformed with base 10. For each gene g in the coding gene set G, let Q1g and Q2g be the 1st quartile and the 2nd quartile of the log-transformed gene-level mRNA abundance distribution of the GTEx cohort for a given cell type, respectively, and yg be the log-transformed mRNA abundance level of the gene g from a case. Then, the low expression outliers are computed using ag defined as follows:
Using a threshold parameter θ, the low expression outliers are defined to be a subset of G in which each gene g satisfies ag≥θ. Genes with no detectable promoter activity, i.e., gene-level TPM=0, are excluded from the pool of potential expression outliers.
Based on this measure, the prediction of the omega index has been tested as now discussed. RNA samples of either fibroblasts or lymphocytes from five patients with clinically established autosomal recessive diseases were prepared. The samples were sequenced for obtaining the RNA-seq. The quality of each RNA sample was determined based on its RNA Integrity Number (RIN) and ensured that RNA-seq data were generated from a high-quality RNA sample (RIN 8.0) for each case. Then the sequencing libraries were prepared and paired-end 150 bp reads were generated. The GTEx RNASeq samples were downloaded from publicly available data (e.g., Consortium G, (2013), The Genotype-Tissue Expression (GTEx) project. Nature genetics 45: 580-585) for the blood and skin tissue types from the Database of Genotypes and Phenotypes with RIN≥6 (n=1,572) and transformed into the fastq format using SRA Toolkit (see, e.g., https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft). These RNA-seq datasets consist of paired-end 75 bp reads.
To investigate the value of the novel RNA-seq quantification in clinical diagnostics, the omega measure and TPM were applied to the five clinically established cases of distinct autosomal recessive disorders with known causal transcriptional aberrations as a case study. For this comparison, the GTEx RNA-seq data was used as control data (skin fibroblast: n=305 and lymphocyte: n=b 134) and the low expression outliers were determined based on equation (3), out of 19,688 coding genes in the five diagnosed cases. Here, the desirable outcome from the RNA-seq quantification is the selection of a small set of expression outliers that contain the true disease gene. Because each of these cases has a large number of unique splicing junction sites that are not found in the corresponding cell type of the GTEx cohort, as illustrated in
Note that in
The inventors have found that, in all five cases, the omega index was able to generate smaller sets of outliers that contain the true causal genes than the TPM method, as shown in
In this regard, the comparison of the mRNA abundances of the causative genes between the omega measure and the TPM method highlights that the consideration of the error-free transcripts allowed the omega measure to strengthen the discriminability of the disease-causing gene between the cases and the controls. While the above discussion illustrated the application of the omega measure to the autosomal recessive diseases, this quantification is also expected to be suitable for molecular diagnostics of other types of genetic diseases, as well as complex diseases such as cancer.
A method for analysis of transcriptional aberrations and molecular diagnostic of genetic diseases is now discussed with regard to
In one application, the step of calculating a probability λt of an error-free splicing for a coding transcript t includes calculating a ratio of (1) a count of an annotated splice junction and (2) a sum of (i) the count of the annotated splice junction, (ii) a count of unannotated splice junction, and (iii) a normalized count of an intron reduction within the annotated splicing region. The ratio for the junction i is multiplied with corresponding ratios of other junctions that belong to a set of splicing junctions for the transcript t to calculate the probability λt.
The step of calculating the count-per-million (CPM) normalized xt for the coding transcript t may include determining the transcript counts; selecting that transcripts that are annotated as protein-coding to obtain coding transcript counts; and normalizing the coding transcript counts so that a sum of the coding transcript counts is one million. The step of calculating the omega index includes calculating a product of the probability λt and the CPM normalized xt for each transcript t, which is part of a set Tg of coding transcripts annotated for the gene g. In one application, a transcript is determined to be annotated by calculating a distance of each observed splicing junction from a closest donor and acceptor sites using a location of an annotated exon of the RNA.
The above-discussed procedures and methods may be implemented in a computing device as illustrated in
Exemplary computing device 900 suitable for performing the activities described in the embodiments may include a server 901. Such a server 901 may include a central processor (CPU) 902 coupled to a random access memory (RAM) 904 and to a read-only memory (ROM) 906. ROM 906 may also be other types of storage media to store programs, such as programmable ROM (PROM), erasable PROM (EPROM), etc. Processor 902 may communicate with other internal and external components through input/output (I/O) circuitry 908 and bussing 910 to provide control signals and the like. Processor 902 carries out a variety of functions as are known in the art, as dictated by software and/or firmware instructions.
Server 901 may also include one or more data storage devices, including hard drives 912, CD-ROM drives 914 and other hardware capable of reading and/or storing information, such as DVD, etc. In one embodiment, software for carrying out the above-discussed steps may be stored and distributed on a CD-ROM or DVD 916, a USB storage device 918 or other form of media capable of portably storing information. These storage media may be inserted into, and read by, devices such as CD-ROM drive 914, disk drive 912, etc. Server 901 may be coupled to a display 920, which may be any type of known display or presentation screen, such as LCD, plasma display, cathode ray tube (CRT), etc. A user input interface 922 is provided, including one or more user interface mechanisms such as a mouse, keyboard, microphone, touchpad, touch screen, voice-recognition system, etc.
Server 901 may be coupled to other devices, such as sources, detectors, DNA and RNA decoders, etc. The server may be part of a larger network configuration as in a global area network (GAN) such as the Internet 928, which allows ultimate connection to various landline and/or mobile computing devices.
The disclosed embodiments provide a method for estimating the abundance level of functional mRNAs in a between-sample RNA-seq quantification, for analysis of transcriptional aberrations and molecular diagnostic of genetic diseases. It should be understood that this description is not intended to limit the invention. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents, which are included in the spirit and scope of the invention as defined by the appended claims. Further, in the detailed description of the embodiments, numerous specific details are set forth in order to provide a comprehensive understanding of the claimed invention. However, one skilled in the art would understand that various embodiments may be practiced without such specific details.
Although the features and elements of the present embodiments are described in the embodiments in particular combinations, each feature or element can be used alone without the other features and elements of the embodiments or in various combinations with or without other features and elements disclosed herein.
This written description uses examples of the subject matter disclosed to enable any person skilled in the art to practice the same, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims.
This application claims priority to U.S. Provisional Patent Application No. 62/813,301, filed on Mar. 4, 2019, entitled “NEW METHOD TO QUANTIFY AND EVALUATE THE ACTIVITY OF GENES USING RNA-SEQ DATA,” the disclosure of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62813301 | Mar 2019 | US |