The present invention relates to providing information of transcript (e.g. mRNA) abundances based on next generation sequencing (NGS) reads.
Next generation sequencing technology produces a large amount of short reads when sequencing a nucleic acid sample. An essential step in next generation sequencing is the library preparation or library prep for short. This process takes mRNA or cDNA as input and produces a library of short cDNA fragments, each corresponding to a section of an mRNA molecule. These fragments are then sequenced by an NGS sequencer, usually not in their entirety but partially at their start and/or at their end. This results in short sequences of nucleotides which are called reads and are most commonly stored by the NGS sequencer as sequences of a group of four ASCII characters such as A, C, G, T or 0, 1, 2, 3, representing the nucleobases of the genetic code. In order to infer which mRNA molecules were present in the original sample, the reads are mapped onto a reference genome.
Next generation sequencing has been employed in a variety of genome mapping procedures (US 2013/110410 A1) or DNA identification methods, e.g. by using a mapped genome to associate sequence reads to a certain organism variant (WO 2009/085412 A1).
WO 2009/091798 A1 describes a method for obtaining a profile of a transcriptome of an organism, the method comprising: sequencing one or more cDNA molecules to obtain sequencing reads; aligning each of the sequencing reads to a reference sequence. However, unknown to prior methods a major problem underlying transcriptome analysis using short sequence reads is the alignment step in case of transcript variants, such as isoforms which differ in sequence deviations, such as e.g. strain difference of a gene, point mutations, or splice variants of one protein. It is usually difficult to align short sequence reads correctly to one transcript variant.
The most common method to assemble transcript sequencing data based on sequence reads is the “Cufflinks” method (Trapnell et al. 2010). Cufflinks constructs a parsimonious set of transcripts that “explain” the reads observed in an RNA-Seq experiment. It does so by reducing the comparative assembly problem to a problem in maximum matching in bipartite graphs. In essence, Cufflinks implements a constructive proof of Dilworth's Theorem by constructing a covering relation on the read alignments, and finding a minimum path cover on the directed acyclic graph for the relation. Using this statistical method, Cufflinks can estimate the abundances of the transcript isoforms present in the sample, either using a known reference annotation, or after an ab-initio assembly of the transcripts using only the reference genome. Cufflinks uses a statistical model of paired-end sequencing experiments to derive a likelihood for the abundances of a set of transcripts given a set of fragments. This likelihood function can be shown to have a unique maximum, which Cufflinks finds using a numerical optimization algorithm. The program then multiplies these probabilities to compute the overall likelihood that one would observe the fragments in the experiment, given the proposed abundances on the transcripts. Because Cufflinks' statistical model is linear, the likelihood function has a unique maximum value, which is found by Cufflinks with a numerical optimization algorithm.
Roberts et al. (2011) relates to methods of improving RNA-Seq expression estimates by correcting fragment bias. Wen-Ping et al. (2007) describes mixture modelling of transcript abundance classes in natural populations.
Prior methods failed to differentiate correctly between transcript variants and to obtain the correct transcript amount or abundance in relation to other transcripts. As shown by comparison herein, even the Cufflinks method failed in several experiments to arrive at the correct transcript abundance information.
It is a goal of the present invention to provide improved methods that allow a more accurate assessment of transcript abundances.
The present invention provides a method of estimating transcript abundances comprising the steps of
a) obtaining transcript fragment sequencing data from a potential mixture of transcripts within a genetic locus of interest,
b) assigning said fragment sequencing data to genetic coordinates of said locus of interest thereby obtaining a data set of fragment genetic coordinate coverage, said coverage for each genetic coordinate combined forming a coverage envelope curve (also referred to as total coverage histogram or histogram of total),
c) setting a number of transcripts of said mixture,
d) pre-setting a probability distribution function of modelled genetic coverage for each transcript i, with i denoting the numerical identifier for a transcript, wherein said probability distribution function is composed of the mathematical product of a weight factor αi of said transcript i and the sum of at least 2 probability subfunctions j, with j denoting the numerical identifier for a probability subfunction, each probability subfunction j being independently weighted by a weight factor βi,j,
e) adding the probability distribution functions for each transcript to obtain a sum function,
f) fitting the sum function to the coverage envelope curve thereby optimizing the values for αi and βi,j to increase the fit,
g) repeating steps e) and f) until a pre-set convergence criterion has been fulfilled, thereby obtaining the estimated transcript abundance for each transcript of the mixture given by the weight factor αi as optimized after the convergence criterion has been fulfilled.
The invention further provides a computer program product employing said method, e.g. containing machine code to perform or assist in said methods and steps on a computer. The computer program product can be provided on any kind of memory device. Also provided is a system, e.g. a computer device, programmed to assist in performing the steps of the inventive method. Calculation steps are usually performed without assistance of an operator. Input and setting steps may be assisted by the program or system, e.g. by suggesting an option proposal for the number and type of probability subfunctions in step d). Of course, the program or system may also be performed with default parameters without further input from an operator.
The following detailed description and preferred embodiments apply to all aspects of the invention and can be combined with each other without restriction, except were explicitly indicated. Preferred embodiments and aspects are defined in the claims.
The present invention employs a numerical method to obtain transcript abundance information from a sample of transcript fragment sequences.
The method aligns (NGS) reads, generally referred to as transcript fragment sequences, to a reference sequence, such as a reference genome, to obtain genetic coverage information (step b). Prior statistical tools used for this purpose often make unrealistic assumptions about the nature of the observed data and the estimates of the transcript concentrations are therefore inaccurate. Some of the most widely used tools, such as Cufflinks, assume that the distribution of reads along a transcript is uniform, which contradicts current mRNA-Seq protocols. The invention provides a statistical model that is capable of learning simultaneously the bias of the read distribution along a transcript and the transcript abundances. For this purpose the read or fragment distributions of the transcripts are modelled by mixtures of functions which are trained together with the transcript abundances in the fitting step. The methods used in the fitting step can be inferred from prior maximisation or minimisation procedures, such as a maximum likelihood framework with the expectation maximization algorithm. Since the total probability distribution of the reads in the inventive model is a mixture of mixtures, this model is called the Mix2 (read: Mixquare) model. It is shown in the following that the Mix2 model is very versatile and can be adjusted to different structures inherent in the data by optional parameter tying. In particular, the methods used to obtain transcript abundances can be suitable for transcript related probability distributions. In experiments it is shown that the Mix2 model achieves considerably better estimates for the transcript abundances than the statistical model used in the Cufflinks program. Even when starting from inaccurate transcript annotations the Mix2 model can learn the correct annotation from the data and produces abundance estimates far superior to those of the prior art. Due to the superior learning capability during the fitting step, the starting parameters, e.g. selected during the assignment step a) or during the selection of (e.g. random) probability distribution functions in step d), are not crucial. Even the assumed number of transcripts can be different. It may well be that a wrong transcript annotation or transcript number assumption will be corrected by e.g. fitting the probability distribution function for one or more transcripts to an abundance of zero. The weight factor alpha, which can be modelled as a probability, will represent the abundance of a transcript after convergence.
As used herein “genetic locus of interest” does not limit the method to one continuous sequence stretch of a gene on a chromosome. It generally refers to one or more sections of genetic sequences. A genetic locus can be associated with genomic coordinates, providing position information for sequence reads (transcript fragment sequences). A genetic “position” or “coordinate” is used herein to refer to a nucleotide numerically identified on a reference sequence with a certain distance from the start of the reference sequence. A genetic coordinate can be expressed in genomic coordinates or in transcript coordinates if the genetic coordinate is compatible with the transcript, which is the case if the genetic coordinate is located within any of the transcript exons. Compatible genomic coordinates are transformed into transcript coordinates by calculating the relative distance from the transcript start and subtracting the length of the introns preceding the genomic coordinate. Transcript coordinates can be converted into genomic coordinates by reverting this process. Probability distributions defined on the coordinates of one transcript may be converted to probability distributions on the transcript coordinates of another transcript by shrinking, stretching and shifting of the probability subfunctions.
A nucleobase type (e.g. A, T/U, G, C) may or may not be associated with a genomic coordinate or position. Usually, one type of nucleobase is associated with each genetic coordinate for each transcript. However, for each genetic coordinate different overlapping transcripts may differ in the nucleobase constitution since the present invention can be used to differentiate point mutations, i.e. one or more nucleobases may differ between different transcripts, especially if the sample contained nucleic acid molecules of different organisms or of different alleles. Even in the case of possible mismatches between the transcripts, their fragments and a locus of interest, common genetic coordinates can be allocated. Step b), assigning fragment sequences to genetic coordinates, may comprise sequence comparison or alignment between these sequences, which provides a genetic coordinate. A sequence comparison is well known in the art (e.g. by the open source cufflinks method mentioned) comprising the comparison of the nucleotides. The inventive method can be used on data of different strains or organisms with deviating sequences on a given genetic locus of interest, or may be used to differentiate between different splice variants, i.e. different transcripts which can be distinguished by different combinations of exon sequences. Thus mismatches during assignment step a) can be allowed or disallowed. Such mismatches are preferably at most 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or 11 mismatches per 100 bases.
The present invention can model the abundance of individual transcripts in a mixture of transcripts (e.g. as the relative amount, which corresponds to a probability). The transcript fragment sequencing data may comprise at least 2, preferably 3, 4, 5, 6, 7, 8, 9, 10, or at least or at most 15, at least or at most 20, at least 25, at least or at most 30, at least or at most 40, transcript sequences. The number of transcripts can be selected e.g. in a selection step performed on a nucleic acid sample by e.g. amplifying or removing nucleic acids. Such removed or amplified nucleic acids may contain a common sequence stretch, such as a sequence corresponding to an oligonucleotide probe, anchor sequence or primer sequence of choice. One or more genomic loci may be selected in this way. In particular preferred embodiments of the invention transcript variants of one gene are investigated. Nevertheless in general, “transcript” as used herein may refer to any nucleic acid or its sequence of any gene or gene combination and any variant thereof, in particular mRNA or cDNA sequence variants thereof.
Preferably the transcript fragment sequences of said transcript fragment sequencing data have a length of 5 to 800 nucleotides, preferably of 6 to 600 nucleotides, of 7 to 400 nucleotides, of 8 to 200 nucleotides or of 9 to 150 nucleotides, even more preferred of 10 to 100 nucleotides, especially preferred of 12 to 70 nucleotides.
The number of transcript fragment sequences may be at least or at most 100, at least or at most 500, at least or at most 1000, at least or at most 5000, at least or at most 10000. Preferably at least or at most 10, or at least or at most 20, at least or at most 50, for each transcript. In combination or as an alternative, the number of transcript fragment sequences may be at most 400000, at most 300000, at most 200000, at most 100000 or at most 50000.
The transcript length of at least one or more, e.g. all, transcripts may be e.g. 100 to 1000000 nucleotides, preferably 1000 to 100000 nucleotides or 2000 to 10000 nucleotides.
In preferred embodiments, the genetic locus of interest comprises one or more isoform, e.g. encoding transcript sequences, of one or more genes or genetic elements, preferably comprises at least 2, 3, 4 or more splice variants of one gene or genetic element. It may comprise one or more other splice variants of another gene or genetic element. In addition or alternatively to splice variants, it may comprise different alleles. In preferred embodiments, the gene or genetic element encodes a protein (e.g. mRNA), but also non protein-coding transcripts, such as regulatory or catalytic RNA, including microRNA, snoRNA or rRNA, as well as their precursors, in particular pre-microRNA or prerRNA.
As used herein “gene” and “genetic elements” relate to genetic nucleotides with a sequence that is transcribed to form one or more transcripts.
As used herein “isoform” is used to relate to a particular variant of a transcript. Transcripts of one “gene” or “genetic elements” may differ, e.g. in case of splice variants, thus giving rise to different isoforms. Other isoform variations can be caused by different alleles or different sources of the transcript material, e.g. in a mixture of different microorganisms or strains.
The step of setting a number of transcripts may comprise obtaining pre-annotated sequence data from the genetic locus of interest and setting the number of transcripts to at least the number of different transcript sequences, including splice variants counting as different transcript sequences, expected from said genetic locus of interest. As said before, the set number of transcripts may exceed the actual number of transcripts (which may or may not be known exactly) since incorrect transcripts may be removed during fitting of the sum function, resulting in one or more weight factors alphas converging to zero. Usually, a genetic locus of interest is known, e.g. from the cDNA generation step of any one of the nucleic acid selection steps. Using annotated genetic data, it is possible to arrive at a starting number for the amount of transcripts, imax. For each transcript, an abundance will be modelled by setting the probability distribution function in step d). This function contains a weight factor alpha, which corresponds to the abundance of the transcript in the mixture after convergence of the sum and probability distribution fitting, an iterative process. The mixture of the transcripts is the first mixture of the inventive model. Each alpha can be modified separately during the fitting process. Thus, setting a number of transcripts may be a rough estimate. For example, it may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, at least any one of 16, 20, 30, 40 or more.
A second mixture is mathematically contained in each probability distribution function. The probability distribution function is composed of 2 or more (e.g. 2, 3, 4, 5, 6, 7, 8, 9, 10 or more) probability subfunctions or “blocks”. It is essential to have at least 2 probability subfunctions in order to be able to model asymmetric read distributions over the length of the transcript (transcript coordinate range). The total amount of probability subfunctions per transcript is jmax. Jmax is usually around 4-8. Too high amounts of probability subfunctions may lead to overtraining and reduced computational efficiency.
The probability distribution function is the sum of the probability subfunctions. Noteworthy, the factor alpha weighs the entire probability distribution function and thus each underlying probability subfunction equally. Within the probability distribution function, the probability subfunctions are weighted separately by a weight factor βi,j. (betai,j). Of note, for each transcript, i.e. each probability distribution function, the betas may be independent or dependent factors (“tied”). In case of independencies, each beta can be individually modified during the fitting step. It is also possible to tie the betas between the different probability distribution functions, such as the first beta, or generally any given betaj, will be the same for each probability distribution function (e.g. beta1,j=beta2,j). This reduces the number of modelled weight factors beta to just jmax, instead of imax×jmax which simplifies the fitting process and reduces computational resource usage. This simplification works best if for each transcript a similar fragment coverage distribution can be expected, e.g. in a given sequence portion always larger abundances are expected than in other portions—for two or more, e.g. all, transcripts (the “tied” transcripts). This assumption is usually true for transcripts of about the same or similar length, such as transcripts having a length (in nucleobases) of 0.3 to 3.2, preferably of 0.5 to 2.1, of another tied transcript. Of course, as will be more clear in the following, other parameters of the probability subfunctions can also be tied in a similar fashion between different transcripts or also between the probability subfunctions of a given transcript (e.g. when shifting the entire probability distribution function for a transcript transversally in the genetic coordinate direction). Also, in step d) it is possible to normalize the probability subfunctions of a transcript to about the same maximum height or maximum value (usually modelled as a probability or probability contribution of the part to the total of a probability distribution function) to each other.
According to a further option for tying of parameters, it is also possible to only tie one or more parameter, e.g. one or more beta, in a group of transcripts, i.e. each probability distribution functions—but not through all transcripts, i.e. each probability distribution function. Such a group may be defined by expected similar shapes of the sum functions of the probability distribution functions. E.g. a group may be defined by similar lengths and/or similar GC-content. A similar length is e.g. when all members of the group have a length within +/−50%, preferably +/−35%, more preferred +/−30%, of a mean size of all members. A similar GC-content may e.g. be within +/−10%, preferably +/−5%, of a mean GC-content of all members.
“About” as used herein may refer to the same value or a value differing by +/−10% of the given value.
The sum of the probability subfunctions for a transcript i constitute the probability distribution function for the transcript I, with the given weight factors. As probability subfunction any mathematical function can be selected, with the requirement that the sum of these probability subfunctions can form a probability function. Since the probability subfunctions form the basis for a computational model that can be fitted and optimized, simple functions with few parameters are preferred. The probability subfunction may e.g. comprise or consist of 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 function parameters determining the shape of the function in dependence of the genetic coordinate. The probability subfunctions j are preferably constituted of positive values for each genetic coordinate. Preferably the probability subfunction is an aperiodic function and/or especially preferred the probability subfunction is a density function or probability function, but is referred to herein as probability subfunction or “block” to distinguish it from the sum of the probability subfunctions, which is referred to as the probability distribution function (for a given transcript). It usually contains a maximum at a certain genetic coordinate and steadily decreases for positive and negative genetic coordinates removed from said maximum. Preferably the probability subfunction has only one maximum. Example probability subfunctions j are selected from a Gaussian function, a square shape function, a triangle shape function, preferably a Gaussian function.
The genetic coordinates or the range thereof is preferably the same for each transcript. To this end the probability distribution function for a transcript can be shrinked or extended in the genetic coordinate direction, or shifted. Of course such a transformation can be reversed to again arrive at the correct source coordinate in the reference sequence, e.g. the genome. However such a transformation may be beneficial for the abundance modelling algorithm.
This shrinking, extending and shifting can be used to obtain or improve the knowledge of the correct (or more likely) transcript length. In a preferment, a transcript length estimate is obtained by this step, which is used to determine the numbers (e.g. amount or concentration) of fragments. Fragment numbers can be estimated from FPKM (fragments per kilo base pair per million reads) values, based on the relationship: FPKM=number of fragments/transcript length (×weight factor). Hence, the number of fragments correlates to: FPKM×transcript length.
Transcript length estimates can be obtained by setting the transcript start and end positions according to the optionally shrunken or extended and/or shifted probability distribution functions of the sum function after convergence of step e). Due to the nature of the sum function being composed of probability distribution functions, which may not have a defined start and end position, but only probability values for that, the start and end can be estimated by e.g. as the genetic coordinate where the partial area up to (for start) or starting from (for end) said genetic coordinate is a fraction of the whole area under the curve (of the sum function or the first or last probability distribution function). This fraction can be a value between 1% to 10%. A skilled practitioner can easily test suitable values to determine such an area cut-off dependent on the shape of the used probably distribution function. It can be easily tested using model nucleic acids with a known start and end position and setting up the inventive function with other start and end positions and letting the inventive algorithm train improved start and end positions as described. Example values used for cut-off values are 0.5%, 1%, 2,%, 3%, 4%, 5%, 6%, 7%, 8%, 9% or 10%.
According to a further option, the fragment length distribution or average fragment length is calculated using the inventive model functions after convergence of step e).
In particular, the genetic coordinate for one or more, preferably each transcript (and its probability distribution function) may correspond to nucleotide positions in a genome, optionally transformed to omit genetic regions not of interest, wherein preferably said genetic regions not of interest do not contain coverage by said transcript fragment sequencing data —such as introns for which no read or sequence fragment aligns. Thus a continuous probability distribution function can be modelled with no intermittent zero distribution. Such regions not of interest can be removed from the individual probability distribution functions and additionally or alternatively also from the coverage envelope curve (to which the sum function of the various probability distribution functions is fitted during each iteration step g)). The envelope curve represents the histogram of the combined sequence data coverage.
Thus preferably the inventive method further comprises a step b2) comprising removing genetic coordinate positions with introns from said coverage envelope curve. This results in obtaining a modified coverage envelope curve, which replaces the use of the coverage envelope curve in step f) and preferably all other steps—except of course where the probability distribution functions are again referred back to the original fragment sequencing data or the genetic or genomic coordinates of the locus of interest.
The genetic coordinate coverage is not required to use the entire sequence information of a read in order to model the abundance. It is also sufficient to use only a limited number of nucleotides of the read sequence, e.g. just the first one or start site information. The fragment genetic coordinate coverage can contain the count of at least one nucleotide for each fragment sequence assigned to a genetic coordinate, preferably wherein at least one nucleotide comprises the fragment start site or the entire fragment sequence. “Comprises” as used herein shall be understood as an open definition, allowing further members as in containing. “Consisting” on the other hand is considered as a closed definition without further elements of the consisting definition feature. Thus “comprising” is a broader definition and contains the “consisting” definition. Any definitions herein using the “comprising” language may also be read with a consisting limitation in a special embodiment of the invention.
The two or more probability subfunctions, which are the components of the probability distribution function, are preferably evenly distributed among the genomic coordinate range of said probability distribution function (of a transcript). In general, preferably the probability subfunctions for a transcript comprise maxima each at a different genetic coordinate. In a particular preferred embodiment of the invention, the maxima of the probability subfunctions for each transcript are separated at least about 1/jmax times the difference between the first and last nucleotide of the genetic coordinate as assigned to the transcript in step a), optionally after transformation as described above (which may provide a modified genetic coverage, such as after removal of the introns). Thus the probability subfunctions can be distributed evenly along the genetic coordinate covered by a transcript (and hence the probability density function modeling the coverage). Of course, it is also possible to deviate from the optimal even distribution. Thus the maximum location on the genetic coordinate may deviate e.g. 0%-50%, preferably 10%40%, of the transcript length or the transcript length divided by jmax, from the even distribution maximum for each probability subfunction.
Thus, in step d) the probability subfunctions for a transcript can be positioned or shifted in the genetic coordinate to cover the entire length of a transcript with a positive value.
In preferred embodiments the method comprises determining sequence reads of at least one transcript, preferably mRNA or cDNA, wherein said reads comprise the sequence of fragments of said transcript, to provide said transcript fragment sequencing data. The determination step can be performed by any method known in the art, such as PCR sequencing. Such method include Maxam-Gilbert sequencing, Chain-termination methods, Shotgun sequencing, Bridge PCR, Massively parallel signature sequencing (MPSS), Polony sequencing, pyrosequencing, Illumina (Solexa) sequencing, SOLiD sequencing, Ion semiconductor sequencing, DNA nanoball sequencing, Heliscope single molecule sequencing, Single molecule real time (SMRT) sequencing, Nanopore DNA sequencing, Sequencing by hybridization, Sequencing with mass spectrometry, Microfluidic Sanger sequencing, Microscopy-based techniques, RNAP sequencing, In vitro virus high-throughput sequencing.
The inventive method may alternatively or in addition comprise the step of feeding transcript fragment sequencing data to a computational device capable of performing the inventive method, e.g. in case of using pre-prepared sequence data. Said computation device will then perform the inventive method to arrive at an abundance information with the feeded sequencing data.
The abundance information can be displayed on an output device, such as a video screen, a printer or a computer readable medium, e.g. as described further below.
Fitting in step f) can be performed by the expectation maximization algorithm. The expectation maximization algorithm has been described by Dempster et al., 1977. Of course any other maximization or minimization algorithm can be adapted to fit the inventive sum function to the envelope curve. In essence the weights alpha i and beta i,j—and optionally any further parameters which define the probability subfunctions—are modified to minimize the difference between the sum function (the model) and the coverage envelope curve (the real data based on the fragment sequences), e.g. by varying the parameters of the probability subfunctions. Many algorithms exist to make such a minimization or variations to find a minimum, especially after several iterations by repeating the calculation steps e) and f).
Such an additional parameter of a probability subfunction is e.g. the full width at half maximum value or, especially in case of gauss functions, the standard deviation value (sigma i,j). These are parameters of function width. Preferably the function width at any genetic coordinate separate from the maximum of the probability subfunction (e.g. preferably the full width at half maximum value or standard deviation) for each probability subfunction for a transcript i is about identical, preferably is identical. Alternatively the ratio of the width to the maximum value is about identical, preferably is identical, for each probability subfunction for a transcript i. This “ties” the function width similarly as described above for the betas between the different probability distribution functions of the transcripts.
In general, in preferred embodiments, at least 1, 2, 3, 4, 5, or 6 parameters of the probability subfunctions within the group of probability subfunctions of a transcript and/or within the group of probability subfunctions with the same identifier j for separate transcripts, are tied, e.g. modified in combination with each other, during the fitting step f).
The last step is convergence of the fitting. A convergence criterion can be set by an operator. It can be a maximum adjustment of any one of the parameters αi or βi,j or a combination thereof, e.g. all us or all βs, or all us and βs, or a maximum adjustment in the probability distribution function, e.g. as in equation (15) in the examples. For example, a convergence criterion can be if the log likelihood increase according to equation (21) is below 0.5. Any convergence criterion can be selected by an operator to achieve the desired quality.
In a preferment, the inventive method is further combined with other bias reduction methods. E.g. the obtained estimated transcript abundance of step e) is further weighted by a bias factor, thereby obtaining weighted estimated transcript abundance. This bias factor can take into account that certain transcripts are increased or decreased during wet-chemical transcript or fragment generation. Some fragments, for example, fragments starting with C may be overrepresented and are preferable weighted by a bias factor lower than 1 thereby decreasing the weighted estimated transcript abundance in comparison to the estimated transcript abundance of step e). The chemical reason why fragments starting with C are overrepresented lies in primer kinetics: primers anneal better with templates starting with C as compared to templates starting with G or T. Therefore, fragments starting with G and/or T may be weighted by a bias factor greater than the bias factor for fragments starting with C. In general, a skilled practitioner can obtain such bias factors for any parameter where a systematic bias exist, which may depend on any polynucleic acid property, and include an accordant bias factor an weigh the result of step 3) accordingly.
The invention further relates to a computer readable memory device comprising a computer program product for performing a method according to the invention on a computer or for supporting a method according to the invention by a computer, especially any one of steps a), b), b2) c), d), e), f) and g) may be performed on a computer. Any inventive method or step can be performed as computer-implemented method. Even the usually wet-chemistry step of determining sequence reads may be assisted by a computer, e.g. to control and obtain data from an automated or semi-automated sequence reader. The computer program product or memory device may also be provided with a read generation component that obtains short reads from a sample, such as a sequencer, preferably a sequencer comprising a computer component. For example, computer readable media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips, . . . ), optical disks (e.g., compact disk (CD), digital versatile disk (DVD), . . . ), smart cards, and flash memory devices (e.g., card, stick, key drive, . . . ).
The present invention is further illustrated by the following figures and examples, without being limited to these embodiments of the invention, with—each element being combinable with any other embodiment of the invention. Especially any one of the formulas provided below can be used separately, individually or in combination to describe or define a step in the inventive method.
In order to infer which mRNA molecules were present in the original sample, NGS reads are mapped onto a reference genome with known methods, such as the Burrows-Wheeler transform. For each read this gives a set of genetic coordinates, potentially including information about splice sites. The mapping process is visualized in
Genetic loci such as genes can contain a number of different, possibly overlapping, regions that each generate their own mRNA. Such regions are called transcripts. The histogram in the genetic locus h(locus)(r) is therefore a mixture of the histograms associated with each individual transcript in the locus, i.e.
where h(r|t=i) is the histogram associated with transcript t=i and i is the name of a transcript, for instance, Apoe.
This is the proportion of counts in the total histogram hlocus(r) that belong to transcript t=i. If hlocus(r) is, for instance, the histogram of the fragment start sites then αi is the percentage of fragments that were generated by transcript t=i. In this case, αi is called the abundance of transcript t=i and is directly correlated to the concentration of transcript t=i within the mRNA sample. In
The decomposition of the histogram hlocus(r) in equation (1) is usually not known and the derivation of the αi requires elaborate mathematical machinery. In order to make the problem of finding the decomposition (1) mathematically tractable, (1) is reformulated in a probabilistic framework, i.e.
where plocus(r) is the total probability of observing fragment r in the locus, p(r|t=i) is the probability of observing fragment r in the locus given transcript t=i and αi is the probability that transcript t=i is observed in the locus. If the probability distributions plocus(r) and p(r|t=i) are derived from the histograms by normalizing them to sum to one then (3) is a direct result of (1) and (2). In order to emphasize the fact that the methods developed in the following are applicable in a general probabilistic setting and not just to the field of next generation sequencing, the subscript “locus” will be replaced by “total”, hence
p
(locus)(r)=p(total)(r) (4)
In the estimation of the transcript probabilities αi the shape of the transcript distributions is of fundamental importance. Current methods such as Cufflinks (Trapnell et al., 2010), however, do not model these distributions accurately. Instead, for the fragment start site distribution Cufflinks uses a model that only depends on the distribution of the fragment lengths. The default model for the fragment lengths in Cufflinks is a Gaussian with mean 200 and a standard deviation of 80. Together with the other assumptions in Cufflinks this implies that for a read length of 200 bp's and a transcript of 3000 bp's the fragment start site distribution and the coverage of the transcript are modeled by the functions shown in
In comparison,
The distributions observed in a genetic location often exhibit visual clues as to which transcripts are contained in the genetic location and their probability. Consider, again, the example in
Since the p(r|t=i) are scaled and shifted versions of pavg(r), they are given by
p(r|t=i)=pavg(λir−νi) (5)
where Δi and νi are the scale and shift parameters of transcript t=i. This means that a transcript t=i only has two parameters λi and νi that are unique to the transcript. All other parameters, i.e. the parameters of pavg(r), are tied between the different p(r|t=i). This highlights a central idea of the method described in the following that structures that are common to different transcripts, such as ptotal(r), can be estimated by appropriately tying the parameters of the p(r|t=i). In the following the p(r|t=i) will be modeled by a mixture of functions and the decomposition of ptotal(r) in (3) is therefore a mixture of mixtures of functions. In order to increase readability this model will be referred to as the inventive model. The following sections give a general introduction to the inventive model, also called the Mix2 model, and discuss several of its variants. In addition, experiments show that the inventive model, the Mix2 model, can reliably estimate the probability distributions p(r|t=i) and produces estimates for the αi that are considerably more accurate than those of the Cufflinks model.
The genetic axis is the sequence of base pairs that have been sequenced for an organism, which usually starts at zero or one and can reach up to several hundred million base pairs long, depending on the complexity of the organism. In addition, the genetic axis is usually subdivided into chromosomes or contigs. The genetic axis is visualized at the top of
Points on the genetic axis that lie within an exon of a transcript can also be referenced within the coordinate system of this transcript. A transcript coordinate is the distance from the start of the transcript minus the length of the preceding introns and is therefore a number between 1 and l(t). Thus for a position P in exon ei of a transcript the transcript and genome coordinates Ptrans can be converted as follows,
Consider, for instance, the position 4500 in genetic coordinates in
In order to simplify notation, the “genome” and “trans” subscripts will be dropped in the following whenever convenient.
Taking the sum of the transcript probabilities in genome coordinates yields the total probability distribution ptotal(r) for positions in the entire genetic locus of interest, which is shown in
A fragment is a continuous sequence within a transcript.
Similar to a transcript, a fragment r therefore consists of a sequence of intervals on the genetic axis, r=(rint1, . . . , rintK), where rinti=[s(rinti),e(rinti)] is the i-th interval with start s(rinti) and end e(rinti). A fragment r is compatible with a transcript if its start and end are located within exons of the transcript and the gaps between adjacent intervals [e(rinti)+1,s(rinti+1)−1] are introns of the transcript, i.e.
s(int1)εexoni (9)
e(intK)εexonk (10)
for some i≦k, and
[e(intk)+1,s(intk+1)−1]ε{[e(exoni)+1,s(exoni+1)−1]:i=1, . . . ,N−1}∇k=1, . . . ,K−1 (11)
If a fragment r is compatible with a transcript it can be converted into the interval [T(s(rint1)),T(e(rintK)] in transcript coordinates. The gaps between intervals of the fragment are therefore removed in the transformation to transcript coordinates. As in the previous section, the transform of a fragment r to transcript coordinates will be denoted by T. Consider, for instance, the fragment ([2000,3000],[4000,4500]) and the three transcripts in
Probability distributions on fragments in transcript coordinates ptrans(rtrans|t=i) can therefore be converted into probability distributions on fragments in genomic coordinates as follows.
As before, the “genome” and “trans” subscripts will be dropped whenever convenient. Instead of using its start and end, an interval can also be represented by its start s(r) and length l(r). This allows the following factorizations of probability distributions in transcript coordinates.
p
trans(r|t=i)=ptrans(s(r)|t=i,l(r))ptrans(l(r)|t=i) (13)
and
p
trans(r|t=i)=ptrans(l(r)t=i,s(r))ptrans(s(r)|t=i) (14)
These factorizations are convenient for NGS data as the global distribution of the fragment lengths can usually be inferred from bioanalyzer traces of the library. The factorization in (13) is used by Cufflinks, which additionally assumes that a fragment of a given length can be placed anywhere on the transcript with equal probability. In contrast, in the experiments in section 4 the Mix2 model uses the factorization in (14) which allows a more efficient estimation of the transcript specific variability of the fragment start sites.
The model that is described in the following uses mixtures of mixtures of functions and will therefore be called the Mix2 model.
In the following, r can represent both a fragment and a position. However, for convenience, r will always be referred to as a fragment. The probability of observing a particular fragment r in the genetic locus ptotal(r) is a sum of the probabilities of observing the fragment for a transcript weighted by the probability that the transcript generates a fragment. Hence the ptotal(r) is given by the following mixture of probability distributions.
As described in section 2, if r is compatible with t=i then p(r|t=i)=trans(T(r)|t=i) and p(r|t=i)=0 otherwise. The method assumes that the probability distributions ptrans(r|t=i) are mixtures, i.e.
Here Mi is the number of mixture components and the ptrans(r|t=i,b=j) are probability distributions. The βi,j≧0 are weights which sum to one, i.e.
As before, the variable t is meant to represent a transcript, whereas the newly introduced variable b is meant to represent a “building block”, also referred to as probability subfunction in general. In this model it is possible to calculate the posterior probability that transcript t=i and block b=j were observed given a fragment r. This posterior probability is given by
Note that p(r|t=i,b=j)=0 and hence p(t=i,b=j|r)=0 if fragment r is incompatible with transcript t=i. The posterior probabilities for t and b are given by
The model assumes furthermore that some of the parameters of ptotal(r) are tied between different transcripts (probability distribution functions) and blocks (probability subfunctions). This can include both the mixture weights βi,j as well as any parameters Θ that determine the set of probability subfunctions of the probability distributions ptrans(r|t=i,b=j), i=1, . . . , N, j=1, . . . , M. The tying might be applied to all transcripts and blocks combined or within groups of transcripts and blocks. Since tying the parameters of probability distributions ptrans(r|t=i) implies certain similarities between the ptrans(r|t=i), parameter tying should only be applied to transcripts which exhibit the kind of similarity implied by the type of parameter tying. If, for instance, the parameter tying implies that the ptrans(r|t=i) are derived from a single pavg(r) by scaling and shifting, as has been suggested in the introduction, then the parameter tying should only be applied to transcripts for which the ptrans(r|t=i) are scaled and shifted versions of one another. For some library preps the latter is only valid for transcripts with lengths in a certain range. Transcripts whose lengths lie within different ranges should therefore not apply a parameter tying which requires the p(r|t=i) to be scaled and shifted versions of one another.
For a data set R of fragments r in a genetic region, the transcript specific distributions p(r|t=i) and the transcript probabilities αi of the Mix2model are estimated by maximizing the likelihood of the data set R under the model ptotal(r). The likelihood function of the parameters αi,βi,j,Θ is therefore given by
The maximization of (21) is a constrained non-linear optimization problem for which approximate solutions can be found with a number of different optimization methods. The optimization method used in the following is the expectation maximization (EM) algorithm. The EM algorithm is an iterative procedure that finds a local maximum of the likelihood function. In order to obtain a local maximum close to the global optimum it maybe necessary to try the EM algorithm several times with different initializations for the model parameters. The result that yields the highest likelihood is then chosen as the solution to the optimization problem. The EM algorithm can be used to estimate the αi without any assumptions about the form of the ptrans(r|t=i) or the tying of their parameters. The EM update formula for the which is also used by Cufflinks, is given by
Here |R| is the number of fragments in data set R, αi(n+1) is the estimate for αi after the n+1-th iteration and p(n)(t=i|r) is the posterior probability of observing transcript t=i given fragment r, estimated with the parameters from the n-th iteration. Since in this case the optimization problem is concave, it is not necessary to repeat the EM algorithm with different initializations. For concave problems there is a single global optimum to which the EM algorithm converges always. The model parameters are estimated by optimizing the fit of the model ptotal(r) to the observed data R. However, the ultimate goal is to find a good fit to the ptrans(r|t=i) and therefore a good estimate of the transcript probabilities αi. The optimization of ptotal(r) per se does not imply a good estimate of the αi. Only if the parameters of the ptrans(r|t=i) are tied appropriately will the optimization of ptotal(r) result in a good estimate of the transcript probabilities αi. The next section introduces several variants of the Mix2 model which use different tying strategies. Two of those variants will be studied in section 4 where they will be shown to yield considerably more accurate estimates for the αi than the Cufflinks model, for the case of scaled and shifted transcript probability distributions p(r|t=i).
The mathematical foundations of the Mix2 model in the last section are fairly general. This section discusses a number of concrete realizations of the Mix2 model and highlights their various advantages and limitations.
3.2.1 Tying within a Single Group of Parameters
The simplest Mix2 model discussed in this section ties only the weights βi,j for i=1, . . . , N between different transcripts. Thus the set of parameters of this model is {αi,βj: i=1, . . . , N, j=1, . . . , M} and n trans (r|t=i) is given by
Since this Mix2 model only ties parameters within a single group of parameters, namely the βj, this model will be called the 1tie-Mix2model in the following. The probability of observing a fragment r under the 1tie-Mix2 model is given by
In equation (24) the order of the sums can be interchanged which yields
As previously, summands in (26) are zero if fragment r is not compatible with transcript t=i. Since p(r|b=j) is a probability distribution equation (25) shows that, given the αi, the problem of estimating the βj is conceptually the same as the estimation of the αi. Thus the EM algorithm can be applied and gives the following update formula for βj
This implies that both the αi and the βj of the 1tie-Mix2 model can be trained simultaneously by iterative application of the formulas (22) and (27). In principle, the EM algorithm for the 1tie-Mix2 model can converge to different solutions depending on the initialization of the αi and βi since, in comparison to the Cufflinks model, the likelihood function of the 1tie-Mix2 model is not concave. In the experiments in Section 4, however, a single initialization of the αi and βi was sufficient in order to converge to a satisfactory result. The probability distributions ptrans(r|t=i,b=j) for the 1tie-Mix2 model have to be chosen to fit the data. Thus previous knowledge about the structure of the data is necessary. If the observed ptrans(r|t=i) are scaled versions of one another for different t=i, then the ptrans(r|t=i,b=j) should be scaled versions for different t=i as well. In addition, for the same b=j but different t=i the ptrans(r|t=i,b=j) have to lie within regions of the transcript coordinate system which are regulated by the same scaling factor βj. The latter can be achieved if the correct start and length of the transcripts are known. In this case the a)r trans (r|t=i,b=j) can be placed at equidistant positions along the transcript coordinate system of t=i. If the correct start and length are not known it is not immediately clear how to position the ptrans(r|t=i,b=j). The model in the next section avoids this restriction by learning the transcript start and length from the data.
3.2.2 Tying within 5 Groups of Parameters
The model in the last section relies on the knowledge of the correct start sites and lengths of the transcripts. This section discusses a model that positions and scales the D trans (r|t=i,b=j) automatically. Since this model ties parameters within 5 groups of parameters it will be called the 5tie-Mix2 model in the following. The 5tie-Mix2 model uses Gaussians for the distributions ptrans(r|t=i,b=j), whose internal parameters, namely their means and standard deviations αi,j are derived from the set of parameters Θ={: i=1, . . . , N, j=1, . . . , M}. In particular the μi,j and σi,j are given by
μi,j=+ (28)
σi,j= (29)
The and are tied between different transcripts, hence =μj and =σj, whereas the and are tied between different blocks, hence =λi and =νi. As a result equations (28) and (29) reduce to
μi,j=λiμj+νi (30)
σi,j=λiσj (31)
The pdf for transcript t=i and block b=j are therefore given by
As before, the βi,j are tied for block j between the transcripts i=1, . . . , N and hence) βi,j=βj. The probability distribution Σj=1Mβjgμ
where the union in the sum of the denominator in (33) extends over all the transcripts in the locus. The αi can then be derived from the αi,as follows
Equation (35) amounts to a rescaling of the continuous model weights alpha(i,R). The effect of the rescaling is not very pronounced for slowly varying probability distributions ptrans(r|t=i) and for genetic loci in which all the transcripts have a sensible length. Thus in the experiments described later the effect of the rescaling formula is negligible. In situations where the locus and the ptrans(r|t=i) conform to these requirements the alpha(i,R) can therefore be directly used as the transcript probabilities αi, which will be done in Section 4. The λi are scaling parameters that account for the different lengths of the transcripts, whereas the νi account for the different relative offsets of the transcript start sites from the transcript annotation. Since these two parameters can correct for incorrect start sites and lengths in the transcript annotation it is possible to remove the definition of start and end all together from the transcript annotations. For instance, the start sites of all the transcripts can be set to the same value, which is smaller than all the start sites of the transcripts in the locus. Likewise, the end sites of all the transcripts can be set to the same value, which is larger than all the end sites of the transcripts in the locus. This extension of the transcript annotations is illustrated in
The parameters of the model in this section can be efficiently trained with the EM algorithm. The EM update formulas for the means, variances and offset parameters are given below.
Calculating the derivative of the auxiliary function of the EM algorithm with respect to the scale parameter λi leads to the following quadratic equation.
In most practical circumstances equation (39) has one positive and one negative solution. Since λi is a scale parameter it has to be positive and therefore only the positive solution of (39) is of interest. The experiments in section 4 show that the 5tie-Mix2 model can indeed correct imprecise transcript annotations and yields in this case much more accurate estimates for the αi than models which rely on correct transcript annotations.
3.2.3 Tying within 2 Groups of Parameters
As the previous models, the model in this section ties the βi,j across transcripts for block j and therefore βi,j=βj. In addition the model in this section ties the from the 5tie-Mix2 model but not across transcripts for a given block but more generally across sets of pairs of (i,j), i=1, . . . , N, j=1, . . . , M. For instance, if N≧2, M≧3 then a possible set is L={(1,1), (1,2), (2,3)} and hence ===σL. Since the model in this section only ties parameters within two groups it will be called the 2tie-Mix2 model. Thus for the 2tie-Mix2 model
σi,j=λiσL(i,j)εL (40)
The motivation for the 2tie-Mix2 model is the following. Consider the situation depicted in
The EM update formula for the other parameters of the model is the same as in the previous section. In the situation in
3.2.4 Tying within 6 Groups of Parameters
A straight-forward extension to the 5tie-Mix2 model is given by the introduction of another parameter which scales the . In particular
σi,j= (42)
Thus, in comparison to the 5tie-Mix2 model, the model in this section scales and independently and since parameters are tied within 6 groups of parameters it is called the 6tie-Mix2 model. For the 6tie-Mix2 model the are tied for the same transcript between different blocks, hence =κi. In the 6tie-Mix2 model the λi and κi play a role similar to the means μj and standard deviations σj and can be trained with the following formulas, which have a similar structure as equations (36) and (37)
The advantage of the 6tie-Mix2 model over the 5tie-Mix2 model is that it can model differences in the smoothness between different transcripts. However, the increased number of parameters can also lead to bad model convergence and should therefore only be used if the NGS data exhibits the variance in smoothness implied by the 6tie-Mix2 model.
In the previous sections the means μi,j and standard deviations σi,j of the Gaussians ptrans(r|t=i,b=j) were derived from the μj and σj via the transcript specific affine linear transformations of λ given by equations (30) and (31). This section generalizes this concept and allows the μi,j and σi,j to be generated by polynomials in λ, i.e.
The EM update formula for μj(k0) is given by the following expression.
The EM update formulas for the remaining model parameters lead to non-linear equations which can themselves only be solved by iterative procedures.
In the experiments in this section, the probability ptrans(r|t=i,b=j) is factorized as follows
p
trans(r|t=i,b=j)=ptrans(s(r)|t=i,b=j)ptrans(l(r)t=i,b=j,s(r)) (48)
where s(r) and l(r) are the start and length of fragment r. In addition, the probability ptrans(l(r)|t=i,b=j, s(r)) is assumed to depend only on s(r) and the length of the transcript l(t). Therefore (48) reduces to
p
trans(r|t=i,b=j)ptrans(s(r)|t=i,b=j)ptrans(l(t),s(r)) (49)
As a consequence, the mixture model (16) is a mixture model for the start site distribution ptrans(s(r)|t=i) multiplied by the probability distribution for the fragment length, i.e.
Thus, the convergence of the model to the correct ptrans(r|t=i) can be evaluated by checking the convergence of the distributions Σj=1M
Two types of experiments are discussed in this section. The first type is a detailed analysis of the convergence of the Mix2 model for a set of 10000 fragments where the fragment start sites were randomly drawn from a superposition of the p(r|t=i) in
The second type of experiment in this section repeats the first type for 60 different sets of weights α1, α2, α3 and compares the α1, α2, α3 estimated by the Mix2 model to the true weights. The former experiments provide a detailed insight into the convergence of the Mix2 models and their parameters, while the latter experiments focus on the overall accuracy of the weights α1, α2, α3 estimated by the Mix2 model. The set of 60 weights where chosen by setting α3 to each of the values 0.2, 0.4, 0.6, 0.8. For each value of α3, 15 different values of αi were chosen in equidistant intervals between 0 and 1−α3 and α2 was set to α2=1−α1-α3. For each set of weights the Kullback-Leibler (KL) divergence was calculated between the true and the estimated weights and the average of the KL divergence over all sets of weights was used as a measure to quantify the accuracy of Cufflinks and the Mix2 model.
4.1.1 Convergence of the 1Tie-Mix2 Model for α1=0.28, α2=0.32, α3=0.4
As mentioned before, the data used to estimate the models in this section was generated by sampling the fragment start sites from a superposition of the p(r|t=i) in
and progressing in steps of
where both numbers were rounded to the nearest integer. Thus for transcript 2 with a length of 2702 bp's, for instance, this results in means at 169, 507, 845, 1183, 1521, 1859, 2197, 2535. The standard deviation of each Gaussian was set to be equal to the first mean. Hence for transcript 2 the standard deviation of each Gaussian was 169. The initial βj in the model were set to ⅛. The resulting initialization of p(r|t=i) for transcript 2 together with the weighted blocks βjp(s(r)|t=i,b=j) is depicted in
The initial abundances in the model were set to α1=α2=α3=⅓. The EM algorithm was applied until either the difference between the αi and βj between subsequent iterations was below 0.001 or the increase in the log likelihood between subsequent iterations was below 0.5. With these termination conditions the EM algorithm converged after 20 iterations.
Finally,
This section discusses experiments with the 60 different sets of weights α1, α2, α3, which were chosen according to the procedure described at the beginning of section 4. Thus α3 takes the values 0.2, 0.4, 0.6, 0.8 and αi and α2 are distributed in equidistant intervals between 0 and 1-α3. For each set of weights p(r|t=i) was initialized for transcript 2 as in
As can be seen from Table 3, this leads overall to an average KL divergence of 0.12368 for the Cufflinks model and an average KL divergence of 3.6369e-04 for the 1tie-Mix2 model. Thus, in terms of the average KL divergence the accuracy of the 1tie-Mix2 model exceeds that of the Cufflinks model by 3 orders of magnitude.
The model in this section estimates transcript specific offset and scaling parameters νi and λi and does therefore not rely on the correct transcript annotation. Thus, in order to demonstrate the potential of this model it is trained with three incorrect transcript annotations. These annotations can be seen in
For the other models studied in this section, i.e. the Cufflinks and 1tie-Mix2 models, which cannot correct the incorrect annotation one can expect that the estimated α2 is too high, since transcript two gains more fragments by its increased length than it loses due to its shifted start. Similarly, it can be expected that αi is more strongly underestimated than α3 since transcript one loses more fragments at its end than transcript three at its start.
4.2.1 Convergence of the 5Tie-Mix2 Model for α1=0.28, α2=0.32, α3=0.4
The correction of the incorrect transcript annotation during the EM algorithm can also be seen in
4.2.2 Comparison Between the 5Tie-Mix2, 1-Tie-Mix2 and Cufflinks Models on 60 Sets of Weights
This section discusses experiments with the 60 different sets of weights αi, α2, α3, and the 5tie-Mix2, 1tie-Mix2 and Cufflinks models. The incorrect transcript annotations in
In
4.2.3 Comparison with the 1par-Mix2Model
Since the 1tie-Mix2 model was much better than the Cufflinks model in the experiments with correct transcript annotation in Section 4.1.2, it is worth to compare the 5tie-Mix2 model and the 1tie-Mix2 model both with the incorrect transcript annotation.
The experiments in the last section compared the 1tie-Mix2 and the 5tie-Mix2 model with the Cufflinks model both with correct and incorrect transcript annotations. The results of the experiments on the set of 60 parameter sets αi, α2, α3 are summarized in Table 9. These results show that with a correct transcript annotation the 1tie-Mix2 model is far superior to the Cufflinks model, while with an incorrect transcript annotation the 5tie-Mix2 model outperforms both the 1tie-Mix2 and the Cufflinks model. This implies that the use of the appropriate Mix2 model improves the accuracy of the abundance estimates considerably.
Cufflinks is a program that implements methods for transcript assembly and transcript abundance estimation. A detailed description of the implementation can be found in Trapnell et al., 2010. In Cufflinks, transcript abundances, which correspond to the αi according to the invention, are estimated via the factorization in equation (13), where ptrans(s(r)|t=i,l(r)) is uniform and ptrans(l(r)|t=i) is independent of transcript t=i. In contrast to the Mix2 model, Cufflinks does not learn the distribution of the fragment start sites s(r) from the data and depends furthermore on the availability of a correct transcript annotation.
In Roberts et al., 2011, which implements an extension to the Cufflinks abundance estimation, ptrans(r|t=i) is factorized as in equation (13) and ptrans(s(r)|t=i,l(r)) is a normalized product of sequence specific and positional weights, which are defined for each position in transcript coordinates. This results in a large number of parameters that need to be trained even for a single transcript. In order to make the model in the experiments in Roberts et al., 2011 computationally tractable the positional weights have to be restricted to step functions only 5 of which are estimated and shared between all transcripts. In contrast to the model in Roberts et al., 2011, the inventive model does not estimate all the probabilities ptrans(s(r)|t=i) individually, but the parameters of the mixtures (16). As a result, far fewer parameters need to be estimated for the inventive model and its parameter estimation is therefore more robust and computationally tractable than that of the model in Roberts et al., 2011. In addition, the model in (Roberts et al., 2011) requires accurate starting estimates for the transcript abundances and its weights are therefore estimated on single isoform genes. Hence the model in Roberts et al., 2011 is not applicable to the data used in the experiments shown herein. The inventive model, on the other hand, does not require accurate starting values for the transcript abundances and can therefore be trained on multi-isoform genes. In addition, the model in Roberts et al., 2011 requires a correct transcript annotation unlike the inventive model.
In Wu et al., 2011 the bias in the distribution of fragments is modeled for the exons of a gene by assigning a single weight to each exon and transcript. In contrast to the inventive model, this restricts the ptrans(s(r)|t=i) to be constant on each exon. In addition, the weights in Wu et al., 2011 are estimated partly by heuristic methods outside a probabilistic framework. The work in Wu et al., 2011 further differs from the inventive model in that it uses Poisson distributions for the probability of read counts on exons, whereas the inventive model uses a mixture model for the probability of a fragment given a transcript. In addition, the work in Wu et al., 2011 relies on correct transcript annotations.
The work in Li et al., 2010 corrects for a sequence specific bias and is therefore similar to the sequence specific weights in Roberts et al., 2011. It also resembles the work in Wu et al., 2011 in its usage of Poisson distributions to model count data. In contrast to Wu et al., 2011, the count data in Li et al., 2010 are not exon based but is given by the coverage. The work in Li et al., 2010 differs from the inventive model and the models in Roberts et al., 2011 and Wu et al., 2011 in that it does not estimate the transcript abundances.
In Glaus et al., 2012 a Bayesian model is used for the probability of observing reads generated by a transcript. This model uses the same sequence specific and positional weights as Roberts et al., 2011 and a Gibbs sampling procedure to estimate the transcript abundances. Thus, in comparison to the inventive model, the parameter estimation in Glaus et al., 2012 is performed within a Bayesian rather than a maximum likelihood framework. The model in Glaus et al., 2012 also differs from the inventive model in that it does not use any mixtures for the p(r|t=i) and requires correct transcript annotations.
The models in Li et al., 2010 & 2011 are Bayesian models with a structure similar to the model in Glaus et al., 2012. In contrast to Glaus et al., 2012, however, the transcript abundances in Li et al., 2010 & 2011 are trained within a maximum likelihood framework, in particular with the EM algorithm. The remaining model parameters, such as p(s(r)|t=i), are either derived in advance or are estimated with heuristic methods. This is in contrast to the inventive model which estimates all its parameters within a maximum likelihood framework. In addition, the models in Li et al., 2010 & 2011 do not use mixtures to model the ptrans(r|t=i) and require correct transcript annotations.
According to the invention the expectation maximization algorithm was used to estimate the inventive model. The general framework of the expectation maximization algorithm was developed in Dempster et al., 1977, while the concrete EM update formulas for the inventive model were derived as described herein.
It was mentioned in the paragraph following equation (20) that tying the parameters of the Mix2 model implies a certain similarity between the p(r|t=i) for different transcripts t=i. Therefore, parameters of the Mix2 model should only be tied between transcripts t=i, which exhibit this similarity. If transcripts are separated into different groups such that only the transcripts within a group share their parameters, then the EM update formulas of the Mix2 model (27), (36) and (37) need to be modified. In the following, each transcript t=i has an associated group g=k which can be retrieved via the function G(i)=k. Then the EM update formulas of the parameters βk,j,μk,j and σk,j within group g=k are given as follows.
Using the Lagrange method to enforce the constraint
and taking the derivative with respect to βk,j leads to
which after some rearrangement results in
For the remaining parameters which are tied between transcripts, i.e. μk,j and σk,j, the update formulas (36) and (37) need to be modified by replacing the sum over the complete set of transcripts t=1, . . . , N with the sum over the transcripts within the group g=k, i.e.
Placing each transcript in a genomic locus into its own group removes all parameter tying and leads, as mentioned before, to inaccurate estimates for fragment distributions and transcript abundances. When placing transcripts into different groups it is therefore important to strike the right balance between separation of dissimilar fragment distributions p(r|t=i) and retaining the tying of a sufficient number of parameters to ensure the stability of the Mix2 model. A sensible requirement, for instance, is that each group contains at least two transcripts. Alternatively, in the presence of three or more groups it is sensible to require that at most one group contains a single transcript.
In the experiments discussed in section 4 the transcript specific fragment probability ptrans(r|t=i,b=j) was factorized as in (48) and the probability distribution of the fragment lengths ptrans(l(r)|t=i,b=j,s(r)) was assumed to depend only on the fragment start s(r) and the length l(t=i) of transcript t=i. In addition, ptrans(l(r)|l(t=i),s(r)) was assumed to be given. These assumptions are unnecessary if ptrans(l(r)|t=i,b=j,s(r)) is estimated from the data set R, which can be done within the framework of the Mix2 model. For this purpose, as for the distributions of the fragment starts s(r), ptrans(l(r)|t=i,b=j,s(r)) is written as a mixture, i.e.
where bs=j is the hidden variable that was previously denoted by b=j. Here bs is a mnemonic for “building block of fragment start”, while the hidden variable bl is a mnemonic for “building block of fragment length”. It is sensible to assume that ptrans(l(r)|t=i,bs=j,s(r)) is independent of bs=j and thus (57) reduces to
Combining (15), (48) and (58) yields the following expression for the probability of a fragment
where the following holds
Hence (59) is a Mix2 model where the mixture weights of ptrans(r|t=i) are the product of βj and γk and the mixture components of ptrans(r|t=i) are the product of the conditional probability distributions in (60).
In addition to (15) and (25), the following can be derived from (59)
And as a consequence, similar to (22) and (27), γk can be estimated with the EM algorithm as follows
For the fragment length building blocks ptrans(l(r)|t=i,bl=k,s(r)) one can, for instance, use Gaussians whose means are equidistantly distributed either between s(r) and l(t=i) or 1 and l(t=i) and which are either normalized within a discrete or a continuous one-dimensional probability space. If distributions on a continuous one-dimensional probability space are chosen then their internal parameters such as means, standard deviations, shift and scale parameters can be estimated similarly to these parameters for the probability distributions of the fragment starts ptrans(s(r)|t=i).
The 5tie-Mix2 model estimates the shift and scale parameters νi and λi. If νi is set to 0 and is not updated during the EM algorithm, then the start of transcript t=i remains unchanged. If the end point of a transcript is to be fixed, the 5tie-Mix2 model needs to be modified slightly.
If Li is the length of transcript t=i according to the annotation then for a fixed end point of transcript t=i the following holds
This yields the following EM update formula for νi
From which λi(n+1) can be derived according to (65), i.e.
The Mix2 parameter tying discussed in the experiments in Section 4 implements a model for the positional fragmentation bias, i.e. the bias related to the fragment start within a transcript. Other kinds of bias, like the sequence specific bias, might be described with other models such as the variable length hidden Markov model (VLMM) in [6]. Typically, models for non-positional bias compare the observed frequency of nucleotide sequences to their frequency under the null-hypothesis of unbiased data. This comparison can be used to derive a probability distribution p(m=c|r) over the multiplicity m=c of fragment r in the absence of any bias, given that a single copy of r is observed in the biased data. The distribution p(m=c|r) can then be used to computationally remove the non-positional bias from the data by weighting each fragment r in the EM update formulas of the Mix2 model by the expected multiplicity of r. For the EM update formula of the abundances αi, for instance, this leads to
It should be noted that the distribution p(m=c|r) cannot be estimated by maximizing the likelihood of the data set R under the Mix2 model, as this would lead to infinite expected values of p(m=c|r). For the combination of the Mix2 model with other bias models, as discussed in this section, it is therefore important to estimate p(m=c|r) outside the maximum likelihood framework of the Mix2 model.
The concentration of a transcript in RNA-Seq is usually measured with the FPKM (fragments per kilo base pair per million reads) or RPKM (reads per kilo base pair per million reads) measures, where for transcript t=i the latter is given by
For the FPKM measure the length l(t=i) of transcript t=i in (69) is replaced by an adjusted transcript length {tilde over (l)}(t=i) [7]. Thus, in order to obtain accurate FPKM and RPKM values it is important to use the accurate transcript length l(t=i) in (69). The 5tie-Mix2 model can correct inaccurate transcript annotations and therefore yields transcript lengths of higher precision, which results in an improved accuracy of the RPKM and FPKM values. Different methods can be employed to obtain an estimate for the transcript length based on the 5tie-Mix2 model. The transcript start can, for instance, be estimated as the position x where
∫−∞xp(s(r)|t=i)ds(r)=cs (70)
with 0<cs<1. Typically, cs, will be a small positive value such as cs=0.05. Likewise, the transcript end can be estimated as the position where (70) holds with cs, replaced by a value ce with 0<ce<1, which will be close to 1, e.g. ce=0.95. Similarly, (70), with cs, and ce, can be applied to any mixture component p(r|t=i,b=j). In this case, the mixture component used to determine the transcript start will, typically, be concentrated in a region of small s(r), whereas the mixture component used to determine the transcript end will be concentrated in a region of large s(r). Another method to obtain an estimate for the length of transcript t=i from the 5tie-Mix2 model is to multiply the length of the transcript annotation by the λi after convergence of the Mix2 model.
Number | Date | Country | Kind |
---|---|---|---|
13175774.2 | Jul 2013 | EP | regional |
14170767.9 | Jun 2014 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2014/064310 | 7/4/2014 | WO | 00 |