The present invention relates to providing information of transcript (e.g. mRNA) copy numbers based on fragmentation based sequencing methods, such as next generation sequencing (NGS).
Next generation sequencing technology produces a large amount of short reads when sequencing a nucleic acid sample. An essential step in next generation sequencing is the library preparation or library prep for short. This process takes mRNA or cDNA as input and produces a library of short cDNA fragments, each corresponding to a section of an mRNA molecule. These fragments are then sequenced by an NGS sequencer, usually not in their entirety but partially at their start and/or at their end. This results in short sequences of nucleotides which are called reads and are most commonly stored by the NGS sequencer as sequences of a group of four ASCII characters such as A, C, G, T or 0, 1, 2, 3, representing the nucleobases of the genetic code. In order to infer which mRNA molecules were present in the original sample, the reads are mapped onto a reference genome.
Next generation sequencing has been employed in a variety of genome mapping procedures (US 2013/110410 A1) or DNA identification methods, e.g. by using a mapped genome to associate sequence reads to a certain organism variant (WO 2009/085412 A1).
WO 2009/091798 A1 describes a method for obtaining a profile of a transcriptome of an organism, the method comprising: sequencing one or more cDNA molecules to obtain sequencing reads; aligning each of the sequencing reads to a reference sequence.
The estimate of the number of copies of a transcript based on the fragment sequence synthesis (FSS) is a within-sample normalization method since it normalizes the read counts of different transcripts from a single sample. Within-sample normalization is one aspect of the FPKM [5] and length adjusted FPKM [9] values, which they achieve through dividing the number of reads assigned to a transcript by the transcript length and adjusted transcript length. Both FPKM and length adjusted FPKM values cannot correct for variations in the distribution of fragments between different transcripts. Furthermore, FPKM and adjusted length FPKM values are no estimates of the number of copies of a transcript but are abstract measures correlated to the latter. Other normalization methods in the literature such as [1, 2, 8] are between sample normalization methods which account for the technical and/or biological variability in the number of reads generated by a transcript or gene.
Despite improvements in read assembly, methods to determine transcript copy quantities based on fragmentation data are still inaccurate and further improvements are needed. It is a goal of the present invention to provide improved methods that allow a more accurate assessment of transcript abundances.
The present invention provides a method for determining an amount of nucleic acid molecule copies of a preselected nucleic acid molecule in a sample, e.g. in a fragmentation based method, comprising the steps of
The invention further provides a computer program product employing said method, e.g. containing machine code to perform or assist in said methods and steps on a computer. The computer program product can be provided on any kind of memory device. Also provided is a system, e.g. a computer device, programmed to assist in performing the steps of the inventive method. Calculation steps are usually performed without assistance of an operator. Input and setting steps may be assisted by the program or system, e.g. by suggesting an option proposal for the number random step repetitions, if desired. Of course, the program or system may also be performed with default parameters without further input from an operator.
The following detailed description and preferred embodiments apply to all aspects of the invention and can be combined with each other without restriction, except were explicitly indicated. Preferred embodiments and aspects are defined in the claims.
The present invention is based on a new method that allows estimation of nucleic acid molecule copy amounts in a sample. The nucleic acid molecule copies are substantially identical nucleic acid molecules, having the same nucleic acid sequence. E.g., the nucleic acid molecule can be selected from a DNA, e.g. cDNA, or RNA molecule, preferably a transcript, especially an mRNA. In this case, the present application also refers to transcript copy numbers, however, the concept is described for transcript copies but also applies for any other nucleic acid molecule and such disclosure shall not be considered to be limited to analysing transcript copies. The amount, e.g. a quantity, of the preselected nucleic acid molecule may also be referred to herein as “copy number”.
The invention can determine the amount or concentration of a given nucleic acid molecule and its copies in a sample. A sample may comprise also other nucleic acid molecules with other sequences. However, the present invention is suitable to distinguish between such nucleic acids and the nucleic acid molecule to be the subject of the present method, which is hence referred to as “preselected” nucleic acid molecule. Usually the preselected nucleic acid molecule has a known nucleic acid sequence, e.g. known from deposited database sequence data, e.g. at the NCBI or EBI.
Any nucleic acid molecule can be used with any length, which may be fragmented. Preferably, the preselected nucleic acid molecule has 500 to 10000000 nucleotides in length, preferably 1000 to 1000000 nucleotides in length, even more preferred 5000 to 500000 nucleotides in length.
The present invention is based on the analysis of nucleic acid fragment data and is usually used in the context of a fragmentation based sequencing method, such as next generation sequencing. It is possible to supplement such methods with the inventive method to determine nucleic acid copy numbers (i.e. amounts). It is noteworthy, that the present invention is not dependent on fragment sequence data and can work with or without sequence information of the fragments. E.g. the virtual fragments may not be defined by a nucleic acid sequence but simply by the indication of the start and end position on the genetic coordinate (or alternatively, the start or end position and the length in nucleotides, which is equivalent). Essentially, during the inventive method steps d) to g), the nucleic acid sequence information of the virtual fragments is not needed anymore and may not be used during the calculation in any one of these steps, especially in step d1) and/or e). Of course, alternatively, if the nucleic acid sequence of a fragment is known, by using the reference sequence of the preselected nucleic acid of interest it is also possible to calculate the position information for such a fragment and vice-versa to determine the nucleic acid sequence of any virtual fragment for which the position data is known. Matching nucleotides according to step e) can be simply determined with the position information (i.e. start and end position) of the virtual fragments.
Instead (or in addition), the present invention uses the fragmentation profile of the (preselected) nucleic acid molecule when determining the copy number. To this goal, in most embodiments the present invention uses statistical methods, which may result in a probability for copy numbers, which may be a likelihood distribution of copy numbers.
Fragments can be generated by any known means. Preferred is a random fragmentation, such as a sequence non-specific fragmentation, although some bias for particular nucleotide sections with preferred (but not exclusive) cutting sites may exist, such as nucleic acid molecule portions with reduced stability that are easily cut. Random fragmentation is to be distinguished from specific fragmentation, e.g. by using restriction enzymes, which cut exclusively at defined restriction sites. Preferably the fragmentation is by physical means, in particular preferred by sonication, shearing or elevated temperature. Another option to generate fragments is to amplify partial copies with partial sequences of the preselected nucleic acid molecule. Random fragments can be generated in this way by e.g. using random primers, such as random hexamers. Sequence dependent fragmenting can be controlled by selecting template portion specific primers. In another embodiment only partial sequences are generated during sequencing, e.g. by random priming during sequencing. This approach will simultaneously result in fragment nucleic acid sequences provided in step d2). Thus partial sequences of fragments can be generated, which is sufficient for the methods of the present invention. Amplification of fragments and fragment sequencing may result in fragments in other concentrations as would have occurred by fragmenting the preselected nucleic acid molecule itself, however any biases created in this way are conveniently cancelled during the inventive method, e.g. in step e) since the assembly of fragments into fragment sequences used in this step automatically results in a reduction of the bias.
The fragments of step a) and/or the virtual fragments of step d) may have an average size of 40 to 2000 nucleotides in length, preferably 60 to 1000 nucleotides in length, especially preferred 80 to 800 nucleotides in length. The fragment sizes for step a) (fragmenting) and d) (a step in the analysis) may be selected dependent on the respective other step or independently. During step a) fragmentation sizes may be controlled by e.g. the number of different restriction enzymes, the temperature, sonification volume, shear stress and /or the length of treatment, especially in case of physical means.
According to step b) the amount of fragments is determined. The amount can be represented by an absolute count of fragments, a concentration or a molar quantity, in absolutes or relative to other (e.g. the total) nucleic acids in the sample. The reference to a volume or other nucleic acids is essentially non-consequential when considered in further calculations. According to the invention it is usually most convenient to operate with a fragment quantity or a fraction thereof, which then of course needs to be considered and multiplied in the final copy number determination.
The fragmentation profile is an estimate for the distribution of the fragment location over the length of the preselected nucleic acid molecule. Locations over the length of a nucleic acid molecule are represented as genetic coordinates, which for mRNA transcripts extend from 1 to the number of the last (3′) nucleotide. The fragmentation profile is then a distribution of fragment coverage over a genetic coordinate for said preselected nucleic acid molecule. This distribution or profile is not required to be an exact representation of each and every fragment of the preselected nucleic acid molecule that provides an estimate of differences in the fragment coverage over the genetic coordinate. It is possible that fragment coverage is uniform over the entire length of the nucleic acid molecule, but it is also possible that particular portions of the nucleic acid molecule are over- or under-represented by fragments, e.g. in certain profiles the 3′ end and/or 5′ end are over-represented in comparison to the middle portion. If not considered, such different profiles may distort a correct estimate of the copy numbers. Fragment coverage may be represented by a coverage envelope curve (also referred to as total coverage histogram or histogram of total) to the individual fragment coverage values.
To calculate the fragment coverage it is not required to consider each and every nucleotide for each and every fragment. It is sufficient to consider e.g. fragment start site distributions or a distribution of at least one nucleotide of each fragment at a preselected position from a fragment 3′ or 5′ terminus on the genetic coordinate. The “at least one nucleotide” can e.g. be the 3′ terminal nucleotide (start site) or any other nucleotide, e.g. the nucleotide at position 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 or any combination thereof from either the 3′ end and/or 5′ end of the fragment. The distribution may of course include all nucleotides of the fragments. This (these) nucleotide(s) should be the same position(s) for all fragments under consideration.
Several methods exist to determine the distribution of fragment coverage over a genetic coordinate. One method is disclosed in European patent application EP 13175774.2 (incorporated herein by reference). The method may comprise assigning fragment sequencing data to genetic coordinates of a preselected nucleic acid molecule thereby obtaining a data set of fragment genetic coordinate coverage. A fragment distribution can be modeled by e.g. pre-setting a probability distribution function of modeled genetic coverage for a nucleic acid molecule, wherein said probability distribution function is defined by a weight factor αi of said nucleic acid molecule multiplied with the sum of at least 2 (e.g. 2, 3, 4, 5, 6, 7, 8 or more) probability subfunctions j, with j denoting the numerical identifier for a probability subfunction, each probability subfunction j being independently weighted by a weight factor βj. The probability subfunctions can e.g. be simple functions like an aperiodic function, preferably a Gaussian function, a square shape function, a triangle shape function, especially preferred a Gaussian function. In general, the probability subfunction is constituted of positive values for each genetic coordinate. Preferably it is a density function. The sum of such modeled subfunctions can be the distribution of fragment coverage as used according to the inventive method.
As is apparent in view of the suitability of such modeled functions, the inventive fragment coverage may also be an approximation or local average, deviating from the exact count of fragment coverage. Such an approach may e.g. be an average number over 2 to 600, preferably 10 to 500, more preferred 20 to 400, even more preferred 30 to 300, or 40 to 200, or 50 to 100, adjacent nucleic acid nucleotides of the preselected nucleic acid molecule.
It is also possible to determine the distribution of fragment coverage over a genetic coordinate by using a reference nucleic acid. The reference nucleic acid is fragmented, the coverage of fragments over a genetic coordinate of said reference nucleic acid is determined, thereby obtaining a reference coverage. If the reference molecule differs in length to the preselected nucleic acid molecule, it is possible to fit, e.g. stretch or shrink, the reference coverage to the genetic coordinate of the preselected nucleic acid molecule. Reference nucleic acid molecules can e.g. be a reference, which fragments do not align with human or mammalian nucleic acid molecules. Such non-aligning reference nucleic acid molecules have been made available by the ERCC (Qu et al., Journal of Biomolecular Techniques 2011, 22, supplement, S46) and are provided by Lifetechnologies as ERCC spikes (Lifetechnologies, USA, Catalog no. 4456740, 4456739). The reference nucleic acid should resemble the fragmentation behavior of the preselected nucleic acid molecule and have e.g. similar length, such as a length with less than 10 times, less than 8 times, less than 6 times, less than 4 times or less than 2 times differences in length. Preferably the reference nucleic acid has a similar GC-content as the preselected nucleic acid molecule, such as a GC content which differs by less than 20%, less than 50%, less than 10%, less than 8%, less than 6% in the GC-content. The reference nucleic acid may be present in the same sample, as the preselected nucleic acid molecule, or it may be in a different sample. If in a different sample, the different sample is preferably handled similarly as the sample with the preselected nucleic acid molecule, e.g. by using the same experimental protocol, to ensure a similar fragmentation condition. The reference nucleic acid molecule may be labeled so as to distinguish it from other nucleic acids in the sample. Such labels may include radio labels or fluorescence labels or preferably nucleic acid barcodes, which may be transferred to the fragments as disclosed in WO 2011/070155.
A further possibility to provide a distribution of fragment coverage over the genetic coordinate for the preselected nucleic acid molecule comprises isolating the preselected nucleic acid molecule from the sample, generating fragments and analysing the fragment distribution in view of the sequence of the preselected nucleic acid molecule. Instead of isolating the preselected nucleic acid molecule it is also possible to label the preselected nucleic acid molecule so that the fragments can be distinguished and analysed separately from other nucleic acids or fragments potentially present in the sample.
With the distribution of fragment coverage determined for the preselected nucleic acid molecule as described above, it is possible to artificially generate virtual fragments of said preselected nucleic acid molecule. This is a conceptual step (which is indicated by “virtual”) that is best performed on a computer. A virtual fragment has the information to define a fragment of the preselected nucleic acid molecule. This information can be simply a start and end position of the fragment on the genetic coordinate of said preselected nucleic acid molecule. It is possible, but not necessary to use the nucleic acid sequence of such as fragment. With the knowledge of the nucleic acid sequence of the preselected nucleic acid molecule it is possible to obtain the nucleic acid sequence from the position data and vice versa. The virtual fragments shall have a fragment distribution resembling the distribution of fragment coverage of the preselected nucleic acid molecule. I.e. the distribution of fragment coverage of the virtual fragments of step d1) and of the fragments of step c) is substantially the same. The distribution of fragment coverage of the virtual fragments can be generated by selecting suitable fragments, and controlled by again generating such a distribution by the same methods as described above for the fragments of the preselected nucleic acid molecule. The generation of virtual fragments is essentially a random process. At a random position in the genomic coordinates of the preselected nucleic acid molecule a fragment with random length is extracted. The length of the virtual fragments preferably resembles the length distribution of fragments obtained by the used fragmenting method employed in step a). Random virtual fragment generation does not mean that the parameter of a fragment are purely by chance, but usually the random generation of position parameters (such as start position and length or end position) may be weighted by the fragment distribution obtained in step c), e.g. a start site (or end site) and length distribution. Preferably the amount of artificially generated virtual fragments is sufficient so that generation of any additional virtual fragments has a small or negligible effect on the amount of assembled fragments obtained in step f), e.g. when the calculated result converges.
In preferred embodiments the inventive method further comprises a step of providing a distribution of fragment lengths of the fragments of step a) and wherein in step d1) the fragments are generated so that the plurality of generated virtual fragment sequences has substantially the distribution of fragment lengths. A distribution of fragment lengths can be determined by the same methods as described above for providing the distribution of fragment coverage or by studying a bioanalyser trace. It may result in a probability distribution function for fragment lengths.
“Substantially” and “about” as used herein may refer to the same value or a value differing by +/−10% of the given value, or in the case of distribution functions, identical functions or functions with +/−10% on average per coordinate, preferably also at most +/−10% per coordinate.
In preferred embodiments of the invention it is possible to repeat this step d1) of generating a plurality of virtual fragments at least once, thereby obtaining a different set of generated virtual fragments for each repetition of step d1). Repetitions of the generation of virtual fragments leads to different sets of pluralities of artificial fragments, and allows a better statistical analysis of fragment representation, especially during fragment assembly in the next step e). Further repetitions can be performed in a statistically relevant and reliable amount. In preferred embodiments at least 20, preferably at least 50, even more preferred at least 200, random sets of generated virtual fragment sequences are generated.
In addition or alternatively, it is possible to continuously generate virtual fragments according to step d1) and assemble them according to step e), preferably until the amount of virtual fragments per virtual copy of step f) converges. It shall be noted that steps d) and e) can be performed repeatedly for each additional virtual fragment generated in step d), e.g. d) and e) may be performed cyclically. Preferably at least 10, especially preferred at least 20, at least 30, at least 40, at least 50, at least 60, at least 60, at least 80, at least 100, at least 125, at least 150, at least 200, at least 300, at least 400, or at least 500, virtual fragments per nucleotide in length of the preselected nucleic acid molecule are generated. The amount, in particular the mean value, of virtual fragments per virtual copy converges, when said amount remains substantially the same, preferably does not change by more than 5%, preferably not more than 3%, even more preferred not more than 2% in particular preferred not more than 1%.
In case the sequence information of the fragment is known e.g. after sequencing the fragments or via obtaining sequence information of the virtual fragments, it is also possible to simply use these sequences in the next step e)—but preferred is the use of position data of the virtual fragments. As is common for fragment sequencing data, it is possible that sequence information is only available for a fraction of the generated fragments. E.g. it is possible that in step d2) 1% to 100%, preferably 2% to 80%, even more preferred 3% to 40% of the fragments are sequenced. It is also possible that only less than 50%, less than 40%, less than 30%, less than 20%, or less than 10%, of the fragments are sequenced.
The next step in the inventive method is the assembly of the artificial (d1) or non-artificial (d2) virtual fragments of any step d). This leads to fragment sequences, i.e. multiple series of virtual fragments, where a fragment sequence consists of at least one virtual fragment. Each virtual fragment within a virtual fragment can have an associated index indicating its position within the fragment sequence, i.e. a series of virtual fragments. Thus a “fragment sequence” of this (also conceptual step) is a series or a succession of fragments - not to be confused with nucleic acid sequences of fragments, which may not be used at all in the conceptual steps.
In the assembly step a fragment can be inserted at a specified position into a fragment sequence if the fragments start and end match or the resulting fragment sequence satisfies the assembly condition. This condition might refer to the relative position of the genome coordinates of the fragments within the fragment sequence or to the similarity of the nucleotide sequences of the fragments within the sequence. The nucleotide sequences of the fragments are available for fragments obtained in a) and sequenced in d2) but might not be available for artificially generated virtual fragments in d1). In preferred embodiments, the assembly condition is that two virtual fragments match as adjacent within one fragment sequence when the start in genome coordinates (or position) of a subsequent fragment follows immediately after the end in genome coordinates (or position) of a preceeding virtual fragment (exact match) or the start of a subsequent virtual fragment and the end of a preceeding virtual fragment are located within a window of +/−1 to +/−200 nucleotides (inexact match). This “window” can be seen as an overlap or gap imprecision, that surprisingly can provided faster and/or better results within the inventive method. This window is preferably a gap or overlap of the start (subsequent virtual fragment) to end (of the subsequent virtual fragment) positions of 1 to 150 nucleotides, e.g. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, e.g. up to 120, up to 100, up to 80, up to 60, up to 40, up to 30 or up to 25 nucleotides. Another condition for an inexact match is an overlap or gap imprecision of at most 10% of the average virtual fragment length in nucleotides. In fact, allowing an inexact match improves performance of the inventive method, without substantially altering the outcome in step f) and g) of an assembly condition with exact match. If the nucleotide sequence of the fragments is known the assembly condition might require that parts of the nucleotide sequence of two fragments within one sequence adjacent in terms of their sequence indices are similar or identical potentially after reversion or reverse complementation of the nucleotide sequences. These or other conditions may be combined in the assembly condition. If the nucleotide sequence of the fragments is known their location in genome coordinates can be inferred by mapping their nucleotide sequence to a reference genome. This allows to determine whether assembly conditions referring to the relative location of the fragments within the fragment sequence are met.
The assembly step may be random. Two types of randomness can be included, the random generation of subsequently processed virtual fragments (or alternatively, if a plurality of fragments is generated at once and later assembled, the order of virtual fragments to be assembled can be random) and/or for a given virtual fragment the insertion into a fragment sequence randomly selected from the set of all fragment sequences previously generated by the assembly step, whose extension by the fragment meets the assembly condition.
The assembly step is repeated for all randomly or non-randomly generated virtual fragments until all fragments have been assembled or fail to match any fragment sequence or other non-assembled virtual fragments. Each fragment sequence or non-assembled virtual fragment in the final set of fragment sequences represents a virtual copy. For the inventive method, only the number of fragments per virtual copy, i.e. the number of fragments assembled to form a virtual copy is required. This number is calculated for each resulting virtual copy, which provides step f) the amount, e.g. an average or a distribution or the quantity, of assembled fragments from step d) per virtual copy obtained in step e).
In a particular embodiment in case of random factors in step e), preferably the assembly step e) is repeated with a different order of fragments preferably wherein at least 10 different fragments, or at least 100, especially at least 1000, different (e.g. random) fragments sequence sets are generated. Such different results may be further considered when calculating the average or distribution of fragments per virtual copy.
The amount of copies determined in step g) is a simple comparison of the amount of fragments as determined in step b) and the result of the calculated virtual fragments per virtual copy as determined in step f)—in view of step e). The amount of copies may be an absolute value or a relative value, dependent on the removal of any calculation bias and in dependence on the method used to generate fragments and the selected amount of virtual fragments as mentioned above. Given the statistical nature of the inventive method, the amount may be given as an average or a probability distribution of the quantity or concentration. A simple calculation to estimate the result is by providing the ratio of the fragment amount of step b) to the amount (or average) of annealed fragments per virtual copy of step f). It is also possible to estimate the probability of the copies of a preselected nucleic acid molecule from a distribution in step f) and the fragment amount of step b), e.g. using a posterior probability estimation. Any such method to transform the fragment amount with the amount of annealed fragments per virtual copy, wherein the latter could be a probability function itself, resulting in a probability function for the determined copy amount of the preselected nucleic acid molecule is possible.
The invention further relates to a computer readable memory device comprising a computer program product for performing a method according to the invention on a computer or for supporting a method according to the invention by a computer, especially any one of steps b), c), d1), d2), e), f) and g) may be assisted with a computer. Especially steps c), d1), e), f) and g) may be performed on a computer. Even the usually wet-chemistry step a) and/or b) of fragmenting the preselected nucleic acid molecule or step d2) of determining the nucleic acid sequence may be assisted by a computer, e.g. to control and obtain data from an automated or semi-automated sequence reader. The computer program product or memory device may also be provided with a sequence generation component that obtains short sequencing reads from a sample, such as a sequencer, preferably a sequencer comprising a computer component. For example, computer readable media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips, . . . ), optical disks (e.g., compact disk (CD), digital versatile disk (DVD), . . . ), smart cards, and flash memory devices (e.g., card, stick, key drive, . . . ).
The present invention is further illustrated by the following figures and examples, without being limited to these embodiments of the invention, with—each element being combinable with any other embodiment of the invention. Especially any one of the formulas provided below can be used separately, individually or in combination to describe or define a step in the inventive method.
One of the main purposes of next generation sequencing (NGS) is to determine the concentration, i.e. the number of copies, of a preselected nucleic acid of interest, such as a transcript within an mRNA sample, which is used for representative purposes only in the examples. Of course any example can be performed with any nucleic acid of interest. Concentration determination is typically done by mapping the reads (which are essentially fragments defined by their nucleic acid sequence) from an NGS experiment onto a reference genome and normalizing the number of reads that mapped within the definition of the transcripts. The intention of the normalization is to obtain a value which is directly correlated to the concentration of the transcripts and which is identical for transcripts with equal concentration. One of the most common normalization methods results in the so-called “fragments per kilo base per million reads” (FPKM) measure. For a given transcript t, this measure divides the number of reads r(t) that map within the definition of transcript t by the total number of reads in the NGS experiment |R| and multiplies by one million. This value, which is interpreted as the expected value of reads that map within the definition of the transcript for an NGS experiment with a total number of one million reads, is then divided by the length of the transcript l(t) and multiplied by 1000. Hence the FPKM measure of transcript t is given by
Dividing by the length of the transcript l(t) in (1) accounts for the fact that longer transcripts generate more reads and thus their read counts are higher than those of shorter transcripts with identical concentration. However, for two transcripts t1 and t2 of different length but same concentration their FPKM values are identical only if the following holds
This implies that in order for the FPKM measure to be a proper measure of transcript concentration the number of reads r(t) has to be linearly correlated to the length of the transcript l(t) and that for constant concentration the length l(t) is the only factor that influences the number of reads r(t). This is, however, an unrealistic assumption if one takes into consideration that each fragment originates from the fragmentation of a single copy of a transcript and thus from a sequence of fragments which obeys certain constraints. One such constraint is that the start of a fragment does not occur before the end of the previous fragment in the sequence. Another constraint is that the probability distribution of the fragments is given by p(r|t=i). Consider, for instance, two transcripts t1 and t2 with the same length l(t1)=l(t2) of 2000 base pairs, where the fragment start sites of transcript t1 have a strong 5′ bias, as shown by the solid line in
Under this assumption typical fragment sequences generated for single copies of transcripts t1 and t2 are shown in
Commonly used concentration measures, such as the FPKM measure, do not have any intuitive meaning by themselves but are abstract values which are somehow correlated to the concentration, i.e. the number of copies, of a transcript. In contrast, the method developed in this section directly calculates the probability p(c=n|t=i) that n copies of transcript t=i are present in an mRNA sample. The probability p(c=n|t=i) can be factorized as follows.
Here p(f=m|t=i) is the probability of observing m fragments given that the transcript is t=i and p(c=n|t=i,f=m) is the probability of observing n copies of transcript t=i given that m fragments were produced by transcript t=i. The probability p(f=m|t=i) is the distribution over the abundances of transcript t=i, which is provided by abundance estimating methods such as Cufflinks. Some of these methods return a single maximum likelihood estimate αi for the abundance of transcript t=i in which case p(f=αi|t=i)=1 and p(f=m|t=i)=0 for m≠αi. Since p(f=m|t=i) is provided by already existing tools the goal of the method described here is the calculation of p(c=n|t=i,f=m). Applying Bayes formula yields
Here N is the maximal possible number of copies of t=i and p(c=n) is the prior knowledge about the number of copies of transcript t=i. If no prior knowledge is available then (4) reduces to
Assuming that the fragmentation of n copies of transcript t=i is the repetition of n independent identically distributed processes yields
And the probability of observing m fragments in total generated by n copies of transcript t=i is therefore given by
Thus the main problem in estimating p(c=n|t=i,f=m) is the estimation of p(f=m|t=i,c=1) and the efficient calculation of p(f=m|t=i,c=n) from p(f=m|t=i,c=1). In order to obtain a good estimate for p(f=m|t=i,c=1) it is necessary to find fragment sequences which resemble those generated by fragmenting a single copy of transcript t=i. Such fragment sequences, which are illustrated in
As mentioned in the discussion of
In current NGS experiments the single copy fragment sequences (SCFS's) of transcript t=i are not recorded. Therefore, only the statistics p(r|t=i) of the fragment locations irrespective of their SCFS's can be inferred from the data. Thus in current NGS experiments p(r|t=i) forms the basis of the estimate of the abundance of transcript t=i. In the future, however, one can envisage that some information might be retained about the SCFS's of transcript t=i and that it will become possible to reconstruct the SCFS's of transcript t=i. An important consequence of the ability to reconstruct the SCFS's of transcript t=i is that it allows to count the copies of transcript t=i and therefore provides an exact value of the concentration of transcript t=i. An SCFS might, for instance, be recorded if each fragment of an SCFS has connectors at its start and end, where the connector at the end of one fragment fits to the connector at the start of the successive fragment in the SCFS. Here the connectors can be part of the sequence of transcript t=i or sequences artificially inserted at the start and end of a fragment. This situation is depicted on the right side of
A simplified version of a memory-preserving fragmentation, which partially implements the concept of start and end connectors, can be realized by isolating individual copies of transcript t=i, for instance by applying a limiting dilution. After isolation a single barcode is ligated to the start of each fragment, which uniquely identifies the copy of transcript t=i. Then the sequence of fragments generated by fragmenting a single copy of transcript t=i can be reconstructed by aligning fragments with the same barcode to a reference genome, which resolves the order of the fragments within the sequence.
In general, due to inaccuracies in the data multiple reconstructions of the SCFS's from the fragments generated by a memory-preserving or even perfect-memory fragmentation will be possible and the reconstruction with the highest quality has to be selected.
It has already been noted that the fragment sequence synthesis (FSS) introduced in section 1.1 and the fragment sequence reconstruction (FSR) introduced in section 1.2 share many common features. It is the purpose of this section to describe those features and to develop a general framework which contains the FSS and FSR as special cases.
As indicated by
As mentioned previously, the selection of the fragment sequence to be extended by rn has to generate a fragment sequence which obeys the topological constraints of the SCFS's. Hence the probability of selecting a fragment sequence such that extension by rn yields a fragment sequence which violates the topological constraints is zero. It is possible to arrive at such a selection criterion by defining the topological constraints and giving the same probability of selection to all fragment sequences whose extension by rn yields a sequence which obeys the topological constraints. Thus, as indicated in the central box in
The previous section discussed the common features of the fragment sequence synthesis and the fragment sequence reconstruction. It is the purpose of this section to formalize these features in a mathematical framework which contains the FSS and FSR as special cases.
Fragments are elements of a set F and are assembled by the fragment assembly algorithm (FAA) into sequences of fragments a=(r1, . . . , rn), where r≦1, ri∈F and the length of a will be denoted by l(a). The set of sequences that are generated by the FAA is denoted by A(F) and its elements a∈A(F) are called assemblies. For a fragment r∈F and a sequence a∈A(F) there might exist an extension of a by r into another sequence b∈A(F). This defines a relation on A(F) which will be denoted by a→b. It is also possible that more than one sequence can be simultaneously extended by a fragment r∈F into another sequence. This can, for instance, be the case if a1=(r1, . . . rn) and a2=(r1′, . . . , rm′) and r connects a1 and a2 as follows r1, . . . rn, r1′, . . . rn′)∈A(F). In general therefore, the relation a→b is defined between finite subsets a⊂A(F) and b∈A(F). There is a special sequence which is the empty sequence aØ=0. The empty sequence is extendable by any fragment r∈F and fulfills aØr→(r). Thus (r) is the extension of the empty sequence aØ by r. Since l(a)>0 for a∈A(F), aØ∉A(F), which can also be stated by saying that the FAA does not produce empty sequences. If the FAA is supposed to discard certain fragments this can be achieved by keeping the state of the FAA fixed between successive iterations. Each iteration of the FAA generates a multiset of fragment sequences. In comparison to a set a multiset can contain the same element multiple times and is therefore specified by its elements and their multiplicity. A multiset of assemblies can therefore be written as follows
A={(a,mA(a)):a∈A(F), mA(a)∈Z0} (8)
where mA(a) is the multiplicity of a. The set of all multisets of assemblies will be denoted by AS(F). Each multiset of assemblies has an associated multiset of fragments which will be denoted by F(A) and is given by
Thus the multiplicity of r∈F in F(A) is the number of times the fragment occurs in assemblies a∈A . For a→b it is assumed that
F(b)=F(a)∪r (10)
Hence given a,b∈A(F) such that ar→b then fragment r∈F is uniquely defined. The underlying set of a multiset A will be denoted by U(A) and is given by
U(A)={a∈A(F): mA(a)>0} (11)
If assemblies are drawn randomly with uniform distribution from a multiset of assemblies A then the probability of observing assembly a∈U(A(F)) is given by
where |A|=Σa∈U(A)mA(a) is the number of elements in A. Similarly, the probability of observing an assembly of length L is given by
For A∈AS(F) and r∈F let X(A,r) denote the multiset of extensions a→b where a∈A unified with the empty assembly aØ.
The multiplicity
is the number of possibilities, including the multiplicity of a, mA(a), how a can be extended by r to yield b. Given a and r there might be more than one way to construct b. Consider, for instance, the sequences a=( . . . , ri,r,ri+1, . . . ) and b=( . . . , ri,r,r,ri+1, . . . ) then b can be derived by extension from a by inserting r immediately after ri or immediately before ri+1. In the FAA's studied here, however, there is exactly one way how a can be extended by r to yield b and therefore
According to (12), the probability of a uniformly random extension ar→b in X(A,r) is given by
This probability will be called the canonical extension probability. Hence, if the extension probability is the canonical then
The probability in (16) is the extension probability of the FAA as defined in the previous section. The extension of the multiset of assemblies A via
will be denoted by
and is given by the following multiset of assemblies
Thus the subset a⊂A∪aØ of the assemblies which are extended is removed from the multiset A while the assembly b that results from extension is added to the multiset. This implies
Since aØ∉A equation (18) implies
and thus
If B can be derived from A by extension then the following must hold
a=A\B⊂A∉aØ
b=B\A∈A(F)
r=F(B)\F(A)∈F
B=EXT(A,ar→b) (22)
Thus the extension
by which B is derived from A is unique and the extension probability can be uniquely defined on sets A,B∈AS(F) as follows.
If A,B∈AS(F) satisfy (22) then by (12) and (19) the probability of observing assembly x∈B, p(x|B), can be derived from the probability p(x|A) as follows
for fragments r∈F which do not satisfy (22) the following holds
Since each r∈F is extendable due to the presence of aØ in X(A,r) the following holds
and thus an FAA is a Markov chain on the countably infinite state space AS(F). Since according to (12) and (13) every A∈AS(F) has an associated probability distribution over the assemblies a∈A and assembly lengths L, an FAA has associated hidden Markov models, where the observables are the assemblies a∈A(F) and assembly lengths l(a), respectively, and the hidden variables are the multisets of assemblies A∈AS(F). Summing up the discussion so far, the fragment assembly algorithm (FAA) can be formally defined as follows.
Fragment assembly algorithm: A fragment assembly algorithm is a Markov chain which is defined by the following tuple
where
is a relation defined between the set of finite subsets of A(F), 2A(F), and A(F) which indicates that b∈A(F) is an extension of a⊂A(F) by r∈F.
is a probability distribution which is called the extension probability. p(a→b|A,r) is defined for each A∈AS(F) and r∈F on a⊂A, b∈A and r∈F and satisfies
for a→b∉U(X(A,r)).
A special case of an FAA can be defined by the following tuple
where
The state space of the fragment assembly algorithm is the countably infinite set AS(F) and the probability of a transition from state A∈AS(F) to state B∈AS(F) is given by
where
is defined in (23). The initial state of the fragment assembly algorithm is given by
Thus for the initial state the probability of a multiset of assemblies p(A) with a single underlying fragment r equals the probability of the fragment p(r) while p(A)=0 for sets of assemblies with more than one underlying fragment. An FAA gives rise to hidden Markov models where the observables are assemblies a∈A(F) and assembly lengths L=1, . . . , N and the hidden variable is the multiset of assemblies A∈AS(F).
The following examples show different sets of assemblies AS(F), where F is the set of intervals on a finite set of numbers.
Here a sequence of fragments a=(r1, . . . , rn) is an assembly a∈A(F) if
end(ri)<start(ri+1)−S (32)
for i=1, . . . n−1 and some S≧0. Hence a→b∈X(A,r) means that r can be inserted between successive elements of a or alternatively before r1 or after rn such that equation (32) is fulfilled for the resulting sequence b. A fragment r connects two sequences a1=(r1, . . . , nr) and a2=(ri′, . . . rm′) and therefore
if end(r)start<start(r)−S and end(r)<start(r1′)−S. If S=0 then the assemblies are contiguous, since there are no gaps between successive fragments. The case S=0 will therefore be referred to as assemblies of contiguous fragments.
It is furthermore possible to express a preference for certain extensions
by modifying the canonical extension probability. This can, for instance, be done by weighting the canonical extension probability and subsequent renormalization. Such a weight might for instance indicate a preference for sequences where the fragment r is far away from its neighbors in b, e.g. by weighting the canonical extension probability as follows
with the necessary modifications for r inserted before the start or after the end of a and a=(a1,a2).
Here a sequence of fragments a=(r1, . . . , rn) is an assembly a∈A(F) if
end(ri)<start(ri+1)+S (34)
for i=1, . . . , n−1 and some S>0. The same observations, with the necessary modifications, as in the case of assemblies of non-overlapping fragments also apply here. In comparison, however, sequences of potentially overlapping fragments can contain the same fragment multiple times. The concept of assemblies of potentially overlapping fragments contradicts the minimal topological constraint mentioned before, which requires that the start of a fragment does not occur before the end of the previous fragment. It will, however, be seen to be convenient for computational purposes.
In this case a sequence of fragments a=(r1, . . . , rn) is an assembly a∈A(F) if
|end (ri)−start(ri−1)|<B (35)
for i=1, . . . , n−1 and some B>0. An assembly can, as in the previous example, contain the same fragment multiple times.
The above examples show that there exist many variants of assembly sets A(F) and extension probabilities
arms therefore many variants of the FAA. It is furthermore easy to generate new FAA's from already existing FAA's by modifying their assembly sets A(F) or extension probabilities
Section 2.3 will employ FAA's with assemblies of contiguous fragments and will show that FAA's with assemblies of fragments with bounded distance can be used to efficiently approximate their fragment sequence synthesis.
Let An denote the set of multisets A∈AS(F) which can be generated in the n-th iteration of an FAA. Due to equation (31) and the fact that exactly one fragment is added to a multiset A∈AS(F) in each iteration the following holds.
A
n
={A∈AS(F):|F(A)|=n+1} (36)
Therefore An∩An+1=Ø and hence an FAA can be interpreted as a branching process with no overlap between successive generations. Let An ∈An then the following holds
where p(n)(An) is the probability of multiset An ∈An after the n-th iteration. Equation (37) holds since p(n)(A)=0 for A∉An. Let b∈A(F) be an assembly then the probability of b in the n+1-th iteration is given as follows
where p(b|An+1) is given by (12). Substituting (37) into (38) yields the following equation.
Due to (18), (20) and (12) p(x|B) can be derived from p(x∈A) as follows.
If F is the set of intervals of a finite set of numbers then l(a) is bounded for all a∈A(F) and |A(F)|<∞, |An|→∞ for An∈An and n→∞ and therefore
p(x|An+1)→p(x|An)n→∞ (41)
Hence for n→∞ the right hand side of equation (39) converges to
Let p(n)(L) denote the probability of observing an assembly a∈A(F) of length l(a)=L in the n-th iteration, then the following can be derived from equation (42).
Equations (42) and (43) show that the difference between probabilities of successive iterations of the fragment sequence synthesis converges to zero. Thus any FAA whose termination condition is based on the difference between successive probabilities will eventually terminate. However, equations (42) and (43) are only a necessary but not sufficient condition for the convergence of the probabilities. In principle it is possible that the difference between successive probabilities becomes arbitrarily small but that the probabilities do not converge. This, however, is only possible if the probabilities are not monotonic, which beyond a certain iteration has not been observed in the experiments described here. In the following therefore if |p(n+1)(L)−p(n)(L)|<ε for all L and a certain ε>0 then p(n)(L) will be interpreted as the distribution p(f=L|t=i,c=1).
Whereas the previous section discussed the theoretical aspects of the fragment assembly algorithm this section focuses on some practical aspects of the fragment sequence synthesis (FSS). In particular this section deals with the question how many sampled fragments are necessary in order to obtain a good estimate of p(f=m|t=i,c=1) and how the speed of the convergence of the FSS can be increased.
Consider the fragment distribution p(r|t=i) whose fragment start site and length distributions are derived from the distributions in
The estimation of PCT in p(f=m|t=i,c=1converges for the p(r|t=i) derived from the distributions in
Instead of requiring an exact match at the start and end of) fragments, i.e. end(ri)+1=start(ri+1), one can study assemblies of fragments with bounded distance, i.e. |start(ri+1)−end(ri)|≦B. This removes some of the short assemblies from the assembly list while retaining all the fragments in the fragment sample.
The same holds true for the Gaussian start site distribution.
The estimation of p(c=n|t=i,f=m) proceeds via the application of Bayes theorem in (5) and requires therefore the calculation of p(f=m|t=i,c=n) in (7). While decomposing m into all sums of n summands might be possible for small m it becomes increasingly impractical for larger m. In this case one can apply the central limit theorem which states that given the probability distribution p(f=m|t=i,c=1) the distribution of the sum of n copies p(f=m|t=i,c=n) converges to a normal distribution for n→∞. The mean of this normal distribution is given by
μn=nμ1 (44)
and the standard deviation is given by
σn=√{square root over (n)}σ1 (45)
where μ1 and σ1 are the mean and standard deviation of p(f=m|t=i,c=1). Thus the probability p(f=m|t=i,c=n) converges as follows
This section applies the methods developed in the previous sections for the estimation of p(c=m|t=i,f=m) to the data of an NGS experiment. For this NGS experiment 500 ng RNA input of Universal Human Reference RNA [7] were sequenced as part of a larger sample on an Illumina HiSeq 2000 chemistry version 3, producing 50 base pair long single end reads. The SENSE kit revision date 11.02.2013 [6] was used to prepare the library. SENSE is a stranded library preparation protocol and the start of a fragment can therefore be easily determined for each read. The length distribution of the resulting fragments is depicted in
where F(i) is the probability of observing a fragment of length i, which is given by the distribution in 18. The number of reads mapping to the annotation of the ERCC transcripts is given in the fifth column. Substituting this number into the definition of the FPKM value (1) together with the transcript length, adjusted transcript length and total number of mapping reads in the experiment, which is 3,215,638, results in the values in the last two columns of Table 3.
The estimated number of molecules is plotted against the true number of molecules in
After RNA fragmentation in an RNA-Seq experiment the library preparation, sequencing and data processing pipeline changes the number of fragments originally generated for a transcript, t=i. The proportion of fragments which arrive at the end of this pipeline represent the pipeline efficiency, which has to be taken into account when estimating p(f=m|t=i,c=1). If the probability p(f=m|f0=m0,t=i,c=1) is known that m fragments arrive at the end of the pipeline given that m0 entered the pipeline, then the probability distribution of the number of fragments belonging to a single copy of a transcript at the end of the pipeline is given as follows.
Here the probability distribution p(f0=m0|t=i,c=1) can, as usual, be estimated by the fragment assembly algorithm. In order to increase readability, the dependency of the probability distributions on t=i and c=1 will in the following be dropped from the notation. For the purpose of estimating the number of transcript copies with the central limit theorem for large values of m only the expected value and variance of p(f=m) are needed. These are given by the following equations.
Consider, for instance, the situation were the pipeline retains fragments with probability q and discards them with probability 1-q. This corresponds to a Bernoulli trial and therefore the following holds.
As a result, the expected value and variance of p(f=m) are given by.
E(f)=q E(f0) (51)
Var(f)=qE(f0)+q2(Var(f0)−E(f0)) (52)
Intuitively one might expect that retaining a fraction of q fragments from the original set of fragments should lead to a correction of the estimated number of copies by a factor of 1/q. In comparison to scaling the estimate of the number of transcript copies by 1/q, however, the binomial distribution (48) introduces a non-linear correction which is most prominent for small f and variances Var(f0) which differ strongly from E(f0). This can be seen from
p(f=m|f0=m0)=δqm
where δqm
E(f)=q E(f0)
Var(f)=q2Var (f0) (54)
This section describes an experiment which was carried out on NGS data derived from a SQUARE library preparation [10]. A total RNA equivalent of 100 ng RNA was used as input with 1% spiked in ERCC's. The library was sequenced on a MiSeq NGS sequencer producing 577,010 single-end reads with a length of 60 bps. 72,511 of these reads mapped to the ERCC annotation.
This document developed the fragment assembly algorithm (FAA) and its two special cases the fragment sequence synthesis (FSS) and fragment sequence reconstruction (FSR). With additional information on the connectivity of fragments the FSR can be used to reconstruct the original fragment sequences generated by a fragmentation process. As a result the FSR allows a direct estimate of the number of copies of a transcript by counting the reconstructed fragment sequences belonging to the transcript. The FSS, on the other hand, can be used to estimate p(f=m|t=i,c=1) the probability distribution of the number of fragments generated by a single copy of transcript t=i. It was shown that for assemblies of fragments with bounded distance the FSS leads to rapidly converging estimates of p(f=m|t=i,c=1) which can be used as approximations of p(f=m|t=i,c=1) for assemblies with contiguous fragments. From estimates of p(f=m|t=i,c=1) estimates of p(f=m|t=i,c=n) can be derived either via decomposing m into all possible sums consisting of n summands or by application of the central limit theorem. In experiments on ERCC spike-in's of an NGS data set it was shown that application of the central limit theorem in conjunction with the FSS yields estimates of p(c=n|t=i,f=m) which have a higher correlation with the true number of molecules of a transcript than both the FPKM and length adjusted FPKM values. In addition, fitting a linear regression with intercept to the true and estimated number of molecules yields a slope close to 1, which implies that the method described here produces in fact a proper estimate of the number of molecules of a transcript.
Number | Date | Country | Kind |
---|---|---|---|
13191170.3 | Oct 2013 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2014/073463 | 10/31/2014 | WO | 00 |