The present invention relates to a bioinformatic method for determining the presence of a viral contamination in a sample. Particularly, the present invention relates to determining the type of viral contamination where such viral contamination is present. Such method can be used in various applications in which the absence of viral contamination is a critical quality attribute either for the final product or during any intermediate steps in or as raw material for a production process.
Viral contaminations, or adventitious contaminations, can occur in any environment and in any type of materials. Such materials can be for example a final or intermediate product in a process producing a biomolecule of interest or in raw materials for a production process to produce an active pharmaceutical agent or a cell bank.
The presence of harmful viral agents should be avoided in many commercially available products. This is particularly true for any products that are for human or animal consumption such as foods or medications. For example, many health authorities around the world require strict control of production processes to minimize viral contaminations. In many instances such health authorities require the final product, particularly if the final product is a medical product, to be free of viral contaminations.
Analytical methods have been developed for determining viral contaminations and frequently involve the testing of the presence of a particular viral contaminant by exposing a sample (or multiple samples simultaneously) to a determinant for a suspected viral contaminant. For example, using a reporter antibody that recognizes a particular viral contaminant. Such analytical methods are laborious and require the availability of a good reporter determinant for the viral contaminant with a very low limited of detection. In addition, such methods are limited in that only the presence or absence of a suspected viral contamination is determined.
Thus, there is a need, for alternative detection methods for determining viral contaminants in a sample, which are less cumbersome and laborious and would determine the presence or absence of a viral contamination regardless of the nature of a suspected viral contaminant.
The current invention provides a solution to the above described problems by sequencing all DNA and RNA, if present, in a sample to be tested for viral contamination and comparing the resulting sequencing reads to a viral database. Such a method is independent from the nature or type of suspected viral contamination. The method is rapid and can be carried out using high throughput sequencing (HTS, also known as Next-Generation Sequencing or massive parallel sequencing or deep sequencing) and requires a single sample. Moreover, the method of the present invention can be automated in production processes for commercial production of biomolecules. In such production processes the method of the current invention can be utilized in in-process control. In-process control for determining viral contamination could improve the reliability of production processes and provide control and insights in where within the production process the viral contamination occurred. Another advantage of the method of the present invention is that if a viral contamination is detected the nature/type of viral contamination can be readily identified.
In one embodiment, the present invention provides a method for determining viral contamination in a sample wherein sequence data is obtained through HTS, the method comprising the steps of:
In another embodiment of the present invention, the current invention provides a method for determining viral contamination of a sample in the production process of a biologic molecule of interest, where the method for determining viral contamination in a sample and wherein sequence data is obtained through HTS, the method comprises the steps of:
In yet another embodiment, the current invention provides a method for determining viral contamination in a sample wherein sequence data is obtained through HTS, the method comprising the steps of:
In another embodiment, the present invention provides a method of product release in or from a production process the method comprising;
Reliable, easy to use (even as in-process control) and rapid analytical methods for determining viral contaminations are currently not available as described above. The present invention provides a solution wherein viral contamination and identity, if present, of such viral contamination can be determined. The method of the present invention utilizes high through put sequencing techniques wherein the sequencing reads are compared and aligned with a viral database. Such methods allow for a more rapid determination of viral contamination and simultaneously identify the viral contaminants if there are one or more such viral contaminants. The method of the present invention can be used to determine viral contaminants in a variety of products such as for example a final product of a production process or raw materials to be used in a production process but also intermediate products (such as for in-process control).
As such in one embodiment the present invention provides a method for determining viral contamination in a sample wherein sequence data is obtained through HTS, the method comprising the steps of:
In the method of the present invention any high through-put sequencing (HTS) technique can be used, preferably the method uses short-read HTS methods.
The viral database for use in method of the present invention preferably comprises viral genome sequences organized by genome and viral families. The viral families (or taxonomic groups of other rank) are preferably organized such that viruses of the same taxonomic family are grouped together. In addition, such grouping together can also apply to sequences of segmented genomes which are grouped together in the viral database. As such the identity of the viral contamination, if there is any could be more readily determined.
Some sequencing reads obtained with the sequencing methods as used in the present invention can be aligned both with sequences in the viral database and are sequences that could be non-viral (having similarity with non-viral sequences). Taking into account that such sequencing reads that have similarity to non-viral sequences while at the same time align with sequences in the viral database a mask can be applied to the remaining sequences from step c in the method of the present invention. Such mask either fully subtracts such sequencing reads from the remaining sequencing reads from step c of the method or applies to such sequencing reads a discounted value. When calculating a set of sequence coverage metrics such sequencing reads which are discounted as a result of the mask would have a reduced value in the calculated coverage metrics as opposed to when no mask (unmasked coverage metrics) was applied.
The determination of viral contamination and identity, if any viral contamination is present, is preferably carried out by the steps of: a.) calculating a set of sequencing coverage metrics for each viral genome sequence after alignment of the remaining sequencing reads against the viral database, b.) discard all viral genomes not surpassing one or more preset minimal sequence coverage metric values, c.) identify and report any viral families that present at least one candidate positive signal, d.) identify for reach reported family a virus with the most complete and intense signal, and e.) report both a list of positive viral families and a best match in each positive family.
As described above such sequence coverage metrics can be calculated while excluding all viral genome regions (sequencing reads within a viral genome region) that overlap with sequences that have previously been observed in reference samples of a same biologic background. A reference sample of a same biologic background as described herein refers to when the sample for which viral contamination is tested has the same biologic background as a reference sample which does not contain any viral contamination. Such reference sample of a same biologic background is preferable a reference sample with known absence of viral contamination and that has been produced in a same production process using the same biologic material, for example host cell, as the sample for which the presence or absence of viral contamination is being tested. In addition, the biologic material referred to herein may be either from the host cell or could also refer to the plasmid sequence where the plasmid is introduced in the host cell for expressing the biologic molecule of interest. The plasmid contains some plasmid specific sequences and the sequence of the biologic molecule of interest. Also, such biologic material referred to herein may be sequence material related to a recombinant cell line under testing (for example in a cell bank or production process).
The method can be used for determining the presence of viral contamination in a variety of different samples such a final product, raw material or intermediate product in or for a production process. Preferably the production process is a production process for a biologic molecule. In such production process where the final product is a biologic molecule the process may use a host cell to express the biologic molecule. In processes wherein a host cell is used to express the biologic molecule, the process of the present invention includes the subtraction of any sequencing reads that are aligned with a host cell genome prior to alignment of the sequencing reads to the viral database as in step b of the method of the current invention. In the flow-chart of
The method of the present invention can be used in a method for product release. In such method for product release in or from a production process the method comprises determining the presence or absence of viral contamination in a sample by using the method of the current invention which includes the steps of:
The following examples are illustrative and are not meant to be limiting the scope of the invention.
The raw data produced by NGS sequencers are analyzed and the method provides a determination on the presence or absence of viral contamination.
The raw data generated by the are converted into FASTQ files. Where a single run includes data from different samples (multiplexed runs), in this initial step the reads are assigned to each sample.
After the generation of the FASTQs, the sequence of the adapters are removed from the reads in a processed called “trimming”. This step is required to filter out reads with low quality and to “clean” the data, because part of the sequences generated during the sequencing process might contain adapters used for the sequencing itself.
Optionally, the pipeline can subsample the reads in order to use only part of the available reads for data analysis. This optional step can be used to assess the method performance at different levels of sequencing throughput (i.e. different number of sequencing reads generated).
2. Host Cell Subtraction (which is Optional)
To remove sequences derived from the cell genome, the method can align all the reads against the reference genomic sequence of the host cell using a sequence aligner. Then, all the reads aligning against the host genome are excluded from the analysis and only the unaligned reads are used for the subsequent steps. This step is optional but can be useful when a mask file is not available, or it can be used to further investigate positive samples, excluding false positives due to low specificity.
The reads generated in step 1 (or step 2 in case the step host subtraction is carried out) are then aligned against a database including both the plasmid sequence as well as the viral database using an open source sequence aligner generating an intermediate alignment file. The resulting alignment file is further processed to make sure that i) secondary alignments (i.e. reads that align equally well to multiple locations) are treated as primary (i.e. considered in the downstream processing) and ii) alignments shorter than 75 base pair are discarded. At the end of this process an alignment file in BAM format is generated. This file contains various information including: i) Name of the read, ii) Location of the alignment on the reference database (with sequence name) iii) Quality of the alignment, iv) Quality of the read and v) name of the sequence of the reference genome.
In this step the method disregards all the alignments against the plasmid sequence and calculates several coverage metrics for each viral genome.
All the four parameters are calculated for all the viruses 2 times, one counting the entire length of the viral sequences including in the database (metrics “unmasked”) and a second time excluding all the viral regions that are described in the Mask file (metrics “masked”). See
The coverage metrics calculated in the previous step are used to discriminate background noise from potential signals. Cutoff values preset or previously set (for example previously determined through empirical evidence) are used to exclude all the viral signals that do not pass the defined cutoffs, selecting the positive candidate signals.
Using virus details present in the database, the method determines which viral groups (e.g. taxonomic families) contains at least one candidate positive signals. These viral groups are added to the final report and the list of positive viral families constitutes the primary result of the method. In addition, for each positive viral group, the method identifies which viral genome is the closest match to the actual viral contaminant in the sample (“Best match”). For each positive viral group, the best match reported by the method is the virus with the highest 1× Coverage % (unmasked). In case of ties between two sequences, the method selects the signal with highest number of mapping reads (unmasked).
Number | Date | Country | Kind |
---|---|---|---|
21183572.3 | Jul 2021 | EP | regional |
22160023.2 | Mar 2022 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/068346 | 7/1/2022 | WO |