The present invention relates to a method for detecting an integration pattern of a virus in a host genome, tools for performing the method and applications thereof.
The integration of viral DNA into the host genome is a defining feature of the retroviral life cycle, irreversibly linking provirus and cell. This intimate association facilitates viral persistence and replication in somatic cells, and with integration into germ cells bequeaths the provirus to subsequent generations. Considerable effort has been expended to understand patterns of proviral integration, both from a basic virology stand point, and due to the use of retroviral vectors in gene therapy1. The application of next generation sequencing (NGS) over the last ˜10 years has had a dramatic impact on our ability to explore the landscape of retroviral integration for both exogenous and endogenous retroviruses. Methods based on ligation mediated PCR and Illumina sequencing have facilitated the identification of hundreds of thousands of insertion sites in exogenous viruses such as Human T-cell leukemia virus-1 (HTLV-1)2 and Human immunodeficiency virus (HIV-1)3-6. These techniques have shown that in HTLV-12, Bovine Leukemia Virus (BLV)7 and Avian Leukosis Virus (ALV)8 integration sites are not random, pointing to clonal selection. In HIV-1 it has also become apparent that provirus integration can drive clonal expansion3,4,8,8, magnifying the HIV-1 reservoir and placing a major road block in the way of a complete cure.
Current methods based on short-read (high throughput) sequencing identify the insertion point, but the provirus itself is largely unexplored. Whether variation in the provirus influences the fate of the clone remains difficult to investigate. Using long range PCR it has been shown that proviruses in HTLV-1 induced Adult T-cell leukemia (ATL) are frequently (˜45%) defective10, although the abundance of defective proviruses within asymptomatic HTLV-1 carriers has not been systematically investigated. Recently, there has been a concerted effort to better understand the structure of HIV-1 proviruses in the latent reservoir. Methods such as Full-Length Individual Proviral Sequencing (FLIPS) have been developed to identify functional proviruses11 but without identifying the provirus integration site. More recently matched integration site and proviral sequencing (MIP-Seq) has allowed the sequence of individual proviruses to be linked to integration site in the genome6. However, this method relies on whole genome amplification of isolated HIV-1 genomes, with separate reactions to identify the integration site and sequence the associated provirus6. As a result, this method is quite labor intensive limiting the number of proviruses one can reasonably interrogate.
Retroviruses are primarily associated with the diseases they provoke through the infection of somatic cells. Over the course of evolutionary time they have also played a major role in shaping the genome. Retroviral invasion of the germ line has occurred multiple times, resulting in the remarkable fact that endogenous retrovirus (ERV)-like elements comprise a larger proportion of the human genome (8%) than protein coding sequences (˜1.5%)12. With the availability of multiple vertebrate genome assemblies, much of the focus has been on comparison of ERVs between species. However, single genomes represent a fraction of the variation within a species, prompting some to take a population approach to investigate ERV-host genome variation13. While capable of identifying polymorphic ERVs in the population, approaches relying on conventional paired-end libraries and short reads cannot capture the sequence of the provirus beyond the first few hundred bases of the proviral long terminal repeat (LTR), leaving the variation within uncharted.
In contrast to retroviruses, papillomaviruses do not integrate into the host genome as part of their lifecycle. Human papillomavirus (HPV) is usually present in the cell as a multi copy circular episome (˜8 kb in size), however in a small fraction of infections, it can integrate into the host genome leading to the dysregulation of the viral oncogenes E6 and E714. Genome wide profiling of HPV integration sites via capture probes and Illumina sequencing has also identified hotspots of integration indicating that disruption of host genes may also play a role in driving clonal expansion15. As a consequence, HPV integration is a risk factor for the development of cervical carcinoma16.
HPV accounts for >95% of cervical carcinoma and ˜70% of oropharyngeal carcinoma52. While infection with a high-risk HPV strain (HPV16 & HPV18) is generally necessary for the development of cervical cancer, it is not sufficient41. The progression towards cancer is driven by a combination of both viral and host factors, as a result, a greater understanding of both is required to identify high risk infections41.
The HPV vaccine will cut the rate of cervical cancer in vaccinated women by ˜75%, however it will take 20 to 30 years for the full impact of vaccination to become apparent64. Additionally, vaccination uptake varies widely, with the Belgian French speaking community only having a 36% uptake in 201865. As consequence HPV induced cervical cancer will remain a major health issue in the medium term and the cause of a nontrivial number of cancers into the foreseeable future.
The centrality of HPV integration in carcinogenesis makes a deeper understanding of the process a priority, both to understand the basic biology behind HPV induced cervical cancer, but also because of its potential as a biomarker to identify high risk cases sooner. The study of HPV integration is hampered by the unpredictability of the breakpoint sites in the integrated HPV genome. This limits the applicability of approaches based on ligation mediated PCR and short read sequencing. Techniques such as real-time PCR can identify HPV infections, but cannot identify integrations associated with clonal expansion. Biotin capture probes and Illumina sequencing have provided an unbiased genome wide picture of integration sites in cervical carcinomas, hinting at potential hot spots of integration15. However, this technique is not suited to exploring precancerous stages, where only a small fraction of the cells carries integrated virus. Looking beyond integration sites, work on HPV16 using a targeted sequencing approach has shown that conservation of the HPV E7 gene is critical for carcinogenesis66.
The application of NGS as well as Sanger sequencing before, has had a large impact on our understanding of both exogenous and endogenous proviruses. The development of long-read sequencing, linked-read technologies and associated computational tools17 have the potential to explore questions inaccessible to short reads. Groups investigating Long interspersed nuclear elements-1 (LINE-1) insertions16 and the koala retrovirus, KoRV19 have highlighted this potential and described techniques utilizing the Oxford Nanopore and PacBio platforms, to investigate insertion sites and retroelement structure.
To more fully exploit the potential of long reads we developed Pooled CRISPR Inverse PCR sequencing (PCIP-seq), a method that leverages selective cleavage of circularized DNA fragments carrying proviral DNA/integrated viral DNA with CRISPR guide RNAs or a pool of CRISPR guide RNAs, followed by inverse long-range PCR and multiplexed sequencing, such as on the Oxford Nanopore MinION platform. Using this approach, we can now simultaneously identify the integration site and track clone abundance while also sequencing the provirus/viral DNA inserted at that position. We have successfully applied the technique to the retroviruses HTLV-1, HIV-1 and BLV, endogenous retroviruses in cattle and sheep as well as HPV18 and HPV16.
In an aspect, the invention provides a method for detecting an integration pattern of human papillomavirus (HPV) in genomic DNA of a subject, said method comprising:
(a) fragmenting genomic DNA isolated from a sample of the subject;
(b) circularizing the DNA fragments to generate circular DNA;
(c) removing non-circularized DNA fragments;
(d) linearizing the circular DNA using an RNA-guided DNA endonuclease and at least one guide RNA or at least one pool of guide RNAs, which target a region in the viral genome, to generate linearized DNA molecules;
(e) amplifying the linearized DNA molecules by an inverse amplification reaction using a pair of primers arranged about and oriented outwardly with respect to the linearization site;
(f) sequencing the amplified DNA;
(g) mapping the sequenced DNA to human genomic DNA sequence; and
(h) optionally mapping the sequenced DNA to the HPV genome.
The invention also provides for a kit for detecting an integration pattern of human papillomavirus (HPV) in genomic DNA of a subject according to the method of of the invention, said kit comprising:
A further aspect relates to a method for monitoring the progression of a human papillomavirus (HPV) infection in a subject comprising:
A further aspect relates to a method for assessing a risk of having or developing a cancer in a subject comprising:
The teaching of the application is illustrated by the following Figures which are to be considered as illustrative only and do not in any way limit the scope of the claims.
As used herein, the singular forms “a”, “an”, and “the” include both singular and plural referents unless the context clearly dictates otherwise.
The terms “comprising”, “comprises” and “comprised of” as used herein are synonymous with “including”, “includes” or “containing”, “contains”, and are inclusive or open-ended and do not exclude additional, non-recited members, elements or method steps. The terms also encompass “consisting of” and “consisting essentially of”, which enjoy well-established meanings in patent terminology.
The recitation of numerical ranges by endpoints includes all numbers and fractions subsumed within the respective ranges, as well as the recited endpoints. This applies to numerical ranges irrespective of whether they are introduced by the expression “from . . . to . . . ” or the expression “between . . . and . . . ” or another expression.
The terms “about” or “approximately” as used herein when referring to a measurable value such as a parameter, an amount, a temporal duration, and the like, are meant to encompass variations of and from the specified value, such as variations of +/−10% or less, preferably +/−5% or less, more preferably +/−1% or less, and still more preferably +/−0.1% or less of and from the specified value, insofar such variations are appropriate to perform in the disclosed invention. It is to be understood that the value to which the modifier “about” or “approximately” refers is itself also specifically, and preferably, disclosed.
Whereas the terms “one or more” or “at least one”, such as one or more members or at least one member of a group of members, is clear per se, by means of further exemplification, the term encompasses inter alia a reference to any one of said members, or to any two or more of said members, such as, e.g., any or etc. of said members, and up to all said members. In another example, “one or more” or “at least one” may refer to 1, 2, 3, 4, 5, 6, 7 or more.
The discussion of the background to the invention herein is included to explain the context of the invention. This is not to be taken as an admission that any of the material referred to was published, known, or part of the common general knowledge in any country as of the priority date of any of the claims.
Throughout this disclosure, various publications, patents and published patent specifications are referenced by an identifying citation. All documents cited in the present specification are hereby incorporated by reference in their entirety. In particular, the teachings or sections of such documents herein specifically referred to are incorporated by reference.
Unless otherwise defined, all terms used in disclosing the invention, including technical and scientific terms, have the meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. By means of further guidance, term definitions are included to better appreciate the teaching of the invention. When specific terms are defined in connection with a particular aspect of the invention or a particular embodiment of the invention, such connotation or meaning is meant to apply throughout this specification, i.e., also in the context of other aspects or embodiments of the invention, unless otherwise defined.
In the following passages, different aspects or embodiments of the invention are defined in more detail. Each aspect or embodiment so defined may be combined with any other aspect(s) or embodiment(s) unless clearly indicated to the contrary. In particular, any feature indicated as being preferred or advantageous may be combined with any other feature or features indicated as being preferred or advantageous.
Reference throughout this specification to “one embodiment”, “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to a person skilled in the art from this disclosure, in one or more embodiments. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those in the art. For example, in the appended claims, any of the claimed embodiments can be used in any combination.
For general methods relating to the invention, reference is made inter alia to well-known textbooks, including, e.g., “Molecular Cloning: A Laboratory Manual, 4th Ed.” (Green and Sambrook, 2012, Cold Spring Harbor Laboratory Press), “Current Protocols in Molecular Biology” (Ausubel et al., 1987).
Provided herein is a method for detecting an integration pattern of a virus in genomic DNA of a subject, said method comprising:
(a) fragmenting genomic DNA isolated from a sample of the subject;
(b) circularizing the DNA fragments to generate circular DNA;
(c) removing non-circularized DNA fragments;
(d) optionally linearizing the circular DNA using an RNA-guided DNA endonuclease and at least one guide RNA or at least one pool of guide RNAs, which target a region in the viral genome to generate linearized DNA molecules;
(e) amplifying the circular DNA or the linearized DNA molecules by an inverse amplification reaction using a pair of primers arranged about and oriented outwardly with respect to the linearization site;
(f) sequencing the amplified DNA;
(g) mapping the sequenced DNA to genomic DNA sequence of the subject; and
(h) optionally mapping the sequenced DNA to the viral genome.
As used herein, the terms “integration pattern” or “viral integration pattern” refer to the pattern of viral DNA that is integrated in host genomic DNA. The term may refer to a visualized DNA pattern comprising viral DNA and host genomic DNA, as well as to information quantified by or correlated with such DNA pattern. Non-limiting examples of information quantified by, or correlated with an integration pattern include the presence of absence of integrated viral DNA; the number of viral integration sites in host genomic DNA or the average number of such integrations; the insertion site(s) of viral DNA in the host genome; mutations (e.g. deletions, duplications, SNPs, etc.) in the viral DNA integrations; the size in kb of viral DNA integrations into host genomic DNA; the number of viral genomes integrated at each integration site; the number of viral integration sites per cellular genome; the mean number of viral genomes integrated per integration site (or the mean size of integration sites); maximum number of viral genomes integrated per integration site (or the maximum size of integration sites); minimum number of viral genomes integrated per integration site (or minimum size of integration sites), number of viral genomes integrated per cellular genome, and any combinations thereof.
The method of the invention allows to detect integration of viruses such as retroviruses that integrate into a host cell genome as part of their lifecycle, as well as viruses such as papillomaviruses that do not integrate into a host cell genome as part of their lifecycle. The virus may be a DNA virus or an RNA virus. DNA viruses include, for example, human papillomavirus (HPV); RNA viruses include, for example, human T lymphophilic virus (HTLV, particularly HTLV-1), human immunodeficiency virus (HIV), bovine leukemia virus (BLV). In embodiments, the virus is a retrovirus. In further embodiments, the retrovirus is an exogenous retrovirus such as HTLV, in particular HTLV-1, HIV or BLV. In further embodiments, the retrovirus is an endogenous retrovirus. In other embodiments, the virus is HPV. In further embodiments, said HPV is a high risk HPV such as a HPV strain 16, 18, 31, 33, 35, 39, 45, 51, 55, 56, 58, 59 or 66, preferably a HPV strain 18 or a HPV strain 16.
“Integrated viral DNA” refers to a complete or partial genome of a virus that is integrated into a host cell chromosome. “Episomal viral DNA” refers to non-integrated viral DNA, i.e., viral DNA that has not integrated into a host cell chromosome. “Provirus” refers to viral DNA, in particular retroviral DNA, that is integrated into the DNA of a host cell as a stage of virus replication, or a state that persists over longer periods of time as either inactive viral infections or an endogenous viral element.
The terms “subject” and “host” and “patient” are used interchangeably and refer to a human or non-human animal that is tested for the presence of integrated viral DNA. The host is not particularly limited as long as the virus infects and viral nucleic acid is integrated into the genome. Preferably, the host is a mammal, most preferably a human. Hosts may be domestic animals such as cows, horses, pigs, sheep, goats and chickens. In preferred embodiments, the subject is a human. In embodiments, the subject is an ovine. In embodiments, the subject is a bovine.
The term “sample” generally refers to a material of biological origin that includes cells. Samples can include, e.g., an in vitro cell culture or tissue obtained from a subject as defined herein. Samples can be purified or semi-purified to remove certain constituents (e.g., extracellular constituents or non-target cell populations). In embodiments, the sample comprises cervical or vaginal epithelial cells, such as wherein the sample is a pap smear. In embodiments, the sample comprises oropharyngeal epithelial cells, such as wherein the sample is an oropharyngeal swab. In embodiments, the sample comprises peripheral blood mononuclear cells (PBMC), in particular CD4+ T cells, such as wherein the sample is a blood sample, e.g. a whole blood sample. In embodiments, the sample is a sperm sample. Isolation of DNA from the samples can be carried out by standard methods.
In step (a) genomic DNA of the subject is fragmented. In embodiments, fragmenting the genomic DNA of the subject comprises shearing the genomic DNA, thereby producing (sheared) DNA fragments. Shearing of the genomic DNA may occur e.g. by acoustic or mechanical means as known to the skilled person. In further embodiments, shearing of the genomic DNA of the subject is followed by end-repair of the sheared DNA fragments.
In embodiments, the (sheared) DNA fragments have an average size of about the size of the viral genome. In particular embodiments, the (sheared) DNA fragments have an average size of between 6000 and 10000 basepairs (bp), preferably between 7000 and 9000 bp, more preferably about 8000 bp.
In step (b) of the method (sheared) DNA fragments are circularized. Circularization or intramolecular ligation of the DNA fragments may be achieved by incubation of the DNA fragments in the presence of a DNA ligase, e.g. T4 DNA ligase, as known to the skilled person, thereby generating circular DNA.
Step (c) of the method encompasses removal of remaining linear DNA. In embodiments, non-circularized DNA is removed by digestion. Selective digestion of non-circularized or linear DNA may be achieved using an appropriate selective DNase as commercially available (e.g. Plasmid-Safe™ ATP-Dependent DNase (Epicentre).
Preferably, the circular DNA is linearized in step (d) before the amplification step (e), which improves the efficiency of the amplification reactions. Linearization of the circular DNA can be achieved using an RNA-guided DNA endonuclease, such as a CRISPR-Cas system as known to the skilled person, and corresponding guide RNAs. In particular embodiments, the RNA-guided DNA endonuclease is a Cas-9 endonuclease.
In order to achieve selective linearization of circular DNA that comprises integrated viral DNA and host DNA, guide RNA(s) are used that target a region of the viral DNA. Preferably, the “linearization site”, i.e. the region in the viral DNA that is targeted by a guide RNA or a pool of guide RNAs, comprises a region of the viral genome that is prone to integration in host DNA. For example, for HPV, a linearization site may comprise E6 gene and/or E7 gene. For retroviruses, a linearization site may be adjacent to a 5′LTR or adjacent to a 3′LTR.
Particular guide RNA targeting domains and pools of guide RNA targeting domains are provided in Table 1. The sequences set forth in SEQ ID NO:7-79 refer to oligonucleotide sequences used for synthesizing the guide RNAs. These sequences comprise a “targeting domain” as well as accessory sequences required by the kit, in particular the EnGen® sgRNA Synthesis Kit (New England Biolabs), for synthesizing the guide RNA, which elements can be identified by the skilled person. By way of example, oligonucleotide sequences encoding HPV18 and HPV16 gRNAs and their corresponding targeting domain and flanking PAM site (underlined) are summarized in the below table. With “targeting domain” is meant herein a sequence that is capable of hybridizing to a sequence in the region of the viral DNA that is targeted by the guide RNA (i.e. in the linearization site of the viral DNA). With “PAM site” is meant herein a protospacer adjacent sequence as is known in the art. When reference is made to a guide RNA comprising a sequence set forth in any one of SEQ ID NO:7-79, a guide RNA comprising the targeting domain of said sequence is envisaged, i.e. the sequence without the sequence TTCTAATACGACTCACTATA (SEQ ID NO:244) 5 prime and without the sequence GTTTTAGAGCTAGA (SEQ ID NO:245) 3 prime. When reference is made to a guide RNA comprising a sequence set forth in any one of SEQ ID NO:232-243, a guide RNA comprising the targeting of said sequence is envisaged, i.e. the sequence without the NGG sequence 3 prime. As will be appreciated by the skilled person, the guide RNA comprises in addition to a targeting domain, a tracer and a tracer mate as known in the art, wherein the tracer and tracer mate may be provided chimeric. The guide RNA is an RNA molecule and will therefore comprise the base uracil (U), while the oligonucleotide encoding the gRNA molecule comprises the base thymine (T).
To improve cleavage of a linearization site, more than one guide RNA targeting said linearization site can be used. As used herein, a “pool of guide RNAs” refers to a set of guide RNAs that target a defined region of the viral DNA, i.e. the linearization site. It is to be understood that each guide RNA within a pool of guide RNAs may be capable of hybridizing to different, non-overlapping or partially overlapping, sequences within said linearization site. A pool of guide RNAs may comprise at least 2 or at least 3 guide RNAs, preferably at least 3 guide RNAs, more preferably between 3 and 10 or between 3 and 8 guide RNAs, such as 3, 4, 5, 6, 7 or 8 guide RNAs.
The circular DNA may be linearized using a first guide RNA or a first pool of guide RNAs, which target a first region of the viral DNA, and at least one other guide RNA or at least one other pool of guide RNAs, which target a non-overlapping region(s) of the viral RNA. When targeting more than one linearization site, a more complete integration pattern may be obtained (e.g. more integration sites may be detected).
Accordingly, in embodiments, a first portion of the circular DNA is linearized using a first guide RNA or a first pool of guide RNAs that target a first region of the viral DNA to generate a first set of linearized DNA molecules; and
a second portion of the circular DNA is linearized using a second guide RNA or a second pool of guide RNAs that target a second region of the viral DNA to generate a second set of linearized DNA molecules,
wherein the first region and the second region of the viral DNA do not overlap.
In embodiments of the method for detecting an integration pattern of a retrovirus in genomic DNA of a subject, a first portion of the circular DNA is linearized using a first guide RNA or a first pool of guide RNAs that target a region of the viral DNA adjacent to the 5′ long terminal repeat (LTR) to generate a first set of linearized DNA molecules; and
a second portion of the circular DNA is linearized using a second guide RNA or a second pool of guide RNAs that target a region of the viral DNA adjacent to the 3′LTR to generate a second set of linearized DNA molecules.
In embodiments of the method for detecting an integration pattern of a HPV in genomic DNA of a subject, a first portion of the circular DNA is linearized using a first guide RNA or a first pool of guide RNAs that target a first region of the viral DNA comprising E6 gene and/or E7 gene to generate a first set of linearized DNA molecules; and
a second portion of the circular DNA is linearized using a second guide RNA or a second pool of guide RNAs that target a second region of the viral DNA to generate a second set of linearized DNA molecules, wherein said first and second regions of the viral DNA do not overlap.
In the amplification step (e), the circular DNA or preferably the linearized DNA molecules are amplified by an inverse amplification reaction using a pair of primers arranged about and oriented outwardly with respect to the linearization site. In particular, a primer pair is used comprising a forward primer capable of hybridizing to a viral DNA sequence in a 3′ flanking region of the viral DNA region targeted by the guide RNA or the pool of guide RNAs and a reverse primer capable of hybridizing to a viral DNA sequence in a 5′ flanking region of the viral DNA region targeted by the guide RNA or the pool of guide RNAs.
Particular primer pairs corresponding to the guide RNA targeting domains or pools of guide RNA targeting domains of Table 1 are provided in Table 2. The primers in Table 2 may comprise a tail, in particular a tail consisting of the sequence TTTCTGTTGGTGCTGATATTGC (SEQ ID NO:246) or the sequence ACTTGCCTGTCGCTCTATCTTC (SEQ ID NO:247). When reference is made herein to a primer comprising a sequence set forth in any one of SEQ ID NO:80-127, the tailed primer as well as a corresponding primer without the tail or with another tail are envisaged herein.
Preferably, each set of linearized DNA molecules (i.e. linearized DNA molecules generated by one guide RNA or one pool of guide RNAs as described herein and thus characterized by cleavage in a defined linearization site) is amplified in a separate amplification reaction using an appropriate pair of primers arranged about and oriented outwardly with respect to the linearization site.
In further embodiments, the linearization step and the amplification step may be carried out in a single solution, wherein a guide RNA or a pool of guide RNAs and a corresponding pair of primers are multiplexed.
In preferred embodiments, said amplification reaction comprises a long range amplification reaction such as a long range PCR. As used herein, “long range PCR” refers to a method to amplify DNA fragments of increased size, typically of more than 3-5 kb, using a modified DNA polymerase or high-fidelity DNA polymerase. DNA polymerases for long range PCR are known to the skilled person and are commercially available.
In further embodiments, tailed primers are used in the amplification reaction and the amplicons are subjected to a second amplification reaction using a set of indexing primers, thereby generating indexed amplification products. This facilitates multiplexed sequencing of the amplified DNA.
Particular methods are provided herein for detecting an integration pattern of a retrovirus in genomic DNA of a subject, said method comprising:
(a) fragmenting genomic DNA isolated from a sample of the subject;
(b) circularizing the DNA fragments to generate circular DNA;
(c) removing non-circularized DNA fragments;
(d) linearizing the circular DNA using an RNA-guided DNA endonuclease and at least one guide RNA or at least one pool of guide RNAs, which target a region in the viral genome adjacent to the 5′ long terminal repeat (LTR) or adjacent to the 3′LTR to generate linearized DNA molecules;
(e) amplifying the linearized DNA molecules by an inverse amplification reaction using a pair of primers arranged about and oriented outwardly with respect to the linearization site;
(f) sequencing the amplified DNA;
(g) mapping the sequenced DNA to genomic DNA sequence of the subject; and
(h) optionally mapping the sequenced DNA to the viral genome.
In further embodiments of the method for detecting an integration pattern of a retrovirus in genomic DNA of a subject, the linearization of the circular DNA comprises linearizing a first portion of the circular DNA using a first guide RNA or a first pool of guide RNAs, preferably a first pool of guide RNAs, which target a region of the viral DNA adjacent to the 5′ long terminal repeat (LTR) to generate a first set of linearized DNA molecules, and
linearizing a second portion of the circular DNA using a second guide RNA or a second pool of guide RNAs, preferably a second pool of guide RNAs, which target a region of the viral DNA adjacent to the 3′LTR to generate a second set of linearized DNA molecules; and
the amplification of the linearized DNA molecules comprises amplifying the first set of linearized DNA molecules using a first pair of primers arranged about and oriented outwardly with respect to the viral DNA region adjacent to the 5′ LTR targeted by the first guide RNA or the first pool of guide RNAs,
and amplifying the second set of linearized DNA molecules using a second pair of primers arranged about and oriented outwardly with respect to the viral DNA region adjacent to the 3′ LTR targeted by the second guide RNA or the second pool of guide RNAs.
A further aspect relates to a kit for performing the method described herein, said kit comprising:
In further embodiments, the kit comprises:
a second pair of primers arranged about and oriented outwardly with respect to a second linearization site in the viral DNA defined by said second guide RNA or said second pool of guide RNAs.
Particular kits are provided herein for the detection of an integration pattern of a HPV in genomic DNA of a subject according to the method disclosed herein, said kit comprising:
In other embodiments, said kit comprises:
In further embodiments, said kit for the detection of an integration pattern of a HPV comprises:
In particular embodiments, said second region of the viral DNA comprises a region of the viral DNA comprising L1 gene or a region of the viral DNA adjacent to L1 gene.
Particular embodiments for the guide RNAs, pools of guide RNAs and primer pairs are as described above for the method. Particular combinations of guide RNA targetind domains or pools of guide RNA targeting domains and primer pairs are described in Tables 1 and 2.
The kit may also contain reagents, e.g., buffers, enzymes and other necessary reagents, for performing the method described above. In particular embodiments, the kit further comprises an RNA-guided DNA endonuclease. In particular embodiments, the kit further comprises a DNA polymerase, preferably a DNA polymerase for long range PCR.
The various components of the kit may be present in separate containers or certain compatible components may be pre-combined into a single container, as desired.
The herein disclosed aspects and embodiments of the invention are further supported by the following non-limiting examples.
Samples
Both the BLV infected sheep7 and HTLV-1 samples7,20 have been previously described. Briefly, the sheep were infected with the molecular clone pBLV34421, following the experimental procedures approved by the University of Saskatchewan Animal Care Committee based on the Canadian Council on Animal Care Guidelines (Protocol #19940212). The HTLV-1 samples7,20 were obtained with informed consent following the institutional review board-approved protocol at the Necker Hospital, University of Paris, France, in accordance with the Declaration of Helsinki. The BLV bovine samples were natural infections, obtained from commercially kept adult dairy cows in Alberta, Canada. Sampling was approved by VSACC (Veterinary Sciences Animal care Committee) of the University of Calgary: protocol number: AC15-0159. The bovine 571 used for ERV identification was collected as part of this cohort. The two sheep samples used for Jaagsiekte sheep retrovirus (enJSRV) identification were the BLV infected ovine samples (220 & 221 (032014)), with a PVL of 3.8 and 16% respectively. PBMCs were isolated using standard Ficoll-Hypaque separation. The DNA for the bovine Mannequin was extracted from sperm, while the DNA for bovine 10201e6 was extracted from whole blood using standard procedures. The HIV-1 U1 cell line DNA sequenced without dilution was provided by Dr. Carine Van Lint, IBMM, Gosselies, Belgium. The HIV-1 U1 cell line dilutions in Jurkat were generated at Ghent University Hospital.
HPV material was prepared from PAP smears obtained from HPV-infected patients at the CHU Liege University hospital. Both patients were PCR positive for HPV18, HPV18_PY was classified as having Atypical Squamous Cell of Undetermined Significance (ASC-US), while HPV18_PX was classified as having Atypical Glandular Cells (AGC). Patients provided written informed consent and the study was approved by the Comité d'Ethique Hospitalo-Facultaire Universitaire de Liege (Reference number: 2019/139). No statistical test was used to determine adequate sample size and the study did not use blinding.
PCIP-Seq
Total genomic DNA isolation was carried out using the Qiagen AllPrep DNA/RNA/miRNA kit (BLV, HTLV-1 and HPV infected individuals) or the Qiagen DNeasy Blood & Tissue Kit (HIV-1 patients) according to manufacturer's protocol. High molecular weight DNA was sheared to ˜8 kb using Covaris g-Tubes™ (Woburn, Mass.) or a Megaruptor (Diagenode), followed by end-repair using the NEBNext EndRepair Module (New England Biolabs). Intramolecular circularization was achieved by overnight incubation at 16° C. with T4 DNA Ligase. Remaining linear DNA was removed with Plasmid-Safe-ATP-Dependent DNAse (Epicentre, Madison Wis.). Guide RNAs were designed using chopchop (http://chopchop.cbu.uib.no/index.php). The EnGen™ sgRNA Template Oligo Designer (http://nebiocalculator.neb.com/#!/sgrna) provided the final oligo sequence. Oligos were synthesized by Integrated DNA Technologies (IDT). Oligos were pooled and guide RNAs synthesized with the EnGen sgRNA Synthesis kit, S. pyogenes (New England Biolabs). Selective linearization reactions were performed with the Cas-9 nuclease, S. pyogenes (New England Biolabs). (See Example 3 for the rationale behind using of CRISPR-cas9 to cleave the circular DNA). PCR primers flanking the cut sites were designed using primer3 (http://bioinfo.ut.ee/primer3/). Primers were tailed to facilitate the addition of Oxford Nanopore indexes in a subsequent PCR reaction. The linearized fragments were PCR amplified with LongAmp Taq DNA Polymerase (New England Biolabs) and purified using 1× AmpureXP beads, (Beckman Coulter). A second PCR added the appropriate Oxford Nanopore index. PCR products were visualized on a 1% agarose gel, purified using 1× AmpureXP beads and quantified on a Nanodrop spectrophotometer. Indexed PCR products were multiplexed and Oxford Nanopore libraries prepared with either the Ligation Sequencing Kit 1D (SQK-LSK108) or 1D{circumflex over ( )}2 Sequencing Kit (SQK-LSK308) (only the 1D were used) The resulting libraries were sequenced on Oxford Nanopore MinION R9.4 or R9.5 flow cells respectively. The endogenous retrovirus libraries were base called using albacore 2.3.1, all other PCIP-seq libraries were base called with Guppy 3.1.5 (https://nanoporetech.com) using the “high accuracy” base calling model. For the endogenous retrovirus libraries, demultiplexing was carried out via porechop (https://github.com/rrwick/Porechop) using the default setting. The HIV, HTLV-1, BLV and HPV PCIP-seq libraries were subjected to a more stringent demultiplexing with the guppy_barcoder (https://nanoporetech.com) tool using the --require_barcodes_both_ends option. The output was also passed through porechop, again barcodes were required on both ends, adapter sequence was trimmed and reads with middle adapters were discarded. Oligos used can be found in Tables 1 and 2.
Identification of Proviral Integration Sites in PCIP-Seq
Reads were mapped with Minimap255 to the host genome with the proviral genome as a separate chromosome. In-house R-scripts were used to identify integration sites (IS). Briefly, chimeric reads that partially mapped to at least one extremity of the proviral genome were used to extract virus-host junctions and shear sites. Junctions within a 200 bp window were clustered together to form an “IS cluster”, compensating for sequencing/mapping errors. The IS retained corresponded to the position supported by the highest number of virus-host junctions in each IS cluster. Clone abundance was estimated based on the number of reads supporting each IS cluster. Reads sharing the same integration site and same shear site were considered PCR duplicates. Custom software, code description and detailed outline of the workflow are available on Github: https://github.com/GIGA-AnimalGenomics-BLV/PCIP.
Measure of Proviral Load (PVL) and Identification of Proviral Integration Sites (Illumine)
PVLs and integration sites of HTLV-1- and BLV-positive individuals were determined as previously described in Rosewick et al 20177 and Artesi et al 201720. PVL represents the percentage of infected cells, considering a single proviral integration per cell. Total HIV-1 DNA content of CD4 T-cell DNA isolates was measured by digital droplet PCR (ddPCR, QX200 platform, Bio-Rad, Temse, Belgium), as described by Rutsaert et al.56 The DNA was subjected to a restriction digest with EcoRI (Promega, Leiden, The Netherlands) for one hour, and diluted 1:2 in nuclease free water. HIV-1 DNA was measured in triplicate using 4 μL of the diluted DNA as input into a 20 μL reaction, while the RPP30 reference gene was measured in duplicate using 1 μL as input. Primers and probes are summarized in Table 3. Thermocycling conditions were as follows: 95° C. for 10 min, followed by 40 cycles of 95° C. for 30 s and 56° C. for 60 s, followed by 98° C. for 10 min. Data was analyzed with the ddpcRquant analysis software57.
Variant Calling
After PCR duplicate removal, proviruses with an IS supported by more than 10 reads were retained for further processing. SNPs were identified using LoFreq22 with default parameters, only SNPs with an allele frequency of >0.6 in the provirus associated with the insertion site were considered. We also called variants on proviruses supported by more than 10 reads without PCR duplicate removal (this greatly increased the number of proviruses examined). This data was used to explore the number of proviruses carrying the Tax 303 variant. Deletions were called on proviruses supported by more than 10 reads without PCR duplicate removal using an in house R-scripts. Briefly, samtools pileup58 was used to calculate/compute coverage and deletions at base resolution. We used the changepoint detection algorithm PELT59 to identify genomic windows showing an abrupt change in coverage. Windows that showed at least a 4-fold increase in the frequency of deletions (absence of a nucleotide for that position within a read) were flagged as deletions and visually confirmed in IGV80.
HIV-1 Proviral Sequences
Sequences of the two major proviruses integrated in chr2 (SEQ ID NO:5) and chrX (SEQ ID NO:4) of the U1 cell line were generated by initially mapping the reads from both platforms to the HIV-1 provirus, isolate NY5 (GenBank: M38431.1), where the 5′LTR sequence is appended to the end of the sequence to produce a full-length HIV-1 proviral genome reference. The sequence was then manually curated to produce the sequence for each provirus. To check for recombination, reads of selected clones were mapped to the sequence from the chrX provirus and the patterns of SNPs examined to determine if the variants matched the chrX or chr2 proviruses.
Endogenous Retroviruses
The sequence of bovine APOB ERV (SEQ ID NO:6) was generated by PCR amplifying the full length ERV with LongAmp Taq DNA Polymerase (New England Biolabs) from a Holstein suffering from cholesterol deficiency. The resultant PCR product was sequenced on the Illumina platform as described below. It was also sequenced with an Oxford Nanopore MinION R7 flow cell as previously described29. Full length sequence of the element was generated via manual curation. Guide RNAs and primer pairs were designed using this ERV reference. For the Ovine ERV we used the previously published enJSRV-7 sequence40 as a reference to design PCIP-seq guide RNAs and PCR primers.
As the ovine and bovine genome contains sequences matching the ERV, mapping ERV PCIP-seq reads back to the reference genome creates a large pileup of reads in these regions. To avoid this, prior to mapping to the reference we first used BLAST61 to identify the regions in the reference genome containing sequences matching the ERV, we then used BEDtools62 to mask those regions. The appropriate ERV reference was then added as an additional chromosome in the reference.
PCR validation and Illumina Sequencing
Clone specific PCR products were generated by placing primers in the flanking DNA as well as inside the provirus. LongAmp Taq DNA Polymerase (New England Biolabs) was used for amplification following the manufacturers guidelines. Resultant PCR products were sheared to ˜400 bp using the Bioruptor Pico (Diagenode) and Nextera XT indexes added as previously described29. Illumina PCIP-seq libraries were generated in the same manner. Sequencing was carried out on either an Illumina MiSeq or NextSeq 500. Clone specific PCR products sequenced on Nanopore were indexed by PCR, multiplexed and libraries prepared using the Ligation Sequencing Kit 1D (SQK-LSK108) and sequenced on a MinION R9.4 flow cell. Oligos used can be found in Tables 4-7.
BLV References
The sequence (SEQ ID NO:1) of the pBLV344 provirus was generated via a combination of Sanger and Illumina based sequencing with manual curation of the sequence to produce a full length proviral sequence. The consensus BLV sequences for the bovine samples 1439 & 1053 (SEQ ID NO:3,2) were generated by first mapping the PCIP-seq Nanopore reads to the pBLV344 provirus. We then used Nanopolish63 to create an improved consensus. PCIP-seq libraries sequenced on the Illumina and Nanopore platform were mapped to this improved consensus visualized in IGV and manually corrected.
Genome References Used
Sheep=OAR3.1; Cattle=UMD3.1; Human=hg38; For HTLV-1 integration sites hg19 was used; HPV18=GenBank: AY262282.1; Sequences of the exogenous and endogenous proviruses can be found in SEQ ID NO:1-SEQ ID NO:6.
Data Availability
Sequence data that support the findings of this study have been deposited in the European Nucleotide Archive (ENA) hosted by the European Bioinformatics Institute (EMBL-EBI) and are accessible through study accession number PRJEB34495. All other relevant data are available within the article and its Supplementary Information files or from the corresponding authors upon reasonable request.
Code Availability
The code and a detailed outline of the PCIP-seq analysis workflow are publicly available on Github: https://github.com/GIGA-AnimalGenomics-BLV/PCIP
The genome size of the viruses targeted ranged from 6.8 to 9.7 kb, therefore we chose to shear the DNA to ˜8 kb in length. In most cases this creates two fragments for each provirus, one containing the 5′ end with host DNA upstream of the insertion site and the second with the 3′ end and downstream host DNA. Depending on the shear site the amount of host and proviral DNA in each fragment will vary (
Pooled CRISPR Inverse PCR sequencing (PCIP-seq) leverages long reads on the Oxford Nanopore MinION platform to sequence the insertion site and its associated provirus. The technique was applied to natural infections produced by three exogenous retroviruses, HTLV-1, BLV and HIV-1 as well as endogenous retroviruses in both cattle and sheep. The high efficiency of the method facilitated the identification of tens of thousands of insertion sites in a single sample. Thousands of SNPs and dozens of structural variants within proviruses were observed. While initially developed for retroviruses the method has also been successfully extended to DNA extracted from HPV positive PAP smears, where it could assist in identifying viral integrations associated with clonal expansion. An overview of the applications tested herein is provided in Table 8.
It is established practice to linearize plasmids (generally via cutting with a restriction enzyme) prior to their use as template in PCR. It is believed that this avoids supercoiling and thereby increases PCR efficiency67. Following the same logic, we speculated that linearizing our circularized DNA could also increase PCR efficiency.
Following clean up and elution in ˜40 μl of H2O we took an equal volume (3 μl) of each library and indexed them via PCR, in a 50 μl reaction volume and using 8 cycles. Again, following clean up, an equal volume of library was pooled and a nanopore library (LSK-109) was prepared and sequenced on a r9.4 flow cell. Base calling and demultiplexing was carried out as described in Example 1. The results are outlined in Table 9. In addition the coverage of the resultant reads is shown in
Table 9 shows that libraries prepared with the CRISPR cut generally produced more raw reads and a much larger fraction of them is composed of the desired chimeric reads containing proviral and host DNA. The CRISPR cut libraries also identified a large number of integration sites. The comparison with an Illumina based library prepared from the same timepoint, using ˜4 ug of template, shows that PCIP can identify more integration sites. This experiment also shows that only libraries with a size distribution that mirrors that observed in the sheared DNA should be sequenced, libraries with a preponderance of shorter fragments mainly represent nonspecific amplification.
Adult T-cell leukemia (ATL) is an aggressive cancer induced by HTLV-1. It is generally characterized by the presence of a single dominant malignant clone, identifiable by a unique proviral integration site. We and others have developed methods based on ligation mediated PCR and Illumina sequencing to simultaneously identify integration sites and determine the abundance of the corresponding clones2.7. We initially applied PCIP-seq to two HTLV-1 induced cases of ATL, both previously analyzed with our Illumina based method (ATL27 & ATL10020). In ATL100 both methods identify a single dominant clone, with >95% of the reads mapping to a single insertion site on chr18 (
In the case of ATL2, PCIP-seq showed three major proviruses located on chr5, chr16 and chr1, each responsible for ˜33% of the HTLV-1/host hybrid reads. We had previously established that these three proviruses are in a single clone via examination of the T-cell receptor gene rearrangement7. However, it is interesting to note that this was not initially obvious using our Illumina based method as the proviral insertion site on chr1 falls within a repetitive element (LTR) causing many of the reads to map to multiple regions in the genome. If multi mapping reads are filtered out, the chr1 insertion site accounted for 13.7% of the remaining reads, while retaining multi mapping produces values closer to reality (25.4%). In contrast the long reads from PCIP-seq allow unambiguous mapping and closely matched the expected 33% for each insertion site (
The samples utilized above represent a best-case scenario, with ˜100% of cells infected and a small number of major clones. We next applied PCIP-seq to four samples from BLV infected sheep (experimental infection21) and three cattle (natural infection) to explore its performance on polyclonal and low proviral load (PVL) samples and compared PCIP-seq to our previously published Illumina method7. PCIP-seq revealed all samples to be highly polyclonal (
Comparison of the results showed a significant overlap between the two methods. When we consider insertion sites supported by more than three reads in both methods (larger clones, more likely to be present in both samples), in the majority of cases >50% of the insertion sites identified in the Illumina data were also observed via PCIP-seq (Table 10). These results show the utility of PCIP-seq for insertion site identification, especially considering the advantages long reads have in repetitive regions of the genome.
Portions of the proviruses with more than ten supporting reads (PCR duplicates removed) were examined for SNPs with LoFreq22. For the four sheep samples, the variants were called relative to the pBLV344 provirus (used to infect the animals). For the bovine samples 1439 and 1053 custom consensus BLV sequences were generated for each and the variants were called in relation to the appropriate reference (SNPs were not called in 560). Across all the samples 3,209 proviruses were examined, 934 SNPs were called and 680 (21%) of the proviruses carried one or more SNPs (Table 11).
We validated 10 BLV SNPs in the ovine samples and 15 in the bovine via clone specific long-range PCR and Illumina sequencing. For Ovine 221, which was sequenced twice over a two-year interval, we identified and validated three instances where the same SNP and provirus were observed at both time points. We noted a small number of positions in the BLV provirus prone to erroneous SNP calls. By comparing allele frequencies from bulk Illumina and Nanopore data these problematic positions could be identified. For example, we observed a number of BLV proviruses in all the samples that had an apparent SNP at position 8213. When we looked at this position in reads mapped to the provirus without first sorting based on insertion site (referred to as bulk) we saw a C called 36 and 38% of the time respectively in the Nanopore data. In the bulk Illumina data, generated from the same sample, we saw the C is called 0% of the time indicating a technical artifact. As a consequence, SNPs from this position were excluded.
Approximately half of the SNPs (47.1% sheep, 51.6% cattle) were found in multiple proviruses. Generally, SNPs found at the same position in multiple proviruses were concentrated in a single individual, indicating their presence in a founder provirus or via a mutation in the very early rounds of viral replication. For example, in animal 233 we found 16 proviruses (provirus inclusion was based on the less stringent criteria of >10 reads covering the position, not filtered for PCR duplicates) carrying a T-to-C transition within the Tax ORF at position 8154, this variant does not change the amino acid. Illumina and Nanopore bulk sequencing from the same sample show C is called at a 2% frequency in Nanopore, while with Illumina C is called at a 1% frequency. This indicates that the SNPs observed in these proviruses are not a technical artifact. Alternatively, a variant may also rise in frequency due to increased fitness of clones carrying a mutation in that position. In this instance, we would expect to see the same position mutated in multiple individuals. One potential example is found in the first base of codon 303 (position 8155) of the viral protein Tax, a potent viral transactivator, stimulator of cellular proliferation and highly immunogenic23. A variant was observed at this position in five proviruses for sheep 233 and three for sheep 221 as well as one provirus from bovine 1439 (
Patterns of provirus-wide APOBEC3G25 induced hypermutation (G-to-A) were not observed in BLV. However, three proviruses (two from sheep 233 and one in bovine 1053) showed seven or more A-to-G transitions, confined to a ˜70 bp window in the first half of the U3 portion of the 3′LTR. The pattern of mutation, as well as their location in the provirus suggests the action of RNA adenosine deaminases 1 (ADAR1)26,27.
Proviruses were also examined for structural variants (SVs) using a custom script and via visualization in IGV (see Example 1). Between the sheep and bovine samples, we identified 66 deletions and 3 tandem duplications, with sizes ranging from 15 bp to 4,152 bp, with a median of 113 bp (Table 12).
We validated 14 of these via clone specific PCR. As seen in
Despite the effectiveness of combination antiretroviral therapy (ART) in suppressing HIV-1 replication, cART is not capable of eliminating latently infected cells, ensuring a viral rebound if cART is suspended30. This HIV-1 reservoir represents a major obstacle to a HIV cure31 making its exploration a priority. However, this task is complicated by its elusiveness, with only ˜0.1% of CD4+ T cells carrying integrated HIV-1 DNA32. To see if PCIP-seq could be applied to these extremely low proviral loads we initially carried out dilution experiments using U133, a HIV-1 cell line containing replication competent proviruses34. PCIP-seq on undiluted U1 DNA found the major insertion sites on chr2 and chrX (accounting for 47% & 41% of the hybrid reads respectively) and identified the previously reported variants that disrupt Tat function35 in both proviruses. In the chr2 provirus a T-to-C changes ATG to ACG and the first methionine to a threonine. In the chrX provirus an A-to-T changes CAT to CTT replacing a histidine at position 13 with a leucine. In addition to the two major proviruses we identified an additional ˜700 low abundance insertion sites (Table 8) including one on chr19 (0.8%) reported by Symons et al 201734 that is actually a product of recombination between the major chrX and chr2 proviruses, and one on chr7 (chr7: 100.5). Identification of the chr7: 100.5 & chr19: 34.9 proviruses as the products of recombination between major chrX and chr2 proviruses was shown by mapping proviral reads from all four proviruses to a full length proviral genome (the sequence (SEQ ID NO:4) of the chrX provirus was used as the reference). This allowed to identify SNPs and sequences derived from respectively, the chr2 and chrX provides. We then serially diluted U1 DNA in Jurkat cell line DNA. PCIP-seq was carried out with 5 μg of template DNA where U1 represents 0.1% and 0.01% of the total DNA. We also processed 5 μg of Jurkat DNA in parallel as a negative control. The three PCIP-seq libraries were prepared using the same guides and primers. Following sequencing and demultiplexing the Jurkat negative control produced 12,137 reads, Jurkat+U1 0.01% produced 234,421 reads and Jurkat+U1 0.1% 252,913 reads. The resultant reads were mapped to the human genome. We were able to detect the major proviruses on chr2 and chrX in both dilutions (Table 8). The reads were also mapped the HIV-1 genome. No reads of pure HIV-1 or chimeric HIV-1/host reads mapping to HIV-1 were observed in the Jurkat negative control (Table 14). In Jurkat+U1 0.01% samples 12.6% of the reads were chimeric HIV-1/host, in Jurkat+U1 0.1% this rose to 43.2%.
ERVs in the genome can be present as full length, complete provirus, or more commonly as solo-LTRs, the products of non-allelic recombination37. At the current time conventional short read sequencing, using targeted or whole genome approaches, cannot distinguish between the two classes. Examining full length ERVs would provide a more complete picture of ERV variation, while also revealing which elements can produce de novo ERV insertions. As PCIP-seq targets inside the provirus we can preferentially amplify full length ERVs, opening this type of ERV to study in larger numbers of individuals. As a proof of concept we targeted the class II bovine endogenous retrovirus BERVK2, known to be transcribed in the bovine placenta38. We applied the technique to three cattle, of which one (10201e6) was a Holstein suffering from cholesterol deficiency, an autosomal recessive genetic defect recently ascribed to the insertion of a 1.3 kb LTR in the APOB gene39. PCIP-seq clearly identified the APOB ERV insertion in 10201e6, whereas no reads were seen mapping to this position in libraries from the other two cattle (Mannequin & 571). In contrast to previous reports39 PCIP-seq shows it to be a full-length element. We identified a total of 67 ERVs, with 8 present in all three samples (Table 15).
We validated three ERVs via long range PCR and Illumina sequencing. We did not find any with an identical sequence to the APOB ERV, although the ERV BTA3_115.3 has an identical LTR sequence, highlighting that the sequence of the LTR cannot be used to infer the complete sequence of the ERV.
We also adapted PCIP-seq to amplify the Ovine endogenous retrovirus Jaagsiekte sheep retrovirus (enJSRV), a model for retrovirus-host co-evolution40. The PCIP-seq reads were mapped to the reference genome (OAR3) where sequences matching enJSRV had been masked out, this preventing reads from multiple proviruses mapping to these positions. Hybrid reads in the unique flanking sequence allowed us to determine the sequence of the proviruses present at these locations. Using two sheep (220 & 221) as template we identified a total of 48 enJSRV proviruses, (33 in 220 and 38 in 221, with 22 common to both) and of these ˜54% were full length (Table 16).
We validated seven proviruses via long-range PCR and Illumina sequencing.
The majority of HPV infections clear or are suppressed within 1-2 years41, however a minority evolve into cancer, and these are generally associated with integration of the virus into the host genome. This integration into the host genome is not part of the viral lifecycle and the breakpoint in the viral genome can occur at any point across is 8 kb circular genome16. As a consequence the part of the viral genome found at the virus host breakpoint varies considerably, making the identifying of integration sites difficult using existing approaches16. The long reads employed by PCIP-seq mean that even when the breakpoint is a number of kb away from the position targeted by primers we should still capture the integration site. As a proof of concept, we applied PCIP-seq to two HPV18 positive cases, (HPV18_PX and HPV18_PY) using 4 μg of DNA extracted from Pap smear material. We identified 55 integration sites in HPV18_PX and 19 integration sites in HPV18_PY (Table 17).
In HPV18_PY the vast majority of the reads only contained HPV sequences, the integration sites identified were defined by single reads, suggesting little or no clonal expansion (Table 8). In HPV18_PX most integration sites were again defined by a single read, however there were some exceptions (Table 17). HPV18_PX had integrated copies of HPV18 on chr21 and chr3 (
PCR with primers spanning positions α-β and α-γ, showed that a genomic rearrangement had occurred in this clonally expanded cell (
As regards HPV integrations, we identified six patients where integration is associated with a pronounced clonal expansion, four, including HPV18_PX, were infected with HPV18 and two with HPV16.
The second patient had an integration of HPV18 within an intron of LRRC49 (histology=low grade squamous intraepithelial lesion). From the next two clonally expanded integrations (both HPV18), samples from two time points were available. The first had an integration in the LAPTM4B gene, the integration was found in both samplings and in the second it appears that episomal HPV18 has been cleared (
The last clonally expanded integrations were found in a seventy-one-year-old patient, integration was observed in three different positions in the genome, all were observed in two samplings 5 months apart (
As regards HPV16, we identified two samples with clonally expanded integrations. The first was observed in a 53-year-old with a low-grade squamous intraepithelial lesion, the HPV16 genome had integrated ˜2.5 kb upstream of the KRT5 gene. No episomal HPV16 DNA was observed in this sample. The integrated HPV genome contains a ˜3 kb deletion that does not overlap with the E6 and E7 genes. The second HPV16 sample has an integration in intron 4 of the POFUT1 gene. Again, the inserted viral genome contains a large deletion (˜5.5 kb) that does not overlap with E6 and E7. In contrast to the other HPV16 sample the majority (˜75%) of the HPV16 reads in this patient were still derived from episomal HPV16.
Discussion
In the present report we describe how PCIP-seq can be utilized to identify insertion sites while also sequencing parts of, and in some cases the entire associated provirus, and confirm this methodology is effective with a number of different retroviruses as well as HPV. For insertion site identification, the method was capable of identifying more than ten thousand BLV insertion sites in a single sample, using ˜4 μg of template DNA. Even in samples with a PVL of 0.66%, it was possible to identify hundreds of insertion sites with only 1 μg of DNA as template. The improved performance of PCIP-seq in repetitive regions further highlights its utility, strictly from the standpoint of insertion site identification. In addition to its application in research, high throughput sequencing of retrovirus insertion sites has shown promise as a clinical tool to monitor ATL progression20. Illumina based techniques require access to a number of capital-intensive instruments. In contrast PCIP-seq libraries can be generated, sequenced and analyzed with the basics found in most molecular biology labs, moreover, preliminary results are available just minutes after sequencing begins45. As a consequence, the method may have use in a clinical context to track clonal evolutions in HTLV-1 infected individuals, especially as the majority of HTLV-1 infected individuals live in regions of the world with poor biomedical infrastructure.
One of the common issues raised regarding Oxford Nanopore data is read accuracy. Early versions of the MinION had read identities of less than 60%47, however the development of new pores and base calling algorithms make read identities of ˜90% achievable. Accuracy can be further improved by generating a consensus from multiple reads, making accuracies of ˜99.4%48 possible. Recently Greig et al49 compared the performance of Illumina and Oxford Nanopore technologies for SNP identification in two isolates of Escherichia coli. They found that after accounting for variants observed at 5-methylcytosine motif sequences only ˜7 discrepancies remained between the platforms. It should be noted that as PCIP-seq sequences PCR amplified DNA, errors generated by base modifications will be avoided. Despite these improvements in accuracy, Nanopore specific errors can be an issue at some positions. Comparison with Illumina data is helpful in the identification of problematic regions and custom base calling models may be a way to improve accuracy in such regions48. Additionally, PCIP-seq libraries could equally be sequenced using long reads on the Pacific Biosciences platform or via 10× Genomics linked reads on Illumina if high single molecule accuracy is required17. In the current study we focused on SNPs observed in clonally expanded BLV proviruses. For viruses such as HIV-1, which have much lower proviral loads, more caution will be requited as the majority of proviral sequences will be generated from single provirus, making errors introduced by PCR more of an issue.
When analyzing SNPs from BLV the most striking result was the presence of the recurrent mutations at the first base of codon 303 in the viral protein Tax, a central player in the biology of both HTLV-146 and BLV50. It has previously been reported that this mutation causes an E-to-K amino acid substitution which ablates the transactivator activity of the Tax protein23. Collectively, these observations suggest this mutation confers an advantage to clones carrying it, possibly contributing to immune evasion, while retaining Tax protein functions that contribute to clonal expansion. However, there is a cost to the virus as this mutation prevents infection of new cells due to the loss of Tax mediated transactivation of the proviral 5′LTR making it an evolutionary dead end. It will be interesting to see if PCIP-seq can provide a tool to identify other examples of variants that increase the fitness of the provirus in the context of an infected individual but hinder viral spread to new hosts. Additionally, the technique could be used to explore the demographic features of the proviral population within and between hosts, how these populations evolve over time and how they vary.
A second notable observation is the cluster of A-to-G transitions observed within a ˜70 bp window in the 3′LTR. Similar patterns have been ascribed to ADAR1 hypermutation in a number of viruses26, including the close BLV relatives HTLV-2 and simian T-cell leukemia virus type 3 (STLV-3)51. Given the small number of hypermutated proviruses observed, it appears to be a minor source of variation in BLV, although it will be interesting to see it this holds for different retroviruses and at different time points during infection.
In the current study we focused our analysis on retroviruses and ERVs. However, as this methodology is potentially applicable to a number of different targets we extended its use to HPV as a proof of concept. It is estimated that HPV is responsible for >95% of cervical carcinoma and ˜70% of oropharyngeal carcinoma52. While infection with a high-risk HPV strain (HPV16 & HPV18) is generally necessary for the development of cervical cancer, it is not sufficient and the majority of infections resolve without adverse consequences41. The use of next-generation sequencing has highlighted the central role HPV integration plays in driving the development of cervical cancer16. Our results show that PCIP-seq can be applied to identify HPV integration sites in early precancerous samples. This opens up the possibility of generating a more detailed map of HPV integrations as well as potentially providing a biomarker to identify HPV integrations on the road to cervical cancer.
Other potential applications include determining the insertion sites and integrity of retroviral vectors54 and detecting transgenes in genetically modified organisms. We envision that in addition to the potential applications outlined above many other novel targets/questions could be addressed using this method.
This application is a national phase entry under 35 U.S.C. § 371 of International Patent Application PCT/EP2020/084557, filed Dec. 3, 2020, designating the United States of America and published in English as International Patent Publication WO 2021/110878 on Jun. 10, 2021, which claims the benefit under Article 8 of the Patent Cooperation Treaty to U.S. Patent Application Ser. No. 62/942,972, filed Dec. 3, 2019, the entireties of which are hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2020/084557 | 12/3/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62942972 | Dec 2019 | US |