AAV DELIVERY OF NUCLEOBASE EDITORS

BACKGROUND

Precise genome targeting technologies using the CRISPR/Cas9 system have recently been explored in a wide range of applications, including gene therapy. A major limitation to the application of Cas9 and Cas9-based genome-editing agents in gene therapy is the size of Cas9 (>4 kb), impeding its efficient delivery via recombinant adeno-associated virus (rAAV).

SUMMARY

Point mutations represent the majority of known pathogenic human genetic variants¹. To enable the direct installation or correction of point mutations in living cells, base editors (or “nucleobase editors”) were developed, which are engineered proteins that directly convert a target base pair to a different base pair without creating double-stranded DNA breaks^2-4. Cytidine base editors (CBEs) such as BE4max^3,5-7catalyze the conversion of target C.G base pairs to T.A, while adenine base editors (ABEs) such as ABEmax^4,6convert target A.T base pairs to G.C. While CBEs and ABEs are both widely used and work robustly in many cultured mammalian cell systems², the efficient delivery of base editors into live animals remains a challenge, despite promising initial studies^8-10. A major impediment to the delivery of base editors in animals has been an inability to package base editors in adeno-associated virus (AAV), an efficient and widely used delivery agent that remains the only FDA-approved in vivo gene therapy vector¹¹. The large size of the DNA encoding base editors (5.2 kb for base editors containing S. pyogenes Cas9, not including any guide RNA or regulatory sequences) precludes packaging in AAV, which has a genome packaging size limit of ≤5 k^12,13.

To bypass this packaging size limit and deliver base editors (or “nucleobase editors”) using AAVs, a split-base editor dual AAV strategy^14,15was devised, in which the CBE or ABE is divided into an N-terminal and C-terminal half. Each nucleobase editor half is fused to half of a fast-splicing split-intein. Following co-infection by AAV particles expressing each nucleobase editor-split intein half, protein splicing in trans reconstitutes full-length nucleobase editor. Unlike other approaches utilizing small molecules¹⁶or sgRNA¹⁷to bridge split Cas9, intein splicing removes all exogenous sequences and regenerates a native peptide bond at the split site, resulting in a single reconstituted protein identical in sequence to the unmodified nucleobase editor.

Split-intein CBEs and split-intein ABEs were developed and integrated into optimized dual AAV genomes to enable efficient base editing in somatic tissues of therapeutic relevance, including liver, heart, muscle, retina, and brain. The resulting AAVs were used to achieve base editing efficiencies at test loci for both CBEs and ABEs that, in each of these tissues, meets or exceeds therapeutically relevant editing thresholds for the treatment of some human genetic diseases at AAV dosages that are known to be well-tolerated in humans. By integrating these developments, dual AAV split-intein nucleobase editors were used to treat a mouse model of Niemann-Pick disease type C (e.g., type C1), a debilitating disease that affects the central nervous system (CNS), resulting in correction of the casual mutation in CNS tissue, and an increase in the animal's lifespan. In addition, dual AAV split-intein nucleobase editors were used to treat a mouse model of congenital deafness, resulting in correction of the casual mutation in vivo.

Accordingly, in some aspects, described herein are nucleic acid molecules, compositions, recombinant AAV (rAAV) particles, kits, and methods for delivering a Cas9 protein or a base editor (or “nucleobase editor”) to cells, e.g., via rAAV vectors. Typically, a Cas9 protein or a nucleobase editor is “split” into an N-terminal portion and a C-terminal portion. The N-terminal portion or C-terminal portion of a Cas9 protein or a nucleobase editor may be fused to one member of the intein system, respectively. The resulting fusion proteins, when delivered on separate vectors (e.g., separate rAAV vectors) into one cell and co-expressed, may be joined to form a complete and functional Cas9 protein or nucleobase editor (e.g., via intein-mediated protein splicing). Further provided herein are empirical testing of regulatory elements in the delivery vectors for high expression levels of the split Cas9 protein or the nucleobase editor.

Some aspects of the present disclosure provide nucleic acid molecules encoding a N-terminal portion of a nucleobase editor fused at its C-terminus to a first intein sequence, wherein the nucleic acid molecule is operably linked to a first promoter, further comprising a nucleic acid segment encoding a guide RNA (gRNA) operably linked to a second promoter, wherein the direction of transcription of the nucleic acid segment is reversed relative to the direction of transcription of the nucleic acid molecule. Further provided are nucleic acid molecules encoding a C-terminal portion of a nucleobase editor fused at its N-terminus to a second intein sequence, wherein the nucleic acid molecule is operably linked to a third promoter, and further comprising a nucleic acid segment encoding a guide RNA (gRNA) operably linked to a fourth promoter, wherein the direction of transcription of the nucleic acid segment is reversed relative to the direction of transcription of the nucleic acid molecule.

In some embodiments, the disclosed nucleic acid molecules further comprise i) a transcriptional terminator, optionally wherein the transcriptional terminator is the transcriptional terminator from a bGH gene, hGH gene, or SV40 gene, and ii) a woodchuck hepatitis posttranscriptional regulatory element (WPRE) inserted 5′ of the transcriptional terminator. In certain embodiments, the WPRE is a truncated WPRE sequence. In certain embodiments, the truncated WPRE sequence comprises W3, as first reported in Choi, J. H., et al. (2014), Mol. Brain 7: 17, incorporated by reference herein. In certain embodiments, the WPRE is a full-length WPRE. In certain embodiments, the first and/or third promoters comprise a Cbh promoter. In certain embodiments, the second and/or fourth promoters comprise a U6 promoter.

Other aspects of the present disclosure provide compositions comprising: (i) a first nucleotide sequence encoding a N-terminal portion of a Cas9 protein fused at its C-terminus to an intein-N; and (ii) a second nucleotide sequence encoding an intein-C fused to the N-terminus of a C-terminal portion of the Cas9 protein, wherein at least one of the first nucleotide sequence and second nucleotide sequence is operably linked to a first promoter, wherein at least one of the first nucleotide sequence and second nucleotide sequence comprises at its 3′ end a gRNA nucleic acid segment encoding a guide RNA (gRNA) operably linked to a second promoter, and wherein the direction of transcription of the gRNA nucleic acid segment is reversed relative to the direction of transcription of the at least one nucleotide sequence.

In some embodiments, the Cas9 protein is a catalytically inactive Cas9 (dCas9) or a Cas9 nickase (nCas9), and wherein the first nucleotide sequence of (i) and/or the second nucleotide sequence of (ii) further comprises a nucleotide sequence encoding a nucleobase modifying enzyme fused to the N-terminus of the N-terminal portion of the Cas9 protein.

In some embodiments, the nucleobase modifying enzyme (or nucleobase modification domain) is a deaminase. In some embodiments, the deaminase is a cytosine deaminase. In some embodiments, the deaminase is an adenosine deaminase. In some embodiments, the second nucleotide sequence of (ii) further comprises a nucleotide sequence encoding a uracil glycosylase inhibitor (UGI) fused at the 3′ end of the second nucleotide sequence. In some embodiments, the first nucleotide sequence of (i) further comprises a nucleotide sequence encoding a uracil glycosylase inhibitor (UGI) at the 5′ end of the first nucleotide sequence. In some embodiments, the UGI comprises the amino acids sequence of SEQ ID NOs: 299-302.

In some embodiments, the first nucleotide sequence and the second nucleotide sequence are on different vectors. In some embodiments, the each of the different vectors is a genome of a recombinant adeno-associated virus (rAAV). In some embodiments, each vector is packaged in a rAAV particle. In some aspects, the present disclosure provides rAAV particles comprising a first nucleic acid molecule (e.g. encoding a N-terminal portion of a nucleobase editor or Cas9 protein fused at its C-terminus to an intein-N) as described herein. rAAV particles comprising a second nucleic acid molecule (e.g. encoding an intein-C fused to the N-terminus of a C-terminal portion of the Cas9 protein or nucleobase editor) as described herein are also provided. In some embodiments, the N-terminal portion of the Cas9 protein and the C-terminal portion of the Cas9 protein are joined together to form the Cas9 protein. The disclosed rAAV particles may comprise both a first nucleic acid molecule and second nucleic acid molecules as described herein.

In another aspect, host cells comprising the compositions described herein are provided. The disclosed cells may comprise any of the disclosed nucleic acid molecules, rAAV vectors, or rAAV particles described herein.

Some aspects of the present disclosure provide compositions comprising: (i) a first nucleotide sequence encoding a N-terminal portion of a nucleobase editor fused at its C-terminus to an intein-N; and (ii) a second nucleotide sequence encoding an intein-C fused to the N-terminus of a C-terminal portion of the nucleobase editor. Further provided herein are kits comprising the any of the compositions described herein.

In some embodiments, any of the nucleobase editors of the disclosure comprises a cytosine deaminase fused to the N-terminus of a catalytically inactive Cas9 or a Cas9 nickase. In some embodiments, the cytosine deaminase is selected from the group consisting of: APOBEC1, APOBEC3, AID, and pmCDA1. In some embodiments, the nucleobase editor further comprises a uracil glycosylase inhibitor (UGI).

Still other aspects of the present disclosure provide methods comprising contacting a cell with any of the compositions described herein, wherein the contacting results in the delivery of the first nucleotide sequence and the second nucleotide sequence into the cell, and wherein the N-terminal portion of the nucleobase editor and the C-terminal portion of the nucleobase editor are joined to form a nucleobase editor.

Still other aspects of the present disclosure provide methods comprising administering to a subject in need there of a therapeutically effective amount of any of the compositions described herein. In some embodiments, the subject has a disease or disorder (e.g. a genetic disease). In particular embodiments, the disease or condition is Niemann-Pick disease type C (NPC) disease. In other embodiments, the disease or condition is congenital deafness. In some embodiments, the disease or disorder is selected from the group consisting of: cystic fibrosis, phenylketonuria, epidermolytic hyperkeratosis (EHK), chronic obstructive pulmonary disease (COPD), Charcot-Marie-Toot disease type 4J, neuroblastoma (NB), von Willebrand disease (vWD), myotonia congenital, hereditary renal amyloidosis, dilated cardiomyopathy, hereditary lymphedema, familial Alzheimer's disease, prion disease, chronic infantile neurologic cutaneous articular syndrome (CINCA), and desmin-related myopathy (DRM).

The details of certain embodiments of the invention are set forth in the Detailed Description of Certain Embodiments, as described below. Other features, objects, and advantages of the invention will be apparent from the Definitions, Examples, Figures, and Claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which constitute a part of this Application, illustrate several embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIGS. 1A-1C are graphs showing a “split nucleobase editor” for delivery into cells using recombinant adeno associated virus (rAAV) vectors. FIG. 1A is a schematic representation of how the nucleobase editor is split into two portions. FIG. 1B shows that AAV-delivered split nucleobase editor can undergo protein splicing upon expression of the two halves in cells to form a complete nucleobase editor that has comparable activity to a nucleobase editor expressed as a whole. FIG. 1C shows the formation of a complete nucleobase editor from the two halves via protein splicing mediated by DnaE intein.

FIG. 2 shows that U1118 cells were efficiently transfected by AAV2 containing nucleic acids encoding mCherry. Different viral titers were tested (2.5-10 μl at 4.5×10¹¹vg/ml*) and all resulted in efficient transfection of U118 cells. *vg/ml means viral genome-containing particles per microliter.

FIGS. 3A-3B are graphs showing high throughput sequence (HTS) results of nucleobase editing by rAAV-delivered split nucleobase editor in U118 and HEK cells. Lipid-transfected nucleobase editor was used as a control. A sgRNA targeting R37 in the PRNP gene was used, and the PRNP gene locus was sequenced. FIG. 3A shows the HTS reads, and FIG. 3B summarizes the base editing results.

FIG. 4 is a graph showing the optimization of the transcriptional terminator used in the AAV constructs encoding the split nucleobase editor. Transcriptional terminators of different sizes and origins were tested. bGH transcriptional terminator is relatively short and efficiently terminates transcription comparably to longer terminator sequences. It was therefore chosen to be used in the downstream experiments.

FIGS. 5A-5B are graphs showing the results of nucleobase editing with long term (up to 15 days) transduction of AAV encoding the split nucleobase editor in mouse astrocytes expressing human ApoE4 cDNA. The target base is in the codon for arginine 112 and arginine 158 in ApoE4, which is converted to a cysteine upon base editing. FIG. 5A shows that the editing of arginine 158 increases overtime when the mouse astrocytes were transduced at 10¹⁰vg, while editing of arginine 112 remained minimal. The nucleotide sequence 3′ of the codon for arginine 158 sequence features a flanking NGG PAM allowing for high activity by SpCas9 (with guide sequence GAAGCGCCTGGCAGTGTACC, SEQ ID NO: 348), while the nucleotide sequence 3′ of the codon for arginine 112 contains a flanking NAG PAM which does not allow for high activity (with guide sequence GACGTGCGCGGCCGCCTGGTG, SEQ ID NO: 349). FIG. 5B shows cells transduced with rAAV encoding mCherry at 10¹⁰vg (control).

FIG. 6 is a schematic representation of the optimization of the nuclear localization signal in AAV constructs encoding the split nucleobase editor. The nuclear localization signal controls nuclear import, which must occur for reconstituted nucleobase editor to associate with genomic DNA as a prerequisite for editing, and is a potential rate-limiting step in the process. This schematic shows that the NLS (and NLS optimization) is critical for the nucleobase editor to be imported into the nucleus.

FIG. 7 is a graph showing the results of base editing using different rAAV split nucleobase editor constructs containing different nuclear localization signals (NLS).

FIGS. 8A-8B are graphs showing the editing of DNMT1 gene in dissociated mouse cortical neurons using an AAV encoded split nucleobase editor.

FIGS. 9A-9B are graphs showing the editing of DNMT1 gene in mouse Neuro-2a cell line using either an AAV encoded split nucleobase editor, or a lipid transfected DNA encoded nucleobase editor.

FIGS. 10A-10F show the development of split-intein cytosine and adenine base editors (or nucleobase editors). FIG. 10A is a schematic representation of the intein reconstitution strategy. Two separately encoded protein fragments fused to split-intein halves splice to reconstitute full-length protein following co-expression. FIG. 10B is a graph showing lipofection of intact BE3, split BE3 with the Npu split-intein site between E573/C574 or K637/T638, or split BE3 with the Cfa split-intein site between E573/C574 into HEK293T cells followed by high-throughput sequencing of six test loci to determine base editing efficiency. FIG. 10C is a graph comparing average editing data in FIG. 10B, normalized to BE3 levels (dotted line). BE3-normalized editing at each locus (black dots) was averaged. FIG. 10D is a graph showing “BEmax” optimization of nuclear localization signals and codon usage increases editing efficiency at six standard loci. BE3.9max and BE4max show comparable editing efficiencies. FIG. 10E is a graph comparing average editing data in FIG. 10D, normalized to BE4 levels (dotted line). FIG. 10F is a graph showing lipofection of ABEmax (left bar) or Npu-split E573/C574 ABEmax (right bar) into NIH 3T3 cells for generation of a split-intein adenosine nucleobase editor. In FIG. 10B and FIG. 10D, dots represent values and bars represent mean+SD of n=3 independent biological replicates. Dots in FIG. 10C and FIG. 10E represent locus averages.

FIGS. 11A-11E show the optimization of split-intein nucleobase editor AAVs. FIG. 11A contains images showing GFP expression three weeks after injection of 1×10¹¹vg of GFP-NLS-bGH, GFP-NLS-W3-bGH, or GFP-NLS-WPRE-bGH into six-week-old C57BL/6 mice. Representative images of horizontal brain slices show hippocampus and neocortex. Top panels show DAPI and EGFP signals overlaid; bottom panels show EGFP signal only. The scale bar represents 500 μm. FIG. 11B is a graph showing transcriptional regulatory element optimization. Total GFP signal measured by ImageJ from mice injected as described in FIG. 11A. See methods for a detailed description of imaging and analysis procedures. FIG. 11C is a graph showing the number of GFP-positive cells per horizontal brain slice from the mice described in FIG. 11A. GFP-positive cells were identified by ilastik/CellProfiler as described in the image analysis section of the Methods of Example 3. FIG. 11D is a schematic of v3, v4, and v5 AAV variants. Arrows indicate direction of U6 promoter transcription. The CBE3.9 coding sequence consists of rAPOBEC1, spCas9 D10A nickase, and UGI. Small white boxes in v3 are non-essential backbone sequences removed in v4 and v5 AAV. See FIG. 17 for the schematic of v5 AAV-ABEmax. FIG. 11E is a graph showing cytosine base editing efficiencies in NIH 3T3 cells following a 14-day incubation with v3 AAV, v4 AAV, and v5 AAV. Dots and bars in FIG. 11B and FIG. 11C represent individual replicates and mean+SD of n=2-3 animals, 3-6 slices per animal. Darkened circles and error bars in FIG. 11E represent mean±SD. Dots in FIG. 11E represent values for independent biological replicates (n=3-4).

FIGS. 12A-12D show the systemic injection of v5 AAV9 editors results in cytosine and adenine base editing in heart, muscle, and liver. FIG. 12A is a schematic showing six-week-old C57BL/6 mice were treated by retro-orbital injection of 2×10¹²vg total of v5 AAV9. After 4 weeks, organs were harvested and genomic DNA of unsorted cells was sequenced. FIG. 12B is a graph showing cytosine base editing by v5 AAV CBE3.9max in the indicated organs. FIG. 12C is a graph showing adenine base editing by v5 AAV ABEmax in the indicated organs. FIG. 12D is a graph comparing adenine base editing from v5 AAV-mediated ABEmax (grey bars) and from trans-mRNA splicing (white bars). Bars represent mean+SD of n=3 animals.

FIGS. 13A-13F show AAV-mediated cytosine and adenine base editing in the central nervous system by two delivery routes. FIG. 13A is a schematic of P0 intraventricular injections. P0 C57BL/6 mice were co-injected with 4×10¹⁰vg total of v5 CBE3.9max or ABEmax AAV targeting DNMT1 and 1×10¹⁰vg Cbh-KASH-GFP. Sorting for GFP-positive cells enriches for triply transduced cells. Tissue was harvested 3-4 weeks after injection, and cortex and cerebellum were separated. Cortical tissue comprises neocortex and hippocampus. For each tissue, nuclei were dissociated and analyzed as unsorted (all nuclei) or GFP-positive populations for DNA sequencing. FIG. 13B is a graph showing percent GFP-positive nuclei measured by flow cytometry following P0 injection. FIG. 13C is a graph showing cytosine base editing efficiency following P0 v5 CBE3.9max AAV injection in cortex and cerebellum at DNMT1 for unsorted nuclei (left bars) and GFP-positive nuclei (right bars). FIG. 13D is a graph showing adenosine base editing efficiency following P0 v5 CBE3.9max AAV9 injection in cortex and cerebellum at DNMT1 for unsorted nuclei (left bar) and GFP-positive nuclei (right bar). FIG. 13E is a schematic of retro-orbital injections. Brains from 9-week-old C57BL/6 mice were harvested 4 weeks after injection with 4×10¹²vg total v5 CBE3.9max or ABEmax AAV targeting DNMT1 and 2×10¹¹vg KASH-GFP AAV, then processed and analyzed as described in FIG. 13A. FIG. 13F is a graph showing cytosine base editing in unsorted (left bar) and GFP-positive (right bar) cortical and cerebellar cells following the procedure described in FIG. 13A. Bars represent mean+SD. Black dots represent individual animals (n=3-4).

FIGS. 14A-14F show AAV-mediated cytosine and adenine base editing in the retina following sub-retinal injections of 2-week-old Rho-Cre; Ai9 mice. FIG. 14A is a schematic of sub-retinal injections. Two-week-old Rho-Cre; Ai9 mice were treated by sub-retinal injection of 1×10⁹to 1×10¹⁰vg total of v5 CBE3.9max or v5 ABEmax AAV targeting DNMT1. For each group, at least three eyes were injected. Three weeks after injection, injected retinas were sorted into GFP-negative/tdTomato-positive (rod photoreceptors not transduced with GFP), tdTomato-positive/GFP-positive (transduced rods), GFP-positive/tdTomato-negative (marker transduced non-rod), and double-negative populations (unmarked non-rods, not shown). FIG. 14B is a graph showing the percentage of GFP transduced rod photoreceptors or non-rod retinal cells followed by subretinal injection of AAV mix of PHP.B-CBE, Anc80-CBE and Anc80-ABE AAV, respectively. The dose of AAV-GFP is 2×10⁹vg for PHP.B-CBE mix, 3.3×10⁸vg for Anc80-CBE mix and 4.5×10⁸vg for Anc80-ABE mix. FIG. 14C contains images showing the expression of tdTomato in the rod photoreceptor cells of Rho-Cre; Ai9 mice (left panel). Retinal transduction of PHP.B-GFP (middle panel) or Anc80-GFP (right panel) at 5×10⁹vg. Scale bar=20 μm. FIG. 14D is a graph showing cytosine base editing by v5 CBE3.9max PHP.B AAV in injected retinas. Editing percentage in all rods was inferred as ((editing % in GFP transduced rods)*(number of transduced rods)+(editing % in unmarked rods)*(number of unmarked rods))/total rods. This calculation was repeated for non-rods. FIG. 14E is a graph showing cytosine base editing by v5 CBE3.9max Anc80 AAV in photoreceptors and other retinal cells. Editing efficiencies in all rods and all non-rods were inferred as described for FIG. 14B. FIG. 14F is a graph showing adenine base editing by v5 ABEmax Anc80 AAV in photoreceptors. All GFP-positive cells were pooled in this experiment, resulting in a single GFP-positive population containing tdTomato-positive and tdTomato-negative cells (hashed bar). Bars represent mean+SD. Black dots represent individual eyes (n=3-4).

FIGS. 15A-15H show base editing of NPC1^I1061Tin the mouse CNS. FIG. 15A is a schematic of the NPC1 locus highlighting the mutation in exon 21, the protospacer and PAM sequence targeted, and the desired CBE-mediated reversion of I1061T. The scale bar represents 5 kilobases. FIG. 15B is a Kaplan-Meier plot of homozygous NPC1^I1061Tmice injected with 4×10¹⁰vg total of v5 CBE3.9max AAV9 targeting NPC1^I1061T(blue; n=7), untreated homozygous NPC1^I1061Tmice (red; n=12), and NPC1^I1061Theterozygous animals (black; n=14). FIG. 15C is a Kaplan-Meier plot of NPC1^I1061Tmice injected with 1×10¹¹vg total v5 CBE3.9max AAV9 targeting NPC1^I1061T(blue; n=5), with data from the other two cohorts replotted from FIG. 15B. FIG. 15D is a graph showing cortical and cerebellar base editing in P0 animals injected with v5 AAV9 targeting NPC1^I1061TLighter bars report editing in unsorted or GFP-positive cells following injection of n=3 mice of 4×10¹⁰vg (2×10¹⁰vg of each split nucleobase editor half); darker bars correspond to editing following injection of 1×10¹¹vg (5×10¹⁰vg of each split nucleobase editor half). FIG. 15E is a graph showing base editing to the precisely corrected wild-type allele shown in FIG. 15A. Lighter bars indicate the frequency of alleles that are corrected to the wild-type sequence; darker bars replotted from FIG. 15D indicate total C.G-to-T.A editing in the T1061 codon (“ACA”) in FIG. 15A. FIG. 15F is a graph showing precisely corrected (wild-type) alleles as a percentage of all edited alleles. In FIG. 15B and FIG. 15C, tick marks indicate animal deaths. Bars represent mean+SD. Dots represent individual animals (n=3-5). FIG. 15G shows immunofluorescent measurements of calbindin and DAPI staining in midline saggital cerebellar slices from P98-P105 mice. Calbindin is indicated as the darker stain, and DAPI is indicated as the lighter stain. Images were taken using an Eclipse Ti microscope (Nikon).Wild-type, n=3 mice, 15 images; NPC1^I1061Tuntreated, n=2 mice, 6 images; NpC1^I1061TAAV-CBE, n=2 mice, 10 images. Untreated vs. treated, two-sided t-test, p=0.0005. FIG. 15H shows immunofluorescent measurements of CD68+ tissue area. Images are representative CD68-stained midline saggital cerebellar slices from P98-P105 mice. EGFP-KASH labeled cells are indicated with the ({circumflex over ( )}) symbol, CD68+ labeled cells are indicated with the (>) symbol, and DRAQ5 signal is indicated with the (*) symbol. The untreated mice were uninjected and did not express GFP. In the quantification of CD68+ tissue area, each point represents the average per mouse. Wild-type, n=3 mice, 15 images; Npc1^I1061Tuntreated, n=2 mice, 6 images; NPC1^I1061TAAV-CBE, n=2 mice, 10 images. Untreated vs. treated, two-sided t-test, p=0.0005. The middle subpanel reports base editing to the precisely corrected wild-type allele shown in FIG. 15A from the 1×10¹¹vg injections. Lighter bars indicate the frequency of alleles that are corrected to the wild-type sequence; replotted darker bars indicate total C.G-to-T.A editing of the T1061 codon (“ACA”) in FIG. 15A. The right subpanel shows precisely corrected (wild-type) alleles as a percentage of all edited alleles in mice injected with 1×10¹¹vg. In FIG. 15B, tick marks indicate animal deaths. In all other panels, bars represent mean+SD. Dots represent individual mice. Scale bars represent 200 μm. Statistical tests for immunofluorescence are two-sided t-tests without multiple comparison corrections.

FIGS. 16A-16F show the development of a split-intein S. aureus CBEs. FIG. 16A contains graphs showing editing performance in HEK293T cells of seven split S. aureus nucleobase editors with intein insertions between K534/C535, Y537/S538, Q501/T502, N484/S485, L431/S432, R453/S454, or Q457/S458. For each of the six endogenous genomic test sites, 16 bases of the protospacer, numbered with the PAM starting at position 21 are shown on the X axis. Unsplit S. aureus BE3 (saBE3) data are shown as black stars; seven split-intein CBEs are shown as shaded circles. Note that ABOBEC1 exhibits an anti-GpC preference. FIG. 16B contains bar graphs of editing efficiency at the most highly edited C for each site. Shading patterns correspond to the shading patterns of the circles shown in FIG. 16A. FIG. 16C is a graph showing the average editing across the six genomic sites, normalized to unsplit saBE3 editing (dotted line). FIG. 16D shows a sample Western blot of S. pyogenes nucleobase editor expression (BE3.9max and Npu-BE3.9max) in HEK293T cells. The lanes to the left of the ladder have been stained against FLAG. The lanes to the right are the same samples stained against HA. The FLAG-stained lanes are co-stained against GAPDH loading control. Untagged BE3.9max is shown in the first lane; other samples are tagged as indicated. This representative blot is one of three biological replicates. FIGS. 16E-16F show editing at the HEK3 locus by the tagged editor constructs. The bars in FIG. 16E correspond to the lanes shown on the Western blot; the bars in FIG. 16F show additional conditions measuring the effect of tagging on editing efficiency. NpuC1A constructs are split-intein constructs containing the inactivating Npu N-terminal C1A mutation. In FIG. 16A, and FIGS. 16E-16F, dots are mean+SD of n=3 independent biological replicates. In FIG. 16B and FIG. 16C, bars represent mean+SD. In FIG. 16B, dots represent values from independent biological replicates (n=3). Dots in FIG. 16C represent average editing at each of n=6 tested sites.

FIG. 17 is a schematic of v5 AAV ABEmax constructs. Arrows indicate direction of U6 promoter transcription. The ABEmax coding sequence consists of wild-type and evolved tadA monomers followed by spCas9 D10A nickase. The U6-sgRNA cassette was omitted from the N-terminal construct to avoid exceeding the AAV packaging limit.

FIGS. 18A-18C show CBE- and ABE-mediated editing in six organs following systemic injection of v5 AAV9 nucleobase editors. FIG. 18A is a graph showing cytosine base editing by v5 AAV CBE3.9max in organs poorly transduced by AAV9. The dotted line indicates the detection threshold of 0.1% editing. FIG. 18B is a graph comparing adenine base editing from v5 AAV-mediated ABEmax (grey bars, right) and from trans-mRNA splicing (white bars, left). Bars represent mean+SD of n=3 animals. FIG. 18C shows a comparison of cytosine base editing mediated by v5 AAV-SaBE3.9max compared to previously-reported constructs, which were modified to replace the liver-specific P3 promoter with Cbh and to replace the Pah sgRNA with PCKS9-targeting sgRNA. Bars to the left of the dotted line report editing in livers of mice injected retro-orbitally with 1×10¹¹vg total; bars to the right report a dose of 1×10¹²vg total. Bars represent mean+SD of n=3 mice.

FIGS. 19A-19B show the transduction of cerebellar Purkinje cells by P0 intracerebroventricular injections. FIG. 19A is a schematic of P0 intraventricular injections. P0 L7-GFP mice were injected with 5×10¹⁰vg of PHP.B Cbh-mCherry-NLS. Brains were prepared for imaging following a three-week incubation. Visible cerebellar cells fall into three categories: GFP-positive, mCherry-negative=untransduced Purkinje cells; GFP-negative, mCherry-positive=transduced non-Purkinje cells; and GFP-positive, mCherry-positive=transduced Purkinje cells. The overlap of EGFP and mCherry, which are shared in light grey and dark grey, respectively, produces white nuclei in transduced Purkinje cells. FIG. 19B contains sample cerebellar images from horizontally sliced hemispheres of injected L7-GFP mice. Left panel shows EGFP and mCherry signals overlaid; center and left panels respectively show EGFP and mCherry only. The scale bar represents 500 μm.

FIGS. 20A-20B show indel-subtracted AAV-mediated cytosine and adenine base editing in the retina following sub-retinal injections of 2-week-old C57BL/6 mice. Indel-containing datasets (solid bars) are reproduced from FIGS. 14D-14E for clarity. FIG. 20A is a graph showing cytosine base editing by v5 CBE3.9max PHP.B AAV in photoreceptors and other retinal cells. Diagonal-striped bars represent data re-analyzed after discarding indel-containing reads. Editing percentage was then calculated by dividing the number of T.A-containing reads by the original total read number. Removal of indel-containing reads was manually verified. The inferred editing percentages were calculated as in FIGS. 14A-14F: the editing percentage in all rods was inferred as ((editing % in transduced rods)*(number of transduced rods)+(editing % in unmarked rods)*(number of unmarked rods))/total rods. This calculation was repeated for non-rods. FIG. 20B is a graph showing cytosine base editing by v5 CBE3.9max Anc80 AAV in photoreceptors and other retinal cells. Indel removal was performed and editing efficiencies in all rods and all non-rods were inferred as described for FIG. 20A.Bars represent mean+SD. Black dots represent individual eyes (n=3).

FIGS. 21A-21D show the prolonged expression of a nucleobase editor. FIG. 21A is a graph showing editing in NPC1^I1061T/+ mice injected at P0 with 1×10¹¹vg v5 CBE3.9max AAV9. The shaded area and dotted line indicate that in unedited heterozygous animals, 50% of HTS reads are expected to contain a T.A. Brains were harvested and sequenced at P29 after sorting into unsorted (left bar) or GFP-positive (right bar) cells. The darker bars represent unsorted and GFP-positive cells harvested at P110. FIG. 21B is a graph showing the percent of edited cells inferred from the percent of T.A-containing reads. The percent of edited cells was calculated as 2*(% T.A−50). Bars represent mean+SD. Dots represent individual animals (n=3). FIG. 21C shows the cerebellar Cas9/EGFP staining in a P110 mouse injected at P0 with v5 AAV-CBE and GFP-KASH. Merged images show EGFP in darker shading and Cas9 in lighter shading. The Cas9 antibody is a mouse monoclonal antibody which binds a motif in the C-terminal half of the split editor. The dashed white rectangle indicates the zoomed-in area depicted in the single-channel images. Greyscale images are as labeled. FIG. 21D shows cortical Cas9/EGFP staining in a P110 mouse injected at P0 with v5 AAV-CBE and GFP-KASH. Merged images show EGFP as the darker label and Cas9 as the lighter label. Images in FIG. 21C and FIG. 21D are representative of n=2 mice. The dashed white rectangle indicates the zoomed-in area depicted in the single-channel images. In FIG. 21A and FIG. 21B, bars represent mean+SD. Black dots represent individual mice.

FIGS. 22A-22C are a tables showing base editing efficiency, indel frequency, and base editing:indel ratio for all in vivo experiments at the DNMT1 locus. All in vivo intein-split experiments were performed with v5 AAV and are listed according to the figure in which they appear. The percentage of reads with C.G to T.A editing (CBE3.9max) or A.T to G.C editing (ABEmax) was divided by the percentage of reads containing indels to generate the base editing:indel ratio. All analyses of HTS data were performed by CRISPResso2 as described in the Methods section of Example 3. Crispresso2 is a public software that provides analyses of genome editing outcomes from deep sequencing data. See Clement et al., Nat Biotechnol. 2019 March; 37(3):224-226, herein incorporated by reference. All values represent mean±SD.

FIG. 23 contains flow cytometry plots exemplifying brain nuclei sorting. Plots show 500,000 events. Nuclei were sequentially gated on the basis of DyeCycle Ruby signal, FSC/SSC ratio, SSC-Width/SSC-height ratio, and GFP/DyeCycle ratio, as shown above. The first column demonstrates the gating strategy on a GFP-negative control sample. The middle column demonstrates the gating strategy on a sample with low transduction (P0 injection, cerebellar tissue), and the right column demonstrates high transduction efficiency (P0 injection, cortical tissue). In all cases, unsorted nuclei correspond to events that pass gates R1, R2, and R3, without sorting on R4.

FIG. 24 contains flow cytometry plots exemplifying retinal cell sorting. Plots show 250,000 events. Cells were sequentially gated on the basis of FSC/SSC ratio, FSC-W/FSC-A, SSC-W/FSC-A, and fluorescence. Cells were sorted four ways on the basis of signal intensity in the PE-Texas Red and GFP channels. The left column illustrates the gating strategy on an untransduced Rho-Cre; Ai9 mouse with tdTomato-positive rod photoreceptors. The right column illustrates the gating strategy on an Rho-Cre; Ai9 mouse co-injected with PHP.B GFP and v5 CBE3.9max.

FIGS. 25A-25B are tables containing primers used to generate sgRNA sequences and amplify genomic DNA. All sgRNA forward primers have 5′-CACC overhangs, and all reverse primers have 5′-AAAC overhangs to generate overhangs for efficient ligation. Primers for gDNA amplification contain bolded 5′ Illumina adapter sequences and 3′ gene-specific sequences (no special formatting).

FIGS. 26A-26U show the recombinant AAV vector construct nucleotide sequences encoding the CBE3.9max, ABEmax, and AID-BE3.9max nucleobase editors evaluated in the Examples. All constructs cloned in the px601 backbone (F. Zhang) modified to correct an 11-bp deletion in the left ITR. Pseudospacer-containing backbones were cut with Esp3I or BsmBI endonucleases. Primers listed in FIGS. 25A-25B were annealed and ligated with standard molecular biology techniques. Annotations are coded as described in the figure. The U6-sgRNA cassette was omitted from the ABEmax N-terminal constructs to keep the total construct size under the packaging limit.

FIG. 27 shows a Kaplan-Meier plot of homozygous NPC1^I1061Tmice injected with 4×10¹²vg total of v5 CBE3.9max. Mice were injected with 3×10¹²vg PHP.eB and 1×10¹²vg AAV9 targeting NPC1^I1061T(blue; n=5) or untreated homozygous NPC1^I1061Tmice (red; n=9). Tick marks indicate animal deaths. Median survival increases from 109 to 120 days, p=0.015 by Mantel-Cox.

FIGS. 28A-28B show cerebellar CD68 staining. FIG. 28A shows representative single-channel images of cerebellar slices stained against EGFP, CD68, and DNA in greyscale. EGFP labels cells transduced with GFP-KASH AAV transduction marker. CD68 labels reactive microglia, and DRAQ5 labels DNA. The NPC1^I1061Tanimal in this case was not transduced. Multi-channel images from FIGS. 15A-15H are reproduced for clarity. The dotted white rectangle in the rightmost (treated) column highlights one area that is GFP⁺/CD68⁻. Scale bar is 200 μm. FIG. 28B shows, CD68+ cells per mm²in wild-type, treated, and untreated mice. Bars represent mean+SD. Black dots represent individual mice. For (a) and (b), n=3 wild-type; n=2 treated; n=2 untreated mice).

FIGS. 29A-29D show an off-target analysis of NPC1-targeting sgRNA. FIG. 29A shows the results of CIRCLE-seq using the NPC1-targeting sgRNA and Cas9 to cut gDNA harvested from untreated NPC1^I1061Tmouse liver. Note that off-target candidate sequences are aligned to the wild-type C57BL/6 genome; the wildtype NPC1 allele on line 2 is not present in the assay. FIG. 29B shows a CRISPOR off-target analysis off the six sites with the highest predicted Cas9 activity as determined by CFD score, including the on-target site, in descending order. Off-target guide sequences are shown in the left-most column. FIG. 29C shows an amplicon sequencing of the three CIRCLE-seq candidate loci from treated, sorted mouse cortical and cerebellar samples shown in FIG. 15F. FIG. 29D shows amplicon sequencing of the top five CRISPOR predicted Cas9 off-target sites from treated, sorted mouse cortical and cerebellar samples shown in FIG. 15F. In FIGS. 29C-29D, individual cytosines in the protospacer are arrayed on the x-axis, with base 1 the farthest from the PAM and base 20 PAM adjacent, as depicted in FIG. 29A. Light grey bars indicate cerebellar samples; dark grey bars indicate cortical samples. The dotted line indicates the detection threshold of 0.1% editing. Bars represent mean+SD. Black dots represent individual mice (n=4 mice for cerebellar samples; n=5 mice for cortical samples).

FIGS. 30A-30D show how evaluating different nucleobase editors and guide RNA can correct the Tmc1^Y182C/Y182Callele in Baringo MEF cells. FIG. 30A is a schematic of the Tmc1 locus highlighting the c.A545G mutation (red), silent bystander bases, and three candidate guide RNAs that position the target C (directly below “Y/C”) at different protospacer positions (C₈, C₇, C₁₀) and the use of different PAMs (AGG, GGA and TGA). FIG. 30B shows base editing efficiencies for the four CBE-P2A-GFP variants tested with sgRNA1 (where the four CBEs are APOBEC1-BE4max, CDA1-BE4max, evoCDA1-BE4max, or AID-BE4max). Base editing values (blue bars) reflect the correction of the Baringo mutation to the wild-type TMC1 protein coding sequence, with no other non-silent changes or indels. Three days following nucleofection into Baringo MEF cells, GFP positive (GFP+) cells were sorted and genomic DNA was characterized by high-throughput sequencing. FIG. 30C shows base editing efficiencies for three different guide RNAs tested with AID-BE4max variants: AID-BE4max+sgRNA1, AID-VRQR-BE4max+sgRNA2, or AID-VRQR-BE4max+sgRNA3. Three days following nucleofection of these plasmids into Baringo MEF cells, GFP-positive cells were sorted and sequenced by HTS. FIG. 30D shows base editing efficiencies in Baringo MEF cells following a 14-day incubation with dual AAV encoding AID-BE3.9max+sgRNA1 at high (N terminal: 6.1×10⁸vg, C terminal: 8.3×10⁸vg) and low (3.1×10⁷vg, C terminal: 4.2×10⁷vg) doses. Dots, shaded bars, and error bars represent individual biological replicates, mean values, and SEM, respectively (n=3-5).

FIGS. 31A-31F show in vivo base editing of Tmc1^Y182C/Y182Cin Baringo mice, in vitro off-target analysis for sgRNA1, and in vivo analysis of hair-cell stereocilia bundle morphology. FIG. 31A shows the ten most abundant genomic DNA cleavage products (which include the on-target site and nine potential off-target sequences) from Cas9 nuclease+sgRNA1 as identified in vitro by CIRCLE-seq, aligned to the on-target Tmc1 sequence. FIG. 31B shows an editing analysis of the nine candidate off-target sites identified by CIRCLE-seq in MEF cells treated with dual AAV encoding AID-BE3.9max+sgRNA1. The on-target locus, plus the top nine off-target sites identified by CIRCLE-seq, were sequenced by HTS. Dots and bars represent biological replicates and mean±SEM (n=3). FIG. 31C shows the efficiency of AID-BE3.9max+sgRNA1-mediated editing in treated Baringo (Tmc1^Y182C/Y182C; Tmc2^+/+) mice. Mouse inner ears were injected at P1 with 1 μL (3.1×10⁹vg of each AAV) dual AAV encoding AID-BE3.9max+sgRNA1. After 14 days, cochleas were microdissected into base, mid, and apex samples. Genomic DNA was extracted from each sample and sequenced by HTS. Each dot represents the efficiency of generating Tmc1 alleles with wild-type TMC1 protein sequence and no other non-silent mutations or indels, averaging all samples sequenced from one injected cochlea. To obtain Tmc1 mRNA from the cochlea, the cochlea was extracted at P30, isolated RNA, reverse transcribed into cDNA, and analyzed by HTS. Each dot represents the mRNA from one injected cochlea. FIGS. 31D-31F show representative scanning electron microscopy (SEM) images at the apical turn of OHCs and IHCs of wild-type (Tmc1^+/+; Tmc2^+/+) mice (FIG. 31D), untreated Baringo (Tmc1^Y182C/Y182C; Tmc2^+/+) mice (FIG. 31E), and Baringo mice treated with dual AAV encoding AID-BE3.9max+sgRNA1 (FIG. 31F). The organ of Corti samples were imaged by SEM at 4 weeks. Scale bar, 10 μm.

FIGS. 32A-32C show that the inner ear injection of dual AAV encoding AID-BE3.9max+sgRNA1 restores sensory transduction in Tmc1^Y182C/Y182C; Tmc2^Δ/Δinner hair cells. FIG. 32A shows confocal images of mid-turn cochlear sections excised from P5 Tmc1^Y182C/Y182C; Tmc^2Δ/Δ mouse cochleas. A representative untreated mouse (top panel) or a representative mouse treated with 1 μL (3.1×10⁹vg of each AAV) of dual AAV encoding AID-BE3.9max+sgRNA1 (bottom panel) are shown. The tissue was cultured for 9-13 days and treated with 5 μM FM1-43 for 10 seconds followed by three full bath exchanges to wash out excess dye. The tissue was mounted and imaged for FM1-43 uptake (light shading) in IHCs and OHCs. All images are 500×150 μm. Scale bar, 50 μm. FIG. 32B is a graph showing the quantification of FM1-43-positive IHCs from untreated and treated mice represented as mean±SD (n=3-4 different mice in each group). FIG. 32C is a graph showing representative families of sensory transduction currents evoked by mechanical displacement of hair bundles recorded from apical IHCs of untreated Tmc1^Y182C/Y182C; Tmc2^Δ/Δmice at P8 (untreated), from Tmc1^Y182C/Y182C; Tmc2^Δ/Δmice treated with dual AAV encoding AID-BE3.9max+sgRNA1 at P14 and P18 and from wild-type Tmc1^+/+; Tmc2^+/+ mice at P14-16. Horizontal lines and error bars reflect mean values and SD of 3-4 independent mice and 4-8 hair cells (indicated on top of x-axis), with each dot representing one IHC.

FIGS. 33A-33D show that dual AAV nucleobase editor treatment partially restores auditory function in Baringo (Tmc1^Y182C/Y182C; Tmc2^Δ/Δ) mice. FIG. 33A shows representative sets of ABR waveforms recorded in response to 5.6-kHz tone bursts of varying sound intensity for untreated wild-type mice (left) and wild-type mice treated with dual AAV encoding AID-BE3.9max+sgRNA1 (right). FIG. 33B shows the same as FIG. 33A, but with untreated Baringo mice (left) and Baringo mice treated with 1 μL (3.1×10⁹vg of each AAV) dual AAV encoding AID-BE3.9max+sgRNA1 (right). FIG. 33C shows the mean ABR responses for all four groups (untreated and treated, Baringo and wild-type mice) across all tested frequencies. Untreated Baringo mice (black, n=10) are profoundly deaf, with no detectable ABR threshold (>110 dB, indicated by the upward arrows). Among the treated Baringo mice (n=15) injected with dual AAV encoding AID-BE3.9max+sgRNA1, nine showed ABR response improvements of up to >50 dB (series of overlapping lines associated with “n=9”), while six did not show any rescue (grey line, n=6). Untreated wild-type mice (darker line, n=6) and wild-type mice injected with dual AAV encoding AID BE3.9max+sgRNA1 (lighter line, n=4) show similar ABR thresholds. FIG. 33D shows that the same mice in FIG. 33C were subjected to DPOAE testing. Untreated (black line, n=10) and treated Baringo mice both showed no DPOAE responses under the tested conditions (up to 80 dB). Untreated wild-type mice (darker line, n=6) and wild-type mice injected with dual AAV encoding AID-BE3.9max+sgRNA1 (lighter line, n=4) exhibited normal DPOAE thresholds. All recordings were done at P30. Values and error bars reflect mean±SD for the numbers of mice specified above.

FIG. 34 shows the base editing outcomes from different CBE and sgRNA combinations. The heat map shows an average base editing efficiency by BE4max variants at cytosines surrounding the target nucleotide. The target Tmc1^Y182C/Y182Cmutation is at protospacer position 8. Silent bystander cytosines are at positions 1, 10, 15, and 16. Non-silent bystander cytosines are at positions −12, −11, −9, −8, 18, and 23.

FIGS. 35A-35C show Anc80-Cbh-GFP AAV transduction in IHCs and OHCs in wild-type mice. FIG. 35A shows low magnification, and FIG. 35B shows high magnification images of the entire apical and basal portions of the cochlea of a wild-type mouse injected at P1 with 1 μL of Anc80-Cbh-GFP AAV. The cochlea was harvested at P10, stained with Alexa555-phalloidin, and imaged for Alexa555 and GFP. Scale bar, 50 μm. FIG. 35C shows the number of hair cells are calculated by phalloidin-positive HCs and number of GFP+ HCs are counted. Values and error bars reflect individual data points and mean±SD from three samples from n=3 different mice in each group.

FIG. 36 shows base editing at on-target and off-target genomic DNA sites identified by CIRCLE-seq using Cas9+sgRNA1. Off-target editing analysis in MEF cells treated with dual AAV encoding AID-BE3.9max+sgRNA1. The top ten sites identified by CIRCLE-seq (the on-target locus and the top nine off-target loci) were sequenced by HTS. The maximum % C.G-to-T.A conversion at any position in the protospacer is shown. No off-target site showed editing levels (red) that were significantly (p<0.1) different than the maximum % C.G-to-T.A of the untreated control (blue). Dots and bars represent biological replicates and mean±SEM (n=3 for AAV-treated samples and n=1 for the untreated samples).

FIGS. 37A-37B show the transduction currents from IHCs and OHCs of Tmc1^Y182C/Y182; Tmc2^+/+and Tmc1^Y182C/Y182C; Tmc2^Δ/Δmice at different time points. FIG. 37A shows representative current traces from IHCs of a Tmc1^Y182C/Y182C; Tmc2^+/+mouse (P7) and Tmc1^Y182C/Y182C; Tmc2^Δ/Δmouse (P6) are shown. FIG. 37B shows that cellular recordings were obtained from the basal and mid-apical regions of IHCs or OHCs at different time points (P6-P27). Horizontal lines and error bars reflect mean values and SD of 3-4 independent mice and 2-8 hair cells (indicated on top of x-axis), with each dot representing one OHC or IHC.

FIG. 38A-38C show the hair cell morphology in the organ of Corti from Tmc1^Y182C/Y182C; Tmc2^+/+mice with and without treatment with dual AAV-AID-BE3.9max+sgRNA1. FIG. 38A shows representative, low-magnification images of whole-mount apical and basal turns from Tmc1^Y182C/Y182C; Tmc2^+/+ mice treated with AAV-AID-BE3.9max+sgRNA1 and Tmc1^Y182C/Y182C; Tmc2^+/+mice without treatment. Samples were stained with Myo7A (lighter shading) to label hair cells. FIG. 38B shows high-magnification images of the same cochleas boxed in FIG. 38A. FIG. 38C is a graph showing the quantification of the number of Myo7A positive IHCs and OHCs from entire cochleas of three untreated Tmc1^Y182C/Y182C; Tmc2^+/+ and four Tmc1^Y182C/Y182C; Tmc2^+/+mice treated with dual AAV-AID-BE3.9max+sgRNA1 at P1. Dots and bars represent biological replicates and mean±SD.

FIGS. 39A-39C show the hair bundle morphology in the basal turn of the organ of Corti from Tmc1^Y182C/Y182C; Tmc2^+/+mice with and without treatment with dual AAV-AID-BE3.9max+sgRNA1. Representative scanning electron microscopy images (basal part) of the organ of Corti are shown from wild-type Tmc1^Y182C/Y182C; Tmc2^+/+mice (FIG. 39A), Tmc1^Y182C/Y182CTmc2^+/+ untreated mice (FIG. 39B), and Tmc1^Y182C/Y182C; Tmc2^+/+ mice treated with dual AAV-AID-BE3.9max+sgRNA1 (FIG. 39C). The apical and basal regions of organ of Corti were imaged at 4 weeks. Scale bar, 10 μm.

DEFINITIONS

As used herein and in the claims, the singular forms “a,” “an,” and “the” include the singular and the plural reference unless the context clearly indicates otherwise. Thus, for example, a reference to “an agent” includes a single agent and a plurality of such agents.

An “adeno-associated virus” or “AAV” is a virus which infects humans and some other primate species. The wild-type AAV genome is a single-stranded deoxyribonucleic acid (ssDNA), either positive- or negative-sensed. The genome comprises two inverted terminal repeats (ITRs), one at each end of the DNA strand, and two open reading frames (ORFs): rep and cap between the ITRs. The rep ORF comprises four overlapping genes encoding Rep proteins required for the AAV life cycle. The cap ORF comprises overlapping genes encoding capsid proteins: VP1, VP2 and VP3, which interact together to form the viral capsid. VP1, VP2 and VP3 are translated from one mRNA transcript, which can be spliced in two different manners: either a longer or shorter intron can be excised resulting in the formation of two isoforms of mRNAs: a ˜2.3 kb- and a ˜2.6 kb-long mRNA isoform. The capsid forms a supramolecular assembly of approximately 60 individual capsid protein subunits into a non-enveloped, T-1 icosahedral lattice capable of protecting the AAV genome. The mature capsid is composed of VP1, VP2, and VP3 (molecular masses of approximately 87, 73, and 62 kDa respectively) in a ratio of about 1:1:10.

rAAV particles may comprise a nucleic acid vector (e.g., a recombinant genome), which may comprise at a minimum: (a) one or more heterologous nucleic acid regions comprising a sequence encoding a protein or polypeptide of interest (e.g., a split Cas9 or split nucleobase) or an RNA of interest (e.g., a gRNA), or one or more nucleic acid regions comprising a sequence encoding a Rep protein; and (b) one or more regions comprising inverted terminal repeat (ITR) sequences (e.g., wild-type ITR sequences or engineered ITR sequences) flanking the one or more nucleic acid regions (e.g., heterologous nucleic acid regions). In some embodiments, the nucleic acid vector is between 4 kb and 5 kb in size (e.g., 4.2 to 4.7 kb in size). In some embodiments, the nucleic acid vector further comprises a region encoding a Rep protein. In some embodiments, the nucleic acid vector is circular. In some embodiments, the nucleic acid vector is single-stranded. In some embodiments, the nucleic acid vector is double-stranded. In some embodiments, a double-stranded nucleic acid vector may be, for example, a self-complimentary vector that contains a region of the nucleic acid vector that is complementary to another region of the nucleic acid vector, initiating the formation of the double-strandedness of the nucleic acid vector.

As used herein, the term “adenosine deaminase” or “adenosine deaminase domain” refers to a protein or enzyme that catalyzes a deamination reaction of an adenosine (or adenine). The terms are used interchangeably. In certain embodiments, the disclosure provides nucleobase editor fusion proteins comprising one or more adenosine deaminase domains. For instance, an adenosine deaminase domain may comprise a heterodimer of a first adenosine deaminase and a second deaminase domain, connected by a linker. Adenosine deaminases (e.g., engineered adenosine deaminases or evolved adenosine deaminases) provided herein may be enzymes that convert adenine (A) to inosine (I) in DNA or RNA. Such adenosine deaminase can lead to an A:T to G:C base pair conversion. In some embodiments, the deaminase is a variant of a naturally-occurring deaminase from an organism. In some embodiments, the deaminase does not occur in nature. For example, in some embodiments, the deaminase is at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75% at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to a naturally-occurring deaminase.

In some embodiments, the adenosine deaminase is derived from a bacterium, such as, E. coli, S. aureus, S. typhi, S. putrefaciens, H. influenzae, or C. crescentus. In some embodiments, the adenosine deaminase is a TadA deaminase. In some embodiments, the TadA deaminase is an E. coli TadA deaminase (ecTadA). In some embodiments, the TadA deaminase is a truncated E. coli TadA deaminase. For example, the truncated ecTadA may be missing one or more N-terminal amino acids relative to a full-length ecTadA. In some embodiments, the truncated ecTadA may be missing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 6, 17, 18, 19, or 20 N-terminal amino acid residues relative to the full length ecTadA. In some embodiments, the truncated ecTadA may be missing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 6, 17, 18, 19, or 20 C-terminal amino acid residues relative to the full length ecTadA. In some embodiments, the ecTadA deaminase does not comprise an N-terminal methionine. Reference is made to U.S. Patent Publication No. 2018/0073012, published Mar. 15, 2018, which is incorporated herein by reference.

In genetics, the “antisense” strand of a segment within double-stranded DNA is the template strand, and which is considered to run in the 3′ to 5′ orientation. By contrast, the “sense” strand is the segment within double-stranded DNA that runs from 5′ to 3′, and which is complementary to the antisense strand of DNA, or template strand, which runs from 3′ to 5′. In the case of a DNA segment that encodes a protein, the sense strand is the strand of DNA that has the same sequence as the mRNA, which takes the antisense strand as its template during transcription, and eventually undergoes (typically, not always) translation into a protein. The antisense strand is thus responsible for the RNA that is later translated to protein, while the sense strand possesses a nearly identical makeup to that of the mRNA. Note that for each segment of dsDNA, there will possibly be two sets of sense and antisense, depending on which direction one reads (since sense and antisense is relative to perspective). It is ultimately the gene product, or mRNA, that dictates which strand of one segment of dsDNA is referred to as sense or antisense.

“Base editing” refers to genome editing technology that involves the conversion of a specific nucleic acid base into another at a targeted genomic locus. In certain embodiments, this can be achieved without requiring double-stranded DNA breaks (DSB), or single stranded breaks (i.e., nicking). To date, other genome editing techniques, including CRISPR-based systems, begin with the introduction of a DSB at a locus of interest. Subsequently, cellular DNA repair enzymes mend the break, commonly resulting in random insertions or deletions (indels) of bases at the site of the DSB. However, when the introduction or correction of a point mutation at a target locus is desired rather than stochastic disruption of the entire gene, these genome editing techniques are unsuitable, as correction rates are low (e.g. typically 0.1% to 5%), with the major genome editing products being indels. In order to increase the efficiency of gene correction without simultaneously introducing random indels, the present inventors previously modified the CRISPR/Cas9 system to directly convert one DNA base into another without DSB formation. See, Komor, A. C., et al., Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature 533, 420-424 (2016), the entire contents of which is incorporated by reference herein.

The terms “base editor (BE)” and “nucleobase editor,” which are used interchangeably herein, refer to an agent comprising a polypeptide that is capable of making a modification to a base (e.g., A, T, C, G, or U) within a nucleic acid sequence (e.g., DNA or RNA) that converts one base to another (e.g., A to G, A to C, A to T, C to T, C to G, C to A, G to A, G to C, G to T, T to A, T to C, T to G). In some embodiments, the nucleobase editor is capable of deaminating a base within a nucleic acid such as a base within a DNA molecule. In the case of an adenine nucleobase editor, the nucleobase editor is capable of deaminating an adenine (A) in DNA. Such nucleobase editors may include a nucleic acid programmable DNA binding protein (napDNAbp) fused to an adenosine deaminase. Some nucleobase editors include CRISPR-mediated fusion proteins that are utilized in the base editing methods described herein. In some embodiments, the nucleobase editor comprises a nuclease-inactive Cas9 (dCas9) fused to a deaminase which binds a nucleic acid in a guide RNA-programmed manner via the formation of an R-loop, but does not cleave the nucleic acid. For example, the dCas9 domain of the fusion protein may include a D10A and a H840A mutation (which renders Cas9 capable of cleaving only one strand of a nucleic acid duplex), as described in PCT/US2016/058344, which published as WO 2017/070632 on Apr. 27, 2017 and is incorporated herein by reference in its entirety. The DNA cleavage domain of S. pyogenes Cas9 includes two subdomains, the HNH nuclease subdomain and the RuvC1 subdomain. The HNH subdomain cleaves the strand complementary to the gRNA (the “targeted strand”, or the strand in which editing or deamination occurs), whereas the RuvC1 subdomain cleaves the non-complementary strand containing the PAM sequence (the “non-edited strand”). The RuvC1 mutant D10A generates a nick in the targeted strand, while the HNH mutant H840A generates a nick on the non-edited strand (see Jinek et al., Science, 337:816-821(2012); Qi et al.,Cell. 28; 152(5):1173-83 (2013)).

In some embodiments, a nucleobase editor is a macromolecule or macromolecular complex that results primarily (e.g., more than 80%, more than 85%, more than 90%, more than 95%, more than 99%, more than 99.9%, or 100%) in the conversion of a nucleobase in a polynucleic acid sequence into another nucleobase (i.e., a transition or transversion) using a combination of 1) a nucleotide-, nucleoside-, or nucleobase-modifying enzyme and 2) a nucleic acid binding protein that can be programmed to bind to a specific nucleic acid sequence.

In some embodiments, the nucleobase editor comprises a DNA binding domain (e.g., a programmable DNA binding domain such as a dCas9 or nCas9) that directs it to a target sequence. In some embodiments, the nucleobase editor comprises a nucleobase modification domain fused to a programmable DNA binding domain (e.g., a dCas9 or nCas9). The terms “nucleobase modifying enzyme” and “nucleobase modification domain,” which are used interchangeably herein, refer to an enzyme that can modify a nucleobase and convert one nucleobase to another (e.g., a deaminase such as a cytidine deaminase or a adenosine deaminase). The nucleobase modifying enzyme of the the nucleobase editor may target cytosine (C) bases in a nucleic acid sequence and convert the C to thymine (T) base. In some embodiments, C to T editing is carried out by a deaminase, e.g., a cytidine deaminase. In some embodiments, A to G editing is carried out by a deaminase, e.g., an adenosine deaminase. Nucleobase editors that can carry out other types of base conversions (e.g., C to G) are also contemplated.

A “split nucleobase editor” refers to a nucleobase editor that is provided as an N-terminal portion (also referred to as a N-terminal half) and a C-terminal portion (also referred to as a C-terminal half) encoded by two separate nucleic acids. The polypeptides corresponding to the N-terminal portion and the C-terminal portion of the nucleobase editor may be combined to form a complete nucleobase editor. In some embodiments, for a nucleobase editor that comprises a dCas9 or nCas9, the “split” is located in the dCas9 or nCas9 domain, at positions as described herein in the split Cas9. Accordingly, in some embodiments, the N-terminal portion of the nucleobase editor contains the N-terminal portion of the split Cas9, and the C-terminal portion of the nucleobase editor contains the C-terminal portion of the split Cas9. Similarly, intein-N or intein-C may be fused to the N-terminal portion or the C-terminal portion of the nucleobase editor, respectively, for the joining of the N- and C-terminal portions of the nucleobase editor to form a complete nucleobase editor.

In some embodiments, a nucleobase editor converts a C to a T. In some embodiments, the nucleobase editor comprises a cytosine deaminase. A “cytosine deaminase”, or “cytidine deaminase,” refers to an enzyme that catalyzes the chemical reaction “cytosine+H₂O→uracil+NH₃” or “5-methyl-cytosine+H₂O→thymine+NH₃.” As it may be apparent from the reaction formula, such chemical reactions result in a C to U/T nucleobase change. In the context of a gene, such a nucleotide change, or mutation, may in turn lead to an amino acid change in the protein, which may affect the protein's function, e.g., loss-of-function or gain-of-function. In some embodiments, the C to T nucleobase editor comprises a dCas9 or nCas9 fused to a cytidine deaminase. In some embodiments, the cytidine deaminase domain is fused to the N-terminus of the dCas9 or nCas9. In some embodiments, the nucleobase editor further comprises a domain that inhibits uracil glycosylase, and/or a nuclear localization signal. Such nucleobase editors have been described in the art, e.g., in Rees & Liu, Nat Rev Genet. 2018; 19(12):770-788 and Koblan et al., Nat Biotechnol. 2018; 36(9):843-846; as well as U.S. Patent Publication No. 2018/0073012, published Mar. 15, 2018, which issued as U.S. Pat. No. 10,113,163; on Oct. 30, 2018; U.S. Patent Publication No. 2017/0121693, published May 4, 2017, which issued as U.S. Pat. No. 10,167,457 on Jan. 1, 2019; PCT Publication No. WO 2017/070633, published Apr. 27, 2017; U.S. Patent Publication No. 2015/0166980, published Jun. 18, 2015; U.S. Pat. No. 9,840,699, issued Dec. 12, 2017; U.S. Pat. No. 10,077,453, issued Sep. 18, 2018; PCT Publication No. WO 2019/023680, published Jan. 31, 2019; PCT Publication No. WO 2018/0176009, published Sep. 27, 2018, PCT Application No PCT/US2019/033848, filed May 23, 2019, PCT Application No. PCT/US2019/47996, filed Aug. 23, 2019; PCT Application No. PCT/US2019/049793, filed Sep. 5, 2019; International Patent Application No. PCT/US2020/028568, filed Apr. 17, 2020; PCT Application No. PCT/US2019/61685, filed Nov. 15, 2019; PCT Application No. PCT/US2019/57956, filed Oct. 24, 2019; PCT Publication No. PCT/US2019/58678, filed Oct. 29, 2019, the contents of each of which are incorporated herein by reference in their entireties.

In some embodiments, a nucleobase editor converts an A to a G. In some embodiments, the nucleobase editor comprises an adenosine deaminase. An “adenosine deaminase” is an enzyme involved in purine metabolism. It is needed for the breakdown of adenosine from food and for the turnover of nucleic acids in tissues. Its primary function in humans is the development and maintenance of the immune system. An adenosine deaminase catalyzes hydrolytic deamination of adenosine (forming inosine, which base pairs as G) in the context of DNA. There are no known natural adenosine deaminases that act on DNA. Instead, known adenosine deaminase enzymes only act on RNA (tRNA or mRNA). Evolved deoxyadenosine deaminase enzymes that accept DNA substrates and deaminate dA to deoxyinosine have been described, e.g., in PCT Application PCT/US2017/045381, filed Aug. 3, 2017, which published as WO 2018/027078, PCT Application No. PCT/US2019/033848, which published as WO 2019/226953, PCT Application No PCT/US2019/033848, filed May 23, 2019, and PCT Patent Application No. PCT/US2020/028568, filed Apr. 17, 2020; each of which is herein incorporated by reference by reference.

Exemplary adenosine and cytidine nucleobase editors are also described in Rees & Liu, Base editing: precision chemistry on the genome and transcriptome of living cells, Nat. Rev. Genet. 2018; 19(12):770-788; as well as U.S. Patent Publication No. 2018/0073012, published Mar. 15, 2018, which issued as U.S. Pat. No. 10,113,163, on Oct. 30, 2018; U.S. Patent Publication No. 2017/0121693, published May 4, 2017, which issued as U.S. Pat. No. 10,167,457 on Jan. 1, 2019; PCT Publication No. WO 2017/070633, published Apr. 27, 2017; U.S. Patent Publication No. 2015/0166980, published Jun. 18, 2015; U.S. Pat. No. 9,840,699, issued Dec. 12, 2017; and U.S. Pat. No. 10,077,453, issued Sep. 18, 2018, the contents of each of which are incorporated herein by reference in their entireties.

The term “Cas9” or “Cas9 nuclease” refers to an RNA-guided nuclease comprising a Cas9 domain, or a fragment thereof (e.g., a protein comprising an active or inactive DNA cleavage domain of Cas9, and/or the gRNA binding domain of Cas9). A “Cas9 domain” as used herein, is a protein fragment comprising an active or inactive cleavage domain of Cas9 and/or the gRNA binding domain of Cas9. A “Cas9 protein” is a full length Cas9 protein. A Cas9 nuclease is also referred to sometimes as a casnl nuclease or a CRISPR (Clustered Regularly Interspaced Short Palindromic Repeat)-associated nuclease. CRISPR is an adaptive immune system that provides protection against mobile genetic elements (viruses, transposable elements, and conjugative plasmids). CRISPR clusters contain spacers, sequences complementary to antecedent mobile elements, and target invading nucleic acids. CRISPR clusters are transcribed and processed into CRISPR RNA (crRNA). In type II CRISPR systems correct processing of pre-crRNA requires a trans-encoded small RNA (tracrRNA), endogenous ribonuclease 3 (rnc) and a Cas9 domain. The tracrRNA serves as a guide for ribonuclease 3-aided processing of pre-crRNA. Subsequently, Cas9/crRNA/tracrRNA endonucleolytically cleaves linear or circular dsDNA target complementary to the spacer. The target strand not complementary to crRNA is first cut endonucleolytically, then trimmed 3′-5′ exonucleolytically. In nature, DNA-binding and cleavage typically requires protein and both RNAs. However, single guide RNAs (“sgRNA”, or simply “gNRA”) can be engineered so as to incorporate aspects of both the crRNA and tracrRNA into a single RNA species. See, e.g., Jinek M., Chylinski K., Fonfara I., Hauer M., Doudna J. A., Charpentier E. Science 337:816-821(2012), the entire contents of which are hereby incorporated by reference. Cas9 recognizes a short motif in the CRISPR repeat sequences (the PAM or protospacer adjacent motif) to help distinguish self versus non-self. Cas9 nuclease sequences and structures are well known to those of skill in the art (see, e.g., “Complete genome sequence of an M1 strain of Streptococcus pyogenes.” Ferretti et al., J. J., McShan W. M., Ajdic D. J., Savic D. J., Savic G., Lyon K., Primeaux C., Sezate S., Suvorov A. N., Kenton S., Lai H. S., Lin S. P., Qian Y., Jia H. G., Najar F. Z., Ren Q., Zhu H., Song L., White J., Yuan X., Clifton S.W., Roe B. A., McLaughlin R. E., Proc. Natl. Acad. Sci. U.S.A. 98:4658-4663(2001); “CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III.” Deltcheva E., Chylinski K., Sharma C. M., Gonzales K., Chao Y., Pirzada Z. A., Eckert M.R., Vogel J., Charpentier E., Nature 471:602-607(2011); and “A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity.” Jinek M., Chylinski K., Fonfara I., Hauer M., Doudna J. A., Charpentier E. Science 337:816-821(2012), the entire contents of each of which are incorporated herein by reference). Cas9 orthologs have been described in various species, including, but not limited to, S. pyogenes and S. thermophilus. Additional suitable Cas9 nucleases and sequences will be apparent to those of skill in the art based on this disclosure, and such Cas9 nucleases and sequences include Cas9 sequences from the organisms and loci disclosed in Chylinski, Rhun, and Charpentier, “The tracrRNA and Cas9 families of type II CRISPR-Cas immunity systems” (2013) RNA Biology 10:5, 726-737; the entire contents of which are incorporated herein by reference. In some embodiments, a Cas9 nuclease comprises one or more mutations that partially impair or inactivate the DNA cleavage domain.

A “split Cas9 protein” or “split Cas9” refers to a Cas9 protein that is provided as an N-terminal portion (which is referred to herein interchangeably as an N-terminal half) and a C-terminal portion (which is referred to herein interchangeably as a C-terminal half) encoded by two separate nucleotide sequences. The polypeptides corresponding to the N-terminal portion and the C-terminal portion of the Cas9 protein may be combined (joined) to form a complete Cas9 protein. A Cas9 protein is known to consist of a bi-lobed structure linked by a disordered linker (e.g., as described in Nishimasu et al., Cell, Volume 156, Issue 5, pp. 935-949, 2014, incorporated herein by reference). In some embodiments, the “split” occurs between the two lobes, generating two portions of a Cas9 protein, each containing one lobe.

A nuclease-inactivated Cas9 domain may interchangeably be referred to as a “dCas9” protein (for nuclease-“dead” Cas9). Methods for generating a Cas9 domain (or a fragment thereof) having an inactive DNA cleavage domain are known (see, e.g., Jinek et al., Science. 337:816-821(2012); Qi et al., “Repurposing CRISPR as an RNA-Guided Platform for Sequence-Specific Control of Gene Expression” (2013) Cell. 28; 152(5):1173-83, the entire contents of each of which are incorporated herein by reference). For example, the DNA cleavage domain of Cas9 is known to include two subdomains, the HNH nuclease subdomain and the RuvC1 subdomain. The HNH subdomain cleaves the strand complementary to the gRNA, whereas the RuvC1 subdomain cleaves the non-complementary strand. Mutations within these subdomains can silence the nuclease activity of Cas9. For example, the mutations D10A and H840A completely inactivate the nuclease activity of S. pyogenes Cas9 (Jinek et al., Science. 337:816-821(2012); Qi et al., Cell. 28; 152(5):1173-83 (2013)). In some embodiments, proteins comprising fragments of Cas9 are provided. For example, in some embodiments, a protein comprises one of two Cas9 domains: (1) the gRNA binding domain of Cas9; or (2) the DNA cleavage domain of Cas9. In some embodiments, proteins comprising Cas9 or fragments thereof are referred to as “Cas9 variants.” A Cas9 variant shares homology to Cas9, or a fragment thereof. For example, a Cas9 variant is at least about 70% identical, at least about 80% identical, at least about 90% identical, at least about 95% identical, at least about 96% identical, at least about 97% identical, at least about 98% identical, at least about 99% identical, at least about 99.5% identical, at least about 99.8% identical, or at least about 99.9% identical to wild type Cas9 (e.g., SpCas9 of SEQ ID NO: 1). In some embodiments, the Cas9 variant may have 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 21, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or more amino acid changes compared to wild type Cas9 (e.g., SpCas9 of SEQ ID NO: 1). In some embodiments, the Cas9 variant comprises a fragment of Cas9 (e.g., a gRNA binding domain or a DNA-cleavage domain), such that the fragment is at least about 70% identical, at least about 80% identical, at least about 90% identical, at least about 95% identical, at least about 96% identical, at least about 97% identical, at least about 98% identical, at least about 99% identical, at least about 99.5% identical, or at least about 99.9% identical to the corresponding fragment of wild type Cas9 (e.g., SpCas9 of SEQ ID NO: 1). In some embodiments, the fragment is at least 30%, at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95% identical, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% of the amino acid length of a corresponding wild type Cas9 (e.g., SpCas9 of SEQ ID NO: 1).

As used herein, the term “nCas9” or “Cas9 nickase” refers to a Cas9 or a variant thereof, which cleaves or nicks only one of the strands of a target cut site thereby introducing a nick in a double strand DNA molecule rather than creating a double strand break. This can be achieved by introducing appropriate mutations in a wild-type Cas9 which inactivates one of the two endonuclease activities of the Cas9. Any suitable mutation which inactivates one Cas9 endonuclease activity but leaves the other intact is contemplated, such as one of D10A or H840A mutations in the wild-type S. pyogenes Cas9 amino acid sequence, or a D10A mutation in the wild-type S. aureus Cas9 amino acid sequence, may be used to form the nCas9.

The term “cDNA” refers to a strand of DNA copied from an RNA template. cDNA is complementary to the RNA template.

As used herein, the term “circular permutant” refers to a protein or polypeptide (e.g., a Cas9) comprising a circular permutation, which is change in the protein's structural configuration involving a change in order of amino acids appearing in the protein's amino acid sequence. In other words, circular permutants are proteins that have altered N- and C-termini as compared to a wild-type counterpart, e.g., the wild-type C-terminal half of a protein becomes the new N-terminal half. Circular permutation (or CP) is essentially the topological rearrangement of a protein's primary sequence, connecting its N- and C-terminus, often with a peptide linker, while concurrently splitting its sequence at a different position to create new, adjacent N- and C-termini. The result is a protein structure with different connectivity, but which often can have the same overall similar three-dimensional (3D) shape, and possibly include improved or altered characteristics, including, reduced proteolytic susceptibility, improved catalytic activity, altered substrate or ligand binding, and/or improved thermostability. Circular permutant proteins can occur in nature (e.g., concanavalin A and lectin). In addition, circular permutation can occur as a result of posttranslational modifications or may be engineered using recombinant techniques. Such circularly permuted proteins (“CP-napDNAbp”, such as “CP-Cas9” in the case of Cas9), or variants thereof, retain the ability to bind DNA when complexed with a guide RNA (gRNA). See, Oakes et al., “Protein Engineering of Cas9 for enhanced function,” Methods Enzymol, 2014, 546: 491-511 and Oakes et al., “CRISPR-Cas9 Circular Permutants as Programmable Scaffolds for Genome Modification,” Cell, Jan. 10, 2019, 176: 254-267, each of are incorporated herein by reference.

The term “circularly permuted Cas9” refers to a Cas9 protein, or variant thereof (e.g., SpCas9), that occurs as or engineered as a circular permutant, whereby its N- and C-termini have been topically rearranged. The instant disclosure contemplates any previously known CP-Cas9 or use a new CP-Cas9 so long as the resulting circularly permuted protein retains the ability to bind DNA when complexed with a guide RNA (gRNA).

As used herein, a “cytosine deaminase” encoded by the CDA gene is an enzyme that catalyzes the removal of an amine group from cytidine (i.e., the base cytosine when attached to a ribose ring) to uridine (C to U) and deoxycytidine to deoxyuridine (C to U). A non-limiting example of a cytosine deaminase is APOBEC1 (“apolipoprotein B mRNA editing enzyme, catalytic polypeptide 1”). Another example is AID (“activation-induced cytosine deaminase”). Under standard Watson-Crick hydrogen bond pairing, a cytosine base hydrogen bonds to a guanine base. When cytidine is converted to uridine (or deoxycytidine is converted to deoxyuridine), the uridine (or the uracil base of uridine) undergoes hydrogen bond pairing with the base adenine. Thus, a conversion of “C” to uridine (“U”) by cytosine deaminase will cause the insertion of “A” instead of a “G” during cellular repair and/or replication processes. Since the adenine “A” pairs with thymine “T”, the cytosine deaminase in coordination with DNA replication causes the conversion of an C.G pairing to a T.A pairing in the double-stranded DNA molecule.

“CRISPR” is a family of DNA sequences (i.e., CRISPR clusters) in bacteria and archaea that represent snippets of prior infections by a virus that have invaded the prokaryote. The snippets of DNA are used by the prokaryotic cell to detect and destroy DNA from subsequent attacks by similar viruses and effectively compose, along with an array of CRISPR-associated proteins (including Cas9 and homologs thereof) and CRISPR-associated RNA, a prokaryotic immune defense system. In nature, CRISPR clusters are transcribed and processed into CRISPR RNA (crRNA). In certain types of CRISPR systems (e.g., type II CRISPR systems), correct processing of pre-crRNA requires a trans-encoded small RNA (tracrRNA), endogenous ribonuclease 3 (rnc) and a Cas9 protein. The tracrRNA serves as a guide for ribonuclease 3-aided processing of pre-crRNA. Subsequently, Cas9/crRNA/tracrRNA endonucleolytically cleaves linear or circular dsDNA target complementary to the RNA. Specifically, the target strand not complementary to crRNA is first cut endonucleolytically, then trimmed 3′-5′ exonucleolytically. In nature, DNA-binding and cleavage typically requires protein and both RNAs. However, single guide RNAs (“sgRNA”, or simply “gRNA”) can be engineered so as to incorporate aspects of both the crRNA and tracrRNA into a single RNA species—the guide RNA. See, e.g., Jinek M., Chylinski K., Fonfara I., Hauer M., Doudna J. A., Charpentier E. Science 337:816-821(2012), the entire contents of which is hereby incorporated by reference. Cas9 recognizes a short motif in the CRISPR repeat sequences (the PAM or protospacer adjacent motif) to help distinguish self versus non-self. CRISPR biology, as well as Cas9 nuclease sequences and structures are well known to those of skill in the art (see, e.g., “Complete genome sequence of an M1 strain of Streptococcus pyogenes.” Ferretti et al., J. J., McShan W. M., Ajdic D. J., Savic D. J., Savic G., Lyon K., Primeaux C., Sezate S., Suvorov A. N., Kenton S., Lai H. S., Lin S. P., Qian Y., Jia H. G., Najar F. Z., Ren Q., Zhu H., Song L., White J., Yuan X., Clifton S.W., Roe B. A., McLaughlin R. E., Proc. Natl. Acad. Sci. U.S.A. 98:4658-4663(2001); “CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III.” Deltcheva E., Chylinski K., Sharma C. M., Gonzales K., Chao Y., Pirzada Z. A., Eckert M. R., Vogel J., Charpentier E., Nature 471:602-607(2011); and “A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity.” Jinek M., Chylinski K., Fonfara I., Hauer M., Doudna J. A., Charpentier E. Science 337:816-821(2012), the entire contents of each of which are incorporated herein by reference). Cas9 orthologs have been described in various species, including, but not limited to, S. pyogenes and S. thermophilus. Additional suitable Cas9 nucleases and sequences will be apparent to those of skill in the art based on this disclosure, and such Cas9 nucleases and sequences include Cas9 sequences from the organisms and loci disclosed in Chylinski, Rhun, and Charpentier, “The tracrRNA and Cas9 families of type II CRISPR-Cas immunity systems” (2013) RNA Biology 10:5, 726-737; the entire contents of which are incorporated herein by reference.

The term “deaminase” or “deaminase domain” refers to a protein or enzyme that catalyzes a deamination reaction. In some embodiments, the deaminase is an adenosine (or adenine) deaminase, which catalyzes the hydrolytic deamination of adenine or adenosine. In some embodiments, the adenosine deaminase catalyzes the hydrolytic deamination of adenine or adenosine in deoxyribonucleic acid (DNA) to inosine. In other embodiments, the deminase is a cytidine (or cytosine) deaminase, which catalyzes the hydrolytic deamination of cytidine or cytosine.

The deaminases provided herein may be from any organism, such as a bacterium. In some embodiments, the deaminase or deaminase domain is a variant of a naturally-occurring deaminase from an organism. In some embodiments, the deaminase or deaminase domain does not occur in nature. For example, in some embodiments, the deaminase or deaminase domain is at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75% at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to a naturally-occurring deaminase.

As used herein, the term “DNA binding protein” or “DNA binding protein domain” refers to any protein that localizes to and binds a specific target DNA nucleotide sequence (e.g. a gene locus of a genome). This term embraces RNA-programmable proteins, which associate (e.g. form a complex) with one or more nucleic acid molecules (i.e., which includes, for example, guide RNA in the case of Cas systems) that direct or otherwise program the protein to localize to a specific target nucleotide sequence (e.g., DNA sequence) that is complementary to the one or more nucleic acid molecules (or a portion or region thereof) associated with the protein. Exemplary RNA-programmable proteins are CRISPR-Cas9 proteins, as well as Cas9 equivalents, homologs, orthologs, or paralogs, whether naturally occurring or non-naturally occurring (e.g. engineered or modified), and may include a Cas9 equivalent from any type of CRISPR system (e.g. type II, V, VI), including Cpf1 (a type-V CRISPR-Cas systems) (now known as Cas12a), C2c1 (a type V CRISPR-Cas system), C2c2 (a type VI CRISPR-Cas system), C2c3 (a type V CRISPR-Cas system), dCas9, GeoCas9, CjCas9, Cas12b, Cas12c, Cas12d, Cas12g, Cas12h, Cas12i, Cas13d, Cas14, Argonaute, and nCas9. Further Cas-equivalents are described in Makarova et al., “C2c2 is a single-component programmable RNA-guided RNA-targeting CRISPR effector,” Science 2016; 353(6299), the contents of which are incorporated herein by reference.

The term “DNA editing efficiency,” as used herein, refers to the number or proportion of intended base pairs that are edited. For example, if a nucleobase editor edits 10% of the base pairs that it is intended to target (e.g., within a cell or within a population of cells), then the nucleobase editor can be described as being 10% efficient. Some aspects of editing efficiency embrace the modification (e.g. deamination) of a specific nucleotide within DNA, without generating a large number or percentage of insertions or deletions (i.e., indels). It is generally accepted that editing while generating less than 5% indels (as measured over total target nucleotide substrates) is high editing efficiency. The generation of more than 20% indels is generally accepted as poor or low editing efficiency. Indel formation may be measured by techniques known in the art, including high-throughput screening of sequencing reads.

The term “off-target editing frequency,” as used herein, refers to the number or proportion of unintended base pairs, e.g. DNA base pairs, that are edited. On-target and off-target editing frequencies may be measured by the methods and assays described herein, further in view of techniques known in the art, including high-throughput sequencing reads. As used herein, high-throughput sequencing involves the hybridization of nucleic acid primers (e.g., DNA primers) with complementarity to nucleic acid (e.g., DNA) regions just upstream or downstream of the target sequence or off-target sequence of interest. Because the DNA target sequence and the Cas9-independent off-target sequences are known a priori in the methods disclosed herein, nucleic acid primers with sufficient complementarity to regions upstream or downstream of the target sequence and Cas9-independent off-target sequences of interest may be designed using techniques known in the art, such as the PhusionU PCR kit (Life Technologies), Phusion HS II kit (Life Technologies), and Illumina MiSeq kit. Since many of the Cas9-dependent off-target sites have high sequence identity to the target site of interest, nucleic acid primers with sufficient complementarity to regions upstream or downstream of the Cas9-dependent off-target site may likewise be designed using techniques and kits known in the art. These kits make use of polymerase chain reaction (PCR) amplification, which produces amplicons as intermediate products. The target and off-target sequences may comprise genomic loci that further comprise protospacers and PAMs. Accordingly, the term “amplicons,” as used herein, may refer to nucleic acid molecules that constitute the aggregates of genomic loci, protospacers and PAMs. High-throughput sequencing techniques used herein may further include Sanger sequencing and IIlumina-based next-generation genome sequencing (NGS).

The term “on-target editing,” as used herein, refers to the introduction of intended modifications (e.g., deaminations) to nucleotides (e.g., adenine) in a target sequence, such as using the nucleobase editors described herein. The term “off-target DNA editing,” as used herein, refers to the introduction of unintended modifications (e.g. deaminations) to nucleotides (e.g. adenine) in a sequence outside the canonical nucleobase editor binding window (i.e., from one protospacer position to another, typically 2 to 8 nucleotides long). Off-target DNA editing can result from weak or non-specific binding of the gRNA sequence to the target sequence.

As used herein, the terms “upstream” and “downstream” are terms of relativety that define the linear position of at least two elements located in a nucleic acid molecule (whether single or double-stranded) that is orientated in a 5′-to-3′ direction. In particular, a first element is upstream of a second element in a nucleic acid molecule where the first element is positioned somewhere that is 5′ to the second element. For example, a SNP is upstream of a Cas9-induced nick site if the SNP is on the 5′ side of the nick site. Conversely, a first element is downstream of a second element in a nucleic acid molecule where the first element is positioned somewhere that is 3′ to the second element. For example, a SNP is downstream of a Cas9-induced nick site if the SNP is on the 3′ side of the nick site. The nucleic acid molecule can be a DNA (double or single stranded). RNA (double or single stranded), or a hybrid of DNA and RNA. The analysis is the same for single strand nucleic acid molecule and a double strand molecule since the terms upstream and downstream are in reference to only a single strand of a nucleic acid molecule, except that one needs to select which strand of the double stranded molecule is being considered. Often, the strand of a double stranded DNA which can be used to determine the positional relativity of at least two elements is the “sense” or “coding” strand. In genetics, a “sense” strand is the segment within double-stranded DNA that runs from 5′ to 3′, and which is complementary to the antisense strand of DNA, or template strand, which runs from 3′ to 5′. Thus, as an example, a SNP nucleobase is “downstream” of a promoter sequence in a genomic DNA (which is double-stranded) if the SNP nucleobase is on the 3′ side of the promoter on the sense or coding strand.

The term “base edit:indel ratio,” as used herein, refers to the ratio of intended DNA nucleobase modifications (e.g., point mutations or deaminations) to formation of indels.

The term “effective amount,” as used herein, refers to an amount of a biologically active agent that is sufficient to elicit a desired biological response. For example, in some embodiments, an effective amount of a nucleobase editor may refer to the amount of the editor that is sufficient to edit a target site nucleotide sequence, e.g., a genome. In some embodiments, an effective amount of a nucleobase editor provided herein, e.g., of a fusion protein comprising a nickase Cas9 domain and a guide RNA may refer to the amount of the fusion protein that is sufficient to induce editing of a target site specifically bound and edited by the fusion protein. As will be appreciated by the skilled artisan, the effective amount of an agent, e.g., a fusion protein, a nuclease, a hybrid protein, a protein dimer, a complex of a protein (or protein dimer) and a polynucleotide, or a polynucleotide, may vary depending on various factors as, for example, on the desired biological response, e.g., on the specific allele, genome, or target site to be edited, on the cell or tissue being targeted, and on the agent being used.

The term “functional equivalent” refers to a second biomolecule that is equivalent in function, but not necessarily equivalent in structure to a first biomolecule. For example, a “Cas9 equivalent” refers to a protein that has the same or substantially the same functions as Cas9, but not necessarily the same amino acid sequence. In the context of the disclosure, the specification refers throughout to “a protein X, or a functional equivalent thereof.” In this context, a “functional equivalent” of protein X embraces any homolog, paralog, fragment, naturally occurring, engineered, circular permutant, mutated, or synthetic version of protein X which bears an equivalent function.

The term “fusion protein” as used herein refers to a hybrid polypeptide which comprises protein domains from at least two different proteins. One protein may be located at the amino-terminal (N-terminal) portion of the fusion protein or at the carboxy-terminal (C-terminal) protein thus forming an “amino-terminal fusion protein” or a “carboxy-terminal fusion protein,” respectively. A protein may comprise different domains, for example, a nucleic acid binding domain (e.g., the gRNA binding domain of Cas9 that directs the binding of the protein to a target site) and a nucleic acid cleavage domain or a catalytic domain of a nucleic-acid editing protein. Another example includes a Cas9 or equivalent thereof fused to an adenosine deaminae. Any of the proteins provided herein may be produced by any method known in the art. For example, the proteins provided herein may be produced via recombinant protein expression and purification, which is especially suited for fusion proteins comprising a peptide linker. Methods for recombinant protein expression and purification are well known, and include those described by Green and Sambrook, Molecular Cloning: A Laboratory Manual (4^thed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. (2012)), the entire contents of which are incorporated herein by reference.

Two proteins or protein domains are considered to be “fused” when a peptide bond is formed linking the two proteins or two protein domains. In some embodiments, a linker (e.g., a peptide linker) is present between the two proteins or two protein domains. The term “linker,” as used herein, refers to a chemical group or a molecule linking two molecules or moieties, e.g., two domains of a fusion protein, such as, for example, a nuclease-inactive Cas9 domain and a nucleic acid editing domain (e.g., a deaminase domain). Typically, the linker is positioned between, or flanked by, two groups, molecules, or other moieties and connected to each one via a covalent bond, thus connecting the two. In some embodiments, the linker is an amino acid or a plurality of amino acids (e.g., a peptide or protein). In some embodiments, the linker is an organic molecule, group, polymer, or chemical moiety. In some embodiments, the linker is 5-100 amino acids in length, for example, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 30-35, 35-40, 40-45, 45-50, 50-60, 60-70, 70-80, 80-90, 90-100, 100-150, or 150-200 amino acids in length. Longer or shorter linke are also contemplated.

The term “guide nucleic acid” or “napDNAbp-programming nucleic acid molecule” or equivalently “guide sequence” refers the one or more nucleic acid molecules which associate with and direct or otherwise program a napDNAbp protein to localize to a specific target nucleotide sequence (e.g., a gene locus of a genome) that is complementary to the one or more nucleic acid molecules (or a portion or region thereof) associated with the protein, thereby causing the napDNAbp protein to bind to the nucleotide sequence at the specific target site. A non-limiting example is a guide RNA of a Cas protein of a CRISPR-Cas genome editing system. Chemically, guide nucleic acids can be all RNA, all DNA, or a chimeric of RNA and DNA. The guide nucleic acids may also include nucleotide analogs. Guide nucleic acids can be expressed as transcription products or can be synthesized.

As used herein, a “guide RNA” can refer to a synthetic fusion of the endogenous bacterial crRNA and tracrRNA that provides both targeting specificity and a scaffold and/or binding ability for Cas9 nuclease to a target DNA. This synthetic fusion does not exist in nature and is also commonly referred to as an sgRNA. However, the term, guide RNA, also embraces equivalent guide nucleic acid molecules that associate with Cas9 equivalents, homologs, orthologs, or paralogs, whether naturally occurring or non-naturally occurring (e.g., engineered or recombinant), and which otherwise program the Cas9 equivalent to localize to a specific target nucleotide sequence. The Cas9 equivalents may include other napDNAbps from any type of CRISPR system (e.g., type II, V, VI), including Cpf1 (a type-V CRISPR-Cas systems) (now known as Cas12a), C2c1 (a type V CRISPR-Cas system), C2c2 (a type VI CRISPR-Cas system) and C2c3 (a type V CRISPR-Cas system). Further Cas-equivalents are described in Makarova et al., “C2c2 is a single-component programmable RNA-guided RNA-targeting CRISPR effector,” Science 2016; 353(6299), the contents of which are incorporated herein by reference. Exemplary sequences are and structures of guide RNAs are provided herein. In addition, methods for designing appropriate guide RNA sequences are provided herein.

A guide RNA is a particular type of guide nucleic acid which is mostly commonly associated with a Cas protein of a CRISPR-Cas9 and which associates with Cas9, directing the Cas9 protein to a specific sequence in a DNA molecule that includes complementarity to the protospacer sequence for the guide RNA. Functionally, guide RNAs associate with Cas9, directing (or programming) the Cas9 protein to a specific sequence in a DNA molecule that includes a sequence complementary to the protospacer sequence for the guide RNA. A gRNA is a component of the CRISPR/Cas system. Typically, a guide RNA comprises a fusion of a CRISPR-targeting RNA (crRNA) and a trans-activation crRNA (tracrRNA), providing both targeting specificity and scaffolding/binding ability for Cas9 nuclease. A “crRNA” is a bacterial RNA that confers target specificity and requires tracrRNA to bind to Cas9. A “tracrRNA” is a bacterial RNA that links the crRNA to the Cas9 nuclease and typically can bind any crRNA. The sequence specificity of a Cas DNA-binding protein is determined by gRNAs, which have nucleotide base-pairing complementarity to target DNA sequences. The native gRNA comprises a 20 nucleotide (nt) Specificity Determining Sequence (SDS), or spacer, which specifies the DNA sequence to be targeted, and is immediately followed by a 80 nt scaffold sequence, which associates the gRNA with Cas9. In some embodiments, an SDS of the present disclosure has a length of 15 to 100 nucleotides, or more. For example, an SDS may have a length of 15 to 90, 15 to 85, 15 to 80, 15 to 75, 15 to 70, 15 to 65, 15 to 60, 15 to 55, 15 to 50, 15 to 45, 15 to 40, 15 to 35, 15 to 30, or 15 to 20 nucleotides. In some embodiments, the SDS is 20 nucleotides long. For example, the SDS may be 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, or 25 nucleotides long. At least a portion of the target DNA sequence is complementary to the SDS of the gRNA. For Cas9 to successfully bind to the DNA target sequence, a region of the target sequence is complementary to the SDS of the gRNA sequence and is immediately followed by the correct protospacer adjacent motif (PAM) sequence (e.g., NGG for Cas9 and TTN, TTTN, or YTN for Cpf1). In some embodiments, an SDS is 100% complementary to its target sequence. In some embodiments, the SDS sequence is less than 100% complementary to its target sequence and is, thus, considered to be partially complementary to its target sequence. For example, a targeting sequence may be 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, or 90% complementary to its target sequence. In some embodiments, the SDS of template DNA or target DNA may differ from a complementary region of a gRNA by 1, 2, 3, 4 or 5 nucleotides.

In some embodiments, the guide RNA is about 15-120 nucleotides long and comprises a sequence of at least 10 contiguous nucleotides that is complementary to a target sequence. In some embodiments, the guide RNA is 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, or 120 nucleotides long. In some embodiments, the guide RNA comprises a sequence of 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more contiguous nucleotides that is complementary to a target sequence. Sequence complementarity refers to distinct interactions between adenine and thymine (DNA) or uracil (RNA), and between guanine and cytosine.

As used herein, a “spacer sequence” is the sequence of the guide RNA (˜20 nts in length) which has the same sequence (with the exception of uridine bases in place of thymine bases) as the protospacer of the PAM strand of the target (DNA) sequence, and which is complementary to the target strand (or non-PAM strand) of the target sequence.

As used herein, the “target sequence” refers to the ˜20 nucleotides in the target DNA sequence that have complementarity to the protospacer sequence in the PAM strand. The target sequence is the sequence that anneals to or is targeted by the spacer sequence of the guide RNA. The spacer sequence of the guide RNA and the protospacer have the same sequence (except the spacer sequence is RNA, and the protospacer is DNA).

As used herein, the terms “guide RNA core,” “guide RNA scaffold sequence” and “backbone sequence,” which are used interchangeably, refer to the region (or sequence) within the gRNA that is responsible for Cas9 binding. It does not include the 20 bp spacer sequence that is used to guide Cas9 to target DNA. This region also known as the crRNA/tracrRNA. The guide RNA backbone sequence is separate from the guide sequence, or spacer, region of the guide RNA, which has complementarity to a protospacer of a nucleic acid molecule.

As used herein, the term “protospacer” refers to the sequence (e.g., a ˜20 bp sequence) in DNA adjacent to the PAM (protospacer adjacent motif) sequence which shares the same sequence as the spacer sequence of the guide RNA, and which is complementary to the target sequence of the non-PAM strand. The spacer sequence of the guide RNA anneals to the target sequence located on the non-PAM strand. In order for Cas9 to function it also requires a specific protospacer adjacent motif (PAM) that varies depending on the bacterial species of the Cas9 gene. The most commonly used Cas9 nuclease, derived from S. pyogenes, recognizes a PAM sequence of NGG that is found directly downstream of the protospacer sequence in the genomic DNA, on the non-target strand. The skilled person will appreciate that the literature in the state of the art sometimes refers to the “protospacer” as the ˜20-nt target-specific guide sequence on the guide RNA itself, rather than referring to it as a “spacer” (and that the protospacer (DNA) and the spacer (RNA) have the same sequence). Thus, the term “protospacer” as used herein may be used interchangeably with the term “spacer.” The context of the discription surrounding the appearance of either “protospacer” or “spacer” will help inform the reader as to whether the term is reference to the gRNA or the DNA sequence. Both usages of these terms are acceptable since the state of the art uses both terms in each of these ways.

A “protospacer adjacent motif” (PAM) is typically a sequence of nucleotides located adjacent to (e.g., within 10, 9, 8, 7, 6, 5, 4, 3, 3, or 1 nucleotide(s) of a target sequence). A PAM sequence is “immediately adjacent to” a target sequence if the PAM sequence is contiguous with the target sequence (that is, if there are no nucleotides located between the PAM sequence and the target sequence). In some embodiments, a PAM sequence is a wild-type PAM sequence. Examples of PAM sequences include, without limitation, NGG, NGR, NNGRR(T/N), NNNNGATT, NNAGAAW, NGGAG, NAAAAC, AWG, and CC. In some embodiments, a PAM sequence is obtained from Streptococcus pyogenes (e.g., NGG or NGR). In some embodiments, a PAM sequence is obtained from Staphylococcus aureus (e.g., NNGRR(T/N)). In some embodiments, a PAM sequence is obtained from Neisseria meningitidis (e.g., NNNNGATT). In some embodiments, a PAM sequence is obtained from Streptococcus thermophilus (e.g., NNAGAAW or NGGAG). In some embodiments, a PAM sequence is obtained from Treponema denticola (e.g., NAAAAC). In some embodiments, a PAM sequence is obtained from Escherichia coli (e.g., AWG). In some embodiments, a PAM sequence is obtained from Pseudomonas auruginosa (e.g., CC). Other PAM sequences are contemplated. A PAM sequence is typically located downstream (i.e., 3′) from the target sequence, although in some embodiments a PAM sequence may be located upstream (i.e., 5′) from the target sequence.

The term “host cell,” as used herein, refers to a cell that can host, replicate, and transfer a phage vector useful for a continuous evolution process as provided herein. In embodiments where the vector is a viral vector, a suitable host cell is a cell that may be infected by the viral vector, can replicate it, and can package it into viral particles that can infect fresh host cells. A cell can host a viral vector if it supports expression of genes of viral vector, replication of the viral genome, and/or the generation of viral particles. One criterion to determine whether a cell is a suitable host cell for a given viral vector is to determine whether the cell can support the viral life cycle of a wild-type viral genome that the viral vector is derived from. For example, if the viral vector is a modified M13 phage genome, as provided in some embodiments described herein, then a suitable host cell would be any cell that can support the wild-type M13 phage life cycle. Suitable host cells for viral vectors useful in continuous evolution processes are well known to those of skill in the art, and the disclosure is not limited in this respect. In some embodiments, the viral vector is a phage and the host cell is a bacterial cell. In some embodiments, the host cell is an E. coli cell. Suitable E. coli host strains will be apparent to those of skill in the art, and include, but are not limited to, New England Biolabs (NEB) Turbo, Top10F′, DH12S, ER2738, ER2267, and XL1-Blue MRF′. These strain names are art recognized and the genotype of these strains has been well characterized. It should be understood that the above strains are exemplary only and that the invention is not limited in this respect. The term “fresh,” as used herein interchangeably with the terms “non-infected” or “uninfected” in the context of host cells, refers to a host cell that has not been infected by a viral vector comprising a gene of interest as used in a continuous evolution process provided herein. A fresh host cell can, however, have been infected by a viral vector unrelated to the vector to be evolved or by a vector of the same or a similar type but not carrying the gene of interest.

In some embodiments, the host cell is a prokaryotic cell, for example, a bacterial cell. In some embodiments, the host cell is an E. coli cell. In some embodiments, the host cell is a eukaryotic cell, for example, a yeast cell, a plant cell, an insect cell, or a mammalian cell. In some embodiments, the cell is a human cell. The type of host cell, will, of course, depend on the viral vector employed, and suitable host cell/viral vector combinations will be readily apparent to those of skill in the art.

An “intein” is a segment of a protein that is able to excise itself and join the remaining portions (the exteins) with a peptide bond in a process known as protein splicing. Inteins are also referred to as “protein introns.” The process of an intein excising itself and joining the remaining portions of the protein is herein termed “protein splicing” or “intein-mediated protein splicing.” In some embodiments, an intein of a precursor protein (an intein containing protein prior to intein-mediated protein splicing) comes from two genes. Such intein is referred to herein as a split intein. For example, in cyanobacteria, DnaE, the catalytic subunit a of DNA polymerase III, is encoded by two separate genes, dnaE-n and dnaE-c. The intein encoded by the dnaE-n gene is herein referred as “intein-N.” The intein encoded by the dnaE-c gene is herein referred as “intein-C.”

Other intein systems may also be used. For example, a synthetic intein based on the dnaE intein, the Cfa-N and Cfa-C intein pair, has been described (e.g., in Stevens et al., J Am Chem Soc. 2016 Feb. 24; 138(7):2162-5, incorporated herein by reference). As another example, a synthetic intein based on the dnaE intein, the Nostoc punctiforme (Npu) intein pair, has been described (see Zettler, J., Schutz, V. & Mootz, H. D., The naturally split Npu DnaE intein exhibits an extraordinarily high rate in the protein trans-splicing reaction. FEBS letters 583, 909-914 (2009), incorporated herein by reference). Non-limiting examples of intein pairs that may be used in accordance with the present disclosure include: Cfa DnaE intein, Npu DnaE intein, Ssp GyrB intein, Ssp DnaX intein, Ter DnaE3 intein, Ter ThyX intein, Rma DnaB intein and Cne Prp8 intein (e.g., as described in U.S. Pat. No. 8,394,604, incorporated herein by reference).

Exemplary nucleotide and amino acid sequences of inteins are provided below, as SEQ ID NOs: 350-357. In some embodiments, the inteins used in accordance with the disclosed napDNAbp domains (e.g., Cas9 domains) comprise the Npu intein-N comprising the amino acid sequence of SEQ ID NO: 351 and the the Npu intein-C comprising the amino acid sequence of SEQ ID NO: 353. In some embodiments, the inteins used in accordance with the disclosed nucleobase editors comprise the Npu intein-N comprising the amino acid sequence of SEQ ID NO: 351 and the Npu intein-C comprising the amino acid sequence of SEQ ID NO: 353. In some embodiments, the inteins used in accordance with the disclosed constructs encoding any of the disclosed napDNAbp domains (e.g., a Cas9 domain) comprise the Npu intein-N DNA comprising the nucleotide sequence of SEQ ID NO: 350 and the the Npu intein-C DNA comprising the nucleotide sequence of SEQ ID NO: 352. In some embodiments, the inteins used in accordance with the disclosed constructs encoding any of the disclosed nucleobase editors comprise the Npu intein-N DNA comprising the nucleotide sequence of SEQ ID NO: 350 and the Npu intein-C DNA comprising the nucleotide sequence of SEQ ID NO: 352.

In some embodiments, the intein-N comprises an amino acid sequence that is at least 90%, 95%, 98%, or 99% identical to the amino acid of SEQ ID NOs: 351 or 355. In some embodiments, the intein-N comprises an amino acid sequence that differs from the amino acid of SEQ ID NOs: 351 or 355 by 1, 2, 3, 4, 5, 6, or 7 amino acids. In some embodiments, the intein-N comprises the amino acid sequence of SEQ ID NOs: 351 or 355. In some embodiments, the intein-N used in accordance with the disclosed constructs comprises a nucleotide sequence that is at least 90%, 95%, 98%, or 99% identical to the nucleotide sequence of SEQ ID NOs: 350 or 354. In some embodiments, the intein-N used in accordance with the disclosed constructs comprises a nucleotide sequence that differs by 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or 10-15 nucleotides from the nucleotide sequence of SEQ ID NOs: 350 or 354.

In some embodiments, the intein-C comprises an amino acid sequence that is at least 90%, 95%, 98%, or 99% identical to the amino acid of SEQ ID NOs: 353 or 357. In some embodiments, the intein-C comprises an amino acid sequence that differs from the amino acid of SEQ ID NOs: 353 or 357 by 1, 2, 3, 4, or 5 amino acids. In some embodiments, the intein-C comprises the amino acid sequence of SEQ ID NOs: 351 or 355. In some embodiments, the intein-C used in accordance with the disclosed constructs comprises a nucleotide sequence that is at least 90%, 95%, 98%, or 99% identical to the nucleotide sequence of SEQ ID NOs: 352 or 356. In some embodiments, the intein-C used in accordance with the disclosed constructs comprises a nucleotide sequence that differs by 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 nucleotides from the nucleotide sequence of SEQ ID NOs: 352 or 356.

In particular embodiments, the intein-N comprises the amino acid sequence as set forth in SEQ ID NO: 355. In some embodiments, the intein-C comprises the amino acid sequence as set forth in SEQ ID NO: 357.

DnaE Intein-N DNA:

(SEQ ID NO: 350)

TGCCTGTCATACGAAACCGAGATACTGACAGTAGAATATGGCCTTCTGCC

AATCGGGAAGATTGTGGAGAAACGGATAGAATGCACAGTTTACTCTGTCG

ATAACAATGGTAAATTTATACTCAGCCAGTTGCCCAGTGGCACGACCGGG

GAGAGCAGGAAGTATTCGAATACTGTCTGGAGGATGGAAGTCTCATTAGG

GCCACTAAGGACCACAAATTTATGACAGTCGATGGCCAGATGCTGCCTAT

AGACGAAATCTTTGAGCGAGAGTTGGACCTCATGCGAGTTGACAACCTTC

CTAAT

Npu DnaE N-terminal Protein:

(SEQ ID NO: 351)

CLSYETEILTVEYGLLPIGKIVEKRIECTVYSVDNNGNIYTQPVAQWHDR

GEQEVFEYCLEDGSLIRATKDHKFMTVDGQMLPIDEIFERELDLMRVDNL

PN

DnaE Intein-C DNA:

(SEQ ID NO: 352)

ATGATCAAGATAGCTACAAGGAAGTATCTTGGCAAACAAAACGTTTATGA

TATTGGAGTCGAAAGAGATCACAACTTTGCTCTGAAGAACGGATTCATAG

CTTCTAAT

Npu DnaE C-terminal Protein:

(SEQ ID NO: 353)

MIKIATRKYLGKQNVYDIGVERDHNFALKNGFIASN

Cfa-N DNA:

(SEQ ID NO: 354)

TGCCTGTCTTATGATACCGAGATACTTACCGTTGAATATGGCTTCTTGCC

TATTGGAAAGATTGTCGAAGAGAGAATTGAATGCACAGTATATACTGTAG

ACAAGAATGGTTTCGTTTACACACAGCCCATTGCTCAATGGCACAATCGC

GGCGAACAAGAAGTATTTGAGTACTGTCTCGAGGATGGAAGCATCATACG

AGCAACTAAAGATCATAAATTCATGACCACTGACGGGCAGATGTTGCCAA

TAGATGAGATATTCGAGCGGGGCTTGGATCTCAAACAAGTGGATGGATTG

CCA

Cfa-N Protein:

(SEQ ID NO: 355)

CLSYDTEILTVEYGFLPIGKIVEERIECTVYTVDKNGFVYTQPIAQWHNR

GEQEVFEYCLEDGSIIRATKDHKFMTTDGQMLPIDEIFERGLDLKQVDGL

P

Cfa-C DNA:

(SEQ ID NO: 356)

ATGAAGAGGACTGCCGATGGATCAGAGTTTGAATCTCCCAAGAAGAAGAG

GAAAGTAAAGATAATATCTCGAAAAAGTCTTGGTACCCAAAATGTCTATG

ATATTGGAGTGGAGAAAGATCACAACTTCCTTCTCAAGAACGGTCTCGTA

GCCAGCAAC

Cfa-C Protein:

(SEQ ID NO: 357)

MKRTADGSEFESPKKKRKVKIISRKSLGTQNVYDIGVEKDHNFLLKNGLV

ASN

Intein-N and intein-C may be fused to the N-terminal portion of the split Cas9 and the C-terminal portion of the split Cas9, respectively, for the joining of the N-terminal portion of the split Cas9 and the C-terminal portion of the split Cas9. For example, in some embodiments, an intein-N is fused to the C-terminus of the N-terminal portion of the split Cas9, i.e., to form a structure of N-[N-terminal portion of the split Cas9]-[intein-N]-C. In some embodiments, an intein-C is fused to the N-terminus of the C-terminal portion of the split Cas9, i.e., to form a structure of N-[intein-C]-[C-terminal portion of the split Cas9]-C. The mechanism of intein-mediated protein splicing for joining the proteins the inteins are fused to (e.g., split Cas9) is known in the art, e.g., as described in Shah et al., Chem Sci. 2014; 5(1):446-461, incorporated herein by reference.

The term “mutation,” as used herein, refers to a substitution of a residue within a sequence, e.g. a nucleic acid or amino acid sequence, with another residue; a deletion or insertion of one or more residues within a sequence; or a substitution of a residue within a sequence of a genome in a subject to be corrected. Mutations are typically described herein by identifying the original residue followed by the position of the residue within the sequence and by the identity of the newly substituted residue. Various methods for making the amino acid substitutions (mutations) provided herein are well known in the art, and are provided by, for example, Green and Sambrook, Molecular Cloning: A Laboratory Manual (4^thed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. (2012)). Mutations can include a variety of categories, such as single base polymorphisms, microduplication regions, indel, and inversions, and is not meant to be limiting in any way. Mutations can include “loss-of-function” mutations which are mutations that reduce or abolish a protein activity. Most loss-of-function mutations are recessive, because in a heterozygote the second chromosome copy carries an unmutated version of the gene coding for a fully functional protein whose presence compensates for the effect of the mutation. There are some exceptions where a loss-of-function mutation is dominant, one example being haploinsufficiency, where the organism is unable to tolerate the approximately 50% reduction in protein activity suffered by the heterozygote. This is the explanation for a few genetic diseases in humans, including Marfan syndrome, which results from a mutation in the gene for the connective tissue protein called fibrillin. Mutations also embrace “gain-of-function” mutations, which is one which confers an abnormal activity on a protein or cell that is otherwise not present in a normal condition. Many gain-of-function mutations are in regulatory sequences rather than in coding regions, and can therefore have a number of consequences. Because of their nature, gain-of-function mutations are usually dominant. Many loss-of-function mutations are recessive, such as autosomal recessive.

The term “napDNAbp” which stand for “nucleic acid programmable DNA binding protein” refers to any protein that may associate (e.g., form a complex) with one or more nucleic acid molecules (i.e., which may broadly be referred to as a “napDNAbp-programming nucleic acid molecule” and includes, for example, guide RNA in the case of Cas systems) which direct or otherwise program the protein to localize to a specific target nucleotide sequence (e.g., a gene locus of a genome) that is complementary to the one or more nucleic acid molecules (or a portion or region thereof) associated with the protein, thereby causing the protein to bind to the nucleotide sequence at the specific target site. This term napDNAbp embraces CRISPR-Cas9 proteins, as well as Cas9 equivalents, homologs, orthologs, or paralogs, whether naturally occurring or non-naturally occurring (e.g., engineered or modified), and may include a Cas9 equivalent from any type of CRISPR system (e.g., type II, V, VI), including Cpf1 (a type-V CRISPR-Cas systems) (now known as Cas12a), C2c1 (a type V CRISPR-Cas system), C2c2 (a type VI CRISPR-Cas system), C2c3 (a type V CRISPR-Cas system), dCas9, GeoCas9, CjCas9, Cas12a, Cas12b, Cas12c, Cas12d, Cas12g, Cas12h, Cas12i, Cas13d, Cas14, Argonaute, and nCas9. Further Cas-equivalents are described in Makarova et al., “C2c2 is a single-component programmable RNA-guided RNA-targeting CRISPR effector,” Science 2016; 353 (6299), the contents of which are incorporated herein by reference. However, the nucleic acid programmable DNA binding protein (napDNAbp) that may be used in connection with this invention are not limited to CRISPR-Cas systems. The invention embraces any such programmable protein, such as the Argonaute protein from Natronobacterium gregoryi (NgAgo) which may also be used for DNA-guided genome editing. NgAgo-guide DNA system does not require a PAM sequence or guide RNA molecules, which means genome editing can be performed simply by the expression of generic NgAgo protein and introduction of synthetic oligonucleotides on any genomic sequence. See Gao et al., DNA-guided genome editing using the Natronobacterium gregoryi Argonaute. Nature Biotechnology 2016; 34(7):768-73, which is incorporated herein by reference.

In some embodiments, the napDNAbp is a RNA-programmable nuclease, when in a complex with an RNA, may be referred to as a nuclease:RNA complex. Typically, the bound RNA(s) is referred to as a guide RNA (gRNA). gRNAs can exist as a complex of two or more RNAs, or as a single RNA molecule. gRNAs that exist as a single RNA molecule may be referred to as single-guide RNAs (sgRNAs), though “gRNA” is used interchangeably to refer to guide RNAs that exist as either single molecules or as a complex of two or more molecules. Typically, gRNAs that exist as single RNA species comprise two domains: (1) a domain that shares homology to a target nucleic acid (e.g., and directs binding of a Cas9 (or equivalent) complex to the target); and (2) a domain that binds a Cas9 protein. In some embodiments, domain (2) corresponds to a sequence known as a tracrRNA, and comprises a stem-loop structure. For example, in some embodiments, domain (2) is homologous to a tracrRNA as depicted in FIG. 1E of Jinek et al., Science 337:816-821(2012), the entire contents of which is incorporated herein by reference. Other examples of gRNAs (e.g., those including domain 2) can be found in U.S. Pat. No. 9,340,799, entitled “mRNA-Sensing Switchable gRNAs,” and International Patent Application No. PCT/US2014/054247, filed Sep. 6, 2013, published as WO 2015/035136 and entitled “Delivery System For Functional Nucleases,” the entire contents of each are herein incorporated by reference. In some embodiments, a gRNA comprises two or more of domains (1) and (2), and may be referred to as an “extended gRNA.” For example, an extended gRNA will, e.g., bind two or more Cas9 proteins and bind a target nucleic acid at two or more distinct regions, as described herein. The gRNA comprises a nucleotide sequence that complements a target site, which mediates binding of the nuclease/RNA complex to said target site, providing the sequence specificity of the nuclease:RNA complex. In some embodiments, the RNA-programmable nuclease is the (CRISPR-associated system) Cas9 endonuclease, for example Cas9 (Csn1) from Streptococcus pyogenes (see, e.g., “Complete genome sequence of an M1 strain of Streptococcus pyogenes.” Ferretti J. J. et al., Proc. Natl. Acad. Sci. U.S.A. 98:4658-4663(2001); “CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III.” Deltcheva E. et al., Nature 471:602-607(2011); and “A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity.” Jinek M. et al., Science 337:816-821(2012), the entire contents of each of which are incorporated herein by reference.

The napDNAbp nucleases (e.g., Cas9) use RNA:DNA hybridization to target DNA cleavage sites, these proteins are able to be targeted, in principle, to any sequence specified by the guide RNA. Methods of using napDNAbp nucleases, such as Cas9, for site-specific cleavage (e.g., to modify a genome) are known in the art (see e.g., Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819-823 (2013); Mali, P. et al. RNA-guided human genome engineering via Cas9. Science 339, 823-826 (2013); Hwang, W. Y. et al. Efficient genome editing in zebrafish using a CRISPR-Cas system. Nature Biotechnology 31, 227-229 (2013); Jinek, M. et al. RNA-programmed genome editing in human cells. eLife 2, e00471 (2013); Dicarlo, J. E. et al., Genome engineering in Saccharomyces cerevisiae using CRISPR-Cas systems. Nucleic Acid Res. (2013); Jiang, W. et al. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nature Biotechnology 31, 233-239 (2013); the entire contents of each of which are incorporated herein by reference).

The term “nickase” refers to a napDNAbp (e.g., a Cas9) having only a single nuclease activity that cuts only one strand of a target DNA, rather than both strands. Thus, a nickase type napDNAbp does not leave a double-strand break. Exemplary nickases include SpCas9 and SaCas9 nickases. An exemplary nickase comprises a sequence having at least 99%, or 100%, identity to the amino acid sequence of SEQ ID NO: 3 or 11.

A “uracil glycosylase inhibitor (UGI)” refers to a protein that inhibits the activity of uracil-DNA glycosylase. Suitable UGI proteins for use in accordance with the present disclosure include, for example, those published in Wang et al., J. Biol. Chem. 264:1163-1171(1989); Lundquist et al., J. Biol. Chem. 272:21408-21419(1997); Ravishankar et al., Nucleic Acids Res. 26:4880-4887(1998); and Putnam et al., J. Mol. Biol. 287:331-346 (1999), each of which is incorporated herein by reference. Non-limiting, exemplary proteins that may be used as a UGI of the present disclosure and their respective sequences are provided below. In some embodiments, the UGI is a variant of a naturally-occurring deaminase from an organism, and the variants do not occur in nature. For example, in some embodiments, the UGI is at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75% at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to a naturally-occurring UGI from an organism or any UGIs provided herein (e.g., a UGI comprising the amino acid sequence of any one of SEQ ID NOs: 299-302). In some embodiments, the UGI comprises an amino acid sequence that is shorter or longer in length (e.g., by no more than 30%, no more than 25%, no more than 20%, no more than 15%, no more than 10%, no more than 5%, no more than 1% longer or shorter) than any of the UGIs provided herein. In some embodiments, the UGI comprises an amino acid sequence that is shorter or longer in length (e.g., by no more than 20 amino acids, no more than 15 amino acids, no more than 10 amino acids, no more than 5 amino acids, no more than 2 amino acids longer or shorter) than any of the UGIs provided herein.

A “nuclear localization signal” or “NLS” refers to as an amino acid sequence that “tags” a protein for import into the cell nucleus by nuclear transport. Typically, this signal consists of one or more short sequences of positively charged lysines or arginines exposed on the protein surface. One or more NLS may be added to the N- or C-terminus of a protein, or internally (e.g., between two protein domains). For example, one or more NLS may be added to the N- or C-terminus of a nucleobase editor, or between the Cas9 and the deaminase in a nucleobase editor. In some embodiments, 1, 2, 3, 4, 5, or more NLS may be added. Nuclear localization sequences are known in the art and would be apparent to the skilled artisan. For example, NLS sequences are described in Plank et al., PCT/EP2000/011690, filed Nov. 23, 2000, the contents of which are incorporated herein by reference for its disclosure of exemplary nuclear localization sequences. In some embodiments, a NLS comprises a bipartite nuclear localization signal comprising an amino acid sequence selected from the group consisting of KRTADGSEFEPKKKRKV (SEQ ID NO: 398), KRPAATKKAGQAKKKK (SEQ ID NO: 344), KKTELQTTNAENKTKKL (SEQ ID NO: 345), KRGINDRNFWRGENGRKTR(SEQ ID NO: 346), RKSGKIAAIVVKRPRK (SEQ ID NO: 347), PKKKRKV (SEQ ID NO: 373) or MDSLLMNRRKFLYQFKNVRWAKGRRETYLC (SEQ ID NO: 374). In some embodiments, a linker is inserted between the Cas9 and the deaminase. In certain embodiments, the NLS comprises the amino acid sequence of SEQ ID NO: 398. In some embodiments, the NLS comprises the amino acid sequence of SEQ ID NO: 344.

An NLS can be classified as monopartite or bipartite. A non-limiting example of a monopartite NLS is the sequence PKKKRKV (SEQ ID NO: 373) in the SV40 Large T-antigen. A “bipartite” NLS typically contains two clusters of basic amino acids, separated by a spacer of about 10 amino acids. One non-limiting example of a bipartite NLS is the NLS of nucleoplasmin, KRPAATKKAGQAKKKK (spacer underlined) (SEQ ID NO: 344). In some embodiments, the NLS used in accordance with the present disclosure is the NLS of nucleoplasmin comprising the amino acid sequence of KRPAATKKAGQAKKKK (SEQ ID NO: 344). Other bipartite NLSs that may be used in accordance with the present disclosure include, without limitation: SV40 bipartite NLS (KRTADGSEFESPKKKRKV (SEQ ID NO: 375), e.g., as described in Hodel et al., J Biol Chem. 2001 Jan. 12; 276(2):1317-25, incorporated herein by reference); Kanadaptin bipartite NLS (KKTELQTTNAENKTKKL (SEQ ID NO: 345), e.g., as described in Hubner et al., Biochem J. 2002 Jan. 15; 361 (Pt 2):287-96, incorporated herein by reference); influenza A nucleoprotein bipartite NLS (KRGINDRNFWRGENGRKTR (SEQ ID NO: 346), e.g., as described in Ketha et al., BMC Cell Biology. 2008; 9:22, incorporated herein by reference); and ZO-2 bipartite NLS (RKSGKIAAIVVKRPRK (SEQ ID NO: 347), e.g., as described in Quiros et al., Nusrat A, ed. Molecular Biology of the Cell. 2013; 24(16):2528-2543, incorporated herein by reference).

The nucleotide sequence encoding an NLS is “operably linked” to the nucleotide sequence encoding a protein to which the NLS is fused (e.g., a Cas9 or a nucleobase editor) when two coding sequences are “in-frame with each other” and are translated as a single polypeptide fusing two sequences.

Nucleic acids of the present disclosure may include one or more genetic elements. A “genetic element” refers to a particular nucleotide sequence that has a role in nucleic acid expression (e.g., promoter, enhancer, terminator) or encodes a discrete product of an engineered nucleic acid (e.g., a nucleotide sequence encoding a guide RNA, a protein and/or an RNA interference molecule).

A “promoter” refers to a control region of a nucleic acid sequence at which initiation and rate of transcription of the remainder of a nucleic acid sequence are controlled. A promoter may also contain sub-regions at which regulatory proteins and molecules may bind, such as RNA polymerase and other transcription factors. Promoters may be constitutive, inducible, activatable, repressible, tissue-specific, or any combination thereof. A promoter drives expression or drives transcription of the nucleic acid sequence that it regulates. Herein, a promoter is considered to be “operably linked” when it is in a correct functional location and orientation in relation to a nucleic acid sequence it regulates to control (“drive”) transcriptional initiation and/or expression of that sequence.

A promoter may be one naturally associated with a gene or sequence, as may be obtained by isolating the 5′ non-coding sequences located upstream of the coding segment of a given gene or sequence. Such a promoter is referred to as an “endogenous promoter.” In some embodiments, a coding nucleic acid sequence may be positioned under the control of a recombinant or heterologous promoter, which refers to a promoter that is not normally associated with the encoded sequence in its natural environment. Such promoters may include promoters of other genes; promoters isolated from any other cell; and synthetic promoters or enhancers that are not “naturally occurring” such as, for example, those that contain different elements of different transcriptional regulatory regions and/or mutations that alter expression through methods of genetic engineering that are known in the art. In addition to producing nucleic acid sequences of promoters and enhancers synthetically, sequences may be produced using recombinant cloning and/or nucleic acid amplification technology, including polymerase chain reaction (PCR).

In some embodiments, promoters used in accordance with the present disclosure are “inducible promoters,” which are promoters that are characterized by regulating (e.g., initiating or activating) transcriptional activity when in the presence of, influenced by or contacted by an inducer signal. An inducer signal may be endogenous or a normally exogenous condition (e.g., light), compound (e.g., chemical or non-chemical compound) or protein that contacts an inducible promoter in such a way as to be active in regulating transcriptional activity from the inducible promoter. Thus, a “signal that regulates transcription” of a nucleic acid refers to an inducer signal that acts on an inducible promoter. A signal that regulates transcription may activate or inactivate transcription, depending on the regulatory system used. Activation of transcription may involve directly acting on a promoter to drive transcription or indirectly acting on a promoter by inactivation a repressor that is preventing the promoter from driving transcription. Conversely, deactivation of transcription may involve directly acting on a promoter to prevent transcription or indirectly acting on a promoter by activating a repressor that then acts on the promoter.

In genetics, a “sense” strand is the segment within double-stranded DNA that runs from 5′ to 3′, and which is complementary to the antisense strand of DNA, or template strand, which runs from 3′ to 5′. In the case of a DNA segment that encodes a protein, the sense strand is the strand of DNA that has the same sequence as the mRNA, which takes the antisense strand as its template during transcription, and eventually undergoes (typically, not always) translation into a protein. The antisense strand is thus responsible for the RNA that is later translated to protein, while the sense strand possesses a nearly identical makeup to that of the mRNA. Note that for each segment of dsDNA, there will possibly be two sets of sense and antisense, depending on which direction one reads (since sense and antisense is relative to perspective). It is ultimately the gene product, or mRNA, that dictates which strand of one segment of dsDNA is referred to as sense or antisense.

The term “subject,” as used herein, refers to an individual organism, for example, an individual mammal. In some embodiments, the subject is a human. In some embodiments, the subject is a non-human mammal. In some embodiments, the subject is a non-human primate. In some embodiments, the subject is a rodent. In some embodiments, the subject is a sheep, a goat, a cattle, a cat, or a dog. In some embodiments, the subject is a vertebrate, an amphibian, a reptile, a fish, an insect, a fly, or a nematode. In some embodiments, the subject is a research animal. In some embodiments, the subject is genetically engineered, e.g., a genetically engineered non-human subject. The subject may be of either sex and at any stage of development.

A subject in need thereof” refers to an individual who has a disease, a sign and/or symptom of a disease, or a predisposition toward a disease, with the purpose to cure, heal, alleviate, relieve, alter, remedy, ameliorate, improve, or affect the disease, the symptom of the disease, or the predisposition toward the disease. In some embodiments, the subject is a mammal. In some embodiments, the subject is a non-human primate. In some embodiments, the subject is human. In some embodiments, the mammal is a rodent. In some embodiments, the rodent is a mouse. In some embodiments, the rodent is a rat. In some embodiments, the mammal is a companion animal. A “companion animal” refers to pets and other domestic animals. Non-limiting examples of companion animals include dogs and cats; livestock, such as horses, cattle, pigs, sheep, goats, and chickens; and other animals, such as mice, rats, guinea pigs, and hamsters.

The term “target site” refers to a sequence within a nucleic acid molecule that is edited by a base editor (BE) or nucleobase editor disclosed herein. The term “target site,” in the context of a single strand, also can refer to the “target strand” which anneals or binds to the spacer sequence of the guide RNA. The target site can refer, in certain embodiments, to a segment of double-stranded DNA that includes the protospacer (i.e., the strand of the target site that has the same nucleotide sequence as the spacer sequence of the guide RNA) on the PAM-strand (or non-target strand) and target strand, which is complementary to the protospacer and the spacer alike, and which anneals to the spacer of the guide RNA, thereby targeting or programming a Cas9 nucleobase editor to target the target site.

A “transcriptional terminator” is a nucleic acid sequence that causes transcription to stop. A transcriptional terminator may be unidirectional or bidirectional. It is comprised of a DNA sequence involved in specific termination of an RNA transcript by an RNA polymerase. A transcriptional terminator sequence prevents transcriptional activation of downstream nucleic acid sequences by upstream promoters. A transcriptional terminator may be necessary in vivo to achieve desirable expression levels or to avoid transcription of certain sequences. A transcriptional terminator is considered to be “operably linked to” a nucleotide sequence when it is able to terminate the transcription of the sequence it is linked to.

The most commonly used type of terminator is a forward terminator. When placed downstream of a nucleic acid sequence that is usually transcribed, a forward transcriptional terminator will cause transcription to abort. In some embodiments, bidirectional transcriptional terminators are provided, which usually cause transcription to terminate on both the forward and reverse strand. In some embodiments, reverse transcriptional terminators are provided, which usually terminate transcription on the reverse strand only.

In prokaryotic systems, terminators usually fall into two categories (1) rho-independent terminators and (2) rho-dependent terminators. Rho-independent terminators are generally composed of palindromic sequence that forms a stem loop rich in G-C base pairs followed by several T bases. Without wishing to be bound by theory, the conventional model of transcriptional termination is that the stem loop causes RNA polymerase to pause, and transcription of the poly-A tail causes the RNA:DNA duplex to unwind and dissociate from RNA polymerase.

In eukaryotic systems, the terminator region may comprise specific DNA sequences that permit site-specific cleavage of the new transcript so as to expose a polyadenylation site. This signals a specialized endogenous polymerase to add a stretch of about 200 A residues (polyA) to the 3′ end of the transcript. RNA molecules modified with this polyA tail appear to more stable and are translated more efficiently. Thus, in some embodiments involving eukaryotes, a terminator may comprise a signal for the cleavage of the RNA. In some embodiments, the terminator signal promotes polyadenylation of the message. The terminator and/or polyadenylation site elements may serve to enhance output nucleic acid levels and/or to minimize read through between nucleic acids.

Terminators for use in accordance with the present disclosure include any terminator of transcription described herein or known to one of ordinary skill in the art. Examples of terminators include, without limitation, the termination sequences of genes such as, for example, the bovine growth hormone terminator, and viral termination sequences such as, for example, the SV40 terminator, spy, yejM, secG-leuU, thrLABC, rrnB T1, hisLGDCBHAFI, metZWV, rrnC, xapR, aspA and arcA terminator. In some embodiments, the termination signal may be a sequence that cannot be transcribed or translated, such as those resulting from a sequence truncation.

A “Woodchuck Hepatitis Virus (WHP) Posttranscriptional Regulatory Element (WPRE)” is a DNA sequence that, when transcribed creates a tertiary structure enhancing expression. Commonly used in molecular biology to increase expression of genes delivered by viral vectors. WPRE is a tripartite regulatory element with gamma, alpha, and beta components.

The full WPRE sequence is 609 bp long:

(SEQ ID NO: 376)

GCTTATCGATAATCAACCTCTGGATTACAAAATTTGTGAAAGATTGACTG

GTATTCTTAACTATGTTGCTCCTTTTACGCTATGTGGATACGCTGCTTTA

ATGCCTTTGTATCATGCTATTGCTTCCCGTATGGCTTTCATTTTCTCCTC

CTTGTATAAATCCTGGTTGCTGTCTCTTTATGAGGAGTTGTGGCCCGTTG

TCAGGCAACGTGGCGTGGTGTGCACTGTGTTTGCTGACGCAACCCCCACT

GGTTGGGGCATTGCCACCACCTGTCAGCTCCTTTCCGGGACTTTCGCTTT

CCCCCTCCCTATTGCCACGGCGGAACTCATCGCCGCCTGCCTTGCCCGCT

GCTGGACAGGGGCTCGGCTGTTGGGCACTGACAATTCCGTGGTGTTGTCG

GGGAAATCATCGTCCTTTCCTTGGCTGCTCGCCTATGTTGCCACCTGGAT

TCTGCGCGGGACGTCCTTCTGCTACGTCCCTTCGGCCCTCAATCCAGCGG

ACCTTCCTTCCCGCGGCCTGCTGCCGGCTCTGCGGCCTCTTCCGCGTCTT

CGCCTTCGCCCTCAGACGAGTCGGATCTCCCTTTGGGCCGCCTCCCCGCA

TCGATACCG.

The terms “nucleic acid,” and “polynucleotide,” as used herein, refer to a compound comprising a nucleobase and an acidic moiety, e.g., a nucleotide, or a polymer of nucleotides. Typically, polymeric nucleic acids, e.g., nucleic acid molecules comprising three or more nucleotides are linear molecules, in which adjacent nucleotides are linked to each other via a phosphodiester linkage. In some embodiments, “nucleic acid” refers to individual nucleic acid residues (e.g. nucleotides and/or nucleosides). In some embodiments, “nucleic acid” refers to an oligonucleotide chain comprising three or more individual nucleotide residues. As used herein, the terms “oligonucleotide” and “polynucleotide” can be used interchangeably to refer to a polymer of nucleotides (e.g., a string of at least three nucleotides). In some embodiments, “nucleic acid” encompasses RNA as well as single and/or double-stranded DNA. Nucleic acids may be naturally occurring, for example, in the context of a genome, a transcript, an mRNA, tRNA, rRNA, siRNA, snRNA, a plasmid, cosmid, chromosome, chromatid, or other naturally occurring nucleic acid molecule. On the other hand, a nucleic acid molecule may be a non-naturally occurring molecule, e.g., a recombinant DNA or RNA, an artificial chromosome, an engineered genome (e.g., an engineered viral vector), an engineered vector, or fragment thereof, or a synthetic DNA, RNA, or DNA/RNA hybrid, optionally including non-naturally occurring nucleotides or nucleosides. Furthermore, the terms “nucleic acid,” “DNA,” “RNA,” and/or similar terms include nucleic acid analogs, e.g., analogs having other than a phosphodiester backbone. Nucleic acids can be purified from natural sources, produced using recombinant expression systems and optionally purified, chemically synthesized, etc. Where appropriate, e.g., in the case of chemically synthesized molecules, nucleic acids can comprise nucleoside analogs such as analogs having chemically modified bases or sugars, and backbone modifications. A nucleic acid sequence is presented in the 5′ to 3′ direction unless otherwise indicated. In some embodiments, a nucleic acid is or comprises natural nucleosides (e.g. adenosine, thymidine, guanosine, cytidine, uridine, deoxyadenosine, deoxythymidine, deoxyguanosine, and deoxycytidine); nucleoside analogs (e.g., 2-aminoadenosine, 2-thiothymidine, inosine, pyrrolo-pyrimidine, 3-methyl adenosine, 5-methylcytidine, 2-aminoadenosine, C5-bromouridine, C5-fluorouridine, C5-iodouridine, C5-propynyl-uridine, C5-propynyl-cytidine, C5-methylcytidine, 2-aminoadenosine, 7-deazaadenosine, 7-deazaguanosine, 8-oxoadenosine, 8-oxoguanosine, O(6)-methylguanine, and 2-thiocytidine); chemically modified bases; biologically modified bases (e.g., methylated bases); intercalated bases; modified sugars (e.g., 2′-fluororibose, ribose, 2′-deoxyribose, arabinose, and hexose); and/or modified phosphate groups (e.g., phosphorothioates and 5′-N-phosphoramidite linkages).

The terms “protein,” “peptide,” and “polypeptide” are used interchangeably herein, and refer to a polymer of amino acid residues linked together by peptide (amide) bonds. The terms refer to a protein, peptide, or polypeptide of any size, structure, or function. Typically, a protein, peptide, or polypeptide will be at least three amino acids long. A protein, peptide, or polypeptide may refer to an individual protein or a collection of proteins. One or more of the amino acids in a protein, peptide, or polypeptide may be modified, for example, by the addition of a chemical entity such as a carbohydrate group, a hydroxyl group, a phosphate group, a farnesyl group, an isofarnesyl group, a fatty acid group, a linker for conjugation, functionalization, or other modification, etc. A protein, peptide, or polypeptide may also be a single molecule or may be a multi-molecular complex. A protein, peptide, or polypeptide may be just a fragment of a naturally occurring protein or peptide. A protein, peptide, or polypeptide may be naturally occurring, recombinant, or synthetic, or any combination thereof. The term “fusion protein” as used herein refers to a hybrid polypeptide which comprises protein domains from at least two different proteins. One protein may be located at the amino-terminal (N-terminal) portion of the fusion protein or at the carboxy-terminal (C-terminal) protein thus forming an “amino-terminal fusion protein” or a “carboxy-terminal fusion protein,” respectively. A protein may comprise different domains, for example, a nucleic acid binding domain (e.g., the gRNA binding domain of Cas9 that directs the binding of the protein to a target site) and a nucleic acid cleavage domain or a catalytic domain of a nucleic-acid editing protein. In some embodiments, a protein is in a complex with, or is in association with, a nucleic acid, e.g., RNA or DNA. Any of the proteins provided herein may be produced by any method known in the art. For example, the proteins provided herein may be produced via recombinant protein expression and purification, which is especially suited for fusion proteins comprising a peptide linker. Methods for recombinant protein expression and purification are well known, and include those described by Green and Sambrook, Molecular Cloning: A Laboratory Manual (4^thed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. (2012)), which are incorporated herein by reference.

The term “subject,” as used herein, refers to an individual organism, for example, an individual mammal. In some embodiments, the subject is a human. In some embodiments, the subject is a non-human mammal. In some embodiments, the subject is a non-human primate. In some embodiments, the subject is a rodent (e.g., mouse, rat). In some embodiments, the subject is a domesticated animal. In some embodiments, the subject is a sheep, a goat, a cow, a cat, or a dog. In some embodiments, the subject is a research animal. In some embodiments, the subject is genetically engineered, e.g., a genetically engineered non-human subject. The subject may be of either sex and at any stage of development.

The term “recombinant” as used herein in the context of proteins or nucleic acids refers to proteins or nucleic acids that do not occur in nature, but are the product of human engineering. For example, in some embodiments, a recombinant protein or nucleic acid molecule comprises an amino acid or nucleotide sequence that comprises at least one, at least two, at least three, at least four, at least five, at least six, or at least seven mutations as compared to any naturally occurring sequence. The fusion proteins (e.g., nucleobase editors) described herein are made by recombinant technology. Recombinant technology is familiar to those skilled in the art.

The term “pharmaceutically-acceptable carrier” means a pharmaceutically-acceptable material, composition or vehicle, such as a liquid or solid filler, diluent, excipient, manufacturing aid (e.g., lubricant, talc magnesium, calcium or zinc stearate, or steric acid), or solvent encapsulating material, involved in carrying or transporting the compound from one site (e.g., the delivery site) of the body, to another site (e.g., organ, tissue or portion of the body). A pharmaceutically acceptable carrier is “acceptable” in the sense of being compatible with the other ingredients of the formulation and not injurious to the tissue of the subject (e.g., physiologically compatible, sterile, physiologic pH, etc.).

“A therapeutically effective amount” as used herein refers to the amount of each therapeutic agent (e.g., nucleobase editor, rAAV) described in the present disclosure required to confer therapeutic effect on the subject, either alone or in combination with one or more other therapeutic agents. Effective amounts vary, as recognized by those skilled in the art, depending on the particular condition being treated, the severity of the condition, the individual subject parameters including age, physical condition, size, gender, and weight, the duration of the treatment, the nature of concurrent therapy (if any), the specific route of administration and like factors within the knowledge and expertise of the health practitioner. These factors are well known to those of ordinary skill in the art and can be addressed with no more than routine experimentation. It is generally preferred that a maximum dose of the individual components or combinations thereof be used, that is, the highest safe dose according to sound medical judgment. It will be understood by those of ordinary skill in the art, however, that a subject may insist upon a lower dose or tolerable dose for medical reasons, psychological reasons or for virtually any other reasons. Empirical considerations, such as the half-life, generally will contribute to the determination of the dosage. For example, therapeutic agents that are compatible with the human immune system, such as polypeptides comprising regions from humanized antibodies or fully human antibodies, may be used to prolong half-life of the polypeptide and to prevent the polypeptide being attacked by the host's immune system.

The terms “treatment,” “treat,” and “treating,” refer to a clinical intervention aimed to reverse, alleviate, delay the onset of, or inhibit the progress of a disease or disorder, or one or more symptoms thereof, as described herein. As used herein, the terms “treatment,” “treat,” and “treating” refer to a clinical intervention aimed to reverse, alleviate, delay the onset of, or inhibit the progress of a disease or disorder, or one or more symptoms thereof, as described herein. In some embodiments, treatment may be administered after one or more symptoms have developed and/or after a disease has been diagnosed. In other embodiments, treatment may be administered in the absence of symptoms, e.g., to prevent or delay onset of a symptom or inhibit onset or progression of a disease. For example, treatment may be administered to a susceptible individual prior to the onset of symptoms (e.g., in light of a history of symptoms and/or in light of genetic or other susceptibility factors). Treatment may also be continued after symptoms have resolved, for example, to prevent or delay their recurrence.

As used herein, the term “variant” refers to a protein having characteristics that deviate from what occurs in nature that retains at least one functional i.e. binding, interaction, or enzymatic ability and/or therapeutic property thereof. A “variant” is at least about 70% identical, at least about 80% identical, at least about 90% identical, at least about 95% identical, at least about 96% identical, at least about 97% identical, at least about 98% identical, at least about 99% identical, at least about 99.5% identical, or at least about 99.9% identical to the wild type protein. For instance, a variant of Cas9 may comprise a Cas9 that has one or more changes in amino acid residues as compared to a wild type Cas9 amino acid sequence. As another example, a variant of a deaminase may comprise a deaminase that has one or more changes in amino acid residues as compared to a wild type deaminase amino acid sequence, e.g. following ancestral sequence reconstruction of the deaminase. These changes include chemical modifications, including substitutions of different amino acid residues truncations, covalent additions (e.g. of a tag), and any other mutations. The term also encompasses circular permutants, mutants, truncations, or domains of a reference sequence, and which display the same or substantially the same functional activity or activities as the reference sequence. This term also embraces fragments of a wild type protein.

The level or degree of which the property is retained may be reduced relative to the wild type protein but is typically the same or similar in kind. Generally, variants are overall very similar, and in many regions, identical to the amino acid sequence of the protein described herein. A skilled artisan will appreciate how to make and use variants that maintain all, or at least some, of a functional ability or property.

The variant proteins may comprise, or alternatively consist of, an amino acid sequence which is at least 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or 100%, identical to, for example, the amino acid sequence of a wild-type protein, or any protein provided herein.

By a polypeptide having an amino acid sequence at least, for example, 95% “identical” to a query amino acid sequence, it is intended that the amino acid sequence of the subject polypeptide is identical to the query sequence except that the subject polypeptide sequence may include up to five amino acid alterations per each 100 amino acids of the query amino acid sequence. In other words, to obtain a polypeptide having an amino acid sequence at least 95% identical to a query amino acid sequence, up to 5% of the amino acid residues in the subject sequence may be inserted, deleted, or substituted with another amino acid. These alterations of the reference sequence may occur at the amino- or carboxy-terminal positions of the reference amino acid sequence or anywhere between those terminal positions, interspersed either individually among residues in the reference sequence or in one or more contiguous groups within the reference sequence.

As a practical matter, whether any particular polypeptide is at least 80%, 85%, 90%, 95%, 96%, 97%, 98%, or 99% identical to, for instance, the amino acid sequence of a protein such as a Niemann-Pick C1 (NPC1) protein, can be determined conventionally using known computer programs. A preferred method for determining the best overall match between a query sequence (a sequence of the present invention) and a subject sequence, also referred to as a global sequence alignment, can be determined using the FASTDB computer program based on the algorithm of Brutlag et al. (Comp. App. Biosci. 6:237-245 (1990)). In a sequence alignment the query and subject sequences are either both nucleotide sequences or both amino acid sequences. The result of said global sequence alignment is expressed as percent identity. Preferred parameters used in a FASTDB amino acid alignment are: Matrix=PAM 0, k-tuple=2, Mismatch Penalty=1, Joining Penalty=20, Randomization Group Length=0, Cutoff Score=1, Window Size=sequence length, Gap Penalty=5, Gap Size Penalty=0.05, Window Size=500 or the length of the subject amino acid sequence, whichever is shorter.

If the subject sequence is shorter than the query sequence due to N- or C-terminal deletions, not because of internal deletions, a manual correction must be made to the results. This is because the FASTDB program does not account for N- and C-terminal truncations of the subject sequence when calculating global percent identity. For subject sequences truncated at the N- and C-termini, relative to the query sequence, the percent identity is corrected by calculating the number of residues of the query sequence that are N- and C-terminal of the subject sequence, which are not matched/aligned with a corresponding subject residue, as a percent of the total bases of the query sequence. Whether a residue is matched/aligned is determined by results of the FASTDB sequence alignment. This percentage is then subtracted from the percent identity, calculated by the above FASTDB program using the specified parameters, to arrive at a final percent identity score. This final percent identity score is what is used for the purposes of the present invention. Only residues to the N- and C-termini of the subject sequence, which are not matched/aligned with the query sequence, are considered for the purposes of manually adjusting the percent identity score. That is, only query residue positions outside the farthest N- and C-terminal residues of the subject sequence.

The term “vector,” as used herein, refers to a nucleic acid that can be modified to encode a gene of interest and that is able to enter into a host cell and replicate within the host cell, and then transfer a replicated form of the vector into another host cell. Exemplary suitable vectors include viral vectors, such as AAV vectors or bacteriophages and filamentous phage, and conjugative plasmids. Additional suitable vectors will be apparent to those of skill in the art based on the instant disclosure.

As used herein the term “wild type” is a term of the art understood by skilled persons and means the typical form of an organism, strain, gene or characteristic as it occurs in nature as distinguished from mutant or variant forms.

DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS

Provided herein are nucleic acid molecules (e.g., vector genomes), compositions (containing, e.g., vectors, recombinant viruses), rAAV particles, and kits comprising nucleic acids encoding split napDNAbp domains (e.g., Cas9 proteins) or nucleobase editors, and methods of delivering a nucleobase editor or a napDNAbp domain into a cell using such nucleic acids. The N-terminal portion and C-terminal portion of a nucleobase editor or a napDNAbp domain are encoded on separate nucleic acids and delivered into a cell, e.g., a via recombinant adeno-associated virus (rAAV particles) delivery. In particular embodiments, the N-terminal portion of a nucleobase editor is fused to a first intein, and the C-terminal portion of a nucleobase editor is fused to an intein. The N-terminal and C-terminal portions may each be encoded on separate nucleic acids and delivered into a cell, e.g., a via rAAV particle delivery. The polypeptides corresponding to the N-terminal portion and C-terminal portion of the base editor (or nucleobase editor) may be joined to form a complete nucleobase editor or Cas9 protein, e.g., via intein-mediated protein splicing.

To overcome the packaging size limit and deliver base editors using AAVs, a split-base editor dual AAV strategy was devised, in which the CBE or ABE is divided into an N-terminal portion (or “half”) and a C-terminal half. Each base editor half is fused to half of a fast-splicing split-intein. Following co-infection by AAV particles expressing each base editor-split intein half, protein splicing in trans reconstitutes the full-length base editor. Unlike other approaches utilizing small molecules or sgRNA to bridge split Cas9, intein splicing removes all exogenous sequences and regenerates a native peptide bond at the split site, resulting in a single reconstituted protein (e.g., a protein that is identical in sequence to the unmodified nucleobase editor).

Split-intein CBEs and split-intein ABEs are disclosed that are integrated into dual AAV genomes to enable efficient base editing in somatic tissues of therapeutic relevance, including liver, heart, muscle, retina, and brain. The resulting AAVs were used to achieve base editing efficiencies at test loci for both CBEs and ABEs that, in each of these tissues, meets or exceeds therapeutically relevant editing thresholds for the treatment of human genetic diseases at AAV dosages that are known to be well-tolerated in humans. In particular, the disclosed AAV-nucleobase editor vectors achieved editing efficiencies of 59% editing (A.T-to-G.C) among unsorted cells in the cortex, and 48-50% editing (C.G-to-T.A) in photoreceptor cells and mouse embryonic fibroblasts (MEFs). The highest in vivo genome editing efficiencies were observed following injection of ˜10¹³-10¹⁴vector genomes per kilogram weight of subject (vgs/kg), which is a dosage comparable to those currently used in human gene therapy trials. Accordingly, the invention provides split napDNAbp domains (e.g., Cas9 proteins), split nucleobase editors, and nucleic acids and vectors encoding same; as well as cells, compositions, methods, kits, and systems that utilize the disclosed split napDNAbp domains, split nucleobase editors, and vectors.

Aspects of the present disclosure relate to nucleic acid molecules encoding a N-terminal portion of a base editor or nucleobase editor fused at its C-terminus to a first intein sequence, wherein the nucleic acid molecule is operably linked to a first promoter, further comprising a nucleic acid segment encoding a guide RNA (gRNA) operably linked to a second promoter, wherein the direction of transcription of the nucleic acid segment is reversed relative to the direction of transcription of the nucleic acid molecule. These nucleic acid molecules may be comprised within a viral genome, such as an rAAV genome or rAAV vector.

Further provided are nucleic acid molecules encoding a C-terminal portion of a nucleobase editor fused at its N-terminus to an intein sequence, wherein the nucleic acid molecule is operably linked to a first promoter, and further comprising a nucleic acid segment encoding a guide RNA (gRNA) operably linked to a second promoter, wherein the direction of transcription of the nucleic acid segment is reversed relative to the direction of transcription of the nucleic acid molecule. In some embodiments, the first promoter of the nucleic acid molecule encoding the N-terminal portion of the nucleobase editor and the first promoter of the nucleic acid molecule encoding the C-terminal portion of the nucleobase editor comprise the same promoter (i.e., are the same). In other embodiments, these first promoters are different. In some embodiments, the second promoter of the nucleic acid molecule encoding the N-terminal portion of the nucleobase editor and the second promoter of the nucleic acid molecule encoding the C-terminal portion of the nucleobase editor are the same. In other embodiments, these second promoters are different.

Some aspects of the present disclosure relate to compositions comprising (i) a first nucleotide sequence encoding an N-terminal portion of a Cas9 protein fused at its C-terminus to an intein-N; and (ii) a second nucleotide sequence encoding an intein-C fused to the N-terminus of a C-terminal portion of the Cas9 protein, wherein at least one of the first nucleotide sequence and second nucleotide sequence comprises at its 3′ end a gRNA nucleic acid segment encoding a guide RNA (gRNA) operably linked to a second promoter, and wherein the direction of transcription of the gRNA nucleic acid segment is reversed relative to the direction of transcription of the at least one nucleotide sequence. In some embodiments, the first nucleotide sequence and/or second nucleotide sequence is operably linked to a nucleotide sequence encoding at least one bipartite nuclear localization signal (NLS).

Additional aspects of the present disclosure relate to methods of editing using the split nucleobase editors and/or the split Cas9 proteins disclosed herein. In particular embodiments, provided herein are methods of base editing at therapeutically-relevant efficiencies in vivo, such as in murine retina. The methods disclosed herein improve the rate and throughput with which promising base editor targets can be identified in cultured cells and in vivo.

This disclosure describes methods of base editing that may be used for targeted editing of DNA in vitro, e.g., for the generation of mutant cells or animals; for the introduction of targeted mutations, e.g., for the correction of genetic defects in cells ex vivo, e.g., in cells obtained from a subject that are subsequently re-introduced into the same or another subject; and for the introduction of targeted mutations in vivo, e.g., the correction of genetic defects or the introduction of deactivating mutations in disease-associated genes in a subject. As an example, diseases and conditions can be treated by making an A to G, or a C to T mutation, may be treated using the base editors provided herein. The base editors described herein may be utilized for the targeted editing of C to T and G to A mutations so as to correct a mutation or restore a normal reading frame in an gene to generate a functional protein. In certain embodiments, the subject has been diagnosed with a disease, disorder, or condition, such as, but not limited to, a disease, disorder, or condition associated with a point mutation in the Tmc1 gene or the NPC1 gene. The methods described herein involving contacting a base editor with a target nucleotide sequence in the genome of an organism, e.g., a human.

In certain embodiments, the methods described above result in cutting (or nicking) one strand of the double-stranded DNA, for example, the strand that includes the thymine (T) of a target A:T nucleobase pair opposite the strand containing the target adenine (A) that is being deaminated. This nicking result serves to direct mismatch repair machinery to the non-edited strand, ensuring that the chemically modified nucleobase is not interpreted as a lesion by the machinery. This nick may be created by the use of an nCas9.

Still further, the present disclosure provides for methods of making the disclosed split nucleobase editors, as well as methods of using the split nucleobase editors or nucleic acid molecules encoding the nucleobase editors in applications including editing a nucleic acid molecule, e.g., a genome. Such methods involve transducing (e.g., via transfection) cells with a plurality of complexes each comprising a portion of a split nucleobase editor (e.g., a nucleobase editor comprising a napDNAbp (e.g., nCas9) domain and a deaminase domain) and/or a gRNA molecule. In some embodiments, the nucleic acid constructs encoding the N-terminal and C-terminal portions of the split nucleobase editor are transfected separately from one another. In certain embodiments, the methods involve the transfection of nucleic acid constructs (e.g., plasmids) that each (or together) encode the components of a complex of split nucleobase editor and a gRNA molecule.

In certain embodiments of the disclosed methods of making the disclosed split nucleobase editors, one or more nucleic acid constructs that encode the split nucleobase editor is transfected into the cell separately from the plasmid that encodes the gRNA molecule. In certain embodiments, these components are encoded on a single construct and transfected together. In other embodiments, the methods disclosed herein involve the introduction into cells of one or more nucleic acid vectors encoding a a split nucleobase editor and gRNA molecule that has been expressed and cloned outside of these cells. In some embodiments, these vectors are delivered as part of an rAAV vector.

It should be appreciated that any nucleobase editor, e.g., any of the nucleobase editors provided herein, may be introduced into the cell in any suitable way, either stably or transiently. In some embodiments, a nucleobase editor may be transfected into the cell. In some embodiments, the cell may be transduced or transfected with a nucleic acid construct that encodes a nucleobase editor. For example, a cell may be transduced (e.g., with a virus encoding a nucleobase editor), or transfected (e.g., with a plasmid encoding a nucleobase editor) with a nucleic acid that encodes a nucleobase editor, or the translated nucleobase editor. Such transduction may be a stable or transient transduction. In some embodiments, cells expressing a nucleobase editor or containing a nucleobase editor may be transduced or transfected with one or more gRNA molecules, for example, when the nucleobase editor comprises a Cas9 (e.g., nCas9) domain. In some embodiments, a plasmid expressing one or more portions of a nucleobase editor may be introduced into cells through electroporation, transient (e.g., lipofection) and stable genome integration (e.g., nucleofection and piggybac), viral transduction, or other methods known to those of skill in the art. In particular embodiments, plasmids expressing one or more portions of any of the disclosed nucleobase editors may be delivered to cells through nucleofection.

In some aspects, the disclosed split nucleobase editors are delivered to the cell (or the subject) by use of recombinant AAV (rAAV) particles. In some embodiments, any of the disclosed split nucleobase editors is fused to split intein pairs that are packaged into two separate rAAV particles that, when co-delivered to a cell, reconstitute the functional editor protein. Several other considerations to account for the unique features of base editing are described, including the optimization of second-site nicking targets and properly packaging nucleobase editors into virus vectors, including lentiviruses and rAAV. Accordingly, the disclosure provides dual rAAV vectors and dual rAAV vector particles that comprise expression constructs that encode two portions (or “two halves”) of any of the disclosed nucleobase editors, wherein the encoded nucleobase editor is divided between the two halves at a split site. In some embodiments, the disclosed rAAV vectors encoding the split nucleobase editors may comprise a nucleotide sequence that is at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to any one of the sequences depicted in FIGS. 26A-26U.

Accordingly, the present disclosure provides compositions comprising: (i) a first recombinant adeno associated virus (rAAV) particle comprising a first nucleotide sequence encoding a N-terminal portion of a Cas9 protein fused at its C-terminus to an intein-N; and (ii) a second recombinant adeno associated virus (rAAV) particle comprising a second nucleotide sequence encoding an intein-C fused to the N-terminus of a C-terminal portion of the Cas9 protein. In some embodiments, at least one of the first nucleotide sequence and second nucleotide sequence comprises at its 3′ end a gRNA nucleic acid segment encoding a guide RNA (gRNA) operably linked to a second promoter, and wherein the direction of transcription of the gRNA nucleic acid segment is reversed relative to the direction of transcription of the at least one nucleotide sequence.

In some aspects, the specification discloses a pharmaceutical composition comprising any one of the presently disclosed complexes of nucleobase editors and gRNA. In other aspects, the present disclosure discloses a pharmaceutical composition comprising one or more polynucleotides encoding the nucleobase editors disclosed herein and one or more polynucleotides encoding a gRNA, or polynucleotides encoding both. The one or more polynucleotides encoding the nucleobase editors and one or moe polynucleotides encoding a gRNA may be provided on the same vector, or different vectors (e.g., different rAAV vectors).

napDNAbp Domains

In some aspects, the base editing methods and nucleobase editors described herein involve a nucleic acid programmable DNA binding protein (napDNAbp). Each napDNAbp is associated with at least one guide nucleic acid (e.g., guide RNA), which localizes the napDNAbp to a DNA sequence that comprises a DNA strand (i.e., a target strand) that is complementary to the guide nucleic acid, or a portion thereof (e.g., the protospacer of a guide RNA). In other words, the guide nucleic-acid “programs” the napDNAbp (e.g., Cas9 or equivalent) to localize and bind to a complementary sequence. In various embodiments, the napDNAbp can be fused to a disclosed herein adenosine deaminase or a herein disclosed cytosine deaminase. In other aspects, the napDNAbp can be fused to a non-deaminase nucleobase modifying enzyme (or nucleobase modification domain) disclosed herein.

Without being bound by theory, the binding mechanism of a napDNAbp—guide RNA complex, in general, includes the step of forming an R-loop whereby the napDNAbp induces the unwinding of a double-strand DNA target, thereby separating the strands in the region bound by the napDNAbp. The guide RNA spacer then hybridizes to the “target strand.” This displaces a “non-target strand” that is complementary to the target strand, which forms the single strand region of the R-loop. In some embodiments, the napDNAbp includes one or more nuclease activities, which then cut the DNA leaving various types of lesions. For example, the napDNAbp may comprises a nuclease activity that cuts the non-target strand at a first location, and/or cuts the target strand at a second location. Depending on the nuclease activity, the target DNA can be cut to form a “double-stranded break” whereby both strands are cut. In other embodiments, the target DNA can be cut at only a single site, i.e., the DNA is “nicked” on one strand. Exemplary napDNAbp with different nuclease activities include “Cas9 nickase” (“nCas9”) and a deactivated Cas9 having no nuclease activities (“dead Cas9” or “dCas9”).

The below description of various napDNAbps which can be used in connection with the presently disclose nucleobase editors is not meant to be limiting in any way. The nucleobase editors may comprise the canonical SpCas9, or any ortholog Cas9 protein, or any variant Cas9 protein—including any naturally occurring variant, mutant, or otherwise engineered version of Cas9—that is known or which can be made or evolved through a directed evolutionary or otherwise mutagenic process. In various embodiments, the Cas9 or Cas9 variants have a nickase activity, i.e., only cleave of strand of the target DNA sequence. In other embodiments, the Cas9 or Cas9 variants have inactive nucleases, i.e., are “dead” Cas9 proteins. Other variant Cas9 proteins that may be used are those having a smaller molecular weight than the canonical SpCas9 (e.g., for easier delivery) or having modified or rearranged primary amino acid structure (e.g., the circular permutant formats). The nucleobase editors described herein may also comprise Cas9 equivalents, including Cas12a/Cpf1 and Cas12b proteins which are the result of convergent evolution. The napDNAbps used herein (e.g., SpCas9, Cas9 variant, or Cas9 equivalents) may also may also contain various modifications that alter/enhance their PAM specificities. Lastly, the application contemplates any Cas9, Cas9 variant, or Cas9 equivalent which has at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.9% sequence identity to a reference Cas9 sequence, such as a references SpCas9 canonical sequence or a reference Cas9 equivalent (e.g., Cas12a/Cpf1).

The napDNAbp can be a CRISPR (clustered regularly interspaced short palindromic repeat)-associated nuclease. As outlined above, CRISPR is an adaptive immune system that provides protection against mobile genetic elements (viruses, transposable elements and conjugative plasmids). CRISPR clusters contain spacers, sequences complementary to antecedent mobile elements, and target invading nucleic acids. CRISPR clusters are transcribed and processed into CRISPR RNA (crRNA). In type II CRISPR systems correct processing of pre-crRNA requires a trans-encoded small RNA (tracrRNA), endogenous ribonuclease 3 (rnc) and a Cas9 protein. The tracrRNA serves as a guide for ribonuclease 3-aided processing of pre-crRNA. Subsequently, Cas9/crRNA/tracrRNA endonucleolytically cleaves linear or circular dsDNA target complementary to the spacer. The target strand not complementary to crRNA is first cut endonucleolytically, then trimmed 3′-5′ exonucleolytically. In nature, DNA-binding and cleavage typically requires protein and both RNAs. However, single guide RNAs (“sgRNA”, or simply “gRNA”) can be engineered so as to incorporate aspects of both the crRNA and tracrRNA into a single RNA species. See, e.g., Jinek M. et al., Science 337:816-821(2012), the entire contents of which is hereby incorporated by reference.

In some embodiments, the napDNAbp directs cleavage of one or both strands at the location of a target sequence, such as within the target sequence and/or within the complement of the target sequence. In some embodiments, the napDNAbp directs cleavage of one or both strands within about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100, 200, 500, or more base pairs from the first or last nucleotide of a target sequence. In some embodiments, a vector encodes a napDNAbp that is mutated to with respect to a corresponding wild-type enzyme such that the mutated napDNAbp lacks the ability to cleave one or both strands of a target polynucleotide containing a target sequence. For example, an aspartate-to-alanine substitution (D10A) in the RuvC I catalytic domain of Cas9 from S. pyogenes converts Cas9 from a nuclease that cleaves both strands to a nickase (cleaves a single strand). Other examples of mutations that render Cas9 a nickase include, without limitation, H840A, N854A, and N863A in reference to the canonical SpCas9 sequence, or to equivalent amino acid positions in other Cas9 variants or Cas9 equivalents.

As used herein, the term “Cas protein” refers to a full-length Cas protein obtained from nature, a recombinant Cas protein having a sequences that differs from a naturally occurring Cas protein, or any fragment of a Cas protein that nevertheless retains all or a significant amount of the requisite basic functions needed for the disclosed methods, i.e., (i) possession of nucleic-acid programmable binding of the Cas protein to a target DNA, and (ii) ability to nick the target DNA sequence on one strand. The Cas proteins contemplated herein embrace CRISPR Cas 9 proteins, as well as Cas9 equivalents, variants (e.g., Cas9 nickase (nCas9) or nuclease inactive Cas9 (dCas9)) homologs, orthologs, or paralogs, whether naturally occurring or non-naturally occurring (e.g., engineered or recombinant), and may include a Cas9 equivalent from any type of CRISPR system (e.g., type II, V, VI), including Cpf1 (a type-V CRISPR-Cas systems), C2c1 (a type V CRISPR-Cas system), C2c2 (a type VI CRISPR-Cas system) and C2c3 (a type V CRISPR-Cas system). Further Cas-equivalents are described in Makarova et al., “C2c2 is a single-component programmable RNA-guided RNA-targeting CRISPR effector,” Science 2016; 353(6299), the contents of which are incorporated herein by reference.

The terms “Cas9” or “Cas9 nuclease” or “Cas9 moiety” or “Cas9 domain” embrace any naturally occurring Cas9 from any organism, any naturally-occurring Cas9 equivalent or functional fragment thereof, any Cas9 homolog, ortholog, or paralog from any organism, and any mutant or variant of a Cas9, naturally-occurring or engineered. The term Cas9 is not meant to be particularly limiting and may be referred to as a “Cas9 or equivalent.” Exemplary Cas9 proteins are further described herein and/or are described in the art and are incorporated herein by reference. The present disclosure is unlimited with regard to the particular Cas9 that is employed in the nucleobase editor (BE) of the invention.

As noted herein, Cas9 nuclease sequences and structures are well known to those of skill in the art (see, e.g., “Complete genome sequence of an M1 strain of Streptococcus pyogenes.” Ferretti et al., J. J., McShan W. M., Ajdic D. J., Savic D. J., Savic G., Lyon K., Primeaux C., Sezate S., Suvorov A. N., Kenton S., Lai H. S., Lin S. P., Qian Y., Jia H. G., Najar F. Z., Ren Q., Zhu H., Song L., White J., Yuan X., Clifton S. W., Roe B. A., McLaughlin R. E., Proc. Natl. Acad. Sci. U.S.A. 98:4658-4663(2001); “CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III.” Deltcheva E., Chylinski K., Sharma C. M., Gonzales K., Chao Y., Pirzada Z. A., Eckert M. R., Vogel J., Charpentier E., Nature 471:602-607(2011); and “A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity.” Jinek M., Chylinski K., Fonfara I., Hauer M., Doudna J. A., Charpentier E. Science 337:816-821(2012), the entire contents of each of which are incorporated herein by reference).

The Cas9 protein encoded by the first and second nucleotide sequence is herein referred as a “split Cas9.” The Cas9 protein is known to have an N-terminal lobe and a C-terminal lobe linked by a disordered linker (e.g., as described in Nishimasu et al., Cell, Volume 156, Issue 5, pp. 935-949, 2014, incorporated herein by reference). In some embodiments, the N-terminal portion of the split Cas9 protein comprises the N-terminal lobe of a Cas9 protein. In some embodiments, the C-terminal portion of the split Cas9 comprises the C-terminal lobe of a Cas9 protein.

In some embodiments, the N-terminal portion of the split Cas9 comprises a portion of any one of SEQ ID NO: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 1-(550-650) in SEQ ID NO: 1. “1-(550-650)” means starting from amino acid 1 and ending anywhere between amino acid 550-650 (inclusive). For example, the N-terminal portion of the split Cas9 may comprise a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 1-550, 1-551, 1-552, 1-553, 1-554, 1-555, 1-556, 1-557, 1-558, 1-559, 1-560, 1-561, 1-562, 1-563, 1-564, 1-565, 1-566, 1-567, 1-568, 1-569, 1-570, 1-571, 1-572, 1-573, 1-574, 1-575, 1-576, 1-577, 1-578, 1-579, 1-580, 1-581, 1-582, 1-583, 1-584, 1-585, 1-586, 1-587, 1-588, 1-589, 1-590, 1-591, 1-592, 1-593, 1-594, 1-595, 1-596, 1-597, 1-598, 1-599, 1-600, 1-601, 1-602, 1-603, 1-604, 1-605, 1-606, 1-607, 1-608, 1-609, 1-610, 1-611, 1-612, 1-613, 1-614, 1-615, 1-616, 1-617, 1-618, 1-619, 1-620, 1-621, 1-622, 1-623, 1-624, 1-625, 1-626, 1-627, 1-628, 1-629, 1-630, 1-631, 1-632, 1-633, 1-634, 1-635, 1-636, 1-637, 1-638, 1-639, 1-640, 1-641, 1-642, 1-643, 1-644, 1-645, 1-646, 1-647, 1-648, 1-649, or 1-650 of SEQ ID NO: 1. In some embodiments, the N-terminal portion of the split Cas9 protein comprises a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 1-573 or 1-637 of SEQ ID NO: 1.

In some embodiments, the N-terminal portion of the split Cas9 may comprise a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 1-430, 1-431, 1-432, 1-433, 1-434, 1-435, 1-436, 1-437, 1-438, 1-439, 1-440, 1-441, 1-442, 1-443, 1-444, 1-445, 1-446, 1-447, 1-448, 1-449, 1-450, 1-451, 1-452, 1-453, 1-454, 1-455, 1-456, 1-457, 1-458, 1-459, 1-460, 1-461, 1-462, 1-463, 1-464, 1-465, 1-466, 1-467, 1-468, 1-469, 1-470, 1-471, 1-472, 1-473, 1-474, 1-475, 1-476, 1-477, 1-478, 1-479, 1-480, 1-481, 1-482, 1-483, 1-484, 1-485, 1-486, 1-487, 1-488, 1-489, 1-490, 1-491, 1-492, 1-493, 1-494, 1-495, 1-496, 1-497, 1-498, 1-499, 1-500, 1-501, 1-502, 1-503, 1-504, 1-505, 1-506, 1-507, 1-508, 1-509, 1-510, 1-511, 1-512, 1-513, 1-514, 1-515, 1-516, 1-517, 1-518, 1-519, 1-520, 1-521, 1-522, 1-523, 1-524, 1-525, 1-526, 1-527, 1-528, 1-529, 1-530, 1-531, 1-532, 1-533, 1-534, 1-535, 1-536, 1-537, 1-538, or 1-539 of SEQ ID NO: 11. In some embodiments, the N-terminal portion of the split Cas9 protein comprises a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 1-431, 1-453, 1-457, 1-484, 1-501, 1-534, or 1-537 of SEQ ID NO: 11. In certain embodiments, the N-terminal portion of the split Cas9 protein comprises a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 1-534 of SEQ ID NO: 11.

The C-terminal portion of the split Cas9 can be joined with the N-terminal portion of the split Cas9 to form a complete Cas9 protein. In some embodiments, the C-terminal portion of the Cas9 protein starts from where the N-terminal portion of the Cas9 protein ends. As such, in some embodiments, the C-terminal portion of the split Cas9 comprises a portion of any one of SEQ ID NO: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids (551-651)-1368 of SEQ ID NO: 1. “(551-651)-1368” means starting at an amino acid between amino acids 551-651 (inclusive) and ending at amino acid 1368.

For example, the C-terminal portion of the split Cas9 may comprise a portion of any one of SEQ ID NO: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acid 551-1368, 552-1368, 553-1368, 554-1368, 555-1368, 556-1368, 557-1368, 558-1368, 559-1368, 560-1368, 561-1368, 562-1368, 563-1368, 564-1368, 565-1368, 566-1368, 567-1368, 568-1368, 569-1368, 570-1368, 571-1368, 572-1368, 573-1368, 574-1368, 575-1368, 576-1368, 577-1368, 578-1368, 579-1368, 580-1368, 581-1368, 582-1368, 583-1368, 584-1368, 585-1368, 586-1368, 587-1368, 588-1368, 589-1368, 590-1368, 591-1368, 592-1368, 593-1368, 594-1368, 595-1368, 596-1368, 597-1368, 598-1368, 599-1368, 600-1368, 601-1368, 602-1368, 603-1368, 604-1368, 605-1368, 606-1368, 607-1368, 608-1368, 609-1368, 610-1368, 611-1368, 612-1368, 613-1368, 614-1368, 615-1368, 616-1368, 617-1368, 618-1368, 619-1368, 620-1368, 621-1368, 622-1368, 623-1368, 624-1368, 625-1368, 626-1368, 627-1368, 628-1368, 629-1368, 630-1368, 631-1368, 632-1368, 633-1368, 634-1368, 635-1368, 636-1368, 637-1368, 638-1368, 639-1368, 640-1368, 641-1368, 642-1368, 643-1368, 644-1368, 645-1368, 646-1368, 647-1368, 648-1368, 649-1368, 650-1368, or 651-1368 of SEQ ID NO: 1. In some embodiments, the C-terminal portion of the split Cas9 protein comprises a portion of any one of SEQ ID NO: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 574-1368 or 638-1368 of SEQ ID NO: 1.

In other embodiments, the C-terminal portion of the split Cas9 protein comprises a portion of any one of SEQ ID NO: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 432-1054, 454-1054, 458-1054, 485-1054, 502-1054, 535-1054, or 538-1054 of SEQ ID NO: 11. In certain embodiments, the C-terminal portion of the split Cas9 protein comprises a portion of any one of SEQ ID NO: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 535-1054 of SEQ ID NO: 11.

In other embodiments, the C-terminal portion of the split Cas9 protein comprises a portion of any one of SEQ ID NO: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 432-1054, 454-1054, 458-1054, 485-1054, 502-1054, 535-1054, or 538-1054 of SEQ ID NO: 10. In certain embodiments, the C-terminal portion of the split Cas9 protein comprises a portion of any one of SEQ ID NO: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 535-1054 of SEQ ID NO: 10.

Further aspects of the present disclosure provide rAAV particles comprising a first nucleic acid molecule (e.g. encoding a N-terminal portion of a nucleobase editor or Cas9 protein fused at its C-terminus to an intein-N) as described herein. rAAV particles comprising a second nucleic acid molecule (e.g. encoding an intein-C fused to the N-terminus of a C-terminal portion of the Cas9 protein or nucleobase editor) as described herein are also provided. The disclosed rAAV particles may comprise both a first nucleic acid molecule and second nucleic acid molecules as described herein.

Cas9 variants may also be delivered to cells using the methods described herein. For example, a Cas9 variant may also be “split” as described herein. A Cas9 variant may comprise an amino acid sequence that is at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to any one of the Cas9 sequences provided herein. In some embodiments, the Cas9 variant comprises an amino acid sequence that is shorter or longer in length (e.g., by no more than 30%, no more than 25%, no more than 20%, no more than 15%, no more than 10%, no more than 5%, no more than 1% longer or shorter) than any of the Cas9 proteins provided herein (e.g., a S. pyogenes Cas9 (SpCas9) (SEQ ID NO: 1), S. pyogenes Cas9 nickase (SpCas9n) (SEQ ID NO: 3), S. aureus Cas9 (SaCas9) (SEQ ID NO: 10), and S. aureus Cas9 nickase (SaCas9) (SEQ ID NO: 11). In some embodiments, the Cas9 variant comprises an amino acid sequence that is shorter or longer in length (e.g., by no more than 200 amino acids, no more than 150 amino acids, no more than 100 amino acids, no more than 50 amino acids, no more than 10 amino acids, no more than 5 amino acids, or no more than 2 amino acids longer or shorter) than any of the Cas9 proteins provided herein.

In some embodiments, the N-terminal portion of a split Cas9 comprises an amino acid sequence that is at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to the corresponding portion of any one of the Cas9 sequences provided herein (e.g., a SpCas9, SpCas9n, SaCas9, or SaCas9n). In some embodiments, the N-terminal portion of the split Cas9 comprises an amino acid sequence that is shorter or longer in length (e.g., by no more than 30%, no more than 25%, no more than 20%, no more than 15%, no more than 10%, no more than 5%, no more than 1% longer or shorter) than the corresponding portion of any of the Cas9 proteins provided herein. In some embodiments, the N-terminal portion of the split Cas9 comprises an amino acid sequence that is shorter or longer in length (e.g., by no more than 200 amino acids, no more than 150 amino acids, no more than 100 amino acids, no more than 50 amino acids, no more than 10 amino acids, no more than 5 amino acids, or no more than 2 amino acids longer or shorter) than the corresponding portion of any of the Cas9 proteins provided herein.

In some embodiments, the C-terminal portion of a split Cas9 comprises an amino acid sequence that is at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to the corresponding portion of any one of the Cas9 sequences provided herein (e.g., the Cas9 sequences of any of SEQ ID NOs: 1, 3, 10, and 11). In some embodiments, the C-terminal portion of the split Cas9 comprises an amino acid sequence that is shorter or longer in length (e.g., by no more than 30%, no more than 25%, no more than 20%, no more than 15%, no more than 10%, no more than 5%, no more than 1% longer or shorter) than the corresponding portion of any of the Cas9 proteins provided herein. In some embodiments, the C-terminal portion of the split Cas9 comprises an amino acid sequence that is shorter or longer in length (e.g., by no more than 200 amino acids, no more than 150 amino acids, no more than 100 amino acids, no more than 50 amino acids, no more than 10 amino acids, no more than 5 amino acids, or no more than 2 amino acids longer or shorter) than the corresponding portion of any of the Cas9 proteins provided herein.

In some embodiments, the Cas9 variant is a dCas9 or nCas9. In some embodiments, the Cas9 protein is selected from S. pyogenes Cas9 (SpCas9) (SEQ ID NO: 1), S. pyogenes Cas9 nickase (SEQ ID NO: 3), S. aureus Cas9 (SaCas9) (SEQ ID NO: 10), and S. aureus Cas9 nickase (SEQ ID NO: 11). In certain embodiments, the Cas9 variant is a VRQR variant of SpCas9 that is compatible with NGA PAM sites.

Accordingly, in some embodiments, the N-terminal portion of the Cas9 protein comprises a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 1-573 or 1-637 of SEQ ID NO: 1. In some embodiments, the C-terminal portion of the Cas9 protein comprises a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 574-1368 or 638-1368 of SEQ ID NO: 1. In other embodiments, the N-terminal portion of the Cas9 protein comprises a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 1-573 or 1-637 of SEQ ID NO: 3. In some embodiments, the C-terminal portion of the Cas9 protein comprises a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 574-1368 or 638-1368 of SEQ ID NO: 3.

In some embodiments, the N-terminal portion of the Cas9 protein comprises a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 1-534 of SEQ ID NO: 11. In some embodiments, the C-terminal portion of the Cas9 protein comprises a portion of any one of SEQ ID NOs: 1-129, 143-275, 282-291, 394-397, 435-437, 519-549, and 554-556 that corresponds to amino acids 535-1054 of SEQ ID NO: 11.

In some embodiments, the N-terminal portion of the split Cas9 comprises a mutation corresponding to a D10A mutation in SEQ ID NO: 1. In some embodiments, the N-terminal portion of the split Cas9 comprises a mutation corresponding to a D10A mutation in SEQ ID NO: 1 and the C-terminal portion of the split Cas9 comprises a mutation corresponding to a H840A mutation in SEQ ID NO:1. In some embodiments, the N-terminal portion of the split Cas9 comprises a mutation corresponding to a D10A mutation in SEQ ID NO: 1, and the C-terminal portion of the split Cas9 comprises a histidine at the position corresponding to position 840 in SEQ ID NO:1.

In other embodiments, the N-terminal portion of the split Cas9 comprises a mutation corresponding to a D10A mutation in SEQ ID NO: 10.

In some embodiments, to join the N-terminal portion of the Cas9 protein and the C-terminal portion of the Cas9 protein, an intein system may be used. In some embodiments, the N-terminal portion of the Cas9 is fused to an intein-N. In some embodiments, the intein-N is fused to the C-terminus of the N-terminal portion of the Cas9 to form a structure of NH₂-[N-terminal portion of Cas9]-[intein-N]-COOH. In some embodiments, the intein-N is encoded by the dnaE-n gene. In some embodiments, the intein-N comprises the amino acid sequence as set forth in SEQ ID NO: 351 or 355. In some embodiments, the C-terminal portion of the Cas9 is fused to an intein-C, and the intein-C is fused to the N-terminus of the C-terminal portion of the Cas9 to form a structure of NH₂-[intein-C]-[C-terminal portion of Cas9]-COOH. In some embodiments, the intein-C is encoded by the dnaE-c gene. In some embodiments, the intein-C comprises the amino acid sequence as set forth in SEQ ID NO: 353 or 357.

Other split intein systems may also be used in the present disclosure and are known in the art. For example, in some embodiments, the intein pair comprises an Npu split intein. In certain such embodiments, the intein-N comprises the amino acid sequence of SEQ ID NO: 351. In some embodiments, the intein-C comprises the amino acid sequence of SEQ ID NO: 353.

As described herein, the N-terminal portion of a nucleobase editor comprises the N-terminal portion of a nuclease-inactive Cas9 protein (dCas9) or a Cas9 nickase (nCas9). In some embodiments, the N-terminal portion of a nucleobase editor further comprises a nucleobase modifying enzyme (e.g., nucleases, nickases, recombinases, deaminases, DNA repair enzymes, DNA damage enzymes, dismutases, alkylation enzymes, depurination enzymes, oxidation enzymes, pyrimidine dimer forming enzymes, integrases, transposases, polymerases, ligases, helicases, photolyases, glycosylases, epigenetic modifiers such as methylases, acetylases, methyltransferase, demethylase, etc.). In some embodiments, the nucleobase modifying enzyme is a deaminase (e.g., a cytosine deaminase or an adenosine deaminase, or functional variants thereof). In some embodiments, the nucleobase modifying enzyme is fused to the N-terminus of the N-terminal portion of the split dCas9 or split nCas9. In some embodiments, the N-terminal portion of the nucleobase editor has of the structure: NH₂-[nucleobase modifying enzyme]-[N-terminal portion of dCas9 or nCas9]-COOH. In some embodiments, the N-terminal portion of the nucleobase editor is fused to an intein N. In some embodiments, the intein-N is fused to the C-terminus of the N-terminal portion of the nucleobase editor.

In some embodiments, the first nucleotide sequence encodes a polypeptide comprising the structure NH₂-[nucleobase modifying enzyme]-[N-terminal portion of dCas9 or nCas9]-[intein-N]-COOH.

In some embodiments, the C-terminal portion of the nucleobase editor comprises the C-terminal portion of a nuclease-inactive Cas9 protein (dCas9) or a Cas9 nickase (nCas9). In some embodiments, the nucleobase modifying enzyme is fused to the C-terminus of the C-terminal portion of the split dCas9 or split nCas9. In some embodiments, the C-terminal portion of the nucleobase editor is of the structure: NH₂-[C-terminal portion of dCas9 or nCas9]-[nucleobase modifying enzyme]-COOH. In some embodiments, the C-terminal portion of the nucleobase editor comprises an intein-C fused to the C-terminal portion of the Cas9 protein. In some embodiments, the intein-C is fused to the N-terminus of the C-terminal portion of the nucleobase editor. In some embodiments, the second nucleotide sequence encodes a polypeptide of the structure: NH₂-[intein-C]-[C-terminal portion of the Cas9 protein]-COOH.

Non-limiting examples of suitable Cas9 proteins and variants, and nucleobase editors and variants are provided. The disclosure provides Cas9 variants, for example, Cas9 proteins from one or more organisms, which may comprise one or more mutations (e.g., to generate dCas9 or Cas9 nickase). In some embodiments, one or more of the amino acid residues, identified below by an asterisk, of a Cas9 protein may be mutated. In some embodiments, the D10 and/or H840 residues of the amino acid sequence provided in SEQ ID NO: 1, or a corresponding mutation in any of the amino acid sequences provided in SEQ ID NOs: 2-275, 394-397 and 488, are mutated. In some embodiments, the D10 residue of the amino acid sequence provided in SEQ ID NO: 1, or a corresponding mutation in any of the amino acid sequences provided in SEQ ID NOs: 2-275, 394-397 and 488, is mutated to any amino acid residue, except for D. In some embodiments, the D10 residue of the amino acid sequence provided in SEQ ID NO: 1, or a corresponding mutation in any of the amino acid sequences provided in SEQ ID NOs: 2-275, 394-397 and 488, is mutated to an A. In some embodiments, the H840 residue of the amino acid sequence provided in SEQ ID NO: 1, or a corresponding residue in any of the amino acid sequences provided in SEQ ID NOs: 2-275, 394-397 and 488, is an H. In some embodiments, the H840 residue of the amino acid sequence provided in SEQ ID NO: 1, or a corresponding mutation in any of the amino acid sequences provided in SEQ ID NOs: 2-275, 394-397 and 488, is mutated to any amino acid residue, except for H. In some embodiments, the H840 residue of the amino acid sequence provided in SEQ ID NO: 1, or a corresponding mutation in any of the amino acid sequences provided in SEQ ID NOs: 2-275, 394-397 and 488, is mutated to an A. In some embodiments, the D10 residue of the amino acid sequence provided in SEQ ID NO: 1, or a corresponding residue in any of the amino acid sequences provided in SEQ ID NOs: 2-275, 394-397 and 488, is a D.

A number of Cas9 sequences from various species were aligned to determine whether corresponding homologous amino acid residues of D10 and H840 of SEQ ID NO: 1 can be identified in other Cas9 proteins, allowing the generation of Cas9 variants with corresponding mutations of the homologous amino acid residues. The alignment was carried out using the NCBI Constraint-based Multiple Alignment Tool (COBALT (accessible at st-va.ncbi.nlm.nih.gov/tools/cobalt)), with the following parameters. Alignment parameters: Gap penalties −11, −1; End-Gap penalties −5, −1. CDD Parameters: Use RPS BLAST on; Blast E-value 0.003; Find Conserved columns and Recompute on. Query Clustering Parameters: Use query clusters on; Word Size 4; Max cluster distance 0.8; Alphabet Regular.

Examples of Cas9 and Cas9 equivalents are provided as follows; however, these specific examples are not meant to be limiting. The nucleobase editor fusions of the present disclosure may use any suitable napDNAbp, including any suitable Cas9 or Cas9 equivalent.

S. pyogenes Cas9 wild type

(NCBI Reference Sequence: NC 002737.2, Uniprot Reference Sequence: Q99ZW2)

(SEQ ID NO: 1)

MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTAR

RRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLR

KKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDA

KAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDN

LLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEK

YKEIFFDQSKNGYAGYIDGGASQLEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHL

GELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGA

SAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFK

TNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLT

LFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRN

FMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVI

EMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQEL

DINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLITQRK

FDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSD

FRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKA

TAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQ

TGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIME

RSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLA

SHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENI

IHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD

S. pyogenes dCas9 (D10A and H840A)

(SEQ ID NO: 2)

MDKKYSIGLAIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTAR

RRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLR

KKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDA

KAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDN

LLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEK

YKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHL

GELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGA

SAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFK

TNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLT

LFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRN

FMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVI

EMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQEL

DINRLSDYDVDAIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSLEVVKKMKNYWRQLLNAKLITQRK

FDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSD

FRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKA

TAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQ

TGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIME

RSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLA

SHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENI

IHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD

S. pyogenes Cas9 Nickase (D10A)

(SEQ ID NO: 3)

MDKKYSIGLAIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTAR

RRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLR

KKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDA

KAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDN

LLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEK

YKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHL

GELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGA

SAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFK

TNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLT

LFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRN

FMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVI

EMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQEL

DINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLITQRK

FDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSD

FRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKA

TAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQ

TGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIME

RSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLA

SHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENI

IHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD

VRER-nCas9 (D10A/D1135V/G1218R/R1335E/T1337R) S. pyogenes Cas9 Nickase

(SEQ ID NO: 4)

MDKKYSIGLAIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTAR

RRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLR

KKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDA

KAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDN

LLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEK

YKEIFFDQSKNGYAGYIDGGASQLEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHL

GELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGA

SAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFK

TNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLT

LFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRN

FMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVI

EMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQEL

DINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLITQRK

FDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSD

FRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKA

TAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQ

TGGFSKESILPKRNSDKLIARKKDWDPKKYGGFVSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIME

RSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASARELQKGNELALPSKYVNFLYLA

SHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENI

IHLFTLTNLGAPAAFKYFDTTIDRKEYRSTKEVLDATLIHQSITGLYETRIDLSQLGGD

VQR-nCas9 (D10A/D1135V/R1335Q/T1337R) S. pyogenes Cas9 Nickase

(SEQ ID NO: 5)

MDKKYSIGLAIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTAR

RRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLR

KKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDA

KAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDN

LLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEK

YKEIFFDQSKNGYAGYIDGGASQLEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHL

GELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGA

SAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFK

TNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLT

LFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRN

FMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVI

EMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQEL

DINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLITQRK

FDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSD

FRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKA

TAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQ

TGGFSKESILPKRNSDKLIARKKDWDPKKYGGFVSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIME

RSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLA

SHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENI

IHLFTLTNLGAPAAFKYFDTTIDRKQYRSTKEVLDATLIHQSITGLYETRIDLSQLGGD

EQR-nCas9 (D10A/D1135E/R1335Q/T1337R) S. pyogenes Cas9 Nickase

(SEQ ID NO: 6)

MDKKYSIGLAIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTAR

RRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLR

KKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDA

KAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDN

LLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEK

YKEIFFDQSKNGYAGYIDGGASQLEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHL

GELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGA

SAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFK

TNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLT

LFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRN

FMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVI

EMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQEL

DINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLITQRK

FDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSD

FRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKA

TAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQ

TGGFSKESILPKRNSDKLIARKKDWDPKKYGGFESPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIME

RSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLA

SHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENI

IHLFTLTNLGAPAAFKYFDTTIDRKQYRSTKEVLDATLIHQSITGLYETRIDLSQLGGD

VRQR-nCas9 (D10A/D1135V/G1218R/R1335Q/T1337R) S. pyogenes Cas9

Nickase

(SEQ ID NO: 488)

MDKKYSIGLAIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTAR

RRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLR

KKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDA

KAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDN

LLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEK

YKEIFFDQSKNGYAGYIDGGASQLEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHL

GELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGA

SAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFK

TNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLT

LFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRN

FMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVI

EMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQEL

DINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLITQRK

FDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSD

FRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKA

TAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQ

TGGFSKESILPKRNSDKLIARKKDWDPKKYGGFVSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIME

RSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASARELQKGNELALPSKYVNFLYLA

SHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENI

IHLFTLTNLGAPAAFKYFDTTIDRKQYRSTKEVLDATLIHQSITGLYETRIDLSQLGGD

SaKKH-nCas9 (D10A/E782K/N968K/R1015H) S. aureus Cas9 Nickase

(SEQ ID NO: 7)

MKRNYILGLAIGITSVGYGIIDYETRDVIDAGVRLFKEANVENNEGRRSKRGARRLKRRRRHRIQRVKK

LLFDYNLLTDHSELSGINPYEARVKGLSQKLSEEEFSAALLHLAKRRGVHNVNEVEEDTGNELSTKEQI

SRNSKALEEKYVAELQLERLKKDGEVRGSINRFKTSDYVKEAKQLLKVQKAYHQLDQSFIDTYIDLLE

TRRTYYEGPGEGSPFGWKDIKEWYEMLMGHCTYFPEELRSVKYAYNADLYNALNDLNNLVITRDENE

KLEYYEKFQIIENVFKQKKKPTLKQIAKEILVNEEDIKGYRVTSTGKPEFTNLKVYHDIKDITARKEIIEN

AELLDQIAKILTIYQSSEDIQEELTNLNSELTQLEIEQISNLKGYTGTHNLSLKAINLILDELWHTNDNQIA

IFNRLKLVPKKVDLSQQKEIPTTLVDDFILSPVVKRSFIQSIKVINAIIKKYGLPNDIIIELAREKNSKDAQK

MINEMQKRNRQTNERIEEIIRTTGKENAKYLIEKIKLHDMQEGKCLYSLEAIPLEDLLNNPFNYLVDHIIP

RSVSFDNSFNNKVLVKQEENSKKGNRTPFQYLSSSDSKISYETFKKHILNLAKGKGRISKTKKEYLLEER

DINRFSVQKDFINRNLVDTRYATRGLMNLLRSYFRVNNLDVKVKSINGGFTSFLRRKWKFKKERNKGY

KHHAEDALIIANADFIFKEWKKLDKAKKVMENQMFEEKQAESMPEIETEQEYKEIFITPHQIKHIKDFK

DYKYSHRVDKKPNRKLINDTLYSTRKDDKGNTLIVNNLNGLYDKDNDKLKKLINKSPEKLLMYHHDP

QTYQKLKLIMEQYGDEKNPLYKYYEETGNYLTKYSKKDNGPVIKKIKYYGNKLNAHLDITDDYPNSR

NKVVKLSLKPYRFDVYLDNGVYKFVTVKNLDVIKKENYYLVNSKCYLEAKKLKKISNQAEFIASFYKN

DLIKINGELYRVIGVNNDLLNRIEVNMIDITYREYLENMNDKRPPHIIKTIASKTQSIKKYSTDILGNLYE

VKSKKHPQIIKKG

Streptococcus thermophilus CRISPR1 Cas9 (St1Cas9) Nickase (D9A)

(SEQ ID NO: 8)

MSDLVLGLAIGIGSVGVGILNKVTGEIIHKNSRIFPAAQAENNLVRRTNRQGRRLTRRKKHRRVRLNRL

FEESGLITDFTKISINLNPYQLRVKGLTDELSNEELFIALKNMVKHRGISYLDDASDDGNSSIGDYAQIVK

ENSKQLETKTPGQIQLERYQTYGQLRGDFTVEKDGKKHRLINVFPTSAYRSEALRILQTQQEFNPQITDE

FINRYLEILTGKRKYYHGPGNEKSRTDYGRYRTSGETLDNIFGILIGKCTFYPDEFRAAKASYTAQEFNL

LNDLNNLTVPTETKKLSKEQKNQIINYVKNEKAMGPAKLFKYIAKLLSCDVADIKGYRIDKSGKAEIHT

FEAYRKMKTLETLDIEQMDRETLDKLAYVLTLNTEREGIQEALEHEFADGSFSQKQVDELVQFRKANS

SIFGKGWHNFSVKLMMELIPELYETSEEQMTILTRLGKQKTTSSSNKTKYIDEKLLTEEIYNPVVAKSVR

QAIKIVNAAIKEYGDFDNIVIEMARETNEDDEKKAIQKIQKANKDEKDAAMLKAANQYNGKAELPHSV

FHGHKQLATKIRLWHQQGERCLYTGKTISIHDLINNSNQFEVDHILPLSITFDDSLANKVLVYATANQE

KGQRTPYQALDSMDDAWSFRELKAFVRESKTLSNKKKEYLLTEEDISKFDVRKKFIERNLVDTRYASR

VVLNALQEHFRAHKIDTKVSVVRGQFTSQLRRHWGIEKTRDTYHHHAVDALIIAASSQLNLWKKQKN

TLVSYSEDQLLDIETGELISDDEYKESVFKAPYQHFVDTLKSKEFEDSILFSYQVDSKFNRKISDATIYAT

RQAKVGKDKADETYVLGKIKDIYTQDGYDAFMKIYKKDKSKFLMYRHDPQTPEKVIEPILENYPNKQI

NEKGKEVPCNPFLKYKEEHGYIRKYSKKGNGPEIKSLKYYDSKLGNHIDITPKDSNNKVVLQSVSPWR

ADVYFNKTTGKYEILGLKYADLQFEKGTGTYKISQLKYNDIKKKEGVDSDSEFKFTLYKNDLLLVKDT

ETKEQQLFRFLSRTMPKQKHYVELKPYDKQKFEGGEALIKVLGNVANSGQCKKGLGKSNISIYKVRTD

VLGNQHIIKNEGDKPKLDF

Streptococcus thermophilus CRISPR3Cas9 (St3Cas9) Nickase (D10A)

(SEQ ID NO: 9)

MTKPYSIGLAIGTNSVGWAVITDNYKVPSKKMKVLGNTSKKYIKKNLLGVLLFDSGITAEGRRLKRTA

RRRYTRRRNRILYLQEIFSTEMATLDDAFFQRLDDSFLVPDDKRDSKYPIFGNLVEEKVYHDEFPTIYHL

RKYLADSTKKADLRLVYLALAHMIKYRGHFLIEGEFNSKNNDIQKNFQDFLDTYNAIFESDLSLENSKQ

LEEIVKDKISKLEKKDRILKLFPGEKNSGIFSEFLKLIVGNQADFRKCFNLDEKASLHFSKESYDEDLETL

LGYIGDDYSDVFLKAKKLYDAILLSGFLTVTDNETEAPLSSAMIKRYNEHKEDLALLKEYIRNISLKTYN

EVFKDDTKNGYAGYIDGKTNQEDFYVYLKNLLAEFEGADYFLEKIDREDFLRKQRTFDNGSIPYQIHLQ

EMRAILDKQAKFYPFLAKNKERIEKILTFRIPYYVGPLARGNSDFAWSIRKRNEKITPWNFEDVIDKESS

AEAFINRMTSFDLYLPEEKVLPKHSLLYETFNVYNELTKVRFIAESMRDYQFLDSKQKKDIVRLYFKDK

RKVTDKDIIEYLHAIYGYDGIELKGIEKQFNSSLSTYHDLLNIINDKEFLDDSSNEAIIEEIIHTLTIFEDRE

MIKQRLSKFENIFDKSVLKKLSRRHYTGWGKLSAKLINGIRDEKSGNTILDYLIDDGISNRNFMQLIHDD

ALSFKKKIQKAQIIGDEDKGNIKEVVKSLPGSPAIKKGILQSIKIVDELVKVMGGRKPESIVVEMARENQ

YTNQGKSNSQQRLKRLEKSLKELGSKILKENIPAKLSKIDNNALQNDRLYLYYLQNGKDMYTGDDLDI

DRLSNYDIDHIIPQAFLKDNSIDNKVLVSSASNRGKSDDFPSLEVVKKRKTFWYQLLKSKLISQRKFDNL

TKAERGGLLPEDKAGFIQRQLVETRQITKHVARLLDEKFNNKKDENNRAVRTVKIITLKSTLVSQFRKD

FELYKVREINDFHHAHDAYLNAVIASALLKKYPKLEPEFVYGDYPKYNSFRERKSATEKVYFYSNIMNI

FKKSISLADGRVIERPLIEVNEETGESVWNKESDLATVRRVLSYPQVNVVKKVEEQNHGLDRGKPKGL

FNANLSSKPKPNSNENLVGAKEYLDPKKYGGYAGISNSFAVLVKGTIEKGAKKKITNVLEFQGISILDRI

NYRKDKLNFLLEKGYKDIELIIELPKYSLFELSDGSRRMLASILSTNNKRGEIHKGNQIFLSQKFVKLLYH

AKRISNTINENHRKYVENHKKEFEELFYYILEFNENYVGAKKNGKLLNSAFQSWQNHSIDELCSSFIGPT

GSERKGLFELTSRGSAADFEFLGVKIPRYRDYTPSSLLKDATLIHQSVTGLYETRIDLAKLGEG

S. aureus Cas9 wild type

(SEQ ID NO: 10)

MKRNYILGLDIGITSVGYGIIDYETRDVIDAGVRLFKEANVENNEGRRSKRGARRLKRRRRHRIQRVKK

LLFDYNLLTDHSELSGINPYEARVKGLSQKLSEEEFSAALLHLAKRRGVHNVNEVEEDTGNELSTKEQI

SRNSKALEEKYVAELQLERLKKDGEVRGSINRFKTSDYVKEAKQLLKVQKAYHQLDQSFIDTYIDLLE

TRRTYYEGPGEGSPFGWKDIKEWYEMLMGHCTYFPEELRSVKYAYNADLYNALNDLNNLVITRDENE

KLEYYEKFQIIENVFKQKKKPTLKQIAKEILVNEEDIKGYRVTSTGKPEFTNLKVYHDIKDITARKEIIEN

AELLDQIAKILTIYQSSEDIQEELTNLNSELTQLEIEQISNLKGYTGTHNLSLKAINLILDELWHTNDNQIA

IFNRLKLVPKKVDLSQQKEIPTTLVDDFILSPVVKRSFIQSIKVINAIIKKYGLPNDIIIELAREKNSKDAQK

MINEMQKRNRQTNERIEEIIRTTGKENAKYLIEKIKLHDMQEGKCLYSLEAIPLEDLLNNPFNYEVDHIIP

RSVSFDNSFNNKVLVKQEENSKKGNRTPFQYLSSSDSKISYETFKKHILNLAKGKGRISKTKKEYLLEER

DINRFSVQKDFINRNLVDTRYATRGLMNLLRSYFRVNNLDVKVKSINGGFTSFLRRKWKFKKERNKGY

KHHAEDALIIANADFIFKEWKKLDKAKKVMENQMFEEKQAESMPEIETEQEYKEIFITPHQIKHIKDFK

DYKYSHRVDKKPNRELINDTLYSTRKDDKGNTLIVNNLNGLYDKDNDKLKKLINKSPEKLLMYHHDP

QTYQKLKLIMEQYGDEKNPLYKYYEETGNYLTKYSKKDNGPVIKKIKYYGNKLNAHLDITDDYPNSR

NKVVKLSLKPYRFDVYLDNGVYKFVTVKNLDVIKKENYYEVNSKCYEEAKKLKKISNQAEFIASFYNN

DLIKINGELYRVIGVNNDLLNRIEVNMIDITYREYLENMNDKRPPRIIKTIASKTQSIKKYSTDILGNLYE

VKSKKHPQIIKKG

S. aureus Cas9 Nickase (D10A)

(SEQ ID NO: 11)

MKRNYILGLAIGITSVGYGIIDYETRDVIDAGVRLFKEANVENNEGRRSKRGARRLKRRRRHRIQRVKK

LLFDYNLLTDHSELSGINPYEARVKGLSQKLSEEEFSAALLHLAKRRGVHNVNEVEEDTGNELSTKEQI

SRNSKALEEKYVAELQLERLKKDGEVRGSINRFKTSDYVKEAKQLLKVQKAYHQLDQSFIDTYIDLLE

TRRTYYEGPGEGSPFGWKDIKEWYEMLMGHCTYFPEELRSVKYAYNADLYNALNDLNNLVITRDENE

KLEYYEKFQIIENVFKQKKKPTLKQIAKEILVNEEDIKGYRVTSTGKPEFTNLKVYHDIKDITARKEIIEN

AELLDQIAKILTIYQSSEDIQEELTNLNSELTQEEIEQISNLKGYTGTHNLSLKAINLILDELWHTNDNQIA

IFNRLKLVPKKVDLSQQKEIPTTLVDDFILSPVVKRSFIQSIKVINAIIKKYGLPNDIIIELAREKNSKDAQK

MINEMQKRNRQTNERIEEIIRTTGKENAKYLIEKIKLHDMQEGKCLYSLEAIPLEDLLNNPFNYEVDHIIP

RSVSFDNSFNNKVLVKQEENSKKGNRTPFQYLSSSDSKISYETFKKHILNLAKGKGRISKTKKEYLLEER

DINRFSVQKDFINRNLVDTRYATRGLMNLLRSYFRVNNLDVKVKSINGGFTSFLRRKWKF1KKERNKG

YKHHAEDALIIANADFIFKEWKKLDKAKKVMENQMFEEKQAESMPEIETEQEYKEIFITPHQIKHIKDF

KDYKYSHRVDKKPNRELINDTLYSTRKDDKGNTLIVNNLNGLYDKDNDKLKKLINKSPEKLLMYHHD

PQTYQKLKLIMEQYGDEKNPLYKYYEETGNYLTKYSKKDNGPVIKKIKYYGNKLNAHLDITDDYPNSR

NKVVKLSLKPYRFDVYLDNGVYKFVTVKNLDVIKKENYYEVNSKCYEEAKKLKKISNQAEFIASFYNN

DLIKINGELYRVIGVNNDLLNRIEVNMIDITYREYLENMNDKRPPRIIKTIASKTQSIKKYSTDILGNLYE

VKSKKHPQIIKKG

Streptococcus thermophilus wild type CRISPR3 Cas9 (St3Cas9)

(SEQ ID NO: 12)

MTKPYSIGLDIGTNSVGWAVITDNYKVPSKKMKVLGNTSKKYIKKNLLGVLLFDSGITAEGRRLKRTA

RRRYTRRRNRILYLQEIFSTEMATLDDAFFQRLDDSFLVPDDKRDSKYPIFGNLVEEKVYHDEFPTIYHL

RKYLADSTKKADLRLVYLALAHMIKYRGHFLIEGEFNSKNNDIQKNFQDFLDTYNAIFESDLSLENSKQ

LEEIVKDKISKLEKKDRILKLFPGEKNSGIFSEFLKLIVGNQADFRKCFNLDEKASLHFSKESYDEDLETL

LGYIGDDYSDVFLKAKKLYDAILLSGFLTVTDNETEAPLSSAMIKRYNEHKEDLALLKEYIRNISLKTYN

EVFKDDTKNGYAGYIDGKTNQEDFYVYLKNLLAEFEGADYFLEKIDREDFLRKQRTFDNGSIPYQIHLQ

EMRAILDKQAKFYPFLAKNKERIEKILTFRIPYYVGPLARGNSDFAWSIRKRNEKITPWNFEDVIDKESS

AEAFINRMTSFDLYLPEEKVLPKHSLLYETFNVYNELTKVRFIAESMRDYQFLDSKQKKDIVRLYFKDK

RKVTDKDIIEYLHAIYGYDGIELKGIEKQFNSSLSTYHDLLNIINDKEFLDDSSNEAIIEEIIHTLTIFEDRE

MIKQRLSKFENIFDKSVLKKLSRRHYTGWGKLSAKLINGIRDEKSGNTILDYLIDDGISNRNFMQLIHDD

ALSFKKKIQKAQIIGDEDKGNIKEVVKSLPGSPAIKKGILQSIKIVDELVKVMGGRKPESIVVEMARENQ

YTNQGKSNSQQRLKRLEKSLKELGSKILKENIPAKLSKIDNNALQNDRLYLYYLQNGKDMYTGDDLDI

DRLSNYDIDHIIPQAFLKDNSIDNKVLVSSASNRGKSDDFPSLEVVKKRKTFWYQLLKSKLISQRKFDNL

TKAERGGLLPEDKAGFIQRQLVETRQITKHVARLLDEKFNNKKDENNRAVRTVKIITLKSTLVSQFRKD

FELYKVREINDFHHAHDAYLNAVIASALLKKYPKLEPEFVYGDYPKYNSFRERKSATEKVYFYSNIMNI

FKKSISLADGRVIERPLIEVNEETGESVWNKESDLATVRRVLSYPQVNVVKKVEEQNHGLDRGKPKGL

FNANLSSKPKPNSNENLVGAKEYLDPKKYGGYAGISNSFAVLVKGTIEKGAI(KKITNVLEFQGISILDRI

NYRKDKLNFLLEKGYKDIELIIELPKYSLFELSDGSRRMLASILSTNNKRGEIHKGNQIFLSQKFVKLLYH

AKRISNTINENHRKYVENHKKEFEELFYYILEFNENYVGAKKNGKLLNSAFQSWQNHSIDELCSSFIGPT

GSERKGLFELTSRGSAADFEFLGVKIPRYRDYTPSSLLKDATLIHQSVTGLYETRIDLAKLGEG

Streptococcus thermophilus CRISPR1 Cas9 wild type (St1Cas9)

(SEQ ID NO: 13)

MSDLVLGLDIGIGSVGVGILNKVTGEIIHKNSRIFPAAQAENNLVRRTNRQGRRLTRRKKHRRVRLNRL

FEESGLITDFTKISINLNPYQLRVKGLTDELSNEELFIALKNMVKHRGISYLDDASDDGNSSIGDYAQIVK

ENSKQLETKTPGQIQLERYQTYGQLRGDFTVEKDGKKHRLINVFPTSAYRSEALRILQTQQEFNPQITDE

FINRYLEILTGKRKYYHGPGNEKSRTDYGRYRTSGETLDNIFGILIGKCTFYPDEFRAAKASYTAQEFNL

LNDLNNLTVPTETKKLSKEQKNQIINYVKNEKAMGPAKLFKYIAKLLSCDVADIKGYRIDKSGKAEIHT

FEAYRKMKTLETLDIEQMDRETLDKLAYVLTLNTEREGIQEALEHEFADGSFSQKQVDELVQFRKANS

SIFGKGWHNFSVKLMMELIPELYETSEEQMTILTRLGKQKTTSSSNKTKYIDEKLLTEEIYNPVVAKSVR

QAIKIVNAAIKEYGDFDNIVIEMARETNEDDEKKAIQKIQKANKDEKDAAMLKAANQYNGKAELPHSV

FHGHKQLATKIRLWHQQGERCLYTGKTISIHDLINNSNQFEVDHILPLSITFDDSLANKVLVYATANQE

KGQRTPYQALDSMDDAWSFRELKAFVRESKTLSNKKKEYLLTEEDISKFDVRKKFIERNLVDTRYASR

VVLNALQEHFRAHKIDTKVSVVRGQFTSQLRRHWGIEKTRDTYHHHAVDALIIAASSQLNLWKKQKN

TLVSYSEDQLLDIETGELISDDEYKESVFKAPYQHFVDTLKSKEFEDSILFSYQVDSKFNRKISDATIYAT

RQAKVGKDKADETYVLGKIKDIYTQDGYDAFMKIYKKDKSKFLMYRHDPQTPEKVIEPILENYPNKQI

NEKGKEVPCNPFLKYKEEHGYIRKYSKKGNGPEIKSLKYYDSKLGNHIDITPKDSNNKVVLQSVSPWR

ADVYFNKTTGKYEILGLKYADLQFEKGTGTYKISQLKYNDIKKKEGVDSDSEFKFTLYKNDLLLVKDT

ETKEQQLFRFLSRTMPKQKHYVELKPYDKQKFEGGEALIKVLGNVANSGQCKKGLGKSNISIYKVRTD

VLGNQHIIKNEGDKPKLDF

CasX from Sulfolobus islandicus (strain REY15A)

(SEQ ID NO: 14)

MEVPLYNIFGDNYIIQVATEAENSTIYNNKVEIDDEELRNVLNLAYKIAKNNEDAAAERRGKAKKKKG

LEGETTTSNIILPLSGNDKNPWTETLKCYNFPTTVALSEVFKNFSQVKECEEVSAPSFVKPEFYKFGRSP

GMVERTRRVKLEVEPHYLIMAAAGWVLTRLGKAKVSEGDYVGVNVFTPTRGILYSLIQNVNGIVPGIK

PETAFGLWIARKVVSSVTNPNVSVVSIYTISDAVGQNPTTINGGFSIDLTKLLEKRDLLSERLEAIARNAL

SISSNMRERYIVLANYIYEYLTGSKRLEDLLYFANRDLIMNLNSDDGKVRDLKLISAYVNGELIRGEG

CasY from Sulfolobus islandicus (strain REY15A)

(SEQ ID NO: 15)

MEVPLYNIFGDNYIIQVATEAENSTIYNNKVEIDDEELRNVLNLAYKIAKNNEDAAAERRGKAKKKKG

LEGETTTSNIILPLSGNDKNPWTETLKCYNFPTTVALSEVFKNFSQVKECEEVSAPSFVKPEFYLFGRSPG

MVERTRRVKLEVEPHYLIIAAAGWVLTRLGKAKVSEGDYVGVNVFTPTRGILYSLIQNVNGIVPGIKPE

TAFGLWIARKVVSSVTNPNVSVVRIYTISDAVGQNPTTINGGFSIDLTKLLEKRYLLSERLEAIARNALSI

SSNMRERYIVLANYIYEYLTGSKRLEDLLYFANRDLIMNLNSDDGKVRDLKLISAYVNGELIRGEG

Some aspects of the disclosure provide Cas9 domains that have different PAM specificities. Typically, Cas9 proteins, such as Cas9 from S. pyogenes (spCas9), require a canonical NGG PAM sequence to bind a particular nucleic acid region. This may limit the ability to edit desired bases within a genome. In some embodiments, the base editing fusion proteins provided herein may need to be placed at a precise location, for example where a target base is placed within a 4 base region (e.g., a “editing window”), which is approximately 15 bases upstream of the PAM. See Komor, A. C., et al., “Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage” Nature 533, 420-424 (2016), the entire contents of which are hereby incorporated by reference. Accordingly, in some embodiments, any of the fusion proteins provided herein may contain a Cas9 domain that is capable of binding a nucleotide sequence that does not contain a canonical (e.g., NGG) PAM sequence. Cas9 domains that bind to non-canonical PAM sequences have been described in the art and would be apparent to the skilled artisan. For example, Cas9 domains that bind non-canonical PAM sequences have been described in Kleinstiver, B. P., et al., “Engineered CRISPR-Cas9 nucleases with altered PAM specificities” Nature 523, 481-485 (2015); and Kleinstiver, B. P., et al., “Broadening the targeting range of Staphylococcus aureus CRISPR-Cas9 by modifying PAM recognition” Nature Biotechnology 33, 1293-1298 (2015); the entire contents of each are hereby incorporated by reference.

For example, a napDNAbp domain with altered PAM specificity, such as a domain with at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% sequence identity with wild type Francisella novicida Cpf1 (SEQ ID NO: 16) (D917, E1006, and D1255), which has the following amino acid sequence:

Wild type Francisella novicida Cpf1

(D917, E1006, and D1255 are bolded and underlined)

(SEQ ID NO: 16)

MSIYQEFVNKYSLSKTLRFELIPQGKTLENIKARGLILDDEKRAKDYKKAKQIIDKYHQFFIEEILSSVCIS

EDLLQNYSDVYFKLKKSDDDNLQKDFKSAKDTIKKQISEYIKDSEKFKNLFNQNLIDAKKGQESDLILW

LKQSKDNGIELFKANSDITDIDEALEIIKSFKGWTTYFKGFHENRKNVYSSNDIPTSIIYRIVDDNLPKFLE

NKAKYESLKDKAPEAINYEQIKKDLAELLTFDIDYKTSEVNQRVFSLDEVFEIANFNNYLNQSGITKFNT

IIGGKFVNGENTKRKGINEYINLYSQQINDKTLKKYKMSVLFKQILSDTESKSFVIDKLEDDSDVVTTMQ

SFYEQIAAFKTVEEKSIKETLSLLFDDLKAQKLDLSKIYFKNDKSLTDLSQQVFDDYSVIGTAVLEYITQ

QIAPKNLDNPSKKEQELIAKKTEKAKYLSLETIKLALEEFNKHRDIDKQCRFEEILANFAAIPMIFDEIAQ

NKDNLAQISIKYQNQGKKDLLQASAEDDVKAIKDLLDQTNNLLHKLKIFHISQSEDKANILDKDEHFYL

VFEECYFELANIVPLYNKIRNYITQKPYSDEKFKLNFENSTLANGWDKNKEPDNTAILFIKDDKYYLGV

MNKKNNKIFDDKAIKENKGEGYKKIVYKLLPGANKMLPKVFFSAKSIKFYNPSEDILRIRNHSTHTKNG

SPQKGYEKFEFNIEDCRKFIDFYKQSISKHPEWKDFGFRFSDTQRYNSIDEFYREVENQGYKLTFENISES

YIDSVVNQGKLYLFQIYNKDFSAYSKGRPNLHTLYWKALFDERNLQDVVYKLNGEAELFYRKQSIPKK

ITHPAKEAIANKNKDNPKKESVFEYDLIKDKRFTEDKFFFHCPITINFKSSGANKFNDEINLLLKEKAND

VHILSI custom-character