MITOCHONDRIAL BASE EDITORS AND METHODS FOR EDITING MITOCHONDRIAL DNA

BACKGROUND OF THE INVENTION

Inherited or acquired mutations in mitochondrial DNA (mtDNA) can profoundly impact cell physiology and are associated with a spectrum of human diseases, ranging from rare inborn errors of metabolism, certain cancers, age-associated neurodegeneration, and even the aging process itself. Tools for introducing specific modifications to mtDNA are needed both for modeling diseases and for their therapeutic potential. The development of such tools, however, has been constrained in part by the challenge of transporting RNAs into mitochondria, including guide RNAs required to facilitate nucleic acid modification and/or editing using CRISPR-associated proteins.

Each mammalian cell contains hundreds to thousands of copies of circular mtDNA. Homoplasmy refers to a state in which all mtDNA molecules are identical, while heteroplasmy refers to a state in which a cell contains a mixture of wild-type and mutant mtDNA. Current approaches to engineering and/or altering mtDNA rely on RNA-free DNA-binding proteins, such as transcription activator-like effector nucleases (mitoTALENs) and zinc finger nucleases fused to mitochondrial targeting sequences (mitoZFNs), to induce double-strand breaks (DSBs). Upon cleavage, the linearized mtDNA is rapidly degraded, resulting in heteroplasmic shifts to favor uncut mtDNA genomes. As a candidate therapy however, this approach cannot be applied to homoplasmic mtDNA mutations since destroying all mtDNA copies is presumed to be harmful. In addition, using DSBs to eliminate heteroplasmic mtDNA mutations, which tend to be functionally recessive, implicitly requires the edited cell to restore its wild-type mtDNA copy number. During this transient period of mtDNA repopulation, the loss of mtDNA copies could cause cellular toxicity resulting in deleterious effects (e.g., apoptosis).

A favorable alternative to targeted destruction of DNA through DSBs is precision genome editing. The ability to precisely install or correct pathogenic mutations, rather than destroy targeted mtDNA, could accelerate the ability to model mtDNA diseases in cells and animal models, and in principle could also enable therapeutic approaches that correct pathogenic mtDNA and genomic DNA mutations.

Therefore, the development of programmable base editors that are capable of introducing a nucleotide change and/or that could alter or modify the nucleotide sequence at a target site with high specificity and efficiency within DNA, including genomic DNA and mtDNA, would substantially expand the scope and therapeutic potential of genome editing technologies.

SUMMARY OF THE INVENTION

The present disclosure is based on the development of engineered zinc finger domain-containing proteins, engineered double-stranded DNA deaminase A (DddA variants), and fusion proteins comprising engineered zinc finger domain-containing proteins and/or engineered DddA variants that display increased on-target base editing activity and/or decreased off-target base editing activity, including when acting on mtDNA. Thus, in one aspect, the present disclosure provides engineered zinc finger domain-containing proteins comprising (i) one or more linker motifs, wherein each linker motif independently comprises the amino acid sequence of any one of SEQ ID NOs: 1-24; (ii) one or more α-motifs, wherein each α-motif independently comprises the amino acid sequence of any one of SEQ ID NOs: 25-42 and 346; and (iii) one or more β-motifs, wherein each β-motif independently comprises the amino acid sequence of any one of SEQ ID NOs: 43-138 and 336-345, or an amino acid sequence that is at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of any one of SEQ ID NOs: 43-138 and 336-345. In some embodiments, a zinc finger domain-containing protein comprises the structure [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]. In certain embodiments, each of the first, second, and third β-motifs comprise the same amino acid sequence, each of the first, second, and third α-motifs comprise the same amino acid sequence, and/or each of the first and second linker motifs comprise the same amino acid sequence. In some embodiments, a zinc finger domain-containing protein comprises the structure [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]. In certain embodiments, each of the first, second, third, and fourth β-motifs comprise the same amino acid sequence, each of the first, second, third, and fourth α-motifs comprise the same amino acid sequence, and/or each of the first, second, and third linker motifs comprise the same amino acid sequence. In some embodiments, a zinc finger domain-containing protein comprises the structure [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]. In certain embodiments, each of the first, second, third, fourth, and fifth β-motifs comprise the same amino acid sequence, each of the first, second, third, fourth, and fifth α-motifs comprise the same amino acid sequence, and/or each of the first, second, third, and fourth linker motifs comprise the same amino acid sequence. In some embodiments, a zinc finger domain-containing protein comprises the structure [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]-[fifth linker motif]-[sixth β-motif]-[sixth DNA recognition motif]-[sixth α-motif]. In certain embodiments, each of the first, second, third, fourth, fifth, and sixth β-motifs comprise the same amino acid sequence, each of the first, second, third, fourth, fifth, and sixth α-motifs comprise the same amino acid sequence, and each of the first, second, third, fourth, and fifth linker motifs comprise the same amino acid sequence. In some embodiments, any of the zinc finger domain-containing proteins provided herein may comprise an N-terminal cap (e.g., the amino acid sequence MAERP). In some embodiments, any of the zinc finger domain-containing proteins provided herein may comprise a C-terminal cap (e.g., the amino acid sequence HTKIHLR).

Each of the linker, alpha, and beta motifs may comprise or consist of any of the various amino acid sequences provided herein, in any combination with one another. In certain preferred embodiments, the present disclosure provides zinc finger domain-containing proteins that comprise multiple instances of the same linker sequence, the same beta motif sequence, and the same alpha motif sequence, including embodiments in which the zinc finger protein comprises the same sequence for all instances of the linker motif within the protein, the same sequence for all instances of the beta motif within the protein, and the same sequence for all instances of the alpha motif within the protein.

In some embodiments, a zinc finger domain-containing protein comprises one or more linker motifs comprising the amino acid sequence of any one of TGEKP (SEQ ID NO: 1), SGEKP (SEQ ID NO: 13), SGERP (SEQ ID NO: 14), and SGDKP (SEQ ID NO: 17). In certain embodiments, all of the linker motifs present in a zinc finger domain-containing protein each comprise the same amino acid sequence selected from the group consisting of TGEKP (SEQ ID NO: 1), SGEKP (SEQ ID NO: 13), SGERP (SEQ ID NO: 14), and SGDKP (SEQ ID NO: 17).

In some embodiments, a zinc finger domain-containing protein comprises one or more α-motifs comprising the amino acid sequence of any one of HMRTH (SEQ ID NO: 33), HMKIH (SEQ ID NO: 34), HMKVH (SEQ ID NO: 35), HMKTH (SEQ ID NO: 36), and HIRTH (SEQ ID NO: 346). In certain embodiments, all of the α-motifs present in a zinc finger domain-containing protein each comprise the same amino acid sequence selected from the group consisting of HMRTH (SEQ ID NO: 33), HMKIH (SEQ ID NO: 34), HMKVH (SEQ ID NO: 35), HMKTH (SEQ ID NO: 36), and HIRTH (SEQ ID NO: 346).

In some embodiments, a zinc finger domain-containing protein comprises one or more β-motifs comprising the amino acid sequence of any one of YKCNECGKAFN (SEQ ID NO: 51), YKCNECGKSFN (SEQ ID NO: 54), YKCSECGKAFN (SEQ ID NO: 57), YKCEECGKAFN (SEQ ID NO: 63), FKCNECGKAFN (SEQ ID NO: 99), FKCNECGKSFN (SEQ ID NO: 102), FKCSECGKAFN (SEQ ID NO: 105), FKCEECGKAFS (SEQ ID NO: 109), FKCEECGKAFN (SEQ ID NO: 111), FKCEECGKSFN (SEQ ID NO: 114), YACPECGKSFS (SEQ ID NO: 337), and FACDICGRKFA (SEQ ID NO: 345). In certain embodiments, all of the β-motifs present in a zinc finger domain-containing protein each comprise the same amino acid sequence selected from the group consisting of YKCNECGKAFN (SEQ ID NO: 51), YKCNECGKSFN (SEQ ID NO: 54), YKCSECGKAFN (SEQ ID NO: 57), YKCEECGKAFN (SEQ ID NO: 63), FKCNECGKAFN (SEQ ID NO: 99), FKCNECGKSFN (SEQ ID NO: 102), FKCSECGKAFN (SEQ ID NO: 105), FKCEECGKAFS (SEQ ID NO: 109), FKCEECGKAFN (SEQ ID NO: 111), FKCEECGKSFN (SEQ ID NO: 114), YACPECGKSFS (SEQ ID NO: 337), and FACDICGRKFA (SEQ ID NO: 345).

In certain embodiments, the present disclosure provides zinc finger domain-containing proteins in which every β-motif comprises the amino acid sequence FACDICGRKFA (SEQ ID NO: 345), every α-motif comprises the amino acid sequence HIRTH (SEQ ID NO: 346), and every linker motif comprises the amino acid sequence TGEKP (SEQ ID NO: 1). In certain embodiments, every β-motif comprises the amino acid sequence YACPECGKSFS (SEQ ID NO: 337), every α-motif comprises the amino acid sequence HIRTH (SEQ ID NO: 346), and every linker motif comprises the amino acid sequence TGEKP (SEQ ID NO: 1). In certain embodiments, every β-motif comprises the amino acid sequence FKCEECGKAFN (SEQ ID NO: 111), every α-motif comprises the amino acid sequence HIRTH (SEQ ID NO: 346), and every linker motif comprises the amino acid sequence TGEKP (SEQ ID NO: 1). In certain embodiments, every β-motif comprises the amino acid sequence YKCEECGKAFN (SEQ ID NO: 63), every α-motif comprises the amino acid sequence HIRTH (SEQ ID NO: 346), and every linker motif comprises the amino acid sequence TGEKP (SEQ ID NO: 1).

In another aspect, the present disclosure provides fusion proteins comprising any of the zinc finger domain-containing proteins disclosed herein, and an effector protein. In some embodiments, the effector protein comprises nuclease activity, nickase activity, recombinase activity, deaminase activity, methyltransferase activity, methylase activity, acetylase activity, acetyltransferase activity, transcriptional activation activity, transcriptional repression activity, or polymerase activity. In some embodiments, the effector protein is a nucleic acid editing protein, such as a deaminase (e.g., an adenosine deaminase or a cytidine deaminase). In certain embodiments, the effector protein comprises a double-stranded DNA cytidine deaminase (DddA) domain. The fusion proteins provided herein may, in some embodiments, comprise one or more additional domains such as one or more mitochondrial targeting sequences, one or more nuclear export sequences (e.g., the NES of mitogen-activated protein kinase kinase (MAPKK)), one or more nuclear localization sequences, and/or one or more UGI domains. In some embodiments, the zinc finger domain-containing protein and the effector protein are joined by a linker (e.g., a glycine and serine-rich amino acid linker, optionally wherein the linker is about 13 amino acids in length). In certain embodiments, the fusion proteins comprise the structure NH₂-[MTS]-[FLAG tag]-[NES]-[NES]-[first zinc finger domain]-[second zinc finger domain]-[third zinc finger domain]-[optional fourth zinc finger domain]-[optional fifth zinc finger domain]-[optional sixth zinc finger domain]-[linker]-[split DddA]-[UGI]-COOH or NH₂-[MTS]-[FLAG tag]-[NES]-[NES]-[split DddA]-[linker]-[first zinc finger domain]-[second zinc finger domain]-[third zinc finger domain]-[optional fourth zinc finger domain]-[optional fifth zinc finger domain]-[optional sixth zinc finger domain]-[UGI]-COOH.

In another aspect, the present disclosure provides double-stranded DNA cytidine deaminase (DddA) variants comprising a first fragment comprising an amino acid sequence that is at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of SEQ ID NO: 139, and a second fragment comprising an amino acid sequence that is at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of SEQ ID NO: 283, wherein the first fragment comprises one or more amino acid substitutions, truncations, or extensions relative to the amino acid sequence of SEQ ID NO: 139, and/or wherein the second fragment comprises one or more amino acid substitutions, truncations, or extensions relative to the amino acid sequence of SEQ ID NO: 283. The DddA variants provided by the present disclosure may comprise one or more modifications relative to a wild type DddA sequence including, but not limited to, one or more point mutations, and N- and/or C-terminal amino acid truncations and/or extensions.

In some embodiments, the first fragment of a DddA variant comprises one or more amino acid substitutions relative to the amino acid sequence of SEQ ID NO: 139. In some embodiments, the first fragment of a DddA variant comprises an amino acid sequence of any one of SEQ ID NOs: 140-252, or an amino acid sequence at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of any one of SEQ ID NOs: 140-252. In some embodiments, the first fragment of a DddA variant comprises an amino acid substitution at position N18. In certain embodiments, the amino acid substitution is an N18K substitution. In some embodiments, the first fragment of a DddA variant comprises an amino acid substitution at position P25. In certain embodiments, the amino acid substitution is a P25K substitution. In certain embodiments, the amino acid substitution is a P25A substitution.

In some embodiments, the first fragment of a DddA variant comprises an N-terminal amino acid truncation. In some embodiments, the first fragment of a DddA variant comprises an N-terminal amino acid truncation of 1-15 amino acids in length. In certain embodiments, the first fragment of a DddA variant comprises the amino acid sequence of any one of SEQ ID NOs: 253-267.

In some embodiments, the first fragment of a DddA variant comprises a C-terminal amino acid truncation. In some embodiments, the first fragment of a DddA variant comprises a C-terminal amino acid truncation of 1-15 amino acids in length. In certain embodiments, the first fragment of a DddA variant comprises the amino acid sequence of any one of SEQ ID NOs: 268-282.

In some embodiments, the second fragment of a DddA variant comprises a C-terminal amino acid truncation. In some embodiments, the second fragment of a DddA variant comprises a C-terminal amino acid truncation of 1-10 amino acids in length. In certain embodiments, the second fragment of a DddA variant comprises a C-terminal amino acid truncation of 3 amino acids in length. In certain embodiments, the first fragment of a DddA variant comprises the amino acid sequence of any one of SEQ ID NOs: 284-293.

In some embodiments, the second fragment of a DddA variant comprises a C-terminal amino acid extension. In some embodiments, the second fragment of a DddA variant comprises a C-terminal amino acid extension of 1-15 amino acids in length. In certain embodiments, the first fragment of a DddA variant comprises the amino acid sequence of any one of SEQ ID NOs: 294-308.

In some embodiments, a DddA variant further comprises a sequence of charged amino acid residues (e.g., of the amino acid sequence of any one of SEQ ID NOs: 309-334) to weaken the binding affinity of the first fragment and the second fragment of the DddA variant to one another.

In some embodiments, a DddA variant further comprises a catalytically dead second DddA fragment fused to the first DddA fragment. In some embodiments, the catalytically dead second DddA fragment comprises the amino acid sequence of SEQ ID NO: 335, or an amino acid sequence that is at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of SEQ ID NO: 335.

In certain embodiments, the present disclosure provides a DddA variant comprising a first fragment that comprises amino acid substitutions at positions N18 (e.g., an N18K substitution) and P25 (e.g., a P25A or P25K substitution), and a second fragment that comprises a C-terminal amino acid truncation of 3 amino acids in length.

In another aspect, the present disclosure provides fusion proteins comprising a programmable DNA binding protein and a first or second fragment of any of the DddA variants provided herein. In some embodiments, the programmable DNA binding protein is a nucleic acid-programmable DNA binding protein (napDNAbp), e.g., a Cas9 protein (including Cas9 nickases and nuclease-inactive Cas9 proteins). In some embodiments, the napDNAbp is selected from the group consisting of Cas9, Cas12e, Cas12d, Cas12a, Cas12b1, Cas13a, Cas12c, and Argonaute, and optionally has a nickase activity. In some embodiments, the programmable DNA binding protein is a zinc finger protein, such as any of the zinc finger domain-containing proteins disclosed herein. In some embodiments, the programmable DNA binding protein is a TALE protein. The fusion proteins provided herein may, in certain embodiments, comprise one or more additional domains such as one or more mitochondrial targeting sequences, one or more nuclear export sequences (e.g., the NES of mitogen-activated protein kinase kinase (MAPKK)), one or more nuclear localization sequences, and/or one or more UGI domains. In some embodiments, the pDNAbp and the first or second fragment of the DddA variant are joined by a linker (e.g., a glycine and serine-rich amino acid linker, optionally wherein the linker is about 13 amino acids in length). In certain embodiments, the fusion proteins comprise the structure NH₂-[MTS]-[FLAG tag]-[NES]-[NES]-[first zinc finger domain]-[second zinc finger domain]-[third zinc finger domain]-[optional fourth zinc finger domain]-[optional fifth zinc finger domain]-[optional sixth zinc finger domain]-[linker]-[split DddA]-[UGI]-COOH or NH₂-[MTS]-[FLAG tag]-[NES]-[NES]-[split DddA]-[linker]-[first zinc finger domain]-[second zinc finger domain]-[third zinc finger domain]-[optional fourth zinc finger domain]-[optional fifth zinc finger domain]-[optional sixth zinc finger domain]-[UGI]-COOH.

In another aspect, the present disclosure provides fusion proteins comprising any of the zinc finger domain-containing proteins provided herein and the first or second fragment of any of the DddA variants provided herein.

In another aspect, the present disclosure provides methods for editing a target nucleic acid molecule comprising contacting the target nucleic acid molecule with any of the fusion proteins disclosed herein. The target nucleic acid molecule may comprise, for example, nuclear DNA or mitochondrial DNA. In some embodiments, the contacting is performed in vitro. In some embodiments, the contacting is performed in vivo (e.g., in a subject). In some embodiments, the contacting is performed in a subject that has been diagnosed with a disease or disorder. In some embodiments, the target sequence comprises a genomic sequence associated with a disease or disorder. For example, the target sequence may comprise a point mutation associated with a disease or disorder, such as a T→C point mutation associated with a disease or disorder or an A→G point mutation associated with a disease or disorder. In some embodiments, the step of editing the target nucleic acid results in correction of the point mutation. In some embodiments, the target nucleic acid comprises MT-TK, Nd1, HBB, or MT-TL1. In certain embodiments, the fusion protein used in the methods provided herein comprises the architecture of any of the fusion proteins provided in Table 7, Table 8, and Table 31.

In another aspect, the present disclosure provides polynucleotides encoding any of the zinc finger domain-containing proteins, DddA variants, or fusion proteins provided herein. In another aspect, the present disclosure provides vectors comprising any of the polynucleotides provided herein.

In another aspect, the present disclosure provides cells comprising any of the zinc finger domain-containing proteins, DddA variants, fusion proteins, polynucleotides, or vectors provided herein.

In another aspect, the present disclosure provides kits comprising any of the zinc finger domain-containing proteins, DddA variants, fusion proteins, polynucleotides, vectors, or cells provided herein.

In another aspect, the present disclosure provides pharmaceutical compositions comprising any of the zinc finger domain-containing proteins, DddA variants, fusion proteins, polynucleotides, or vectors provided herein, and a pharmaceutically acceptable excipient.

In another aspect, the present disclosure provides AAVs comprising any of the fusion proteins, polynucleotides, or vectors provided herein.

In some embodiments, any of the zinc finger domain-containing proteins, DddA variants, fusion proteins, polynucleotides, vectors, pharmaceutical compositions, and AAVs provided herein may be for use in medicine. In some embodiments, the present disclosure provides for the use of any of the zinc finger domain-containing proteins, DddA variants, fusion proteins, polynucleotides, vectors, pharmaceutical compositions, and AAVs disclosed herein in the manufacture of a medicament for the treatment of a disease or disorder.

It should be appreciated that the foregoing concepts, and additional concepts discussed below, may be arranged in any suitable combination, as the present disclosure is not limited in this respect. Further, other advantages and novel features of the present disclosure will become apparent from the following detailed description of various non-limiting embodiments when considered in conjunction with the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present disclosure, which can be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.

FIGS. 1A-1E: Architectural improvements increase zinc finger double-stranded DNA deaminase cytosine base editor (ZF-DdCBE) editing activity. A schematic of evolution of DddA via PACE is shown in FIG. 1C.

FIG. 2: Schematic of C-terminal ZF-DdCBE architecture.

FIG. 3: Schematic of N- or C-terminal ZF-DdCBE architecture.

FIGS. 4A-4E: Canonical zinc finger scaffolds. Typical consensus sequences for a 3ZF array (FIG. 4A), a 4ZF array (FIG. 4B), a 5ZF array (FIG. 4C), and a 6ZF array (FIG. 4D) are shown. FIG. 4E provides exemplary sequences of the zinc finger proteins shown in FIGS. 4A-4D comprising different variable DNA-binding residues.

FIGS. 5A-5C: Testing of permutations of β-motif, α-motif, and linker motif combinations to find improved ZF scaffolds. X1 represents a single 1ZF protein

FIGS. 6A-6D: Improvements of variant X1 hold across different ZF array lengths and different sites.

FIG. 7: Schematic representing workflow for finding further improvements for optimized ZF scaffolds.

FIG. 8: Data from searching the human proteome for ZF sequences.

FIGS. 9A-9B: Identification of linker motif consensus sequences.

FIG. 10: Percent C to T editing efficiency for various diverse linker motifs tested to improve ZF activity.

FIG. 11: Percent C to T editing for top linker motifs.

FIGS. 12A-12B: Identification of α-motif consensus sequences.

FIG. 13: Percent C to T editing efficiency for various diverse α-motifs tested to improve ZF activity.

FIG. 14: Percent C to T editing for top α-motifs.

FIGS. 15A-15B: Identification of β-motif consensus sequences.

FIGS. 16A-16D: Percent C to T editing efficiency for various diverse β-motifs tested to improve ZF activity.

FIG. 17: Percent C to T editing for top β-motifs.

FIG. 18: Schematic showing workflow for combining improvements in β-motifs, α-motifs, and linker motifs to produce optimized ZF scaffolds.

FIG. 19: TALE-DdCBEs exhibit minimal off-target editing.

FIG. 20: Amplicon-wide sequencing reveals off-target editing by ZF-DdCBEs.

FIG. 21: Average amplicon-wide percent C to T or G to A editing shows that off-target editing is caused by DddA.

FIG. 22: Architectural differences underlie the discrepancy in DddA off-target editing.

FIGS. 23A-23C: Off-target editing depends on the interaction strength between split deaminase halves.

FIG. 24: Schematic showing tuning of the interaction strength between split deaminase halves.

FIG. 25: Structure of a split double-stranded DNA deaminase, split at amino acid position G1397. Fragments G1397N and G1397C are shown.

FIG. 26: Structures of truncation options for split DddA.

FIG. 27: Percent on-target activity for various N-terminal truncations of DddA-C and C-terminal truncations of DddA-N.

FIG. 28: Percent off-target activity for various N-terminal truncations of DddA-C and C-terminal truncations of DddA-N.

FIG. 29: Percent on-target activity for various C-terminal truncations of DddA-C and C-terminal truncations of DddA-N.

FIG. 30: Percent off-target activity for various C-terminal truncations of DddA-C and C-terminal truncations of DddA-N.

FIG. 31: Maximizing on-target editing and minimizing off-target editing of DddA.

FIG. 32: Minimizing off-target editing of DddA using truncations.

FIG. 33: Alanine scanning mutagenesis of DddA.

FIG. 34: Lysine scanning mutagenesis of DddA.

FIG. 35: Aspartate scanning mutagenesis of DddA.

FIG. 36: Glutamate scanning mutagenesis of DddA.

FIG. 37: Comparison between positively charged mutations (lysine, arginine, and histidine).

FIGS. 38A-38B: Additive combination of single mutations in DddA (FIG. 38A) and single+double mutations in DddA (FIG. 38B). Percent on-target editing and percent off-target editing are shown.

FIG. 39: Effect of combining mutations and truncations on DddA activity. Percent on-target editing and percent off-target editing are shown.

FIGS. 40A-40B: Capping of DddA with a dead deaminase. A schematic of a capped deaminase is provided (FIG. 40A), and percent on-target editing and average amplicon-wide off-target editing for a dead DddA (dDddA) capped DddA are shown.

FIG. 41: Schematic showing the introduction of charged residues into the flexible linker upstream of DddA.

FIGS. 42A-42C: Percent on-target editing and average-amplicon wide off-target editing for DddA variants incorporating positively charged residues into the upstream flexible linker. Data for incorporation of arginine residues (FIG. 42A), lysine residues (FIG. 42B), and histidine residues (FIG. 42C) are shown.

FIGS. 43A-43B: Percent on-target editing and average-amplicon wide off-target editing for DddA variants incorporating negatively charged residues into the upstream flexible linker. Data for incorporation of aspartate residues (FIG. 43A) and glutamate residues (FIG. 43B) are shown.

FIGS. 44A-44D: Data showing on-target editing and off-target editing demonstrate that orthogonal approaches for improving DddA activity can be combined additively.

FIGS. 45A-45B: Specificity-optimized ZF-DdCBEs reduce off-target editing.

FIGS. 46A-46B: ZF β-motif sequences. FIG. 46A shows the most commonly-used sequences in canonical ZF scaffolds. FIG. 46B shows additional newly defined ZF scaffold sequences.

FIGS. 47A-47D: Example ZF proteins comprising one of the newly defined ZF scaffold sequences from FIG. 46B (X1). A 3ZF array (FIG. 47A), a 4ZF array (FIG. 47B), a 5ZF array (FIG. 47C), and a 6ZF array (FIG. 47D) are shown.

FIGS. 48A-48H: Improved ZF scaffolds show increased editing activity at a panel of different target sites.

FIG. 49: ZF scaffolds for additional β-motif sequences.

FIGS. 50A-50C: Percent on-target editing and average off-target editing for specificity-optimized DddA mutants. In FIGS. 50A and 50B, the three farthest rightmost dots represent canonical DddA scaffolds, and gray dots represent a selection of the most promising DddA mutants based on observed activity.

FIG. 51: Mutations and sequences of improved DddA variants.

FIGS. 52A-52E: Optimizing ZF-DdCBEs increases base editing efficiency in mitochondria. FIG. 52A: Architectures of optimized ZF-DdCBEs showing progression from v1 to v8. The components are a mitochondrial targeting signal, FLAG tag, nuclear export signal(s), ZF array with either canonical ZF scaffold (dark grey) or optimized ZF scaffold (light grey), Gly/Ser-rich flexible linker, split DddA deaminase (with or without activity-enhancing mutations and specificity-enhancing mutations) and UGI. FIGS. 52B-52C: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with (FIG. 52B) six optimized ZF-DdCBE pairs used to establish architectural improvements or (FIG. 52C) seven additional optimized ZF-DdCBE pairs.

FIGS. 52D-52E: Comparison of mitochondrial DNA base editing efficiencies of HEK293T cells treated with either ZFD or optimized ZF-DdCBE pairs at genomic target sites chosen by (FIG. 52D) Lim et al.²⁵, or this study (FIG. 52E). For FIGS. 52B-52E, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 53A-53L: High-specificity ZF-DdCBE variants reduce mitochondrial off-target editing. FIG. 53A: Mitochondrial DNA base editing efficiencies within amplicon ND4 of HEK293T cells treated with ND4-DdCBE. FIG. 53B: Mitochondrial DNA base editing efficiencies within amplicon ATP8 of HEK293T cells treated with v7 ZF-DdCBE pair R8-3i-ATP8+4-3i-ATP8. FIG. 53C: Off-target editing efficiencies within mitochondrial off-target amplicon ND5.1 of HEK293T cells treated with ND4-DdCBE, v7 ZF-DdCBE pair R8-3i-ATP8+4-3i-ATP8, or individual components of the v7 ZF-DdCBE architecture. FIGS. 53D-53L: On-target and average off-target editing efficiencies within amplicon ATP8 of HEK293T cells treated with canonical v7 ZF-DdCBE pair R8-3i-ATP8+4-3i-ATP8 (indicated with an arrow) or variants containing (FIG. 53D) DddA^Nand DddA^Ctruncations, (FIG. 53E) Ala, (FIG. 53F) Lys, (FIG. 53G) Asp, or (FIG. 53H) Glu point mutations within DddA^C, (FIG. 53I) Asp or (FIG. 53J) Glu residues upstream or downstream of DddA^Nand DddA^C, (FIG. 53K) fused catalytically inactivated DddA^N, or (FIG. 53L) combinations thereof. High-specificity variants HS1 to HS5 are labeled accordingly. For FIGS. 53A-53B and FIGS. 53D-53L, values reflect the mean of n=3 independent biological replicates. For FIG. 53C, values and errors reflect the mean±s.d. of n=3 independent biological replicates. For FIGS. 53D-53L, the editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 54A-54E: ZF-DdCBEs install pathogenic mutations in cultured cells in vitro. FIG. 54A: The m.8340G>A mutation in human MT-TK disrupts the T-arm of mt-tRNA^Lys. FIG. 54B: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with an optimized ZF-DdCBE pair designed to install m.8340G>A. FIG. 54C: The m.7743G>A mutation in mouse Mt-tk disrupts the T-arm of mt-tRNA^Lys. FIG. 54D: Mitochondrial DNA base editing efficiencies of C2C12 cells treated with an optimized ZF-DdCBE pair designed to install m.7743G>A. FIG. 54E: Mitochondrial DNA base editing efficiencies of C2C12 cells treated with an optimized ZF-DdCBE pair designed to install m.3177G>A. For FIGS. 54B, 54D, and 54E, values and errors reflect the mean±s.d. of n=3 independent biological replicates. For each site the DNA spacing region, split DddA orientation, ZF array lengths, and ZF-targeted DNA strands (LT=left top; LB=left bottom; RB=right bottom) are shown, and the cytosine with the highest editing efficiency is colored in light gray.

FIGS. 55A-55B: ZF-DdCBEs enable base editing of nuclear DNA. FIG. 55A: Nuclear DNA base editing efficiencies of HEK293T cells treated with five 3ZF+3ZF nuclear-targeted ZF-DdCBE pairs, or ZF-DdCBE variants with extended ZF arrays. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region. FIG. 55B: Nuclear DNA base editing efficiencies of HEK293T-HBB cells treated with an optimized ZF-DdCBE pair designed to correct the HBB-28(A>G) mutation. The DNA spacing region, split DddA orientation, ZF array lengths, and ZF-targeted DNA strands (LT=left top; RB=right bottom) are shown, and the pathogenic cytosine is colored in light gray. For FIGS. 55A-55B, values and errors reflect the mean±s.d. of n=3 independent biological replicates.

FIGS. 56A-56F: In vivo base editing of pathogenic sites in mtDNA. FIG. 56A: Mitochondrial DNA base editing efficiencies installing m.7743G>A of tissue samples from mice treated with buffer, dAAV-Mt-tk, or AAV-Mt-tk. FIG. 56B: Mitochondrial DNA base editing efficiencies of tissue samples from AAV-Mt-tk-treated mice. FIG. 56C: Off-target editing efficiencies within representative mitochondrial off-target amplicon OT8 of tissue samples from mice treated with buffer, dAAV-Mt-tk, or AAV-Mt-tk. FIG. 56D: Mitochondrial DNA base editing efficiencies installing m.3177G>A of tissue samples from mice treated with buffer or AAV-Nd1. FIG. 56E: Mitochondrial DNA base editing efficiencies of tissue samples from AAV-Nd1-treated mice. FIG. 56F: Off-target editing efficiencies within representative mitochondrial off-target amplicon OT7 of tissue samples from mice treated with buffer, or AAV-Nd1. For FIGS. 56A-56B, values and errors reflect the mean±s.d. of n=4, 4 and 7 for mice treated with buffer, AAV-Mt-tk, or dAAV-Mt-tk, respectively. For FIG. 56C, values reflect the mean of n=4, 4 and 7 for mice treated with buffer, AAV-Mt-tk, or dAAV-Mt-tk, respectively. For FIGS. 56D-56E, values and errors reflect the mean±s.d. of n=4 and 7 for mice treated with buffer or AAV-Nd1, respectively. For FIG. 56F, values reflect the mean of n=4 and 7 for mice treated with buffer or AAV-Nd1, respectively.

FIG. 57: All-protein base editor size comparison. The area of each hexagon is proportional to the length of DNA sequence required to encode that protein. The total AAV packaging capacity of ˜4.7 kb is represented proportionally in brown. The total size of DNA encoding a ZF-DdCBE is well below the AAV packaging capacity limit, whereas the total size of DNA encoding a TALE-DdCBE exceeds the packaging limit of a single AAV capsid. The ZF and TALE hexagons each represent a six-zinc finger (6ZF) array and an 18-repeat TALE array, respectively.

FIGS. 58A-58E: ZF-DdCBE architecture optimization. FIG. 58A: Initial mitochondrial ZF-DdCBE pairs used to establish v1 to v5 architectural improvements. For each site the DNA spacing region, split DddA orientation, ZF array lengths, and ZF-targeted DNA strands (LB=left bottom, RT=right top) are shown, and the cytosine with the highest editing efficiency is colored in light gray. ZF-DdCBE naming convention follows A+B where A and B specify the left and right ZF, respectively. Nucleotide numbering starts with the first 5′-nucleotide in the spacing region designated position 1. For R8-ATP8+4-ATP8, nucleotide C5 has the highest editing efficiency. FIGS. 58B-58E: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with four ZF-DdCBE pairs testing the effects of: (FIG. 58B) replacing the two-amino acid linker in architecture v1 with a 7- or 13-amino acid Gly/Ser-rich flexible linker, or a 32-amino acid XTEN linker; (FIG. 58C), inserting a FLAG or HA tag immediately downstream of the MTS in architecture v2; (FIG. 58D), adding an additional NES from HIV-1 Rev (NES1), MAPKK (NES2), or MVM NS2 (NES3) to architecture v3, either downstream of the existing internal NES or at the C-terminus of the protein; or (FIG. 58E), moving the location of UGI within the fusion protein to a position N-terminal of the 5ZF array, appending a second copy of UGI to the C-terminus (2×UGI), or expressing a separate mitochondrially targeted UGI in trans using a self-cleaving P2A peptide (with (P2A UGI only) or without (+P2A UGI) removing the C-terminally fused UGI) compared to architecture v3. Values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 59A-59I: ZF array length and positioning influences ZF-DdCBE editing efficiency. FIG. 59A: Truncation of 5ZF arrays to create a set of two 4ZFs and a set of three 3ZFs by removing either one or two individual ZFs, respectively, creates four resulting 4ZF+4ZF combinations and nine 3ZF+3ZF combinations derived from the original 5ZF+5ZF ZF-DdCBE pair. FIGS. 59B-59I: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with truncated v5 ZF-DdCBE pairs derived from (FIG. 59B and FIG. 59F) R8-ATP8+4-ATP8, (FIG. 59C and FIG. 59G) R8-ATP8+10-ATP8, (FIG. 59D and FIG. 59H) 9-ND51+R13-ND51, or (FIG. 59E and FIG. 59I) 12-ND51+R13-ND51. For FIGS. 59B-59E, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 60A-60E: Design of ZF-DdCBEs at (GNN)_n-rich sites. Design of 3ZF, 4ZF, and 5ZF arrays at (FIG. 60A) ND1 (GNN)_n-rich site 1, (FIG. 60B) COX1 (GNN)_n-rich site 1, (FIG. 60C) COX1 (GNN)_n-rich site 2, (FIG. 60D) COX2 (GNN)_n-rich site 1, and (FIG. 60E) ND6 (GNN)_n-rich site 1. (GNN)_nsequences are underlined, and ZF-targeted DNA sequences are indicated by thick black lines vertically above or below the corresponding DNA sequence.

FIG. 61: Extension of ZF array length improves ZF-DdCBE editing efficiency, but including extended linkers is detrimental. Mitochondrial DNA base editing efficiencies of HEK293T cells treated with 3ZF+3ZF, 4ZF+4ZF, and 5ZF+5ZF ZF-v5 DdCBE pairs targeting ND1 (GNN)_n-rich site 1, COX1 (GNN)_n-rich site 1 and 2, COX2 (GNN)_n-rich site 1, and ND6 (GNN)_n-rich site 1. To generate the ZF array length series, 3ZF arrays were extended outwards away from the spacing region to create longer 4ZF or 5ZF arrays, all of which share the same split DddA positioning and therefore maintained a fixed spacing region. 4ZF-Ext+4ZF-Ext and 5ZF-Ext+5ZF-Ext reflect ZF-DdCBE pairs in which an extended linker (TGSEKP) was incorporated into each ZF array following ZF3 (the third ZF repeat) in 4ZF and 5ZF arrays, respectively. Values shown reflect the fold-change editing efficiency for the most efficiently edited C•G within the spacing region for n=3 independent biological replicates, compared to the corresponding 3ZF+3ZF pair. A single data point for 4ZF+4ZF at ND6 (GNN)_n-rich site 1 at a value of 16.0-fold change is omitted from the axes range for clarity.

FIGS. 62A-62K: Defining new ZF scaffolds improves ZF-DdCBE editing efficiency. FIGS. 62A-62D: Secondary structure and amino acid sequence of canonical (FIG. 62A) 3ZF, (FIG. 62B) 4ZF, (FIG. 62C) 5ZF, and (FIG. 62D) 6ZF arrays. FIG. 62E: Amino acid sequences of ZF scaffolds X1 to X8. Different beta-motif, alpha-motif, and linker-motif sequences are colored in grey. FIGS. 62F-62K: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with v5 ZF-DdCBE pairs (FIG. 62F) R8-ATP8+4-ATP8, (FIG. 62G) R8-ATP8+10-ATP8, (FIG. 62H) R8-3i-ATP8+4-3i-ATP8, (FIG. 62I) R8-3i-ATP8+10-3ii-ATP8, (FIG. 62J) 9-ND51+R13-ND51, or (FIG. 62K) 12-ND51+R13-ND51 with either canonical ZF scaffold or ZF scaffolds X1 to X8. For FIGS. 62F-62K, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 63A-63F: Defining new ZF scaffolds derived from the human proteome. FIGS. 63A, 63C, and 63E: Amino acid frequencies at each sequence position from (FIG. 63A) 3,356 unique beta-motifs, (FIG. 63C) 625 unique alpha-motifs, and (FIG. 63E) 549 unique linker motifs in the human proteome. FIGS. 63B, 63D, and 63F: Amino acid frequencies at each sequence position displayed as a sequence logo (top) used to define (FIG. 63B) consensus beta-motif, (FIG. 63D) consensus alpha-motif, and (FIG. 63F) consensus linker motif sequences by applying a 10% frequency cut-off at each sequence position (bottom).

FIGS. 64A-64I: Identifying new ZF scaffolds derived from the human proteome that improve ZF-DdCBE editing efficiency. FIGS. 64A-64F: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with v5 ZF-DdCBE pair R8-ATP8+4-ATP8 with either canonical or X1 ZF scaffolds, or ZF scaffolds containing (FIG. 64A) consensus beta-motifs YB1 to YB24, (FIG. 64B) YB25 to YB48, (FIG. 64C) YB49 to YB72, (FIG. 64D) YB73 to YB96, (FIG. 64E) consensus alpha-motifs YA1 to YA18, or (FIG. 64F) consensus linker motifs YL1 to YL24. FIGS. 64G-64I: The editing efficiencies of (FIG. 64G) the ten top-performing consensus beta-motifs, (FIG. 64H) four top-performing consensus alpha-motifs, or (FIG. 64I) four top-performing linker motifs. For FIGS. 64A-64I, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 65A-65C: Identifying new ZF scaffolds derived from ZFN268(F1) and Sp1C that improve ZF-DdCBE editing efficiency. FIG. 65A: Amino acid sequences of ZF scaffolds based on ZF scaffold X1 and containing beta-motifs derived from ZFN268(F1) and Sp1C sequences. Amino acid changes are colored in grey. FIGS. 65B-65C: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with (FIG. 65B) v5 ZF-DdCBE pairs R8-3i-ATP8+4-3i-ATP8, or (FIG. 65C) R8-3i-ATP8+10-3ii-ATP8 with either canonical ZF scaffold or ZF scaffolds from KGKS to VSGRS. For FIGS. 65B-65C, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 66A-66F: Optimized ZF scaffolds increase ZF-DdCBE editing efficiency. FIGS. 66A-66F: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with (FIG. 66A) v5 ZF-DdCBE pairs R8-ATP8+4-ATP8, (FIG. 66B) R8-ATP8+10-ATP8, (FIG. 66C) R8-3i-ATP8+4-3i-ATP8, (FIG. 66D) R8-3i-ATP8+10-3ii-ATP8, (FIG. 66E) 9-ND51+R13-ND51, or (FIG. 66F) 12-ND51+R13-ND51 with either canonical or optimized ZF scaffolds. For FIG. 66A and FIGS. 66C-66F, values and errors reflect the mean±s.d. of n=2 independent biological replicates. For FIG. 66B, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 67A-67D: DddA mutations enhance ZF-DdCBE editing efficiency. FIGS. 67A-67D: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with v5 ZF-DdCBE pairs (FIG. 67A) R8-ATP8+4-ATP8, (FIG. 67B) R8-ATP8+10-ATP8, (FIG. 67C) 9-ND51+R13-ND51, or (FIG. 67D) 12-ND51+R13-ND51 containing combinations of mutations in DddA^Nand DddA^C. The triple mutant T1380I, E1396K, T1413I is colored in grey. For FIGS. 67A-67D, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 68A-68G: Optimized ZF scaffolds increase ZF-DdCBE editing efficiency. FIGS. 68A-68G: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with v5 ZF-DdCBE pairs (FIG. 68A) G24-R1b+G32-R1b, (FIG. 68B) G22-R13+G24-R13, (FIG. 68C) G32-R6a+G21-R6a, (FIG. 68D) G36-R6c+G212-R6c, (FIG. 68E) G33-V1+G35-V1, (FIG. 68F) G22-V2+G34-V2, or (FIG. 68G) G33-V5+G36-V5 with either canonical or optimized ZF scaffolds. For FIGS. 68A-68G, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIG. 69: Identifying ZF scaffolds that support the highest editing efficiency for ZFD-derived ZF-DdCBEs. Mitochondrial DNA base editing efficiencies of HEK293T cells treated with v7 ZF-DdCBE pairs ND1-Left+ND1-Right, ND2-Left+ND2-Right, ND4L-Left+ND4L-Right, ND4-Left+ND4-Right, ND5-Left+ND5-Right, ND52-Left+ND52-Right, COX1-Left+COX1-Right, COX2-Left+COX2-Right, or CYB-Left+CYB-Right with the indicated optimized ZF scaffolds. Values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 70A-70B: Time course of TALE-DdCBE and ZF-DdCBE editing efficiencies over time. Mitochondrial DNA base editing efficiencies of HEK293T cells treated with (FIG. 70A) TALE-DdCBE pair ND4-DdCBE, or (FIG. 70B) v5 ZF-DdCBE pair R8-3i-ATP8+4-3i-ATP8 with the indicated amount of plasmid DNA. Cells were lysed after the indicated time period. For FIGS. 70A-70B, values and errors reflect the mean±s.d. of n=2 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIG. 71: Amino acid sequences immediately upstream of DddA^Nand DddA^Cinfluence non-targeted editing activity. Average non-targeted editing efficiencies within amplicon ATP8 of HEK293T cells treated with DddA^N-UGI and DddA^C-UGI preceded by the indicated sequences. Naming convention follows A/B, where A and B correspond to the amino acid sequences immediately upstream of DddA^Nand DddA^C, respectively. Values reflect the mean of n=3 independent biological replicates.

FIGS. 72A-72H: DddA truncation reduces ZF-DdCBE off-target editing. FIG. 72A: Crystal structure of DddA (PDB 6U08) complexed with DddI, the natural protein inhibitor of DddA (not shown). DddA^Nand DddA^Care colored in light gray and dark gray, respectively, and have N- and C-termini indicated. FIGS. 72B-72D: (FIG. 72B) C-terminal truncation of DddA^N, (FIG. 72C) N-terminal truncation of DddA^C, and (FIG. 72D) C-terminal truncation of DddA^Care shown with residues incrementally removed colored in white. FIGS. 72E-72H: (FIG. 72E and FIG. 72G) On-target and (FIG. 72F and FIG. 72H) average off-target editing efficiencies within amplicon ATP8 of HEK293T cells treated with canonical v7 ZF-DdCBE pair R8-3i-ATP8+4-3i-ATP8 or variants containing DddA^Nand DddA^Ctruncations. For FIGS. 72E-72H, values reflect the mean of n=3 independent biological replicates. The on-target editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 73A-73B: Shifting the position of the canonical G1397 split site within DddA. FIG. 73A: On-target and average off-target editing efficiencies within amplicon ATP8 of HEK293T cells treated with canonical v7 ZF-DdCBE pair R8-3i-ATP8+4-3i-ATP8 (indicated with an arrow) or variants containing C-terminally extended DddA^Nand N-terminally truncated DddA^C. FIG. 73B: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with only a single ZF-DdCBE half (R8-3i-ATP8 from ZF-DdCBE pair R8-3i-ATP8+4-3i-ATP8) carrying canonical DddA^Nor C-terminally extended DddA^Nvariants. Naming convention C+X signifies DddA_C+X^N. For FIG. 73A, values reflect the mean of n=3 independent biological replicates. For FIG. 73B, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 74A-74C: Introducing negative charge at the termini of DddA or capping with catalytically inactivated DddA^N. Architectures of canonical ZF-DdCBEs and ZF-DdCBE variants containing a ZF array, Gly/Ser-rich flexible linker, split DddA deaminase, and UGI (N-terminal mitochondrial targeting signal, FLAG tag, and nuclear export signals are not shown). FIG. 74A: ZF-DdCBE variants are shown in which three, six, or nine residues in the 13-amino acid Gly/Ser-rich flexible linker upstream of DddA^Nand DddA^Cwere mutated to either Glu (E) or Asp (D) residues. ZF-DdCBE variants are also shown in which three, six, or nine Glu (E) or Asp (D) residues were inserted into the Gly/Ser-rich flexible linker downstream of DddA^N. FIG. 74B: Off-target editing efficiencies within mitochondrial off-target amplicon ATP8 of HEK293T cells treated with individual components of the v7 ZF-DdCBE architecture, with or without the DddA catalytically inactivating E1347A mutation. FIG. 74C: ZF-DdCBE variants are shown in which dDddA^Nwas fused downstream of DddA^Cusing Gly/Ser-rich flexible linkers, either before or after the UGI domain.

FIGS. 75A-75D: Combining approaches to reduce ZF-DdCBE off-target editing. FIG. 75A: On-target and average off-target editing efficiencies within amplicon ATP8 of HEK293T cells treated with canonical v7 ZF-DdCBE pair R8-3i-ATP8+4-3i-ATP8 (indicated with an arrow) or (FIG. 75A) variants containing one (grey) or two (black) DddA^Cpoint mutations from the following set: [K5A, R6A, G7A, T9A, V14A, P25A, T12K, V14K, N18K, P25K], (FIG. 75B) variants containing one or two DddA^Cpoint mutations from the following set: [K5A, R6A, G7A, T9A, V14A, P25A, T12K, V14K, N18K, P25K], in combination with either DddA^Nor DddA_CΔ3^N, (FIG. 75C) variants containing one or two DddA^Cpoint mutations from the following set: [R6A, G7A, T9A, V14A, P25A, T12K, V14K, N18K, P25K], in combination with either DddA^Nand DddA_NΔ5^C, or DddA_CΔ3^Nand DddA_NΔC^C, (FIG. 75D) variants containing one, two or three changes in total, selected from any of the four approaches of single point mutations, truncations, electrostatic repulsion, and dDddA^Ncapping. For FIGS. 75A-75D, values reflect the mean of n=3 independent biological replicates. The on-target editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 76A-76G: v8^HSZF-DdCBE variants reduce off-target editing. (FIGS. 76A-76G) On-target and average off-target editing efficiencies of HEK293T cells treated with v7 (indicated with an arrow), v8, or v8^HS1to v8^HS5ZF-DdCBE pairs (FIG. 76A) G24-R1b+G32-R1b, (FIG. 76B) G22-R13+G24-R13, (FIG. 76C) G32-R6a+G21-R6a, (FIG. 76D) G36-R6c+G212-R6c, (FIG. 76E) G33-V1+G35-V1, (FIG. 76F) G22-V2+G34-V2, or (FIG. 76G) G33-V5+G36-V5. For FIGS. 76A-76G, values reflect the mean of n=3 independent biological replicates. The on-target editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 77A-77I: Comparison between v8^HS1ZF-DdCBEs and ZFDs. FIGS. 77A-77I: On-target and average off-target editing efficiencies of HEK293T cells treated with ZFDs (indicated with an arrow), v7, v8, or v8^HS1ZF-DdCBE pairs (FIG. 77A) ND1-Left+ND1-Right, (FIG. 77B) ND2-Left+ND2-Right, (FIG. 77C) ND4L-Left+ND4L-Right, (FIG. 77D) ND4-Left+ND4-Right, (FIG. 77E) ND5-Left+ND5-Right, (FIG. 77F) ND52-Left+ND52-Right, (FIG. 77G) COX1-Left+COX1-Right, (FIG. 77H) COX2-Left+COX2-Right, or (FIG. 77I) CYB-Left+CYB-Right. For FIGS. 77A-77G, values reflect the mean of n=3 independent biological replicates. The on-target editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 78A-78C: Optimized ZF-DdCBEs install m.8340G>A in HEK293T cells. FIG. 78A: Design of 3ZF arrays for ZF-DdCBE-mediated installation of m.8340G>A in human MT-TK. ZF-targeted DNA sequences are indicated by thick black lines vertically above or below the corresponding DNA sequence, and the target cytosine is colored light gray. FIG. 78B: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with v7 ZF-DdCBE pairs with the indicated split DddA orientation (DddA^N/DddA^Csignifies that the left ZF array is fused to DddA^Nand the right ZF array is fused to DddA^C). FIG. 78C: Mitochondrial DNA base editing efficiencies of HEK293T cells treated with 3ZF+3ZF v7^AGKSZF-DdCBE pair G21-MT-TK+G23-MT-TK or variants with the left and right ZF array extended to 4ZF or 5ZF as indicated. For FIG. 78B and FIG. 78C, values and errors reflect the mean±s.d. of n=3 independent biological replicates. The on-target editing efficiencies shown are for the most efficiently edited C•G within the spacing region.

FIGS. 79A-79G: Optimized ZF-DdCBEs install m.7743G>A in C2C12 cells. FIG. 79A: 3ZF arrays for ZF-DdCBEs designed to install m.7743G>A in mouse Mt-tk. ZF-targeted DNA sequences are indicated by thick black lines vertically above or below the corresponding DNA sequence, and the target cytosine is colored light gray. FIGS. 79B, 79D, and 79F: Mitochondrial DNA base editing efficiencies of C2C12 cells treated with (FIG. 79B) the top 27 performing v7 ZF-DdCBE pairs from the initial 3ZF+3ZF panel designed to install m.7743G>A, (FIG. 79D) the top 12 performing extended v7 ZF-DdCBE pairs designed to install m.7743G>A, (FIG. 79F) the v7 ZF-DdCBE pair LT51-Mt-tk+RB38-Mt-tk with the indicated optimized ZF scaffolds. FIG. 79C: Extension of ZF arrays from 3ZF to 4ZF, 5ZF, or 6ZF (adding additional ZF repeats to the ZF arrays extending away from the spacing region in order to maintain a fixed deaminase positioning) to test the effects of ZF extension on ZF-DdCBE editing efficiency. FIG. 79E: Mitochondrial DNA base editing efficiencies of C2C12 cells plated on either poly-D-lysine- or collagen-coated plates treated with the indicated ZF-DdCBE pairs. FIG. 79G: On-target and average off-target editing efficiencies of C2C12 cells treated with v7 (indicated with an arrow), v8, or v8^HS1to v8^HS5ZF-DdCBE pair LT51-Mt-tk+RB38-Mt-tk. For FIGS. 79D-79F, values and errors reflect the mean±s.d. of n=3 independent biological replicates. For FIG. 79G, values reflect the mean of n=3 independent biological replicates. The on-target editing efficiencies shown are for the most efficiently edited C•G within the spacing region. For FIGS. 79D-79E, all ZF-DdCBE pairs use the split DddA orientation DddA^C/DddA^N.

FIGS. 80A-80G: Optimized ZF-DdCBEs install m.3177G>A in C2C12 cells. FIG. 80A: 3ZF arrays for ZF-DdCBEs designed to install m.3177G>A in mouse Nd1. ZF-targeted DNA sequences are indicated by thick black lines vertically above or below the corresponding DNA sequence, and the target cytosine is colored light gray. FIGS. 80B, 80C, and 80E: Mitochondrial DNA base editing efficiencies of C2C12 cells treated with (FIG. 80B) the top 26 performing v7 ZF-DdCBE pairs from the initial 3ZF+3ZF panel designed to install m.3177G>A, (FIG. 80C) the top 18 performing extended v7 ZF-DdCBE pairs designed to install m.3177G>A, (FIG. 80E) the v7 ZF-DdCBE pair LB510-Nd1+RB54-Nd1 with the indicated optimized ZF scaffolds. FIG. 80D: Mitochondrial DNA base editing efficiencies of C2C12 cells plated on either poly-D-lysine- or collagen-coated plates treated with the indicated ZF-DdCBE pairs. FIG. 80F: On-target and average off-target editing efficiencies of C2C12 cells treated with v7 (indicated with an arrow), v8, or v8^HS1to v8^HS5ZF-DdCBE pair LB510-Nd1+RB54-Nd1. FIG. 80G: The m.3177G>A mutation in mouse Nd1 creates a missense E143K mutation. For FIGS. 80B-80E, values and errors reflect the mean±s.d. of n=3 independent biological replicates. For FIG. 80F, values reflect the mean of n=3 independent biological replicates. The on-target editing efficiencies shown are for the most efficiently edited C•G within the spacing region. For FIGS. 80C-80D, all ZF-DdCBE pairs use the split DddA orientation DddA^C/DddA^N.

FIGS. 81A-81C: Converting mitochondrial ZF-DdCBEs into nuclear ZF-DdCBEs. FIGS. 81A-81C: 3ZF arrays for ZF-DdCBEs designed to edit mitochondrial sites, or nuclear sites with high sequence similarity. ZF-targeted DNA sequences are indicated by thick black lines vertically above or below the corresponding DNA sequence, spacing regions are marked with arrows, and the target cytosine(s) edited in mitochondrial DNA with high efficiency are colored light gray.

FIGS. 82A-82B: Correction of a nuclear disease-causing mutation using ZF-DdCBEs. FIG. 82A: 3ZF arrays for ZF-DdCBEs designed to correct human HBB-28(A>G). ZF-targeted DNA sequences are indicated by thick black lines vertically above or below the corresponding DNA sequence, and the target cytosine is colored light gray. FIG. 82B: Mitochondrial DNA base editing efficiencies of HEK293T-HBB cells nuclear ZF-DdCBE pairs designed to correct HBB-28(A>G). All ZF-DdCBE pairs use the split DddA orientation DddA^N/DddA^C. For FIG. 82B, values and errors reflect the mean±s.d. of n=3 independent biological replicates.

FIGS. 83A-83F: Off-target editing analysis of mice treated with AAV-Mt-tk. FIGS. 83A-83F: Off-target editing efficiencies within mitochondrial off-target amplicon (FIG. 83A) OT1, (FIG. 83B) OT3, (FIG. 83C) OT4, (FIG. 83D) OT10, (FIG. 83E) OT11, or (FIG. 83F) OT12 of tissue samples from mice treated with buffer, dAAV-Mt-tk or AAV-Mt-tk. Values reflect the mean of n=4, 4, and 7 for mice treated with buffer, AAV-Mt-tk, or dAAV-Mt-tk, respectively.

FIGS. 84A-84F: Off-target editing analysis of mice treated with AAV-Nd1. FIGS. 84A-84F: Off-target editing efficiencies within mitochondrial off-target amplicon (FIG. 84A) OT2, (FIG. 84B) OT3, (FIG. 84C) OT5, (FIG. 84D) OT6, (FIG. 84E) OT9, or (FIG. 84F) OT12 of tissue samples from mice treated with buffer or AAV-Nd1. Values reflect the mean of n=4 and 7 for mice treated with buffer or AAV-Nd1, respectively.

FIGS. 85A-85D: Configurations and DNA sequences of spacing regions for the ZF-DdCBE pairs described herein. FIG. 85A: Initial mitochondrial ZF-DdCBE pairs used to establish v1 to v8 architectural improvements. FIG. 85B: Additional mitochondrial ZF-DdCBE pairs used to validate optimized architectures and HS variants. FIG. 85C: ZFD-derived mitochondrial ZF-DdCBE pairs. FIG. 85D: Nuclear ZF-DdCBE pairs. For each site the DNA spacing region, split DddA orientation, ZF array lengths, and ZF-targeted DNA strands (LT, LB, RT, RB=left top, left bottom, right top, right bottom, respectively) are shown, and the cytosine with the highest editing efficiency is colored in light gray. ZF-DdCBE naming convention follows A+B where A and B specify the left and right ZF, respectively. Nucleotide numbering starts with the first 5′-nucleotide in the spacing region designated position 1. For R8-ATP8+4-ATP8, nucleotide C5 has the highest editing efficiency.

FIGS. 86A-86C: ZF-DdCBEs correct the MELAS-causing pathogenic mutation in cultured cells in vitro. FIG. 86A: The m.3243A>G mutation in human MT-TL1 alters the D-loop of mt-tRNA^Leu(UUR). FIGS. 86B-86C: Mitochondrial DNA base editing efficiencies of (FIG. 86B) HEK293T cells or (FIG. 86C) RN164 cybrid 143BTK⁻ cells treated with an optimized ZF-DdCBE pair designed to correct m.3243A>G. Values and errors reflect the mean±s.d. of n=3 independent biological replicates. For each site, the DNA spacing region, split DddA orientation, ZF array lengths, and ZF-targeted DNA strands (LT, RB=left top, right bottom, respectively) are shown, and the cytosine with highest editing efficiency is colored in light gray.

FIGS. 87A-87C: Correction of a mitochondrial disease-causing mutation using ZF-DdCBEs. FIG. 87A: 3ZF arrays for ZF-DdCBEs designed to correct m.3243A>G in human MT-TL1. ZF-targeted DNA sequences are indicated by thick black lines vertically above or below the corresponding DNA sequence, and the target cytosine is colored light gray. FIG. 87B: mtDNA base editing efficiencies of HEK293T cells (encoding wild-type MT-TL1, which lacks the m.3243A>G mutation) treated with v7 ZF-DdCBE pairs designed to correct m.3243A>G. Editing of the adjacent base at position m.3242 (CTC context) is considered a proxy for on-target editing activity. FIG. 87C: mtDNA base editing efficiencies of RN164 cybrid 143BTK− cells homoplasmic for m.3243A>G treated with v7 ZF-DdCBE pair MT-TL1•pB7-LT32/pB6N-RB6458 or variants containing additional mutations in DddAN. For FIG. 87B, values and errors reflect the mean±s.d. of n=3 independent biological replicates. For FIG. 87C, values and errors reflect the mean±s.d. of n=2 independent biological replicates.

DEFINITIONS

Unless defined otherwise, all technical and scientific terms used herein have the meaning commonly understood by a person skilled in the art to which this invention belongs. The following references provide one of skill with a general definition of many of the terms used in this invention: Singleton et al., Dictionary of Microbiology and Molecular Biology (2^nded. 1994); The Cambridge Dictionary of Science and Technology (Walker ed., 1988); The Glossary of Genetics, 5^thEd., R. Rieger et al. (eds.), Springer Verlag (1991); and Hale & Marham, The Harper Collins Dictionary of Biology (1991). As used herein, the following terms have the meanings ascribed to them unless specified otherwise.

AAV

An “adeno-associated virus” or “AAV” is a virus which infects humans and some other primate species. The wild-type AAV genome is a single-stranded deoxyribonucleic acid (ssDNA), either positive- or negative-sensed. The genome comprises two inverted terminal repeats (ITRs), one at each end of the DNA strand, and two open reading frames (ORFs): rep and cap between the ITRs. The rep ORF comprises four overlapping genes encoding Rep proteins required for the AAV life cycle. The cap ORF comprises overlapping genes encoding capsid proteins: VP1, VP2 and VP3, which interact together to form the viral capsid. VP1, VP2 and VP3 are translated from one mRNA transcript, which can be spliced in two different manners: either a longer or shorter intron can be excised, resulting in the formation of two isoforms of mRNAs: a ˜2.3 kb- and a ˜2.6 kb-long mRNA isoform. The capsid forms a supramolecular assembly of approximately 60 individual capsid protein subunits into a non-enveloped, T-1 icosahedral lattice capable of protecting the AAV genome. The mature capsid is composed of VP1, VP2, and VP3 (molecular masses of approximately 87, 73, and 62 kDa respectively) in a ratio of about 1:1:10.

rAAV particles may comprise a nucleic acid vector (e.g., a recombinant genome), which may comprise at a minimum: (a) one or more heterologous nucleic acid regions comprising a sequence encoding a protein or polypeptide of interest (e.g., a split Cas9 or split nucleobase) or an RNA of interest (e.g., a gRNA), or one or more nucleic acid regions comprising a sequence encoding a Rep protein; and (b) one or more regions comprising inverted terminal repeat (ITR) sequences (e.g., wild-type ITR sequences or engineered ITR sequences) flanking the one or more nucleic acid regions (e.g., heterologous nucleic acid regions). In some embodiments, the nucleic acid vector is between 4 kb and 5 kb in size (e.g., 4.2 to 4.7 kb in size). In some embodiments, the nucleic acid vector further comprises a region encoding a Rep protein. In some embodiments, the nucleic acid vector is circular. In some embodiments, the nucleic acid vector is single-stranded. In some embodiments, the nucleic acid vector is double-stranded. In some embodiments, a double-stranded nucleic acid vector may be, for example, a self-complimentary vector that contains a region of the nucleic acid vector that is complementary to another region of the nucleic acid vector, initiating the formation of the double-strandedness of the nucleic acid vector.

Adenosine Deaminase

As used herein, the term “adenosine deaminase” or “adenosine deaminase domain” refers to a protein or enzyme that catalyzes a deamination reaction of an adenosine (or adenine). The terms are used interchangeably. In certain embodiments, the disclosure provides base editor fusion proteins comprising one or more adenosine deaminase domains (for example, fused to any of the zinc finger domain-containing proteins provided herein). For instance, an adenosine deaminase domain may comprise a heterodimer of a first adenosine deaminase and a second deaminase domain, connected by a linker. Adenosine deaminases (e.g., engineered adenosine deaminases or evolved adenosine deaminases) provided herein may be enzymes that convert adenine (A) to inosine (I) in DNA or RNA. Such adenosine deaminase can lead to an A:T to G:C base pair conversion. In some embodiments, the deaminase is a variant of a naturally occurring deaminase from an organism. In some embodiments, the deaminase does not occur in nature. For example, in some embodiments, the deaminase is at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75% at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to a naturally-occurring deaminase.

In some embodiments, the adenosine deaminase is derived from a bacterium, such as, E. coli, S. aureus, S. typhi, S. putrefaciens, H. influenzae, or C. crescentus. In some embodiments, the adenosine deaminase is a TadA deaminase. In some embodiments, the TadA deaminase is an E. coli TadA deaminase (ecTadA). In some embodiments, the TadA deaminase is a truncated E. coli TadA deaminase. For example, the truncated ecTadA may be missing one or more N-terminal amino acids relative to a full-length ecTadA. In some embodiments, the truncated ecTadA may be missing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 6, 17, 18, 19, or 20 N-terminal amino acid residues relative to the full length ecTadA. In some embodiments, the truncated ecTadA may be missing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 6, 17, 18, 19, or 20 C-terminal amino acid residues relative to the full length ecTadA. In some embodiments, the ecTadA deaminase does not comprise an N-terminal methionine. Reference is made to U.S. Patent Publication No. 2018/0073012, published Mar. 15, 2018, which is incorporated herein by reference.

Base Editing

“Base editing” refers to genome editing technology that involves the conversion of a specific nucleic acid base into another at a targeted genomic locus (e.g., including in a mtDNA). In certain embodiments, this can be achieved without requiring double-stranded DNA breaks (DSB), or single stranded breaks (i.e., nicking). To date, other genome editing techniques, including CRISPR-based systems, begin with the introduction of a DSB at a locus of interest. Subsequently, cellular DNA repair enzymes mend the break, commonly resulting in random insertions or deletions (indels) of bases at the site of the DSB. However, when the introduction or correction of a point mutation at a target locus is desired rather than stochastic disruption of the entire gene, these genome editing techniques are unsuitable, as correction rates are low (e.g., typically 0.1% to 5%), with the major genome editing products being indels. In order to increase the efficiency of gene correction without simultaneously introducing random indels, the present inventors previously modified the CRISPR/Cas9 system to directly convert one DNA base into another without DSB formation. See, Komor, A. C., et al., Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature 533, 420-424 (2016), the entire contents of which is incorporated by reference herein.

Base Editor

The term “base editor (BE)” as used herein, refers to an agent comprising a polypeptide that is capable of making a modification to a base (e.g., A, T, C, G, or U) within a nucleic acid sequence (e.g., mtDNA) that converts one base to another (e.g., A to G, A to C, A to T, C to T, C to G, C to A, G to A, G to C, G to T, T to A, T to C, T to G). In some embodiments, the BE refers to those fusion proteins described herein which are capable of modifying bases directly in mitochondrial DNA. Such BEs can also be referred to herein as “mtDNA base editors” or “mtDNA BEs.” Such BEs can refer to those fusion proteins comprising a programmable DNA binding protein (“pDNAbp”) (e.g., any of the zinc finger domain-containing proteins provided herein, including mitoZFPs, or a CRISPR/Cas9) and a deaminase (such as a double-stranded DNA deaminase (“DddA”)) to precisely install nucleotide changes and/or correct pathogenic mutations in DNA, including mtDNA, rather than destroying the mtDNA with double-strand breaks (DSBs).

In some embodiments, the base editors contemplated herein comprise any of the zinc finger domain-containing proteins provided herein. In some embodiments, the base editors contemplated herein comprise any of the DddA variants provided herein.

In some embodiments, the base editors contemplated herein comprise a nuclease-inactive Cas9 (dCas9) fused to a deaminase which binds a nucleic acid in a guide RNA-programmed manner via the formation of an R-loop, but does not cleave the nucleic acid. For example, the dCas9 domain of the fusion protein may include a D10A and a H840A mutation (which renders Cas9 capable of cleaving only one strand of a nucleic acid duplex), as described in PCT/US2016/058344, which published as WO 2017/070632 on Apr. 27, 2017, and is incorporated herein by reference in its entirety. The DNA cleavage domain of S. pyogenes Cas9 includes two subdomains, the HNH nuclease subdomain and the RuvC1 subdomain. The HNH subdomain cleaves the strand complementary to the gRNA (the “targeted strand,” or the strand in which editing or deamination occurs), whereas the RuvC1 subdomain cleaves the non-complementary strand containing the PAM sequence (the “non-edited strand”). The RuvC1 mutant D10A generates a nick in the targeted strand, while the HNH mutant H840A generates a nick on the non-edited strand (see Jinek et al., Science, 337:816-821(2012); Qi et al., Cell. 28; 152(5):1173-83 (2013)).

BEs that convert a C to T, in some embodiments, comprise a cytidine deaminase (e.g., a double-stranded DNA deaminase or DddA). A “cytidine deaminase” (including those DddAs disclosed herein) refers to an enzyme that catalyzes the chemical reaction “cytosine+H₂O→uracil+NH₃” or “5-methyl-cytosine+H₂O→thymine+NH₃.” As it may be apparent from the reaction formula, such chemical reactions result in a C to U/T nucleobase change. In the context of a gene, such a nucleotide change, or mutation, may in turn lead to an amino acid change in the protein, which may affect the protein's function, e.g., loss-of-function or gain-of-function. In some embodiments, the C to T nucleobase editor comprises a zinc finger protein fused to a cytidine deaminase. In some embodiments, the cytidine deaminase domain is fused to the N-terminus of the zinc finger protein, or to the C-terminus of the zinc finger protein. In some embodiments, the C to T nucleobase editor comprises a Cas9 protein (e.g., an nCas9 or dCas9 protein) fused to a cytidine deaminase. In some embodiments, the cytidine deaminase is fused to the N-terminus of the Cas9 protein, or to the C-terminus of the Cas9 protein.

In some embodiments, the nucleobase editor further comprises a domain that inhibits uracil glycosylase, and/or a nuclear localization signal.

Cas9 domains used in base editing have been described in the following references, the contents of which may be applied in the instant disclosure to modify and/or include in BEs described herein, which can target mtDNA, e.g., in Rees & Liu, Nat Rev Genet. 2018; 19(12):770-788 and Koblan et al., Nat Biotechnol. 2018; 36(9):843-846; as well as U.S. Patent Publication No. 2018/0073012, published Mar. 15, 2018, which issued as U.S. Pat. No. 10,113,163; on Oct. 30, 2018; U.S. Patent Publication No. 2017/0121693, published May 4, 2017, which issued as U.S. Pat. No. 10,167,457 on Jan. 1, 2019; International Publication No. WO 2017/070633, published Apr. 27, 2017; U.S. Patent Publication No. 2015/0166980, published Jun. 18, 2015; U.S. Pat. No. 9,840,699, issued Dec. 12, 2017; U.S. Pat. No. 10,077,453, issued Sep. 18, 2018; International Publication No. WO 2019/023680, published Jan. 31, 2019; International Publication No. WO 2018/0176009, published Sep. 27, 2018, International Application No PCT/US2019/033848, filed May 23, 2019, International Application No. PCT/US2019/47996, filed Aug. 23, 2019; International Application No. PCT/US2019/049793, filed Sep. 5, 2019; U.S. Provisional Application No. 62/835,490, filed Apr. 17, 2019; International Application No. PCT/US2019/61685, filed Nov. 15, 2019; International Application No. PCT/US2019/57956, filed Oct. 24, 2019; U.S. Provisional Application No. 62/858,958, filed Jun. 7, 2019; International Publication No. PCT/US2019/58678, filed Oct. 29, 2019, the contents of each of which are incorporated herein by reference in their entireties.

Exemplary adenine and cytosine base editors are also described in Rees & Liu, Base editing: precision chemistry on the genome and transcriptome of living cells, Nat. Rev. Genet. 2018; 19(12):770-788; as well as U.S. Patent Publication No. 2018/0073012, published Mar. 15, 2018, which issued as U.S. Pat. No. 10,113,163, on Oct. 30, 2018; U.S. Patent Publication No. 2017/0121693, published May 4, 2017, which issued as U.S. Pat. No. 10,167,457 on Jan. 1, 2019; International Publication No. WO 2017/070633, published Apr. 27, 2017; U.S. Patent Publication No. 2015/0166980, published Jun. 18, 2015; U.S. Pat. No. 9,840,699, issued Dec. 12, 2017; and U.S. Pat. No. 10,077,453, issued Sep. 18, 2018, PCT Application PCT/US2017/045381, filed Aug. 3, 2017, which published as WO 2018/027078, and PCT Application No. PCT/US2019/033848, which published as WO 2019/226953, each of which is herein incorporated by reference. Any of the deaminase components of these adenine or cytidine BEs could be modified using a method of directed evolution (e.g., PACE or PANCE) to obtain a deaminase which may use double-stranded DNA as a substrate, and thus, which could be used in the BEs described herein, which are intended, for example, for use in conducting base editing directly on mtDNA, i.e., on a double-stranded DNA target.

Cas9

The term “Cas9” or “Cas9 nuclease” refers to an RNA-guided nuclease comprising a Cas9 domain, or a fragment thereof (e.g., a protein comprising an active or inactive DNA cleavage domain of Cas9, and/or the gRNA binding domain of Cas9). A “Cas9 domain” as used herein, is a protein fragment comprising an active or inactive cleavage domain of Cas9 and/or the gRNA binding domain of Cas9. A “Cas9 protein” is a full length Cas9 protein. A Cas9 nuclease is also referred to sometimes as a casn1 nuclease or a CRISPR (Clustered Regularly Interspaced Short Palindromic Repeat)-associated nuclease. CRISPR is an adaptive immune system that provides protection against mobile genetic elements (viruses, transposable elements, and conjugative plasmids). CRISPR clusters contain spacers, sequences complementary to antecedent mobile elements, and target invading nucleic acids. CRISPR clusters are transcribed and processed into CRISPR RNA (crRNA). In type II CRISPR systems, correct processing of pre-crRNA requires a trans-encoded small RNA (tracrRNA), endogenous ribonuclease 3 (rnc), and a Cas9 domain. The tracrRNA serves as a guide for ribonuclease III-aided processing of pre-crRNA. Subsequently, Cas9/crRNA/tracrRNA endonucleolytically cleaves linear or circular dsDNA target complementary to the spacer. The target strand not complementary to crRNA is first cut endonucleolytically, then trimmed 3′-5′ exonucleolytically. In nature, DNA-binding and cleavage typically requires protein and both RNAs. However, single guide RNAs (“sgRNA”, or simply “gRNA”) can be engineered so as to incorporate aspects of both the crRNA and tracrRNA into a single RNA species. See, e.g., Jinek M., Chylinski K., Fonfara I., Hauer M., Doudna J. A., Charpentier E. Science 337:816-821(2012), the entire contents of which are hereby incorporated by reference. Cas9 recognizes a short motif in the CRISPR repeat sequences (the PAM or protospacer adjacent motif) to help distinguish self versus non-self. Cas9 nuclease sequences and structures are well known to those of skill in the art (see, e.g., “Complete genome sequence of an M1 strain of Streptococcus pyogenes.” Ferretti et al., J. J., McShan W. M., Ajdic D. J., Savic D. J., Savic G., Lyon K., Primeaux C., Sezate S., Suvorov A. N., Kenton S., Lai H. S., Lin S. P., Qian Y., Jia H. G., Najar F. Z., Ren Q., Zhu H., Song L., White J., Yuan X., Clifton S. W., Roe B. A., McLaughlin R. E., Proc. Natl. Acad. Sci. U.S.A. 98:4658-4663(2001); “CRISPR RNA maturation by trans-encoded small RNA and host factor Rnase III.” Deltcheva E., Chylinski K., Sharma C. M., Gonzales K., Chao Y., Pirzada Z. A., Eckert M. R., Vogel J., Charpentier E., Nature 471:602-607(2011); and “A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity.” Jinek M., Chylinski K., Fonfara I., Hauer M., Doudna J. A., Charpentier E. Science 337:816-821(2012), the entire contents of each of which are incorporated herein by reference). Cas9 orthologs have been described in various species, including, but not limited to, S. pyogenes and S. thermophilus. Additional suitable Cas9 nucleases and sequences will be apparent to those of skill in the art based on this disclosure, and such Cas9 nucleases and sequences include Cas9 sequences from the organisms and loci disclosed in Chylinski, Rhun, and Charpentier, “The tracrRNA and Cas9 families of type II CRISPR-Cas immunity systems” (2013) RNA Biology 10:5, 726-737; the entire contents of which are incorporated herein by reference. In some embodiments, a Cas9 nuclease comprises one or more mutations that partially impair or inactivate the DNA cleavage domain.

A nuclease-inactivated Cas9 domain may interchangeably be referred to as a “dCas9” protein (for nuclease-“dead” Cas9). Methods for generating a Cas9 domain (or a fragment thereof) having an inactive DNA cleavage domain are known (see, e.g., Jinek et al., Science. 337:816-821(2012); Qi et al., “Repurposing CRISPR as an RNA-Guided Platform for Sequence-Specific Control of Gene Expression” (2013) Cell. 28; 152(5):1173-83, the entire contents of each of which are incorporated herein by reference). For example, the DNA cleavage domain of Cas9 is known to include two subdomains, the HNH nuclease subdomain and the RuvC1 subdomain. The HNH subdomain cleaves the strand complementary to the gRNA, whereas the RuvC1 subdomain cleaves the non-complementary strand. Mutations within these subdomains can silence the nuclease activity of Cas9. For example, the mutations D10A and H840A completely inactivate the nuclease activity of S. pyogenes Cas9 (Jinek et al., Science. 337:816-821(2012); Qi et al., Cell. 28; 152(5):1173-83 (2013)). In some embodiments, proteins comprising fragments of Cas9 are provided. For example, in some embodiments, a protein comprises one of two Cas9 domains: (1) the gRNA binding domain of Cas9; or (2) the DNA cleavage domain of Cas9. In some embodiments, proteins comprising Cas9, or fragments thereof, are referred to as “Cas9 variants.” A Cas9 variant shares homology to Cas9, or a fragment thereof. For example, a Cas9 variant is at least about 70% identical, at least about 80% identical, at least about 90% identical, at least about 95% identical, at least about 96% identical, at least about 97% identical, at least about 98% identical, at least about 99% identical, at least about 99.5% identical, at least about 99.8% identical, or at least about 99.9% identical to wild type Cas9 (e.g., SpCas9 of SEQ ID NO: 450). In some embodiments, the Cas9 variant may have 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 21, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or more amino acid changes compared to wild type Cas9 (e.g., SpCas9 of SEQ ID NO: 450). In some embodiments, the Cas9 variant comprises a fragment of Cas9 (e.g., a gRNA binding domain or a DNA-cleavage domain), such that the fragment is at least about 70% identical, at least about 80% identical, at least about 90% identical, at least about 95% identical, at least about 96% identical, at least about 97% identical, at least about 98% identical, at least about 99% identical, at least about 99.5% identical, or at least about 99.9% identical to the corresponding fragment of wild type Cas9 (e.g., SpCas9 of SEQ ID NO: 450). In some embodiments, the fragment is at least 30%, at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95% identical, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% of the amino acid length of a corresponding wild type Cas9 (e.g., SpCas9 of SEQ ID NO: 450).

The amino acid sequence of wild type SpCas9 is:

(SEQ ID NO: 450)

MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIG

ALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFF

HRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTD

KADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLF

EENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALS

LGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAAK

NLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQL

PEKYKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVK

LNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQEDFYPFLKDNREKIE

KILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGASAQS

FIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAF

LSGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFN

ASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLK

TYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSD

GFANRNFMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKK

GILQTVKVVDELVKVMGRHKPENIVIEMARENQTTQKGQKNSRERMKRI

EEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRL

SDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNY

WRQLLNAKLITQRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHV

AQILDSRMNTKYDENDKLIREVKVITLKSKLVSDFRKDFQFYKVREINN

YHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEI

GKATAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGR

DFATVRKVLSMPQVNIVKKTEVQTGGFSKESILPKRNSDKLIARKKDWD

PKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEK

NPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNE

LALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQIS

EFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAA

FKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD

As used herein, the term “nCas9” or “Cas9 nickase” refers to a Cas9 or a variant thereof, which cleaves or nicks only one of the strands of a target cut site thereby introducing a nick in a double strand DNA molecule rather than creating a double strand break. This can be achieved by introducing appropriate mutations in a wild-type Cas9 which inactivates one of the two endonuclease activities of the Cas9. Any suitable mutation which inactivates one Cas9 endonuclease activity but leaves the other intact is contemplated, such as one of D10A or H840A mutations in the wild-type S. pyogenes Cas9 amino acid sequence, or a D10A mutation in the wild-type S. aureus Cas9 amino acid sequence, may be used to form the nCas9.

The amino acid sequence of SpCas9 nickase is:

(SEQ ID NO: 451)

MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIG

ALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFF

HRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTD

KADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLF

EENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALS

LGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAAK

NLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQL

PEKYKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVK

LNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQEDFYPFLKDNREKIE

KILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGASAQS

FIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAF

LSGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFN

ASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLK

TYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSD

GFANRNFMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKK

GILQTVKVVDELVKVMGRHKPENIVIEMARENQTTQKGQKNSRERMKRI

EEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRL

SDYDVDAIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNY

WRQLLNAKLITQRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHV

AQILDSRMNTKYDENDKLIREVKVITLKSKLVSDFRKDFQFYKVREINN

YHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEI

GKATAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGR

DFATVRKVLSMPQVNIVKKTEVQTGGFSKESILPKRNSDKLIARKKDWD

PKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEK

NPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNE

LALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQIS

EFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAA

FKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD

Cytidine Deaminase

As used herein, a “cytidine deaminase” encoded by the CDA gene is an enzyme that catalyzes the removal of an amine group from cytidine (i.e., the base cytosine when attached to a ribose ring) to uridine (C to U) and deoxycytidine to deoxyuridine (C to U). A non-limiting example of a cytidine deaminase is APOBEC1 (“apolipoprotein B mRNA editing enzyme, catalytic polypeptide 1”). Another example is AID (“activation-induced cytidine deaminase”). Under standard Watson-Crick hydrogen bond pairing, a cytosine base hydrogen bonds to a guanine base. When cytidine is converted to uridine (or deoxycytidine is converted to deoxyuridine), the uridine (or the uracil base of uridine) undergoes hydrogen bond pairing with the base adenine. Thus, a conversion of “C” to uridine (“U”) by cytidine deaminase will cause the insertion of “A” instead of a “G” during cellular repair and/or replication processes. Since the adenine “A” pairs with thymine “T”, the cytidine deaminase in coordination with DNA replication causes the conversion of an C-G pairing to a T-A pairing in the double-stranded DNA molecule.

Deaminase

The term “deaminase” or “deaminase domain” refers to a protein or enzyme that catalyzes a deamination reaction. In some embodiments, the deaminase is an adenosine (or adenine) deaminase, which catalyzes the hydrolytic deamination of adenine or adenosine. In some embodiments, the adenosine deaminase catalyzes the hydrolytic deamination of adenine or adenosine in deoxyribonucleic acid (DNA) to inosine. In other embodiments, the deaminase is a cytidine (or cytosine) deaminase, which catalyzes the hydrolytic deamination of cytidine or cytosine. In preferred aspects, the deaminase is a double-stranded DNA deaminase, or is modified, evolved, or otherwise altered to be able to utilize double-strand DNA as a substrate for deamination.

The deaminase embraces the DddA domains described herein and defined below. The DddA is a type of deaminase, but where the activity of the deaminase is against double-stranded DNA, rather than single-stranded DNA, which is the case for deaminases prior to the present disclosure.

The deaminases provided herein may be from any organism, such as a bacterium. In some embodiments, the deaminase or deaminase domain is a variant of a naturally-occurring deaminase from an organism. In some embodiments, the deaminase or deaminase domain does not occur in nature. For example, in some embodiments, the deaminase or deaminase domain is at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75% at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to a naturally-occurring deaminase.

DNA Editing Efficiency

The term “DNA editing efficiency,” as used herein, refers to the number or proportion of intended base pairs that are edited. For example, if a base editor edits 10% of the base pairs that it is intended to target (e.g., within a cell or within a population of cells), then the base editor can be described as being 10% efficient. Some aspects of editing efficiency embrace the modification (e.g., deamination) of a specific nucleotide within DNA, without generating a large number or percentage of insertions or deletions (i.e., indels). It is generally accepted that editing while generating less than 5% indels (as measured over total target nucleotide substrates) is high editing efficiency. The generation of more than 20% indels is generally accepted as poor or low editing efficiency. Indel formation may be measured by techniques known in the art, including high-throughput screening of sequencing reads.

DddA

The term “double-stranded DNA deaminase domain” or “DddA” (or equivalently, DddE) refers to a protein that catalyzes a deamination of a target nucleotide (e.g., C, A, G, C) in a double-stranded DNA molecule. References to DddA and double-stranded DNA deaminase are equivalent. In one embodiment, the DddA deaminates a cytidine. Deamination of cytidine results in a uracil (or deoxyuracil in the case of deoxycytidine), and through replication and/or repair processes, converts the original C:G base pair to a T:A base pair. This change can also be referred to as a “C-to-T” edit because the C of the C:G pair is converted to a T of T:A pair. DddA, when expressed naturally, can be toxic to biological systems. While the mechanism of action is not clearly documented, one rationale for the observed toxicity is that DddA's activity may cause indiscriminate deamination of cytidine in vivo on double-stranded target DNA (e.g., the cellular genome). Such indiscriminate deaminations may provoke cellular repair responses, including, but not limited to, degradation of genomic DNA. Canonical DddA was described in Mok et al., “A bacterial cytidine deaminase toxin enables CRISPR-free mitochondrial base editing,” Nature, 2020; 583(7817): 631-637 (“Mok et al., 2020”), (incorporated herein by reference). Canonical DddA was discovered in Burkholderia cenocepia and reported Mok et al. and in the Protein Data Bank as PDB ID: 6U08, which has the following full-length amino acid sequence (1427 amino acids):

>tr|A0A1V6L4E7|A0A1V6L4E7_9BURK YD repeat (Two copies) OS =

Burkholderia cenocepacia OX = 95486 GN = UE95_03830 PE = 1 SV = 1

(1427 AA-the canonical protein or “canonical DddA”)

(SEQ ID NO: 356)

MYEAARVTDPIDHTSALAGFLVGAVLGIALIAAVAFATFTCGFGVALLAGMMAGIGAQ

ALLSIGESIGKMFSSQSGNIITGSPDVYVNSLSAAYATLSGVACSKHNPIPLVAQGSTNIFI

NGRPAARKDDKITCGATIGDGSHDTFFHGGTQTYLPVDDEVPPWLRTATDWAFTLAGL

VGGLGGLLKASGGLSRAVLPCAAKFIGGYVLGEAFGRYVAGPAINKAIGGLFGNPIDVT

TGRKILLAESETDYVIPSPLPVAIKRFYSSGIDYAGTLGRGWVLPWEIRLHARDGRLWYT

DAQGRESGFPMLRAGQAAFSEADQRYLTRTPDGRYILHDLGERYYDFGQYDPESGRIA

WVRRVEDQAGQWYQFERDSRGRVTEILTCGGLRAVLDYETVFGRLGTVTLVHEDERRL

AVTYGYDENGQLASVTDANGAVVRQFAYTNGLMTSHMNALGFTSSYVWSKIEGEPRV

VETHTSEGENWTFEYDVAGRQTRVRHADGRTAHWRFDAQSQIVEYTDLDGAFYRIKY

DAVGMPVMLMLPGDRTVMFEYDDAGRIIAETDPLGRTTRTRYDGNSLRPVEVVGPDGG

AWRVEYDQQGRVVSNQDSLGRENRYEYPKALTALPSAHIDALGGRKTLEWNSLGKLV

GYTDCSGKTTRTSFDAFGRICSRENALGQRITYDVRPTGEPRRVTYPDGSSETFEYDAAG

TLVRYIGLGGRVQELLRNARGQLIEAVDPAGRRVQYRYDVEGRLRELQQDHARYTFTY

SAGGRLLTETRPDGILRRFEYGEAGELLGLDIVGAPDPHATGNRSVRTIRFERDRMGVLK

VQRTPTEVTRYQHDKGDRLVKVERVPTPSGIALGIVPDAVEFEYDKGGRLVAEHGSNGS

VIYTLDELDNVVSLGLPHDQTLQMLRYGSGHVHQIRFGDQVVADFERDDLHREVSRTQ

GRLTQRSGYDPLGRKVWQSAGIDPEMLGRGSGQLWRNYGYDAAGDLIETSDSLRGSTR

FSYDPAGRLISRANPLDRKFEEFAWDAAGNLLDDAQRKSRGYVEGNRLLMWQDLRFE

YDPFGNLATKRRGANQTQRFTYDGQDRLITVHTQDVRGVVETRFAYDPLGRRIAKTDT

AFDLRGMKLRAETKRFVWEGLRLVQEVRETGVSSYVYSPDAPYSPVARADTVMAEAL

AATVIDSAKRAARIFHFHTDPVGAPQEVTDEAGEVAWAGQYAAWGKVEATNRGVTAA

RTDQPLRFAGQYADDSTGLHYNTFRFYDPDVGRFINQDPIGLNGGANVYHYAPNPVGW

VDPWGLAGSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNY

ANAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVVPPEGA

IPVKRGATGETKVFTGNSNSPKSPTKGGC.

Effective Amount

The term “effective amount,” as used herein, refers to an amount of a biologically active agent that is sufficient to elicit a desired biological response. For example, in some embodiments, an effective amount of any of the fusion proteins as described herein, or compositions thereof, may refer to the amount of the fusion proteins sufficient to edit a target nucleotide sequence (e.g., mtDNA). In some embodiments, an effective amount of any of the fusion proteins as described herein, or compositions thereof (e.g., a fusion protein comprising any of the zinc finger domain-containing proteins disclosed herein and any of the DddA variants disclosed herein) that is sufficient to induce editing of a target nucleotide, which is proximal to a target nucleic acid sequence specifically bound and edited by the fusion protein. As will be appreciated by the skilled artisan, the effective amount of an agent (e.g., a fusion protein), may vary depending on various factors such as, for example, the desired biological response on the specific allele, genome, or target site to be edited, on the cell or tissue being targeted, and on the agent being used.

Fusion Protein

The term “fusion protein” as used herein refers to a hybrid polypeptide which comprises protein domains from at least two different proteins (e.g., a programmable DNA binding protein, such as any of the zinc finger domain-containing proteins disclosed herein, and a deaminase, such as any of the DddA variants disclosed herein). One protein may be located at the amino-terminal (N-terminal) portion of the fusion protein or at the carboxy-terminal (C-terminal) portion of the fusion protein, thus forming an “amino-terminal fusion protein” or a “carboxy-terminal fusion protein,” respectively. A protein may comprise different domains, for example, a nucleic acid binding protein (e.g., a zinc finger domain-containing protein) and a catalytic domain of a nucleic-acid editing protein (e.g., a DddA variant, or a portion of a DddA variant). Any of the proteins provided herein may be produced by any method known in the art. For example, the proteins provided herein may be produced via recombinant protein expression and purification, which is especially suited for fusion proteins comprising a peptide linker. Methods for recombinant protein expression and purification are well known, and include those described by Green and Sambrook, Molecular Cloning: A Laboratory Manual (4^thed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. (2012)), the entire contents of which are incorporated herein by reference.

Lentiviral Vectors

Lentiviral vectors are derived from human immunodeficiency virus-1 (HIV-1). The lentiviral genome consists of single-stranded RNA that is reverse-transcribed into DNA and then integrated into the host cell genome. Lentiviruses can infect both dividing and non-dividing cells, making them attractive tools for gene therapy.

The lentiviral genome is around 9 kb in length and contains three major structural genes: gag, pol, and env. The gag gene is translated into three viral core proteins: 1) matrix (MA) proteins, which are necessary for virion assembly and infection of non-dividing cells; 2) capsid (CA) proteins, which form the hydrophobic core of the virion; and 3) nucleocapsid (NC) proteins, which protect the viral genome by coating and associating tightly with the RNA. The pol gene encodes for the viral protease, reverse transcriptase, and integrase enzymes that are essential for viral replication. The env gene encodes for the viral surface glycoproteins, which are essential for virus entry into the host cell by enabling binding to cellular receptors and fusion with cellular membranes. In some embodiments, the viral glycoprotein is derived from vesicular stomatitis virus (VSV-G). The viral genome also contains regulatory genes, including tat and rev. Tat encodes transactivators critical for activating viral transcription, while rev encodes a protein that regulates the splicing and export of viral transcripts. Tat and rev are the first proteins synthesized following viral integration and are required to accelerate production of viral mRNAs.

To improve the safety of lentivirus, the components necessary for viral production are split across multiple vectors. In some embodiments, the disclosure relates to delivery of a heterologous gene (e.g., transgene) via a recombinant lentiviral transfer vector encoding one or more transgenes of interest flanked by long terminal repeat (LTR) sequences. These LTRs are identical nucleotide sequences that are repeated hundreds or thousands of times and facilitate the integration of the transfer plasmid sequences into the host cell genome. Methods of the current disclosure also describe one or more accessory plasmids. These accessory plasmids may include one or more lentiviral packaging plasmids, which encode the pol and rev genes that are necessary for the replication, splicing, and export of viral particles. The accessory plasmids may also include a lentiviral envelope plasmid, which encodes the genes necessary for producing the viral glycoproteins that will allow the viral particle to fuse with the host cell.

Linker

In various embodiments, the herein disclosed fusion proteins (e.g., base editors comprising, for example, any of the zinc finger domain-containing proteins and DddA variants disclosed herein) or the polypeptides that comprise the fusion proteins (e.g., the zinc finger domain-containing proteins or other pDNAbps, and DddA variants or other deaminases) may be engineered to include one or more linker sequences that join two or more polypeptides (e.g., a pDNAbp and a DddA half) to one another.

The term “linker,” as used herein, refers to a molecule linking two other molecules or moieties. The linker can be an amino acid sequence in the case of a linker joining two fusion proteins. For example, a zinc finger domain-containing protein can be fused to a first or second portion of a DddA, by an amino acid linker sequence. The linker can also be a nucleotide sequence in the case of joining two nucleotide sequences together. In other embodiments, the linker is an organic molecule, group, polymer, or chemical moiety. In some embodiments, the linker is 1-100 amino acids in length, for example, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 30-35, 35-40, 40-45, 45-50, 50-60, 60-70, 70-80, 80-90, 90-100, 100-150, or 150-200 amino acids in length. Longer linkers are also contemplated.

mitoZFP

In various embodiments, the mtDNA base editors embrace fusion proteins comprising a DddA (or inactive fragment thereof) and a mitoZFP domain. A “mitoZFP” refers to a zinc finger DNA binding protein that has been modified to comprise one or more mitochondrial targeting sequences (MTS), as described further herein.

Mitochondrial Targeting Sequence (MTS)

In various embodiments, the base editors or the polypeptides that comprise the base editors (e.g., the pDNAbps (such as zinc finger domain-containing proteins) and DddA) disclosed herein may be engineered to include one or more mitochondrial targeting sequences (MTS) (or mitochondrial localization sequence (MLS)) that facilitate the translocation of a polypeptide into the mitochondria. Such base editors may be referred to herein as mtDNA base editors. MTS are known in the art, and exemplary sequences are provided herein. In general MTSs are short peptide sequences (about 3-70 amino acids long) that direct a newly synthesized protein to the mitochondria within a cell. It is usually found at the N-terminus and consists of an alternating pattern of hydrophobic and positively charged amino acids to form what is called an amphipathic helix. Mitochondrial localization sequences can contain additional signals that subsequently target the protein to different regions of the mitochondria, such as the mitochondrial matrix. One exemplary mitochondrial localization sequence is the mitochondrial localization sequence derived from Cox8, a mitochondrial cytochrome c oxidase subunit VIII. In some embodiments, a mitochondrial localization sequence derived from Cox8 includes the amino acid sequence: MSVLTPLLLRGLTGSARRLPVPRAKIHSL (SEQ ID NO: 357). In some embodiments, the mitochondrial localization sequence derived from Cox8 includes an amino acid sequence that is about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% identity to SEQ ID NO: 357.

napDNAbp

In various embodiments, the base editors provided herein may comprise pDNAbps that are nucleic acid programmable (e.g., a base editor comprising a napDNAbp such as Cas9 and any of the DddA variants disclosed herein). The term “napDNAbp” which stands for “nucleic acid programmable DNA binding protein” refers to any protein that may associate (e.g., form a complex) with one or more nucleic acid molecules (i.e., which may broadly be referred to as a “napDNAbp-programming nucleic acid molecule” and includes, for example, guide RNA in the case of Cas systems) that direct or otherwise program the protein to localize to a specific target nucleotide sequence (e.g., a gene locus of a genome) that is complementary to the one or more nucleic acid molecules (or a portion or region thereof) associated with the protein, thereby causing the protein to bind to the nucleotide sequence at the specific target site. The term napDNAbp embraces CRISPR-Cas9 proteins, as well as Cas9 equivalents, homologs, orthologs, or paralogs, whether naturally occurring or non-naturally occurring (e.g., engineered or modified), and may include a Cas9 equivalent from any type of CRISPR system (e.g., type II, V, VI), including Cpf1 (a type-V CRISPR-Cas systems), C2c1 (a type V CRISPR-Cas system), C2c2 (a type VI CRISPR-Cas system), C2c3 (a type V CRISPR-Cas system), dCas9, GeoCas9, CjCas9, Cas12a, Cas12b, Cas12c, Cas12d, Cas12g, Cas12h, Cas12i, Cas13d, Cas14, Argonaute, and nCas9. Further Cas-equivalents are described in Makarova et al., “C2c2 is a single-component programmable RNA-guided RNA-targeting CRISPR effector,” Science 2016; 353 (6299), the contents of which are incorporated herein by reference. However, the nucleic acid programmable DNA binding proteins (napDNAbps) that may be used in connection with this invention are not limited to CRISPR-Cas systems. The invention embraces any such programmable protein, such as the Argonaute protein from Natronobacterium gregoryi (NgAgo), which may also be used for DNA-guided genome editing. The NgAgo-guide DNA system does not require a PAM sequence or guide RNA molecules, which means genome editing can be performed simply by the expression of generic NgAgo protein and introduction of synthetic oligonucleotides on any genomic sequence. See Gao et al., DNA-guided genome editing using the Natronobacterium gregoryi Argonaute. Nature Biotechnology 2016; 34(7):768-73, which is incorporated herein by reference.

In some embodiments, the napDNAbp is an RNA-programmable nuclease, which, when in a complex with an RNA, may be referred to as a nuclease:RNA complex. Typically, the bound RNA(s) is referred to as a guide RNA (gRNA). gRNAs can exist as a complex of two or more RNAs, or as a single RNA molecule. gRNAs that exist as a single RNA molecule may be referred to as single-guide RNAs (sgRNAs), though “gRNA” is used interchangeably to refer to guide RNAs that exist as either single molecules or as a complex of two or more molecules. Typically, gRNAs that exist as single RNA species comprise two domains: (1) a domain that shares homology to a target nucleic acid (e.g., and directs binding of a Cas9 (or equivalent) complex to the target); and (2) a domain that binds a Cas9 protein. In some embodiments, domain (2) corresponds to a sequence known as a tracrRNA and comprises a stem-loop structure. For example, in some embodiments, domain (2) is homologous to a tracrRNA as depicted in FIG. 1E of Jinek et al., Science 337:816-821(2012), the entire contents of which is incorporated herein by reference. Other examples of gRNAs (e.g., those including domain 2) can be found in U.S. Pat. No. 9,340,799, entitled “mRNA-Sensing Switchable gRNAs,” and International Patent Application No. PCT/US2014/054247, filed Sep. 6, 2013, published as WO 2015/035136 and entitled “Delivery System For Functional Nucleases,” the entire contents of each of which are incorporated herein by reference. In some embodiments, a gRNA comprises two or more of domains (1) and (2) and may be referred to as an “extended gRNA.” For example, an extended gRNA will, e.g., bind two or more Cas9 proteins and bind a target nucleic acid at two or more distinct regions, as described herein. The gRNA comprises a nucleotide sequence that complements a target site, which mediates binding of the nuclease/RNA complex to said target site, providing the sequence specificity of the nuclease:RNA complex. In some embodiments, the RNA-programmable nuclease is the (CRISPR-associated system) Cas9 endonuclease, for example Cas9 (Csn1) from Streptococcus pyogenes (see, e.g., “Complete genome sequence of an M1 strain of Streptococcus pyogenes.” Ferretti J. J. et al., Proc. Natl. Acad. Sci. U.S.A. 98:4658-4663(2001); “CRISPR RNA maturation by trans-encoded small RNA and host factor Rnase III.” Deltcheva E. et al., Nature 471:602-607(2011); and “A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity.” Jinek M. et al., Science 337:816-821(2012), the entire contents of each of which are incorporated herein by reference.

Since the napDNAbp nucleases (e.g., Cas9) use RNA:DNA hybridization to target DNA cleavage sites, these proteins are able to be targeted, in principle, to any sequence specified by the guide RNA. Methods of using napDNAbp nucleases, such as Cas9, for site-specific cleavage (e.g., to modify a genome) are known in the art (see e.g., Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819-823 (2013); Mali, P. et al. RNA-guided human genome engineering via Cas9. Science 339, 823-826 (2013); Hwang, W. Y. et al. Efficient genome editing in zebrafish using a CRISPR-Cas system. Nature Biotechnology 31, 227-229 (2013); Jinek, M. et al. RNA-programmed genome editing in human cells. eLife 2, e00471 (2013); Dicarlo, J. E. et al., Genome engineering in Saccharomyces cerevisiae using CRISPR-Cas systems. Nucleic Acid Res. (2013); Jiang, W. et al. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nature Biotechnology 31, 233-239 (2013); the entire contents of each of which are incorporated herein by reference).

Nickase

The term “nickase” refers to a napDNAbp having only a single nuclease activity that cuts only one strand of a target DNA, rather than both strands. Thus, a nickase type napDNAbp does not leave a double-strand break. In some embodiments, any of the base editors disclosed herein may comprise a nickase (such as a Cas9 nickase) fused, for example, to any of the DddA variants disclosed herein.

Nuclear Localization Signal

In various embodiments, the base editors or the polypeptides that comprise the base editors disclosed herein (e.g., the zinc finger domain-containing protein and DddA variant fusions described herein) may be further engineered to include one or more nuclear localization signals.

A nuclear localization signal or sequence (NLS) is an amino acid sequence that tags, designates, or otherwise marks a protein for import into the cell nucleus by nuclear transport. Typically, this signal consists of one or more short sequences of positively charged lysine or arginine residues exposed on the protein surface. Different nuclear localized proteins may share the same NLS. An NLS has the opposite function of a nuclear export signal (NES), which targets proteins out of the nucleus. Thus, a single nuclear localization signal can direct the entity with which it is associated to the nucleus of a cell. Such sequences may be of any size and composition, for example more than 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, or 25 amino acids, but will preferably comprise at least a four to eight amino acid sequence known to function as a nuclear localization signal (NLS).

Nucleic Acid Molecule

The term “nucleic acid,” as used herein, refers to a polymer of nucleotides. The polymer may include natural nucleosides (i.e., adenosine, thymidine, guanosine, cytidine, uridine, deoxyadenosine, deoxythymidine, deoxyguanosine, and deoxycytidine), nucleoside analogs (e.g., 2-aminoadenosine, 2-thiothymidine, inosine, pyrrolo-pyrimidine, 3-methyl adenosine, 5-methylcytidine, C5 bromouridine, C5 fluorouridine, C5 iodouridine, C5 propynyl uridine, C5 propynyl cytidine, C5 methylcytidine, 7 deazaadenosine, 7 deazaguanosine, 8 oxoadenosine, 8 oxoguanosine, 0(6) methylguanine, 4-acetylcytidine, 5-(carboxyhydroxymethyl)uridine, dihydrouridine, methylpseudouridine, 1-methyl adenosine, 1-methyl guanosine, N6-methyl adenosine, and 2-thiocytidine), chemically modified bases, biologically modified bases (e.g., methylated bases), intercalated bases, modified sugars (e.g., 2′-fluororibose, ribose, 2′-deoxyribose, 2′-O-methylcytidine, arabinose, and hexose), or modified phosphate groups (e.g., phosphorothioates and 5′-N-phosphoramidite linkages).

Programmable DNA Binding Protein (pDNAbp)

As used herein, the term “programmable DNA binding protein,” “pDNA binding protein,” “pDNA binding protein domain” or “pDNAbp” refers to any protein that localizes to and binds a specific target DNA nucleotide sequence (e.g., a gene locus of a genome). This term embraces RNA-programmable proteins, which associate (e.g., form a complex) with one or more nucleic acid molecules (i.e., which includes, for example, guide RNA in the case of Cas systems) that direct or otherwise program the protein to localize to a specific target nucleotide sequence (e.g., DNA sequence) that is complementary to the one or more nucleic acid molecules (or a portion or region thereof) associated with the protein. The term also embraces proteins which bind directly to a nucleotide sequence in an amino acid-programmable manner, e.g., zinc finger proteins and TALE proteins. Exemplary RNA-programmable proteins are CRISPR-Cas9 proteins, as well as Cas9 equivalents, homologs, orthologs, or paralogs, whether naturally occurring or non-naturally occurring (e.g., engineered or modified), and may include a Cas9 equivalent from any type of CRISPR system (e.g., type II, V, VI), including Cpf1 (a type-V CRISPR-Cas systems), C2c1 (a type V CRISPR-Cas system), C2c2 (a type VI CRISPR-Cas system), C2c3 (a type V CRISPR-Cas system), dCas9, GeoCas9, CjCas9, Cas12a, Cas12b, Cas12c, Cas12d, Cas12g, Cas12h, Cas12i, Cas13d, Cas14, Argonaute, and nCas9. Further Cas-equivalents are described in Makarova et al., “C2c2 is a single-component programmable RNA-guided RNA-targeting CRISPR effector,” Science 2016; 353(6299), the contents of which are incorporated herein by reference.

Protein, Peptide, and Polypeptide

The terms “protein,” “peptide,” and “polypeptide” are used interchangeably herein and refer to a polymer of amino acid residues linked together by peptide (amide) bonds. The terms refer to a protein, peptide, or polypeptide of any size, structure, or function. Typically, a protein, peptide, or polypeptide will be at least three amino acids long. A protein, peptide, or polypeptide may refer to an individual protein or a collection of proteins. One or more of the amino acids in a protein, peptide, or polypeptide may be modified, for example, by the addition of a chemical entity such as a carbohydrate group, a hydroxyl group, a phosphate group, a farnesyl group, an isofarnesyl group, a fatty acid group, a linker for conjugation, functionalization, or other modification, etc. A protein, peptide, or polypeptide may also be a single molecule or may be a multi-molecular complex. A protein, peptide, or polypeptide may be just a fragment of a naturally occurring protein or peptide. A protein, peptide, or polypeptide may be naturally occurring, recombinant, or synthetic, or any combination thereof. Any of the proteins provided herein may be produced by any method known in the art. For example, the proteins provided herein may be produced via recombinant protein expression and purification, which is especially suited for fusion proteins comprising a peptide linker. Methods for recombinant protein expression and purification are well known, and include those described by Green and Sambrook, Molecular Cloning: A Laboratory Manual (4^thed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. (2012)), the contents of which are incorporated herein by reference.

Split Site (e.g., of a DddA)

As used herein, the term “split site,” as in a split site of a DddA, refers to a specific peptide bond between any two immediately adjacent amino acid residues in the amino acid sequence of a DddA at which the complete DddA polypeptide is divided into two half portions, i.e., an N-terminal half portion and a C-terminal half portion. The N-terminal half portion of the DddA may be referred to as “DddA-N half” and the C-terminal half portion of the DddA may be referred to as the “DddA-C half.” Alternately, DddA-N half may be referred to as the “DddA-N fragment or portion” and the DddA-C half may be referred to as the “DddA-C fragment or portion.” Depending on the location of the split site, the DddA-N half and the DddA-C half may be the same or different size and/or sequence length. The term “half” does not connote the requirement that the DddA-N and DddA-C portions are identically half of the size and/or sequence length of a complete DddA, or that the split site is required to be at the midpoint of the complete DddA polypeptide. To the contrary, and as noted above, the split site can be between any pair of residues in the DddA polypeptide, thereby giving rise to half portions which are unequal in size and/or sequence length. For example, the split site may be such that the DddA polypeptide is split at amino acid position 1397 of DddA (e.g., as in the DddA variant proteins disclosed herein).

For clarity, as used herein, the term “half” when used in the context of a split molecule (e.g., protein, intein, delivery molecule, nucleic acid, etc.), shall not be interpreted to require, and shall not imply, that the size of the resulting portions (e.g., as “split” or broken into smaller portions) of the molecule are one-half (e.g., ½, 50%) of the original molecule. The term shall be interpreted to be illustrative of the idea that they are portion(s) of a larger molecule that has been broken into smaller fragments (e.g., portions), but that when reconstituted may regain the activity of the molecule as a whole. Thus, by way of example, a half (e.g., portion) may be any portion of the molecule from which it is obtained (e.g., is less than 100% of the whole of the molecule), such that there is at least one additional portion formed (e.g., a second half, other half, second portion), which also is less than 100% of the whole of the molecule. It is important to note that the molecule may be formed into additional portions (e.g., third, fourth, etc., halves (e.g., portions)), and such additional halves do not constitute a molecule larger than or in addition to the whole from which they were derived. Further, it should be noted that in the event there are more than two halves (e.g., two portions) formed from the splitting of a molecule, it may only require two of the portions to reconstitute the activity of the molecule as a whole. By way of example, if an enzyme is split into three halves (e.g., three portions), wherein the catalytic domain of the enzyme possessing the enzymatic activity of interest is only split into two halves (e.g., two portions), only the two portions of the catalytic domain may be necessary to be used to carry out the activity of interest. Thus, when referring to using two halves, it is not necessary that the two halves, together, comprise 100% of the whole of the molecule from which they were derived. In certain embodiments, the split site is within a loop region of the DddA.

As used herein, reference to “splitting a DddA at a split site” embraces direct and indirect means for obtaining two half portions of a DddA. In one embodiment, splitting a DddA refers to the direct splitting of a DddA polypeptide at a split site in the protein to obtain the DddA-N and DddA-C half portions. For example, the cleaving of a peptide bond between two adjacent amino acid residues at a split site may be achieved by enzymatic or chemical means. In another embodiment, a DddA may be split by engineering separate nucleic acid sequences, each encoding a different half portion of the DddA. Such methods can be used to obtain expression vectors for expressing the DddA half portions in a cell in order to reconstitute DddA activity.

Subject

The term “subject,” as used herein, refers to an individual organism, for example, an individual mammal. In some embodiments, the subject is a human. In some embodiments, the subject is a non-human mammal. In some embodiments, the subject is a non-human primate. In some embodiments, the subject is a rodent. In some embodiments, the subject is a sheep, a goat, a cattle, a cat, or a dog. In some embodiments, the subject is a vertebrate, an amphibian, a reptile, a fish, an insect, a fly, or a nematode. In some embodiments, the subject is a research animal. In some embodiments, the subject is genetically engineered, e.g., a genetically engineered non-human subject. The subject may be of either sex and at any stage of development.

Substitution

The terms “substitution” and “mutation,” as used herein, refer to a substitution of a residue within a sequence, e.g., a nucleic acid or amino acid sequence, with another residue; a deletion or insertion of one or more residues within a sequence; or a substitution of a residue within a sequence of a genome in a subject to be corrected. Mutations are typically described herein by identifying the original residue followed by the position of the residue within the sequence, and then by the identity of the newly substituted residue. Various methods for making the amino acid substitutions (mutations) provided herein are well known in the art, and are provided by, for example, Green and Sambrook, Molecular Cloning: A Laboratory Manual (4^thed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. (2012)). The terms mutation and substitution can include a variety of categories, such as single base polymorphisms, microduplication regions, indels, and inversions, and are not meant to be limiting in any way. Mutations can include “loss-of-function” mutations, which are mutations that reduce or abolish a protein activity. Most loss-of-function mutations are recessive, because in a heterozygote the second chromosome copy carries an unmutated version of the gene coding for a fully functional protein whose presence compensates for the effect of the mutation. There are some exceptions where a loss-of-function mutation is dominant, one example being haploinsufficiency, where the organism is unable to tolerate the approximately 50% reduction in protein activity suffered by the heterozygote. This is the explanation for a few genetic diseases in humans, including Marfan syndrome, which results from a mutation in the gene for the connective tissue protein called fibrillin. Mutations also embrace “gain-of-function” mutations, which are substitutions that confer an abnormal activity on a protein or cell that is otherwise not present in a normal (wild type) condition. Many gain-of-function mutations are in regulatory sequences rather than in coding regions, and they can therefore have a number of consequences. For example, a mutation might lead to one or more genes being expressed in the wrong tissues, these tissues gaining functions that they normally lack. Alternatively, the mutation could lead to overexpression of one or more genes involved in control of the cell cycle, thus leading to uncontrolled cell division and hence to cancer. Because of their nature, gain-of-function mutations are usually dominant.

Target Site

The term “target site” refers to a sequence within a nucleic acid molecule that is edited by a zinc finger base editor disclosed herein. The target site further refers to the sequence within a nucleic acid molecule to which a base editor binds.

Treatment

The terms “treatment,” “treat,” and “treating,” refer to a clinical intervention aimed to reverse, alleviate, delay the onset of, or inhibit the progress of a disease or disorder, or one or more symptoms thereof, as described herein. As used herein, the terms “treatment,” “treat,” and “treating” refer to a clinical intervention aimed to reverse, alleviate, delay the onset of, or inhibit the progress of a disease or disorder, or one or more symptoms thereof, as described herein. In some embodiments, treatment may be administered after one or more symptoms have developed and/or after a disease has been diagnosed. In other embodiments, treatment may be administered in the absence of symptoms, e.g., to prevent or delay onset of a symptom or inhibit onset or progression of a disease. For example, treatment may be administered to a susceptible individual prior to the onset of symptoms (e.g., in light of a history of symptoms and/or in light of genetic or other susceptibility factors). Treatment may also be continued after symptoms have resolved, for example, to prevent or delay their recurrence.

Uracil Glycosylase Inhibitor (UGI)

The term “uracil glycosylase inhibitor” or “UGI,” as used herein, refers to a protein that is capable of inhibiting a uracil-DNA glycosylase base-excision repair enzyme. In some embodiments, a UGI domain comprises a wild-type UGI or a UGI as set forth in SEQ ID NO: 351. In some embodiments, the UGI proteins provided herein include fragments of UGI and proteins homologous to a UGI or a UGI fragment. For example, in some embodiments, a UGI domain comprises a fragment of the amino acid sequence set forth in SEQ ID NO: 351. In some embodiments, a UGI fragment comprises an amino acid sequence that comprises at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% of the amino acid sequence as set forth in SEQ ID NO: 351. In some embodiments, a UGI comprises an amino acid sequence homologous to the amino acid sequence set forth in SEQ ID NO: 351, or an amino acid sequence homologous to a fragment of the amino acid sequence set forth in SEQ ID NO: 351. In some embodiments, proteins comprising UGI, or fragments of UGI or homologs of UGI, are referred to as “UGI variants.” A UGI variant shares homology to UGI, or a fragment thereof. For example, a UGI variant is at least 70% identical, at least 75% identical, at least 80% identical, at least 85% identical, at least 90% identical, at least 95% identical, at least 96% identical, at least 97% identical, at least 98% identical, at least 99% identical, at least 99.5% identical, or at least 99.9% identical to a wild type UGI or a UGI as set forth in SEQ ID NO: 351. In some embodiments, the UGI variant comprises a fragment of UGI, such that the fragment is at least 70% identical, at least 80% identical, at least 90% identical, at least 95% identical, at least 96% identical, at least 97% identical, at least 98% identical, at least 99% identical, at least 99.5% identical, or at least 99.9% to the corresponding fragment of wild-type UGI or a UGI as set forth in SEQ ID NO: 351. In some embodiments, the UGI comprises the following amino acid sequence: MTNLSDIIEKETGKQLVIQESILMLPEEVEEVIGNKPESDILVHTAYDESTDENVMLLTSD APEYKPWALVIQDSNGENKIKML (SEQ ID NO: 351) (P14739|UNGI_BPPB2 Uracil-DNA glycosylase inhibitor), or the same sequence but without the N-terminal methionine.

Other UGI proteins may include those described in Example 6, as follows:

SEQ

ID

UGI
Sequence
NO:

Canonical
TNLSDIIEKETGKQLVIQESILMLPEEVEEVIGNK
358

UGI
PESDILVHTAYDESTDENVMLLTSDAPEYKPWALV

IQDSNGENKIKML

UGI2
MTLELQLKHYITNLFNLPKDEKWHCESIEEIADDI
352

LPDQYVRLGALSNKILQTYTYYSDTLHESNIYPFI

LYYQKQLIAIGYIDENHDMDFLYLHNTIMPLLDQR

YLLTGGQ

UGI3
MNKNFDEVKADLRTVTGKKIEFKERLKNILRVQMN
353

QLGFEDSYMIQVQVSSDQEEWVECHENMSLSDFEV

MYGNISGEIKRMTVVKYEEANIEKLVELKFEYEYA

KAHQEYIRAYTKLMSNTLYGRKPSL

UGI5
MNEEKMHYRDAIKEVELTMMSLDSHFRTHKEFTDS
354

YLLVLILEDVVGETRVEVSEGLTFDEASYIIGGTS

DNILNMHMINYCEKNREEIYKWLKVSRVNTFKSNY

AKMLLNTAYGKDLLKGVVK

UGI7
MNNHFMSIGRNCSKCNNVRLNEDFSKSEEICNECF
355

DKEERFVDSYTLIYITEDETGKRFEAILENQTIEE

TEIIYGNIIDKIIVWNVILTM

UGI12
DGNEHWEVHPGLSLSDFEVVYGNNPHQIVKLRLDK
350

EVGGSGGSMVQNDFIDSYTLCWLLRDDSGGGGSMV

QNDFIDSYTLCWLLRDDDGNEHWEVHPGLSLSDFE

VVYGNNPHQIVKLRLDKEV

Variant

As used herein, the term “variant” should be taken to mean the exhibition of qualities that have a pattern that deviates from what occurs in nature, e.g., a variant zinc finger protein is a zinc finger protein comprising one or more changes in amino acid residues as compared to a wild type zinc finger protein amino acid sequence. A variant deaminase is a deaminase comprising one or more changes in amino acid residues as compared to a wild type deaminase amino acid sequence. The term “variant” encompasses homologous proteins having at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identity with a reference sequence and having the same or substantially the same functional activity or activities as the reference sequence. The term also encompasses mutants, truncations, or domains of a reference sequence that display the same or substantially the same functional activity or activities as the reference sequence.

Vector

The term “vector,” as used herein, refers to a nucleic acid that can be modified to encode a gene of interest and that is able to enter a host cell, mutate, and replicate within the host cell, and then transfer a replicated form of the vector into another host cell. Exemplary suitable vectors include viral vectors, such as retroviral vectors or bacteriophages and filamentous phage, and conjugative plasmids. Additional suitable vectors will be apparent to those of skill in the art based on the instant disclosure.

Wild Type

As used herein the term “wild type” is a term of the art understood by skilled persons and means the typical form of an organism, strain, gene, or characteristic as it occurs in nature as distinguished from mutant or variant forms.

Zinc Finger DNA Binding Protein and Zinc Finger Motifs

A “zinc finger DNA binding protein or polypeptide” is a protein or polypeptide that comprises at least one zinc finger motif and is capable of and/or has the property of being able to bind to a DNA molecule in a “programmable manner.” As used herein, a “zinc finger motif” is a polypeptide comprising an amino acid sequence that folds into a three-dimensional structure that is held together and stabilized by the coordinated binding by certain amino acid residues (e.g., cysteine and histidine) in the zinc finger motif to a zinc ion. The amino acid sequence of the zinc finger motif “programs” or determines the sequence of DNA to which it can bind. As used herein, a protein domain that comprises at least one zinc finger motif may be referred to as a “zinc finger domain.” Further, a zinc finger DNA binding protein may be regarded more broadly as a type of “zinc finger domain-containing protein or polypeptide.” A zinc finger domain-containing protein or polypeptide is any protein or polypeptide that comprises at least one zinc finger motif. In certain embodiments, the zinc finger domain-containing protein may comprise an array of two or more zinc finger motifs arranged in a continuous or non-continuous pattern or repeating array (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 or more zinc finger motifs).

Zinc finger DNA binding proteins or polypeptides) (which may be referred more generally as “zinc finger protein or polypeptide” or “ZFP”) can be “engineered” to bind to a predetermined or target nucleotide sequence. Non-limiting examples of methods for engineering zinc finger proteins include sequence design and selection approaches. Such engineered proteins do not occur in nature. Rational criteria for engineering such proteins include application of substitution rules and computerized algorithms for processing information in a database storing information of existing ZFP designs, sequences, and binding data. See, for example, U.S. Pat. Nos. 6,140,081; 6,453,242; 6,534,261; and 6,785,613; see, also WO 98/53058; WO 98/53059; WO 98/53060; WO 02/016536; and WO 03/016496; and U.S. Pat. Nos. 6,746,838; 6,866,997; and 7,030,215, each of which are incorporated herein by reference.

The present application also relates to zinc finger nucleases (“ZFNs”). Zinc finger nucleases (“ZFNs”) are artificial restriction enzymes generated by fusing a zinc finger DNA-binding protein or domain to a DNA-cleavage domain. Zinc finger DNA-binding domains can be engineered to target specific desired DNA sequences, and this enables zinc finger nucleases to target unique sequences within complex genomes.

The DNA-binding domains of individual ZFNs typically contain between three and six individual zinc finger motifs (each containing a β-motif, a DNA recognition motif, and an α-motif as described further herein) and can each recognize between 9 and 18 base pairs. The repeating units of individual zinc finger motifs of the DNA-binding domain can be referred to as a “zinc finger repeat” or “zinc finger array.” Each individual zinc finger motif is typically joined together by a linker motif. If the zinc finger domains are specific for their intended target site, a pair of 3-finger ZFNs that recognize a total of 18 base pairs can, in theory, target a single locus in a mammalian genome. The most straightforward method to generate new zinc finger arrays is to combine smaller zinc finger “modules” of known specificity. The most common modular assembly process involves combining three separate zinc finger motifs that can each recognize a 3 base pair DNA sequence to generate a 3-finger zinc finger array that can recognize a 9 base pair target site.

DETAILED DESCRIPTION

The present disclosure is based on the development by the inventors of engineered zinc finger domain-containing proteins, DddA variants, and fusion proteins comprising the same that display increased on-target base editing activity and/or decreased off-target base editing activity. In particular, the proteins and fusion proteins provided herein may be especially useful for editing mitochondrial DNA due to the small size of zinc finger proteins, as described further herein. Thus, the present disclosure provides zinc finger domain-containing proteins comprising optimized α-, β-, and/or linker motifs, and fusion proteins comprising said zinc finger domain-containing proteins fused to an effector domain (e.g., a deaminase, or any other effector protein including but not limited to those described herein). The present disclosure also provides DddA variants and fusion proteins comprising said DddA variants (for example, fused to a programmable DNA binding protein, such as any of the zinc finger domain-containing proteins disclosed herein, or a CRISPR/Cas9 protein). Methods for editing DNA (including, e.g., genomic DNA and mitochondrial DNA) using the fusion proteins described herein are also provided by the present disclosure. The present disclosure further provides polynucleotides, vectors, cells, kits, and pharmaceutical compositions comprising the zinc finger domain-containing proteins, DddA variants, and fusion proteins described herein.

Zinc Finger Domain-Containing Proteins

In one aspect, the present disclosure provides engineered zinc finger domain-containing proteins. Engineered zinc finger arrays are most commonly constructed based on the sequence of Zif268, a murine transcription factor. As described further herein, it was found by the inventors that zinc finger scaffold sequences with improved activity (for example, improved base editing activity when linked to a fusion protein in the context of a deaminase) could be developed by searching the human proteome for the ZF consensus sequence: x(2)-C-x(2,4)-C-x(12)-H-x(3)-H-x(4,5)-P, where C and H are conserved Cys and His residues that coordinate the Zn²⁺ ion, P is a conserved Pro residue at the end of the linker motif, and x can be any amino acid residue. Through this search, several ZF sequences from the human proteome were discovered, and these sequences were separated and filtered to extract new beta-motif sequences, new alpha-motif sequences, and new linker motif sequences. As described herein, all of the sequences identified within each class were aligned, and an amino acid frequency calculation was performed to determine the frequency at which each amino acid was found at each position within each of the three types of motif sequences. This provided a basis set of amino acids from which to construct new motif sequences. All possible permutations of these sequences were created, which resulted in the creation of new linker motifs, alpha-motifs, and beta-motifs. Sequences for each of these motifs are provided in the following tables.

Zinc finger linker motif sequences disclosed herein include those of SEQ ID NOs: 1-24:

TGEKP (SEQ ID NO: 1)

TGERP (SEQ ID NO: 2)

TGKKP (SEQ ID NO: 3)

TGKRP (SEQ ID NO: 4)

TGDKP (SEQ ID NO: 5)

TGDRP (SEQ ID NO: 6)

TEEKP (SEQ ID NO: 7)

TEERP (SEQ ID NO: 8)

TEKKP (SEQ ID NO: 9)

TEKRP (SEQ ID NO: 10)

TEDKP (SEQ ID NO: 11)

TEDRP (SEQ ID NO: 12)

SGEKP (SEQ ID NO: 13)

SGERP (SEQ ID NO: 14)

SGKKP (SEQ ID NO: 15)

SGKRP (SEQ ID NO: 16)

SGDKP (SEQ ID NO: 17)

SGDRP (SEQ ID NO: 18)

SEEKP (SEQ ID NO: 19)

SEERP (SEQ ID NO: 20)

SEKKP (SEQ ID NO: 21)

SEKRP (SEQ ID NO: 22)

SEDKP (SEQ ID NO: 23)

SEDRP (SEQ ID NO: 24)

In some embodiments, the present disclosure provides zinc finger proteins comprising one or more linker motifs of SEQ ID NOs: 1-24, or one or more linker motifs comprising an amino acid sequence at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to the amino acid sequence of any one of SEQ ID NOs: 1-24. In some embodiments, a zinc finger domain-containing protein comprises one or more linker motifs comprising the amino acid sequence of any one of TGEKP (SEQ ID NO: 1), SGEKP (SEQ ID NO: 13), SGERP (SEQ ID NO: 14), and SGDKP (SEQ ID NO: 17), or one or more linker motifs comprising an amino acid sequence at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to the amino acid sequence of any one of TGEKP (SEQ ID NO: 1), SGEKP (SEQ ID NO: 13), SGERP (SEQ ID NO: 14), and SGDKP (SEQ ID NO: 17). In certain embodiments, all of the linker motifs present in a zinc finger domain-containing protein each comprise the same amino acid sequence selected from the group consisting of TGEKP (SEQ ID NO: 1), SGEKP (SEQ ID NO: 13), SGERP (SEQ ID NO: 14), and SGDKP (SEQ ID NO: 17), or the same amino acid sequence that is at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to the amino acid sequence of any one of TGEKP (SEQ ID NO: 1), SGEKP (SEQ ID NO: 13), SGERP (SEQ ID NO: 14), and SGDKP (SEQ ID NO: 17).

Zinc Finger α-motif sequences disclosed herein include those of SEQ ID NOs: 25-42 and 346:

HQRIH (SEQ ID NO: 25)

HQRVH (SEQ ID NO: 26)

HQRTH (SEQ ID NO: 27)

HQKIH (SEQ ID NO: 28)

HQKVH (SEQ ID NO: 29)

HQKTH (SEQ ID NO: 30)

HMRIH (SEQ ID NO: 31)

HMRVH (SEQ ID NO: 32)

HMRTH (SEQ ID NO: 33)

HMKIH (SEQ ID NO: 34)

HMKVH (SEQ ID NO: 35)

HMKTH (SEQ ID NO: 36)

HKRIH (SEQ ID NO: 37)

HKRVH (SEQ ID NO: 38)

HKRTH (SEQ ID NO: 39)

HKKIH (SEQ ID NO: 40)

HKKVH (SEQ ID NO: 41)

HKKTH (SEQ ID NO: 42)

HIRTH (SEQ ID NO: 346)

In some embodiments, the present disclosure provides zinc finger proteins comprising one or more alpha motifs of SEQ ID NOs: 25-42 and 346, or one or more alpha motifs comprising an amino acid sequence at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to the amino acid sequence of any one of SEQ ID NOs: 25-42 and 346. In some embodiments, a zinc finger domain-containing protein comprises one or more α-motifs comprising the amino acid sequence of any one of HMRTH (SEQ ID NO: 33), HMKIH (SEQ ID NO: 34), HMKVH (SEQ ID NO: 35), HMKTH (SEQ ID NO: 36), and HIRTH (SEQ ID NO: 346), or one or more alpha motifs comprising an amino acid sequence at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to the amino acid sequence of any one of HMRTH (SEQ ID NO: 33), HMKIH (SEQ ID NO: 34), HMKVH (SEQ ID NO: 35), HMKTH (SEQ ID NO: 36), and HIRTH (SEQ ID NO: 346). In certain embodiments, all of the α-motifs present in a zinc finger domain-containing protein each comprise the same amino acid sequence selected from the group consisting of HMRTH (SEQ ID NO: 33), HMKIH (SEQ ID NO: 34), HMKVH (SEQ ID NO: 35), HMKTH (SEQ ID NO: 36), and HIRTH (SEQ ID NO: 346), or the same amino acid sequence that is at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to the amino acid sequence of any one of HMRTH (SEQ ID NO: 33), HMKJH (SEQ ID NO: 34), HMKVH (SEQ ID NO: 35), HMKTH (SEQ ID NO: 36), and HIRTH (SEQ ID NO: 346).

Zinc Finger β-motif sequences disclosed herein include those of SEQ ID NOs: 43-138 and 336-345:

YKCKECGKAFS (SEQ ID NO: 43)

YKCKECGKAFR (SEQ ID NO: 44)

YKCKECGKAFN (SEQ ID NO: 45)

YKCKECGKSFS (SEQ ID NO: 46)

YKCKECGKSFR (SEQ ID NO: 47)

YKCKECGKSEN (SEQ ID NO: 48)

YKCNECGKAFS (SEQ ID NO: 49)

YKCNECGKAFR (SEQ ID NO: 50)

YKCNECGKAFN (SEQ ID NO: 51)

YKCNECGKSFS (SEQ ID NO: 52)

YKCNECGKSFR (SEQ ID NO: 53)

YKCNECGKSEN (SEQ ID NO: 54)

YKCSECGKAFS (SEQ ID NO: 55)

YKCSECGKAFR (SEQ ID NO: 56)

YKCSECGKAFN (SEQ ID NO: 57)

YKCSECGKSFS (SEQ ID NO: 58)

YKCSECGKSFR (SEQ ID NO: 59)

YKCSECGKSEN (SEQ ID NO: 60)

YKCEECGKAFS (SEQ ID NO: 61)

YKCEECGKAFR (SEQ ID NO: 62)

YKCEECGKAFN (SEQ ID NO: 63)

YKCEECGKSFS (SEQ ID NO: 64)

YKCEECGKSFR (SEQ ID NO: 65)

YKCEECGKSEN (SEQ ID NO: 66)

YECKECGKAFS (SEQ ID NO: 67)

YECKECGKAFR (SEQ ID NO: 68)

YECKECGKAFN (SEQ ID NO: 69)

YECKECGKSFS (SEQ ID NO: 70)

YECKECGKSFR (SEQ ID NO: 71)

YECKECGKSEN (SEQ ID NO: 72)

YECNECGKAFS (SEQ ID NO: 73)

YECNECGKAFR (SEQ ID NO: 74)

YECNECGKAFN (SEQ ID NO: 75)

YECNECGKSFS (SEQ ID NO: 76)

YECNECGKSFR (SEQ ID NO: 77)

YECNECGKSEN (SEQ ID NO: 78)

YECSECGKAFS (SEQ ID NO: 79)

YECSECGKAFR (SEQ ID NO: 80)

YECSECGKAFN (SEQ ID NO: 81)

YECSECGKSFS (SEQ ID NO: 82)

YECSECGKSFR (SEQ ID NO: 83)

YECSECGKSFN (SEQ ID NO: 84)

YECEECGKAFS (SEQ ID NO: 85)

YECEECGKAFR (SEQ ID NO: 86)

YECEECGKAFN (SEQ ID NO: 87)

YECEECGKSFS (SEQ ID NO: 88)

YECEECGKSFR (SEQ ID NO: 89)

YECEECGKSEN (SEQ ID NO: 90)

FKCKECGKAFS (SEQ ID NO: 91)

FKCKECGKAFR (SEQ ID NO: 92)

FKCKECGKAFN (SEQ ID NO: 93)

FKCKECGKSFS (SEQ ID NO: 94)

FKCKECGKSFR (SEQ ID NO: 95)

FKCKECGKSFN (SEQ ID NO: 96)

FKCNECGKAFS (SEQ ID NO: 97)

FKCNECGKAFR (SEQ ID NO: 98)

FKCNECGKAFN (SEQ ID NO: 99)

FKCNECGKSFS (SEQ ID NO: 100)

FKCNECGKSFR (SEQ ID NO: 101)

FKCNECGKSEN (SEQ ID NO: 102)

FKCSECGKAFS (SEQ ID NO: 103)

FKCSECGKAFR (SEQ ID NO: 104)

FKCSECGKAFN (SEQ ID NO: 105)

FKCSECGKSFS (SEQ ID NO: 106)

FKCSECGKSFR (SEQ ID NO: 107)

FKCSECGKSFN (SEQ ID NO: 108)

FKCEECGKAFS (SEQ ID NO: 109)

FKCEECGKAFR (SEQ ID NO: 110)

FKCEECGKAFN (SEQ ID NO: 111)

FKCEECGKSFS (SEQ ID NO: 112)

FKCEECGKSFR (SEQ ID NO: 113)

FKCEECGKSEN (SEQ ID NO: 114)

FECKECGKAFS (SEQ ID NO: 115)

FECKECGKAFR (SEQ ID NO: 116)

FECKECGKAFN (SEQ ID NO: 117)

FECKECGKSFS (SEQ ID NO: 118)

FECKECGKSFR (SEQ ID NO: 119)

FECKECGKSEN (SEQ ID NO: 120)

FECNECGKAFS (SEQ ID NO: 121)

FECNECGKAFR (SEQ ID NO: 122)

FECNECGKAFN (SEQ ID NO: 123)

FECNECGKSFS (SEQ ID NO: 124)

FECNECGKSFR (SEQ ID NO: 125)

FECNECGKSEN (SEQ ID NO: 126)

FECSECGKAFS (SEQ ID NO: 127)

FECSECGKAFR (SEQ ID NO: 128)

FECSECGKAFN (SEQ ID NO: 129)

FECSECGKSFS (SEQ ID NO: 130)

FECSECGKSFR (SEQ ID NO: 131)

FECSECGKSEN (SEQ ID NO: 132)

FECEECGKAFS (SEQ ID NO: 133)

FECEECGKAFR (SEQ ID NO: 134)

FECEECGKAFN (SEQ ID NO: 135)

FECEECGKSFS (SEQ ID NO: 136)

FECEECGKSFR (SEQ ID NO: 137)

FECEECGKSEN (SEQ ID NO: 138)

YKCPECGKSFS (SEQ ID NO: 336)

YACPECGKSFS (SEQ ID NO: 337)

YACPECGRSFS (SEQ ID NO: 338)

YACPECDRSES (SEQ ID NO: 339)

YACPECDRSFS (SEQ ID NO: 340)

YACPECDRRES (SEQ ID NO: 341)

YACPVESCDRRFS (SEQ ID NO: 342)

YACPVESCDRSFS (SEQ ID NO: 343)

YACPVESCGKSFS (SEQ ID NO: 344)

FACDICGRKFA (SEQ ID NO: 345)

In some embodiments, the present disclosure provides zinc finger proteins comprising one or more beta motifs of SEQ ID NOs: 43-138 and 336-345, or one or more beta motifs comprising an amino acid sequence at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to the amino acid sequence of any one of SEQ ID NOs: 43-138 and 336-345. In some embodiments, a zinc finger domain-containing protein comprises one or more β-motifs comprising the amino acid sequence of any one of YKCNECGKAFN (SEQ ID NO: 51), YKCNECGKSFN (SEQ ID NO: 54), YKCSECGKAFN (SEQ ID NO: 57), YKCEECGKAFN (SEQ ID NO: 63), FKCNECGKAFN (SEQ ID NO: 99), FKCNECGKSFN (SEQ ID NO: 102), FKCSECGKAFN (SEQ ID NO: 105), FKCEECGKAFS (SEQ ID NO: 109), FKCEECGKAFN (SEQ ID NO: 111), FKCEECGKSFN (SEQ ID NO: 114), YACPECGKSFS (SEQ ID NO: 337), and FACDICGRKFA (SEQ ID NO: 345), or an amino acid sequence at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to any one of YKCNECGKAFN (SEQ ID NO: 51), YKCNECGKSFN (SEQ ID NO: 54), YKCSECGKAFN (SEQ ID NO: 57), YKCEECGKAFN (SEQ ID NO: 63), FKCNECGKAFN (SEQ ID NO: 99), FKCNECGKSFN (SEQ ID NO: 102), FKCSECGKAFN (SEQ ID NO: 105), FKCEECGKAFS (SEQ ID NO: 109), FKCEECGKAFN (SEQ ID NO: 111), FKCEECGKSFN (SEQ ID NO: 114), YACPECGKSFS (SEQ ID NO: 337), and FACDICGRKFA (SEQ ID NO: 345). In certain embodiments, all of the β-motifs present in a zinc finger domain-containing protein each comprise the same amino acid sequence selected from the group consisting of YKCNECGKAFN (SEQ ID NO: 51), YKCNECGKSFN (SEQ ID NO: 54), YKCSECGKAFN (SEQ ID NO: 57), YKCEECGKAFN (SEQ ID NO: 63), FKCNECGKAFN (SEQ ID NO: 99), FKCNECGKSFN (SEQ ID NO: 102), FKCSECGKAFN (SEQ ID NO: 105), FKCEECGKAFS (SEQ ID NO: 109), FKCEECGKAFN (SEQ ID NO: 111), FKCEECGKSFN (SEQ ID NO: 114), YACPECGKSFS (SEQ ID NO: 337), and FACDICGRKFA (SEQ ID NO: 345), or the same amino acid sequence that is at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to any one of YKCNECGKAFN (SEQ ID NO: 51), YKCNECGKSFN (SEQ ID NO: 54), YKCSECGKAFN (SEQ ID NO: 57), YKCEECGKAFN (SEQ ID NO: 63), FKCNECGKAFN (SEQ ID NO: 99), FKCNECGKSFN (SEQ ID NO: 102), FKCSECGKAFN (SEQ ID NO: 105), FKCEECGKAFS (SEQ ID NO: 109), FKCEECGKAFN (SEQ ID NO: 111), FKCEECGKSFN (SEQ ID NO: 114), YACPECGKSFS (SEQ ID NO: 337), and FACDICGRKFA (SEQ ID NO: 345).

Thus, in one aspect, the present disclosure provides zinc finger domain-containing proteins comprising (i) one or more linker motifs, wherein each linker motif independently comprises the amino acid sequence of any one of SEQ ID NOs: 1-24; (ii) one or more α-motifs, wherein each α-motif independently comprises the amino acid sequence of any one of SEQ ID NOs: 25-42 and 346; and (iii) one or more β-motifs, wherein each β-motif independently comprises the amino acid sequence of any one of SEQ ID NOs: 43-138 and 336-345, or an amino acid sequence that is at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of any one of SEQ ID NOs: 43-138 and 336-345.

Zinc finger proteins consist of repeating subunits of the general structure [β-motif]-[DNA recognition motif]-[α-motif]joined together by a linker motif. Zinc finger proteins generally comprise at least three repeats of this general structure. In some embodiments, a zinc finger protein comprises three repeats of this general structure. In some embodiments, a zinc finger protein comprises four repeats of this general structure. In some embodiments, a zinc finger protein comprises five repeats of this general structure. In some embodiments, a zinc finger protein comprises six repeats of this general structure. In certain embodiments, a zinc finger domain-containing protein comprises any of the following structures:

- [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif];
- [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif];
- [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]; or
- [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]-[fifth linker motif]-[sixth β-motif]-[sixth DNA recognition motif]-[sixth α-motif].

Any of the zinc finger domain-containing proteins provided herein may further comprise an N-terminal cap. In some embodiments, an N-terminal cap comprises the amino acid sequence MAERP. Thus, in certain embodiments, a zinc finger domain-containing protein may comprise any of the following structures:

- [N-terminal cap]-[first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif];
- [N-terminal cap]-[first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif];
- [N-terminal cap]-[first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]; or
- [N-terminal cap]-[first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]-[fifth linker motif]-[sixth β-motif]-[sixth DNA recognition motif]-[sixth α-motif].

Any of the zinc finger domain-containing proteins provided herein may also further comprise a C-terminal cap. In some embodiments a C-terminal cap comprises the amino acid sequence HTKIHLR. Thus, in certain embodiments, a zinc finger domain-containing protein may comprise any of the following structures:

- [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[C-terminal cap];
- [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[C-terminal cap];
- [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]-[C-terminal cap]; or
- [first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]-[fifth linker motif]-[sixth β-motif]-[sixth DNA recognition motif]-[sixth α-motif]-[C-terminal cap].

In certain embodiments, any of the zinc finger domain-containing proteins provided herein may comprise both an N-terminal cap (e.g., MAERP) and a C-terminal cap (e.g., HTKIHLR). Thus, in certain embodiments, a zinc finger domain-containing protein may comprise any of the following structures:

- [N-terminal cap]-[first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[C-terminal cap];
- [N-terminal cap]-[first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[C-terminal cap];
- [N-terminal cap]-[first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]-[C-terminal cap]; or
- [N-terminal cap]-[first β-motif]-[first DNA recognition motif]-[first α-motif]-[first linker motif]-[second β-motif]-[second DNA recognition motif]-[second α-motif]-[second linker motif]-[third β-motif]-[third DNA recognition motif]-[third α-motif]-[third linker motif]-[fourth β-motif]-[fourth DNA recognition motif]-[fourth α-motif]-[fourth linker motif]-[fifth β-motif]-[fifth DNA recognition motif]-[fifth α-motif]-[fifth linker motif]-[sixth β-motif]-[sixth DNA recognition motif]-[sixth α-motif]-[C-terminal cap].

Each of the linker, alpha, and beta motifs may comprise or consist of any of the various amino acid sequences provided herein, in any combination with one another. In certain embodiments, the present disclosure provides zinc finger proteins wherein each of the linker motifs present in the protein comprises the same amino acid sequence, each of the alpha-motifs present in the protein comprises the same amino acid sequence, and each of the beta-motifs present in the protein comprises the same amino acid sequence. For example, in some embodiments, the present disclosure provides zinc finger proteins comprising three repeating zinc finger motifs wherein each of the first, second, and third β-motifs comprise the same amino acid sequence, each of the first, second, and third α-motifs comprise the same amino acid sequence, and/or each of the first and second linker motifs comprise the same amino acid sequence. In some embodiments, the present disclosure provides zinc finger proteins comprising four repeating zinc finger motifs wherein each of the first, second, third, and fourth β-motifs comprise the same amino acid sequence, each of the first, second, third, and fourth α-motifs comprise the same amino acid sequence, and/or each of the first, second, and third linker motifs comprise the same amino acid sequence. In some embodiments, the present disclosure provides zinc finger proteins comprising five repeating zinc finger motifs wherein each of the first, second, third, fourth, and fifth β-motifs comprise the same amino acid sequence, each of the first, second, third, fourth, and fifth α-motifs comprise the same amino acid sequence, and/or each of the first, second, third, and fourth linker motifs comprise the same amino acid sequence. In some embodiments, the present disclosure provides zinc finger proteins comprising six repeating zinc finger motifs wherein each of the first, second, third, fourth, fifth, and sixth β-motifs comprise the same amino acid sequence, each of the first, second, third, fourth, fifth, and sixth α-motifs comprise the same amino acid sequence, and each of the first, second, third, fourth, and fifth linker motifs comprise the same amino acid sequence.

The DNA-binding domains of individual zinc finger proteins typically contain between three and six individual zinc finger motifs (each containing a β-motif, a DNA recognition motif, and an α-motif, as described above) each connected to one another by a linker motif. Each zinc finger protein can typically recognize between 9 and 18 base pairs. For example, a zinc finger protein comprising an array of three zinc finger motifs will typically recognize a nine-nucleotide sequence. A zinc finger protein comprising an array of four zinc finger motifs will typically recognize a twelve-nucleotide sequence. A zinc finger protein comprising an array of five zinc finger motifs will typically recognize a fifteen-nucleotide sequence. And a zinc finger protein comprising an array of six zinc finger motifs will typically recognize an eighteen-nucleotide sequence.

Amino acid sequences of various zinc finger DNA-binding domains that recognize particular three-nucleotide DNA sequences have been characterized and are well known in the art. These variable amino acid sequences generally contain seven amino acid residues that can recognize and interact with (e.g., bind to) specific nucleotide sequences (generally of three nucleotides in length). The seven variable DNA-binding residues (typically numbered from −1 to 6) are inserted in between the beta-motif and the alpha-motif within each individual ZF repeat and vary between each individual ZF repeat depending on the target DNA sequence. The variable DNA-binding residues are therefore distinct from, and do not overlap with, the beta-motif and the alpha-motif sequences. For example, the following seven-amino acid DNA recognition sequences that recognize particular three-nucleotide DNA sequences may be used in the ZF domain-containing proteins described herein:

Target DNA

ZF nt

sequence
ZF amino acid
ZF nucleotide
sequence

(5′ to 3′)
sequence
sequence
SEQ ID NO:

AAA
QRANLRA (SEQ ID NO:
cagagagctaatctcagggcc
816

753)

AAC
DSGNLRV (SEQ ID NO:
gattcagggaatctccgggtt
817

754)

AAG
RKDNLKN (SEQ ID NO:
cgaaaagataatctgaagaat
818

755)

AAT
TTGNLTV (SEQ ID NO:
accactggaaacctcacggtg
819

756)

ACA
SPADLTR (SEQ ID NO:
agtcctgcagatcttacccga
820

757)

ACC
DKKDLTR (SEQ ID NO:
gacaagaaggatctgacacga
821

758)

ACG
RTDTLRD (SEQ ID NO:
aggactgatacgctgcgcgat
822

759)

ACT
THLDLIR (SEQ ID NO:
acccacctggacctcatcaga
823

760)

AGA
QLAHLRA (SEQ ID NO:
caactcgctcatctgcgagca
824

761)

AGC
ERSHLRE (SEQ ID NO:
gaacgaagccacctgcgcgaa
825

762)

AGG
RSDHLTN (SEQ ID NO:
cgcagcgaccatttgactaac
826

763)

AGT
HRTTLTN (SEQ ID NO:
caccgaacgaccttgactaac
827

764)

ATA
QKSSLIA (SEQ ID NO:
cagaaatcttctttgatagct
828

765)

ATC
RRSACRR (SEQ ID NO:
cggagatcagcctgtcgacgc
829

766)

ATG
RRDELNV (SEQ ID NO:
aggcgggacgaactgaacgtg
830

767)

ATT
HKNALQN (SEQ ID NO:
cacaaaaatgccttgcaaaac
831

768)

CAA
QSGNLTE (SEQ ID NO:
caatctggcaatcttacagag
832

769)

CAC
SKKALTE (SEQ ID NO:
tctaaaaaggcgctgacggag
833

770)

CAG
RADNLTE (SEQ ID NO:
cgggcggataatctcactgag
834

771)

CAT
TSGNLTE (SEQ ID NO:
acgagtggaaatcttacggaa
835

772)

CCA
TSHSLTE (SEQ ID NO:
acgtcccacagtttgaccgaa
836

773)

CCC
SKKHLAE (SEQ ID NO:
agcaagaaacaccttgcagaa
837

774)

CCG
RNDTLTE (SEQ ID NO:
aggaatgatactcttaccgag
838

775)

CCT
TKNSLTE (SEQ ID NO:
acaaagaacagcctcaccgag
839

776)

CGA
QSGHLTE (SEQ ID NO:
cagtcagggcatctcacggag
840

777)

CGC
HTGHLLE (SEQ ID NO:
cacacaggccatttgttggag
841

778)

CGG
RSDKLTE (SEQ ID NO:
cggagtgataaactcaccgaa
842

779)

CGT
SRRTCRA (SEQ ID NO:
tcacgacgcacctgtagagcg
843

780)

CTA
QNSTLTE (SEQ ID NO:
cagaattcaactctcaccgaa
844

781)

CTC
QRHHLVE (SEQ ID NO:
cagcgacaccatttggtcgag
845

782)

CTG
RNDALTE (SEQ ID NO:
cggaacgatgcacttaccgag
846

783)

CTT
TTGALTE (SEQ ID NO:
actacaggggctctcactgaa
847

784)

GAA
QSSNLVR (SEQ ID NO:
cagagtagtaacctggtgagg
848

785)

GAC
DPGNLVR (SEQ ID NO:
gatcccgggaacctcgttaga
849

786)

GAG
RSDNLVR (SEQ ID NO:
cgctctgataacctggtcaga
850

787)

GAT
TSGNLVR (SEQ ID NO:
actagcgggaacctcgtccgg
851

788)

GCA
QSGDLRR (SEQ ID NO:
caaagcggggacttgagaagg
852

789)

GCC
DCRDLAR (SEQ ID NO:
gattgccgagatcttgctcgg
853

790)

GCG
RSDDLVR (SEQ ID NO:
cgctcagatgatctggttcgc
854

791)

GCT
TSGELVR (SEQ ID NO:
acgtctggggagttggttagg
855

792)

GGA
QRAHLER (SEQ ID NO:
caaagagcccatctggaaagg
856

793)

GGC
DPGHLVR (SEQ ID NO:
gatcccggacacttggttcga
857

794)

GGG
RSDKLVR (SEQ ID NO:
cgcagcgacaaactcgttaga
858

795)

GGT
TSGHLVR (SEQ ID NO:
acttcaggccatcttgtaaga
859

796)

GTA
QSSSLVR (SEQ ID NO:
caatcttcctcacttgtgagg
860

797)

GTC
DPGALVR (SEQ ID NO:
gacccaggggctttggttcgg
861

798)

GTG
RSDELVR (SEQ ID NO:
cggtcagatgagctggtacgc
862

799)

GTT
TSGSLVR (SEQ ID NO:
acaagcggctctctcgttaga
863

800)

TAA
QASNLIS (SEQ ID NO:
caagcctctaacttgattagc
864

801)

TAC
SRGNLKS (SEQ ID NO:
agcaggggtaacttgaaatcc
865

802)

TAG
REDNLHT (SEQ ID NO:
cgggaagacaaccttcatacg
866

803)

TAT
ARGNLRT (SEQ ID NO:
gcacgcgggaacttgcggact
867

804)

TCA
RSDHLTT (SEQ ID NO:
cgaagtgatcacttgacaacc
868

811)

TCC
RSDERKR (SEQ ID NO:
cggtcagacgagagaaagcga
869

806)

TCG
RLRALDR (SEQ ID NO:
cgcttgcgggcgctcgaccga
870

807)

TCT
RLRDIQF (SEQ ID NO:
agactcagggatatacaattt
871

808)

TGA
QAGHLAS (SEQ ID NO:
caaggggccacctcgccagc
872

809)

TGC
APKALGW (SEQ ID NO:
gccccaaaagcactgggctgg
873

810)

TGG
RSDHLTT (SEQ ID NO:
cggagcgaccatctcactact
874

811)

TGT
WRDSLLA (SEQ ID NO:
tggcgcgactcccttctcgcg
875

812)

TTA
QKWPRDS (SEQ ID NO:
cagaagtggcccagggattca
876

813)

TTC
DNSYLPR (SEQ ID NO:
gacaattcttacttgcccagg
877

814)

TTG
RKDALRG (SEQ ID NO:
aggaaagatgcgcttagaggg
878

815)

Several methods to generate a zinc finger array of repeating zinc finger units that each recognize a three-nucleotide sequence have been developed and are known in the art. The most straightforward method to generate new zinc finger arrays is to combine individual zinc finger motifs or shorter zinc finger arrays with known DNA specificity (i.e., “zinc finger modules”) to form longer zinc finger arrays have a particular DNA sequence binding affinity. The concept of obtaining zinc finger DNA binding domains for each of the 64 possible combinations of three-nucleotide sequences and then assembling these domains together to design zinc finger proteins with specificity for any target sequence has been described in the art (see, for example, Pavletich et al. Zinc finger-DNA recognition: crystal structure of a Zif268-DNA complex at 2.1 Å. Science 1991, 252(5007), 809-817, which is incorporated herein by reference). The most common modular assembly process involves combining three separate zinc finger motifs that can each recognize a 3 base pair DNA sequence to generate a zinc finger repeat comprising three zinc finger motifs that can recognize a nine base pair target site. Longer zinc finger arrays that recognize longer target sites can be generated as well, as discussed above. Methods utilizing two zinc finger modules to generate zinc finger arrays comprising up to six individual zinc finger motifs have also been described (see, for example, Shukla et al. Precise genome modification in the crop species Zea mays using zinc finger nucleases. Nature 2009, 459(7245), 437-441, which is incorporated herein by reference). Additionally, variants of the modular assembly approach that take into account the context of neighboring DNA binding domains in the other zinc finger domains within an array have also been described (see, for example, Sander et al. Selection-free zinc finger-nuclease engineering by context-dependent assembly (CoDA). Nature 2011, 8(1), 67-69, which is incorporated herein by reference).

Methods utilizing phage display to select for zinc finger DNA binding domains that recognize a particular DNA sequence have also been developed, as described, e.g., in Segal et al. Toward controlling gene expression at will: selection and design of zinc finger domains recognizing each of the 5′-GNN-3′ DNA target sequences. PNAS 1999, 96(6), 2758-63; Dreier et al. Development of zinc finger domains for recognition of the 5′-CNN-3′ family DNA sequences and their use in the construction of artificial transcription factors. J. Biol. Chem. 2005, 280(42), 35588-35597; and Dreier et al. Development of zinc finger domains for recognition of the 5′-ANN-3′ family of DNA sequences and their use in the construction of artificial transcription factors. J. Biol. Chem. 2001, 276(31), 29466-29478, the contents of each of which are incorporated herein by reference. Methods utilizing yeast one-hybrid systems, bacterial one-hybrid systems, bacterial two-hybrid systems, and mammalian cells have also been developed. For example, a method known as “OPEN” has been developed to select novel three-zinc finger arrays. OPEN utilizes a bacterial two-hybrid system and combines pre-selected pools of individual zinc fingers that have each been selected to recognize and bind to a particular three-nucleotide DNA sequence. A second round of selection is then utilized to obtain three-zinc finger arrays capable of binding a desired nine-nucleotide DNA sequence. The OPEN system is described further in Maeder et al. Rapid “open-source” engineering of customized zinc finger nucleases for highly efficient gene modification. Molecular Cell 2008, 31(2), 294-301, the contents of which are incorporated herein by reference.

Additional references that describe the selection of DNA binding domains to design zinc finger arrays that recognize particular nucleotide sequences (and that describe zinc finger proteins more generally) include, but are not limited to, Hossain et al. Artificial Zinc Finger DNA Binding Domains: Versatile Tools for Genome Engineering and Modulation of Gene Expression. J. Cell Biochem. 2015, 116(11), 2435-2444; Gupta, R. M. and Musunuru, K. Expanding the genetic editing tool kit: ZFNs, TALENs, and CRISPR-Cas9. J. Clin. Invest. 2014, 124(10), 4154-4161; Collin, J. and Lako, M. Concise Review: Putting a Zinc Finger on Stem Cell Biology: Zinc Finger Nuclease-Driven Targeted Genetic Editing in Human Pluripotent Stem Cells. Stem Cells 2011, 29, 1021-1033; Carroll, D. Genome Engineering With Zinc finger Nucleases. Genetics 2011, 188, 773-782; Yang, X. et al. Strategies for mitochondrial gene editing. Comput. Struct. Biotechnol. J. 2021, 19, 3319-3329; Lim et al. Nuclear and mitochondrial DNA editing in human cells with zinc finger deaminases. Nat. Commun. 2022, 13(366); Elrod-Erickson et al. Zif268 protein-DNA complex refined at 1.6 Å: a model system for understanding zinc finger-DNA interactions. Structure 1996, 4(10), 1171-1180; and Jamieson et al. A zinc finger directory for high-affinity DNA recognition. Proc. Natl. Acad. Sci. USA 1996, 93, 12834-12839, each of which is incorporated by reference herein.

DddA Variants

In some aspects, the present disclosure provides double-stranded DNA deaminase A (DddA) variants. For example, the present disclosure provides DddA variants that exhibit increased on-target editing efficiency and/or decreased off-target editing. As described further herein, the DddA protein is often split into two halves or portions (e.g., at position 1397 of DddA as described herein). The spontaneous reassembly of the two split DddA halves can lead to off-target deamination independent from the on-target site. This can lead to unwanted mutagenesis and increased off-target editing generally if not controlled.

In some embodiments, the DddA variants provided herein are designed to weaken the affinity of the two split DddA halves for one another. Such weaking of the interaction between the two DddA portions allows for fine-tuning of the deaminase activity to eliminate its off-target activity while still preserving high on-target editing efficiency.

In various embodiments involving obtaining a DddA variant by way of one or more methodologies, such as, but not limited to, mutagenesis (e.g., through alanine scanning, lysine scanning, glutamate scanning, and/or aspartate scanning), protein truncation or elongation, and insertion of charged residues into a linker upstream of DddA (e.g., in the context of a fusion protein, such as the base editors described herein), the process may begin with a “starter” protein, such as canonical DddA or a fragment of DddA.

In various embodiments, the starter DddA protein from which variants are derived can be the canonical protein, or a fragment thereof. As reported in Mok et al. 2020, DddA was discovered in Burkholderia cenocepia and reported in the Protein Data Bank as PDB ID: 6U08, which has the following full-length amino acid sequence (1427 amino acids):

>tr|A0A1V6L4E7|A0A1V6L4E7_9BURK YD repeat (Two copies)

OS = Burkholderia cenocepacia OX = 95486 GN = UE95_03830

PE = 1 SV = 1

(SEQ ID NO: 356)

MYEAARVTDPIDHTSALAGFLVGAVLGIALIAAVAFATFTCGFGVALLAGMMAGIGAQ

ALLSIGESIGKMFSSQSGNIITGSPDVYVNSLSAAYATLSGVACSKHNPIPLVAQGSTNIFI

NGRPAARKDDKITCGATIGDGSHDTFFHGGTQTYLPVDDEVPPWLRTATDWAFTLAGL

VGGLGGLLKASGGLSRAVLPCAAKFIGGYVLGEAFGRYVAGPAINKAIGGLFGNPIDVT

TGRKILLAESETDYVIPSPLPVAIKRFYSSGIDYAGTLGRGWVLPWEIRLHARDGRLWYT

DAQGRESGFPMLRAGQAAFSEADQRYLTRTPDGRYILHDLGERYYDFGQYDPESGRIA

WVRRVEDQAGQWYQFERDSRGRVTEILTCGGLRAVLDYETVFGRLGTVTLVHEDERRL

AVTYGYDENGQLASVTDANGAVVRQFAYTNGLMTSHMNALGFTSSYVWSKIEGEPRV

VETHTSEGENWTFEYDVAGRQTRVRHADGRTAHWRFDAQSQIVEYTDLDGAFYRIKY

DAVGMPVMLMLPGDRTVMFEYDDAGRIIAETDPLGRTTRTRYDGNSLRPVEVVGPDGG

AWRVEYDQQGRVVSNQDSLGRENRYEYPKALTALPSAHIDALGGRKTLEWNSLGKLV

GYTDCSGKTTRTSFDAFGRICSRENALGQRITYDVRPTGEPRRVTYPDGSSETFEYDAAG

TLVRYIGLGGRVQELLRNARGQLIEAVDPAGRRVQYRYDVEGRLRELQQDHARYTFTY

SAGGRLLTETRPDGILRRFEYGEAGELLGLDIVGAPDPHATGNRSVRTIRFERDRMGVLK

VQRTPTEVTRYQHDKGDRLVKVERVPTPSGIALGIVPDAVEFEYDKGGRLVAEHGSNGS

VIYTLDELDNVVSLGLPHDQTLQMLRYGSGHVHQIRFGDQVVADFERDDLHREVSRTQ

GRLTQRSGYDPLGRKVWQSAGIDPEMLGRGSGQLWRNYGYDAAGDLIETSDSLRGSTR

FSYDPAGRLISRANPLDRKFEEFAWDAAGNLLDDAQRKSRGYVEGNRLLMWQDLRFE

YDPFGNLATKRRGANQTQRFTYDGQDRLITVHTQDVRGVVETRFAYDPLGRRIAKTDT

AFDLRGMKLRAETKRFVWEGLRLVQEVRETGVSSYVYSPDAPYSPVARADTVMAEAL

AATVIDSAKRAARIFHFHTDPVGAPQEVTDEAGEVAWAGQYAAWGKVEATNRGVTAA

RTDQPLRFAGQYADDSTGLHYNTFRFYDPDVGRFINQDPIGLNGGANVYHYAPNPVGW

VDPWGLAGSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNY

ANAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVVPPEGA

IPVKRGATGETKVFTGNSNSPKSPTKGGC.

In various other embodiments, the starter DddA protein can be a split DddA can have the following sequences:

- Split DddA (DddA-G1397N) GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYANAGHVE GQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVVPPEG (SEQ ID NO: 283), and can include fragments or variants thereof, including amino acid sequences having at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or 100% sequence identify with DddA of SEQ ID NO: 283.

Split DddA (DddA-G1397C)

(SEQ ID NO: 139)

AIPVKRGATGETKVFTGNSNSPKSPTKGGC.

It has been found that the whole, intact DddA protein is toxic to cells. Thus, in order to utilize DddA in the context of the base editors described herein, DddA may be delivered in an inactive form. One of ordinary skill in the art will appreciate that various methods, techniques, and modifications known in the art can be adapted for reversibly inactivating DddA such that the enzyme may be delivered to a cell in an inactive state, but then become activated inside the cell (or the mitochondria) under one or more conditions, or in the presence of one or more inducing agents, in order to conduct the desired deamination.

In preferred embodiments, DddA (including the DddA variants described herein) may be split into inactive fragments that can be separately delivered to a target deamination site on separate fusion constructs that target each fragment of the DddA to sites positioned on either side of a target edit site.

In some embodiments, the DddA variants provided herein comprise a first portion and a second portion. In some embodiments, the first portion and the second portion together comprise a full length DddA. In some embodiments, the first and second portion comprise less than the full length DddA portion. In some embodiments, the first and second portion independently do not have any, or have minimal, native DddA activity (e.g., deamination activity). In some embodiments, the first and second portion can re-assemble (i.e., dimerize) into a DddA protein with (at least partial) native DddA activity (e.g., deamination activity).

In some embodiments, the first and second portion of the DddA are formed by truncating (i.e., dividing or splitting the DddA protein) at specified amino acid residues (e.g., amino acid residue 1397). In some embodiments, the first portion of a DddA comprises a full-length DddA truncated at its N-terminus. In some embodiments, the second portion of a DddA comprises a full-length DddA truncated at its C-terminus. In some embodiments, additional truncations are performed to either the full-length DddA or to the first or second portions of the DddA. In some embodiments, the first and second portions of a DddA may comprise additional truncations, but the first and second portion can dimerize or re-assemble to restore (at least partially) native DddA activity (e.g., deamination).

In certain embodiments, the DddA can be separated into two fragments by dividing the DddA at a split site. A “split site” refers to a position between two adjacent amino acids (in a wildtype DddA amino acid sequence) that marks a point of division of a DddA. In certain embodiments, the DddA can have a least one split site, such that once divided at that split site, the DddA forms an N-terminal fragment and a C-terminal fragment. The N-terminal and C-terminal fragments can be the same or difference sizes (or lengths), wherein the size and/or polypeptide length depends on the location or position of the split site. As used herein, reference to a “fragment” of DddA (or any other polypeptide) can be referred to equivalently as a “portion.” Thus, a DddA that is divided at a split site can form an N-terminal portion and a C-terminal portion. Preferably, the N-terminal fragment (or portion) and the C-terminal fragment (or portion) of DddA do not have deaminase activity on their own, and preferably the N-terminal and C-terminal fragments do have deaminase activity when associated with one another.

In various embodiments, a DddA may be split into two or more inactive fragments by directly cleaving the DddA at one or more split sites. Direct cleaving can be carried out by a protease (e.g., trypsin) or another enzyme or chemical reagent. In certain embodiments, such chemical cleavage reactions can be designed to be site-selective (e.g., Elashal and Raj, “Site-selective chemical cleavage of peptide bonds,” Chemical Communications, 2016, Vol. 52, pages 6304-6307, the contents of which are incorporated herein by reference). In other embodiments, chemical cleavage reactions can be designed to be non-selective and/or occur in a random fashion.

In other embodiments, the two or more inactive DddA fragments can be engineered as separately expressed polypeptides. For instance, for a DddA having one split site, the N-terminal DddA fragment could be engineered from a first nucleotide sequence that encodes the N-terminal DddA fragment (which extends from the N-terminus of the DddA up to and including the residue on the amino-terminal side of the split site). In such an example, the C-terminal DddA fragment could be engineered from a second nucleotide sequence that encodes the C-terminal DddA fragment (which extends from the carboxy-terminus of the split site up to including the natural C-terminus of the DddA protein). The first and second nucleotide sequences could be on the same or different nucleotide molecules (e.g., the same or different expression vectors).

In various embodiments, the N-terminal portion of the DddA variants provided herein may be referred to as “DddA-N half” and the C-terminal portion of the DddA variants provided herein may be referred to as the “DddA-C half.” Reference to the term “half” does not connote the requirement that the DddA-N and DddA-C portions are identically half of the size and/or sequence length of a complete DddA, or that the split site is required to be at the midpoint of the complete DddA polypeptide. To the contrary, and as noted above, the split site can be between any pair of residues in the DddA polypeptide, thereby giving rise to half portions that are unequal in size and/or sequence length. In certain embodiments, the split site is within a loop region of the DddA.

In one aspect, the present disclosure provides DddA variants comprising a first fragment comprising an amino acid sequence that is at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of SEQ ID NO: 139, and a second fragment comprising an amino acid sequence that is at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of SEQ ID NO: 283, wherein the first fragment comprises one or more amino acid substitutions, truncations, or extensions relative to the amino acid sequence of SEQ ID NO: 139, and/or wherein the second fragment comprises one or more amino acid substitutions, truncations, or extensions relative to the amino acid sequence of SEQ ID NO: 283.

In some embodiments, the DddA variants provided herein comprise point mutations relative to a wild type DddA sequence. As described further herein, it was hypothesized by the inventors that introduction of individual point mutations in the C-terminal DddA fragment (G1397C) would reduce the interaction interface between the two split DddA halves and weaken the spontaneous reassembly of DddA at off-target sites. Thus, alanine scanning (to remove side chain interactions), lysine scanning (to introduce positive charge), and glutamate and aspartate scanning (to introduce negative charge) were performed. In this way, 120 constructs were tested in which each of the 30 residues in the C-terminal DddA fragment (G1397C) was individually mutated to either Ala, Lys, Glu or Asp. In some embodiments, the present disclosure provides DddA point mutants that exhibit lower off-target editing without an observed decrease in on-target editing, or point mutants that exhibit large reductions in off-target editing with only minor decreases in on-target editing. Such exemplary point mutants include DddA variants with amino acid substitutions at positions A5, A6, A7, A9, A14, A25, K12, K14, K18, K25, D3, D4, D5, D9, D14, DA, D19, D20, D25, D27, E5, E13, E16 and E20.

Exemplary DddA point mutants provided by the present disclosure include those comprising the following point mutations in the DddA C-terminal fragment G1397C:

Mutation:
Sequence:

Canonical
AIPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 139)

I2A
AAPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 140)

P3A
AIAVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 141)

V4A
AIPAKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 142)

K5A
AIPVARGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 143)

R6A
AIPVKAGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 144)

G7A
AIPVKRAATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 145)

T9A
AIPVKRGAAGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 146)

G10A
AIPVKRGATAETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 147)

E11A
AIPVKRGATGATKVFTGNSNSPKSPTKGGC (SEQ ID NO: 148)

T12A
AIPVKRGATGEAKVFTGNSNSPKSPTKGGC (SEQ ID NO: 149)

K13A
AIPVKRGATGETAVFTGNSNSPKSPTKGGC (SEQ ID NO: 150)

V14A
AIPVKRGATGETKAFTGNSNSPKSPTKGGC (SEQ ID NO: 151)

F15A
AIPVKRGATGETKVATGNSNSPKSPTKGGC (SEQ ID NO: 152)

T16A
AIPVKRGATGETKVFAGNSNSPKSPTKGGC (SEQ ID NO: 153)

G17A
AIPVKRGATGETKVFTANSNSPKSPTKGGC (SEQ ID NO: 154)

N18A
AIPVKRGATGETKVFTGASNSPKSPTKGGC (SEQ ID NO: 155)

S19A
AIPVKRGATGETKVFTGNANSPKSPTKGGC (SEQ ID NO: 156)

N20A
AIPVKRGATGETKVFTGNSASPKSPTKGGC (SEQ ID NO: 157)

S21A
AIPVKRGATGETKVFTGNSNAPKSPTKGGC (SEQ ID NO: 158)

P22A
AIPVKRGATGETKVFTGNSNSAKSPTKGGC (SEQ ID NO: 159)

K23A
AIPVKRGATGETKVFTGNSNSPASPTKGGC (SEQ ID NO: 160)

S24A
AIPVKRGATGETKVFTGNSNSPKAPTKGGC (SEQ ID NO: 161)

P25A
AIPVKRGATGETKVFTGNSNSPKSATKGGC (SEQ ID NO: 162)

T26A
AIPVKRGATGETKVFTGNSNSPKSPAKGGC (SEQ ID NO: 163)

K27A
AIPVKRGATGETKVFTGNSNSPKSPTAGGC (SEQ ID NO: 164)

G28A
AIPVKRGATGETKVFTGNSNSPKSPTKAGC (SEQ ID NO: 165)

G29A
AIPVKRGATGETKVFTGNSNSPKSPTKGAC (SEQ ID NO: 166)

C30A
AIPVKRGATGETKVFTGNSNSPKSPTKGGA (SEQ ID NO: 167)

A1K
KIPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 168)

I2K
AKPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 169)

P3K
AIKVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 170)

V4K
AIPKKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 171)

R6K
AIPVKKGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 172)

G7K
AIPVKRKATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 173)

A8K
AIPVKRGKTGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 174)

T9K
AIPVKRGAKGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 175)

G10K
AIPVKRGATKETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 176)

E11K
AIPVKRGATGKTKVFTGNSNSPKSPTKGGC (SEQ ID NO: 177)

T12K
AIPVKRGATGEKKVFTGNSNSPKSPTKGGC (SEQ ID NO: 178)

V14K
AIPVKRGATGETKKFTGNSNSPKSPTKGGC (SEQ ID NO: 179)

F15K
AIPVKRGATGETKVKTGNSNSPKSPTKGGC (SEQ ID NO: 180)

T16K
AIPVKRGATGETKVFKGNSNSPKSPTKGGC (SEQ ID NO: 181)

G17K
AIPVKRGATGETKVFTKNSNSPKSPTKGGC (SEQ ID NO: 182)

N18K
AIPVKRGATGETKVFTGKSNSPKSPTKGGC (SEQ ID NO: 183)

S19K
AIPVKRGATGETKVFTGNKNSPKSPTKGGC (SEQ ID NO: 184)

N20K
AIPVKRGATGETKVFTGNSKSPKSPTKGGC (SEQ ID NO: 185)

S21K
AIPVKRGATGETKVFTGNSNKPKSPTKGGC (SEQ ID NO: 186)

P22K
AIPVKRGATGETKVFTGNSNSKKSPTKGGC (SEQ ID NO: 187)

S24K
AIPVKRGATGETKVFTGNSNSPKKPTKGGC (SEQ ID NO: 188)

P25K
AIPVKRGATGETKVFTGNSNSPKSKTKGGC (SEQ ID NO: 189)

T26K
AIPVKRGATGETKVFTGNSNSPKSPKKGGC (SEQ ID NO: 190)

G28K
AIPVKRGATGETKVFTGNSNSPKSPTKKGC (SEQ ID NO: 191)

G29K
AIPVKRGATGETKVFTGNSNSPKSPTKGKC (SEQ ID NO: 192)

C30K
AIPVKRGATGETKVFTGNSNSPKSPTKGGK (SEQ ID NO: 193)

A1D
DIPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 194)

I2D
ADPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 195)

P3D
AIDVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 196)

V4D
AIPDKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 197)

K5D
AIPVDRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 198)

R6D
AIPVKDGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 199)

G7D
AIPVKRDATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 200)

A8D
AIPVKRGDTGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 201)

T9D
AIPVKRGADGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 202)

G10D
AIPVKRGATDETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 203)

E11D
AIPVKRGATGDTKVFTGNSNSPKSPTKGGC (SEQ ID NO: 204)

T12D
AIPVKRGATGEDKVFTGNSNSPKSPTKGGC (SEQ ID NO: 205)

K13D
AIPVKRGATGETDVFTGNSNSPKSPTKGGC (SEQ ID NO: 206)

V14D
AIPVKRGATGETKDFTGNSNSPKSPTKGGC (SEQ ID NO: 207)

F15D
AIPVKRGATGETKVDTGNSNSPKSPTKGGC (SEQ ID NO: 208)

T16D
AIPVKRGATGETKVFDGNSNSPKSPTKGGC (SEQ ID NO: 209)

G17D
AIPVKRGATGETKVFTDNSNSPKSPTKGGC (SEQ ID NO: 210)

N18D
AIPVKRGATGETKVFTGDSNSPKSPTKGGC (SEQ ID NO: 211)

S19D
AIPVKRGATGETKVFTGNDNSPKSPTKGGC (SEQ ID NO: 212)

N20D
AIPVKRGATGETKVFTGNSDSPKSPTKGGC (SEQ ID NO: 213)

S21D
AIPVKRGATGETKVFTGNSNDPKSPTKGGC (SEQ ID NO: 214)

P22D
AIPVKRGATGETKVFTGNSNSDKSPTKGGC (SEQ ID NO: 215)

K23D
AIPVKRGATGETKVFTGNSNSPDSPTKGGC (SEQ ID NO: 216)

S24D
AIPVKRGATGETKVFTGNSNSPKDPTKGGC (SEQ ID NO: 217)

P25D
AIPVKRGATGETKVFTGNSNSPKSDTKGGC (SEQ ID NO: 218)

T26D
AIPVKRGATGETKVFTGNSNSPKSPDKGGC (SEQ ID NO: 219)

K27D
AIPVKRGATGETKVFTGNSNSPKSPTDGGC (SEQ ID NO: 220)

G28D
AIPVKRGATGETKVFTGNSNSPKSPTKDGC (SEQ ID NO: 221)

G29D
AIPVKRGATGETKVFTGNSNSPKSPTKGDC (SEQ ID NO: 222)

C30D
AIPVKRGATGETKVFTGNSNSPKSPTKGGD (SEQ ID NO: 223)

A1E
EIPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 224)

I2E
AEPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 225)

P3E
AIEVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 226)

V4E
AIPEKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 227)

K5E
AIPVERGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 228)

R6E
AIPVKEGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 229)

G7E
AIPVKREATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 230)

A8E
AIPVKRGETGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 231)

T9E
AIPVKRGAEGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 232)

G10E
AIPVKRGATEETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 233)

T12E
AIPVKRGATGEEKVFTGNSNSPKSPTKGGC (SEQ ID NO: 234)

K13E
AIPVKRGATGETEVFTGNSNSPKSPTKGGC (SEQ ID NO: 235)

V14E
AIPVKRGATGETKEFTGNSNSPKSPTKGGC (SEQ ID NO: 236)

F15E
AIPVKRGATGETKVETGNSNSPKSPTKGGC (SEQ ID NO: 237)

T16E
AIPVKRGATGETKVFEGNSNSPKSPTKGGC (SEQ ID NO: 238)

G17E
AIPVKRGATGETKVFTENSNSPKSPTKGGC (SEQ ID NO: 239)

N18E
AIPVKRGATGETKVFTGESNSPKSPTKGGC (SEQ ID NO: 240)

S19E
AIPVKRGATGETKVFTGNENSPKSPTKGGC (SEQ ID NO: 241)

N20E
AIPVKRGATGETKVFTGNSESPKSPTKGGC (SEQ ID NO: 242)

S21E
AIPVKRGATGETKVETGNSNEPKSPTKGGC (SEQ ID NO: 243)

P22E
AIPVKRGATGETKVFTGNSNSEKSPTKGGC (SEQ ID NO: 244)

K23E
AIPVKRGATGETKVFTGNSNSPESPTKGGC (SEQ ID NO: 245)

S24E
AIPVKRGATGETKVFTGNSNSPKEPTKGGC (SEQ ID NO: 246)

P25E
AIPVKRGATGETKVFTGNSNSPKSETKGGC (SEQ ID NO: 247)

T26E
AIPVKRGATGETKVFTGNSNSPKSPEKGGC (SEQ ID NO: 248)

K27E
AIPVKRGATGETKVFTGNSNSPKSPTEGGC (SEQ ID NO: 249)

G28E
AIPVKRGATGETKVFTGNSNSPKSPTKEGC (SEQ ID NO: 250)

G29E
AIPVKRGATGETKVFTGNSNSPKSPTKGEC (SEQ ID NO: 251)

C30E
AIPVKRGATGETKVFTGNSNSPKSPTKGGE (SEQ ID NO: 252)

In some embodiments, a DddA variant comprises one or more amino acid substitutions relative to the amino acid sequence of SEQ ID NO: 139 (i.e., the C-terminal fragment of DddA split at position 1397). In some embodiments, a DddA variant comprises the point mutation D20. In some embodiments, a DddA variant comprises the point mutation E20. In some embodiments, a DddA variant comprises the point mutation K18. In some embodiments, a DddA variant comprises the point mutation K25. In some embodiments, a DddA variant comprises a C-terminal fragment comprising an amino acid sequence of any one of SEQ ID NOs: 140-252, or an amino acid sequence at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of any one of SEQ ID NOs: 140-252.

In some embodiments, a DddA variant comprises a C-terminal fragment comprising an amino acid substitution at position N18. In certain embodiments, the amino acid substitution is an N18K substitution. In some embodiments, a DddA variant comprises a C-terminal fragment comprising an amino acid substitution at position P25. In certain embodiments, the amino acid substitution is a P25K substitution. In certain embodiments, the amino acid substitution is a P25A substitution. In certain embodiments, a DddA variant comprises a C-terminal fragment comprising an N18K substitution and a P25K substitution relative to the amino acid sequence of SEQ ID NO: 139. In certain embodiments, a DddA variant comprises a C-terminal fragment comprising an N18K substitution and a P25A substitution relative to the amino acid sequence of SEQ ID NO: 139.

In some embodiments, the DddA variants provided herein comprise truncations and/or extensions of either DddA fragment. As described further herein, it was hypothesized by the inventors that truncation of the N-terminal DddA fragment (G1397N) and/or truncation of the C-terminal DddA fragment (G1397C) would reduce the interaction interface between the two split DddA halves and weaken the spontaneous reassembly of DddA at off-target sites. In some embodiments, the N-terminal DddA fragment (G1397N) is truncated at its C-terminus (e.g., by deletion of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10 amino acids). In some embodiments, the C-terminal DddA fragment (G1397C) is truncated at its N-terminus (e.g., by deletion of between 1-15 amino acids). In some embodiments, the C-terminal DddA fragment (G1397C) is truncated at its C-terminus by deletion of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or more than 15 amino acids. In particular, it was found that off-target editing was reduced by truncation of the N-terminal DddA fragment (G1397N) at its C-terminus by deletion of three amino acids without any observed lowering of on-target editing. This produced an even greater effect when combined with truncation of the C-terminal DddA fragment (G1397C) at its N-terminus by deletion of 5 amino acids.

Thus, in some embodiments, a DddA variant provided herein comprises a C-terminal fragment comprising an N-terminal amino acid truncation. In some embodiments, the C-terminal fragment comprises an N-terminal amino acid truncation of 1-15 amino acids in length (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or more than 15 amino acids in length). In some embodiments, a DddA variant comprises a C-terminal fragment comprising the amino acid sequence of any one of SEQ ID NOs: 253-267:

N-Terminal Truncations of G1397C DddA Fragment:

Truncation:
Sequence:

Canonical
AIPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 139)

NA1
_IPVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 253)

NA2
__PVKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 254)

NA3
___VKRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 255)

NA4
____KRGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 256)

NA5
_____RGATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 257)

NA6
_______GATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 258)

NA7
________ATGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 259)

NA8
_________TGETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 260)

NA9
__________GETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 261)

NA10
___________ETKVFTGNSNSPKSPTKGGC (SEQ ID NO: 262)

NA11
____________TKVFTGNSNSPKSPTKGGC (SEQ ID NO: 263)

NA12
_____________KVFTGNSNSPKSPTKGGC (SEQ ID NO: 264)

NA13
______________VFTGNSNSPKSPTKGGC (SEQ ID NO: 265)

NA14
_______________FTGNSNSPKSPTKGGC (SEQ ID NO: 266)

NA15
________________TGNSNSPKSPTKGGC (SEQ ID NO: 267)

In some embodiments, a DddA variant provided herein comprises a C-terminal fragment comprising a C-terminal amino acid truncation. In some embodiments, the C-terminal fragment comprises a C-terminal amino acid truncation of 1-15 amino acids in length (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or more than 15 amino acids in length). In some embodiments, a DddA variant comprises a C-terminal fragment comprising the amino acid sequence of any one of SEQ ID NOs: 268-282:

C-Terminal Truncations of G1397C DddA Fragment:

Truncation:
Sequence:

Canonical
AIPVKRGATGETKVFTGNSNSPKSPTKGGC
(SEQ ID NO: 139)

CA1
AIPVKRGATGETKVFTGNSNSPKSPTKGG_
(SEQ ID NO: 268)

CA2
AIPVKRGATGETKVFTGNSNSPKSPTKG__
(SEQ ID NO: 269)

CA3
AIPVKRGATGETKVFTGNSNSPKSPTK___
(SEQ ID NO: 270)

CA4
AIPVKRGATGETKVFTGNSNSPKSPT____
(SEQ ID NO: 271)

CA5
AIPVKRGATGETKVFTGNSNSPKSP_____
(SEQ ID NO: 272)

CA6
AIPVKRGATGETKVFTGNSNSPKS______
(SEQ ID NO: 273)

CA7
AIPVKRGATGETKVFTGNSNSPK_______
(SEQ ID NO: 274)

CA8
AIPVKRGATGETKVFTGNSNSP________
(SEQ ID NO: 275)

CA9
AIPVKRGATGETKVFTGNSNS_________
(SEQ ID NO: 276)

CA10
AIPVKRGATGETKVFTGNSN__________
(SEQ ID NO: 277)

CA11
AIPVKRGATGETKVFTGNS___________
(SEQ ID NO: 278)

CA12
AIPVKRGATGETKVFTGN____________
(SEQ ID NO: 279)

CA13
AIPVKRGATGETKVFTG_____________
(SEQ ID NO: 280)

CA14
AIPVKRGATGETKVFT______________
(SEQ ID NO: 281)

CA15
AIPVKRGATGETKVF_______________
(SEQ ID NO: 282)

In some embodiments, a DddA variant provided herein comprises an N-terminal fragment comprising a C-terminal amino acid truncation. In some embodiments, the N-terminal fragment comprises a C-terminal amino acid truncation of 1-10 amino acids in length (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10 amino acids in length). In certain embodiments, the N-terminal fragment comprises a C-terminal amino acid truncation of 3 amino acids in length. In some embodiments, a DddA variant comprises an N-terminal fragment comprising the amino acid sequence of any one of SEQ ID NOs: 284-293:

C-Terminal Truncations of G1397N Fragment:

Truncation:
Sequence:

Canonical
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVVPP

EG (SEQ ID NO: 283)

CA1
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVVPP

E_ (SEQ ID NO: 284)

CA2
GSYALGPYQI SAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVVPP__

(SEQ ID NO: 285)

CA3
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVVP___

(SEQ ID NO: 286)

CA4
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV____

(SEQ ID NO: 287)

CA5
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTV_____

(SEQ ID NO: 288)

CA6
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMT______

(SEQ ID NO: 289)

CA7
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKM_______

(SEQ ID NO: 290)

CA8
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAK________

(SEQ ID NO: 291)

CA9
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENA_________

(SEQ ID NO: 292)

CA10
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYAN

AGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPEN__________

(SEQ ID NO: 293)

In some embodiments, a DddA variant provided herein comprises an N-terminal fragment comprising a C-terminal amino acid extension. In some embodiments, the N-terminal fragment comprises a C-terminal amino acid extension of 1-15 amino acids in length (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or more than 15 amino acids in length). In some embodiments, a DddA variant comprises an N-terminal fragment comprising the amino acid sequence of any one of SEQ ID NOs: 294-308:

C-terminal extensions of G1397N fragment:

Extension:
Sequence:

Canonical
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEG (SEQ ID NO: 283)

C + 1
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGA (SEQ ID NO: 294)

C + 2
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAI (SEQ ID NO: 295)

C + 3
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIP (SEQ ID NO: 296)

C + 4
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPV (SEQ ID NO: 297)

C + 5
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVK (SEQ ID NO: 298)

C + 6
GSYALGPYQI SAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKR (SEQ ID NO: 299)

C + 7
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKRG (SEQ ID NO: 300)

C + 8
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKRGA (SEQ ID NO: 301)

C + 9
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKRGAT (SEQ ID NO: 302)

C + 10
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKRGATG (SEQ ID NO: 303)

C + 11
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKRGATGE (SEQ ID NO: 304)

C + 12
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKRGATGET (SEQ ID NO: 305)

C + 13
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKRGATGETK (SEQ ID NO: 306)

C + 14
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKRGATGETKV (SEQ ID NO: 307)

C + 15
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVFSSGGPTPYPNYA

NAGHVEGQSALFMRDNGISEGLVFHNNPEGTCGFCVNMTETLLPENAKMTVV

PPEGAIPVKRGATGETKVF (SEQ ID NO: 308)

In certain embodiments, a DddA variant further comprises a sequence of charged amino acid residues (for example, upstream of the DddA variant, e.g., in a linker joining the DddA variant to a pDNAbp such as a zinc finger domain-containing protein as described herein). As described further herein, it was hypothesized by the inventors that introduction of charged residues in the flexible linker between the ZF and the split DddA halves would introduce electrostatic repulsion that would weaken the spontaneous reassembly of DddA at off-target sites. In some embodiments, the charged sequence is GSGGGGSGDDDGS (SEQ ID NO: 319), GSGGGDDDDDDGS (SEQ ID NO: 320), GSDDDDDDDDDGS (SEQ ID NO: 321), GSGGGGSGGSDDD (SEQ ID NO: 316), GSGGGGSDDDDDD (SEQ ID NO: 317), GSGGDDDDDDDDD (SEQ ID NO: 318), GSGGGGSGEEEGS (SEQ ID NO: 313), GSGGGEEEEEEGS (SEQ ID NO: 314), GSEEEEEEEEEGS (SEQ ID NO: 315), GSGGGGSGGSEEE (SEQ ID NO: 310), GSGGGGSEEEEEE (SEQ ID NO: 311), or GSGGEEEEEEEEE (SEQ ID NO: 312). In some embodiments, the charged sequence is SGDDDGS (SEQ ID NO: 236), SGDDDDDDGS (SEQ ID NO: 327), SGDDDDDDDDDGS (SEQ ID NO: 328), DDDGS (SEQ ID NO: 323), DDDDDDGS (SEQ ID NO: 324), DDDDDDDDDGS (SEQ ID NO: 325), SGDDDGS (SEQ ID NO: 236), SGDDDDDDGS (SEQ ID NO: 327), SGDDDDDDDDDGS (SEQ ID NO: 328), DDDGS (SEQ ID NO: 323), DDDDDDGS (SEQ ID NO: 324), or DDDDDDDDDGS (SEQ ID NO: 325). In some embodiments, the sequence of charged amino acid residues comprises the amino acid sequence of any one of SEQ ID NOs: 309-334:

Charged residues upstream or downstream of split DddA to weaken binding affinity between split halves and lower off-target activity:

GSGGGGSGGSGGS (SEQ ID NO: 309)

GSGGGGSGGSEEE (SEQ ID NO: 310)

GSGGGGSEEEEEE (SEQ ID NO: 311)

GSGGEEEEEEEEE (SEQ ID NO: 312)

GSGGGGSGEEEGS (SEQ ID NO: 313)

GSGGGEEEEEEGS (SEQ ID NO: 314)

GSEEEEEEEEEGS (SEQ ID NO: 315)

GSGGGGSGGSDDD (SEQ ID NO: 316)

GSGGGGSDDDDDD (SEQ ID NO: 317)

GSGGDDDDDDDDD (SEQ ID NO: 318)

GSGGGGSGDDDGS (SEQ ID NO: 319)

GSGGGDDDDDDGS (SEQ ID NO: 320)

GSDDDDDDDDDGS (SEQ ID NO: 321)

SGGS (SEQ ID NO: 322)

DDDGS (SEQ ID NO: 323)

DDDDDDGS (SEQ ID NO: 324)

DDDDDDDDDGS (SEQ ID NO: 325)

SGDDDGS (SEQ ID NO: 326)

SGDDDDDDGS (SEQ ID NO: 327)

SGDDDDDDDDDGS (SEQ ID NO: 328)

EEEGS (SEQ ID NO: 329)

EEEEEEGS (SEQ ID NO: 330)

EEEEEEEEEGS (SEQ ID NO: 331)

SGEEEGS (SEQ ID NO: 332)

SGEEEEEEGS (SEQ ID NO: 333)

SGEEEEEEEEEGS (SEQ ID NO: 334)

In some embodiments, the sequence of charged amino acid residues may weaken the binding affinity of the first fragment and the second fragment of the DddA variant to one another.

In some embodiments, a DddA variant further comprises a catalytically dead second DddA fragment fused to the first DddA fragment. As described further herein, DddA can be catalytically inactivated by introduction of an E1347A mutation. In the G1397-split architecture, this mutation lies in the N-terminal DddA fragment (G1397N). It was hypothesized by the inventors that by fusing a catalytically-inactivated N-terminal DddA fragment (G1397N) adjacent to the C-terminal DddA fragment (G1397C), the catalytically-inactivated fragment would compete for reassembly and would weaken the spontaneous reassembly of catalytically-active DddA at off-target sites. Thus, the present disclosure provides ZF-DdCBE constructs in which a catalytically-inactivated N-terminal DddA fragment (G1397N) was fused downstream of the C-terminal DddA fragment (G1397C), either before or after the UGI, using flexible linkers of different lengths. In some embodiments, the catalytically dead second DddA fragment comprises the amino acid sequence of SEQ ID NO: 335, or an amino acid sequence that is at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% identical to the amino acid sequence of SEQ ID NO: 335:

Fusion of “Dead” DddA N-Terminal Domain to C-Terminal DddA Fragment to Reduce Off-Target Activity:

Canonical
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVF

SSGGPTPYPNYANAGHVEGQSALFMRDNGISEGLVFHNN

PEGTCGFCVNMTETLLPENAKMTVVPPEG

(SEQ ID NO: 283)

Dead
GSYALGPYQISAPQLPAYNGQTVGTFYYVNDAGGLESKVF

(E1347A)
SSGGPTPYPNYANAGHVAGQSALFMRDNGISEGLVFHNNP

EGTCGFCVNMTETLLPENAKMTVVPPEG

(SEQ ID NO: 335)

The changes made in each of the DddA variants provided herein relative to wild type DddA may be made in any combination with one another. In some embodiments, combining two or more of the point mutations, truncation, extensions, etc. described herein will result in a DddA variant with even more increased on-target editing activity and/or decreased off-target editing activity relative to a DddA variant comprising only a single point mutation, truncation, extension, etc. Mutants comprising an N18K mutation, N18K and P25A mutations, and N18K and P25K mutations showed particularly promising increases in activity. Variants comprising a truncation of the three C-terminal amino acids of the N-terminal DddA fragment also showed particularly promising increases in activity, especially in combination with N18K and/or P25A or P25K mutations. Thus, in some embodiments, a DddA variant comprises a C-terminal fragment comprising amino acid substitutions at positions N18 and P25 and an N-terminal fragment comprising a C-terminal amino acid truncation of 3 amino acids in length. In certain embodiments, the C-terminal fragment comprises the amino acid substitutions N18K and P25A, and the N-terminal fragment comprises a C-terminal amino acid truncation of 3 amino acids in length. In certain embodiments, the C-terminal fragment comprises the amino acid substitutions N18K and P25K, and the N-terminal fragment comprises a C-terminal amino acid truncation of 3 amino acids in length.

Any of the point mutations, amino acid truncations, extensions, etc. described herein can also be made at corresponding positions in other DddA enzymes and homologs. In various embodiments, the following exemplary DddA enzymes, or variants thereof, can be used to create additional DddA variants comprising the point mutations, amino acid truncations, extensions, etc. described herein, or a sequence (amino acid or nucleotide as the case may be) having at least 70%, 75%, 80%, 85%, 90%, 95%, or 99% sequence identity with any one of the following DddA sequences:

DddA

Description
DddA amino acid and/or nucleotide sequence

DddA
>ATF83755.1 hypothetical protein CO712_00910

homolog in
[Burkholderia gladioli pv. Gladioli]

Burkholderia

MYEAARVTDPIEHTSALAGFLVGAVLGIALIAAVAFATFTCGFGVALLAGM

gladioli

AAGIGAQVLLSLGESIGKMFSSQSGAITLGSPNVYVNGKQAAYATLSSVTCS

PROTEIN
KHNPTPLVAQGSTNIFINGKPAARKDDKITCGAAISDGSHDTYFHGGIQTCLP

IDDEVPPWLRTATDWAFALAGLVGGLGGLLKEAGGLSHAVMPCAAKFIGG

YVLGEAASRYVIGPAINSAIGGMFGNPVDVTTGRKILPAESETDYVVPSPMP

VAIRRFYSSDLDYVGTLGRGWVLPWELRLHARDGRLWYTDAQGRESGFPIL

KPGQAAFSEADQRYLTCTPDGRYILHDVGETYYDFGRYEPGSGRIGWVRRIE

DQAGQWCQFERDSRGRVREIQTCGGLLAVLDYEPEHERLAEVSLVSGDQRR

LVVAYGYDENGQMASVTDANGAVVRRFTYADGRMTSHSNALGFTSGYTW

KVIDGTPRVVATHTSEGEAWAFEYDIEGRRTHVRHADGRHAQWRYDAQFQ

IVEYLDFDGRRYGLKYNAAGMPVMLTLPGERTVMFEYDDAGRIVAETDPLG

RTTKTRYDGNSMRPVEIILPDGSAWHAEYDRQGRLLVTRDPLDRENRYEYP

EALSALPVAHVDALGGRKTFEWNRLGELVAYTDCSGKTTRNFFDAFGLPLA

RENALGHRVSFDLRPTGETRRVTYPDGSSESYEYDAAGLMIRHIGLGGRMQ

TLQRNARGQLVEAVDPAGRRTRYHYDAEGRLRELQQAHARYAFAYSAGGR

LVSETRPDGVLRRFEYGEAGDLAALEIVGTADDCAPNDRPVRAIRFERDRM

GNLCVQHTPTEVTRYERDAGGRLLEVASVPTAAGLALGIAPDTLTFEYDKA

GRLSAEHGANGSVQYTLDALDNVLKLALPHEQTLQMLRYGSGHVHQIRHG

DQVVSDFERDDLHRELTRTQGPLTERTAYDLLGRKIWQSAGFQPDALARGQ

GQLWRNYGYDAAGELVESHDSLRGSTQFSYDPAGYLTQRVNTADRQLESF

AWDAAGNLLDDAQRSSRGYVEGNRLRMWQNLRFDYDAFGNLATKLRGAN

QRQQFTYDGQDRLVAVRTQGARGVVETRFAYDPLGRRIAKTDRTLDVRGV

TLREETKRFVWEGLRLAQEVRDTGVSSYVYSPDAPYMPAARVDAVKAEAL

ANAAIDKARQATRIYHFHTDVSGAPQEATNEAGDIVWAGQYSAWGKVAPN

QHAPARIDQPLRYAGQYADDSTELHYNTFRFYDPDVGRFINQDPIGLMGGL

NLYQYAPNSIAWTDWWGLAGSYTLGSYQISAPQLPAYNGQTVGTFYYVND

AGGLESRTFSSGGPTPYPNYANAGHVEGQSALFMRDNGISDGLVFHNNPEG

TCGFCVNMTETLLPENSKLTVVPPEGSIPVKRGATGETRTFTGNSKSPKSPVK

GGC (SEQ ID NO: 361)

DddA
>CO712_00910 NZ_CP023522.1:185368-189645

homolog in

Burkholderia gladioli pv. Gladioli strain

Burkholderia

FDAARGOS_389 chromosome 1, complete sequence

gladioli

GTGTACGAAGCGGCCCGCGTCACGGATCCGATCGAGCACACCAGCGCGC

DNA
TGGCCGGCTTCCTGGTGGGCGCCGTGCTCGGTATCGCCCTGATTGCTGCC

GTGGCGTTCGCCACGTTCACCTGCGGCTTCGGCGTGGCACTGCTGGCCGG

CATGGCGGCCGGCATCGGCGCGCAGGTGCTGTTGTCGTTAGGGGAATCG

ATCGGGAAGATGTTCAGTTCGCAATCCGGCGCGATCACGCTCGGCTCGCC

GAACGTCTACGTGAACGGCAAGCAGGCCGCCTACGCCACGCTCAGCAGC

GTGACGTGCAGCAAGCACAACCCGACGCCGCTCGTCGCGCAGGGCTCCA

CCAACATCTTCATCAACGGCAAGCCGGCCGCGCGCAAGGACGACAAGAT

CACCTGCGGCGCGGCCATCTCGGACGGCTCGCACGACACCTACTTCCACG

GAGGCATCCAGACCTGCCTGCCGATCGACGACGAAGTGCCGCCGTGGCT

GCGCACCGCCACCGACTGGGCGTTCGCGCTGGCCGGGCTGGTGGGCGGG

CTCGGCGGCCTACTCAAGGAAGCGGGCGGGCTGTCGCACGCGGTGATGC

CGTGCGCGGCGAAGTTCATCGGCGGCTACGTGCTCGGCGAGGCGGCGAG

CCGCTACGTGATCGGCCCGGCCATCAACAGCGCGATCGGCGGGATGTTC

GGCAACCCGGTAGACGTCACCACTGGGCGCAAGATCCTCCCTGCCGAAT

CGGAAACCGATTACGTCGTGCCCAGCCCGATGCCGGTGGCGATCCGGCG

CTTCTATTCGAGCGACCTCGATTACGTCGGCACGCTTGGGCGCGGCTGGG

TGCTGCCGTGGGAGCTGCGCCTGCACGCGCGTGACGGTCGGCTCTGGTAC

ACCGACGCGCAGGGGCGCGAGAGCGGCTTCCCGATCCTGAAACCGGGCC

AGGCCGCGTTCAGCGAGGCCGATCAGCGCTATCTGACCTGCACGCCGGA

TGGCCGCTACATCCTCCACGACGTCGGCGAAACCTATTACGACTTCGGCC

GCTACGAGCCGGGCTCGGGCCGCATCGGCTGGGTGCGCCGGATCGAGGA

TCAGGCCGGCCAGTGGTGCCAGTTCGAGCGCGACAGCCGTGGCCGCGTG

CGTGAAATCCAGACCTGCGGCGGCTTGCTGGCCGTGCTCGATTACGAGCC

GGAGCACGAGCGGCTCGCCGAGGTGTCGCTCGTCAGCGGCGATCAGCGC

CGCCTCGTCGTGGCCTACGGCTACGACGAAAACGGCCAGATGGCCTCCG

TGACCGACGCGAACGGCGCGGTGGTGCGCCGCTTCACCTATGCCGACGG

GCGCATGACGAGCCATTCGAACGCGCTCGGTTTCACGTCGGGCTATACGT

GGAAGGTCATCGACGGCACGCCGCGAGTGGTCGCCACCCACACCAGCGA

GGGCGAGGCCTGGGCGTTCGAGTACGACATCGAAGGCCGCCGCACCCAT

GTGCGGCATGCCGACGGCCGCCACGCGCAATGGCGCTACGACGCGCAAT

TCCAGATCGTCGAGTACCTCGATTTCGACGGCCGTCGCTACGGGCTCAAG

TACAACGCTGCCGGCATGCCCGTGATGCTGACGCTGCCCGGCGAACGAA

CCGTGATGTTCGAGTACGACGACGCCGGCCGCATCGTCGCCGAAACCGA

TCCCCTCGGCCGCACCACGAAAACGCGCTACGACGGCAACAGCATGCGG

CCCGTCGAGATCATCTTGCCCGACGGCAGCGCCTGGCACGCCGAATACG

ACCGGCAGGGCCGGCTGCTCGTCACCCGTGATCCGCTCGACCGGGAGAA

TCGCTACGAATATCCGGAGGCACTGAGCGCGCTCCCGGTGGCGCATGTC

GATGCGCTGGGCGGGCGCAAGACGTTCGAGTGGAACCGGCTCGGCGAGC

TGGTGGCCTACACCGATTGCTCGGGCAAGACCACGCGCAATTTTTTCGAT

GCATTCGGCCTGCCGCTCGCGCGCGAGAACGCGCTCGGGCACCGCGTGT

CGTTCGATCTGCGCCCGACCGGCGAGACGCGCCGCGTCACCTATCCCGAC

GGCAGTTCCGAAAGCTACGAATACGACGCCGCCGGGCTGATGATCCGGC

ACATCGGGCTGGGCGGCCGGATGCAGACGTTGCAGCGCAATGCGCGCGG

GCAACTCGTCGAGGCGGTCGATCCGGCCGGGCGGCGAACCCGCTACCAC

TACGACGCCGAAGGGCGGCTGCGCGAGCTGCAACAGGCCCACGCGCGCT

ACGCATTCGCGTACAGCGCAGGCGGGCGGCTTGTCAGCGAAACGCGGCC

CGACGGCGTGCTGCGCCGCTTCGAATACGGCGAGGCCGGCGATCTGGCG

GCGCTCGAGATCGTCGGAACGGCCGATGATTGCGCTCCAAACGATCGCC

CGGTTCGCGCGATCCGCTTCGAGCGCGACCGGATGGGTAACCTGTGCGTG

CAGCACACGCCTACCGAGGTGACGCGCTACGAGCGCGACGCCGGCGGCC

GCCTGCTCGAAGTCGCGAGCGTGCCGACCGCGGCCGGACTGGCGCTCGG

CATCGCGCCCGACACGCTGACCTTCGAATACGACAAGGCCGGGCGGCTG

AGCGCCGAACACGGCGCGAACGGCAGCGTCCAGTACACGCTCGACGCGC

TCGACAACGTGTTGAAGCTCGCCTTGCCGCACGAACAGACGCTGCAGAT

GCTGCGCTACGGCTCGGGGCACGTGCACCAGATTCGCCACGGCGACCAG

GTCGTCAGCGATTTCGAGCGCGACGACCTGCATCGCGAGTTGACGCGCA

CGCAGGGCCCCCTGACCGAGCGGACCGCCTACGACCTGCTGGGCCGCAA

GATCTGGCAATCAGCCGGCTTCCAGCCCGACGCGCTTGCGCGTGGGCAG

GGCCAGCTGTGGCGCAACTACGGCTACGACGCCGCCGGGGAACTGGTCG

AGAGCCACGACAGCCTGCGCGGCAGCACGCAGTTCAGCTACGATCCGGC

CGGCTATCTGACGCAGCGCGTGAACACCGCCGACCGGCAGCTCGAATCG

TTCGCCTGGGACGCCGCCGGCAACCTGCTCGACGATGCGCAACGCAGCA

GCCGCGGCTATGTCGAGGGCAACCGGCTGCGCATGTGGCAGAACCTGCG

CTTCGACTACGACGCGTTCGGCAATCTCGCGACCAAGCTGCGCGGCGCG

AATCAGCGCCAGCAGTTCACGTACGATGGGCAGGATCGGCTCGTGGCCG

TGCGCACGCAGGGCGCGCGCGGCGTGGTGGAGACGCGTTTCGCCTACGA

TCCGCTCGGGCGGCGCATCGCCAAGACCGATAGGACACTCGACGTGCGC

GGCGTAACGCTGCGCGAGGAAACGAAGCGGTTCGTATGGGAAGGGCTGC

GGCTCGCGCAGGAGGTGCGCGACACCGGCGTGAGCAGCTACGTGTACAG

CCCGGATGCGCCTTACATGCCCGCGGCGCGGGTCGATGCGGTGAAAGCC

GAAGCGCTCGCAAACGCCGCGATCGACAAGGCCAGACAGGCGACGCGG

ATCTATCACTTTCATACCGATGTGTCGGGCGCACCGCAAGAAGCGACGA

ACGAGGCCGGCGACATTGTTTGGGCCGGCCAATACTCAGCCTGGGGCAA

GGTGGCGCCGAACCAGCATGCCCCAGCCCGGATCGATCAGCCGCTCCGC

TACGCCGGACAATATGCCGATGACAGTACCGAGCTGCACTACAACACGT

TTCGTTTCTACGATCCGGATGTCGGCCGGTTTATCAATCAGGATCCAATC

GGGTTGATGGGGGGGCTGAATCTTTACCAATATGCACCCAACTCAATCGC

GTGGACCGACTGGTGGGGGCTGGCCGGCAGCTATACGCTCGGTTCCTATC

AAATTTCTGCTCCTCAACTTCCCGCCTACAATGGGCAGACTGTTGGGACC

TTCTACTATGTAAACGACGCGGGCGGGCTCGAATCGAGGACATTCTCTTC

TGGAGGGCCGACCCCTTATCCAAATTATGCCAATGCCGGGCACGTGGAA

GGCCAGTCCGCACTGTTCATGAGGGATAACGGAATTTCAGACGGACTGG

TTTTCCACAACAACCCTGAGGGTACTTGCGGATTCTGCGTCAATATGACC

GAAACGCTTTTGCCTGAAAATTCCAAACTTACCGTCGTTCCGCCCGAGGG

CTCGATTCCGGTCAAGCGGGGCGCGACGGGCGAAACGAGAACATTTACA

GGGAACAGCAAGTCTCCGAAGTCCCCTGTCAAAGGAGGATGTTGA (SEQ

ID NO: 362)

DddA
>AJY63123.1 RHS repeat-associated core domain pro-

homolog in
tein [Burkholderia glumae LMG 2196 = ATCC 33617]

Burkholderia

MYEAARVTDPIEHTSALTGFLVGAVLGIALIAAVAFATFTCGFGVALLAGMA

glumae LMG
AGIGAQVLLSLGESIGKMFSSQSGAITLGSPNVYVNGKPTAYAMLSSVTCSK

2196
HNPTPLVAQGSTNIFINGKPAARKDDKITCGATISDGSHDTYFHGGTQTCLPI

PROTEIN
DDEVPPWLRTATDWAFALAGLVGGLGGLLKEAGGLSRAVMPCAAKFIGGY

VLGEAASRYVVGPAINSAIGGMFGNPVDVTTGRKILLAESETDYVVPSPMPV

AIRRFYSSDLDYVGTLGRGWVLPWELRLHARDGRLWYTDAQGRESGFPML

QPGHAAFSEADQRYLTCTPDGRYILHDLGETYYDFGHYEPGSGRIGWVRRIE

DQAGQWCQFERDSRGRVREIQTCGGLLAVLDYEPEHGRLAGVSLVSGDQR

RLVVAYGYDEHGQMASVTDANGALVRRFTYADGRMTSHSNALGFTSGYT

WQAVGGAPRVVATHTSEGEAWAFEYDIEGRRTHVRHADGRHAQWRYDAQ

FQIVEYLDFDGRRYGLKYNDAGMPVMLTLPGERTVTFEYDDAGRIVAETDP

LGRTTKTRYDGNSRRPVEIIAPDGSAWHAEYDRQGRLLATRDPLDRENRYE

YPKALSALPIAHVDALGGRKTFEWNRLGELVAYTDCSGKTTRNFYDAFGLP

LARENALGHRVTFDLRPTGEARRVTYPDGSTESYEYDAAGLMIRHVGLGGR

TQIALRNARGQIVEAVDPAGRRTCYRYDAEGRLRELQQGHARYAFTYSAGG

RLTSETRPDGVRRRFEYGEAGDLAALDIVGAADDATANDRPVRTIRFERDR

MGNLCAQHTPTEVTRYTRDTGGRLLEVACVPTAAGLALGIAPDTLTFEYDK

AGRLSAEHGANGSVRYTLDALDNVMKLALPHEQTLQMLRYGSGHVHQIRC

GDQVVSDFERDDLHRELTRTQGRLTERTAYDLLGRKIWQSAGFQPDALARG

QGQVWRNYGYDAAGELAESHDSLRGSTQFSYDPAGYLTQRVNTADRQLES

FAWDAAGNLLDDAQRRSRGYVEGNRLRMWQNLRFEYDPFGNLATKLRGA

NQRQQFTYDGQDRLVAVRTQDARGVVETRFAYDPLGRRIAKTDIVRDARG

VALREETKRFVWEGLRLAQEVRDTGVSSYVYSPDAPYTPAARVDAVLAEA

MAAAAIEQARQATRIYHFHTDVSGAPQEATNEAGDIVWAGQYSAWGKVAP

NQHAPARIDOPLRYAGQYADDSTELHYNTFRFYDPDVGRFINQDPIGLMGG

LNLYQYAPNSIAWTDWWGLAGSYTLGSYQISAPQLPAYNGQTVGTFYYVN

GAGGLESRTFSSGGPTPYPNYANAGHVEGQSALFMRDNGISDGLVFHNNPE

GTCGFCVNMTETLLPENSKLTVVPPEGAIPVKRGATGETRTFTGNSKSPKSPV

KGEC (SEQ ID NO: 363)

DddA
>KS03_3390 CP009434.1:65330-69607 Burkholderia

homolog in

glumae LMG 2196 = ATCC 33617 chromosome II,

Burkholderia

complete sequence

glumae LMG
GTGTACGAAGCGGCCCGCGTCACCGACCCGATCGAACACACCAGCGCGC

2196
TGACCGGCTTTCTGGTGGGCGCCGTGCTCGGCATTGCCCTGATCGCCGCG

DNA
GTGGCGTTCGCCACCTTCACCTGCGGCTTCGGCGTGGCGCTGCTGGCCGG

CATGGCCGCCGGCATCGGCGCGCAGGTGCTGTTGTCGTTAGGAGAATCG

ATCGGGAAGATGTTCAGTTCGCAATCCGGCGCGATCACGCTCGGCTCGCC

GAACGTCTATGTGAACGGCAAGCCGACCGCCTACGCCATGCTCAGCAGC

GTGACGTGCAGCAAGCACAACCCGACGCCGCTCGTCGCGCAGGGGTCCA

CCAACATCTTCATCAACGGCAAGCCGGCCGCCCGCAAGGACGACAAGAT

CACCTGCGGCGCGACCATCTCCGACGGCTCGCACGACACCTATTTCCACG

GCGGCACCCAGACCTGCCTGCCGATCGACGACGAAGTGCCGCCGTGGCT

GCGCACCGCCACCGACTGGGCGTTCGCGCTGGCCGGGCTGGTGGGCGGG

CTCGGCGGCCTGCTCAAGGAAGCGGGCGGGCTGTCGCGCGCGGTGATGC

CGTGCGCGGCGAAGTTCATCGGCGGCTACGTGCTCGGCGAGGCGGCGAG

CCGCTACGTGGTCGGCCCGGCCATCAACAGCGCGATCGGCGGGATGTTC

GGCAACCCGGTGGACGTCACCACCGGGCGCAAGATCCTGCTGGCGGAAT

CGGAAACCGATTACGTGGTGCCCAGCCCGATGCCGGTGGCGATCCGGCG

CTTCTATTCGAGCGACCTCGACTACGTCGGCACGCTCGGGCGCGGCTGGG

TGCTGCCGTGGGAACTGCGGCTGCACGCGCGCGACGGGCGGCTCTGGTA

CACCGACGCGCAGGGGCGCGAGAGCGGCTTCCCGATGCTCCAGCCGGGC

CATGCCGCGTTCAGCGAGGCCGACCAGCGCTATCTGACCTGCACCCCGG

ATGGCCGCTACATCCTGCACGACCTCGGCGAAACCTATTACGACTTCGGC

CACTACGAGCCGGGCTCGGGCCGCATCGGCTGGGTGCGCCGCATCGAGG

ATCAGGCCGGCCAGTGGTGCCAGTTCGAGCGCGACAGCCGCGGCCGCGT

GCGCGAAATCCAGACCTGCGGCGGCTTGCTGGCCGTGCTCGATTACGAG

CCGGAACACGGGCGGCTCGCCGGGGTGTCGCTCGTCAGCGGGGATCAGC

GCCGCCTCGTGGTGGCTTACGGCTATGACGAGCACGGCCAGATGGCGTC

CGTGACCGATGCGAACGGCGCGCTGGTGCGCCGCTTCACCTATGCCGAC

GGGCGCATGACGAGCCATTCGAACGCGCTCGGCTTCACGTCGGGCTATA

CGTGGCAAGCCGTCGGCGGCGCGCCGCGGGTGGTTGCCACCCACACCAG

CGAGGGCGAGGCCTGGGCCTTCGAGTACGACATTGAAGGACGCCGCACC

CACGTGCGTCACGCCGACGGCCGCCACGCGCAATGGCGCTACGACGCGC

AATTCCAGATCGTCGAGTACCTCGATTTCGACGGCCGGCGCTACGGGCTC

AAGTACAACGACGCCGGCATGCCCGTGATGCTGACGCTGCCCGGCGAAC

GGACCGTGACGTTCGAGTACGACGATGCCGGCCGCATCGTCGCCGAAAC

CGATCCACTCGGCCGCACCACGAAAACGCGCTACGACGGCAACAGCAGG

CGGCCCGTCGAGATCATCGCGCCCGACGGCAGCGCCTGGCACGCCGAAT

ACGACCGGCAAGGCCGGCTGCTCGCCACCCGCGATCCGCTCGACCGGGA

AAACCGCTACGAATACCCGAAGGCGCTCAGCGCGCTGCCGATCGCGCAC

GTCGATGCGCTGGGCGGGCGCAAGACGTTCGAGTGGAACCGGCTCGGCG

AGCTGGTGGCCTATACCGATTGCTCGGGCAAGACCACACGCAATTTTTAC

GACGCATTCGGTCTGCCGCTCGCGCGCGAGAACGCGCTCGGCCACCGCG

TGACGTTCGACCTGCGCCCGACCGGCGAGGCGCGGCGCGTCACCTATCCC

GACGGCAGTACAGAAAGCTACGAATACGACGCCGCCGGGCTGATGATCC

GGCACGTCGGGCTGGGCGGCCGGACGCAGATTGCGCTGCGCAACGCGCG

TGGGCAGATCGTGGAGGCGGTCGATCCGGCCGGACGGCGCACCTGCTAC

CGCTACGACGCCGAGGGGCGGCTGCGCGAGCTGCAACAGGGGCACGCGC

GTTACGCGTTCACCTACAGCGCGGGCGGGCGGCTCACCAGCGAAACCCG

GCCCGACGGCGTGCGGCGCCGCTTCGAATACGGCGAGGCCGGCGATCTG

GCGGCGCTCGACATCGTCGGCGCGGCCGACGACGCCACGGCGAACGATC

GTCCGGTTCGCACCATCCGCTTCGAGCGCGACCGCATGGGCAATCTGTGC

GCGCAGCACACGCCCACCGAGGTGACGCGCTACACGCGCGACACCGGCG

GCCGCCTGCTCGAAGTCGCATGCGTGCCGACCGCGGCCGGGCTGGCGCT

CGGCATCGCGCCCGACACGCTGACCTTCGAATACGACAAGGCCGGGCGG

CTGAGTGCCGAACACGGCGCGAACGGCAGCGTCCGATACACGCTCGACG

CGCTCGACAACGTGATGAAGCTCGCCCTGCCGCACGAGCAGACGCTGCA

GATGCTGCGCTACGGCTCGGGGCACGTGCATCAGATCCGCTGCGGCGAC

CAGGTGGTCAGCGATTTCGAGCGCGACGACCTGCATCGCGAGCTGACGC

GCACTCAGGGCCGCCTGACCGAGCGTACCGCCTACGACCTGCTGGGCCG

CAAGATCTGGCAATCGGCCGGCTTCCAGCCCGACGCGCTTGCGCGCGGG

CAGGGCCAGGTGTGGCGCAACTACGGCTACGACGCCGCCGGCGAACTGG

CCGAGAGCCACGATAGCCTGCGCGGCAGCACGCAGTTCAGCTACGATCC

GGCCGGCTATCTGACGCAGCGCGTCAATACCGCCGACCGGCAGCTCGAA

TCGTTCGCCTGGGATGCCGCCGGCAACCTGCTCGACGATGCGCAGCGCCG

CAGCCGCGGTTATGTCGAGGGCAACCGGCTGCGCATGTGGCAGAACCTG

CGCTTCGAATACGACCCGTTCGGCAATCTCGCGACCAAGCTGCGCGGCGC

GAACCAGCGCCAGCAGTTCACTTACGACGGGCAGGATCGGCTCGTGGCG

GTGCGCACGCAGGACGCGCGCGGCGTGGTGGAGACGCGTTTCGCCTACG

ATCCGCTGGGGCGGCGCATCGCCAAGACGGATATTGTGCGCGACGCGCG

CGGCGTAGCGCTGCGCGAGGAAACGAAGCGGTTCGTGTGGGAGGGGCTG

CGGCTCGCGCAGGAGGTGCGCGACACGGGCGTGAGCAGCTACGTGTACA

GCCCGGACGCGCCCTATACGCCCGCGGCGCGCGTGGATGCCGTGCTGGC

CGAGGCCATGGCCGCCGCTGCCATCGAGCAGGCCAGACAGGCGACGCGG

ATCTATCACTTTCATACCGATGTGTCGGGCGCACCGCAAGAAGCGACGA

ACGAGGCTGGCGACATTGTTTGGGCCGGCCAATACTCAGCCTGGGGCAA

GGTGGCGCCGAACCAGCATGCCCCCGCCCGGATCGATCAGCCGCTCCGC

TACGCCGGACAATATGCCGACGACAGTACCGAGCTGCACTACAACACGT

TTCGTTTCTACGATCCGGACGTCGGCCGGTTTATCAATCAGGATCCAATC

GGGTTGATGGGGGGGCTGAATCTTTACCAATATGCACCCAACTCGATCGC

ATGGACCGACTGGTGGGGGCTGGCCGGCAGCTATACGCTCGGTTCCTATC

AAATTTCTGCGCCTCAACTTCCGGCCTACAATGGACAGACTGTTGGGACC

TTCTACTACGTGAACGGCGCGGGCGGGCTCGAATCGAGGACATTCTCTTC

CGGAGGGCCGACCCCTTATCCAAATTATGCCAATGCCGGGCACGTGGAG

GGCCAGTCCGCGCTGTTCATGAGGGATAACGGAATTTCAGACGGACTGG

TTTTCCACAACAACCCTGAGGGCACTTGCGGATTCTGCGTTAATATGACC

GAAACGCTTTTGCCTGAAAATTCCAAACTTACCGTCGTTCCGCCCGAGGG

CGCGATCCCGGTCAAGCGGGGCGCGACGGGCGAAACGAGAACATTTACG

GGGAACAGCAAGTCTCCGAAGTCCCCTGTCAAAGGAGAATGTTGA (SEQ

ID NO: 365)

DddA
>ACR30728.1 Rhs family protein [Burkholderia

homolog in

glumae BGR1]

Burkholderia

MYEAARVTDPIEHTSALTGFLVGAVLGIALIAAVAFATFTCGFGVALLAGMA

glumae

AGIGAQVLLSLGESIGKMFSSQSGAITLGSPNVYVNGKPTAYAMLSSVTCSK

BGR1
HNPTPLVAQGSTNIFINGKPAARKDDKITCGATISDGSHDTYFHGGTQTCLPI

PROTEIN
DDEVPPWLRTATDWAFALAGLVGGLGGLLKEAGGLSRAVMPCAAKFIGGY

VLGEAASRYVVGPAINSAIGGMFGNPVDVTTGRKILLAESETDYVVPSPMPV

AIRRFYSSDLDYVGTLGRGWVLPWELRLHARDGRLWYTDAQGRESGFPML

QPGHAAFSEADQRYLTCTPDGRYILHDLGETYYDFGHYEPGSGRIGWVRRIE

DQAGQWCQFERDSRGRVREIQTCGGLLAVLDYEPEHGRLAGVSLVSGDQR

RLVVAYGYDEHGQMASVTDANGALVRRFTYADGRMTSHSNALGFTSGYT

WQAVGGAPRVVATHTSEGEAWAFEYDIEGRRTHVRHADGRHAQWRYDAQ

FQIVEYLDFDGRRYGLKYNDAGMPVMLTLPGERTVTFEYDDAGRIVAETDP

LGRTTKTRYDGNSRRPVEIIAPDGSAWHAEYDRQGRLLATRDPLDRENRYE

YPKALSALPIAHVDALGGRKTFEWNRLGELVAYTDCSGKTTRNFYDAFGLP

LARENALGHRVTFDLRPTGEARRVTYPDGSTESYEYDAAGLMIRHVGLGGR

TQIALRNARGQIVEAVDPAGRRTCYRYDAEGRLRELQQGHARYAFTYSAGG

RLTSETRPDGVRRRFEYGEAGDLAALDIVGAADDATANDRPVRTIRFERDR

MGNLCAQHTPTEVTRYTRDTGGRLLEVACVPTAAGLALGIAPDTLTFEYDK

AGRLSAEHGANGSVRYTLDALDNVMKLALPHEQTLQMLRYGSGHVHQIRC

GDQVVSDFERDDLHRELTRTQGRLTERTAYDLLGRKIWQSAGFQPDALARG

QGQVWRNYGYDAAGELAESHDSLRGSTQFSYDPAGYLTQRVNTADRQLES

FAWDAAGNLLDDAQRRSRGYVEGNRLRMWQNLRFEYDPFGNLATKLRGA

NQRQQFTYDGQDRLVAVRTQDARGVVETRFAYDPLGRRIAKTDIVRDARG

VALREETKRFVWEGLRLAQEVRDTGVSSYVYSPDAPYTPAARVDAVLAEA

MAAAAIEQARQATRIYHFHTDVSGAPQEATNEAGDIVWAGQYSAWGKVAP

NQHAPARIDOPLRYAGQYADDSTELHYNTFRFYDPDVGRFINQDPIGLMGG

LNLYQYAPNSIAWTDWWGLAGSYTLGSYQISAPQLPAYNGQTVGTFYYVN

GAGGLESRTFSSGGPTPYPNYANAGHVEGQSALFMRDNGISDGLVFHNNPE

GTCGFCVNMTETLLPENSKLTVVPPEGAIPVKRGATGETRTFTGNSKSPKSPV

KGEC (SEQ ID NO: 364)

DddA
>bglu_2g02600 NC_012721.2:303868-308145

homolog in

Burkholderia glumae BGR1 chromosome 2, complete

Burkholderia

sequence

glumae

GTGTACGAAGCGGCCCGCGTCACCGACCCGATCGAACACACCAGCGCGC

BGR1
TGACCGGCTTTCTGGTGGGCGCCGTGCTCGGCATTGCCCTGATCGCCGCG

DNA
GTGGCGTTCGCCACCTTCACCTGCGGCTTCGGCGTGGCGCTGCTGGCCGG

CATGGCCGCCGGCATCGGCGCGCAGGTGCTGTTGTCGTTAGGAGAATCG

ATCGGGAAGATGTTCAGTTCGCAATCCGGCGCGATCACGCTCGGCTCGCC

GAACGTCTATGTGAACGGCAAGCCGACCGCCTACGCCATGCTCAGCAGC

GTGACGTGCAGCAAGCACAACCCGACGCCGCTCGTCGCGCAGGGGTCCA

CCAACATCTTCATCAACGGCAAGCCGGCCGCCCGCAAGGACGACAAGAT

CACCTGCGGCGCGACCATCTCCGACGGCTCGCACGACACCTATTTCCACG

GCGGCACCCAGACCTGCCTGCCGATCGACGACGAAGTGCCGCCGTGGCT

GCGCACCGCCACCGACTGGGCGTTCGCGCTGGCCGGGCTGGTGGGCGGG

CTCGGCGGCCTGCTCAAGGAAGCGGGCGGGCTGTCGCGCGCGGTGATGC

CGTGCGCGGCGAAGTTCATCGGCGGCTACGTGCTCGGCGAGGCGGCGAG

CCGCTACGTGGTCGGCCCGGCCATCAACAGCGCGATCGGCGGGATGTTC

GGCAACCCGGTGGACGTCACCACCGGGCGCAAGATCCTGCTGGCGGAAT

CGGAAACCGATTACGTGGTGCCCAGCCCGATGCCGGTGGCGATCCGGCG

CTTCTATTCGAGCGACCTCGACTACGTCGGCACGCTCGGGCGCGGCTGGG

TGCTGCCGTGGGAACTGCGGCTGCACGCGCGCGACGGGCGGCTCTGGTA

CACCGACGCGCAGGGGCGCGAGAGCGGCTTCCCGATGCTCCAGCCGGGC

CATGCCGCGTTCAGCGAGGCCGACCAGCGCTATCTGACCTGCACCCCGG

ATGGCCGCTACATCCTGCACGACCTCGGCGAAACCTATTACGACTTCGGC

CACTACGAGCCGGGCTCGGGCCGCATCGGCTGGGTGCGCCGCATCGAGG

ATCAGGCCGGCCAGTGGTGCCAGTTCGAGCGCGACAGCCGCGGCCGCGT

GCGCGAAATCCAGACCTGCGGCGGCTTGCTGGCCGTGCTCGATTACGAG

CCGGAACACGGGCGGCTCGCCGGGGTGTCGCTCGTCAGCGGGGATCAGC

GCCGCCTCGTGGTGGCTTACGGCTATGACGAGCACGGCCAGATGGCGTC

CGTGACCGATGCGAACGGCGCGCTGGTGCGCCGCTTCACCTATGCCGAC

GGGCGCATGACGAGCCATTCGAACGCGCTCGGCTTCACGTCGGGCTATA

CGTGGCAAGCCGTCGGCGGCGCGCCGCGGGTGGTTGCCACCCACACCAG

CGAGGGCGAGGCCTGGGCCTTCGAGTACGACATTGAAGGACGCCGCACC

CACGTGCGTCACGCCGACGGCCGCCACGCGCAATGGCGCTACGACGCGC

AATTCCAGATCGTCGAGTACCTCGATTTCGACGGCCGGCGCTACGGGCTC

AAGTACAACGACGCCGGCATGCCCGTGATGCTGACGCTGCCCGGCGAAC

GGACCGTGACGTTCGAGTACGACGATGCCGGCCGCATCGTCGCCGAAAC

CGATCCACTCGGCCGCACCACGAAAACGCGCTACGACGGCAACAGCAGG

CGGCCCGTCGAGATCATCGCGCCCGACGGCAGCGCCTGGCACGCCGAAT

ACGACCGGCAAGGCCGGCTGCTCGCCACCCGCGATCCGCTCGACCGGGA

AAACCGCTACGAATACCCGAAGGCGCTCAGCGCGCTGCCGATCGCGCAC

GTCGATGCGCTGGGCGGGCGCAAGACGTTCGAGTGGAACCGGCTCGGCG

AGCTGGTGGCCTATACCGATTGCTCGGGCAAGACCACACGCAATTTTTAC

GACGCATTCGGTCTGCCGCTCGCGCGCGAGAACGCGCTCGGCCACCGCG

TGACGTTCGACCTGCGCCCGACCGGCGAGGCGCGGCGCGTCACCTATCCC

GACGGCAGTACAGAAAGCTACGAATACGACGCCGCCGGGCTGATGATCC

GGCACGTCGGGCTGGGCGGCCGGACGCAGATTGCGCTGCGCAACGCGCG

TGGGCAGATCGTGGAGGCGGTCGATCCGGCCGGACGGCGCACCTGCTAC

CGCTACGACGCCGAGGGGCGGCTGCGCGAGCTGCAACAGGGGCACGCGC

GTTACGCGTTCACCTACAGCGCGGGCGGGCGGCTCACCAGCGAAACCCG

GCCCGACGGCGTGCGGCGCCGCTTCGAATACGGCGAGGCCGGCGATCTG

GCGGCGCTCGACATCGTCGGCGCGGCCGACGACGCCACGGCGAACGATC

GTCCGGTTCGCACCATCCGCTTCGAGCGCGACCGCATGGGCAATCTGTGC

GCGCAGCACACGCCCACCGAGGTGACGCGCTACACGCGCGACACCGGCG

GCCGCCTGCTCGAAGTCGCATGCGTGCCGACCGCGGCCGGGCTGGCGCT

CGGCATCGCGCCCGACACGCTGACCTTCGAATACGACAAGGCCGGGCGG

CTGAGTGCCGAACACGGCGCGAACGGCAGCGTCCGATACACGCTCGACG

CGCTCGACAACGTGATGAAGCTCGCCCTGCCGCACGAGCAGACGCTGCA

GATGCTGCGCTACGGCTCGGGGCACGTGCATCAGATCCGCTGCGGCGAC

CAGGTGGTCAGCGATTTCGAGCGCGACGACCTGCATCGCGAGCTGACGC

GCACTCAGGGCCGCCTGACCGAGCGTACCGCCTACGACCTGCTGGGCCG

CAAGATCTGGCAATCGGCCGGCTTCCAGCCCGACGCGCTTGCGCGCGGG

CAGGGCCAGGTGTGGCGCAACTACGGCTACGACGCCGCCGGCGAACTGG

CCGAGAGCCACGATAGCCTGCGCGGCAGCACGCAGTTCAGCTACGATCC

GGCCGGCTATCTGACGCAGCGCGTCAATACCGCCGACCGGCAGCTCGAA

TCGTTCGCCTGGGATGCCGCCGGCAACCTGCTCGACGATGCGCAGCGCCG

CAGCCGCGGTTATGTCGAGGGCAACCGGCTGCGCATGTGGCAGAACCTG

CGCTTCGAATACGACCCGTTCGGCAATCTCGCGACCAAGCTGCGCGGCGC

GAACCAGCGCCAGCAGTTCACTTACGACGGGCAGGATCGGCTCGTGGCG

GTGCGCACGCAGGACGCGCGCGGCGTGGTGGAGACGCGTTTCGCCTACG

ATCCGCTGGGGCGGCGCATCGCCAAGACGGATATTGTGCGCGACGCGCG

CGGCGTAGCGCTGCGCGAGGAAACGAAGCGGTTCGTGTGGGAGGGGCTG

CGGCTCGCGCAGGAGGTGCGCGACACGGGCGTGAGCAGCTACGTGTACA

GCCCGGACGCGCCCTATACGCCCGCGGCGCGCGTGGATGCCGTGCTGGC

CGAGGCCATGGCCGCCGCTGCCATCGAGCAGGCCAGACAGGCGACGCGG

ATCTATCACTTTCATACCGATGTGTCGGGCGCACCGCAAGAAGCGACGA

ACGAGGCTGGCGACATTGTTTGGGCCGGCCAATACTCAGCCTGGGGCAA

GGTGGCGCCGAACCAGCATGCCCCCGCCCGGATCGATCAGCCGCTCCGC

TACGCCGGACAATATGCCGACGACAGTACCGAGCTGCACTACAACACGT

TTCGTTTCTACGATCCGGACGTCGGCCGGTTTATCAATCAGGATCCAATC

GGGTTGATGGGGGGGCTGAATCTTTACCAATATGCACCCAACTCGATCGC

ATGGACCGACTGGTGGGGGCTGGCCGGCAGCTATACGCTCGGTTCCTATC

AAATTTCTGCGCCTCAACTTCCGGCCTACAATGGACAGACTGTTGGGACC

TTCTACTACGTGAACGGCGCGGGCGGGCTCGAATCGAGGACATTCTCTTC

CGGAGGGCCGACCCCTTATCCAAATTATGCCAATGCCGGGCACGTGGAG

GGCCAGTCCGCGCTGTTCATGAGGGATAACGGAATTTCAGACGGACTGG

TTTTCCACAACAACCCTGAGGGCACTTGCGGATTCTGCGTTAATATGACC

GAAACGCTTTTGCCTGAAAATTCCAAACTTACCGTCGTTCCGCCCGAGGG

CGCGATCCCGGTCAAGCGGGGCGCGACGGGCGAAACGAGAACATTTACG

GGGAACAGCAAGTCTCCGAAGTCCCCTGTCAAAGGAGAATGTTGA (SEQ

ID NO: 366)

DddA
>AOT60363.1 tRNA nuclease WapA precursor

homolog in
[Streptomyces rubrolavendulae]

Streptomyces

MSSSDAGRAFGVPENVLARFTRYPGGARRRAGRTARARRLGIVLSAVLSAT

rubrolavendulae

LLPAEAWAIAPPAPRTGPTLDALQQEEEVDPDPAAMEELDDWDGGPVEPPA

PROTEIN
DYTPTEVTPPTGGTAPVPLDSAGEELVPAGTLPVRIGQASPTEEDPAPPAPSG

TWDVTVEPRATTEAAAVDGAIIKLTPPASGSTPVDVELDYGRFEDLFGTEWS

SRLKLTQLPECFLTTPELEECGTPITIPTSNDPATGTVRATVDPADGQPQGLA

AQSGGGPAVLAATDSASGAGGTYKATSLSATGSWTAGGSGGGFSWSYPLTI

PDTPAGPAPKISLSYSSQSVDGRTSVANGQASWIGDGWDYHPGFVERRYRSC

NDDRSGTPNNDNSADKEKSDLCWASDNVVMSLGGSTTELVRDDTTGTWVA

QNDTGARIEYKDKDGGALAAQTAGYDGEHWVVTTRDGTRYWFGRNTLPG

RGAPTNSALTVPVFGNHTGEPCHAATYAASSCTQAWRWNLDYVEDVHGNA

MVVDWKKEQNRYAKNEKFKAAVSYDRDAYPTQILYGLRADDLAGPPAGK

VVFHAAPRCLESAATCSEAKFESKNYADKQPWWDTPATLHCKAGDENCYV

TSPTFWSRVRLSAIETQGQRTPGSTALSTVDRWTLHQSFPKQRTDTHPPLWL

ESITRVGFGRPDASGNQSSKALPAVTFLPNKVDMPNRVLKSTTDQTPDFDRL

RVEVIRTETGGETHVTYSAPCPVGGTRPTPASNGTRCFPVHWSPDPAAFSDE

NLDKSGYEPPLEWFNKYVVTKVTEMDLVAEQPSVETVYTYEGDAAWAKNT

DEYGKPALRTYDQWRGYASVVTRTGTTANTGAADATEQSQTRTRYFRGMS

GDAGRAKVHVTLTDVTGTATTVEDLLPYQGMAAETLTYTKAGGDVAAREL

AFPYSRKTASRARPGLPALEAYRTGTTRTDSIQHISGDRTRAAQNHTTYDDA

YGLPTQTYSLTLSPNDSGTLVAGDERCTVTTYVHNTAAHIIGLPDRVRATTG

DCAAAPNATTGQIVSDSRTAYDALGAFGTAPVKGLPVQVDTISGGGTSWITS

ARTEYDALGRATKVTDAAGNSTTTTYSPATGPAFEVTVTNAAGHATTTTLD

PGRGSALTVTDQNGRKTTSTYDELGRATGVWTPSRPVNQDASVRFVYQIED

SKVPAVHTRVLRDAGTYEESIELYDGFLRPRQTQREALGGGRIVTETLYNAN

GSAKEVRDGYLAEGEPARELFVPLSLDQVPSATRTAYDGLGRPVRTTTLHR

GVPRHSATTAYGGDWELSRTGMSPDGTTPLSGSRAVKATTDALGRPARIQH

FTTQNVSAESVDTTYTYDPRGPLAQVTDAQQNTWTYTYDARGRKTSSTDPD

AGAAYFGYNALDQQVWSKDNQGRLQYTTYDVLGRQTELRDDSASGPLVA

KWTFDTLPGAKGHPVASTRYNDGAAFTSEVTGYDTEYRPTGNKVTIPSTPM

TTGLAGTYTYASTYTPTGKVQSVDLPATPGGLAAEKVITRYDGEDSPTTMSG

LAWYTADTFLGPYGEVLRTASGEAPRRVWTTNVYDEDTRRLTRTTAHRET

APHPVSTTTYGYDTVGNITSIADQQPAGTEEQCFSYDPMGRLVHAWTDGNS

AVCPRTSTAPGAGPARADVSAGVDGGGYWHSYAFDAIGNRTKLTVHDRTD

AALDDTYTYTYGKTLPGNPQPVQPHTLTQVDAVLNEPGSRVEPRSTYAYDT

SGNTTQRVIGGDTQTLAWDRRNKLTSVDTNNDGTPDVKYLYDASGNRLVE

DDGTTRTLFLGEAEIVVNTAGQAVDARRYYSSPGAPTTIRTTGGKTTGHKLT

VMLSDHHSTATTAVELTDTQPVTRRRFDPYGNPRGTEPTTWPDRRTYLGVG

IDDPATGLTHIGAREYDASTGRFISVDPVMDLTDPLQMNGYTYANADPINNS

DPTGLLLDARGGGTQKCVGTCVKDVTNRKGIPLPPGEEWKHEGEAQTDFNG

DGFITVFPTVNVPAKWKKAKKYTEAFYKAVDTACFYGRESCADPEYPSRAH

SINNWKGKACKAVGGKCPERLSWGEGPAFAGGFAIAAEEYAGRGGYRGGG

ARRGSPCKCFLAGTEVLMADGSTKSIEDIKLGDEVVATDPVTGEAGAHPVSA

LIATENDKRFNELVIITSEGVERLTATHEHPFWSPSEGEWLEAGELRTGMTLR

SDSGETLVVAGNRAFTQRARTYNLTVADLHTYYVLAGQTPVLVHNANCGP

HLKDLQKDYPRRTVGILDVGTDQLPMISGPGGQSGLLKNLPGRTKANGEHV

ETHAAAFLRMNPGVRKAVLYIDYPTGTCGTCRSTLPDMLPEGVQLWVISPR

RTEKFTGLPD (SEQ ID NO: 367)

DddA
>A4G23_03234 CP017316.1:3756245-3763321

homolog in

Streptomyces rubrolavendulae strain MJM4426,

Streptomyces

complete genome

rubrolavendulae

ATGTCCTCGTCCGATGCGGGACGCGCCTTCGGCGTGCCCGAAAACGTCCT

DNA
GGCGCGTTTCACGCGGTATCCCGGCGGGGCGCGACGCCGTGCCGGGCGC

ACGGCGCGCGCCCGGCGCCTGGGCATCGTGCTGTCCGCCGTCCTCTCGGC

GACCCTGCTGCCCGCCGAGGCATGGGCCATCGCGCCCCCGGCGCCGCGC

ACCGGTCCGACCCTGGACGCCCTCCAGCAGGAGGAGGAGGTCGATCCGG

ACCCGGCCGCCATGGAAGAGCTGGACGACTGGGACGGTGGGCCGGTCGA

GCCCCCGGCCGACTACACCCCCACCGAGGTCACGCCTCCCACCGGCGGC

ACCGCCCCGGTGCCGCTGGACAGCGCGGGCGAGGAACTGGTCCCGGCCG

GGACCCTGCCCGTGCGCATCGGCCAGGCGTCCCCCACCGAGGAGGACCC

GGCACCCCCGGCACCCAGCGGCACGTGGGACGTCACCGTGGAGCCCCGC

GCCACCACCGAGGCGGCCGCCGTGGACGGCGCCATCATCAAGCTCACCC

CGCCCGCCAGCGGCTCCACACCGGTCGACGTGGAACTCGACTACGGCCG

GTTCGAGGACCTGTTCGGCACCGAGTGGTCCTCCCGGCTCAAGCTGACGC

AGCTCCCGGAGTGCTTCCTCACGACGCCCGAGCTGGAGGAGTGCGGCAC

CCCCATCACCATCCCGACGAGCAACGACCCGGCCACCGGGACGGTCCGG

GCCACCGTCGACCCGGCCGACGGGCAGCCGCAGGGCCTGGCCGCGCAGT

CGGGCGGCGGTCCCGCCGTCCTCGCCGCGACCGACTCGGCGTCCGGCGC

CGGCGGCACGTACAAGGCGACCTCCCTCTCGGCCACCGGCTCCTGGACG

GCCGGCGGCAGCGGCGGCGGCTTCTCCTGGTCGTATCCGCTCACCATCCC

GGACACCCCGGCCGGCCCCGCGCCGAAGATCTCCCTGTCGTACTCCTCCC

AGTCCGTCGACGGCCGCACCTCCGTCGCCAACGGCCAGGCGTCGTGGAT

AGGCGACGGCTGGGACTACCACCCCGGCTTCGTCGAGCGCCGCTACCGC

TCCTGCAACGACGACCGCTCCGGCACCCCGAACAACGACAACAGTGCGG

ACAAGGAGAAGTCCGACCTGTGCTGGGCGAGCGACAACGTCGTGATGTC

GCTCGGCGGCTCCACCACCGAACTCGTCCGCGACGACACGACCGGCACG

TGGGTCGCGCAGAACGACACCGGTGCCCGGATCGAGTACAAGGACAAGG

ACGGCGGAGCCCTGGCCGCCCAGACCGCCGGCTACGACGGCGAGCACTG

GGTCGTCACCACCCGCGACGGAACCCGCTACTGGTTCGGCCGCAACACC

CTCCCCGGCCGCGGCGCCCCCACGAACTCCGCCCTCACCGTCCCCGTCTT

CGGCAACCACACCGGCGAGCCCTGCCACGCCGCCACCTACGCCGCCTCCT

CCTGCACCCAGGCGTGGCGCTGGAACCTCGACTACGTCGAGGACGTCCA

CGGCAACGCGATGGTCGTCGACTGGAAGAAGGAGCAGAACCGGTACGCG

AAGAACGAGAAGTTCAAGGCGGCTGTCTCCTACGACCGCGACGCGTATC

CGACGCAGATCCTCTACGGCCTGCGCGCCGACGACCTGGCGGGCCCGCC

CGCCGGCAAGGTCGTCTTCCACGCCGCCCCGCGCTGCCTCGAAAGCGCG

GCCACCTGCTCCGAAGCCAAGTTCGAGTCCAAGAACTACGCGGACAAGC

AGCCCTGGTGGGACACACCGGCCACCCTGCACTGCAAGGCCGGTGACGA

GAACTGCTACGTCACCTCGCCGACGTTCTGGAGCCGCGTCCGCCTGTCGG

CGATCGAGACGCAGGGTCAGCGCACGCCCGGCTCGACGGCGCTGTCCAC

GGTCGACCGCTGGACCCTGCACCAGTCGTTCCCGAAGCAGCGCACCGAC

ACCCACCCGCCGCTCTGGCTGGAGTCGATCACCCGCGTGGGCTTCGGCCG

GCCGGACGCCTCCGGCAACCAGTCGAGCAAGGCCCTCCCGGCGGTGACC

TTCCTGCCCAACAAGGTCGACATGCCGAACCGCGTGCTGAAGAGCACGA

CGGACCAGACGCCCGATTTCGACCGCCTGCGCGTCGAGGTCATCCGCAC

GGAGACCGGCGGCGAGACCCATGTGACGTACTCCGCCCCCTGCCCCGTC

GGCGGCACCCGCCCCACCCCGGCCTCCAACGGCACCCGCTGCTTCCCGGT

CCACTGGTCCCCCGACCCGGCGGCCTTCTCCGACGAGAACCTGGACAAG

AGCGGCTACGAGCCGCCCCTCGAGTGGTTCAACAAGTACGTCGTCACCA

AGGTCACCGAGATGGACCTCGTGGCGGAGCAGCCCAGCGTCGAGACCGT

CTACACCTACGAGGGCGACGCCGCCTGGGCGAAGAACACCGACGAGTAC

GGCAAGCCCGCCCTGCGCACCTACGACCAGTGGCGCGGCTACGCGAGCG

TCGTCACCCGCACGGGCACCACGGCCAACACCGGCGCCGCCGACGCCAC

CGAGCAGTCCCAGACCCGCACCCGGTACTTCCGCGGCATGTCCGGCGAC

GCGGGCCGCGCCAAGGTGCACGTCACGCTCACGGACGTGACCGGCACCG

CGACCACCGTCGAGGACCTGCTCCCGTACCAGGGCATGGCCGCCGAGAC

CCTTACCTACACCAAGGCGGGCGGCGACGTCGCCGCCCGCGAGCTGGCC

TTCCCCTACAGCAGGAAGACCGCCTCCCGCGCCCGCCCCGGCCTCCCCGC

CCTGGAGGCGTACCGCACGGGCACGACGCGCACGGACTCCATCCAGCAC

ATCAGCGGCGACCGGACGCGCGCCGCTCAGAACCACACCACATACGACG

ACGCGTACGGCCTGCCCACCCAGACCTACTCGCTGACACTCTCGCCGAAC

GACTCCGGCACCCTTGTCGCCGGTGACGAGCGGTGCACCGTCACGACGT

ACGTCCACAACACCGCCGCGCACATCATCGGCCTCCCCGACCGCGTCCGC

GCCACGACGGGCGACTGCGCCGCCGCGCCGAACGCCACCACCGGCCAGA

TCGTCTCCGACAGCCGCACCGCGTACGACGCGCTCGGCGCCTTCGGCACG

GCCCCGGTCAAGGGCCTGCCGGTCCAGGTGGACACGATCTCCGGAGGCG

GCACGAGCTGGATCACCTCGGCGCGCACGGAGTACGACGCGCTGGGCCG

TGCGACCAAGGTCACCGACGCGGCGGGCAACTCCACCACGACCACGTAC

AGCCCGGCGACCGGCCCCGCGTTCGAGGTCACCGTGACCAACGCGGCTG

GTCATGCCACGACCACCACCCTCGACCCCGGTCGCGGCTCGGCGCTGACC

GTCACCGACCAGAACGGCCGCAAGACCACCAGCACGTACGACGAACTCG

GCCGGGCCACCGGCGTGTGGACGCCCTCCCGCCCGGTGAACCAGGACGC

GTCCGTGCGCTTCGTCTACCAGATCGAGGACAGCAAGGTCCCGGCGGTG

CACACTCGGGTCCTGCGCGACGCCGGTACGTACGAGGAGTCGATCGAGC

TCTACGACGGCTTCCTCCGCCCCCGTCAGACCCAGCGCGAGGCGCTGGGC

GGCGGCCGAATCGTCACCGAGACCCTCTACAACGCCAACGGCTCTGCGA

AGGAAGTGCGCGACGGCTACCTGGCGGAGGGCGAGCCCGCGCGGGAACT

GTTCGTCCCGCTCTCCCTCGACCAGGTGCCGAGCGCGACGAGGACGGCCT

ATGACGGCCTGGGCCGGCCCGTCCGGACGACGACCCTCCACAGGGGAGT

CCCCCGGCACTCCGCCACCACGGCGTACGGCGGCGACTGGGAACTGAGC

CGCACCGGCATGTCGCCCGACGGAACGACGCCGCTCTCTGGCAGCCGCG

CCGTGAAGGCGACGACGGACGCGCTCGGCCGCCCGGCCCGCATCCAGCA

CTTCACCACCCAGAACGTGTCGGCCGAGAGCGTCGACACCACGTACACC

TACGACCCCCGCGGCCCCCTTGCCCAGGTCACCGACGCCCAGCAGAACA

CCTGGACGTACACGTACGACGCCCGTGGGCGCAAGACGTCCTCCACCGA

CCCGGACGCGGGCGCCGCCTACTTCGGCTACAACGCGCTGGACCAGCAG

GTCTGGTCGAAGGACAACCAGGGCCGCCTGCAGTACACGACGTACGACG

TCCTGGGCCGCCAGACCGAGCTGCGCGACGACTCCGCGTCCGGCCCGCT

GGTGGCGAAGTGGACCTTCGACACCCTGCCGGGCGCCAAGGGCCACCCG

GTCGCGTCGACCCGCTACAACGACGGCGCCGCGTTCACCAGCGAGGTGA

CCGGTTACGACACCGAGTACCGTCCGACCGGCAACAAGGTCACCATCCC

CAGCACCCCGATGACCACGGGCCTCGCCGGCACGTACACGTACGCCAGC

ACGTACACCCCGACCGGCAAGGTCCAGTCCGTCGACCTGCCCGCGACGC

CCGGCGGGCTCGCCGCGGAGAAGGTGATCACCCGCTACGACGGCGAGGA

CTCGCCCACCACGATGTCGGGCCTGGCCTGGTACACGGCCGACACCTTCC

TCGGCCCGTACGGGGAAGTGCTGCGCACGGCGTCGGGCGAGGCCCCGCG

CCGCGTGTGGACGACCAACGTCTACGACGAGGACACCCGCCGCCTCACC

AGGACCACCGCGCACCGGGAGACGGCTCCCCACCCGGTCAGCACGACCA

CCTACGGCTACGACACGGTCGGCAACATCACGTCCATCGCCGACCAGCA

GCCGGCGGGTACCGAGGAGCAGTGCTTCTCGTACGACCCGATGGGGCGC

CTCGTCCACGCCTGGACGGACGGCAACAGCGCCGTCTGCCCCAGGACGT

CCACGGCACCGGGCGCCGGCCCGGCCCGCGCCGACGTCTCGGCCGGTGT

CGACGGCGGCGGATACTGGCACTCGTACGCGTTCGACGCGATCGGCAAC

CGGACGAAGCTGACCGTCCACGACCGCACCGACGCGGCCCTGGACGACA

CGTACACCTACACCTACGGCAAGACCCTGCCGGGTAACCCGCAGCCGGT

CCAGCCGCACACCCTCACCCAGGTCGACGCGGTGCTCAACGAGCCCGGA

TCGAGAGTCGAACCGCGCTCCACATACGCCTACGACACCTCCGGCAACA

CCACCCAGCGCGTCATCGGCGGCGACACCCAGACCCTGGCCTGGGACCG

CCGCAACAAGCTGACGTCCGTCGACACGAACAACGACGGCACACCGGAC

GTGAAGTACCTGTACGACGCGTCGGGCAACCGCCTGGTCGAGGACGACG

GCACCACGCGCACCCTCTTCCTCGGCGAGGCCGAGATCGTCGTCAACACG

GCCGGCCAGGCCGTGGACGCGCGCCGCTACTACAGCAGCCCCGGCGCCC

CGACGACGATCCGCACGACCGGCGGCAAGACCACGGGCCACAAGCTGAC

CGTCATGCTGTCGGACCACCACAGCACGGCGACGACCGCGGTCGAGCTG

ACCGACACCCAGCCGGTCACCCGCCGCCGCTTCGACCCGTACGGCAACC

CCCGCGGCACCGAGCCGACCACCTGGCCCGACCGCCGCACCTACCTGGG

CGTCGGCATCGACGACCCCGCCACGGGCCTGACCCACATCGGCGCCCGC

GAATACGACGCATCGACGGGCCGCTTCATCTCCGTCGATCCGGTCATGGA

CCTCACGGACCCGCTCCAGATGAACGGGTACACCTACGCCAACGCGGAC

CCGATCAACAACAGCGACCCCACCGGACTGTTGCTCGACGCCCGAGGCG

GCGGCACTCAGAAGTGCGTGGGAACCTGCGTCAAGGACGTCACGAACCG

AAAGGGAATTCCGCTCCCGCCTGGCGAGGAGTGGAAGCATGAAGGGGAG

GCGCAAACCGATTTCAACGGTGACGGCTTCATCACCGTCTTCCCGACCGT

GAATGTTCCGGCGAAGTGGAAGAAGGCGAAGAAGTACACGGAGGCTTTC

TACAAGGCGGTTGATACTGCTTGCTTCTATGGACGCGAAAGCTGTGCGGA

TCCGGAGTACCCTTCGCGGGCGCATAGCATCAACAACTGGAAGGGAAAG

GCATGCAAAGCCGTAGGGGGAAAATGCCCTGAGAGGTTGTCGTGGGGGG

AGGGTCCGGCGTTCGCTGGTGGCTTCGCGATAGCAGCGGAAGAGTATGC

GGGGAGAGGGGGCTACCGGGGCGGTGGGGCGAGGAGGGGGTCGCCCTG

TAAGTGCTTCCTTGCCGGCACCGAGGTGCTCATGGCGGATGGCAGCACTA

AAAGTATCGAGGACATCAAGCTCGGTGACGAAGTGGTTGCGACTGATCC

GGTAACCGGTGAGGCCGGTGCGCACCCTGTCTCGGCGCTGATCGCCACC

GAGAACGACAAGCGTTTCAACGAGCTGGTCATTATCACCAGCGAGGGTG

TAGAGCGTCTTACCGCAACGCATGAGCACCCCTTCTGGTCGCCATCCGAA

GGGGAGTGGTTGGAGGCGGGTGAGCTGCGCACTGGCATGACGCTGCGCT

CCGACTCTGGCGAAACTCTCGTAGTCGCAGGAAACCGCGCCTTCACCCAG

CGAGCCCGGACCTACAACCTCACGGTTGCAGACCTCCACACGTACTATGT

GCTGGCGGGCCAGACTCCGGTACTGGTTCACAATGCAAACTGTGGACCTC

ACCTGAAGGACCTGCAAAAGGACTACCCCCGGCGCACTGTGGGCATCCT

TGACGTCGGAACTGATCAGCTCCCGATGATTAGCGGCCCAGGTGGCCAG

TCGGGACTTCTCAAGAACCTCCCAGGTCGTACGAAGGCCAACGGGGAGC

ACGTGGAGACTCACGCAGCAGCGTTCTTGCGTATGAACCCGGGTGTCAG

AAAGGCCGTGCTCTACATCGACTACCCGACGGGGACCTGCGGAACATGT

AGAAGTACATTGCCTGACATGCTGCCCGAGGGTGTTCAGTTGTGGGTGAT

CTCGCCGCGTAGGACTGAAAAATTCACGGGACTTCCTGACTGA (SEQ ID

NO: 368)

DddA
>AVT32940.1 hypothetical protein C6361_29650

homolog in
[Plantactinospora sp. BC1]

Plantactinospora

MGDRLPAFVDGGDTLGIFSRGGIERDLASGVAGPASSLPKGTPGFNGLVKSH

sp. BC1
VEGHAAALMRQNGIPNAELYINRVPCGSGNGCAAMLPHMLPEGATLRVYG

PROTEIN
PNGYDRTFTGLPD (SEQ ID NO: 369)

DddA
>C6361_29650 CP028158.1:6764267-6764614

homolog in

Plantactinospora sp. BC1 chromosome, complete

Plantactinospora

genome

sp. BC1
CTGGGTGACCGGCTCCCTGCCTTCGTGGACGGTGGAGACACGTTGGGCAT

DNA
CTTTTCTCGCGGAGGTATTGAGCGGGACCTCGCCAGCGGAGTTGCGGGTC

CTGCAAGTAGCCTTCCTAAAGGCACGCCTGGCTTCAATGGTCTTGTAAAG

AGTCATGTTGAAGGGCATGCGGCTGCGCTAATGAGACAAAATGGAATTC

CGAACGCTGAGCTGTATATCAACAGAGTGCCGTGCGGTTCAGGTAATGG

CTGCGCAGCGATGTTGCCGCATATGCTTCCGGAAGGTGCCACCCTCCGCG

TATATGGGCCGAACGGGTACGATAGAACCTTCACTGGACTTCCGGACTG

A (SEQ ID NO: 370)

DddA
>BAJ27137.1 hypothetical protein KSE_13070

homolog in
[Kitasatospora setae KM-6054]

Kitasatospora

MAAVPSAEALAAKRARDTIWTPPNTPLGSQTKSVDGENLVPGRLPGPLEPEP

setae KM-
ADWTPGGPASVPAPGSADVTLGFDSAEAAAARKATGGAAPASDGAALRAG

6054
SLPVVIGAAKDAKSGAHRIRVELVDQAKSRAAHLDSPLIALTDTEPDTPPSGR

PROTEIN
TTKVSLDLKGIGAQTWADRARLVALPACALETPDRPECQQQTPVQSSVDLR

SGLLTAEVILPAATEGTAPPTKSSLGSGTASGVVQAGLTTAAPAKAAPTVLA

ATAGASGSGGSFSATSLSPSAAWGAGSNVGNFTYSYPIQTPPSLGGTAPSVG

LGYDSSAVDGKTSAQNSQSSWLGEGWGYEAGFIERGYKSCNTAGIANSSDM

CWGGQNATLSLAGHSGTLVRDDTTGVWHLQSDDGTKIEQLTGAPNGLQNG

EHWRITTTDGTQFYFGRNHLPGGDGTDPASNSAFKEPVYSPKSGDPCYNSST

ATGSWCTMGWRWNLDYAVDVHGNLITYTYAQETNYYSRGAGQNSGSGTL

TDYTRAGYLTQIAYGQRLSEQVTAKGAAKAAALITFTAAERCVPSGSITCTE

AQRTTANASYWPDTPLDQVCASTGTCTRAGPTFFTTKRLASLTTQVLVSGA

YRTVDTWTLTHSFKDPGDGNAKSLWLDSIQRTGTNGQTAVTMPPVTFTAV

MKPNRVDGDLTLKDGTKVTVTPFNRPRLQQVTTETGGQINVVYTTSSDAAH

PACSRLAGTMPAAADGNTLACAPVKWYLPGSSSPDPVDDWFNKYLISAVTE

QDAISGTTLIKATNYTYNGDAAWHRNDAEFTDAKTRTWDGFRGYQSVTSTT

GSAYPGEAPRTQQTATYLRGMDGDVKADGSTRSVQVANPLGGPALTDSPW

LAGSSFATQTYDQAGGTVISANGSVAGGQQVTATHAQSGGMPALVARYPA

SQVTTTSKSKLSDGTWRTNTTVSTSDPAHANRPLSSDDKGDGTPGAELCSTN

GYATGTNPMMLNILAERTVTKGACGTPVTSANTVSSARTLYDGKPYGQAG

DLAESTSALTLDHYDTGGNPVYVHTAASTFDAYGRLTSVSEANGATYDAAG

NQLTAPNLTPATTRTAYTPATGAIATTVTQTTPTGWTTTLTQDPGRAEALVS

TDANGRATTQQYDGLGRLTAAWSPERATNLTPSQKFSYAVNGTTGPSVVTS

QWLKEAGGYAYKNELYDGLGRLRQVQRTSDTYSGRLITDTVYDSHGWPVK

TASPYYEKTTAPNSTVYLPQDSQVPAQTWVTFDGIGRTTRSAFVSYGQQQW

ATTTAYPGADRTDVTPPNGKYPTSTFTDGRNQVSALWQYRTATPTGNPADA

TVTTYTYDAANRPATRKDAAGNTWSYGYDLRGRQTTVTDPDTGTTTTAYD

VNSRAVSTTDGKGNTLVVSYDLIGRKTGLYQGSIAPANQLAGWTYDTLPGG

KGKPTSSTRYVGGAGGSAYTQAVTGYDAGYRPTGTSVTIPASEGKLAGTYT

TGLTYNPVLGTLKQTDLPAIGAAPAESVMYTYNISGVLQKSYSDTYYVYDV

QYDAFGRPVRTTTGDAGTQVVSTQLDKTDYTYNQAGDVTSVTDVQNGTAT

DAQCFTYDHLGRLTQAWTDTAGSTSTTSGTWTDTSGTVHNSGSSQSVPALG

ACANANGPASTGSPAKLSVGGPSPYWQSYGYDSTGNRTTLVQHDTTGNTTK

DTTTTQTFGPAGSVNTATGAPNTGGGTGGPHALLTSSTTGPTGTQVTSYQYD

QLGNTTAVTETSGTTTLAWNGEDKLASVTKTGQAQATSYLYDADGNQLIRR

NPGKTTLNLGSDEVTLDTAANSLTDTRYYSAPGGISIARTTGPTGASALAYQ

ASDPHGTANVQINVDAAQTTTRRPTDPFGNPRGTQPAPNTWAGDKGFVGGT

KDDTTGLTNLGAREYQPTTGRFLNPDPLLDAGNPQQWNGYAYSDNDPVNS

SDPSGLITNALADGDTYVARPAAFCVTMSCVEQTSGPGFWEDKRVGDAVFA

AVVQATTQSNGNGSSQTKKEKGIWGQAWDWTKKNGGAILGALVEGAVFST

CFIGAGFAAPATGGITVIAGAAACGAVAGEAGALTTNILTPDADHSVDGITN

DMVVGEITGAAVSAASEGASSLAKPAVRKLLGMEAEEGLEAAGRAATGPC

NSFPAGVTVLLADGTTKPIEQIAQGDQVTATDPQTGTTQAEPVTDTIIGHDDT

EFTDLTLTNDADPRAPPSEITSTTHHPYWNATTSRWTDAGDLKPGDHVRTPD

GTELTVNTVYSYTTQPRTARNLTVADLHTYYVLAGNTPVLVHNTGPGCGEP

GFVSDAANSLSGRRITTGQIFDASGNPIGPEITSGGGSLADRAQSYLADSPNIR

NLPAKARYASADHVEAQYAVWMRENGVTDASVVINQNYVCGLPLGCQAA

VPAILPRGSTMTVWYPGSGSPIVLRGVG (SEQ ID NO: 371)

DddA
>KSE_13070 NC_016109.1:1451556-1458878

homolog in

Kitasatospora setae KM-6054 DNA, complete genome

Kitasatospora

GTGCTGGGGACAGCGGCCGCGCTCGCGGTCATGATGTCCATGGCGGCGG

setae KM-
TGCCGTCCGCCGAGGCACTGGCCGCGAAGCGGGCACGCGACACCATCTG

6054
GACGCCGCCCAACACCCCGCTGGGCAGCCAGACCAAGTCCGTCGACGGC

DNA
GAGAACCTCGTCCCGGGCCGCCTGCCCGGCCCCCTGGAGCCGGAACCGG

CCGACTGGACACCCGGCGGACCGGCATCCGTGCCCGCTCCGGGCAGCGC

GGACGTCACCCTCGGCTTCGACTCCGCGGAGGCCGCCGCCGCCCGCAAG

GCCACCGGCGGCGCCGCCCCCGCCTCCGACGGCGCGGCCCTCCGCGCGG

GCTCCCTCCCCGTCGTCATCGGCGCGGCGAAGGACGCCAAGAGCGGCGC

CCACCGGATCCGCGTCGAGCTCGTGGACCAGGCCAAGAGCCGTGCCGCA

CACCTCGACAGCCCGCTGATCGCACTCACCGACACCGAGCCGGACACCC

CGCCCTCCGGTCGGACCACGAAGGTGTCCCTCGACCTGAAGGGCATCGG

CGCCCAGACCTGGGCGGACCGCGCGCGACTCGTCGCCCTGCCCGCCTGC

GCCCTGGAGACGCCCGACAGGCCCGAGTGCCAGCAGCAGACCCCCGTGC

AGAGCTCCGTCGACCTGCGCTCCGGACTGCTGACGGCCGAGGTCATTCTG

CCCGCCGCCACCGAGGGCACCGCCCCGCCCACCAAGAGCTCCCTCGGCT

CGGGCACCGCCTCCGGCGTCGTCCAGGCCGGCCTCACCACGGCGGCGCC

CGCCAAGGCCGCGCCCACGGTGCTCGCCGCGACCGCCGGCGCGTCCGGC

TCGGGCGGCAGCTTCTCGGCGACCTCGCTGTCGCCCTCCGCGGCCTGGGG

CGCCGGCTCCAACGTCGGCAACTTCACCTACTCGTACCCGATCCAGACGC

CTCCCTCGCTCGGCGGGACCGCCCCCTCCGTGGGCCTCGGGTACGACTCG

TCCGCCGTCGACGGGAAGACCTCCGCGCAGAACTCCCAGTCCTCCTGGCT

CGGCGAGGGCTGGGGCTACGAGGCCGGGTTCATCGAGCGCGGCTACAAG

TCCTGCAACACGGCCGGCATCGCGAACTCCTCGGACATGTGCTGGGGCG

GGCAGAACGCCACCCTCTCGCTGGCCGGCCACTCCGGCACCCTGGTGCGC

GACGACACCACCGGCGTCTGGCACCTGCAGAGCGACGACGGCACGAAGA

TCGAACAGCTCACCGGCGCGCCCAACGGCCTGCAGAACGGCGAGCACTG

GCGGATCACCACGACCGACGGCACGCAGTTCTACTTCGGCCGCAACCAC

CTGCCCGGCGGCGACGGCACCGACCCGGCGAGCAACTCCGCCTTCAAGG

AACCGGTGTACTCGCCCAAGAGCGGCGACCCCTGCTACAACTCCTCCACC

GCCACCGGCTCCTGGTGCACGATGGGCTGGCGCTGGAACCTCGACTACG

CCGTCGACGTCCACGGCAACCTGATCACCTACACCTACGCCCAGGAGAC

CAACTACTACAGCCGAGGCGCCGGCCAGAACAGCGGCAGCGGCACCCTG

ACCGACTACACCCGCGCCGGCTACCTCACCCAGATCGCCTACGGCCAGC

GCCTGAGCGAGCAGGTCACCGCCAAGGGCGCGGCCAAGGCCGCTGCCCT

CATCACCTTCACCGCCGCGGAACGCTGCGTCCCGTCCGGCTCGATCACCT

GCACCGAGGCACAGCGCACGACCGCGAACGCCTCGTACTGGCCGGACAC

CCCGCTCGACCAGGTCTGCGCCTCCACCGGCACCTGCACCCGGGCCGGCC

CGACGTTCTTCACCACCAAGCGCCTCGCCTCCCTCACCACCCAGGTCCTG

GTCTCCGGCGCCTACCGCACCGTCGACACCTGGACGCTCACCCATTCCTT

CAAGGACCCGGGCGACGGCAACGCCAAGTCGCTGTGGCTCGACTCGATC

CAGCGCACCGGCACCAACGGGCAGACCGCGGTCACCATGCCGCCCGTCA

CCTTCACGGCGGTGATGAAGCCGAACCGGGTGGACGGGGACCTCACCCT

CAAGGACGGCACCAAGGTCACCGTCACCCCGTTCAACCGGCCCCGCCTC

CAGCAGGTCACCACGGAGACCGGCGGCCAGATCAACGTCGTCTACACCA

CCTCCTCCGACGCCGCGCACCCCGCCTGCTCGCGCCTGGCCGGCACCATG

CCCGCCGCGGCGGACGGCAACACCCTCGCCTGCGCCCCCGTCAAGTGGT

ACCTGCCCGGATCCAGCTCCCCGGACCCGGTCGACGACTGGTTCAACAA

GTACCTGATCAGCGCCGTCACCGAACAGGACGCGATCAGCGGCACCACC

CTGATCAAGGCCACCAACTACACCTACAACGGCGACGCCGCCTGGCACC

GCAACGACGCCGAGTTCACCGACGCCAAGACCCGCACCTGGGACGGCTT

CCGCGGCTACCAGTCCGTCACCAGCACCACCGGCAGCGCCTACCCGGGC

GAGGCCCCCAGGACCCAGCAGACCGCGACCTACCTGCGCGGCATGGACG

GCGACGTCAAGGCCGACGGCTCCACCCGCAGCGTCCAGGTCGCCAACCC

GCTCGGCGGCCCGGCCCTCACCGACAGCCCGTGGCTGGCCGGCTCCAGCT

TCGCCACCCAGACCTACGACCAGGCCGGCGGCACCGTCATCTCCGCCAA

CGGCTCCGTCGCCGGCGGCCAGCAGGTCACCGCCACCCACGCCCAGAGC

GGCGGCATGCCGGCCCTGGTCGCCCGCTACCCCGCCTCCCAGGTCACCAC

CACCTCCAAGTCCAAGCTCTCCGACGGGACCTGGCGCACCAACACCACC

GTCAGCACCAGCGACCCCGCGCACGCCAACCGCCCCCTCAGCAGCGACG

ACAAGGGCGACGGCACCCCCGGCGCCGAACTGTGCAGCACCAACGGCTA

CGCCACCGGCACCAACCCGATGATGCTGAACATCCTCGCCGAGCGGACG

GTCACCAAGGGCGCCTGCGGCACCCCCGTGACCTCGGCCAACACCGTCTC

CTCCGCCCGCACCCTCTACGACGGCAAGCCCTACGGCCAGGCCGGCGAC

CTCGCCGAGTCCACCAGCGCCCTGACCCTGGACCACTACGACACCGGCG

GCAACCCCGTCTACGTCCACACCGCCGCCTCCACCTTCGACGCCTACGGC

CGGCTTACCAGCGTCAGCGAGGCCAACGGCGCCACCTACGACGCCGCGG

GCAACCAGCTCACCGCGCCCAACCTCACCCCCGCCACCACCCGCACCGCC

TACACCCCGGCCACCGGCGCCATCGCCACCACCGTCACCCAGACCACGC

CCACCGGCTGGACCACCACCCTCACCCAGGACCCGGGCCGCGCCGAAGC

TCTGGTCTCCACCGACGCCAACGGCCGCGCCACCACCCAGCAGTACGAC

GGCCTCGGCCGCCTGACCGCCGCCTGGTCACCGGAGCGCGCGACCAACC

TCACCCCCAGCCAGAAGTTCTCCTACGCGGTCAACGGCACCACCGGCCCC

TCCGTCGTCACCTCCCAGTGGCTCAAGGAAGCCGGCGGCTACGCGTACA

AGAACGAGCTGTACGACGGCCTCGGCCGCCTGCGCCAGGTCCAGCGCAC

CAGCGACACCTACTCCGGGCGGCTGATCACCGACACCGTCTACGACTCGC

ACGGCTGGCCCGTCAAGACCGCCAGCCCGTACTACGAGAAGACCACCGC

GCCCAACAGCACCGTCTACCTGCCGCAGGACTCCCAGGTGCCCGCCCAG

ACCTGGGTCACCTTCGACGGCATCGGCCGGACCACCCGCTCCGCGTTCGT

CTCCTACGGACAGCAGCAGTGGGCCACCACCACCGCCTACCCCGGCGCC

GACCGCACCGACGTCACCCCGCCCAACGGCAAATACCCGACCAGCACCT

TCACCGACGGCCGCAACCAGGTCAGCGCCCTGTGGCAGTACCGCACCGC

CACCCCCACCGGCAACCCGGCCGACGCGACCGTCACCACCTACACCTAC

GACGCCGCCAACCGGCCCGCCACCCGCAAGGACGCCGCCGGGAACACCT

GGAGCTACGGCTACGACCTGCGCGGCCGCCAGACCACCGTCACCGACCC

CGACACCGGCACCACCACCACCGCCTACGACGTCAACTCGCGCGCCGTCT

CCACCACCGACGGCAAGGGCAACACCCTCGTCGTCAGCTACGACCTGAT

CGGCCGCAAGACCGGCCTCTACCAGGGCAGCATCGCCCCGGCCAACCAG

CTCGCCGGCTGGACGTACGACACCCTGCCGGGCGGAAAGGGCAAGCCCA

CCTCCTCCACCCGCTACGTCGGGGGCGCCGGCGGCTCGGCCTACACCCAG

GCCGTCACCGGCTACGACGCCGGCTACCGGCCCACCGGCACCTCGGTGA

CGATCCCCGCCAGCGAAGGCAAGCTCGCCGGTACCTACACCACCGGCCT

GACGTACAACCCGGTCCTCGGCACGCTCAAGCAGACCGACCTGCCGGCC

ATCGGCGCGGCGCCCGCCGAGAGCGTCATGTACACCTACAACATCTCCG

GCGTCCTGCAGAAGTCCTACAGCGACACCTACTACGTCTACGACGTGCAG

TACGACGCCTTCGGCCGCCCGGTCCGCACGACCACCGGCGACGCCGGAA

CCCAGGTCGTCTCCACCCAGCTCGACAAGACCGACTACACCTACAACCA

GGCCGGCGACGTCACCTCGGTCACCGACGTCCAGAACGGCACCGCCACC

GACGCCCAGTGCTTCACCTACGACCACCTCGGGCGCCTCACCCAGGCCTG

GACCGACACCGCGGGCTCCACCAGCACCACCAGCGGCACCTGGACCGAC

ACCTCCGGCACCGTCCACAACAGCGGCTCCTCCCAGTCCGTCCCCGCACT

CGGCGCCTGCGCCAACGCCAACGGCCCCGCCAGCACCGGCAGCCCCGCC

AAGCTCTCCGTCGGCGGCCCCTCCCCGTACTGGCAGAGCTACGGCTACGA

CAGCACCGGCAACCGCACCACCCTCGTCCAGCACGACACCACCGGCAAC

ACCACCAAGGACACCACCACCACCCAGACCTTCGGCCCCGCCGGATCGG

TCAACACCGCCACCGGCGCCCCCAACACCGGCGGCGGCACCGGCGGCCC

GCACGCCCTGCTCACCAGCAGCACCACCGGACCCACCGGGACCCAGGTC

ACCAGCTACCAGTACGACCAGCTCGGCAACACCACCGCGGTCACCGAGA

CGTCCGGAACCACCACCCTCGCCTGGAACGGCGAGGACAAGCTCGCCTC

CGTCACCAAGACCGGCCAGGCCCAGGCCACCAGCTACCTCTACGACGCC

GACGGCAACCAGCTCATCCGCCGCAACCCCGGCAAGACCACCCTCAACC

TCGGCAGCGACGAGGTCACCCTCGACACCGCCGCCAACTCCCTCACCGA

CACCCGCTACTACAGCGCCCCCGGCGGCATCAGCATCGCCCGCACCACC

GGACCCACCGGCGCAAGCGCCCTCGCCTACCAGGCCTCCGACCCCCACG

GCACCGCCAACGTCCAGATCAACGTCGACGCCGCCCAGACCACCACCCG

CCGCCCCACCGACCCCTTCGGCAACCCCCGCGGCACCCAGCCCGCCCCCA

ACACCTGGGCCGGCGACAAGGGCTTCGTCGGCGGCACCAAGGACGACAC

CACCGGACTCACCAACCTCGGCGCCCGCGAATACCAACCCACCACCGGC

CGCTTCCTCAACCCCGACCCACTCCTCGACGCCGGCAACCCCCAGCAGTG

GAACGGCTACGCCTACAGCGACAACGACCCCGTCAACAGCTCCGACCCC

AGCGGACTCATCACCAACGCCCTGGCCGACGGCGACACCTACGTCGCCC

GCCCCGCCGCCTTCTGCGTCACCATGTCGTGCGTCGAGCAGACCAGCGGC

CCCGGTTTCTGGGAGGACAAGCGCGTCGGTGACGCCGTCTTCGCCGCCGT

CGTCCAGGCCACCACGCAGAGCAACGGCAACGGGTCATCCCAGACCAAG

AAAGAGAAGGGCATCTGGGGCCAGGCCTGGGACTGGACCAAGAAGAAC

GGCGGCGCCATCCTCGGAGCGCTGGTAGAGGGAGCGGTCTTCAGCACAT

GCTTCATCGGAGCTGGATTCGCCGCACCTGCAACGGGAGGAATCACCGT

CATCGCCGGTGCTGCGGCCTGCGGGGCTGTGGCCGGCGAGGCAGGGGCA

CTGACCACCAATATCCTCACCCCAGATGCCGACCACTCCGTCGACGGCAT

CACCAACGACATGGTCGTTGGTGAAATCACCGGGGCGGCTGTCAGCGCA

GCGAGCGAGGGCGCAAGCTCCCTCGCCAAGCCGGCGGTCCGCAAACTCC

CCGGACCTTGCAACAGTTTCCCGGCCGGCGTCACCGTCCTCCTCGCCGAC

GGCACCACCAAGCCCATCGAACAGATCGCCCAGGGCGACCAGGTAACCG

CCACCGACCCGCAGACAGGCACCACCCAGGCAGAACCCGTCACCGACAC

GATCATCGGCCACGACGACACGGAATTCACCGACCTCACCCTCACCAAC

GACGCAGACCCCCGCGCCCCGCCCAGCGAGATCACCTCCACCACCCACC

ACCCCTACTGGAACGCCACCACCAGCCGCTGGACCGATGCCGGCGACCT

CAAGCCCGGCGACCACGTCCGCACCCCCGACGGCACCGAACTGACCGTC

AACACCGTCTACAGCTACACCACACAACCCCGGACCGCGCGCAACCTCA

CCGTCGCAGACCTCCACACGTACTATGTGCTCGCTGGAAATACGCCGGTC

CTAGTGCATAACACCGGCCCGGGATGTGGTGAGCCGGGATTCGTTAGTG

ACGCTGCTAATTCTCTCTCGGGCAGGCGCATCACCACGGGACAAATATTT

GATGCGAGCGGGAATCCGATCGGGCCTGAGATCACGAGCGGCGGCGGCA

GTCTGGCAGATAGGGCGCAGAGTTATCTTGCCGACTCCCCTAATATTCGA

AATCTGCCCGCTAAGGCGAGATATGCGTCGGCTGACCACGTTGAGGCGC

AATATGCAGTGTGGATGCGAGAAAATGGAGTGACCGACGCCAGTGTGGT

CATCAATCAAAACTATGTATGTGGGCTGCCCCTAGGCTGCCAGGCGGCG

GTGCCCGCTATCCTCCCTCGCGGCTCGACCATGACGGTATGGTATCCAGG

GTCAGGAAGTCCCATCGTATTGCGGGGAGTGGGTTAA

(SEQ ID NO: 372)

DddA
>ATE59819.1 type IV secretion protein Rhs

homolog in
[Thauera sp. K11]

Thauera sp.
MRAFRLIACLLAFSAAAAPAAADTSSMLGRLPEASARQLKERLAPRGLASA

K11
AALRQYLDASQRELDTAPEADDVPARSQRFAARAGELTALREQARRDLASL

PROTEIN
EDAAKASGSAEATQRIGRIRGQVDARFDRLEGLFTTWRNAPQGSERRQARR

ELRAALATLRHAGTPAPAAIPVPTLGPLQPAGEPAANPPAARLPAYAQADDA

TGDPFTPGGFRLMKVAALPPAVAAEAATDCSATSADLADDGKDVRLTQPIR

DLAASLDYSPARILRWTQQNVAFEPYWGALKGAEGVLQTRAGNSTDQASL

LIALLRASNIPARYVRGTVQLNDTAAQDDAGGRAQRWLGTKRYRASAAVL

AGGGTSAGLQSIDGTVRGIRFSHVWVQACVPHGAYRGARAEAGGYRWLAL

DAAVKDHDYQQGIAVDVPLTDAAFYTPYLAARSDQLPHEHFAQKVAEAAR

ATDANAALADVPYAGTPRPLRYDVLPGSLPYEVEAFTNWPGLGSSETASLP

DAHRHTFTVTVRNGATTLASAALPYPQNAFKRVTLSYQPTAASQAAWNAW

TGDLPAAADGSIQVVPQIKADGTVLAAGAPANALPLAGVHNVILKVSQGER

SGAACINDSGNPADPKDTDGTCLNKTVYTNIKAGAYHALGLNALHTSNAFL

GQRLEALAAGVQAYPVAPTPAAGAGYEATVGELLHLVLQDYLHQTEQADQ

RNAALRGFKSVGPYDLGLTASDLETDYLFDIPVAIKPAGVFVDFKGGLYGFV

KLDTTAETAAARAAENVDLAKLSIYSGSALEHHVWQQALRTDAVSTVRGL

QFAAEQGIPLVTFTAANIGQYDSLMQMSGATSMAAYKSAIQNAVKGSDNGN

HGVVTVPRAQIAYADPVDPASKWTGAVYMSQNPVTGEYGAIINGTIAGGFP

LLNSTPFSNLYNFDSFVPNTLLGTNGGAGAVQTLPGGTQGESSWITKAGDPV

NMLTGNYTLQARDFTIKGRGGLPIVLERWFNAQNATDGPFGFGWTHSFNHQ

LRFYGIESGQSKVGWVDGTGAQRFYAVAAAGSIAPGTTLAAQAGVFTTLSR

LADGRFQVRETNGLTYSFESLTSPTTPPAAGSEPRARLLAIADRHGNTLTLNY

SGSQLASVSDSLGRTVLSFTWNGNRIGKVKDVSGREVNYAYEDGNGNLTRV

TDPLGQATRYSYYTSADGAKLDHALRRHTLPRGNGMEFEYYAGGQVFRHT

PFDTSGNLIPESALTFHYNSYRRESWTVDGRGAEERFLFDTHGNVIQQTAAN

GATHTYAYADPNDPHLRTRMTDPVGRVTQYSYTAEGYLQTLTLPSGAVQA

WRDYDAFGQPRRVKDARGNWTLHHYDTAGTRTDSIRVKSGVVPTVGTAPA

AANVVSWIKYQGDSVGNLTGVKRLRDWTGATLGNFASGSGPVVTTTFDAA

RLNVASVGRSGNRNGSQISETSPIFSHDALGRLTGGVDGRWYPVAFDYDVL

DRVTRATDATGQPRRYAFDVNGNRIGTELIAGGSRIDSSVAAFDVQDRVAH

VLDHAGNRVAYAYDAVGNRVSVESPDGYAIGFDYDLAGRPYSAYDEDGNR

VFSAFDVAGRVRAVIDPNGAATLYDYHGDEQDGRLARVEQPAIPGQNAGR

AAETDYDAGGLPIRVRQVSAGGEAREGYRFHDELGRVVRSVSAPDDVGQRL

QVCYSYDALSNLTQVRAGATTDTTSAACAGSPAVQLTQSWDDFGNLLTRT

DALGRVWKFEYDAHGNLVASQTPEQAKVSTRSTYRYDPALHGLLAGRSVP

GSGSAGQSVSYARNALGQVIRAETRDGAGNLVVAYDYQYDAAHRVVRIVD

SRGGKALDYAWTPAGRLASITLDGHVWRFQYDGVGRLAAIVAPNGATIAM

ARDAAGRLTERRWPDGAKSAFDWLPEGSLAAIEHSAGGSALAQFAYAYDA

WGNRTSATETLAGTSRSLAYGYDALDRLKTVTTDGATETHAFDLFGNRTSK

TTGGVTTDYLFDAAHQLTQVQIAGTPTERLAYDDNGNLRKHCVGSPSGSTS

DCTGTTVLSLAWNGLDQLIQAARTGLPAESYAYDDAGRRVTKAVGSSATHF

AYDGPDILAEYASPAGSPTAVYAHGAGIDEPLLRLTGATSTPAASAHHYAQD

GLGSIVAAYGEIGASGPVSAASVSATHSYSAGSYPPAKLIDGETTGSTGFWA

GSSGNFAADPAVITLELGAEKSVSRVRLHRVASYLPDYVVKDAEVQVRKPD

NSWQTVGTLTNNTSEDSPEIVLTGAPGSALRVLVKGVRNGSLVLMAEVTMS

ADGGAASVATARYDAWGNVTQASGSIPAFGYTGREPDATGLVYYRARYYH

PALGRFASRDPLGLAAGINPYAYAGGNPILYNDPDGLLAQLAWNTAASYWG

QPIVQETVATIRNGAAVAAGNFVPDTVNGATGWFEQFLHQESGSFGRMDSW

VDVRNPVAQDVAQDLRGVAAVGLMMTPLRYGRASNASFNPPVANLPLNTG

GKTSGMLHIPGQESLSLTSGIAGPSQVVRGQGLPGFNGNQLTHVEGHAAAY

MRTHKVSEAVLDINKAPCTAGSGGGCNGLLPRMLPEGAHLTIRHPNGVQVY

IGTPD (SEQ ID NO: 373)

DddA
>CCZ27_07525 NZ_CP023439.1:1708666-1716450 Thauera

homolog in
sp. K11 chromosome, complete genome

Thauera sp.
ATGCGTGCCTTCCGCCTGATCGCCTGCCTTCTCGCCTTTTCGGCGGCAGCC

K11
GCACCTGCTGCGGCTGACACGTCGTCGATGCTGGGGCGTCTGCCTGAAGC

DNA
AAGCGCCCGCCAGCTCAAGGAGCGGTTGGCGCCGCGTGGCCTTGCCTCC

GCTGCCGCCTTGCGCCAGTACCTGGACGCCTCGCAACGCGAGCTGGACA

CCGCACCGGAAGCGGACGACGTACCCGCCCGCAGCCAACGCTTTGCCGC

AAGGGCGGGCGAACTCACCGCGCTGCGCGAACAGGCGCGCCGGGATCTC

GCCAGTCTGGAGGACGCCGCGAAGGCGAGCGGCTCGGCCGAGGCGACGC

AGCGCATCGGTCGAATCCGCGGGCAGGTGGACGCACGCTTCGACCGGCT

CGAAGGGCTTTTTACCACTTGGCGCAATGCGCCCCAGGGCAGCGAACGC

CGCCAGGCCCGCCGCGAACTGCGTGCCGCGCTCGCCACGCTCCGCCATGC

CGGCACCCCGGCTCCGGCTGCGATTCCTGTTCCTACCCTCGGCCCCCTGC

AACCGGCCGGCGAGCCGGCTGCCAACCCACCGGCCGCGCGCTTGCCAGC

CTATGCGCAAGCGGATGACGCGACTGGCGACCCCTTTACCCCCGGTGGCT

TCCGGCTGATGAAGGTCGCCGCACTGCCGCCGGCGGTCGCGGCCGAGGC

GGCAACGGACTGCTCCGCCACCAGCGCCGACCTGGCCGACGACGGCAAG

GACGTGCGCCTGACCCAGCCGATCCGCGACCTCGCGGCATCGCTCGACTA

CTCACCGGCACGCATCCTGCGCTGGACGCAGCAGAACGTCGCCTTCGAA

CCCTACTGGGGGGCACTCAAGGGGGCGGAAGGCGTGCTGCAGACGCGCG

CCGGCAACAGCACCGACCAGGCCAGCCTGCTGATCGCACTCTTGCGGGC

CTCCAACATTCCCGCCCGCTACGTACGCGGCACCGTGCAGCTCAACGACA

CTGCCGCGCAGGACGACGCAGGCGGGCGGGCGCAGCGCTGGCTGGGCAC

CAAGCGCTACCGTGCATCGGCCGCGGTACTCGCCGGCGGCGGAACTTCC

GCCGGCCTGCAGTCGATCGACGGCACCGTCCGCGGCATCCGCTTCAGCCA

TGTCTGGGTCCAGGCCTGCGTTCCCCATGGCGCTTACCGCGGTGCCCGCG

CGGAAGCCGGCGGCTATCGCTGGCTGGCGCTGGACGCGGCGGTGAAGGA

CCATGACTACCAGCAGGGCATCGCGGTCGATGTGCCGCTCACCGATGCC

GCGTTCTACACGCCCTATCTGGCGGCGCGCAGCGACCAGTTGCCGCACGA

GCATTTCGCACAGAAGGTGGCGGAGGCGGCGCGTGCGACCGACGCCAAT

GCGGCGCTGGCCGACGTGCCCTACGCCGGTACGCCGCGGCCGCTGCGCT

ACGACGTGCTGCCCGGTTCGCTGCCCTACGAGGTCGAAGCCTTCACCAAC

TGGCCCGGCCTCGGTTCGTCCGAAACCGCAAGCCTGCCGGACGCACACC

GCCACACCTTCACCGTGACGGTCAGGAACGGCGCCACCACGTTGGCGAG

CGCCGCGCTGCCCTATCCGCAGAACGCCTTCAAGCGCGTCACGCTGTCCT

ATCAGCCGACTGCCGCCTCGCAGGCGGCCTGGAACGCCTGGACGGGCGA

TCTGCCCGCCGCGGCCGACGGCAGCATCCAGGTCGTGCCGCAGATCAAG

GCCGACGGTACCGTGCTCGCCGCAGGTGCGCCCGCCAACGCGCTGCCGC

TCGCCGGCGTGCACAACGTCATCCTCAAGGTCTCGCAGGGCGAGCGCAG

CGGTGCCGCGTGCATCAACGACAGCGGCAACCCCGCCGACCCGAAGGAC

ACCGACGGCACCTGCCTCAACAAGACCGTCTACACCAACATCAAGGCCG

GCGCCTACCACGCCCTGGGCCTGAATGCGCTGCACACCTCGAATGCCTTC

CTCGGCCAGCGGCTCGAAGCGCTGGCGGCCGGCGTGCAGGCCTATCCCG

TCGCGCCCACGCCGGCCGCGGGTGCCGGCTACGAGGCCACGGTCGGTGA

ATTGCTGCATCTGGTGCTGCAGGACTACCTGCACCAGACCGAGCAGGCC

GACCAGCGCAACGCCGCGTTGCGCGGCTTCAAGAGCGTGGGGCCGTACG

ACCTCGGGCTGACCGCGTCCGACCTCGAAACCGACTACCTCTTCGACATC

CCGGTCGCGATCAAGCCGGCCGGCGTGTTCGTGGACTTCAAGGGCGGCC

TCTACGGTTTCGTCAAACTCGATACCACGGCCGAGACGGCCGCGGCACG

CGCCGCCGAAAACGTGGATCTGGCCAAGCTCTCGATCTACTCCGGCTCCG

CGCTCGAACACCACGTCTGGCAGCAGGCGCTGCGCACCGATGCGGTGTC

CACCGTGCGTGGGCTGCAGTTCGCCGCCGAGCAGGGCATTCCGCTCGTCA

CCTTCACCGCGGCCAACATCGGCCAGTACGACAGCCTCATGCAGATGAG

CGGCGCCACCAGCATGGCCGCTTACAAGAGCGCGATCCAGAACGCGGTG

AAGGGCTCGGACAACGGCAACCACGGCGTCGTCACCGTGCCGCGCGCCC

AGATCGCCTACGCCGACCCCGTCGATCCGGCGAGCAAATGGACCGGCGC

GGTCTACATGTCTCAGAACCCCGTCACCGGAGAGTACGGGGCGATCATC

AACGGCACCATCGCCGGCGGCTTCCCGCTGCTCAACAGCACGCCCTTCAG

CAATCTCTACAACTTCGATTCCTTCGTGCCCAACACCCTCCTTGGCACGA

ACGGGGGTGCCGGTGCGGTGCAGACCCTGCCCGGCGGCACCCAGGGCGA

GAGTTCCTGGATCACCAAGGCCGGCGACCCGGTGAACATGCTCACCGGC

AACTACACGCTGCAGGCACGCGACTTCACCATCAAGGGCCGGGGCGGAC

TGCCGATCGTGCTGGAGCGCTGGTTCAACGCGCAGAACGCCACCGACGG

GCCGTTCGGCTTCGGCTGGACGCACAGCTTCAACCATCAGTTGCGTTTCT

ACGGCATCGAGAGCGGCCAGTCCAAGGTCGGCTGGGTGGACGGCACTGG

CGCCCAGCGCTTCTACGCCGTGGCCGCCGCCGGCAGCATTGCGCCGGGC

ACGACGCTGGCCGCGCAGGCCGGGGTGTTCACGACGCTGTCGCGTCTGG

CCGACGGCCGCTTCCAGGTGCGCGAGACCAACGGCCTCACCTACAGCTTC

GAATCGCTCACGAGCCCGACCACCCCGCCGGCCGCGGGCAGCGAACCGC

GCGCAAGACTGCTGGCCATCGCCGACCGCCACGGCAACACCCTGACGCT

CAACTACAGCGGCAGCCAGCTTGCCTCGGTGAGCGACAGCCTCGGCCGC

ACGGTGCTCAGCTTCACCTGGAACGGCAACCGCATCGGCAAGGTGAAGG

ACGTCAGCGGACGGGAAGTGAACTACGCCTACGAGGACGGCAACGGCA

ACCTCACGCGCGTCACCGATCCGCTGGGTCAAGCCACGCGCTACAGCTAC

TACACCAGTGCCGACGGTGCCAAGCTCGACCACGCCCTGCGCCGCCACA

CCCTGCCGCGCGGCAACGGCATGGAGTTCGAGTACTACGCCGGTGGCCA

GGTCTTCCGCCACACGCCGTTCGACACCAGCGGCAACCTCATTCCCGAAT

CGGCGCTGACCTTCCACTACAACAGTTATCGGCGCGAGAGCTGGACGGT

CGATGGCCGCGGTGCCGAGGAGCGCTTCCTGTTCGACACGCACGGCAAC

GTGATCCAGCAGACCGCCGCCAACGGTGCCACCCACACCTACGCGTACG

CCGACCCGAACGATCCGCATCTGCGCACGCGCATGACAGACCCGGTCGG

CCGCGTCACCCAGTACAGCTATACCGCCGAAGGCTATCTGCAGACCCTGA

CGCTGCCGTCGGGCGCCGTGCAGGCGTGGCGCGACTACGACGCCTTCGG

CCAGCCCCGCCGCGTCAAGGACGCGCGCGGCAACTGGACGCTCCACCAC

TACGACACCGCCGGGACACGGACCGACTCCATCCGGGTCAAATCGGGCG

TGGTCCCCACCGTCGGCACCGCGCCTGCCGCGGCCAACGTCGTTTCCTGG

ATCAAGTACCAGGGCGACAGCGTGGGCAACCTCACCGGCGTCAAGCGCC

TGCGCGACTGGACGGGCGCGACCCTGGGCAATTTCGCCAGCGGCAGCGG

CCCCGTCGTCACCACCACCTTCGATGCGGCCAGGCTCAACGTCGCCAGCG

TCGGCCGTAGCGGCAACCGCAACGGCAGCCAGATCAGCGAGACCAGCCC

GATCTTCTCCCACGACGCGCTGGGGCGCCTCACCGGCGGGGTGGACGGG

CGCTGGTATCCGGTCGCCTTCGATTACGACGTGCTCGACCGCGTCACCCG

CGCCACCGACGCCACGGGCCAGCCGCGCCGCTACGCGTTCGACGTCAAC

GGCAACCGCATCGGTACGGAGCTGATTGCCGGCGGCAGCCGTATCGATT

CCTCGGTGGCCGCCTTCGACGTGCAGGACCGCGTCGCCCACGTCCTCGAT

CACGCCGGCAACCGCGTGGCCTACGCCTACGATGCGGTGGGCAACCGGG

TGAGCGTGGAAAGCCCCGACGGCTACGCCATCGGCTTCGACTACGACCT

CGCCGGACGGCCCTATTCGGCCTACGACGAAGACGGCAACCGCGTCTTCT

CCGCCTTCGACGTGGCCGGGCGCGTGCGAGCGGTCATCGACCCCAACGG

CGCCGCGACGCTCTACGACTATCACGGCGACGAGCAGGACGGGCGTCTC

GCGCGCGTGGAGCAGCCCGCCATCCCGGGCCAGAACGCGGGCCGCGCCG

CCGAGACCGACTACGATGCGGGTGGGTTGCCCATCCGCGTGCGCCAGGT

CTCGGCCGGCGGCGAAGCGCGCGAAGGCTACCGTTTCCACGACGAGCTT

GGCCGCGTGGTGCGCAGCGTCTCCGCGCCGGACGACGTCGGCCAGCGGC

TGCAGGTCTGCTACAGCTACGATGCACTCTCGAACCTCACCCAGGTGCGC

GCCGGCGCCACCACCGACACCACCAGTGCCGCCTGCGCCGGCAGCCCCG

CGGTGCAGCTCACCCAGAGCTGGGACGACTTTGGCAACCTGCTGACGCG

CACCGACGCGCTGGGCCGGGTGTGGAAGTTCGAGTACGACGCCCACGGC

AACCTCGTCGCCAGCCAGACGCCCGAGCAGGCCAAGGTCTCGACGCGCA

GCACCTACCGCTACGATCCGGCGCTGCACGGCTTGCTGGCCGGGCGCAG

CGTGCCGGGCAGCGGCAGTGCGGGCCAGAGCGTGAGCTATGCGCGCAAC

GCGCTCGGCCAGGTCATCCGCGCCGAGACGCGCGACGGCGCGGGCAACC

TCGTCGTCGCCTACGACTACCAGTACGACGCCGCCCACCGTGTGGTGCGC

ATCGTCGACAGCCGCGGCGGCAAGGCGCTCGACTACGCCTGGACGCCCG

CCGGGCGGCTGGCGAGCATTACCCTGGACGGCCATGTCTGGCGCTTCCAG

TACGACGGCGTCGGCCGGCTCGCCGCGATCGTCGCGCCCAACGGCGCCA

CCATAGCGATGGCACGCGATGCCGCCGGGCGGCTCACCGAGCGGCGCTG

GCCCGACGGCGCGAAGAGCGCCTTCGACTGGCTGCCCGAAGGCAGCCTC

GCCGCCATCGAGCACAGCGCGGGCGGCAGCGCGCTCGCACAGTTCGCCT

ATGCCTACGATGCCTGGGGCAACCGCACGAGCGCCACCGAGACCCTCGC

GGGCACCAGCCGCAGCCTCGCCTACGGCTACGACGCGCTCGACCGCCTG

AAGACCGTCACCACCGACGGTGCGACCGAAACCCATGCCTTCGATCTCTT

CGGCAATCGCACCAGCAAGACCACGGGGGGGTGACCACCGACTATCTC

TTCGACGCGGCGCACCAGCTCACCCAGGTGCAGATCGCCGGCACCCCCA

CCGAGCGGCTCGCCTACGACGACAACGGTAATCTCCGCAAGCACTGCGT

CGGCAGTCCGAGTGGCAGCACCAGCGATTGCACCGGCACCACCGTGCTG

AGCCTCGCCTGGAACGGCCTCGACCAGTTGATCCAGGCCGCCAGGACGG

GCCTGCCCGCCGAGTCCTACGCCTACGACGATGCCGGGCGGCGTGTCACC

AAGGCGGTGGGCAGCAGCGCCACCCACTTCGCCTACGACGGTCCCGACA

TCCTGGCCGAGTACGCCAGCCCGGCCGGCAGCCCCACCGCCGTCTATGCC

CACGGTGCCGGCATCGACGAACCGCTGCTGCGCCTCACCGGCGCGACGA

GCACGCCGGCCGCTTCCGCGCACCACTACGCGCAGGACGGGCTGGGCAG

CATCGTCGCGGCCTATGGCGAGATCGGCGCCAGCGGTCCGGTCAGTGCC

GCGAGCGTATCGGCCACCCACAGTTACAGCGCCGGCAGCTACCCGCCGG

CAAAGCTGATCGACGGCGAGACGACCGGAAGCACCGGGTTCTGGGCTGG

CAGCTCGGGCAACTTCGCTGCCGATCCAGCCGTGATCACGCTGGAACTGG

GTGCGGAGAAAAGCGTGAGCCGCGTGAGGCTGCACCGGGTGGCCAGCTA

CCTGCCCGACTACGTGGTCAAGGATGCCGAGGTGCAGGTCCGAAAACCG

GACAATTCGTGGCAGACGGTCGGCACGCTGACAAACAACACCAGCGAAG

ACAGTCCCGAGATCGTGCTCACCGGCGCCCCCGGCAGCGCGCTGCGCGT

GCTCGTCAAGGGCGTGCGCAACGGCAGCCTGGTGCTGATGGCCGAGGTG

ACGATGAGTGCGGACGGTGGCGCGGCCAGCGTGGCCACCGCCCGCTACG

ACGCCTGGGGCAACGTCACGCAGGCGAGCGGCAGCATCCCGGCCTTCGG

CTACACCGGACGCGAGCCCGATGCCACGGGCCTGGTCTACTACCGCGCC

CGCTACTACCACCCCGCGCTCGGCCGCTTCGCCAGCCGCGACCCGCTGGG

GCTGGCGGCGGGGATCAATCCCTACGCCTACGCGGGCGGCAATCCCATC

CTCTACAACGATCCGGATGGCTTGCTGGCGCAACTGGCGTGGAATACGG

CGGCCAGCTACTGGGGACAGCCGATAGTTCAAGAAACGGTCGCCACGAT

TCGAAATGGGGCCGCAGTGGCCGCTGGCAACTTCGTTCCAGACACGGTC

AACGGTGCAACAGGTTGGTTTGAGCAGTTCCTGCACCAAGAATCGGGCT

CGTTCGGGCGCATGGACTCGTGGGTGGATGTGCGAAACCCCGTTGCGCA

GGACGTAGCCCAGGACCTGCGCGGTGTCGCAGCCGTTGGGTTAATGATG

ACGCCGCTGCGGTATGGTCGTGCCTCCAACGCGTCTTTCAATCCGCCAGT

AGCCAATCTTCCGCTCAACACTGGAGGAAAAACATCTGGCATGTTGCAC

ATTCCAGGGCAAGAATCACTGTCGCTCACGAGCGGAATTGCGGGGCCGT

CTCAAGTCGTTAGAGGTCAAGGTTTGCCAGGATTCAACGGTAATCAGTTG

ACCCATGTGGAAGGTCATGCTGCTGCTTACATGCGGACTCACAAGGTCTC

TGAGGCTGTTCTGGACATAAACAAAGCACCTTGCACCGCTGGTAGTGGTG

GTGGATGTAATGGGTTGCTTCCCCGAATGCTGCCGGAGGGGGCTCATTTA

ACAATTCGACACCCAAATGGTGTTCAAGTTTATATTGGCACTCCTGACTA

A (SEQ ID NO: 374)

Chondromyces

>AKT41505.1 type IV secretion protein Rhs

crocatus

[Chondromyces crocatus]

PROTEIN
MSMSASRSQPAFPFVSASSPRPRRRPPFPRALLLLIAVLLVGACGDAGGPLLW

SSSSQALWEPSPIPPLPPLLCLGPGDGPSPFPPDLTQGTTTAAGTLPGSFSVTST

GEATYTIPVPTLPGRAGIEPSLAITYDSAQGEGLLGIGFHLQGLSSVDRCPRNV

AQDGHIAPVRDAEDDALCLDGQRLVPVDPQPGRAPREYRTFPDSFTRVEAD

FAESEGWPAERGPKRLRAHGKAGLIYEYGGESSGRVLAQGEAVRSWLLTRL

SDRDGNTMAVVYRNDLHAKGYTVEHAPQRITYTRHPTVPASRMVEFTYGP

LEAADVRVHYARGMELRRSLSLRSIQMFGPGHVLARELRFGYGHGPATGRL

RLEAVRECAGDGTCKPPTRFTWHTAGAAGYTQQQTLVEVPLSERGTLMTM

DVSGDGLDDLVTSDMVVEAGTEEPITRWSVALNRSQELTPGFFEAAVTGQE

QPHFIDAEPPYQPELGTPLDYDHDGRMDLFLHDVHGQSMTWEVLLSNGDGR

FTRRDTGVPRPFTMGMTPAGLRSPDASTHLVDVDGDGMVDLLQCYLSAHE

QLWYLHRWTAAAGGFAPHGDRVHALSSYPCHAELHAVDVDADGRVDLVM

QELILVGSQVRAGWQYVAFSYELSDGSWTRALTGLRLTPPGDRVFFLDVNG

DGLPDAVQSSRDDEQLYTSMNIGAGFAAPVPSLATPTLGAARFVRFASVLDH

NADGRQDLLLAMSDGGSESLPAWKVLQATGEVGPGTFEIVHPGLPMGIVLQ

QDELPTPDHPLTPRVTDVNGDGAQDLLYAFNNQVHVFENVLGQEDLLAAV

TDGMNAHAPEDAEYLPNVQIRYDHLIDRARTTEGFEDAPGIPSPEQRTYRPL

EQSDEEPCRYPVRCVVGHRRVVSGYVLNNGADRPRTFQVAYRNGRHHRLG

RGFLGFGTRIVRDLDTGAGTAEFYDNVTFDGAFQAFPFRGQVQRSWRWSPS

LPLDAHSAEPASLELLTTRSYAVVIPTQAGTYFTLSLLEGKSRHQGTFSPGSG

KTLEEAVRALEGDLASRMSDTLRTVSDFDLYGNILAEQTQTEGVDLDLSVTR

SFDNDPLSWRLGELTRETTCSKAGGETQCRVMHRSYDGRGHVRLERVGGEP

FDPEMQLDVWFSRDALGNIHSTRSRDGTGQVRASCTSYDALGLMPYAHRNL

EGHQSYTRYDPAVGVLRASVDPNGLVSRWAYDGFGRVTLESLPGRMPTVIR

RTWTKDGGAAGNAWNLKIRTASVGGQDETVQLDGLGREVRWWWQALDV

GEEQAPRMMQEVAFDARGEHLAWRSLPIVDPAPPGSVQVRETWQYDGMGR

VLRHVTPWGAATTHEYIGRDEVITAPGQAVTRIASDPLGRPTAVGDPEGGVS

RYTYGPFGGLREVTTPAGAVTLTERDAFGRVRRQVSPDRGVSTAHYDGYGQ

KISSLDAAGRAVTTRYDTLGRIFRQVDEDGVTEFRWDDAQHGVGQLALVVS

PDGHRLRYGFDHLGRPATTTLEIGGESFTSRLSYDLSGRLERIEYPSAPGIGSF

AIEREYDPHGRLRALKDAGSGAEFWRATAIDAGNRITGERFGGGTATTLRTF

DAARERVSRIETQTAGGPVQQLSYLWNDRRKLVERSDGLHANVERFRYDLL

DRLTCAQFGLINAALCERPFTYGPDGNLLQKPGVGAYEYDPAQPHAVVRAG

SAFYGYDAVGNQTSRPGATIAYTAFDLPKRIALTSGDTVDFAYDGLQQRVR

KTTATQEIASFGEVYERVTDVVTGAVEHRYHVRNDERVVTLVRRSVAQGTR

TLHVHVDHLGSIDVLTDGVTGSVAERRSYDAFGAPRHPDWGSGQPPSPHEL

SSLGFTGHEADLDLGLVNMKGRIYDPKLGRFLTPDPLVPRPLFGQSWNSYSY

VLNSPLSLVDPSGFQEQPPATEDGCSQGCTIWVFGPPREPKPPAPPKVVEGNL

EDAAGTGSTQAPVDVGTSGVRSGWSPQLPATLQTLGRGDAIARRIMDGVRI

GMARMLLESAKLGILGGTSRVYVAYTNLTAAWNGYKESGLPGALDAVNPA

SQMVQAGVEAYEAAAAEDWEAAGASLFKAGSIGMSILATAVGVGGAITAT

VGSTAGAAGRAAARAPSLPAYAGGKTSGVLRTTAGDTALLSGYKGPSASMP

RGTPGMNGRIKSHVEAHAAAVMREQGMKEGTLYINRVPCSGATGCDAMLP

RMLPPDAHLRVVGPNGYDQVFVGLPD (SEQ ID NO: 375)

Chondromyces

>CMC5_057130 NZ_CP012159.1:7808731-7815414

crocatus

Chondromyces crocatus strain Cm c5, complete genome

DNA
ATGTCCATGTCGGCCTCACGGAGTCAGCCCGCATTCCCCTTCGTGTCGGC

CTCCTCTCCGCGTCCGCGCCGGCGCCCTCCCTTTCCCCGAGCGCTGCTCCT

CCTCATCGCCGTGCTCCTCGTCGGCGCATGCGGCGACGCTGGCGGCCCGC

TTCTCTGGTCGAGCAGCTCCCAGGCCCTCTGGGAACCCTCCCCGATCCCG

CCGCTCCCCCCGCTCCTGTGCCTCGGCCCCGGCGACGGTCCCTCCCCCTTT

CCGCCTGACCTTACGCAGGGGACCACCACCGCGGCGGGGACCCTGCCAG

GGAGCTTTTCGGTCACGAGCACGGGCGAGGCGACGTACACGATCCCGGT

CCCCACGCTGCCTGGCCGTGCCGGCATCGAGCCCTCGCTGGCGATCACCT

ACGACAGTGCGCAGGGTGAAGGGCTGCTCGGGATCGGCTTCCACTTGCA

GGGCCTCTCGTCGGTCGATCGCTGCCCCCGGAACGTCGCGCAGGATGGTC

ACATCGCGCCGGTCCGGGATGCCGAGGACGACGCCTTGTGCCTCGATGG

GCAGCGGCTCGTCCCCGTGGACCCGCAGCCAGGGCGTGCGCCGCGGGAA

TACCGCACGTTCCCGGACAGCTTCACGCGCGTCGAGGCCGACTTCGCGGA

GAGCGAGGGGTGGCCGGCGGAGCGTGGGCCGAAGCGGCTGCGGGCGCA

TGGCAAAGCGGGGCTGATCTACGAATACGGTGGAGAATCATCGGGCCGG

GTGCTCGCGCAAGGGGAGGCGGTGCGGTCCTGGTTGCTGACGCGGCTCA

GCGACCGGGATGGCAACACGATGGCGGTGGTCTACCGGAATGACCTCCA

CGCGAAGGGCTACACCGTCGAGCACGCGCCGCAGCGGATCACCTACACC

AGGCACCCGACTGTGCCGGCCTCGCGCATGGTGGAGTTCACGTACGGGC

CGCTGGAGGCGGCGGACGTGCGCGTACACTATGCCCGCGGGATGGAGCT

GCGCCGCTCGCTGAGCTTGCGCTCGATCCAGATGTTCGGGCCGGGACACG

TGCTCGCGAGGGAGCTGCGCTTCGGTTACGGGCATGGGCCGGCGACGGG

TCGCTTGCGACTGGAGGCGGTTCGGGAGTGCGCAGGTGACGGGACGTGC

AAGCCGCCGACACGCTTCACCTGGCACACGGCCGGAGCGGCTGGATACA

CGCAGCAGCAGACACTGGTGGAGGTGCCGCTGTCGGAGCGCGGCACGTT

GATGACGATGGACGTCAGCGGCGATGGCCTCGACGACCTGGTGACGTCC

GACATGGTGGTGGAGGCCGGCACGGAAGAGCCGATCACCCGCTGGTCGG

TCGCGCTCAACCGGAGCCAGGAGCTGACGCCGGGGTTCTTCGAGGCGGC

CGTCACTGGGCAGGAGCAGCCGCATTTCATCGACGCAGAGCCGCCGTAC

CAGCCGGAGCTGGGGACGCCGCTCGACTACGACCACGATGGCCGGATGG

ACCTGTTTCTGCACGATGTGCACGGGCAGTCGATGACGTGGGAGGTGCTG

CTGTCGAATGGAGATGGGCGGTTCACGCGGCGGGATACGGGGGTGCCGC

GGCCGTTCACGATGGGCATGACGCCGGCGGGATTGCGCAGCCCGGATGC

GTCGACCCATCTGGTGGATGTTGACGGTGACGGGATGGTGGACCTGCTGC

AGTGCTACCTGAGCGCGCACGAGCAGCTCTGGTACTTGCACCGCTGGAC

GGCAGCGGCGGGGGGCTTCGCGCCGCACGGCGATCGGGTGCATGCGCTG

AGCTCCTACCCGTGCCACGCCGAGCTGCACGCGGTCGATGTCGACGCGG

ATGGGCGGGTGGACCTGGTGATGCAGGAGCTGATCCTCGTCGGGAGCCA

GGTGCGGGCGGGGTGGCAGTACGTGGCGTTCTCGTACGAGCTGTCCGAT

GGATCGTGGACGCGCGCGCTGACGGGGCTGCGGCTCACGCCGCCTGGGG

ACCGGGTGTTCTTCCTCGACGTCAACGGCGATGGGCTGCCCGATGCGGTG

CAGAGCAGCCGGGACGATGAGCAGCTGTACACGTCGATGAATATCGGCG

CGGGATTCGCGGCGCCGGTACCGAGCCTGGCGACGCCGACGCTCGGGGC

TGCGAGGTTCGTTCGGTTTGCGTCGGTGCTCGATCACAACGCGGATGGGC

GACAAGACCTGCTGCTGGCCATGAGCGATGGGGGATCGGAGTCGCTGCC

CGCGTGGAAGGTGCTCCAGGCGACGGGGGAGGTCGGTCCGGGGACGTTC

GAGATCGTCCATCCCGGGCTGCCGATGGGCATCGTGCTCCAGCAGGACG

AGCTGCCCACGCCCGACCATCCGCTCACGCCGCGGGTCACTGACGTGAAT

GGGGATGGGGCGCAGGATCTGCTCTATGCGTTCAACAACCAGGTCCATG

TGTTCGAGAACGTGCTCGGCCAGGAGGACCTGCTCGCGGCCGTGACCGA

CGGCATGAATGCGCACGCTCCGGAGGACGCCGAGTACCTGCCCAACGTG

CAGATCCGGTACGACCACCTGATCGATCGTGCGCGGACGACGGAGGGCT

TCGAGGATGCTCCAGGGATCCCGTCACCCGAGCAGCGCACCTACCGGCC

TCTGGAGCAAAGCGATGAGGAGCCCTGCCGCTATCCGGTGCGGTGCGTG

GTCGGGCATCGGCGGGTGGTGAGCGGCTATGTGCTCAACAATGGCGCGG

ATCGGCCGCGCACCTTCCAGGTGGCCTACCGCAATGGCCGTCACCATCGC

CTGGGCCGAGGGTTTCTGGGGTTCGGGACGCGGATCGTGCGTGACCTCG

ATACCGGCGCGGGGACGGCCGAGTTCTACGACAACGTCACGTTTGATGG

CGCCTTCCAGGCCTTCCCTTTCCGAGGGCAGGTACAGCGCTCGTGGCGCT

GGAGTCCGAGCTTGCCGCTGGACGCGCATAGCGCGGAGCCGGCGTCCCT

CGAGCTGCTGACGACGCGGAGCTACGCGGTGGTGATCCCCACGCAAGCG

GGGACGTACTTCACCCTCTCGCTGCTGGAGGGCAAGAGCCGTCATCAGG

GCACGTTCTCACCGGGGAGTGGGAAAACGCTCGAAGAAGCCGTGCGCGC

TCTGGAAGGAGATCTCGCCTCGCGAATGAGCGACACGCTCCGCACCGTC

AGCGACTTCGACCTCTACGGGAACATCCTCGCCGAGCAAACGCAGACGG

AGGGCGTCGACCTCGACCTCTCGGTGACGCGCAGCTTCGACAACGACCC

GCTCTCCTGGCGCCTTGGCGAGCTGACGCGAGAGACGACGTGCAGCAAA

GCGGGCGGTGAGACGCAGTGCCGGGTGATGCACCGGAGCTATGACGGGC

GCGGCCACGTTCGCCTGGAGCGCGTCGGGGGAGAGCCCTTCGACCCGGA

GATGCAGCTCGATGTCTGGTTCTCGCGGGACGCGCTGGGCAACATCCACA

GCACCCGGTCACGTGATGGGACGGGGCAGGTGCGCGCGAGCTGCACCAG

CTACGACGCGCTGGGCTTGATGCCTTATGCCCACCGCAACCTGGAGGGCC

ACCAGAGCTATACGCGCTACGACCCGGCCGTGGGCGTGCTGCGGGCGTC

GGTGGATCCCAACGGCCTGGTGAGCCGCTGGGCCTACGATGGCTTCGGG

CGGGTGACGCTGGAGAGCCTCCCCGGGCGCATGCCCACCGTCATCCGGC

GGACCTGGACGAAGGACGGCGGAGCGGCTGGCAACGCCTGGAACCTGA

AGATCCGCACCGCCTCGGTGGGGGGCCAGGACGAGACCGTGCAGCTCGA

TGGTCTCGGGCGGGAGGTGCGCTGGTGGTGGCAAGCGCTCGACGTGGGG

GAAGAGCAAGCGCCGCGGATGATGCAGGAGGTCGCCTTCGATGCGCGGG

GCGAGCACCTCGCGTGGCGCTCGCTGCCGATCGTGGATCCCGCGCCACCA

GGCTCGGTGCAGGTGCGAGAGACGTGGCAATACGACGGGATGGGGCGG

GTGCTCCGGCACGTCACGCCGTGGGGGGCGGCGACGACGCACGAGTACA

TCGGGCGGGACGAGGTCATCACCGCGCCTGGGCAGGCCGTCACCCGAAT

CGCCAGCGATCCGCTCGGGAGGCCCACGGCAGTGGGTGATCCCGAAGGT

GGCGTCAGCCGGTACACCTACGGTCCCTTCGGGGGGCTGCGCGAGGTGA

CCACGCCCGCTGGTGCCGTGACGCTGACCGAGCGGGATGCGTTTGGCCG

CGTGCGACGGCAGGTGAGCCCGGACCGGGGAGTCTCTACTGCGCACTAC

GACGGTTACGGGCAGAAGATCTCATCGCTCGACGCGGCAGGACGCGCGG

TCACGACCCGCTACGACACGCTGGGTCGGATTTTCAGGCAGGTCGACGA

AGACGGCGTCACCGAGTTCCGTTGGGATGACGCGCAGCATGGAGTGGGT

CAGCTCGCGCTGGTGGTCAGCCCCGATGGGCATCGGCTGCGCTACGGCTT

CGACCACCTCGGGCGACCAGCGACGACGACGCTGGAGATCGGAGGGGA

AAGCTTCACCAGCCGGCTGTCTTATGATCTGAGCGGCCGGCTCGAGCGGA

TCGAGTACCCGAGCGCGCCGGGGATTGGCAGCTTCGCCATCGAGCGGGA

GTACGATCCTCACGGGCGGCTGCGGGCGCTGAAGGATGCGGGGTCGGGG

GCGGAGTTCTGGCGAGCCACCGCGATCGATGCGGGGAATCGCATCACGG

GGGAGCGCTTCGGTGGGGGGACCGCCACCACGCTCCGCACGTTCGACGC

GGCACGGGAGCGGGTGAGTCGGATCGAGACGCAGACGGCAGGTGGGCC

CGTCCAGCAGCTCTCCTACCTCTGGAACGATCGCCGCAAGCTCGTCGAGC

GCTCCGATGGCCTCCACGCCAACGTCGAGCGCTTTCGTTACGACCTGCTG

GACCGGCTGACGTGCGCGCAGTTCGGGCTGATCAATGCTGCCCTCTGCGA

GCGACCGTTCACCTACGGACCCGACGGCAACCTGCTCCAGAAGCCCGGC

GTCGGTGCCTACGAGTACGACCCCGCGCAGCCCCACGCCGTCGTCCGAG

CTGGTAGCGCGTTCTACGGCTACGACGCCGTCGGCAACCAGACCTCACG

ACCCGGCGCGACCATCGCCTACACCGCGTTCGACCTACCGAAGCGAATC

GCGCTCACCAGCGGCGACACCGTCGACTTCGCGTACGACGGCCTCCAGC

AGCGGGTGCGCAAGACCACGGCGACGCAGGAGATCGCCTCCTTCGGCGA

GGTGTACGAGCGCGTGACCGATGTCGTCACGGGAGCCGTCGAGCATCGC

TACCACGTGCGCAACGACGAGCGCGTCGTCACGCTGGTGCGGCGCTCGG

TCGCGCAAGGCACGCGCACGCTGCATGTCCATGTCGACCACCTCGGGTCG

ATCGATGTGCTCACCGACGGTGTGACCGGCAGCGTCGCCGAGCGCCGCA

GCTACGATGCCTTCGGCGCACCGCGCCATCCCGACTGGGGTTCGGGTCAG

CCTCCGTCACCCCACGAGCTGTCGTCGCTTGGCTTCACCGGGCACGAGGC

CGACCTCGACCTCGGCCTCGTGAACATGAAGGGGCGCATCTACGACCCC

AAGCTCGGACGGTTCCTCACGCCCGATCCGCTCGTGCCGCGGCCTCTCTT

CGGGCAGAGCTGGAATAGCTATTCGTACGTGCTAAACAGCCCGCTGTCG

CTGGTCGATCCCAGTGGGTTTCAAGAGCAGCCACCTGCGACAGAGGACG

GATGCTCGCAGGGCTGCACCATCTGGGTGTTCGGTCCTCCCCGCGAGCCG

AAGCCACCTGCGCCGCCCAAGGTCGTCGAGGGCAACCTGGAGGACGCCG

CTGGCACTGGTTCGACCCAGGCGCCGGTCGATGTCGGGACCTCCGGGGTC

CGTAGCGGATGGAGTCCGCAGCTCCCGGCCACGTTGCAGACCTTGGGCC

GTGGTGACGCCATCGCCAGGCGCATCATGGACGGCGTCCGCATCGGGAT

GGCCAGGATGCTGCTGGAGTCCGCAAAGCTCGGCATCCTGGGCGGCACC

AGCCGCGTCTACGTCGCCTACACCAACCTCACCGCCGCCTGGAATGGCTA

CAAAGAGAGCGGGCTCCCCGGCGCTCTCGACGCCGTCAATCCCGCCAGC

CAGATGGTCCAAGCCGGCGTGGAGGCCTACGAGGCTGCCGCCGCAGAGG

ACTGGGAGGCCGCCGGCGCCAGCTTGTTCAAGGCCGGGTCGATCGGGAT

GTCGATCCTGGCGACGGCTGTTGGCGTCGGGGGAGCGATCACTGCGACA

GTGGGCTCGACGGCAGGAGCGGCGGGGAGGGCAGCCGCAAGAGCCCCC

TCACTCCCTGCATATGCTGGCGGAAAAACGTCGGGAGTACTACGGACCA

CCGCAGGCGATACAGCACTGCTGAGCGGCTACAAGGGGCCGTCCGCATC

GATGCCTCGAGGAACGCCAGGCATGAACGGACGCATCAAGTCGCATGTA

GAAGCTCATGCGGCTGCCGTGATGCGAGAGCAAGGGATGAAGGAAGGA

ACCCTGTACATCAATCGAGTCCCCTGCTCTGGCGCCACCGGATGCGACGC

GATGCTCCCAAGAATGCTCCCACCAGATGCACACCTTCGCGTGGTCGGTC

CGAATGGTTACGATCAAGTTTTTGTCGGGCTGCCCGACTGA

(SEQ ID NO: 376)

Fusion Proteins

In some aspects, the present disclosure provides fusion proteins comprising any of the zinc finger domain-containing proteins provided herein and/or any of the DddA variants provided herein.

In one aspect, the present disclosure provides fusion proteins comprising a zinc finger domain-containing protein disclosed herein and an effector protein. In some embodiments, the effector protein comprises nuclease activity, nickase activity, recombinase activity, deaminase activity, methyltransferase activity, methylase activity, acetylase activity, acetyltransferase activity, transcriptional activation activity, transcriptional repression activity, or polymerase activity. In some embodiments, the effector protein comprises a nucleic acid editing domain. In certain embodiments, the nucleic acid editing domain comprises a deaminase domain (e.g., an adenosine deaminase domain or a cytidine deaminase domain). In certain embodiments, the cytidine deaminase domain is a double-stranded DNA cytidine deaminase (DddA) domain (e.g., a wild type DddA deaminase domain, or any of the DddA variant deaminase domains disclosed herein).

In this aspect, the structure of a fusion protein may comprise, for example:

- NH₂-[zinc finger domain-containing protein]-[effector protein]-COOH; or
- NH₂-[effector protein]-[zinc finger domain-containing protein]-COOH.