Nucleic Acid Binding Domains and Methods of Use Thereof

Information

  • Patent Application
  • 20220306699
  • Publication Number
    20220306699
  • Date Filed
    June 26, 2019
    5 years ago
  • Date Published
    September 29, 2022
    a year ago
Abstract
Provided herein are polypeptides, compositions comprising the polypeptides and methods for genome editing and gene regulation (e.g., activation and/or repression) using the polypeptides or the compositions comprising the polypeptides, such as, DNA binding domains derived from the genus of Ralstonia. Also disclosed are DNA binding proteins that include a fragment of N-cap sequence of a TALE protein, such as, a Xanthomonas TALE protein. Also disclosed are DNA binding proteins that include a fragment of N-cap sequence of a DNA binding protein derived from bacteria of the genus Ralstonia.
Description
INCORPORATION BY REFERENCE OF SEQUENCE LISTING PROVIDED AS A TEXT FILE

A Sequence Listing is provided herewith as a text file, “ALTI-718WO Seq List_ST25.txt,” created on Jun. 26, 2019 and having a size of 448 KB. The contents of the text file are incorporated by reference herein in their entirety.


INTRODUCTION

Genome editing and gene regulation techniques include the use of nucleic acid binding domains linked to a functional domain. Provided herein are polypeptides and methods for genome editing and gene regulation, wherein the nucleic acid binding domain is derived from DNA binding proteins from bacteria from the genus of Ralstonia or from Xanthomonas.


SUMMARY

In various aspects, the present disclosure provides a polypeptide comprising a modular nucleic acid binding domain comprising a potency for a target site greater than 65% and a specificity ratio for the target site of at least 50:1; and a functional domain; wherein: the modular nucleic acid binding domain comprises a plurality of repeat units; at least one repeat unit of the plurality of repeat units comprises a binding region configured to bind to a target nucleic acid base within the target site; the potency comprises indel percentage at the target site, and wherein the specificity ratio comprises indel percentage at the target site over indel percentage at a top-ranked off-target site of the polypeptide.


In some aspects, the at least one repeat unit comprises a sequence of A1-11X1X2B14-35, wherein: each amino acid residue of A1-11 comprises any amino acid residue; X1X2 comprises the binding region; each amino acid residue of B14-35 comprises any amino acid; and a first repeat unit of the plurality of repeat units comprises at least one residue in A1-11, B14-35, or a combination thereof that differs from a corresponding residue in a second repeat unit of the plurality of repeat units.


In various aspects, the present disclosure provides a polypeptide comprising a modular nucleic acid binding domain and a functional domain, wherein: the modular nucleic acid binding domain comprises a plurality of repeat units; at least one repeat unit of the plurality comprises a sequence of A1-11X1X2B14-35; each amino acid residue of A1-11 comprises any amino acid residue; X1X2 comprises a binding region configured to bind to a target nucleic acid base within a target site; each amino acid residue of B14-35 comprises any amino acid; and a first repeat unit of the plurality of repeat units comprises at least one residue in A1-11, B14-35, or a combination thereof that differs from a corresponding residue in a second repeat unit of the plurality of repeat units.


In some aspects, the binding region comprises an amino acid residue at position 13 or an amino acid residue at position 12 and the amino acid residue at position 13. In further aspects, the amino acid residue at position 13 binds to the target nucleic acid base. In some aspects, the amino acid residue at position 12 stabilizes the configuration of the binding region.


In some aspects, the modular nucleic acid binding domain further comprises a potency for the target site greater than 65% and a specificity ratio for the target site of at least 50:1, wherein the potency comprises indel percentage at the target site and the specificity ratio comprises indel percentage at the target site over indel percentage at a top-ranked off-target site of the polypeptide. In further aspects, the indel percentage is measured by deep sequencing. In some aspects, the modular nucleic acid binding domain further comprises one or more properties selected from the following: (a) binds the target site, wherein the target site comprises a 5′ guanine; (b) comprises from 7 repeat units to 25 repeat units; (c) upon binding to the target site, the modular nucleic acid binding domain is separated from a second modular nucleic acid binding domain bound to a second target site by from 2 to 50 base pairs.


In some aspects, the modular nucleic acid binding domain comprises a Ralstonia repeat unit. In further aspects, the Ralstonia repeat unit is a Ralstonia solanacearum repeat unit. In still further aspects, the B14-35 of at least one repeat unit of the plurality of repeat units has at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or a 100% sequence identity to GGKQALEAVRAQLLDLRAAPYG (SEQ ID NO: 280).


In some aspects, the binding region comprises HD binding to cytosine, NG binding to thymidine, NK binding to guanine, SI binding to adenosine, RS binding to adenosine, HN binding to guanine, or NT binds to adenosine. In some aspects, the at least one repeat unit comprises any one of SEQ ID NO: 267-SEQ ID NO: 279.


In further aspects, the at least one repeat unit comprises at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or a 100% sequence identity with any one of SEQ ID NO: 168-SEQ ID NO: 263. In further aspects, the at least one repeat unit comprises at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or a 100% sequence identity with SEQ ID NO: 209, SEQ ID NO: 197, SEQ ID NO: 233, SEQ ID NO: 253, SEQ ID NO: 203, or SEQ ID NO: 218. In some aspects, the at least one repeat unit comprises any one of SEQ ID NO: 168-SEQ ID NO: 263. In further aspects, the at least one repeat unit comprises SEQ ID NO: 209, SEQ ID NO: 197, SEQ ID NO: 233, SEQ ID NO: 253, SEQ ID NO: 203, or SEQ ID NO: 218.


In some aspects, the target nucleic acid base is cytosine, guanine, thymidine, adenosine, uracil or a combination thereof. In some aspects, the target site is a nucleic acid sequence within a PDCD1 gene, a CTLA4 gene, a LAG3 gene, a TET2 gene, a BTLA gene, a HAVCR2 gene, a CCR5 gene, a CXCR4 gene, a TRA gene, a TRB gene, a B2M gene, an albumin gene, a HBB gene, a HBA1 gene, a TTR gene, a NR3C1 gene, a CD52 gene, an erythroid specific enhancer of the BCL11A gene, a CBLB gene, a TGFBR1 gene, a SERPINA1 gene, a HBV genomic DNA in infected cells, a CEP290 gene, a DMD gene, a CFTR gene, an IL2RG gene, or a combination thereof.


In other aspects, a nucleic acid sequence encoding a chimeric antigen receptor (CAR), alpha-L iduronidase (IDUA), iduronate-2-sulfatase (IDS), or Factor 9 (F9), is inserted at the target site.


In some aspects, the modular nucleic acid binding domain comprises an N-terminus amino acid sequence, a C-terminus amino acid sequence, or a combination thereof. In further aspects, the N-terminus amino acid sequence is from Xanthomonas spp., Legionella quateirensis, or Ralstonia solanacearum. In still further aspects, the N-terminus amino acid sequence comprises at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or a 100% sequence identity to SEQ ID NO: 264, SEQ ID NO: 300, SEQ ID NO: 335, SEQ ID NO: 303, SEQ ID NO: 301, SEQ ID NO: 304, or SEQ ID NO: 320, SEQ ID NO: 321, or SEQ ID NO: 322. In still further aspects, the N-terminus amino acid sequence comprises SEQ ID NO: 264, SEQ ID NO: 300, SEQ ID NO: 335, SEQ ID NO: 303, SEQ ID NO: 301, SEQ ID NO: 304, or SEQ ID NO: 320, SEQ ID NO: 321, or SEQ ID NO: 322.


In some aspects, the C-terminus amino acid sequence is from Xanthomonas spp., Legionella quateirensis, or Ralstonia solanacearum. In further aspects, the C-terminus amino acid sequence comprises at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or a 100% sequence identity sequence identity to SEQ ID NO: 266, SEQ ID NO: 298, or SEQ ID NO: 306. In still further aspects, the C-terminus amino acid sequence comprises SEQ ID NO: 266, SEQ ID NO: 298, or SEQ ID NO: 306. In some aspects, the C-terminus amino acid sequence serves as a linker between the modular nucleic acid binding domain and the cleavage domain.


In some aspects, the modular nucleic acid binding domain comprises a half repeat. In further aspects, the half repeat comprises at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or a 100% sequence identity sequence identity to SEQ ID NO: 265, SEQ ID NO: 327-SEQ ID NO: 329, or SEQ ID NO: 290. In further aspects, the half repeat comprises SEQ ID NO: 265, SEQ ID NO: 327-SEQ ID NO: 329, or SEQ ID NO: 290.


In still further aspects, the functional domain is a cleavage domain or a repression domain. In some aspects, the cleavage domain comprises at least 33.3% divergence from SEQ ID NO: 163 and is immunologically orthogonal to SEQ ID NO: 163. In further aspects, the polypeptide comprises one or more of the following characteristics: (a) induces greater than 1% indels at a target site; (b) the cleavage domain comprises a molecular weight of less than 23 kDa; (c) the cleavage domain comprises less than 196 amino acids; (d) capable of cleaving across a spacer region greater than 24 base pairs.


In some aspects, the polypeptide induces greater than 5%, greater than 10%, greater than 20%, greater than 30%, greater than 40%, greater than 50%, greater than 60%, greater than 70%, greater than 80%, or greater than 90% indels at the target site. In some aspects, the cleavage domain comprises at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, or at least 75% divergence from SEQ ID NO: 163. In some aspects, the cleavage domain comprises a sequence selected from SEQ ID NO: 316-SEQ ID NO: 319.


In further aspects, the cleavage domain comprises a nucleic acid sequence encoding for a sequence having at least 80% sequence identity with SEQ ID NO: 1-SEQ ID NO: 81. In still further aspects, the cleavage domain comprises a nucleic acid sequence encoding for a sequence selected from SEQ ID NO: 1-SEQ ID NO: 81. In some aspects, the nucleic acid sequence comprises at least 80% sequence identity with SEQ ID NO: 82-SEQ ID NO: 162. In further aspects, the nucleotide sequence encoding for the sequence comprises any one of SEQ ID NO: 82-SEQ ID NO: 162.


In some aspects, the repression domain comprises KRAB, Sin3a, LSD1, SUV39H1, G9A (EHMT2), DNMT1, DNMT3A-DNMT3L, DNMT3B, KOX, TGF-beta-inducible early gene (TIEG), v-erbA, SID, MBD2, MBD3, Rb, or MeCP2.


In some aspects, the at least one repeat unit comprises 1-20 additional amino acid residues at the C-terminus. In some aspects, the at least repeat unit of the plurality of repeat units is separated from a neighboring repeat unit by a linker. In further aspects, the linker comprises a recognition site. In some aspects, the recognition site is for a small molecule, a protease, or a kinase. In some aspects, the recognition site serves as a localization signal. In some aspects, the plurality of repeat units comprises 3 to 60 repeat units.


In some aspects, a repeat unit of the plurality of repeat units recognizes a target nucleic acid base and wherein the plurality of repeat units has one or more of the following characteristics: (a) at least one repeat unit comprising greater than 39 amino acid residues; (b) at least one repeat unit comprising greater than 35 amino acid residues derived from the genus of Ralstonia; (c) at least one repeat unit comprising less than 32 amino acid residues; and (d) each repeat unit of the plurality of repeat units is separated from a neighboring repeat unit by a linker comprising a recognition site. In some aspects, the at least one repeat unit comprises an amino acid selected from glycine, alanine, threonine or histidine at a position after an amino acid residue at position 35. In some aspects, the at least one repeat unit comprises an amino acid selected from glycine, alanine, threonine or histidine at a position after an amino acid residue at position 39.


Also provided herein is a non-naturally occurring DNA binding polypeptide that includes from N- to C-terminus: a N-terminus region comprising at least residues N+110 to N+1 of a TALE protein, where the N-terminus region does not include residues N+288 to N+116 of the TALE protein; a plurality of TALE repeat units derived from a TALE protein; and C-terminus region of a TALE protein. The N-terminus region may not include at least amino acids N+288 to N+116 of the TALE protein. The N-terminus region may not include amino acids N+288 to up to N+116 of the TALE protein. The N-terminus region may not include at least amino acids N+288 to up to N+111 of the TALE protein. The N-terminus region may include residues N+1 to up to N+115 of the TALE protein. The N-terminus region may include residues N+1 to up to N+110 of the TALE protein. The C-terminus region may include full length C-terminus region of a TALE protein or a fragment thereof, e.g., residues C+1 to C+63 of the TALE protein. The DNA binding polypeptide may be fused to a heterologous functional domain, such as, enzyme, a transcriptional activator, a transcriptional repressor, or a DNA nucleotide modifier. The N-terminus region, the TALE repeat units, and the C-terminus region may be derived from the same TALE protein or from different TALE proteins. The TALE proteins from which the N-terminus region, the TALE repeat units, and the C-terminus region may be derived include Xanthomonas TALE proteins, such as, AvrBs3, AVRHAH1, AvrXa7, AVRB6, or AvrXa10.


In various aspects, the present disclosure provides a method of genome editing, the method comprising: administering any of the above polypeptides or compositions thereof and inducing a double stranded break.


In various aspects, the present disclosure provides method of gene repression, the method comprising administering any of the above polypeptides or compositions thereof and repressing gene expression.


INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.





BRIEF DESCRIPTION OF DRAWINGS


FIGS. 1A-1C show schematics of the domain structure of DNA binding proteins (not drawn to scale).



FIG. 2 shows nuclease activity mediated by DNA binding protein dimers that each include from N-terminus to C-terminus: a N-terminus region of a TALE protein, TALE repeat units, C-terminus region of a TALE protein, and a Fok1 endonuclease.





DETAILED DESCRIPTION

The present disclosure provides modular nucleic acid binding domains (NBDs) derived from the genus of bacteria. For example, in some embodiments, the present disclosure provides NBDs derived from bacteria that serve as plant pathogens, such as from the genus of Xanthomonas spp. and Ralstonia. In particular embodiments, the present disclosure provides NBDs from the genus of Ralstonia. Also provided herein are NBDs from the animal pathogen, Legionella. Provided herein are sequences of repeat units derived from the genus of Ralstonia, which can be linked together to form non-naturally occurring modular nucleic acid binding domains (NBDs), capable of targeting and binding any target nucleic acid sequence (e.g., DNA sequence).


In some embodiments, “derived” indicates that a protein is from a particular source (e.g., Ralstonia), is a variant of a protein from a particular source (e.g., Ralstonia), is a mutated or modified form of the protein from a particular source (e.g., Ralstonia), and shares at least 30% sequence identity with, at least 40% sequence identity with, at least 50% sequence identity with, at least 60% sequence identity with, at least 70% sequence identity with, at least 80% sequence identity with, or at least 90% sequence identity with a protein from a particular source (e.g., Ralstonia).


In some embodiments, “modular” indicates that a particular polypeptide such as a nucleic acid binding domain, comprises a plurality of repeat units that can be switched and replaced with other repeat units. For example, any repeat unit in a modular nucleic acid binding domain can be switched with a different repeat unit. In some embodiments, modularity of the nucleic acid binding domains disclosed herein allows for switching the target nucleic acid base for a particular repeat unit by simply switching it out for another repeat unit. In some embodiments, modularity of the nucleic acid binding domains disclosed herein allows for swapping out a particular repeat unit for another repeat unit to increase the affinity of the repeat unit for a particular target nucleic acid. Overall, the modular nature of the nucleic acid binding domains disclosed herein enables the development of genome editing complexes that can precisely target any nucleic acid sequence of interest.


In particular embodiments, modular nucleic acid binding domains (NBDs), also referred to herein as “DNA binding polypeptides,” are provided herein from the genus of Ralstonia solanacearum. In some embodiments, modular nucleic acid binding domains derived from Ralstonia (RNBDs) can be engineered to bind to a target gene of interest for purposes of gene editing or gene regulation. An RNBD can be engineered to target and bind a specific nucleic acid sequence. The nucleic acid sequence can be DNA or RNA.


In some embodiments, the RNBD can comprise a plurality of repeat units, wherein each repeat unit recognizes and binds to a single nucleotide (in DNA or RNA) or base pair. Each repeat unit in the plurality of repeat units can be specifically selected to target and bind to a specific nucleic acid sequence, thus contributing to the modular nature of the DNA binding polypeptide. A non-naturally occurring Ralstonia-derived modular nucleic acid binding domain can comprise a plurality of repeat units, wherein each repeat unit of the plurality of repeat units recognizes a single target nucleotide, base pair, or both.



Ralstonia-Derived DNA Binding Domains

In some embodiments, the repeat unit of a modular nucleic acid binding domain can be derived from a bacterial protein. For example, the bacterial protein can be a transcription activator like effector-like protein (TALE-like protein). The bacterial protein can be derived from Ralstonia solanacearum. Repeat units derived from Ralstonia solanacearum can be 33-35 amino acid residues in length. In some embodiments, the repeat unit can be derived from the naturally occurring Ralstonia solanacearum TALE-like protein.


TABLE 1 below shows exemplary repeat units derived from the genus of Ralstonia, which are capable of binding a target nucleic acid.









TABLE 1







Exemplary Ralstonia-derived Repeat Units








SEQ ID NO
Sequence





SEQ ID
LDTEQVVAIASHNGGKQALEAVKADLLDLLGAPYV


NO: 168






SEQ ID
LDTEQVVAIASHNGGKQALEAVKADLLDLRGAPYA


NO: 169






SEQ ID
LDTEQVVAIASHNGGKQALEAVKADLLELRGAPYA


NO: 170






SEQ ID
LDTEQVVAIASHNGGKQALEAVKAHLLDLRGAPYA


NO: 171






SEQ ID
LNTEQVVAIASHNGGKQALEAVKADLLDLRGAPYA


NO: 172






SEQ ID
LNTEQVVAIASNNGGKQALEAVKTHLLDLRGARYA


NO: 173






SEQ ID
LNTEQVVAIASNPGGKQALEAVRALFPDLRAAPYA


NO: 174






SEQ ID
LNTEQVVAIASSHGGKQALEAVRALFPDLRAAPYA


NO: 175






SEQ ID
LNTEQVVAVASNKGGKQALEAVGAQLLALRAVPYA


NO: 176






SEQ ID
LNTEQVVAVASNKGGKQALEAVGAQLLALRAVPYE


NO: 177






SEQ ID
LSAAQVVAIASHDGGKQALEAVGTQLVALRAAPYA


NO: 178






SEQ ID
LSIAQVVAVASRSGGKQALEAVRAQLLALRAAPYG


NO: 179






SEQ ID
LSPEQVVAIASNHGGKQALEAVRALFRGLRAAPYG


NO: 180






SEQ ID
LSPEQVVAIASNNGGKQALEAVKAQLLELRAAPYE


NO: 181






SEQ ID
LSTAQLVAIASNPGGKQALEAIRALFRELRAAPYA


NO: 182






SEQ ID
LSTAQLVAIASNPGGKQALEAVRALFRELRAAPYA


NO: 183






SEQ ID
LSTAQLVAIASNPGGKQALEAVRAPFREVRAAPYA


NO: 184






SEQ ID
LSTAQLVSIASNPGGKQALEAVRALFRELRAAPYA


NO: 185






SEQ ID
LSTAQVAAIASHDGGKQALEAVGTQLVVLRAAPYA


NO: 186






SEQ ID
LSTAQVATIASSIGGRQALEALKVQLPVLRAAPYG


NO: 187






SEQ ID
LSTAQVATIASSIGGRQALEAVKVQLPVLRAAPYG


NO: 188






SEQ ID
LSTAQVVAIAANNGGKQALEAVRALLPVLRVAPYE


NO: 189






SEQ ID
LSTAQVVAIAGNGGGKQALEGIGEQLLKLRTAPYG


NO: 190






SEQ ID
LSTAQVVAIASHDGGKQALEAAGTQLVALRAAPYA


NO: 191






SEQ ID
LSTAQVVAIASHDGGKQALEAVGAQLVELRAAPYA


NO: 192






SEQ ID
LSTAQVVAIASHDGGKQALEAVGTQLVALRAAPYA


NO: 193






SEQ ID
LSTAQVVAIASHDGGNQALEAVGTQLVALRAAPYA


NO: 194






SEQ ID
LSTAQVVAIASHNGGKQALEAVKAQLLDLRGAPYA


NO: 195






SEQ ID
LSTAQVVAIASNDGGKQALEEVEAQLLALRAAPYE


NO: 196






SEQ ID
LSTAQVVAIASNGGGKQALEGIGEQLLKLRTAPYG


NO: 197






SEQ ID
LSTAQVVAIASNGGGKQALEGIGEQLRKLRTAPYG


NO: 198






SEQ ID
LSTAQVVAIASNPGGKQALEAVRALFRELRAAPYA


NO: 199






SEQ ID
LSTAQVVAIASQNGGKQALEAVKAQLLDLRGAPYA


NO: 200






SEQ ID
LSTAQVVAIASSHGGKQALEAVRALFRELRAAPYG


NO: 201






SEQ ID
LSTAQVVAIASSNGGKQALEAVWALLPVLRATPYD


NO: 202






SEQ ID
LSTAQVVAIATRSGGKQALEAVRAQLLDLRAAPYG


NO: 203






SEQ ID
LSTAQVVAVAGRNGGKQALEAVRAQLPALRAAPYG


NO: 204






SEQ ID
LSTAQVVAVASSNGGKQALEAVWALLPVLRATPYD


NO: 205






SEQ ID
LSTAQVVTIASSNGGKQALEAVWALLPVLRATPYD


NO: 206






SEQ ID
LSTEQVVAIAGHDGGKQALEAVGAQLVALRAAPYA


NO: 207






SEQ ID
LSTEQVVAIASHDGGKQALEAVGAQLVALLAAPYA


NO: 208






SEQ ID
LSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYA


NO: 209






SEQ ID
LSTEQVVAIASHDGGKQALEAVGGQLVALRAAPYA


NO: 210






SEQ ID
LSTEQVVAIASHDGGKQALEAVGTQLVALRAAPYA


NO: 211






SEQ ID
LSTEQVVAIASHDGGKQALEAVGVQLVALRAAPYA


NO: 212






SEQ ID
LSTEQVVAIASHDGGKQALEAVVAQLVALRAAPYA


NO: 213






SEQ ID
LSTEQVVAIASHDGGKQPLEAVGAQLVALRAAPYA


NO: 214






SEQ ID
LSTEQVVAIASHGGGKQVLEGIGEQLLKLRAAPYG


NO: 215






SEQ ID
LSTEQVVAIASHKGGKQALEGIGEQLLKLRAAPYG


NO: 216






SEQ ID
LSTEQVVAIASHNGGKQALEAVKADLLDLRGAPYA


NO: 217






SEQ ID
LSTEQVVAIASHNGGKQALEAVKADLLELRGAPYA


NO: 218






SEQ ID
LSTEQVVAIASHNGGKQALEAVKAHLLDLRGAPYA


NO: 219






SEQ ID
LSTEQVVAIASHNGGKQALEAVKAHLLDLRGVPYA


NO: 220






SEQ ID
LSTEQVVAIASHNGGKQALEAVKAHLLELRGAPYA


NO: 221






SEQ ID
LSTEQVVAIASHNGGKQALEAVKAQLLDLRGAPYA


NO: 222






SEQ ID
LSTEQVVAIASHNGGKQALEAVKAQLLELRGAPYA


NO: 223






SEQ ID
LSTEQVVAIASHNGGKQALEAVKAQLPVLRRAPYG


NO: 224






SEQ ID
LSTEQVVAIASHNGGKQALEAVKTQLLELRGAPYA


NO: 225






SEQ ID
LSTEQVVAIASHNGGKQALEAVRAQLPALRAAPYG


NO: 226






SEQ ID
LSTEQVVAIASHNGSKQALEAVKAQLLDLRGAPYA


NO: 227






SEQ ID
LSTEQVVAIASNGGGKQALEGIGKQLQELRAAPHG


NO: 228






SEQ ID
LSTEQVVAIASNGGGKQALEGIGKQLQELRAAPYG


NO: 229






SEQ ID
LSTEQVVAIASNHGGKQALEAVRALFRELRAAPYA


NO: 230






SEQ ID
LSTEQVVAIASNHGGKQALEAVRALFRGLRAAPYG


NO: 231






SEQ ID
LSTEQVVAIASNKGGKQALEAVKADLLDLRGAPYV


NO: 232






SEQ ID
LSTEQVVAIASNKGGKQALEAVKAHLLDLLGAPYV


NO: 233






SEQ ID
LSTEQVVAIASNKGGKQALEAVKAQLLALRAAPYA


NO: 234






SEQ ID
LSTEQVVAIASNKGGKQALEAVKAQLLELRGAPYA


NO: 235






SEQ ID
LSTEQVVAIASNNGGKQALEAVKALLLELRAAPYE


NO: 236






SEQ ID
LSTEQVVAIASNNGGKQALEAVKAQLLALRAAPYE


NO: 237






SEQ ID
LSTEQVVAIASNNGGKQALEAVKAQLLDLRGAPYA


NO: 238






SEQ ID
LSTEQVVAIASNNGGKQALEAVKAQLLVLRAAPYG


NO: 239






SEQ ID
LSTEQVVAIASNNGGKQALEAVKAQLPALRAAPYE


NO: 240






SEQ ID
LSTEQVVAIASNNGGKQALEAVKAQLPVLRRAPCG


NO: 241






SEQ ID
LSTEQVVAIASNNGGKQALEAVKAQLPVLRRAPYG


NO: 242






SEQ ID
LSTEQVVAIASNNGGKQALEAVKARLLDLRGAPYA


NO: 243






SEQ ID
LSTEQVVAIASNNGGKQALEAVKTQLLALRTAPYE


NO: 244






SEQ ID
LSTEQVVAIASNPGGKQALEAVRALFPDLRAAPYA


NO: 245






SEQ ID
LSTEQVVAIASSHGGKQALEAVRALFPDLRAAPYA


NO: 246






SEQ ID
LSTEQVVAIASSHGGKQALEAVRALLPVLRATPYD


NO: 247






SEQ ID
LSTEQVVAVASHNGGKQALEAVRAQLLDLRAAPYE


NO: 248






SEQ ID
LSTEQVVAVASNKGGKQALAAVEAQLLRLRAAPYE


NO: 249






SEQ ID
LSTEQVVAVASNKGGKQALEEVEAQLLRLRAAPYE


NO: 250






SEQ ID
LSTEQVVAVASNKGGKQVLEAVGAQLLALRAVPYE


NO: 251






SEQ ID
LSTEQVVAVASNNGGKQALKAVKAQLLALRAAPYE


NO: 252






SEQ ID
LSTEQVVVIANSIGGKQALEAVKVQLPVLRAAPYE


NO: 253






SEQ ID
LSTGQVVAIASNGGGRQALEAVREQLLALRAVPYE


NO: 254






SEQ ID
LSPEQVVTIASNNGGKQALEAVRAQLLALRAAPYG


NO: 255






SEQ ID
LTIAQVVAVASHNGGKQALEAIGAQLLALRAAPYA


NO: 256






SEQ ID
LTIAQVVAVASHNGGKQALEVIGAQLLALRAAPYA


NO: 257






SEQ ID
LTPQQVVAIAANTGGKQALGAITTQLPILRAAPYE


NO: 258






SEQ ID
LTPQQVVAIASNTGGKQALEAVTVQLRVLRGARYG


NO: 259






SEQ ID
LTPQQVVAIASNTGGKRALEAVCVQLPVLRAAPYR


NO: 260






SEQ ID
LTPQQVVAIASNTGGKRALEAVRVQLPVLRAAPYE


NO: 261






SEQ ID
LTTAQVVAIASNDGGKQALEAVGAQLLVLRAVPYE


NO: 262






SEQ ID
LTTAQVVAIASNDGGKQTLEVAGAQLLALRAVPYE


NO: 263






SEQ ID
LSTAQVVAVASGSGGKPALEAVRAQLLALRAAPYG


NO: 336






SEQ ID
LSTAQVVAVASGSGGKPALEAVRAQLLALRAAPYG


NO: 337






SEQ ID
LNTAQIVAIASHDGGKPALEAVWAKLPVLRGAPYA


NO: 338






SEQ ID
LNTAQVVAIASHDGGKPALEAVRAKLPVLRGVPYA


NO: 339






SEQ ID
LNTAQVVAIASHDGGKPALEAVWAKLPVLRGVPYA


NO: 340






SEQ ID
LNTAQVVAIASHDGGKPALEAVWAKLPVLRGVPYE


NO: 341






SEQ ID
LSTAQVVAIASHDGGKPALEAVWAKLPVLRGAPYA


NO: 342






SEQ ID
LSTAQVVAVASHDGGKPALEAVRKQLPVLRGVPHQ


NO: 343






SEQ ID
LSTAQVVAVASHDGGKPALEAVRKQLPVLRGVPHQ


NO: 344






SEQ ID
LNTAQVVAIASHDGGKPALEAVWAKLPVLRGVPYA


NO: 345






SEQ ID
LSTEQVVAIASHNGGKLALEAVKAHLLDLRGAPYA


NO: 346






SEQ ID
LSTEQVVAIASHNGGKPALEAVKAHLLALRAAPYA


NO: 347






SEQ ID
LNTAQVVAIASHYGGKPALEAVWAKLPVLRGVPYA


NO: 348






SEQ ID
LNTEQVVAIASNNGGKPALEAVKAQLLELRAAPYE


NO: 349






SEQ ID
LSPEQVVAIASNNGGKPALEAVKALLLALRAAPYE


NO: 350






SEQ ID
LSPEQVVAIASNNGGKPALEAVKAQLLELRAAPYE


NO: 351






SEQ ID
LSTEQVVAIASNNGGKPALEAVKALLLALRAAPYE


NO: 352






SEQ ID
LSTEQVVAIASNNGGKPALEAVKALLLELRAAPYE


NO: 353






SEQ ID
LSPEQVVAIASNNGGKPALEAVKALLLALRAAPYE


NO: 354






SEQ ID
LSPEQVVAIASNNGGKPALEAVKAQLLELRAAPYE


NO: 355






SEQ ID
LSTEQVVAIASNNGGKPALEAVKALLLELRAAPYE


NO: 356









In some embodiments, an RNBD of the present disclosure can comprise between 1 to 50 Ralstonia solanacearum-derived repeat units. In some embodiments, an RNBD of the present disclosure can comprise between 9 and 36 Ralstonia solanacearum-derived repeat units. Preferably, in some embodiments, an RNBD of the present disclosure can comprise between 12 and 30 Ralstonia solanacearum-derived repeat units. A RNBD described herein can comprise between 5 to 10 Ralstonia solanacearum-derived repeat units, between 10 to 15 Ralstonia solanacearum-derived repeat units, between 15 to 20 Ralstonia solanacearum-derived repeat units, between 20 to 25 Ralstonia solanacearum-derived repeat units, between 25 to 30 Ralstonia solanacearum-derived repeat units, or between 30 to 35 Ralstonia solanacearum-derived repeat units, between 35 to 40 Ralstonia solanacearum-derived repeat units. A RNBD described herein can comprise at least 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, or 40, or more Ralstonia solanacearum-derived repeat units.


A Ralstonia solanacearum-derived repeat unit can be derived from a wild-type repeat unit, such as any one of SEQ ID NO: 168-SEQ ID NO: 263 or SEQ ID NO: 336-SEQ ID NO: 356. A Ralstonia solanacearum-repeat unit can have at least 80% sequence identity with any one of SEQ ID NO: 168-SEQ ID NO: 263 or SEQ ID NO: 336-SEQ ID NO: 356. A Ralstonia solanacearum-derived repeat unit can also comprise a modified Ralstonia solanacearum-derived repeat unit enhanced for specific recognition of a nucleotide or base pair. An RNBD described herein can comprise one or more wild-type Ralstonia solanacearum-derived repeat units, one or more modified Ralstonia solanacearum-derived repeat units, or a combination thereof. In some embodiments, a modified Ralstonia solanacearum-derived repeat unit can comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, or 29 mutations that can enhance recognition of a specific nucleotide or base pair. In some embodiments, a modified Ralstonia solanacearum-derived repeat unit can comprise more than 1 modification, for example 1 to 5 modifications, 5 to 10 modifications, 10 to 15 modifications, 15 to 20 modifications, 20 to 25 modification, or 25-29 modifications. In some embodiments, An RNBD can comprise more than one modified Ralstonia solanacearum-derived repeat units, wherein each of the modified Ralstonia solanacearum-derived repeat units can have a different number of modifications.


The Ralstonia solanacearum-derived repeat units comprise amino acid residues at positions 12 and 13, what is referred to herein as, a repeat variable diresidue (RVD). The RVD can modulate binding affinity of the repeat unit for a particular nucleic acid base (e.g., adenosine, guanine, cytosine, thymidine, or uracil (in RNA sequences)). In some embodiments, a single amino acid residue can modulate binding to the target nucleic acid base. In some embodiments, two amino acid residues (RVD) can modulate binding to the target nucleic acid base. In some embodiments, any repeat unit disclosed herein can have an RVD selected from HD, HG, HK, HN, ND, NG, NH, NK, NN, NP, NT, QN, RN, RS, SH, SI, or SN. In some embodiments, an RVD of HD can bind to cytosine. In some embodiments, an RVD of NG can bind to thymidine. In some embodiments, an RVD of NK can bind to guanine. In some embodiments, an RVD of SI can bind to adenosine. In some embodiments, an RVD of RS can bind to adenosine. In some embodiments, an RVD of HN can bind to guanine. In some embodiments, an RVD of NT can bind to adenosine.


In some embodiments, a repeat unit having at least 80% sequence identity with SEQ ID NO: 209 can be included in a DNA binding domain of the present disclosure to bind to cytosine. In some embodiments, a repeat unit having at least 80% sequence identity with SEQ ID NO: 197 can be included in a DNA binding domain of the present disclosure to bind to thymidine. In some embodiments, a repeat unit having at least 80% sequence identity with SEQ ID NO: 233 can be included in a DNA binding domain of the present disclosure to bind to guanine. In some embodiments, a repeat unit having at least 80% sequence identity with SEQ ID NO: 253 can be included in a DNA binding domain of the present disclosure to bind to adenosine. In some embodiments, a repeat unit having at least 80% sequence identity with SEQ ID NO: 203 can be included in a DNA binding domain of the present disclosure to bind to adenosine. In some embodiments, a repeat unit having at least 80% sequence identity with SEQ ID NO: 218 can be included in a DNA binding domain of the present disclosure to bind to guanine. In some embodiments, the repeat unit of SEQ ID NO: 209 can be included in a DNA binding domain of the present disclosure to bind to cytosine. In some embodiments, the repeat unit of SEQ ID NO: 197 can be included in a DNA binding domain of the present disclosure to bind to thymidine. In some embodiments, the repeat unit of SEQ ID NO: 233 can be included in a DNA binding domain of the present disclosure to bind to guanine. In some embodiments, the repeat unit of SEQ ID NO: 253 can be included in a DNA binding domain of the present disclosure to bind to adenosine. In some embodiments, the repeat unit of SEQ ID NO: 203 can be included in a DNA binding domain of the present disclosure to bind to adenosine. In some embodiments, the repeat unit of SEQ ID NO: 218 can be included in a DNA binding domain of the present disclosure to bind to guanine.


In some embodiments, the present disclosure provides repeat units as set forth in SEQ ID NO: 267-SEQ ID NO: SEQ ID NO: 279. Unspecified amino acid residues in SEQ ID NO: 267-SEQ ID NO: SEQ ID NO: 279 can be any amino acid residues. In particular embodiments, unspecified amino acid residues in SEQ ID NO: 267-SEQ ID NO: SEQ ID NO: 279 can be those set forth in the Variable Definition column of TABLE 2.


TABLE 2 shows consensus sequences of Ralstonia-derived repeat units.









TABLE 2







Consensus Sequences of Ralstonia-derived Repeat Units









RVD
Consensus Sequence
Variable Definition





HN
LX1X2X3QVVX4X5ASHNGX6
X1: D|N|S|T, X2: I|T|V, X3: A|E,



KQALEX7X8X9X10X11LX12
X4: A|T, X5: I|V, X6: G|S, X7:



X13LX14X15X16PYX17 (SEQ
A|V, X8: I|V, X9: G|K|R, X10:



ID NO: 267)
A|T, X11: D|H|Q, X12: L|P, X13:




A|D|E|V, X14: L|R, X15: A|G|R,




X16: A|V, X17: A|E|G|V


NN
LX1X2X3QVVAX4AX5NNGG
X1: N|S, X2: P|T, X3: A|E, X4:



KQALX6AVX7X8X9LX10X11
I|V, X5: A|S, X6: E|K, X7: K|R,



LRX12AX13X14X15 (SEQ ID
X8: A|T, X9: H|L|Q|R, X10: L|P,



NO: 268)
X11: A|D|E|V, X12: A|G|R|T|V,




X13: P|R, X14: C|Y, X15: A|E|G


NP
LX1TX2QX3VX4IASNPGGK
X1: N|S, X2: A|E, X3: L|V, X4:



QALEAX5RAX6FX7X8X9RA
A|S, X5: I|V, X6: L|P, X7: P|R,



APYA (SEQ ID NO: 269)
X8: D|E, X9: L|V


SH
LX1TX2QVVAIASSHGGKQ
X1: N|S, X2: A|E, X3: F|L, X4:



ALEAVRALX3X4X5LRAX6P
P|R, X5: D|E|V, X6: A|T, X7:



YX7 (SEQ ID NO: 270)
A|D|G


NK
LX1TEQVVAX2ASNKGGKQ
X1: N|S, X10: A|G, X11: A|V,



X3LX4X5VX6AX7LLX8LX9X10
X12: A|E|V, X2: I|V, X3: A|V,



X11PYX12(SEQ ID NO: 271)
X4: A|E, X5: A|E, X6: E|G|K,




X7: D|H|Q, X8: A|D|E|R, X9:




L|R


HD
LSX1X2QVX3AIAX4HDGGX5
X1: A|T, X2: A|E, X3: A|V,



QX6LEAX7X8X9QLVX10LX11
X4: G|S, X5: K|N, X6: A|P,



AAPYA (SEQ ID NO: 272)
X7: A|V, X8: G|V, X9: A|G|T|V,




X10: A|E|V, X11: L|R


RS
LSX1AQVVAX2AX3RSGGK
X1: I|T, X2: I|V, X3: S|T, X4:



QALEAVRAQLLX4LRAAP
A|D



YG (SEQ ID NO: 273)



NH
LSX1EQVVAIASNHGGKQ
X1: P|T, X2: E|G, X3: A|G



ALEAVRALFRX2LRAAPY




X (SEQ ID NO: 274)



SI
LSTX1QVX2X3IAX4SIGGX5
X1: A|E, X2: A|V, X3: T|V,



QALEAX6KVQLPVLRAAP
X4: N|S, X5: K|R, X6: L|V, X7:



YX7 (SEQ ID NO: 275)
E|G


ND
LX1TAQVVAIASNDGGKQ
X1: S|T, X2: A|T, X3: A|E|V,



X2LEX3X4X5AQLLX6LRAX7
X4: A|V, X5: E|G, X6: A|V, X7:



PYE (SEQ ID NO: 276)
A|V


SN
LSTAQVVX1X2ASSNGGK
X1: A|T, X2: I|V



QALEAVWALLPVLRATP




YD (SEQ ID NO: 277)



NG
LSTX1QVVAIAX2NGGGX3
X1: A|E|G, X2: G|S, X3: K|R,



QALEX4X5X6X7QLX8X9LR
X4: A|G, X5: I|V, X6: G|R, X7:



X10X11PX12X13 (SEQ ID NO:
E|K, X8: L|Q|R, X9: A|E|K, X10:



278)
A|T, X11: A|V, X12: H|Y, X13:




E|G


NT
LTPQQVVAIAX1NTGGKX2
X1: A|S, X10: P|R, X11: E|G|R,



ALX3AX4X5X6QLX7X8LRX9
X2: Q|R, X3: E|G, X4: I|V, X5:



AX10YX11 (SEQ ID NO: 279)
C|R|T, X6: T|V, X7: P|R, X8:




I|V, X9: A|G









In some aspects, the at least one repeat unit comprises any one of SEQ ID NO: 267-SEQ ID NO: 279. In some embodiments, the present disclosure provides a modular nucleic acid binding domain (e.g., RNBD or MAP-NBD), wherein the modular nucleic acid binding domain comprises a repeat unit with a sequence of A1-11X1X2B14-35, wherein A1-11 comprises 11 amino acid residues and wherein each amino acid residue of A1-11 can be any amino acid. In some embodiments, A1-11 can be any amino acids in position 1 through position 11 of any one of SEQ ID NO: 168-SEQ ID NO: 263 or SEQ ID NO: 336-SEQ ID NO: 356. X1X2 comprises any repeat variable diresidue (RVD) disclosed herein and comprises at least one amino acid at position 12 or position 13. As described herein, this RVD contacts and binds to a target nucleic acid base of a target site. Said RVD can be the RVD of any repeat unit disclosed herein, such as position 12 and position 13 of any one of SEQ ID NO: 168-SEQ ID NO: 263 or SEQ ID NO: 336-SEQ ID NO: 356. B14-35 can comprise 22 amino acid residues and each amino acid residue of B14-35 can be any amino acid. In some embodiments, B14-35 can be any amino acid in position 14 through position 35 of any one of SEQ ID NO: 168-SEQ ID NO: 263 or SEQ ID NO: 336-SEQ ID NO: 356. In particular embodiments, a modular nucleic acid binding domain (e.g., RNBD or MAP-NBD) having the above sequence of A1-11X1X2B14-35 can have a first repeat unit with at least one residue in A1-11, B14-35, or a combination thereof that differs from a corresponding residue in a second repeat unit in the modular nucleic acid binding domain (e.g., RNBD or MAP-NBD). In other words, at least two repeat units in a modular nucleic acid binding domain (e.g., RNBD or MAP-NBD) described herein can have different amino acid residues with respect to each other, at the same position outside the RVD region. Thus, in some embodiments, a modular nucleic acid binding domain (e.g., RNBD or MAP-NBD) described herein can have variant backbones with respect to each repeat unit in the plurality of repeat units that make up the modular nucleic acid binding domain. In some embodiments, an RNBD of the present disclosure can have a sequence of GGKQALEAVRAQLLDLRAAPYG (SEQ ID NO: 280) at B14-35.


In some embodiments, the present disclosure provides a polypeptide comprising a modular nucleic acid binding domain and a functional domain, wherein: the modular nucleic acid binding domain comprises a plurality of repeat units; at least one repeat unit of the plurality comprises a sequence of A1-11X1X2B14-35; each amino acid residue of A1-11 comprises any amino acid residue; X1X2 comprises a binding region configured to bind to a target nucleic acid base within a target site; each amino acid residue of B14-35 comprises any amino acid; and a first repeat unit of the plurality of repeat units comprises at least one residue in A1-11, B14-35, or a combination thereof that differs from a corresponding residue in a second repeat unit of the plurality of repeat units. In some embodiments, the binding region comprises an amino acid residue at position 13 or an amino acid residue at position 12 and the amino acid residue at position 13. In further aspects, the amino acid residue at position 13 binds to the target nucleic acid base. In some aspects, the amino acid residue at position 12 stabilizes the configuration of the binding region.


In some embodiments, the modular nucleic acid binding domain comprises a Ralstonia repeat unit. In further aspects, the Ralstonia repeat unit is a Ralstonia solanacearum repeat unit. In still further aspects, the B14-35 of at least one repeat unit of the plurality of repeat units has at least 92% sequence identity to GGKQALEAVRAQLLDLRAAPYG (SEQ ID NO: 280).


In some embodiments, a modular nucleic acid binding sequence (e.g., RNBD) can comprise one or more of the following characteristics: the modular nucleic acid binding sequence (e.g., RNBD) can bind a nucleic acid sequence, wherein the target site comprises a 5′ guanine, the modular nucleic acid binding domain (e.g., RNBD) can comprise 7 repeat units to 25 repeat units, a first modular nucleic acid binding sequence (e.g., RNBD) can bind a target nucleic acid sequence and be separated from a second modular nucleic acid binding domain (e.g., RNBD) from 2 to 50 base pairs, or any combination thereof.


In some embodiments, an RNBD of the present disclosure can have the full length naturally occurring N-terminus of a naturally occurring Ralstonia solanacearum-derived protein. In some embodiments, any truncation of the full length naturally occurring N-terminus of a naturally occurring Ralstonia solanacearum-derived protein can be used at the N-terminus of an RNBD of the present disclosure. For example, in some embodiments, amino acid residues at positions 1 (H) to position 137 (F) of the naturally occurring Ralstonia solanacearum-derived protein N-terminus can be used. In particular embodiments, said truncated N-terminus from position 1 (H) to position 137 (F) can have a sequence as follows: FGKLVALGYSREQIRKLKQESLSEIAKYHTTLTGQGFTHADICRISRRRQSLRVVARNYPELA AALPELTRAHIVDIARQRSGDLALQALLPVATALTAAPLRLSASQIATVAQYGERPAIQALY RLRRKLTRAPLH (SEQ ID NO: 264). In some embodiments, the naturally occurring N-terminus of Ralstonia solanacearum can be truncated to any length and used at the N-terminus of the engineered DNA binding domain. For example, the naturally occurring N-terminus of Ralstonia solanacearum can be truncated to amino acid residues at position 1 (H) to position 120 (K) as follows: KQESLSEIAKYHTTLTGQGFTHADICRISRRRQSLRVVARNYPELAAALPELTRAHIVDIARQ RSGDLALQALLPVATALTAAPLRLSASQIATVAQYGERPAIQALYRLRRKLTRAPLH (SEQ ID NO: 303) and used at the N-terminus of the RNBD. The naturally occurring N-terminus of Ralstonia solanacearum can be truncated such that it includes amino acid residues at positions 1 to 115 and used as the N-terminus of the engineered DNA binding domain. In certain aspects, the truncated N-terminus sequence may be at least 80%, 85%, 90%, 95%, 98%, 99%, or more identical to the amino acid sequence set forth in SEQ ID NO: 320. The naturally occurring N-terminus of Ralstonia solanacearum can be truncated to amino acid residues at positions 1 to 50, 1 to 70, 1 to 100, 1 to 120, 1 to 130, 10 to 40, 60 to 100, or 100 to 120 and used at the N-terminus of the engineered DNA binding domain. Truncation of the N-termini can be particularly advantageous for obtaining DNA binding domains, which are smaller in size including number of amino acids and overall molecular weight. A reduced number of amino acids can allow for more efficient packaging into a viral vector and a smaller molecular weight can result in more efficient loading of the DNA binding domains in non-viral vectors for delivery.


In some embodiments, the N-terminus, referred to as the amino terminus or the “NH2” domain, can recognize a guanine. In some embodiments, the N-terminus can be engineered to bind a cytosine, adenosine, thymidine, guanine, or uracil.


In some embodiments, an RNBD of the present disclosure can have a DNA binding domain, in which the final full length repeat unit of 33-35 amino acid residues is followed by a half repeat also derived from Ralstonia solanacearum. The half repeat can have 15 to 23 amino acid residues, for example, the half repeat can have 19 amino acid residues. In particular embodiments, the half repeat can have a sequence as follows: LSTAQVVAIACISGQQALE (SEQ ID NO: 265).


In some embodiments, an RNBD of the present disclosure can have the full length naturally occurring C-terminus of a naturally occurring Ralstonia solanacearum-derived protein. In some embodiments, any truncation of the full length naturally occurring C-terminus of a naturally occurring Ralstonia solanacearum-derived protein can be used at the C-terminus of an RNBD of the present disclosure. For example, in some embodiments, the RNBD can comprise amino acid residues at position 1 (A) to position 63 (S) as follows: AIEAHMPTLRQASHSLSPERVAAIACIGGRSAVEAVRQGLPVKAIRRIRREKAPVAGPPPAS (SEQ ID NO: 266) of the naturally occurring Ralstonia solanacearum-derived protein C-terminus. In some embodiments, the naturally occurring C-terminus of Ralstonia solanacearum can be truncated to any length and used at the C-terminus of the RNBD. For example, the naturally occurring C-terminus of Ralstonia solanacearum can be truncated to amino acid residues at positions 1 to 63 and used at the C-terminus of the RNBD. The naturally occurring C-terminus of Ralstonia solanacearum can be truncated amino acid residues at positions 1 to 50 and used at the C-terminus of the RNBD. The naturally occurring C-terminus of Ralstonia solanacearum can be truncated to amino acid residues at positions 1 to 63, 1 to 50, 1 to 70, 1 to 100, 1 to 120, 1 to 130, 10 to 40, 60 to 100, or 100 to 120 and used at the C-terminus of the RNBD.


TABLE 3 shows N-termini, C-termini, and half repeats derived from Ralstonia.









TABLE 3








Ralstonia-Derived N-terminus, C-terminus, and Half-Repeat










SEQ ID NO
Description
Sequence





SEQ ID NO: 320
Truncated N-terminus; positions 1
SEIAKYHTTLTGQGFTHADICRISRRRQS



(H) to 115 (S) of the naturally
LRVVARNYPELAAALPELTRAHIVDIAR



occurring Ralstonia solanacearum-
QRSGDLALQALLPVATALTAAPLRLSAS



derived protein N-terminus
QIATVAQYGERPAIQALYRLRRKLTRAP




LH





SEQ ID NO: 264
Truncated N-terminus; positions 1
FGKLVALGYSREQIRKLKQESLSEIAKYH



(H) to 137 (F) of the naturally
TTLTGQGFTHADICRISRRRQSLRVVARN



occurring Ralstonia solanacearum-
YPELAAALPELTRAHIVDIARQRSGDLAL



derived protein N-terminus
QALLPVATALTAAPLRLSASQIATVAQY




GERPAIQALYRLRRKLTRAPLH





SEQ ID NO: 303
Truncated N-terminus; positions 1
KQESLSEIAKYHTTLTGQGFTHADICRIS



(H) to 120 (K) of the naturally
RRRQSLRVVARNYPELAAALPELTRAHI



occurring Ralstonia solanacearum-
VDIARQRSGDLALQALLPVATALTAAPL



derived protein N-terminus
RLSASQIATVAQYGERPAIQALYRLRRK




LTRAPLH





SEQ ID NO: 265
Half-repeat
LSTAQVVAIACISGQQALE





SEQ ID NO: 266
Truncated C-terminus; positions 1 (A)
AIEAHMPTLRQASHSLSPERVAAIACIGG



to 63 (S) of the naturally occurring
RSAVEAVRQGLPVKAIRRIRREKAPVAG




Ralstonia solanacearum-derived

PPPAS



protein C-terminus









In some embodiments, an RNBD can be engineered to target and bind to a site in the PDCD1 gene. For example, an RNBD with the sequence FGKLVALGYSREQIRKLKQESLSEIAKYHTTLTGQGFTHADICRISRRRQSLRVVARNYPELA AALPELTRAHIVDIARQRSGDLALQALLPVATALTAAPLRLSASQIATVAQYGERPAIQALY RLRRKLTRAPLHLTPQQVVAIASNTGGKRALEAVCVQLPVLRAAPYRLSTEQVVAIASHDG GKQALEAVGAQLVALRAAPYALSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYALST AQVVAIASNGGGKQALEGIGEQLLKLRTAPYGLSTEQVVAIASNKGGKQALEAVKAHLLDL LGAPYVLSTEQVVAIASNKGGKQALEAVKAHLLDLLGAPYVLSTEQVVAIASNKGGKQAL EAVKAHLLDLLGAPYVLSTEQVVVIANSIGGKQALEAVKVQLPVLRAAPYELSTEQVVAIA SHDGGKQALEAVGAQLVALRAAPYALSTEQVVVIANSIGGKQALEAVKVQLPVLRAAPYE LSTEQVVAIASNKGGKQALEAVKAHLLDLLGAPYVLSTAQVVAIASNGGGKQALEGIGEQL LKLRTAPYGLSTAQVVAIASNGGGKQALEGIGEQLLKLRTAPYGLSTAQVVAIASNGGGKQ ALEGIGEQLLKLRTAPYGLSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYALSTEQVVA IASHDGGKQALEAVGAQLVALRAAPYALSTEQVVAIASHDGGKQALEAVGAQLVALRAAP YALSTAQVVAIASNGGGKQALEGIGEQLLKLRTAPYGLSTAQVVAIASNGGGKQALEGIGE QLLKLRTAPYGLSTAQVVAIACISGQQALEAIEAHMPTLRQASHSLSPERVAAIACIGGRSAV EAVRQGLPVKAIRRIRREKAPVAGPPPAS (SEQ ID NO: 311) can bind to the GACCTGGGACAGTTTCCCTT (SEQ ID NO: 312) nucleic acid sequence in the PDCD1 gene. As another example, an RNBD with the sequence FGKLVALGYSREQIRKLKQESLSEIAKYHTTLTGQGFTHADICRISRRRQSLRVVARNYPELA AALPELTRAHIVDIARQRSGDLALQALLPVATALTAAPLRLSASQIATVAQYGERPAIQALY RLRRKLTRAPLHLTPQQVVAIASNTGGKRALEAVCVQLPVLRAAPYRLSTAQVVAIASNGG GKQALEGIGEQLLKLRTAPYGLSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYALSTA QVVAIASNGGGKQALEGIGEQLLKLRTAPYGLSTEQVVAIASHNGGKQALEAVKADLLELR GAPYALSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYALSTEQVVVIANSIGGKQALEA VKVQLPVLRAAPYELSTAQVVAIASNGGGKQALEGIGEQLLKLRTAPYGLSTEQVVAIASH NGGKQALEAVKADLLELRGAPYALSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYALS TEQVVAIASHDGGKQALEAVGAQLVALRAAPYALSTAQVVAIASNGGGKQALEGIGEQLL KLRTAPYGLSTEQVVAIASHNGGKQALEAVKADLLELRGAPYALSTEQVVAIASHNGGKQ ALEAVKADLLELRGAPYALSTEQVVVIANSIGGKQALEAVKVQLPVLRAAPYELSTEQVVA IASHNGGKQALEAVKADLLELRGAPYALSTEQVVAIASHDGGKQALEAVGAQLVALRAAP YALSTAQVVAIACISGQQALEMEAHMPTLRQASHSLSPERVAAIACIGGRSAVEAVRQGLP VKAIRRIRREKAPVAGPPPAS (SEQ ID NO: 313) can bind to the GATCTGCATGCCTGGAGC (SEQ ID NO: 314) nucleic acid sequence in the PDCD1 gene. As yet another example, an RNBD with the sequence FGKLVALGYSREQIRKLKQESLSEIAKYHTTLTGQGFTHADICRISRRRQSLRVVARNYPELA AALPELTRAHIVDIARQRSGDLALQALLPVATALTAAPLRLSASQIATVAQYGERPAIQALY RLRRKLTRAPLHLTPQQVVAIASNTGGKRALEAVCVQLPVLRAAPYRLSTAQVVAIASNGG GKQALEGIGEQLLKLRTAPYGLSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYALSTA QVVAIASNGGGKQALEGIGEQLLKLRTAPYGLSTEQVVAIASHNGGKQALEAVKADLLELR GAPYALSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYALSTAQVVAIATRSGGKQALE AVRAQLLDLRAAPYGLSTAQVVAIASNGGGKQALEGIGEQLLKLRTAPYGLSTEQVVAIAS HNGGKQALEAVKADLLELRGAPYALSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYA LSTEQVVAIASHDGGKQALEAVGAQLVALRAAPYALSTAQVVAIASNGGGKQALEGIGEQ LLKLRTAPYGLSTEQVVAIASHNGGKQALEAVKADLLELRGAPYALSTEQVVAIASHNGGK QALEAVKADLLELRGAPYALSTAQVVAIATRSGGKQALEAVRAQLLDLRAAPYGLSTEQV VAIASHNGGKQALEAVKADLLELRGAPYALSTEQVVAIASHDGGKQALEAVGAQLVALRA APYALSTAQVVAIACISGQQALEMEAHMPTLRQASHSLSPERVAAIACIGGRSAVEAVRQG LPVKAIRRIRREKAPVAGPPPAS (SEQ ID NO: 315) can bind to the GATCTGCATGCCTGGAGC (SEQ ID NO: 314) nucleic acid sequence in the PDCD1 gene. Any one of SEQ ID NO: 311, SEQ ID NO; 313, or SEQ ID NO: 315 can be fused to any repression domain described herein (e.g., KRAB) to yield a gene repressor capable of repressing expression of the target gene.



Xanthomonas Derived Transcription Activator Like Effector (TALE)

The present disclosure provides a modular nucleic acid binding domain derived from Xanthomonas spp., also referred to herein as a transcription activator-like effector (TALE) protein, can comprise a plurality of repeat units. A repeat unit of the plurality of repeat units recognizes a single target nucleotide, base pair, or both. A repeat unit from Xanthomonas spp. can comprise 33-35 amino acid residues. In some embodiments, a repeat unit can be from Xanthomonas spp. protein having the sequence:









(SEQ ID NO: 299)



MDPIRSRTPSPARELLPGPQPDGVQPTADRGVSPPAGGPLDGLPARRTMS







RTRLPSPPAPSPAFSAGSFSDLLRQFDPSLFNTSLFDSLPPFGAHHTEAA







TGEWDEVQSGLRAADAPPPTMRVAVTAARPPRAKPAPRRRAAQPSDASPA







AQVDLRTLGYSQQQQEKIKPKVRSTVAQHHEALVGHGFTHAHIVALSQHP







AALGTVAVKYQDMIAALPEATHEAIVGVGKQWSGARALEALLTVAGELRG







PPLQLDTGQLLKIAKRGGVTAVEAVHAWRNALTGAPLNLTPEQVVAIASH






DGGKQALETVQRLLPVLCQAHGLTPQQVVAIASNGGGKQALETVQRLLPV





LCQAHGLTPQQVVAIASNSGGKQALETVQRLLPVLCQAHGLTPEQVVAIA





SNGGGKQALETVQRLLPVLCQAHGLTPEQVVAIASNIGGKQALETVQALL





PVLCQAHGLTPEQVVAIASNIGGKQALETVQALLPVLCQAHGLTPEQVVA





IASNIGGKQALETVQALLPVLCQAHGLTPEQVVAIASHDGGKQALETVQR





LLPVLCQAHGLTPEQVVAIASHDGGKQALETVQRLLPVLCQAHGLTPQQV





VAIASNGGGKQALETVQRLLPVLCQAHGLTPEQVVAIASNSGGKQALETV





QALLPVLCQAHGLTPEQVVAIASNSGGKQALETVQRLLPVLCQAHGLTPE





QVVAIASHDGGKQALETVQRLLPVLCQAHGLTPEQVVAIASHDGGKQALE





TVQRLLPVLCQAHGLTPEQVVAIASHDGGKQALETVQRLLPVLCQAHGLT





PQQVVAIASNGGGRPALETVQRLLPVLCQAHGLTPEQVVAIASHDGGKQA





LETVQRLLPVLCQAHGLTPQQVVAIASNGGGRPALESIVAQLSRPDPALA






ALTNDHLVALACLGGRPALDAVKKGLPHAPALIKRTNRRIPERTSHRVAD







HAQVVRVLGFFQCHSHPAQAFDDAMTQFGMSRHGLLQLFRRVGVTELEAR







SGTLPPASQRWDRILQASGMKRAKPSPTSTQTPDQASLHAFADSLERDLD







APSPMHEGDQTRASSRKRSRSDRAVTGPSAQQSFEVRVPEQRDALHLPLS







WRVKRPRTSIGGGLPDPGTPTAADLAASSTVMREQDEDPFAGAADDTPAF







NEEELAWLMELLPQ.







In some embodiments, a TALE of the present disclosure can comprise between 1 to 50 Xanthomonas spp.-derived repeat units. In some embodiments, a TALE of the present disclosure can comprise between 9 and 36 Xanthomonas spp.-derived repeat units. Preferably, in some embodiments, a TALE of the present disclosure can comprise between 12 and 30 Xanthomonas spp.-derived repeat units. A TALE described herein can comprise between 5 to 10 Xanthomonas spp.-derived repeat units, between 10 to 15 Xanthomonas spp.-derived repeat units, between 15 to 20 Xanthomonas spp.-derived repeat units, between 20 to 25 Xanthomonas spp.-derived repeat units, between 25 to 30 Xanthomonas spp.-derived repeat units, or between 30 to 35 Xanthomonas spp.-derived repeat units, between 35 to 40 Xanthomonas spp.-derived repeat units. A TALE described herein can comprise at least 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, or 40, or more Xanthomonas spp.-derived repeat units, such as, repeat units derived from Xanthomonas spp. protein having the amino acid sequence set forth in SEQ ID NO:299.


A Xanthomonas spp.-derived repeat units can be derived from a wild-type repeat unit, such as any one of SEQ ID NO: 323-SEQ ID NO: 326. For example, a Xanthomonas spp.-derived repeat units can have a sequence of LTPDQVVAIASNHGGKQALETVQRLLPVLCQDHG (SEQ ID NO: 323) comprising an RVD of NH, which recognizes guanine. A Xanthomonas spp.-derived repeat units can have a sequence of LTPDQVVAIASNGGGKQALETVQRLLPVLCQDHG (SEQ ID NO: 324) comprising an RVD of NG, which recognizes thymidine. A Xanthomonas spp.-derived repeat units can have a sequence of LTPDQVVAIASNIGGKQALETVQRLLPVLCQDHG (SEQ ID NO: 325) comprising an RVD of NI, which recognizes adenosine. A Xanthomonas spp.-derived repeat units can have a sequence of LTPDQVVAIASHDGGKQALETVQRLLPVLCQDHG (SEQ ID NO: 326) comprising an RVD of HD, which recognizes cytosine.


A Xanthomonas spp.-derived repeat unit can also comprise a modified Xanthomonas spp.-derived repeat units enhanced for specific recognition of a nucleotide or base pair. A TALE described herein can comprise one or more wild-type Xanthomonas spp.-derived repeat units, one or more modified Xanthomonas spp.-derived repeat units, or a combination thereof. In some embodiments, a modified Xanthomonas spp.-derived repeat units can comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, or 29 mutations that can enhance recognition of a specific nucleotide or base pair. In some embodiments, a modified Xanthomonas spp.-derived repeat unit can comprise more than 1 modification, for example 1 to 5 modifications, 5 to 10 modifications, 10 to 15 modifications, 15 to 20 modifications, 20 to 25 modification, or 25-29 modifications. In some embodiments, A TALE can comprise more than one modified Xanthomonas spp.-derived repeat units, wherein each of the modified Xanthomonas spp.-derived repeat units can have a different number of modifications.


In some embodiments, a TALE of the present disclosure can have the full length naturally occurring N-terminus of a naturally occurring Xanthomonas spp.-derived protein, such as the N-terminus of SEQ ID NO: 299. The N-terminus sequence in SEQ ID NO:299 is indicated by underlining.


In some embodiments, a TALE of the present disclosure can comprise the amino acid residues at position 1 (N) through position 137 (M) of the naturally occurring Xanthomonas spp.-derived protein as follows:









(SEQ ID NO: 300)


MVDLRTLGYSQQQQEKIKPKVRSTVAQHHEALVGHGFTHAHIVALSQHPA





ALGTVAVKYQDMIAALPEATHEAIVGVGKQWSGARALEALLTVAGELRGP





PLQLDTGQLLKIAKRGGVTAVEAVHAWRNALTGAPLN.






The amino acid sequence set forth in SEQ ID NO:300 includes a M added to the N-terminus which is not present in the wild type N-terminus region of a TALE protein. The N-terminus fragment sequence set out in SEQ ID NO:300 is generated by deleting amino acids N+288 through N+137 of the N-terminus region of a TALE protein, adding a M, such that amino acids N+136 through N+1 of the N-terminus region of the TALE protein are present.


In some embodiments, the N-terminus can be truncated such that the fragment of the N-terminus includes amino acids from position 1 (N) through position 120 (K) of the naturally occurring Xanthomonas spp.-derived protein as follows:









(SEQ ID NO: 301)


KPKVRSTVAQHHEALVGHGFTHAHIVALSQHPAALGTVAVKYQDMIAALP





EATHEAIVGVGKQWSGARALEALLTVAGELRGPPLQLDTGQLLKIAKRGG





VTAVEAVHAWRNALTGAPLN.






In some embodiments, the N-terminus can be truncated such that the fragment of the N-terminus includes amino acids from position 1 (N) through position 115 (S) of the naturally occurring Xanthomonas spp.-derived protein as follows:









(SEQ ID NO: 321)


STVAQHHEALVGHGFTHAHIVALSQHPAALGTVAVKYQDMIAALPEATHE





AIVGVGKQWSGARALEALLTVAGELRGPPLQLDTGQLLKIAKRGGVTAVE





AVHAWRNALTGAPLN.






In some embodiments, the N-terminus can be truncated such that the fragment of the N-terminus includes amino acids from position 1 (N) through position 110 (H) of the naturally occurring Xanthomonas spp.-derived protein as follows:









(SEQ ID NO: 447)


HHEALVGHGETHAHIVALSQHPAALGTVAVKYQDMIAALPEATHEAIVGV





GKQWSGARALEALLTVAGELRGPPLQLDTGQLLKIAKRGGVTAVEAVHAW





RNALTGAPLN.






In some embodiments, a truncation of the naturally occurring Xanthomonas spp.-derived protein can be used at the N-terminus of a TALE disclosed herein. In some embodiments, a truncation of the naturally occurring Xanthomonas spp.-derived protein can be used at the N-terminus of a TALE disclosed herein and may include an amino acid sequence at least 85%, 90%, 95%, 96%, 97%, 98%, or 99% identical to the amino acid sequences set forth in one of SEQ ID NOs: 300, 301, 321, and 447. The naturally occurring N-terminus of Xanthomonas spp. can be truncated to amino acid residues at positions 1 to 50, 1 to 70, 1 to 100, 1 to 120, 1 to 130, 10 to 40, 60 to 100, or 100 to 120 and used at the N-terminus of the TALE.



FIGS. 1A-1C show schematics of the domain structure of a TALE protein (not drawn to scale). ‘N’ and ‘C’ indicate the amino and carboxy termini, respectively. The TALE repeat domain comprising TALE repeat units, N-Cap and C-Cap regions are labeled and the residue numbering scheme for the N-Cap and C-Cap regions and the N-terminus and C-terminus fragments are indicated. FIG. 1A includes the full-length N-cap region that extends from amino acid position N+1 to N+288 and full-length C-cap region that extends from amino acid position C+1 through C+278. FIG. 1B provides a schematic of a DNA binding protein comprising TALE repeat units and a truncated N-terminus that extends from amino acid position N+1 to N+136 (the notation N+137 indicates that a methionine added to the N-terminus increases the length to 137) and a truncated C-terminus that extends from amino acid position C+1 through C+63. FIG. 1C provides a schematic of a DNA binding protein comprising TALE repeat units and a truncated N-terminus that extends from amino acid position N+1 to N+115 and a truncated C-terminus that extends from amino acid position C+1 through C+63. In certain cases, the last repeat domain may be a half-repeat or a partial repeat as disclosed herein.


In some embodiments, a TALE of the present disclosure can have a DNA binding domain, in which the final full length repeat unit of 33-35 amino acid residues is followed by a half repeat also derived from Xanthomonas spp. The half repeat can have 15 to 23 amino acid residues, for example, the half repeat can have 19 amino acid residues. In particular embodiments, the half repeat can have a sequence as set forth in LTPQQVVAIASNGGGRPALE (SEQ ID NO: 297). In some embodiments, the half repeat can have a sequence as set forth in SEQ ID NO: 327, 328, 329, 330, 331, 332, 333, or 334).









TABLE 4








Xanthomonas Repeat Sequences










SEQ ID




NO
Amino Acid Sequence
Description





323
LTPDQVVAIASNHGGKQALETVQRLLPV
RVD NH



LCQDHG
recognizing G





324
LTPDQVVAIASNGGGKQALETVQRLLPV
RVD NG



LCQDHG
recognizing T





325
LTPDQVVAIASNIGGKQALETVQRLLPV
RVD NI



LCQDHG
recognizing A





326
LTPDQVVAIASHDGGKQALETVQRLLPV
RVD HD



LCQDHG
recognizing C





297
LTPQQVVAIASNGGGRPALE
Half repeat





327
LTPEQVVAIASNGGGRPALE
Half repeat





328
LTPDQVVAIASNGGGRPALE
Half repeat





329
LTPEQVVAIASNIGGRPALE
Half repeat





330
LTPDQVVAIASNIGGRPALE
Half repeat





331
LTPEQVVAIASHDGGRPALE
Half repeat





332
LTPDQVVAIASHDGGRPALE
Half repeat





333
LTPEQVVAIASNHGGRPALE
Half repeat





334
LTPDQVVAIASNHGGRPALE
Half repeat









In some embodiments, a TALE of the present disclosure can have the full length naturally occurring C-terminus of a naturally occurring Xanthomonas spp.-derived protein, such as the C-terminus of SEQ ID NO: 299. The C-terminus of the TALE protein sequence set forth in SEQ ID NO:299 is italicized. In some embodiments, the C-terminus can be a fragment of the full length naturally occurring C-terminus of a naturally occurring Xanthomonas spp.-derived protein. In some embodiments, the C-terminus can be less than 250 amino acids long. In some embodiments, the C-terminus can be positions 1 (S) through position 278 (Q) of the naturally occurring Xanthomonas spp.-derived protein as follows: SIVAQLSRPDPALAALTNDHLVALACLGGRPALDAVKKGLPHAPALIKRTNRRIPERTSHRV ADHAQVVRVLGFFQCHSHPAQAFDDAMTQFGMSRHGLLQLFRRVGVTELEARSGTLPPAS QRWDRILQASGMKRAKPSPTSTQTPDQASLHAFADSLERDLDAPSPTHEGDQRRASSRKRS RSDRAVTGPSAQQSFEVRAPEQRDALHLPLSWRVKRPRTSIGGGLPDPGTPTAADLAASSTV MREQDEDPFAGAADDFPAFNEEELAWLMELLPQ (SEQ ID NO: 302). In some embodiments, any truncation of the full length naturally occurring C-terminus of a naturally occurring Xanthomonas spp.-derived protein can be used at the C-terminus of a TALE of the present disclosure. For example, in some embodiments, the naturally occurring N-terminus of Xanthomonas spp. can be truncated to amino acid residues at position 1 (S) to position 63 (X) as follows: SIVAQLSRPDPALAALTNDHLVALACLGGRPALDAVKKGLPHAPALIKRTNRRIPERTSHRV A (SEQ ID NO: 298). The naturally occurring C-terminus of Xanthomonas spp. can be truncated amino acid residues at positions 1 to 50 and used at the C-terminus of the engineered DNA binding domain. The naturally occurring C-terminus of Xanthomonas spp. can be truncated to amino acid residues at positions 1 to 63, 1 to 50, 1 to 70, 1 to 100, 1 to 120, 1 to 130, 10 to 40, 60 to 100, or 100 to 120 and used at the C-terminus of the engineered DNA binding domain.


The terms “N-cap” polypeptide and “N-terminal sequence” are used to refer to an amino acid sequence (polypeptide) that flanks the N-terminal portion of the first TALE repeat unit. The N-cap sequence can be of any length (including no amino acids), so long as the TALE-repeat unit(s) function to bind DNA. An N-terminal fragment and grammatical equivalents thereof refers to a shortened sequence of an N-terminal sequence which fragment is sufficient for the TALE repeat units to bind to DNA.


The term “C-cap” or “C-terminal region” refers to optionally present amino acid sequences that may be flanking the C-terminal portion of the last TALE repeat unit. The C-cap can also comprise any part of a terminal C-terminal TALE repeat, including 0 residues, truncations of a TALE repeat or a full TALE repeat. A C-terminal fragment and grammatical equivalents thereof refers to a shortened sequence of a C-terminal sequence which fragment is sufficient for the TALE repeat units to bind to DNA.


Animal Pathogen Derived Modular Nucleic Acid Binding Domains

The present disclosure provides a modular nucleic acid binding domain derived from an animal pathogen protein (MAP-NBD) can comprise a plurality of repeat units, wherein a repeat unit of the plurality of repeat units recognizes a single target nucleotide, base pair, or both.


In some embodiments, the repeat unit can be derived from an animal pathogen, and can be referred to as a non-naturally occurring modular nucleic acid binding domain derived from an animal pathogen protein (MAP-NBD), or “modular animal pathogen-nucleic acid binding domain” (MAP-NBD). For example, in some cases, the animal pathogen can be from the Gram-negative bacterium genus, Legionella. In other cases, the animal pathogen can be from Burkholderia. In some cases, the animal pathogen can be from Paraburkholderia. In other cases, the animal pathogen can be from Francisella.


In particular embodiments, the repeat unit can be derived from a species of the genus of Legionella, such as Legionella quateirensis, the genus of Burkholderia, the genus of Paraburkholderia, or the genus of Francisella. In some embodiments, the repeat unit can comprise from 19 amino acid residues to 35 amino acid residues. In particular embodiments, the repeat unit can comprise 33 amino acid residues. In other embodiments, the repeat unit can comprise 35 amino acid residues. In some embodiments, the MAP-NBD is non-naturally occurring, and comprises a plurality of repeat units and wherein a repeat unit of the plurality of repeat units recognizes a single target nucleic acid.


In some embodiments, a repeat unit can be derived from a Legionella quateirensis protein with the following sequence:









(SEQ ID NO: 281)


MPDLELNFAIPLHLFDDETVFTHDATNDNSQASSSYSSKSSPASANARKR





TSRKEMSGPPSKEPANTKSRRANSQNNKLSLADRLTKYNIDEEFYQTRSD





SLLSLNYTKKQIERLILYKGRTSAVQQLLCKHEELLNLISPDGLGHKELI





KIAARNGGGNNLIAVLSCYAKLKEMGFSSQQIIRMVSHAGGANNLKAVTA





NHDDLQNMGFNVEQIVRMVSHNGGSKNLKAVTDNHDDLKNMGFNAEQIVR





MVSHGGGSKNLKAVTDNHDDLKNMGFNAEQIVSMVSNNGGSKNLKAVTDN





HDDLKNMGFNAEQIVSMVSNGGGSLNLKAVKKYHDALKDRGFNTEQIVRM





VSHDGGSLNLKAVKKYHDALRERKFNVEQIVSIVSHGGGSLNLKAVKKYH





DVLKDREFNAEQIVRMVSHDGGSLNLKAVTDNHDDLKNMGFNAEQIVRMV





SHKGGSKNLALVKEYFPVFSSFHFTADQIVALICQSKQCFRNLKKNHQQW





KNKGLSAEQIVDLILQETPPKPNFNNTSSSTPSPSAPSFFQGPSTPIPTP





VLDNSPAPIFSNPVCFFSSRSENNTEQYLQDSTLDLDSQLGDPTKNFNVN





NFWSLFPFDDVGYHPHSNDVGYHLHSDEESPFFDF.






In some embodiments, a repeat from a Legionella quateirensis protein can comprise a repeat with a canonical RVD or a non-canonical RVD. In some embodiments, a canonical RVD can comprise NN, NG, HD, or HD. In some embodiments, a non-canonical RVD can comprise RN, HA, HN, HG, HG, or HK.


In some embodiments, a repeat of SEQ ID NO: 282 comprises an RVD of HA and primarily recognizes a base of adenine (A). In some embodiments, a repeat of SEQ ID NO: 283 comprises an RVD of HN and recognizes a base comprising guanine (G). In some embodiments, a repeat of SEQ ID NO: 284 comprises an RVD of HG and recognizes a base comprising thymine (T). In some embodiments, a repeat of SEQ ID NO: 285 comprises an RVD of NN and recognizes a base comprising guanine (G). In some embodiments, a repeat of SEQ ID NO: 286 comprises an RVD of NG and recognizes a base comprising thymine (T). In some embodiments, a repeat of SEQ ID NO: 287 comprises an RVD of HD and recognizes a base comprising cytosine (C). In some embodiments, a repeat of SEQ ID NO: 288 comprises an RVD of HG and recognizes a base comprising thymine (T). In some embodiments, a repeat of SEQ ID NO: 289 comprises an RVD of HD and recognizes a base comprising cytosine (C). In some embodiments, a half-repeat of SEQ ID NO: 290 comprises an RVD of HK and recognizes a base comprising guanine (G). In some embodiments, a repeat of SEQ ID NO: 357 comprises an RVD of RN and recognizes a base comprising guanine (G).


TABLE 5 illustrates exemplary repeats from Legionella quateirensis, Burkholderia, Paraburkholderia, or Francisella that can make up a MAP-NBD of the present disclosure and the RVD at position 12 and 13 of the particular repeat. A MAP-NBD of the present disclosure can comprise at least one of the repeats disclosed in TABLE 5 including any one of SEQ ID NO: 357, SEQ ID NO: 282-SEQ ID NO: 290, or SEQ ID NO: 358-SEQ ID NO: 446. A MAP-NBD of the present disclosure can comprise any combination of repeats disclosed in TABLE 5 including any one of SEQ ID NO: 357, SEQ ID NO: 282-SEQ ID NO: 290, or SEQ ID NO: 358-SEQ ID NO: 446.









TABLE 5







Animal Pathogen Derived Repeat Units










SEQ ID NO
Organism
Repeat Unit Sequence
RVD





357

L. quateirensis

LGHKELIKIAARNGGGNNLIAVLSCYAKLKEMG
RN





282

L. quateirensis

FSSQQIIRMVSHAGGANNLKAVTANHDDLQNMG
HA





283

L. quateirensis

FNVEQIVRMVSHNGGSKNLKAVTDNHDDLKNMG
HN





284

L. quateirensis

FNAEQIVRMVSHGGGSKNLKAVTDNHDDLKNMG
HG





285

L. quateirensis

FNAEQIVSMVSNNGGSKNLKAVTDNHDDLKNMG
NN





286

L. quateirensis

FNAEQIVSMVSNGGGSLNLKAVKKYHDALKDRG
NG





287

L. quateirensis

FNTEQIVRMVSHDGGSLNLKAVKKYHDALRERK
HD





288

L. quateirensis

FNVEQIVSIVSHGGGSLNLKAVKKYHDVLKDRE
HG





289

L. quateirensis

FNAEQIVRMVSHDGGSLNLKAVTDNHDDLKNMG
HD





290

L. quateirensis

FNAEQIVRMVSHKGGSKNL
HK


(half repeat)








358

L. quateirensis

FSAEQIVRIAAHDGGSRNIEAVQQAQHVLKELG
HD





359

L. quateirensis

FSAEQIVSIVAHDGGSRNIEAVQQAQHILKELG
HD





360

L. quateirensis

FSRQQILRIASHDGGSKNIAAVQKFLPKLMNFGFN
HD





361

L. quateirensis

FSAEQIVRIAAHDGGSLNIDAVQQAQQALKELG
HD





362

L. quateirensis

FSTEQIVCIAGHGGGSLNIKAVLLAQQALKDLG
HG





363

L. quateirensis

FSSEQIVRVAAHGGGSLNIKAVLQAHQALKELD
HG





364

L. quateirensis

FSAEQIVHIAAHGGGSLNIKAILQAHQTLKELN
HG





365

L. quateirensis

FSAEQIVRIAAHIGGSRNIEAIQQAHHALKELG
HI





366

L. quateirensis

FSAEQIVRIAAHIGGSHNLKAVLQAQQALKELD
HI





367

L. quateirensis

FSAKHIVRIAAHIGGSLNIKAVQQAQQALKELG
HI





368

L. quateirensis

FNAEQIVRMVSHKGGSKNLALVKEYFPVFSSFH
HK





369

L. quateirensis

FNAEQIVRMVSHKGGSKNLALVKEYFPVFSSFHFT
HK





370

L. quateirensis

FSADQIVRIAAHKGGSHNIVAVQQAQQALKELD
HK





371

L. quateirensis

FNVEQIVRMVSHNGGSKNLKAVTDNHDDLKNMGFN
HN





372

L. quateirensis

FSADQVVKIAGHSGGSNNIAVMLAVFPRLRDFGFK
HS





373

L. quateirensis

FSAEQIVSIAAHVGGSHNIEAVQKAHQALKELD
HV





374

L. quateirensis

FNAEQIVSMVSNNGGSKNLKAVTDNHDDLKNMGFN
NN





375

L. quateirensis

FSHKELIKIAARNGGGNNLIAVLSCYAKLKEMG
RN





376

L. quateirensis

FSHKELIKIAARNGGGNNLIAVLSCYAKLKEMGFS
RN





377

Burkholderia

FSSGETVGATVGAGGTETVAQGGTASNTTVSSGGY
GA





378

Burkholderia

FSGGMATSTTVGSGGTQDVLAGGAAVGGTVGTGGV
GS





379

Burkholderia

FSAADIVKIAGKIGGAQALQAFITHRAALIQAGFS
KI





380

Burkholderia

FNPTDIVKIAGNDGGAQALQAVLELEPALRERGFS
ND





381

Burkholderia

FNPTDIVRMAGNDGGAQALQAVFELEPAFRERSFS
ND





382

Burkholderia

FNPTDIVRMAGNDGGAQALQAVLELEPAFRERGFS
ND





383

Burkholderia

FSQVDIVKIASNDGGAQALYSVLDVEPTFRERGFS
ND





384

Burkholderia

FSRADIVKIAGNDGGAQALYSVLDVEPPLRERGFS
ND





385

Burkholderia

FSRGDIVKIAGNDGGAQALYSVLDVEPPLRERGFS
ND





386

Burkholderia

FNRADIVRIAGNGGGAQALYSVRDAGPTLGKRGFS
NG





387

Burkholderia

FRQADIVKIASNGGSAQALNAVIKLGPTLRQRGFS
NG





388

Burkholderia

FRQADIVKMASNGGSAQALNAVIKLGPTLRQRGFS
NG





389

Burkholderia

FSRADIVKIAGNGGGAQALQAVLELEPTFRERGFS
NG





390

Burkholderia

FSRADIVRIAGNGGGAQALYSVLDVGPTLGKRGFS
NG





391

Burkholderia

FSRGDIVRIAGNGGGAQALQAVLELEPTLGERGFS
NG





392

Burkholderia

FSRADIVKIAGNGGGAQALQAVITHRAALTQAGFS
NG





393

Burkholderia

FSRGDTVKIAGNIGGAQALQAVLELEPTLRERGFS
NI





394

Burkholderia

FNPTDIVKIAGNIGGAQALQAVLELEPAFRERGFS
NI





395

Burkholderia

FSAADIVKIAGNIGGAQALQAIFTHRAALIQAGFS
NI





396

Burkholderia

FSAADIVKIAGNIGGAQALQAVITHRATLTQAGFS
NI





397

Burkholderia

FSATDIVKIASNIGGAQALQAVISRRAALIQAGFS
NI





398

Burkholderia

FSQPDIVKIAGNIGGAQALQAVLELEPAFRERGFS
NI





399

Burkholderia

FSRADIVKIAGNIGGAQALQAVLELESTFRERSFN
NI





400

Burkholderia

FSRADIVKIAGNIGGAQALQAVLELESTLRERSFN
NI





401

Burkholderia

FSRGDIVKMAGNIGGAQALQAGLELEPAFRERGFS
NI





402

Burkholderia

FSRGDIVKMAGNIGGAQALQAVLELEPAFHERSFC
NI





403

Burkholderia

FTLTDIVKMAGNIGGAQALKAVLEHGPTLRQRDLS
NI





404

Burkholderia

FTLTDIVKMAGNIGGAQALKVVLEHGPTLRQRDLS
NI





405

Burkholderia

FNPTDIVKIAGNNGGAQALQAVLELEPALRERGFS
NN





406

Burkholderia

FNPTDIVKIAGNNGGAQALQAVLELEPALRERSFS
NN





407

Burkholderia

FNPTDMVKIAGNNGGAQALQAVLELEPALRERGFS
NN





408

Burkholderia

FSAADIVKIASNNGGAQALQALIDHWSTLSGKTKA
NN





409

Burkholderia

FSAADIVKIASNNGGAQALQAVISRRAALIQAGFS
NN





410

Burkholderia

FSAADIVKIASNNGGAQALQAVITHRAALAQAGFS
NN





411

Burkholderia

FSAADIVKIASNNGGARALQALIDHWSTLSGKTKA
NN





412

Burkholderia

FTLTDIVEMAGNNGGAQALKAVLEHGSTLDERGFT
NN





413

Burkholderia

FTLTDIVKMAGNNGGAQALKAVLEHGPTLDERGFT
NN





414

Burkholderia

FTLTDIVKMAGNNGGAQALKVVLEHGPTLRQRGFS
NN





415

Burkholderia

FTLTDIVKMASNNGGAQALKAVLEHGPTLDERGFT
NN





416

Burkholderia

FSAADIVKIAGNSGGAQALQAVISHRAALTQAGFS
NS





417

Burkholderia

FSGGDAVSTVVRSGGAQSVASGGTASGTTVSAGAT
RS





418

Burkholderia

FRQTDIVKMAGSGGSAQALNAVIKHGPTLRQRGFS
SG





419

Burkholderia

FSLIDIVEIASNGGAQALKAVLKYGPVLTQAGRS
SN





420

Burkholderia

FSGGDAAGTVVSSGGAQNVTGGLASGTTVASGGAA
SS





421

Paraburkholderia

FNLTDIVEMAANSGGAQALKAVLEHGPTLRQRGLS
NS





422

Paraburkholderia

FNRASIVKIAGNSGGAQALQAVLKHGPTLDERGFN
NS





423

Paraburkholderia

FSQANIVKMAGNSGGAQALQAVLDLELVFRERGFS
NS





424

Paraburkholderia

FSQPDIVKMAGNSGGAQALQAVLDLELAFRERGFS
NS





425

Paraburkholderia

FSLIDIVEIASNGGAQALKAVLKYGPVLMQAGRS
SN





426

Francisella

YKSEDIIRLASHDGGSVNLEAVLRLHSQLTRLG
HD





427

Francisella

YKPEDIIRLASHGGGSVNLEAVLRLNPQLIGLG
HG





428

Francisella

YKSEDIIRLASHGGGSVNLEAVLRLHSQLTRLG
HG





429

Francisella

YKSEDIIRLASHGGGSVNLEAVLRLNPQLIGLG
HG





430

Paraburkholderia

FNLTDIVEMAGKGGGAQALKAVLEHGPTLRQRGFN
KG





431

Paraburkholderia

FRQADIIKIAGNDGGAQALQAVIEHGPTLRQHGFN
ND





432

Paraburkholderia

FSQADIVKIAGNDGGTQALHAVLDLERMLGERGFS
ND





433

Paraburkholderia

FSRADIVKIAGNGGGAQALKAVLEHEATLDERGFS
NG





434

Paraburkholderia

FSRADIVRIAGNGGGAQALYSVLDVEPTLGKRGFS
NG





435

Paraburkholderia

FSQPDIVKMASNIGGAQALQAVLELEPALRERGFS
NI





436

Paraburkholderia

FSQPDIVKMAGNIGGAQALQAVLSLGPALRERGFS
NI





437

Paraburkholderia

FSQPEIVKIAGNIGGAQALHTVLELEPTLHKRGFN
NI





438

Paraburkholderia

FSQSDIVKIAGNIGGAQALQAVLDLESMLGKRGFS
NI





439

Paraburkholderia

FSQSDIVKIAGNIGGAQALQAVLELEPTLRESDFR
NI





440

Paraburkholderia

FNPTDIVKIAGNKGGAQALQAVLELEPALRERGFN
NK





441

Paraburkholderia

FSPTDIIKIAGNNGGAQALQAVLDLELMLRERGFS
NN





442

Paraburkholderia

FSQADIVKIAGNNGGAQALYSVLDVEPTLGKRGFS
NN





443

Paraburkholderia

FSRGDIVTIAGNNGGAQALQAVLELEPTLRERGFN
NN





444

Paraburkholderia

FSRIDIVKIAANNGGAQALHAVLDLGPTLRECGFS
NN





445

Paraburkholderia

FSQADIVKIVGNNGGAQALQAVFELEPTLRERGFN
NN





446

Paraburkholderia

FSQPDIVRITGNRGGAQALQAVLALELTLRERGFS
NR









In any one of the animal pathogen-derived repeat domains of SEQ ID NO: 357, SEQ ID NO: 282-SEQ ID NO: 290, and SEQ ID NO: 358-SEQ ID NO: 446, there can be considerable sequence divergence between repeats of a MAP-NBD outside of the RVD.


In some embodiments, a MAP-NBD of the present disclosure can comprise between 1 to 50 animal pathogen-derived repeat units. In some embodiments, a MAP-NBD of the present disclosure can comprise between 9 and 36 animal pathogen-derived repeat units. In some embodiments, a MAP-NBD of the present disclosure can comprise between 12 and 30 animal pathogen-derived repeat units. A MAP-NBD described herein can comprise between 5 to 10, 10 to 15, 15-20, 20 to 25, 25 to 30, 30 to 35, or 35 to 40, e.g., 15-25 animal pathogen-derived repeat units. A MAP-NBD described herein can comprise 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39 or 40 animal pathogen-derived repeat units.


A MAP-NBD described herein can comprise 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39 or 40 animal pathogen-derived repeat units.


An animal pathogen-derived repeat units can be derived from a wild-type repeat unit, such as any one of SEQ ID NO: 357, SEQ ID NO: 282-SEQ ID NO: 290, and SEQ ID NO: 358-SEQ ID NO: 446. An animal pathogen-derived repeat unit can also comprise a modified animal pathogen-derived repeat units enhanced for specific recognition of a nucleotide or base pair. A MAP-NBD described herein can comprise one or more wild-type animal pathogen-derived repeat units, one or more modified animal pathogen-derived repeat units, or a combination thereof. In some embodiments, a modified animal pathogen-derived repeat units can comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, or 29 mutations that can enhance recognition of a specific nucleotide or base pair. In some embodiments, a modified animal pathogen-derived repeat unit can comprise more than 1 modification, for example 1 to 5 modifications, 5 to 10 modifications, 10 to 15 modifications, 15 to 20 modifications, 20 to 25 modification, or 25-29 modifications. In some embodiments, a MAP-NBD can comprise more than one modified animal pathogen-derived repeat units, wherein each of the modified animal pathogen-derived repeat units can have a different number of modifications.


In some embodiments, a MAP-NBD of the present disclosure can have the full length naturally occurring N-terminus of a naturally occurring Legionella quateirensis-derived protein, such as the N-terminus of SEQ ID NO: 281. A N-terminus can be the full length N-terminus sequence and can have a sequence of MPDLELNFAIPLHLFDDETVFTHDATNDNSQASSSYSSKSSPASANARKRTSRKEMSGPPSK EPANTKSRRANSQNNKLSLADRLTKYNIDEEFYQTRSDSLLSLNYTKKQIERLILYKGRTSA VQQLLCKHEELLNLISPDG (SEQ ID NO: 291). In some embodiments, any truncation of SEQ ID NO: 291 can be used as the N-terminus in a MAP-NBD of the present disclosure. For example, in some embodiments, a MAP-NBD comprises a truncated N-terminus including amino acid residues at position 1 (G) to position 137 (S) of the naturally occurring Legionella quateirensis N-terminus as follows: NFAIPLHLFDDETVFTHDATNDNSQASSSYSSKSSPASANARKRTSRKEMSGPPSKEPANTK SRRANSQNNKLSLADRLTKYNIDEEFYQTRSDSLLSLNYTKKQIERLILYKGRTSAVQQLLC KHEELLNLISPDG (SEQ ID NO: 335). For example, in some embodiments, a MAP-NBD comprises a truncated N-terminus including amino acid residues at position 1 (G) to position 120 (S) of the naturally occurring Legionella quateirensis N-terminus as follows: DATNDNSQASSSYSSKSSPASANARKRTSRKEMSGPPSKEPANTKSRRANSQNNKLSLADR LTKYNIDEEFYQTRSDSLLSLNYTKKQIERLILYKGRTSAVQQLLCKHEELLNLISPDG (SEQ ID NO: 304). In some embodiments, a MAP-NBD comprises a truncated N-terminus including amino acid residues at position 1 (G) to position 115 (K) of the naturally occurring Legionella quateirensis N-terminus as follows: NSQASSSYSSKSSPASANARKRTSRKEMSGPPSKEPANTKSRRANSQNNKLSLADRLTKYNI DEEFYQTRSDSLLSLNYTKKQIERLILYKGRTSAVQQLLCKHEELLNLISPDG (SEQ ID NO: 322). In some embodiments, any truncation of the naturally occurring Legionella quateirensis-derived protein can be used at the N-terminus of a DNA binding domain disclosed herein. The naturally occurring N-terminus of Legionella quateirensis can be truncated to amino acid residues at positions 1 to 50, 1 to 70, 1 to 100, 1 to 120, 1 to 130, 10 to 40, 60 to 100, or 100 to 120 and used at the N-terminus of the MAP-NBD.


In some embodiments, a MAP-NBD of the present disclosure can have the full length naturally occurring C-terminus of a naturally occurring Legionella quateirensis-derived protein. In some embodiments, A MAP-NBD of the present disclosure can have at its C-terminus amino acid residues at position 1 (A) to position 176 (F) of the naturally occurring Legionella quateirensis-derived protein as follows:









(SEQ ID NO: 305)


ALVKEYFPVFSSFHFTADQIVALICQSKQCFRNLKKNHQQWKNKGLSAEQ





IVDLILQETPPKPNFNNTSSSTPSPSAPSFFQGPSTPIPTPVLDNSPAPI





FSNPVCFFSSRSENNTEQYLQDSTLDLDSQLGDPTKNFNVNNFWSLFPFD





DVGYHPHSNDVGYHLHSDEESPFFDF.






In some embodiments, a MAP-NBD of the present disclosure can have at its C-terminus amino acid residues at position 1 (A) to position 63 (P) of the naturally occurring Legionella quateirensis-derived protein as follows: ALVKEYFPVFSSFHFTADQIVALICQSKQCFRNLKKNHQQWKNKGLSAEQIVDLILQETPPK P (SEQ ID NO: 306).


In some embodiments, the present disclosure provides methods for identifying an animal pathogen-derived repeat unit. For example, a consensus sequence can be defined comprising a first repeat motif, a spacer, and a second repeat motif. The consensus sequence can be 1xxx211x1xxx33x2x1xxxxxxxxx1xxxx1xxx211x1xxx33x2x1xxxxxxxxx1 (SEQ ID NO: 292), 1xxx211x1xxx33x2x1xxxxxxxxx1xxxxx1xxx211x1xxx33x2x1xxxxxxxxx1 (SEQ ID NO: 293), 1xxx211x1xxx33x2x1xxxxxxxxx1xxxxxx1xxx211x1xxx33x2x1xxxxxxxxx1 (SEQ ID NO: 294), 1xxx211x1xxx33x2x1xxxxxxxxx1xxxxxxx1xxx211x1xxx33x2x1xxxxxxxxx1 (SEQ ID NO: 295), 1xxx211x1xxx33x2x1xxxxxxxxx1xxxxxxxx1xxx211x1xxx33x2x1xxxxxxxxx1 (SEQ ID NO: 296). For any one of SEQ ID NO: 292-SEQ ID NO: 296, x can be any amino acid residue, 1, 2, and 3 are flexible residues that are defined as follows: 1 can be selected from any one of A, F, I, L, M, T, or V, 2 can be selected from any one of D, E, K, N, M, S, R, or Q, and 3 can be selected from any one of A, G, N, or S. Thus, in some embodiments, a MAP-NBD can be derived from an animal pathogen comprising the consensus sequence of SEQ ID NO: 292, SEQ ID NO: 293, SEQ ID NO: 294, SEQ ID NO: 295, or SEQ ID NO: 296. Any one of consensus sequences of SEQ ID NO: 292-SEQ ID NO: 296 can be compared against all sequences downloaded from NCBI, MGRast, JGI, and EBI databases to identify matches corresponding to animal pathogen proteins containing repeat units of a DNA-binding repeat unit.


In some embodiments, a MAP-NBD repeat unit can itself have a consensus sequence of 1xxx211x1xxx33x2x1xxxxxxxxx1 (SEQ ID NO: 293), wherein x can be any amino acid residue, 1, 2, and 3 are flexible residues that are defined as follows: 1 can be selected from any one of A, F, I, L, M, T, or V, 2 can be selected from any one of D, E, K, N, M, S, R, or Q, and 3 can be selected from any one of A, G, N, or S.


Mixed DNA Binding Domains

In some embodiments, the present disclosure provides DNA binding domains in which the repeat units, the N-terminus, and the C-terminus can be derived from any one of Ralstonia solanacearum, Xanthomonas spp., Legionella quateirensis, Burkholderia, Paraburkholderia, or Francisella. For example, the present disclosure provides a DNA binding domain wherein the plurality of repeat units are selected from any one of SEQ ID NO: 168-SEQ ID NO: 263 or SEQ ID NO: 336-SEQ ID NO: 356 and can further comprise an N-terminus and/or C-terminus from Xanthomonas spp., (N-termini: SEQ ID NO: 298, SEQ ID NO: 300, SEQ ID NO: 301, and SEQ ID NO: 321; C-termini: SEQ ID NO: 302 and SEQ ID NO: 298) or Legionella quateirensis (N-termini: SEQ ID NO: 304 or SEQ ID NO: 322; C-termini: SEQ ID NO: 305 and SEQ ID NO: 306). In some embodiments, the present disclosure provides modular DNA binding domains in which the repeat units can be from Ralstonia solanacearum (e.g., any one of SEQ ID NO: 168-SEQ ID NO: 263 or SEQ ID NO: 336-SEQ ID NO: 356), Xanthomonas spp. (e.g., any one of SEQ ID NO: 323-SEQ ID NO: 334), an animal pathogen such as Legionella quateirensis, Burkholderia, Paraburkholderia, or Francisella (e.g., any one of SEQ ID NO: 357, SEQ ID NO: 282-SEQ ID NO: 290, or SEQ ID NO: 358-SEQ ID NO: 446), or any combination thereof.


Nucleases for Genome Editing

Genome editing can include the process of modifying a DNA of a cell in order to introduce or knock out a target gene or a target gene region. In some instances, a subject may have a disease in which a protein is aberrantly expressed or completely lacking. One therapeutic strategy for treating this disease can be introduction of a target gene or a target gene region to correct the aberrant or missing protein. For example, genome editing can be used to modify the DNA of a cell in the subject in order to introduce a functional gene, which gives rise to a functional protein. Introduction of this functional gene and expression of the functional protein can relieve the disease state of the subject.


In other instances, a subject may have a disease in which protein is overexpressed or is targeted by a virus for infection of a cell. Alternatively, a therapy such as a cell therapy for cancer can be ineffective due to repression of certain processes by tumor cells (e.g., checkpoint inhibition). Still alternatively, it may be desirable to eliminate a particular protein expressed at the surface of a cell in order to generate a universal, off-the-shelf cell therapy for a subject in need thereof (e.g., TCR). In such cases, it can be desirable to partially or completely knock out the gene encoding for such a protein. Genome editing can be used to modify the DNA of a cell in the subject in order to partially or completely knock out the target gene, thus reducing or eliminating expression of the protein of interest.


Genome editing can include the use of any nuclease as described herein in combination with any DNA binding domain disclosed herein in order to bind to a target gene or target gene region and induce a double strand break, mediated by the nuclease. Genes can be introduced during this process, or DNA binding domains can be designed to cut at regions of the DNA such that after non-homologous end joining, the target gene or target gene region is removed. Genome editing systems that are further disclosed and described in detail herein can include DNA binding domains from Xanthomonas, Ralstonia, Legionella, Burkholderia, Paraburkholderia, or Francisella fused to nucleases.


The specificity and efficiency of genome editing can be dependent on the nuclease responsible for cleavage. More than 3,000 type II restriction endonucleases have been identified. They recognize short, usually palindromic, sequences of 4-8 bp and, in the presence of Mg2+, cleave the DNA within or in close proximity to the recognition sequence. Naturally, type IIs restriction enzymes themselves have a DNA recognition domain that can be separated from the catalytic, or cleavage, domain. As such, since cleavage occurs at a site adjacent to the DNA sequence bound by the recognition domain, these enzymes can be referred to as exhibiting “shifted” cleavage. These type IIs restriction enzymes having both the recognition domain and the cleavage domain can be 400-600 amino acids. The main criterion for classifying a restriction endonuclease as a type II enzyme is that it cleaves specifically within or close to its recognition site and that it does not require ATP hydrolysis for its nucleolytic activity. An example of a type II restriction endonucleases is FokI, which consists of a DNA recognition domain and a non-specific DNA cleavage domain. FokI cleaves DNA nine and thirteen bases downstream of an asymmetric sequence (recognizing a DNA sequence of GGATG).


In some embodiments, the DNA cleavage domain at the C-terminus of FokI itself can be combined with a variety of DNA-binding domains (e.g., RNBDs, TALEs, MAP-NBDs) of other molecules for genome editing purposes. This cleavage domain can be 180 amino acids in length and can be directly linked to a DNA binding domain (e.g., RNBDs, TALEs, MAP-NBDs). In some embodiments, the FokI cleavage domain only comprises a single catalytic site. Thus, in order to cleave phosphodiester bonds, these enzymes form transient homodimers, providing two catalytic sites capable of cleaving double stranded DNA. In some embodiments, a single DNA-binding domains (e.g., RNBDs, TALEs, MAP-NBDs) linked to a Type IIS cleaving domain may not nick the double stranded DNA at the targeted site. In some embodiments, cleaving of target DNA only occurs when a pair of DNA-binding domains (e.g., RNBDs, TALEs, MAP-NBDs), each linked to a Type IIS cleaving domain (e.g., any one of SEQ ID NO: 1-SEQ ID NO: 81 (nucleotide sequences of SEQ ID NO: 82-SEQ ID NO: 162)) bind to opposing strands of DNA and allow for formation of a transient homodimer in the spacer region (the base pairs between the C-terminus of the DNA binding domain on a top strand of DNA and the C-terminus of the DNA binding domain on a bottom strand of DNA). Said spacer region can be greater than 2 base pairs, greater than 5 base pairs, greater than 10 base pairs, greater than 15 base pairs, greater than 24 base pairs, greater than 25 base pairs, greater than 30 base pairs, greater than 35 base pairs, greater than 40 base pairs, greater than 45 base pairs, or greater than 50 base pairs. In some embodiments, the spacer region can be anywhere from 2 to 50 base pairs, 5 to 40 base pairs, 10 to 30 base pairs, 14 to 40 base pairs, 24 to 30 base pairs, 24 to 40 base pairs, or 24 to 50 base pairs. In some embodiments, the nuclease disclosed herein (e.g., any one of SEQ ID NO: 1-SEQ ID NO: 81 (nucleotide sequences of SEQ ID NO: 82-SEQ ID NO: 162) can be capable of cleaving over a spacer region of greater than 24 base pairs upon formation of a transient homodimer.


Comparative analyses showed that FokI phylogenetic groupings can largely be at least partially explained by a combination of local gene duplication, and the whole-genome duplication event that predates their speciation, however enzymes vary significantly in their activities. In some aspects, the disclosure provides enzymes identified in a phylogenetic, molecular, and comparative analyses of sequences from various proteins related to FokI in various sequenced species. In some instances, such enzymes can comprise one or more mutations relative to SEQ ID NO: 1-SEQ ID NO: 81 (nucleotide sequences of SEQ ID NO: 82-SEQ ID NO: 162). In some cases, the non-naturally occurring enzymes described herein can comprise about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more mutations. A mutation can be engineered to enhance cleavage efficiency. A mutation can abolish cleavage activity. In some cases, a mutation can enhance homodimerization. For example, FokI can have a mutation at one or more amino acid residue positions 446, 447, 479, 483, 484, 486, 487, 490, 491, 496, 498, 499, 500, 531, 534, 537, and 538 to modulate homodimerization, and similar mutations can be designed based on the phylogenetic analysis of SEQ ID NO: 1-SEQ ID NO: 81 (nucleotide sequences of SEQ ID NO: 82-SEQ ID NO: 162).


TABLE 6 shows exemplary amino acid sequences (SEQ ID NO: 1-SEQ ID NO: 81) of endonucleases for genome editing and the corresponding back-translated nucleic acid sequences (SEQ ID NO: 82-SEQ ID NO: 162) of the endonucleases, which were obtained using Genius software and selecting for human codon optimization.









TABLE 6







Amino Acid and Nucleic Acid Sequences of Endonucleases










SEQ

SEQ



ID

ID



NO
Amino Acid Sequence
NO
Back Translated Nucleic Acid Sequences













1
FLVKGAMEIKKSEL
82
TTCCTGGTGAAGGGCGCCATGGAGATCAAGAAGAGCGAGCTGA



RHKLRHVPHEYIELI

GGCACAAGCTGAGGCACGTGCCCCACGAGTACATCGAGCTGATC



EIAQDSKQNRLLEFK

GAGATCGCCCAGGACAGCAAGCAGAACAGGCTGCTGGAGTTCA



VVEFFKKIYGYRGK

AGGTGGTGGAGTTCTTCAAGAAGATCTACGGCTACAGGGGCAA



HLGGSRKPDGALFT

GCACCTGGGCGGCAGCAGGAAGCCCGACGGCGCCCTGTTCACC



DGLVLNHGIILDTKA

GACGGCCTGGTGCTGAACCACGGCATCATCCTGGACACCAAGGC



YKDGYRLPISQADE

CTACAAGGACGGCTACAGGCTGCCCATCAGCCAGGCCGACGAG



MQRYVDENNKRSQ

ATGCAGAGGTACGTGGACGAGAACAACAAGAGGAGCCAGGTGA



VINPNEWWEIYPTSI

TCAACCCCAACGAGTGGTGGGAGATCTACCCCACCAGCATCACC



TDFKFLFVSGFFQGD

GACTTCAAGTTCCTGTTCGTGAGCGGCTTCTTCCAGGGCGACTAC



YRKQLERVSHLTKC

AGGAAGCAGCTGGAGAGGGTGAGCCACCTGACCAAGTGCCAGG



QGAVMSVEQLLLGG

GCGCCGTGATGAGCGTGGAGCAGCTGCTGCTGGGCGGCGAGAA



EKIKEGSLTLEEVGK

GATCAAGGAGGGCAGCCTGACCCTGGAGGAGGTGGGCAAGAAG



KFKNDEIVF

TTCAAGAACGACGAGATCGTGTTC





2
QIVKSSIEMSKANM
83
CAGATCGTGAAGAGCAGCATCGAGATGAGCAAGGCCAACATGA



RDNLQMLPHDYIELI

GGGACAACCTGCAGATGCTGCCCCACGACTACATCGAGCTGATC



EISQDPYQNRIFEMK

GAGATCAGCCAGGACCCCTACCAGAACAGGATCTTCGAGATGA



VMDLFINEYGFSGS

AGGTGATGGACCTGTTCATCAACGAGTACGGCTTCAGCGGCAGC



HLGGSRKPDGAMY

CACCTGGGCGGCAGCAGGAAGCCCGACGGCGCCATGTACGCCC



AHGFGVIVDTKAYK

ACGGCTTCGGCGTGATCGTGGACACCAAGGCCTACAAGGACGG



DGYNLPISQADEME

CTACAACCTGCCCATCAGCCAGGCCGACGAGATGGAGAGGTAC



RYVRENIDRNEHVN

GTGAGGGAGAACATCGACAGGAACGAGCACGTGAACAGCAACA



SNRWWNIFPEDTNE

GGTGGTGGAACATCTTCCCCGAGGACACCAACGAGTACAAGTTC



YKFLFVSGFFKGNFE

CTGTTCGTGAGCGGCTTCTTCAAGGGCAACTTCGAGAAGCAGCT



KQLERISIDTGVQGG

GGAGAGGATCAGCATCGACACCGGCGTGCAGGGCGGCGCCCTG



ALSVEHLLLGAEYIK

AGCGTGGAGCACCTGCTGCTGGGCGCCGAGTACATCAAGAGGG



RGILTLYDFKNSFLN

GCATCCTGACCCTGTACGACTTCAAGAACAGCTTCCTGAACAAG



KEIQF

GAGATCCAGTTC





3
QTIKSSIEELKSELRT
84
CAGACCATCAAGAGCAGCATCGAGGAGCTGAAGAGCGAGCTGA



QLNVISHDYLQLVDI

GGACCCAGCTGAACGTGATCAGCCACGACTACCTGCAGCTGGTG



SQDSQQNRLFEMKV

GACATCAGCCAGGACAGCCAGCAGAACAGGCTGTTCGAGATGA



MDLFINEFGYNGSH

AGGTGATGGACCTGTTCATCAACGAGTTCGGCTACAACGGCAGC



LGGSRKPDGILYTEG

CACCTGGGCGGCAGCAGGAAGCCCGACGGCATCCTGTACACCG



LSKDYGIIVDTKAYK

AGGGCCTGAGCAAGGACTACGGCATCATCGTGGACACCAAGGC



DGYNLPIAQADEME

CTACAAGGACGGCTACAACCTGCCCATCGCCCAGGCCGACGAG



RYIRENIDRNEVVNP

ATGGAGAGGTACATCAGGGAGAACATCGACAGGAACGAGGTGG



NRWWEVFPSKINDY

TGAACCCCAACAGGTGGTGGGAGGTGTTCCCCAGCAAGATCAAC



KFLFVSAYFKGNFK

GACTACAAGTTCCTGTTCGTGAGCGCCTACTTCAAGGGCAACTT



EQLERISINTGILGGA

CAAGGAGCAGCTGGAGAGGATCAGCATCAACACCGGCATCCTG



ISVEHLLLGAEYFKR

GGCGGCGCCATCAGCGTGGAGCACCTGCTGCTGGGCGCCGAGTA



GILSLEDVRDKFCNT

CTTCAAGAGGGGCATCCTGAGCCTGGAGGACGTGAGGGACAAG



EIEF

TTCTGCAACACCGAGATCGAGTTC





4
GKSEVETIKEQMRG
85
GGCAAGAGCGAGGTGGAGACCATCAAGGAGCAGATGAGGGGCG



ELTHLSHEYLGLLDL

AGCTGACCCACCTGAGCCACGAGTACCTGGGCCTGCTGGACCTG



AYDSKQNRLFELKT

GCCTACGACAGCAAGCAGAACAGGCTGTTCGAGCTGAAGACCA



MQLLTEECGFEGLH

TGCAGCTGCTGACCGAGGAGTGCGGCTTCGAGGGCCTGCACCTG



LGGSRKPDGIVYTK

GGCGGCAGCAGGAAGCCCGACGGCATCGTGTACACCAAGGACG



DENEQVGKENYGIII

AGAACGAGCAGGTGGGCAAGGAGAACTACGGCATCATCATCGA



DTKAYSGGYSLPISQ

CACCAAGGCCTACAGCGGCGGCTACAGCCTGCCCATCAGCCAGG



ADEMERYIGENQTR

CCGACGAGATGGAGAGGTACATCGGCGAGAACCAGACCAGGGA



DIRINPNEWWKNFG

CATCAGGATCAACCCCAACGAGTGGTGGAAGAACTTCGGCGAC



DGVTEYYYLFVAGH

GGCGTGACCGAGTACTACTACCTGTTCGTGGCCGGCCACTTCAA



FKGKYQEQIDRINCN

GGGCAAGTACCAGGAGCAGATCGACAGGATCAACTGCAACAAG



KNIKGAAVSIQQLLR

AACATCAAGGGCGCCGCCGTGAGCATCCAGCAGCTGCTGAGGA



IVNDYKAGKLTHED

TCGTGAACGACTACAAGGCCGGCAAGCTGACCCACGAGGACAT



MKLKIFHY

GAAGCTGAAGATCTTCCACTAC





5
MKILELLINECGYKG
86
ATGAAGATCCTGGAGCTGCTGATCAACGAGTGCGGCTACAAGG



LHLGGARKPDGIIYT

GCCTGCACCTGGGCGGCGCCAGGAAGCCCGACGGCATCATCTAC



EKEKYNYGVIIDTK

ACCGAGAAGGAGAAGTACAACTACGGCGTGATCATCGACACCA



AYSKGYNLPIGQIDE

AGGCCTACAGCAAGGGCTACAACCTGCCCATCGGCCAGATCGAC



MIRYIIENNERNIKR

GAGATGATCAGGTACATCATCGAGAACAACGAGAGGAACATCA



NTNCWWNNFEKNV

AGAGGAACACCAACTGCTGGTGGAACAACTTCGAGAAGAACGT



NEFYFSFISGEFTGNI

GAACGAGTTCTACTTCAGCTTCATCAGCGGCGAGTTCACCGGCA



EEKLNRIFISTNIKGN

ACATCGAGGAGAAGCTGAACAGGATCTTCATCAGCACCAACATC



AMSVKTLLYLANEI

AAGGGCAACGCCATGAGCGTGAAGACCCTGCTGTACCTGGCCA



KANRISYIELLNYFD

ACGAGATCAAGGCCAACAGGATCAGCTACATCGAGCTGCTGAA



NKV

CTACTTCGACAACAAGGTG





6
AKSSQSETKEKLRE
87
GCCAAGAGCAGCCAGAGCGAGACCAAGGAGAAGCTGAGGGAG



KLRNLPHEYLSLVD

AAGCTGAGGAACCTGCCCCACGAGTACCTGAGCCTGGTGGACCT



LAYDSKQNRLFEMK

GGCCTACGACAGCAAGCAGAACAGGCTGTTCGAGATGAAGGTG



VIELLTEECGFQGLH

ATCGAGCTGCTGACCGAGGAGTGCGGCTTCCAGGGCCTGCACCT



LGGSRRPDGVLYTA

GGGCGGCAGCAGGAGGCCCGACGGCGTGCTGTACACCGCCGGC



GLTDNYGIILDTKAY

CTGACCGACAACTACGGCATCATCCTGGACACCAAGGCCTACAG



SSGYSLPIAQADEME

CAGCGGCTACAGCCTGCCCATCGCCCAGGCCGACGAGATGGAG



RYVRENQTRDELVN

AGGTACGTGAGGGAGAACCAGACCAGGGACGAGCTGGTGAACC



PNQWWENFENGLG

CCAACCAGTGGTGGGAGAACTTCGAGAACGGCCTGGGCACCTTC



TFYFLFVAGHFNGN

TACTTCCTGTTCGTGGCCGGCCACTTCAACGGCAACGTGCAGGC



VQAQLERISRNTGV

CCAGCTGGAGAGGATCAGCAGGAACACCGGCGTGCTGGGCGCC



LGAAASISQLLLLAD

GCCGCCAGCATCAGCCAGCTGCTGCTGCTGGCCGACGCCATCAG



AIRGGRMDRERLRH

GGGCGGCAGGATGGACAGGGAGAGGCTGAGGCACCTGATGTTC



LMFQNEEFL

CAGAACGAGGAGTTCCTG





7
NSEKSEFTQEKDNL
88
AACAGCGAGAAGAGCGAGTTCACCCAGGAGAAGGACAACCTGA



REKLDTLSHEYLSLV

GGGAGAAGCTGGACACCCTGAGCCACGAGTACCTGAGCCTGGT



DLAFDSQQNRLFEM

GGACCTGGCCTTCGACAGCCAGCAGAACAGGCTGTTCGAGATGA



KTVELLTKECNYKG

AGACCGTGGAGCTGCTGACCAAGGAGTGCAACTACAAGGGCGT



VHLGGSRKPDGIIYT

GCACCTGGGCGGCAGCAGGAAGCCCGACGGCATCATCTACACC



ENSTDNYGVIIDTKA

GAGAACAGCACCGACAACTACGGCGTGATCATCGACACCAAGG



YSNGYNLPISQVDE

CCTACAGCAACGGCTACAACCTGCCCATCAGCCAGGTGGACGAG



MVRYVEENNKREK

ATGGTGAGGTACGTGGAGGAGAACAACAAGAGGGAGAAGGAG



ERNSNEWWKEFGD

AGGAACAGCAACGAGTGGTGGAAGGAGTTCGGCGACAACATCA



NINKFYFSFISGKFIG

ACAAGTTCTACTTCAGCTTCATCAGCGGCAAGTTCATCGGCAAC



NIEEKLQRITIFTNVY

ATCGAGGAGAAGCTGCAGAGGATCACCATCTTCACCAACGTGTA



GNAMTIITLLYLANE

CGGCAACGCCATGACCATCATCACCCTGCTGTACCTGGCCAACG



IKANRLKTMEVVKY

AGATCAAGGCCAACAGGCTGAAGACCATGGAGGTGGTGAAGTA



FDNKV

CTTCGACAACAAGGTG





8
NLTCSDLTEIKEEVR
89
AACCTGACCTGCAGCGACCTGACCGAGATCAAGGAGGAGGTGA



NALTHLSHEYLALID

GGAACGCCCTGACCCACCTGAGCCACGAGTACCTGGCCCTGATC



LAYDSTQNRLFEMK

GACCTGGCCTACGACAGCACCCAGAACAGGCTGTTCGAGATGA



TLQLLVEECGYQGT

AGACCCTGCAGCTGCTGGTGGAGGAGTGCGGCTACCAGGGCAC



HLGGSRKPDGICYSE

CCACCTGGGCGGCAGCAGGAAGCCCGACGGCATCTGCTACAGC



EAKSEGLEANYGIII

GAGGAGGCCAAGAGCGAGGGCCTGGAGGCCAACTACGGCATCA



DTKSYSGGYGLPISQ

TCATCGACACCAAGAGCTACAGCGGCGGCTACGGCCTGCCCATC



ADEMERYIRENQTR

AGCCAGGCCGACGAGATGGAGAGGTACATCAGGGAGAACCAGA



DAEVNRNKWWEAF

CCAGGGACGCCGAGGTGAACAGGAACAAGTGGTGGGAGGCCTT



PETIDIFYFMFVAGH

CCCCGAGACCATCGACATCTTCTACTTCATGTTCGTGGCCGGCCA



FKGNYFNQLERLQR

CTTCAAGGGCAACTACTTCAACCAGCTGGAGAGGCTGCAGAGG



STGIKGAAVDIKTLL

AGCACCGGCATCAAGGGCGCCGCCGTGGACATCAAGACCCTGCT



LTANRCKTGELDHA

GCTGACCGCCAACAGGTGCAAGACCGGCGAGCTGGACCACGCC



GIESCFFNNCRL

GGCATCGAGAGCTGCTTCTTCAACAACTGCAGGCTG





9
DNVKSNFNQEKDEL
90
GACAACGTGAAGAGCAACTTCAACCAGGAGAAGGACGAGCTGA



REKLDTLSHEYLYL

GGGAGAAGCTGGACACCCTGAGCCACGAGTACCTGTACCTGCTG



LDLAYDSKQNKLFE

GACCTGGCCTACGACAGCAAGCAGAACAAGCTGTTCGAGATGA



MKILELLINECGYRG

AGATCCTGGAGCTGCTGATCAACGAGTGCGGCTACAGGGGCCTG



LHLGGVRKPDGIIYT

CACCTGGGCGGCGTGAGGAAGCCCGACGGCATCATCTACACCG



EKEKYNYGVIIDTK

AGAAGGAGAAGTACAACTACGGCGTGATCATCGACACCAAGGC



AYSKGYNLPIGQIDE

CTACAGCAAGGGCTACAACCTGCCCATCGGCCAGATCGACGAG



MIRYIIENNERNIKR

ATGATCAGGTACATCATCGAGAACAACGAGAGGAACATCAAGA



NTNCWWNNFEKNV

GGAACACCAACTGCTGGTGGAACAACTTCGAGAAGAACGTGAA



NEFYFSFISGEFTGNI

CGAGTTCTACTTCAGCTTCATCAGCGGCGAGTTCACCGGCAACA



EEKLNRIFISTNIKGN

TCGAGGAGAAGCTGAACAGGATCTTCATCAGCACCAACATCAA



AMSVKTLLYLANEI

GGGCAACGCCATGAGCGTGAAGACCCTGCTGTACCTGGCCAACG



KANRISFLEMEKYF

AGATCAAGGCCAACAGGATCAGCTTCCTGGAGATGGAGAAGTA



DNKV

CTTCGACAACAAGGTG





10
EGIKSNISLLKDELR
91
GAGGGCATCAAGAGCAACATCAGCCTGCTGAAGGACGAGCTGA



GQISHISHEYLSLIDL

GGGGCCAGATCAGCCACATCAGCCACGAGTACCTGAGCCTGATC



AFDSKQNRLFEMKV

GACCTGGCCTTCGACAGCAAGCAGAACAGGCTGTTCGAGATGA



LELLVNEYGFKGRH

AGGTGCTGGAGCTGCTGGTGAACGAGTACGGCTTCAAGGGCAG



LGGSRKPDGIVYSTT

GCACCTGGGCGGCAGCAGGAAGCCCGACGGCATCGTGTACAGC



LEDNFGIIVDTKAYS

ACCACCCTGGAGGACAACTTCGGCATCATCGTGGACACCAAGGC



EGYSLPISQADEMER

CTACAGCGAGGGCTACAGCCTGCCCATCAGCCAGGCCGACGAG



YVRENSNRDEEVNP

ATGGAGAGGTACGTGAGGGAGAACAGCAACAGGGACGAGGAG



NKWWENFSEEVKK

GTGAACCCCAACAAGTGGTGGGAGAACTTCAGCGAGGAGGTGA



YYFVFISGSFKGKFE

AGAAGTACTACTTCGTGTTCATCAGCGGCAGCTTCAAGGGCAAG



EQLRRLSMTTGVNG

TTCGAGGAGCAGCTGAGGAGGCTGAGCATGACCACCGGCGTGA



SAVNVVNLLLGAEK

ACGGCAGCGCCGTGAACGTGGTGAACCTGCTGCTGGGCGCCGA



IRSGEMTIEELERAM

GAAGATCAGGAGCGGCGAGATGACCATCGAGGAGCTGGAGAGG



FNNSEFI

GCCATGTTCAACAACAGCGAGTTCATC





11
ISKTNVLELKDKVR
92
ATCAGCAAGACCAACGTGCTGGAGCTGAAGGACAAGGTGAGGG



DKLKYVDNRYLALI

ACAAGCTGAAGTACGTGGACAACAGGTACCTGGCCCTGATCGAC



DLAYDGTANRDFEI

CTGGCCTACGACGGCACCGCCAACAGGGACTTCGAGATCCAGAC



QTIDLLINELKFKGV

CATCGACCTGCTGATCAACGAGCTGAAGTTCAAGGGCGTGAGGC



RLGESRKPDGIISYDI

TGGGCGAGAGCAGGAAGCCCGACGGCATCATCAGCTACGACAT



NGVIIDNKAYSSGY

CAACGGCGTGATCATCGACAACAAGGCCTACAGCAGCGGCTAC



NLPINQADEMIRYIE

AACCTGCCCATCAACCAGGCCGACGAGATGATCAGGTACATCGA



ENQTRDKKINPNKW

GGAGAACCAGACCAGGGACAAGAAGATCAACCCCAACAAGTGG



WESFDDKVKDFNYL

TGGGAGAGCTTCGACGACAAGGTGAAGGACTTCAACTACCTGTT



FVSSFFKGNFKNNL

CGTGAGCAGCTTCTTCAAGGGCAACTTCAAGAACAACCTGAAGC



KHIANRTGVNGGVI

ACATCGCCAACAGGACCGGCGTGAACGGCGGCGTGATCAACGT



NVENLLYFAEELKS

GGAGAACCTGCTGTACTTCGCCGAGGAGCTGAAGAGCGGCAGG



GRLSYVDLFKMYDN

CTGAGCTACGTGGACCTGTTCAAGATGTACGACAACGACGAGAT



DEINI

CAACATC





12
ISKTNVLELKDKVR
93
ATCAGCAAGACCAACGTGCTGGAGCTGAAGGACAAGGTGAGGG



DKLKYVDHRYLALI

ACAAGCTGAAGTACGTGGACCACAGGTACCTGGCCCTGATCGAC



DLAYDGTANRDFEI

CTGGCCTACGACGGCACCGCCAACAGGGACTTCGAGATCCAGAC



QTIDLLINELKFKGV

CATCGACCTGCTGATCAACGAGCTGAAGTTCAAGGGCGTGAGGC



RLGESRKPDGIISYDI

TGGGCGAGAGCAGGAAGCCCGACGGCATCATCAGCTACGACAT



NGVIIDNKAYSTGY

CAACGGCGTGATCATCGACAACAAGGCCTACAGCACCGGCTAC



NLPINQADEMIRYIE

AACCTGCCCATCAACCAGGCCGACGAGATGATCAGGTACATCGA



ENQTRDKKINSNKW

GGAGAACCAGACCAGGGACAAGAAGATCAACAGCAACAAGTGG



WESFDDKVKNFNYL

TGGGAGAGCTTCGACGACAAGGTGAAGAACTTCAACTACCTGTT



FVSSFFKGNFKNNL

CGTGAGCAGCTTCTTCAAGGGCAACTTCAAGAACAACCTGAAGC



KHIANRTGVNGGAI

ACATCGCCAACAGGACCGGCGTGAACGGCGGCGCCATCAACGT



NVENLLYFAEELKA

GGAGAACCTGCTGTACTTCGCCGAGGAGCTGAAGGCCGGCAGG



GRLSYVDSFTMYDN

CTGAGCTACGTGGACAGCTTCACCATGTACGACAACGACGAGAT



DEIYV

CTACGTG





13
KAEKSEFLIEKDKLR
94
AAGGCCGAGAAGAGCGAGTTCCTGATCGAGAAGGACAAGCTGA



EKLDTLPHDYLSMV

GGGAGAAGCTGGACACCCTGCCCCACGACTACCTGAGCATGGTG



DLAYDSKQNRLFEM

GACCTGGCCTACGACAGCAAGCAGAACAGGCTGTTCGAGATGA



KTIELLINECNYKGL

AGACCATCGAGCTGCTGATCAACGAGTGCAACTACAAGGGCCTG



HLGGTRKPDGIVYT

CACCTGGGCGGCACCAGGAAGCCCGACGGCATCGTGTACACCA



NNEVENYGIIIDTKA

ACAACGAGGTGGAGAACTACGGCATCATCATCGACACCAAGGC



YSKGYNLPISQVDE

CTACAGCAAGGGCTACAACCTGCCCATCAGCCAGGTGGACGAG



MTRYVEENNKREK

ATGACCAGGTACGTGGAGGAGAACAACAAGAGGGAGAAGAAG



KRNPNEWWNNFDS

AGGAACCCCAACGAGTGGTGGAACAACTTCGACAGCAACGTGA



NVKKFYFSFISGKFV

AGAAGTTCTACTTCAGCTTCATCAGCGGCAAGTTCGTGGGCAAC



GNIEEKLQRITLFTEI

ATCGAGGAGAAGCTGCAGAGGATCACCCTGTTCACCGAGATCTA



YGNAITVTTLLYIAN

CGGCAACGCCATCACCGTGACCACCCTGCTGTACATCGCCAACG



EIKANRIVIKKSDIME

AGATCAAGGCCAACAGGATGAAGAAGAGCGACATCATGGAGTA



YFNDKV

CTTCAACGACAAGGTG





14
ISKTNVLELKDKVR
95
ATCAGCAAGACCAACGTGCTGGAGCTGAAGGACAAGGTGAGGG



DKLKYVDHRYLALI

ACAAGCTGAAGTACGTGGACCACAGGTACCTGGCCCTGATCGAC



DLAYDGTANRDFEI

CTGGCCTACGACGGCACCGCCAACAGGGACTTCGAGATCCAGAC



QTIDLLINELKFKGV

CATCGACCTGCTGATCAACGAGCTGAAGTTCAAGGGCGTGAGGC



RLGESRKPDGIISYNI

TGGGCGAGAGCAGGAAGCCCGACGGCATCATCAGCTACAACAT



NGVIIDNKAYSTGY

CAACGGCGTGATCATCGACAACAAGGCCTACAGCACCGGCTAC



NLPINQADEMIRYIE

AACCTGCCCATCAACCAGGCCGACGAGATGATCAGGTACATCGA



ENQTRDEKINSNKW

GGAGAACCAGACCAGGGACGAGAAGATCAACAGCAACAAGTGG



WESFDDEVKDFNYL

TGGGAGAGCTTCGACGACGAGGTGAAGGACTTCAACTACCTGTT



FVSSFFKGNFKNNL

CGTGAGCAGCTTCTTCAAGGGCAACTTCAAGAACAACCTGAAGC



KHIANRTGVNGGAI

ACATCGCCAACAGGACCGGCGTGAACGGCGGCGCCATCAACGT



NVENLLYFAEELKA

GGAGAACCTGCTGTACTTCGCCGAGGAGCTGAAGGCCGGCAGG



GRLSYVDSFTMYDN

CTGAGCTACGTGGACAGCTTCACCATGTACGACAACGACGAGAT



DEIYV

CTACGTG





15
ISKTNILELKDKVRD
96
ATCAGCAAGACCAACATCCTGGAGCTGAAGGACAAGGTGAGGG



KLKYVDHRYLALID

ACAAGCTGAAGTACGTGGACCACAGGTACCTGGCCCTGATCGAC



LAYDGTANRDFEIQ

CTGGCCTACGACGGCACCGCCAACAGGGACTTCGAGATCCAGAC



TIDLLINELKFKGVR

CATCGACCTGCTGATCAACGAGCTGAAGTTCAAGGGCGTGAGGC



LGESRKPDGIISYNIN

TGGGCGAGAGCAGGAAGCCCGACGGCATCATCAGCTACAACAT



GVIIDNKAYSTGYNL

CAACGGCGTGATCATCGACAACAAGGCCTACAGCACCGGCTAC



PINQADEMIRYIEEN

AACCTGCCCATCAACCAGGCCGACGAGATGATCAGGTACATCGA



QTRDEKINSNKWWE

GGAGAACCAGACCAGGGACGAGAAGATCAACAGCAACAAGTGG



SFDEKVKDFNYLFV

TGGGAGAGCTTCGACGAGAAGGTGAAGGACTTCAACTACCTGTT



SSFFKGNFKNNLKHI

CGTGAGCAGCTTCTTCAAGGGCAACTTCAAGAACAACCTGAAGC



ANRTGVNGGAINVE

ACATCGCCAACAGGACCGGCGTGAACGGCGGCGCCATCAACGT



NLLYFAEELKAGRIS

GGAGAACCTGCTGTACTTCGCCGAGGAGCTGAAGGCCGGCAGG



YLDSFKMYNNDEIY

ATCAGCTACCTGGACAGCTTCAAGATGTACAACAACGACGAGAT



L

CTACCTG





16
ISKTNVLELKDKVR
97
ATCAGCAAGACCAACGTGCTGGAGCTGAAGGACAAGGTGAGGG



DKLKYVDHRYLALI

ACAAGCTGAAGTACGTGGACCACAGGTACCTGGCCCTGATCGAC



DLAYDGTANRDFEI

CTGGCCTACGACGGCACCGCCAACAGGGACTTCGAGATCCAGAC



QTIDLLINELKFKGV

CATCGACCTGCTGATCAACGAGCTGAAGTTCAAGGGCGTGAGGC



RLGESRKPDGIISYNI

TGGGCGAGAGCAGGAAGCCCGACGGCATCATCAGCTACAACAT



NGVIIDNKAYSTGY

CAACGGCGTGATCATCGACAACAAGGCCTACAGCACCGGCTAC



NLPINQADEMIRYIE

AACCTGCCCATCAACCAGGCCGACGAGATGATCAGGTACATCGA



ENQTRDEKINSNKW

GGAGAACCAGACCAGGGACGAGAAGATCAACAGCAACAAGTGG



WESFDDKVKDFNYL

TGGGAGAGCTTCGACGACAAGGTGAAGGACTTCAACTACCTGTT



FVSSFFKGNFKNNL

CGTGAGCAGCTTCTTCAAGGGCAACTTCAAGAACAACCTGAAGC



KHIANRTGVSGGAI

ACATCGCCAACAGGACCGGCGTGAGCGGCGGCGCCATCAACGT



NVENLLYFAEELKA

GGAGAACCTGCTGTACTTCGCCGAGGAGCTGAAGGCCGGCAGG



GRLSYVDSFKMYDN

CTGAGCTACGTGGACAGCTTCAAGATGTACGACAACGACGAGAT



DEIYV

CTACGTG





17
ISKTNVLELKDKVR
98
ATCAGCAAGACCAACGTGCTGGAGCTGAAGGACAAGGTGAGGA



NKLKYVDHRYLALI

ACAAGCTGAAGTACGTGGACCACAGGTACCTGGCCCTGATCGAC



DLAYDGTANRDFEI

CTGGCCTACGACGGCACCGCCAACAGGGACTTCGAGATCCAGAC



QTIDLLINELKFKGV

CATCGACCTGCTGATCAACGAGCTGAAGTTCAAGGGCGTGAGGC



RLGESRKPDGIISYDI

TGGGCGAGAGCAGGAAGCCCGACGGCATCATCAGCTACGACAT



NGVIIDNKSYSTGYN

CAACGGCGTGATCATCGACAACAAGAGCTACAGCACCGGCTAC



LPINQADEMIRYIEE

AACCTGCCCATCAACCAGGCCGACGAGATGATCAGGTACATCGA



NQTRDEKINSNKW

GGAGAACCAGACCAGGGACGAGAAGATCAACAGCAACAAGTGG



WESFDEKVKDFNYL

TGGGAGAGCTTCGACGAGAAGGTGAAGGACTTCAACTACCTGTT



FVSSFFKGNFKNNL

CGTGAGCAGCTTCTTCAAGGGCAACTTCAAGAACAACCTGAAGC



KHIANRTGVNGGAI

ACATCGCCAACAGGACCGGCGTGAACGGCGGCGCCATCAACGT



NVENLLYFAEELKS

GGAGAACCTGCTGTACTTCGCCGAGGAGCTGAAGAGCGGCAGG



GRLSYVDSFTMYDN

CTGAGCTACGTGGACAGCTTCACCATGTACGACAACGACGAGAT



DEIYV

CTACGTG





18
ISKTNVLELKDKVR
99
ATCAGCAAGACCAACGTGCTGGAGCTGAAGGACAAGGTGAGGG



DKLKYVDHRYLSLI

ACAAGCTGAAGTACGTGGACCACAGGTACCTGAGCCTGATCGAC



DLAYDGNANRDFEI

CTGGCCTACGACGGCAACGCCAACAGGGACTTCGAGATCCAGA



QTIDLLINELNFKGV

CCATCGACCTGCTGATCAACGAGCTGAACTTCAAGGGCGTGAGG



RLGESRKPDGIISYNI

CTGGGCGAGAGCAGGAAGCCCGACGGCATCATCAGCTACAACA



NGVIIDNKAYSTGY

TCAACGGCGTGATCATCGACAACAAGGCCTACAGCACCGGCTAC



NLPINQADEMIRYIE

AACCTGCCCATCAACCAGGCCGACGAGATGATCAGGTACATCGA



ENQTRDEKINSNKW

GGAGAACCAGACCAGGGACGAGAAGATCAACAGCAACAAGTGG



WESFDDKVKDFNYL

TGGGAGAGCTTCGACGACAAGGTGAAGGACTTCAACTACCTGTT



FVSSFFKGNFKNNL

CGTGAGCAGCTTCTTCAAGGGCAACTTCAAGAACAACCTGAAGC



KHIANRTGVSGGAI

ACATCGCCAACAGGACCGGCGTGAGCGGCGGCGCCATCAACGT



NVENLLYFAEELKA

GGAGAACCTGCTGTACTTCGCCGAGGAGCTGAAGGCCGGCAGG



GRLSYADSFTMYDN

CTGAGCTACGCCGACAGCTTCACCATGTACGACAACGACGAGAT



DEIYV

CTACGTG





19
IAKTNVLGLKDKVR
100
ATCGCCAAGACCAACGTGCTGGGCCTGAAGGACAAGGTGAGGG



DRLKYVDHRYLALI

ACAGGCTGAAGTACGTGGACCACAGGTACCTGGCCCTGATCGAC



DLAYDGTANRDFEI

CTGGCCTACGACGGCACCGCCAACAGGGACTTCGAGATCCAGAC



QTIDLLINELKFKGV

CATCGACCTGCTGATCAACGAGCTGAAGTTCAAGGGCGTGAGGC



RLGESRKPDGIISYN

TGGGCGAGAGCAGGAAGCCCGACGGCATCATCAGCTACAACGT



VNGVIIDNKAYSKG

GAACGGCGTGATCATCGACAACAAGGCCTACAGCAAGGGCTAC



YNLPINQADEMIRYI

AACCTGCCCATCAACCAGGCCGACGAGATGATCAGGTACATCGA



EENQTRDEKINANK

GGAGAACCAGACCAGGGACGAGAAGATCAACGCCAACAAGTGG



WWESFDDKVEEFSY

TGGGAGAGCTTCGACGACAAGGTGGAGGAGTTCAGCTACCTGTT



LFVSSFFKGNFKNNL

CGTGAGCAGCTTCTTCAAGGGCAACTTCAAGAACAACCTGAAGC



KHIANRTGVNGGAI

ACATCGCCAACAGGACCGGCGTGAACGGCGGCGCCATCAACGT



NVENLLYFAEELKS

GGAGAACCTGCTGTACTTCGCCGAGGAGCTGAAGAGCGGCAGG



GRLSYMDSFSLYDN

CTGAGCTACATGGACAGCTTCAGCCTGTACGACAACGACGAGAT



DEICV

CTGCGTG





20
ELKDEQSEKRKAKF
101
GAGCTGAAGGACGAGCAGAGCGAGAAGAGGAAGGCCAAGTTCC



LKETKLPMKYIELLD

TGAAGGAGACCAAGCTGCCCATGAAGTACATCGAGCTGCTGGA



IAYDGKRNRDFEIVT

CATCGCCTACGACGGCAAGAGGAACAGGGACTTCGAGATCGTG



MELFREVYRLNSKL

ACCATGGAGCTGTTCAGGGAGGTGTACAGGCTGAACAGCAAGC



LGGGRKPDGLIYTD

TGCTGGGCGGCGGCAGGAAGCCCGACGGCCTGATCTACACCGA



DFGVIVDTKAYGEG

CGACTTCGGCGTGATCGTGGACACCAAGGCCTACGGCGAGGGCT



YSKSINQADEMIRYI

ACAGCAAGAGCATCAACCAGGCCGACGAGATGATCAGGTACAT



EDNKRRDEKRNPIK

CGAGGACAACAAGAGGAGGGACGAGAAGAGGAACCCCATCAA



WWESFPSSISQNNFY

GTGGTGGGAGAGCTTCCCCAGCAGCATCAGCCAGAACAACTTCT



FLWVSSKFVGKFQE

ACTTCCTGTGGGTGAGCAGCAAGTTCGTGGGCAAGTTCCAGGAG



QLAYTANETQTKGG

CAGCTGGCCTACACCGCCAACGAGACCCAGACCAAGGGCGGCG



AINVEQILIGADLIM

CCATCAACGTGGAGCAGATCCTGATCGGCGCCGACCTGATCATG



QKMLDINTIPSFFEN

CAGAAGATGCTGGACATCAACACCATCCCCAGCTTCTTCGAGAA



QEIIF

CCAGGAGATCATCTTC





21
IFKTNVLELKDSIRE
102
ATCTTCAAGACCAACGTGCTGGAGCTGAAGGACAGCATCAGGG



KLDYIDHRYLSLVD

AGAAGCTGGACTACATCGACCACAGGTACCTGAGCCTGGTGGAC



LAYDSKANRDFEIQ

CTGGCCTACGACAGCAAGGCCAACAGGGACTTCGAGATCCAGA



TIDLLINELDFKGLR

CCATCGACCTGCTGATCAACGAGCTGGACTTCAAGGGCCTGAGG



LGESRKPDGIISYDIN

CTGGGCGAGAGCAGGAAGCCCGACGGCATCATCAGCTACGACA



GVIIDNKAYSKGYN

TCAACGGCGTGATCATCGACAACAAGGCCTACAGCAAGGGCTA



LPINQADEMIRYIQE

CAACCTGCCCATCAACCAGGCCGACGAGATGATCAGGTACATCC



NQSRNEKINPNKWW

AGGAGAACCAGAGCAGGAACGAGAAGATCAACCCCAACAAGTG



ENFEDKVIKFNYLFI

GTGGGAGAACTTCGAGGACAAGGTGATCAAGTTCAACTACCTGT



SSLFVGGFKKNLQHI

TCATCAGCAGCCTGTTCGTGGGCGGCTTCAAGAAGAACCTGCAG



ANRTGVNGGAIDVE

CACATCGCCAACAGGACCGGCGTGAACGGCGGCGCCATCGACG



NLLYFAEEIKSGRLT

TGGAGAACCTGCTGTACTTCGCCGAGGAGATCAAGAGCGGCAG



YKDSFSRYINDEIKM

GCTGACCTACAAGGACAGCTTCAGCAGGTACATCAACGACGAG





ATCAAGATG





22
LPVKSEVSVFKDYL
103
CTGCCCGTGAAGAGCGAGGTGAGCGTGTTCAAGGACTACCTGAG



RTHLTHVDHRYLIL

GACCCACCTGACCCACGTGGACCACAGGTACCTGATCCTGGTGG



VDLGFDGSSDRDYE

ACCTGGGCTTCGACGGCAGCAGCGACAGGGACTACGAGATGAA



MKTAELFTAELGFM

GACCGCCGAGCTGTTCACCGCCGAGCTGGGCTTCATGGGCGCCA



GARLGDTRKPDVCV

GGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACCACGG



YHGANGLIIDNKAY

CGCCAACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGGC



GKGYSLPIKQADEIY

TACAGCCTGCCCATCAAGCAGGCCGACGAGATCTACAGGTACAT



RYIEENKERDARLNP

CGAGGAGAACAAGGAGAGGGACGCCAGGCTGAACCCCAACCAG



NQWWKVFDESVTH

TGGTGGAAGGTGTTCGACGAGAGCGTGACCCACTTCAGGTTCGC



FRFAFISGSFTGGFK

CTTCATCAGCGGCAGCTTCACCGGCGGCTTCAAGGACAGGATCG



DRIELISMRSGICGA

AGCTGATCAGCATGAGGAGCGGCATCTGCGGCGCCGCCGTGAA



AVNSVNLLLMAEEL

CAGCGTGAACCTGCTGCTGATGGCCGAGGAGCTGAAGAGCGGC



KSGRLDYEEWFQYF

AGGCTGGACTACGAGGAGTGGTTCCAGTACTTCGACTGCAACGA



DCNDEISF

CGAGATCAGCTTC





23
ISVKSDMAVVKDSV
104
ATCAGCGTGAAGAGCGACATGGCCGTGGTGAAGGACAGCGTGA



RERLAHVSHEYLILI

GGGAGAGGCTGGCCCACGTGAGCCACGAGTACCTGATCCTGATC



DLGFDGTSDRDYEI

GACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCC



QTAELFTRELDFLGG

AGACCGCCGAGCTGTTCACCAGGGAGCTGGACTTCCTGGGCGGC



RLGDTRKPDVCIYY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCATCTACTACG



GKDGMIIDNKAYGK

GCAAGGACGGCATGATCATCGACAACAAGGCCTACGGCAAGGG



GYSLPIKQADEMYR

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACC



YLEENKERNEKINPN

TGGAGGAGAACAAGGAGAGGAACGAGAAGATCAACCCCAACA



RWWKVFDEGVTDY

GGTGGTGGAAGGTGTTCGACGAGGGCGTGACCGACTACAGGTTC



RFAFVSGSFTGGFKD

GCCTTCGTGAGCGGCAGCTTCACCGGCGGCTTCAAGGACAGGCT



RLENIHMRSGLCGG

GGAGAACATCCACATGAGGAGCGGCCTGTGCGGCGGCGCCATC



AIDSVTLLLLAEELK

GACAGCGTGACCCTGCTGCTGCTGGCCGAGGAGCTGAAGGCCG



AGRMEYSEFFRLFD

GCAGGATGGAGTACAGCGAGTTCTTCAGGCTGTTCGACTGCAAC



CNDEVTF

GACGAGGTGACCTTC





24
ELKDKAADAVKAK
105
GAGCTGAAGGACAAGGCCGCCGACGCCGTGAAGGCCAAGTTCC



FLKLTGLSMKYIELL

TGAAGCTGACCGGCCTGAGCATGAAGTACATCGAGCTGCTGGAC



DIAYDSSRNRDFEIL

ATCGCCTACGACAGCAGCAGGAACAGGGACTTCGAGATCCTGA



TADLFKNVYGLDA

CCGCCGACCTGTTCAAGAACGTGTACGGCCTGGACGCCATGCAC



MHLGGGRKPDAIAQ

CTGGGCGGCGGCAGGAAGCCCGACGCCATCGCCCAGACCAGCC



TSHFGIIIDTKAYGN

ACTTCGGCATCATCATCGACACCAAGGCCTACGGCAACGGCTAC



GYSKSISQEDEMVR

AGCAAGAGCATCAGCCAGGAGGACGAGATGGTGAGGTACATCG



YIEDNQQRSITRNSV

AGGACAACCAGCAGAGGAGCATCACCAGGAACAGCGTGGAGTG



EWWKNFNSSIPSTAF

GTGGAAGAACTTCAACAGCAGCATCCCCAGCACCGCCTTCTACT



YFLWVSSKFVGKFD

TCCTGTGGGTGAGCAGCAAGTTCGTGGGCAAGTTCGACGACCAG



DQLLATYNRTNTCG

CTGCTGGCCACCTACAACAGGACCAACACCTGCGGCGGCGCCCT



GALNVEQLLIGAYK

GAACGTGGAGCAGCTGCTGATCGGCGCCTACAAGGTGAAGGCC



VKAGLLGIGQIPSYF

GGCCTGCTGGGCATCGGCCAGATCCCCAGCTACTTCAAGAACAA



KNKEIAW

GGAGATCGCCTGG


25
ISVKSDMAVVKDSV
106
ATCAGCGTGAAGAGCGACATGGCCGTGGTGAAGGACAGCGTGA



RERLAHVSHEYLLLI

GGGAGAGGCTGGCCCACGTGAGCCACGAGTACCTGCTGCTGATC



DLGFDGTSDRDYEI

GACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCC



QTAELLTRELDFLG

AGACCGCCGAGCTGCTGACCAGGGAGCTGGACTTCCTGGGCGGC



GRLGDTRKPDVCIY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCATCTACTACG



YGKDGMIIDNKAYG

GCAAGGACGGCATGATCATCGACAACAAGGCCTACGGCAAGGG



KGYSLPIKQADEMY

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACC



RYLEENKERNEKINP

TGGAGGAGAACAAGGAGAGGAACGAGAAGATCAACCCCAACA



NRWWKVFDEGVTD

GGTGGTGGAAGGTGTTCGACGAGGGCGTGACCGACTACAGGTTC



YRFAFVSGSFTGGFK

GCCTTCGTGAGCGGCAGCTTCACCGGCGGCTTCAAGGACAGGCT



DRLENIHMRSGLCG

GGAGAACATCCACATGAGGAGCGGCCTGTGCGGCGGCGCCATC



GAIDSVTLLLLAEEL

GACAGCGTGACCCTGCTGCTGCTGGCCGAGGAGCTGAAGGCCG



KAGRMEYSEFFRLF

GCAGGATGGAGTACAGCGAGTTCTTCAGGCTGTTCGACTGCAAC



DCNDEVTF

GACGAGGTGACCTTC





26
ELKDEQAEKRKAKF
107
GAGCTGAAGGACGAGCAGGCCGAGAAGAGGAAGGCCAAGTTCC



LKETNLPMKYIELLD

TGAAGGAGACCAACCTGCCCATGAAGTACATCGAGCTGCTGGAC



IAYDGKRNRDFEIVT

ATCGCCTACGACGGCAAGAGGAACAGGGACTTCGAGATCGTGA



MELFRNVYRLHSKL

CCATGGAGCTGTTCAGGAACGTGTACAGGCTGCACAGCAAGCTG



LGGGRKPDGLLYQD

CTGGGCGGCGGCAGGAAGCCCGACGGCCTGCTGTACCAGGACA



RFGVIVDTKAYGKG

GGTTCGGCGTGATCGTGGACACCAAGGCCTACGGCAAGGGCTAC



YSKSINQADEMIRYI

AGCAAGAGCATCAACCAGGCCGACGAGATGATCAGGTACATCG



EDNKRRDENRNPIK

AGGACAACAAGAGGAGGGACGAGAACAGGAACCCCATCAAGTG



WWEAFPDTIPQEEF

GTGGGAGGCCTTCCCCGACACCATCCCCCAGGAGGAGTTCTACT



YFMWVSSKFIGKFQ

TCATGTGGGTGAGCAGCAAGTTCATCGGCAAGTTCCAGGAGCAG



EQLDYTSNETQIKG

CTGGACTACACCAGCAACGAGACCCAGATCAAGGGCGCCGCCC



AALNVEQLLLGADL

TGAACGTGGAGCAGCTGCTGCTGGGCGCCGACCTGGTGCTGAAG



VLKGQLHISDLPSYF

GGCCAGCTGCACATCAGCGACCTGCCCAGCTACTTCCAGAACAA



QNKEIEF

GGAGATCGAGTTC





27
RNLDNVERDNRKAE
108
AGGAACCTGGACAACGTGGAGAGGGACAACAGGAAGGCCGAGT



FLAKTSLPPRFIELLS

TCCTGGCCAAGACCAGCCTGCCCCCCAGGTTCATCGAGCTGCTG



IAYESKSNRDFEMIT

AGCATCGCCTACGAGAGCAAGAGCAACAGGGACTTCGAGATGA



AELFKDVYGLGAVH

TCACCGCCGAGCTGTTCAAGGACGTGTACGGCCTGGGCGCCGTG



LGNAKKPDALAFND

CACCTGGGCAACGCCAAGAAGCCCGACGCCCTGGCCTTCAACGA



DFGIIIDTKAYSNGY

CGACTTCGGCATCATCATCGACACCAAGGCCTACAGCAACGGCT



SKNINQEDEMVRYIE

ACAGCAAGAACATCAACCAGGAGGACGAGATGGTGAGGTACAT



DNQIRSPDRNNNEW

CGAGGACAACCAGATCAGGAGCCCCGACAGGAACAACAACGAG



WLSFPPSIPENDFHF

TGGTGGCTGAGCTTCCCCCCCAGCATCCCCGAGAACGACTTCCA



LWVSSYFTGRFEEQ

CTTCCTGTGGGTGAGCAGCTACTTCACCGGCAGGTTCGAGGAGC



LQETSARTGGTTGG

AGCTGCAGGAGACCAGCGCCAGGACCGGCGGCACCACCGGCGG



ALDVEQLLIGGSLIQ

CGCCCTGGACGTGGAGCAGCTGCTGATCGGCGGCAGCCTGATCC



EGSLAPHEVPAYMQ

AGGAGGGCAGCCTGGCCCCCCACGAGGTGCCCGCCTACATGCA



NRVIHF

GAACAGGGTGATCCACTTC





28
SPVKSEVSVFKDYL
109
AGCCCCGTGAAGAGCGAGGTGAGCGTGTTCAAGGACTACCTGA



RTHLTHVDHRYLIL

GGACCCACCTGACCCACGTGGACCACAGGTACCTGATCCTGGTG



VDLGFDGSSDRDYE

GACCTGGGCTTCGACGGCAGCAGCGACAGGGACTACGAGATGA



MKTAELFTAELGFM

AGACCGCCGAGCTGTTCACCGCCGAGCTGGGCTTCATGGGCGCC



GARLGDTRKPDVCV

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACCACG



YHGAHGLIIDNKAY

GCGCCCACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



GKGYSLPIKQADEIY

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATCTACAGGTACA



RYIEENKERAVRLNP

TCGAGGAGAACAAGGAGAGGGCCGTGAGGCTGAACCCCAACCA



NQWWKVFDESVAH

GTGGTGGAAGGTGTTCGACGAGAGCGTGGCCCACTTCAGGTTCG



FRFAFISGSFTGGFK

CCTTCATCAGCGGCAGCTTCACCGGCGGCTTCAAGGACAGGATC



DRIELISMRSGICGA

GAGCTGATCAGCATGAGGAGCGGCATCTGCGGCGCCGCCGTGA



AVNSVNLLLMAEEL

ACAGCGTGAACCTGCTGCTGATGGCCGAGGAGCTGAAGAGCGG



KSGRLNYEEWFQYF

CAGGCTGAACTACGAGGAGTGGTTCCAGTACTTCGACTGCAACG



DCNDEISL

ACGAGATCAGCCTG





29
TLVDIEKERKKAYFL
110
ACCCTGGTGGACATCGAGAAGGAGAGGAAGAAGGCCTACTTCC



KETSLSPRYIELLEIA

TGAAGGAGACCAGCCTGAGCCCCAGGTACATCGAGCTGCTGGA



FDPKRNRDFE VITAE

GATCGCCTTCGACCCCAAGAGGAACAGGGACTTCGAGGTGATCA



LLKAGYGLKAKVLG

CCGCCGAGCTGCTGAAGGCCGGCTACGGCCTGAAGGCCAAGGT



GGRRPDGIAYTKDY

GCTGGGCGGCGGCAGGAGGCCCGACGGCATCGCCTACACCAAG



GLIVDTKAYSNGYG

GACTACGGCCTGATCGTGGACACCAAGGCCTACAGCAACGGCTA



KNIGQADEMIRYIED

CGGCAAGAACATCGGCCAGGCCGACGAGATGATCAGGTACATC



NQKRDNKRNPIEW

GAGGACAACCAGAAGAGGGACAACAAGAGGAACCCCATCGAGT



WREFEVQIPANSYY

GGTGGAGGGAGTTCGAGGTGCAGATCCCCGCCAACAGCTACTAC



YLWVSGRFTGRFDE

TACCTGTGGGTGAGCGGCAGGTTCACCGGCAGGTTCGACGAGCA



QLVYTSSQTNTRGG

GCTGGTGTACACCAGCAGCCAGACCAACACCAGGGGCGGCGCC



ALEVEQLLWGADA

CTGGAGGTGGAGCAGCTGCTGTGGGGCGCCGACGCCGTGATGA



VMKGKLNVSDLPK

AGGGCAAGCTGAACGTGAGCGACCTGCCCAAGTACATGAACAA



YMNNSIIKL

CAGCATCATCAAGCTG





30
ELRDKVIEEQKAIFL
111
GAGCTGAGGGACAAGGTGATCGAGGAGCAGAAGGCCATCTTCC



QKTKLPLSYIELLEIA

TGCAGAAGACCAAGCTGCCCCTGAGCTACATCGAGCTGCTGGAG



RDGKRSRDFELITIE

ATCGCCAGGGACGGCAAGAGGAGCAGGGACTTCGAGCTGATCA



LFKNIYKINARILGG

CCATCGAGCTGTTCAAGAACATCTACAAGATCAACGCCAGGATC



ARKPDGVLYMPEFG

CTGGGCGGCGCCAGGAAGCCCGACGGCGTGCTGTACATGCCCG



VIVDTKAYADGYSK

AGTTCGGCGTGATCGTGGACACCAAGGCCTACGCCGACGGCTAC



SIAQADEMIRYIEDN

AGCAAGAGCATCGCCCAGGCCGACGAGATGATCAGGTACATCG



KRRDPSRNSTKWWE

AGGACAACAAGAGGAGGGACCCCAGCAGGAACAGCACCAAGTG



HFPTSIPANNFYFLW

GTGGGAGCACTTCCCCACCAGCATCCCCGCCAACAACTTCTACT



VSSVFVNKFHEQLS

TCCTGTGGGTGAGCAGCGTGTTCGTGAACAAGTTCCACGAGCAG



YTAQETQTVGAALS

CTGAGCTACACCGCCCAGGAGACCCAGACCGTGGGCGCCGCCCT



VEQLLLGADSVLKG

GAGCGTGGAGCAGCTGCTGCTGGGCGCCGACAGCGTGCTGAAG



NLTTEKFIDSFKNQE

GGCAACCTGACCACCGAGAAGTTCATCGACAGCTTCAAGAACCA



IVF

GGAGATCGTGTTC





31
GATKSDLSLLKDDIR
112
GGCGCCACCAAGAGCGACCTGAGCCTGCTGAAGGACGACATCA



KKLNHINHKYLVLI

GGAAGAAGCTGAACCACATCAACCACAAGTACCTGGTGCTGATC



DLGFDGTADRDYEL

GACCTGGGCTTCGACGGCACCGCCGACAGGGACTACGAGCTGC



QTADLLTSELAFKG

AGACCGCCGACCTGCTGACCAGCGAGCTGGCCTTCAAGGGCGCC



ARLGDSRKPDVCVY

AGGCTGGGCGACAGCAGGAAGCCCGACGTGTGCGTGTACCACG



HDKNGLIIDNKAYG

ACAAGAACGGCCTGATCATCGACAACAAGGCCTACGGCAGCGG



SGYSLPIKQADEML

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGCTGAGGTACA



RYIEENQKRDKALN

TCGAGGAGAACCAGAAGAGGGACAAGGCCCTGAACCCCAACGA



PNEWWTIFDDAVSK

GTGGTGGACCATCTTCGACGACGCCGTGAGCAAGTTCAACTTCG



FNFAFVSGEFTGGFK

CCTTCGTGAGCGGCGAGTTCACCGGCGGCTTCAAGGACAGGCTG



DRLENISRRSYTNGA

GAGAACATCAGCAGGAGGAGCTACACCAACGGCGCCGCCATCA



AINSVNLLLLAEEIK

ACAGCGTGAACCTGCTGCTGCTGGCCGAGGAGATCAAGAGCGG



SGRISYGDAFTKFEC

CAGGATCAGCTACGGCGACGCCTTCACCAAGTTCGAGTGCAACG



NDEIII

ACGAGATCATCATC





32
ELRNAALDKQKVNF
113
GAGCTGAGGAACGCCGCCCTGGACAAGCAGAAGGTGAACTTCA



INKTGLPMKYIELLE

TCAACAAGACCGGCCTGCCCATGAAGTACATCGAGCTGCTGGAG



IAFDGSRNRDFEMV

ATCGCCTTCGACGGCAGCAGGAACAGGGACTTCGAGATGGTGA



TADLFKNVYGFNSIL

CCGCCGACCTGTTCAAGAACGTGTACGGCTTCAACAGCATCCTG



LGGGRKPDGLIFTDR

CTGGGCGGCGGCAGGAAGCCCGACGGCCTGATCTTCACCGACA



FGVIIDTKAYGNGYS

GGTTCGGCGTGATCATCGACACCAAGGCCTACGGCAACGGCTAC



KSIGQEDEMVRYIED

AGCAAGAGCATCGGCCAGGAGGACGAGATGGTGAGGTACATCG



NQLRDSNRNSVEW

AGGACAACCAGCTGAGGGACAGCAACAGGAACAGCGTGGAGTG



WKNFDEKIESENFYF

GTGGAAGAACTTCGACGAGAAGATCGAGAGCGAGAACTTCTAC



MWISSKFIGQFSDQL

TTCATGTGGATCAGCAGCAAGTTCATCGGCCAGTTCAGCGACCA



QSTSDRTNTKGAAL

GCTGCAGAGCACCAGCGACAGGACCAACACCAAGGGCGCCGCC



NVEQLLLGAAAARD

CTGAACGTGGAGCAGCTGCTGCTGGGCGCCGCCGCCGCCAGGG



GKLDINSLPIYMNNK

ACGGCAAGCTGGACATCAACAGCCTGCCCATCTACATGAACAAC



EILW

AAGGAGATCCTGTGG





33
ELKDEQSEKRKAYF
114
GAGCTGAAGGACGAGCAGAGCGAGAAGAGGAAGGCCTACTTCC



LKETNLPLKYIELLDI

TGAAGGAGACCAACCTGCCCCTGAAGTACATCGAGCTGCTGGAC



AYDGKRNRDFEIVT

ATCGCCTACGACGGCAAGAGGAACAGGGACTTCGAGATCGTGA



MELFRNVYRLQSKL

CCATGGAGCTGTTCAGGAACGTGTACAGGCTGCAGAGCAAGCTG



LGGVRKPDGLLYKH

CTGGGCGGCGTGAGGAAGCCCGACGGCCTGCTGTACAAGCACA



RFGIIVDTKAYGEGY

GGTTCGGCATCATCGTGGACACCAAGGCCTACGGCGAGGGCTAC



SKSISQADEMIRYIE

AGCAAGAGCATCAGCCAGGCCGACGAGATGATCAGGTACATCG



DNKRRDENRNSTK

AGGACAACAAGAGGAGGGACGAGAACAGGAACAGCACCAAGT



WWEHFPDCIPKQSF

GGTGGGAGCACTTCCCCGACTGCATCCCCAAGCAGAGCTTCTAC



YFMWVSSKFVGKFQ

TTCATGTGGGTGAGCAGCAAGTTCGTGGGCAAGTTCCAGGAGCA



EQLDYTANETKTNG

GCTGGACTACACCGCCAACGAGACCAAGACCAACGGCGCCGCC



AALNVEQLLWGAD

CTGAACGTGGAGCAGCTGCTGTGGGGCGCCGACCTGGTGGCCAA



LVAKGKLDISQLPSY

GGGCAAGCTGGACATCAGCCAGCTGCCCAGCTACTTCCAGAACA



FQNKEIEF

AGGAGATCGAGTTC





34
HNNKFKNYLRENSE
115
CACAACAACAAGTTCAAGAACTACCTGAGGGAGAACAGCGAGC



LSFKFIELIDIAYDGN

TGAGCTTCAAGTTCATCGAGCTGATCGACATCGCCTACGACGGC



RNRDMEIITAELLKE

AACAGGAACAGGGACATGGAGATCATCACCGCCGAGCTGCTGA



IYGLNVKLLGGGRK

AGGAGATCTACGGCCTGAACGTGAAGCTGCTGGGCGGCGGCAG



PDILAYTDDIGIIIDT

GAAGCCCGACATCCTGGCCTACACCGACGACATCGGCATCATCA



KAYKDGYGKQINQ

TCGACACCAAGGCCTACAAGGACGGCTACGGCAAGCAGATCAA



ADEMIRYIEDNQRR

CCAGGCCGACGAGATGATCAGGTACATCGAGGACAACCAGAGG



DLIRNPNEWWRYFP

AGGGACCTGATCAGGAACCCCAACGAGTGGTGGAGGTACTTCCC



KSISKEKIYFMWISS

CAAGAGCATCAGCAAGGAGAAGATCTACTTCATGTGGATCAGC



YFKNNFYEQVQYTA

AGCTACTTCAAGAACAACTTCTACGAGCAGGTGCAGTACACCGC



QETKSIGAALNVRQ

CCAGGAGACCAAGAGCATCGGCGCCGCCCTGAACGTGAGGCAG



LLLCADAIQKEVLSL

CTGCTGCTGTGCGCCGACGCCATCCAGAAGGAGGTGCTGAGCCT



DTFLGSFRNEEINL

GGACACCTTCCTGGGCAGCTTCAGGAACGAGGAGATCAACCTG





35
LPVKSEVSILKDYLR
116
CTGCCCGTGAAGAGCGAGGTGAGCATCCTGAAGGACTACCTGA



SHLTHIDHKYLILVD

GGAGCCACCTGACCCACATCGACCACAAGTACCTGATCCTGGTG



LGYDGTSDRDYEIQ

GACCTGGGCTACGACGGCACCAGCGACAGGGACTACGAGATCC



TAQLLTAELSFLGGR

AGACCGCCCAGCTGCTGACCGCCGAGCTGAGCTTCCTGGGCGGC



LGDTRKPDVCIYYE

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCATCTACTACG



DNGLIIDNKAYGKG

AGGACAACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



YSLPMKQADEMYR

CTACAGCCTGCCCATGAAGCAGGCCGACGAGATGTACAGGTAC



YIEENKERSELLNPN

ATCGAGGAGAACAAGGAGAGGAGCGAGCTGCTGAACCCCAACT



CWWNIFDKDVKTFH

GCTGGTGGAACATCTTCGACAAGGACGTGAAGACCTTCCACTTC



FAFLSGEFTGGFRDR

GCCTTCCTGAGCGGCGAGTTCACCGGCGGCTTCAGGGACAGGCT



LNHISMRSGMRGAA

GAACCACATCAGCATGAGGAGCGGCATGAGGGGCGCCGCCGTG



VNSANLLIMAEKLK

AACAGCGCCAACCTGCTGATCATGGCCGAGAAGCTGAAGGCCG



AGTMEYEEFFRLFD

GCACCATGGAGTACGAGGAGTTCTTCAGGCTGTTCGACACCAAC



TNDEILF

GACGAGATCCTGTTC





36
LPVKSQVSILKDYLR
117
CTGCCCGTGAAGAGCCAGGTGAGCATCCTGAAGGACTACCTGAG



SYLSHVDHKYLILLD

GAGCTACCTGAGCCACGTGGACCACAAGTACCTGATCCTGCTGG



LGFDGTSDRDYEIW

ACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCTG



TAQLLTAELSFLGGR

GACCGCCCAGCTGCTGACCGCCGAGCTGAGCTTCCTGGGCGGCA



LGDTRKPDVCIYYE

GGCTGGGCGACACCAGGAAGCCCGACGTGTGCATCTACTACGA



DNGLIIDNKAYGKG

GGACAACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGGC



YSLPIKQADEMYRYI

TACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACAT



EENKERSDLLNPNC

CGAGGAGAACAAGGAGAGGAGCGACCTGCTGAACCCCAACTGC



WWNIFGEGVKTFRF

TGGTGGAACATCTTCGGCGAGGGCGTGAAGACCTTCAGGTTCGC



AFLSGEFTGGFKDRL

CTTCCTGAGCGGCGAGTTCACCGGCGGCTTCAAGGACAGGCTGA



NHISMRSGIKGAAV

ACCACATCAGCATGAGGAGCGGCATCAAGGGCGCCGCCGTGAA



NSANLLIMAEQLKS

CAGCGCCAACCTGCTGATCATGGCCGAGCAGCTGAAGAGCGGC



GTMSYEEFFQLFDY

ACCATGAGCTACGAGGAGTTCTTCCAGCTGTTCGACTACAACGA



NDEIIF

CGAGATCATCTTC





37
VSKTNILELKDNTRE
118
GTGAGCAAGACCAACATCCTGGAGCTGAAGGACAACACCAGGG



KLVYLDHRYLSLFD

AGAAGCTGGTGTACCTGGACCACAGGTACCTGAGCCTGTTCGAC



LAYDDKASRDFEIQ

CTGGCCTACGACGACAAGGCCAGCAGGGACTTCGAGATCCAGA



TIDLLINELQFKGLR

CCATCGACCTGCTGATCAACGAGCTGCAGTTCAAGGGCCTGAGG



LGERRKPDGIISYGV

CTGGGCGAGAGGAGGAAGCCCGACGGCATCATCAGCTACGGCG



NGVIIDNKAYSKGY

TGAACGGCGTGATCATCGACAACAAGGCCTACAGCAAGGGCTA



NLPIRQADEMIRYIQ

CAACCTGCCCATCAGGCAGGCCGACGAGATGATCAGGTACATCC



ENQSRDEKLNPNKW

AGGAGAACCAGAGCAGGGACGAGAAGCTGAACCCCAACAAGTG



WENFEEETSKFNYL

GTGGGAGAACTTCGAGGAGGAGACCAGCAAGTTCAACTACCTG



FISSKFISGFKKNLQY

TTCATCAGCAGCAAGTTCATCAGCGGCTTCAAGAAGAACCTGCA



IADRTGVNGGAINV

GTACATCGCCGACAGGACCGGCGTGAACGGCGGCGCCATCAAC



ENLLCFAEMLKSGK

GTGGAGAACCTGCTGTGCTTCGCCGAGATGCTGAAGAGCGGCAA



LEYNDFFNQYNNDE

GCTGGAGTACAACGACTTCTTCAACCAGTACAACAACGACGAGA



IIM

TCATCATG





38
LPVKSQVSILKDYLR
119
CTGCCCGTGAAGAGCCAGGTGAGCATCCTGAAGGACTACCTGAG



SCLSHVDHKYLILLD

GAGCTGCCTGAGCCACGTGGACCACAAGTACCTGATCCTGCTGG



LGFDGTSDRDYEIQT

ACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCCA



AQLLTAELSFLGGRL

GACCGCCCAGCTGCTGACCGCCGAGCTGAGCTTCCTGGGCGGCA



GDTRKPDVCIYYED

GGCTGGGCGACACCAGGAAGCCCGACGTGTGCATCTACTACGA



NGLIIDNKAYGKGY

GGACAACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGGC



SLPIKQADEMYRYIE

TACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACAT



ENKERSELLNPNCW

CGAGGAGAACAAGGAGAGGAGCGAGCTGCTGAACCCCAACTGC



WNIFDEGVKTFRFA

TGGTGGAACATCTTCGACGAGGGCGTGAAGACCTTCAGGTTCGC



FLSGEFTGGFKDRLN

CTTCCTGAGCGGCGAGTTCACCGGCGGCTTCAAGGACAGGCTGA



HISMRSGIKGAAVNS

ACCACATCAGCATGAGGAGCGGCATCAAGGGCGCCGCCGTGAA



ANLLIIAEQLKSGTM

CAGCGCCAACCTGCTGATCATCGCCGAGCAGCTGAAGAGCGGC



SYEEFFQLFDQNDEI

ACCATGAGCTACGAGGAGTTCTTCCAGCTGTTCGACCAGAACGA



TV

CGAGATCACCGTG





39
MSSKSEISVIKDNIR
120
ATGAGCAGCAAGAGCGAGATCAGCGTGATCAAGGACAACATCA



KRLNHINHKYLVLID

GGAAGAGGCTGAACCACATCAACCACAAGTACCTGGTGCTGATC



LGFDGTADRDYELQ

GACCTGGGCTTCGACGGCACCGCCGACAGGGACTACGAGCTGC



TADLLTSELSFKGAR

AGACCGCCGACCTGCTGACCAGCGAGCTGAGCTTCAAGGGCGCC



LGDTRKPDVCVYHG

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACCACG



TNGLIIDNKAYGKG

GCACCAACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



YSLPIKQADEMLRYI

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGCTGAGGTACA



EENQKRDKSLNPNE

TCGAGGAGAACCAGAAGAGGGACAAGAGCCTGAACCCCAACGA



WWTIFDDAVSKFNF

GTGGTGGACCATCTTCGACGACGCCGTGAGCAAGTTCAACTTCG



AFVSGEFTGGFKDR

CCTTCGTGAGCGGCGAGTTCACCGGCGGCTTCAAGGACAGGCTG



LENISRRSSVNGAAI

GAGAACATCAGCAGGAGGAGCAGCGTGAACGGCGCCGCCATCA



NSVNLLLLAEEIKSG

ACAGCGTGAACCTGCTGCTGCTGGCCGAGGAGATCAAGAGCGG



RMSYSDAFKNFDCN

CAGGATGAGCTACAGCGACGCCTTCAAGAACTTCGACTGCAACA



KEITI

AGGAGATCACCATC





40
RNLDKVERDSRKAE
121
AGGAACCTGGACAAGGTGGAGAGGGACAGCAGGAAGGCCGAGT



FLAKTSLPPRFIELLS

TCCTGGCCAAGACCAGCCTGCCCCCCAGGTTCATCGAGCTGCTG



IAYESKSNRDFEMIT

AGCATCGCCTACGAGAGCAAGAGCAACAGGGACTTCGAGATGA



AEFFKDVYGLGAVH

TCACCGCCGAGTTCTTCAAGGACGTGTACGGCCTGGGCGCCGTG



LGNARKPDALAFTD

CACCTGGGCAACGCCAGGAAGCCCGACGCCCTGGCCTTCACCGA



NFGIVIDTKAYSNGY

CAACTTCGGCATCGTGATCGACACCAAGGCCTACAGCAACGGCT



SKNINQEDEMVRYIE

ACAGCAAGAACATCAACCAGGAGGACGAGATGGTGAGGTACAT



DNQIRSPERNKNEW

CGAGGACAACCAGATCAGGAGCCCCGAGAGGAACAAGAACGAG



WLSFPPSIPENNFHF

TGGTGGCTGAGCTTCCCCCCCAGCATCCCCGAGAACAACTTCCA



LWVSSYFTGYFEEQ

CTTCCTGTGGGTGAGCAGCTACTTCACCGGCTACTTCGAGGAGC



LQETSDRAGGMTGG

AGCTGCAGGAGACCAGCGACAGGGCCGGCGGCATGACCGGCGG



ALDIEQLLIGGSLVQ

CGCCCTGGACATCGAGCAGCTGCTGATCGGCGGCAGCCTGGTGC



EGKLAPHDIPEYMQ

AGGAGGGCAAGCTGGCCCCCCACGACATCCCCGAGTACATGCA



NRVIHF

GAACAGGGTGATCCACTTC





41
APVKSEVSLCKDILR
122
GCCCCCGTGAAGAGCGAGGTGAGCCTGTGCAAGGACATCCTGA



SHLTHVDHKYLILL

GGAGCCACCTGACCCACGTGGACCACAAGTACCTGATCCTGCTG



DLGFDGTSDRDYEI

GACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCC



QTAQLLTAELDFKG

AGACCGCCCAGCTGCTGACCGCCGAGCTGGACTTCAAGGGCGCC



ARLGDTRKPDVCVY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACTACG



YGEDGLILDNKAYG

GCGAGGACGGCCTGATCCTGGACAACAAGGCCTACGGCAAGGG



KGYSLPIKQADEMY

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACA



RYIEENKERNERLNP

TCGAGGAGAACAAGGAGAGGAACGAGAGGCTGAACCCCAACAA



NKWWEIFDKDVVR

GTGGTGGGAGATCTTCGACAAGGACGTGGTGAGGTACCACTTCG



YHFAFVSGTFTGGF

CCTTCGTGAGCGGCACCTTCACCGGCGGCTTCAAGGAGAGGCTG



KERLDNIRMRSGICG

GACAACATCAGGATGAGGAGCGGCATCTGCGGCGCCGCCGTGA



AAVNSMNLLLMAE

ACAGCATGAACCTGCTGCTGATGGCCGAGGAGCTGAAGAGCGG



ELKSGRLGYKECFA

CAGGCTGGGCTACAAGGAGTGCTTCGCCCTGTTCGACTGCAACG



LFDCNDEIAF

ACGAGATCGCCTTC





42
SCVKDEVNDIVDRV
123
AGCTGCGTGAAGGACGAGGTGAACGACATCGTGGACAGGGTGA



RVKLKNIDHKYLILI

GGGTGAAGCTGAAGAACATCGACCACAAGTACCTGATCCTGATC



SLAYSDETERTKKN

AGCCTGGCCTACAGCGACGAGACCGAGAGGACCAAGAAGAACA



SDARDFEIQTAELFT

GCGACGCCAGGGACTTCGAGATCCAGACCGCCGAGCTGTTCACC



KELGFNGIRLGESNK

AAGGAGCTGGGCTTCAACGGCATCAGGCTGGGCGAGAGCAACA



PDVLISFGANGTIIDN

AGCCCGACGTGCTGATCAGCTTCGGCGCCAACGGCACCATCATC



KSYKDGFNIPRVTSD

GACAACAAGAGCTACAAGGACGGCTTCAACATCCCCAGGGTGA



QMIRYINENNQRTT

CCAGCGACCAGATGATCAGGTACATCAACGAGAACAACCAGAG



QLNPNEWWKNFDSS

GACCACCCAGCTGAACCCCAACGAGTGGTGGAAGAACTTCGAC



VSNYTFLFVTSFLKG

AGCAGCGTGAGCAACTACACCTTCCTGTTCGTGACCAGCTTCCT



SFKNQIEYISNATNG

GAAGGGCAGCTTCAAGAACCAGATCGAGTACATCAGCAACGCC



TRGAAINVESLLYIS

ACCAACGGCACCAGGGGCGCCGCCATCAACGTGGAGAGCCTGC



EDIKSGKIKQSDFYS

TGTACATCAGCGAGGACATCAAGAGCGGCAAGATCAAGCAGAG



EFKNDEIVY

CGACTTCTACAGCGAGTTCAAGAACGACGAGATCGTGTAC





43
SQGDKAREQLKAKF
124
AGCCAGGGCGACAAGGCCAGGGAGCAGCTGAAGGCCAAGTTCC



LAKTNLLPRYVELL

TGGCCAAGACCAACCTGCTGCCCAGGTACGTGGAGCTGCTGGAC



DIAYDSKRNRDFEM

ATCGCCTACGACAGCAAGAGGAACAGGGACTTCGAGATGGTGA



VTAELFNFAYLLPA

CCGCCGAGCTGTTCAACTTCGCCTACCTGCTGCCCGCCGTGCACC



VHLGGVRKPDALVA

TGGGCGGCGTGAGGAAGCCCGACGCCCTGGTGGCCACCAAGAA



TKKFGIIVDTKAYAN

GTTCGGCATCATCGTGGACACCAAGGCCTACGCCAACGGCTACA



GYSRNANQADEMA

GCAGGAACGCCAACCAGGCCGACGAGATGGCCAGGTACATCAC



RYITENQKRDPKTNP

CGAGAACCAGAAGAGGGACCCCAAGACCAACCCCAACAGGTGG



NRWWDNFDARIPPN

TGGGACAACTTCGACGCCAGGATCCCCCCCAACGCCTACTACTT



AYYFLWVSSFFTGQ

CCTGTGGGTGAGCAGCTTCTTCACCGGCCAGTTCGACGACCAGC



FDDQLSYTAHRTNT

TGAGCTACACCGCCCACAGGACCAACACCCACGGCGGCGCCCTG



HGGALNVEQLLIGA

AACGTGGAGCAGCTGCTGATCGGCGCCAACATGATCCAGACCG



NMIQTGQLDRNKLP

GCCAGCTGGACAGGAACAAGCTGCCCGAGTACATGCAGGACAA



EYMQDKEITF

GGAGATCACCTTC





44
KVQKSNILDVIEKCR
125
AAGGTGCAGAAGAGCAACATCCTGGACGTGATCGAGAAGTGCA



EKINNIPHEYLALIP

GGGAGAAGATCAACAACATCCCCCACGAGTACCTGGCCCTGATC



MSFDENESTMFEIKT

CCCATGAGCTTCGACGAGAACGAGAGCACCATGTTCGAGATCAA



IELLTEHCKFDGLHC

GACCATCGAGCTGCTGACCGAGCACTGCAAGTTCGACGGCCTGC



GGASKPDGLIYSED

ACTGCGGCGGCGCCAGCAAGCCCGACGGCCTGATCTACAGCGA



YGVIIDTKSYKDGFN

GGACTACGGCGTGATCATCGACACCAAGAGCTACAAGGACGGC



IQTPERDKMKRYIEE

TTCAACATCCAGACCCCCGAGAGGGACAAGATGAAGAGGTACA



NQNRNPQHNKTRW

TCGAGGAGAACCAGAACAGGAACCCCCAGCACAACAAGACCAG



WDEFPHNISNFLFLF

GTGGTGGGACGAGTTCCCCCACAACATCAGCAACTTCCTGTTCC



VSGKFGGNFKEQLRI

TGTTCGTGAGCGGCAAGTTCGGCGGCAACTTCAAGGAGCAGCTG



LSEQTNNTLGGALSS

AGGATCCTGAGCGAGCAGACCAACAACACCCTGGGCGGCGCCC



YVLLNIAEQIAINKID

TGAGCAGCTACGTGCTGCTGAACATCGCCGAGCAGATCGCCATC



HCDFKTRISCLDEVA

AACAAGATCGACCACTGCGACTTCAAGACCAGGATCAGCTGCCT



GGACGAGGTGGCC







45
VPVKSEVSLCKDYL
126
GTGCCCGTGAAGAGCGAGGTGAGCCTGTGCAAGGACTACCTGA



RSYLTHVDHKYLILL

GGAGCTACCTGACCCACGTGGACCACAAGTACCTGATCCTGCTG



DLGFDGTSDRDYEI

GACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCC



QTAQLLTAELDFKG

AGACCGCCCAGCTGCTGACCGCCGAGCTGGACTTCAAGGGCGCC



ARLGDTRKPDVCVY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACTACG



YGEDGLIIDNKAYG

GCGAGGACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



KGYSLPIKQADEIYR

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATCTACAGGTACA



YIEENKKRDEKLNP

TCGAGGAGAACAAGAAGAGGGACGAGAAGCTGAACCCCAACAA



NKWWEIFDKGVVR

GTGGTGGGAGATCTTCGACAAGGGCGTGGTGAGGTACCACTTCG



YHFAFVSGAFTGGF

CCTTCGTGAGCGGCGCCTTCACCGGCGGCTTCAAGGAGAGGCTG



KERLDNIRMRSGICG

GACAACATCAGGATGAGGAGCGGCATCTGCGGCGCCGCCATCA



AAINSMNLLLMAEE

ACAGCATGAACCTGCTGCTGATGGCCGAGGAGCTGAAGAGCGG



LKSGRLGYEECFALF

CAGGCTGGGCTACGAGGAGTGCTTCGCCCTGTTCGACTGCAACG



DCNDEITF

ACGAGATCACCTTC





46
VPVKSEVSLCKDYL
127
GTGCCCGTGAAGAGCGAGGTGAGCCTGTGCAAGGACTACCTGA



RSHLNHVDHRYLIL

GGAGCCACCTGAACCACGTGGACCACAGGTACCTGATCCTGCTG



LDLGFDGTSDRDYEI

GACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCC



QTAQLLTGELNFKG

AGACCGCCCAGCTGCTGACCGGCGAGCTGAACTTCAAGGGCGCC



ARLGDTRKPDVCVY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACTACG



YGEDGLIIDNKAYG

GCGAGGACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



KGYSLPIKQADEMY

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACA



RYIEENKERNEKLNP

TCGAGGAGAACAAGGAGAGGAACGAGAAGCTGAACCCCAACAA



NKWWEIFDKDVIHY

GTGGTGGGAGATCTTCGACAAGGACGTGATCCACTACCACTTCG



HFAFVSGAFTGGFK

CCTTCGTGAGCGGCGCCTTCACCGGCGGCTTCAAGGAGAGGCTG



ERLENIRMRSGIYGA

GAGAACATCAGGATGAGGAGCGGCATCTACGGCGCCGCCGTGA



AVNSMNLLLMAEEL

ACAGCATGAACCTGCTGCTGATGGCCGAGGAGCTGAAGAGCGG



KSGRLDYKECFKLF

CAGGCTGGACTACAAGGAGTGCTTCAAGCTGTTCGACTGCAACG



DCNDEIVL

ACGAGATCGTGCTG





47
VPVKSEVSLLKDYL
128
GTGCCCGTGAAGAGCGAGGTGAGCCTGCTGAAGGACTACCTGA



RSHLVHVDHKYLVL

GGAGCCACCTGGTGCACGTGGACCACAAGTACCTGGTGCTGCTG



LDLGFDGTSDRDYEI

GACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCC



QTAQLLTGELNFKG

AGACCGCCCAGCTGCTGACCGGCGAGCTGAACTTCAAGGGCGCC



ARLGDTRKPDVCVY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACTACG



YGEDGLIIDNKAYG

GCGAGGACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



KGYSLPIKQADEMY

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACA



RYIEENKERNEKLNP

TCGAGGAGAACAAGGAGAGGAACGAGAAGCTGAACCCCAACAA



NKWWEIFGNDVIHY

GTGGTGGGAGATCTTCGGCAACGACGTGATCCACTACCACTTCG



HFAFVSGAFTGGFK

CCTTCGTGAGCGGCGCCTTCACCGGCGGCTTCAAGGAGAGGCTG



ERLDNIRMRSGIYGA

GACAACATCAGGATGAGGAGCGGCATCTACGGCGCCGCCGTGA



AVNSMNLLLLAEEL

ACAGCATGAACCTGCTGCTGCTGGCCGAGGAGCTGAAGAGCGG



KSGRLGYKECFKLF

CAGGCTGGGCTACAAGGAGTGCTTCAAGCTGTTCGACTGCAACG



DCNDEIVL

ACGAGATCGTGCTG





48
ECVKDNVVDIKDRV
129
GAGTGCGTGAAGGACAACGTGGTGGACATCAAGGACAGGGTGA



RNKLIHLDHKYLALI

GGAACAAGCTGATCCACCTGGACCACAAGTACCTGGCCCTGATC



DLAYSDAASRAKKN

GACCTGGCCTACAGCGACGCCGCCAGCAGGGCCAAGAAGAACG



ADAREFEIQTADLFT

CCGACGCCAGGGAGTTCGAGATCCAGACCGCCGACCTGTTCACC



KELSFNGQRLGDSR

AAGGAGCTGAGCTTCAACGGCCAGAGGCTGGGCGACAGCAGGA



KPDVIISYGLDGTIV

AGCCCGACGTGATCATCAGCTACGGCCTGGACGGCACCATCGTG



DNKSYKDGFNISRT

GACAACAAGAGCTACAAGGACGGCTTCAACATCAGCAGGACCT



CADEMSRYINENNL

GCGCCGACGAGATGAGCAGGTACATCAACGAGAACAACCTGAG



RQKSLNPNEWWKN

GCAGAAGAGCCTGAACCCCAACGAGTGGTGGAAGAACTTCGAC



FDSTITAYTFLFITSY

AGCACCATCACCGCCTACACCTTCCTGTTCATCACCAGCTACCTG



LKGQFEDQLEYVSN

AAGGGCCAGTTCGAGGACCAGCTGGAGTACGTGAGCAACGCCA



ANGGIKGAAIGVESL

ACGGCGGCATCAAGGGCGCCGCCATCGGCGTGGAGAGCCTGCT



LYLSEGIKAGRISHA

GTACCTGAGCGAGGGCATCAAGGCCGGCAGGATCAGCCACGCC



DFYSNFNNKEMIY

GACTTCTACAGCAACTTCAACAACAAGGAGATGATCTAC





49
IAKSDFSIIKDNIRRK
130
ATCGCCAAGAGCGACTTCAGCATCATCAAGGACAACATCAGGA



LQYVNHKYLLLIDL

GGAAGCTGCAGTACGTGAACCACAAGTACCTGCTGCTGATCGAC



GFDSDSNRDYEIQTA

CTGGGCTTCGACAGCGACAGCAACAGGGACTACGAGATCCAGA



ELLTTELAFKGARL

CCGCCGAGCTGCTGACCACCGAGCTGGCCTTCAAGGGCGCCAGG



GDTRKPDVCVYYGE

CTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACTACGGCG



NGLIIDNKAYSKGYS

AGAACGGCCTGATCATCGACAACAAGGCCTACAGCAAGGGCTA



LPMSQADEMVRYIE

CAGCCTGCCCATGAGCCAGGCCGACGAGATGGTGAGGTACATC



ENKARQSSINPNQW

GAGGAGAACAAGGCCAGGCAGAGCAGCATCAACCCCAACCAGT



WKIFEDTVCNFNYA

GGTGGAAGATCTTCGAGGACACCGTGTGCAACTTCAACTACGCC



FVSGEFTGGFKDRL

TTCGTGAGCGGCGAGTTCACCGGCGGCTTCAAGGACAGGCTGAA



NNICERTRVSGGAIN

CAACATCTGCGAGAGGACCAGGGTGAGCGGCGGCGCCATCAAC



TINLLLLAEELKSGR

ACCATCAACCTGCTGCTGCTGGCCGAGGAGCTGAAGAGCGGCA



MSYPKCFSYFDTND

GGATGAGCTACCCCAAGTGCTTCAGCTACTTCGACACCAACGAC



EVHI

GAGGTGCACATC





50
LKYLGIKKQNRAFEI
131
CTGAAGTACCTGGGCATCAAGAAGCAGAACAGGGCCTTCGAGA



ITAELFNTSYKLSAT

TCATCACCGCCGAGCTGTTCAACACCAGCTACAAGCTGAGCGCC



HLGGGRRPDVLVYN

ACCCACCTGGGCGGCGGCAGGAGGCCCGACGTGCTGGTGTACA



DNFGIIVDTKAYKD

ACGACAACTTCGGCATCATCGTGGACACCAAGGCCTACAAGGAC



GYGRNVNQEDEMV

GGCTACGGCAGGAACGTGAACCAGGAGGACGAGATGGTGAGGT



RYITENNIRKQDINK

ACATCACCGAGAACAACATCAGGAAGCAGGACATCAACAAGAA



NDWWKYFSKSIPST

CGACTGGTGGAAGTACTTCAGCAAGAGCATCCCCAGCACCAGCT



SYYHLWISSQFVGM

ACTACCACCTGTGGATCAGCAGCCAGTTCGTGGGCATGTTCAGC



FSDQLRETSSRTGEN

GACCAGCTGAGGGAGACCAGCAGCAGGACCGGCGAGAACGGCG



GGAMNVEQLLIGAN

GCGCCATGAACGTGGAGCAGCTGCTGATCGGCGCCAACCAGGT



QVLNNVLDPNCLPK

GCTGAACAACGTGCTGGACCCCAACTGCCTGCCCAAGTACATGG



YMENKEIIF

AGAACAAGGAGATCATCTTC





51
VPVKSEVSLCKDYL
132
GTGCCCGTGAAGAGCGAGGTGAGCCTGTGCAAGGACTACCTGA



RSHLNHVDHKYLIL

GGAGCCACCTGAACCACGTGGACCACAAGTACCTGATCCTGCTG



LDLGFDGTSDRDYEI

GACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCC



QTAQLLTGELNFKG

AGACCGCCCAGCTGCTGACCGGCGAGCTGAACTTCAAGGGCGCC



ARLGDTRKPDVCVY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACTACG



YGEDGLIIDNKAYG

GCGAGGACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



KGYSLPIKQADEMY

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACA



RYIEENKERNEKLNP

TCGAGGAGAACAAGGAGAGGAACGAGAAGCTGAACCCCAACAA



NKWWEIFDKDVIHY

GTGGTGGGAGATCTTCGACAAGGACGTGATCCACTACCACTTCG



HFAFVSGAFTGGFR

CCTTCGTGAGCGGCGCCTTCACCGGCGGCTTCAGGGAGAGGCTG



ERLENIRMRSGIYGA

GAGAACATCAGGATGAGGAGCGGCATCTACGGCGCCGCCGTGA



AVNSMNLLLMAEEL

ACAGCATGAACCTGCTGCTGATGGCCGAGGAGCTGAAGAGCGG



KSGRLGYKECFKLF

CAGGCTGGGCTACAAGGAGTGCTTCAAGCTGTTCGACTGCAACG



DCNDEIVL

ACGAGATCGTGCTG





52
VPVKSEVSLLKDYL
133
GTGCCCGTGAAGAGCGAGGTGAGCCTGCTGAAGGACTACCTGA



RTHLLHVDHRYLILL

GGACCCACCTGCTGCACGTGGACCACAGGTACCTGATCCTGCTG



DLGFDGTSDRDYEI

GACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCC



QTAQLLTGELNFKG

AGACCGCCCAGCTGCTGACCGGCGAGCTGAACTTCAAGGGCGCC



ARLGDTRKPDVCVY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACTACG



YGEDGLIIDNKAYG

GCGAGGACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



KGYSLPIKQADEMY

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACA



RYIEENKERNEKLNP

TCGAGGAGAACAAGGAGAGGAACGAGAAGCTGAACCCCAACAA



NKWWEIFDNDVIHY

GTGGTGGGAGATCTTCGACAACGACGTGATCCACTACCACTTCG



HFAFISGAFTGGFKE

CCTTCATCAGCGGCGCCTTCACCGGCGGCTTCAAGGAGAGGCTG



RLDNIRMRSGIYGA

GACAACATCAGGATGAGGAGCGGCATCTACGGCGCCGCCGTGA



AVNSMNLLLMAEEL

ACAGCATGAACCTGCTGCTGATGGCCGAGGAGCTGAAGAGCGG



KSGRLGYKECFKLF

CAGGCTGGGCTACAAGGAGTGCTTCAAGCTGTTCGACTGCAACG



DCNDEIVL

ACGAGATCGTGCTG





53
VPVKSEVSLCKDYL
134
GTGCCCGTGAAGAGCGAGGTGAGCCTGTGCAAGGACTACCTGA



RSHLNHVDHKYLIL

GGAGCCACCTGAACCACGTGGACCACAAGTACCTGATCCTGCTG



LDLGFDGTSDRDYEI

GACCTGGGCTTCGACGGCACCAGCGACAGGGACTACGAGATCC



QTAQLLTGELNFKG

AGACCGCCCAGCTGCTGACCGGCGAGCTGAACTTCAAGGGCGCC



ARLGDTRKPDVCVY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACTACG



YGEDGLIIDNKAYG

GCGAGGACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



KGYSLPIKQADEMY

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACA



RYIEENKERNEKLNP

TCGAGGAGAACAAGGAGAGGAACGAGAAGCTGAACCCCAACAA



NKWWEIFDNDVIHY

GTGGTGGGAGATCTTCGACAACGACGTGATCCACTACCACTTCG



HFAFVSGAFTGGFR

CCTTCGTGAGCGGCGCCTTCACCGGCGGCTTCAGGGAGAGGCTG



ERLENIRMRSGIYGA

GAGAACATCAGGATGAGGAGCGGCATCTACGGCGCCGCCGTGA



AVNSMNLLLMAEEL

ACAGCATGAACCTGCTGCTGATGGCCGAGGAGCTGAAGAGCGG



KSGRLGYKECFKLF

CAGGCTGGGCTACAAGGAGTGCTTCAAGCTGTTCGACTGCAACG



DCNDEIVL

ACGAGATCGTGCTG





54
VPVKSEMSLLKDYL
135
GTGCCCGTGAAGAGCGAGATGAGCCTGCTGAAGGACTACCTGA



RTHLLHVDHRYLILL

GGACCCACCTGCTGCACGTGGACCACAGGTACCTGATCCTGCTG



DLGFDGASDRDYEI

GACCTGGGCTTCGACGGCGCCAGCGACAGGGACTACGAGATCC



QTAQLLTGELNFKG

AGACCGCCCAGCTGCTGACCGGCGAGCTGAACTTCAAGGGCGCC



ARLGDTRKPDVCVY

AGGCTGGGCGACACCAGGAAGCCCGACGTGTGCGTGTACTACG



YGEDGLIIDNKAYG

GCGAGGACGGCCTGATCATCGACAACAAGGCCTACGGCAAGGG



KGYSLPIKQADEMY

CTACAGCCTGCCCATCAAGCAGGCCGACGAGATGTACAGGTACA



RYIEENKERNEKLNP

TCGAGGAGAACAAGGAGAGGAACGAGAAGCTGAACCCCAACAA



NKWWEIFDNDVIHY

GTGGTGGGAGATCTTCGACAACGACGTGATCCACTACCACTTCG



HFAFVSGAFTGGFK

CCTTCGTGAGCGGCGCCTTCACCGGCGGCTTCAAGGAGAGGCTG



ERLDNIRMRSGIYGA

GACAACATCAGGATGAGGAGCGGCATCTACGGCGCCGCCGTGA



AVNSMNLLLMAEEL

ACAGCATGAACCTGCTGCTGATGGCCGAGGAGCTGAAGAGCGG



KSGRLGYKECFKLF

CAGGCTGGGCTACAAGGAGTGCTTCAAGCTGTTCGACTGCAACG



DCNDEIVL

ACGAGATCGTGCTG





55
ILVDKEREMRKAKF
136
ATCCTGGTGGACAAGGAGAGGGAGATGAGGAAGGCCAAGTTCC



LKETVLDSKFISLLD

TGAAGGAGACCGTGCTGGACAGCAAGTTCATCAGCCTGCTGGAC



LAADATKSRDFEIVT

CTGGCCGCCGACGCCACCAAGAGCAGGGACTTCGAGATCGTGA



AELFKEAYNLNSVL

CCGCCGAGCTGTTCAAGGAGGCCTACAACCTGAACAGCGTGCTG



LGGSNKPDGLVFTD

CTGGGCGGCAGCAACAAGCCCGACGGCCTGGTGTTCACCGACG



DFGILLDTKAYKNG

ACTTCGGCATCCTGCTGGACACCAAGGCCTACAAGAACGGCTTC



FSIYAKDRDQMIRY

AGCATCTACGCCAAGGACAGGGACCAGATGATCAGGTACGTGG



VDDNNKRDKIRNPN

ACGACAACAACAAGAGGGACAAGATCAGGAACCCCAACGAGTG



EWWKSFSPLIPNDKF

GTGGAAGAGCTTCAGCCCCCTGATCCCCAACGACAAGTTCTACT



YYLWVSNFFKGQFK

ACCTGTGGGTGAGCAACTTCTTCAAGGGCCAGTTCAAGAACCAG



NQIEYVNRETNTYG

ATCGAGTACGTGAACAGGGAGACCAACACCTACGGCGCCGTGC



AVLNVEQLLYGADA

TGAACGTGGAGCAGCTGCTGTACGGCGCCGACGCCGTGATCAAG



VIKGIINPNKLHEYFS

GGCATCATCAACCCCAACAAGCTGCACGAGTACTTCAGCAACGA



NDEIKF

CGAGATCAAGTTC





56
TVDEKERLELKEYFI
137
ACCGTGGACGAGAAGGAGAGGCTGGAGCTGAAGGAGTACTTCA



SNTRIPSKYITLLDLA

TCAGCAACACCAGGATCCCCAGCAAGTACATCACCCTGCTGGAC



YDGNANRDFEIVTA

CTGGCCTACGACGGCAACGCCAACAGGGACTTCGAGATCGTGAC



ELFKDIFKLQSKHM

CGCCGAGCTGTTCAAGGACATCTTCAAGCTGCAGAGCAAGCACA



GGTRKPDILIWTDKF

TGGGCGGCACCAGGAAGCCCGACATCCTGATCTGGACCGACAA



GVIADTKAYSKGYK

GTTCGGCGTGATCGCCGACACCAAGGCCTACAGCAAGGGCTACA



KNISEADKMVRYVN

AGAAGAACATCAGCGAGGCCGACAAGATGGTGAGGTACGTGAA



ENTNRNKVDNTNE

CGAGAACACCAACAGGAACAAGGTGGACAACACCAACGAGTGG



WWNSFDSRIPKDAY

TGGAACAGCTTCGACAGCAGGATCCCCAAGGACGCCTACTACTT



YFLWISSEFVGKFDE

CCTGTGGATCAGCAGCGAGTTCGTGGGCAAGTTCGACGAGCAGC



QLTETSSRTGRNGAS

TGACCGAGACCAGCAGCAGGACCGGCAGGAACGGCGCCAGCAT



INVYQLLRGADLVQ

CAACGTGTACCAGCTGCTGAGGGGCGCCGACCTGGTGCAGAAG



KSKFNIHDLPNLMQ

AGCAAGTTCAACATCCACGACCTGCCCAACCTGATGCAGAACAA



NNEIKF

CGAGATCAAGTTC





57
TLQKSDIEKFKNQLR
138
ACCCTGCAGAAGAGCGACATCGAGAAGTTCAAGAACCAGCTGA



TELTNIDHSYLKGIDI

GGACCGAGCTGACCAACATCGACCACAGCTACCTGAAGGGCAT



ASKKTTTNVENTEF

CGACATCGCCAGCAAGAAGACCACCACCAACGTGGAGAACACC



EAISTKVFTDELGFF

GAGTTCGAGGCCATCAGCACCAAGGTGTTCACCGACGAGCTGGG



GEHLGGSNKPDGLI

CTTCTTCGGCGAGCACCTGGGCGGCAGCAACAAGCCCGACGGCC



WDNDCAIILDSKAY

TGATCTGGGACAACGACTGCGCCATCATCCTGGACAGCAAGGCC



SEGFPLTASHTDAM

TACAGCGAGGGCTTCCCCCTGACCGCCAGCCACACCGACGCCAT



GRYLRQFKERKEEIK

GGGCAGGTACCTGAGGCAGTTCAAGGAGAGGAAGGAGGAGATC



PTWWDIAPDNLANT

AAGCCCACCTGGTGGGACATCGCCCCCGACAACCTGGCCAACAC



YFAYVSGSFSGNYK

CTACTTCGCCTACGTGAGCGGCAGCTTCAGCGGCAACTACAAGG



AQLQKFRQDTNHM

CCCAGCTGCAGAAGTTCAGGCAGGACACCAACCACATGGGCGG



GGALEFVKLLLLAN

CGCCCTGGAGTTCGTGAAGCTGCTGCTGCTGGCCAACAACTACA



NYKAHKMSINEVKE

AGGCCCACAAGATGAGCATCAACGAGGTGAAGGAGAGCATCCT



SILDYNISY

GGACTACAACATCAGCTAC





58
VKEKTDAALVKERV
139
GTGAAGGAGAAGACCGACGCCGCCCTGGTGAAGGAGAGGGTGA



RLQLHNINHKYLALI

GGCTGCAGCTGCACAACATCAACCACAAGTACCTGGCCCTGATC



DYAFSGKNNSRDFE

GACTACGCCTTCAGCGGCAAGAACAACAGCAGGGACTTCGAGG



VYTIDLLVNELTFGG

TGTACACCATCGACCTGCTGGTGAACGAGCTGACCTTCGGCGGC



LHLGGTRKPDGIFY

CTGCACCTGGGCGGCACCAGGAAGCCCGACGGCATCTTCTACCA



HGSNGIIIDNKAYAK

CGGCAGCAACGGCATCATCATCGACAACAAGGCCTACGCCAAG



GFVITRNMADEMIR

GGCTTCGTGATCACCAGGAACATGGCCGACGAGATGATCAGGTA



YVQENNDRNPERNP

CGTGCAGGAGAACAACGACAGGAACCCCGAGAGGAACCCCAAC



NCWWKGFPHDVTR

TGCTGGTGGAAGGGCTTCCCCCACGACGTGACCAGGTACAACTA



YNYVFISSMFKGEV

CGTGTTCATCAGCAGCATGTTCAAGGGCGAGGTGGAGCACATGC



EHMLDNIRQSTGIDG

TGGACAACATCAGGCAGAGCACCGGCATCGACGGCTGCGTGCT



CVLTIENLLYYADAI

GACCATCGAGAACCTGCTGTACTACGCCGACGCCATCAAGGGCG



KGGTLSKATFINGFN

GCACCCTGAGCAAGGCCACCTTCATCAACGGCTTCAACGCCAAC



ANKEMVF

AAGGAGATGGTGTTC





59
VKETTDSVIIKDRVR
140
GTGAAGGAGACCACCGACAGCGTGATCATCAAGGACAGGGTGA



LKLHHVNHKYLTLI

GGCTGAAGCTGCACCACGTGAACCACAAGTACCTGACCCTGATC



DYAFSGKNNCMDFE

GACTACGCCTTCAGCGGCAAGAACAACTGCATGGACTTCGAGGT



VYTIDLLVNELAFN

GTACACCATCGACCTGCTGGTGAACGAGCTGGCCTTCAACGGCG



GVHLGGTRKPDGIF

TGCACCTGGGCGGCACCAGGAAGCCCGACGGCATCTTCTACCAC



YHNRNGIIIDNKAYS

AACAGGAACGGCATCATCATCGACAACAAGGCCTACAGCCACG



HGFTLSRAMADEMI

GCTTCACCCTGAGCAGGGCCATGGCCGACGAGATGATCAGGTAC



RYIQENNDRNPERN

ATCCAGGAGAACAACGACAGGAACCCCGAGAGGAACCCCAACA



PNKWWENFDKGVN

AGTGGTGGGAGAACTTCGACAAGGGCGTGAACCAGTTCAACTTC



QFNFVFISSLFKGEIE

GTGTTCATCAGCAGCCTGTTCAAGGGCGAGATCGAGCACATGCT



HMLTNIKQSTDGVE

GACCAACATCAAGCAGAGCACCGACGGCGTGGAGGGCTGCGTG



GCVLSAENLLYFAE

CTGAGCGCCGAGAACCTGCTGTACTTCGCCGAGGCCATGAAGAG



AMKSGVMPKTEFIS

CGGCGTGATGCCCAAGACCGAGTTCATCAGCTACTTCGGCGCCG



YFGAGKEIQF

GCAAGGAGATCCAGTTC





60
SACKADITELKDKIR
141
AGCGCCTGCAAGGCCGACATCACCGAGCTGAAGGACAAGATCA



KSLKVLDHKYLVLV

GGAAGAGCCTGAAGGTGCTGGACCACAAGTACCTGGTGCTGGT



DLAYSDASTKSKKN

GGACCTGGCCTACAGCGACGCCAGCACCAAGAGCAAGAAGAAC



SDAREFEIQTADLFT

AGCGACGCCAGGGAGTTCGAGATCCAGACCGCCGACCTGTTCAC



KELKFDGMRLGDSN

CAAGGAGCTGAAGTTCGACGGCATGAGGCTGGGCGACAGCAAC



RPDVIISHDNFGTIID

AGGCCCGACGTGATCATCAGCCACGACAACTTCGGCACCATCAT



NKSYKDGFNIDKKC

CGACAACAAGAGCTACAAGGACGGCTTCAACATCGACAAGAAG



ADEMSRYINENQRRI

TGCGCCGACGAGATGAGCAGGTACATCAACGAGAACCAGAGGA



PELPKNEWWKNFD

GGATCCCCGAGCTGCCCAAGAACGAGTGGTGGAAGAACTTCGA



VNVDIFTFLFITSYLK

CGTGAACGTGGACATCTTCACCTTCCTGTTCATCACCAGCTACCT



GNFKDQLEYISKSQS

GAAGGGCAACTTCAAGGACCAGCTGGAGTACATCAGCAAGAGC



DIKGAAISVEHLLYI

CAGAGCGACATCAAGGGCGCCGCCATCAGCGTGGAGCACCTGC



SEKVKNGSMDKADF

TGTACATCAGCGAGAAGGTGAAGAACGGCAGCATGGACAAGGC



FKLFNNDEIRV

CGACTTCTTCAAGCTGTTCAACAACGACGAGATCAGGGTG





61
VLKDKHLEKIKEKF
142
GTGCTGAAGGACAAGCACCTGGAGAAGATCAAGGAGAAGTTCC



LENTSLDPRFISLIEIS

TGGAGAACACCAGCCTGGACCCCAGGTTCATCAGCCTGATCGAG



RDKKQNRAFEIITAE

ATCAGCAGGGACAAGAAGCAGAACAGGGCCTTCGAGATCATCA



LFNTSYNLSAIHLGG

CCGCCGAGCTGTTCAACACCAGCTACAACCTGAGCGCCATCCAC



GRRPDVLAYNDNFG

CTGGGCGGCGGCAGGAGGCCCGACGTGCTGGCCTACAACGACA



IIVDTKAYKNGYGR

ACTTCGGCATCATCGTGGACACCAAGGCCTACAAGAACGGCTAC



NVNQEDEMVRYITE

GGCAGGAACGTGAACCAGGAGGACGAGATGGTGAGGTACATCA



NKIRKQDISKNNWW

CCGAGAACAAGATCAGGAAGCAGGACATCAGCAAGAACAACTG



KYFSKSIPSTSYYHL

GTGGAAGTACTTCAGCAAGAGCATCCCCAGCACCAGCTACTACC



WISSEFVGMFSDQL

ACCTGTGGATCAGCAGCGAGTTCGTGGGCATGTTCAGCGACCAG



RETSSRTGENGGAM

CTGAGGGAGACCAGCAGCAGGACCGGCGAGAACGGCGGCGCCA



NVEQLLIGANQVLN

TGAACGTGGAGCAGCTGCTGATCGGCGCCAACCAGGTGCTGAAC



NVLDPNRLPEYMEN

AACGTGCTGGACCCCAACAGGCTGCCCGAGTACATGGAGAACA



KEIIF

AGGAGATCATCTTC





62
ALKDKHLEKIKEKF
143
GCCCTGAAGGACAAGCACCTGGAGAAGATCAAGGAGAAGTTCC



LENTSLDPRFISLIEIS

TGGAGAACACCAGCCTGGACCCCAGGTTCATCAGCCTGATCGAG



RDKKQNRAFEIITAE

ATCAGCAGGGACAAGAAGCAGAACAGGGCCTTCGAGATCATCA



LFNTSYKLSATHLG

CCGCCGAGCTGTTCAACACCAGCTACAAGCTGAGCGCCACCCAC



GGRRPDVLVYNDNF

CTGGGCGGCGGCAGGAGGCCCGACGTGCTGGTGTACAACGACA



GIIVDTKAYKDGYG

ACTTCGGCATCATCGTGGACACCAAGGCCTACAAGGACGGCTAC



RNVNQEDEMVRYIT

GGCAGGAACGTGAACCAGGAGGACGAGATGGTGAGGTACATCA



ENNIRKQDINKNDW

CCGAGAACAACATCAGGAAGCAGGACATCAACAAGAACGACTG



WKYFSKSIPSTSYYH

GTGGAAGTACTTCAGCAAGAGCATCCCCAGCACCAGCTACTACC



LWISSQFVGMFSDQ

ACCTGTGGATCAGCAGCCAGTTCGTGGGCATGTTCAGCGACCAG



LRETSSRTGENGGA

CTGAGGGAGACCAGCAGCAGGACCGGCGAGAACGGCGGCGCCA



MNVEQLLIGANQVL

TGAACGTGGAGCAGCTGCTGATCGGCGCCAACCAGGTGCTGAAC



NNVLDPNCLPKYME

AACGTGCTGGACCCCAACTGCCTGCCCAAGTACATGGAGAACAA



NKEIIF

GGAGATCATCTTC





63
VLEKSDIEKFKNQLR
144
GTGCTGGAGAAGAGCGACATCGAGAAGTTCAAGAACCAGCTGA



TELTNIDHSYLKGIDI

GGACCGAGCTGACCAACATCGACCACAGCTACCTGAAGGGCAT



ASKKKTSNVENTEF

CGACATCGCCAGCAAGAAGAAGACCAGCAACGTGGAGAACACC



EAISTKIFTDELGFSG

GAGTTCGAGGCCATCAGCACCAAGATCTTCACCGACGAGCTGGG



KHLGGSNKPDGLLW

CTTCAGCGGCAAGCACCTGGGCGGCAGCAACAAGCCCGACGGC



DDDCAIILDSKAYSE

CTGCTGTGGGACGACGACTGCGCCATCATCCTGGACAGCAAGGC



GFPLTASHTDAMGR

CTACAGCGAGGGCTTCCCCCTGACCGCCAGCCACACCGACGCCA



YLRQFTERKEEIKPT

TGGGCAGGTACCTGAGGCAGTTCACCGAGAGGAAGGAGGAGAT



WWDIAPEHLDNTYF

CAAGCCCACCTGGTGGGACATCGCCCCCGAGCACCTGGACAACA



AYVSGSFSGNYKEQ

CCTACTTCGCCTACGTGAGCGGCAGCTTCAGCGGCAACTACAAG



LQKFRQDTNHLGGA

GAGCAGCTGCAGAAGTTCAGGCAGGACACCAACCACCTGGGCG



LEFVKLLLLANNYK

GCGCCCTGGAGTTCGTGAAGCTGCTGCTGCTGGCCAACAACTAC



TQKMSKKEVKKSIL

AAGACCCAGAAGATGAGCAAGAAGGAGGTGAAGAAGAGCATCC



DYNISY

TGGACTACAACATCAGCTAC





64
AEADVTSEKIKNHF
145
GCCGAGGCCGACGTGACCAGCGAGAAGATCAAGAACCACTTCA



RRVTELPERYLELLD

GGAGGGTGACCGAGCTGCCCGAGAGGTACCTGGAGCTGCTGGA



IAFDHKRNRDFEMV

CATCGCCTTCGACCACAAGAGGAACAGGGACTTCGAGATGGTG



TAGLFKDVYGLESV

ACCGCCGGCCTGTTCAAGGACGTGTACGGCCTGGAGAGCGTGCA



HLGGANKPDGVVY

CCTGGGCGGCGCCAACAAGCCCGACGGCGTGGTGTACAACGAC



NDNFGIILDTKAYEN

AACTTCGGCATCATCCTGGACACCAAGGCCTACGAGAACGGCTA



GYGKHISQIDEMVR

CGGCAAGCACATCAGCCAGATCGACGAGATGGTGAGGTACATC



YIDDNRLRDTTRNP

GACGACAACAGGCTGAGGGACACCACCAGGAACCCCAACAAGT



NKWWENFDADIPSD

GGTGGGAGAACTTCGACGCCGACATCCCCAGCGACCAGTTCTAC



QFYYLWVSGKFLPN

TACCTGTGGGTGAGCGGCAAGTTCCTGCCCAACTTCGCCGAGCA



FAEQLKQTNYRSHA

GCTGAAGCAGACCAACTACAGGAGCCACGCCAACGGCGGCGGC



NGGGLEVQQLLLGA

CTGGAGGTGCAGCAGCTGCTGCTGGGCGCCGACGCCGTGAAGA



DAVKRRKLDVNTIP

GGAGGAAGCTGGACGTGAACACCATCCCCAACTACATGAAGAA



NYMKNEVITL

CGAGGTGATCACCCTG





65
AEADLNSEKIKNHY
146
GCCGAGGCCGACCTGAACAGCGAGAAGATCAAGAACCACTACA



RKITNLPEKYIELLDI

GGAAGATCACCAACCTGCCCGAGAAGTACATCGAGCTGCTGGA



AFDHRRHQDFEIVT

CATCGCCTTCGACCACAGGAGGCACCAGGACTTCGAGATCGTGA



AGLFKDCYGLSSIHL

CCGCCGGCCTGTTCAAGGACTGCTACGGCCTGAGCAGCATCCAC



GGQNKPDGVVFNN

CTGGGCGGCCAGAACAAGCCCGACGGCGTGGTGTTCAACAACA



KFGIILDTKAYEKGY

AGTTCGGCATCATCCTGGACACCAAGGCCTACGAGAAGGGCTAC



GMHIGQIDEMCRYI

GGCATGCACATCGGCCAGATCGACGAGATGTGCAGGTACATCG



DDNKKRDIVRQPNE

ACGACAACAAGAAGAGGGACATCGTGAGGCAGCCCAACGAGTG



WWKNFGDNIPKDQF

GTGGAAGAACTTCGGCGACAACATCCCCAAGGACCAGTTCTACT



YYLWISGKFLPRFNE

ACCTGTGGATCAGCGGCAAGTTCCTGCCCAGGTTCAACGAGCAG



QLKQTHYRTSINGG

CTGAAGCAGACCCACTACAGGACCAGCATCAACGGCGGCGGCC



GLEVSQLLLGANAA

TGGAGGTGAGCCAGCTGCTGCTGGGCGCCAACGCCGCCATGAA



MKGKLDVNTLPKH

GGGCAAGCTGGACGTGAACACCCTGCCCAAGCACATGAACAAC



MNNQVIKL

CAGGTGATCAAGCTG





66
VLKDAALQKTKNTL
147
GTGCTGAAGGACGCCGCCCTGCAGAAGACCAAGAACACCCTGC



LNELTEIDPADIEVIE

TGAACGAGCTGACCGAGATCGACCCCGCCGACATCGAGGTGATC



MSWKKATTRSQNTL

GAGATGAGCTGGAAGAAGGCCACCACCAGGAGCCAGAACACCC



EATLFEVKVVEIFKK

TGGAGGCCACCCTGTTCGAGGTGAAGGTGGTGGAGATCTTCAAG



YFELNGEHLGGQNR

AAGTACTTCGAGCTGAACGGCGAGCACCTGGGCGGCCAGAACA



PDGAVYYNSTYGIIL

GGCCCGACGGCGCCGTGTACTACAACAGCACCTACGGCATCATC



DTKAYSNGYNIPVD

CTGGACACCAAGGCCTACAGCAACGGCTACAACATCCCCGTGGA



QQREMVDYITDVID

CCAGCAGAGGGAGATGGTGGACTACATCACCGACGTGATCGAC



KNQNVTPNRWWEA

AAGAACCAGAACGTGACCCCCAACAGGTGGTGGGAGGCCTTCC



FPATLLKNNIYYLW

CCGCCACCCTGCTGAAGAACAACATCTACTACCTGTGGGTGGCC



VAGGFTGKYLDQLT

GGCGGCTTCACCGGCAAGTACCTGGACCAGCTGACCAGGACCCA



RTHNQTNMDGGAM

CAACCAGACCAACATGGACGGCGGCGCCATGACCACCGAGGTG



TTEVLLRLANKVSS

CTGCTGAGGCTGGCCAACAAGGTGAGCAGCGGCAACCTGAAGA



GNLKTTDIPKLMTN

CCACCGACATCCCCAAGCTGATGACCAACAAGCTGATCCTGAGC



KLILS







67
AEADLDSERIKNHY
148
GCCGAGGCCGACCTGGACAGCGAGAGGATCAAGAACCACTACA



RKITNLPEKYIELLDI

GGAAGATCACCAACCTGCCCGAGAAGTACATCGAGCTGCTGGA



AFDHHRHQDFEIITA

CATCGCCTTCGACCACCACAGGCACCAGGACTTCGAGATCATCA



GLFKDCYGLSSIHLG

CCGCCGGCCTGTTCAAGGACTGCTACGGCCTGAGCAGCATCCAC



GQNKPDGVVFNGKF

CTGGGCGGCCAGAACAAGCCCGACGGCGTGGTGTTCAACGGCA



GIILDTKAYEKGYG

AGTTCGGCATCATCCTGGACACCAAGGCCTACGAGAAGGGCTAC



MHINQIDEMCRYIED

GGCATGCACATCAACCAGATCGACGAGATGTGCAGGTACATCG



NKQRDKIRQPNEW

AGGACAACAAGCAGAGGGACAAGATCAGGCAGCCCAACGAGTG



WNNFGDNIPENKFY

GTGGAACAACTTCGGCGACAACATCCCCGAGAACAAGTTCTACT



YLWVSGKFLPKFNE

ACCTGTGGGTGAGCGGCAAGTTCCTGCCCAAGTTCAACGAGCAG



QLKQTHYRTGINGG

CTGAAGCAGACCCACTACAGGACCGGCATCAACGGCGGCGGCC



GLEVSQLLLGADAV

TGGAGGTGAGCCAGCTGCTGCTGGGCGCCGACGCCGTGATGAA



MKGALNVNILPTYM

GGGCGCCCTGAACGTGAACATCCTGCCCACCTACATGCACAACA



HNNVIQ

ACGTGATCCAG





68
EISDIALQKEKAYFY
149
GAGATCAGCGACATCGCCCTGCAGAAGGAGAAGGCCTACTTCTA



KNTALSKRHISILEIA

CAAGAACACCGCCCTGAGCAAGAGGCACATCAGCATCCTGGAG



FDGSKNRDLEILSAE

ATCGCCTTCGACGGCAGCAAGAACAGGGACCTGGAGATCCTGA



VFKDYYQLESIHLG

GCGCCGAGGTGTTCAAGGACTACTACCAGCTGGAGAGCATCCAC



GGLKPDGIAFNQNF

CTGGGCGGCGGCCTGAAGCCCGACGGCATCGCCTTCAACCAGAA



GIIVDTKAYKGVYS

CTTCGGCATCATCGTGGACACCAAGGCCTACAAGGGCGTGTACA



RSRAEADKMFRYIE

GCAGGAGCAGGGCCGAGGCCGACAAGATGTTCAGGTACATCGA



DNKKRDPKRNQSL

GGACAACAAGAAGAGGGACCCCAAGAGGAACCAGAGCCTGTGG



WWRSFNEHIPANNF

TGGAGGAGCTTCAACGAGCACATCCCCGCCAACAACTTCTACTT



YFLWISGKFQRNFD

CCTGTGGATCAGCGGCAAGTTCCAGAGGAACTTCGACACCCAGA



TQINQLNYETGYRG

TCAACCAGCTGAACTACGAGACCGGCTACAGGGGCGGCGCCCT



GALSARQFLIGADAI

GAGCGCCAGGCAGTTCCTGATCGGCGCCGACGCCATCCAGAAG



QKGKIDINDLPSYFN

GGCAAGATCGACATCAACGACCTGCCCAGCTACTTCAACAACAG



NSVISF

CGTGATCAGCTTC





69
TSREKSRLNLKEYFV
150
ACCAGCAGGGAGAAGAGCAGGCTGAACCTGAAGGAGTACTTCG



SNTNLPNKFITLLDL

TGAGCAACACCAACCTGCCCAACAAGTTCATCACCCTGCTGGAC



AYDGKANRDFELIT

CTGGCCTACGACGGCAAGGCCAACAGGGACTTCGAGCTGATCAC



SELFREIYKLNTRHL

CAGCGAGCTGTTCAGGGAGATCTACAAGCTGAACACCAGGCAC



GGTRKPDILIWNENF

CTGGGCGGCACCAGGAAGCCCGACATCCTGATCTGGAACGAGA



GIIADTKAYSKGYK

ACTTCGGCATCATCGCCGACACCAAGGCCTACAGCAAGGGCTAC



KNISEEDKMVRYIDE

AAGAAGAACATCAGCGAGGAGGACAAGATGGTGAGGTACATCG



NIKRSKDYNPNEWW

ACGAGAACATCAAGAGGAGCAAGGACTACAACCCCAACGAGTG



KVFDNEISSNNYFYL

GTGGAAGGTGTTCGACAACGAGATCAGCAGCAACAACTACTTCT



WISSEFIGKFEEQLQ

ACCTGTGGATCAGCAGCGAGTTCATCGGCAAGTTCGAGGAGCAG



ETAQRTNVKGASIN

CTGCAGGAGACCGCCCAGAGGACCAACGTGAAGGGCGCCAGCA



VYQLLMGAHKVQT

TCAACGTGTACCAGCTGCTGATGGGCGCCCACAAGGTGCAGACC



KELNVNSIPKYMNN

AAGGAGCTGAACGTGAACAGCATCCCCAAGTACATGAACAACA



TEIKF

CCGAGATCAAGTTC





70
NCIKDSIIDIKDRVRT
151
AACTGCATCAAGGACAGCATCATCGACATCAAGGACAGGGTGA



KLVHLDHKYLALID

GGACCAAGCTGGTGCACCTGGACCACAAGTACCTGGCCCTGATC



LAFSDADTRTKKNS

GACCTGGCCTTCAGCGACGCCGACACCAGGACCAAGAAGAACA



DAREFEIQTADLFTK

GCGACGCCAGGGAGTTCGAGATCCAGACCGCCGACCTGTTCACC



ELSFNGQRLGDSRK

AAGGAGCTGAGCTTCAACGGCCAGAGGCTGGGCGACAGCAGGA



PDIIISFDKIGTIIDNK

AGCCCGACATCATCATCAGCTTCGACAAGATCGGCACCATCATC



SYKDGFNISRPCADE

GACAACAAGAGCTACAAGGACGGCTTCAACATCAGCAGGCCCT



MIRYINENNLRKKSL

GCGCCGACGAGATGATCAGGTACATCAACGAGAACAACCTGAG



NANEWWNKFDPTIT

GAAGAAGAGCCTGAACGCCAACGAGTGGTGGAACAAGTTCGAC



AYSFLFITSYLKGQF

CCCACCATCACCGCCTACAGCTTCCTGTTCATCACCAGCTACCTG



QEQLEYISNANGGIK

AAGGGCCAGTTCCAGGAGCAGCTGGAGTACATCAGCAACGCCA



GAAIGIENLLYLSEA

ACGGCGGCATCAAGGGCGCCGCCATCGGCATCGAGAACCTGCT



LKSGKISHKDFYQNF

GTACCTGAGCGAGGCCCTGAAGAGCGGCAAGATCAGCCACAAG



NNKEITY

GACTTCTACCAGAACTTCAACAACAAGGAGATCACCTAC





71
LPQKDQVQQQQDEL
152
CTGCCCCAGAAGGACCAGGTGCAGCAGCAGCAGGACGAGCTGA



RPMLKNVDHRYLQL

GGCCCATGCTGAAGAACGTGGACCACAGGTACCTGCAGCTGGTG



VELALDSDQNSEYS

GAGCTGGCCCTGGACAGCGACCAGAACAGCGAGTACAGCCAGT



QFEQLTMELVLKHL

TCGAGCAGCTGACCATGGAGCTGGTGCTGAAGCACCTGGACTTC



DFDGKPLGGSNKPD

GACGGCAAGCCCCTGGGCGGCAGCAACAAGCCCGACGGCATCG



GIAWDNDGNFIIFDT

CCTGGGACAACGACGGCAACTTCATCATCTTCGACACCAAGGCC



KAYNKGYSLAGNT

TACAACAAGGGCTACAGCCTGGCCGGCAACACCGACAAGGTGA



DKVKRYIDDVRDRD

AGAGGTACATCGACGACGTGAGGGACAGGGACACCAGCAGGAC



TSRTSTWWQLVPKS

CAGCACCTGGTGGCAGCTGGTGCCCAAGAGCATCGACGTGCACA



IDVHNLLRFVYVSG

ACCTGCTGAGGTTCGTGTACGTGAGCGGCAACTTCACCGGCAAC



NFTGNYMKLLDSLR

TACATGAAGCTGCTGGACAGCCTGAGGAGCTGGAGCAACGCCC



SWSNAQGGLASVEK

AGGGCGGCCTGGCCAGCGTGGAGAAGCTGCTGCTGACCAGCGA



LLLTSELYLRNMYS

GCTGTACCTGAGGAACATGTACAGCCACCAGGAGCTGATCGACA



HQELIDSWTDNNVK

GCTGGACCGACAACAACGTGAAGCAC



H







72
TTDAVVVKDRARV
153
ACCACCGACGCCGTGGTGGTGAAGGACAGGGCCAGGGTGAGGC



RLHNINHKYLTLIDY

TGCACAACATCAACCACAAGTACCTGACCCTGATCGACTACGCC



AFSGKNNCTEFEIYT

TTCAGCGGCAAGAACAACTGCACCGAGTTCGAGATCTACACCAT



IDLLVNELAFNGIHL

CGACCTGCTGGTGAACGAGCTGGCCTTCAACGGCATCCACCTGG



GGTRKPDGIFDYNQ

GCGGCACCAGGAAGCCCGACGGCATCTTCGACTACAACCAGCA



QGIIIDNKAYSKGFTI

GGGCATCATCATCGACAACAAGGCCTACAGCAAGGGCTTCACCA



TRSMADEMVRYVQ

TCACCAGGAGCATGGCCGACGAGATGGTGAGGTACGTGCAGGA



ENNDRNPERNKTQ

GAACAACGACAGGAACCCCGAGAGGAACAAGACCCAGTGGTGG



WWLNFGDNVNHFN

CTGAACTTCGGCGACAACGTGAACCACTTCAACTTCGTGTTCAT



FVFISSMFKGEVRH

CAGCAGCATGTTCAAGGGCGAGGTGAGGCACATGCTGAACAAC



MLNNIKQSTGVDGC

ATCAAGCAGAGCACCGGCGTGGACGGCTGCGTGCTGACCGCCG



VLTAENLLYFADAI

AGAACCTGCTGTACTTCGCCGACGCCATCAAGGGCGGCACCGTG



KGGTVKRTDFINLF

AAGAGGACCGACTTCATCAACCTGTTCGGCAAGAACGACGAGCT



GKNDEL

G





73
LPKKDNVQRQQDEL
154
CTGCCCAAGAAGGACAACGTGCAGAGGCAGCAGGACGAGCTGA



RPLLKHVDHRYLQL

GGCCCCTGCTGAAGCACGTGGACCACAGGTACCTGCAGCTGGTG



VELALDSSQNSEYS

GAGCTGGCCCTGGACAGCAGCCAGAACAGCGAGTACAGCATGC



MLESMTMELLLTHL

TGGAGAGCATGACCATGGAGCTGCTGCTGACCCACCTGGACTTC



DFDGASLGGASKPD

GACGGCGCCAGCCTGGGCGGCGCCAGCAAGCCCGACGGCATCG



GIAWDKDGNFLIVD

CCTGGGACAAGGACGGCAACTTCCTGATCGTGGACACCAAGGCC



TKAYDNGYSLAGNT

TACGACAACGGCTACAGCCTGGCCGGCAACACCGACAAGGTGG



DKVARYIDDVRAKD

CCAGGTACATCGACGACGTGAGGGCCAAGGACCCCAACAGGGC



PNRASTWWTQVPES

CAGCACCTGGTGGACCCAGGTGCCCGAGAGCCTGAACGTGGAC



LNVDDNLSFMYVSG

GACAACCTGAGCTTCATGTACGTGAGCGGCAGCTTCACCGGCAA



SFTGNYQRLLKDLR

CTACCAGAGGCTGCTGAAGGACCTGAGGGCCAGGACCAACGCC



ARTNARGGLTTVEK

AGGGGCGGCCTGACCACCGTGGAGAAGCTGCTGCTGACCAGCG



LLLTSEAYLAKSGY

AGGCCTACCTGGCCAAGAGCGGCTACGGCCACACCCAGCTGCTG



GHTQLLNDWTDDNI

AACGACTGGACCGACGACAACATCGACCAC



DH







74
QIKDKYLEDLKLEL
155
CAGATCAAGGACAAGTACCTGGAGGACCTGAAGCTGGAGCTGT



YKKTNLPNKYYEM

ACAAGAAGACCAACCTGCCCAACAAGTACTACGAGATGGTGGA



VDIAYDGKRNREFEI

CATCGCCTACGACGGCAAGAGGAACAGGGAGTTCGAGATCTAC



YTSDLMQEIYGFKT

ACCAGCGACCTGATGCAGGAGATCTACGGCTTCAAGACCACCCT



TLLGGTRKPDVVSY

GCTGGGCGGCACCAGGAAGCCCGACGTGGTGAGCTACAGCGAC



SDAHGYIIDTKAYA

GCCCACGGCTACATCATCGACACCAAGGCCTACGCCAACGGCTA



NGYRKEIKQEDEMV

CAGGAAGGAGATCAAGCAGGAGGACGAGATGGTGAGGTACATC



RYIEDNQLKDVLRN

GAGGACAACCAGCTGAAGGACGTGCTGAGGAACCCCAACAAGT



PNKWWECFDDAEH

GGTGGGAGTGCTTCGACGACGCCGAGCACAAGAAGGAGTACTA



KKEYYFLWISSKFV

CTTCCTGTGGATCAGCAGCAAGTTCGTGGGCGAGTTCAGCAGCC



GEFSSQLQDTSRRTG

AGCTGCAGGACACCAGCAGGAGGACCGGCATCAAGGGCGGCGC



IKGGAVNIVQLLLG

CGTGAACATCGTGCAGCTGCTGCTGGGCGCCCACCTGGTGTACA



AHLVYSGEISKDQF

GCGGCGAGATCAGCAAGGACCAGTTCGCCGCCTACATGAACAA



AAYMNNTEINF

CACCGAGATCAACTTC





75
MNPRNEIVIAKHLSG
156
ATGAACCCCAGGAACGAGATCGTGATCGCCAAGCACCTGAGCG



GNRPEIVCYHPEDKP

GCGGCAACAGGCCCGAGATCGTGTGCTACCACCCCGAGGACAA



DHGLILDSKAYKSG

GCCCGACCACGGCCTGATCCTGGACAGCAAGGCCTACAAGAGC



FTIPSGERDKMVRYI

GGCTTCACCATCCCCAGCGGCGAGAGGGACAAGATGGTGAGGT



EEYITKNQLQNPNE

ACATCGAGGAGTACATCACCAAGAACCAGCTGCAGAACCCCAA



WWKNLKGAEYPGI

CGAGTGGTGGAAGAACCTGAAGGGCGCCGAGTACCCCGGCATC



VGFGFISNSFLGHYR

GTGGGCTTCGGCTTCATCAGCAACAGCTTCCTGGGCCACTACAG



KQLDYIMRRTKIKG

GAAGCAGCTGGACTACATCATGAGGAGGACCAAGATCAAGGGC



SSITTEHLLKTVEDV

AGCAGCATCACCACCGAGCACCTGCTGAAGACCGTGGAGGACG



LSEKGNVIDFFKYFL

TGCTGAGCGAGAAGGGCAACGTGATCGACTTCTTCAAGTACTTC



E

CTGGAG





76
EIKNQEIEELKQIALN
157
GAGATCAAGAACCAGGAGATCGAGGAGCTGAAGCAGATCGCCC



KYTALPSEWVELIEI

TGAACAAGTACACCGCCCTGCCCAGCGAGTGGGTGGAGCTGATC



SRDKDQSTIFEMKV

GAGATCAGCAGGGACAAGGACCAGAGCACCATCTTCGAGATGA



AELFKTCYRIKSLHL

AGGTGGCCGAGCTGTTCAAGACCTGCTACAGGATCAAGAGCCTG



GGASKPDCLLWDDS

CACCTGGGCGGCGCCAGCAAGCCCGACTGCCTGCTGTGGGACGA



FSVIVDAKAYKDGF

CAGCTTCAGCGTGATCGTGGACGCCAAGGCCTACAAGGACGGCT



PFQASEKDKMVRYL

TCCCCTTCCAGGCCAGCGAGAAGGACAAGATGGTGAGGTACCTG



RECERKDKAENATE

AGGGAGTGCGAGAGGAAGGACAAGGCCGAGAACGCCACCGAGT



WWNNFPPELNSNQL

GGTGGAACAACTTCCCCCCCGAGCTGAACAGCAACCAGCTGTTC



FFMFASSFFSSTAEK

TTCATGTTCGCCAGCAGCTTCTTCAGCAGCACCGCCGAGAAGCA



HLESVSIASKFSGCA

CCTGGAGAGCGTGAGCATCGCCAGCAAGTTCAGCGGCTGCGCCT



WDVDNLLSGANFFL

GGGACGTGGACAACCTGCTGAGCGGCGCCAACTTCTTCCTGCAG



QNPQATLQYHLIRV

AACCCCCAGGCCACCCTGCAGTACCACCTGATCAGGGTGTTCAG



FSNKVVD

CAACAAGGTGGTGGAC





77
LPHKDNVIKQQDEL
158
CTGCCCCACAAGGACAACGTGATCAAGCAGCAGGACGAGCTGA



RPMLKHVNHKYLQ

GGCCCATGCTGAAGCACGTGAACCACAAGTACCTGCAGCTGGTG



LVELAFESSRNSEYS

GAGCTGGCCTTCGAGAGCAGCAGGAACAGCGAGTACAGCCAGT



QFETLTMELVLKYL

TCGAGACCCTGACCATGGAGCTGGTGCTGAAGTACCTGGACTTC



DFSGKSLGGANKPD

AGCGGCAAGAGCCTGGGCGGCGCCAACAAGCCCGACGGCATCG



GIAWDPLGNFLIFDT

CCTGGGACCCCCTGGGCAACTTCCTGATCTTCGACACCAAGGCC



KAYKHGYTLSNNTD

TACAAGCACGGCTACACCCTGAGCAACAACACCGACAGGGTGG



RVARYINDVRDKDI

CCAGGTACATCAACGACGTGAGGGACAAGGACATCCAGAGGAT



QRISRWWQSIPTYID

CAGCAGGTGGTGGCAGAGCATCCCCACCTACATCGACGTGAAG



VKNKLQFVYISGSFT

AACAAGCTGCAGTTCGTGTACATCAGCGGCAGCTTCACCGGCCA



GHYLRLLNDLRSRT

CTACCTGAGGCTGCTGAACGACCTGAGGAGCAGGACCAGGGCC



RAKGGLVTVEKLLL

AAGGGCGGCCTGGTGACCGTGGAGAAGCTGCTGCTGACCACCG



TTERYLAEADYTHK

AGAGGTACCTGGCCGAGGCCGACTACACCCACAAGGAGCTGTTC



ELFDDWMDDNIEH

GACGACTGGATGGACGACAACATCGAGCAC





78
RISPSNLEQTKQQLR
159
AGGATCAGCCCCAGCAACCTGGAGCAGACCAAGCAGCAGCTGA



EELINLDHQYLDILD

GGGAGGAGCTGATCAACCTGGACCACCAGTACCTGGACATCCTG



FSIAGNVGARQFEV

GACTTCAGCATCGCCGGCAACGTGGGCGCCAGGCAGTTCGAGGT



RIVELLNEIIIAKHLS

GAGGATCGTGGAGCTGCTGAACGAGATCATCATCGCCAAGCACC



GGNRPEIIGFNPKEN

TGAGCGGCGGCAACAGGCCCGAGATCATCGGCTTCAACCCCAA



PEDCIIMDSKAYKEG

GGAGAACCCCGAGGACTGCATCATCATGGACAGCAAGGCCTAC



FNIPANERDKMIRYV

AAGGAGGGCTTCAACATCCCCGCCAACGAGAGGGACAAGATGA



EEYNAKDNTLNNNK

TCAGGTACGTGGAGGAGTACAACGCCAAGGACAACACCCTGAA



WWKNFESPNYPTNQ

CAACAACAAGTGGTGGAAGAACTTCGAGAGCCCCAACTACCCC



VKFSFVSSSFIGQFT

ACCAACCAGGTGAAGTTCAGCTTCGTGAGCAGCAGCTTCATCGG



NQLTYINNRTNVNG

CCAGTTCACCAACCAGCTGACCTACATCAACAACAGGACCAACG



SAITAETLLRKVENV

TGAACGGCAGCGCCATCACCGCCGAGACCCTGCTGAGGAAGGT



MNVNTEYNLNNFFE

GGAGAACGTGATGAACGTGAACACCGAGTACAACCTGAACAAC



ELGSNTLVA

TTCTTCGAGGAGCTGGGCAGCAACACCCTGGTGGCC





79
TFDSTVADNLKNLIL
160
ACCTTCGACAGCACCGTGGCCGACAACCTGAAGAACCTGATCCT



PKLKELDHKYLQAI

GCCCAAGCTGAAGGAGCTGGACCACAAGTACCTGCAGGCCATC



DIAYKRSNTTNHEN

GACATCGCCTACAAGAGGAGCAACACCACCAACCACGAGAACA



TLLEVLSADLFTKE

CCCTGCTGGAGGTGCTGAGCGCCGACCTGTTCACCAAGGAGATG



MDYHGKHLGGANK

GACTACCACGGCAAGCACCTGGGCGGCGCCAACAAGCCCGACG



PDGFVYDEETGWIL

GCTTCGTGTACGACGAGGAGACCGGCTGGATCCTGGACAGCAA



DSKAYRDGFAVTAH

GGCCTACAGGGACGGCTTCGCCGTGACCGCCCACACCACCGACG



TTDAMGRYIDQYRD

CCATGGGCAGGTACATCGACCAGTACAGGGACAGGGACGACAA



RDDKSTWWEDFPK

GAGCACCTGGTGGGAGGACTTCCCCAAGGACCTGCCCCAGACCT



DLPQTYFAYVSGFYI

ACTTCGCCTACGTGAGCGGCTTCTACATCGGCAAGTACCAGGAG



GKYQEQLQDFENRK

CAGCTGCAGGACTTCGAGAACAGGAAGCACATGAAGGGCGGCC



HMKGGLIEVAKLILL

TGATCGAGGTGGCCAAGCTGATCCTGCTGGCCGAGAAGTACAAG



AEKYKENKITHDQIT

GAGAACAAGATCACCCACGACCAGATCACCCTGCAGATCCTGA



LQILNDHISQ

ACGACCACATCAGCCAG





80
PLDVVEQMKAELRP
161
CCCCTGGACGTGGTGGAGCAGATGAAGGCCGAGCTGAGGCCCC



LLNHVNHRLLAIIDF

TGCTGAACCACGTGAACCACAGGCTGCTGGCCATCATCGACTTC



SYNMSRGDDKRLED

AGCTACAACATGAGCAGGGGCGACGACAAGAGGCTGGAGGACT



YTAQIYKLISHDTHL

ACACCGCCCAGATCTACAAGCTGATCAGCCACGACACCCACCTG



LAGPSRPDVVSVIND

CTGGCCGGCCCCAGCAGGCCCGACGTGGTGAGCGTGATCAACG



LGIIIDSKAYKQGFNI

ACCTGGGCATCATCATCGACAGCAAGGCCTACAAGCAGGGCTTC



PQAEEDKMVRYLDE

AACATCCCCCAGGCCGAGGAGGACAAGATGGTGAGGTACCTGG



SIRRDPAINPTKWWE

ACGAGAGCATCAGGAGGGACCCCGCCATCAACCCCACCAAGTG



YLGASTEYVFQFVSS

GTGGGAGTACCTGGGCGCCAGCACCGAGTACGTGTTCCAGTTCG



SFSSGASAKLRQIHR

TGAGCAGCAGCTTCAGCAGCGGCGCCAGCGCCAAGCTGAGGCA



RSSIEGSIITAKNLLL

GATCCACAGGAGGAGCAGCATCGAGGGCAGCATCATCACCGCC



LAENFLCTNTINIDL

AAGAACCTGCTGCTGCTGGCCGAGAACTTCCTGTGCACCAACAC



FRQNNEI

CATCAACATCGACCTGTTCAGGCAGAACAACGAGATC





81
QLVPSYITQTKLRLS
162
CAGCTGGTGCCCAGCTACATCACCCAGACCAAGCTGAGGCTGAG



GLINYIDHSYFDLID

CGGCCTGATCAACTACATCGACCACAGCTACTTCGACCTGATCG



LGFDGRQNRLYELRI

ACCTGGGCTTCGACGGCAGGCAGAACAGGCTGTACGAGCTGAG



VELLNLINSLKALHL

GATCGTGGAGCTGCTGAACCTGATCAACAGCCTGAAGGCCCTGC



SGGNRPEIIAYSPDV

ACCTGAGCGGCGGCAACAGGCCCGAGATCATCGCCTACAGCCCC



NPINGVIMDSKSYRG

GACGTGAACCCCATCAACGGCGTGATCATGGACAGCAAGAGCT



GFNIPNSERDKMIRY

ACAGGGGCGGCTTCAACATCCCCAACAGCGAGAGGGACAAGAT



INEYNQKNPTLNSN

GATCAGGTACATCAACGAGTACAACCAGAAGAACCCCACCCTG



RWWENFRAPDYPQS

AACAGCAACAGGTGGTGGGAGAACTTCAGGGCCCCCGACTACC



PLKYSFVSGNFIGQF

CCCAGAGCCCCCTGAAGTACAGCTTCGTGAGCGGCAACTTCATC



LNQIQYILTQTGING

GGCCAGTTCCTGAACCAGATCCAGTACATCCTGACCCAGACCGG



GAITSEKLIEKVNAV

CATCAACGGCGGCGCCATCACCAGCGAGAAGCTGATCGAGAAG



LNPNISYTINNFFND

GTGAACGCCGTGCTGAACCCCAACATCAGCTACACCATCAACAA



LGCNRLVQ

CTTCTTCAACGACCTGGGCTGCAACAGGCTGGTGCAG









In some embodiments, an endonuclease of the present disclosure can have a sequence of X1X2X3X4X5X6X7X8X9X10X11X12X13X14X15X16X17X18X19X20X21X22X23X24X25X26X27X28X29X30X31X32X33X34X35X36X37X38X39X40X41X42X43KX44X45X46X47X48X49X50X51X52X53X54X55GX56HLGGX57RX58PDGX59X60X61X62X63X64X65X66X67X68X69X70X71X72X73X74GX751X76DTKX77YX78X79GYX80L PIX81QX82DEMX83RYX84X85ENX86X87RX88X89X90X91NX92NX93WWX94X95X96X97X98X99X100X101X102X103X104X105X106FX107X108X109X110FX111GX112X113X114X115X116X117X118RX119X120X121X122X123X124X125X126GX127X128X129X130X131X132X133LLX134X135X136X137X138X139X140X141X142X143X144X145X146X147X148X149X150X151X152X153FX154X155X156X157X158X159X160 (SEQ ID NO: 316), wherein X1 is F, Q, N, D, or absent, X2 is L, I, T, S, N, or absent, X3 is V, I, G, A, E, T, or absent, X4 is K, C, or absent, X5 is G, S, or absent, X6 is A, S, E, D, N, or absent, X7 is M, I, V, Q, F, L, or absent, X8 is E, S, T, N, or absent, X9 is I, M, E, T, Q, or absent, X10 is K, S, L, I, T, E, or absent, X11 is K or absent, X12 is S, A, E, D, or absent, X13 is E, N, Q, K, or absent, X14 is L, M, V, or absent, X15 is R or absent, X16 is H, D, T, G, E, N, or absent, X17 is K, N, Q, E, A, or absent, X18 is L or absent, X19 is R, Q, N, T, D, or absent, X20 is H, M, V, N, T, or absent, X21 is V, L, I, or absent, X22 is P, S, or absent, X23 is H or absent, X24 is E, D, or absent, X25 is Y or absent, X26 is I, L, or absent, X27 is E, Q, G, S, A, Y, or absent, X28 is L or absent, X29 is I, V, L, or absent, X30 is E, D, or absent, X31 is I, L, or absent, X32 is A, S, or absent, X33 is Q, Y, F, or absent, X34 is D or absent, X35 is S, P, or absent, X36 is K, Y, Q, T, or absent, X37 is Q or absent, X38 is N or absent, X39 is R, K, or absent, X40 is L, I, or absent, X41 is L, F, or absent, X42 is E or absent, X43 is F, M, L, or absent, X44 is V, T, or I, X45 1 S V, M, L, or I, X46 is E, D, or Q, X47 is F or L, X48 is F or L, X40 is K, I, T, or V, X50 is K, N, or E, X51 is I or E, X52 is Y, F, or C, X53 is G, or N, X54 is Y, or F, X55 is R, S, N, E, K, or Q, X56 is K, S, L, V, or T, X57 is S, A, or V, X58 is K or R, X59 is A, I, or V, X60 is L, M, V, I, or C, X61 is F or Y, X62 is T, A, or S, X63 is K, E, or absent, X64 is D, E, or absent, X65 is E, A, or absent, X66 is N, K, or absent, X67 is E, S, or absent, X68 is D, E, Q, A, or absent, X69 is G, V, K, N, or absent, X70 is L, G, E, S, or absent, X71 is V, S, K, T, E, or absent, X72 is L, H, K, E, Y, D, or A, X73 is N, G, or D, X74 is H, F, or Y, X75 is I, or V, X76 is L, V, or I, X77 is A or S, X78 is K or S, X79 is D, G, K, S, or N, X80 is R, N, S, or G, X81 is S, A, or G, X82 is A, I, or V, X83 is Q, E, I, or V, X84 is V or I, X85 is D, R, G, I, or E, X86 is N, I, or Q, X87 is K, D, T, E, or K, X88 is S, N, D, or E, X89 is Q, E, I, K, or A, X90 is V, H, R, K, L, or E, X91 is I, V, or R, X92 is P, S, T, or R, X93 is E, R, C, Q, or K, X94 is E, N, or K, X95 is I, V, N, E, or A, X96 is Y or F, X97 is P, G, or E, X98 is T, E, S, D, K, or N, X99 is S, D, K, G, N, or T, X100 is I, T, V, or L, X101 is T, N, G, or D, X102 is D, E, T, K, or I, X103 is F or Y, X104 is K or Y, X105 is F or Y, X106 is L, S, or M, X107 is V or I, X108 is S or A, X109 is G or A, X110 is F, Y, H, E, or K, Xiii is Q, K, T, N, or I, X112 is D, N, or K, X113 is Y, F, I, or V, X114 is R, E, K, Q, or F, X115 is K, E, A, or N, X116 is Q or K, X117 is L or I, X118 is E, D, N, or Q, X119 is V, I, or L, X120 is S, N, F, T, or Q, X121 is H, I, C, or R, X122 is L, D, N, S, or F, X123 is T or K, X124 is K, G, or N, X125 is C, V, or I, X126 is Q, L, K, or Y, X127 is A, G, or N, X128 is V or A, X129 is M, L, I, V, or A, X130 is S, T, or D, X131 is V or I, X132 is E, Q, K, S, or I, X133 is Q, H, or T, X134 is L, R, or Y, X135 is G, I, L, or T, X136 is G, A, or V, X137 is E, N, or D, X138 is K, Y, D, E, A, or R, X139 is I, F, Y, or C, X140 is K or R, X141 is E, R, A, G, or T, X142 is G or N, X143 is S, K, R, or E, X144 is L, I, or M, X145 is T, S, D, or K, X146 is L, H, Y, R, T, or F, X147 is E, Y, M, A, or L, X148 is E, D, R, or G, X149 is V, F, M, L, or I, X150 is G, K, R, L, V, or E, X151 is K, N, D, L, H, or S, X152 is K, L, C, or absent, X153 is K, S, I, Y, M, or F, X154 is K, L, C, H, D, Q, or N, X155 is N or Y, X156 is D, K, T, E, C, or absent, X157 is E, V, R, or absent, X158 is I, F, L, or absent, X159 is V, Q, E, L, or absent, and X160 is F or absent.


In some embodiments, an endonuclease of the present disclosure can have a sequence of X1X2X3X4X5X6X7X8X9X10X11X12X13X14X15X16X17X18X19X20X21X22X23X24X25X26X27X28X29X30X31X32X33X34X35X36X37X38X39X40X41X42X43KX44X45X46X47X48X49X50X51X52X53X54X55GX56HLGGX57RX58PDGX59X60X61X62X63X64X65X66X67X68X69X70X71X72X73X74GX75IX76DTKX77YX78X79GYX80L PIX81QX82DEMX83RYX84X85ENX86X87RX88X89X90X91NX92NX93WWX94X95X96X97X98X99X100X101X102X103X104X105X106FX107X108X109X110FX111GX112X113X114X115X116X117X118RX119X120X121X122X123X124X125X126GX127X128X129X130X131X132X133LLX134X135X136X137X138X139X140X141X142X143X144X145X146X147X148X149X150X151X152X153FX154X155X156X157X158X159X160 (SEQ ID NO: 317), wherein X1 is F, Q, N, or absent, X2 is L, I, T, S, or absent, X3 is V, I, G, A, E, T, or absent, X4 is K, C, or absent, X5 is G, S, or absent, X6 is A, S, E, D, or absent, X7 is M, I, V, Q, F, L, or absent, X8 is E, S, T, or absent, X9 is I, M, E, T, Q, or absent, X10 is K, S, L, I, T, E, or absent, X11 is K or absent, X12 is S, A, E, D, or absent, X13 is E, N, Q, K, or absent, X14 is L, M, V, or absent, X15 is R or absent, X16 is H, D, T, G, E, N, or absent, X17 is K, N, Q, E, A, or absent, X18 is L or absent, X19 is R, Q, N, T, D, or absent, X20 is H, M, V, N, T, or absent, X21 is V, L, I, or absent, X22 is P, S, or absent, X23 is H or absent, X24 is E, D, or absent, X25 is Y or absent, X26 is I, L, or absent, X27 is E, Q, G, S, A, or absent, X28 is L or absent, X29 is I, V, L, or absent, X30 is E, D, or absent, X31 is I, L, or absent, X32 is A, S, or absent, X33 is Q, Y, F, or absent, X34 is D or absent, X35 is S, P, or absent, X36 is K, Y, Q, T, or absent, X37 is Q or absent, X38 is N or absent, X39 is R or absent, X40 is L, I, or absent, X41 is L, F, or absent, X42 is E or absent, X43 is F, M, L, or absent, X44 is V, T, or I, X45 is V, M, L, or I, X46 is E, D, or Q, X47 is F or L, X48 is F or L, X49 is K, I, T, or V, X50 is K, N, or E, X51 is I or E, X52 is Y, F, or C, X53 is G, or N, X54 is Y, or F, X55 is R, S, N, E, K, or Q, X56 is K, S, L, V, or T, X57 is S or A, X58 is K or R, X59 is A, I, or V, X60 is L, M, V, I, or C, X61 is F or Y, X62 is T, A, or S, X63 is K, E, or absent, X64 is D, E, or absent, X65 is E, A, or absent, X66 is N, K, or absent, X67 is E, S, or absent, X68 is D, E, Q, A, or absent, X69 is G, V, K, N, or absent, X70 is L, G, E, S, or absent, X71 is V, S, K, T, E, or absent, X72 is L, H, K, E, Y, D, or A, X73 is N, G, or D, X74 is H, F, or Y, X75 is I, or V, X76 is L, V, or I, X77 is A or S, X78 is K or S, X79 is D, G, K, S, or N, X80 is R, N, S, or G, X81 is S, A, or G, X82 is A, I, or V, X83 is Q, E, I, or V, X84 is V or I, X85 is D, R, G, I, or E, X86 is N, I, or Q, X87 is K, D, T, E, or K, X88 is S, N, D, or E, X89 is Q, E, I, K, or A, X90 is V, H, R, K, L, or E, X91 is I, V, or R, X92 is P, S, T, or R, X93 is E, R, C, Q, or K, X94 is E, N, or K, X95 is I, V, N, E, or A, X96 is Y or F, X97 is P, G, or E, X98 is T, E, S, D, K, or N, X99 is S, D, K, G, N, or T, X100 is I, T, V, or L, X101 is T, N, G, or D, X102 is D, E, T, K, or I, X103 is F or Y, X104 is K or Y, X105 is F or Y, X106 is L, S, or M, X107 is V or I, X108 is S or A, X109 is G or A, X110 is F, Y, H, E, or K, Xiii is Q, K, T, N, or I, X112 is D, N, or K, X113 is Y, F, I, or V, X114 is R, E, K, Q, or F, X115 is K, E, A, or N, X116 is Q or K, X117 is L or I, X118 is E, D, N, or Q, X119 is V, I, or L, X120 is S, N, F, T, or Q, X121 is H, I, C, or R, X122 is L, D, N, S, or F, X123 is T or K, X124 is K, G, or N, X125 is C, V, or I, X126 is Q, L, K, or Y, X127 is A, G, or N, X128 is V or A, X129 is M, L, I, V, or A, X130 is S, T, or D, X131 is V or I, X132 is E, Q, K, S, or I, X133 is Q, H, or T, X134 is L, R, or Y, X135 is G, I, L, or T, X136 is G, A, or V, X137 is E, N, or D, X138 is K, Y, D, E, A, or R, X139 is I, F, Y, or C, X140 is K or R, X141 is E, R, A, G, or T, X142 is G or N, X143 is S, I, K, R, or E, X144 is L, I, or M, X145 is T, S, D, or K, X146 is L, H, Y, R, or T, X147 is E, Y, I, M, or A, X148 is E, D, R, or G, X149 is V, F, M, L, or I, X150 is G, K, R, L, V, or E, X151 is K, N, D, L, H, or S, X152 is K, L, C, or absent, X153 is K, S, I, Y, M, or F, X154 is K, L, C, H, D, Q, or N, X155 is N or Y, X156 is D, K, T, E, C, or absent, X157 is E, V, R, or absent, X158 is I, F, L, or absent, X159 is V, Q, E, L, or absent, and X160 is F or absent.


In some embodiments, an endonuclease of the present disclosure can have a sequence of X1LVKSSX2EEX3KEELREKLX4HLSHEYLX5LX6DLAYDSKQNRLFEMKVX7ELLINECGYX8G LHLGGSRKPDGIX9YTEGLKX10NYGIIIDTKAYSDGYNLPISQADEMERYIRENNTRNX11X12V NPNEWWENFPX13NINEFYFLFVSGHFKGNX14EEQLERISIX15TX16IKGAAMSVX17TLLLLAN EIKAGRLX18LEEVX19KYFDNKEIX20F (SEQ ID NO: 318), wherein X1 is F, Q, N, D, or absent, X2 is M, I, V, Q, F, L, or absent, X3 is K, S, L, I, T, E, or absent, X4 is R, Q, N, T, D, or absent, X5 is E, Q, G, S, A, Y, or absent, X6 is I, V, L, or absent, X7 is V, M, L, or I, X8 is R, S, N, E, K, or Q, X9 is L, M, V, I, or C, X10 is L, H, K, E, Y, D, or A, X11 is Q, E, I, K, or A, X12 is V, H, R, K, L, or E, X13 is T, E, S, D, K, or N, X14 is Y, F, I, or V, X15 is L, D, N, S, or F, X16 is K, G, or N, X17 is E, Q, K, S, or I, X18 is T, S, D, or K, X19 is G, K, R, L, V, or E, and X20 is V, Q, E, L, or absent.


In some embodiments, an endonuclease of the present disclosure can have a sequence of X1LVKSSX2EEX3KEELREKLX4HLSHEYLX5LX6DLAYDSKQNRLFEMKVX7ELLINECGYX8G LHLGGSRKPDGIX9YTEGLKX10NYGIIIDTKAYSDGYNLPISQADEMERYIRENNTRNX11X12V NPNEWWENFPX13NINEFYFLFVSGHFKGNX14EEQLERISIX15TX16IKGAAMSVX17TLLLLAN EIKAGRLX18LEEVX19KYFDNKEIX20F (SEQ ID NO: 319), wherein X1 is F, Q, N, or absent, X2 is M, I, V, Q, F, L, or absent, X3 is K, S, L, I, T, E, or absent, X4 is R, Q, N, T, D, or absent, X5 is E, Q, G, S, A, or absent, X6 is I, V, L, or absent, X7 is V, M, L, or I, X8 is R, S, N, E, K, or Q, X9 is L, M, V, I, or C, X10 is L, H, K, E, Y, D, or A, X11 is Q, E, I, K, or A, X12 is V, H, R, K, L, or E, X13 is T, E, S, D, K, or N, X14 is Y, F, I, or V, X15 is L, D, N, S, or F, X16 is K, G, or N, X17 is E, Q, K, S, or I, X18 is T, S, D, or K, X19 is G, K, R, L, V, or E, and X20 is V, Q, E, L, or absent. In some embodiments, a cleavage domain disclosed herein comprises a sequence selected from SEQ ID NO: 316-SEQ ID NO: 319.


In some embodiments, an endonuclease of the present disclosure can have conserved amino acid residues at position 76 (D or E), position 98 (D), and position 100 (K), which together preserve catalytic function. In some embodiments, an endonuclease of the present disclosure can have conserved amino acid residues at position 114 (D) and position 118 (R), which together preserve dimerization of two cleavage domains.


In some embodiments, endonucleases disclosed herein (e.g., SEQ ID NO: 1-SEQ ID NO: 81 (nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162)) can have at least 33.3% divergence from SEQ ID NO: 163 (FokI) and, is immunologically orthogonal to SEQ ID NO: 163 (FokI). In some embodiments, an immunologically orthogonal endonuclease (e.g., SEQ ID NO: 1-SEQ ID NO: 81 (nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162)) can be administered to a patient that has already received, and is thus can have an adverse immune reaction to, FokI. In some embodiments, endonucleases disclosed herein (e.g., SEQ ID NO: 1-SEQ ID NO: 81 (nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162)) can have at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, or at least 75% divergence from SEQ ID NO: 163 (FokI).


In some embodiments, an endonuclease disclosed herein (e.g., SEQ ID NO: 1-SEQ ID NO: 81 (nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162)) can be fused to any nucleic acid binding domain disclosed herein to form a non-naturally occurring fusion protein. This fusion protein can have one or more of the following characteristics: (a) induces greater than 1% indels (insertions/deletions) at a target site; (b) the cleavage domain comprises a molecular weight of less than 23 kDa; (c) the cleavage domain comprises less than 196 amino acids; and (d) capable of cleaving across a spacer region greater than 24 base pairs. In some embodiments, the non-naturally occurring fusion protein can induce greater than 5%, greater than 10%, greater than 20%, greater than 30%, greater than 40%, greater than 50%, greater than 60%, greater than 70%, greater than 80%, or greater than 90% indels at the target site. In some embodiments, indels are generated via the non-homologous end joining (NHEJ) pathway upon administration of a genome editing complex disclosed herein to a subject. Indels can be measured using deep sequencing.


In still various embodiments, the functional domain can be a cleavage domain or a repression domain. In some aspects, the cleavage domain comprises at least 33.3% divergence from SEQ ID NO: 163 and is immunologically orthogonal to SEQ ID NO: 163. In further aspects, the polypeptide can comprise one or more of the following characteristics: (a) induces greater than 1% indels at a target site; (b) the cleavage domain comprises a molecular weight of less than 23 kDa; (c) the cleavage domain comprises less than 196 amino acids; (d) capable of cleaving across a spacer region greater than 24 base pairs.


DNA Binding Domains Fused to SEQ ID NO: 1-SEQ ID NO: 81 (Nucleic Acid Sequences of SEQ ID NO: 82-SEQ ID NO: 162)

The present disclosure provides for novel compositions of endonucleases with modular nucleic acid binding domains (e.g., TALEs, RNBDs, or MAP-NBDs) described herein. In some instances the novel endonucleases can be fused to a DNA binding domain from Xanthomonas spp. (TALE), Ralstonia (RNBD), or Legionella (MAP-NBD) resulting in genome editing complexes. A TALEN, RNBD-nuclease, or MAP-NBD-nuclease can include multiple components including the DNA binding domain, an optional linker, and a repressor domain. The genome editing complexes described herein can be used to selectively bind and cleave to a target gene sequence for genome editing purposes. For example, a DNA binding domain from Xanthomonas, Ralstonia, or Legionella of the present disclosure can be used to direct the binding of a genome editing complex to a desired genomic sequence.


The genome editing complexes described herein, comprising a DNA binding domain fused to an endonuclease, can be used to edit genomic loci of interest by binding to a target nucleic acid sequence via the DNA binding domain and cleaving phosphodiester bonds of target double stranded DNA via the endonuclease.


In some aspects, DNA binding domains fused to nucleases can create a site-specific double-stranded DNA break when fused to a nuclease. Such breaks can then be subsequently repaired by cellular machinery, through either homology-dependent repair or non-homologous end joining (NHEJ). Genome editing, using DNA binding domains fused to nucleases described herein, can thus be used to delete a sequence of interest (e.g., an aberrantly expressed or mutated gene) or to introduce a nucleic acid sequence of interest (e.g., a functional gene). DNA binding domains of the present disclosure can be programmed to delivery virtually any nuclease, including those disclosed herein, to any target site for therapeutic purposes, including ex vivo engineered cell therapies obtained using the compositions disclosed herein or gene therapy by direct in vivo administration of the compositions disclosed herein. In addition, the DNA binding domain can bind to specific DNA sequences and in some cases they can activate the expression of host genes. In some instances, the disclosure provides for enzymes, e.g., SEQ ID NO: 1-SEQ ID NO: 81 (or any one of nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162) that can be fused to the DNA binding domains of TALEs, RNBDs, and MAP-NBDs. In some instances, enzymes of the disclosure, including SEQ ID NO: 1 (nucleic acid sequence of SEQ ID NO: 82), SEQ ID NO: 4 (nucleic acid sequence of SEQ ID NO: 85), and SEQ ID NO: 8 (nucleic acid sequence of SEQ ID NO: 89), can achieve greater than 30% indels via the NHEJ pathway on a target gene when fused to a DNA binding domain of a TALE, RNBD, and MAP-NBD.


A non-naturally occurring fusion protein of the disclosure, e.g., any one of SEQ ID NO: 1-SEQ ID NO: 81 (or any one of nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162) fused to a DNA binding domain, can comprise a repeat unit. A repeat unit can be from a wild-type DNA-binding domain (Ralstonia solanacearum, Xanthomonas spp., Legionella quateirensis, Burkholderia, Paraburkholderia, or Francisella) or a modified repeat unit enhanced for specific recognition of a particular nucleic acid base. A modified repeat unit can comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25 or more mutations that can enhance the repeat module for specific recognition of a particular nucleic acid base. In some embodiments, a modified repeat unit is modified at amino acid position 2, 3, 4, 11, 12, 13, 21, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, or 35. In some embodiments, a modified repeat unit is modified at amino acid positions 12 or 13.


As described in further detail below, a non-naturally occurring fusion protein of the disclosure, e.g., anyone of SEQ ID NO: 1-SEQ ID NO: 81 (or any one of nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162) fused to a plurality of repeat units (e.g., derived from Ralstonia solanacearum, Xanthomonas spp., Legionella quateirensis, Burkholderia, Paraburkholderia, or Francisella), can further comprise a C-terminal truncation, which can served as a linker between the DNA binding domain and the nuclease.


A non-naturally occurring fusion protein of the disclosure, e.g., anyone of SEQ ID NO: 1-SEQ ID NO: 81 (or any one of nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162) fused to a DNA binding domain, can further comprise an N-terminal cap as described in further detail below. An N-terminal cap can be a polypeptide portion flanking the DNA-binding repeat module. An N-terminal cap can be any length and can comprise from 0 to 136 amino acid residues in length. An N-terminal cap can be 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 120, or 130 amino acid residues in length. In some embodiments, an N-terminal cap can modulate structural stability of the DNA-binding repeat units. In some embodiments, an N-terminal cap can modulate nonspecific interactions. In some cases, an N-terminal cap can decrease nonspecific interaction. In some cases, an N-terminal cap can reduce off-target effect. As used here, off-target effect refers to the interaction of a genome editing complex with a sequence that is not the target binding site of interest. An N-terminal cap can further comprise a wild-type N-terminal cap sequence of a protein from Ralstonia solanacearum, Xanthomonas spp., Legionella quateirensis, Burkholderia, Paraburkholderia, or Francisella or can comprise a modified N-terminal cap sequence.


In some embodiments, a DNA binding domain comprises at least one repeat unit having a repeat variable diresidue (RVD), which contacts a target nucleic acid base. In some embodiments, a DNA binding domain comprises more than one repeat unit, each having an RVD, which contacts a target nucleic acid base. In some embodiments, the DNA binding domain comprises 1 to 50 RVDs. In some embodiments, the DNA binding domain components of the fusion proteins can be at least 14 RVDs, at least 15 RVDs, at least 16 RVDs, at least 17 RVDs, at least 18 RVDs, at least 19 RVDs, at least 20 RVDs in length, or at least 21 RVDs in length. In some embodiments, the DNA binding domains can be 16 to 21 RVDs in length.


In some embodiments, any one of the DNA binding domains described herein can bind to a region of interest of any gene. For example, the DNA binding domains described herein can bind upstream of the promoter region, upstream of the gene transcription start site, or downstream of the transcription start site. In certain embodiments, the DNA binding domain binding region is no farther than 50 base pairs downstream of the transcription start site. In some embodiments, the DNA binding domain is designed to bind in proximity to the transcription start site (TSS). In other embodiments, the TALE can be designed to bind in the 5′ UTR region.


A DNA binding domain described herein can comprise between 1 to 50 repeat units. A DNA binding domain described herein can comprise between 5 and 45, between 8 to 45, between 10 to 40, between 12 to 35, between 15 to 30, between 20 to 30, between 8 to 40, between 8 to 35, between 8 to 30, between 10 to 35, between 10 to 30, between 10 to 25, between 10 to 20, or between 15 to 25 repeat units.


A DNA binding domain described herein can comprise at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 45, 50, or more repeat units. A DNA binding domain described herein can comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 45, or 50 repeat units. A DNA binding domain described herein can comprise 5 repeat units. A DNA binding domain described herein can comprise 10 repeat units. A DNA binding domain described herein can comprise 11 repeat units. A DNA binding domain described herein can comprise 12 repeat units, or another suitable number.


A repeat unit of a DNA binding domain can be 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36 37, 38, 39 or 40 residues in length.


In some embodiments, the effector can be a protein secreted from Xanthomonas or Ralstonia bacteria upon plant infection. In some embodiments, the effector can be a protein that is a mutated form of, or otherwise derived from, a protein secreted from Xanthomonas or Ralstonia bacteria. The effector can further comprise a DNA-binding module which includes a variable number of about 33-35 amino acid residue repeat units. Each amino acid repeat unit recognizes one base pair through two adjacent amino acids (e.g., at amino acid positions 12 and 13 of the repeat unit). As such, amino acid positions 12 and 13 of the repeat unit can also be referred to as repeat variable diresidue (RVD).


Linkers

A nuclease, e.g., anyone of SEQ ID NO: 1-SEQ ID NO: 81 (or any one of nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162) fused to a DNA binding domain (e.g., an RNBD, a MAP-NBD, a TALE), can further include a linker connecting SEQ ID NO: 1-SEQ ID NO: 81 (or any one of nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162) to the DNA binding domain. A linker used herein can be a short flexible linker comprising 0 base pairs, 3 to 6 base pairs, 6 to 12 base pairs, 12 to 15 base pairs, 15 to 21 base pairs, 21 to 24 base pairs, 24 to 30 base pairs, 30 to 36 base pairs, 36 to 42 base pairs, 42 to 48 base pairs, or 1 to 48 base pairs. The nucleic acid sequence of the linker can encode for an amino acid sequence comprising 0 residues, 1-3 residues, 4-7 residues, 8-10 residues, 10-12 residues, 12-15 residues, or 1-15 residues. Linkers can include, but are not limited to, residues such as glycine, methionine, aspartic acid, alanine, lysine, serine, leucine, threonine, tryptophan, or any combination thereof.


When linking a repressor domain to an RNBD, MAP-NBD, or TALE, the linker can have a nucleic acid sequence of GGCGGTGGCGGAGGGATGGATGCTAAGTCACTAACTGCCTGGTCC (SEQ ID NO: 165) and an amino acid sequence of GGGGGMDAKSLTAWS (SEQ ID NO: 166).


A nuclease, e.g., anyone of SEQ ID NO: 1-SEQ ID NO: 81 (or any one of nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162) can be connected to a DNA binding domain via a linker, a linker can be between 1 to 70 amino acid residues in length. A linker can be from 5 to 45, from 5 to 40, from 5 to 35, from 5 to 30, from 5 to 25, from 5 to 20, from 5 to 15, from 10 to 40, from 10 to 35, from 10 to 30, from 10 to 25, from 10 to 20, from 12 to 40, from 12 to 35, from 12 to 30, from 12 to 25, from 12 to 20, from 14 to 40, from 14 to 35, from 14 to 30, from 14 to 25, from 14 to 20, from 14 to 16, from 15 to 40, from 15 to 35, from 15 to 30, from 15 to 25, from 15 to 20, from 15 to 18, from 18 to 40, from 18 to 35, from 18 to 30, from 18 to 25, from 18 to 24, from 20 to 40, from 20 to 35, from 20 to 30, from 25 to 30, from 25 to 70, from 30 to 70, from 5 to 70, from 35 to 70, from 40 to 70, from 45 to 70, from 50 to 70, from 55 to 70, from 60 to 70, or from 65 to 70 amino acid residues in length.


A linker for linking a nuclease, e.g., anyone of SEQ ID NO: 1-SEQ ID NO: 81 (or any one of nucleic acid sequences of SEQ ID NO: 82-SEQ ID NO: 162) to a DNA binding domain can be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 amino acid residues in length.


In some embodiments, the linker can be the N-terminus of a naturally occurring Ralstonia solanacearum-derived protein, Xanthomonas spp.-derived protein, or animal pathogen-derived protein, wherein any functional domain disclosed herein is fused to the N-terminus of the engineered DNA binding domain. In some embodiments, the linker comprising the N-terminus can comprise the full length naturally occurring N-terminus of a naturally occurring Ralstonia solanacearum-derived protein, Xanthomonas spp.-derived protein, or animal pathogen-derived protein, or a truncation of the naturally occurring N-terminus, such as amino acid residues at positions 1 to 137 of the naturally occurring Ralstonia solanacearum-derived protein N-terminus (e.g., SEQ ID NO: 264), positions 1 (H) to 115 (S) of the naturally occurring Ralstonia solanacearum-derived protein N-terminus (SEQ ID NO: 320), positions 1 (N) to 115 (S) of the naturally occurring Xanthomonas spp.-derived protein N-terminus (SEQ ID NO: 321), or positions 1 (G) to 115 (K) of the naturally occurring Legionella quateirensis-derived protein N-terminus (SEQ ID NO: 322). In some embodiments, the linker can comprise amino acid residues at positions 1 to 120 of the naturally occurring Ralstonia solanacearum-derived protein (SEQ ID NO: 303), Xanthomonas spp.-derived protein (SEQ ID NO: 301), or Legionella quateirensis-derived protein (SEQ ID N): 304). In some embodiments, the linker can comprise the naturally occurring N-terminus of Ralstonia solanacearum truncated to any length. For example, the naturally occurring N-terminus of Ralstonia solanacearum can be truncated to amino acid residues at positions 1 to 120, 1 to 115, 1 to 50, 1 to 70, 1 to 100, 1 to 120, 1 to 130, 10 to 40, 60 to 100, or 100 to 120 and used at the N-terminus of the engineered DNA binding domain as a linker to a nuclease or a repressor.


In other embodiments, the linker can be the C-terminus of a naturally occurring Ralstonia solanacearum-derived protein, Xanthomonas spp.-derived protein, or animal pathogen-derived protein, wherein any functional domain disclosed herein is fused to the C-terminus of the engineered DNA binding domain. In some embodiments, the linker comprising the C-terminus can comprise the full length naturally occurring C-terminus of a naturally occurring Ralstonia solanacearum-derived protein, Xanthomonas spp.-derived protein, or animal pathogen-derived protein, or a truncation of the naturally occurring C-terminus, such as positions 1 to 63 of the naturally occurring Ralstonia solanacearum-derived protein (SEQ ID NO: 266), Xanthomonas spp.-derived protein (SEQ ID NO: 298), or Legionella quateirensis-derived protein (SEQ ID NO: 306). In some embodiments, the naturally occurring C-terminus of Ralstonia solanacearum-derived protein, Xanthomonas spp.-derived protein, or Legionella quateirensis-derived protein can be truncated to any length and used at the C-terminus of the engineered DNA binding domain and used as a linker to a nuclease or repressor. For example, the naturally occurring C-terminus of Ralstonia solanacearum-derived protein, Xanthomonas spp.-derived protein, or Legionella quateirensis-derived protein can be truncated to amino acid residues at positions 1 to 63, 1 to 50, 1 to 70, 1 to 100, 1 to 120, 1 to 130, 10 to 40, 60 to 100, or 100 to 120 and used at the C-terminus of the engineered DNA binding domain.


Linkers Comprising Recognition Sites

In some embodiments, the present disclosure provides DNA binding domains (e.g., RNBDs, MAP-NBDs, TALEs) with gapped repeat units for use as gene editing complexes. A DNA binding domain (e.g., RNBDs, MAP-NBDs, TALEs) with gapped repeat can comprise of a plurality of repeat units in which each repeat unit of the plurality of repeat units is separated from a neighboring repeat unit by a linker. This linker can comprise a recognition site for additional functionality and activity. For example, the linker can comprise a recognition site for a small molecule. As another example, the linker can serve as a recognition site for a protease. In yet another example, the linker can serve as a recognition site for a kinase. In other embodiments, the recognition site can serve as a localization signal.


Each repeat unit of a DNA binding domain (e.g., RNBDs, MAP-NBDs, TALEs) comprises a secondary structure in which the RVD interfaces with and binds to a target nucleic acid base on double stranded DNA, while the remainder of the repeat unit protrudes from the surface of the DNA. Thus, the linkers comprising a recognition site between each repeat unit are removed from the surface of the DNA and are solvent accessible. In some embodiments, these solvent accessible linkers comprising recognition sites can have extra activity while mediating gene editing. In some embodiments, the at least one repeat unit comprises 1-20 additional amino acid residues at the C-terminus. In some aspects, the at least repeat unit of the plurality of repeat units is separated from a neighboring repeat unit by a linker. In further aspects, the linker comprises a recognition site. In some aspects, the recognition site is for a small molecule, a protease, or a kinase. In some aspects, the recognition site serves as a localization signal. In some aspects, the plurality of repeat units comprises 3 to 60 repeat units.


Examples of a left and a right DNA binding domain comprising repeat units derived from Xanthomonas spp. are shown below in TABLE 7 for AAVS1 and GA7. “X,” shown in bold and underlining, represents a linker comprising a recognition site and can comprise 1-40 amino acid residues. An amino acid residue of the linker can comprise a glycine, an alanine, a threonine, or a histidine.









TABLE 7







Exemplary Left or Right Gapped DNA Binding Domains









SEQ ID




NO
Construct
Sequence





307
AAVS1_Left
LTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGG




KQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLC




QDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIA




SNGGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRL




LPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQ




VVAIASNIGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALET




VQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXL




TPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGK




QALETVQRLLPVLCQDHGXLTPDQVVAIASNIGGKQALETVQRLLPVLCQD




HGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNI




GGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNHGGKQALETVQRLLPV




LCQDHGXLTPDQVVAIASNGGG





308
AAVS1_Right
LTPDQVVAIASNGGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNGGG




KQALETVQRLLPVLCQDHGXLTPDQVVAIASNGGGKQALETVQRLLPVLC




QDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIA




SNGGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNHGGKQALETVQRL




LPVLCQDHGXLTPDQVVAIASNGGGKQALETVQRLLPVLCQDHGXLTPDQ




VVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNIGGKQALET




VQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXL




TPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNIGGK




QALETVQRLLPVLCQDHGXLTPDQVVAIASNIGGKQALETVQRLLPVLCQD




HGXLTPDQVVAIASNGGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASH




DGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLP




VLCQDHGXLTPDQVVAIASNGGGKQALESIVAQLSRPDPALA





309
GA7.2 Left
LTPDQVVAIASNHGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGG




KQALETVQRLLPVLCQDHGXLTPDQVVAIASNGGGKQALETVQRLLPVLC




QDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIA




SNIGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNHGGKQALETVQRL




LPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQ




VVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALE




TVQRLLPVLCQDHGXLTPDQVVAIASNIGGKQALETVQRLLPVLCQDHGX




LTPDQVVAIASNHGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGG




KQALETVQRLLPVLCQDHGXLTPDQVVAIASNGGGKQALETVQRLLPVLC




QDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIA




SNIGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNHGGKQALETVQRL




LPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQ




VVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNGGGK





310
GA7.2 Right
LTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGG




KQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLC




QDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIA




SHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNGGGKQALETVQRL




LPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQ




VVAIASNGGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALE




TVQRLLPVLCQDHGXLTPDQVVAIASNIGGKQALETVQRLLPVLCQDHGX




LTPDQVVAIASNGGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNGGG




KQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALETVQRLLPVLC




QDHGXLTPDQVVAIASNGGGKQALETVQRLLPVLCQDHGXLTPDQVVAIA




SHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASNGGGKQALETVQRL




LPVLCQDHGXLTPDQVVAIASNIGGKQALETVQRLLPVLCQDHGXLTPDQ




VVAIASHDGGKQALETVQRLLPVLCQDHGXLTPDQVVAIASHDGGKQALE




TVQRLLPVLCQDHGXLTPDQVVAIASNIGGKQALETVQRLLPVLCQDHGX




LTPDQVVASASNGGGKQALESIVAQLSRPDPALA









Tunable Repeat Units

In some embodiments, the present disclosure provides DNA binding domains (e.g., RNBDs, MAP-NBDs, TALEs) with expanded repeat units. For example, a DNA binding domain (e.g., RNBDs, MAP-NBDs, TALEs) comprises a plurality of repeat units in which each repeat unit is usually 33-35 amino acid residues in length. The present disclosure provides repeat units, which are greater than 35 amino acid residues in length. In some embodiments, the present disclosure provides repeat units, which are greater than 39 amino acid residues in length. In some embodiments, the present disclosure provides repeat units which are 35 to 40, 39 to 40, 35 to 45, 39 to 45, 35 to 50, 39 to 50, 35 to 50, 35 to 60, 39 to 60, 35 to 70, 39 to 70, 35 to 79, or 39 to 79 amino acid residues long.


In other embodiments, the present disclosure provides DNA binding domains (e.g., RNBDs, MAP-NBDs, TALEs) with contracted repeat units. For example, the present disclosure provides repeat units, which are less than 32 amino acid residues in length. In some embodiments, the present disclosure provides repeat units, which are 15 to 32, 16 to 32, 17 to 32, 18 to 32, 19 to 32, 20 to 32, 21 to 32, 22 to 32, 23 to 32, 24 to 32, 25 to 32, 26 to 32, 27 to 32, 28 to 32, 29 to 32, 30 to 32, or 31 to 32 amino acid residues in length.


In some embodiments, said expanded repeat units can be tuned to modulate binding of each repeat unit to its target nucleic acid, resulting in the ability to overall modulate binding of the DNA binding domain (e.g., RNBDs, MAP-NBDs, TALEs) to a target gene of interest. For example, expanding repeat units can improve binding affinity of the repeat unit to its target nucleic acid base and thereby increase binding affinity of the DNA binding domain (e.g., RNBDs, MAP-NBDs, TALEs) to a target gene. In other embodiments, contracting repeat units can improve binding affinity of the repeat unit to its target nucleic acid base and thereby increase binding affinity of the DNA binding domain (e.g., RNBDs, MAP-NBDs, TALEs) for a target gene.


Functional Domains

An RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be linked to a functional domain. The functional domain can provide different types activity, such as genome editing, gene regulation (e.g., activation or repression), or visualization of a genomic locus via imaging.


A. Genome Editing Domains


For example, an RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be linked to a nuclease, wherein the RNBD provides specificity and targeting and the nuclease provides genome editing functionality. In some embodiments, the nuclease can be a cleavage domain, which dimerizes with another copy of the same cleavage domain to form an active full domain capable of cleaving DNA. In other embodiments, the nuclease can be a cleavage domain, which is capable of cleaving DNA without needing to dimerize. For example, a nuclease comprising a cleavage domain can be an endonuclease, such as FokI or Bfil. In some embodiments, two cleavage domains (e.g., FokI or Bfil) can be fused together to form a fully functional single cleavage domain. When cleavage domains are used as the nuclease, two RNBDs can be engineered, the first RNBD binding to a top strand of a target nucleic acid sequence and comprising a first FokI cleavage domain and a second RNBD binding to a bottom strand of a target nucleic acid sequence and comprising a second FokI cleavage domain.


In some embodiments, a fully functional cleavage domain, capable of cleaving DNA without needing to dimerize include meganucleases, also referred to as homing endonucleases. For example, a meganuclease can include I-Anil or I-OnuI. In some embodiments, the nuclease can be a type IIS restriction enzyme, such as FokI or Bfil.


A nuclease domain fused to an RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be an endonuclease or an exonuclease. An endonuclease can include restriction endonucleases and homing endonucleases. An endonuclease can also include S1 Nuclease, mung bean nuclease, pancreatic DNase I, micrococcal nuclease, or yeast HO endonuclease. An exonuclease can include a 3′-5′ exonuclease or a 5′-3′ exonuclease. An exonuclease can also include a DNA exonuclease or an RNA exonuclease. Examples of exonuclease includes exonucleases I, II, III, IV, V, and VIII; DNA polymerase I, RNA exonuclease 2, and the like.


A nuclease domain fused to an RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be a restriction endonuclease (or restriction enzyme). In some instances, a restriction enzyme cleaves DNA at a site removed from the recognition site and has a separate binding and cleavage domains. In some instances, such restriction enzyme is a Type IIS restriction enzyme.


A nuclease domain fused to an RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be a Type IIS nuclease. A Type IIS nuclease can be FokI or Bfil. In some cases, a nuclease domain fused to an RNBD (e.g., Ralstonia solanacearum-derived) is FokI. In other cases, a nuclease domain fused to an RNBD (e.g., Ralstonia solanacearum-derived) is Bfil.


FokI can be a wild-type FokI or can comprise one or more mutations. In some cases, FokI can comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more mutations. A mutation can enhance cleavage efficiency. A mutation can abolish cleavage activity. In some cases, a mutation can modulate homodimerization. For example, FokI can have a mutation at one or more amino acid residue positions 446, 447, 479, 483, 484, 486, 487, 490, 491, 496, 498, 499, 500, 531, 534, 537, and 538 to modulate homodimerization.


In some instances, a FokI cleavage domain is, for example, as described in Kim et al. “Hybrid restriction enzymes: Zinc finger fusions to Fok I cleavage domain,” PNAS 93: 1156-1160 (1996), which is incorporated herein by reference in its entirety. In some cases, a FokI cleavage domain described herein has a sequence as follows: QLVKSELEEKKSELRHKLKYVPHEYIELIEIARNSTQDRILEMKVMEFFMKVYGYRGKHLG GSRKPDGAIYTVGSPIDYGVIVDTKAYSGGYNLPIGQADEMQRYVEENQTRNKHINPNEWW KVYPSSVTEFKFLFVSGHFKGNYKAQLTRLNHITNCNGAVLSVEELLIGGEMIKAGTLTLEE VRRKFNNGEINF (SEQ ID NO: 163). In other instances, a FokI cleavage domain described herein is a FokI, for example, as described in U.S. Pat. No. 8,586,526, which is incorporated herein by reference in its entirety.


An RNBD (e.g., Ralstonia solanacearum-derived) can be linked to a functional group that modifies DNA nucleotides, for example an adenosine deaminase.


In some embodiments, an RNBD (e.g., Ralstonia solanacearum-derived) can be linked to any nuclease as set forth in TABLE 6 showing exemplary amino acid sequences (SEQ ID NO: 1-SEQ ID NO: 81) of endonucleases for genome editing and the corresponding back-translated nucleic acid sequences (SEQ ID NO: 82-SEQ ID NO: 162) of the endonucleases.


For purposes of gene editing, a first DNA binding domain (e.g., of a TALE, RNBD, or MAP-NBD) linked to a cleavage domain and a second DNA binding domain (e.g., of a TALE, RNBD, or MAP-NBD) linked to a cleavage domain can be provided. The first DNA binding domain (e.g., of a TALE, RNBD, or MAP-NBD) linked to a cleavage domain can recognize a top strand of double stranded DNA and bind to said region of double stranded DNA. The second DNA binding domain (e.g., of a TALE, RNBD, or MAP-NBD) linked to a cleavage domain can recognize a separate, non-overlapping bottom strand of double stranded DNA and bind to said region of double stranded DNA. The target nucleic acid sequence on the bottom strand can have its complementary nucleic acid sequence in the top strand positioned 10 to 20 nucleotides towards the 3′ end from the first region. In some embodiments this stretch of 10 to 20 nucleotides can be referred to as the spacer region. In some embodiments, this first DNA binding domain (e.g., of a TALE, RNBD, or MAP-NBD) linked to a cleavage domain and the second DNA binding domain (e.g., of a TALE, RNBD, or MAP-NBD) linked to a cleavage domain both bind at a target site, allowing for dimerization of the two cleavage domains in the spacer region and allowing for catalytic activity and cleaving of the target DNA.


a. Potency and Specificity of Genome Editing


In some embodiments, the efficiency of genome editing with a genome editing complex of the present disclosure (e.g., any one of an RNBD, MAP-NBD, or TALE fused to any nuclease disclosed herein) can be determined. Specifically, the potency and specificity of the genome editing complex can indicate whether a particular modular nucleic acid binding domain fused to a nuclease provides efficient editing. Potency can be defined as the percent indels (insertions/deletions) that are generated via the non-homologous end joining (NHEJ) pathway at a target site after administering a modular nucleic acid binding domain fused to a nuclease to a subject. A modular nucleic acid binding domain can have a potency of greater than 50%, greater than 55%, greater than 60%, greater than 65%, greater than 70%, greater than 75%, greater than 80%, greater than 85%, greater than 90%, greater than 95%, greater than 92%, greater than 95%, greater than 97%, or greater than 99%. A modular nucleic acid binding domain can have a potency of from 50% to 100%, 50% to 60%, 60% to 70%, 70% to 80%, 80% to 90%, or 90% to 100%.


Specificity can be defined as a specificity ratio, wherein the ratio is the percent indels at a target site of interest over the percent indels at the top-ranked off-target site for a particular genome editing complex (e.g., any DNA binding domain linked to a nuclease described herein) of interest. A high specificity ratio would indicate that a modular nucleic acid binding domain fused to a nuclease edits primarily at the desired target site and exhibits fewer instances of undesirable, off-target editing. A low specificity ratio would indicate that a modular nucleic acid binding domain fused to a nuclease does not edit efficiently at the desired target site and/or can indicate that the modular nucleic acid binding domain fused to a nuclease exhibits high off-target activity. A modular nucleic acid binding domain can have a specificity ratio for the target site of at least 50:1, 55:1, 60:1, 65:1, 70:1, 75:1, 80:1, 85:1, 90:1, 92:1, 95:1, 97:1, 99:1, 50:2, 55:2, 60:2, 65:2, 70:2, 75:2, 80:2, 85:2, 90:2, 92:2, 95:2, 97:2, 99:2, 50:3, 55:3, 60:3, 65:3, 70:3, 75:3, 80:3, 85:3, 90:3, 92:3, 95:3, 97:3, 99:3, 50:4, 55:4, 60:4, 65:4, 70:4, 75:4, 80:4, 85:4, 90:4, 92:4, 95:4, 97:4, 99:4, 50:5, 55:5, 60:5, 65:5, 70:5, 75:5, 80:5, 85:5, 90:5, 92:5, 95:5, 97:5, or 99:5. A modular nucleic acid binding domain can have a specificity ratio for the target site from 50:1 to 100:1, 99:5 to 50:1, or 99:5 to 100:1. Percent indels can be measured via deep sequencing techniques.


In some embodiments, the present disclosure provides a polypeptide comprising a modular nucleic acid binding domain comprising a potency for a target site greater than 65% and a specificity ratio for the target site of at least 50:1; and a functional domain; wherein: the modular nucleic acid binding domain comprises a plurality of repeat units; at least one repeat unit of the plurality of repeat units comprises a binding region configured to bind to a target nucleic acid base within the target site; the potency comprises indel percentage at the target site, and wherein the specificity ratio comprises indel percentage at the target site over indel percentage at a top-ranked off-target site of the polypeptide. Indel percentage can be measured by deep sequencing.


The top-ranked off-target site for a polypeptide (e.g., a modular nucleic acid binding domain linked to a cleavage domain) can be determined using the predicted report of genome-wide nuclease off-target sites (PROGNOS) ranking algorithms as described in Fine et al. (Nucleic Acids Res. 2014 April; 42(6):e42. doi: 10.1093/nar/gkt1326. Epub 2013 Dec. 30.). As described in Fine et al, the PROGNOS algorithm TALEN v2.0 can use the DNA target sequence as input; prior construction and experimental characterization of the specific nucleases are not necessary. Based on the differences between the sequence of a potential off-target site in the genome and the intended target sequence, the algorithm can generate a score that is used to rank potential off-target sites. If two (or more) potential off-target sites have equal scores, they can be further ranked by the type of genomic region annotated for each site with the following order: Exon>Promoter>Intron>Intergenic. A final ranking by chromosomal location can be employed as a tie-breaker to ensure consistency in the ranking order. Thus, a score can be generated for each potential off-target site.


B. Regulatory Domains


As another example, an RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be linked to a gene regulating domain. A gene regulation domain can be an activator or a repressor. For example, an RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be linked to an activation domain, such as VP16, VP64, p65, p300 catalytic domain, TET1 catalytic domain, TDG, Ldb1 self-associated domain, SAM activator (VP64, p65, HSF1), or VPR (VP64, p65, Rta). Alternatively, an RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be linked to a repressor, such as KRAB, Sin3a, LSD1, SUV39H1, G9A (EHMT2), DNMT1, DNMT3A-DNMT3L, DNMT3B, KOX, TGF-beta-inducible early gene (TIEG), v-erbA, SID MBD2, MBD3, Rb, or MeCP2.


In some embodiments, an RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be linked to a DNA modifying protein, such as DNMT3a. An RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be linked to a chromatin-modifying protein, such as lysine-specific histone demethylase 1 (LSD1). An RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be linked to a protein that is capable of recruiting other proteins, such as KRAB. The DNA modifying protein (e.g., DNMT3a) and proteins capable of recruiting other proteins (e.g., KRAB) can serve as repressors of transcription. Thus, RNBDs (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), linked to a DNA modifying protein (e.g., DNMT3a) or a domain capable of recruiting other proteins (e.g., KRAB, a domain found in transcriptional repressors, such as Kox1) can provide gene repression functionality, can serve as transcription factors, wherein the RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), provides specificity and targeting and the DNA modifying protein and the protein capable of recruiting other proteins provides gene repression functionality, which can be referred to as a TALE-transcription factor (TALE-TF), RNBD-transcription factor (RNBD-TF), or MAP-NBD-transcription factor (MAP-NBD-TF).


In some embodiments, expression of the target gene can be reduced by at least 5%, at least 10%, at least 15%, at least 20%, at least 25%, at least 30%, at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 92%, at least 95%, at least 97%, or at least 99% by using a DNA binding domain fused to a repression domain (e.g., an RNBD-TF, a MAP-NBD-TF, or TALE-TF) of the present disclosure as compared to non-treated cells. In some embodiments, expression of the target gene can be reduced by 5% to 10%, 10% to 15%, 15% to 20%, 20%, to 25%, 25% to 30%, 30% to 35%, 35% to 40%, 40% to 45%, 45% to 50%, 50% to 55%, 55% to 60%, 60% to 65%, 65% to 70%, 70% to 75%, 75% to 80%, 80% to 85%, 85% to 90%, 90% to 95%, or 95% to 99% by using an RNBD-TF, a MAP-NBD-TF, or TALE-TF of the present disclosure as compared to non-treated cells. In some embodiments, expression of the checkpoint gene can be reduced by over 90% by using an RNBD-TF, a MAP-NBD-TF, or TALE-TF of the present disclosure as compared to non-treated cells.


In some embodiments, repression of the target gene with a DNA binding domain fused to a repression domain (e.g., an RNBD-TF, a MAP-NBD-TF, or TALE-TF) of the present disclosure and subsequent reduced expression of the target gene can last for at least 1 day, at least 2 days, at least 3 days, at least 4 days, at least 5 days, at least 6 days, at least 7 days, at least 8 days, at least 9 days, at least 10 days, at least 11 days, at least 12 days, at least 13 days, at least 14 days, at least 15 days, at least 16 days, at least 17 days, at least 18 days, at least 19 days, at least 20 days, at least 21 days, at least 22 days, at least 23 days, at least 24 days, at least 25 days, at least 26 days, at least 27 days, or at least 28 days. In some embodiments, repression of the target gene with an RNBD-TF, a MAP-NBD-TF, or TALE-TF of the present disclosure and subsequent reduced expression of the target gene can last for 1 days to 3 days, 3 days to 5 days, 5 days to 7 days, 7 days to 9 days, 9 days to 11 days, 11 days to 13 days, 13 days to 15 days, 15 days to 17 days, 17 days to 19 days, 19 days to 21 days, 21 days to 23 days, 23 days to 25 days, or 25 days to 28 days.


In various aspects, the present disclosure provides a method of identifying a target binding site in a target gene of a cell, the method comprising: (a) contacting a cell with an engineered genomic regulatory complex comprising a DNA binding domain, a repressor domain, and a linker; (b) measuring expression of the target gene; and (c) determining expression of the target gene is repressed by at least 50%, at least 60%, at least 70%, at least 80%, at least 85%, at least 90%, at least 92%, at least 95%, at least 97%, or at least 99% for at least 3 days, wherein the target gene is selected from: a checkpoint gene and a T cell surface receptor.


In some aspects, expression of the target gene is repressed in at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% of a plurality of the cells. In some aspects, the engineered genomic regulatory complex is undetectable after at least 3 days. In some aspects, determining the engineered genomic regulatory complex is undetectable is measured by qPCR, imaging of a FLAG-tag, or a combination thereof. In some aspects, the measuring expression of the target gene comprises flow cytometry quantification of expression of the target gene.


In some embodiments, repression of the target gene with a DNA binding domain fused to a repression domain (e.g., an RNBD-TF, a MAP-NBD-TF, or TALE-TF) of the present disclosure can last even after the DNA binding domain-gene regulator becomes undetectable. The DNA binding domain fused to a repression domain (e.g., an RNBD-TF, a MAP-NBD-TF, or TALE-TF) can become undetectable after at least 3 days. In some embodiments, the DNA binding domain fused to a repression domain (e.g., an RNBD-TF, a MAP-NBD-TF, or TALE-TF) can become undetectable after at least 1 day, at least 2 days, at least 3 days, at least 4 days, at least 5 days, at least 6 days, at least 1 week, at least 2 weeks, at least 3 weeks, or at least 4 weeks. In some embodiments, qPCR or imaging via the FLAG-tag can be used to confirm that the DNA binding domain fused to a repression domain (e.g., an RNBD-TF, a MAP-NBD-TF, or TALE-TF) is no longer detectable.


C. Imaging Moieties


An RNBD (e.g., Ralstonia solanacearum-derived), or another binding domain (e.g., MAP-NBD or TALE), can be linked to a fluorophore, such as Hydroxycoumarin, methoxycoumarin, Alexa fluor, aminocoumarin, Cy2, FAM, Alexa fluor 488, Fluorescein FITC, Alexa fluor 430, Alexa fluor 532, HEX, Cy3, TRITC, Alexa fluor 546, Alexa fluor 555, R-phycoerythrin (PE), Rhodamine Red-X, Tamara, Cy3.5, Rox, Alexa fluor 568, Red 613, Texas Red, Alexa fluor 594, Alexa fluor 633, Allophycocyanin, Alexa fluor 633, Cy5, Alexa fluor 660, Cy5.5, TruRed, Alexa fluor 680, Cy7, GFP, or mCHERRY. An RNBD (e.g., Ralstonia solanacearum-derived) can be linked to a biotinylation reagent.


Genes and Indications of Interest

In some embodiments, genome editing can be performed by fusing a nuclease of the present disclosure with a DNA binding domain for a particular genomic locus of interest. Genetic modification can involve introducing a functional gene for therapeutic purposes, knocking out a gene for therapeutic gene, or engineering a cell ex vivo (e.g., HSCs or CAR T cells) to be administered back into a subject in need thereof. For example, the genome editing complex can have a target site within PDCD1, CTLA4, LAGS, TET2, BTLA, HAVCR2, CCR5, CXCR4, TRA, TRB, B2M, albumin, HBB, HBA1, TTR, NR3C1, CD52, erythroid specific enhancer of the BCL11A gene, CBLB, TGFBR1, SERPINA1, HBV genomic DNA in infected cells, CEP290, DMD, CFTR, IL2RG, CS-1, or any combination thereof. In some embodiments, a genome editing complex can cleave double stranded DNA at a target site in order to insert a chimeric antigen receptor (CAR), alpha-L iduronidase (IDUA), iduronate-2-sulfatase (IDS), or Factor 9 (F9). Cells, such as hematopoietic stem cells (HSCs) and T cells, can be engineered ex vivo with the genome editing complex. Alternatively, genome editing complexes can be directly administered to a subject in need thereof.


The subject receiving treatment can be suffering from a disease such as transthyretin amyloidosis (ATTR), HIV, glioblastoma multiforme, cancer, acute lymphoblastic leukemia, acute myeloid leukemia, beta-thalassemia, sickle cell disease, MPSI, MPSII, Hemophilia B, multiple myeloma, melanoma, sarcoma, Leber congenital amaurosis (LCA10), CD19 malignancies, BCMA-related malignancies, duchenne muscular dystrophy (DMD), cystic fibrosis, alpha-1 antitrypsin deficiency, X-linked severe combined immunodeficiency (X-SCID), or Hepatitis B.


Samples for Analysis

In some aspects, described herein include methods of modifying the genetic material of a target cell utilizing an RNBD described herein. A sample described herein may be a fresh sample. The sample may be a live sample.


The sample may be a cell sample. The cell sample may be obtained from the cells or tissue of an animal. The animal cell may comprise a cell from an invertebrate, fish, amphibian, reptile, or mammal. The mammalian cell may be obtained from a primate, ape, equine, bovine, porcine, canine, feline, or rodent. The mammal may be a primate, ape, dog, cat, rabbit, ferret, or the like. The rodent may be a mouse, rat, hamster, gerbil, hamster, chinchilla, or guinea pig. The bird cell may be from a canary, parakeet, or parrot. The reptile cell may be from a turtle, lizard, or snake. The fish cell may be from a tropical fish. For example, the fish cell may be from a zebrafish (such as Danio rerio). The amphibian cell may be from a frog. An invertebrate cell may be from an insect, arthropod, marine invertebrate, or worm. The worm cell may be from a nematode (such as Caenorhabditis elegans). The arthropod cell may be from a tarantula or hermit crab.


The cell sample may be obtained from a mammalian cell. For example, the mammalian cell may be an epithelial cell, connective tissue cell, hormone secreting cell, a nerve cell, a skeletal muscle cell, a blood cell, an immune system cell, or a stem cell. A cell may be a fresh cell, live cell, fixed cell, intact cell, or cell lysate. Cell samples can be any primary cell, such as a hematopoetic stem cell (HSCs) or naïve or stimulated T cells (e.g., CD4+ T cells).


Cell samples may be cells derived from a cell line, such as an immortalized cell line. Exemplary cell lines include, but are not limited to, 293A cell line, 293FT cell line, 293F cell line, 293 H cell line, HEK 293 cell line, CHO DG44 cell line, CHO-S cell line, CHO-K1 cell line, Expi293F™ cell line, Flp-In™ T-REx™ 293 cell line, Flp-In™-293 cell line, Flp-In™-3T3 cell line, Flp-In™-BHK cell line, Flp-In™-CHO cell line, Flp-In™-CV-1 cell line, Flp-In™-Jurkat cell line, FreeStyle™ 293-F cell line, FreeStyle™ CHO-S cell line, GripTite™ 293 MSR cell line, GS-CHO cell line, HepaRG™ cell line, T-REx™ Jurkat cell line, Per.C6 cell line, T-REx™-293 cell line, T-REx™-CHO cell line, T-REx™-HeLa cell line, NC-HIMT cell line, PC12 cell line, A549 cells, and K562 cells.


In some embodiments, an RNBD of the present disclosure can be used to modify a target cell. The target cell can itself be unmodified or modified. For example, an unmodified cell can be edited with an RNBD of the present disclosure to introduce an insertion, deletion, or mutation in its genome. In some embodiments, a modified cell already having a mutation can be repaired with an RNBD of the present disclosure.


In some instances, a target cell is a cell comprising one or more single nucleotide polymorphism (SNP). In some instances, an RNBD-nuclease described herein is designed to target and edit a target cell comprising a SNP.


In some cases, a target cell is a cell that does not contain a modification. For example, a target cell can comprise a genome without genetic defect (e.g., without genetic mutation) and an RNBD-nuclease described herein can be used to introduce a modification (e.g., a mutation) within the genome.


The cell sample may be obtained from cells of a primate. The primate may be a human, or a non-human primate. The cell sample may be obtained from a human. For example, the cell sample may comprise cells obtained from blood, urine, stool, saliva, lymph fluid, cerebrospinal fluid, synovial fluid, cystic fluid, ascites, pleural effusion, amniotic fluid, chorionic villus sample, vaginal fluid, interstitial fluid, buccal swab sample, sputum, bronchial lavage, Pap smear sample, or ocular fluid. The cell sample may comprise cells obtained from a blood sample, an aspirate sample, or a smear sample.


The cell sample may be a circulating tumor cell sample. A circulating tumor cell sample may comprise lymphoma cells, fetal cells, apoptotic cells, epithelia cells, endothelial cells, stem cells, progenitor cells, mesenchymal cells, osteoblast cells, osteocytes, hematopoietic stem cells (HSC) (e.g., a CD34+HSC), foam cells, adipose cells, transcervical cells, circulating cardiocytes, circulating fibrocytes, circulating cancer stem cells, circulating myocytes, circulating cells from a kidney, circulating cells from a gastrointestinal tract, circulating cells from a lung, circulating cells from reproductive organs, circulating cells from a central nervous system, circulating hepatic cells, circulating cells from a spleen, circulating cells from a thymus, circulating cells from a thyroid, circulating cells from an endocrine gland, circulating cells from a parathyroid, circulating cells from a pituitary, circulating cells from an adrenal gland, circulating cells from islets of Langerhans, circulating cells from a pancreas, circulating cells from a hypothalamus, circulating cells from prostate tissues, circulating cells from breast tissues, circulating cells from circulating retinal cells, circulating ophthalmic cells, circulating auditory cells, circulating epidermal cells, circulating cells from the urinary tract, or combinations thereof.


The cell can be a T cell. For example, in some embodiments, the T cell can be an engineered T cell transduced to express a chimeric antigen receptor (CAR). The CAR T cell can be engineered to bind to BCMA, CD19, CD22, WT1, L1 CAM, MUC16, ROR1, or LeY.


A cell sample may be a peripheral blood mononuclear cell sample.


A cell sample may comprise cancerous cells. The cancerous cells may form a cancer which may be a solid tumor or a hematologic malignancy. The cancerous cell sample may comprise cells obtained from a solid tumor. The solid tumor may include a sarcoma or a carcinoma. Exemplary sarcoma cell sample may include, but are not limited to, cell sample obtained from alveolar rhabdomyosarcoma, alveolar soft part sarcoma, ameloblastoma, angiosarcoma, chondrosarcoma, chordoma, clear cell sarcoma of soft tissue, dedifferentiated liposarcoma, desmoid, desmoplastic small round cell tumor, embryonal rhabdomyosarcoma, epithelioid fibrosarcoma, epithelioid hemangioendothelioma, epithelioid sarcoma, esthesioneuroblastoma, Ewing sarcoma, extrarenal rhabdoid tumor, extraskeletal myxoid chondrosarcoma, extraskeletal osteosarcoma, fibrosarcoma, giant cell tumor, hemangiopericytoma, infantile fibrosarcoma, inflammatory myofibroblastic tumor, Kaposi sarcoma, leiomyosarcoma of bone, liposarcoma, liposarcoma of bone, malignant fibrous histiocytoma (MFH), malignant fibrous histiocytoma (MFH) of bone, malignant mesenchymoma, malignant peripheral nerve sheath tumor, mesenchymal chondrosarcoma, myxofibrosarcoma, myxoid liposarcoma, myxoinflammatory fibroblastic sarcoma, neoplasms with perivascular epitheioid cell differentiation, osteosarcoma, parosteal osteosarcoma, neoplasm with perivascular epitheioid cell differentiation, periosteal osteosarcoma, pleomorphic liposarcoma, pleomorphic rhabdomyosarcoma, PNET/extraskeletal Ewing tumor, rhabdomyosarcoma, round cell liposarcoma, small cell osteosarcoma, solitary fibrous tumor, synovial sarcoma, or telangiectatic osteosarcoma.


Exemplary carcinoma cell samples may include, but are not limited to, cell samples obtained from an anal cancer, appendix cancer, bile duct cancer (i.e., cholangiocarcinoma), bladder cancer, brain tumor, breast cancer, cervical cancer, colon cancer, cancer of Unknown Primary (CUP), esophageal cancer, eye cancer, fallopian tube cancer, gastroenterological cancer, kidney cancer, liver cancer, lung cancer, medulloblastoma, melanoma, oral cancer, ovarian cancer, pancreatic cancer, parathyroid disease, penile cancer, pituitary tumor, prostate cancer, rectal cancer, skin cancer, stomach cancer, testicular cancer, throat cancer, thyroid cancer, uterine cancer, vaginal cancer, or vulvar cancer.


The cancerous cell sample may comprise cells obtained from a hematologic malignancy. Hematologic malignancy may comprise a leukemia, a lymphoma, a myeloma, a non-Hodgkin's lymphoma, or a Hodgkin's lymphoma. The hematologic malignancy may be a T-cell based hematologic malignancy. The hematologic malignancy may be a B-cell based hematologic malignancy. Exemplary B-cell based hematologic malignancy may include, but are not limited to, chronic lymphocytic leukemia (CLL), small lymphocytic lymphoma (SLL), high risk CLL, a non-CLL/SLL lymphoma, prolymphocytic leukemia (PLL), follicular lymphoma (FL), diffuse large B-cell lymphoma (DLBCL), mantle cell lymphoma (MCL), Waldenström's macroglobulinemia, multiple myeloma, extranodal marginal zone B cell lymphoma, nodal marginal zone B cell lymphoma, Burkitt's lymphoma, non-Burkitt high grade B cell lymphoma, primary mediastinal B-cell lymphoma (PMBL), immunoblastic large cell lymphoma, precursor B-lymphoblastic lymphoma, B cell prolymphocytic leukemia, lymphoplasmacytic lymphoma, splenic marginal zone lymphoma, plasma cell myeloma, plasmacytoma, mediastinal (thymic) large B cell lymphoma, intravascular large B cell lymphoma, primary effusion lymphoma, or lymphomatoid granulomatosis. Exemplary T-cell based hematologic malignancy may include, but are not limited to, peripheral T-cell lymphoma not otherwise specified (PTCL-NOS), anaplastic large cell lymphoma, angioimmunoblastic lymphoma, cutaneous T-cell lymphoma, adult T-cell leukemia/lymphoma (ATLL), blastic NK-cell lymphoma, enteropathy-type T-cell lymphoma, hematosplenic gamma-delta T-cell lymphoma, lymphoblastic lymphoma, nasal NK/T-cell lymphomas, or treatment-related T-cell lymphomas.


A cell sample described herein may comprise a tumor cell line sample. Exemplary tumor cell line sample may include, but are not limited to, cell samples from tumor cell lines such as 600MPE, AU565, BT-20, BT-474, BT-483, BT-549, Evsa-T, Hs578T, MCF-7, MDA-MB-231, SkBr3, T-47D, HeLa, DU145, PC3, LNCaP, A549, H1299, NCI-H460, A2780, SKOV-3/Luc, Neuro2a, RKO, RKO-AS45-1, HT-29, SW1417, SW948, DLD-1, SW480, Capan-1, MC/9, B72.3, B25.2, B6.2, B38.1, DMS 153, SU.86.86, SNU-182, SNU-423, SNU-449, SNU-475, SNU-387, Hs 817.T, LMH, LMH/2A, SNU-398, PLHC-1, HepG2/SF, OCI-Ly1, OCI-Ly2, OCI-Ly3, OCI-Ly4, OCI-Ly6, OCI-Ly7, OCI-Ly10, OCI-Ly18, OCI-Ly19, U2932, DB, HBL-1, RIVA, SUDHL2, TMD8, MEC1, MEC2, 8E5, CCRF-CEM, MOLT-3, TALL-104, AML-193, THP-1, BDCM, HL-60, Jurkat, RPMI 8226, MOLT-4, RS4, K-562, KASUMI-1, Daudi, GA-10, Raji, JeKo-1, NK-92, and Mino.


A cell sample may comprise cells obtained from a biopsy sample, necropsy sample, or autopsy sample.


The cell samples (such as a biopsy sample) may be obtained from an individual by any suitable means of obtaining the sample using well-known and routine clinical methods. Procedures for obtaining tissue samples from an individual are well known. For example, procedures for drawing and processing tissue sample such as from a needle aspiration biopsy are well-known and may be employed to obtain a sample for use in the methods provided. Typically, for collection of such a tissue sample, a thin hollow needle is inserted into a mass such as a tumor mass for sampling of cells that, after being stained, will be examined under a microscope.


A cell may be a live cell. A cell may be a eukaryotic cell. A cell may be a yeast cell. A cell may be a plant cell. A cell may be obtained from an agricultural plant.


EXAMPLES

These examples are provided for illustrative purposes only and not to limit the scope of the claims provided herein.


Example 1
Genome Editing Complexes and Gene Repressors

This example describes genome editing complexes and gene repressors. A Ralstonia-derived modular nucleic acid binding domain (RNBD) is engineered by encoding for a plurality of repeat units, wherein each repeat unit is selected from any combination of SEQ ID NO: 168-SEQ ID NO: 263 or SEQ ID NO: 336-SEQ ID NO: 356. RNBDs are engineered to have an N-terminus as set forth in SEQ ID NO: 264 of SEQ ID NO: 303 and a C-terminus as set forth in SEQ ID NO: 266. The RNBD is engineered to also include a half repeat as set forth in SEQ ID NO: 265, prior to the C-terminus of SEQ ID NO: 266.


Genome Editing. The RNBD is linked to a nuclease, such as Fold or any one of SEQ ID NO: 1-SEQ ID NO: 81 (nucleic acid Sequences of SEQ ID NO: 82-SEQ ID NO: 162).


Gene Regulation. The RNBD is linked to an activator (e.g., VP16, VP64, p65, p300 catalytic domain, TET1 catalytic domain, TDG, Ldb1 self-associated domain, SAM activator (VP64, p65, HSF1), or VPR (VP64, p65, Rta) or a repressor (e.g., KRAB, Sin3a, LSD1, SUV39H1, G9A (EHMT2), DNMT1, DNMT3A-DNMT3L, DNMT3B, KOX, TGF-beta-inducible early gene (TIEG), v-erbA, SID, MBD2, MBD3, Rb, or MeCP2).


Example 2
Mixed DNA Binding Domains

This example illustrates mixed DNA binding domains fused to nucleases to form genome editing complexes or fused to regulation domains to form gene activators or repressors. A Ralstonia-derived modular nucleic acid binding domain (RNBD) is engineered by encoding for a plurality of repeat units, wherein each repeat unit is selected from any combination of SEQ ID NO: 168-SEQ ID NO: 263 or SEQ ID NO: 336-SEQ ID NO: 356. RNBDs are engineered with an N-terminus as set forth in SEQ ID NO: 301 (Xanthomonas) or SEQ ID NO: 304 (Legionella). RNBDs are engineered with a C-terminus as set forth in SEQ ID NO: 298 (Xanthomonas) or SEQ ID NO: 306 (Legionella).


Genome Editing. The RNBD is linked to a nuclease, such as Fold or any one of SEQ ID NO: 1-SEQ ID NO: 81 (nucleic acid Sequences of SEQ ID NO: 82-SEQ ID NO: 162).


Gene Regulation. The RNBD is linked to an activator (e.g., VP16, VP64, p65, p300 catalytic domain, TET1 catalytic domain, TDG, Ldb1 self-associated domain, SAM activator (VP64, p65, HSF1), or VPR (VP64, p65, Rta) or a repressor (e.g., KRAB, Sin3a, LSD1, SUV39H1, G9A (EHMT2), DNMT1, DNMT3A-DNMT3L, DNMT3B, KOX, TGF-beta-inducible early gene (TIEG), v-erbA, SID, MBD2, MBD3, Rb, or MeCP2).


Example 3
Genome Editing with an RNBD Fused to a Nuclease

This example illustrates genome editing with an RNBD fused to a nuclease. A first modular Ralstonia nucleic acid binding domain (RNBD) described herein, is fused to a cleavage half domain, such as an nuclease and a second modular Ralstonia DNA binding domain (RNBD) described herein, is fused to another cleavage half domain. The nucleic acid binding domains are fused to the nuclease, optionally, via a naturally occurring linker, a variant or truncation of a naturally occurring linker, or a synthetic linker. The first RNBD-nuclease complex recognizes a target nucleic acid sequence on the top strand of double stranded DNA and binds said region of the double stranded DNA and the second RNBD-nuclease complex recognizes a target nucleic acid sequence on the bottom strand of double stranded DNA and binds said region of the double stranded DNA. The 3′ end of the target nucleic acid sequence on the top strand and the 3′ end of the target nucleic acid sequence on the bottom strand are spaced 2 to 50 base pairs apart, referred to herein as the “spacer region.” Gene editing is carried out by dimerization of the two cleavage half domains in the spacer region followed by cleaving of the DNA phosphodiester bonds. Gene editing allows for the insertion of a sequence or deletion of a sequence.


Direct Administration to Introduce a Gene

The genome editing complex is administered directly to a subject in need thereof and is taken up by a cell. The subject has a disease. The DNA binding domain of the genome editing complex binds a region of DNA in a target cell and the cleavage domain induces a double strand break in the DNA of the target cell to introduce a gene. The introduced gene is a mutated gene or a functional gene.


Factor IX. The genome editing complex with a cleavage domain introduces a double strand break into the albumin gene locus (e.g., into intron 1) concomitant with delivery to the cell of an ectopic nucleic acid bearing a cDNA of the factor IX gene. The double strand break leads to the integration of the ectopic nucleic acid into intron 1 of the albumin gene; the factor IX protein is secreted by the cell into the circulation. The target cell is a hepatocyte and the subject in need thereof has Hemophilia B.


Ex Vivo Engineering of a Cell to Introduce a Gene

The genome editing complex is transfected into cells ex vivo along with an ectopic nucleic acid bearing a gene. Upon transfection of cells ex vivo, the DNA binding domain of the genome editing complex binds a region of DNA in a target cell and the cleavage domain induces a double strand break in the DNA of the target cell to introduce an ectopically provided gene (also provided to the cell) into the region cleaved by the genome editing complex. The resulting engineered cells with modified DNA are administered to a subject in need thereof. The subject has a disease.


CAR. The genome editing complex with a cleavage domain introduces a chimeric antigen receptor (CAR) by editing the DNA of a target cell. The target cell is a T cell and the subject has cancer, such as a blood cancer. Upon administration of the engineered cells to a subject, the engineered CAR T cells effectively eliminate cancer in the subject.


Direct Administration to Partially or Completely Knock Out a Gene

The genome editing complex is administered directly to a subject in need thereof and is taken up by a cell. The subject has a disease. The DNA binding domain of the genome editing complex binds a region of DNA in a target cell and the cleavage domain induces a double strand break in the DNA of the target cell to partially or completely knock out a gene.


TTR. The genome editing complex with a cleavage domain partially or completely knocks out the transthyretin (TTR) gene by editing the DNA of a target cell. The target cell is a liver cell and the subject in need thereof has transthyretin amyloidosis (ATTR).


SERPINA1. The genome editing complex with a cleavage domain partially or completely knocks out the SERPINA1 gene by editing the DNA of a target cell. The target cell is a liver cell and the subject in need thereof has alpha-1 antitrypsin deficiency (dA1AT def).


Ex Vivo Engineering of a Cell to Partially or Completely Knock Out a Gene or a Gene Regulatory Region

The genome editing complex is transfected in cells ex vivo. Upon transfection of cells ex vivo, the DNA binding domain of the genome editing complex binds a region of DNA in a target cell and the cleavage domain induces a double strand break in the DNA of the target cell to partially or completely knock out a gene or a gene regulatory region. The subject has a disease.


BCL11A Enhancer. The genome editing complex with a cleavage domain partially or completely knocks out the BCL11A erythroid enhancer by editing the DNA of a target cell. The target cell is an HPSC and the subject in need thereof has b-thalassemia or sickle cell disease.


CCR5. The genome editing complex with a cleavage domain partially or completely knocks the CCR5 gene by editing the DNA of a target cell, thereby allowing for introduction of a mutated version of CCR5. Target cells, in which mutated versions of CCR5 are introduced via the action of the genome editing complex, are not infected by HIV via the modified CCR5 receptor. The target cell is a T cell or a hematopoietic stem cell (HPSC) and the subject has HIV.


Upon administration of the genome editing complex directly to a subject or upon administration of an engineered cell with DNA that has been modified with the genome editing complex, the disease symptoms are eliminated or reduced.


Example 4
TALE Protein with N-Terminus Fragment

A DNA binding protein engineered to have a shortened N-terminus derived from a TALE protein was generated. U.S. Pat. No. 8,586,526 shows that while the N-terminus region (referred to as N-cap) from a TALE protein can be shortened by deleting amino acids at the N-terminus, deleting amino acids beyond amino acid position N+134 decreased DNA binding affinity, with the decrease in DNA binding apparent even with deletion of amino acids beyond amino acid position N+137. U.S. Pat. No. 8,586,526 concluded that amino acid sequence from N+1 through N+137 are required for binding to DNA while the first 152 amino acids of the N-cap sequence are dispensable.


However, it has been discovered that further deleting amino acids till position N+116 surprising leads to recovery of DNA binding. Even shorter N-terminus regions such as a fragment having deletion till position N+111 also retains DNA binding activity. Deleting amino acids till position N+106 significantly decreases DNA binding. Further deletion of the N-terminus region, such as, deleting amino acids till position N+101 does not lead to recovery of DNA binding. See FIG. 2.


TALEN monomers recognizing 5′-TTTCTGTCACCAATCCT-3′ and 5′-TCCCCTCCACCCCACAGT-3′ in the human AAVS1 locus were engineered to harbor N-terminus regions that included deletions encompassing residues N137-116, N137-111, N137-106 and N137-101. While these residues are numbered with reference to the N+137 construct in U.S. Pat. No. 8,586,526, N137-116 refers to deletion of amino acids starting at the N-terminus of the N-cap sequence (N+228) and extending through amino acid residue 116 such that the resulting fragment retains amino acids residues from position N+115 to position N+1, and so on. The amino acid sequence of the N-terminal truncation del_N137-116 is set forth in SEQ ID NO:321. The amino acid sequence of the N-terminal truncation del_N137-111 is set forth in SEQ ID NO:447.


NK562 cells were transfected with 2 μg plasmid DNA for each TALEN monomer using an AMAXA™ Nucleofector™ 96-well Shuttle™ system as per the manufacturer's recommendations. Full length TALEN monomers were included (“AAVS1 control”), together with N137-116/full length and full length/N137-116 heterodimers. Cells were cold shocked at 30° C. and genomic DNA was harvested at 72 h using QuickExtract™ (Lucigen). Indel rates were determined by amplicon sequencing. The TALE repeats present in the TALE monomers have the sequence LTPDQVVAIAS(RVD)GGKQALETVQRLLPVLCQDHG, with a RVD selected based on the target sequence.



FIG. 2 represents DNA binding activity assayed by measuring nuclease activity of FokI fused to C-terminus of the polypeptides. AAVS1 control data set correspond to TALENS using the standard full-length N-terminus (N+288 to N+1). N-terminal truncation del_N137-116 (N-terminus extending from N+115 to N+1) showed higher activity than standard full-length N-terminus (N+288 to N+1). N-terminal truncation del_N137-111 (N-terminus extending from N+110 to N+1) was also active. Further truncation del_N137-106 (N-terminus extending from N+105 to N+1) significantly decreased DNA binding. Further deletion of the N-terminus region del_N137-101 (N-terminus extending from N+100 to N+1) did not lead to recovery of DNA binding. Thus, a fragment of the N-terminus of a TALE protein extending from N+115 to N+1 shows full activity. Mock/GFP is a negative control. The AAVS1/del_N137-116 data shows that an N1-115 TALEN monomer can be combined with a monomer comprising full-length N-terminus region of a TALE protein.


While preferred embodiments of the present invention have been shown and described herein, it will be apparent to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims
  • 1. A polypeptide comprising: a modular nucleic acid binding domain comprising a potency for a target site greater than 65% and a specificity ratio for the target site of at least 50:1; anda functional domain, wherein the functional domain is heterologous to the nuclear acid binding domain;
  • 2. The polypeptide of claim 2, wherein the at least one repeat unit comprises a sequence of A1-11X1X2B14-35, wherein: each amino acid residue of A1-11 comprises any amino acid residue;X1X2 comprises the binding region;each amino acid residue of B14-35 comprises any amino acid; anda first repeat unit of the plurality of repeat units comprises at least one residue in A1-11, B14-35, or a combination thereof that differs from a corresponding residue in a second repeat unit of the plurality of repeat units.
  • 3. A polypeptide comprising a modular nucleic acid binding domain and a functional domain, wherein: the modular nucleic acid binding domain comprises a plurality of repeat units; at least one repeat unit of the plurality comprises a sequence of A1-11 X1X2B14-35; each amino acid residue of A1-11 comprises any amino acid residue;X1X2 comprises a binding region configured to bind to a target nucleic acid base within a target site;each amino acid residue of B14-35 comprises any amino acid; anda first repeat unit of the plurality of repeat units comprises at least one residue in A1-11, B14-35, or a combination thereof that differs from a corresponding residue in a second repeat unit of the plurality of repeat units.
  • 4. The polypeptide of any one of claims 1-3, wherein the binding region comprises an amino acid residue at position 13 or an amino acid residue at position 12 and the amino acid residue at position 13.
  • 5. The polypeptide of claim 4, wherein the amino acid residue at position 13 binds to the target nucleic acid base.
  • 6. The polypeptide of any one of claims 4-5, wherein the amino acid residue at position 12 stabilizes the configuration of the binding region.
  • 7. The polypeptide of any one of claims 3-6, wherein the modular nucleic acid binding domain further comprises a potency for the target site greater than 65% and a specificity ratio for the target site of at least 50:1, wherein the potency comprises indel percentage at the target site and the specificity ratio comprises indel percentage at the target site over indel percentage at a top-ranked off-target site of the polypeptide.
  • 8. The polypeptide of any one of claim 1, 2, or 4-7, wherein the indel percentage is measured by deep sequencing.
  • 9. The polypeptide of any one of claims 1-8, wherein the modular nucleic acid binding domain further comprises one or more properties selected from the following: (b) binds the target site, wherein the target site comprises a 5′ guanine;(c) comprises from 7 repeat units to 25 repeat units;(d) upon binding to the target site, the modular nucleic acid binding domain is separated from a second modular nucleic acid binding domain bound to a second target site by from 2 to 50 base pairs.
  • 10. The polypeptide of any one of claims 1-9, wherein the modular nucleic acid binding domain comprises a Ralstonia repeat unit.
  • 11. The polypeptide of claim 10, wherein the Ralstonia repeat unit is a Ralstonia solanacearum repeat unit.
  • 12. The polypeptide of any one of claims 2-11, wherein the B14-35 of at least one repeat unit of the plurality of repeat units has at least 92% sequence identity to GGKQALEAVRAQLLDLRAAPYG (SEQ ID NO: 280).
  • 13. The polypeptide of any one of claims 1-12, wherein the binding region comprises HD binding to cytosine, NG binding to thymidine, NK binding to guanine, SI binding to adenosine, RS binding to adenosine, HN binding to guanine, or NT binds to adenosine.
  • 14. The polypeptide of any one of claims 1-13, wherein the at least one repeat unit comprises any one of SEQ ID NO: 267-SEQ ID NO: 279.
  • 15. The polypeptide of any one of claims 1-14, wherein the at least one repeat unit comprises at least 80% sequence identity with any one of SEQ ID NO: 168-SEQ ID NO: 263.
  • 16. The polypeptide of any one of claims 1-15, wherein the at least one repeat unit comprises at least 80% sequence identity with SEQ ID NO: 209, SEQ ID NO: 197, SEQ ID NO: 233, SEQ ID NO: 253, SEQ ID NO: 203, or SEQ ID NO: 218.
  • 17. The polypeptide of any one of claims 1-16, wherein the at least one repeat unit comprises any one of SEQ ID NO: 168-SEQ ID NO: 263.
  • 18. The polypeptide of any one of claims 1-17, wherein the at least one repeat unit comprises SEQ ID NO: 209, SEQ ID NO: 197, SEQ ID NO: 233, SEQ ID NO: 253, SEQ ID NO: 203, or SEQ ID NO: 218.
  • 19. The polypeptide of any one of claims 1-18, wherein the target nucleic acid base is cytosine, guanine, thymidine, adenosine, uracil or a combination thereof.
  • 20. The polypeptide of any one of claims 1-19, wherein the target site is a nucleic acid sequence within a PDCD1 gene, a CTLA4 gene, a LAG3 gene, a TET2 gene, a BTLA gene, a HAVCR2 gene, a CCR5 gene, a CXCR4 gene, a TRA gene, a TRB gene, a B2M gene, an albumin gene, a HBB gene, a HBA1 gene, a TTR gene, a NR3C1 gene, a CD52 gene, an erythroid specific enhancer of the BCL11A gene, a CBLB gene, a TGFBR1 gene, a SERPINA1 gene, a HBV genomic DNA in infected cells, a CEP290 gene, a DMD gene, a CFTR gene, an IL2RG gene, or a combination thereof.
  • 21. The polypeptide of any one of claims 1-20, wherein a nucleic acid sequence encoding a chimeric antigen receptor (CAR), alpha-L iduronidase (IDUA), iduronate-2-sulfatase (IDS), or Factor 9 (F9), is inserted at the target site.
  • 22. The polypeptide of any one of claims 1-21, wherein the modular nucleic acid binding domain comprises an N-terminus amino acid sequence, a C-terminus amino acid sequence, or a combination thereof.
  • 23. The polypeptide of claim 22, wherein the N-terminus amino acid sequence is from Xanthomonas spp., Legionella quateirensis, or Ralstonia solanacearum.
  • 24. The polypeptide of any one of claims 22-23, wherein the N-terminus amino acid sequence comprises at least 80% sequence identity to SEQ ID NO: 264, SEQ ID NO: 300, SEQ ID NO: 335, SEQ ID NO: 303, SEQ ID NO: 301, SEQ ID NO: 304, or SEQ ID NO: 320, SEQ ID NO: 321, or SEQ ID NO: 322.
  • 25. The polypeptide of any one of claims 22-24, wherein the N-terminus amino acid sequence comprises SEQ ID NO: 264, SEQ ID NO: 300, SEQ ID NO: 335, SEQ ID NO: 303, SEQ ID NO: 301, SEQ ID NO: 304, or SEQ ID NO: 320, SEQ ID NO: 321, or SEQ ID NO: 322.
  • 26. The polypeptide of any one of claims 22-25, wherein the C-terminus amino acid sequence is from Xanthomonas spp., Legionella quateirensis, or Ralstonia solanacearum.
  • 27. The polypeptide of any one of claims 22-25, wherein the C-terminus amino acid sequence comprises at least 80% sequence identity to SEQ ID NO: 266, SEQ ID NO: 298, or SEQ ID NO: 306.
  • 28. The polypeptide of any one of claims 22-26, wherein the C-terminus amino acid sequence comprises SEQ ID NO: 266, SEQ ID NO: 298, or SEQ ID NO: 306.
  • 29. The polypeptide of any one of claims 22-28, wherein the C-terminus amino acid sequence serves as a linker between the modular nucleic acid binding domain and the cleavage domain.
  • 30. The polypeptide of any one of claims 1-29, wherein the modular nucleic acid binding domain comprises a half repeat.
  • 31. The polypeptide of claim 30, wherein the half repeat comprises at least 80% sequence identity to SEQ ID NO: 265, SEQ ID NO: 327-SEQ ID NO: 329, or SEQ ID NO: 290.
  • 32. The polypeptide of any one of claims 30-31, wherein the half repeat comprises SEQ ID NO: 265, SEQ ID NO: 327-SEQ ID NO: 329, or SEQ ID NO: 290.
  • 33. The polypeptide of any one of claims 1-32, wherein the functional domain is a cleavage domain or a repression domain.
  • 34. The polypeptide of claim 33, wherein the cleavage domain comprises at least 33.3% divergence from SEQ ID NO: 163 and is immunologically orthogonal to SEQ ID NO: 163.
  • 35. The polypeptide of any one of claims 33-34, comprising one or more of the following characteristics: (a) induces greater than 1% indels at a target site;(b) the cleavage domain comprises a molecular weight of less than 23 kDa;(c) the cleavage domain comprises less than 196 amino acids;(d) capable of cleaving across a spacer region greater than 24 base pairs.
  • 36. The polypeptide of any one of claims 33-35, wherein the polypeptide induces greater than 5%, greater than 10%, greater than 20%, greater than 30%, greater than 40%, greater than 50%, greater than 60%, greater than 70%, greater than 80%, or greater than 90% indels at the target site.
  • 37. The polypeptide of any one of claims 33-36, wherein the cleavage domain comprises at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, or at least 75% divergence from SEQ ID NO: 163.
  • 38. The polypeptide of any one of claims 33-37, wherein the cleavage domain comprises a sequence selected from SEQ ID NO: 316-SEQ ID NO: 319.
  • 39. The polypeptide of any one of claims 33-38, wherein the cleavage domain comprises a nucleic acid sequence encoding for a sequence having at least 80% sequence identity with SEQ ID NO: 1-SEQ ID NO: 81.
  • 40. The polypeptide of any one of claims 33-38, wherein the cleavage domain comprises a nucleic acid sequence encoding for a sequence selected from SEQ ID NO: 1-SEQ ID NO: 81.
  • 41. The polypeptide of any one of claims 33-40, wherein the nucleic acid sequence comprises at least 80% sequence identity with SEQ ID NO: 82-SEQ ID NO: 162.
  • 42. The polypeptide of any one of claims 33-41, wherein the nucleotide sequence encoding for the sequence comprises any one of SEQ ID NO: 82-SEQ ID NO: 162.
  • 43. The polypeptide of claim 33, wherein the repression domain comprises KRAB, Sin3a, LSD1, SUV39H1, G9A (EHMT2), DNMT1, DNMT3A-DNMT3L, DNMT3B, KOX, TGF-beta-inducible early gene (TIEG), v-erbA, SID, MBD2, MBD3, Rb, or MeCP2.
  • 44. The polypeptide of any one of claims 1-43, wherein the at least one repeat unit comprises 1-20 additional amino acid residues at the C-terminus.
  • 45. The polypeptide of any one of claims 1-44, wherein the at least repeat unit of the plurality of repeat units is separated from a neighboring repeat unit by a linker.
  • 46. The polypeptide of claim 45, wherein the linker comprises a recognition site.
  • 47. The polypeptide of claim 46, wherein the recognition site is for a small molecule, a protease, or a kinase.
  • 48. The polypeptide of claim 47, wherein the recognition site serves as a localization signal.
  • 49. The polypeptide of any one of claims 1-48, wherein the plurality of repeat units comprises 3 to 60 repeat units.
  • 50. The polypeptide of any one of claims 1-49, wherein a repeat unit of the plurality of repeat units recognizes a target nucleic acid base and wherein the plurality of repeat units has one or more of the following characteristics: (a) at least one repeat unit comprising greater than 39 amino acid residues;(b) at least one repeat unit comprising greater than 35 amino acid residues derived from the genus of Ralstonia; (c) at least one repeat unit comprising less than 32 amino acid residues; and(d) each repeat unit of the plurality of repeat units is separated from a neighboring repeat unit by a linker comprising a recognition site.
  • 51. The polypeptide of claim 50, wherein the at least one repeat unit comprises an amino acid selected from glycine, alanine, threonine or histidine at a position after an amino acid residue at position 35.
  • 52. The polypeptide any one of claims 50-51, wherein the at least one repeat unit comprises an amino acid selected from glycine, alanine, threonine or histidine at a position after an amino acid residue at position 39.
  • 53. A method of genome editing, the method comprising: administering the polypeptide of any one of claim 1-42 or 44-52 and inducing a double stranded break.
  • 54. A method of gene repression, the method comprising administering the polypeptide of any one of claim 1-33 or 43-52 and repressing gene expression.
  • 55. A non-naturally occurring DNA-binding polypeptide comprising from N-terminus to C-terminus: an N-terminus region comprises at least residues N+110 to N+1 of a TALE protein, wherein the N-terminus region does not include residues N+288 to N+116 of the TALE protein;a plurality of TALE-repeat units, the TALE repeat units comprising a repeat variable di-residue (RVD); anda C-terminus region of a TALE protein.
  • 56. The DNA binding polypeptide of claim 55, wherein the N-terminus region comprises residues N+1 up to N+115 of the TALE protein.
  • 57. The DNA binding polypeptide of claim 55, wherein the N-terminus region comprises residues N+1 up to N+110 of the TALE protein.
  • 58. The DNA binding polypeptide of any one of claims 55-57, wherein the C-terminus region comprises residues C+1 to C+63 of the TALE protein.
  • 59. The DNA binding polypeptide of any one of claims 55-58, wherein the N-terminus region consists of residues N+1 to N+115 of the TALE protein.
  • 60. The DNA binding polypeptide of any one of claims 55-59, wherein the TALE repeat units are ordered from N-terminus to C-terminus to specifically bind to a target nucleic acid in genomic DNA.
  • 61. The DNA binding polypeptide of any one of claims 55-60, wherein a heterologous functional domain is conjugated to the N-terminus and/or C-terminus.
  • 62. The DNA binding polypeptide of claim 61, wherein the functional domain comprises an enzyme, a transcriptional activator, a transcriptional repressor, or a DNA nucleotide modifier.
  • 63. The DNA binding polypeptide of claim 62, wherein the enzyme is a nuclease, a DNA modifying protein, or a chromatin modifying protein.
  • 64. The DNA binding polypeptide of claim 63, wherein the nuclease is a cleavage domain or a half-cleavage domain.
  • 65. The DNA binding polypeptide of claim 64, wherein the cleavage domain or half-cleavage domain comprises a type IIS restriction enzyme.
  • 66. The DNA binding polypeptide of claim 65, wherein the type IIS restriction enzyme comprises FokI or Bfil.
  • 67. The DNA binding polypeptide of claim 63, wherein the chromatin modifying protein is lysine-specific histone demethylase 1 (LSD1).
  • 68. The DNA binding polypeptide of claim 62, wherein the transcriptional activator comprises VP16, VP64, p65, p300 catalytic domain, TET1 catalytic domain, TDG, Ldb1 self-associated domain, SAM activator (VP64, p65, HSF1), or VPR (VP64, p65, Rta).
  • 69. The DNA binding polypeptide of claim 62, wherein the transcriptional repressor comprises KRAB, Sin3a, LSD1, SUV39H1, G9A (EHMT2), DNMT1, DNMT3A-DNMT3L, DNMT3B, KOX, TGF-beta-inducible early gene (TIEG), v-erbA, SID, MBD2, MBD3, Rb, or MeCP2.
  • 70. The DNA binding polypeptide claim 62, wherein the DNA nucleotide modifier is adenosine deaminase.
  • 71. The DNA binding polypeptide of any of claims 60-70, wherein the target nucleic acid is within a PDCD1 gene, a CTLA4 gene, a LAG3 gene, a TET2 gene, a ETLA gene, a HA VCR2 gene, a CCR5 gene, a CXCR4 gene, a TRA gene, a TRE gene, a E2M gene, an albumin gene, a HEE gene, a HEAl gene, a TTR gene, a NR3Cl gene, a CD52 gene, an erythroid specific enhancer of the ECLllA gene, a CELE gene, a TGFERl gene, a SERPINAl gene, a HEV genomic DNA in infected cells, a CEP290 gene, a DMD gene, a CFTR gene, or an IL2RG gene.
  • 72. The DNA binding polypeptide of any of claims 60-71, wherein the heterologous functional domain comprises a fluorophore or a detectable tag.
  • 73. A nucleic acid encoding the polypeptide of any of claims 55-73.
  • 74. The nucleic acid of claim 73, wherein the nucleic acid is operably linked to a promoter sequence that confers expression of the polypeptide.
  • 75. The nucleic acid of claim 73 or 74, wherein the sequence of the nucleic acid is codon optimized for expression of the polypeptide in a human cell.
  • 76. The nucleic acid of any one of claims 73-75, wherein the nucleic acid is a deoxyribonucleic acid (DNA).
  • 77. The nucleic acid of any one of claims 73-75, wherein the nucleic acid is a ribonucleic acid (RNA).
  • 78. A vector comprising the nucleic acid of any of claims 73-76.
  • 79. The vector of claim 78, wherein the vector is a viral vector.
  • 80. A host cell comprising the nucleic acid of any of claims 73-77 or the vector of claim 78 or 79.
  • 81. A host cell that expresses the polypeptide of any of claims 55-72.
  • 82. A pharmaceutical polypeptide comprising the polypeptide of any of claims 55-72 and a pharmaceutically acceptable excipient.
  • 83. A pharmaceutical polypeptide comprising the nucleic acid of any of claims 73-77 or the vector of claim 78 or 79 and a pharmaceutically acceptable excipient.
  • 84. A method of modulating expression of an endogenous gene in a cell, the method comprising: introducing into the cell the polypeptide of claim 61,wherein the DNA binding polypeptide binds to a target nucleic acid sequence present in the endogenous gene and the heterologous functional domain modulates expression of the endogenous gene.
  • 85. The method of claim 84, wherein the polypeptide is introduced as a nucleic acid encoding the polypeptide.
  • 86. The method of claim 85, wherein the nucleic acid is a deoxyribonucleic acid (DNA).
  • 87. The method of claim 85, wherein the nucleic acid is a ribonucleic acid (RNA).
  • 88. The method of any of claims 85-87, wherein the sequence of the nucleic acid is codon optimized for expression in a human cell.
  • 89. The method of any of claims 84-88, wherein the functional domain is a transcriptional activator and the target nucleic acid sequence is present in an expression control region of the gene, wherein the polypeptide increases expression of the gene.
  • 90. The method of claim 89, wherein the transcriptional activator comprises VP16, VP64, p65, p300 catalytic domain, TET1 catalytic domain, TDG, Ldb1 self-associated domain, SAM activator (VP64, p65, HSF1), or VPR (VP64, p65, Rta).
  • 91. The method of any of claims 84-88, wherein the functional domain is a transcriptional repressor and the target nucleic acid sequence is present in an expression control region of the gene, wherein the polypeptide decreases expression of the gene.
  • 92. The method of claim 91, wherein the transcriptional repressor comprises KRAB, Sin3a, LSD1, SUV39H1, G9A (EHMT2), DNMT1, DNMT3A-DNMT3L, DNMT3B, KOX, TGF-beta-inducible early gene (TIEG), v-erbA, SID, MBD2, MBD3, Rb, or MeCP2.
  • 93. The method of any of claims 84-92, wherein the gene is a PDCD1 gene, a CTLA4 gene, a LAG3 gene, a TET2 gene, a ETLA gene, a HA VCR2 gene, a CCR5 gene, a CXCR4 gene, a TRA gene, a TRE gene, a E2M gene, an albumin gene, a HEE gene, a HEAl gene, a TTR gene, a NR3Cl gene, a CD52 gene, an erythroid specific enhancer of the ECLllA gene, a CELE gene, a TGFERl gene, a SERPINAl gene, a HEV genomic DNA in infected cells, a CEP290 gene, a DMD gene, a CFTR gene, or an IL2RG gene.
  • 94. The method of any of claims 91-93, wherein the expression control region of the gene comprises a promoter region of the gene.
  • 95. The method of any of claims 84-88, wherein the functional domain is a nuclease comprising a cleavage domain or a half-cleavage domain and the endogenous gene is inactivated by cleavage.
  • 96. The method of claim 95, wherein inactivation occurs via non-homologous end joining (NHEJ).
  • 97. The method of claim 95 or 96, wherein the DNA binding polypeptide is a first polypeptide that binds to a first target nucleic acid sequence in the gene and comprises a half-cleavage domain and the method comprises introducing a second DNA binding polypeptide that binds to a second target nucleic acid sequence in the gene and comprises a half-cleavage domain.
  • 98. The method of claim 97, wherein the first target nucleic acid sequence and the second target sequence are spaced apart in the gene and the two half-cleavage domains mediate a cleavage of the gene sequence at a location in between the first and second target nucleic acid sequences, wherein the second DNA binding polypeptide comprises from N-terminus to C-terminus: an N-terminus region comprising residues N+1 to up to N+115 of a TALE protein or a full-length N-terminus region of a TALE protein;a plurality of TALE-repeat units, the TALE repeat units comprising a repeat variable di-residue (RVD); anda C-terminus region of a TALE protein.
  • 99. The method of claim 98, wherein the C-terminus region is a full length region of the TALE protein.
  • 100. The method of claim 98, wherein the C-terminus region is a fragment of the C-terminus region of the TALE protein.
  • 101. The method of claim 98, wherein the C-terminus region extends from C+1 to C+63 of the TALE protein.
  • 102. The method of any of claims 95-101, wherein the cleavage domain or the cleavage half domain comprises FokI or Bfil.
  • 103. The method of any of claims 95-101, wherein the cleavage domain comprises a meganuclease.
  • 104. The method of any of claims 95-103, wherein the gene is a PDCD1 gene, a CTLA4 gene, a LAG3 gene, a TET2 gene, a ETLA gene, a HA VCR2 gene, a CCR5 gene, a CXCR4 gene, a TRA gene, a TRE gene, a E2M gene, an albumin gene, a HEE gene, a HEAl gene, a TTR gene, a NR3Cl gene, a CD52 gene, an erythroid specific enhancer of the ECLllA gene, a CELE gene, a TGFERl gene, a SERPINAl gene, a HEV genomic DNA in infected cells, a CEP290 gene, a DMD gene, a CFTR gene, or an IL2RG gene.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority pursuant to 35 U.S.C. § 119(e) to U.S. Provisional Application No. 62/690,888, filed Jun. 27, 2018, U.S. Provisional Application No. 62/694,239, filed Jul. 5, 2018, U.S. Provisional Application No. 62/716,147, filed Aug. 8, 2018 and U.S. Provisional Application No. 62/852,134, filed May 23, 2019, the disclosures of which are incorporated herein by reference in their entirety.

PCT Information
Filing Document Filing Date Country Kind
PCT/US2019/039318 6/26/2019 WO
Provisional Applications (4)
Number Date Country
62690888 Jun 2018 US
62694239 Jul 2018 US
62716147 Aug 2018 US
62852134 May 2019 US