The instant application contains a Sequence Listing which has been submitted electronically in ASCII format and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Feb. 3, 2022, is named NYU_Zinc_Finger_PCT.txt, and is 77,858 byte in size.
This disclosure generally relates to the field of modulating gene expression responses, and more specifically to modified proteins containing DNA binding domains and transcription activators or transcription repressor domains that function in a natural context.
The precise regulation of gene expression is at the foundation of most biological processes and offers enormous therapeutic potential as the mis-regulation of a gene's expression can be associated with many diseases including cancer, neurodegenerative diseases, and cardiomyopathies. For example, over 660 human genes are estimated to cause diseases due to haploinsufficiency, the effects of which could be corrected by upregulating the functional allele. Conversely, many other diseases are caused by the expression of a gene in the wrong tissue or through gain of function mutations. These diseases could be corrected by downregulating the gene in a tissue specific manner.
Transcription factors (TFs) are endogenous proteins that naturally activate or repress the expression of target genes. These factors modify gene expression by first binding a DNA sequence proximal to the target gene using a DNA-binding domain (DBD) and then recruiting other proteins, through secondary protein interactions, that either modify histones or recruit mediator and/or polymerase components that lead to transcription. These secondary interactions are dictated by other domains within the parent TF which can be common domains such as KRAB domains that repress gene expression, or they can be less common protein sequences the TF has evolved. These effector domains are generically referred to as activation or repression domains. In this way, the DNA-binding specificity of the DBD of the TF determines where the protein will bind in the genome and therefore, which genes will be regulated through the secondary interactions of the effector domains. The most common DBD used by TFs in most metazoans, including human, is the Cys2His2 zinc finger (ZF) representing nearly 50% of the human TFs.
Exceptional CRISPR and TALE-based tools have been developed in recent years to modulate gene expression for both academic and therapeutic applications, but some intrinsic characteristics could limit their therapeutic efficacy and their ability to mimic natural regulatory processes. For example, the size of these protein domains limit applications that require AAV delivery. In addition, pre-existing immune responses have been reported in human and primate models for spCas9, while the immune response to the prokaryotic TALE system is unclear but likely. Thus, immunogenicity makes long-term expression of these proteins in humans a significant therapeutic risk. In addition, these prokaryotic proteins require the addition of an activation or repression domain that will function in humans for human therapeutic purposes. This approach requires the expression of a Cas9 or TALE -effector domain fusion that will present the domain out of its natural context. In some cases, the expression of the effector domain out of its natural context can have a significant impact on efficacy. In other cases, the domains employed are not human in origin resulting in a second point of potential immunogenicity. Finally, a TALE-based repressor screen has demonstrated that the position of binding in the genome has a sizeable influence on repression potential, with positions modified by even a few bases having a large impact on efficacy. As a result, applications that require single-base resolution will be limited by the PAM requirement of Cas9. Thus, there is an ongoing and unmet need for improved compositions and methods for precise targeting of DNA locations that modulate a gene's expression while using proteins that minimize the risk of immunogenicity. The present disclosure is pertinent to this need.
The present disclosure relates to the use of activating and repressing transcription factors (TFs), and/or the activation or repression domains from these proteins, e.g., effector domains, many of which use zinc fingers (ZFs) to recognize their DNA targets. Among other aspects, the disclosure provides examples of activators and repressors to seamlessly scaffold designed ZFs in place of the ZFs that occur naturally in these proteins.
In various embodiments, the disclosure accordingly provides modified proteins comprising an introduced ZF DNA binding domain. The introduced ZF DNA binding domain comprises one or more changes to a DNA binding domain that may have been present in the DNA binding domain (or other DBD) of the effector protein domain in an unmodified form, or may be a completely new ZF DNA binding domain. In certain examples, the introduced zinc finger binding domain comprises a substitution of an endogenous ZF domain of the protein. The modified protein thus binds to a different location, e.g., a different DNA sequence, relative to the binding location of the transcription activator or repressor protein in its unmodified form.
The DNA binding domains to which the modified proteins bind can be any DNA binding site that is recognized with specificity by the introduced ZF DNA binding domain. In non-limiting embodiments, the DNA binding location is on a chromosome, organelle DNA, or a plasmid. In embodiments, binding of the modified protein promotes expression of a gene that is operably linked to the DNA binding domain to thereby promote expression of the gene. In an alternative embodiment, binding of the modified protein represses or otherwise inhibits expression of a gene that is operably linked to the DNA binding domain to thereby facilitate inhibition of expression of the gene.
In one representative and non-limiting embodiment, an introduced ZF DNA binding domain is present in a protein that comprises an activator domain that is a Krueppel-like factor 6 (KLF6) protein or functional segment thereof.
In another representative and non-limiting embodiment, an introduced ZF DNA binding domain is present in a protein that comprises a gene expression repressor domain that is a KRAB domain. In one non-limiting example, the KRAB domain is comprised by a Zim3 protein or functional segment thereof.
The disclosure includes modifying the described protein by introducing a plurality of ZF domains. In embodiments, the introduced ZF domains bind with specificity to the same DNA sequence. In alternative embodiments, introduced ZF domains bind to different DNA sequences.
The disclosure includes expression vectors encoding the described modified proteins, as well as cDNAs and RNA, including mRNA, encoding the described modified proteins.
The disclosure also includes pharmaceutical compositions comprising one or more of the modified proteins; one or more mRNAs encoding one or more the modified proteins; and one or more expression vectors encoding one or more of said modified proteins. The disclosure includes administering the described proteins, expression vectors encoding them, and pharmaceutical formulations, to an individual in need thereof. In embodiments, the modified proteins promote expression of a therapeutic gene, and/or a gene that has a prophylactic effect against any disease, condition, or disorder. In alternative embodiments, the modified proteins inhibit expression of a gene, wherein inhibition of the expression of the gene provides a therapeutic or prophylactic effect against any disease, condition, or disorder. In embodiments, administration of the described protein to an individual does not stimulate an immune response, or does not stimulate a deleterious immune response, directed toward the modified protein.
The disclosure also includes a method of making any of the described, modified proteins, by expressing the proteins recombinantly, and optionally isolating the modified proteins from an expression system. The disclosure thus also comprises cells which are programmed to express any one or combination of the described modified proteins.
Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains.
Unless specified to the contrary, it is intended that every maximum numerical limitation given throughout this description includes every lower numerical limitation, as if such lower numerical limitations were expressly written herein. Every minimum numerical limitation given throughout this specification will include every higher numerical limitation, as if such higher numerical limitations were expressly written herein. Every numerical range given throughout this specification will include every narrower numerical range that falls within such broader numerical range, as if such narrower numerical ranges were all expressly written herein.
All protein herein include proteins that have from 80.0-99.9% identity across their entire lengths to such proteins. The amino acid or polynucleotide sequence as the case may be associated with each GenBank accession number of this disclosure is incorporated herein by reference as presented in the database on the effective filing date of this application or patent. All combinations or specific proteins, and all combinations of types of proteins, are included in the disclosure. Any protein described herein may comprise or consist of the described protein. In embodiments, a described protein may be linked to or a component of another protein, non-limiting examples of which include nuclease activity, said nucleases including CRISPR-nucleases, recombinases, any nickases, and transposases.
The present disclosure relates to use of effective activating and repressing TFs, and the activation or repression domains from these proteins, including human proteins, many of which use ZFs to recognize their DNA targets. Among other aspects, the disclosure provides examples of activators and repressors to seamlessly scaffold designed ZFs in place of the ZFs that occur naturally in these proteins. By doing so, the disclosure provides for directing the modified protein to any desired DNA sequence in the genome in order to modify proximal gene expression. In this way, the DNA-binding specificity of the TF is effectively reprogrammed to bind alternative sequences in the genome without altering other functions or compositions of the parent protein. Non-limiting embodiments of the disclosure include seamless scaffolds for the KRAB-containing Zim3 protein (repression) and the activating KLF6 protein, and additional examples are described below. These results demonstrate the approach and efficacy of seamless reprogramming that can be applied to any natural ZF or other DBD-expressing TF protein. In an embodiment, the disclosure replaces a DBD that is not a zinc finger with a zinc finger domain. A representative example is provided in
In embodiments, the disclosure includes the following embodiments, including all embodiments individually and all combinations thereof.
In an embodiment, the disclosure provides a modified protein comprising an introduced ZF DBD, the modified protein having a changed DNA binding specificity relative to a DNA binding specificity of the protein in its unmodified form. In general, the modified protein comprises, in addition to the introduced zinc finger DNA binding domain, a gene expression activator domain or a gene expression repressor domain. An “introduced” zinc finger domain means one or more amino acid changes in an endogenous ZF domain of a protein that changes the DNA binding location of the protein. Thus, an introduced ZF domain comprises a ZF domain that was not present in the protein, prior to modification of the protein as described herein. The introduced ZF domain may include more than one ZF domain. In general, the introduced ZF domain does not change the natural function of the parent protein, e.g., if an activator of transcription includes an introduced ZF domain, the activator of transcription function of the protein is retained, but transcription of a different gene may be promoted. The same rational applies to a repressor. The activators and repressors may bind to any location that is operably linked to any gene. “Operably linked” means binding of the protein is correlated with a change in gene expression, e.g., activation or repression. Thus, the proteins can bind to elements that are proximal to a gene (e.g., a promoter), or elements that are distal from a gene (e.g., an enhancer). Binding to other elements that influence expression of a gene to which the binding site is operably linked are included in the disclosure. In embodiments, the activator or repressor is a transcription factor. Thus, in one embodiment, the activator promotes transcription of mRNA which is in turn translated into a protein. In one embodiment, the repressor inhibits transcription of mRNA. The modified protein of the disclosure may bind to a changed DNA binding location on a chromosome, organelle DNA, or a plasmid. In an embodiment, the DNA binding location is present in the genome of a DNA virus.
In embodiments, any ZF domain that is introduced into a protein as described herein may have the same DBD sequence as any of ZNF324, ZNF264, ZNF10, FoxR2, KLF7, or ZXDC. In embodiments, the ZF domain is a novel sequence.
In embodiments, the gene expression activator domain promotes expression of a gene that is operably linked to the changed DNA binding location relative to the DNA binding location of the unmodified effector protein to thereby provide therapeutic expression of the gene. In a non-limiting embodiment, the gene expression activator domain comprises a Krueppel-like factor 6 (KLF6) protein or functional segment thereof. A “functional segment” means a segment of the protein that is sufficient to promote its activation or repression. In an embodiment, the gene expression repressor domain inhibits expression of a gene that is operably linked to the changed DNA binding location to thereby provide therapeutic inhibition of expression of the gene. In a non-limiting embodiment, the gene expression domain comprises a KRAB domain, wherein the KRAB domain is optionally comprised by a Zim3 protein or functional segment thereof. In a non-limiting embodiment, a modified protein of the disclosure comprises a substitution of an endogenous zinc finger domain of the protein. In embodiments, the introduced zinc finger domain is one of a plurality of zinc finger domains that are introduced into the modified protein to thereby provide a modified protein comprising a plurality of introduced zinc finger domains, and wherein the plurality of introduced zinc finger domains optionally comprise the same changed DNA binding domain. The disclosure also includes cDNAs and mRNAs encoding any modified protein described herein.
In embodiments, the modified protein is encoded by an expression vector, such as an expression vector used to make the modified protein, and/or an expression vector that can be used to deliver the coding sequence to cells so that the cells express the modified protein, which may be for a therapeutic purpose. In non-limiting embodiments, the expression vector may comprise a suitable viral vector, non-limiting embodiments of which include modified viral polynucleotide from an adenovirus, a herpesvirus, or a retrovirus, such as a lentiviral vector. Polynucleotides can be used directly, or they may be introduced into cells using any of a variety of polynucleotide insertion reagents, such as transfection agents. In non-limiting embodiments, a recombinant adeno-associated virus (rAAV) vector may be used. In certain embodiments, the expression vector is a self-complementary adeno-associated virus (scAAV). In embodiments, a composition of this disclosure comprises mRNA encoding one or more of the described modified proteins.
In embodiments, a therapeutically effective amount of a described protein is administered to an individual in need thereof. Administration of the protein includes administration by way of polynucleotides that encode the protein. The term “therapeutically effective amount” as used herein refers to an amount of a described protein to achieve, in a single or multiple doses, the intended purpose of treatment. The amount desired or required will vary depending on the particular protein, its mode of administration, patient specifics and the like. Appropriate effective amounts can be determined by one of ordinary skill in the art informed by the instant disclosure using routine experimentation.
The disclosure also provides a pharmaceutical composition comprising one or more of the described modified proteins, one or more mRNAs encoding one or more of said modified proteins, or one or more expression vectors encoding one or more of said modified proteins. Pharmaceutical compositions generally comprise one or more pharmaceutically acceptable buffers, excipients, and the like.
The disclosure also provides administering one or more described protein to an individual in need thereof. The protein, or a pharmaceutical protein comprising the modified protein, can be administered to the individual using any suitable delivery method.
In embodiments, an individual in need of a described protein is in need of activation or repression of one or more genes. In embodiments, the one or more genes are due to or correlated with a haploinsufficiency.
In embodiments, the described modified proteins do not stimulate an adverse immune response in an individual to which they are introduced. An adverse immune response includes but is not limited to innate immune responses, humoral immune responses, and cell-mediated immune responses, wherein said immune responses are deleterious to the individual. In one embodiment, a described protein does not elicit an increased antibody response that comprises an increase of antibodies that bind to a describes protein, relative to pre-existing antibodies that may bind to the effector domain of a described protein.
While the disclosure relates in part to therapeutic approaches in humans, the described modified proteins can also be used for veterinary purposes, e.g., for non-human animals. Further, the described proteins may be suitable for use in other eukaryotic organisms, such as plants and fungi. In embodiments, the described proteins can be used for prokaryotic purposes.
The disclosure also provides a method of making a described modified protein by modifying a protein to comprise an introduced zinc finger DNA binding domain. In embodiments, the modified protein is produced by cells comprising an expression vector encoding the modified protein, from which the modified protein is separated.
In embodiments, the disclosure includes the described library generation, and analysis of the DNA binding properties of members of the library. In embodiments, one or more methods described herein can be performed by a digital processor and/or a computer running software to perform an algorithm and/or to interpret a signal. In embodiments, the processor runs software or implements an algorithm to interpret an a detectable signal, and may generate a machine and/or user readable output. In embodiments, the digital processor and/or the computer participates in the ZFDesign aspect of this disclosure, as further described below. In embodiments, information obtained by a device or system used to analyze protein binding as described herein can be monitored in real-time by a computer, and/or by a human operator. In embodiments, the processor runs software or implements an algorithm to interpret an optically detectable signal, such as a signal from a detectably labeled protein. In certain embodiments, the disclosure provides as an embodiment or component of the system a non-transitory computer readable storage media for use in performing an algorithm to interpret and/or record signaling events. In embodiments, a system described herein may operate in a networked environment using logical connections to one or more remote computers. In embodiments, a result obtained using a device/system/method of this disclosure is fixed in a tangible medium of expression. The result may be communicated to, for example, a user who produces and/or test modified proteins as described herein.
The following Examples are intended to illustrate but not limit the disclosure.
Two general approaches have been used to engineer ZFs with novel specificity (
Multiple side chains from adjacent ZFs bind DNA in close proximity to one another; this is especially true at the binding site “overlap” where position 6 of an N-terminal helix can be within hydrogen bonding distance of the position −1 and 2 side chains of its C-terminal neighbor. At this position the specificity of adjacent ZFs overlaps and in this way the N-terminal helix is presenting a specific interface environment to its C-terminal neighbor that is based on the side chain employed and the base specified (
From these screens we found global and target-specific differences between these library contexts, indicative of the strength of the constraint that each context puts upon the C-terminal ZF. The total number of selected helices ranged from 128,000 to over 1 million helices per library screened (
Since the majority of prior ZF selections have been carried out with an arginine-guanine contact presented at the overlap, the disclosure includes libraries that present adenine and cytosine contacts to enrich novel helical strategies. To measure these differences on a global scale we first calculated mean hamming distance between the helices enriched to bind each target across all libraries (
Global differences between library environments were assayed by the success of selections across targets as well as the mean hamming distances. To investigate more specific differences, such as the types of binding strategies enabled by one library environment versus another, we compared the clusters generated by MUSI for each target site selection. For most targets we find general strategies that are common to several successful library selections. We also find specialized strategies that are recovered in a small number of selections and in some cases, only recovered with a single library environment (
Data presented in this disclosure demonstrate global and specific differences in ZF function influenced by the adjacent finger environment. While it is believed this data represents the largest screen of ZF function to date, it is still a relatively small number of the potential overlap influences. To test how greater variability at the interface might influence compatibility we created 200 two-finger libraries by assembling pools of helices selected to bind each 3 bp half-site of a 6 bp target. We selected compatible pairs of ZFs from these libraries and analyzed how many starting library environments the helices were enriched from. Most helices enriched in these compatibility assays were only recovered in a minority of the library environments (
Despite considerable effort, it is considered that all previous attempts at generating a general ZF design code have failed. Given the unprecedented depth of the described screening data, the disclosure includes a novel and unique model that explicitly addresses these neighbor influences. In particular, we separately make use of the single-finger library selections that comprehensively describe single-finger specificity in a variety of neighbor finger contexts and the pair selections that show which ZFs are compatible with each other as neighbors. This information is hierarchical and to make use of it, we developed a novel neural network architecture that implements attention modules in a hierarchical manner (
The first layer of this hierarchical architecture contains two modules that are trained on the single-finger selection data sampling a wide range of influences at the interface where adjacent finger specificity can overlap (
The overall model retains a traditional encoder-decoder architecture: An encoder generates a high-dimensional representation for each DNA base, a decoder then generates predictions for each residue in a ZF helix using self-attention layers and attention layers that relate the nucleotide bases to the helical residues. To train the model, we provide the nucleotide target as well as a partially masked ZF sequence and evaluate the cross-entropy loss given input data. We achieve a reconstruction accuracy (sequence identity to the six masked residues) of 0.62 and 0.69 on the validation and test data respectively; some positions (such as “−1”) that are strong determinants of binding specificity having higher reconstruction accuracies (
The described method (referred to herein as ZFDesign) generates sequences in an incremental fashion: Starting from an empty sequence, the model is run once for each amino acid in the ZF helix pair. At each iteration an amino acid is predicted and this prediction is provided as context in subsequent iterations. For optimal sequence generation we adapted both an A*-based sampling methodology(36), as well as a temperature-dependent sampling procedure(37). We sought to compare ZFDesign to a baseline, but it is believed no previous model has explicitly attempted to perform full ZF-array design for a given target, with only a few collections of ZFs available. We used ZFpred, a recently developed method that outperformed previous models(35). We then used both ZFDesign and ZFpred to generate ZF sequences to target 6-mers from our test dataset. As alternative baseline comparisons, we first used the single-finger models (e.g., only the bottom module in
To validate ZFDesign we used a GFP-disruption assay in a U20S cell line that has been used to approximate nuclease activity for ZFNs(38), TALENs(39), and spCas9(40) as indels in the coding sequence of GFP lead to frameshifts and loss of fluorescence. For each ZFN, two ZF arrays were designed as ZFNs require dimerization of the Fok1catalytic domain presented as C-terminal fusions from each ZF array in a tail-to-tail orientation (
To avoid the presentation of effector domains out of their natural context, the disclosure demonstrates that ZF domains in human TFs can be seamlessly replaced with designed ZFs. This approach presents the designed ZFs in the exact context that ZFs would occur naturally in the parent protein. Such Reprogrammed Transcription Factors (RTFs) maximize secondary interactions of the TF, avoid the use of foreign effector domains, and enable investigation of TF binding events (
To create RTFs that repress target genes we used ZIM3 as our TF scaffold as ZIM3′s KRAB domain has proven a potent repressor as an isolated SpCas9 fusion(43). We replaced ZIM3′s ZFs with the series of ZF arrays designed to bind the TetO sequence as described for KLF6 (
For any of the RTFs listed above, to seamlessly replace their DBDs without impacting any other part of the parent protein, we use the consensus definition of the DBD of the parent protein to determine which part of the parent protein to replace. For example, the consensus Cys2His2 zinc finger domain begins 2 amino acids before the first Cysteine and ends with the 2nd Histidine. Therefore, we replaced the natural ZFs of a TF such as Zim3, that has 11 ZFs naturally, by starting 2 amino acids before the first Cysteine of the first finger and replaced the sequence all the way through the 2nd histidine in the last (eleventh) finger. This is replaced with a designed ZF array that again begins 2 amino acids before the first Cysteine of the first ZF and follows through to the 2nd histidine of the last ZF in the array. No other modifications are made to the parent protein (See
Representative constructs are shown on
DELTRHFRKHTGAKPFKCSHCDRCFSRSDHLALHMKRH]L
Example of designed zinc fingers expressed in KLF6 scaffold:
THTKIHTQRPQIPPKPFACDICGRKFALKHHLLNHTRIHTGEKPFACDI
Designed zinc fingers are between the brackets. Recognition helices for each zinc finger are bold. In the example we are using extended linkers that allow for base-skipping between 2-finger targets. However, engineered zinc fingers that use the consensus linkers (TG(E/Q)(K/R)P) and do not skip bases are also functional. As these zinc fingers naturally occur at the C-terminus, we have left the C-terminal “L” of KLF6, however, a C-terminal extension from EGR1 or another human zinc finger protein may be accommodated without further risk of immunogenicity.
QGETTKPDVILRLEQGKEPWL]EEEEVLGSGRAEKNGDIGGQIWKPKDV
FECHSCGRAFGEKWKLDKHQKTHAEERPYKCENCGNAYKQKSNLFQHQK
MHTKEKPYQCKTCGKAFSWKSSCINHEKIHNAKKSYQCNECEKSFRQNS
TLIQHKKVHTGQKPFQCTDCGKAFIYKSDLVKHQRIHTGEKPYKCSICE
KAFSQKSNVIDHEKIHTGKRAYECDLCGNTFIQKKNLIQHKKIHTGEKP
YECNRCGKAFFQKSNLHSHQKTHSGERTYRCSECGKTFIRKLNLSLHKK
THTGQKPYGCSECGKAFADRSYLVRHQKRIHSR
Example of designed zinc fingers expressed in Zim3 scaffold:
NHTRIHTGEKPFACDICGRKFATSSGLCHHTKIHTQRPQIPPKPFACDI
Designed zinc fingers are between the brackets. Recognition helices for each zinc finger are bold. In the example we are using extended linkers that allow for base-skipping between 2-finger targets. However, engineered zinc fingers that use the consensus linkers (TG(E/Q)(K/R)P) and do not skip bases are also functional. As these zinc fingers naturally occur at the C-terminus, we have left the C-terminal “SR” of Zim3, however, a C-terminal extension from EGR1 or another human zinc finger protein may be accommodated without further risk of immunogenicity.
To test the regulatory potential of endogenous genes with RTFs, we applied the ZIM3 architecture to repress 3 endogenous targets (DPH1, Rab1a, and UEB4A) and designed 4 arrays each to bind sequences close to the transcriptional start site (TSS) of each gene. To maximize the likelihood of function, we designed these and all following ZF arrays to use 8-fingers. HEK293T's were nucleofected with the RTFs and expression levels assayed by RT-qPCR. For each target gene at least one construct reduced expression levels significantly (
ZFDesign enables the reprogramming of TFs for either activation or repression. To test the precision of the regulation we used RNA-seq to quantify the on and off-target regulation of the RTFs. We focused on the 4 most potent KLF6 RTF regulators of CDKN1C, #125, 150, 172, and 200 (see
The specificity of ZF arrays can be impacted by target content and affinity. As noted, G-rich binding tends to be more promiscuous. Consistent with this observation, the CDKN1C target with the lowest G-content (#200,
It will be recognized from the foregoing Examples this disclosure presents ZFDesign, a novel hierarchical attention-based AI model trained on comprehensive screens of ZF-DNA interactions that consider the influence of multiple adjacent finger environments. ZFDesign captures these influences to provide the first general design model for ZF arrays. By contrast, previous efforts produced incomplete collections of ZF modules that often fail out of context and produce low on-target activity. Conversely, the described model consistently produced ZF arrays across a wide range of targets with high efficacy as nucleases, repressors, and activators. Thus, ZFDesign represents a significant advance as the design of ZFs for any given target is suitable for study of many research and therapeutic applications with the advantages of small size and low immunogenicity.
Without intending to be constrained by any particular theory, it is considered the disclosure provides the first generalizable design methodology that allows for the seamless replacement of a TF's natural DNA-binding domain and direct the TF to any target of interest. These RTFs can produce activation and repression activities similar to CRISPR-based tools, supporting utility of these proteins as therapeutics comprised of solely human components. In addition, the described approaches all for analyzing TF function as they more accurately mimic natural TFs.
The following materials and methods were used to produce the data described herein and in the accompanying figures.
Primary zinc finger libraries: All primary ZF libraries were built as previous described(35, 46) and detailed below. To provide templates for PCR, gBlocks were ordered from IDT that coded for the finger 0 and finger 1 domains of each library (
2-finger libraries: Second round selections were used to select compatible pairs from pre-selected ZF pools generated in the primary ZF library selections. We pooled recovered plasmid DNA from our primary single-finger screens on a binding site basis, resulting in a pool of diverse helices (termed “round 2 pools”) with broad compatibility for each of the 64 different binding sites. To ensure these were enriched for functional helices and not background, a simple cutoff was devised to omit unsuccessful selections. Based on the data filtering metrics described, single-finger pools were omitted if less than 20% of the reads passed these filters as those selections would have added a disproportionate amount of non-functional ZFs to our template pools. This set of 64, round 2 pools was used as a PCR template to create either ‘domain 1’ or ‘domain 2’ amplicons using Expand™ High Fidelity PCR system (Roche) and 15 cycles of PCR to reduce bias. ‘domain 1’ and ‘domain 2’ reactions were gel-purified from a 2% agarose gel, quantified by nanodrop, and stored at −20 C. In order to create a 2-finger library insert, we performed overlapping PCR to stitch appropriate ‘domain 1’ and ‘domain 2’ pools together. Purified single-finger amplicons were combined equimolar as the template for overlap PCR with Phusion® High Fidelity DNA Polymerase (NEB) (25 cycles), PCR-purified, digested with KpnI and NotI, gel-purified, and quantified by Nanodrop (ThermoFisher Scientific). The digested 2-finger library inserts were ligated into our 2-finger library vector (see
Primary ZF Libraries: Libraries were built in a vector that will express the ZFs as a fusion to the omega subunit of the bacterial polymerase using a strong promoter. In the B1H system omega is simply acting as an activation domain. The binding site reporter vectors were built by placing the binding site of interest 10 bp upstream of the-35 box of the promoter that drives HIS3 and GFP expression in the previously described GHUC vector. For example, for the library 2 TAC selection, the binding site 5′ TAC-ACA-AAG 3′ was built into the GHUC vector 10 bp upstream of the promoter where the library domain will bind TAC and domains 1 and 0 of library 2 will bind ACA and AAG, respectively (
Compatible 2-finger modules selections: In order to identify compatible 2-finger modules from our round 2 libraries, we first built a matching set of vectors containing the intended DNA target and then leveraged omega-dependent activation of the HIS3 reporter in our bacteria 1-hybrid system. Round 2 libraries were co-transformed with the matching reporter vector in USO-ω cells and recovered and titered as described. Based on cell counts the next day, 1×106 cells were added in triplicate to a 96-well deep-well plate containing a sterile bead for efficient agitation. Selections were performed in 1 mL NM+Ura/−His supplemented with 100 μg/mL carbenicillin, 50 μg/mL Kanamycin, 1 μM IPTG, and 5 mM 3AT. These were grown at 37 C in a plate shaker for 18, 24, or 40 hours and harvested upon reaching visible turbidity (typically OD>0.6). Triplicates were pooled, miniprepped, and deep sequenced on an Illumina NextSeq 500. Helices were rank-ordered by sequencing reads, and 2-finger modules within the top 5 highest counts were chosen for follow-up assembly and testing in the EGFP nuclease assay.
Zinc finger nuclease (ZFN) activity was assessed by measuring disruption of an integrated, constitutively-expressed eGFP reporter in a clonal U2OS cell line previously described(39). Cells were cultured in DMEM supplemented with 10% FBS, 2 mM GlutaMAX™ (Life Technologies), 1% penicillin/streptomycin, 1% MEM non-essential amino acids (Life Technologies), 2 mM sodium pyruvate, and 400 μm/mL G418. 1 μg of each ZFN monomer plasmid DNA and 200 ng ptdTomato-N1 plasmid DNA were transfected in duplicate into 5×105 cells using a Lonza Nucleofector™ 2b Device (Kit V, Program X-001). In each assay 2 μg of the parental empty vector (a modified derivative of the JDS71 vector from addgene) and 200 ng ptdTomato-N1 was used as a negative control, and 2 μg of a dual spCas9-guide expressing vector (modified addgene plasmid #41815) and 200 ng ptdTomato-N1 was used as a positive control in each experiment. Cells were grown in 6-well dishes for 3 days post-transfection, harvested and kept on ice, and analyzed for expression of eGFP and tdTomato on a Sony SH800 cell sorter. In order to restrict analysis to only cells that likely received both ZFN monomer plasmids, populations were first gated on the top 15-25% tdTomato+cells, and then analyzed for loss of eGFP expression.
Primary libraries: Following selection from ≥5×108 library variants, surviving colonies were pooled, miniprepped, and DNA barcoded for sequencing on an Illumina NextSeq® 500. Typically these were performed as a set of 64 3 bp binding sites for a given ‘overlap’ library as follows. 2 uL of pooled plasmid DNA was used as a template for barcoding in a 25 μL reaction with Taq Polymerase (NEB) with the following cycling parameters: 95 C for 5 min, 20 cycles of [95 C:20 s, 52 C:30 s, 68 C:30 s], 68 C for 10 min, and held at 4 C. 5 μL each reaction was visualized on a 1% agarose gel to confirm apparent equal amplification. All 64 reactions were pooled in equal volumes. These were run out on a 1% agarose gel, gel purified, and submitted to the NYU Genome Technology Center for sequencing on a NextSeq® 500.
2-finger libraries: Following selection of ˜3×106 2F library variants, plasmid DNA was extracted from surviving cells and barcoded for deep sequencing on an Illumina NextSeq® 500 as follows. 24, pooled plasmid DNA was used as a template for barcoding in a 25 μL reaction with GoTaq® Green 2X Mastermix (Promega) with the following cycling conditions: 95 C for 5 min, 15 cycles of [95 C:30 s, 68 C:30 s, 72 C:60 s], 72 C for 5 min, and held at 4 C. 10 μL each reaction was visualized on a 1% agarose gel to confirm equal amplification, all reactions were pooled in equal volumes. These were gel-purified from a 1% agarose gel, and submitted to the NYU Genome Technology Center for sequencing on an Illumina NextSeq® 500.
All paired end Illumina reads are demultiplexed and trimmed into 21-mers with in-house Unix scripts based on EMBOSS 6.6.0. Trimmed DNA sequences are translated, and amino acid sequences are considered if they have a least two read counts and are coded by at least two different DNAs. The invariant Leucine at the helix position +4 is excluded.
For each selection, helix sequences were clustered using the MUSI software(34). Each sequence was assigned to the cluster associated with the PWM for which it was assigned the highest responsibility. For each cluster generated, the Shannon entropy value was calculated for each helix residue based on the PWM for that cluster. If a selection lacked a cluster with at least one position with an entropy of two or less, that selection was filtered out for downstream analysis.
To compare the helices from two selections, A and B, pairwise normalized Hamming distances were computed between the two sets of filtered sequences based on the number of identical amino acids. The minimum normalized Hamming distance was then computed from each helix in selection A to each helix in selection B as well as from each helix in selection B to each helix in selection A. The overall distance between the two selections was computed as the mean of these distances.
Similar to our previous studies(47, 48), the PDB file 1AAY(49) was used as template, the DNA was elongated by 2 bp at each end using X3DNA to avoid the melting end effect so that the binding of zinc fingers is not affected. The DNA and protein sequences were mutated using Chimera (www.cgl.ucsf.edu/chimera/) for each library and test case, the protonated states were determined by WHATIF (swift.cmbi.umcn.nl/whatif/) The prepared structures were then solvated into a TIP3P water box with 15-Å buffer of water extending from the protein/DNA complex in each direction, sodium ions were added to ensure the overall charge neutrality. The FF99 Barcelona forcefield was used for protein/DNA complex and zinc amber forcefieid for zinc ions. The particle mesh Ewald method was used for electrostatics calculations. The SHAKE algorithm was used to constrain the hydrogen-containing bond lengths, which allowed a 2-fs time step for MD simulation. The non-bonded cut-off was set to 12.0 Å. The systems were energy minimized using a combination of steepest descent and conjugate gradient methods. Then the systems were thermalized and equilibrated for 3 ns using a multistage protocol. The first step was a 1.5 ns gradual heating from 100K to 300 K, followed by 1.5 ns of density equilibration, both at 1-fs step length. Berendsen thermostat and barostat were used for both temperature and pressure regulation for another 6-ns equilibration at 2-fs step length with gradually reduced positional constraints at 300K. The systems were built with tleap and the simulations were conducted with GPU accelerated Amber18(50). For each system, three 500-ns trajectories were simulated. The hydrogen bond analysis was performed using BioPython. We considered as a hydrogen bonds any contacts below 3.5 Å between the atoms O6 and N7 in a Guanine and the atoms NH1 and NH2 in an Arginine or ND2 and OD1 for an Asparagine. Bifurcated hydrogen bonds between a guanine and an arginine are identified when two pairs 06-NH1/2 and N7-NH1/2 are found, allowing the tautomeric bifurcated hydrogen bond.
To quantify the promiscuity of helices that target each nucleotide three-mer, the Shannon entropy was computed. For each nucleotide three-mer, a position frequency matrix of nucleotide sequences targeted by every set of core residues (−1, 2, 3, 6) was computed. The entropy was calculated in a position wise fashion and then summed to get an overall metric for specificity.
We developed a hierarchical neural network architecture that mimics the B1H experimental setup and captures the modularity of zinc finger proteins. This architecture is composed of three modules (
The architecture of the first two modules is largely based on the Transformer model1. An encoder generates a high-dimensional representation for each base in a nucleotide four-mer. A decoder then generates predictions for each core residue in a zinc finger helix using self-attention layers and attention layers that relate the nucleotide bases to the helix residues. While the decoder in a conventional Transformer strictly generates sequences from left to right1, the decoders in this model use bi-directional information. A portion of the residues in a helix are masked and the decoder outputs amino acid predictions at these positions. The third module consists of repeating self-attention layers and feed forward layers that allow the model to update residue embeddings based on inter-helix compatibility (
Variants of the first module with different numbers of attention heads and embedding dimensions were trained and evaluated on the initial task of predicting residues in a single helix (Table A). In the final model, all attention layers were repeated three times and each attention layer had four heads. The model embedding dimension (dmodel) was set to 128. The value and key embedding dimensions for computing scaled dot-product attention (dv and dk) were both set to 256. The hidden dimension in the feed-forward layers was set to 128. For regularization, dropout layers were included after every feed forward and attention layer with a dropout percentage of 0.3.
Table A shows the number of human transcription factors that use five common DNA-binding domains(9) and their comparative size. As many DNA-binding domains require dimerization, their monomeric and multimeric sizes are listed. A comparison of the multimeric size and the domain's common target length allows a calculation of amino acids required per base specified.
The models were trained and evaluated on data derived from B1H selections. B1H screening data was filtered using a previously described approach, where helices were evaluated based on the diversity of encoding nucleotide sequences found in the screen2−4. The Shannon entropy for each helix (or helix pair) was calculated based on the number of reads associated with each possible encoding nucleotide sequence. Helices were filtered based on previously defined thresholds3. Specifically, helices with less than ten reads or a Shannon entropy of less than 0.07 were removed.
Modules one and two were pre-trained using data from single-helix B1H selections that were performed against nucleotide four-mers. The data included selections performed with 11 libraries against 192 different nucleotide four-mers. In total, the dataset included 2,071,764 data points. For initial training and hyperparameter tuning, the data points were split into train, test, and validation datasets at proportions of 80%, 10%, and 10% respectively by four-mer sequence. For pre-training, the data was instead split by helix sequence.
The full model was trained using data from helix-pair B1H selections that were performed against nucleotide seven-mers. An initial dataset of selections against 189 seven-mers was split into training and validation datasets at proportions of 90% and 10%. This dataset contains a total of 327,792 data points. To ensure that the validation set was sufficiently different from the training dataset, a graph was generated where nucleotide seven-mers were represented as nodes and edges connected seven-mers within two base substitutions from each other. While most of the nodes formed a single connected component, there were separate components that were included in the validation dataset (
In both training steps, a nucleotide target and a sequence of partially masked core residues from either a single zinc finger or a helix pair were provided to the model. 50% of the core residues were masked and the cross-entropy loss was evaluated based on the output probabilities. Training was done using an Adam optimizer with a learning rate of 1e-4, and a minibatch size of 128 was used. Early stopping was done based on the validation loss. Pre-training modules one and two took at most 1.3 million iterations. Training the full model was at most 3.4 million iterations. When training the full model, the parameters for modules one and two were either randomly initialized, transferred from the pre-training step, or transferred and from the pre-training step and frozen (
When predicting zinc finger residues, the model makes use of context provided by known residues. Helix sequences are generated incrementally where the network is run once for each missing residue. At each iteration, a single residue is added to increase the sequence context. For a pair of helices, there are about 4.1×1015 possible sequences and about 4.8×108 orders in which each sequence can be generated. Enumerating all possibilities to find the sequence with the highest likelihood is thus computationally intractable.
To generate sequences, we adapted the A* search algorithm, as done previously5,6. This approach involves iteratively filling in masked residues while maintaining a priority queue of partially masked sequences. At every iteration, the top partially masked sequence is taken from the priority queue and passed through the network. All possible labels for every masked residue are evaluated. Any label with a probability above 0.05 is accepted and the label is added to a copy of the input sequence before it is pushed onto the priority queue. This is repeated until a set amount of sequences are completely generated. The following equation is used to assign a priority to each partially masked sequence:
This heuristic approximates the maximum expected probability of a sequence that would be attained by predicting the remaining residues. pi denotes the probability assigned to the prediction made at iteration i and j denotes the number of predicted residues. p* denotes the expected maximum probability that would be assigned by the network to later predictions. This parameter can be tuned to move the search closer to a greedy search or a breadth first search. This parameter was set to 0.1 whenever A* was performed in this work.
We also implemented an alternative biased sampling approach using temperature adjusted distributions, as done previously7. This approach generally resulted in higher likelihood sequences (
To generate distributions over helix sequences using ZFPred3, 106 helix sequences were randomly sampled. The binding specificities of these helices were predicted using ZFPred. Sequence distributions for a particular nucleotide sequence were then generated by normalizing the predicted scores of the sampled helices for that nucleotide sequence. Predictions for 3-mers were concatenated to generate predictions for 6-mer sequences.
We designed zinc fingers to bind the sequence 5′-CGCCCAGCTGGGGGCGGGGGA-3′, a sequence that is repeated 111 times at the Brf1 locus on chromosome 14 (hg38 chr14:105229626-105240946). The coding sequence for the designed zinc finger array was ordered from IDT (gBlock) A SV40 NLS was added to the C-termini by PCR. Next, we added GFP as an N-terminal fusion to the zinc fingers using the NT-GFP Fusion TOPO TA Expression Kit (Invitrogen). Successful cloning into the expression vector was confirmed by Sanger sequencing.
The GFP-ZF fusion expression vector was transfected into 293T cells and grown on 0.01% Poly-L-Lysine coated 35 mm MatTek dishes using X-treme-GENE 9 DNA transfection reagent (Sigma Aldrich). Transfected cells were Hoechst stained the next day and then imaged. A titration experiment was conducted to explore optimal plasmid concentration. Clear foci were visible at a range of concentrations, but 333 ng of plasmid yielded the optimal balance of transfection efficiency and signal to noise ratio.
HEK293T cells were transfected with ZF-repressors, ZF-activators, or SpCas9-repressors targeting various endogenous loci and target transcript levels were measured by RT-qPCR as follows. 2 μg of the parental (pKJ-Kan) plasmid DNA or 2 μg of pMMBC_SpCas9 containing a non-targeting guide were used as negative controls for ZF and SpCas9 transfections, respectively. Cells were cultured in DMEM supplemented with 10% FBS, 2 mM GlutaMAX™ (Life Technologies), 1% penicillin/streptomycin, 1% MEM non-essential amino acids (Life Technologies) and 2 mM sodium pyruvate. 18-24 hours prior to transfection, cells were passaged and 7.5e5 cells were added to 2.5 mL media in a 6-well dish. Cells were transfected with 2 μg of plasmid DNA using a 4:1 ratio of DNA: TransIT®-LT1 transfection reagent (Mirus) according to manufacturer's instructions. Media was changed 2 days post-transfection, and cells were harvested for RT-qPCR 3 days post-transfection. Cells were washed once with sterile PBS, 350 μL Buffer RLT Plus (Qiagen) containing 1% β-mercaptoethanol was added, and samples were either stored at −80 C or processed immediately using a RNeasy Plus Mini Kit (Qiagen) according to manufacturer's instructions. Pure RNA was quantified using a NanoDrop™ 2000c (Thermo Scientific™) and stored at −80 C.
1 μg of pure RNA was reverse transcribed using the SuperScript™ IV First-Strand Synthesis System (Invitrogen™) according to manufacturer's instructions except half the recommended reverse transcriptase was used. Random hexamers were used as primers, and cDNA was stored at −20 C or processed immediately. qPCR reactions were set up in technical duplicate or triplicate using the equivalent of 25 ng or 50 ng reverse-transcribed RNA per reaction and the KAPA SYBR FAST qPCR Master Mix (2X) (Roche).
RT-qPCR was performed on a LightCycler® 480 Instrument II (Roche) using the cycling program recommended for KAPA SYBR FAST reagent on the LightCycler® 480 (annealing temperature was 60 C). Ct values were calculated using the on-board “Absolute Quantification/2nd Derivative Max” analysis option. Input was first normalized using the housekeeping gene RPS18, and fold-change in expression for a given gene of interest was calculated relative to the appropriate negative control. A table of RT-qPCR primers used in this study can be found in the supplementary data.
RNA-Seq library preps were constructed using the Illumina TruSeq® Stranded mRNA Library Prep kit (Cat #20020595) using 500-1000 ng of total RNA as input, amplified by 10-12 cycles of PCR, and sequenced paired-end 50 cycles on Illumina sequencers with 2% PhiX spike-in. 25-30 million reads were obtained for each sample. Paired-end reads were aligned to hg38 using STAR aligner8. Read counts were computed using FeatureCounts and differential expression analysis was subsequently performed using DESeq29.
Two-sided Wilcoxon rank-sum tests were performed using the SciPy python library. Boxplot centerlines show medians, box limits show upper and lower quartiles, whiskers are 1.5 the interquartile range and points show outliers.
This reference listing is not an indication that any particular reference is material to patentability:
This application claims priority to U.S. provisional patent application no. 63/145,929, filed Feb. 4, 2021, the entire disclosure of which is hereby incorporated by reference.
This invention was made with government support under grant number R01GM118851 awarded by the National Institutes of Health. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/015346 | 2/4/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63145929 | Feb 2021 | US |