This application contains a Sequence Listing that has been submitted electronically as an ASCII text file named ‘29539 0527US1 Sequence Listing’. The ASCII text file, created on Nov. 21, 2022, is 891 kilobytes in size. The material in the ASCII text file is hereby incorporated by reference in its entirety.
Described herein are systems, methods, and compositions for the precise editing of DNA sequence(s) at specific loci to alter expression of target gene products at the pre-transcriptional or post-transcriptional level in a durable fashion, termed Stable and Heritable Alteration by Precision Editing (SHAPE). The SHAPE platform utilizes genetic modifiers (e.g., nucleases, (CRISPR guided) transposases, recombinases, base editors, and prime editors) to install specific sequence motifs at target sequences through precision genome engineering.
Precisely controlling gene expression in biosystems has important applications in biotechnology and therapeutic settings5 6 6, 7. Pre-transcriptional strategies for gene regulation include the use of artificial transcription factors (ATFs) where a programmable DNA-binding domain (e.g., zinc fingers, transcription activator-like effectors, CRISPR-Cas) is coupled with an effector domain (e.g., VP64, p65, KRAB) to alter gene transcription6, whereas post-transcriptional strategies include targeted protein degradation (TPD) and RNA interference (RNAi). These gene regulation strategies are transient in nature and require re-dosing or constitutive expression of the exogenous biomolecules to have durable and sustained effect(s). There remains a need for durable gene regulation strategies that don't require constitutive presence of exogenous biomolecules.
Described herein are genome engineering strategies for the precise installation of sequence motifs at specific loci to alter expression of target gene products at the pre-transcriptional or post-transcriptional level, termed Stable and Heritable Alteration by Precision Editing (SHAPE). The SHAPE platform includes the identification of 1) functional sequence motifs with regulatory potential, 2) target regions for sequence modification to take place, and 3) genetic modifiers to use to achieve the precise edit of interest to ultimately induce targeted gene expression change(s) for a cell type or cell types of interest (Table 1 and 2).
Thus, provided herein are methods for identifying a method for altering expression of target genes in selected cell types.
Thus, provided herein are methods for identifying a genetic modifier to alter expression of a target gene in a selected cell type. The methods include: providing, optionally from a database, one or more candidate regulatory motif sequences with regulatory potential (as described herein, e.g., binding sites for transcription factors or other factors that affect gene expression and are expressed in the cell, e.g., endogenous factors) in the selected cell type; selecting a sequence of a putative regulatory region of the target gene, preferably wherein the putative regulatory region is in a promoter, enhancer, insulator, untranslated region (UTR), or intron, optionally in a non-coding region of the target gene; comparing the sequence of the putative regulatory region to the candidate regulatory motif sequences, identifying a candidate regulatory motif sequence that has either little to no identity at all (e.g., for the insertion strategy) as a potential insertion site or that has at least 50% identity and at least one mismatch (i.e., not 100% identity, for the substitution strategy) to a (corresponding) portion of the sequence of the regulatory sequence as a potential substitution site;
determining sequence alterations needed to make the putative regulatory region match (e.g., to include or have 100% identity with) the candidate regulatory motif sequence; and
identifying one or more genetic modifiers capable of making the sequence alterations needed to make the putative regulatory region match the candidate regulatory motif sequence. In some embodiments, identifying a genetic modifier comprises using a computer or an algorithm that compares the putative regulatory region and candidate regulatory motif sequences from a database and identifies candidate regulatory motif sequences that differ from the putative regulatory region by at least one nucleotide and up to 100% as a potential insertion site, or that differ from the putative regulatory region by at least one nucleotide and up to a selected amount, optionally at least 50% identity, as a potential substitution site, determines sequence alterations needed to make the putative regulatory region match the candidate regulatory motif sequence, compares the sequence alterations to a database of modifications that could be made by a set of genetic modifiers, and identifying one or more genetic modifiers that can alter the putative regulatory region to match the candidate regulatory motif sequence, to thereby introduce a functional regulatory motif.
In some embodiments, the candidate regulatory sequence motif has regulatory potential to affect target gene expression at the pre-transcriptional or post-transcriptional level.
In some embodiments, the candidate regulatory sequence motif is a transcription factor binding sequence that can recruit endogenous transcription factors within a cell type or cell types of interest (e.g., cell type-specific factors), where the sequence motif may or may not exist in the genome of the selected cell type.
In some embodiments, the candidate regulatory sequence motif alters spacing of endogenous transcription factor binding sites in the putative regulatory region.
In some embodiments, the candidate regulatory sequence motif is a response element that is activated by a receptor-ligand complex through binding of an exogenously delivered small molecule, hormone, or drug for inducible target gene activation.
In some embodiments, the candidate regulatory sequence motif either stabilizes or de-stabilizes target gene transcripts, where the candidate regulatory sequence motif may or may not exist in the genome of the selected cell type.
In some embodiments, the candidate regulatory sequence motif is a hybridization target for endogenous non-coding RNAs (e.g., miRNAs, siRNAs, lncRNAs), where the sequence motif may or may not exist in the genome of interest.
In some embodiments, the candidate regulatory sequence motif modifies the translation initiation and/or elongation efficiency for target gene transcripts (e.g., Kozak sequence, optimal codon structure), and wherein the candidate regulatory sequence motif may or may not exist in the genome of the selected cell type.
In some embodiments, the putative regulatory region has the potential to modify expression of the target gene at the pre-transcriptional or post-transcriptional level.
In some embodiments, the putative regulatory region is a non-coding DNA sequence within 1 Mb or more of a target gene of interest, or spatially-proximal as determined by chromosome conformation capture assays.
In some embodiments, the putative regulatory region is a promoter of a target gene of interest, e.g., a proximal regions e.g., 1000 bp upstream and 500 bp downstream of the transcription start site (TSS).
In some embodiments, the putative regulatory region comprises putative enhancer elements of a target gene of interest as defined by histone marks associated and/or chromatin accessibility features associated with functional enhancer elements (e.g., H3K4me1, H3K27ac); putative insulator elements of the target gene of interest as defined by histone marks associated and/or chromatin accessibility features associated with functional insulator elements; and/or putative silencer elements of the target genes of interest as defined by histone marks associated and/or chromatin accessibility features associated with functional silencer elements.
In some embodiments, the putative regulatory region comprises untranslated regions (UTRs) of the target gene transcripts.
In some embodiments, the putative regulatory regions comprise an intronic region of the target gene transcripts.
In some embodiments, the putative regulatory regions comprises a coding sequence of target gene transcripts.
In some embodiments, the identified genetic modifier can introduce a specific sequence motif or modification at the target genomic region.
In some embodiments, the genetic modifier comprises a CRISPR-Cas domain, a zinc-finger DNA binding domain, or a transcription activator-like (TAL) effector domain.
In some embodiments, the CRISPR-Cas domain is used with a gRNA, wherein the gRNA comprises a sequence complementary to a sequence of the target cis-regulatory element of interest.
In some embodiments, the genetic modifier is a programmable nuclease (e.g., zinc finger nucleases, transcription activator-like effector nucleases, Cas9, CasX, Cas12), a base editor (e.g., ABE, CBE), or a prime editor (e.g., SpCas9H840A-MMLV-RT).
In some embodiments, the CRISPR-Cas prime editor further comprises a prime editing gRNA (pegRNA) and nicking sgRNA (ngRNA) wherein the pegRNA and ngRNA comprise a sequence complementary to a sequence of the target cis-regulatory element of interest.
Also provided herein are methods for altering expression of a target gene in a selected cell type. The methods include providing, optionally from a database, one or more candidate regulatory motif sequences with regulatory potential in the selected cell type; selecting a sequence of a putative regulatory region of the target gene, preferably wherein the putative regulatory region is in a promoter, enhancer, insulator, untranslated region (UTR), or intron, optionally in a non-coding region of the target gene; comparing the sequence of the putative regulatory region to the candidate regulatory motif sequences, identifying a candidate regulatory motif sequence that has either no identity at all as a potential insertion site or that has at least 50% identity and at least one mismatch to a portion of the sequence of the regulatory sequence as a potential substitution site; determining sequence alterations needed to make the putative regulatory region match the candidate regulatory motif sequence; and identifying one or more genetic modifiers capable of making the sequence alterations needed to make the putative regulatory region match the candidate regulatory motif sequence, and contacting the cell with the one or more genetic modifiers under conditions and for a time sufficient for the one or more genetic modifiers to make the putative regulatory region match the candidate regulatory motif sequence.
In some embodiments, identifying a genetic modifier comprises using a computer or an algorithm that compares the putative regulatory region and candidate regulatory motif sequences from a database and identifies candidate regulatory motif sequences that differ from the putative regulatory region by at least one nucleotide and up to 100% as a potential insertion site, or that differ from the putative regulatory region by at least one nucleotide and up to a selected amount, optionally at least 50% identity, as a potential substitution site, determines sequence alterations needed to make the putative regulatory region match the candidate regulatory motif sequence, compares the sequence alterations to a database of modifications that could be made by a set of genetic modifiers, and identifying one or more genetic modifiers that can alter the putative regulatory region to match the candidate regulatory motif sequence, to thereby introduce a functional regulatory motif.
Also provided herein are methods for altering expression of a target gene in a selected cell, the method comprising contacting the selected cell with a genetic modifier identified using a method described herein, under conditions sufficient to increase the target gene expression in the cell.
Additionally, provided herein are methods for heterotopic activation of a target gene expression in a selected cell, the method comprising contacting the cell with a genetic modifier identified using a method described herein, under conditions sufficient to increase the target gene expression in the cell.
In some embodiments, the candidate regulatory sequence motif is introduced into the putative regulatory region as a single motif or a repetitive sequence with multiple copies of the single motif, optionally with linker sequences therebetween.
In some embodiments, the genetic modifier introduces multiplex edits (e.g., installation of multiple transcription factor binding sites) in order to induce more robust modification of a single target gene expression. For example, a single type of modifier such as a prime editor or CRISPR Cas domain containing protein can be used, wherein that modifier is guided to multiple locations in the genome via multiple guide RNAs to enable multiplex edits.
In some embodiments, the genetic modifier introduces multiplex edits (e.g., installation of multiple transcription factor binding sites) in order to perform multi-gene expression control.
In some embodiments, the cell is a eukaryotic cell, e.g., a mammalian cell, e.g., a human cell.
Also provided herein are methods for treating or reducing risk of a condition or a disease in a subject, wherein the condition or the disease is caused, at least in part, by insufficient expression of the target gene, the method comprising administering to the subject an effective amount of a genetic modifier identified using a method described herein, under conditions sufficient to increase the target gene expression in the cell, thereby treating or reducing risk of the condition or the disease in the subject.
In some embodiments, the condition or the disease is caused, at least in part, by insufficient expression of the target gene on an allele.
In some embodiments, the condition or the disease is related to haploinsufficiency.
In some embodiments, the condition or the disease is caused, at least in part, by a dominant-negative gene.
In some embodiments, the condition or the disease is caused, at least in part, by insufficient expression of a target gene that is under the control of an enhancer, wherein the enhancer controls the expression of a plurality of genes.
In some embodiments, the method causes an increase in the expression of the target gene in the cell or in the cell of the subject by at least 1.1 fold as measured by mRNA expression.
Also provided herein are method for treating or reducing risk of a condition or a disease in a subject, wherein the condition or the disease is caused, at least in part, by overexpression of the target gene, the method comprising administering to the subject an effective amount of a genetic modifier identified using a method as described herein, under conditions sufficient to increase the target gene expression in the cell, thereby treating or reducing risk of the condition or the disease in the subject.
In some embodiments, the method causes a decrease in the expression of the target gene in the cell or in the cell of the subject by at least 1.1 fold.
In some embodiments, the subject is a mammal, e.g., a human.
In some embodiments, the present methods include:
The methods can include using an algorithm that compares the target regulatory regions and regulatory motif sequences identified above and identifies candidate regulatory motif sequences that differ from the target gene regulatory region by up to 100% for the insertion strategy and less than a selected amount, e.g., by 95% for the insertion strategy, and compares the candidate regulatory motifs to the possible modifications that would be made by a set of genetic modifiers (e.g., to predict the modification(s) made by each of a set of genetic modifiers, to identify one or more genetic modifiers that can be used to modify the target regulatory region to introduce a functional regulatory motif.
The methods can include CRISPR-guided multiplex gene editing with guide RNAs targeting a nuclease or base/prime editor at two, three, or more, e.g., up to 25, endogenous target sites to introduce multiple regulatory sequences in a given cell type or tissue to modify gene regulation at one or multiple genes in parallel (Campa et al, Nature Methods 2019, Vol 16, pp 887-893). In the referenced paper Cas12a is used for gene editing. This enzyme has been shown to work efficiently in the context of 2nd generation CRISPR tools as well, such as e.g. base editors (Richter et al, Nat Biotechnol (2020). doi.org/10.1038/s41587-020-0453-z).
Also provided herein are methods for identifying a genetic modifier to alter expression of a target gene in a selected cell type, the methods comprising:
Identifying one or more regulatory motif sequences with regulatory potential in the selected cell type;
Identifying a sequence of a putative regulatory region of the target gene, preferably wherein the regulatory region is in a promoter, enhancer, insulator, UTR, or intron, optionally in a non-coding region of the target gene;
Comparing the sequence of the regulatory region to the regulatory motif sequences, and identifying a candidate motif sequence that has either no homology at all in case of the insertion strategy or that has at least 50% identity and at least one mismatch (i.e., not 100% identity) to a portion of the sequence of the regulatory sequence in case of the substitution strategy; and Identifying a genetic modifier capable of altering the regulatory region to match (have 100% identity with) the candidate motif sequence, preferably wherein the genetic modifier is a zinc finger nuclease, CRISPR-Cas9 nuclease, base editor, or prime editor
optionally comprising using an algorithm that compares the target regulatory regions and regulatory motif sequences and identifies candidate regulatory motif sequences that differ from the target gene regulatory region by up to 1-100% for the insertion strategy and less than a selected amount, e.g., by 95% for the substitution strategy and compares the candidate regulatory motifs to the possible modifications that would be made by a set of genetic modifiers, to identify one or more genetic modifiers that can be used to modify the target regulatory region to introduce a functional regulatory motif.
In some embodiments, the identified or discovered sequence motif or modification has regulatory potential to affect target gene expression at the pre-transcriptional or post-transcriptional level.
In some embodiments, the identified or discovered sequence motif is a transcription factor binding sequence that can recruit endogenous transcription factors within a cell type or cell types of interest (cell type-specific), where the sequence motif that may or may not exist in the genome of interest.
In some embodiments, the identified or discovered sequence modification alters the spacing of endogenous transcription factor binding sites in the genome.
In some embodiments, the sequence motif is a response element that is activated by a receptor-ligand complex through binding of an exogenously delivered small molecule, hormone, or drug for inducible target gene activation.
In some embodiments, the identified or discovered sequence motif or modification either stabilizes or de-stabilizes target gene transcripts, where the sequence motif that may or may not exist in the genome of interest.
In some embodiments, the identified or discovered sequence motif is a hybridization target for endogenous non-coding RNAs (e.g., miRNAs, siRNAs, lncRNAs), where the sequence motif that may or may not exist in the genome of interest.
In some embodiments, the identified or discovered sequence motif modifies the translation initiation and/or elongation efficiency for target gene transcripts (e.g., Kozak sequence, optimal codon structure), where the sequence motif that may or may not exist in the genome of interest.
In some embodiments, the identified or discovered sequence motif or modification is determined by a combination of gene expression (e.g., RNA-seq), chromatin accessibility (e.g., ATAC-seq, DNase-seq), DNA-protein interaction (e.g., ChIP-seq), and/or primary DNA sequence data from a single cell type or set of cell types of interest.
In some embodiments, the identification or discovery of sequence motifs can be performed by integrative analysis of genomics data across different cell types, e.g., using a computational strategy wherein regions of interest are uncovered based on their cell type specific activity for a particular class of functional regions and on genomic data e.g., chromatin marks (e.g., H3k27ac, H3k27me3), chromatin accessibility (e.g., DNase-seq or ATAC-seq) or DNA methylation; based on the recovered regions and a list of known TF motifs, regions are searched for enriched patterns and their significance evaluated; integrating gene expression data a short list of candidate TF are provided to account for their endogenous expression across the different cell types and their expected potency based on genes that are downstream of the regions uncovered in the second step; and a ranked list of TF sequences is generated based on this integrative approach for each cell type.
In some embodiments, the discovery of the sequence motifs is done by de novo motif discovery analysis within a single cell type or set of cell types of interest.
In some embodiments, the discovery of sequence motifs is done by analyzing cis-regulatory DNA sequence composition of top-expressing genes (e.g., top 1%, 5%, 20%, 50%) ranked by normalized expression values (e.g., RPKM, FPKM, TPM, fold-change) in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done by analyzing cis-regulatory DNA sequence composition of bottom-expressing genes (e.g., bottom 1%, 5%, 20%, 50%) ranked by normalized expression values (e.g., RPKM, FPKM, TPM, fold-change) in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through frequency-based methods including the construction of position-weight matrices for a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through neural network architectures to identify sequence motifs that may or may not exist in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through generation of synthetic DNA sequences using language models—a generative deep learning technique—where the sequence motifs may or may not exist in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through generation of synthetic DNA sequences using deep variational autoencoders—a generative deep learning model—where the sequence motifs may or may not exist in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through generation of synthetic DNA sequences using Generative Adversarial Networks (GANs)—a generative deep learning model—where the sequence motifs may or may not exist in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through the identification of transcription factor binding sequence motifs with dependencies with other sequence motifs (e.g., pairwise, triwise interactions).
In some embodiments, the discovery of the sequence motifs is done through the identification of transcription factor binding sequence motifs that recruit additional transcriptional machinery through protein-protein interactions.
In some embodiments, the identified sequence motif is from an online database for transcription factor binding motifs (e.g., JASPAR, HOCOMOCO).
In some embodiments, the sequence motif is introduced as a single motif or a repetitive sequence with multiple copies of the single motif that may or may not have linker sequences interspaced.
In some embodiments, the sequence motif is introduced as a combination of different sequence motifs with predicted additive or synergistic effects on target gene expression at the pre-transcriptional and/or post-transcriptional level, where the multiple sequence motifs may or may not have linker sequences interspaced.
In some embodiments, the target genomic region to introduce the sequence motif or modification is able to or has the potential to modify target gene expression at the pre-transcriptional or post-transcriptional level.
In some embodiments, the target genomic regions are non-coding DNA sequences within 1 Mb or more of the target gene(s) of interest, or spatially-proximal as determined by chromosome conformation capture assays.
In some embodiments, the target genomic regions are promoters of the target gene(s) of interest, defined as proximal regions e.g., 1000 bp upstream and 500 bp downstream of the transcription start site (TSS).
In some embodiments, the target genomic regions are putative enhancer elements of the target gene(s) of interest defined by histone marks associated and/or chromatin accessibility features associated with functional enhancer elements (e.g., H3K4me1, H3K27ac).
In some embodiments, the target genomic regions are putative insulator elements of the target gene(s) of interest defined by histone marks associated and/or chromatin accessibility features associated with functional insulator elements.
In some embodiments, the target genomic regions are putative silencer elements of the target gene(s) of interest defined by histone marks associated and/or chromatin accessibility features associated with functional silencer elements.
In some embodiments, the target genomic regions are untranslated regions (UTRs) of target gene transcripts.
In some embodiments, the target genomic regions are intronic regions of target gene transcripts.
In some embodiments, the target genomic regions are coding sequences of target gene transcripts.
In some embodiments, the genetic modifier to use is able to introduce the specific sequence motif or modification at the target genomic region with sufficient efficiency and precision.
In some embodiments, the genetic modifier comprises a CRISPR-Cas domain, a zinc-finger DNA binding domain, or a transcription activator-like (TAL) effector domain.
In some embodiments, the genetic modifier is a programmable nuclease (e.g., zinc finger nucleases, transcription activator-like effector nucleases, Cas9, CasX, Cas12).
In some embodiments, the CRISPR-Cas domain further comprises a gRNA wherein the gRNA comprises a sequence complementary to a sequence of the target cis-regulatory element of interest.
In some embodiments, the genetic modifier is a base editor (e.g., ABE, CBE).
In some embodiments, the genetic modifier is a prime editor (e.g., SpCas9H840A-MMLV-RT).
In some embodiments, the CRISPR-Cas prime editor further comprises a prime editing gRNA (pegRNA) and nicking sgRNA (ngRNA) wherein the pegRNA and ngRNA comprise a sequence complementary to a sequence of the target cis-regulatory element of interest.
In some embodiments, the genetic modifier introduces multiplex edits (e.g., installation of multiple transcription factor binding sites) in order to induce more robust modification of a single target gene expression.
In some embodiments, the genetic modifier introduces multiplex edits (e.g., installation of multiple transcription factor binding sites) in order to perform multi-gene expression control.
In some embodiments, the use of unbiased saturation mutagenesis screening to empirically determine genetic editing modalities (e.g., programmable nucleases, base editors, prime editors) and target sites (e.g., promoter, enhancers, UTRs) to modify target gene(s) expression.
The present methods can be used for increasing a target gene expression in a cell, and for heterotopic activation of a target gene expression in a cell, by contacting the cell with a genetic modifier identified using a method described herein.
In some embodiments, the cell is a eukaryotic cell.
In some embodiments, the cell is a mammalian cell.
In some embodiments, the cell is a human cell.
Also provided are methods for treating or preventing a condition or a disease in a subject, the method comprising administering to the subject an effective amount of a genetic modifier identified by a method described herein, e.g., in a pharmaceutical composition, thereby treating or preventing the condition or the disease.
In some embodiments, the condition or the disease is caused, at least in part, by insufficient expression of the target gene.
In some embodiments, the condition or the disease is caused, at least in part, by insufficient expression of the target gene on an allele.
In some embodiments, the condition or the disease is related to haploinsufficiency.
In some embodiments, the condition or the disease is caused, at least in part, by a dominant-negative gene.
In some embodiments, the administration of the pharmaceutical composition increases expression of the target gene, thereby treating the condition or the disease.
In some embodiments, the condition or the disease is caused, at least in part, by insufficient expression of a target gene that is under the control of an enhancer, wherein the enhancer controls the expression of a plurality of genes.
In some embodiments, the method causes increase in the expression of the target gene in the cell or in the cell of the subject by at least 1.1 fold, 2 fold, at least 3 fold, at least 4 fold, at least 5 fold, at least 6 fold, at least 7 fold, at least 8 fold, at least 9 fold, at least 10 fold, at least 15 fold, at least 20 fold, at least 25 fold, at least 30 fold, at least 35 fold, at least 40 fold, at least 45 fold, at least 50 fold, at least 60 fold, at least 70 fold, at least 80 fold, at least 90 fold, at least 100 fold, at least 150 fold, at least 200 fold, at least 300 fold, at least 350 fold, at least 400 fold, at least 450 fold, at least 500 fold, at least 600 fold, at least 700 fold, at least 800 fold, at least 900 fold, at least 1000 fold, at least 1100 fold, at least 1200 fold, at least 1300 fold, at least 1400 fold, at least 1500 fold, at least 1600 fold, at least 1700 fold, at least 1800 fold, at least 1900 fold, at least 2000 fold, at least 2500 fold, or at least 3000 fold, as measured by mRNA expression.
In some embodiments, the method causes a decrease in the expression of the target gene in the cell or in the cell of the subject by at least 1.1 fold, 2 fold, at least 3 fold, at least 4 fold, at least 5 fold, at least 6 fold, at least 7 fold, at least 8 fold, at least 9 fold, at least 10 fold, at least 15 fold, at least 20 fold, at least 25 fold, at least 30 fold, at least 35 fold, at least 40 fold, at least 45 fold, at least 50 fold, at least 60 fold, at least 70 fold, at least 80 fold, at least 90 fold, at least 100 fold, at least 150 fold, at least 200 fold, at least 300 fold, at least 350 fold, at least 400 fold, at least 450 fold, at least 500 fold, at least 600 fold, at least 700 fold, at least 800 fold, at least 900 fold, at least 1000 fold, at least 1100 fold, at least 1200 fold, at least 1300 fold, at least 1400 fold, at least 1500 fold, at least 1600 fold, at least 1700 fold, at least 1800 fold, at least 1900 fold, at least 2000 fold, at least 2500 fold, or at least 3000 fold, as measured by mRNA expression.
In some embodiments, the methods include providing a cell of the selected cell type and contacting the cell with the genetic modifier to alter the regulatory region to match the candidate motif sequence.
In some embodiments, the methods alter expression of gene products at the pre-transcriptional (e.g., recruitment of endogenous transcription factors, transcription activation or repression) or post-transcriptional level (e.g., sequence motifs that modify transcript stability, sequence motifs that modify translation initiation and/or elongation efficiency), in the context of a single cell type or set of cell types of interest.
In some embodiments, identifying one or more candidate motif sequences with regulatory potential in the selected cell type comprises referring to a database comprising a plurality of regulatory sequence motifs, e.g., endogenous regulatory sequences that are present in the selected cell type (e.g., in the species of the cell) but not in the target gene, or exogenous regulatory sequences (e.g., not present in the cell, from a different species, or artificial regulatory sequences) that bind to a factor present in the cell (e.g., a transcription factor, target sequence for endogenous non-coding RNAs, a sequence motif in the untranslated regions (UTR) of a transcript that increases or decreases the stability and/or affects the transcription of the RNA molecules, and/or sequence motif that modifies the translation initiation or elongation efficiency of transcripts, in the selected cell type.
In some embodiments, the candidate sequence motif is a transcription factor binding sequence, e.g., a binding sequence that is endogenous (present in the genome, but not in the gene, of the cell of interest), or exogenous (e.g., not present in the cell of interest, e.g., an artificial TF binding site or a TF binding site from another cell type or species that binds an endogenous TF that is expressed in the cell) of interest (Table 3A).
In some embodiments, the candidate sequence motif is a range for spacing between endogenous transcription factor binding sites that modifies gene expression.
In some embodiments, the candidate sequence motif is a known sequence motif in the untranslated regions (UTR) of transcripts that increases or decreases the stability and/or affects the transcription of these RNA molecules in cells, either endogenous or exogenous.
In some embodiments, the candidate sequence motif is a known target sequence for endogenous non-coding RNAs (e.g., miRNAs, siRNAs, lncRNAs) that affect the target transcript stability.
In some embodiments, the candidate sequence motif is a sequence motif that is exogenous (not present in the genome of interest), but that recruits endogenous non-coding RNAs (e.g., miRNAs, siRNAs, lncRNAs) that affect the target transcript stability (Table G, H, I, and J).
In some embodiments, the candidate sequence motif is an (endogenous or exogenous) sequence motif that modifies the translation initiation (e.g., Kozak sequence) or elongation (e.g., codon optimization) efficiency of transcripts.
In some embodiments, the candidate sequence motif encodes a 2A self-cleaving peptide (e.g., T2A, P2A, E2A, F2A).
In some embodiments, the candidate sequence motif encodes an intein sequence.
In one aspect, the present application includes identifying target genomic regions to modify that, upon the introduction of specific sequence motifs within this genomic region, may alter expression of a target gene or set of target genes at the pre-transcriptional or post-transcriptional level, in the context of a cell type or set of cell types of interest.
In some embodiments, the target genomic regions are non-coding DNA sequences within 1 Mb or more of the target gene(s) of interest.
In some embodiments, the target genomic regions are promoters of the target gene(s) of interest, defined as proximal regions e.g., 1000 bp upstream and 500 bp downstream of the transcription start site (TSS).
In some embodiments, the target genomic regions are putative enhancer elements of the target gene(s) of interest defined by histone marks and/or chromatin accessibility features associated with functional enhancer elements (e.g., H3K4me1, H3K27ac)17.
In some embodiments, the target genomic regions are putative insulator elements of the target gene(s) of interest defined by histone marks and/or chromatin accessibility features associated with functional insulator elements17,18.
In some embodiments, the target genomic regions are putative silencer elements of the target gene(s) of interest defined by histone marks and/or chromatin accessibility features associated with functional silencer elements19.
In some embodiments, the target genomic regions are untranslated regions (UTRs) of target gene transcripts.
In some embodiments, the target genomic regions are intronic regions of target gene transcripts.
In some embodiments, the target genomic regions are coding sequences of target gene transcripts.
In some embodiments, the endogenous regulatory region of a gene (e.g., the promoter) is targeted to modify or enhance downstream transcription of translation machinery. This can be achieved e.g., by installing or modifying a TATA box (also known as Goldberg-Hogness box) in archae or eukaryotes or a Pribnow box in prokaryotes or by installing enhanced Kozak ((gcc)gccRccAUGG (SEQ ID NO: 478)) in eukaryotes, Shine-Dalgarno (AGGAGGU) in prokaryotes, start codon (AUG and CUG in mammalian cells, AUA and AUU in mitochondria, GUG and UUG in E. coli) or stop codon (UGA, UAG, UAA) sequences20.
In some embodiments binding sites of non-coding RNAs (ncRNAs), such as microRNAs (miRNAs) or long non-coding RNAs (lncRNAs) are installed or modified to alter the binding of said ncRNAs to DNA or RNA with the result of altered gene expression, and/or RNA abundance, and/or protein expression.
The methods can include producing a list of candidate sequence modifications that can be made to add a candidate motif to the target sequence.
The present methods also include identifying genetic modifiers that can introduce specific candidate sequence motifs into target genomic regions with high predicted precision and efficiency that may alter expression of a target gene or set of genes at the pre-transcriptional or post-transcriptional level, in the context of a cell type or set of cell types of interest. Genetic modifiers can include a programmable nuclease (e.g., zinc finger nucleases, transcription activator-like effector nucleases, and CRISPR-Cas systems, e.g., Cas9, CasX, Cas12); base editors (e.g. ABEs, CBEs, and CGBEs); and prime editors, inter alia (Table 2).
In some embodiments, algorithms can be used to identify a genetic modifier, e.g., based on comparing the desired sequence modifications to be made to the changes that could be made by a number of genetic modifiers.
In some embodiments, the sequence motif is to be introduced as a single motif or a repetitive sequence with multiple copies of the single motif that may or may not have linker sequences interspaced.
In some embodiments, the sequence motif is to be introduced as a combination of different sequence motifs with predicted additive or synergistic effects on target gene expression at the pre-transcriptional and/or post-transcriptional level, where the multiple sequence motifs may or may not have linker sequences interspaced.
In some embodiments, the methods can include predicting a sequence modification that would be caused by one genetic modifiers, e.g., using an algorithm, e.g., a computer-based method.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Methods and materials are described herein for use in the present invention; other, suitable methods and materials known in the art can also be used. The materials, methods, and examples are illustrative only and not intended to be limiting. All publications, patent applications, patents, sequences, database entries, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control.
Other features and advantages of the invention will be apparent from the following detailed description and figures, and from the claims
Precisely controlling gene expression in biosystems has important applications in biotechnology and therapeutic settings. Pre-transcriptional strategies for gene regulation include the use of artificial transcription factors (ATFs), whereas post-transcriptional strategies include targeted protein degradation (TPD) and RNA interference (RNAi). These gene regulation strategies are transient in nature and require re-dosing or constitutive expression of the exogenous biomolecules to have durable effect(s).
The regulation of endogenous gene expression is not a simple process, with dynamic transcriptional changes occurring across biological processes such as differentiation and disease. The DNA primary sequence plays a role in this gene regulation control, but how it encodes robust or dynamic gene expression modules is not clear. For example, many sequence motifs exist in the genome, however only a subset of these sequence motifs are bound by transcription factors. Previously, it was not clear that the introduction of a sequence motif into a new context can lead to the modification of gene expression.
Installing a single TF motif at an inactive promoter could lead to increased gene expression. It is well known that a lot of TFs work in concert to bring co-factors and RNA polymerases for gene expression, which might require complex TF motifs at the target promoters or enhancers.
Here we describe a strategy to introduce sequence motifs into non-endogenous contexts, where these sequence motifs recruit endogenous transcription factors that lead to the modification of gene expression. As shown herein, a single TF motif insertion modulates gene expression at a target gene. A single ELF motif which is ˜12 bp inserted at MYOD1 promoter increased ˜20 fold of MYOD1 gene expression in HEK293T. These results suggest we can modulate gene expression by introducing a single TF motif for endogenously active TF in cell types of interest and titrate gene expressions by inserting multiple copies of this single TF motif.
Described herein are strategies for the targeted modification (e.g., activation or repression) of gene expression in a stable and heritable manner, which we call Stable and Heritable Alteration by Precision Editing (SHAPE). The SHAPE methodology is based on the targeted and precise introduction of sequence motifs at regions in the genome that enable changes in target gene expression at either the pre-transcriptional or post-transcriptional level.
The SHAPE platform utilizes genetic modifiers (e.g., nucleases, (CRISPR guided) transposases, recombinases, base editors, and prime editors) to install specific sequence motifs at target sequences through precision genome engineering. Insertion and deletion (indel) mutations from programmable nucleases (e.g., zinc finger nucleases, transcription activator-like effector nucleases, Cas9, CasX, Cas12), base edits from cytosine and/or adenine base editors (e.g., CBEs, ABEs), and all types of prime edits from prime editors (e.g., SpCas9H840A-MMLV-RTPE3) can be utilized to introduce specific sequence motifs at target non-coding DNA sequences (e.g., promoters, enhancers, insulators) or coding (e.g., exons) to influence the gene expression or untranslated regions (UTRs) of target genes. The newly-introduced edits can take the form of different regulatory elements, including: transcription factor binding sites (TFBSs) that recruit endogenous transcription factors to modify target gene transcription, sequence modifications that alter the spacing of endogenous TFBSs, sequence elements that modify the stability (e.g., stabilize, de-stabilize) of target gene transcripts, or sequence compositions that modify target gene translation initiation and/or elongation. Following the introduction of these genetic edits, target gene expression can be modified in a stable and heritable fashion. If the newly introduced motif recruits endogenous transcriptional machinery (e.g., transcription factors), stabilizes the RNA of target genes, or drives expression of novel transcriptional products, the expression of target genes can be altered in a stable and heritable fashion.
SHAPE Platform
The SHAPE platform provides a framework, e.g., a computational (computer-implemented) framework, for the identification or generation of 1) functional DNA sequence motifs with regulatory potential, 2) target regions for sequence modification, and 3) genetic modifiers to use to ultimately achieve the precise installation of sequence motifs that induce targeted gene expression change(s) at the pre-transcriptional or post-transcriptional level. Overall, the SHAPE platform enables genetically-encoded gene regulation through the precise installation of sequence motifs within endogenous DNA sequence, offering a differentiated strategy for stable and heritable transcriptional modulation at the pre-transcriptional or post-transcriptional level with the transient expression of genetic modifiers.
An overview of different exemplary strategies of SHAPE for heritable targeted gene activation is outlined in Table 1, where gene activation strategies are mapped to the possible edit types. These edit types are then mapped to potential genetic modifiers that have the ability to introduce the specific edit in Table 2.
Generation of novel synthetic DNA sequence motifs that can recruit endogenous transcription factors can be done using a generative technique known as language modeling. A language model is typically a probability distribution over sequence of words that can occur in a natural language e.g., English or French. They are typically trained to predict the next word in a sentence. After training is done, the model can be used to generate novel sentences that are semantically meaningful. This type of modeling can also be used to generate de novo DNA sequences. The model typically uses a neural network where the layers consist of Long Short Term Memory (LSTM) units. The neural network can be trained on a large corpora of DNA sequences, and then fine-tuned on known DNA sequence motifs in a conditional manner to generate novel synthetic DNA motifs that are functional in terms of recruiting transcription factors.
The modification of endogenous transcription factor sequence composition by genetic editing is a viable strategy for gene activation. While transcription factors typically bind a core DNA sequence, there can be minor differences in the totality of sequences transcription factors bind across a genome. In addition to a range of different DNA sequences a transcription factor is capable of binding, the transcription factor may also exhibit different binding strength for each DNA sequence variant.′ The identification of the optimal binding sequence of transcription factors and downstream regulatory potential of this binding event can be determined through integration of chromatin accessibility (e.g., ATAC-seq, DNase-seq), protein-DNA interaction (e.g., ChIP-seq) and gene expression (e.g., RNA-seq) data. Following the identification of top DNA sequence binding motifs for all transcription factors, the identification of sub-optimal endogenous transcription factor binding motifs in cis-regulatory regions of a target gene can be performed. The utilization of genetic modifiers to introduce substitution edits (e.g., base editors, prime editors) can introduce more optimal transcription factor binding sites to promote heritable target gene activation.
Identification of Candidate Regulatory Sequence Motifs
The identification or generation of regulatory sequence motifs to be introduced into the genome can be determined through integrative analysis of by gene expression, chromatin accessibility, and/or DNA-protein interactions data, and de novo motif discovery, or generative neural network frameworks within a single cell type or set of cell types of interest. These sequence motifs can take the form of binding motifs for the recruitment of endogenous transcription factors, sequence motifs that promote the stabilization of RNA molecules, or sequence motifs that enable the expression of non-coding RNA for target gene repression.
Exemplary candidate regulatory sequence motifs can include one or more of a transcription factor binding sequence that has been determined experimentally in-vitro (e.g., SELEX)8, experimentally in-vivo (e.g., ChIP-seq, ChIP-qPCR)9, or computationally10 11 based on experimental data; a sequence motif that is not present in the genome of interest, but has been predicted to facilitate transcription factor binding; a known range for spacing between endogenous transcription factor binding sites that has been shown to modify gene expression; a predicted range for spacing between endogenous transcription factor binding sites that has been predicted to modify gene expression; a known sequence motif in the untranslated regions (UTR) of transcripts that increases or decreases the stability and/or affects the transcription of these RNA molecules in cells12; a sequence motif that is not present in the genome of interest, but has been predicted to modify the stability and/or affects the transcription of RNA molecules when placed in 5′ and/or 3′ untranslated regions (UTRs); a known target sequence for endogenous non-coding RNAs (e.g., miRNAs, siRNAs, lncRNAs) that affects the target transcript stability13; a sequence motif that is not present in the genome of interest, but has been predicted to recruit endogenous non-coding RNAs (e.g., miRNAs, siRNAs, lncRNAs) that affects the target transcript stability; a known sequence motif that modifies the translation initiation (e.g., Kozak sequence) or elongation (e.g., codon optimization) efficiency of transcripts14; a sequence motif that is not present in the genome of interest, but has been predicted to modify the translation initiation (e.g., Kozak sequence) or elongation (e.g., codon optimization) efficiency of transcripts; a sequence encoding a 2A self-cleaving peptide (e.g., T2A, P2A, E2A, F2A); and/or a sequence encoding an intein sequence.
A number of methods are known in the art for identifying candidate sequence motifs.
In some embodiments, the identification of candidate DNA sequence motifs is performed by integrative analysis of genomics data across different cell types and with a computational strategy we have recently proposed called Haystack15. First, regions of interest are uncovered based on their cell type specific activity for a particular class of functional regions and on genomic data e.g., chromatin marks (e.g., H3k27ac, H3k27me3), chromatin accessibility (e.g., DNase-seq or ATAC-seq) or DNA methylation. Second, based on the recovered regions and a list of known TF motifs, regions are searched for enriched patterns and their significance evaluated. Third, integrating gene expression data a short list of candidate TF are provided to account for their endogenous expression across the different cell types and their expected potency based on genes that are downstream of the regions uncovered in the second step. Finally, a ranked list of TF sequences is also generated based on this integrative approach for each cell type.
As previously described (Pinello et al. Bioinformatics 2018) by exploiting chromatin accessibility (or histone marks) and gene expression variability across cell types key regulatory regions and regulators were extracted that are cell type specific. By the integration of available data from large consortia (Roadmap Epigenomic and ENCODE projects) and using Haystack (github.com/pinellolab/haystack bio) we have curated a list of regions and transcription factors for several human primary cells and cell lines. Importantly, this strategy can be applied to any organism for which a reference genome is available.
In some embodiments, the discovery of the sequence motifs is done by de novo motif discovery analysis within a single cell type or set of cell types of interest.
In some embodiments, the discovery of sequence motifs is done by analyzing cis-regulatory DNA sequence composition of top-expressing genes (e.g., top 1%, 5%, 20%, 50%) ranked by normalized expression values (e.g., RPKM, FPKM, TPM, fold-change) in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done by analyzing cis-regulatory DNA sequence composition of bottom-expressing genes (e.g., bottom 1%, 5%, 20%, 50%) ranked by normalized expression values (e.g., RPKM, FPKM, TPM, fold-change) in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through frequency-based methods including the construction of position-weight matrices for a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through neural network architectures to identify sequence motifs that may or may not exist in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through generation of synthetic DNA sequences using language models—a generative deep learning technique—where the sequence motifs may or may not exist in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through generation of synthetic DNA sequences using deep variational autoencoders—a generative deep learning model—where the sequence motifs may or may not exist in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through generation of synthetic DNA sequences using Generative Adversarial Networks (GANs)—a generative deep learning model—where the sequence motifs may or may not exist in a single cell type or set of cell types of interest.
In some embodiments, the discovery of the sequence motifs is done through the identification of transcription factor binding sequence motifs with dependencies with other sequence motifs (e.g., pairwise, triwise interactions)16.
In some embodiments, the discovery of the sequence motifs is done through the identification of transcription factor binding sequence motifs that recruit additional transcriptional machinery through protein-protein interactions.
In addition, a number of regulatory sequence motifs, e.g., transcription factors and transcription factor binding sequence motifs, are known, including those that are cell-type specific and those that are hormone responsive. For example, Table 3A provides a list of TFs and their entry number in the JASPAR database, which provides their recognition sequences.
Activating target genes in a cell-type-specific manner can be achieved by recruiting cell-type specific endogenous TFs (many of which are known in the art) at the promoters, enhancers or both of genes of interest. First, a list of cell-type-specific TFs for the target cell lines can be determined based on the expression levels from RNA-seq of native cell lines33. Second, individual or repeats of a cell-type specific TF motif or combination of multiple TF motifs can be introduced at the promoters, enhancers or both of genes of interest in the target cell lines via genetic modifiers. For controls, cell lines that do not or lowly express cell-type-specific TFs that express in target cell lines should be tested to see if target genes are not expressed even though same genomic modifications were installed.
A number of databases provide lists of cell-specific transcription factors, including TRANSFAC and TFregulome®; see also D'Alessio, A. C. et al. A Systematic Approach to Identify Candidate Transcription Factors that Control Cell Identity. Stem Cell Reports 5, 763-Instead of the recruitment of endogenous transcription factors, SHAPE can also be used to introduce de novo transcription factor binding sites that act as response elements for TFs or TF-like proteins that can be activated (e.g., by expression or reduced degradation) by an exogenous small molecule, hormone or drug following the subsequent addition of the exogenous small molecule, hormone, or drug. Upon induction of this exogenous ligand, the response element would recruit the receptor-exogenous ligand complexes to the target locus and give rise to target gene activation. Examples include the use of vitamin D that interact with vitamin D receptor (VDR) transcription factor by installing VDR response motif at the promoters or enhancers at target genes of interest.
Table 3B provides a list of examples of hormone responsive transcription factors and elements.
The modification of endogenous transcription factor spacing by genetic editing is another viable strategy for gene activation. Transcription factor cooperativity30 31 is a well-known phenomenon where multiple transcription factors bound in proximity leads to greater overall binding stabilization and downstream regulatory effect. First, the analysis of endogenous transcription factor spacing for optimal regulatory potential can be determined based on primary DNA sequence in functional cis-regulatory elements or based on studies of transcription factor pair binding data in vitro (e.g., CAP-SELEX)16. Next, the mapping of endogenous transcription factor binding sites in cis-regulatory elements of a target gene of interest can be performed to determine sub-optimal endogenous transcription factor binding spacing32. The utilization of genetic modifiers that can introduce insertion or deletion edits (e.g., programmable nucleases, prime editors) can then be targeted to modify the endogenous transcription factor binding site distances to promote heritable target gene activation.
Selection of Target Genomic Regulatory Regions
The identification and selection of target genomic regulatory regions can be made based on identification or prediction (e.g., bioinformatic or empirical) of cis-regulatory elements (e.g., promoters, enhancers, insulators, silencers/repressors) that regulate a target gene or set of target genes of interest, untranslated regions (UTRs) of target gene transcripts that can modify transcript stability, or regions along the transcript that can modify translation initiation and/or elongation efficiency of the target gene transcript, or locations in the genome where expression of an non-coding RNA could lead to targeted gene repression.
In some embodiments, the target genomic regions are non-coding DNA sequences within 1 Mb or more of the target gene(s) of interest.
In some embodiments, the target genomic regions are promoters of the target gene(s) of interest, defined as proximal regions e.g., 1000 bp upstream and 500 bp downstream of the transcription start site (TSS).
In some embodiments, the target genomic regions are putative enhancer elements of the target gene(s) of interest defined by histone marks and/or chromatin accessibility features associated with functional enhancer elements (e.g., H3K4me1, H3K27ac)17.
In some embodiments, the target genomic regions are putative insulator elements of the target gene(s) of interest defined by histone marks and/or chromatin accessibility features associated with functional insulator elements17, 18.
In some embodiments, the target genomic regions are putative silencer elements of the target gene(s) of interest defined by histone marks and/or chromatin accessibility features associated with functional silencer elements19.
In some embodiments, the target genomic regions are untranslated regions (UTRs) of target gene transcripts.
In some embodiments, the target genomic regions are intronic regions of target gene transcripts.
In some embodiments, the target genomic regions are coding sequences of target gene transcripts.
In some embodiments, the endogenous regulatory region of a gene (e.g., the promoter) is targeted to modify or enhance downstream transcription of translation machinery. This can be achieved e.g., by installing or modifying a TATA box (also known as Goldberg-Hogness box) in archae or eukaryotes or a Pribnow box in prokaryotes or by installing enhanced Kozak ((gcc)gccRccAUGG (SEQ ID NO: 478)) in eukaryotes, Shine-Dalgarno (AGGAGGU) in prokaryotes, start codon (AUG and CUG in mammalian cells, AUA and AUU in mitochondria, GUG and UUG in E. coli) or stop codon (UGA, UAG, UAA) sequences20.
In some embodiments binding sites of non-coding RNAs (ncRNAs), such as microRNAs (miRNAs) or long non-coding RNAs (lncRNAs) are installed or modified to alter the binding of said ncRNAs to DNA or RNA with the result of altered gene expression, and/or RNA abundance, and/or protein expression.
Selection of Genetic Modifiers
The identification of genetic modifiers can be determined based on predictions (e.g., bioinformatic or empirical) of ability, efficiency and precision of a given modifier to introduce specific sequence motifs at a target region of interest (Table 1). The methods can include identifying genetic modifiers to introduce specific sequence motifs into target genomic regions with high predicted precision and efficiency to alter expression of a target gene or set of genes at the pre-transcriptional or post-transcriptional level, in the context of a cell type or set of cell types of interest.
In some embodiments, the methods can include using an algorithm that compares the target regulatory regions and regulatory motif sequences identified above and identifies candidate regulatory motif sequences. and compares the candidate regulatory motifs to the possible modifications that would be made by a set of genetic modifiers (e.g., to predict the modification(s) made by each of a set of genetic modifiers, to identify one or more genetic modifiers that can be used to modify the target regulatory region to introduce a functional regulatory motif. In some embodiments, the candidate regulatory motif sequences differ from the target gene regulatory region by less than a selected amount, e.g., by 1-50%, and a Base Editor or prime editor can be selected to make the changes. In some embodiments, there is no identity between the target regulatory regions and regulatory motif sequences, and a prime editor is selected for inserting a regulatory motif into the target regulatory region.
The identification of a genetic modifier (e.g., programmable nuclease, base editor, prime editor) can be performed through the unbiased saturating mutagenesis of regulatory regions (e.g., promoters, enhancers) associated with a target gene. Comprehensive genetic tiling of these regulatory elements and downstream readout of the editing effects on gene activation can be used to empirically identify combinations of genetic modifiers and target sites that yield varying levels of target gene activation. This strategy can be performed in a pooled library format where target gene activation is a selectable or sortable feature, allowing for the high-throughput identification of candidates for SHAPE.
In some embodiments, the sequence motif is introduced as a single motif or a repetitive sequence with multiple copies of the single motif that may or may not have linker sequences interspaced.
In some embodiments, the sequence motif is introduced as a combination of different sequence motifs with predicted additive or synergistic effects on target gene expression at the pre-transcriptional and/or post-transcriptional level, where the multiple sequence motifs may or may not have linker sequences interspaced.
In some embodiments, algorithms to predict sequence alleles following nuclease genome editing events are used to identify a programmable nuclease type (e.g., zinc finger nucleases, transcription activator-like effector nucleases, Cas9, CasX, Cas12) and target cleavage indice(s) within a DNA sequence of interest to produce an allele that resembles a sequence motif or modification of interest.
In some embodiments, algorithms focused on patterns of microhomology-mediated end joining (MMEJ) and/or non-homologous end joining (NHEJ) following nuclease-induced DBA double-strand breaks (DSBs) can be used for the prediction of alleles following genome editing by a programmable nuclease (e.g., zinc finger nucleases, transcription activator-like effector nucleases, Cas9, CasX, Cas12)21.
In some embodiments, the use of the predominant +1 insertion allele from Cas9 editing can be used to precisely install sequence motifs of interest22.
In some embodiments, algorithms to predict sequence alleles following base editing events can be used to identify the base editor type (e.g., ABEs, CBEs) and base editing window(s) within a DNA sequence of interest to produce an allele that resembles a sequence motif or modification of interest23,24 25.
In some embodiments, algorithms to predict sequence alleles following base editing events can be used to identify the base editor type (e.g., ABEs, CBEs) and base editing window(s) within a DNA sequence of interest to produce an allele that modifies (e.g., strengthens, weakens) the regulatory potential of an existing sequence motif.
In some embodiments, algorithms to predict sequence alleles following prime editing events can be used to identify the prime editor type (e.g., SpCas9H840A-MMLV-RT), prime editing guide RNAs (pegRNAs), and nicking sgRNAs (ngRNAs) to install a specific sequence motif or modification that can affect the expression of the target gene at the pre-transcriptional or post-transcriptional level2.
In some embodiments, prime editing can be used to install specific substitution edits within an endogenous sequence context to introduce specific sequence motifs of interest. In some embodiments, prime editing can be used to install specific insertion edits within an endogenous sequence context to introduce specific sequence motifs of interest.
In some embodiments, prime editing can be used to install specific insertion edits to modify endogenous sequence motifs within untranslated regions (UTRs) of target gene transcripts or insert sequences that contain RNA-stabilizing or de-stabilizing sequence motifs into untranslated regions (UTRs) of target gene transcripts to affect the abundance and/or expression of the target gene products (e.g., RNA, protein).
In some embodiments, prime editing can be used to install specific deletion edits within an endogenous sequence context to introduce specific sequence motifs of interest.
In some embodiments, prime editing can be used to install specific combination edits (e.g., substitution, insertion, and/or deletion edits) within an endogenous sequence context to introduce specific sequence motifs of interest
In some embodiments, the design of pegRNAs and ngRNAs for prime editing will introduce silent mutations into the protospacer adjacent motif (PAM).
In some embodiments, the design of ngRNAs for prime editing will preferentially target the genome following the editing event (known as PE3b).
In some embodiments, the introduction of a transcription factor (TF) binding site in promoter or enhancer sequences or any other DNA sequence that might affect transient or heritable gene activation can be induced using a base editor (BE). In this case an endogenous sequence would be altered by BEs to allow TF binding by increasing homology of the endogenous sequence to a known TF binding site, thereby enabling a SHAPE event (BE-SHAPE).
In some embodiments, the unbiased saturation mutagenesis across genomic sequences (e.g., promoter, enhancers, untranslated regions) can be performed to empirically determine the strategy for SHAPE.
In some embodiments, multiplex genetic editing can be utilized to induce more robust modification of target gene expression at the pre-transcriptional and/or post-transcriptional level.
In some embodiments, multiplex genetic editing can be utilized to modify multiple target gene expressions in parallel at the pre-transcriptional and/or post-transcriptional level.
In some embodiments, the installation of sequence motifs that act as response elements (Table 3C) can be used to perform inducible modification of gene expression (e.g., activation, repression)following the introduction of an exogenous small molecule, hormone, or drug.
In some embodiments, the installation of cell-type-specific sequence motifs can be used to achieve cell-type-specific modification of the expression of a target gene or set of target genes.
In some embodiments, the use of homology directed repair (HDR) through programmable nucleases and a donor DNA (e.g., ssODN, dsODN) can be used to introduce sequence motifs of interest into a target site23.
In some embodiments, the use of NHEJ-mediated DNA sequence integration through programmable nucleases and double-stranded oligodeoxynucleotides (dsODN) or circular DNA donors (e.g., plasmids, minicircles) can be used to introduce sequence motifs of interest into a target site26 27.
In some embodiments, a DNA sequence alteration described herein can be performed using CRISPR-guided DNA base editors such as the cytosine base editor (CBE) that allows for the introduction of C-to-T and G-to-A modifications, the adenine base editor (ABE) that allows for the introduction of A-to-G and T-to-C modifications, the cytosine-to-guanine transversion base editor (CGBE) that allows for the introduction of C-to-G and G-to-C modifications, as well as the synchronous programming of adenine and cytosine editor (SPACE) that allows for the simultaneous introduction of A-to-G (T-to-C on opposite strand) and C-to-T (G-to-A on opposite strand) modifications within the same editing window at the ssDNA bubble generated by RNA-guided fusion proteins.
While CBEs and CGBEs are comprised of a cytidine deaminase (e.g., pmCDA1, hAPOBEC3A, hAID, hAPOBEC3G or rAPOBEC1) as well as a CRISPR Cas RGN or a variant thereof, ABEs contain an adenosine deaminase (e.g., E. coli TadA or variants thereof as momoners or dimers) and a CRISPR Cas protein, and SPACE contains both adenine (e.g., E. coli TadA) and cytosine deaminases (e.g., pmCDA1) as well as CRISPR-Cas proteins (e.g., S. pyogenes Cas9).
In one aspect, the present invention relates to the use of an ABE or SPACE comprising: an adenosine deaminase, e.g., a wild type and/or engineered adenosine deaminase (e.g., ABEs 0.1, 0.2, 1.1, 1.2, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10, 2.11, 2.12, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.2, 4.3, 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 5.10, 5.11, 5.12, 5.13, 5.14, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 7.10, or ABEmax), E. coli TadA monomer, or variations of homo- or heterodimers thereof, bearing one or more mutations in either or both monomers (e.g., the TadA mutant used in miniABEmax-V82G, miniABEmax-K20A/R21A, miniABEmax-V106W, that decrease RNA editing activity while preserving DNA editing activity; a cytidine deaminase (e.g., pmCDA1, rat APOBEC1, human APOBEC3A, or human AID) or variations thereof with reduced RNA off-target editing, one or multiple uracil-n-clycosalyse inhibitors (UGIs); and a programmable DNA binding domain (e.g., Cas9-D10A); and optionally further comprising one or more nuclear localization sequences (e.g., SV40 large T antigen NLS (PKKKRRV, SEQ ID NO: 1), bipartite NLS (KRTADGSEFESPKKKRKV, SEQ ID NO. 2), or nucleoplasmin NLS (KRPAATKKAGQAKKKK, SEQ ID NO. 3). Other NLSs are known in the art; see, e.g., Cokol et al., EMBO Rep. 2000 Nov. 15; 1(5): 411-415; Freitas and Cunha, Curr Genomics. 2009 December; 10(8): 550-557.
In some embodiments, the dual-deaminase BE SPACE would be used and it would include a heterodimeric combined N-terminal adenosine and cytidine deaminase fusion (e.g., pmCDA1 or rAPOBEC1 or hA3A or AID fused to TadA monomers or dimers with a linker) or a heterodimeric combined C-terminal adenosine and cytidine deaminase fusion (e.g., pmCDA1 or rAPOBEC1 or hA3A or AID fused to TadA monomers or dimers with a linker). In both N- and C-terminal positions of these “hybrid fusion deaminase designs” the deaminases can be fused in either of these two orders: NH2-cytidine deaminase-linker-adenosine deaminase or NH2-adenosine deaminase-linker-cytidine deaminase.
In some embodiments, the programmable DNA binding domain is selected from the group consisting of engineered C2H2 zinc-fingers, transcription activator effector-like effectors (TALEs), and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) Cas RNA-guided nucleases (RGNs) and variants thereof (e.g., Tables F L and MG).
In some embodiments, the CRISPR RGN is an ssDNA nickase or is catalytically inactive, e.g., a Cas9, CasX or Cas12a that has ssDNA nickase activity or is catalytically inactive.
In one aspect, the present invention relates to a base editing system comprising: (i) ABE, CBE, CGBE, or SPACE, wherein the programmable DNA binding domain is a CRISPR Cas RGN or a variant thereof; and (ii) at least one guide RNA compatible with the base editor that directs the base editor to a target sequence which is can then be deaminated in order to generate e.g., a TF binding site or any other modification that allows for the transcriptional activation of a targeted gene.
In one aspect, the present invention relates to an isolated nucleic acid encoding any of the CBEs, ABEs, CGBEs, and SPACE or other base editing systems described herein to induce a BE-SHAPE event.
In one aspect, the present invention relates to a vector comprising an isolated nucleic acid described herein.
In one aspect, the present invention relates to an isolated host cell, preferably a mammalian host cell, comprising any of the nucleic acids described herein.
In some embodiments, the isolated host cell described herein expresses a base editor and a gRNA for DNA modification to induce a BE-SHAPE event described herein.
In one aspect, the present invention relates to a method of deaminating a selected adenine and/or cytosine in a nucleic acid, the method comprising contacting the nucleic acid with SPACE, a base editing system, an isolated nucleic acid, a vector, or an isolated host cell described herein.
In one aspect, the present invention relates to a composition comprising a purified CBE, ABE, SPACE, or CGBE, a base editing system, an isolated nucleic acid, a vector, or an isolated host cell described herein.
In some embodiments, the composition includes one or more ribonucleoprotein (RNP) complexes.
In some embodiments, CBE, ABE, CGBE, or SPACE that are being used for BE-SHAPE comprise one or more uracil-N-glycosylase inhibitors (UGIs). In some embodiments, the base editors comprise a linker between the adenosine deaminase and the programmable DNA binding domain as well as the cytidine deaminase and the DNA binding domain, or both in the case of SPACE. In some embodiments the TadA domain can be monomeric, homodimeric or heterodimeric and contain all combinations of wild type (WT) E. coli TadA, or mutant variants of TadA.
In some embodiments one or two deaminase domains can be located at the C-terminus (e.g., pmCDA1) and N-terminus (TadA) or vice versa or they can both be located at the C- or N-terminus.
In some embodiments, the programmable DNA binding domain is selected from the group consisting of engineered C2H2 zinc-fingers, transcription activator effector-like effectors (TALEs), and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-Cas RNA-guided nucleases (RGNs) and variants thereof.
In some embodiments, the CRISPR-Cas RGN is an ssDNA nickase or is catalytically inactive, e.g., a Cas9 or Cas12a that is catalytically inactive or has ssDNA nickase activity (Table 4A).
In some embodiments, the methods include the use of base editing systems comprising (i) the adenine base editors described herein, wherein the programmable DNA binding domain is a CRISPR Cas RGN or a variant thereof (Table 4B); and (ii) at least one guide RNA compatible with the base editor that directs the base editor to a target sequence.
In some embodiments, the methods include the use of nucleic acids encoding ABE, CBE, CGBE or SPACE; vectors comprising the isolated nucleic acids; and isolated host cells, preferably mammalian host cells, comprising the nucleic acids. In some embodiments, the isolated host cell expresses an adenine base editor.
In some embodiments, the methods include the use of methods for deaminating a selected adenine in a nucleic acid, the method comprising contacting the nucleic acid with an adenine base editor or base editing system as described herein.
In some embodiments, the methods include the use of compositions comprising a purified ABE, CBE, CGBE or SPACE or base editing system as described herein. In some embodiments, the composition comprises one or more ribonucleoprotein (RNP) complexes.
In some embodiments, the guide RNAs that are used to target BEs for BE-SHAPE can be truncated by reducing spacer length to <20 nucleotides, e.g., to 17-18 nucleotides, which has been shown to enhance specificity of CRISPR-Cas nucleases (Fu & Sander et al, Nature Biotechnology 2014, 32(3):279-284) and may also affect base editing windows (Kim et al, Nature Biotechnology 2017, 35(4): 371-376).
In some embodiments, multiplex base editing with CBE, ABE, CGBE, and SPACE can be used to enhance the efficiency of BE-SHAPE. In this aspect, one or more gRNAs can be guiding one or more base editors (e.g., CBE and ABE, or ABE and CGBE, or SPACE and CGBE) to one or more genomic target sites to install multiple TF binding sites (or other sequence changes that drive transcriptional activation) at once in one cell or a population of cells, or a tissue, both in vivo or ex vivo.
In some embodiments, anti-CRISPR tools (Bondy-Denomy et al, Nature 2013, 493(7432):429-32 and Nature 2015, 526(7571):136-9; Pawluk et al, Cell 2016, 167(7):1829-1838.e9.) can be used to control the efficiency of CRISPR-based SHAPE platforms by altering the capacity of DNA binding and/or cleavage of the CRISPR-Cas protein when used as a nuclease or within a base or prime editor.
S. pyogenes Cas9 (SpCas9)
S. aureus Cas9 (SaCas9)
S. thermophilus Cas9
S. pasteurianus Cas9
C. jejuni Cas9 (CjCas9)
F. novicida Cas9 (FnCas9)
P. lavamentivorans Cas9
C. lari Cas9 (ClCas9)
Pasteurella
multocida Cas9
F. novicida Cpf1 (FnCpf1)
M. bovoculi Cpf1 (MbCpf1)
A. sp. BV3L6 Cpf1 (AsCpf1)
L. bacterium N2006 (LbCpf1)
S. pyogenes Cas9
S. pyogenes Cas9
S. pyogenes Cas9
S. pyogenes Cas9
S. pyogenes Cas9
S. pyogenes Cas9
S. pyogenes Cas9
S. pyogenes Cas9
S. pyogenes Cas9
S. aureus Cas9
S. aureus Cas9 with PAM interaction domain from
Streptococcus
macacae (Smac)
N. meningitidis
SHAPE Applications
Examples of applications of the SHAPE strategy include SHAPE-TF (transcription factor), SHAPE-UTR (untranslated region), SHAPE-RNAi (RNA interference), and SHAPE-protein (e.g., Kozak sequence, codon optimization).
SHAPE-TF
In some instances, the SHAPE strategy can alter target gene expression at the pre-transcriptional level through the targeted introduction of transcription factor binding motifs (SHAPE-TF) into regions in the genome. Following the introduction of these transcription factor binding motifs (e.g., identified using a method described herein, or from a database such as the JASPAR database (e.g., 8th release—Fornes et al, “JASPAR 2020: update of the open-access database of transcription factor binding profiles”, Nucleic Acids Research Volume 48, Issue D1, 8 Jan. 2020, Pages D87—D92.), the recruitment of endogenous transcription factors to the target locus can enable the activation or repression of target gene expression through the increase or decrease in gene transcription, respectively. In addition to the introduction or modification of transcription factor binding sites, SHAPE-TF can also work through the modification of spacing between endogenous transcription factor binding sites to achieve alteration of target gene expression. To achieve SHAPE-TF, there are several steps that include 1) the identification/discovery of sequence motifs that can actively recruit transcription factors or modify endogenous transcription factor spacing, respectively, 2) the identification of target genomic sequence(s) to introduce the sequence motif or modification to affect target gene expression, and 3) identification of the genetic modifier to use to install the precise edit.
SHAPE-UTR
In some instances, the SHAPE strategy can alter target gene expression at the post-transcriptional level through the targeted introduction of sequence motifs into untranslated regions (SHAPE-UTR) of target gene transcripts to affect transcript stability and downstream protein expression. Following the introduction of these sequence motifs into UTRs, the stabilization or de-stabilization of target gene transcripts can enable the activation or repression of target gene expression through the increase or decrease in gene translation, respectively. To achieve SHAPE-UTR, there are several steps that include 1) the identification or discovery of sequence motifs that can modify a transcript stability when placed at the 5′ and/or 3′ UTRs, 2) the identification of specific regions within the 5′ and/or 3′ UTRs to introduce a sequence motif to affect transcript stability, and 3) identification of the genetic modifier to use to install the precise edit.
SHAPE-RNAi
In some instances, the SHAPE strategy can alter target gene expression at the post-transcriptional level through the targeted introduction of sequence motifs (e.g., microRNA binding sites on DNA/RNA, e.g., from the mirtarbase database (e.g., miRTarBase Release 8.0—Chou et al, “miRTarBase update 2018: A resource for experimentally validated microRNA-target interactions” Nucleic Acids Research 2018 Jan. 4; 46(D1):D296-D302)) that are targeted by endogenous non-coding RNAs (e.g., miRNAs, siRNAs, lncRNAs) to achieve RNA interference (SHAPE-RNAi). Following the introduction of these sequence motifs, endogenous non-coding RNAs may bind the introduced sequence and de-stabilize or inhibit translation of the target transcripts. To achieve SHAPE-RNAi, there are several steps that include 1) the identification or discovery of sequence motifs that can be targeted by endogenous non-coding RNA molecules, 2) the identification of specific regions in the target transcript that can promote RNA interference, and 3) identification of the genetic modifier to use to install the precise edit.
SHAPE-Protein
In some instances, the SHAPE strategy can alter target gene expression at the post-transcriptional level through sequence optimization for target gene transcripts (SHAPE-protein) to increase translation efficiency. The introduction of sequence motifs can be performed through the de novo introduction of new regulatory elements or optimization of endogenous regulatory elements that play a role in transcription or translation initiation and/or elongation. These sequence motifs can be related to elements such as the Kozak sequence or optimal codon structures for the coding regions of target gene transcripts. To achieve SHAPE-protein, there are several steps that include 1) the identification or discovery of sequence motifs that can modify translation initiation and/or elongation efficiency, 2) the identification of specific regions in genome to introduce a sequence motif for modification of translation initiation and/or elongation, and 3) identification of the genetic modifier to use to install the precise edit.
The present methods can include using SHAPE to alter expression of disease related target genes. Thus in some embodiments, the target gene is associated with a disease. Examples of haploinsufficiency diseases (Table 5; from Matharu N, Rattanasopha S, Tamura S, et al. Science. 2019; 363(6424)) and diseases caused by non-coding mutations (Table 6) are listed below. Rather than correcting a mutation that leads to haploinsufficiency, the present methods can be used to rescue haploinsufficiency with synthetic upregulation of a healthy allele—without reverting the “damaged” allele to WT sequence. Table 6 shows disease-related enhancer SNPs that cause other diseases based on down-regulated or upregulated expression. Again, rather than correcting the exact SNPs/genetic variants that cause disease, the present methods can be used to mitigate the effects of those SNPs.
The present methods can include using SHAPE to alter expression of cell fate- and differentiation-related target genes, by introducing or removing a binding site for a cell-reprogramming transcription factor (e.g., adding a site for an activating TF or deleting a site that binds a repressor, activating by de-repression, or adding a site for a repressive TF to down-regulate expression). Installation of sequence motifs can be used to drive programmable cellular differentiation from one cell type to another target cell type.
SHAPE to program cellular differentiation (from iPSCs to specific cell types, or specific cell types into another) through the modification of TF gene expression that is involved in cell identity33. The introduction of sequence motifs to modify (e.g., increase or decrease) the expression of a target gene product (e.g., RNA, protein) can be performed with a genetic modifier, where the resulting change in transcription of a single target gene or multiple target genes leads to differentiation of a cell type to another target cell type. Examples of cell-reprogramming transcription factors are listed in Table 7 below.
Methlyation of CpG dinucleotides at promoters or enhancers are known for transcriptional silencing34. This DNA modification is maintained during replication by endogenous DNMT1. Previous studies of introducing methylated DNA at the promoter of reporter plasmids showed the repression of reporter genes35-37. SHAPE can be used for the installation of methylated CpGs at the regulatory elements (promoters or enhancers) for heritable target gene repression. To achieve CpG methylation at the promoters or enhancers of target genes, first, we will identify genes that are highly expressed in target cells. Second, we will methylate CpGs for insertion motifs in vitro using methylases and SAM. The methylated CpGs can be inserted into the target regions via genetic modifiers. The in vivo methylation status can be determined via bisulfite genomic sequencing of target regions, and the repression effect of target genes can be validated by RT-qPCR.
non-coding RNA is a common strategy to perform target gene repression, where an RNA sequence with complementarity to a target messenger RNA (mRNA) hybridizes to its target and either accelerates degradation of the mRNA or prevents translation of the mRNA, ultimately reducing target gene protein production. Here we describe a strategy within SHAPE to install sequence motifs that induce targeting of endogenous non-coding RNAs (e.g., miRNAs, siRNAs, lncRNAs) for heritable target gene repression via RNA interference. Following the identification of endogenously expressed endogenous non-coding RNAs (e.g., miRNAs, siRNAs, lncRNAs) in a cell type or cell types of interest, the specific binding sites of these non-coding RNAs can be introduced into the target gene transcript at either the 5′ or 3′ UTR or in intronic regions via genetic modifiers with the ability to introduce insertion edits (e.g., programmable nucleases+ssODNs/dsODNs, prime editors). This method would be validated by RT-qPCR of the target gene transcript to assess RNA knockdown or protein-based assays (e.g., western blot, ELISA) to assess downstream protein knockdown to see if target gene repression is achieved.
SHAPE to Increase or Decrease Transcript Stability, and/or Transcription, and/or Protein Output
The activation of target gene protein output can also be performed through the introduction of sequence motifs in the untranslated region to increase transcript stability. Following the identification or generation of sequence motifs that have the potential to promote RNA stability, genetic modifiers with the ability to introduce insertion edits (e.g., programmable nucleases+ssODNs/dsODNs, prime editors) can be used to target the 5′ and/or 3′ untranslated regions (UTRs) of target genes. The introduction of these RNA-stabilizing sequence motifs can be inserted at all targetable positions of the UTRs to find the optimal positioning of the editing event. The stabilization RNA can be read out through RT-qPCR to quantitate the increase in target RNA abundance that should occur through a decreased rate of RNA degradation.
Multiplex genome editing with programmable nucleases, base editors, and prime editors can be utilized to induce more robust activation for a single target gene. The potential to modify more than one genomic location within regulatory elements of a single target gene may enable additive or synergistic effects for target gene activation. Previous studies showed multiplexed genome engineering of up to 25 human endogenous targets, suggesting up to 25 sites can be simultaneously edited for robust activation of target genes (McCarty, N. S., Graham, A. E., Studená, L. et al. Nat Commun 11, 1281 (2020), Campa, C. C., Weisbach, N. R., Santinha, A. J. et al. Nat Methods 16, 887-893 (2019)). Multiplex editing can be combinations of different strategies, including: 1) introduction of a de novo transcription factor binding site 2) modification of an endogenous transcription factor site to either increase or decrease its regulatory potential 3) modification of endogenous spacing of transcription factor binding sites to increase or decrease regulatory potential. Multiplex editing can be performed within a single regulatory element (e.g., promoter, enhancer), or across multiple regulatory elements.
Multiplex genome editing with programmable nucleases, base editors, and prime editors can also be utilized to induce multi-gene activation. Previous studies showed multiplexed genome engineering of up to 25 human endogenous targets, suggesting up to 25 sites can be simultaneously edited for muti-gene activation (McCarty, N. S., Graham, A. E., Studená, L. et al. Nat Commun 11, 1281 (2020), Campa, C. C., Weisbach, N. R., Santinha, A. J. et al. Nat Methods 16, 887-893 (2019)). The potential to modify more than one genomic location may enable the ability to activate multiple genes in parallel with SHAPE. Multiplex editing can be combinations of different strategies, including: 1) introduction of a de novo transcription factor binding site 2) modification of an endogenous transcription factor site to either increase or decrease its regulatory potential 3) modification of endogenous spacing of transcription factor binding sites to increase or decrease regulatory potential. Multiplex editing can be performed within a single regulatory element (e.g., promoter, enhancer), or across multiple regulatory elements.
The invention is further described in the following examples, which do not limit the scope of the invention described in the claims. The examples described herein show gene expression activation by the SHAPE platform using both dual adenine and cytosine base editors and prime editors to install de novo transcription factor binding sites in HEK293T cells. A variety of transcription factor binding motifs were inserted into endogenous genomic contexts, and gene expression changes were measured at the RNA level. Our findings broaden the capabilities for durable and heritable gene activation by precision genome engineering.
Methods and Materials
The Methods and Materials described herein were used in the Examples provided herein.
Molecular Cloning
All base editor (BE) and prime editor (PE) constructs were cloned into a mammalian expression plasmid backbone under the control of a pCMV promoter (AgeI and NotI restriction digest of parental plasmid Addgene No. 112101). Gibson fragments with matching overlaps were PCR-amplified using Phusion High-fidelity polymerase (NEB). Fragments were gel-purified and assembled for 1 hour at 50° C. and transformed into chemically competent E. coli (XL1-Blue, Agilent). All guide RNA (gRNA) constructs were cloned into a BsmBI-digested pUC19-based entry vector (BPK1520, Addgene No. 65777) with a U6 promoter driving gRNA expression. We designed the pegRNAs following the previously described default design rules for designing pegRNAs and ngRNAs (Anzalone et al, Nature 2019, 576, pages 149-157). PegRNAs were cloned into the Bsal-digested pU6-pegRNA-GG-acceptor entry vector (Addgene No. 132777) and ngRNAs were cloned into the BsmBI-digested entry vector BPK1520 that is mentioned above. Oligos containing the spacer, the 5′phosphorylated pegRNA scaffold, and the 3′ extension sequences were annealed to form dsDNA fragments with compatible overhangs and ligated using T4 ligase (NEB). All plasmids used for transfection experiments were prepared using Qiagen Midi or Maxi Plus kits.
Guide RNAs Used in Nuclease and Base Editor Experiments
All gRNAs for base editors were of the form 5′-NNNNNNNNNNNNNNNNNNNNCGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGG CTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT-3′. (SEQ ID NO: 4)
Prime Editing Guide RNAs (pegRNAs)
All pegRNAs for prime editors were of the form 5′-NNNNNNNNNNNNNNNNNNNNGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGC TAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC TTTTTTT-3′. (SEQ ID NO: 11)
PE3 nicking guide RNAs (ngRNAs)
All nicking gRNAs for PE3 system were of the form 5′-NNNNNNNNNNNNNNNNNNNNCGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGG CTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT-3′. (SEQ ID NO: 49)
Cell Culture and Transfections
STR-authenticated HEK293T (CRL-3216), K562 (CCL-243), HeLa (CCL-2), and U2OS cells (similar match to HTB-96; gain of #8 allele at the D5S818 locus) were used in this study. HEK293T and HeLa cells were grown in Dulbecco's Modified Eagle Medium (DMEM, Gibco) with 10% heat-inactivated fetal bovine serum (FBS, Gibco) supplemented with 1% penicillin-streptomycin (Gibco) antibiotic mix. K562 cells were grown in Roswell Park Memorial Institute (RPMI) 1640 Medium (Gibco) with 10% FBS supplemented with 1% Pen-Strep and 1% GlutaMAX (Gibco). U2OS cells were grown in DMEM with 10% FBS supplemented with 1% Pen-Strep and 1% GlutaMAX. Cells were grown at 37° C. in 5% CO2 incubators and periodically passaged upon reaching around 80% confluency. Cell culture media supernatant was tested for mycoplasma contamination using the MycoAlert mycoplasma detection kit (Lonza) and all tests were negative throughout the experiments.
Transfections
HEK293T cells were seeded at 1.25×104 cells per well into 96-well flat bottom cell culture plates (Corning) for DNA on-target experiments or at 6.25×104 cells per well into 24-well cell culture plates (Corning). 24 hours post-seeding, cells were transfected with 30 ng of control or base/prime editor plasmid and 10 ng of gRNA plasmid (and 3.3 ng nicking gRNA plasmid for PE3) using 0.3 μL of TransIT-X2 (Minis) lipofection reagent for experiments in 96-well plates, 150 ng control or base editor plasmid and 50 ng gRNA, or 375 ng dCas9-VPR and 125 ng gRNA, and 3 μL TransIT-X2 for experiments in 24-well plates. K562 cells were electroporated using the SF Cell Line Nucleofector X Kit (Lonza) or Kit V (Lonza), according to the manufacturer's protocol with 2×105 cells per nucleofection and 800 ng control or base/prime editor plasmid, 200 ng gRNA or pegRNA plasmid, and 83 ng nicking gRNA plasmid (for PE3) or with 1×106 cells per nucleofection and 3840 ng control or prime editor plasmid, 960 ng pegRNA plasmid, and 398.4 ng of nicking gRNA plasmid (for PE3), and 3750 ng dCas9-VPR plasmid and 1250 ng of gRNA. When GFP plasmid was co-transfected, half amount of gRNAs or nicking gRNAs was used. U2OS cells were electroporated using the SE Cell Line Nucleofector X Kit (Lonza) with 2×105 cells and 800 ng control or base/prime editor plasmid, 200 ng gRNA or pegRNA, and 83 ng nicking gRNA (for PE3). HeLa cells were electroporated using the SE Cell Line 4D-Nucleofector X Kit (Lonza) with 5×105 cells and 800 ng control or base/prime editor, 200 ng gRNA or pegRNA, and 83 ng nicking gRNA (for PE3). 72 hours post-transfection, cells were lysed for extraction of genomic DNA (gDNA).
DNA and RNA Extraction
For DNA on-target experiments in 96-well plates, 72 h post-transfection, cells were washed with PBS, lysed with freshly prepared 43.54, DNA lysis buffer (50 mM Tris HCl pH 8.0, 100 mM NaCl, 5 mM EDTA, 0.05% SDS), 5.25 μL Proteinase K (NEB), and 1.25 μL 1M DTT (Sigma). For DNA off-target experiments in 24-well plates, cells were lysed in 174 DNA lysis buffer, 21 μL Proteinase K, and 5 μL 1M DTT. For RNA off-target experiments, GFP sorted cells were split 20% for DNA and 80% for RNA extraction. Cells were centrifuged (200 g, 8 min) and lysed as above for DNA or with 350 μL RNA lysis buffer LBP (Macherey-Nagel) for RNA. DNA lysates were incubated at 55° C. on a plate shaker overnight, then gDNA was extracted with 2× paramagnetic beads (as previously described), washed 3 times with 70% EtOH, and eluted in 30-80 μL 0.1×EB buffer (Qiagen). RNA lysates were extracted with the NucleoSpin RNA Plus kit (Macherey-Nagel) following the manufacturer's instructions.
Targeted Amplicon Sequencing
DNA targeted amplicon sequencing was performed as previously described (Grunewald et al, Nature 2019, 569, pages 433-437). Briefly, extracted gDNA was quantified using the Qubit dsDNA HS Assay Kit (Thermo Fisher). Amplicons were constructed in 2 PCR steps. In the first PCR, regions of interest (170-250 bp) were amplified from 5-20 ng of gDNA with primers containing Illumina forward and reverse adapters on both ends. PCR products were quantified on a Synergy HT microplate reader (BioTek) at 485/528 nm using a Quantifluor dsDNA quantification system (Promega), pooled and cleaned with 0.7× paramagnetic beads, as previously described. In a second PCR step (barcoding), unique pairs of Illumina-compatible indexes (equivalent to TruSeq CD indexes, formerly known as TruSeq HT) were added to the amplicons. The amplified products were cleaned up with 0.7× paramagnetic beads, quantified with the Quantifluor or Qubit systems, and pooled before sequencing. The final library was sequenced on an Illumina MiSeq machine using the Miseq Reagent Kit v2 (300 cycles, 2×150 bp, paired-end). Demultiplexed FASTQ files were downloaded from BaseSpace (Illumina).
Targeted Amplicon Sequencing Analysis
Amplicon sequencing data were analyzed with CRISPResso2 2.0.3016 that was run in base editor output mode. Allele frequency tables (CRISPResso output) display an editing window that includes the edited As or Cs (CCA motif in positions 3-4-5 with 1 being the most PAM-distal base).
Flow Cytometry
Cells were washed with cell staining buffer (Biolegends) after 72 hours post-transfection and incubated with PE conjugated IL2RA (Biolegends), HER2 (Biolegends) or EpCAM antibody (Biolegends) for 15 minutes, followed by twice wash with cell staining buffer. All PE positive cells were sorted and sorted cells were measured by a LSR Fortessa X-20 flow cytometer (BD) to test durability of target protein expression. Cells transfected with GFP plasmid were resuspended in cell staining buffer after 72 hours post-transfection and top 50% GFP positive cells were sorted.
Analysis of Potential SPACE-Encodable Transcription Factor Binding Sites
Annotated transcription start sites from hg38 refseq genes were obtained and filtered to exclude micro-RNAs, small NF90 associated RNAs (SNARs), long non-coding RNAs, small nucleolar RNAs, and anti-sense transcripts. These RNAs were filtered in part due to redundant annotations at the same transcription start sites (TSS) and to focus on protein-coding genes. Next, the remaining TSSes were padded to include the region −500 bp to 0 bp relative to the start site. From this 500 bp per-gene window, we found all matches of the “NNCCA GG” (SEQ ID NO:5) motif on either strand that contain a preferential dual-editing window for SPACE and a canonical SpCas9-PAM (NGG). In total, this defined 53,750 protospacers (45,383 unique). To assess the potential for TF motif creation, we used the reference sequence to create “SPACE-edited” sequences for each of the protospacers by modifying the CCA to TTG. Using a set of 386 transcription motifs from the JASPAR2016 motif list20 we determined which motifs could be created with the SPACE modification for each transcription factor. Motif matching was performed using the motifmatchr package using default parameters as part of the chromVAR suite of tools21. Created motifs were those that did not occur in the reference sequence but were matches in the SPACE-edited sequence.
Measurement of Target Gene Expression for Inserting TF Motis by PE3
HEK293T were transfected with PE3 (60 ng), pegRNA(20 ng) and nicking gRNA (6.64 ng). HEK293T was transfected using lipofection. 24 hours prior to transfection, HEK293T cells (625000) were seeded in 24-well plates and then transfected with the plasmids using 3 μl of Transit X2(Mirus Bio, cat #MIR6003) for HEK293T cells. For target gene expression analysis, total RNA was extracted from the cells 72 hours post-transfection using the NucleoSpin RNA Plus Kit (Clontech, cat #740984.250) and 250 ng of purified RNA was used for cDNA synthesis using a High Capacity RNA-to-cDNA kit (ThermoFisher, cat #4387406). 3 μl of 1:20 diluted cDNA was amplified by quantitative PCR (qPCR) using Fast SYBR Green Master Mix (ThermoFisher, cat #4385612) with the primers listed elsewhere in this application. qPCR reactions were performed on a LightCycler 480 (Roche) with the following program: initial denaturation at 95° C. for 20 seconds (s) followed by 45 cycles of 95° C. for 3 s and 60° C. for 30 s. Ct values greater than 35 were considered as 35, because Ct values fluctuate for transcripts expressed at very low levels. Gene expression levels were normalized to HPRT1 and calculated relative to that of the negative controls (PE3 with pegRNA cassette).
First, a bioinformatic analysis was performed to generate candidate genes with endogenous promoter sequences that can get converted into TF binding sites using SPACE. Potential target sites had to lie within −500 to 0 bp upstream of the TSS, have a C3C4A5 motif, with respect to the protospacer (1 being the most PAM-distal position). Only sites with a canonical NGG-PAM were considered. The number of genes with one or more creatable TF binding sites are shown in
We tested the SHAPE system at the MYOD1 locus in HEK293T cells. First, we identified candidate transcription factor motif sequences to insert into varying positions of the MYOD1 locus, ranging from 250 bp upstream and 50 bp downstream of the MYOD1 transcription start site (TSS). We identified ELF, NFY, and SP transcription factors as being actively expressed in HEK293T cells, and additionally included GATA1 and EWS-FLI1 motifs as positive controls where exogenous GATA1 and EWS-FLI1 would be supplied exogenously (Table D).
Next, we designed prime editing guide RNAs (pegRNAs) and nicking sgRNAs (ngRNAs) targeting nearby the MYOD1 TSS (Table B and C. We tested prime editing using the PE3 strategy and achieved meaningful insertion editing efficiencies across three different pegRNAs and 10 different motif insertions (
We tested the durability of the SHAPE system at various loci of endogenous promoters of MYOD1, HBB, IL2RA, HER2, and EpCAM in HEK293T and K562. To do so, we first enriched edited cells via sorting cells for GFP signal or target protein expression. To sort cells for GFP signal, cells were co-transfected with a separate GFP plasmid with prime editor complex that was designed to install ELF(2×) motif at the MYOD1, HBB, IL2RA, HER2, or EpCAM. To compare the durability of SHAPE with CRISPRa, cells co-transfected with dCas9-VPR targeting the same target regions and GFP plasmid were also sorted for GFP. We confirmed stable ELF(2×) insertion at the target promoters and stable target mRNA expression over 30 days post-transfection for GFP sorted HEK293T and K562 cells (
We explored the insertion of diverse transcription factor cluster motifs via prime editing to achieve target protein expression. To do so, we used the IL2RA promoter due to our previous success in activating IL2RA protein expression and the ability to sort for its expression as a cell surface marker (
We explored the insertion of mutagenized ELF(2×) insertion motifs via prime editing to achieve tunable target protein expression. To do so, we used the IL2RA promoter due to our previous success in activating IL2RA protein expression and the ability to sort for its expression as a cell surface marker (
Taken together, these results show that it is possible to use precision genome engineering via prime editing to install functional transcription factor binding sites, to ultimately promote heritable gene activation.
Genome Biol. 5, 201 (2003).
Biochem. 72, 449-479 (2003).
Streptococcus pyogenes Cas9, SEQ ID NO: 372
Escherichia phage P1 (Bacteriophage P1) Cre
This application is a U.S. National Phase Application under 35 U.S.C. § 371 of International Patent Application No. PCT/US2021/034996, filed on May 28, 2021, which claims the benefit of U.S. Provisional patent application Ser. No. 63/032,486, filed on May 29, 2020. The entire contents of the foregoing are hereby incorporated by reference.
This invention was made with Government support under Grant Nos. GM118158, CA211707, CA204954, and HG010717 awarded by the National Institutes of Health. The Government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/034996 | 5/28/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63032486 | May 2020 | US |