The present disclosure relates to systems, methods, and materials for identifying candidate CRISPR associated proteins.
This application contains a Sequence Listing that has been submitted electronically as an ASCII text file named SequenceListing.txt. The ASCII text file, created on Nov. 22, 2021, is 531 kilobytes in size. The material in the ASCII text file is hereby incorporated by reference in its entirety.
The systematic interrogation of genomes and genetic reprogramming of cells involves targeting sets of genes for expression or repression. Currently the most common approach for targeting arbitrary genes for regulation is to use RNA interference (RNAi). This approach has limitations. For example, RNAi can exhibit significant off-target effects and toxicity.
Clustered Regularly interspaced Short Palindromic Repeats (CRISPR) and the CRISPR-associated (Cas) genes, collectively known as the CRISPR-Cas or CRISPR/Cas systems, are currently understood to provide immunity to bacteria and archaea against phage infection. The CRISPR-Cas systems of prokaryotic adaptive immunity are an extremely-diverse group of proteins effectors, non-coding elements, as well as loci architectures, some examples of which have been engineered and adapted to produce important biotechnologies. The components of the systems involved in host defense include one or more effector proteins capable of modifying DNA or RNA and a RNA guide element that is responsible for targeting these protein activities to a specific sequence on the phage DNA or RNA. CRISPR-Cas systems can be broadly classified into two classes: Class 1 systems are composed of multiple effector proteins that together form a complex around a crRNA, and Class 2 systems that consist of a single effector protein that complexes with the crRNA to target DNA or RNA substrates. The single-subunit effector compositions of the Class 2 systems provide a simpler component set for engineering and application translation, and has thus far been important sources of programmable effectors. The discovery, engineering, and optimization of novel Class 2 systems may lead to widespread and powerful programmable technologies for genome engineering and beyond.
There is need in the field for a technology that allows precise targeting of nuclease activity (or other protein activities) to distinct locations within a target DNA in a manner that does not require the design of a new protein for each new target sequence. In addition, there is a need in the art for methods of controlling gene expression with minimal off-target effects.
This document provides compositions, methods, and material for identifying Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)-associated proteins. For example, provided herein are methods including (a) obtaining a set of genomic sequences, wherein a genomic sequence of the set of genomic sequences comprises a CRISPR-associated array; (b) determining coding sequences within a 20 kilobase (kb) sequence flanking either 3′ or 5′ of the CRISPR-associated array; and (c) filtering the coding sequences and using the filtered coding sequences to identify CRISPR-associated proteins. The present disclosure is based on the discovery that methods, including computational methods, can be used to mine prokaryotic genomes and metagenomes for novel CRISPR-associated proteins.
Provided herein are methods of identifying a Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)-associated protein comprising: (a) obtaining a plurality of genomic sequences, wherein a genomic sequence of the plurality of genomic sequences comprises a CRISPR-associated array; (b) determining a subset of the plurality of genomic sequences comprising a plurality of coding sequences within a 20 kilobase (kb) sequence flanking region either at the 3′ or 5′ end of the CRISPR-associated array; and (c) analyzing a coding sequence of the plurality of coding sequences and thereby identifying the CRISPR-associated protein based on the coding sequence.
In some embodiments, the obtaining step comprises selecting, within the plurality of genomic sequences, a genomic sequence comprising a CRISPR-associated array.
Also provided herein are methods of identifying a CRISPR-associated protein comprising: (a) obtaining a plurality of genomic sequences; (b) selecting, within the plurality of genomic sequences, a genomic sequence comprising a CRISPR-associated array; (c) determining a subset of the plurality of genomic sequences comprising a plurality of coding sequences within a 20 kilobase (kb) sequence flanking region either at the 3′ or 5′ end of the CRISPR-associated array; and (d) analyzing a coding sequence of the plurality of coding sequences and thereby identifying the CRISPR-associated protein based on the coding sequence.
In some embodiments, the plurality of genomic sequences comprise one or more of genomes, wherein the one or more of genomes are selected from: a prokaryotic genome and metagenome. In some embodiments, the selecting step comprises using an algorithm selected from the group consisting of PILER-CR, CRISPR Recognition Tool (CRT), and combinations thereof. In some embodiments, the determining step comprises using an algorithm selected from the group consisting of MetaGeneMark, Prodigal, and combinations thereof.
In some embodiments, the analyzing step comprises filtering the coding sequence that comprises more than 500 amino acids. In some embodiments, the analyzing step comprises filtering a coding sequence that comprises more than 800 amino acids. In some embodiments, the analyzing step further comprises classifying the CRISPR-associated array based on having three or more coding sequences present in the 20 kb flanking region. In some embodiments, the analyzing step further comprises determining a relative position of the coding sequence in the 20 kb flanking region relative to the CRISPR-associated array.
In some embodiments, the analyzing of the coding sequence further comprises removing known CRISPR-associated proteins from the identified CRISPR-associated proteins. In some embodiments, the analyzing of the coding sequence comprises using an algorithm selected from the group consisting of HHMSCAN and RPS-BLAST. In some embodiments, the analyzing of the coding sequence further comprises determining the presence of a structural domain. In some embodiments, the analyzing of the coding sequence comprises determining the presence of a functional domain. In some embodiments, the functional domain comprises a DNA binding domain, a RNA binding domain, a nuclease, a helicase, a restriction domain, or a structural maintenance of chromosomes (SMC) domain.
Also provided herein are computer implemented methods comprising: (a) obtaining a plurality of genomic sequences; (b) selecting, within the plurality of genomic sequences, a genomic sequence comprising a CRISPR-associated array; (c) determining a subset of the plurality of genomic sequences comprising a plurality of coding sequences within a 20 kilobase (kb) sequence flanking region either at the 3′ or 5′ end of the CRISPR-associated array; and (d) analyzing a coding sequence of the plurality of coding sequences and thereby identifying a CRISPR-associated protein based on the coding sequence.
In some embodiments, the plurality of genomic sequences comprises one or more of genomes, wherein the one or more of genomes are selected from: a prokaryotic genome and metagenome. In some embodiments, the selecting step comprises using an algorithm selected from the group consisting of PILER-CR, CRISPR Recognition Tool (CRT), and combinations thereof. In some embodiments, the determining step comprises using an algorithm selected from the group consisting of MetaGeneMark, Prodigal, and combinations thereof.
In some embodiments, the analyzing step comprises filtering the coding sequence that comprises more than 500 amino acids. In some embodiments, the analyzing step comprises filtering a coding sequence that comprises more than 800 amino acids. In some embodiments, the analyzing step further comprises classifying the CRISPR-associated array based on having three or more coding sequences present in the 20 kb flanking region. In some embodiments, the analyzing step further comprises determining a relative position of the coding sequence in the 20 kb flanking region relative to the CRISPR-associated array.
In some embodiments, the analyzing of the coding sequence further comprises removing known CRISPR-associated proteins from the identified CRISPR-associated proteins. In some embodiments, the analyzing of the coding sequence comprises using an algorithm selected from the group consisting of HHMSCAN and RPS-BLAST. In some embodiments, the analyzing of the coding sequence further comprises determining the presence of a structural domain. In some embodiments, the analyzing of the coding sequence comprises determining the presence of a functional domain. In some embodiments, the functional domain comprises a DNA binding domain, a RNA binding domain, a nuclease, a helicase, a restriction domain, or a structural maintenance of chromosomes (SMC) domain.
Also provided herein are non-naturally occurring CRISPR/Cas systems comprising: (a) a guide RNA, wherein the guide RNA comprises a repeat sequence and a spacer sequence capable of hybridizing to a target nucleic acid; and (b) a CRISPR-associated protein or a nucleic acid encoding the CRISPR-associated protein, wherein the CRISPR-associated protein comprises an amino acid sequence that is at least 80% identical to a sequence selected from SEQ ID NOs: 1-50.
In some embodiments, the CRISPR-associated protein is capable of binding to the guide RNA. In some embodiments, the CRISPR-associated protein comprises an amino acid sequence that is at least 85% identical to a sequence selected from SEQ ID NOs: 1-50. In some embodiments, the CRISPR-associated protein comprises an amino acid sequence that is at least 90% identical to a sequence selected from SEQ ID NOs: 1-50. In some embodiments, the CRISPR-associated protein comprises an amino acid sequence that is at least 95% identical to a sequence selected from SEQ ID NOs: 1-50. In some embodiments, the CRISPR-associated protein comprises an amino acid sequence selected from SEQ ID NO: 1-50.
In some embodiments, the target nucleic acid is an RNA or DNA. In some embodiments, the targeting of the target nucleic acid results in a modification of the target nucleic acid. In some embodiments, the modification of the target nucleic acid is a cleavage event.
In some embodiments, the guide RNA further comprises a trans-activating CRISPR RNA (tracrRNA). In some embodiments, the system is present in a delivery system. In some embodiments, the delivery system comprises a delivery vehicle selected from the group consisting of an adeno-associated virus, a nanoparticle, and a liposome.
Also provided herein are methods of treating a condition or disease in a subject in need thereof, the method comprising administering to the subject any one of the systems provided herein, wherein the spacer sequence is substantially complementary to a target nucleic acid associated with the condition or disease; wherein the CRISPR-associated protein associates with the guide RNA to form a complex; wherein the complex binds to the target nucleic acid sequence; and wherein upon binding of the complex to the target nucleic acid sequence the CRISPR-associated protein cleaves the target nucleic acid, thereby treating the condition or disease in the subject.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure, suitable methods and materials are described below. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.
Other features and advantages of the disclosure will be apparent from the following detailed description, and from the claims.
This document provides methods of identifying Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)-associated proteins where the method includes computation identification. In some embodiments, these computational methods are directed to identifying CRISRP-associated proteins that co-occur in close proximity to CRISPR arrays. It should be understood that the methods and calculations described herein may be performed on one or more computing devices.
Various non-limiting aspects of these methods and systems are described herein, and can be used in any combination without limitation. Additional aspects of various components of systems and methods for identifying CRISPR associated proteins are known in the art.
It must be noted that, as used in the specification and the appended claims, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise.
As used herein, the terms “about” and “approximately,” when used to modify an amount specified in a numeric value or range, indicate that the numeric value as well as reasonable deviations from the value known to the skilled person in the art, for example ±20%, ±10%, or ±5%, are within the intended meaning of the recited value.
As used herein, a “cell” can refer to either a prokaryotic or eukaryotic cell, optionally obtained from a subject or a commercially available source.
As used herein, “delivering”, “gene delivery”, “gene transfer”, “transducing” can refer to the introduction of an exogenous polynucleotide into a host cell, irrespective of the method used for the introduction. Such methods include a variety of well-known techniques such as vector-mediated gene transfer (e.g., viral infection/transfection, or various other protein-based or lipid-based gene delivery complexes) as well as techniques facilitating the delivery of “naked” polynucleotides (e.g., electroporation, “gene gun” delivery and various other techniques used for the introduction of polynucleotides). The introduced polynucleotide may be stably or transiently maintained in the host cell. Stable maintenance typically requires that the introduced polynucleotide either contains an origin of replication compatible with the host cell or integrates into a replicon of the host cell such as an extrachromosomal replicon (e.g., a plasmid) or a nuclear or mitochondrial chromosome.
In some embodiments, a polynucleotide can be inserted into a host cell by a gene delivery molecule. Examples of gene delivery molecules can include, but are not limited to, liposomes, micelles biocompatible polymers, including natural polymers and synthetic polymers; lipoproteins; polypeptides; polysaccharides; lipopolysaccharides; artificial viral envelopes; metal particles; and bacteria, or viruses, such as baculovirus, adenovirus and retrovirus, bacteriophage, cosmid, plasmid, fungal vectors and other recombination vehicles typically used in the art which have been described for expression in a variety of eukaryotic and prokaryotic hosts, and may be used for gene therapy as well as for simple protein expression.
As used herein, the term “encode” as it is applied to nucleic acid sequences refers to a polynucleotide which is said to “encode” a polypeptide if, in its native state or when manipulated by methods well known to those skilled in the art, can be transcribed and/or translated to produce the mRNA for the polypeptide and/or a fragment thereof. The antisense strand is the complement of such a nucleic acid, and the encoding sequence can be deduced therefrom.
The term “exogenous” refers to any material introduced from or originating from outside a cell, a tissue or an organism that is not produced by or does not originate from the same cell, tissue, or organism in which it is being introduced.
As used herein, “nucleic acid” is used to include any compound and/or substance that comprise a polymer of nucleotides. In some embodiments, a polymer of nucleotides are referred to as polynucleotides. Exemplary nucleic acids or polynucleotides can include, but are not limited to, ribonucleic acids (RNAs), deoxyribonucleic acids (DNAs), threose nucleic acids (TNAs), glycol nucleic acids (GNAs), peptide nucleic acids (PNAs), locked nucleic acids (LNAs, including LNA having a (3-D-ribo configuration, α-LNA having an α-L-ribo configuration (a diastereomer of LNA), 2′-amino-LNA having a 2′-amino functionalization, and 2′-amino-α-LNA having a 2′-amino functionalization) or hybrids thereof. Naturally-occurring nucleic acids generally have a deoxyribose sugar (e.g., found in deoxyribonucleic acid (DNA)) or a ribose sugar (e.g., found in ribonucleic acid (RNA)).
A nucleic acid can contain nucleotides having any of a variety of analogs of these sugar moieties that are known in the art. A deoxyribonucleic acid (DNA) can have one or more bases selected from the group consisting of adenine (A), thymine (T), cytosine (C), or guanine (G), and a ribonucleic acid (RNA) can have one or more bases selected from the group consisting of uracil (U), adenine (A), cytosine (C), or guanine (G).
In some embodiments, the term “nucleic acid” refers to a deoxyribonucleic acid (DNA) or ribonucleic acid (RNA), or a combination thereof, in either a single- or double-stranded form. Unless specifically limited, the term encompasses nucleic acids containing known analogues of natural nucleotides that have similar binding properties as the reference nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses complementary sequences as well as the sequence explicitly indicated. In some embodiments of any of the isolated nucleic acids described herein, the isolated nucleic acid is DNA. In some embodiments of any of the isolated nucleic acids described herein, the isolated nucleic acid is RNA.
Modifications can be introduced into a nucleotide sequence by standard techniques known in the art, such as site-directed mutagenesis and polymerase chain reaction (PCR)-mediated mutagenesis. Conservative amino acid substitutions are ones in which the amino acid residue is replaced with an amino acid residue having a similar side chain. Families of amino acid residues having similar side chains have been defined in the art. These families include amino acids with basic side chains (e.g., arginine, lysine and histidine), acidic side chains (e.g., aspartic acid and glutamic acid), uncharged polar side chains (e.g., asparagine, cysteine, glutamine, glycine, serine, threonine, tyrosine, and tryptophan), nonpolar side chains (e.g., alanine, isoleucine, leucine, methionine, phenylalanine, proline, and valine), beta-branched side chains (e.g., isoleucine, threonine, and valine), and aromatic side chains (e.g., histidine, phenylalanine, tryptophan, and tyrosine), and aromatic side chains (e.g., histidine, phenylalanine, tryptophan, and tyrosine).
Unless otherwise specified, a “nucleotide sequence encoding a protein” includes all nucleotide sequences that are degenerate versions of each other and thus encode the same amino acid sequence.
The term “plurality” can refer to a state of having a plural (e.g., more than one) number of different types of things (e.g., a cell, a genomic sequence, a subject, a system, or a protein). In some embodiments, a plurality of genomic sequences can be more than one genomic sequence wherein each genomic sequence is different from each other.
The term “subject” is intended to include any mammal. In some embodiments, the subject is cat, a dog, a goat, a human, a non-human primate, a rodent (e.g., a mouse or a rat), a pig, or a sheep.
The term “transduced”, “transfected”, or “transformed” refers to a process by which exogenous nucleic acid is introduced or transferred into a cell. A “transduced,” “transfected,” or “transformed” mammalian cell is one that has been transduced, transfected or transformed with exogenous nucleic acid (e.g., a gene delivery vector) that includes an exogenous nucleic acid encoding RNA-binding zinc finger domain).
The term “treating” means a reduction in the number, frequency, severity, or duration of one or more (e.g., two, three, four, five, or six) symptoms of a disease or disorder in a subject (e.g., any of the subjects described herein), and/or results in a decrease in the development and/or worsening of one or more symptoms of a disease or disorder in a subject.
The term “promoter” means a DNA sequence recognized by enzymes/proteins in a mammalian cell required to initiate the transcription of an operably linked coding sequence (e.g., a nucleic acid encoding a fusion protein (e.g., a RNA-binding zinc finger domain and a fusion partner)). A promoter typically refers, to e.g. a nucleotide sequence to which an RNA polymerase and/or any associated factor binds and at which transcription is initiated. The promoter can be constitutive, inducible, or tissue-specific (e.g., a brain-specific promoter).
The terms “identical” or percent “identity,” in the context of two or more polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues, e.g., at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, or at least 95% or greater, that are identical over a specified region when compared and aligned for maximum correspondence over a comparison window or designated region, as measured using a sequence comparison algorithm or by manual alignment and visual inspection.
For sequence comparison of polypeptides, typically one amino acid sequence acts as a reference sequence, to which a candidate sequence is compared. Alignment can be performed using various methods available to one of skill in the art, e.g., visual alignment or using publicly available software using known algorithms to achieve maximal alignment. Such programs include the BLAST programs, ALIGN, ALIGN-2 (Genentech, South San Francisco, Calif) or Megalign (DNASTAR). The parameters employed for an alignment to achieve maximal alignment can be determined by one of skill in the art. For sequence comparison of polypeptide sequences for purposes of this application, the BLASTP algorithm standard protein BLAST for aligning two proteins sequence with the default parameters is used.
As used herein, the term “CRISPR” refers to a technique of sequence specific genetic manipulation relying on the clustered regularly interspaced short palindromic repeats pathway, which unlike RNA interference regulates gene expression at a transcriptional level. The term “gRNA” or “guide RNA” refers to the guide RNA sequences used to target specific genes for correction employing the CRISPR technique. Techniques of designing gRNAs and donor therapeutic polynucleotides for target specificity are well known in the art. For example, Doench, J., et al. Nature biotechnology 2014; 32(12):1262-7 and Graham, D., et al. Genome Biol. 2015; 16: 260. The term “Single guide RNA” or “sgRNA” is a specific type of gRNA that combines tracrRNA (transactivating RNA), which binds to Cas9 to activate the complex to create the necessary strand breaks, and crRNA (CRISPR RNA), comprising complimentary nucleotides to the tracrRNA, into a single RNA construct. Exemplary methods of employing the CRISPR technique are described in WO 2017/091630, which is incorporated by reference in its entirety.
In some embodiments, the single guide RNA can recognize a target RNA, for example, by hybridizing to the target RNA. In some embodiments, the single guide RNA comprises a sequence that is complementary to the target RNA. In some embodiments, the sgRNA can include one or more modified nucleotides. In some embodiments, the sgRNA has a length that is about 10 nt (e.g., about 20 nt, about 30 nt, about 40 nt, about 50 nt, about 60 nt, about 70 nt, about 80 nt, about 90 nt, about 100 nt, about 120 nt, about 140 nt, about 160 nt, about 180 nt, about 200 nt, about 300 nt, about 400 nt, about 500 nt, about 600 nt, about 700 nt, about 800 nt, about 900 nt, about 1000 nt, or about 2000 nt).
In some embodiments, a single guide RNA can recognize a variety of RNA targets. For example, a target RNA can be messenger RNA (mRNA), ribosomal RNA (rRNA), signal recognition particle RNA (SRP RNA), transfer RNA (tRNA), small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), antisense RNA (aRNA), long noncoding RNA (lncRNA), microRNA (miRNA), piwi-interacting RNA (piRNA), small interfering RNA (siRNA), short hairpin RNA (shRNA), retrotransposon RNA, viral genome RNA, or viral noncoding RNA. In some embodiments, a target RNA can be an RNA involved in pathogenesis of conditions such as cancers, neurodegeneration, cutaneous conditions, endocrine conditions, intestinal diseases, infectious conditions, neurological conditions, liver diseases, heart disorders, or autoimmune diseases. In some embodiments, a target RNA can be a therapeutic target for conditions such as cancers, neurodegeneration, cutaneous conditions, endocrine conditions, intestinal diseases, infectious conditions, neurological conditions, liver diseases, heart disorders, or autoimmune diseases.
As used herein, a “CRISPR-associated protein” can refer to an enzyme that uses CRISPR sequences as a guide to recognize and cleave specific nucleic acid strands that are complementary to the CRISPR sequence. A CRISPR-associated protein can associate with a CRISPR RNA sequence to bind to, and alter DNA or RNA target sequences. In some embodiments, a CRISPR-associated protein can be a Cas9 endonuclease that makes a double-stranded break in a target DNA sequence. In some embodiments, a CRISPR-associated protein can be a Cas12a nuclease that also makes a double-stranded break in a target DNA sequence. In some embodiments, a CRISPR-associated protein can be a Cas13 nuclease which targets RNA. Additional CRISPR-associated proteins within the scope of the disclosure as identified by the novel method presented herein also include SEQ ID NOs: 1-50.
As used herein, a “CRISPR-associated array” can refer to a component of a CRISPR-Cas system, wherein a CRISPR-associated array can include alternating conserved repeats and spacers that are transcribed into a precursor CRISPR RNA and processed into individual CRISPR RNAs. In some embodiments, a CRISPR-associated array includes between two and several hundred repeating sequences separated by unique spacers. Both the repeats and spacers in an array have interesting features, wherein each DNA repeat is a partial palindrome while spacers all share a common sequence called a Proto-spacer Adjacent Motif (PAM) that Cas9 requires to recognize its DNA target. In some embodiments, a CRISPR-associated array has a 20 kb flanking region either at the 3′ or 5′ end of the CRISPR-associated array. In some embodiments, the CRISPR-associated array has a 20 kb flanking region at both the 3′ and 5′ end of the CRISPR-associated array. In some embodiments, a flanking region can include a coding sequence. In some embodiments, a flanking region can include a plurality of coding sequences. In some embodiments, a flanking region can include three or more coding sequences.
Provided herein are non-naturally occurring CRISPR/Cas systems including (a) a guide RNA, wherein the guide RNA comprises a repeat sequence and a spacer sequence capable of hybridizing to a target nucleic acid; and (b) a CRISPR-associated protein or a nucleic acid encoding the CRISPR-associated protein, wherein the CRISPR-associated protein comprises an amino acid sequence that is at least 80%, at least 85%, at least 86%, at least 87%, at least 88%, or at least 89% identical to a sequence selected from SEQ ID NOs: 1-50.
In some embodiments, the CRISPR-associated protein comprises an amino acid sequence that is at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% identical to a sequence selected from SEQ ID NOs: 1-50. In some embodiments, the CRISPR-associated protein comprises an amino acid sequence selected from SEQ ID NO: 1-50.
In some embodiments, the CRISPR-associated protein is capable of binding to the guide RNA and of targeting the nucleic acid sequence complementary to the guide RNA spacer sequence. In some embodiments, the target nucleic acid is an RNA or DNA. In some embodiments, the targeting of the target nucleic acid results in a modification of the target nucleic acid. In some embodiments, the modification of the target nucleic acid is a cleavage event. In some embodiments, the guide RNA further comprises a trans-activating CRISPR RNA (tracrRNA).
In some embodiments, the system is present in a delivery system. In some embodiments, the delivery system comprises a delivery vehicle selected from the group consisting of an adeno-associated virus, a nanoparticle, and a liposome.
Also provided herein are nucleic acids encoding any of the CRISPR-associated proteins or CRISPR-associated arrays as described herein.
Any of the isolated nucleic acids described herein can be introduced into any cell, e.g., a mammalian cell. Non-limiting examples of a mammalian cell include: a human cell, a rodent cell (e.g., a rat cell or a mouse cell), a rabbit cell, a dog cell, a cat cell, a porcine cell, or a non-human primate cell.
Methods of culturing cells are well known in the art. Cells can be maintained in vitro under conditions that favor cell proliferation, cell growth, and/or cell differentiation. For example, cells can be cultured by contacting a cell (e.g., any of the cells described herein) with a cell culture medium that includes supplemental growth factors to support cell viability and cell growth.
Methods of introducing nucleic acids (e.g., any of the exemplary nucleic acids described herein) and/or gene delivery vectors (e.g., any of the exemplary gene delivery vectors described herein (e.g., an AAV vector)) into cells (e.g., mammalian cells) are known in the art. Non-limiting examples of methods that can be used to introduce a nucleic acid (e.g., any of the exemplary nucleic acids described herein) and/or a gene delivery vector (e.g., any of the exemplary gene delivery vectors described herein (e.g., an AAV vector)) include: electroporation, lipofection, transfection, microinjection, calcium phosphate transfection, dendrimer-based transfection, anionic polymer transfection, cationic polymer transfection, transfection using highly branched organic compounds, cell-squeezing, sonoporation, optical transfection, magnetofection, particle-based transfection (e.g., nanoparticle transfection), transfection using liposomes (e.g., cationic liposomes), and viral transduction (e.g., lentiviral transduction, adenoviral transduction).
In some embodiments of any of the methods described herein, the method further includes formulating the CRISPR-associated protein, CRISPR-associated array, and/or guide RNA into a composition (e.g., a pharmaceutical composition).
Also provided herein are methods and compositions for specificity of transduction and/or infection, e.g., using any of the AAV capsid proteins or AAV virus serotypes. In some embodiments of any of the methods described herein, specificity of gene expression is determined, e.g., using any of the tissue-specific promoters and/or enhancers described herein.
In some embodiments, the gene delivery vector (e.g., any of the exemplary gene delivery vectors described herein) can include a promoter sequence. In some embodiments of any of the gene delivery vectors described herein, the promoter sequence is a tissue-specific promoter. In some embodiments, the promoter is an H1 promoter. In some embodiments, a promoter is a ubiquitous promoter. Non-limiting examples of ubiquitous promoters include CAG, EF1α, UBC, SV40, CMV, or PGK.
In some embodiments, the gene delivery vector (e.g., any of the exemplary gene delivery vectors described herein) can include an enhancer sequence. In some embodiments, an enhancer sequence is a CMV enhancer, a CAG enhancer, or a cHS4 enhancer.
In some embodiments, the gene delivery vector (e.g., any of the exemplary gene delivery vectors described herein) can include a polyadenylation (poly(A)) signal sequence. Poly(A) tails are added to most nascent eukaryotic messenger RNAs (mRNAs) at their 3′ end during a complex process that includes cleavage of the primary transcript and a coupled polyadenylation reaction driven by the poly(A) signal sequence. In some embodiments of any of the gene delivery vectors described herein, the gene delivery vector can include a poly(A) signal sequence at the 3′ end of the isolated nucleic acid encoding a fusion protein (e.g., any of the fusion proteins described herein).
The term “polyadenylation” refers to the covalent linkage of a polyadenylyl moiety, or its modified variant, to the 3′ end of an mRNA molecule. A poly(A) tail is a long sequence of adenine nucleotides (e.g., 40, 50, 100, 200, 500, 1000) added to the pre-mRNA by a polyadenylate polymerase.
The term “poly(A) signal sequence” or “poly(A) signal” is a sequence that triggers the endonuclease cleavage of a mRNA and the addition of a sequence of adenosine to the 3′end of the cleaved mRNA. Non-limiting examples of poly(A) signals include: bovine growth hormone (bGH) poly(A) signal, human growth hormone (hGH) poly(A) signal. In some embodiments of any of the AAV vectors described herein, the AAV vector can include a poly(A) signal sequence that includes the sequence AATAAA or variations thereof. Additional examples of poly(A) signal sequences are known in the art.
In some embodiments, the gene delivery vector (e.g., any of the exemplary gene delivery vectors described herein) can include an internal ribosome entry site (IRES) sequence. An IRES sequence is used to produce more than one polypeptide from a single gene transcript, and forms a complex secondary structure that allows translation initiation to occur from any position with an mRNA immediately downstream from where the IRES is located. Non-limiting examples of IRES sequences include those from, e.g., hepatitis C virus (HCV), poliovirus (PV), hepatitis A virus (HAV), foot and mouth disease virus (FMDV).
In some embodiments, the gene delivery vector (e.g., any of the exemplary gene delivery vectors described herein) can include a sequence encoding a “self-cleaving” 2A peptide (e.g., T2A, P2A, E2A, or F2A). A self-cleaving 2A-peptide is used to produce more than one polypeptide from a single gene transcript by inducing ribosomal skipping during translation.
In some embodiments, the nucleic acid sequences are operably linked to a promoter or are operably linked to other nucleic acid sequences using a self-cleaving 2A peptide or an IRES sequence.
Also provided herein are compositions (e.g., pharmaceutical compositions) that include any of the delivery systems, CRISPR-associated proteins, CRISPR-associated arrays, and/or guide RNAs described herein. Any of the pharmaceutical compositions can include any of the delivery systems, CRISPR-associated proteins, CRISPR-associated arrays, and/or guide RNAs described herein and one or more (e.g., 1, 2, 3, 4, or 5) pharmaceutically or physiologically acceptable carriers, diluents, or excipients. In some embodiments, any of the pharmaceutical compositions described herein can include one or more buffers (e.g., a neutral-buffered saline, a phosphate-buffered saline (PBS)), one or more carbohydrates (e.g., glucose, mannose, sucrose, dextran, or mannitol), one or more proteins, polypeptides, or amino acids (e.g., glycine), one or more antioxidants, one or more chelating agents (e.g., glutathione or EDTA), one or more preservatives, and/or a pharmaceutically acceptable carrier (e.g., PBS, saline, or bacteriostatic water).
In some embodiments, any of the pharmaceutical compositions described herein can further include one or more (e.g., 1, 2, 3, 4, or 5) agents that promote the entry of any of the gene delivery vectors described herein into a cell (e.g., a mammalian cell) (e.g., a liposome or cationic lipid).
The pharmaceutical compositions provided herein can be, e.g., formulated to be compatible with their intended route of administration. In some embodiments, the compositions are formulated for subcutaneous, intramuscular, intravenous, or intrahepatic administration. In some examples, the compositions include a therapeutically effective amount of any of the gene delivery vectors described herein.
Also provided are kits that include any of the compositions (e.g., pharmaceutical compositions), isolated nucleic acids, gene delivery vectors, or fusion proteins described herein. In some embodiments, a kit can include a solid composition (e.g., a lyophilized composition including any of the gene delivery vectors described herein) and a liquid for solubilizing the lyophilized composition.
In some embodiments, a kit can include a pre-loaded syringe including any of the pharmaceutical compositions described herein.
In some embodiments, the kit includes a vial including any of the pharmaceutical compositions described herein (e.g., formulated as an aqueous pharmaceutical composition).
In some embodiments, the kit can include instructions for performing any of the methods described herein.
Also provided herein is a mammalian cell (e.g., a peripheral mammalian cell, a mammalian neural cell, e.g., a human neural cell) that includes any of the gene delivery vectors, fusion proteins, or isolated nucleic acids described herein. Also provided is a mammalian cell (e.g., a mammalian neural cell, e.g. a human neural cell) that is transduced with any of the gene delivery vectors described herein, edited using lentiviral or CRISPR technologies, or otherwise engineered or modified to express any of the fusion proteins described herein. Skilled practitioners will appreciate that the gene delivery vectors described herein can be introduced into any mammalian cell (e.g., any neural cell), that a variety of technologies can be utilized for modifying the genome of mammalian cells, and that such modified human cells that secrete fusion proteins can be utilized as cell therapies. Non-limiting examples of gene delivery vectors and methods for introducing gene delivery vectors into mammalian cells (e.g., any neural cell, e.g., a human neural cell) are described herein.
In some embodiments, the mammalian cell is a human cell, a rodent cell (e.g., a rat cell or a mouse cell), a rabbit cell, a dog cell, a cat cell, a porcine cell, or a non-human primate cell. In some embodiments, the mammalian cell is present in a subject (e.g., a human subject). In some embodiments, the mammalian cell is an autologous cell obtained from a subject (e.g., a human subject) and cultured ex vivo. In some embodiments, the mammalian cell is in vitro.
Provided herein are methods of identifying a Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)-associated protein including (a) obtaining a plurality of genomic sequences, wherein a genomic sequence of the plurality of genomic sequences comprises a CRISPR-associated array; (b) determining a subset of the plurality of genomic sequences comprising a plurality of coding sequences within a 20 kilobase (kb) sequence flanking region either at the 3′ or 5′ end of the CRISPR-associated array; and (c) analyzing a coding sequence of the plurality of coding sequences and thereby identifying the CRISPR-associated protein based on the coding sequence.
In some embodiments, the obtaining step comprises identifying, within the plurality of genomic sequences, a genomic sequence comprising a CRISPR-associated array.
Also provided herein are methods of identifying a CRISPR-associated proteins including (a) obtaining a plurality of genomic sequences; (b) selecting, within the plurality of genomic sequences, a genomic sequence comprising a CRISPR-associated array; (c) determining a subset of the plurality of genomic sequences comprising a plurality of coding sequences within a 20 kilobase (kb) sequence flanking region either at the 3′ or 5′ end of the CRISPR-associated array; and (d) analyzing a coding sequence of the plurality of coding sequences and thereby identifying the CRISPR-associated protein based on the coding sequence.
In some embodiments, the plurality of genomic sequences comprise one or more of genomes, wherein the one or more of genomes are selected from: a prokaryotic genome and metagenome. In some embodiments, the selecting step comprises using an algorithm selected from the group consisting of PILER-CR, and CRISPR Recognition Tool (CRT), and combinations thereof.
In some embodiments, the determining step includes filtering the genomic sequences according to the location of the genomic sequence relative to the 20 kb sequence flanking region. In some embodiments, the filtering can include selecting a genomic sequence that is located within the 20 kb flanking region. In some embodiments, the determining step also includes filtering the genomic sequences according to the size of the genomic sequence. In some embodiments, the filtering can include selecting a genomic sequence that is longer than 500 amino acids. In some embodiments, the determining step comprises using an algorithm selected from the group consisting of MetaGeneMark, and Prodigal, and combinations thereof.
As used herein, the term “analyzing” can refer to a process that includes filtering of a plurality of coding sequences based on the size of each coding sequence. In some embodiments, the filtering comprises selecting a coding sequence that comprises more than 500 amino acids (e.g., 550 amino acids, 600 amino acids, 650 amino acids, 700 amino acids, 750 amino acids, or 800 amino acids). In some embodiments, the filtering comprises selecting a coding sequence that comprises more than 800 amino acids (e.g., 850 amino acids, 900 amino acids, 950 amino acids, 1000 amino acids, 1100 amino acids, 1200 amino acids, 1300 amino acids, 1400 amino acids, or 1500 amino acids).
In some embodiments, the analyzing step further comprises classifying the CRISPR-associated arrays. In some embodiments, the classifying of the CRISPR-associated arrays comprises selecting a CRISPR-associated array comprising three or more coding sequences (e.g., 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 15 or more, 20 or more, 25 or more, 30 or more, 35 or more, 40 or more, 45 or more, or 50 or more coding sequences) present in the 20 kb flanking regions. In some embodiments, the classifying further comprises determining a relative position of the coding sequence in the 20 kb flanking region relative to the CRISPR-associated array. In some embodiments, the classifying comprises calculating the coding sequence position within the 20 kb flanking region adjacent to the CRISPR-associated array, wherein the coding sequence could be classified based on the position relative to the CRISPR-associated array.
In some embodiments, the analyzing of the coding sequences comprises removing known CRISPR-associated proteins from the identified CRISPR-associated proteins. In some embodiments, the analyzing of the coding sequence comprises using one or more algorithms selected from HHMSCAN and RPS-BLAST. In some embodiments, the analyzing of the coding sequence further comprises determining the presence of a structural domain. In some embodiments, the analyzing of the coding sequence further comprises determining the presence of a functional domain. In some embodiments, the functional domain comprises a functional domain selected from a DNA binding domain, a RNA binding domain, a nuclease, a helicase, a restriction domain, and or a structural maintenance of chromosomes (SMC) domain. In some embodiments, the analyzing of the coding sequence further comprises determining whether the coding sequence starts with a Methoinine
Also provided herein are computer implemented methods including (a) obtaining a plurality of genomic sequences; (b) selecting, within the plurality of genomic sequences, a genomic sequence comprising a CRISPR-associated array; (c) determining a subset of the plurality of genomic sequences comprising a plurality of coding sequences within a 20 kilobase (kb) sequence flanking region either at the 3′ or 5′ end of the CRISPR-associated array; and (d) analyzing a coding sequence of the plurality of coding sequences and thereby identifying a CRISPR-associated protein based on the coding sequence.
Also provided herein are methods for treating a condition or disease in a subject in need thereof, the method including administering to the subject any of the systems described herein, wherein the spacer sequence is substantially complementary to a target nucleic acid associated with the condition or disease; wherein the CRISPR-associated protein associates with the RNA guide to form a complex; wherein the complex binds to the target nucleic acid sequence; and wherein upon binding of the complex to the target nucleic acid sequence the CRISPR-associated protein cleaves the target nucleic acid, thereby treating the condition or disease in the subject.
In some embodiments of these methods, the method can result in at least a 2.0-fold (e.g., at least a 2.5-fold, at least a 3.0-fold, at least a 3.5-fold, at least a 4.0-fold, at least a 4.5-fold, at least a 5.0-fold, at least a 6.0-fold, at least a 7.0-fold, at least a 8.0-fold, at least a 9.0-fold, at least a 10-fold, at least a 15-fold, at least a 20-fold, at least a 30-fold, at least a 40-fold, at least a 50-fold, at least a 60-fold, at least a 80-fold, at least a 100-fold, at least a 120-fold, or at least a 150-fold) decrease in the level of one or more symptoms associated with the condition or disease as compared to the level of the one or more symptoms associated with the condition in the subject prior to the administering. In some examples of these methods, the method can result from about a 2-fold to about a 150-fold, about a 2-fold to about a 100-fold, about a 2-fold to about a 50-fold, about a 2-fold to about a 25-fold, about a 2-fold to about a 10-fold, about a 2-fold to about a 5-fold, about a 5-fold to about a 150-fold, about a 5-fold to about a 100-fold, about a 5-fold to about a 50-fold, about a 5-fold to about a 25-fold, about a 5-fold to about a 10-fold, about a 10-fold to about a 150-fold, a 10-fold to about a 100-fold, about a 10-fold to about a 50-fold, about a 10-fold to about a 25-fold, about a 25-fold to about a 150-fold, about a 25-fold to about a 100-fold, or about a 25-fold to about a 50-fold, decrease in the level of one or more symptoms associated with the condition or disease as compared to the level of the one or more symptoms associated with the condition in the subject prior to the administering.
In some embodiments, the condition or disease can include conditions such as cancers, neurodegeneration, cutaneous conditions, endocrine conditions, intestinal diseases, infectious conditions, neurological conditions, liver diseases, heart disorders, or autoimmune diseases. In some embodiments, the condition or disease can be a cancer. In some embodiments, the cancer is selected from a bladder cancer, breast cancer, cervical cancer, colon cancer, endometrial cancer, esophageal cancer, fallopian tube cancer, gall bladder cancer, gastrointestinal cancer, head and neck cancer, hematological cancer, Hodgkin lymphoma, laryngeal cancer, liver cancer, lung cancer, lymphoma, melanoma, mesothelioma, ovarian cancer, primary peritoneal cancer, salivary gland cancer, sarcoma, stomach cancer, thyroid cancer, pancreatic cancer, renal cell carcinoma, glioblastoma and prostate cancer. In some embodiments, the cancer can be a B-cell acute lymphoblastic meukemia, lung cancer, esophageal cancer, multiple myeloma, or cervical cancer.
In some embodiments, the condition or disease can be a neurodegenerative disease. In some embodiments, the neurodegenerative disease can be Alzheimer's disease, Huntington's disease, Duchenne muscular dystrophy (DMD), frontotemporal dementia, ryanodine receptor type I (RYR1)-related myopathies, cystic fibrosis, or autosomal recessive juvenile parkinsonism.
In some embodiments, the condition or disease can be a blood disease or a hemoglobinopathies. In some embodiments, the blood disease can be sickle cell anemia or beta thalassemia. In some embodiments, the condition or disease can be an eye disease. In some embodiments, the eye disease can be retinitis pigmentosa, leber congenital amaurosis, specific retinal dystrophy, or autosomal dominant cone-rod dystrophy. In some embodiments, the condition or disease can be human immunodeficiency virus (HIV), diabetes, autism spectrum disorder, genetic liver disease, or congenital genetic lung disease.
An exemplary method of identifying candidate CRISPR-association proteins is as described as shown in
In order to annotate and classify the 10,913 cluster sequences adjacent to the CRISPR arrays, from each cluster a representative sequence was searched against the prokaryotic subset of the non-redundant protein database (bacteria+archaea) using blastp in order to annotate protein sequences and identify known CRISPR genes. Protein sequences matching known CRISPR genes with e-value cutoff of 1e-10 and query coverage of 50% were considered orthologous to known CRISPR genes. Furthermore protein sequences were searched with HMMSCAN against known CRISPR-related profiles from (see, e.g., Burstein, D. et al., Nature 542, 237-241 (2017); hereinafter “Burstein”) and with RPS-BLAST against a collection of CRISPR profiles. These protein clusters represent orthologs and are considered known CRISPR associated proteins and thus filtered out or separated for further analysis. From the total 10,913 clusters, 3465 clusters were considered known CRISPR and 7,642 novel potential CRISPR associated candidates (
To further annotate the remaining 7,642 protein clusters, for each candidate protein, functional domains were predicted by running RPS-BLAST on CDD database and HMMSCAN against Pfam and associated GO (Gene Ontology) terms were added using Pfam2Go mapping software. Protein clusters were subsequently grouped in subsets based on the presence/absence of characterized and putative domains.
To identify novel CRISPR associated proteins, 179,804 prokaryotic genomes and 3,396 metagenomes deposited to Genbank from Jun. 1, 2016-Apr. 21, 2020 were downloaded and analyzed. Using PILER-CR and CRT (CRISPR Recognition Tool), 230,443 CRISPR arrays were identified with 187,324 derived from prokaryote genomes, and 43,119 from metagenomes. Given that most CRISPR class 2 effectors (i.e. single effector proteins like Cas9's, Cas12's, Cas13's) are located in close proximity to their arrays (Makarova, et al., Nat. Rev. Microbiology, 18: 67-83 (2020); hereinafter “Makarova”), the search for novel CRISPR associated proteins was limited to a 20 kb window flanking the arrays. Putative protein sequences within the flanking sequences were predicted using MetaGeneMark (Zhu) and Prodigial (Hyatt), filtering out sequences shorter than 500 amino acids as novel class 2 effectors are generally large multidomain proteins (Makarova).
To annotate protein sequences and identify known CRISPR proteins, representative sequences for the 10,913 clusters were searched against the prokaryotic subset of the non-redundant protein database (bacteria+archaea) using blastp. Protein sequences matching known CRISPR genes with e-value cutoff of 1e-10 and query coverage of 50% were considered orthologous to known CRISPR genes. Additionally, protein sequences were searched with HMMSCAN against known CRISPR-related profiles (Burstein) and with RPS-BLAST against collection of CRISPR profiles. Hits for both of these searches mostly overlapped blastp-identified CRISPR sequences, with a few exceptions, which were also added to the CRISPR cluster ortholog set. Together, from the 10,913 clusters, 3465 clusters were considered orthologs to known CRISPR proteins leaving 7,642 potential cluster candidates to be further characterized. Given that many of the 10,913 clusters were generated with a stringent 90% identity using MMseqs2, these clusters were similar and therefore additional filtering was performed. To further reduce the number of sequences, 10,913 clusters can be further clustered with MMseqs2 using default settings, which requires the sequences to overlap by at least 80% (query coverage 0.8). MMseqs2 with default settings generated 4,205 “superclusters”. The supercluster classification reduced the number of known CRISPR-associated clusters to 343 and the number of unknown CRISPR superclusters to 3862. To narrow down the two lists (clusters and superclusters), proteins were further analyzed and protein domains were predicted by running RPS-BLAST on the CDD database and HMMSCAN against Pfam (
For the 3465 clusters consisting of 51,094 orthologs of known CRIPSR proteins, and 343 superclusters consisting of 2614 clusters we found numerous class I systems which have effector modules composed of multiple Cas proteins (e.g. Cas1-4, 5-8, 10-11), and numerous class II systems which encompass a single multidomain crRNA-binding protein (e.g., Cas9, Cas12, Cas13 etc.).
To annotate known candidates, the arrays were classified into class 1, 2, or unclassified based on the identified CRISPR-related proteins associated with each array. For each array with flanking regions length of at least 3 kb, all those CRISPR-related proteins were collected and if they consistently fell into class 1 or 2 that array was classified as such. If an array had no identifiable CRISPR proteins that could distinguish the class, like arrays flanked by Cas1/Cas2/Cas4 only or no Cas proteins, they were marked as unclassified. If an array had proteins from both classes, it was marked ambiguous. That is because if a cluster was classified as 2, that meant that the array already had an effector protein such as Cas9/Cas12/Cas13 since those are the only proteins that can distinguish class 2 reliably. Those arrays were unlikely to have yet another effector. If the array was classified as 1, which is the majority of classified arrays, naturally, it also could have been discarded since class 2 effector were of primary importance. As such, the aim was to narrow down the candidate CRISPR-associated proteins by further considering only unclassified or ambiguous arrays.
Further filtering of the candidate clusters produced a list of 50 candidate proteins to be used for functional assay. Candidates were divided in four main categories: proteins with no blast hits, proteins with no predicted domains and blast hits against hypothetical and unknown proteins, proteins with predicted domains and blast hits against hypothetical and unknown proteins only and proteins with predicted domains and blast hits against characterized proteins. For each category protein shorter than 800 amino acids (aa) and proteins not starting with methionine (Met) were filtered out. The first category included 25 candidates, 6 are associated with classified arrays and thus not considered for further analysis. Since the majority of the proteins were filtered out because they had predicted domains with a structural potential function or were low complexity proteins including many SR repeats, the protein length threshold for this category was changed to 650 aa and four potential candidates were selected for functional analysis. The second category of proteins with no predicted domains and blast hits against hypothetical and unknown proteins contained 347 candidates of which 120 are associated with an already classified array and thus filtered out. From the remaining 227 proteins, 175 proteins were excluded for being shorter than 800 aa and 14 candidates were excluded for not starting with Met. In addition, proteins with high presence of low complexity/repeats regions were selected out and selected 15 candidates for further analysis. The third category included 1644 proteins with predicted domains and blast hits against hypothetical and unknown proteins of which only 552 candidates were longer of 800 aa. Exclusion of 152 proteins as already associated with classified arrays and proteins not starting with Met left 322 candidate proteins. From this shorter list, 15 were selected based on putative function of the hypothetical domains. Proteins with DNA/RNA binding domains, nucleases, helicases, restriction and SMC domains were included in the final list for further functional analysis. The most abundant category is represented by proteins with predicted domains and blast hits against characterized proteins with 5329 candidates of which 1442 were above 800 aa. After filtering out proteins associated with classified arrays and proteins not starting with Met, the candidate number decreased to 758. SEQ ID NOs: 1-50 represent proteins with DNA/RNA binding domains, nucleases, helicases, restriction and SMC domains that were selected for further analysis. The CRISPR arrays and spacer sequences corresponding to the CRISPR-associated proteins of SEQ ID NOs: 1-50 are listed in Tables 1-5.
Streptococcus
thermophilus
Laceyella
sediminis
Firmicutes
bacterium
It is to be understood that while the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.
This application claims priority to U.S. Provisional Patent Application No. 63/117,441, filed on Nov. 23, 2020, and U.S. Provisional Patent Application No. 63/118,307, filed on Nov. 25, 2020. The disclosure of these prior applications are considered part of the disclosure of this application, and are incorporated in their entireties into this application.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/060547 | 11/23/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63117441 | Nov 2020 | US | |
63118307 | Nov 2020 | US |