Applicant hereby incorporates by reference the Sequence Listing material filed in electronic form herewith. This file is labeled NYG-LIPP120PCT_ST25.txt”, was created on 25 Nov. 2020, and is 23 KB in size.
Class 2, Type VI clustered regularly interspaced short palindromic repeats (CRISPR) enzymes (for example, Cas13 proteins) have recently been identified as programmable RNA-guided, RNA-directed Cas proteins with nuclease activity that allow for target gene knock-down without altering the genome.
To date, three different Cas13 effector proteins (PguCas13b, PspCas13b, RfxCas13d) have been reported to show high RNA knock-down efficacy with minimal off-target activity 9,11 In addition to target RNA knock-down12-17, Cas13 proteins have been used to enable viral RNA-detection systems18,19, site-directed RNA-editing20, demethylation of m6A-modified transcripts 21, RNA live-imaging and modulation of splice site choice as well as cleavage and polyadenylation site usage 22-24
Cas13 proteins are guided to their target RNAs by a single CRISPR RNA (crRNA) composed of a direct repeat (DR) stem loop and a spacer sequence (guide RNA) that mediates target recognition by RNA-RNA hybridization. Although Cas13 enzymes exert some non-specific collateral nuclease activity upon activation15,16,18,25,26 they have greatly reduced off-target target activity in cultured cells compared to RNA interference Previous studies have shown that Cas13 guide RNAs have minimal Protospacer Flanking Sequence (PFS) constraints in mammalian cells12,15,20,27 and that RNA target sites should be preferentially accessible for Cas13 binding 12,13,15.
Compared to DNA-targeting CRISPR nucleases like Cas9 or Cas12a/Cpf1, little is known about targeting preferences of RNA-targeting Class 2, Type VI CRISPR enzymes like Cas13d. Despite the notion that Cas13d enzymes have limited protospacer adjacent sequence restrictions and that mRNA target sites should be relatively accessible, there is no guidance for the spacer RNA (guide) design for these novel enzymes.
Beyond these basic parameters, we currently lack information about optimal Cas13 crRNA designs for high target RNA knock-down efficacy. A continuing need in the art exists for new and effective tools and methods for screening, designing, optimizing, ranking, selecting and using CRISPR Cas crRNAs with high specificity and efficiency but low off-target activity.
In one aspect, a non-naturally occurring, synthesized or engineered crRNA Class 2, Type IV clustered regularly interspaced short palindromic repeat (CRISPR) RNA (crRNA) is provided which comprises a direct repeat (DR) stem loop sequence and a guide or spacer sequence, said DR selected from one or more of the DR sequences or a modification thereof of Table 9, SEQ ID Nos; 1-46, wherein R represent A or G; Y represents C or T(or U); S represents G or C; W represents A or T(or U); K represents G or T(or U); M represents A or C; B represents C or G or T(or U); D represents A or G or T(or U); H represents A or C or T(or U); V represents A or C or G; N represents any base; and - represents a nucleotide gap.
In another aspect, a nucleic acid molecule is provided that comprises the crRNA identified above. In some embodiments, the crRNA is capable of forming a complex with a Class 2, Type VI effector protein, and directing the complex to bind to the target RNA to cleave or block the target RNA. In some embodiments, the Class 2, Type VI effector protein is a CRISPR-associated protein 13d (Cas13d). In some embodiments, the nucleic acid molecule is a vector or plasmid. In some embodiments, the vector is a viral vector.
In another aspect, a nucleic acid molecule is provided in which the crRNA comprise a DR sequence of Table 9 and guide sequences which mismatch the target and allow the Class 2, Type VI effector protein to bind the target, but not elicit target degradation.
In another aspect, a ribonucleoprotein (RNP) complex comprises a Class 2, Type VI effector protein and a crRNA as described above.
In another aspect, a composition comprises a crRNA or RNP as described herein, or a nucleic acid molecule as described herein in a pharmaceutically acceptable carrier. In certain embodiments, the carrier is a nanoparticle, a lipid complex, a polymer, a quantum dot, a carbon nanotube, a magnetic nanoparticle, or a gold nanoparticle.
Still other aspects include a cell comprising any of the nucleic acid molecules, crRNA, RNP or compositions described herein, or a library comprising a plurality of crRNAs, nucleic acid molecules or viral vectors described herein, wherein each of the crRNA is capable of directing a Cas13d or a variant thereof to a different target RNA or a different region of one target RNA. Other aspects further include a pharmaceutical composition comprising a crRNA, nucleic acid molecule, RNP, composition, cell or library as described herein.
In still another aspect, a method of treating a disease associated with an abnormal RNA or misregulation of an RNA transcript, comprises administering to a subject in need thereof the crRNA, nucleic acid molecule, RNP and/or pharmaceutical compositions described herein. In yet a further aspect, a method of improving the efficiency of, or stabilizing the targeting of a Class 2, Type VI clustered CRISPR RNA (crRNA) comprising a direct repeat (DR) stem loop and a guide or spacer sequence is provided. An exemplary method entails replacing the DR stem loop sequence of a less efficient crRNA with a DR sequence selected from one or more of the DR sequences of SEQ ID Nos: 1 to 46, or a modification thereof.
In still a further embodiment a method is provided for screening or predicting on-target activity of a clustered regularly interspaced short palindromic repeats (CRISPR) RNA (crRNA), which crRNA is capable of forming a complex with an RNA-targeting CRISPR-associated protein or a variant thereof and directing the complex to the target RNA. The method comprises the steps of (a) characterizing a plurality of crRNAs and their corresponding target by features comprising the presence of both a seed region located between guide RNA nucleotide bases 15 to 21 relative to the guide RNA 5′ end, characterized by a stabilizing, enriched sequence of G and C bases and an accessible target region characterized by an enriched sequence of A and U, surrounding the seed region on the 5′ end, 3′ end or both the 5′ and 3′ ends; (b) assessing on-target activity of each of the crRNAs of (a); (c) applying a machine learning model or deep learning model using the characterization of (a) and the on-target activity of (b). Input of the model comprises characterization(s) of said seed region and target regions of each crRNA and its corresponding target RNA, and output of the model is an on-target score of the crRNA. A higher score indicates a ranked on-target activity. The method also includes (d) applying the model constructed in step (c) to a first crRNA and generating an on-target score of the first crRNA. In one embodiment, the crRNA are characterized by the DR sequences recited in Table 9.
In another aspect, a method of blocking RNA regulatory elements without degradation of the target nucleic acid is provided. The method includes the step of administering cRNAs to a cell expressing an RNA-targeting CRISPR-associated protein or to a subject. The crRNAs are capable of forming a complex with the RNA-targeting CRISPR-associated protein or a variant thereof and directing the complex to the target RNA. The crRNAs comprise a DR sequence and a guide or spacer sequences, said guide or spacer sequences forming extended mismatches to the target site in the seed region.
In another aspect, a method is provided for generating and selecting a clustered regularly interspaced short palindromic repeats (CRISPR) RNA (crRNA) composed of a direct repeat (DR) stem loop and a guide. The selected crRNA is capable of forming a complex with a CRISPR-associated protein 13d (Cas13d) or a variant thereof and directing the complex to a target RNA. The method comprises randomly designating a potential hybridization region in the target RNA, designing a guide which is capable of hybridizing to the hybridization region, designing a crRNA sequence comprising the guide and a DR stem loop accordingly, and ranking each crRNA based on features of the crRNA and its corresponding target RNA. In one embodiment, the features comprise one or more of those listed in Tables 2 and 4-7 and
In one embodiment, a crRNA or its corresponding target RNA having a feature within the identified range of a positively-correlated feature ranks higher than those falling out of the range. Additionally, or alternatively, a crRNA or its corresponding target RNA having a feature out of the identified range of a negatively-correlated feature ranks higher than those falling within the range. The ranges may include one or more of those listed in Tables 2, 4,5 and 7.
In another aspect, provided is a non-naturally occurring, synthesized (chemically or recombinantly) or engineered crRNA composed of a direct repeat (DR) stem loop and a guide capable of hybridizing to a hybridization region of a target RNA. The crRNA is capable of forming a complex with a CRISPR-associated protein 13d (Cas13d) or a variant thereof and directing the complex to the target RNA. In a further embodiment the crRNA or the corresponding target RNA comprises a feature which falls within a certain range of one or more of the positively-correlated features and out of a certain range of one or more of the negatively-correlated features as illustrated in Tables 2, 4 and 5.
Additionally, provided are nucleic acid molecules, vectors, and compositions comprising a nucleic acid sequence of a crRNA as disclosed or a nucleic acid sequence encoding the crRNA, along with a library comprising a plurality of the crRNAs, nucleic acid molecules, or vectors. Uses of the disclosed components or compositions are further provided, for example, treatment of a disease associated with an abnormal RNA, or genome-wide screening for functional RNAs.
Still other aspects and advantages of these compositions and methods are described further in the following detailed description of the preferred embodiments thereof.
Described herein is a method of screening and selecting a clustered regularly interspaced short palindromic repeats (CRISPR) RNA (crRNA) suitable for forming a complex with a CRISPR-associated protein 13d (Cas13d) or a variant thereof and directing the complex to a target RNA. The method comprising randomly designating a potential hybridization region in the target RNA, designing a guide which is capable of hybridizing to the hybridization region, designing a crRNA comprising the guide and a DR stem loop accordingly, and ranking each crRNA based on crRNA-specific features as well as corresponding target RNA features. Using the method, crRNAs with high knock-down efficacy are selected. In one embodiment, the method is in silico.
Also provided are a non-naturally occurring, synthesized or engineered crRNA selected and generated according to the method, along with a vector, a nucleic acid molecule, a library of vectors or nucleic acid molecules, and a composition comprising the crRNA. Methods and uses of the disclosed crRNA(s), vector(s), nucleic acid molecule(s), library(ies) and composition(s) are also provided, for example, in the treatment of a disease associated with an abnormal RNA, in a genome-wide screening of functional RNA, and detecting, knocking-down, editing, or modifying a target RNA. More details are described below.
The methods and compositions described herein provide optimal Cas13 crRNA designs for high target RNA knock-down efficacy. Additionally, such methods and compositions address, among other issues, how mismatches relative to the target site affect Cas13d activity and leverage this aspect for the development of novel biotechnologies.
Technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and by reference to published texts, which provide one skilled in the art with a general guide to many of the terms used in the present application. The definitions contained in this specification are provided for clarity in describing the components and compositions herein and are not intended to limit the claimed invention.
As used herein, “crRNA” is an abbreviation of clustered regularly interspaced short palindromic repeats (CRISPR) RNA, which is a nucleic acid molecule composed of a direct repeat (DR) stem loop sequence and a guide sequence. The terms “guide RNA” “guide” or “guide sequence” refer to a nucleic acid sequence which can hybridize to a sequence (hybridization region or target region) of a target RNA. The guide is capable of complexing with Cas13d protein and providing targeting specificity and binding ability for Cas13d. In one embodiment, the guide RNA is about 20 nucleotides (nt) to about 33 nt. In a further embodiment, the guide RNA is about 20 nt, about 21 nt, about 22 nt, about 23 nt, about 24 nt, about 25, nt, about 26, nt, about 27 nt, about 28 nt, about 29 nt, about 30 nt, about 31 nt, about 32 nt, or about 33 nt. In one embodiment, the guide RNA is about 27 nt. Existence of a sequence in a target RNA similar to a protospacer or protospacer adjacent motif (PAM) was not found in the CRISPR-Cas13d system as further detailed in the Detailed Description and Examples.
As used herein, nucleotide residues in a crRNA or a portion of it (for example, a guide, or a stem loop), unless specified, are numbered as illustrated in
In one embodiment, features with either positive or negative correlation are denoted with the subscript ‘max’ or ‘min’, respectively, in Table 7 as well as in
In the embodiments relating to a nucleic acid molecule or a vector or uses thereof, a nucleic acid molecule encoding a crRNA may be in operative association with an RNA pol III promoter. A RNA pol III promoter is a promoter that is sufficient to direct accurate initiation of transcription by the RNA polymerase III machinery, wherein the RNA polymerase III (RNAP III and Pol III) is a RNA polymerase transcribing DNA to synthesize ribosomal 5S ribosomal RNA (rRNA), transfer RNA (tRNA), crRNA, and other small RNAs. A variety of Polymerase III promoters which can be used are publicly or commercially available, for example the U6 promoter, the promoter fragments derived from H1 RNA genes or U6 snRNA genes of human or mouse origin or from any other species. In addition, pol III promoters can be modified/engineered to incorporate other desirable properties such as the ability to be induced by small chemical molecules, either ubiquitously or in a tissue-specific manner. For example, in one embodiment the promoter may be activated by tetracycline. In another embodiment, the promoter may be activated by IPTG (lacI system). See, U.S. Pat. Nos. 5,902,880A and 7,195,916B2. In another embodiment, a Pol III promoter from various species might be utilized, such as human, mouse or rat.
As used herein a “target RNA” refers to an RNA molecule or a nucleic acid molecule to which a guide sequence is designed to target, e.g. have complementarity, where hybridization between a target RNA and a guide promotes the formation of a CRISPR-Cas13d complex. In one embodiment, the target RNA comprises at least 20 nt (or at least 23 nt, or at least 87 nt, or at least 100 nt) RNA residues or a modification thereof. In a further embodiment, the target RNA comprises at least 20 nt contiguous RNA residues or a modification thereof. The region of a target RNA which is capable of hybridizing to a guide of a crRNA is referred to herein as a potential hybridization region. Such target RNA, a hybridization region therein, a crRNA which the hybridization region of the target RNA may hybridize to, and a guide of the crRNA are corresponding to each other.
As used herein, the term “seed” region or any other grammatical variation thereof means a critical region of the target sequence of Class 2, Type VI enzymes (e.g., Cas13d) that must be strictly complementary to the CRISPR RNA guide to ensure knock-down efficacy. Mismatches between the target and CRISPR RNA guide sequence can contribute to off-target activity. In one embodiment, the critical Cas13d seed region is defined as the region located between guide RNA nucleotides 15 to 21. In another embodiment, the seed region is defined as the region located between guide RNA nucleotides 15 to 21, with its center at nucleotide 18 relative to the guide RNA 5′ end. Within the seed region, single mismatches lead to diminished guide enrichment, while mismatches outside the seed region were better tolerated (see
As used herein, the term “match” or any other grammatical variation thereof, when referring to two nucleotide residues in one or two nucleic acid molecule(s), means a nt residue (which may be a RNA or a DNA) is complementary to the other (which may also be a RNA or a DNA). As it is known to one of skill in the art, guanine is the complementary base of cytosine, and adenine is the complementary base of thymine in DNA and of uracil in RNA. The nucleotide residues matching with each other, for example, in a secondary structure (observed or predicted) of the nucleic acid molecule(s) are a pair of nucleotide residues (nt), or paired nt. In a further embodiment, if no matching nt in the nucleic acid molecule(s), the nt is then referred to as an unpaired nt or a mismatch.
Hybridization is the process of complementary base pairs (nucleotide residues) binding to form a double helix. The term “hybridization” or any other grammatical variation hereof refers to at least two regions from one single nuclei acid molecule or of two or more nucleic acid molecules which comprises at least one nucleotide residue in one region matches a nucleotide residue in another region. In one embodiment, each of the nt in the first region matches to a nt in the second region. In a further embodiment, each of the nt in the first region matches to each of the nt in the second region. In another embodiment, one or more mismatch(es) may be found between two regions, for example one mismatch, two mismatches, two consecutive mismatches, two nonconsecutive mismatches, three or more mismatches (consecutive or nonconsecutive).
Nucleic acid secondary structure is the base pairing interactions within a single nucleic acid polymer or between two polymers. It can be represented as a list of bases which are paired in a nucleic acid molecule. Nucleic acid secondary structure can be determined from atomic coordinates (tertiary structure) obtained by X-ray crystallography, often deposited in the Protein Data Bank. Current methods include 3DNA/DSSR and MC-annotate. Methods for nucleic acid secondary structure prediction are also available, for example those relying on a nearest neighbor thermodynamic model. A common method to determine the most probable structures given a sequence of nucleotides makes use of a dynamic programming algorithm that seeks to find structures with low free energy. The lower the free energy is, the more stable the secondary structure is. Thus, minimum free energy (MFE) has been used in characterizing a secondary structure. In one embodiment, minimum free energies (MFEs) of a crRNA secondary structure was derived using RNAfold [--gquad] on the full-length crRNA sequence. See, ref 7. Also, a MFE of a secondary structure form by two regions hybridizing to each other (for example a target RNA and it corresponding guide) is referred to as a hybridization MFE. In one embodiment, in order to calculate the hybridization MFE between regions of a target RNA and is corresponding guide, Target RNA unpaired probability (accessibility) was calculated using RNAplfold [-L 4⊙ -W 8⊙ -u 5⊙ ] as described previously8. RNA-RNA-hybridization was calculated using RNAhybrid [-s -c] using the di-nucleotide frequency derived from the target sequence9. We calculated the RNA-hybridization minimum free energy for each spacer RNA nucleotide position p over the distance d to the position p+d with its cognate target sequence.
Further, G-quadruplex is a secondary structure formed in nucleic acid by sequences that are rich in guanine. They are helical structures containing guanine tetrads that can form from one, two or four strands. Four guanine bases can associate through Hoogsteen hydrogen bonding to form a square planar structure called a guanine tetrad (G-tetrad or G-quartet), and two or more guanine tetrads (from G-tracts, continuous runs of guanine) can stack on top of each other to form a G-quadruplex. G-quadruplex structures can be computationally predicted from DNA or RNA sequence motifs or other method available publicly or commercially. Generally, a simple pattern match is used for searching for possible intrastrand quadruplex forming sequences: d(G3+N1-7G3+N1-7G3+N1-7G3+), where N is any nucleotide base (including guanine). See, for example, rank-Kamenetskii M D, Mirkin S M (1995). “Triplex DNA structures”. Annual Review of Biochemistry. 64 (9): 65-95. In one embodiment, RNAfold may be used to determine a presence or absence of a G-quadruplex.
A “nucleic acid” or a “nucleotide”, as described herein, can be RNA, DNA, or a modification thereof, and can be selected, for example, from a group including: nucleic acid encoding a protein of interest, oligonucleotides, nucleic acid analogues, for example peptide-nucleic acid (PNA), pseudocomplementary PNA (pc-PNA), locked nucleic acid (LNA) etc. In certain embodiments, the terms “nucleotide” “nucleic acid” “nucleotide residue” and “nucleic acid residue” are used interchangeably, referring to a nucleotide in a nucleic acid polymer. In a further embodiment, consecutive nucleotide residues refer to nucleotide residues in a contiguous region of a nucleic acid polymer.
A nucleic acid molecule (RNA or DNA) or a nucleotide therein may be modified or edited. In one embodiment, such modification or edition includes 5′ capping, 3′ polyadenylation, and RNA splicing. In another embodiment, the modification or edition includes methylation (for example on a A residue resulting in a m6A), demethylation (for example, on a m6A, optionally via a RNA demethylase, including hut not limited to ALKBH5), deamination (for example, from adenosine (A) to inosine (I), optionally via a tRNA-specific adenosine deaminase (ADAT), or from C to U, optionally via a pentatricopeptide repeat (PPR) protein), or amination (for example, from U to C or from G to A).
Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. As used herein, RNA may refer to a CRISPR guide RNA, a messenger RNA (mRNA), a mitochondrial RNA, short hairpin RNAi (shRNAi), small interfering RNA (siRNA), a mature mRNA, a primary transcript mRNA (pre-mRNA), a ribosomal RNA (rRNA), a 5.8S rRNA, a 5S rRNA, a transfer RNA (tRNA), a transfer-messenger RNA (tmRNA), an enhancer RNA (eRNA), a small interfering RNA (siRNA), a microRNA (miRNA), a small nucleolar RNA (snoRNA), a Piwi-interacting RNA (piRNA), a tRNA-derived small RNA (tsRNA), a small rDNA-derived RNA (srRNA), a non-coding RNA (ncRNA), long (intergenic) non-coding RNA (lincRNA/lncRNA), a single-stranded RNA (ssRNA), a circular RNA (circRNA), a vault RNA (vRNA/vtRNA), a SmY RNA, a double-stranded RNA (dsRNA), a small Cajal body-specific RNA (scaRNA), an antisense RNA (aRNA/asRNA), a ribonuclease RNA (e.g. RNase P), a non-coding regulatory RNA (e.g. 7SK RNA), RNA-viruses or single stranded DNA. In one embodiment, the target RNA is an endogenous RNA. Additionally, or alternatively, the target RNA comprises/is a CDS. In another embodiment, the target RNA comprises/is a UTR (including a 5′ UTR or a 3′ UTR). In yet another embodiment, the target RNA comprises/is an intron.
As used herein, deoxyribonucleic acid (DNA) is a polymeric molecule formed by deoxyribonucleic acid, including, but not limited to, genomic DNA, double-strand DNA, single-strand DNA, DNA packaged with a histone protein, complementary DNA (cDNA which is reverse-transcribed from a RNA), mitochondrial DNA, and chromosomal DNA.
In one embodiment, the method(s) as disclosed herein is genome-wide. For example, a target RNA may be any RNA from the whole genome. In one embodiment, an off-target RNA may be any other RNA except the target RNA from the whole genome. As used herein, a genome refers to the total genetic material (e.g., DNA and RNA) of an organism.
A “vector” as used herein is a biological or chemical moiety comprising a nucleic acid sequence which can be introduced into an appropriate cell for replication or expression of said the nucleic acid sequence. Common vectors include naked DNA, phage, transposon, plasmids, viral vectors, cosmids (Phillip McClean, www.ndsu.edu/pubweb/˜mcclean/plsc731/cloning/cloning4.htm) and artificial chromosomes (Gong, Shiaoching, et al. “A gene expression atlas of the central nervous system based on bacterial artificial chromosomes.” Nature 425.6961 (2003): 917-925). One type of vector is a “plasmid”, which refers to a circular double stranded DNA loop into which additional nucleic acid segments can be ligated. Another type of vector is a viral vector, wherein additional nucleic acid segments can be ligated into the viral genome. Certain vectors are capable of autonomous replication in a cell into which they are introduced (e.g., bacterial vectors having a bacterial origin of replication and episomal mammalian vectors). In certain embodiments, the vector is a lentiviral vector. Other vectors (e.g., non-episomal mammalian vectors) are integrated into the genome of a cell upon introduction into the cell, and thereby are replicated along with the cell genome.
A “viral vector” refers to a synthetic or artificial viral particle in which an expression cassette containing a nucleic acid sequence of interest is packaged in a viral capsid or envelope. Examples of viral vector include but are not limited to lentivirus, adenoviruses (Ads), retroviruses (γ-retroviruses and lentiviruses), poxviruses, adeno-associated viruses (AAV), baculoviruses, herpes simplex viruses. In one embodiment, the viral vector is replication defective. A “replication-defective virus” refers to a viral vector, wherein any viral genomic sequences also packaged within the viral capsid or envelope are replication-deficient; i.e., they cannot generate progeny virions but retain the ability to infect cells.
Pooled viral CRISPR “libraries” (in certain embodiments, lentiviral libraries) are a heterogenous population of viral transfer vectors, each containing an individual crRNA targeting a single gene in a given genome.
As used herein, the term “tag” refers to a peptide or polypeptide whose presence can be readily detected. In certain embodiments, the tag is selected from one or more of the following: a FLAG tag, a poly(His) tag, a chitin binding protein (CBP) tag, a maltose binding protein (MBP) tag, a Strep tag, a glutathione-S-transferase (GST) tag, a thioredoxin (TRX) tag, a poly(NANP) tag, a V5 tag, a HA tag, a Spot tag, a T7 tag, a NE tag, a fluorescence tag, a Green Fluorescent Protein (GFP) tag, and a MYC tag. In one embodiment, the FLAG tag has a sequence of DYKDDDK, SEQ ID NO:47. In certain embodiments, the tag is a florescent protein such as Green fluorescent protein (GFP).
A “reporter molecule”, which is used to indicate the presence of a molecule to which it is conjugated (for example, a crRNA, a nucleic acid molecule, a protein, or a Cas13d), is readily known by one of skill in the art. In one embodiment, the reporter molecule may be a tag or a nucleic acid molecule encoding a tag. In another embodiment, the reporter molecule may be an enzyme or a nucleic acid molecule expressing the enzyme, such as an E. coli lacZ enzyme, or a chloramphenicol acetyltransferase (CAT), or a luciferase.
As used herein, the term “selectable marker” refers to a molecule, a peptide or polypeptide whose presence can be readily detected in a target cell when selective pressure is applied to the cell. In certain embodiments, the selectable marker is a puromycin resistance gene, a kanamycin resistance gene, a chloramphenicol resistance gene, a blasticidin S resistance gene, a geneticin resistance gene, a hygromicin resistance gene, an ampicillin resistance gene, a tetracycline resistance gene, or a G418 resistance gene.
The term “target cell” may refer to any cell of interest. Thus, a target cell may refer to a cell having a target RNA or suspected of having a target RNA. In certain embodiments herein, the term “target cell” refers to a cell of various mammalian species. In one embodiment, the target cell is a mammalian cell. In a further embodiment, the target cell might be a eukaryotic cell, a prokaryotic cell, an embryonic stem cell, a cancer cell, a neuronal cell, an epithelial cell, an immune cell, an endocrine cell, a muscle cell, an erythrocyte, or a lymphocyte.
The term “mammal” or grammatical variations thereof, are intended to encompass a singular “mammal” and plural “mammals,” and includes, but is not limited to, humans; primates such as apes, monkeys, orangutans, and chimpanzees; canids such as dogs and wolves; felids such as cats, lions, and tigers; equids such as horses, donkeys, and zebras; food animals such as cows, pigs, and sheep; ungulates such as deer and giraffes; rodents such as mice, rats, hamsters and guinea pigs; wild animals, such as bears, domesticated animals, livestock and laboratory animals. In some preferred embodiments, a mammal is a human.
As used herein, the term “subject” includes any mammal in need of these methods or compositions, including particularly humans. The subject may be male or female.
As used herein, the terms “therapy”, “treatment” and any grammatical variations thereof shall mean any of prevention, delay of outbreak, reducing the severity of the disease symptoms, and/or removing the disease symptoms (to cure) in a subject in need.
The Cas13d protein is a Class 2, Type VI CRISPR effector guided by a single RNA (crRNA). Two higher eukaryotes and prokaryotes nucleotide-binding (HEPN) domains have been found in the Cas13d, flanking a helical domain. See, for example, WO 2019/010384 A1, US 2019/0169595A1, Zhang C, et al. (2018). Structural Basis for the RNA-Guided Ribonuclease Activity of CRISPR-Cas13d. Cell 175, 212-223.e217, golden.com/wiki/CRISPR-Cas13d, and zlab.bio/cas13, which publication is incorporated herein by reference in its entirety. While the term Class 2, Type VI is a broader genus, of which Cas13d is exemplary, throughout the Specification, one of skill in the art would appreciate that the use of the terms “Cas13d” or “Cas13d and a variant thereof” also encompass other Class 2, Type VI proteins, and the terms can be interchangeable. Cas13d and a variant thereof includes, e.g., a wild type or naturally occurring Cas13d protein, an ortholog of a Cas13d, a functional variant thereof, or another modified variant as disclosed.
Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. In some embodiments, the Cas13d is selected from a RfxCas13d from Ruminococcus flavefaciens strain XPD3002, an AdmCas13d from Anaerobic digester metagenome 15706, EsCas13d from Eubacterium siraeum DSM15702, P1E0Cas13d from Gut metagenome assembly P1E0-k21, UrCas13d from Uncultured Ruminoccocus sp., RffCas13d from Ruminoccocus flavefaciens FD1, and RaCas13d from Ruminoccocus albus. In one embodiment, the Cas13d protein is a RfxCas13d or a variant thereof. The amino acid sequences of the Cas13d orthologs are publically available. In one embodiment, the Cas13d has an amino acid sequence as provided by a Protein Data Bank (PDB) accession number 6OAW_B or 6OAW_A or 6E9F_A or 6E9E_A or 6IV9_A, or an amino acid sequence as provided by the UniProtKB identifier B0MS50 (B0MS50_9FIRM) or A0A1C5SD84 (A0A1C5SD84_9FIRM). Each of the sequences of these references is incorporated by reference herein in its entirety.
In one embodiment, a variant of Cas13d may be a functional variant of the Cas13d protein which is a protein or a polypeptide which shares the same biological function with Cas13d. A functional variant of the Cas13d protein might be a Cas13d protein with 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, about 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, about 110, about 120, about 130, about 140, about 150, about 160, about 170, about 180, about 200, about 220, about 240, about 260, about 280, about 300, about 330, about 360, about 390 or more conserved amino acid substitution(s). Identifying an amino acid for a possible conserved substitution, determining a substituted amino acid, as well as the methods and techniques involved in incorporating the amino acid substitution into a protein are well-known to one of skill in the art. See, sift.jcvi.org/and (Ng & Henikoff, Predicting the Effects of Amino Acid Substitutions on Protein Function, 2006; Ng & Henikoff, Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm, 2009; Ng P C, 2003; Ng & Henikoff, Accounting for Human Polymorphisms Predicted to Affect Protein Function, 2002; Sim, et al., 2012; Sim, et al., 2012), each of which is incorporated herein by reference in its entirety.
In another embodiment, a Cas13d variant is a Cas13d protein mutated to increase or decrease or abolish its nuclease activity. Without wishing to be bound by the theory, although we specifically sought to define rules for active Cas13d, our model is transferable to inactive (nuclease-null or dead) Cas13d effector proteins, as the main feature is defined by crRNA folding/accessibility. In yet another embodiment, a Cas13d variant is a Cas13d protein conjugated to another molecule, for example, a reporter molecule, a splicing factor, an enzyme editing or modifying an RNA, a polyA factor, a nuclear localization signal (NLS), organelle specific signal, or a cytosolic signal or a nuclear-export signal (NES).
As used herein, a nuclear localization signal or sequence (NLS) is an amino acid sequence that ‘tags’ a protein for import into the cell nucleus by nuclear transport. Typically, this signal consists of one or more short sequences of positively charged lysines or arginines exposed on the protein surface. Different nuclear localized proteins may share the same NLS. An NLS has the opposite function of a nuclear export signal (NES), which targets proteins out of the nucleus. Further, a cytosolic signal directs a protein into cytosol of the target cell while an organelle specific signal guides a protein into a specific organelle (for example, cytoplasm, ribosome, or mitochondria).
In one embodiment, used herein is a Cas13d variant having a nuclear localization signal (NLS). In a further embodiment, one amino acid sequence of the Cas13d variant is listed below,
In one embodiment, a Cas13d or a variant thereof can further comprise a nuclear localization signal (NLS). In another embodiment, the Class 2, Type VI protein, e.g., Cas13d, can further encompass or be fused to a cytosolic signal or a nuclear-export signal (NES). In still another embodiment, the Cas13d or a variant thereof is fused to an endoplasmic reticulum localization element (see plasmid 79055, labeled ERM-APEX2 by Addgene at www.addgene.org/79055/). In yet a further embodiment, the Cas13d or a variant thereof is fused to an Outer Mitochondrial membrane localization element (See, the APEX2-OMM plasmid #79056 described by Addgene at www.addgene.org/79056/). In another embodiment, the Cas13d or a variant thereof is fused to a Mitochondria localizing element (such as plasmid 72480 Mito-V5-APEX2 described by Addgene atwww.addgene.org/72480/). In still other embodiments, the Cas13d or a variant thereof is fused to a Nucleolus localizing element (NIK3x), a Nuclear lamina localizing element (LMNA) or a Nuclear pore complex localizing element (SENP2). See, e.g., Fazal, F M et al, 2019 Atlas of Subcellular RNA Localization Revealed by APEX-Seq, Cell, 178:473-490, incorporated by reference herein.
A variety of algorithms and/or computer programs are well known in the art or commercially available for alignment of multiple amino acid sequences (e.g., BLAST, ExPASy; FASTA; using, e.g., Needleman-Wunsch algorithm, Smith-Waterman algorithm). Alignments are performed using any of a variety of publicly or commercially available Multiple Sequence Alignment Programs. Sequence alignment programs are available for amino acid sequences, e.g., the “Clustal Omega”, “Clustal X”, “MAP”, “PIMA”, “MSA”, “BLOCKMAKER”, “MEME”, and “Match-Box” programs. Generally, any of these programs are used at default settings, although one of skill in the art can alter these settings as needed. Alternatively, one of skill in the art can utilize another algorithm or computer program which provides at least the level of identity or alignment as that provided by the referenced algorithms and programs. See, e.g., J. D. Thomson et al, Nucl. Acids. Res., “A comprehensive comparison of multiple sequence alignments”, 27(13):2682-2690 (1999).
In some embodiments, the nucleic acid sequence encoding Cas13d or a variant thereof may be codon-optimized for expression in eukaryotic cell, such as mammalian cells. Methods of codon-optimization are known and have been described previously (e.g. International patent publication No. WO 96/09378). A sequence is considered codon-optimized if at least one non-preferred codon as compared to a wild type sequence is replaced by a codon that is more preferred. Herein, a non-preferred codon is a codon that is used less frequently in an organism than another codon coding for the same amino acid. A codon that is more preferred is a codon that is used more frequently in a target cell than a non-preferred codon. The frequency of codon usage for a specific organism can be found in codon frequency tables, such as in www.kazusa.jp/codon. Preferably more than one non-preferred codon, preferably most or all non-preferred codons, are replaced by codons that are more preferred. Preferably the most frequently used codons in an organism are used in a codon-optimized sequence. Replacement by preferred codons generally leads to higher expression. Numerous different nucleic acid molecules can encode the same polypeptide as a result of the degeneracy of the genetic code.
Skilled persons may, using routine techniques, make nucleotide substitutions that do not affect the amino acid sequence encoded by the nucleic acid molecules to reflect the codon usage of any particular host organism in which the polypeptides are to be expressed. Therefore, unless otherwise specified, a “nucleic acid sequence encoding an amino acid sequence” includes all nucleotide sequences that are degenerate versions of each other and that encode the same amino acid sequence. Nucleic acid sequences can be cloned using routine molecular biology techniques, or generated de novo by DNA synthesis, which can be performed using routine procedures by service companies having business in the field of DNA synthesis and/or molecular cloning (e.g. GeneArt™, GenScript®, Life Technologies™, Eurofins).
In one embodiment, the Cas13d coding sequence is operably linked to a regulatory element to ensure expression in a target cell. In a further embodiment, the promoter is an inducible promoter, such as a doxycycline inducible promoter. In a preferred embodiment, the regulatory element(s) comprises an RNA pol II promoter. A RNA pol II promoter is a promoter that is sufficient to direct accurate initiation of transcription by the RNA polymerase II machinery, wherein the RNA polymerase II (RNAP II and Pol II) is a RNA polymerase found in the nucleus of eukaryotic cells, catalyzing the transcription of DNA to synthesize precursors of messenger RNA (mRNA) and most small nuclear RNA (snRNA) and microRNA.
A variety of Polymerase II promoters that can be used within the compositions and methods described herein are publicly or commercially available to a skilled artisan, for example, viral promoters obtained from the genomes of viruses including promoters from polyoma virus, fowlpox virus (UK 2,211,504), adenovirus (such as Adenovirus 2 or 5), herpes simplex virus (thymidine kinase promoter), bovine papilloma virus, avian sarcoma virus, cytomegalovirus (CMV), a retrovirus (e.g., MoMLV, or RSV LTR), Hepatitis-B virus, Myeloproliferative sarcoma virus promoter (MPSV), VISNA, and Simian Virus 40 (SV40); other heterologous mammalian promoters including the actin promoter, β-actin promoter, immunoglobulin promoter, heat-shock protein promoters, human Ubiquitin-C promoter, PGK promoter. Additional promoters are readily known and available. See, e.g., (Kadonaga, 2012), WO 2014/15134, and WO 2016/054153. In one particular embodiment, the promoter is a CMV promoter. In a further embodiment, the promoter is an EF-1 Alpha Short (EFS) promoter, or a Tet operator (tetO) promoter
The term “regulatory element” or “regulatory sequence” refers to expression control sequences which are contiguous with the nucleic acid sequence of interest (for example, a Cas13d coding sequence or a sequence for expressing a crRNA) and expression control sequences that act in trans or at a distance to control the nucleic acid sequence of interest. As described herein, regulatory elements comprise but not limited to: promoter; enhancer; transcription factor; transcription terminator; efficient RNA processing signals such as splicing and polyadenylation signals (polyA); sequences that stabilize cytoplasmic mRNA, for example Woodchuck Hepatitis Virus (WHP) Posttranscriptional Regulatory Element (WPRE); sequences that enhance translation efficiency (i.e., Kozak consensus sequence); sequences that enhance protein stability; and when desired, sequences that enhance secretion of the encoded product. Also, see Goeddel; Gene Expression Technology: Methods in Enzymology 185, Academic Press, San Diego, Calif. (1990). Regulatory sequences include those which direct constitutive expression of a nucleic acid sequence in many types of target cell and those which direct expression of the nucleic acid sequence only in certain target cells (e.g., tissue-specific regulatory sequences). Furthermore, the Cas13d can be delivered by way of a vector comprising a regulatory sequence to direct synthesis of the Cas13d at specific intervals, or over a specific time period. It will be appreciated by those skilled in the art that the design of the vector can depend on such factors as the choice of the target cell, the level of expression desired, and the like.
As used herein, “operably linked” sequences or sequences “in operative association” include both expression control sequences that are contiguous with the nucleic acid sequence of interest (for example, a Cas13d coding sequence or a sequence for expressing a crRNA) and expression control sequences that act in trans or at a distance to control the nucleic acid sequence of interest.
As used herein, “polyadenylation” is the addition of a poly(A) tail to a messenger RNA, which is important for the nuclear export, translation, and stability of mRNA. Examples of suitable polyA sequences include, e.g., Rabbit globin poly A, SV40, SV50, bovine growth hormone (bGH), human growth hormone, and synthetic polyAs.
Optionally, the nucleic acid sequence encoding a Cas13d protein further comprises a reporter gene or a nucleic acid encoding a selectable marker, which may include sequences encoding geneticin, hygromicin, ampicillin or purimycin resistance, among others. A reporter gene, which is used as an indication of whether the Cas13d coding sequence has been incorporated into and/or expressed as a functional protein in the target cell or not, is readily selected by one of skill in the art, including without limitation, the E. coli lacZ gene, the chloramphenicol acetyltransferase (CAT) gene, or a gene encoding a fluorescent protein such as Green fluorescent protein (GFP).
As used herein, “carrier” includes any and all solvents, dispersion media, vehicles, coatings, diluents, antibacterial and antifungal agents, isotonic and absorption delaying agents, buffers, carrier solutions, suspensions, colloids, and the like. The use of such media and agents for pharmaceutical active substances is well known in the art. Supplementary active ingredients can also be incorporated into the compositions. The phrase “pharmaceutically acceptable” refers to molecular entities and compositions that do not produce an allergic or similar untoward reaction when administered to a subject. Delivery vehicles such as lipid particle, liposomes, nanocapsules, nanosphere, nanoparticle, microparticles, microspheres, lipid particles, vesicles, and the like, may be used for the introduction of the compositions of the present invention into suitable target cells.
By “biological sample” is meant any biological fluids, cells or tissues of a subject that is suitable for use, such as, for example, cell-containing body fluids such as blood, sperm, cerebral spinal fluid, saliva, sputum or urine, leukocyte fractions, buffy coat, feces, swabs, puncture fluids, skin fragments, whole organisms or parts thereof, organs, organ fragments, tissues and tissue parts of a subject. Still other suitable samples are in the form of sections, biopsies, fine needle aspirates or tissue sections, isolated cells, for example in the form of adherent or suspended cell cultures, plants, plant parts, plant tissues from the fractions may be carried out at the same time or one or plant cells, bacteria, viruses, yeasts and fungi, without being limited thereto. In one embodiment, the biological sample contains a target RNA. In one embodiment, a suitable biological sample is a tissue section from human tissue, such as a tumor.
The terms “a” or “an” refers to one or more. For example, “an expression cassette” is understood to represent one or more such cassettes. As such, the terms “a” (or “an”), “one or more,” and “at least one” are used interchangeably herein. In certain embodiment, the term “one or more” refers to any integer from one to the maximum including any integer therebetween.
The terms “another”, “first”, “second”, “third”, “fourth”, “fifth” and “sixth” are used throughout this specification as reference terms to distinguish between various forms and components of the compositions and methods, for example, first or second promoter.
As used herein, the term “about” means a variability of plus or minus 10% from the reference given, unless otherwise specified.
The words “comprise”, “comprises”, and “comprising” are to be interpreted inclusively rather than exclusively, i.e., to include other unspecified components or process steps. The words “consist”, “consisting”, and its variants, are to be interpreted exclusively, rather than inclusively, i.e., to exclude components or steps not specifically recited.
As described herein, the terms “reduce” “decrease” “alleviate” “ameliorate” “improve” “delay” “earlier” “low” “high” “mitigate”, any grammatical variation thereof, or any similar terms indication a change, means a variation of about 5 fold, about 2 fold, about 1 fold, about 90%, about 80%, about 70%, about 60%, about 50%, about 40%, about 30%, about 20%, about 10%, about 5% compared to a reference (e.g., a guide generated without using the disclosed methods, or a non-targeting control), unless otherwise specified.
Also, it is noted that any range as disclosed herein, (for example, an MFE range, a hybridization MFE range, a nucleotide (nt) range, a percentage range, or a log10 value range) includes the endpoint and every number/nt/percentage/value therebetween, unless specified.
It is also noted that any embodiment listed with respect to a crRNA, a nucleic acid molecule, a vector, a library, a composition, any other component, a method, or a use, may be combined with any other embodiments with respect to a crRNA, a nucleic acid molecule, a vector, a library, a composition, any other component, a method, or a use.
B. Method of Designing, Generating and Ranking crRNA(s)
In one aspect, provided is a method for generating and selecting a crRNA which is capable of forming a complex with a CRISPR-associated protein 13d (Cas13d) or a variant thereof, and directing the complex to a target RNA. The method comprises the following steps:
(1) randomly designating a potential hybridization region in the target RNA;
(2) designing a guide which is capable of hybridizing to the hybridization region, and designing a crRNA sequence comprising the guide and a DR stem loop accordingly;
(3) ranking each crRNA based on features of the crRNA and its corresponding target RNA.
In one embodiment, the designated target RNA is longer than 87 nucleotides (nt). In a further embodiment, the designated target RNA is longer than 100 nt or 200 nt or 300 nt or 400 nt or 500 nt. In certain embodiments, the ranking does not consider a protospacer in the target RNA for directing the complex. In a further embodiment, nt 15 to nt 21 (or nt 17 to 18 or nt 18) of the crRNA matching with its corresponding hybridization region of the target RNA without mismatches ranks higher than those with mismatches. In yet a further embodiment, crRNA having three or more mismatches to its corresponding target RNA ranks lower comparing to those having 0, 1 or 2 mismatches.
In one embodiment, crRNA with a feature falling in the range of a positively-correlated feature and out of the range of a negatively-correlated feature. In one embodiment, the features are listed in Tables 2 and 4-7 and
In certain embodiments, (a) minimum free energy (MFE) value of the crRNA is considered in the ranking step. In one embodiment, a crRNA having an MFE value of (a) within the following range ranks higher than those falling out of the following range: from −22.8 to −12.8, or from −20.9 to −14.3, or from −23.4 to −14.5, or from −18.7 to −15.9, or about −17.1, or about −17.3, each of the value ranges including the endpoints and all numbers therebetween. In one embodiment, the MFE is calculated via a publicly available software of predicting RNA secondary structure for single stranded RNAs (such as crRNAs), for example, RNAfold. See, Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
In certain embodiments, a crRNA having a DR stem loop which is about 30 nt long ranks higher. In certain embodiments, a crRNA having a 30 nt or 3 1 nt long DR stem loop ranks higher compared to those having a DR stem loop of other lengths.
Based on its secondary structure, the DR stem loop is composed of, from the 5′ end to the 3′ end, a 5′ end, a stem loop which is capable of forming a self-hybridizing structure via paired nucleotides matching with each other, and a 3′ end. The 5′ and 3′ ends of the DR stem loop do not match to the target RNA or any nucleotide of the DR stem loop. In one embodiment, a crRNA having a stem loop comprising 4 unpaired nucleotides in the middle of the sequence forming a loop ranks higher. In yet a further embodiment, a crRNA having a stem loop with an additional two unpaired nucleotide residues in the stem loop forming a bulge, ranks higher. In one embodiment, a crRNA ranks higher if the 5′ end of its DR stem loop is one unpaired nucleotide.
In certain embodiments, the ranking is further determined based on the feature of (g): presence of a DR stem loop having a motif selected from the following:
As used herein, “(n” and “)n” represent a pair of nucleotides matching with each other, and “.” represents an unpaired nucleotide in bulge or loop. As defined above, the part from (1 to )1 is the self-hybridizing stem loop of the DR stem loop while the nt at the 5′ or 3′ of the self-hybridizing stem loop are the 5′ end or 3′ end, respectively.
In a further embodiment, a crRNA forming an effective guide RNA and having a higher ranking is provided with a DR stem loop sequence as recited in TABLE 9 below. With sequence (I) being the sequence found in Ruminococcus flavefaciens (Rfx), we found that the DR sequence modifications II and III showed improvement relative to sequence I. While we introduced specific sequence changes (e.g., replaced nucleotide 1 from A 4 U for sequence I to sequence II), we anticipate that any nucleotide replacement with a similar consequential effect likely yields similar benefits. For example, replacing the A nucleotide in position 1 with either of U or C and to some degree G will similarly disrupt base pair capabilities between nucleotide 1 and the U at position 24. Therefore, we indicate nucleotide changes according to IUPAC nomenclature in addition to the conventional abbreviations A for Adenine, C for Cytosine, G for Guanine and T (or U) for Thymine (or Uracil) by use of the abbreviations: R for A or G; Y for C or T(or U); S for G or C; W for A or T(or U); K for G or T(or U); M for A or C; B for C or G or T(or U); D for A or G or T(or U); H for A or C or T(or U); V for A or C or G; N for any base; . or - to represent a nucleotide gap.
In one embodiment, changes in the nucleotide at position 1 or 24 can have the same consequence of base pair disruption. Thus, any change introduced for the five-prime base pair mate can be mirrored for the three-prime mate. For example:
AACCCCUACCAACUGGUCGGGGUAUGAAAC SEQ ID NO: 46 are anticipated to yield the same effect.
In another embodiment, removing nucleotides from the DR 5′ end or the addition of hindering nucleotides 5′ to the sequence is predicted to alter the DR function in the same way. For example,
ACCCCUACCAACUGGUCGGGGUUUGAAAC SEQ ID NO: 20 likely yield the same effect. In still another aspect, nucleotide removal or addition, alone and in conjunction, in sequences I-VII are anticipated to produce effective DR stem loops for effective guides. The use of such DR stem loops are also anticipated to increase the efficacy of binding of even mismatched crRNA.
Thus Table 9 provides exemplary DR stem loops comprising one of the following sequences or a modification thereof.
BACCCCUACCAACUGGUCGGGGUUUGAAAC
NBCCCCUACCAACUGGUCGGGGUUUGAAAC
NNDCCCUACCAACUGGUCGGGGUUUGAAAC
-ACCCCUACCAACUGGUCGGGGUUUGAAAC
--CCCCUACCAACUGGUCGGGGUUUGAAAC
---CCCUACCAACUGGUCGGGGUUUGAAAC
(N)
nAACCCCUACCAACUGGUCGGGGUUUGAAAC
BACCCCVACCAACUGGUCGGGGUUUGAAAC
NBCCCCVACCAACUGGUCGGGGUUUGAAAC
NNDCCCVACCAACUGGUCGGGGUUUGAAAC
-ACCCCVACCAACUGGUCGGGGUUUGAAAC
--CCCCVACCAACUGGUCGGGGUUUGAAAC
(N)
nAACCCCVACCAACUGGUCGGGGUUUGAAAC
BACCCCUACCAACUGGUDGGGGUUUGAAAC
NBCCCCUACCAACUGGUDGGGGUUUGAAAC
NNDCCCUACCAACUGGUDGGGGUUUGAAAC
-ACCCCUACCAACUGGUDGGGGUUUGAAAC
--CCCCUACCAACUGGUDGGGGUUUGAAAC
---CCCUACCAACUGGUDGGGGUUUGAAAC
(N)
nAACCCCUACCAACUGGUDGGGGUUUGAAAC
In yet a further embodiment, a crRNA having a DR stem loop composed of a G-residue at the 5′ end followed by one of sequences (I) to (XIII) ranks higher. Additionally or alternatively, the ranking is further determined based on a feature of (h) which is absence of a G-quadruplex in the crRNA. In certain embodiments, the presence of a G-quadruplex is determined by RNAfold. In certain embodiments, a crRNA without a G-quadruplex ranks higher.
In certain embodiments, a crRNA with a more stable hybridization between guide and its target sequence ranks lower. Such hybridization may be assessed via hybridization MFE between a target RNA and its corresponding regions of the crRNA, wherein a lower hybridization MFE indicates a more stable hybridization. Without wishing to be bound by theory, it is believed that the most stable guide-target interactions render the crRNA-Cas13d complex inactive. In one embodiment, a crRNA with a more stable hybridization between regions of the guide (which is not the full length guide) and its target sequence ranks lower.
In one embodiment, the crRNA(s) with the highest ranking is selected for directing a Cas13d-crRNA complex to a target RNA. In certain embodiments, a crRNA having a positively correlated feature as disclosed ranks higher than those without the positively correlated feature(s). In yet another embodiment, a crRNA or its corresponding target RNA having more positively correlated features within the identified ranges ranks higher. In certain embodiments, a crRNA having a negatively correlated feature as disclosed ranks lower than those without the negatively correlated feature(s). In another embodiment, a crRNA or its corresponding target RNA having more negatively-correlated features within the identified ranges ranks lower.
In certain embodiments, a crRNA ranks lower if it has an off-target activity or has a higher off-target activity. In one embodiment, an off-target activity is determined if an RNA other than the target RNA comprises the hybridization region of the target RNA, or if an RNA other than the target RNA comprises the hybridization region of the target RNA with one nucleotide residue difference outside of nt −14 to nt −20 of the target RNA; or if an RNA other than the target RNA comprises the hybridization region of the target RNA with two nonconsecutive nucleotide residue differences outside of nt −14 to nt −20 of the target RNA. In certain embodiments, the RNA other than the target RNA is termed as “off-target RNA”. In certain embodiments, the crRNA and/or the crRNA-Cas13d complex is designed to apply to a target cell. In a further embodiment, the off-target RNA also exists in the target cell. In yet a further embodiment, the off-target RNA is at least 87 nt long, or at least 100 nt long, or at least 200 nt long, or at least 300 nt long, or at least 500 nt long.
In one aspect, provided is a method for predicting on-target activity of a crRNA. The crRNA composed of a DR stem loop and a guide is capable of forming a complex with a Cas13d or a variant thereof and directing the complex to the target RNA. The method comprises characterizing one or more of the features (any one or combination of the features as disclosed herein) of a plurality of crRNAs and their corresponding target RNAs; assessing on-target activity of each of the crRNAs; constructing a model using the characterization data and the on-target activity data by a modeling method. In one embodiment, the modeling method comprises Random Forest modeling. Additionally, or alternatively, the modeling method comprises one or more of methods listed in Table 3. In a further embodiment, input of the model comprises characterization(s) of one or more of features of a crRNA and its corresponding target RNA. In yet a further embodiment, output of the model is an on-target score of the crRNA. As used herein an on-target score is an assigned number (for example, an integer, rational number or irrational number) which positively correlates to on-target activity of a crRNA.
In one embodiment, the predicting method further comprises applying the constructed model to a crRNA and generating an on-target score of the crRNA. In a further embodiment, the predicting method comprises applying the constructed model to two or more crRNAs (such as a first crRNA and a second crRNA), and generating on-target scores of the crRNAs. In yet a further embodiment, the crRNAs share the same target RNA. In one embodiment, the crRNA is capable of hybridizing to a different (overlapping or non-overlapping) hybridization region of the same target RNA. In certain embodiments, the predicting method further comprises comparing the generated on-target scores and selecting the crRNA having the higher/highest score for directing the crRNA-Cas13d complex to the target RNA.
In one embodiment, the features of a crRNA and its corresponding target RNA are one or more of the following or the ones listed in one or more of those listed in Tables 2 and 4-7 and
As used herein, an on-target activity of a crRNA (i.e., efficacy of a crRNA) may refer to one or more of the following: efficacy of the crRNA in forming a complex with a Cas13d protein or a variant thereof; efficacy of the crRNA in hybridizing to the corresponding target RNA; efficacy of the crRNA in directing a Cas13d-crRNA complex to the target RNA; efficacy of the crRNA in reducing the corresponding target RNA; and enrichment or abundance or depletion of the crRNA (or the guide of the crRNA or the target RNA) after applying the crRNA and a Cas13d or a variant thereof to a cell or cell culture. As an illustrative embodiment shown in the Example, the crRNA efficacy was determined by quantifying crRNA abundances in sorted and unsorted cell populations. The value represents the log2 fold change of sorted divided by input (for example, unsorted) counts. Higher values depict higher efficacies/efficiencies for target knockdown owed to the screen design. As used herein, an on-target score may be used to quantify the on-target activity. In one embodiment, an on-target score is an efficiency quartile as used here in (Q1 to Q4 also shown as bin1 to bin4 ). In another embodiment, an on-target score is a measured or calculated efficacy, for example, a fold change of crRNA/guide/target RNA abundance before v.s. after applying the crRNA.
In one aspect, provided is a method for predicting off-target activity of a crRNA. As disclosed herein, the crRNA is composed of a DR stem loop and a guide, and is capable of forming a complex with a Cas13d or a variant thereof and directing the complex to the target RNA. The predicting method comprises characterizing one or more of the features of a plurality of crRNAs and their corresponding target RNAs; assessing off-target activity of each of the crRNAs; and constructing a model using the characterization and the off-target activity acquired by a modeling method. In one embodiment, the modeling method comprises Random Forest modeling. In another embodiment, the modeling method comprises a deep learning model. Additionally, or alternatively, the model-constructing method comprises one or more of methods listed in Table 3. In one embodiment, input of the model comprises characterization(s) of one or more of features of a crRNA and its corresponding target RNA. Additionally or alternatively, output of the model is an off-target score of the crRNA positively correlating to off-target activity of the crRNA.
In one embodiment, the predicting method further comprises applying the constructed model to a crRNA and generating an off-target score of the crRNA. In a further embodiment, the predicting method further comprises applying the constructed model to two or more crRNA (for example, a first crRNA and a second crRNA) and generating off-target scores of the crRNAs. In yet a further embodiment, the crRNAs share the same target RNA. In one embodiment, the crRNA is capable of hybridizing to a different (overlapping or non-overlapping) hybridization region of the same target RNA. In certain embodiments, the predicting method further comprises comparing the generated off-target scores and selecting the crRNA having the lower/lowest score for directing the crRNA-Cas13d complex to the target RNA and avoiding off-target effect(s).
In certain embodiments, the features discussed with respect to the method for predicting off-target activity is any one or any combination of the features disclosed herein. In one embodiment, the features are one or more of the following: presence and absence of an off-target RNA comprises the hybridization region of the target RNA, or presence and absence of an off-target RNA comprises the hybridization region of the target RNA with one nucleotide residue difference outside of nt −14 to nt −20 of the target RNA; presence and absence of an off-target RNA comprises the hybridization region of the target RNA with two nonconsecutive nucleotide residue differences outside of nt −14 to nt −20 of the target RNA. In one embodiment, the nt numbering is based on a numbering from 5′ end of the target RNA to 3′ end recognizing the nt which is capable of matching to the guide match start as nt 0.
As used herein, an off-target activity refers to an activity of a crRNA-Cas13d complex binds to and optionally nicks an RNA which is not the target RNA. An off-target effect refers to binding of a crRNA-Cas13d complex with an RNA which is not the target RNA and any consequence(s) thereof, for example, reduction of a non-target RNA, reduction of a peptide or a protein encoded by the non-target RNA, increase or reduction of a peptide or a protein whose expression is regulated by the non-target RNA, and any physiological change(s) relating thereto.
In a further aspect, provided is a method for selecting a crRNA from two or more of crRNAs for directing a complex which comprises the crRNA and a CRISPR-associated protein 13d (Cas13d) or a variant thereof (i.e., Cas13d-crRNA complex or crRNA-Cas13d complex) to a target RNA. As disclosed herein, the crRNA is composed of a DR stem loop and a guide. The method comprises: determining on-target score of each of the two or more of crRNAs using the method as disclosed herein; and determining off-target score of each of the two or more of crRNAs using the method as disclosed herein. In a further embodiment, the method comprises selecting the crRNA with the highest on-target score and the lowest off-target score for directing the Cas13d-crRNA complex to the target RNA. In another embodiment, the method comprises constructing a model for incorporating the on-target score and the off-target score into one selection score via a modeling method. In one embodiment, a selection score equals an on-target score multiplied by a factor and minus the corresponding off-target score, wherein the factor can be any number (for example, an integer, a ratio, a rational number or an irrational number). In a further embodiment, the factor is a positive number. In one embodiment, the factor is any one of the following: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, or 50. In one embodiment, the two or more of crRNA is capable of directing the Cas13d-crRNA to the same target RNA. In a further embodiment, the two or more of crRNA is capable of hybridizing to different (overlapping or nonoverlapping) hybridization regions of the same target RNA.
As used herein, a modeling method refers to a mathematical or statistical analysis, for example, random forest models, classification and regression tree models, boosting, Bayesian networks, Markov random field, linear and generalized linear models, boosted tree models, neural networks, support vector machines, general chi-squared automatic interaction detector models, interactive tree models, multiadaptive regression spline, machine learning classifiers, a multi hypothesis testing, a principal component analysis, and any combinations thereof. These statistical models are well known to those of skill in the art. Any other suitable algorithms in performing the characterization process. In variations, the analysis can be characterized by a learning style including any one or more of: supervised learning (e.g., using logistic regression, using back propagation neural networks), unsupervised learning (e.g., using an Apriori algorithm, using K-means clustering), semi-supervised learning, reinforcement learning (e.g., using a Q-learning algorithm, using temporal difference learning), and any other suitable learning style. Furthermore, the analysis can implement any one or more of: a regression algorithm (e.g., ordinary least squares, logistic regression, stepwise regression, multivariate adaptive regression splines, locally estimated scatterplot smoothing, etc.), an instance-based method (e.g., k-nearest neighbor, learning vector quantization, self-organizing map, etc.), a regularization method (e.g., ridge regression, least absolute shrinkage and selection operator, elastic net, etc.), a decision tree learning method (e.g., classification and regression tree, iterative dichotomiser 3, C4.5, chi-squared automatic interaction detection, decision stump, random forest, multivariate adaptive regression splines, gradient boosting machines, etc.), a Bayesian method (e.g., naive Bayes, averaged one-dependence estimators, Bayesian belief network, etc.), a kernel method (e.g., a support vector machine, a radial basis function, a linear discriminate analysis, etc.), a clustering method (e.g., k-means clustering, expectation maximization, etc.), an associated rule learning algorithm (e.g., an Apriori algorithm, an Eclat algorithm, etc.), an artificial neural network model (e.g., a Perceptron method, a back-propagation method, a Hopfield network method, a self-organizing map method, a learning vector quantization method, etc.), a deep learning algorithm (e.g., a restricted Boltzmann machine, a deep belief network method, a convolution network method, a stacked auto-encoder method, etc.), a dimensionality reduction method (e.g., principal component analysis, partial lest squares regression, Sammon mapping, multidimensional scaling, projection pursuit, etc.), an ensemble method (e.g., boosting, bootstrapped aggregation, AdaBoost, stacked generalization, gradient boosting machine method, random forest method, etc.), and any suitable form of analysis. The machine learning classifier may be a discriminant analysis (DA) machine learning classifier, a nearest neighbor (NN) machine learning classifier, a random forest (RF) machine learning classifier, or a support vector machine (SVM). A DA machine learning classifier may be a linear discriminant analysis (LDA) classifier, or a quadratic discriminant analysis (QDA) classifier. In one embodiment, the SVM classifier may have three kernels, including a linear kernel, a radial basis function (RBF) kernel, and a polynomial kernel. In another embodiment, the machine learning classifier may employ a convolutional neural network (CNN). In one embodiment, a modeling method may be performed on a computer.
As used herein, characterizing a feature or a grammatical variation thereof refers to a qualitative or quantitative manner of describing the feature. For example, it may be presence or absence of the feature, a numeric range of the feature, or a parameter/number/percentage calculated.
In certain embodiments, the ranking and/or any of the predicting methods as disclosed herein are determined in silicon in software. Such software is, for example, an R language program, a Python program or similar. Other codes performing the same function may also be used.
In another aspect, another method is described for screening or predicting on-target activity of a clustered regularly interspaced short palindromic repeats (CRISPR) RNA (crRNA), whereby the crRNA is capable of forming a complex with an RNA-targeting CRISPR-associated protein or a variant thereof and directing the complex to the target RNA. The method includes the step of (a) characterizing a plurality of crRNAs and their corresponding target by features comprising the presence of both a seed region located between guide RNA nucleotide bases 15 to 21 relative to the guide RNA 5′ end, characterized by a stabilizing, enriched sequence of G and C bases and an accessible target region characterized by an enriched sequence of A and U, surrounding the seed region on the 5′ end, 3′ end or both the 5′ and 3′ ends. In another embodiment an additional step (b) involves assessing on-target activity of each of the crRNAs of (a). In yet a further embodiment, an additional step (c) involves applying a machine learning model or deep learning model using the characterization of (a) and the on-target or off-target activity of (b). In one embodiment, input of the model comprises characterization(s) of the seed region and target regions of each crRNA and its corresponding target RNA, and output of the model is an on-target score of the crRNA, and wherein a higher score indicates a ranked on-target activity. In another embodiment the input and output can involve off-target scores. Still another step of the method includes (d) applying the model constructed in step (c) to a first crRNA and generating an on-target score or off-target score of the first crRNA.
The features of crRNA(s) and the corresponding target RNA(s) in step (a) are selected from any combination of at least the top 1, 2, 5, 10, 15, 20, 25, 30, 35 or more features of Table 5; any combination of 2 or more of the features of Table 5, at least the top 1, 2, 3, 4, 5, or 6 features of the RFGFP features listed in Table 2, at least the top 1, 2, 5, 10, 15, 20, 25, 30, or 33 features of the RFcombined features listed in Table 2; any combination of 2 or more features listed in Table 2 and having a DR sequence of Table 9.
In another embodiment, the method can include step (d) which further comprises applying the model constructed in step (c) to a second and further additional crRNA having the same target RNA, and generating an on-target score of the second crRNA.
In another embodiment, the on-target activity of step (b) is efficacy of the crRNA in forming a complex with a Cas13d protein or a variant thereof. In another embodiment, the on-target activity of step (b) is efficacy of the crRNA in hybridizing to the corresponding target RNA. In another embodiment, the on-target activity of step (b) is efficacy of the crRNA in directing a Cas13d-crRNA complex to the target RNA. In another embodiment, the on-target activity of step (b) is efficacy of the crRNA in reducing the corresponding target RNA after hybridizing to the target RNA. In another embodiment, the on-target activity of step (b) is enrichment or depletion of the CRISPR pooled screen readout. In another embodiment, the on-target activity of step (b) is efficacy of the guide of the crRNA or the target RNA after applying the crRNA and a Cas13d or a variant thereof to a cell or cell culture, a non-human organism or an in vitro, cell-free assay system. In another embodiment, the on-target activity of step (b) is the efficacy of the crRNA comprising guide sequences which mismatch the target, to allow the Class 2, Type VI effector protein to bind the target, but not elicit target degradation. In yet another embodiment, the method involves identifying on-target activity that includes binding without cleavage.
These methods can be applied to target RNA which is a messenger RNA (mRNA), a mature mRNA, a primary transcript mRNA (pre-mRNA), a ribosomal RNA (rRNA), a 5.8S rRNA, a 5S rRNA, a transfer RNA (tRNA), a transfer-messenger RNA (tmRNA), an enhancer RNA (eRNA), a small interfering RNA (siRNA), a microRNA (miRNA), a small nucleolar RNA (snoRNA), a Piwi-interacting RNA (piRNA), a tRNA-derived small RNA (tsRNA), a small rDNA-derived RNA (srRNA), a non-coding RNA (ncRNA), long (intergenic) non-coding RNA (lincRNA/lncRNA), a single-stranded RNA (ssRNA), a circular RNA (circRNA), a vault RNA (vRNA/vtRNA), a SmY RNA, a double-stranded RNA (dsRNA), a small Cajal body-specific RNA (scaRNA), an antisense RNA (aRNA/asRNA), a ribonuclease RNA (e.g. RNase P), a non-coding regulatory RNA (e.g. 7SK RNA), RNA-viruses, single stranded DNA, coding sequence (CDS) of a RNA, 5′ untranslated region (UTR) of a RNA, 3′ UTR of a RNA, or a intron of a RNA, or a satellite repeat sequence embedded in any of said RNA targets.
Yet other embodiments of the methods involve use of crRNA guides characterized by one or more of the DR sequences of Table 9. In other embodiments of the methods, the crRNA comprises a guide sequence which mismatches the target and allows the Class 2, Type VI effector protein to bind the target, but not elicit target degradation.
Also provided is a non-naturally occurring and/or synthesized and/or engineered crRNA ranked and selected by a method as disclosed herein.
C. crRNA
Also provided is a clustered regularly interspaced short palindromic repeat (CRISPR) RNA (crRNA) composed of a direct repeat (DR) stem loop sequence and a guide sequence, which is capable of hybridizing to a hybridization region of a target RNA. In one embodiment, the crRNA is a Class 2, Type VI crRNA which comprises a direct repeat (DR) stem loop sequence and a guide or spacer sequence. In another embodiment, the crRNA is characterized by having a DR sequence selected from one or more of the DR sequences of Table 9 above. In one embodiment the crRNA has a DR of SEQ ID NO: 2. In one embodiment the crRNA has a DR of SEQ ID NO: 14. In one embodiment the crRNA has a DR of SEQ ID NO: 25. In one embodiment the crRNA has a DR of SEQ ID NO: 36. In still other embodiments, the crRNA has a DR of any of SEQ ID NO: 1-46, or a variant thereof.
In one embodiment, the crRNA is non-naturally occurring. In another embodiment, the crRNA is synthesized. In another embodiment, the crRNA is an engineered sequence. The crRNA is capable of forming a complex with a Class 2, Type VI protein, such as Cas13d or a variant identified above. The crRNA is capable of directing the complex to the target RNA. In one embodiment, provided is a crRNA designed, generated, or selected by a method described herein.
In one embodiment, the crRNA does not require a protospacer in the target RNA for directing the complex. In a further embodiment, nt 15 to nt 21 of the crRNA matches with its corresponding hybridization “seed” region of the target RNA. In yet a further embodiment, one or two mismatches to the target RNA may be found outside of nt 15 to nt 21 of the crRNA. However, three or more mismatches are not allowed between the guide of the crRNA and its corresponding hybridization region of the target RNA.
Without wishing to be bound by the theory, the center of the nt 15 to nt 21 of the crRNA is theorized to coincide with conserved contacts between a helical domain in RfxCas13d protein and the backbone of the guide-target hybrid interface. This interaction resides opposite of the nt 17-18 of the guide within the target RNA. The helical domain is placed between both higher eukaryotes and prokaryotes nucleotide-binding (HEPN) domains needed for target cleavage, and mutation of the interacting amino acids in EsCas13d completely abolished target cleavage. See, Ref 28. Mismatches at around nt 18 of the crRNA may likely impair HEPN-domain activity.
In certain embodiments, the crRNA has one or more of the positively correlated features but not the negatively correlated features. In one embodiment, the features are listed in one or more of Tables 2 and 4-7 and
In certain embodiments, the crRNA having a DR stem loop which is about 30 nt long, for example, 29 nt, 30 nt, or 31 nt long. Based on its secondary structure, the DR stem loop is composed of, from the 5′ end to the 3′ end, a 5′ end, a stem loop which is capable of forming a self-hybridizing structure via paired nucleotides matching with each other, and a 3′ end. The 5′ and 3′ ends of the DR stem loop do not match to the target RNA or any nucleotide of the stem loop. In one embodiment, the stem loop comprises unpaired nucleotides. In a further embodiment, the middle 4 nucleotide residues of the stem loop are not paired and forming a loop. In yet a further embodiment, there is an additional two unpaired nucleotide residues in the stem loop forming a bulge. One example of loop and bulge can be seen in
In one embodiment, the crRNA has a stem loop with a motif selected from the following:
wherein “(n” and “)n” represent a pair of nucleotides which matches with each other, and “.” represents an unpaired nucleotide in bulge or loop. As defined above, the self-hybridization stem loop of the DR stem loop starts from a nucleotide noted as “(1” and ends at a nucleotide noted “)1” in the motifs of (I) to (V). The “ . . . ” In the center of the motifs represent the unpaired nucleotides in a loop while the “.” flanked by “(” or “)” on both side represent the unpaired nucleotides in a bulge.
In one embodiment, the DR stem loop further contains 1 to 8 nucleotides at the 3′ end of the motif and preceding the guide. Additionally, or alternatively, the DR stem loop further contains a G residue at the 5′ end of the motif.
In certain embodiments, the DR stem loop comprises one of the following sequences SEQ ID NO: 1 to 13, or a modification thereof or the related sequences of Table 9, identified above:
In yet a further embodiment, the DR stem loop is composed of a G-residue at the 5′ end followed by one of sequences (I) to (XIII) In certain embodiments, the crRNA does not have a G-quadruplex. In one embodiment, the presence or absence of a G-quadruplex is determined by RNAfold. In certain embodiments, each nt from nt −14 to nt −20 of the target RNA matches its corresponding region of the crRNA. In certain embodiments, the guide is about 23 nt long to about 33 nt long, or about 27 nt to about 30 nt long, or about 27 nt long, or about 23 nt long.
The efficacy of a crRNA in forming a complex with a Cas13d protein or a variant thereof and directing the complex to the target RNA may be measured. In one embodiment, the efficacy is at least about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, about 90%, about 1 fold, about 1.5 fold, about 2 fold, about 3 fold, about 5 fold, about 10 fold higher than that of another crRNA. In another embodiment, at least about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, about 90%, about 95%, or about 99% of the target RNA hybridized to the crRNA. In yet another embodiment, the amount of the target RNA is reduced for at least about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, about 90%, about 95%, or about 99% after hybridization to the crRNA and being nicked by Cas13d or a variation thereof.
In another embodiment, the cRNA or nucleic acid molecule described herein comprises a guide sequence which mismatches the target and allows the Class 2, Type IV effector protein to bind the target, but not elicit target degradation.
Also provided is a non-naturally occurring, synthesized (chemically or recombinantly) or engineered nucleic acid molecule comprising one or more of the crRNA(s) as disclosed, or a nucleic acid sequence complementary to the crRNA(s), or a nucleic acid sequence encoding the crRNA(s), or a nucleic acid sequence complementary to the crRNA coding sequence. In one embodiment, the nucleic acid molecule is a DNA.
In one embodiment, the nucleic acid molecule is a mature RNA. In one embodiment, the nucleic acid molecule comprises a DNA sequence encoding the crRNA(s). In a further embodiment, the nucleic acid molecule further comprises a first regulatory sequence directing expression of the crRNA(s). For example, the first regulatory sequence may comprise without limitation, a Pol III promoter, for example, a U6 promoter, a H1 promoter, a T7 promoter, and a 7SK promoter.
In another embodiment, the nucleic acid molecule further comprises a DNA sequence encoding a Class 2, Type VI effector protein or a variant thereof. In one embodiment, the encoded protein is any Class 2, Type VI protein. In a further embodiment, the protein is a Cas13d protein. In another embodiment, the effector protein is a RfxCas13d from Ruminococcus flavefaciens strain XPD3002. In another embodiment, other Cas13d proteins may be utilized, for example, an AdmCas13d from Anaerobic digester metagenome 15706, EsCas13d from Eubacterium siraeum DSM15702, P1E0Cas13d from Gut metagenome assembly P1E0-k21, UrCas13d from Uncultured Ruminoccocus sp., RffCas13d from Ruminoccocus flavefaciens FD1, and RaCas13d from Ruminoccocus albus. In a further embodiment, the feature(s), ranges of the features(s), and any combination thereof may be adjusted according to a Cas13d other than RfxCas13d.
In a further embodiment, the Cas13d or a variant thereof further comprises a nuclear localization signal (NLS) or a cytosolic signal or a nuclear-export signal (NES). In another embodiment the Cas13d or a variant thereof is fused to an endoplasmic reticulum localization element, an Outer Mitochondrial membrane localization element, a Mitochondria localizing element, a Nucleolus localizing element (NIK3x), a Nuclear lamina localizing element (LMNA) or a Nuclear pore complex localizing element (SENP2 ). In yet a further embodiment, the Cas13d or a variant thereof is capable of nicking a target RNA. In one embodiment, the Cas13d or a variant thereof has been engineered and does not have a nuclease activity, therefore referred to as a dead Cas13d.
In one embodiment, the DNA sequence encoding the effector, e.g., Cas13d, protein is under the control of a regulatory sequence directing expression thereof in a mammalian cell. In yet a further embodiment, the nucleic acid molecule comprises a second regulatory sequence which directs expression of the Cas13d protein or a variation thereof. In one embodiment, the second regulatory sequence comprises an RNA polymerase II (Pol II) promoter, for example, an EF-1 Alpha Short (EFS) promoter, or a Tet operator (tetO) promoter. In a further embodiment, the second regulatory sequence comprises one or more of the following: a polyadenylation (poly(A)) sequence, a selectable marker, a tag, and a Woodchuck Hepatitis Virus (WHP) Posttranscriptional Regulatory Element (WPRE) sequence. In certain embodiments, the tag is selected from one or more of the following: a FLAG tag, a poly(His) tag, a chitin binding protein (CBP) tag, a maltose binding protein (MBP) tag, a Strep tag, a glutathione-S-transferase (GST) tag, a thioredoxin (TRX) tag, a poly(NANP) tag, a V5 tag, a HA tag, a Spot tag, a T7 tag, a NE tag, a fluorescence tag, a Green Fluorescent Protein (GFP) tag, and a MYC tag. In one embodiment, the FLAG tag has a sequence of DYKDDDK, SEQ ID NO:47. In certain embodiments, the selectable marker is a puromycin resistance gene, a kanamycin resistance gene, a chloramphenicol resistance gene, a blasticidin S resistance gene, an ampicillin resistance gene, a tetracycline resistance gene, or a G418 resistance gene.
In certain embodiments, one nucleic molecule comprises the sequence for the crRNA and a separate nucleic molecule encodes the sequence of the Cas13d protein.
Also provided is a vector comprising a crRNA and or a nucleic acid molecule as disclosed. In one embodiment, the vector is a viral vector, a retrovirus vector, a lentiviral vector, an adenovirus vector an adeno-associated virus vector, or a hybrid viral vector. In another embodiment, the vector is a non-viral vector or an analogous carrier, such as a nanoparticle, a lipid complex, a polymer, a quantum dot, a carbon nanotube, a magnetic nanoparticle, or a gold nanoparticle. Further, a vector (for example, a plasmid) for producing of the vector is provided.
In still another embodiment, a ribonucleoprotein (RNP) complex as described herein includes a Class 2, Type VI effector protein and a crRNA, as defined herein. In still another embodiment, a cell is provided which contains one or more of the cRNA, nucleic acid molecules, RNP or compositions described herein. The cell may be mammalian, preferably a human cell. In other embodiments, the cell may be bacterial.
Additionally, provided is a library comprising a plurality of crRNAs or nucleic acid molecules or RNPs or vectors or cells as disclosed. In one embodiment, each of the crRNA is capable of directing a Cas13d or a variant thereof to a different target RNA or a different region of one target RNA. In another embodiment, the library is a lentiviral library.
A composition is also provided comprising a pharmaceutical acceptable carrier and one or more crRNA(s), RNPs, or nucleic acid molecule(s) or vector(s), or cells as disclosed. These compositions may be for pharmaceutical use and thus useful in the treatment of a disease associated with an abnormal RNA or misregulation of an RNA transcript. Some examples of these diseases are the diseases mentioned specifically above.
In yet a further embodiment, the crRNA, RNPs, pharmaceutical compositions, cells, vectors and libraries may also comprise crRNA having guide sequences which mismatch the target and allow the Class 2, Type VI effector protein to bind the target, but not elicit target degradation when used in the methods known to those of skill in the art as well as the methods described and exemplified specifically herein.
One or more of the crRNAs, nucleic acid molecules, RNPs, vectors, cells, and libraries described herein are useful in a variety of methods including without limitation, treating a disease associated with an abnormal RNA; screening functional RNA(s); knocking-down, detecting, or editing a target RNA; or detecting or editing splicing, alternative isoforms, intron retention or differential UTR usage, or binding but not degrading the target.
In one aspect, the crRNA(s), nucleic acid molecule(s), RNB(s), vector(s), cell(s), or composition(s) containing one of more of them are used as a medicament, for example, in the treatment of a disease associated with an abnormal RNA such as by reducing the level of the abnormal RNA. Such disease may be a cancer/tumor, a virus infection, or a genetic disorder. In one embodiment, the treatment comprises contacting a target cell, and/or a biological sample from a subject having or suspected of having the disease with the crRNA(s), nucleic acid molecule(s), RNB(s), or vector(s) described herein. In further embodiment, target RNA of the crRNA(s) is/are the abnormal RNA(s) associated with the disease. In yet a further embodiment, the level of the abnormal RNA(s) in the target cell and/or in the biological sample is reduced. In one embodiment, the level of the abnormal RNA(s) after the treatment is reduced to at least about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, about 90%, or about 95% of the level before the treatment or the level of a subject having this disease. In another embodiment, the level of the abnormal RNA(s) after the treatment is about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, about 90%, about 100%, about 1.2 fold, about 1.5 fold, about 2 fold, about 3 fold or about 4 fold of a control level of a subject who is free of the disease. In still other methods, the targets are blocked but not degraded. In still other embodiments, the targets are modified temporarily. In other embodiment, the targets are modified permanently.
In one embodiment, a method of treating a disease associated with an abnormal RNA or misregulation of an RNA transcript, comprises administering to a subject in need thereof the crRNA, nucleic acid molecules, vectors, RNBs, cells, or pharmaceutical compositions described herein. The administering step involves in one embodiment, delivering the selected or designed crRNA as a mature RNA to a cell that expresses an RNA-targeting CRISPR-associated protein, e.g., a Class 2, Type VI protein, such as Cas13d or a variant. In one embodiment the cell has been conditioned or modified to express the Cas13d or variant, and the administering occurs ex vivo. In another embodiment, the administering step involves delivering the crRNA described herein in a vector which co-expresses the RNA-targeting CRISPR-associated protein. In still another embodiment, the administering step involves delivering the crRNA and RNA-targeting CRISPR-associated protein as a ribonucleoprotein complex to the subject. In still a further embodiment, the administering step involves delivering the nucleotide molecule containing the crRNA with a separate nucleotide molecule that expresses the RNA-targeting CRISPR-associated protein.
As used herein, the terms “cancer” and “tumor” are used interchangeably and refer to an abnormal cell growth invading or spreading to other parts of the subject or having a potential of the invading or spreading. In order to achieve the abnormal cell growth, to invade other parts of the subject, and/or to evade an immune reaction, abnormal RNAs may be present in a tumor/cancer cell. The cancer/tumor includes, but is not limited to, a solid tumor (e.g., breast, colon, ovarian, lung, liver and glioma, Mesothelioma, and non-small cell lung cancer), a B cell lymphoma, a Cutaneous T cell lymphoma and a Lymphoid leukemia.
Upon virus infection, a target cell may generate abnormal RNA(s) in order to neutralize the virus. Additionally or alternatively, after the virus' entry to a target cell, the virus may utilize the RNA producing machinery of the target cell producing abnormal RNA(s) in order to replicate the virus, or to lyse the target cell, or to perform other function(s) required by fulfilling the virus life cycle. Such virus infection may include HCV infection and related liver diseases, smallpox, the common cold and different types of flu, corona virus infections, measles, mumps, rubella, chicken pox, and shingles, hepatitis (HCV, HBV, or HAV), HIV, herpes and cold sores, polio, rabies, Ebola and Hanta fever.
Abnormal RNA(s) may also be found in other diseases, including, without limitation, Atherosclerosis, Polycystic Kidney Disease, Cardiac disease, Cardiac stress, Myocardial infarction, Kidney fibrosis, Cardiac fibrosis, diabetes, Diabetes-related kidney complications, type 2 diabetes, non-alcoholic fatty liver diseases, mycosis fungoides, and Scleroderma. Other representative examples of disease-causing defects associated with misregulation or defects in RNA include without limitation Prader Willi syndrome, Spinal muscular atrophy (SMA), Dyskeratosis congenita (X-linked), Dyskeratosis congenita (autosomal dominant), Dyskeratosis congenita (autosomal dominant), Diamond-Blackfan anemia, Shwachman-Diamond syndrome, Treacher-Collins syndrome, Prostate cancer, Myotonic dystrophy, type 1 (DM1 ), Myotonic dystrophy, type 2 (DM2 ), Spinocerebellar ataxia 8 (SCA8 ), Huntington's disease-like 2 (HDL2 ), Fragile X-associated tremor ataxia syndrome (FXTAS), Fragile X syndrome, X-linked mental retardation, Oculopharyngeal muscular dystrophy (OPMD), Human pigmentary genodermatosis, Retinitis pigmentosa, Cartilage-hair hypoplasia (recessive), Autism, Beckwith-Wiedemann syndrome (BWS), Charcot-Marie-Tooth (CMT) Disease, Charcot-Marie-Tooth (CMT) Disease, Amyotrophic lateral sclerosis (ALS), Leukoencephalopathy with vanishing white matter, Wolcott-Rallison syndrome, Mitochondrial myopathy and sideroblastic anemia (MLASA), Encephalomyopathy and hypertrophic cardiomyopathy, Hereditary spastic paraplegia, Leukoencephalopathy 2, susceptibility to diabetes, deafness, and MELAS syndrome. See, for example, Thomas A. Cooper, et al, RNA and Disease, Cell, 136 (4 ): 2009, 777-793, ISSN 0092-8674; Scotti, M., Swanson, M. RNA mis-splicing in disease. Nat Rev Genet 17, 19-32 (2016 ). https://doi.org/10.1038/nrg.2015.3; Matsui, M., Corey, D. Non-coding RNAs as drug targets. Nat Rev Drug Discov 16, 167---179 (2017 ). https://doi.org/10.1038/nrd.2016.117; Rupaimoole, R., Slack, F. MicroRNA therapeutics: towards a new era for the management of cancer and other diseases. Nat Rev Drug Discov 16, 203-222 (2017 ). https://doi.org./10.1038/nrd.2016.246; and Rohilla, K. J., Gagnon, K. T. RNA biology of disease-associated microsatellite repeat expansions. Acta Neuropathol Commun 5, 63 (2017 ). https://doi.org/10.1186/s40478-017-0468-y, incorporated by reference herein.
In certain embodiments, the abnormal RNA(s) is/are presented in a biological sample. In a further embodiment, the abnormal RNA(s) may not be within a cell.
In another aspect, a functional screening method is provided. The method comprises contacting one or more crRNA(s), and/or nucleic acid molecule(s), and/or vector(s), and/or a library as disclosed with a target cell of a cell culture, a tissue, or a subject. In one embodiment, the method comprises amplifying the nucleic acid molecule or the vector in the target cell, and optionally quantifying the nucleic acid molecule or the vector.
In one embodiment, a Cas13d protein is expressed by a nucleic acid molecule or a vector in the target cell. Thus in the target cell, the crRNA forms a complex with a Cas13d or a variation thereof, and directs the complex to a target RNA. In a further embodiment, the nucleic acid molecule or vector is the same nucleic acid molecule or vector which comprises or expresses the crRNA(s). In another embodiment, the nucleic acid molecule or vector expresses the Cas13d protein but not the crRNAs and thus, is referred to as “Cas13d molecule” or “Cas13d vector” as used herein. In one embodiment, the ratio of the Cas13d molecule (or Cas13d vector) to a crRNA (or nucleic acid molecule and/or vectors providing the crRNA) is about 100 to 1 to about 1 to 100, including each ratio therebetween. In one embodiment, the ratio is about 10 to 1, about 5 to 1, about 4 to 1, about 3 to 1, about 2 to 1, about 1 to 1, about 1 to 2, about 1 to 3, about 1 to 4, about 1 to 5, or about 1 to 10. In a further embodiment, the ratio is a molar ratio.
In one embodiment, the encoded Cas13d protein is a RfxCas13d from Ruminococcus flavefaciens strain XPD3002. Other Cas13d may also be utilized, for example, AdmCas13d from Anaerobic digester metagenome 15706, EsCas13d from Eubacterium siraeum DSM15702, P1E0Cas13d from Gut metagenome assembly P1 E0-k21, UrCas13d from Uncultured Ruminoccocus sp., RffCas13d from Ruminoccocus flavefaciens FD1, and RaCas13d from Ruminoccocus albus. In a further embodiment, the Cas13d or a variant thereof further comprises a nuclear localization signal (NLS) or a cytosolic signal or a nuclear-export signal (NES). In yet a further embodiment, the Cas13d or a variant thereof is capable of nicking a target RNA. In one embodiment, the Cas13d or a variant thereof has been engineered and does not have a nuclease activity. In one embodiment, the Cas13d is conjugated to a reporter molecule.
In one embodiment, the method reduces level of one or more of target RNA(s) in a target cell. In a further embodiment, the method functionally knocks down or knocks out one or more gene(s) expressing the target RNA(s). In yet a further embodiment, the method knocks down or knocks out one or more gene(s) in a plurality of targets cells in parallel.
In certain embodiments, a selective pressure or a stimulus is applied to the target cells prior to, during or after the contacting step, which is referred to as a perturbation step. Such selective pressure or a stimulus includes, for example, a chemical agent or a biological agent or actively physically disturbing the target cell(s).
The term chemical agent includes various small molecule drugs/compounds, while the term biological agent refers to biological drugs, which are a diverse category of drugs and are generally large, complex molecules. These biological drugs may be produced through biotechnology in a living system, such as a microorganism, plant cell, or animal cell. Types of biological products approved for use in the United States, including therapeutic proteins (such as filgrastim), monoclonal antibodies (such as adalimumab), vaccines (such as those for influenza and tetanus), cell therapy drug (for example, CarT), and gene therapy drug (for example, recombinant AAV vectors). During the perturbation step, the cells may be incubated with the chemical and/or biological agent or any combinations thereof, such as a library of peptides or a library of small molecules or a library of anti-cancer drugs, which are available commercially or publicly. See, for example, www.selleckchem.com/screening/anti-cancer-compound-library.html?gclid=CjwKCAjw0tHoBRBhEiwAvP1GFfLrUWZGJpXyE_QMr_f3NMvn9tC8433K8edIeOYkL08wUNdHzzwgFhoCquQQAvD_BwE, www.genscript.com/peptide-library.html, www.creative-biolabs.com/drug-discovery/therapeutics/whole-peptide-library.htm, phoenixpeptide.com/products/category/Peptide-Libraries/, www.selleckchem.com/screening/express-pick-library-premium-version.html?gclid=CjwKCAjw0tHoBRBhEiwAvP1GFTm7F6ezXNk1pUNajAWqP8Nc4COj2N1MNTes9pEGADe8nMF7UmUgPxoCT9cQAvD_BwE, www.selleckchem.com/screening/fda-approved-drug-library.html and www.chembridge.com/screening_libraries/. In certain embodiments, the cells are contacted with various chemical drugs or biological drugs for large-scale drug screens. In certain embodiments, the cells are treated via CRISPR-Cas enzyme and various guide RNA. The term physical disturbance refers to an active mixing, shaking, stretching, or stirring of the target cell(s). In certain embodiments, a population of cells is treated separately with any one of the perturbations as described herein or with any combinations of the perturbations, resulting in a heterologous population of cells.
In certain embodiments, the method further comprises assessing cell viability, cell proliferation, cell apoptosis, cell death, cell phenotype, existence or concentration of a molecule (for example, the target RNA(s)), protein or cell marker expression, or response to a stimulus of a target cell, or a function which may be achieved by the cell culture, tissue, or subject comprising the target cell(s).
In a further aspect, provided is a method for detecting a target RNA. In one embodiment, the target RNA is an abnormal RNA associated with a disease. Suitable diseases have been discussed in the earlier sections. In a further embodiment, the target RNA is a virus RNA.
The method comprises contacting a biological sample with a crRNA (or a nucleic acid or a vector expressing the crRNA) as disclosed. In one embodiment, the crRNA is conjugated with a reporter molecule. In a further embodiment, the crRNA hybridizes to a mock RNA which is conjugated to a reporter molecule, whereby during the contacting step, the target RNA competitively hybridizes to the crRNA thus releasing the mock RNA with the reporter molecule. In another embodiment, the method further comprises contacting the biological sample with a Cas13d or a variant thereof, prior to, concurrently with, or after the contacting step with the crRNA(s). In a further embodiment, the Cas13d or a variant thereof is expressed by a nucleic acid molecule or a vector as described herein (which may be the same nucleic acid molecule or vector providing a crRNA or a different one) in a target cell of the biological sample. In yet a further embodiment, the Cas13d or a variant thereof comprises (for example, via conjugation to) a reporter molecule. In certain embodiments, the method comprises detecting the presence or the level of a reporter molecule, which is an indication of presence or the level of the abnormal RNA in the biological sample.
In certain embodiments, the abnormal RNA(s) is/are presented in a biological sample. In one embodiment, the abnormal RNA(s) is in a target cell of the biological sample. In another embodiment, the abnormal RNA(s) may not be within a cell. In a further embodiment, the abnormal RNA(s) may be released from a target cell before the contacting step.
In yet a further aspect, a method for editing or modifying a target RNA is provided, comprising contacting a crRNA-Cas13d RNB complex with a target RNA. In one embodiment, this method or any composition used in the method is used for treatment of a disease associated with the target RNA.
In one embodiment, the crRNA of the complex is as disclosed herein. In a further embodiment, the complex is produced by a vector or a nucleic acid sequence disclosed. In one embodiment, the Cas13d nicks the target RNA. In another embodiment, the Cas13d has been engineered to have no nuclease activity. Other suitable Cas13d variants have been discussed in other sections of this application.
In a further embodiment, the Cas13d of the complex is engineered to edit or modify an RNA, for example. For example, the Cas13d may be conjugated to an RNA aminase, deaminase (e.g., ADAR, ADAR1, ADAR2 ), methylase, or demethylase (e.g., ALKBH5 ). In another embodiment, the Cas13d is conjugated to a splicing factor, for example a RBFOX1 or RBM38, whereby exon inclusion in the target RNA is induced when the hybridization region is at the downstream intron (i.e., intron at the 3′ side of an exon), and whereby exon exclusion in the target RNA is induced when the hybridization region is within the target exon. In yet another embodiment, the Cas13d is conjugated to a polyadenylation factor, for example, Nudix hydrolase 21 (NUDT21 ), whereby polyadenylation of RNA is induced at the hybridization region of the target protein.
In still another aspect, a method is provided for improving the efficiency of targeting or stabilization of a Class 2, Type VI crRNA which comprises a direct repeat (DR) stem loop and a guide or spacer sequence. Such a method involves replacing the DR stem loop sequence of a crRNA which targets inefficiently with a DR sequence selected from one or more of the DR sequences of SEQ ID Nos: 1 to 46 of Table 9, or a modification thereof.
In yet another embodiment, a method is provided that can use active Class 2, Type IV enzymes for cleaving a primary target, while using the same enzyme to block another secondary target without cleaving it. Similarly, the method can block multiple targets without cleaving the targets. In one embodiment, the primary target is a disease-causing or disease-related target and a secondary target is an interfering, e.g., RNA regulatory element(s). The secondary target can be blocked without degradation. It has been observed that Cas13a target RNA binding affinity and HEPN-nuclease activity are differentially affected by the number and the position of mismatches between the guide and the target. See, e.g., Tambe, A et al., 2018 July, RNA Binding and HEPN-Nuclease Activation Are Decoupled in CRISPR-Cas13a, Cell Repts., 24:1025-1036, incorporated by reference herein. Guide RNA and target interaction is needed at the seed region to elicit nuclease function and target degradation. Therefore, mismatches at the seed region of about 4 or more nucleotide bases still lead to pronounced binding but without nuclease activation. This is likely a conserved feature between many Cas13 proteins, which all have an extended RNA-RNA interaction interface, which is long enough for strong binding to the target site.
This method of blocking RNA targets without degradation in one embodiment involves administering to a cell expressing an RNA-targeting CRISPR-associated protein or to a subject crRNAs capable of forming a complex with the RNA-targeting CRISPR-associated protein or a variant thereof and directing the complex to the target RNA, wherein said crRNAs comprise a DR sequence and a guide or spacer sequences. The guide or spacer sequence of the crRNAs are characterized by forming extended mismatches to the target site in the seed region. In one embodiment, the crRNA has a guide sequence with 4 or more mismatches in the seed region located between guide RNA nucleotide bases 15 to 21 relative to the guide RNA 5′ end. In another embodiment, the crRNA and target are characterized by a stabilizing, enriched sequence of G and C bases and an accessible target region characterized by an enriched sequence of A and U, surrounding the seed region on the 5′ end, 3′ end or both the 5′ and 3′ ends. In still another embodiment, the DR sequence of the crRNA having the mismatched sequence is one of the DR sequences of Table 9. In yet a further embodiment, the crRNAs are designed and selected by use of the scoring methods described herein.
Because this method can be used to block RNA regulatory elements without degradation of the target by using guide crRNAs with extended mismatches to the target site in the seed regions, the method can be extended to alternate targets that require blocking. In one embodiment, this method can be employed to permit Cas13d (or another Class 2, Type VI protein), to bind and mask/block a binding sites for another RNA binding protein. In one embodiment, a single nucleotide polymorphism may lead to a unwanted binding site that is not desired. The use of a mismatched crRNA can block that unwanted site using active Cas13d instead of inactive Cas13d. Thus, in one embodiment of a method for treating or modifying a disease target, more than one function with active Cas13 can be accomplished. In one embodiment of a method to treat disease or modify genes/proteins causing disease, employs a step of administering a perfect match guide to destabilize a first target RNA directly related to disease. Simultaneously or before treatment with the perfect match crRNA, a step of administering a mismatched crRNA with active Cas13d (via mature RNA, the nucleic acid molecules expressing the crRNA and encoding the Cas13d, or delivering separate molecules or vectors, or delivering the RNP complex) to the same cell to block a another non-desired site, e.g., a regulatory site, without destabilizing the first target.
In yet another embodiment, the method employs the desired effector protein (e.g., active Cas13d) within the same cell to degrade a target RNA based on perfect matching, and protect another target RNA by binding and blocking a target site, such as a cis regulatory element that can serve as a binding site for another RNA-binding protein (RBP).
Even a single mismatch at the center of the seed region (e.g., position 18 relative to the 5′ end of the guide RNA) can lead to partial or complete loss of target cleavage. This is presented in
The following examples disclose both general and specific embodiments of the disclosed compositions and methods described herein, which should be construed to encompass any and all variations that become evident as a result of the teaching provided herein.
Experiments were conducted with respect to in vivo RfxCas13d transcript tiling and permutation screen in mammalian cells. We also evaluated crRNA and target site features for RNA knock-down efficacy using machine learning approaches and have developed an algorithm and easy-to-use webtool for the optimal design of RfxCas13d guides. Using this algorithm, we provided guide predictions for all protein coding transcripts in a human subject. Moreover, we identified a critical seed region within the RfxCas13d guide sequence. This region allows for the discrimination closely related target transcripts by a single nucleotide distance. We showed, that this target sensitivity can be leveraged in vivo for allelic discrimination.
Via these experiments, it was confirmed that RfxCas13d provides robust target RNA knock-down outperforming two other recently identified type VI-B CRISPR proteins PguCas13b and PspCas13b. Nuclear localization/export-tagged nucleases, variable guide lengths, and mutations of the direct repeat were compared in order to develop an optimized RfxCas13d platform.
Previous work on Cas13d did not identify the existence of a critical seed region. Here we showed that a single mismatch between guide and RNA target site within the seed region (nucleotides 15-21 ) can largely disrupt target knock-down. We show that this feature can be used to discriminate closely related RNA species, such as allele-specific single nucleotide polymorphisms—demonstrating a new application of RNA-targeting CRISPR enzymes in vivo.
We systematically evaluated guide RNA and target RNA sequence, secondary structure and hybridization features for perfect match guide RNAs. We show that the crRNA accessibility has a profound impact on target knock-down efficacy, while target site accessibility at least in the GFP transcript has neglectable predictive value.
Using a set of guide RNA and target RNA features, we built an on-target model to predict crRNAs with high knock-down efficacy. We showed that our model performs better on our screen test data than previous models developed for Cas9 nuclease guide design. Specifically, our model is able to explain 37% of the variance in our screen data, while a widely-used Cas9 guide model explains only 21% of the screen variance (Doench et al., Nature Biotechnology, 2016 ).
We confirmed generalizability of our guide design by testing 12 crRNAs on two endogenous genes and show that in 10 out of 12 cases we were able to correctly predict lower or higher guide efficacies.
The largest RNA-targeting screen in mammalian cells to date was performed. In total we gathered information for ˜7,000 RNA-targeting guides, with more than 6,500 guide permutations. This increases the number of data points from previous studies in mammalian cells by more than two orders of magnitude.
We developed a simple algorithm that allows the user to predict guide efficacies on target RNAs. We applied this model to all protein coding transcripts in the human genome and provide a resource for the scientific community for optimal guides. We also created an online, web-based repository that allows the user to select a target mRNA based on cell-type specific isoform expression levels and to visually explore predicted guide scores across target mRNAs (similar to what we previously designed for Cas9, c.f. Meier et al., Nature Methods, 2017 ).
Previous studies hypothesized that “anti-tag” sequences (for self vs. non-self-discrimination in bacteria) would likely be found in Cas13d. Here, we definitively demonstrate the lack of a similar anti-tag for RfxCas13d, confirming the absence of restrictive protospacer flanking sequences.
All of this together defines a state-of-the-art approach to derive both a comprehensive evaluation of RfxCas13d guide design rules as well as a needed model of effective RfxCas13d guides. RNA-targeting Cas proteins have a similarly large impact in molecular biology and medical application; thus, accurate guide prediction is of immense value for this newly developing field.
We tested 21,763 additional guide RNAs and added 5 new pooled CRISPR-Cas13 screens. We used the data to train an improved on-target guide prediction model and show that our model is generalizable across a large number of endogenous genes.
To validate our initial GFP-screen-based guide prediction model, we conducted 2 pooled Cas13d fitness screens targeting 45 essential genes and 65 control genes with a total 4,803 guide RNAs. In these screens, we confirm that predicted high-scoring guide RNAs show better target knockdown compared to low-scoring guide RNAs for the majority of essential genes.
We show that guide RNA depletion in pooled CRISPR-Cas13d fitness screens is specific to essential genes and that gene-level guide depletion scores are in agreement with RNAi-based and CRISPR-Cas9 derived gene essentiality scores. Through this comparison with other functional genomics methods, we show that Cas13d can be utilized for transcriptome-wide forward genetic screens.
We conducted 3 additional tiling screens on endogenous target genes (3 cell surface receptors) with a total of 16,960 additional guide RNAs. We demonstrate efficient targeting of endogenous human transcripts and confirm that mismatches in the critical seed region (discovered in our initial GFP screen) also reduce targeting of endogenous transcripts.
We target complex features of endogenous transcripts beyond coding regions. Through these experiments, we show that Cas13 targeting is most effective in CDS regions, then 5′/3′ UTRs, and least effective in introns. We also show evidence that Cas13 competes with splicing factors in intronic sequences and with the exon-junction-complex. These insights were not possible from the original GFP screen in our initial submission, as the transgene did not contain introns or UTRs. These results demonstrate new applications of Cas13 pooled screens for the study of splicing, alternative isoforms, intron retention and differential UTR usage.
We updated our Cas13d guide design model learning features across all 4 tilling screen datasets. We evaluated the generalizability of both our initial model as well as the updated model on 48 endogenous genes. Importantly, we show that our updated model shows improved prediction accuracy compared to the model in our initial submission. The improved model explains 47% of the variance in the data set with an average Spearman correlation to the held-out data of 0.67.
Code and software are generated to reproduce our entire analyses (data not shown). Moreover, we greatly improved the utility and performance of our web-based repository for guide RNA predictions targeting all protein-coding transcripts in the human genome (cas13design.nygenome.org).
A. Cloning of Cas13 Nuclease, Guide RNAs and Destabilized EGFP Plasmids
Using Gibson cloning, we modified the EF1a-short (EFS) promoter-driven lentiCRISPRv2 (Addgene 52961 ) or lentiCas9-Blast (Addgene 52962 ) plasmids with several different transgenes1. For the destabilized EGFP construct, we introduced a PEST sequence and nuclear localization tag on EGFP to create EFS-EGFPd2PEST-2A-Hygro from lentiCas9-Blast. To test the upstream U-content, we introduced a multiple cloning site (MCS) into EFS-EGFPd2PEST-2A-Hygro right after the stop codon, and used the MCS to introduce oligonucleotide sequences with variable U-content1.
For the CRISPR Type-VI orthologs, we cloned effector proteins (PguCas13b: Addgene 103861, PspCas13b: Addgene 103862, RfxCas13d: Addgene 109049 ) and their direct repeat (DR) sequences (PguCas13b: Addgene 103853, PspCas13b: Addgene 103854, RfxCas13d: Addgene 109053 ) into lentiCRISPRv2. In this manner, we created lentiRNACRISPR constructs: hU6-[Cas13 DR]-EFS-[Cas13 ortholog]-[NLS/NES]-2A-Puro-WPRE, where [Cas13 ortholog] was one of PguCas13b, PspCas13b, or RfxCas13d and [NLS/NES] was either a nuclear localization signal or nuclear export signal. To generate doxycycline-inducible Cas13d cell lines, we cloned NLS-RfxCas13d-NLS (Addgene 109049 ) into TetO-[Cas13]-WPRE-EFS-rtTA3-2A-Blast. For the screens, we changed the DR in the lentiGuide-Puro vector (Addgene 52963 ) to contain the RfxCas13d DR using Gibson cloning to create lentiRfxGuide-Puro1. All plasmids will be made available on Addgene.
Guide cloning was done as described previously1. All constructs were confirmed by Sanger sequencing. All primers used for molecular cloning and guide sequences are not shown.
B. Cell Culture and Monoclonal Cell Line Generation
HEK293FT cells were acquired from Thermo Fisher Scientific (R70007 ) and A375 cells were acquired from ATCC (CRL-1619 ). HEK293FT and A375 cells were maintained at 37° C. with 5% CO2 in D10 media: DMEM with high glucose and stabilized L-glutamine (Caisson DML23 ) supplemented with 10% fetal bovine serum (Serum Plus II Sigma-Aldrich 14009C) and no antibiotics.
To generate doxycycline-inducible RfxCas13d-NLS HEK293FT and A375 cells, we transduced cells with a RfxCas13d-expressing lentivirus at low MOI (<0.1 ) and selected with 5 μg/mL Blasticidin S (ThermoFisher A1113903 ). Single cell colonies were picked after by sparse plating. Clones were screened for Cas13d expression by western blot using mouse anti-FLAG M2 antibody (Sigma F1804 ).
For the GFP tiling screen RfxCas13d-expressing cells were transduced with EFS-EGFPd2PEST-2A-Hygro lentivirus at low MOI (<0.1 ) and selected with 100 μg/ml Hygromycin B (ThermoFisher 10687010 ) for 2 days. Single-cell colonies were grown by sparse plating. Resistant and GFP-positive clonal cells were expanded and screened for homogenous GFP expression by FACS.
C. Transfection and Flow Cytometry
For all transfection experiments, we seeded 2×105 HEK293FT cells per well of a 24-well plate prior to transfection (12-18 hours) and used 500 or 750 ng plasmid together with a 5-to-1 ratio of Lipofectamine 2000 (ThermoFisher 11668019 ) or 1 mg/mL polyethylenimine (Polysciences 23966 ) to DNA (e.g. 2.5 μl Lipofectamine2000 or PEI mixed with 0.5 μg plasmid DNA). Flow cytometry or fluorescence-assisted cell sorting (FACS) was performed at 48 hrs post-transfection. All transfection experiments were performed in biological triplicate.
We compared Type IV CRISPR Cas protein knock-down efficacy. Lentiviral vectors for combined CRISPR Type IV enzyme and crRNA delivery were created. We generated single-vector effector protein plus crRNA expressing constructs utilizing nuclear localization (NLS) or cytosolic/nuclear-export signals (NES) to compare knock-down efficacies with uniform delivery, promoter and polyadenylation. Cells were co-transfected with plasmids encoding a Cas13 enzyme together with a crRNA, and with a destabilized GFP plasmid. GFP intensity was recorded by fluorescence activated cell sorting 48 hours after transfection. The percentage of mean fluorescence intensity reduction of cells transfected with one of three different GFP-targeting guide RNAs sequences (G1, G2, G3 ) was determined relative to a non-targeting guide RNA sequence for the same Cas13-fusion protein as a mean of three replicate experiments. We also assessed targeted knock-down with different guide RNA lengths, while maintaining a fixed 5′ or 3′ anchor. RfxCas13d-NLS expressing HEK293 cells were co-transfected with plasmids delivering the crRNA only and a GFP expression plasmid. We cloned the effector proteins (PguCas13b: Addgene 103861, PspCas13b: Addgene 103862, RfxCas13d: Addgene 109049 ) and their direct repeat sequences (PguCas13b: Addgene 103853, PspCas13b: Addgene 103854, RfxCas13d: Addgene 109053 ) as described above. We co-transfected the pLentiRNACRISPR constructs together with a GFP expression plasmid in a 2:1 molar ratio. The guide RNA length comparison was done using previously published RfxCas13d constructs (Addgene 109049 and 109053 ), except that we removed the GFP cassette from the RfxCas13d plasmid. The modified RfxCas13d construct and guide plasmids were co-transfected together with a GFP expression plasmid in a 2:2:1 molar ratio. For the DR modification experiment (
For the model validation flow cytometry (
For the screen result validation (
To assess the upstream U-context (Example 6 ), we transfected upstream-U context modified EFS-EGFPd2PEST-MCS plasmid together with either a crRNA plasmid into RfxCas13d-expressing in a 2:1 molar ratio. Each GFP-upstreamU-context plasmid was co-transfected with both a targeting or a non-targeting guide RNA used for calculating the knock-down, as a change in 3′UTR uridine content could attract RNA-binding proteins that may affect RNA stability independent of Cas13. We selected the zero-uridine oligonucleotide from a set of 10000 in silico randomized 52mers with {A24,C14,G14} with minimal predicted RNA-secondary structure as determined by RNAfold7 with default setting.
For flow cytometry analysis, cells were gated by forward and side scatter and signal intensity to remove potential multiplets. If present, cells were additionally gated with a live-dead staining (LIVE/DEAD Fixable Violet Dead Cell Stain Kit, Thermo Fisher L34963 ). For each sample we analyzed at least 5000 cells. If cell numbers varied, we randomly sampled all samples to the same number of cells before calculating the mean fluorescence intensity (MFI). For GFP co-transfection experiments, we only considered the percentage of transfected cells with the highest GFP expression determined by comparing the non-targeting control to wild-type control cells. For the upstream U-context co-transfection experiments, we considered the whole cell populations.
For knock-down experiments of endogenous genes (
D. Screen Library Design and Pooled Oligo Cloning
To design the RfxCas13d guide RNA library for GFP, we selected the 714 bp coding sequence (without start codon) to be targeted. In silico, we generated all perfectly matching 27mer guide RNAs with minimal constraints (T-homopolymer <4, V-homopolymer <5, 0.1<GC-content <0.9 ) and selected 400 by random sampling. From these, we sampled 100 guide RNAs and introduced one random nucleotide conversion at each position (n=2700, SM set). From these 100, we randomly sampled 17 guide RNAs and introduced 26 or 25 consecutive double (n=442, CD set) and triple (n=425, CT set) mismatches, respectively. We sampled an additional 13 guide RNAs from the SM set (in total, 30 guide RNAs) and introduced 100 random double mismatches at any position for each guide RNA if not present already in the set of 17 consecutive double mismatches (n=3000, RD set). In total, we designed 6,967 GFP targeting guides and added 533 non-targeting guides (NT set) of the same length from randomly generated sequences that did not align to the human genome (hg19 ) with less than 3 mismatches.
For CD46, CD55 and CD71 library design, we selected the transcript isoform with highest isoform expression in HEK-TE samples (determined by Cancer Cell Line Encyclopedia CCLE; GENCODE v19 ) and longest 3′UTR isoform (CD46: ENST00000367042.1, CD55: ENST00000367064.3, CD71: ENST00000360110.4 ). As described above, we generated all perfectly matching 23mers, and selected ˜2000 evenly spaced guide RNAs per target. In addition to PM, SM, RD and NT sets as described above, we included for each target a set of guide length variants (n=450, LV set), guide RNAs targeting intronic sequences near splice-donor and splice-acceptor sites across all 39 annotated introns (n=2122, I set) and an additional negative control set of reverse complementary perfect match sequences (n=300, RC set). Further details are gathered but not shown.
For both targeted essentiality screens, we used the DEMETER2 v537 data set from the Cancer Dependency Map portal (DepMap) to determined essential and control genes. Specifically, we selected essential genes with low log2 fold-change (FC) enrichments across all cell lines and in the respective assay cell line (s). For our HEK293FT cells, we considered data for HEK-TE cells. Furthermore, we selected genes with one transcript isoform constituting more than 75% of the gene expression with expression level less than ˜150 transcripts per million (TPM). We predicted guide RNA efficiencies using the minimal RFGFP model and removed all guides with matches or partial matches elsewhere in the transcriptome. We allowed up to 3 mismatches when looking for potential off-targets. From the set of remaining perfect match guide RNA predictions, we manually selected three high-scoring and three low-scoring guides for the HEK293FT cell line screen to ensure that each guide fell into non-overlapping regions of the target transcripts. For the A375 cell line targets, we selected the top 20 high-scoring guide RNAs. For the set of 20 low-scoring guides, we chose among the bottom 60 to reduce the overlap of guide RNAs that fall into the same region. In this way, we assayed 20 genes in HEK293FT cells targeting 10 essential and 10 control genes with three low-scoring and three high-scoring guides, as well as three non-targeting guides (n=123 ). For the A375 screen, we targeted 100 genes (35 essential and 65 control genes) with 40 guides each (20 high- and 20 low-scoring) and included 680 non-targeting sequences (n=4680 ).
The guides for the HEK293FT essentiality screen were ordered from IDT, array cloned, confirmed by Sanger sequencing, and subsequently pooled using equal amounts. All other crRNA sequences were synthesized as single-stranded oligonucleotides (Twist Biosciences), PCR amplified using NEBNext High-Fidelity 2×PCR Master Mix (M0541S) (data not shown), and Gibson cloned into pLentiRfxGuide-Puro. Complete library representation with minimal bias (90th percentile/10th percentile crRNA read ratio: 1.68-2.17 ) were verified by next generation sequencing (Illumina MiSeq).
E. Pooled Lentiviral Production and Screening
Lentivirus was produced via transfection of library plasmid with appropriate packaging plasmids (psPAX2: Addgene 12260; pMD2.G: Addgene 12259 ) using polyethylenimine (PEI) reagent in HEK293FT. At 3 days post-transfection, viral supernatant was collected and passed through a 0.45 um filter and stored at −80° C. until use.
Doxycycline-inducible RfxCas13d-NLS human HEK293FT, double-transgenic HEK293FT-GFP or A375 cells were transduced with the respective library pooled lentiviruses in separate infection replicates ensuring at least 1000× guide representation in the selected cell pool per infection replicate using a standard spinfection protocol. We generated either 2 or 3 independent replicate experiments. After 24 hours, RfxCas13d expression was induced by addition of 1 μg/ml doxycycline (Sigma D9891 ) and cells were selected with 1 ug/mL puromycin (ThermoFisher A1113803 ), resulting in ˜30% cell survival. Puromycin-selection was complete ˜48 post puromycin-addition. Assuming independent infection events (Poisson), we determined that ˜83% of surviving cells received a single sgRNA construct. Cells were passaged every two days maintaining at least the initial cell representation and supplemented with fresh doxycycline.
The tiling screens were terminated after 5 to 10 days. For all targets we noted maximal knock-down after 2-4 days (data not shown). For cell surface proteins, cells were stained in batches of 1×107 cells for 30 min at 4° C. (BioLegend: CD46 clone TRA-2-10 #352405-30 per 1×106 cells; CD55 clone JS11 #311311-1.5 μg per 1×106 cells; CD71 clone CYIG4 #334105 -40 per 1×106 cells). We collected unsorted samples for input guide RNA representation of approximately 1000× coverage for each sample and sorted at least another 1000× representation into the assigned bins based on their signal intensities (GFP: lowest 20%, 20%, 20% and remaining highest 40%,
The essentiality screens were started (Day 0 ) upon complete puromycin selection, which was at 5 days after transduction. Cells were passaged every two to three days maintaining at least the initial cell representation and supplemented with fresh doxycycline. At Day 0 (=Input) and every 7 days, we collected a >1000× representation from each sample. The HEK293FT cell screen was conducted in triplicate and cultured for 4 weeks. The A375 cell screen was conducted in duplicate and cultured for 2 weeks.
F. Screen Readout and Read Analysis
For each sample, genomic DNA was isolated from sorted cell pellets using the GeneJET Genomic DNA Purification Kit (ThermoFisher K0722 ) using 2×106 cells or less per column. The crRNA readout was performed using two rounds of PCR2. For the first PCR step, a region containing the crRNA cassette in the lentiviral genomic integrant was amplified from extracted genomic DNA using the PCR1 primers (available but not shown).
For each sample, we performed PCR1 reactions as follows: 20 μl volume with 2 ug of gDNA in each reaction limited by the amount of extracted gDNA (total gDNA ranged from 8 μg to 50 ug per sample with an estimated representation of 106 diploid cells per ˜6.6 ug gDNA. PCR1: 4 μl 5× Q5 buffer, 0.02U/0 Q5 enzyme (M0491L), 0.5 uM forward and reverse primers and 100 ng gDNA/μ1. PCR conditions: 98° C./30s, 24×[98° C./10s, 55° C./30s, 72° C./45s], 72° C./5 min).
We pooled the unpurified PCR1 products and used the mixture for a single second PCR reaction per sample. This second PCR adds on Illumina sequencing adaptors, barcodes and stagger sequences to prevent monotemplate sequencing issues. Complete sequences of the 5 forward and 3 reverse Illumina PCR2 readout primers used are not shown. (PCR2: 50 μl 2× Q5 master mix (NEB #M0492S), 10 μl PCR1-product, 0.5 uM forward and reverse PCR2-primers in 1000. PCR conditions: 98° C./30s, 17×[98° C./10s, 63° C./30s, 72° C./45s], 72° C./5 min).
Amplicons from the second PCR were pooled by screen experiment (e.g. all GFP-screen samples) in equimolar ratios (by gel-based band densitometry quantification) and then purified using a QiaQuick PCR Purification kit (Qiagen 28104 ). Purified products were loaded onto a 2% E-gel and gel extracted using a QiaQuick Gel Extraction kit (Qiagen 28704 ). The molarity of the gel-extracted PCR product was quantified using KAPA library quant (KK4824 ) and sequenced on an Illumina NextSeq 500—II MidOutput 1×150 v2.5.
Reads were demultiplexed based on Illumina i7 barcodes present in PCR2 reverse primers using bcl2fastq and by their custom in-read i5 barcode using a custom python script. Reads were trimmed to the expected guide RNA length by searching for known anchor sequences relative to the guide sequence using a custom python script. For the tiling screens, pre-processed reads were either aligned to the designed crRNA reference using bowtie3 (v.1.1.2 ) with parameters -v 0 -m 1 or collapsed (FASTX-Toolkit) to count perfect duplicates followed by string-match intersection with the reference to retain only perfectly matching and unique alignments. Pre-processed guide RNA sequences from the essentiality screens were aligned allowing for up to 1 mismatch (-v 1 -m 1 ). Alignment statistics are available but not shown. The raw guide RNA counts (data not shown) were normalized separated by screen dataset using a median of ratios method like in DESeq24 and underwent batch-correction using combat implemented in the SVA R package5. Non-reproducible technical outliers were removed by applying pair-wise linear regression for each sample after normalization and batch-correction, collecting the residuals and taking the median value for each guide RNA across all sample-centric comparisons. We removed all crRNA counts within the top X % residuals across all samples (GFP: 2%, CD proteins: 0.5%, Essentiality screen: no outlier removal). For the GFP screen, we only remove outliers on a per-sample basis as needed (but not the entire guide RNA). For CD46, CD55 and CD71 screens, since the number of outliers was small, we decided to remove the entire guide RNA from the analysis. The table below indicates all filtering applied:
Processed crRNA counts are available but not shown. Guide RNA enrichments were calculated building the count ratios between a bin or timepoint and the corresponding input sample and loge-transformation (log2FC). Consistency between replicates was estimated using robust rank aggregation (RRA)6. Delta log2FC for mismatching guides was calculated by subtracting the log2FC of the perfectly matching reference guide. For the tiling screens, all plots and analyses were performed using the mean guide RNA enrichments of bin 1 (=bottom 20%) across replicates, unless indicated otherwise. Similarly, we used the mean guide RNA enrichments relative to Day 0 across replicates for the essentiality screen. Guide RNA enrichment scores (log2FC) are not shown here. In all combined analyses across all four tiling screens, we scaled the observed log2FC separately to improve comparability. For the generation of a the combined on-target model, we normalized the 2918 selected CDS-targeting guides RNA across the four tiling screens to the same scale prior to training and testing the model. To do so, for each dataset D, we computed the upper and lower quartiles of the guide log2FC (UQD and LQD, respectively) as well as the corresponding quartiles for the log2FC among all datasets pooled together (UQP and LQP). We then updated each fold change x as follows: x{circumflex over ( )}=[(x−LQD )/(UQD−LQD)*(UQP−LQP)+LQP]. By centering on quartiles, this procedure normalized the fold-change distributions in a way that was less susceptible to the influence of outliers of a single screen.
G. Predicting RNA secondary structures and RNA-RNA hybridization energies crRNA secondary structure and minimum free energies (MFEs) was derived using RNAfold [--gquad] on the full-length crRNA (DR+guide) sequence7. For building the combined on-target model and for testing the RFGFP model on the combined data set, we assumed 23mer guide RNAs for all guides in the GFP tiling screen to prevent length dependent differences in the crRNA MFE. Target RNA unpaired probability (accessibility) was calculated using RNAplfold [-L 40 -W 80 -u 50] as described before8. We performed a grid-search calculating the RNA accessibility for each target nucleotide in a window of minus 20 bases downstream of the target site to plus 20 bases upstream of the target site assessing the unpaired probability of each nucleotide over 1 to 50 bases for all perfectly matching guides. Then, we calculated the Pearson correlation coefficient between the log10-transformed unpaired probabilities and the observed guide RNA log2FC for each point and window relative to the guide RNA. RNA-RNA-hybridization between the guide RNA and its target site was calculated using RNAhybrid [-s -c] 9. For the hybridization calculation, we did not include the direct repeat of the crRNA. We calculated the RNA-hybridization minimum free energy for each guide RNA nucleotide position p over the distance d to the position p+d with its cognate target sequence. All measures were either directly correlated with the observed crRNA log2FC or using partial correlation to account for the crRNA folding MFE. In each case, we computed the Pearson correlation.
H. Assessing Guide RNA Nucleotide Composition
Guide RNA composition was derived by calculating the nucleotide probability within the respective guide RNA sequence length. To assess the presence of sequence constraints similar to a previously described anti-tag10 or 5′ and 3′ Protospacer Flanking Sequences (PFS), we ranked all perfectly matching guide RNAs by their log2FC enrichment within each screen separately. We selected the top and bottom 20% enriched/depleted guide RNAs and calculated the positional nucleotide probability for the four nucleotides upstream and downstream relative to the guide RNA match. To assess nucleotide preferences at any guide RNA match position in addition to upstream and downstream nucleotides, we selected the top 20% of the log2FC-ranked perfectly matching guides as described above and calculated nucleotide preferences as described before11. In brief, we calculated the probability of each nucleotide at each position for the top guide RNAs and all guide RNAs. The effect size is the difference of nucleotide probability by subtracting the values from all guides from the top guides (delta log2FC). p-values were calculated from the binomial distribution with a baseline probability estimated from the full-length GFP mRNA target sequence for all perfectly matching crRNAs. p-values were adjusted using a Bonferroni multiple testing hypothesis correction.
I. Assessing Target RNA Context
To assess the target RNA context, we calculated the nucleotide probability at each position (p) over a window (w) of 1 to 50 nucleotides centered around the position of interest (e.g. p=−18 with w=11 summarizes the nucleotide content in a window from −23 to −13 with +1 being the first base of the crRNA). We evaluated p for all positions within 75 nucleotides upstream and downstream of the guide RNA. The nucleotide context of each point was then correlated with the observed log2FC crRNA enrichments for all perfect match crRNAs, either directly or using partial correlation accounting for crRNA folding MFE. In each case we used Pearson correlation.
The RNA context around single nucleotide mismatches was assessed accordingly with a slight modification. Here, the nucleotide context was assessed relative to mismatch position summarizing the nucleotide probability in a window of 1 to 15 nucleotides to either side (e.g. p=18 with w=5 summarizes the nucleotide content in a window of 11 nucleotides from 23 to 13 ). The nucleotide context around single nucleotide mismatches influences the mismatch tolerance. We determined Pearson correlation coefficient (rp ) between observed log2FC and delta log2FC for all single mismatch guide RNAs relative to their cognate perfect matching guide RNAs segregated by all 27 positions. For each mismatch position p relative to the 5′ guide RNA end the nucleotide density was calculated as a fraction in a window extended by d (1-15 nt) on both sides centered on the mismatch position p.
The nucleotide context of each position and each window size was then correlated with the observed delta log2FC relative to the perfectly matching reference guide RNA, either directly or using partial correlation accounting for crRNA folding MFE. In each case, we used Pearson correlation.
J. On-Target Model Selection
An explanation for all selected features for the RFGFP and RFcombined model can be found in Table 6 and Table 7, respectively. The RFcombined model feature input values are note shown here. All continuous feature scores were scaled to the [0, 1] interval limited to the 5th and 95th percentile, with a mean set to the 5th percentile. Scaled values exceeding the [0, 1] interval were set to 0 or 1, respectively.
To evaluate and compare model performances, we randomly sampled 1,000 bootstrap datasets from the data of perfect match guide RNA log2FC response values and selected features. We used 399 data points for the initial RFGFP model and 2918 data points for all CDS-annotating perfect match guides across the four tiling screens. For the RFcombined model we normalized the observed log2FC values data prior to training and testing as described earlier. Normalized response values showed better generalizability compared to unnormalized or scaled log2FC. For each bootstrap sample, 70% of the data was used for training and the remaining 30% of the data was held out for testing, ensuring a 70/30 split for each screen dataset when testing the RFcombined model. Linear dependencies between features were identified using the function findLinearCombos from the R package caret and removed. The model performance was evaluated by calculating the Spearman correlation coefficient rs and Pearson r2 to the held-out data. We compared a variety of different methods8 (Table 3 ).
For both models, we tested a variety of feature combinations including crRNA folding energies, RNA-RNA hybridization energies, target site accessibility, overall and positional (di-) nucleotide probabilities, and one-hot encoding for single and di-nucleotide of the guide target-sites and their upstream and downstream flanking four nucleotides. Together, these represented 644 features for the combined on-target model. A full set of features for the combined on-target model can be found below.
For the initial on target model based on the GFP screen data, we evaluate a set of 15 defined features (Table 6 ) along-side with one-hot encoded positional nucleotide information and GC content. These 15 features were defined based on their positive or negative correlation to the observed response value during the data exploration (see also Example 6 ). We iteratively reduced the numbers of features from 15 to 6 for the RFGFP model and monitored the model performance as described above. At each iteration, the Random Forest model performed slightly better than any other learning approach. Reducing the features to fewer than the selected 6 features (RFminimal=RFGFP ) reduced the model performance. For the combined on-target model, we did not iteratively reduce the set of 35 selected features. We compared the RFGFP model to an SVM+L1 model similar to one of the first CRISPR-Cas9 on-target model. Specifically, we used one-hot encoding for all 35 nucleotide positions considered (27 guide RNA positions and 8 additional positions with 4 upstream and 4 downstream nucleotides). Considering all positions, the feature space contained 140 single nucleotide features, 544 di-nucleotide features and the GC-content (685 non-all-zero features). Here, we used tuning (see table herein for parameters) to increase model performance for SVM+L1 specifically. Here, but also for the combined model, one-hot encoded features did not lead to high Spearman correlation coefficient rs to the held-out data.
For further evaluation of the random forest models we used 10-fold cross-validation by randomly partitioning the data into 10 equally-sized partitions ensuring even contribution from each screen to each partition. We trained the model 10 times on 90% of the data and predicted the held-out 10%. For each data point, we assigned the known guide RNA efficacy quartile based on the log2FC enrichment and compared it the predicted efficacy quartiles in the held-out data. We also assessed the predicted guide score by calculating the median predicted guide score for the top and bottom ranked crRNAs in the 10% held-out data based on the known log2FC-rank for all 10 cross-validation folds (top/bottom N=2, 4, 8, 16, 32, 64, 128 or 256 guide RNAs). To compute the null distribution, we calculated the median predicted guides scores of randomly selected guide RNAs across 1000 samplings for each N. For the leave-one-out cross-validation we trained on all data from three tiling screens and performed Spearman rank correlation of the predicted the guide efficiency of the held-out fourth screen to the observed log2FC enrichments.
To make the guide score more interpretable, we standardized the guide score to a [0, 1] interval preserving the distribution between 5th and 95th percentile. Normalized values exceeding the [0, 1] interval were set to 0 or 1, respectively. The final RFGFP model was trained on all data points for perfect match guides using the six selected features with 1500 regression trees. The model explains 36.9% of the observed variance with a mean of squared residuals of 0.139. The table below shows the feature contribution for the RFGFP model.
Similarly, final RFcombined was trained on 2918 data points using 35 selected features. Tuning the number of trees (ntree) and number of splitting variables per node (mtry) led to insignificant insignificant performance improvements compared to default settings. The model (mtry=12, ntree=2000 ) explains 47.16% of the observed variance, a mean of squared residuals of 0.168, and the feature contribution as indicated below ranked by importance:
K. RfxCas13d Guide Scoring
We created a user-friendly R script that readily predicts RfxCas13d on-target guide scores. The only user-provided argument is a single-entry FASTA file input of minimally 30 nt that represents the target sequence, such as a transcript isoform sequence. The software first generates all possible 23mer guide RNAs and collects all required features and predicts guide RNA efficacies. The only filter applied removes guide RNAs with homopolymers of 5 or more Ts and 6 or more Vs (V=A, C, G). Such guide RNAs may trigger early transcript termination for PolIII transcription or cause difficulties during oligo synthesis. The software returns a FASTA file with guide RNA sequences ranked by the predicted standardized guide score. In addition, a csv file is created following providing additional information. Optionally, the script can be used to plot the guide score distribution along the provided target sequence for visualization.
We used this software to predict guide scores for all transcripts (including all biotypes: protein_coding, nonsense_mediated_decay, non_stop_decay, IG_*_gene, TR_*_gene, polymorphic_pseudogene) of protein coding genes annotated in GENCODE v19 (GRCh37 ) (n=94,873 of 95,074 ) and provide the top 10 ranked 5′UTR, coding sequence and 3′UTR annotating guide RNA sequences (data not shown).
L. RfxCas13d Guide Scoring Validation
To validate our that our initial RFGFP model can readily separate between poorly and well-performing crRNAs, we performed several experiments.
First, we chose two genes that encode for cell surface proteins that allow for quantitative assessment of their expression levels by FACS. For each gene we predicted crRNAs for the highest expressed transcript isoform in HEK293FT cells (CD46: ENST00000367042.1, CD71 [TFRC]: ENST00000360110.4 ). For each gene, we selected 3 guides present in the low scoring quartiles (Q1 and Q2 ) and 3 guides in the high scoring quartiles (Q3 and Q4 ). We selected the guides to be non-overlapping and to reside in 3 different regions of the target transcript.
Then, we performed two essentiality screens with a dropout growth phenotype readout in HEK293FT and A375 cells, respectively. We designed two crRNA libraries targeting essential and control genes with a number of predicted low-scoring and high-scoring guide RNAs as described above (see Screen library design and pooled oligo cloning). For the HEK293FT cell screen, we compared the guide depletion of four groups of 30 guides (Essential gene targeted by high-scoring guide or by low-scoring guide, and control genes targeted by high-scoring guide or by low-scoring guide). We expected the greatest depletion for the 30 high-scoring guide RNAs targeting essential genes. Similarly, we compared the relative guide depletion of the same four groups of guide RNAs in the A375 screen, with the expectation that the 20 high-scoring guides per essential gene would be the most depleted.
For gene ranking based on guide depletion, we used robust rank aggregation (RRA)6 to assign a p-value based on the consistency of log2FC-based rank-consistency of the most depleted N guide RNAs per gene (with N {1, 5, 20}) across the two A375 screen replicates. The -log10 transformed p-values were then compared to other growth screens (RNAi and Cas9 ) using Spearman rank correlation. Specifically, we compared the RRA-derived log10 p-value to the log2FC from an RNAi-based DEMETER2 v5 repository37 and the merged STARS scores from a Cas9-based approach29. For the correlation we only used genes with value present in all scores (Essential genes: n=35; Control genes: n=15 ).
Furthermore, we used the log2FC guide depletion values to compare the predictive value of the RFGFP and RFcombined models. Specifically, for both essentiality screens we used 10 essential genes (all in HEK293FT and the 10 most depleted in A375 cells) and correlated the predicted guide scores from both models to the observed log2FC guide depletion scores (normalized to 0-100% per gene) of all detected guide RNAs (HEK293FT: n=60 with 6 guides per gene; A375: n=398 with up to 40 guides per gene). We made the same comparison on a per-gene level using all 40 guide RNAs per gene in the A375 screen.
M. Data Representation
In all boxplots, boxes indicate the median and interquartile ranges, with whiskers indicating either 1.5 times the interquartile range, or the most extreme data point outside the 1.5-fold interquartile. All transfection experiments show the mean of three replicate experiments with individual replicates plotted as points.
N. Data Availability Statement
Screen data are being deposited to GEO with an accession number pending. All code and software to reproduce our entire analyses are available on our gitlab repository (gitlab.com/sanjanalab/cas13 ). Moreover, we provide pre-computed guide RNA predictions targeting all protein-coding transcripts in the human genome on our web-based repository (cas13design.nygenome.org). Other data and materials that support the findings of this research are available from the corresponding author upon reasonable request.
O. Code Availability Statement
The predictive on-target model as well as all code for the presented and additional quality control analysis is available on gitlab repository (gitlab.com/sanjanalab/cas13 ).
To date, three different Cas13 effector proteins (PguCas13b, PspCas13b, RfxCas13d) have been reported to show high RNA knock-down efficacy with minimal off-target activity 16,20 We compared the ability of these three Cas13 enzymes to knock-down GFP mRNA when directed to either the cytosol or the nucleus. RfxCas13d (CasRx) consistently showed the strongest target knock-down, especially when fused to a nuclear localization sequence (NLS) data not shown. Using Cas13d-NLS, we varied the guide length while maintaining a constant guide RNA 5′ end or 3′ end relative to a 30 nt reference guide. In both experiments, we found the most pronounced target knock-down using guide RNAs with a length of 23-30 nt (
To systematically assess the RfxCas13d target knock-down efficacy of thousands of guide RNAs, we established a monoclonal HEK293 cell line expressing destabilized GFP and doxycycline-inducible Cas13d protein. We lentivirally delivered a library of 7,500 crRNAs that target the GFP coding sequence, containing perfect match and mismatch guide RNAs (
We calculated the log2 fold-change (log2FC) crRNA enrichment between all bins and the unsorted input guide RNA distribution (data not shown). Perfect match guide RNAs were enriched in bin 1, while increasing numbers of mismatches led to a gradual decrease in guide enrichment (
We noticed considerable heterogeneity of guide enrichment within each class (
To examine if Cas13 can tolerate mismatches between the guide RNA and the target RNA, we calculated the relative log2 fold change (A log2FC) for each mismatch guide by subtracting the log2FC from the reference (perfect match) guide (
For randomly distributed double mismatches, the largest change in enrichment was observed in cases where both mismatches are in the seed region. Increasing the number of mismatches to three largely abrogated target knock-down. For this reason, the critical region may have been masked in previous studies on RfxCas13d which used four consecutive mismatches28.
Given the heterogeneity in enrichment for guide RNAs that have mismatches in the seed region, we sought to assess the effect of surrounding nucleotide context. We used partial correlation to control for the knock-down efficacy of the cognate perfect match guide (“reference”), as poorly performing crRNAs might not allow for large changes in enrichment. Controlling for the reference crRNA efficacy, mismatches in a ‘U’-context in the target site negatively impact Cas13d activity, whereas mismatches in a ‘GC’-context were better tolerated. We confirmed the presence of the seed region in transfection experiments using guides with single or double nucleotide mismatches to the GFP mRNA (
Importantly, the center of the RfxCas13d seed region coincides with conserved contacts between a helical domain in Cas13d protein and the backbone of the guide RNA-target hybrid interface. This interaction resides opposite of the guide RNA position 17-18 with the target RNA28. The helical domain is located between both HEPN-domains needed for target cleavage, and mutation of the interacting amino acids in EsCas13d completely abolished target cleavage 28. Mismatches at the seed center thus may impair HEPN-domain activity.
Next we sought to assess the features that may affect knock-down efficacy for perfect match guide RNAs (see Example 6 for details). One of the features impacting the observed guide RNA enrichments in the GFP tiling screen was crRNA folding: Predicting secondary structures and corresponding minimum free energy (MFE) of perfect match crRNAs showed a positive correlation between the MFE and guide efficacy (
We defined 15 crRNA and target-RNA features based on their correlation with observed guide enrichment in our exploratory data analysis (Table 6, Example 6 and data not shown). With these features, we sought to derive a generalizable ‘on-target’ model to predict Cas13d target knock-down. We compared the ability of machine learning approaches to predict guide efficiency (observed log2FC) in the held-out data (see Methods) and found that a Random Forest (RF) model had the best prediction accuracy (
To show that our model is generalizable, we predicted guides to target the endogenous transcripts of CD46 and CD71, which encode cell surface proteins, and measured the guide knock-down efficacy by FACS. For each gene, we chose 3 guide RNAs predicted to have high knock-down efficacy (Q3 or Q4 ) and 3 guide RNAs predicted to have low knock-down efficacy (Q1 or Q2 ). On an individual guide level, we found that the majority of guides with higher predicted guide scores suppressed CD46 and CD71 protein expression more robustly than guides with lower guide scores (
In addition, we performed a second targeted essentiality screen in A375 cells targeting 35 essential and 65 control genes with both 20 high-scoring and 20 low-scoring guide RNAs. Similar to the HEK293 screen above, we found that high-scoring guides that target essential genes were progressively depleted over time (
We calculated a significance score of gene depletion based on the guide rank consistency for the 20 high-scoring guides and found strong enrichment of defined essential genes at the top of the list (
Our predictive on-target model based on the GFP tiling screen was largely able to separate guide RNAs with low knock-down efficiency from those with high efficiency. However, given that we observed remaining heterogeneity among the predicted high-scoring guides, we sought to improve our on-target model by enlarging our training dataset. Therefore, we performed three additional Cas13d tiling screens targeting the main transcript isoforms of the cell surface proteins CD46, CD55 and CD71 in HEK293 cells coupled with FACS readout selecting for cells with decreased surface protein expression (
CrRNAs are lentivirally transduced into TetO-RfxCas13d HEK293 cells. Five to ten days after transduction, cells are stained for the targeted cell surface protein and sorted by intensities into 2 bins (Bottom 20%=strongest knock-down, Top 20%=least knock-down). For each screen, perfect match guide RNAs showed the strongest guide enrichment relative to the unsorted input samples, while reverse complement negative control guides and non-targeting guides were depleted, data not shown. In the new screens we reduced the overall guide length to 23 bases and included a set of guide length variants ranging in length from 15 to 36 nucleotides. Starting from 23 nucleotide length, guides RNAs exerted full knock-down efficiency, while longer guide 3′ends did not have any deleterious effects.
Perfect match guides targeting coding regions (CDS) were more strongly enriched compared to guides targeting untranslated regions (UTRs) or introns. UTR-targeting guides may show lower enrichments as each target gene may be represented by multiple transcript isoforms with alternative UTR usage. Hence, guides targeting coding regions have a higher likelihood to find the cognate target site while, for example, 3′ UTR-targeting guide RNAs find their target site only in a fraction of the expressed transcripts isoforms. Accordingly, the low enrichment for intron-targeting guide RNAs may be explained by the short-lived nature of introns. For these guides, the intronic target site is present only for a short period of time, which likely enables the transcript to evade Cas13 targeting. For this reason, guide RNA knock-down efficiency may not be directly comparable between CDS-targeting guides and UTR- or intron-targeting guides.
We also observed a slight decrease in guide efficiency of intron-targeting guides immediately downstream of the 5′-splice-site and within the −50 to 0 nucleotide upstream of the 3′-splice-site summarizing across all 39 introns present-. These sites are typically bound by the spliceosome41, suggesting that guide RNAs targeting these regions may compete with the splice machinery and other splice factors for the target sequences. As transcript maturation in the nucleus seemingly influences the guide RNA targeting efficiency, we wondered if the exon-junction-complex (EJC) would affect knock-down of the matured transcript in the same way. The EJC typically binds ˜20-24 nucleotides 5′ upstream to the exon-exon-junction upon splicing42,43. Indeed, we observed a depletion of high-scoring guide RNAs within a window of −20 to 0 nucleotides 5′ upstream to the exon junction.
To improve our on-target model, we focused on perfect match guide RNAs that target CDS-regions and increased the number of high-confidence model input observation from ˜400 to nearly 3000. We performed a grid-search correlating the observed guide RNA efficacies with the target nucleotide probabilities across a window of 1 nt up to 50 nt at every point 75 nt upstream to 75 nt downstream relative to all 2,918 selected CDS-targeting perfect match target sites. We determined for each nucleotide (A,C,G,U, A|U, G|C) the position and widow size of minimal and maximal Pearson correlation coefficient (Table 7 ). Patterns derived from partial correlation controlling for the crRNA MFE did not deviate from correlations shown. A comparison of machine leaning regression approaches to predict target knock-down of held-out data using bootstrapping was also performed. The data of all CDS-targeting perfect match guides (n=2,918 ) from the all four tiling screens and features (Table 7 ) was randomly split into 70% training data and 30% held-out testing data for 1000 random non-redundant splits. The prediction accuracy (comparing predicted scores to the known log2FC) is computed using the Spearman correlation (rs ) to the held-out data. Models were ranked by their median performance.
Similar to the initial GFP-screen, guide RNAs efficiencies were distributed along the coding region in a non-random manner (
The combined Random Forest model (RFcombined ) displayed improved prediction accuracy compared to the initial RFminimal model (referred to herein as RFGFP ) explaining ˜47% of the variance (r2 ) with a Spearman correlation (rs ) of ˜0.67 to the held-out data (
Finally, we compared the ability of both models, the RFGFP and RFcombined model, with respect to their ability to correctly predict the knockdown efficiencies for the two essentiality screens. Both screens were designed based on guide predictions made by the RFGFP model. In both cases, the RFcombined was in better agreement with the observed knock-down efficiencies across all genes (
We applied our model and predicted guide RNAs for all protein-coding transcripts in the human genome (GENCODE v19 ). We made these predictions available through a user-friendly, web-based application (cas13design.nygenome.org). In addition, we report the 10 highest-scoring crRNAs for the 5′ UTR, CDS and 3′ UTR of each transcript (data not shown). We partitioned the predicted guide RNAs according to the efficacy quartiles in our four screens. Only 15.2% of all possible guides fall into the highest scoring (best knock-down) quartile (Q4 ). A large fraction of guide RNAs are predicted to have lower efficacy (36.8% of all guides are in Q1 or Q2 ), which emphasizes the value of optimal guide selection for high knock-down efficacy. However, almost all transcripts have top-scoring guide predictions.
Taken together, we performed a set of pooled screens for CRISPR Type VI Cas13d and defined targeting rules for optimal guide design. We show that crRNA choice and target RNA-context constrain target knock-down efficacy and, using this data, we develop and validate an ‘on-target’ model to predict guides with high efficacy. Although we specifically sought to define rules for active Cas13d, we believe that our model may be transferable to inactive (catalytically dead) Cas13d effector proteins. Beyond our on-target guide design, we identified a critical seed region in the crRNA that is sensitive to target mismatch. We provide evidence that this seed region can be used in living cells to discriminate between target RNAs with high similarity, such as allele-specific single nucleotide polymorphisms.
For the RFGFP model we define features as follows: For guide RNA features (features 4, 5, 6 ), nucleotide 1 defines the guide start site (GSS) being the most 5′ guide RNA base matching the target RNA. Nucleotide 2 relative to GSS is the subsequent base (moving in the 5′ to 3′ direction) in the guide RNA and so on. For target RNA features (features 7-15 ), we denote the target nucleotide opposite to the GSS as nucleotide 0. Moving in 5′ to 3′ direction target RNA nucleotide −1 is upstream (5′) to target RNA nucleotide 0 and base-paired to guide nucleotide 2, while target RNA nucleotide +1 is downstream of the target site and so on. A complete illustration for features 4-15 with a schematic of the guide RNA and target RNA can be found in Example 6 and
For the RFcombined model we define features as follows: For guide RNA features (features 4 and 5 ), nucleotide 1 defines the guide start site (GSS) being the most 5′ guide RNA base matching the target RNA. Nucleotide 2 relative to GSS is the subsequent base (moving in the 5′ to 3′ direction) in the guide RNA and so on. For target RNA features (features 6-18 ), we denote the target nucleotide opposite to the GSS as nucleotide 0. Moving in 5′ to 3′ direction target RNA nucleotide −1 is upstream (5′) to target RNA nucleotide 0 and base-paired to guide nucleotide 2, while target RNA nucleotide +1 is downstream of the target site and so on. A complete illustration for features 4-18 with a schematic of the guide RNA and target RNA can be found in Example 7 and
A. Anti-Tag
Recently, others have found that Cas13a is inhibited by a 4 nt “anti-tag” sequence homology between the end of the DR and the corresponding flanking sequence of the target and have speculated that Cas13d, which has a similarly positioned 5′ DR, might also use an anti-tag for host versus pathogen discrimination10. Using all perfect match guide RNAs, we did not find evidence for the presence of a similar anti-tag sequence for RfxCas13d suggesting that anti-tags may not be found in all Type VI CRISPRs or contribute only marginally compared to other features, data not shown.
B. Nucleotide Preferences
Next, we tested whether position-based nucleotide preferences exist within the guide RNA target sequence or nearby nucleotides by comparing the nucleotide composition of the top 20% to all perfectly matching guides in the GFP screen, similar to previous approaches assessing Cas9 guide preferences11. Although the top enriched guides showed slight nucleotide preferences compared to all guides, most preferences became insignificant after correction for multiple hypothesis testing. However, when we correlated guide RNA nucleotide probabilities with the observed guide enrichment, we saw that high ‘G’-content in the guide RNA had a strong negative impact. Other measures, like guide RNA GC-content indicated a local optimum around 50% with lower guide efficiency when the guide adopts lower or higher GC proportions.
C. crRNA Folding
The negative correlation to the observed guide RNA enrichments (log2FC) was restricted to high ‘G’-content in the guide RNA, while guide RNA ‘C’-content did not affect targeting in the same way. This suggests that the effect may not be caused by specific guide-target interaction, which should weight ‘C’ and ‘G’ bases interchangeably, but instead may be driven by ‘G’-dependent stable structures within the crRNA that may render the crRNA inaccessible for Cas13d. Indeed, predicting the secondary structure and corresponding minimum free energy (MFE) of perfect match guides showed a positive correlation between the MFE and guide efficacy (
D. Guide RNA—Target RNA Hybridization
We next tested whether guide-target hybridization can contribute to guide RNA efficacy by computing the correlation between hybridization energy and guide RNA efficacy. The Pearson correlation coefficient (rp ) of the observed log2FC and the hybridization minimum free energy (MFE) of guide RNA nucleotide position p over the distance d to the position p+d with its cognate target sequence for all perfectly matching guide RNAs was determined. For the GFP screen we found that more stable hybridization between guide RNAs and their target sequences (lower MFE) was correlated with lower guide RNA efficacy (r=0.31 ). This suggests that the most stable guide-target interactions may render the ribonucleoprotein complex less active. Interestingly, calculating MFEs between smaller regions within the guide RNA indicated multiple sub-structures that contribute to the overall correlation, suggesting that individual parts of the guide-target interaction may serve specific roles during ternary complex formation or nuclease activation. However, these correlative structures were nearly gone when using partial correlation to control for the effect of crRNA folding.
E. Target Site Nucleotide Context
Beyond guide RNA nucleotide composition, we wondered if the context features of the guide RNA target site affected target knock-down. By correlating the observed guide RNA log2FC with the nucleotide probabilities across windows around target sites, we detected a strong negative impact of high ‘C’-context directly at the target site. However, this may be confounded by the high guide RNA ‘G’-content and its role in crRNA folding (
F. Target Site Accessibility
We also assessed whether the target site accessibility influences knock-down by correlating the observed guide RNA efficacies with the target site accessibility. Here, we define target site accessibility as the probability that the target RNA (in this case, GFP mRNA) is unpaired. We found a weak positive correlation with increased target site accessibility centered on the 3′-end of the spacer RNA (
G. On-Target Model Feature Collection
Based on our analyses above, we determined the position and window-size with the best correlation to the observed guide RNA enrichments for each feature (
For the assessment of crRNA and target RNA features, we considered 2,918 perfect match guides that target coding regions across four genes: GFP, CD46, CD55, CD71 (see
A. Anti-Tag
Using all perfect match guide RNAs, we did not find evidence for the presence of an anti-tag sequence27 for RfxCas13d suggesting that anti-tags may not be found in all Type VI CRISPRs or contribute only marginally compared to other features.
B. Nucleotide Preferences
Next, we tested whether position-based nucleotide preferences exist within the guide RNA target sequence or nearby nucleotides by comparing the nucleotide composition of the top 20% to all perfectly matching guides across all four screens, similar to previous approaches assessing Cas9 guide preferences11. The increased number of data points uncovered clear nucleotide preferences. Effect-size (delta nucleotide probabilities) and Bonferroni-corrected p-values of observing the conditional probability of a guide in the top 20% under the null distribution examined at every position including the 4 nucleotides upstream and downstream of the guide RNA target site was determined. The p-values were calculated from the binomial distribution with a baseline probability estimated from the full-length mRNA target sequence all perfect match guide RNAs. The top 20% were selected for each screen separately to ensure equal contribution. The top enriched guides showed preferences for G-bases at guide nucleotides 19-21 (with position 1 defined as the most 5′ nucleotide in the guide RNA that matches the target RNA). C-bases were favored at positions 15-16. Interestingly, the enrichment of G and C bases surround the center of the critical seed region at position 18 (see
We correlated guide RNA nucleotide probabilities with the observed guide enrichment. In the GFP screen data alone we found that high ‘G’-content in the guide RNA had a strong negative impact. This impact was reduced when taking all four screens into account. The guide RNA GC-content indicated a local optimum around 50% with lower guide efficiency when the guide adopts lower or higher GC proportions.
C. crRNA Folding
Analyzing the GFP screen alone, we found previously that the predicted crRNA folding minimum free energy (MFE) of perfect match guides correlated positively with guide RNA efficacy (see
D. Guide RNA—Target RNA Hybridization
We next tested whether guide-target hybridization can contribute to guide RNA efficacy when integrating data from all four screens. Unlike for the GFP screen alone, we found that the overall hybridization energy between the full-length guide RNA and target sequence correlated less. Instead, the hybridization energies of sub-fragments contributed differentially to the overall guide-target interaction. The hybridization energy between the 12 nucleotides from guide position 3 to 15 and the cognate target site showed a slight positive correlation. Hybridization energies covering the 9 nucleotides from guide position 15 to 23 correlated negatively with the knock-down efficiencies. Unlike for the GFP screen analysis before, these sub-structures were still present when controlled for the crRNA folding energies using partial correlations.
E. Target Site Accessibility
We also assessed the target site accessibility for all screens and correlated observed guide RNA efficacies with the target site accessibility. Here, we define target site accessibility as the probability that the target RNA is unpaired. We did not find a strong relationship between the probability of the target sequence being unpaired and the observed knock-down strengths. Similar to the GFP screen alone, we find a weak positive correlation with increased target site accessibility centered on the 3′-end of the spacer RNA. We also recapitulate the observed nucleotide preferences with higher GC-content centered around seed nucleotide 18, which is surrounded by higher AU-content. Higher AU-content translates to increased accessibility, while higher GC content suggest local secondary structures.
F. On-Target Model Feature Collection
Based on our analyses across all four tiling screens, we determined the position and window-size with the best correlation to the observed guide RNA enrichments for each feature (
To show the overall distribution of GFP signal in response to the screen, the GFP flow cytometry plot in
A GFP-FSC scatter plot is also presented in
Regarding predictions for Bins 2-4: Bins 2, 3 and 4 did not enrich for high-efficiency guide RNAs, but instead were depleted for high-efficiency guide RNAs. However, to clarify, our predictive model does not try to make predictions specifically for Bins 2-4. Rather, the targeting quartiles are based on the guide RNA enrichments within Bin 1 (presented in
Regarding the outliers, they may have been introduced during the screen (e.g. FACS selection or PCR amplification) as the outliers do not show any consistency between transduction replicates. In contrast, outliers introduced during transduction would display a guide-specific pattern in input and sorted samples of the same biological (transduction) replicate, which would be apparent in the Pearson correlation coefficients in
As this was not the case, we removed single outlier counts for a single biological sample, but not the entire gRNA for the GFP screen (clarified in the methods). Normalization was done including all counts, as all counts contribute to the library size. Moreover, outliers were not enriched for a particular class of guides and are a small minority of the points. Overall, the outlier detection procedure resulted in the removal of ˜2% of data points with the highest residuals across the 15 biological samples. In conclusion, we considered the outlier to be a random confounder and thus masked individual counts only when detected as an outlier. Most importantly, by masking outliers we reduced the number of perfect match guide RNAs used for the initial on-target model by only 1 guide from 400 to 399, and by 4 guide RNAs overall. A table is provided below summarizing reproducibility (correlation) of bins 1, 2, 3, 4 and input counts throughout the normalization steps across the three replicate GFP-screens. A complete set of all pairwise correlations can be found in
We initially attempted to compare our predicted scores to guides that have been used in previous Cas13d work. Given the few papers in this field, we had difficulty making a meaningful comparison, as other RfxCas13d papers used only a very small numbers of guide RNAs (e.g Ref 22 ) or a non-mammalian context (e.g. bacterial library in EsCas13d from Ref 17 ).
In light of these issues, we decided to significantly expand our dataset and added three additional tiling screens targeting the human transcripts CD46, CD55 and CD71. This data is in
Indeed, we observed overall high consistency between the samples. One reason for the high consistency may be that the plasmid crRNA libraries showed a very even distribution of guide counts (e.g. comparing the 90th to 10th percentile ratio, we determined a skew from 2.2 or less for all screens present in this work). Compared to previous large-scale Cas9 libraries (e.g. GeCKOv1, Ref 2 ), this is nearly a 10-fold improvement in library uniformity. As a consequence, guide RNAs may likely be represented very evenly in the unsorted input samples. In addition to that, we sequenced the GFP screen to a high depth (˜450 reads per guide). Taken together, this may explain the overall good agreement of guide RNAs (and even ones with low representation/counts) in the GFP screen. No additional filtering (e.g. removal of guide RNAs below a minimum count) was used.
The low read count guides are likely not outliers in other aspects. Specifically, we grouped the perfect match guide RNAs into 5 consecutive bins (20% bins) based on their log2FC enrichments and compared the associated guide counts in Input and Sort Bin 1 samples (=highest knock-down). We find that all guides are evenly represented cross the entire range of log2FC enrichments in the input samples (
The observed effects are driven by the differential guide RNA enrichments in Sort Bin 1. It is important to note that our conclusion is based on the average base probabilities of the bottom 0-20% guides (n=80 guide RNAs).
We tested the modified crRNA direct repeat (DR) for 6 additional guide RNAs. We tested the same 6 guide RNAs presented in
We found that the modified DR improved GFP knock-down for low knock-down efficiency guide RNAs, but the effect was negligible for high knock-down efficiency guide RNAs. Overall, the modified DR seems to either improve knock-down efficiency or have no effect when knock-down is already strong.
In our initial GFP tiling screen, we noticed that the guide RNA activity correlated positively with the uridine probability ˜50 nucleotides upstream of the guide RNA target site via a grid search over positions 75 nt upstream to 75 nt downstream of the target site. This finding is consistent with previous reports of higher nuclease activity in ‘U’-rich contexts22,38. Specifically, Konermann et al. 2018 tested the EsCas13d cleavage activity in single stranded and structured context varying the upstream nucleotide identities. The authors found that target cleavage showed a significant preference for U-bases. The exact distance to the target site was not addressed. In Freije et al. 2019, the authors found that active guide RNAs had a higher U-probability within the 50 nt upstream and downstream of the target site compared to non-active guide RNAs for a Cas13a variant. However, it is not clear if this finding was statistically significant. Similarly, since the correlation between the upstream U content and the observed guide efficiency in our GFP screen was driven by a group of guides that all fell into the same region of the GFP target transcript, we were concerned the observed upstream U-context might not be generalizable (i.e. targeting position specific).
To address this concern, we generated a GFP reporter plasmid that allowed for changing the nucleotide context upstream of a perfect match target site. We designed a 52mer oligonucleotide lacking uridines and optimized to minimize predicted RNA secondary structures. We cloned 52mer oligonucleotides with 3 or 6 uridines at various positions into the GFP-reporter plasmid and tested the upstream uridine context effect on GFP knock-down. Each reporter transcript was targeted directly downstream of the introduced oligo, or with a non-targeting guide. This was done, because the introduced U-stretches could potentially act as cis-regulatory elements and recruit RNA binding proteins and thus influence target RNA stability44 independent of the Cas13 protein. We did not observe a significant position-dependent effect of upstream Us (data not shown), suggesting that the effect may be target site specific, driven by additional downstream U content, or too weak to be assessed in this experiment. More importantly, we did not find a similar correlation of U-probability and guide efficiency for the CD46, CD55 and CD71 tiling screens (˜6.3× larger dataset).
The linear combination of nucleotide context (which we term herein as “NT-context+”) represents the following model formula:
guideRNA efficiency (log2FC)˜local A1+local C+local G+local U+upstreamU+crRNA MFE
Each of the listed model parameters is defined in Table 6. This linear model utilizes the same 6 features from the RFGFP model. Although the features are selected (see next paragraph), the model (NT-context+) itself is just a linear (regression) model.
We identified the features most highly correlated with guide RNA efficacy using an exhaustive (grid) search ranging over 75 nt upstream to 75 nt downstream of the target site. At each position, we determined if the nucleotide (A, C, G, U, GC, AU) probability over a window size of 1 to 50 nt correlated with the observed guide RNA efficiencies (log2FC). For A nucleotides in the initial GFP screen, we identified three positions (termed A1, A2, and A3 ) with similar Pearson correlation coefficients.
Specifically, these predictive features are:
A1 is the probability of A-bases in a 7 nt window centered at nucleotide 23 relative to the guide sequence start (GSS). We define “nucleotide 1 relative to GSS” as the most 5′ guide RNA base matching the target RNA. “Nucleotide 2 relative to GSS” is the subsequent base (moving in the 5′ to 3′ direction) in the guide RNA and so on.
A2 is the probability of A-bases in a 33 nt window centered at nucleotide 23 relative to the GSS.
A3 is the probability of A-bases in a 20 nt window centered at nucleotide 17 relative to the GSS.
A complete description of all features that have been selected can be found in Tables 6 and 7. We have summarized the selected features for the GFP-screen based model (see
Naïvely, predicted low-scoring guides should not confer any knock-down, while predicted high-scoring guides should confer strong knock-down. However, we show in
The shift in the distribution from CD46 and CD71 knock-down (FSC vs. CD46/CD71 ) shows a unimodal distribution (i.e. cells of all sizes are shifting to less CD46 or CD71 signal, respectively).
In our revised study, we conducted several additional screens (in total, a ˜6.3× larger dataset) and built a new predictive Cas13 guide RNA model. In all of these tiling screens (GFP, CD46, CD55, CD71 ), we do not see a specific impact of target position along the transcript.
For example, if we simply compare the ratio of the average knock-down efficiency of guide RNAs in the first 50% of each transcript versus those in the last 50% of each transcript, we did not observe significant differences (mean±s.d.: 0.97±0.08 ). Moreover, we assessed the presence of a correlation between the guide RNA match position and the observed guide RNA efficiency and found no clear connection (mean rs=−0.02±0.07 ).
In contrast, specific nucleotide contexts are among the most important features in the improved predictive model (see
Our main focus in this work was active Cas13 for knock-down of target RNAs. We believe that assessing nuclease-inactive applications of dCas13 (such as the modulation of alternative splicing) requires a different readout from the FACS-based screens described in our work. (This is in contrast to nuclease active Cas9 and inactive dCas9-KRAB applications, which can be assessed by the same readout.) Experiments are under investigation addressing this important point with an appropriate readout for dCas13d, such as an alternative splicing reporter (as shown in Konermann et al. 2018 ).
We greatly improved the initial start time (˜100× faster, <10 seconds). Moreover, we have improved data visualization to plot all predictions along the target transcript and have added new guide RNAs to target both 5′ and 3′ UTRs. We have also made it easy to download gene-specific visualizations and guide RNAs.
Endogenous Targets
We added five screens with endogenous targets. Two of these screens (HEK293FT and A375 gene essentiality screens) aimed to validate our initial on-target model based on the GFP screen by comparing low-scoring and high-scoring guide RNAs head-to-head with higher throughput. And in addition, three tiling screens (CD46, CD55, CD71 ) aimed to enlarge the dataset for training an on-target model. In total we added measurements for 21,763 additional guide RNAs to the initial set of 7,500 GFP-targeting guide RNAs. Considering only endogenous genes, we have evaluated our on-target models on 48 endogenous target genes (essential genes plus cell surface proteins) and 3.979 coding sequence targeting perfect match guide RNAs.
Significance Testing
For 47 genes we expected detectable changes when targeted with either low- or high-scoring guide RNAs (2 genes in
To more globally compare gene essentiality between RNAi, CRISPR-Cas9 and Cas13d, we derived a p-value for the relative guide depletion in the A375 screen, under the assumption that more essential genes would show more pronounced guide depletion. Specifically, we calculated a p-value based on the log2FC rank consistency of the most depleted N guides (N in {1, 5, 20}) using robust rank aggregation (RRA)39. RRA assesses the relative rank of each group of selected guides across all 100 genes present in the A375 screen. In this way, RRA represents a multiple-comparisons test in which the consistency of relative guide ranks is compared across genes. The outcome represents a p-value for gene essentiality under the null hypothesis that there are no essential genes (i.e. that there are no guides that rank robustly at the top of the ranked essentiality list). Using all 20 high-scoring guides per gene, we found that essential genes were associated with lower p-values and separate clearly from control genes (
Moreover, we used the derived p-value (Cas13 essentiality score) and compared our score to Cas9 and RNAi derived gene essentiality scores in A375 cells. For this comparison we used non-parametric Spearman rank correlations of all genes in our A375 dataset with essentiality scores in all three approaches. All 35 essential genes were included along with 15 control genes. The -log10p-values (Cas13 essentiality scores) correlated better with the DEMETER2 RNAi40 scores (up to rs=0.71 ) than with the Cas9 STARS scores15 (up to rs=0.61 ) (
Targeting Complex Transcript Features (UTRs and Introns)
Beyond our efforts to validate the initial on-target model based on the GFP screen data, we conducted three additional tiling screens targeting genes that encode for cell surface proteins (CD46, CD55 and CD71 ). These new tiling screens enabled us to assess features we could not assess using the GFP screen alone. For example, we found that guide RNAs targeting coding sequences showed overall stronger enrichments (target depletion) compared to guide RNAs targeting untranslated regions (UTRs) or introns.
This observation may be explained in part by differential target-site availability. Intronic sequences are comparably short-lived and thus can be targeted only for a short period of time during the lifespan of the target transcript. 3′UTRs on the other hand may undergo differential cleavage and polyadenylation, hence only a fraction of transcripts will contain guide RNA target-sites that target longer 3′UTRs. However, data from 3′UTR-end sequencing by Christine Mayr's lab18 suggests that CD55 shows strong evidence for alternative cleavage and polyadenylation in HEK293FT cells, while CD46 and CD71 may only express one 3′UTR isoform. Nevertheless, all three target genes show the same enrichment pattern: CDS>5′ UTR≈3′ UTR>introns, in order of largest median fold-change to smallest median fold-change.
Complex Transcript Features: Splice Junctions
Using a set of ˜2100 intron-targeting guide RNAs across all 39 introns (CD46 n=12 introns, CD55 n=9 introns, CD71 n=18 introns), we found a decrease in guide efficiency of intron-targeting guides immediately downstream of the 5′-splice-site and within the −50 to 0 nucleotide upstream of the 3′-splice-site. The Ule lab recently showed that these sites are usually bound by proteins of the spliceosome19, suggesting that Cas13d may compete with splicing factors and other RNA-binding proteins38 for intronic target sequences. Similarly, we found a decrease of strongly enriched guide RNAs −20 to 0 nucleotides upstream of exon-exon-junctions, which may indicate that Cas13 competes with the exon-junction-complex for target-sites directly upstream (20-24 nt) of exon-exon-junctions20,21.
Summary of expanded analysis: Tiling cell surface proteins and improved model We repeated the feature exploration for the combined tiling screen (n=2,918 coding sequence targeting perfect match guide RNAs) similar to our initial approach for the GFP screen. Details about the selected features can be found in Example 7, with a full description of features in Table 7. Using these features we derived an updated on-target model (RFcombined ) that showed improved correlation to the held-out data during bootstrapping (
Taken together, we show that our model can reasonably predict the guide RNA efficiencies, and provide evidence that Cas13d can be used in forward genetic screens. It is important to note that our on-target model (RFcombined ) has more predictive power (rs=0.67 ) than initial Cas9 on-target models (2014: rs=˜0.45, 2016: rs=˜0.5 ) 14,15.
We have collected a ˜6× larger dataset to build the improved model. Our improved on-target model (RFcombined ) predicts correctly 63% of guides in the highest scoring efficiency quartile, whereas our previous model (RFGFP ) achieved only 46%.
We conducted two targeted fitness screens and included a comparison of depleted genes to previous RNAi screens.
In these screens, we assessed the gene essentiality for 120 endogenous genes. Specifically, we have targeted 10 essential and 10 control genes in human HEK293FT cells and monitored guide depletion 3 predicted low-scoring and 3 predicted high-scoring guides per gene over four weeks. We found that predicted high-scoring guide RNAs targeting essential genes were the most strongly depleted class of guide RNAs (
In both screens, targeting of control genes (n=75 genes) did not lead to cell dropout (guide RNA depletion) suggesting the observed guide depletions are not mediated by Cas13 off-target activity but are specific to the targeted gene. Accordingly, we found that a gene essentiality score measured by the Cas13 guide RNA depletion correlates to gene essentiality derived by RNAi (DEMETER2, from 712 RNAi screens) with a Spearman rank correlation of up to rs=0.71 (
CRISPR-Cas13 mediates robust transcript knockdown in human cells through direct RNA targeting13,15-17. Compared to DNA-targeting CRISPR enzymes like Cas9, RNA targeting by Cas13 is transcript- and strand-specific: It can distinguish and specifically knock-down processed transcripts, alternatively spliced isoforms and overlapping genes, all of which frequently serve different functions. Previously, we have described a set of optimal design rules for RfxCas13d guide RNAs (gRNAs), and developed a computational model to predict gRNA efficacy for all human protein-coding genes46. However, there is a growing interest to target other types of transcripts, such as noncoding RNAs (ncRNAs)47,48 or viral RNAs49,50 and to target transcripts in other commonly-used organisms51, 52, 40,53 In this example, we predicted relative Cas13-driven knock-down for gRNAs targeting messenger RNAs and ncRNAs in six model organisms (human, mouse, zebrafish, fly, nematode and flowering plants) and four abundant RNA virus families (SARS-CoV-2, HIV-1, H1N1 influenza and MERS). To allow for more flexible gRNA efficacy prediction, we also developed a web-based application to predict optimal Cas13d guide RNAs for any RNA target entered by the user.
To select optimal gRNAs for transcripts produced from the reference genomes of human, mouse, zebrafish, fly, nematode and flowering plants, we created a user-friendly Cas13 online platform (cas13design.nygenome.org/). We previously found that optimal Cas13 gRNAs depend on specific sequence and structural features, including position-based nucleotide preferences in the gRNA and the predicted folding energy (secondary structure) of the combined direct repeat plus gRNA46. Using this algorithm, we pre-computed gRNA efficacies, where possible, for all mRNAs and ncRNAs with varying transcript length for the 6 model organisms (
For the scored gRNAs for each organism, we found that approximately 20% are ranked in the top quartile (Q4 guides) for both mRNAs and ncRNAs. Remarkably, even though the nucleotide composition can very between RNAs from different species54-56, we find a similar proportion of optimal RfxCas13d gRNAs across all six species.
Next, we examined how many predicted high efficacy gRNAs are present, on average, in different transcripts. To do this, we determined what fraction of the transcripts in each organism include n top-scoring (Q4 ) gRNAs for values of n between 1 and 25. We found that coding sequences contained a higher number of top-scoring gRNA per transcript across all organisms, whereas targeting the noncoding transcriptome is more challenging and varies across different organisms. On average, we were able to find at least 25 Q4 gRNAs for >99% of coding exons in mRNAs but only 80% of ncRNAs. Beyond targeting transcripts from the reference genomes of these model organisms, there are also many other applications of Cas13, such as targeting transcripts from non-model organisms, cleavage of synthetic RNAs, and targeting of transcripts carrying genetic variants not found in the reference genome. Therefore, in addition to these pre-scored gRNAs, we have also developed a graphical interface that allows the user to input a custom RNA sequence for scoring and selection of optimal gRNAs.
Recently, several groups have proposed using CRISPR-Cas13 nucleases to directly target viral RNAs8,57 which has become an area of rapid technology development due to the recent COVID-19 pandemic58. However, these approaches do not use optimized Cas13 guide RNAs. Previously, we showed that optimal guide RNAs targeting an EGFP transgene can result in a ˜10-fold increase in knock-down efficacy compared to other gRNAs46. Therefore, to speed development of effective CRISPR-based antiviral therapeutics, we applied our design algorithm to target SARS-CoV-2 and other serious viral threats using Cas13d.
To ensure coverage of diverse patient isolates, we collected 7,630 sequenced SARS-CoV-2 genomes submitted to the Global Initiative on Sharing All Influenza Data (GISAID) database from 58 countries/regions19 (
Similarly, we designed and scored all gRNAs for the coronavirus MERS and two other RNA viruses, HIV-1 which drives Acquired Immunodeficiency Syndrome (AIDS) and H1N1 pandemic influenza. Unlike SARS-CoV-2, where a single high-efficacy (Q4 ) gRNA can target all genomes analyzed, we found that at least two gRNAs are needed to target nearly all available genomes. For the highly mutagenic virus HIV-161, we found that nine gRNAs are needed to target all available genomes (
RNA-targeting CRISPR-Cas13 has great potential for transcriptome perturbation and antiviral therapeutics. In this example, we have designed and scored Cas13d gRNAs for both mRNAs and ncRNAs in six common model organisms and identified optimized gRNAs to target virtually all sequenced viral RNAs for SARS-CoV-2, HIV-1, H1N1 influenza and MERS. We further expanded our web-based platform to make the Cas13 gRNA design readily accessible for model organisms and created a new application to enable gRNA prediction for user-provided target RNA sequences. Given the current lack of Cas13 guide design tools, we anticipate this resource will greatly facilitate CRISPR-Cas13 RNA targeting in model organisms, emerging viral threats to human health and novel RNA targets.
A. gRNA Design for Model Organisms
Reference transcriptomes and corresponding annotations were obtained for each model organism: H. sapiens (GENCODE v19, GRCh37 ), M. musculus (GENCODE M24, mm10 ), D. rerio (Ensembl v99, GRCz11 ), D. melanogaster (Ensembl v99, BDGP6 ), C. elegans (Ensembl v99, WBcel235 ) and A. thaliana (Ensembl Plants v46, TAIR10 ). For each organism, we performed the on-target efficiency predictions for both mRNAs and ncRNAs using command-line RfxCas13d designer version 0.2 as previously described46. We scored gRNAs for all RNA targets with a length of at least 80 nucleotides.
B. RNA Virus Genome Collection
All full-length RNA virus genomes were downloaded on Apr. 17, 2020. We downloaded 7,630 complete SARS-CoV-2 viral genomes classified as high coverage and 4,237 Influenza A H1N1 viral genomes with a complete set of eight genomic segments. SARS-CoV-2 and H1N1 genomes were obtained from GISAID (www.gisaid.org/). We also analyzed 522 MERS-CoV and 5,557 full length HIV-1 viral genomes, which were downloaded from NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/).
C. gRNA Design to Target SARS-CoV-2
We split multi-FASTA files into single-entry FASTA files using the UCSC tool faSplit62. All possible 23-mer gRNAs targeting individual genomes were scored with the RfxCas13 on-target model described previously46. All scored guides were classified into four quartiles. Quartile 4 guides (or Q4 ) are designated to be the predicted best-performing guides. We used USA/NY1-PV08001/2020 (refer to as NY1 isolate) for the SARS-CoV-2 reference gRNA design. Compared to the original (Wuhan) isolate, NY1 contains 3 nucleotide substitutions (G3243A, C25214T, G29027T) resulting in two amino acid mutations (N: A252S, ORF1a: G993S). The SARS-CoV-2 transcript annotation was obtained from NCBI (GenBank: NC_045512.2 ).
D. Prediction of Minimal Numbers of gRNAs to Target RNA Viruses
For each RNA virus, we identified a minimal set of high-scoring Q4 gRNAs that could target all genomes collected. We used a greedy algorithm as described previously49: For each iteration, the gRNAs with the highest number of targeting genomes are added to the set. During each iteration, if multiple gRNAs target the same highest number of genomes, we will pick one for the minimal set and start the next iteration.
E. Code Availability
All designed Cas13 guide RNAs (for model organisms and RNA viruses) and the interactive design tool are available at: cas13design.nygenome.org/. The Cas13 guide design algorithm is available at: gitlab.com/sanjanalab/cas13. For additional reproducibility, we provide UNIX and R code to reproduce all figures at: www.dropbox.com/sh/9mk7jlfzcalhnlx/AACp-zkPZSZt6tcLSWOS2a90a?dl=0. Prior to publication, the code to reproduce figures is included on the Gitlab repository. The contents of the websites including the guide sequences referenced in Example 9 are incorporated herein by reference.
Each and every patent, patent application, and publication, including websites cited throughout specification, particularly gitlab.com/sanjanalab/cas13, and the appended Sequence Listing are incorporated herein by reference. While the invention has been described with reference to particular embodiments, it will be appreciated that modifications can be made without departing from the spirit of the invention. Such modifications are intended to fall within the scope of the appended claims.
This application claims the benefit of the priority of U.S. Provisional Patent Application No. 62/940,575, filed Nov. 26, 2019; U.S. Provisional Patent Application No. 62/952,922, filed Dec. 23, 2019; and U.S. Provisional Patent Application No. 63/060,757, filed Aug. 4, 2020, which applications are incorporated herein by reference.
This invention was made with government support under grant numbers R00HG008171, DP2HG010099, and R01CA218668 awarded by the National Institutes of Health and grant number D18AP00053 awarded by the Defense Advanced Research Projects Agency. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US20/62379 | 11/25/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62940575 | Nov 2019 | US | |
62952922 | Dec 2019 | US | |
63060757 | Aug 2020 | US |