Identifying the cause of schizophrenia and bipolar disorder has been a challenging endeavor, as these disorders are sometimes known to be genetically linked. However, identifying how the genetic basis is linked to schizophrenia and bipolar disorder pathology is unclear. Furthermore, providing an effective therapeutic remains elusive. Accordingly, new methods of diagnosis and treatment are needed to better understand how these genetic dysregulations cause schizophrenia and bipolar disorder.
In one aspect, the invention features a method of treating schizophrenia or bipolar disorder in a subject by identifying a sequence of a novel open reading frame (nORF) associated with the schizophrenia or bipolar disorder, wherein the sequence of the nORF is distinct from a canonical open reading frame (cORF) of a gene. The nORF is present in (i) an overlapping region of the cORF in an alternate reading frame, (ii) a 5′ untranslated region (UTR) of the cORF, (iii) a 3′ UTR of the cORF, (iv) an intronic region of the cORF, (v) an intergenic region of the cORF, or (vi) a region not associated with the cORF or the gene, and wherein the nORF has increased expression relative to the nORF in a subject without schizophrenia or bipolar disorder. The method further includes administering to the subject an inhibitor that reduces expression of the nORF to treat the schizophrenia or bipolar disorder.
In another aspect, the invention features method of treating schizophrenia or bipolar disorder in a subject by administering to the subject an inhibitor that reduces expression of a nORF. The subject may have previously been identified with a sequence of the nORF associated with the schizophrenia or bipolar disorder, wherein the sequence of the nORF is distinct from a cORF of a gene, wherein the nORF is present in (i) an overlapping region of the cORF in an alternate reading frame, (ii) a 5′ UTR of the cORF, (iii) a 3′ UTR of the cORF, (iv) an intronic region of the cORF, (v) an intergenic region of the cORF, or (vi) a region not associated with the cORF or the gene, and wherein the nORF has increased expression relative to the nORF in a subject without schizophrenia or bipolar disorder.
In some embodiments of either of the foregoing aspects, the method reduces expression of the nORF, e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 97%, or 99%. The nORF may exhibit an increase (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, 300%, 400%, 500%, or more) in expression, e.g., as compared to the nORF in a normal (e.g., without schizophrenia or bipolar disorder) subject.
In some embodiments of either of the above aspects, the inhibitor is a small molecule, a polynucleotide, or a polypeptide. The polynucleotide may include a miRNA, an antisense RNA, an shRNA, or an siRNA. The polypeptide may include an antibody or antigen-binding fragment thereof (e.g., an scFv).
In some embodiments, the inhibitor is encoded by a vector, such as a viral vector. The viral vector may be selected, for example, from the group consisting of a Retroviridae family virus, an adenovirus, a parvovirus, a coronavirus, a rhabdovirus, a paramyxovirus, a picornavirus, an alphavirus, a herpes virus, and a poxvirus. The parvovirus viral vector may be, for example, an adeno-associated virus (AAV) vector.
In some embodiments, the viral vector is a Retroviridae family viral vector (e.g., a lentiviral vector, an alpharetroviral vector, or a gammaretroviral vector). The Retroviridae family viral vector may include one or more of the following: a central polypurine tract, a woodchuck hepatitis virus post-transcriptional regulatory element, a 5′-LTR, HIV signal sequence, HIV Psi signal 5′-splice site, delta-GAG element, 3′-splice site, and a 3′-self inactivating LTR.
In some embodiments, the viral vector is a pseudotyped viral vector. The pseudotyped viral vector may be selected, for example, from the group consisting of a pseudotyped adenovirus, a pseudotyped parvovirus, a pseudotyped coronavirus, a pseudotyped rhabdovirus, a pseudotyped paramyxovirus, a pseudotyped picornavirus, a pseudotyped alphavirus, a pseudotyped herpes virus, a pseudotyped poxvirus, and a pseudotyped Retroviridae family virus. The pseudotyped viral vector may be a lentiviral vector.
In some embodiments, the pseudotyped viral vector includes one or more envelope proteins from a virus selected from vesicular stomatitis virus (VSV), RD114 virus, murine leukemia virus (MLV), feline leukemia virus (FeLV), Venezuelan equine encephalitis virus (VEE), human foamy virus (HFV), walleye dermal sarcoma virus (WDSV), Semliki Forest virus (SFV), Rabies virus, avian leukosis virus (ALV), bovine immunodeficiency virus (BIV), bovine leukemia virus (BLV), Epstein-Barr virus (EBV), Caprine arthritis encephalitis virus (CAEV), Sin Nombre virus (SNV), Cherry Twisted Leaf virus (ChTLV), Simian T-cell leukemia virus (STLV), Mason-Pfizer monkey virus (MPMV), squirrel monkey retrovirus (SMRV), Rous-associated virus (RAV), Fujinami sarcoma virus (FuSV), avian carcinoma virus (MH2), avian encephalomyelitis virus (AEV), Alfa mosaic virus (AMV), avian sarcoma virus CT10, and equine infectious anemia virus (EIAV).
In some embodiments, the pseudotyped viral vector includes a VSV-G envelope protein.
In another aspect, the invention features a method of treating schizophrenia or bipolar disorder in a subject by identifying a sequence of a nORF associated with the schizophrenia or bipolar disorder, wherein the sequence of the nORF is distinct from a cORF of a gene. The nORF is present in (i) an overlapping region of the cORF in an alternate reading frame, (ii) a 5′ UTR of the cORF, (iii) a 3′ UTR of the cORF, (iv) an intronic region of the cORF, (v) an intergenic region of the cORF, or (vi) a region not associated with the cORF or the gene, and wherein the nORF has decreased expression relative to the nORF in a subject without schizophrenia or bipolar disorder. The method further includes administering to the subject an activator that increases expression of nORF to treat the schizophrenia or bipolar disorder.
In another aspect, the invention features a method of treating schizophrenia or bipolar disorder in a subject by administering to the subject an activator that increases expression of a nORF. The subject may have previously been identified with a sequence of the nORF associated with the schizophrenia or bipolar disorder, wherein the sequence of the nORF is distinct from a cORF of a gene, wherein the nORF is present in (i) an overlapping region of the cORF in an alternate reading frame, (ii) a 5′ UTR of the cORF, (iii) a 3′ UTR of the cORF, (iv) an intronic region of the cORF, (v) an intergenic region of the cORF, or (vi) a region not associated with the cORF or the gene, and wherein the nORF has decreased expression relative to the nORF in a subject without schizophrenia or bipolar disorder.
In some embodiments of either of the foregoing aspects, the method increases expression of the nORF, e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, 300%, 400%, 500%, or more. The nORF may exhibit a decrease (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 97%, or 99%) in expression, e.g., as compared to the nORF in a normal (e.g., without schizophrenia or bipolar disorder) subject.
In some embodiments, the activator is a small molecule, a polynucleotide, or a polypeptide. The polynucleotide may include an antisense RNA. The polypeptide may include an antibody or antigen-binding fragment thereof (e.g., an scFv).
In some embodiments, the activator is encoded by a vector, such as a viral vector. The viral vector may be selected, for example, from the group consisting of a Retroviridae family virus, an adenovirus, a parvovirus, a coronavirus, a rhabdovirus, a paramyxovirus, a picornavirus, an alphavirus, a herpes virus, and a poxvirus. The parvovirus viral vector may be, for example, an AAV vector.
In some embodiments, the viral vector is a Retroviridae family viral vector (e.g., a lentiviral vector, an alpharetroviral vector, or a gammaretroviral vector). The Retroviridae family viral vector may include one or more of the following: a central polypurine tract, a woodchuck hepatitis virus post-transcriptional regulatory element, a 5′-LTR, HIV signal sequence, HIV Psi signal 5′-splice site, delta-GAG element, 3′-splice site, and a 3′-self inactivating LTR.
In some embodiments, the viral vector is a pseudotyped viral vector. The pseudotyped viral vector may be selected, for example, from the group consisting of a pseudotyped adenovirus, a pseudotyped parvovirus, a pseudotyped coronavirus, a pseudotyped rhabdovirus, a pseudotyped paramyxovirus, a pseudotyped picornavirus, a pseudotyped alphavirus, a pseudotyped herpes virus, a pseudotyped poxvirus, and a pseudotyped Retroviridae family virus. The pseudotyped viral vector may be a lentiviral vector.
In some embodiments, the pseudotyped viral vector includes one or more envelope proteins from a virus selected from VSV, RD114 virus, MLV, FeLV, VEE, HFV, WDSV, SFV, Rabies virus, ALV, BIV, BLV, EBV, CAEV, SNV, ChTLV, STLV, MPMV, SMRV, RAV, FuSV, MH2, AEV, AMV, avian sarcoma virus CT10, and EIAV.
In some embodiments, the pseudotyped viral vector includes a VSV-G envelope protein.
In another aspect, the invention features a method of treating schizophrenia or bipolar disorder in a subject by identifying a sequence of a nORF associated with the schizophrenia or bipolar disorder, wherein the sequence of the nORF is distinct from a cORF of a gene. The nORF is present in (i) an overlapping region of the cORF in an alternate reading frame, (ii) a 5′ UTR of the cORF, (iii) a 3′ UTR of the cORF, (iv) an intronic region of the cORF, (v) an intergenic region of the cORF, or (vi) a region not associated with the cORF or the gene, and wherein the nORF has decreased expression relative to the nORF in a subject without schizophrenia or bipolar disorder. The method further includes providing a protein encoded by the nORF to the subject treat the schizophrenia or bipolar disorder.
In another aspect, the invention features a method of treating schizophrenia or bipolar disorder in a subject by providing a protein encoded by a nORF to the subject. The subject may have previously been identified with a sequence of the nORF associated with the schizophrenia or bipolar disorder, wherein the sequence of the nORF is distinct from a cORF of a gene, wherein the nORF is present in (i) an overlapping region of the cORF in an alternate reading frame, (ii) a 5′ UTR of the cORF, (iii) a 3′ UTR of the cORF, (iv) an intronic region of the cORF, (v) an intergenic region of the cORF, or (vi) a region not associated with the cORF or the gene, and wherein the nORF has decreased expression relative to the nORF in a subject without schizophrenia or bipolar disorder.
In some embodiments of either of the foregoing aspects, the method includes restoring the encoded protein product of the nORF. The method may include providing the protein product or a polynucleotide encoding the protein product. The method may include providing a vector (e.g., a viral vector) including the polynucleotide encoding the protein product.
In some embodiments, the viral vector may be selected, for example, from the group consisting of a Retroviridae family virus, an adenovirus, a parvovirus, a coronavirus, a rhabdovirus, a paramyxovirus, a picornavirus, an alphavirus, a herpes virus, and a poxvirus. The parvovirus viral vector may be, for example, an adeno-associated virus (AAV) vector.
In some embodiments, the viral vector is a Retroviridae family viral vector (e.g., a lentiviral vector, an alpharetroviral vector, or a gammaretroviral vector). The Retroviridae family viral vector may include one or more of the following: a central polypurine tract, a woodchuck hepatitis virus post-transcriptional regulatory element, a 5′-LTR, HIV signal sequence, HIV Psi signal 5′-splice site, delta-GAG element, 3′-splice site, and a 3′-self inactivating LTR.
In some embodiments, the viral vector is a pseudotyped viral vector. The pseudotyped viral vector may be selected, for example, from the group consisting of a pseudotyped adenovirus, a pseudotyped parvovirus, a pseudotyped coronavirus, a pseudotyped rhabdovirus, a pseudotyped paramyxovirus, a pseudotyped picornavirus, a pseudotyped alphavirus, a pseudotyped herpes virus, a pseudotyped poxvirus, and a pseudotyped Retroviridae family virus. The pseudotyped viral vector may be a lentiviral vector.
In some embodiments, the pseudotyped viral vector includes one or more envelope proteins from a virus selected from VSV, RD114 virus, MLV, FeLV, VEE, HFV, WDSV, SFV, Rabies virus, ALV, BIV, BLV, EBV, CAEV, SNV, ChTLV, STLV, MPMV, SMRV, RAV, FuSV, MH2, AEV, AMV, avian sarcoma virus CT10, and EIAV.
In some embodiments, the pseudotyped viral vector includes a VSV-G envelope protein.
In some embodiments of any of the above aspects, the encoded protein product of the nORF is less than about 100 amino acids.
In some embodiments, the method further includes performing a statistical analysis between the nORF and the schizophrenia or bipolar disorder. The statistical analysis may measure a positive or negative association between the nORF and the schizophrenia or bipolar disorder.
In some embodiments, the nORF is associated with a transposable element. For example, the nORF may have a positive or negative correlation with a transposable element.
In some embodiments, the nORF is associated with a human accelerated region (HAR). For example, the nORF may have a positive or negative correlation with the HAR.
In some embodiments, the nORF is selected from Table 4. In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 1-21 or a fragment thereof. In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 1-21.
In some embodiments, the disease is bipolar disorder and the nORF is selected from Table 7. In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 124-163 or a fragment thereof. In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 124-163.
In some embodiments, the disease is bipolar disorder and the nORF is selected from Table 8. In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 164-207 or a fragment thereof. In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 164-207.
In some embodiments, the disease is schizophrenia and the nORF is selected from Table 9. In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 208-263 or a fragment thereof. In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 208-263.
In some embodiments, the disease is schizophrenia and the nORF is selected from Table 10. In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 264-324 or a fragment thereof. In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 264-324.
As used herein, a “novel open reading frame” or “nORF” refers to an open reading frame that is transcribed in a cell and consists of a sequence that is distinct from a canonical open reading frame (cORF) transcribed from a gene. The nORF may be present in (i) an overlapping region of the cORF in an alternate reading frame, (ii) a 5′ untranslated region (UTR) of the cORF, (iii) a 3′ UTR of the cORF, (iv) an intronic region of the cORF, (v) an intergenic region of the cORF, or (vi) a region not associated with the cORF or the gene. The nORF may be any unannotated genetic sequence that is transcribed in a cell. As used herein, a “canonical open reading frame” or “cORF” refers to an open reading frame that is transcribed in a cell and its associated genetic elements, including the 5′ UTR, the 3′ UTR, the intronic regions, the exonic regions, and the intergenic regions flanking the gene comprising the cORF. A cORF includes either the primary open reading frame that is expressed from a gene, the most abundantly expressed open reading frame expressed from a gene, or an ORF that is annotated in a publicly available database as the primary and/or most abundantly expressed open reading frame from a gene.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Described herein are methods of diagnosing and treating schizophrenia or bipolar disorder associated with dysregulated novel open reading frames (nORFs). Schizophrenia or bipolar disorder may be caused by dysregulation (e.g., upregulation or downregulation) in a gene or a genetic variant that is associated with the schizophrenia or bipolar disorder. However, it was previously unclear how schizophrenia or bipolar disorder are caused in which no dysregulation of a canonical gene or a canonical open reading frame (cORF) associated with the gene is present and no variant is known. The present invention is premised, in part, upon the discovery of dysregulation of certain novel open reading frames (nORFs) that are distinct from canonical open reading frames (cORF) of genes. In these instances, the dysregulation (e.g., upregulation or downregulation) imparts a deleterious effect on the nORF, in some instances, with or without substantially impacting a protein encoded by a cORF. In particular, the present invention features methods of treating schizophrenia or bipolar disorder associated with a dysregulated nORF in which differential expression (e.g., increased or decreased expression) of the nORF is observed. With increased or decreased expression, the gene product encoded by the dysregulated nORF is increased or decreased as compared to the nORF, e.g., in a subject without schizophrenia or bipolar disorder. The methods of diagnosis and treatment are described in more detail below.
Genetic testing offers one avenue by which a patient may be diagnosed as having or is at risk of developing schizophrenia or bipolar disorder. For example, a genetic analysis can be used to determine whether a patient has a nORF associated with schizophrenia or bipolar disorder. The nORF may be present in any region of a gene, such as within the cORF, a 5′ untranslated region (UTR) of the cORF, a 3′ UTR of the cORF, an intronic region of the cORF, or an intergenic region of the cORF, The nORF may be present within an overlapping region of the cORF in an alternate reading frame, a 5′ UTR of the cORF, a 3′ UTR of the cORF, an intronic region of the cORF, or an intergenic region of the cORF. The nORF may be present in a region that is not associated with the cORF of the gene.
Exemplary genetic tests that can be used to determine whether a patient has such nORF include polymerase chain reaction (PCR) methods known in the art, such as DNA and RNA sequencing. nORF sequences may be identified de novo, e.g., using computational or statistical methods. Furthermore, nORF sequences may be identified from publicly available databases in genomic sequences in which the nORF was not previously identified and/or annotated as a sequence that was transcribed, and/or translated.
nORF sequences may be identified as being linked to schizophrenia or bipolar disorder by using a statistical analysis between the dysregulated nORF and the schizophrenia or bipolar disorder. The statistical analysis may measure a positive or negative association between the dysregulated nORF and the schizophrenia or bipolar disorder (see, e.g., Example 1). The p-value may be, for example, less than 10−3, e.g., less than 10−4, e.g., less than 10−5.
nORF sequences may be identified as being linked to schizophrenia or bipolar disorder by using a statistical analysis between the dysregulated nORF and a human accelerated region (HAR), which is a region in the human genome that are conserved throughout vertebrate evolution but are different in humans. nORF sequences may be identified as being linked to schizophrenia or bipolar disorder by using a statistical analysis between the dysregulated nORF and a transposable element (TE), which is a DNA sequence that can change its position within a genome. The statistical analysis may measure a positive or negative association between the dysregulated nORF and the HAR and/or the TE (see, e.g., Example 1). The nORF may have a positive or negative association with the HAR and/or the TE. The p-value may be, for example, less than 10−3, e.g., less than 10−4, e.g., less than 10−5. To examine the functional importance of a nORF separately from a canonical coding sequence, datasets, such as the Genome Aggregation Database, may be used.
The invention features methods of treating a subject having a dysregulated nORF that has differential expression (e.g., increased or decreased expression). The dysregulated nORF may exhibit an increase (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, 300%, 400%, 500%, or more) in expression, e.g., as compared to the nORF in a normal (e.g., without schizophrenia or bipolar disorder) subject. The dysregulated nORF may exhibit a decrease (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 97%, or 99%) in expression, e.g., as compared to the dysregulated nORF in a normal (e.g., without schizophrenia or bipolar disorder) subject. The subject may be first determined to have the dysregulated nORF and then may subsequently be treated for the schizophrenia or bipolar disorder. The subject may have previously been determined to have the dysregulated nORF and is then treated for the schizophrenia or bipolar disorder. The treatment varies according to the dysregulated nORF associated with the schizophrenia or bipolar disorder. For example, the treatment may include an inhibitor that targets the dysregulated nORF to decrease (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 97%, or 99%) expression of an upregulated nORF. The treatment may include an activator that targets the dysregulated nORF to increase (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, 300%, 400%, 500%, or more) expression of a downregulated nORF. Alternatively, or in addition, the treatment may include providing the nORF or a protein encoded by the nORF to restore levels of the nORF.
The methods of treatment and diagnosis described herein may include providing an inhibitor that targets the dysregulated nORF. The inhibitor may reduce (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 97%, or 99%) an amount or activity of the dysregulated nORF, such as to prevent the deleterious effect of the dysregulated nORF. The inhibitor may target the polynucleotide containing the nORF or the protein encoded by the nORF. The inhibitor may be a small molecule, a polynucleotide, or a polypeptide. Suitable small molecules may be determined or identified by using computational analysis based on the structure of the dysregulated nORF as determined by a protein folding algorithm. The small molecule may target any region of the dysregulated nORF. The small molecule may target the nORF or the protein encoded by the nORF. Suitable polypeptides for reducing an activity or amount of the dysregulated nORF include, for example, an antibody or antigen-binding fragment thereof that binds to the dysregulated nORF (e.g., a single chain antibody or antigen-binding fragment thereof). Suitable polynucleotides that can reduce an amount or activity of the dysregulated nORF include RNA. For example, an RNA for reducing an activity or amount of the dysregulated nORF may be, for example, a miRNA, an antisense RNA, an shRNA, or an siRNA. The miRNA, antisense RNA, shRNA, or siRNA may target a region of RNA (e.g., dysregulated nORF gene) to reduce expression of the dysregulated nORF. The polynucleotide may be an aptamer, e.g., an RNA aptamer that binds to and/or reduces an amount and/or activity of the dysregulated nORF or the protein encoded by the dysregulated nORF. The inhibitor may be provided directly or may be provided by a vector (e.g., a viral vector) encoding the inhibitor. The inhibitor may be formulated, e.g., in a pharmaceutical composition containing a pharmaceutically acceptable carrier. The composition can be administered by any suitable method known in the art to the skilled artisan. The composition (e.g., a vector, e.g., a viral vector) may be formulated in a virus or a virus-like particle.
Using the compositions and methods described herein, a patient with schizophrenia or bipolar disorder may be administered an interfering RNA molecule, a composition containing the same, or a vector encoding the same, so as to reduce or suppress the expression of a dysregulated nORF. Exemplary interfering RNA molecules that may be used in conjunction with the compositions and methods described herein are siRNA molecules, miRNA molecules, shRNA molecules, and antisense RNA molecules, among others. In the case of siRNA molecules, the siRNA may be single stranded or double stranded. miRNA molecules, in contrast, are single-stranded molecules that form a hairpin, thereby adopting a hydrogen-bonded structure reminiscent of a nucleic acid duplex. In either case, the interfering RNA may contain an antisense or “guide” strand that anneals (e.g., by way of complementarity) to the repeat-expanded mutant RNA target. The interfering RNA may also contain a “passenger” strand that is complementary to the guide strand and, thus, may have the same nucleic acid sequence as the RNA target.
siRNA is a class of short (e.g., 20-25 nt) double-stranded non-coding RNA that operates within the RNA interference pathway. siRNA may interfere with expression of the dysregulated nORF gene with complementary nucleotide sequences by degrading mRNA (via the Dicer and RISC pathways) after transcription, thereby preventing translation. miRNA is another short (e.g., about 22 nucleotides) non-coding RNA molecule that functions in RNA silencing and post-transcriptional regulation of gene expression. miRNAs function via base-pairing with complementary sequences within mRNA molecules, thereby leading to cleavage of the mRNA strand into two pieces and destabilization of the mRNA through shortening of its poly(A) tail. shRNA is an artificial RNA molecule with a tight hairpin turn that can be used to silence target gene expression via RNA interference. Antisense RNA are also short single stranded molecules that hybridize to a target RNA and prevent translation by occluding the translation machinery, thereby reducing expression of the target (e.g., the dysregulated nORF).
Using the compositions and methods described herein, a patient with schizophrenia or bipolar disorder may be provided an antibody or antigen-binding fragment thereof, a composition containing the same, a vector encoding the same, or a composition of cells containing a vector encoding the same, so as to suppress or reduce the activity of the dysregulated nORF. In some embodiments of the compositions and methods described herein, an antibody or antigen-biding fragment thereof may be used that binds to and reduces or eliminates the activity of the dysregulated nORF. The antibody may be monoclonal or polyclonal. In some embodiments, the antigen-binding fragment is an antibody that lacks the Fc portion, an F(ab′)2, a Fab, an Fv, or an scFv. The antigen-binding fragment may be an scFv. One of ordinary skill in the art will appreciate that an antibody may include four polypeptides: two identical copies of a heavy chain polypeptide and two copies of a light chain polypeptide. Each of the heavy chains contains one N-terminal variable (VH) region and three C-terminal constant (CH1, CH2 and CH3) regions, and each light chain contains one N-terminal variable (VL) region and one C-terminal constant (CL) region. Thus, one of skill in the art would appreciate that as described herein, a vector that includes a transgene that encodes a polypeptide that is an antibody may be a single transgene that encodes a plurality of polypeptides. Also contemplated is a vector that includes a plurality of transgenes, each transgene encoding a separate polypeptide of the antibody. All variations are contemplated herein. The variable regions of each pair of light and heavy chains form the antigen binding site of an antibody. The transgene which encodes an antibody directed against the dysregulated nORF can include one or more transgene sequences, each of which encodes one or more of the heavy and/or light chain polypeptides of an antibody. In this respect, the transgene sequence which encodes an antibody directed against the dysregulated nORF can include a single transgene sequence that encodes the two heavy chain polypeptides and the two light chain polypeptides of an antibody. Alternatively, the transgene sequence which encodes an antibody directed against the dysregulated nORF can include a first transgene sequence that encodes both heavy chain polypeptides of an antibody, and a second transgene sequence that encodes both light chain polypeptides of an antibody. In yet another embodiment, the transgene sequence which encodes an antibody can include a first transgene sequence encoding a first heavy chain polypeptide of an antibody, a second transgene sequence encoding a second heavy chain polypeptide of an antibody, a third transgene sequence encoding a first light chain polypeptide of an antibody, and a fourth transgene sequence encoding a second light chain polypeptide of an antibody.
In some embodiments, the transgene that encodes the antibody includes a single open reading frame encoding a heavy chain and a light chain, and each chain is separated by a protease cleavage site.
In some embodiments, the transgene encodes a single open reading frame encoding both heavy chains and both light chains, and each chain is separate by protease cleavage site.
In some embodiments, full-length antibody expression can be achieved from a single transgene cassette using 2A peptides, such as foot-and-mouth disease virus (FMDV) equine rhinitis A, porcine teschovirus-1, and Thosea asigna virus 2A peptides, which are used to link two or more genes and allow the translated polypeptide to be self-cleaved into individual polypeptide chains (e.g., heavy chain and light chain, or two heavy chains and two light chains). Thus, in some embodiments, the transgene encodes a 2A peptide in between the heavy and light chains, optionally with a flexible linker flanking the 2A peptide (e.g., GSG linker). The transgene may further include one or more engineered cleavage sequences, e.g., a furin cleavage sequence to remove the 2A peptide residues attached to the heavy chain or light chain. Exemplary 2A peptides are described, e.g., in Chng et al MAbs 7: 403-412, 201f5, and Lin et al. Front. Plant Sci. 9:1379, 2018, the disclosures of which are hereby incorporated by reference in their entirety.
In some embodiments, the antibody is a single-chain antibody or antigen-binding fragment thereof expressed from a single transgene.
The methods of treatment and diagnosis described herein may include providing an activator that targets the dysregulated nORF. The activator may increase (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, 300%, 400%, 500%, or more) an amount or activity of the dysregulated nORF, such as to prevent the deleterious effect of the dysregulated nORF. The activator may target the polynucleotide containing the nORF or the protein encoded by the nORF. The activator may be a small molecule, a polynucleotide, or a polypeptide. Suitable small molecules may be determined or identified by using computational analysis based on the structure of the dysregulated nORF as determined by a protein folding algorithm. The small molecule may target any region of the dysregulated nORF. The small molecule may target the nORF or the protein encoded by the nORF. Suitable polypeptides for increasing an activity or amount of the dysregulated nORF include, for example, an antibody or antigen-binding fragment thereof that binds to the dysregulated nORF (e.g., a single chain antibody or antigen-binding fragment thereof). Suitable polynucleotides that can increase an amount or activity of the dysregulated nORF include RNA. For example, an RNA for increasing an activity or amount of the dysregulated nORF may be, for example, an antisense RNA. The antisense RNA may target a region of RNA (e.g., dysregulated nORF gene) upstream of the primary nORF open reading frame to reduce expression of the upstream nORFs, thereby dedicating the translation machinery to the primary nORF in order to increase expression of the primary nORF. The polynucleotide may be an aptamer, e.g., an RNA aptamer that binds to and/or increases an amount and/or activity of the dysregulated nORF or the protein encoded by the dysregulated nORF. The activator may be provided directly or may be provided by a vector (e.g., a viral vector) encoding the activator. The activator may be formulated, e.g., in a pharmaceutical composition containing a pharmaceutically acceptable carrier. The composition can be administered by any suitable method known in the art to the skilled artisan. The composition (e.g., a vector, e.g., a viral vector) may be formulated in a virus or a virus-like particle.
nORF Replacement
The present invention also features methods of treating schizophrenia or bipolar disorder by administering or providing a nORF or a protein encoded by the nORF. The therapy may restore the encoded protein product of the nORF, such as to replace the nORF that is no longer present due to downregulation. The therapy may include providing the protein product or a polynucleotide encoding the protein product. The method may include providing a vector (e.g., a viral vector) that encodes the protein product. Alternatively, the protein encoded by the nORF may be administered directly, e.g., as an enzyme replacement therapy. The nORF or a polynucleotide encoding the nORF (e.g., a vector, e.g., a viral vector) may be formulated, e.g., in a pharmaceutical composition containing a pharmaceutically acceptable carrier. The composition can be administered by any suitable method known in the art to the skilled artisan. The composition may be formulated in a virus or a virus-like particle.
In some embodiments, the length of the nORF is less than about 100 amino acids (e.g., from about 50 to 100, 50 to 90, 50 to 80, 60 to 90, 60 to 80, 70 to 100, 70 to 90, 70 to 80, 80 to 100, or 90 to 100 amino acids).
Viral genomes provide a rich source of vectors that can be used for the efficient delivery of exogenous genes into a mammalian cell. The gene to be delivered may include an activator or inhibitor that targets a dysregulated nORF, such as an RNA (e.g., an aptamer, a miRNA, an antisense RNA, an shRNA, or an siRNA). Alternatively, the gene to be delivered may include the nORF for replacement. Viral genomes are particularly useful vectors for gene delivery as the polynucleotides contained within such genomes are typically incorporated into the nuclear genome of a mammalian cell by generalized or specialized transduction. These processes occur as part of the natural viral replication cycle, and do not require added proteins or reagents in order to induce gene integration. Examples of viral vectors are a retrovirus (e.g., Retroviridae family viral vector), adenovirus (e.g., Ad5, Ad26, Ad34, Ad35, and Ad48), parvovirus (e.g., an adeno-associated viral (AAV) vector), coronavirus, negative strand RNA viruses such as orthomyxovirus (e.g., influenza virus), rhabdovirus (e.g., rabies and vesicular stomatitis virus), paramyxovirus (e.g. measles and Sendai), positive strand RNA viruses, such as picornavirus and alphavirus, and double stranded DNA viruses including adenovirus, herpesvirus (e.g., Herpes Simplex virus types 1 and 2, Epstein-Barr virus, cytomegalovirus), and poxvirus (e.g., vaccinia, modified vaccinia Ankara (MVA), fowlpox and canarypox). Other viruses include Norwalk virus, togavirus, flavivirus, reoviruses, papovavirus, hepadnavirus, human papilloma virus, human foamy virus, and hepatitis virus, for example. Examples of retroviruses are: avian leukosis-sarcoma, avian C-type viruses, mammalian C-type, B-type viruses, D-type viruses, oncoretroviruses, HTLV-BLV group, lentivirus, alpharetrovirus, gammaretrovirus, spumavirus (Coffin, J. M., Retroviridae: The viruses and their replication, Virology, Third Edition (Lippincott-Raven, Philadelphia, (1996))). Other examples are murine leukemia viruses, murine sarcoma viruses, mouse mammary tumor virus, bovine leukemia virus, feline leukemia virus, feline sarcoma virus, avian leukemia virus, human T-cell leukemia virus, baboon endogenous virus, Gibbon ape leukemia virus, Mason Pfizer monkey virus, simian immunodeficiency virus, simian sarcoma virus, Rous sarcoma virus and lentiviruses. Other examples of vectors are described, for example, in McVey et al., (U.S. Pat. No. 5,801,030), the teachings of which are incorporated herein by reference.
The delivery vector used in the methods described herein may be a retroviral vector. One type of retroviral vector that may be used in the methods and compositions described herein is a lentiviral vector. Lentiviral vectors (LVs), a subset of retroviruses, transduce a wide range of dividing and non-dividing cell types with high efficiency, conferring stable, long-term expression of the transgene encoding the polypeptide or RNA. An overview of optimization strategies for packaging and transducing LVs is provided in Delenda, The Journal of Gene Medicine 6: S125 (2004), the disclosure of which is incorporated herein by reference.
The use of lentivirus-based gene transfer techniques relies on the in vitro production of recombinant lentiviral particles carrying a highly deleted viral genome in which the agent of interest is accommodated. In particular, the recombinant lentivirus are recovered through the in trans coexpression in a permissive cell line of (1) the packaging constructs, i.e., a vector expressing the Gag-Pol precursors together with Rev (alternatively expressed in trans); (2) a vector expressing an envelope receptor, generally of an heterologous nature; and (3) the transfer vector, consisting in the viral cDNA deprived of all open reading frames, but maintaining the sequences required for replication, encapsidation, and expression, in which the sequences to be expressed are inserted.
A LV used in the methods and compositions described herein may include one or more of a 5′-Long terminal repeat (LTR), HIV signal sequence, HIV Psi signal 5′-splice site (SD), delta-GAG element, Rev Responsive Element (RRE), 3′-splice site (SA), elongation factor (E F) 1-alpha promoter and 3′-self inactivating LTR (SIN-LTR). The lentiviral vector optionally includes a central polypurine tract (cPPT) and a woodchuck hepatitis virus post-transcriptional regulatory element (WPRE), as described in U.S. Pat. No. 6,136,597, the disclosure of which is incorporated herein by reference as it pertains to WPRE. The lentiviral vector may further include a pHR′ backbone, which may include for example as provided below.
The Lentigen LV described in Lu et al., Journal of Gene Medicine 6:963 (2004) may be used to express the DNA molecules and/or transduce cells. A LV used in the methods and compositions described herein may a 5′-Long terminal repeat (LTR), HIV signal sequence, HIV Psi signal 5′-splice site (SD), delta-GAG element, Rev Responsive Element (RRE), 3′-splice site (SA), elongation factor (EF) 1-alpha promoter and 3′-self inactivating L TR (SIN-LTR). It will be readily apparent to one skilled in the art that optionally one or more of these regions is substituted with another region performing a similar function.
Enhancer elements can be used to increase expression of modified DNA molecules or increase the lentiviral integration efficiency. The LV used in the methods and compositions described herein may include a nef sequence. The LV used in the methods and compositions described herein may include a cPPT sequence which enhances vector integration. The cPPT acts as a second origin of the (+)-strand DNA synthesis and introduces a partial strand overlap in the middle of its native HIV genome. The introduction of the cPPT sequence in the transfer vector backbone strongly increased the nuclear transport and the total amount of genome integrated into the DNA of target cells. The LV used in the methods and compositions described herein may include a Woodchuck Posttranscriptional Regulatory Element (WPRE). The WPRE acts at the transcriptional level, by promoting nuclear export of transcripts and/or by increasing the efficiency of polyadenylation of the nascent transcript, thus increasing the total amount of mRNA in the cells. The addition of the WPRE to LV results in a substantial improvement in the level of expression from several different promoters, both in vitro and in vivo. The LV used in the methods and compositions described herein may include both a cPPT sequence and WPRE sequence. The vector may also include an IRES sequence that permits the expression of multiple polypeptides from a single promoter.
In addition to IRES sequences, other elements which permit expression of multiple polypeptides are useful. The vector used in the methods and compositions described herein may include multiple promoters that permit expression more than one polypeptide. The vector used in the methods and compositions described herein may include a protein cleavage site that allows expression of more than one polypeptide. Examples of protein cleavage sites that allow expression of more than one polypeptide are described in Klump et al., Gene Ther.; 8:811 (2001), Osborn et al., Molecular Therapy 12:569 (2005), Szymczak and Vignali, Expert Opin Biol Ther. 5:627 (2005), and Szymczak et al., Nat Biotechnol. 22:589 (2004), the disclosures of which are incorporated herein by reference as they pertain to protein cleavage sites that allow expression of more than one polypeptide. It will be readily apparent to one skilled in the art that other elements that permit expression of multiple polypeptides identified in the future are useful and may be utilized in the vectors suitable for use with the compositions and methods described herein.
The vector used in the methods and compositions described herein may, be a clinical grade vector.
The viral vectors (e.g., retroviral vectors, e.g., lentiviral vectors) may include a promoter operably coupled to the transgene encoding the polypeptide or the polynucleotide encoding the RNA to control expression. The promoter may be a ubiquitous promoter. Alternatively, the promoter may be a tissue specific promoter, such as a myeloid cell-specific or hepatocyte-specific promoter. Suitable promoters that may be used with the compositions described herein include CD11b promoter, sp146/p47 promoter, CD68 promoter, sp146/gp9 promoter, elongation factor 1 α (EF1α) promoter, EF1α short form (EFS) promoter, phosphoglycerate kinase (PGK) promoter, α-globin promoter, and β-globin promoter. Other promoters that may be used include, e.g., DC172 promoter, human serum albumin promoter, alpha1 antitrypsin promoter, thyroxine binding globulin promoter. The DC172 promoter is described in Jacob, et al. Gene Ther. 15:594-603, 2008, hereby incorporated by reference in its entirety.
The viral vectors (e.g., retroviral vectors, e.g., lentiviral vectors) may include an enhancer operably coupled to the transgene encoding the polypeptide or the polynucleotide encoding the RNA to control expression. The enhancer may include a β-globin locus control region (βLCR).
Methods of Measuring nORF Gene Expression
Preferably, the compositions and methods of the disclosure are used to facilitate expression of a nORF at physiologically normal levels in a patient (e.g., a human patient), decrease expression of an upregulated nORF, or increase expression of a downregulated nORF. The therapeutic agents of the disclosure, for example, may reduce the dysregulated nORF expression in a human subject. For example, the therapeutic agents of the disclosure may reduce dysregulated nORF expression e.g., by about 1%, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98%, or 99%. Alternatively, the therapeutic agents of the disclosure may increase the dysregulated nORF expression in a human subject. For example, the therapeutic agents of the disclosure may increase dysregulated nORF expression, e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, 300%, 400%, 500%, or more.
The expression level of the nORF expressed in a patient can be ascertained, for example, by evaluating the concentration or relative abundance of mRNA transcripts derived from transcription of the nORF. Additionally, or alternatively, expression can be determined by evaluating the concentration or relative abundance of the nORF following transcription and/or translation of an inhibitor that decreases an amount of the dysregulated nORF. Protein concentrations can also be assessed using functional assays, such as MDP detection assays. Expression can be evaluated by a number of methodologies known in the art, including, but not limited to, nucleic acid sequencing, microarray analysis, proteomics, in-situ hybridization (e.g., fluorescence in-situ hybridization (FISH)), amplification-based assays, in situ hybridization, fluorescence activated cell sorting (FACS), northern analysis and/or PCR analysis of mRNAs.
Nucleic acid-based methods for determining expression (e.g., of an RNA inhibitor or an RNA encoding the nORF) detection that may be used in conjunction with the compositions and methods described herein include imaging-based techniques (e.g., Northern blotting or Southern blotting). Such techniques may be performed using cells obtained from a patient following administration of the polynucleotide encoding the agent. Northern blot analysis is a conventional technique well known in the art and is described, for example, in Molecular Cloning, a Laboratory Manual, second edition, 1989, Sambrook, Fritch, Maniatis, Cold Spring Harbor Press, 10 Skyline Drive, Plainview, NY 11803-2500. Typical protocols for evaluating the status of genes and gene products are found, for example in Ausubel et al., eds., 1995, Current Protocols in Molecular Biology, Units 2 (Northern Blotting), 4 (Southern Blotting), 15 (Immunoblotting) and 18 (PCR Analysis).
Detection techniques that may be used in conjunction with the compositions and methods described herein to evaluate nORF expression further include microarray sequencing experiments (e.g., Sanger sequencing and next-generation sequencing methods, also known as high-throughput sequencing or deep sequencing). Exemplary next generation sequencing technologies include, without limitation, Illumina sequencing, Ion Torrent sequencing, 454 sequencing, SOLiD sequencing, and nanopore sequencing platforms. Additional methods of sequencing known in the art can also be used. For instance, expression at the mRNA level may be determined using RNA-Seq (e.g., as described in Mortazavi et al., Nat. Methods 5:621-628 (2008) the disclosure of which is incorporated herein by reference in their entirety). RNA-Seq is a robust technology for monitoring expression by direct sequencing the RNA molecules in a sample. Briefly, this methodology may involve fragmentation of RNA to an average length of 200 nucleotides, conversion to cDNA by random priming, and synthesis of double-stranded cDNA (e.g., using the Just cDNA DoubleStranded cDNA Synthesis Kit from Agilent Technology). Then, the cDNA is converted into a molecular library for sequencing by addition of sequence adapters for each library (e.g., from Illumina®/Solexa), and the resulting 50-100 nucleotide reads are mapped onto the genome.
Expression levels of the nORF may be determined using microarray-based platforms (e.g., single-nucleotide polymorphism arrays), as microarray technology offers high resolution. Details of various microarray methods can be found in the literature. See, for example, U.S. Pat. No. 6,232,068 and Pollack et al., Nat. Genet. 23:41-46 (1999), the disclosures of each of which are incorporated herein by reference in their entirety. Using nucleic acid microarrays, mRNA samples are reverse transcribed and labeled to generate cDNA. The probes can then hybridize to one or more complementary nucleic acids arrayed and immobilized on a solid support. The array can be configured, for example, such that the sequence and position of each member of the array is known. Hybridization of a labeled probe with a particular array member indicates that the sample from which the probe was derived expresses that gene. Expression level may be quantified according to the amount of signal detected from hybridized probe-sample complexes. A typical microarray experiment involves the following steps: 1) preparation of fluorescently labeled target from RNA isolated from the sample, 2) hybridization of the labeled target to the microarray, 3) washing, staining, and scanning of the array, 4) analysis of the scanned image and 5) generation of gene expression profiles. One example of a microarray processor is the Affymetrix GENECHIP® system, which is commercially available and comprises arrays fabricated by direct synthesis of oligonucleotides on a glass surface. Other systems may be used as known to one skilled in the art.
Amplification-based assays also can be used to measure the expression level of the nORF or RNA in a target cell following delivery to a patient. In such assays, the nucleic acid sequences of the gene act as a template in an amplification reaction (for example, PCR, such as qPCR). In a quantitative amplification, the amount of amplification product is proportional to the amount of template in the original sample. Comparison to appropriate controls provides a measure of the expression level of the gene, corresponding to the specific probe used, according to the principles described herein. Methods of real-time qPCR using TaqMan probes are well known in the art. Detailed protocols for real-time qPCR are provided, for example, in Gibson et al., Genome Res. 6:995-1001 (1996), and in Heid et al., Genome Res. 6:986-994 (1996), the disclosures of each of which are incorporated herein by reference in their entirety. Levels of gene expression as described herein can be determined by RT-PCR technology. Probes used for PCR may be labeled with a detectable marker, such as, for example, a radioisotope, fluorescent compound, bioluminescent compound, a chemiluminescent compound, metal chelator, or enzyme.
Expression of the nORF can additionally be determined by measuring the concentration or relative abundance of a corresponding protein product (e.g., as compared to the nORF in a subject without schizophrenia or bipolar disorder or the dysregulated nORF). Protein levels can be assessed using standard detection techniques known in the art. Protein expression assays suitable for use with the compositions and methods described herein include proteomics approaches, immunohistochemical and/or western blot analysis, immunoprecipitation, molecular binding assays, ELISA, enzyme-linked immunofiltration assay (ELIFA), mass spectrometry, mass spectrometric immunoassay, and biochemical enzymatic activity assays. Proteomics methods can be used to generate large-scale protein expression datasets in multiplex. Proteomics methods may utilize mass spectrometry to detect and quantify polypeptides (e.g., proteins) and/or peptide microarrays utilizing capture reagents (e.g., antibodies) specific to a panel of target proteins to identify and measure expression levels of proteins expressed in a sample (e.g., a single cell sample or a multi-cell population).
Exemplary peptide microarrays have a substrate-bound plurality of polypeptides, the binding of an oligonucleotide, a peptide, or a protein to each of the plurality of bound polypeptides being separately detectable. Alternatively, the peptide microarray may include a plurality of binders, including, but not limited to, monoclonal antibodies, polyclonal antibodies, phage display binders, yeast two-hybrid binders, aptamers, which can specifically detect the binding of specific oligonucleotides, peptides, or proteins. Examples of peptide arrays may be found in U.S. Pat. Nos. 6,268,210, 5,766,960, and 5,143,854, the disclosures of each of which are incorporated herein by reference in their entirety.
Mass spectrometry (MS) may be used in conjunction with the methods described herein to identify and characterize expression of the nORF in a cell from a patient (e.g., a human patient) following delivery of the transgene encoding the nORF. Any method of MS known in the art may be used to determine, detect, and/or measure a protein or peptide fragment of interest, e.g., LC-MS, ESI-MS, ESI-MS/MS, MALDI-TOF-MS, MALDI-TOF/TOF-MS, tandem MS, and the like. Mass spectrometers generally contain an ion source and optics, mass analyzer, and data processing electronics. Mass analyzers include scanning and ion-beam mass spectrometers, such as time-of-flight (TOF) and quadruple (Q), and trapping mass spectrometers, such as ion trap (IT), Orbitrap, and Fourier transform ion cyclotron resonance (FT-ICR), may be used in the methods described herein. Details of various MS methods can be found in the literature. See, for example, Yates et al., Annu. Rev. Biomed. Eng. 11:49-79, 2009, the disclosure of which is incorporated herein by reference in its entirety.
Prior to MS analysis, proteins in a sample obtained from the patient can be first digested into smaller peptides by chemical (e.g., via cyanogen bromide cleavage) or enzymatic (e.g., trypsin) digestion. Complex peptide samples also benefit from the use of front-end separation techniques, e.g., 2D-PAGE, HPLC, RPLC, and affinity chromatography. The digested, and optionally separated, sample is then ionized using an ion source to create charged molecules for further analysis. Ionization of the sample may be performed, e.g., by electrospray ionization (ESI), atmospheric pressure chemical ionization (APCI), photoionization, electron ionization, fast atom bombardment (FAB)/liquid secondary ionization (LSIMS), matrix assisted laser desorption/ionization (MALDI), field ionization, field desorption, thermospray/plasmaspray ionization, and particle beam ionization. Additional information relating to the choice of ionization method is known to those of skill in the art.
After ionization, digested peptides may then be fragmented to generate signature MS/MS spectra. Tandem MS, also known as MS/MS, may be particularly useful for analyzing complex mixtures. Tandem MS involves multiple steps of MS selection, with some form of ion fragmentation occurring in between the stages, which may be accomplished with individual mass spectrometer elements separated in space or using a single mass spectrometer with the MS steps separated in time. In spatially separated tandem MS, the elements are physically separated and distinct, with a physical connection between the elements to maintain high vacuum. In temporally separated tandem MS, separation is accomplished with ions trapped in the same place, with multiple separation steps taking place over time. Signature MS/MS spectra may then be compared against a peptide sequence database (e.g., SEQUEST). Post-translational modifications to peptides may also be determined, for example, by searching spectra against a database while allowing for specific peptide modifications.
The present invention contemplates treatment of schizophrenia or bipolar disorder in which a nORF exhibits increased or decreased expression, e.g., relative to a subject without schizophrenia or bipolar disorder. Schizophrenia is a mental illness that affects how a person thinks, feels, and behaves. Bipolar disorder, also known as manic depression, is a mental illness that brings server high and low moods and changes in sleep, energy, thinking, and behavior.
The method may decrease or slow (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 97%, or 99%) the progression of schizophrenia or bipolar disorder. The method may decrease (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 97%, or 99%) the risk of developing schizophrenia or bipolar disorder. The method may decrease (e.g., by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 97%, or 99%) the risk of developing schizophrenia or bipolar disorder.
In some embodiments, the nORF is selected from Table 4.
In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 1-21 or a fragment thereof.
In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 1-21.
In some embodiments, the disease is bipolar disorder, and the nORF is selected from Table 7. In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 124-163 or a fragment thereof. In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 124-163.
In some embodiments, the disease is bipolar disorder, and the nORF is selected from Table 8. In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 164-207 or a fragment thereof. In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 164-207.
In some embodiments, the disease is schizophrenia, and the nORF is selected from Table 9. In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 208-263 or a fragment thereof. In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 208-263.
In some embodiments, the disease is schizophrenia, and the nORF is selected from Table 10. In some embodiments, the nORF has at least 85% (e.g., at least 90%, 95%, 97%, 98%, 99%, or 100%) sequence identity to any one of SEQ ID NOs: 264-324 or a fragment thereof. In some embodiments, the nORF has the sequence of any one of SEQ ID NOs: 264-324.
In some embodiments, the nORF is associated with a transposable element. For example, the nORF may have a positive or negative correlation with a transposable element.
In some embodiments, the nORF is associated with a human accelerated region (HAR). For example, the nORF may have a positive or negative correlation with the HAR.
The following examples further illustrate the invention but should not be construed as in any way limiting its scope.
Although the heritability of both Schizophrenia (SCZ) and bipolar disorder (BD) is approximately 70%—placing them among the most heritable mental health disorders, the corresponding polygenic risk scores explain only a fraction of genetic disease liability, for example 7% in SCZ relative to 64-81% heritability derived from family and twin studies. Moreover, putative individual genome-wide association studies (GWAS) risk alleles account only for a marginal increase in disease risk with odds ratios typically under 1.1 and differences in allele frequencies between cases and controls are often less than 2%. SCZ and BD, therefore, pose an evolutionary-genetic paradox because they exhibit strong negative fitness effects and high heritability, yet they persist at a prevalence of approximately 1% across all human cultures.
We set out to investigate whether novel open reading frames (nORFs) that have recently evolved or have been associated with Human Accelerated Regions (HARs) could cast clues on the disease mechanism. HARs are genomic segments that are highly conserved among nonhuman species but experienced accelerated substitutions in the human genome. Many HARs are found in the introns of, and adjacent to, genes annotated with gene ontology (GO) terms related to transcription and DNA binding. We curated a list of 4,481 unique HARs split into three groups based on the extent of their conservation and verified that they are present in all chromosomes (
The human centric nature of HARs led to investigations into their link with SCZ. pHARs were found to be enriched in SCZ-associated loci and pHAR-associated SCZ genes were found to be under stronger selection pressure than other SCZ genes. Additionally, mutations in HARs have been found to contribute to altered cognitive behavior, suggesting importance in neural function. However, HARs have not been systematically examined in any of the psychiatric diseases. The PGC meta-analysis provide a novel opportunity to investigate systematically the role of HARs in SCZ.
Another group of genomic features that regulate gene expression are transposable elements (TEs). TEs come in two classes. Class I are retrotransposons, consisting of long terminal repeats (LTRs), which include human endogenous retroviruses, and non-LTRs, which include long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs) and SINE/VNTR/Alu elements (SVAs). Enhancers can arise through the insertion of TEs; therefore, it is feasible that some HARs arose through TE insertion. TEs can be a source of non-coding RNAs and can act as insulator or boundary elements, splitting the genome into 100 kb-1 Mb domains of active and inactive transcription by preventing the spread of heterochromatin. Indeed, many TEs (especially SINEs) harbor binding sites for factors (CTCF, TFIIIC) that confer insulator activity and organize nuclear architecture. Furthermore, chromatin-based repression of TEs impacts expression of nearby loci; when said repression fails, neighboring loci may be expressed together with the corresponding TEs.
HARs and TEs are two classes of genomic regions that can play a role in the regulation of nearby genes. However, little attention has been placed on non-coding regions, especially nORFs and their transcriptional and translational end products in relation to SCZ and BD. nORFs are present in both coding and non-coding regions of the genome and may be biologically regulated.
We hypothesize that components of the genetic architecture of SCZ and BD are attributable to human lineage-specific evolution which were not previously discovered because of the usage of a conservative definition of a gene and because of analyzing genomic, transcriptomic, and proteomic data in silos. To investigate this, we performed a genome-wide evolutionary assessment of the overlap between nORFs present in HARs and in SCZ and BD-associated loci.
We systematically mined SCZ and BD datasets from the PsychENCODE consortium to detect the expression of nORFs with evidence of translation, which to make a nORF database. Following that, we assessed the relationship and association between differentially expressed (DE) nORFs (DE nORFs) and HARs and TEs, and their enrichment in SCZ and BD associated loci, with the goal to identify differentially expressed nORFs present in pHARs associated with SCZ and BD loci. We also investigated the correlation of HARs or TE transcript expression and nORF transcript expression to identify any potential regulation. In addition, for a smaller subset of samples, we were able to show evidence of translation of nORFs. Finally, we predicted structures for some of the nORFs implicated in both the disorders to demonstrate that they may serve as novel drug targets. Thus, this work highlights interesting molecular mechanisms that have been previously missed and we anticipate that this will lead to novel treatments.
Creation of nORF dataset
We used nORFs obtained from two sources—nORFs.org and RPFdbv2.0; however, nORFs from RPFdbv2.0 were further processed. Briefly, expression of nORFs were compared to canonical ORFs (cORFs) from 53 studies (with 353 samples), downloaded from RPFdbv2.0 across 11 human cell lines. The 353 samples were divided into 11 groups based on cell types. Actively translated ORFs with clear sub-codon phasing or triplet periodicity footprints were detected using the RibORF tool for each study. Further, each ORF entry was appended with its corresponding annotations: genomic position, strand, ORF category (one of: canonical, truncated, extended, uORF, overlapping uORF, internal, external, polycistronic, readthrough, non-coding transcripts), length of encoded amino acid, ribosome profiling abundance (RPKMs, raw read counts) and the transcript to which the ORF maps (probable transcript from which ORF is translated). Raw read count abundance for each ORF was then converted to Transcript per million (TPM) values for downstream analysis.
Mean and standard deviation (SD) of Ribo-seq expression TPMs for all 353 samples in each of the 11 groups were compared between the canonical and the ‘non’-canonical ORFs. Mean values were divided into exactly 4000 quantiles with every quantile containing the same number of ORFs. Within each quantile, the SDs were compared between nORFs and cORFs of consequently similar means. ORFs with SDs less than the median SD of cORFs were termed low noise ORFs. 101,797 such low noise nORF entries were added to nORFdbV1, and further, duplicates were removed, and classification was performed. Any nORF classified as in-frame to the CDS of a cORF was removed except for when an annotation such as readthrough, extension, uORF or truncation was determined using the RibORF tool, leading to a final of 248,135 nORF entries. Bedtools getfasta was used to extract the corresponding nucleotide sequence for the new nORF entries using GRCh38 DNA primary assembly (ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/) with parameters “name”, “s” and “tab” specified. nORF sequences identified using RibORF were translated using Biostrings package in R, which was appended to the curated amino acid sequences of nORFs. Results of this analysis is illustrated in
Identification of nORF Transcripts in PsychENCODE Dataset
We chose three out of the eight studies, namely BrainGVEX, CMC and CMC_HBCC, which are part of the PsychENCODE consortium, for analysis. These three studies were selected based on availability of total RNA-seq data from SCZ, BD and control (CNT) adult post-mortem brain samples. The total number of samples used in the analysis were 1,340 patient samples—731 CNT, 428 SCZ, and 188 BD. The processed BAM files and RNA-Seq by Expectation-Maximization (RSEM) count files are available under freeze 1 and freeze 2 of the PsychENCODE Consortium. Briefly, CNT, SCZ and BD samples were isolated from the DLPFC, primarily BA9 and BA46, as part of 8 different studies. For analysis, RNA-Seq results of three studies: CMC_HBCC, CommonMind and BrainGVEX, with samples from CNT, SCZ and BD brain samples were used. RNA-Seq reads were aligned to hg19 reference genome using STAR 2.4.2a. Gene- and isoform-level quantifications were performed using RSEM v1.2.29.
Correlation analysis of gene expression between samples showed higher correlation between samples from the same study group than between samples from different study groups (
Identification of Transcribed nORFs in PsychENCODE Dataset
GRCh37-based transcript and gene coordinates for 1,340 neuropsychiatric samples from the BrainGVEX, CMC and CMC_HBCC studies were obtained from the PsychENCODE consortium. Transcript expressions were filtered to retain those with TPM>0.1 in at least 10% of the samples. Additionally, transcripts from the Y-chromosome pseudo autosomal regions (PAR) were removed. GffCompare (v0.11.5) mapping was performed between the nORF and sample transcript coordinate. The results file was further filtered as specified in github.com/PrabakaranGroup/norfs_in_neuropsychiatric_disorders. Transcripts containing nORFs with biotype not equal to ‘protein-coding’ were retained.
Identification of Differentially Expressed nORFs
To identify underlying covariates that could affect the DE analysis between SCZ, BD, and CNT, we used multivariate adaptive regression spline (MARS) and surrogate variable analysis (SVA) using the earth and sva package, respectively, in R. Sample transcript count values generated using (RSEM) were normalized using trimmed mean of M-values (TMM) method with edgeR. Earth model with linpreds set to true was run 1000 times and covariates identified at least half of the time were retained. seqPC1-3, seqPC5-7, seqPC10−14, seqPC16, seqPC18-25, seqPC27-29, RIN, RIN.squared, age, batch and individualIDSouce were identified as covariates which were then accounted for during differential expression (DE) analysis.
DE analysis was performed through a linear mixed effects model using nlme package in R, with the above set as fixed effects and individual id as random effect. EdgeR TMM normalized and log2 (CPM (expression)+0.5) counts were analyzed for DE between CNT and BD and CNT and SCZ. Transcripts which were identified as differentially expressed at an FDR<0.05 after Benjamini-Hochberg correction of the associated p-values, were further evaluated for nORF presence using the GffCompare workflow, as described.
Potential Functional Inferences of nORFs from Amino Acid Sequence
For the 248,135 curated nORFs, GO terms were obtained from equivalent InterPro IDs generated using InterProScan5 run on the galaxy server. Of the total input, 27,430 nORFs with a total of 62,700 corresponding GO terms were identified. Further analysis revealed that of the 3,103 nORFs identified as transcribed in SCZ and BD samples, 49 nORFs had associated GO terms. Similarly, 2 out of 44 and 13 out of 61 DE nORFs in BD and SCZ, respectively, had corresponding GO terms. For the translated nORFs, 17 out of 21 had GO terms. GO term enrichment for each of these nORF categories was conducted using the GOEnrichment tool on the galaxy server. The required .OBO file for this run was obtained from www.obofoundry.org/ontology/go.html_Analysis was conducted at a p-value cut-off of 0.01 with Benjamini-Hochberg multiple testing correction enabled.
Enrichment Analysis of DE nORFs within SCZ and BD Loci
We evaluated the presence and enrichment of DE transcribed nORFs within SCZ and BD associated loci using an annotation and enrichment tool GLANET, which uses random sampling to calculate enrichment of genomic elements within the input query. In addition, we investigated enrichment of certain DNase I hypersensitive site (DHS1), histone modifications and transcription factors (TFs) within the transcribed and DE nORF cohort. SCZ associated high confidence regions were obtained from PsychENCODE resource (resource.psychencode.org/). For BD, associated loci coordinates were taken from www.nature.com/articles/s41588-019-0397-8 #Sec2. SCZ CNVs were curated from pubmed.ncbi.nlm.nih.gov/29687944/.
4,481 unique HARs were compiled. The genomic coordinates of HARs were mapped to hg19/GRCh37 genome assembly where required, using the LiftOver tool (
Association of nORFs with HARs
3,103 nORFs were identified to be DE in the BrainGVEX, CMC and CMC_HBCC neuropsychiatric samples. These nORFs are defined to be associated with a HAR if the HAR overlapped the nORF or regions extending 100 kb upstream or downstream of the nORF. This association distance is in accordance with previous work, although a previous study looked at association within 1 kb and another study found that 52% of non-coding HARS examined in the study are located within 1 MB of a developmental gene and 59% are within 1 Mb of a gene DE between humans and chimpanzees. A nORF associated with a HAR is referred to as a nORF-HAR.
SCZ-associated SNPs and BD-associated SNPs were stratified by p-value (p<10−2; P<10−3; P<104; P<10−5; P<10−6; P<10−7), as shown in Table 1, below.
To summarize linkage-disequilibrium (LD)-dependent associations between SNPs, these sets of SNPs were clumped in PLINK 1.9 using LD-based clumping and data from 1000 Genome's EUR population (The 1000 Genomes Project Consortium, 2015). Clumping produces LD-independent sets (‘clumps’) of SNPs, which comprise of an index SNP with the highest association and SNPs in high LD with that index SNP. Parameters were chosen to retain SNPs in association with index SNPs with p<0.0001 and r2<0.1 within 3 Mb windows. Due to very high LD within the MHC region, only the most median index SNP and its associated clump were kept from the MHC region. The MHC region was defined as chr6:28,477,797-33,448,354 on the hg19 genome assembly.
The genomic coordinates for disorder-associated loci were found using the index SNPs and the ‘LD-calculations’ procedure on PLINK 1.9. Data from 1000 Genome's EUR population (The 1000 Genomes Project Consortium, 2015) was used to remove index SNPs not in Hardy-Weinberg equilibrium (p<0.0001) or those with a minor allele frequency less than 0.05. Disorder-associated SNP loci were then defined such that SNPs within loci were associated with index SNPs with r2>0.5 and were within 250 kb of an index SNP. The number of SCZ-associated SNP loci was markedly great than that of BD-associated SNP loci. In both disorders, the number of disorder-associated SNP loci is relatively constant for higher p-value stratifications, decreasing after p<10−5.
Enrichment of nORF-HARs with Disorder-Associated SNP Loci
To determine whether nORFs associated with HARs, especially those differentially expressed, are enriched within disorder-associated loci, an enrichment test was performed using INRICH. This was used as it accounts for SNP density as well as overlapping genes (nORFs in this case). The sets of loci used were those generated in the previous section.
The analysis was carried out for the full set of nORF-HARs, as well as the subsets of nORFs associated with vHARs, mHARs or pHARs. Although INRICH is usually used for analysis of genes, it can also be used for analysis of nORFs. INRICH requires four files: an interval file, which contained the loci-defining genomic coordinates for disorder-associated SNP loci and the ‘rs’ IDs of the loci's index SNPs; an interval map file, which contained the genomic coordinates for and the ‘rs’ IDs of the loci's index SNPs; a target set file, which contained the genomic coordinates and IDs of the nORF-HARs; and a reference gene file, which contained the genomic coordinates of the 3,103 nORFs expressed in the neuropsychiatric samples. Since no SNPs from the genome-wide association studies (GWAS) were present on the sex chromosomes, all nORFs on the sex chromosomes were removed before analysis. INRICH merges any overlapping nORFs before processing. Empirical p-values for enrichment are then calculated through a first round of 5000 permutations. A second round of 5000 permutations corrects for multiple testing and accounts for gene length to give corrected p-values.
3,987,910 TEs throughout the human genome were identified using RepeatMasker (repeatmasker.org/). All coordinates were already based on hg19 genome assembly. TEs that overlapped were merged and resulting in 3,863,891 unique TEs.
Association of nORFs with TEs
nORFs are defined to be associated with a TE if TE overlapped 2 kb region upstream of the nORF, but not the nORF itself. Association between DE nORF and TEs was investigated to gain insight into the impact of TEs on nearby nORF expression via correlation analysis of expression.
A set of transcripts DE between SCZ and BD and controls in the PsychENCODE datasets was identified as mentioned above. HARs and TEs that were included in or overlapped with these differentially expressed transcripts were designated DE HARs (differentially expressed HARs) and DE TEs (differentially expressed TEs), respectively. DE TE expression was normalized using the TMM normalization procedure as provided in edgeR v3.30.3.
Correlation of Expression Between DE nORFs and their Associated DE TEs
Spearman and Pearson correlation coefficients and their corresponding p-values were calculated for the normalized counts for each DE nORF-DE TE combination (each DE nORF may be associated with many DE TEs). Expression of a DE nORF and its associated DE TE within a DE nORF-DE TE combination was defined to be significantly correlated if the absolute Spearman and Pearson correlation coefficients were above 0.5 and significant (p<0.05) for the DE nORF-DE TE combination.
Proteogenomic Analysis to Demonstrate Translation of Transcribed nORFs
Proteogenomic analysis to demonstrate evidence of translation of the transcribed nORFs was performed using the amino acid sequence of all the 248,135 nORFs or transcripts assembled from a subset of PsychENCODE samples, which are part of Stanley Medical Research Institute (SMRI) Array Collection. For this subset, we had matching raw transcriptomic and proteomic data; however, from different (adjacent) regions of prefrontal cortex (BA46 and BA10 respectively).
Analysis of Transcripts from SMRI Array Collection Samples
RNA-Seq data from BA46 of post-mortem brain samples, classified as Array Collection by SMRI, was obtained upon request. This comprised of 23 SCZ, 23 CNT and 16 BD samples—after matching with proteomic samples and removing any outliers (
In brief, the RNA extraction was performed as follows. 1 μg of total RNA was poly-A selected using oligo-dT Dynabeads, libraries were prepared using Illumina's TruSeq v1 (Illumina, Hayward, CA) and sequencing was performed using Illumina HiSeq 2000 giving ˜3 Mb of 90 bp paired-end reads for each library. The resultant.FASTQ/.FQ files were processed as described below.
The .FASTQ/.FQ were assessed using FastQC for quality control. Read alignment was carried out using HISAT2 v2.1.0 with default parameters except ‘--add-chrname’, ‘-dta’ and ‘--summary-file’ were set to TRUE. Additionally, either Phred+33 or Phred+64 encoding was set to TRUE based on the sample being analyzed. Reads were aligned using the index for GRCh38 genome available at ccb.jhu.edu/software/hisat2/manual.shtml. The resultant summary file was used to generate counts of percentage read alignment (
Following alignment, transcripts were assembled using StringTie v1.3.3 (
Analysis of Mass Spectra from SMRI Array Collection Samples
Post-mortem anterior prefrontal cortex (BA10) samples were obtained from 23 SCZ, 23 BD and 23 control samples (after matching with RNA-seq data this led to the use of 23 SCZ, 16 BD and 23 CNT samples). 50 mg of tissue slices per sample were collected and processed. Protein samples were analyzed using Waters Q-TOF premier mass spectrometer. The output .RAW files were processed on PLGS and converted to .MGF files. The .MGF files were searched against the human UniProt database using Mascot to identify known proteins that are translated.
Unmapped mass spectra were searched against two databases using Mascot. The first search was carried out against nORF amino acid database that was constructed using 248,135 nORFs that we curated.
The results of mapping unmatched sample spectra to nORF amino acid database were filtered by protein and peptide score >50 and expectation value <0.05. Furthermore, only peptides expressed in at least 30% of each disorder group were evaluated (
Enrichment Analysis to Identify Potential Functions of nORFs
InterProScan was used to identify descriptive GO terms for the nORFs used in this study, and GO enrichment was performed using GOEnrichment tool available via usegalaxy.org. Next, using the GLANET tool for annotation and enrichment analysis, DHS1, TFs and histone modification enrichment was evaluated for nORFs. Default parameters were used, and 10,000 samples were processed across 30 core processors.
Potential Structures of Identified nORFs
Structures for 21 nORFs that were identified using proteogenomic analysis, and DE nORFs identified in BD or SCZ, were generated using 1-TASSER and RaptorX. Default parameters were used for the structure prediction run. For 1-TASSER, the model with the highest confidence score was chosen as the nORF structure. Models were visualized using Avogadro or Jena3D viewer.
Correlation Analysis of the Translated nORFs with Psychosis, Suicide and Gender
Expression of the 21 translated nORFs were compared for differences (presence/absence evaluated as Yes/No) between gender, incidence of psychosis and suicide. Significance was evaluated using a Chi-squared test for each disorder or inter-disorder. P-value significances were evaluated at 3 levels: ***<0.001; **<0.01; *<0.05 C. Similarly, the three new nORFs identified using transcriptomic data were compared for differences between gender, incidence of psychosis and suicide.
Creation of nORF Database and Classification of nORF Entries
We added ‘low-noise’ nORFs, as defined and identified across 353 samples from the RPFdbv2.0 using the RibORF tool to˜194,407 nORF entries from nORFs.org. Briefly, low-noise nORFs were identified as those with lower standard deviation of their RPKM read counts to that of the median deviation of canonical ORFs or cORFs (the main ORFs within protein-coding genes). This resulted in 248,135 nORF entries (GRCh38; 247,404 entries in nORF hg19) after removal of nORFs that were in-frame with the cORFs as determined by a classification scheme (
Classification of the 248,135 nORF entries with respect to known genes (
Identification of DE nORFs in PsychENCODE Dataset
To investigate whether the 248,135 nORFs that we curated are transcribed in PsychENCODE samples, and whether they are up- or down-regulated compared to the control samples, we performed the following set of analyses. Transcripts from the three sample groups were pre-processed, as described, and their abundance was obtained and filtered to retain those with transcripts per million (TPM) >0.1 in at least 10% of the samples, resulting in 110,003 transcripts. We identified 3,103 nORFs using the workflow illustrated in
Next, we intended to investigate similar relationships between differential expression of nORFs in SCZ and BD and their association with the respective disease pathology. Because there is no equivalent metric to patient survival, we explored whether the identified DE nORFs, in the respective disorders, are associated with already identified genomic ‘hot-spots’ for the respective disorders. To do this, we used GLANET, a program that associates nORFs with genomic loci that are implicated in SCZ and BD and tests for the statistical significance of the enrichments.
nORFs-HARs and their Enrichments within Disorder-Associated SNP Loci
Having demonstrated that some nORFs are indeed associated with SCZ hot spots, we performed the following analysis to investigate whether the nORFs constitute recently evolved vHARs, mHARs, and pHARs genomic regions. Out of 3,103 nORFs, 431 nORFs overlapped with 4,481 unique HARs. Seven nORFs DE in SCZ (3 over-expressed and 4 under-expressed) were found to be associated with HARs (7 DE nORF-HARs); most associated HARs resided within the same characterized region as their nORF, but some were found in intergenic regions or in different genes. Six nORFs DE in BD (4 over-expressed and 2 under-expressed in BD) were found to be associated with HARs (6 DE nORF-HARs); again, most associated HARs resided within the same characterized region as their nORF, but some were found in intergenic regions or in different genes.
The transcript types of the 7 DE nORF-HARs in SCZ are—2 ‘antisense’, 2 ‘processed transcripts’, 1 ‘nonsense mediated decay’, 1 ‘retained intron’ and 1 ‘lincRNA.’ 2 DE nORFs contained HARs within them: tracer_65443 and fs1rH2. The transcript types of the 6 DE nORF-HARs in BD are—3 ‘retained intron’, 1 ‘lincRNA’, 1 ‘processed pseudogene’ and 1 ‘antisense’. No nORFs contained HARs within their lengths. The HAR types associated with DE nORFs in SCZ and The HAR types associated with DE nORFs in BD are displayed in
INRICH analysis revealed that out of the 431 nORF-HARs, 50 are associated with SCZ loci with a GWAS p-value upper bound of 10−2; 13 nORF-pHARs were associated with SCZ loci with a GWAS p-value upper bound of 10−7. Furthermore 11 nORF-HARs are associated with BD loci with a GWAS p-value upper bound of 10−2, and only 4 nORF-pHARs were associated with BD loci with a GWAS p-value upper bound of 10−5 (
HARs and TEs that were included in or overlapped with these DE transcripts were designated as differentially expressed HARs or DE HARs and differentially expressed TEs, or DE TEs, respectively. 160 DE transcripts in SCZ contained HARs resulting in 305 DE HARs in SCZ; 59 DE transcripts in BD contained HARs resulting in 90 DE HARs in BD; 2,638 DE transcripts in SCZ contained TEs resulting in 193,111 DE TEs in SCZ; and 1,522 DE transcripts in BD contained TEs, giving 100,831 DE TEs in BD.
Association of DE nORFs with Differentially Expressed HARs (DE HARs)
While most HARs are considered non-coding genomic regions, they do demonstrate evidence of transcription. RNAs containing HARs fall under various classifications of non-coding RNA—small RNA (sRNA), microRNA (miRNA), long non-coding RNA (lncRNA) or enhancer RNA (eRNA)—or may simply be a part of a known protein-coding region. If a DE HAR associated with a DE nORF is within a known protein-coding region, that could indicate a potential connection between that protein-coding region and the DE nORF. Three DE nORFs were found to be associated with DE HARs in SCZ (3 DE nORF-DE HARs); none were found in BD (
Association of DE nORFs with Differentially Expressed TEs (DE TEs)
Presence of a TE in the 2 kb region upstream of a DE nORF could indicate the presence of an alternative promoter. Therefore, DE nORFs were investigated for association with DE TEs based on the condition that DE nORFs are associated with a DE TE if the TE is within the 2 Kb region upstream of the nORF. 11 DE nORFs were found to be associated with DE TEs in SCZ (11 DE nORF-DE TEs); and 8 DE nORFs were found to be associated with DE TEs in BD (8 DE nORF-DE TEs). Of the 8 DE nORFs associated with DE TEs in BD, 2 are also associated with HARs: cp2xH1 and eveeH1. DE TEs could allow for different expression of nORFs under different conditions, leading to phenotypes of SCZ or BD. Besides differential expression-based regulation we also investigated whether there could be other unknown correlations between expression of TEs and nORFs. To understand this, we performed Spearman and Pearson correlation analysis of the expression of nORFs and each of their associated DE TEs.
This analysis revealed that 5 DE nORF-DE TE combinations had significantly correlated expression levels in SCZ. Notably, the DE nORF 2vnjH1 had its expression significantly correlated with two DE TEs: one LINE and one SINE. One DE nORF was overexpressed in SCZ; 4 DE nORFs were under-expressed in SCZ. The DE nORFs' biotypes were split into 2 ‘lincRNA’, 2 ‘processed transcripts’ and 1 ‘antisense’. None of the DE nORFs were within SCZ-associated loci. The 5 DE TEs were all unique and were comprised of 2 3′-end-of-a-L2 LINEs, 2 L2-end SINEs and 1 3′-end-of-a-L1 LINE. For BD, 4 DE nORF-DE TE combinations were found to have significantly correlated expressions. Of the 4 DE nORFs, 2 were found to be associated with HARs as well. The 4 DE nORFs' biotypes were split evenly between ‘retained intron’ and ‘lincRNA’. None of the DE nORFs were within BD-associated loci. The 4 DE TEs were also all unique and were comprised of 2 ERV1 LTRs, 1 Alu SINE and 1 3′-end-of-a-L1 LINE. 3 DE nORFs were upregulated in BD; 1 DE nORF was down regulated. The DE nORFs included cp2xH1 and eveeH1, which were also associated with HARs, suggesting that those DE nORFs were under HAR-related selection pressure as well as being regulated by TEs. This association is perhaps most significant for eveeH1 and is interesting given the parent gene of eveeH1 is ZNF84, a zinc finger protein that contains a KRAB/FPB domain that may regulate gene expression through TE regulation. As such, eveeH1 may serve as an initial regulation point from which other TE-associated genes and nORFs may be regulated. Its associated DE TE with correlated expression is an endogenous retrovirus sequence ERV1 conserved in primates; its insertion may have conferred an added layer of regulation that was later selected for along with the associated HAR, perhaps in part due to its far-reaching effects. 1 DE nORF-DE TE combination with significantly correlated expression was shared between the SCZ and BD datasets: tracer_18675 with its L1MC2 TE. The expression of the DE TE was correlated with the expression of the DE nORF. This was the only DE nORF-DE TE combination with significant overlap between the DE TE and the DE nORF.
Translation Evidence of nORFs in Brain Samples
We aimed to obtain direct evidence of translation of these nORFs in SCZ and BD brain samples. To this end, we used a proteogenomic approach that combines both transcriptomic and proteomic data. We performed the proteogenomics analysis on a subset of samples for which both transcriptomic and proteomic data were available, which were suitable for investigating potential translation of nORFs. Transcriptomic and proteomic data from this subset of 62 samples from the SMRI Array cohort was analyzed using the proteogenomic framework as described in the methods and displayed in
The proteogenomic analysis identified 446, 460 and 434 known proteins that were translated in CNT, SCZ and BD, respectively, among these 408 are common between all three sample sets (
17 of the 21 nORDs identified as translated were common between CNT, SCZ and B whereas 2 were unique to B8 and 2 to SCZ and B (
We further evaluated the expression differences of these novel peptides between disorders for metadata categories such as suicide, psychosis and gender and identified significant expression differences as determined using Chi-squared tests (
Gene Ontology Enrichment Analysis for Potential Functional Inferences of nORFs
To infer functions of the translated nORFs from their amino acid sequence we performed gene ontology (GO) analysis. For all the 248,135 nORFs used in this study, GO terms were obtained using InterProScan and GO term enrichment was performed using GOEnrichment tool via the galaxy server (
The GLANET analysis, in addition to associating nORFs with the SCZ and BD disorder associated loci, also identified enrichment of certain DNase I hypersensitive site (DHS1), histone modifications and transcription factors (TFs) within the transcribed and DE nORFs, as shown in Table 6 below.
The nucleotide sequence (transcript) for the 40 transcripts, which contain 44 differentially expressed nORFS (amino acid sequences) mapped to these transcripts from BD are shown in Table 7 and Table 8, respectively.
The nucleotide sequence (transcript) for the 56 transcripts, which contain 61 differentially expressed nORFS (amino acid sequences) mapped to these transcripts from SCZ are shown in Table 9 and Table 10, respectively.
The nucleotide sequence (transcript) for all 3022 transcripts contained 3103 differentially expressed nORFS (amino acid sequences).
Potential Structures of Identified nORFs
To infer whether these nORFs could form potential structures, we predicted the putative structures of the 21 nORFs identified as translated, as well as DE nORFs which included nORFs that were associated with pHARs and present in SCZ loci, using 1-TASSER and Raptor-X. For 1-TASSER, the model with the highest confidence score was chosen as the nORF structure.
The lack of adequate and targetable SCZ and BD-specific signatures in protein-coding and noncoding genes, led us to investigate nORFs within the human genome. We curated 248,135 nORFs and investigated 1,340 neuropsychiatric samples from the PsychENCODE consortium and identified 3,103 nORFs as transcribed, with 56 and 40 nORFs differentially expressed in SCZ and BD, respectively. Additionally, DHS1, TF and histone modification enrichments were found within the transcribed nORFs, and SCZ specific loci were found enriched with transcribed and DE nORFs.
A number of nORFs differentially expressed in SCZ and BD were identified as being associated with HARs and as having their expression correlated with that of associated TEs differentially expressed in SCZ and BD. The association of 13 DE nORFs with HARs, especially those that are also associated with SCZ and BD loci, suggests that HARs may play a role in the pathophysiology of SCZ and BD, and that these DE nORFs may have advantageous functions that they have been selected for either as a result of or in tandem with their associated HARs. This reinforces the idea that susceptible genes for the two disorders may have been positively selected for in human-specific evolution.
The type of HAR associated with each DE nORF gives a glimpse into the evolutionary background of their regulatory relationships and of, by extension, the disorders in question. The depletion of vHARs in DE nORF-HARs with respect to pHARs and mHARs in the SCZ datasets (
The results of the enrichment analysis (
Similar patterns can be written for nORFs DE in BD. The DE nORF tracer_42939 is within the SLC7A60S gene, which is highly conserved in vertebrates. Despite the gene's conservation in vertebrates, it is associated with a pHAR, suggesting that some event may have occurred around the divergence of primates that resulted in human-lineage-specific rapid evolution of that locus, possibly resulting in altered CNS development and a susceptibility to BD. As mentioned previously, SLC7A60S is within a SCZ-associated locus; its detection as a DE nORF-HAR in BD and its relevance to SCZ suggests a genetic commonality and may contribute towards explaining phenotypic similarities between the disorders.
The correlation of expression of DE nORFs and DE TEs indicates the possibility of TE-based regulation of the DE nORFs, especially since the majority of DE-TEs found in this analysis are distinct of the DE nORF (the exception is tracer_18675 and its L1MC2 TE). A particularly interesting DE nORF-DE-TE combination is that of eveeH1 and its ERV1 LTR TE. The DE nORF eveeH1 is within the ZNF84 gene, which codes for a KRAB/FPB domain-containing protein. The KRAB/FPB domain may regulate gene expression through TE regulation; eveeH1 may thus be an initial regulation point from which a cascade of TE-based regulation occurs. Since it is differentially regulated in BD, it could be responsible, at least in part, for TE-based differential regulation across the genome that contributes to the BD phenotype. Further investigation into the specific function of eveeH1 and other DE nORFs may elucidate more fully their role in SCZ and BD. The exact relationships between HARs, TEs and nORFs remain to be elucidated; further work utilizing ChIP-seq and whole genome bisulfite sequencing data could shine a light on them. Furthermore, analysis of more RNAseq data—from a large number of disorder samples in particular—would help clarify how HARs and TEs regulate nORF expression in these two mental disorders.
We also demonstrated evidence of translation for 21 nORFs from the database, and for three new ones identified from the transcriptome of a smaller subset of neuropsychiatric samples. Of the 21 nORFs, some were found significantly different between disorders for metadata categories such as gender, incidence of psychosis and suicide. We predicted structures for the 21 nORFs and for those that are associated with pHARs and disorder-loci. This approach could offer a new strategy to expedite the identification of novel drug candidates and novel diagnostic signatures for preemptive interventions, for example to prevent suicide or mitigate psychosis.
To summarize, we introduce how novel regions of the genome, nORFs, merit systematic analysis within disease systems to uncover novel targets for the development of diagnostic and therapeutic strategies. From an evolutionary point-of-view, these results indicate that the genomic features responsible for SCZ and BD arose at least after the divergence of mammals from other vertebrates, or that nORFs associated with pHARs may have arisen in primates and then been subject to increased evolution in the human lineage, only to result in SCZ and BD susceptibility in modern humans when dysfunctional.
Codes for this work can be obtained from: github.com/PrabakaranGroup/norfs_in_neuropsychiatric disorders.
While the invention has been described in connection with specific embodiments thereof, it will be understood that it is capable of further modifications and this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the invention that come within known or customary practice within the art to which the invention pertains and may be applied to the essential features hereinbefore set forth, and follows in the scope of the claims.
Other embodiments are within the claims.
This application claims the benefit of U.S. Provisional Application No. 63/221,821 filed on Jul. 14, 2021, which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/069790 | 7/14/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63221821 | Jul 2021 | US |