The technology relates in part to methods and systems for the identification of disease-associated transcript isoforms and peptides. In certain aspects, the technology relates to methods and systems for the identification of transcript isoforms and peptides preferentially expressed in tumor cells.
Gene expression is the process by which instructions in our DNA are converted into a functional product, such as a protein. One regulatory process of gene expression is alternative splicing, which is a process by which exons of a gene are included or excluded in matured mRNAs, resulting in the production of multiple distinct transcript isoforms, and generating diverse isoforms of proteins. Mechanisms for alternative splicing may include, for example, i) use of alternative promoters, ii) exon skipping, iii) mutually exclusive exons, iv) exon scrambling, v) alternative 5′ and 3′ splice sites, vi) retained introns, and vii) alternative polyadenylation. Aberrant splicing patterns may be found in abnormal cells (e.g., tumor cells). In some instances, dysregulated expression of splicing regulators such as RBFOX2, PTB/PTBP1, and SRSF1, can cause splicing pattern changes in multiple genes.
The advent of whole transcriptome sequencing (RNA-seq) and the development of related bioinformatics analysis tools have enabled researchers to detect and measure not only the expression of genes but also their sequences and structural configurations. When applied to cancer, disease-specific formation of alternative transcripts may be identified as potential biomarkers for diagnosis and/or targets for treatment.
Provided in certain aspects are computer-implemented methods for identifying a transcript of a gene that is expressed at a level in a defined subpopulation of diseased tissue samples higher than a level the transcript is expressed in non-diseased tissue samples, comprising (a) receiving user input comprising (i) a defined disease selected from a plurality of defined diseases, where each of the defined diseases corresponds to diseased tissue samples, (ii) a defined minimum transcript expression value threshold for diseased tissue samples, (iii) a defined maximum or a defined median transcript expression value threshold for non-diseased tissues, where each of the non-diseased tissues corresponds to non-diseased tissue samples, and (iv) a defined minimum sample sub-population percentage; (b) identifying in a database comprising one or more tables that relate: (i) transcript identifiers to genes, where at least a portion of the transcript identifiers correspond to transcript isoform sets, and each of the transcript isoform sets is encoded by a gene; (ii) transcript identifiers to corresponding transcript expression values in diseased tissue samples, and (iii) transcript identifiers to corresponding transcript expression values in non-diseased tissue samples; one or more transcript identifiers meeting the following criteria (1) and (2): (1) a corresponding transcript expression value in diseased tissue samples corresponding to the input defined disease of (a)(i) is greater than the input defined minimum transcript expression value threshold of (a)(ii), and (2) a corresponding transcript expression value in non-diseased tissue is less than the input defined maximum or defined median transcript expression value threshold of (a)(iii), for a percentage of diseased samples corresponding to the input defined disease of (a)(i) greater than the defined minimum sub-population percentage of (a)(iv); and (c) outputting a list comprising a transcript identifier for each of one or more transcripts identified in (b), thereby identifying a transcript of one or more genes that is expressed at a level in a defined subpopulation of diseased tissue samples higher than a level the transcript is expressed in non-diseased tissue samples.
Also provided in certain aspects are methods for identifying a transcript of a gene that is expressed at a level in a defined subpopulation of diseased tissue samples higher than a level the transcript is expressed in non-diseased tissue samples, comprising (a) receiving user input comprising (i) a defined disease selected from a plurality of defined diseases, where each of the defined diseases corresponds to diseased tissue samples, (ii) a defined transcript expression ratio threshold, where the transcript expression ratio is a ratio of a transcript expression value for diseased tissue samples to a transcript expression value for non-diseased tissues, where each of the non-diseased tissues corresponds to non-diseased tissue samples, and (iii) a defined minimum sample sub-population percentage; (b) identifying in a database comprising one or more tables that relate (i) transcript identifiers to genes, where at least a portion of the transcript identifiers correspond to transcript isoform sets, and each of the transcript isoform sets is encoded by a gene; and (ii) transcript identifiers to corresponding transcript expression ratios; one or more transcript identifiers having a corresponding transcript expression ratio for the input defined disease of (a)(i) that is greater than the input defined transcript expression ratio threshold of (a)(ii), for a percentage of diseased samples corresponding to the input defined disease of (a)(i) greater than the defined minimum sub-population percentage of (a)(iii); and (c) outputting a list comprising a transcript identifier for each of one or more transcripts identified in (b), thereby identifying a transcript of one or more genes that is expressed at a level in a defined subpopulation of diseased tissue samples higher than a level the transcript is expressed in non-diseased tissue samples.
Also provided in certain aspects are systems, machines, or apparatuses comprising one or more microprocessors and memory, the memory comprising a database comprising one or more tables that relate (i) transcript identifiers to genes, where at least a portion of the transcript identifiers correspond to transcript isoform sets, and each of the transcript isoform sets is encoded by a gene; (ii) transcript identifiers to corresponding transcript expression values in diseased tissue samples, and (iii) transcript identifiers to corresponding transcript expression values in non-diseased tissue samples; and instructions executable by the one or more microprocessors configured to perform the following method (a) receiving user input comprising (i) a defined disease selected from a plurality of defined diseases, where each of the defined diseases corresponds to diseased tissue samples, (ii) a defined minimum transcript expression value threshold for diseased tissue samples, (iii) a defined maximum or a defined median transcript expression value threshold for non-diseased tissues, where each of the non-diseased tissues corresponds to non-diseased tissue samples, and (iv) a defined minimum sample sub-population percentage; (b) identifying in the database one or more transcript identifiers meeting the following criteria (1) and (2): (1) a corresponding transcript expression value in diseased tissue samples corresponding to the input defined disease of (a)(i) is greater than the input defined minimum transcript expression value threshold of (a)(ii), and (2) a corresponding transcript expression value in non-diseased tissue is less than the input defined maximum or defined median transcript expression value threshold of (a)(iii), for a percentage of diseased samples corresponding to the input defined disease of (a)(i) greater than the defined minimum sub-population percentage of (a)(iv); and (c) outputting a list comprising a transcript identifier for each of one or more transcripts identified in (b).
Also provided in certain aspects are systems, machines, or apparatuses comprising one or more microprocessors and memory, the memory comprising a database comprising one or more tables that relate (i) transcript identifiers to genes, where at least a portion of the transcript identifiers correspond to transcript isoform sets, and each of the transcript isoform sets is encoded by a gene; (ii) transcript identifiers to corresponding transcript expression values in diseased tissue samples, and (iii) transcript identifiers to corresponding transcript expression values in non-diseased tissue samples; and instructions executable by the one or more microprocessors configured to perform the following method (a) receiving user input comprising (i) a defined disease selected from a plurality of defined diseases, where each of the defined diseases corresponds to diseased tissue samples, (ii) a defined transcript expression ratio threshold, where the transcript expression ratio is a ratio of a transcript expression value for diseased tissue samples to a transcript expression value for non-diseased tissues, where each of the non-diseased tissues corresponds to non-diseased tissue samples, and (iii) a defined minimum sample sub-population percentage; (b) identifying in the database one or more transcript identifiers having a corresponding transcript expression ratio for the input defined disease of (a)(i) that is greater than the input defined transcript expression ratio threshold of (a)(ii), for a percentage of diseased samples corresponding to the input defined disease of (a)(i) greater than the defined minimum sub-population percentage of (a)(iii); and (c) outputting a list comprising a transcript identifier for each of one or more transcripts identified in (b).
Also provided in certain aspects are computer-implemented methods for analyzing a polypeptide comprising (a) identifying one or more transcript identifiers in a database comprising one or more tables that relate (i) transcript identifiers to genes, where at least a portion of the transcript identifiers correspond to transcript isoform sets, and each of the transcript isoform sets is encoded by a gene; (ii) transcript identifiers to corresponding transcript expression values in diseased tissue samples, and (iii) transcript identifiers to corresponding transcript expression values in non-diseased tissue samples; (b) receiving user selection of a transcript identifier; and (c) outputting one or more of (i) a three-dimensional structure corresponding to a polypeptide linear sequence corresponding to the selected transcript identifier, and (ii) functional polypeptide domain information for a polypeptide linear sequence corresponding to the selected transcript identifier, where the one or more tables of the database relate three-dimensional structure coordinates and functional polypeptide domain information to the polypeptide linear sequence.
Also provided in certain aspects are methods for generating a database comprising (i) relating transcript identifiers to genes, where at least a portion of the transcript identifiers correspond to transcript isoform sets, and each of the transcript isoform sets is encoded by a gene; (ii) relating transcript identifiers to corresponding transcript expression values in diseased tissue samples for a plurality of defined diseases, and (iii) relating percentages of diseased tissue samples for each defined disease to corresponding transcript identifiers, where the percentages are based on an amount of diseased tissue samples corresponding to a defined disease for which transcript expression values of a corresponding transcript identifier exceed a defined minimum transcript expression value threshold, or (iii′) relating percentages of diseased tissue samples for each defined disease to corresponding transcript identifiers, where the percentages are based on an amount of diseased tissue samples corresponding to a defined disease for which transcript expression ratios of a corresponding transcript identifier exceed a defined transcript expression ratio threshold.
Certain implementations are described further in the following description, examples and claims, and in the drawings.
The drawings illustrate certain implementations of the technology and are not limiting. For clarity and ease of illustration, the drawings are not made to scale and, in some instances, various aspects may be shown exaggerated or enlarged to facilitate an understanding of particular implementations.
Provided herein are methods and systems for identifying disease-associated transcript isoforms. Also provided herein are methods and systems for analyzing peptides and polypeptides encoded by the disease-associated transcript isoforms. In some aspects, identifying disease-associated transcript isoforms involves a comparison of transcripts from diseased samples to transcripts from a variety of tissues from normal samples. In some aspects, a method herein involves identifying a subpopulation of diseased samples having elevated expression of a disease-associated transcript isoform. In some aspects, a method herein involves identifying a disease-associated transcript isoform having elevated expression across multiple disease types.
A particular gene can give rise to different polypeptide or peptide variants according to one or more alterations at the DNA level (e.g., point mutation event) or mRNA level (e.g., alternative splice event). A particular gene can encode two or more polypeptide variants or peptide variants that are distinguished, for example, by at least one single amino acid substitution, at least one single amino acid insertion, at least one single amino acid deletion, at least one substitution of two or more consecutive amino acids, at least one insertion of two or more consecutive amino acids, at least one deletion of two or more consecutive amino acids, or a combination thereof.
Methods described herein may include an expression analysis of a variant subsequence (e.g., a splice variant subsequence). In some embodiments, a method described herein includes an analysis of differential expression (e.g., expression of a gene, variant (e.g., splice variant), or variant subsequence (e.g. splice variant subsequence) in different organs or tissues, expression of a gene, variant (e.g., splice variant), or variant subsequence (e.g. splice variant subsequence) in different subjects, expression of a gene, variant (e.g., splice variant), or variant subsequence (e.g. splice variant subsequence) in healthy vs. disease organs or tissues, expression of a gene, variant (e.g., splice variant), or variant subsequence (e.g. splice variant subsequence) in healthy vs. disease subjects). In some embodiments, a differential expression analysis includes comparing gene, variant (e.g., splice variant), or variant subsequence (e.g. splice variant subsequence) expression in tumors vs. surrounding tissue (e.g., in the same subject). In some embodiments, a differential expression analysis includes comparing gene, variant (e.g., splice variant), or variant subsequence (e.g. splice variant subsequence) expression in tumors (e.g., from one or more disease subjects) vs. corresponding tissue (e.g., from one or more healthy subjects).
Corresponding tissue generally refers to an equivalent organ or tissue in a healthy subject that is cancerous in a disease subject. For example, if a disease subject has prostate cancer, corresponding tissue would refer to prostate tissue from a healthy subject. In some embodiments, a differential expression analysis includes comparing gene, variant (e.g., splice variant), or variant subsequence (e.g. splice variant subsequence) expression in tumors (e.g., from one or more disease subjects) vs. normal tissue (e.g., one or more tissues from one or more healthy subjects).
Any suitable method for determining or measuring levels of gene, variant (e.g., splice variant), or variant subsequence (e.g. splice variant subsequence) expression may be used in a gene, variant (e.g., splice variant), or variant subsequence (e.g. splice variant subsequence) expression analysis and/or a differential expression analysis. Examples of methods for measuring expression levels include qPCR, RT-qPCR, RNA-Seq, microarray, northern blot, differential display, and RNase protection assay.
In some embodiments, expression levels may be measured using a quantifiable amplification method. For example, expression levels may be measured using a quantitative PCR (qPCR) approach (e.g., on cDNA generated from mRNA from a sample), or a reverse transcriptase quantitative PCR (RT-qPCR) approach (e.g., on mRNA from a sample). Quantitative PCR (qPCR), which also may be referred to a real-time PCR, monitors the amplification of a targeted nucleic acid molecule during a PCR reaction (i.e., in real time). This method may be used quantitatively (quantitative real-time PCR) and semi-quantitatively (i.e., above/below a certain amount of nucleic acid molecules; semi-quantitative real-time PCR. Methods for qPCR include use of non-specific fluorescent dyes that intercalate with double-stranded DNA, and sequence-specific DNA probes labelled with a fluorescent reporter, which generally allows detection after hybridization of the probe with its complementary sequence. Quantitative PCR methods typically are performed in a thermal cycler with the capacity to illuminate each sample with a beam of light of at least one specified wavelength and detect the fluorescence emitted by an excited fluorophore.
For non-specific detection, a DNA-binding dye binds to all double-stranded (ds) DNA during PCR. An increase in DNA product during PCR therefore leads to an increase in fluorescence intensity measured at each cycle. For qPCR using dsDNA dyes, the reaction typically is prepared like a basic PCR reaction, with the addition of fluorescent dsDNA dye. Then the reaction is run in a real-time PCR instrument, and after each cycle, the intensity of fluorescence is measured with a detector (the dye only fluoresces when bound to the dsDNA (i.e., the PCR product)). In certain applications, multiple target sequences may be monitored in a tube by using different types of dyes. For specific detection, fluorescent reporter probes detect only the DNA containing the sequence complementary to the probe. Accordingly, use of the reporter probe increases specificity, and enables performing the technique even in the presence of other dsDNA. Using different types of labels, fluorescent probes may be used in multiplex assays for monitoring several target sequences in the same tube. This method typically uses a DNA-based probe with a fluorescent reporter at one end and a quencher of fluorescence at the opposite end of the probe. The close proximity of the reporter to the quencher prevents detection of its fluorescence. During PCR, the probe is broken down by the 5′ to 3′ exonuclease activity of the polymerase, which breaks the reporter-quencher proximity and thus allows unquenched emission of fluorescence, which can be detected after excitation with a laser. An increase in the product targeted by the reporter probe at each PCR cycle therefore causes a proportional increase in fluorescence due to the breakdown of the probe and release of the reporter.
In some embodiments, expression levels may be measured using a sequencing process (e.g., RNA sequencing (RNA-Seq)). RNA-Seq typically uses high-throughput sequencing to detect the presence and/or measure the quantity of RNA in a sample. In certain applications, RNA-Seq allows for detection and/or measurement of alternative spliced transcripts, post-transcriptional modifications, gene fusion, mutations/SNPs, changes in gene expression over time, and/or differential expression. RNA-Seq can analyze different populations of RNA, which may include mRNA transcripts, total RNA, small RNA (e.g., miRNA), tRNA, and ribosomal RNA. RNA-Seq also may be used to assess exon/intron boundaries.
A disease-associated isoform identification system described herein can identify a disease-associated polypeptide or peptide variant of a particular gene. Stated another way, a disease-associated isoform identification system can identify a disease-associated variant (e.g., splice variant) of a particular gene, and an amino acid sequence of one or more variants of the gene may be aligned and/or outputted by a disease-associated isoform identification system. In certain instances expression level values are within a dataset described herein. In certain instances a dataset is received and/or stored in memory, and sometimes a dataset is in a database. A dataset may contain expression level values for transcripts in disease samples and non-disease samples from multiple tissues. Transcripts in a dataset can be virtual RNA transcripts (e.g., mRNA transcripts) and representative polynucleotide sequences (e.g., RNA, DNA and/or cDNA sequences) of transcripts or portions thereof can be included in a dataset. Transcripts in a dataset can correspond to amino acid sequence variants encoded by a gene, and a database can include representative amino acid sequences and/or amino acid subsequence corresponding to (e.g., translated from) transcripts or portions thereof. Transcripts can correspond to variants (e.g., splice variants) encoded by one or more genes, and a database can include expression level values associated with polynucleotide sequences, amino acid sequences and/or amino acid subsequence corresponding to transcripts or portions thereof.
A disease-associated isoform identification system described herein can include or can receive a dataset containing amino acid sequences of polypeptides and peptides encoded by genes and associated expression level information. If there are three polypeptide variants for a particular gene, for example, the system can include or receive (i) an amino acid sequence for each variant, and (ii) associated expression level information for each variant. Amino acid sequence information and associated expression level information can be stored in a system in any suitable format (e.g., a .tar archive). Expression level information stored in a system can exist in a system in any suitable manner, and sometimes exists as normalized expression level information.
Candidate disease-associated polypeptides or peptide variants (e.g., tumor-specific isoforms) may be defined by comparing the isoform expression levels between normal and disease (e.g., tumor) conditions. In one embodiment, candidate tumor-specific isoforms are defined by comparing the isoform expression levels between normal and tumor conditions, by calculating the median TPM-based fold change, and the presence or absence of one candidate tumor-specific isoform. In another embodiment, candidate tumor-specific isoforms are defined by comparing isoform expression levels between normal and tumor conditions, by calculating the median FPKM-based fold change, and the presence or absence of one candidate tumor-specific isoform. In another embodiment, candidate tumor-specific isoforms are defined by comparing isoform expression levels between normal and tumor conditions, by calculating the median RPKM-based fold change, and the presence or absence of one candidate tumor-specific isoform. In another embodiment, candidate tumor-specific isoforms are defined by comparing isoform expression levels between normal and tumor conditions, by calculating the median RSEM-based fold change, and the presence or absence of one candidate tumor-specific isoform.
A dataset of a disease-associated isoform identification system, or a dataset received by a disease-associated isoform identification system, can include (i) expression level information associated with amino acid sequence variants for disease samples, and (ii) expression level information associated with amino acid sequence variants for non-disease samples. A dataset can include expression levels (e.g., average expression levels) of transcripts associated with particular amino acid sequence variants for disease samples. A dataset can include expression levels (e.g., average expression levels) of transcripts associated with particular amino acid sequence variants for non-disease samples. A dataset can include a composite expression level associated with a particular amino acid sequence variant for non-disease samples (e.g., for all non-disease samples or a subset of non-disease samples). A composite expression level often is an average of the average expression level in a dataset for each tissue of origin of non-disease samples (e.g., all non-diseased samples in a dataset or a subset of non-disease samples in a dataset).
Each disease sample in a dataset of a disease-associated isoform identification system can be associated with a tissue of origin. In certain configurations, each disease sample in a dataset of a disease-associated isoform identification system can be associated with (i) a tissue of origin, and (ii) matched expression level values for each transcript derived from non-disease tissue adjacent to the disease tissue. In certain configurations, each disease sample in a dataset of a disease-associated isoform identification system can be associated with a tissue of origin, and is not associated with matched expression level values for each transcript derived from non-disease tissue adjacent to the disease tissue. Sample information can be from any suitable dataset or combined dataset, a non-limiting example of which includes a datasets from TCGA (World Wide Web Uniform Resource Locator cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga), TARGET (World Wide Web Uniform Resource Locator ocg.cancer.gov/programs/target), and GTEx (World Wide Web Uniform Resource Locator gtexportal.org/home/).
A disease sample generally is from a portion of a tissue of an organism identified as being diseased, and a non-disease sample generally is from a portion of a tissue of an organism identified as not being diseased. A disease sample and a non-disease sample sometimes are from the same subject, and sometimes are not from the same subject. A disease sample sometimes is from a portion of a tissue of an organism identified as being diseased, and a non-disease sample sometimes is from a portion of an adjacent tissue of an organism identified as not being diseased. For example, a disease sample sometimes is from a cancer tumor and a non-disease sample sometimes is from a non-tumor tissue adjacent to the tumor in the same subject. In some configurations, a disease-associated isoform identification system herein excludes datasets from non-tumor tissue adjacent to tumor tissue, and instead includes datasets from non-disease samples from subjects identified as not having a disease (e.g., datasets from GTEx). In some instances, the datasets from non-disease samples include expression data from multiple non-disease tissues.
A disease sometimes is a condition, and a disease or condition sometimes is diagnosed, inferred or suspected for a subject. Non-limiting examples of disease samples include samples from subjects having or suspected of having Alzheimer's disease, Parkinson's disease, Lupus, IPEX syndrome, diabetes, rheumatoid arthritis, influenza, pneumonia, or tuberculosis. A disease sample can be a cancer sample, and non-limiting examples of cancer samples include samples from subjects having or suspected of having a cancer type described herein.
A disease-associated isoform identification system can compare, for a selected disease: (i) an expression level (e.g., average expression level) of a particular amino acid sequence variant for disease samples, to (ii) a composite expression level of the particular amino acid sequence variant for non-disease samples (e.g., for all non-disease samples or a subset of non-disease samples). A disease-associated isoform identification system can compare, for a selected gene: (i) an expression level (e.g., average expression level) of a particular amino acid sequence variant for disease samples, to (ii) a composite expression level of the particular amino acid sequence variant for non-disease samples (e.g., for all non-disease samples or a subset of non-disease samples).
A disease-associated isoform identification system can perform several computations using expression level information in a dataset. A system can, for example: (a) compute an average expression level value for each transcript for disease samples; and/or (b) compute for each amino acid sequence variant a “related variant” value for disease samples and a “related variant” value for non-disease samples, where the “related variant” value is (i) the average (e.g., mean or median) expression level for the variant, divided by (ii) the sum of average expression level values for each variant of the gene; and/or (c) compute for each amino acid sequence variant a “fold change” value, where the “fold change” value is (i) the average expression level for the amino acid sequence variant in disease samples, divided by (ii) the average expression level for the amino acid sequence variant in non-disease samples. The “related variant” value can be expressed as a percentage referred to as an “expression percentage.” A computation described for part (a) also can include matching each average expression level value for each transcript with (i) a composite average expression level for the transcript for all non-disease samples or a subset of non-disease samples, and/or (ii) a highest tissue expression level identified from all non-disease samples or a subset of non-disease samples for the transcript. While each of the related variant value and the fold change value described in part (b) and part (c) is computed by dividing (i) by (ii), each ratio independently may be computed by dividing (ii) by (i).
After performing computations (a) and/or (b) and/or (c) described in the preceding paragraph, a disease-associated isoform identification system can generate a “disease sample only” variant list. Each variant selected for the list is expressed in disease samples but generally not expressed in non-diseased samples, where “not expressed” is defined as an expression level (e.g., TPM expression level) of less than 0.00001. The resulting list often is sorted by expression level value such that the most highly-expressed “disease sample only” variants are at the top of the list. A configurable cutoff value can be applied to expression level in disease samples, whereby only amino acid variants are displayed that are associated with a value greater than or equal to the value associated with the cutoff. For example, a threshold of greater-than-or-equal-to a TPM expression level of 1.0, and an expression percentage of greater than 10% can be applied.
After performing computations (a) and/or (b) and/or (c) described above, a disease-associated isoform identification system can generate a “disease sample specific” variant list. Each variant selected for the list can be (i) the dominant variant in disease samples, and/or (ii) not the dominant variant in non-disease samples.
After performing computations (a) and/or (b) and/or (c) described above, a disease-associated isoform identification system can generate a “disease upregulated” variant list. Each variant selected for the list exhibits a fold change value equal-to-or-greater-than a threshold value. The fold change expression level threshold can be configurable by a user, and a threshold value can be a two-fold threshold value, for example.
A disease-associated isoform identification system can generate a multi-sequence alignment (MSA) for each variant included in a list, which facilitates identification of an amino acid subsequence present in a particular variant that is not present in at least one other variant of a particular gene. A MSA often aligns a variant amino acid sequence with an amino acid sequence of at least one other variant encoded by the same gene. A MSA can be generated using any suitable sequence alignment algorithm, non-limiting examples of which include Clustal (e.g., ClustalW, ClustalW2, Clustal Omega), Multiple Alignment using Fast Fourier Transform (MAFFT), T-COFFEE, M-COFFEE, LALIGN, PSAlign, PRRN, PRRP, DIALIGN, MUSCLE, MergeAlign, Partial-Order Alignment (POA), Sequence Alignment and Modeling System (SAM), HMMER, PRANK, PAGAN, ProGraphMSA, MEME, MAST and EDNA. A disease-associated isoform identification system can generate a MSA based on a gene identifier, which can involve synching amino acid sequence databases having disparate gene identifier information. A non-limiting example of such a synching process is described hereafter. A TCGA variant model was based on the hg19 2009 version of the UCSC gene dataset (gene models built by UCSC as part of a genome browser). The UCSC table known as GenePep was downloaded for hg19 to obtain the protein sequence of each of the variants used in the TCGA analysis. Multiple versions of the hg19 USC gene models were released over time; the UCSC gene hg19 version 12 from 2009 was obtained that matched the gene models used in the TCGA analysis. A gene symbol scheme was constructed for transcript mapping from the TCGA reference dataset and the GenePep table was used to build a Sqlite database that supports looking up a gene symbol and returning all the protein sequences of the transcripts of the gene. These were then run through a multi-sequence alignment program and then through a format routine. The resulting structure allowed for a protein alignment program to receive a gene symbol and produces a multi-sequence alignment that can be used to identify sections specific to a tumor-associated variant.
A disease-associated isoform identification system can generate a box plot of expression level values for disease samples and/or non-disease samples by tissue for a given variant. Data from any suitable dataset can be utilized for such a box plot, non-limiting examples of which include the TCGA dataset described herein, the TARGET dataset described herein, and the GTEx dataset described herein. A disease-associated isoform identification system can generate a box plot of expression level values of a variant in disease and/or non-disease samples for different tissues, and can generate: (i) upper whisker, lower whisker, upper quartile, lower quartile, and/or an average of the distribution of values for a selected disease type; and/or (ii) a maximum expression level value (e.g., maximum TPM value) for a non-disease sample from a relevant tissue. In some configurations, a disease-associated isoform identification system can generate a box plot of expression level values of a variant in disease and/or non-disease samples for different tissues, and can generate one or more of: average transcript expression value, first quantile of average transcript expression value, third quantile of average transcript expression value, maximum transcript expression value whisker, minimum transcript expression value whisker, and outlier designations for transcripts.
A disease-associated isoform identification system can include an interface (e.g., a graphic interface) that facilitates selection of one or more features described herein. An interface can facilitate user interaction with a computing device by way of graphical elements, text-based elements, check boxes, drop-down menus, and the like. Graphical elements may include, for example, windows, buttons, icons, widgets, sliders, text boxes, spinner controls, and menus (e.g., drop down menus) that can be selected, highlighted, moved, and the like. Graphical elements may be manipulated using a touch screen, keyboard, stylus, and/or pointing device (e.g., mouse). In certain configurations, two or more screens of a user interface are linked together into a workflow, and navigation between screens may be possible in some instances. User interfaces may be web-based, for example, with parts of each page being loaded from a server and/or a database, or natively-compiled to execute on a user device with or without network availability.
A disease-associated isoform identification system can include an interface (e.g., a graphic interface) that facilitates selection of (i) a disease of interest (e.g., a tumor type of interest), (ii) a gene of interest according to a gene identifier (e.g., gene name, a gene tag (i.e., an abbreviated version of a gene name), accession number), or (iii) a tissue type of interest (e.g., TCGA, TARGET, and/or GTEx tissue type of interest). An interface can facilitate selection of one or more filters. An interface can facilitate selection of one or more filters that enhance identification of variants that are immunologic and/or can be immunologically targeted. An interface can facilitate selection of an expression level threshold filter that permits listing of only transcripts and/or variants associated with a minimum disease sample expression level (e.g., minimum disease sample TPM; minimum tumor TPM). An interface can facilitate input of an expression level threshold filter that permits listing only of transcripts and/or variants associated with a maximum non-disease sample expression level (e.g., maximum non-disease sample TPM; maximum normal TPM) or a median non-disease sample expression level (e.g., median non-disease sample TPM; median normal TPM). An interface can facilitate input of filter that permits listing only of transcripts and/or variants associated with a selected subpopulation (e.g., a minimum percentage of samples meeting an expression level minimum (e.g., for diseased samples) and/or expression level maximum or median (e.g., for non-diseased samples)). In certain configurations, an interface can facilitate selection of an expression ratio threshold filter that permits listing of only transcripts and/or variants associated with a minimum ratio value. An interface can facilitate selection of a filter that permits listing of only transcripts and/or variants that encode a cell-surface polypeptide (e.g., based on an annotation in a dataset specifying that a particular gene encodes a cell-surface polypeptide). An interface can facilitate selection of a filter that permits listing of only transcripts and/or variants having at least one insertion of a single amino acid or two or more consecutive amino acids relative to a canonical amino acid sequence (e.g., a unique peptide).
An interface of a disease-associated isoform identification system can display a list of variants. An interface can output a sorted list of variants with the strongest disease/non-disease differences at the top of the list. In some configurations, two types of lists are available for each tissue: (i) variants expressed in disease samples but not expressed in non-disease samples, and (ii) variants expressed at significantly higher levels in disease samples than non-disease samples. An interface also can allow selection of a particular variant in a list for display of additional output pertaining to the variant, as described in further detail herein. For example, additional output can include (i) a detail panel showing a multi-sequence alignment for the selected variant and all other variants encoded by the gene; (ii) a box plot of expression of the variant in disease and/or non-disease samples for different tissues (e.g., a box plot for TCGA dataset and/or a box plot for GTEx dataset); (iii) three dimensional protein structure information for a selected variant; and/or (iv) protein domain information for a selected variant.
In certain instances, a gene identifier or transcript identifier (e.g., gene name, a gene tag (i.e., an abbreviated version of a gene name), transcript name, a transcript tag (i.e., an abbreviated version of a transcript name), accession number) can be inputted into a disease-associated isoform identification system, and an algorithm associated with the system can identify the gene identifier and/or transcript identifier in a pre-compiled database and retrieve the associated amino acid sequence from the database. An amino acid sequence can be retrieved by an algorithm associated with a disease-associated isoform identification system from a NCBI Protein database (e.g., World Wide Web Uniform Resource Locator ncbi.nlm.nih.gov/protein), for example. In certain instances, an amino acid sequence can be directly inputted into the disease-associated isoform identification system (e.g., copying and pasting an amino acid sequence). An input amino acid sequence can be a polypeptide amino acid sequence or a peptide amino acid sequence (e.g., a polypeptide or a peptide encoded by a gene or by a mRNA), or a portion of a polypeptide amino acid sequence or a peptide amino acid sequence, for example.
Provided herein are methods and systems for identifying a particular transcript of a gene. A system herein may refer to a machine or apparatus, in certain embodiments. A particular transcript of a gene may be referred to an isoform. A transcript or an isoform may refer to a transcript variant or an isoform variant. A transcript variant or an isoform variant may refer to a splice variant. In some embodiments, a transcript is a disease-associated transcript. In some embodiments, a transcript is expressed at a level in diseased tissue higher than a level the transcript is expressed in non-diseased tissue. In some embodiments, a transcript is a cancer-associated transcript. In some embodiments, a transcript is expressed at a level in cancerous tissue higher than a level the transcript is expressed in non-cancer tissue. In some embodiments, a transcript is a tumor-associated transcript. In some embodiments, a transcript is expressed at a level in tumor tissue higher than a level the transcript is expressed in non-tumor tissue. In some embodiments, a transcript is expressed at a level in diseased, cancerous, and/or tumor tissues at least about 5% higher than a level the transcript is expressed in non-diseased, non-cancer, and/or non-tumor tissues. For example, a transcript may be expressed at a level in diseased, cancerous, and/or tumor tissues at least about 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, or 500% higher than a level the transcript is expressed in non-diseased, non-cancer, and/or non-tumor tissues. In some embodiments, a transcript is expressed at a level in diseased, cancerous, and/or tumor tissues at least about 1.5 times higher than a level the transcript is expressed in non-diseased, non-cancer, and/or non-tumor tissues. For example, a transcript may be expressed at a level in diseased, cancerous, and/or tumor tissues at least about 2 times, 3 times, 4 times, 5 times, 6 times, 7 times, 8 times, 9 times, 10 times, 20 times, 30 times, 40 times, 50 times, 60 times, 70 times, 80 times, 90 times, or 100 times higher than a level the transcript is expressed in non-diseased, non-cancer, and/or non-tumor tissues. The term disease-associated transcript may be used interchangeably herein with the terms cancer-associated transcript and/or tumor-associated transcript. The term diseased tissue may be used interchangeably herein with the terms cancerous tissue, cancerous tissue sample, tumor tissue and/or tumor tissue sample. The term non-diseased tissue may be used interchangeably herein with the terms non-cancer tissue, non-cancer tissue sample, non-tumor tissue, and/or non-tumor tissue sample.
In some embodiments, a transcript is expressed at a level in a defined subpopulation of diseased tissue samples higher than a level the transcript is expressed in non-diseased tissue samples. A defined subpopulation of diseased samples generally refers to a selected portion of total diseased samples for which expression data has been collected. A selected portion of diseased samples may be expressed as a fraction or percentage of the total number of diseased samples for which expression data has been collected. A defined subpopulation of diseased samples may refer to a defined subpopulation of samples for a particular disease. A defined subpopulation of diseased samples may refer a defined subpopulation for multiple diseases.
Database
In some embodiments, a method herein includes use of a database. In some embodiments, a system herein includes a database. A database may comprise one or more tables that relate two or more features described herein. Two or more features may be related in any suitable tabular format (e.g., rows and columns, spreadsheet, and the like). In some embodiments, a database comprises one or more tables that relate transcript identifiers (e.g., transcript name, a transcript tag (i.e., an abbreviated version of a transcript name), accession number) to one or more features described herein. For example, one or more tables in a database may relate one or more of: transcript identifiers to genes; transcript identifiers to corresponding transcript expression values in diseased tissue samples; transcript identifiers to corresponding transcript expression values in non-diseased tissue samples; percentages of diseased tissue samples to corresponding transcript identifiers (e.g., for one or more defined diseases); average transcript expression values to corresponding transcript identifiers (e.g., for one or more defined diseases); average transcript expression values in diseased tissue samples, average transcript expression values for all non-diseased samples or a subset of non-disease samples, maximum transcript expression values of non-diseased samples, and non-diseased tissues corresponding to maximum transcript expression values, to corresponding transcript identifiers; average transcript expression values in diseased tissue samples, average transcript expression values for all non-diseased samples or a subset of non-disease samples, median transcript expression values of non-diseased samples, and non-diseased tissues corresponding to median transcript expression values, to corresponding transcript identifiers; transcript expression values and box plot statistics with corresponding transcript identifiers; and polypeptide linear sequences to corresponding transcript identifiers. In some embodiments, a database comprises one or more tables that relate three-dimensional structure coordinates and functional polypeptide domain information to the polypeptide linear sequences.
In some embodiments, a database described herein comprises one or more of the following tables: a samples table comprising one record for each sample and including phenotype information for each sample; a tissue table comprising one row for each tissue type and comprising tissues corresponding to diseased tissue samples and tissues corresponding to non-diseased tissue samples; a transcript table relating gene identifiers to corresponding transcript identifiers; a non-diseased sample statistics table relating transcript expression values to corresponding transcript identifiers for non-diseased samples, and comprising one or more of: average transcript expression value, first quantile of average transcript expression value, third quantile of average transcript expression value, maximum transcript expression value whisker, minimum transcript expression value whisker, and outlier designations for transcripts of non-diseased samples, for corresponding transcript identifiers; a non-diseased sample statistics by tissue table relating transcript expression values to corresponding transcript identifiers categorized by tissue, and comprising one or more of: average transcript expression value, first quantile of average transcript expression value, third quantile of average transcript expression value, maximum transcript expression value whisker, minimum transcript expression value whisker, outlier designations for transcripts of samples in each tissue, and the tissue having the highest expression value for each transcript identifier, for corresponding transcript identifiers; a diseased sample statistics by tissue table relating transcript expression values to transcript identifiers categorized by tissue, and comprising one or more of: average transcript expression value, first quantile of average transcript expression value, third quantile of average transcript expression value, maximum transcript expression value whisker, minimum transcript expression value whisker, outlier designations for transcripts of samples in each tissue; a diseased sample percentage table relating percentages of diseased tissue samples for each defined disease to corresponding transcript identifiers; an aligned linear sequences table relating transcript identifiers to corresponding linear polypeptide sequences; a transcript table, where the transcript table identifies a subset of gene identifiers each corresponding to a gene encoding a cell surface protein; a transcript table, where the transcript table identifies a subset of transcript identifiers each corresponding to a transcript encoding a unique polypeptide comprising an insertion, deletion or substitution of one or more amino acids relative to polypeptides encoded by other transcript isoforms of the same gene; and a transcript table, where the transcript table identifies a subset of transcript identifiers each as a primary transcript having an expression value (e.g., a general expression value) higher than the expression values corresponding to other transcript isoforms encoded by the same gene.
In some embodiments, for a diseased sample percentage table, each percentage is calculated based on transcript expression values for diseased tissue samples for which transcript expression values exceed a defined minimum transcript expression value threshold, and for each defined minimum sample sub-population percentage. In some embodiments, a diseased sample statistics by tissue table relates average transcript expression values for each defined disease to corresponding transcript identifiers. Each average transcript expression value may be calculated based on transcript expression values for diseased tissue samples for which transcript expression values exceed a defined minimum transcript expression value threshold, and for each defined minimum transcript expression value threshold of a plurality of defined minimum transcript expression value thresholds. Average transcript expression values may be mean transcript expression values. Average transcript expression values may be median transcript expression values.
In some embodiments, a database relates transcript identifiers to genes. In some embodiments, at least a portion of the transcript identifiers correspond to transcript isoform sets. Generally, each transcript isoform set is encoded by a gene (i.e., one gene per isoform set). Isoform sets may include one or more of a canonical sequence, a consensus sequence, and a variant sequence for a selected gene. In some embodiments, an isoform set includes a plurality of variant sequences for a selected gene. Variant sequences in an isoform set may include splice variants, for example.
In some embodiments, a database relates transcript identifiers to corresponding transcript expression values in diseased tissue samples. In some embodiments, a database relates transcript identifiers to corresponding transcript expression values in non-diseased tissue samples. Transcript expression value information (e.g., for diseased tissue samples and/or non-diseased tissue sample) can exist, for example, as transcripts per million (TPM) values, fragments per kilobase per million reads mapped (FPKM) values, reads per kilobase per million reads mapped (RPKM) values, RNA-seq by expectation-maximization (RSEM) values, or combination of such values. In some embodiments, transcript expression values are transcripts per million (TPM) values. TPM values generally are normalized expression level values, and a TPM value for a particular gene/transcript represents the number of RNA molecules in a sample for every one million RNA molecules in the sample. A TPM value generally is determined for RNA-seq samples. In some embodiments, TPM values are or have been generated by a RNA-Seq by Expectation-Maximization (RSEM) algorithm (Li and Dewey (2011) BMC Bioinformatics 12:323 pages 1-16; incorporated by reference herein). TPM values may be expressed as logarithmic TPM values (e.g., binary logarithmic values). For example, TPM values may include Log 2 TPM values. Minimum Log 2 TPM values and/or TPM value thresholds (e.g., for diseased tissue samples) may include various precomputed tumor expression thresholds (e.g., to improve processing speed, make the database manageable, and make performance of the search tractable). Minimum Log 2 TPM values and/or TPM value thresholds (e.g., for diseased tissue samples) may include various precomputed tumor expression thresholds ranging from 1 to 10. Minimum Log 2 TPM values and/or TPM value thresholds (e.g., for diseased tissue samples) may include various precomputed tumor expression thresholds ranging from 1 to 10, incremented by 1. Non-limiting examples of minimum Log 2 TPM values and/or TPM value thresholds (e.g., for diseased tissue samples) include 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. In some embodiments, a minimum Log 2 TPM value and/or TPM value threshold (e.g., for diseased tissue samples) is 2, 3, or 4. Maximum Log 2 TPM values and/or TPM value thresholds (e.g., for non-diseased tissues) can be any number and generally range from about −9.9 (for very low expression) to 18 (for extremely high expression). In certain configurations, a spinner control used in an interface moves by 0.25 increments maximum Log 2 TPM values. In certain configurations, any number for maximum Log 2 TPM values can be manually entered by a user. Non-limiting examples of maximum Log 2 TPM values and/or TPM value thresholds (e.g., for non-diseased tissues) include −1, −0.5, 0, 0.5, and 1. Transcript expression level information can exist, for example, as average expression level values. For example, expression level information can exist as average TPM values, average FPKM values, average RPKM values, average RSEM values, or combination thereof. As used herein, an average value can be a mean, median, or mode value.
In some embodiments, one or more tables in a database comprise transcript expression values for non-diseased tissue samples. In some embodiments, non-diseased tissue samples were from multiple subjects different than the subjects from which diseased tissue samples were obtained. Non-diseased tissue sample transcript expression values may be obtained from any suitable source or dataset. In some embodiments, non-diseased tissue sample transcript expression values are obtained from a Genotype-Tissue Expression (GTEx) dataset. Without being limited by theory, this database feature allows for accurate identification of sub-populations of disease-associated transcripts due, in part, to a large population of baseline non-diseased sample transcript expression values. In some embodiments, non-diseased tissues comprise tissues chosen from adipose (e.g., subcutaneous, visceral (omentum)), adrenal gland, artery (e.g., aorta, coronary, tibial), brain (e.g., amygdala, anterior cingulate cortex (BA24), caudate (basal ganglia), cerebellar hemisphere, cerebellum, cortex, frontal cortex (BA9), hippocampus, hypothalamus, nucleus accumbens (basal ganglia), putamen (basal ganglia), spinal cord (cervical c-1), substantia nigra), breast (e.g., mammary tissue), colon (e.g., sigmoid, transverse), esophagus (e.g., gastroesophageal junction, mucosa, muscularis), heart (e.g., atrial appendage, left ventricle), kidney (e.g., cortex, medulla), liver, lung, minor salivary gland, muscle (e.g., skeletal), nerve (e.g., tibial), ovary, pancreas, pituitary, prostate, skin (e.g., not sun exposed (e.g., suprapubic), sun exposed (e.g., lower leg)), small intestine (e.g., terminal ileum), spleen, stomach, testis, thyroid, uterus, vagina, whole blood, bladder, cervix (e.g., endocervix, ectocervix), fallopian tube, and cells (e.g., EBV-transformed lymphocytes, cultured fibroblasts). In some embodiments, non-diseased tissues exclude cultured cells.
In some embodiments, one or more tables in a database comprise transcript expression values for diseased tissue samples. In some embodiments, diseased tissue samples were from multiple subjects different than the subjects from which non-diseased tissue samples were obtained. Diseased tissue sample transcript expression values may be obtained from any suitable source or dataset. In some embodiments, one or more tables comprise transcript expression values from diseased tissue from child subjects. In some embodiments, diseased tissue sample transcript expression values are obtained from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) dataset. In some embodiments, one or more tables comprise transcript expression values from diseased tissue from adult subjects. In some embodiments, diseased tissue sample transcript expression values are obtained from The Cancer Genome Atlas (TCGA) dataset. In some embodiments, one or more tables in a database comprise transcript expression values for diseased tissue samples from a TCGA database, and transcript expression values for non-diseased tissue samples from a GTEx database. In some database configurations, there is an intra- and an inter-dataset subject's independence. That is, subjects in each dataset (e.g., GTEx and TCGA) are independent (no overlapping), and TCGA and GTEx subjects are different (TCGA subjects are individuals diagnosed with a tumor, GTEx subjects are tumor-free). In some embodiments, one or more tables in a database comprise transcript expression values for non-diseased tissue adjacent to diseased tissue in the same subjects from the TCGA database. Accordingly, in certain database configurations, TCGA diseased tissue (e.g., tumor tissue) is compared to matched TCGA adjacent non-diseased tissue (i.e., normal tissue). In some embodiments, one or more tables in a database exclude transcript expression values for non-diseased tissue adjacent to diseased tissue in the same subjects from the TCGA database. Accordingly, in certain database configurations, TCGA diseased tissue (e.g., tumor tissue) is not compared to matched TCGA adjacent non-diseased tissue (i.e., normal tissue). TCGA adjacent non-diseased tissue (i.e., normal tissue) typically is sourced from diseased (e.g., cancer) patients. Such tissue may be subject to tumor field effects and/or may not match the tumor tissue of origin. For example, a tumor tissue of origin could be epithelial cells and adjacent normal tissue might include muscle or other tissue. Methods and systems herein generally comprise a highly tissue-specific transcript isoform analysis. In certain configurations, diseased samples are compared to multiple non-diseased samples from multiple tissues (e.g., about 7,400 non-diseased samples from about 47 tissues). Such configurations may provide a robust disease vs. non-disease (e.g., tumor vs. normal) transcript analysis, and may be useful for developing an immunotherapy that specifically targets cancer cells.
In some embodiments, transcript expression values in one or more tables have been processed according to a common gene model. In some embodiments, a common gene model is a single Gencode version (e.g., Gencode v23). A non-limiting example of a transcript expression value processor that uses a common gene model is TOIL. For example, the reprocessing of RNASeq data from TCGA, TARGET, and GTEX by the Xena group at UCSC using the TOIL pipeline provided a single transcript expression dataset constructed with a common gene model, Gencode v23 (see e.g., World Wide Web Uniform Resource Locator xenabrowser.net/datapages/?hub=https://toil.xenahubs.net:443). Without being limited by theory, this system structure may allow for accurate identification of transcripts in sub-populations as expression values for different transcripts have been normalized. This system structure also may allow for addition of transcript expression level values for diseased tissue samples, processed according to the same gene model already utilized for database generation, without requiring addition of comparative expression values in non-diseased samples. Accordingly, in some embodiments, a method herein comprises entering additional transcript expression level values for diseased tissue samples, where the expression level values have been processed according to the same gene model, without entering an expression level value for a comparative non-diseased sample.
In some embodiments, one or more tables in a database comprise transcript expression values for multiple transcripts. For example, one or more tables in a database may comprise transcript expression values for at least 10,000 transcripts, at least 20,000 transcripts, at least 30,000 transcripts, at least 40,000 transcripts, at least 50,000 transcripts, at least 60,000 transcripts, at least 70,000 transcripts, at least 80,000 transcripts, at least 90,000 transcripts, or at least 100,000 transcripts. In some embodiments, one or more tables in a database comprise transcript expression values for at least 50,000 transcripts.
In some embodiments, one or more tables in a database comprise transcript expression values for multiple genes. For example, one or more tables in a database may comprise transcript expression values for at least 1,000 genes, at least 5,000 genes, at least 10,000 genes, at least 15,000 genes, at least 16,000 genes, at least 17,000 genes, at least 18,000 genes, at least 19,000 genes, at least 20,000 genes, at least 21,000 genes, at least 22,000 genes, at least 23,000 genes, at least 24,000 genes, at least 25,000 genes, or at least 30,000 genes. In some embodiments, one or more tables in a database comprise transcript expression values for at least 19,000 genes.
In some embodiments, one or more tables in a database comprise transcript expression values for multiple diseased samples. For example, one or more tables in a database may comprise transcript expression values for at least 1,000 diseased samples, at least 2,000 diseased samples, at least 3,000 diseased samples, at least 4,000 diseased samples, at least 5,000 diseased samples, at least 6,000 diseased samples, at least 7,000 diseased samples, at least 8,000 diseased samples, at least 9,000 diseased samples, at least 10,000 diseased samples, at least 11,000 diseased samples, at least 12,000 diseased samples, at least 13,000 diseased samples, at least 14,000 diseased samples, at least 15,000 diseased samples, at least 16,000 diseased samples, at least 17,000 diseased samples, at least 18,000 diseased samples, at least 19,000 diseased samples, or at least 20,000 diseased samples. In some embodiments, one or more tables in a database comprise transcript expression values for at least 10,000 diseased samples.
In some embodiments, one or more tables in a database comprise transcript expression values for multiple non-diseased samples. For example, one or more tables in a database may comprise transcript expression values for at least 1,000 non-diseased samples, at least 2,000 non-diseased samples, at least 3,000 non-diseased samples, at least 4,000 non-diseased samples, at least 5,000 non-diseased samples, at least 6,000 non-diseased samples, at least 7,000 non-diseased samples, at least 8,000 non-diseased samples, at least 9,000 non-diseased samples, at least 10,000 non-diseased samples, at least 11,000 non-diseased samples, at least 12,000 non-diseased samples, at least 13,000 non-diseased samples, at least 14,000 non-diseased samples, at least 15,000 non-diseased samples, at least 16,000 non-diseased samples, at least 17,000 non-diseased samples, at least 18,000 non-diseased samples, at least 19,000 non-diseased samples, or at least 20,000 non-diseased samples. In some embodiments, one or more tables in a database comprise transcript expression values for at least 7,000 non-diseased samples.
In some embodiments, one or more tables in a database comprise transcript expression values for multiple defined diseases. For example, one or more tables in a database may comprise transcript expression values for at least 20 defined diseases, at least 25 defined diseases, at least 30 defined diseases, at least 31 defined diseases, at least 32 defined diseases, at least 33 defined diseases, at least 34 defined diseases, at least 35 defined diseases, at least 36 defined diseases, at least 37 defined diseases, at least 38 defined diseases, at least 39 defined diseases, at least 40 defined diseases, at least 45 defined diseases, or at least 50 defined diseases. In some embodiments, one or more tables in a database comprise transcript expression values for at least 35 defined diseases.
In some embodiments, one or more tables in a database comprise transcript expression values for multiple non-diseased tissues. For example, one or more tables in a database may comprise transcript expression values for at least 30 non-diseased tissues, at least 35 non-diseased tissues, at least 40 non-diseased tissues, at least 41 non-diseased tissues, at least 42 non-diseased tissues, at least 43 non-diseased tissues, at least 44 non-diseased tissues, at least 45 non-diseased tissues, at least 46 non-diseased tissues, at least 47 non-diseased tissues, at least 48 non-diseased tissues, at least 49 non-diseased tissues, at least 50 non-diseased tissues, at least 55 non-diseased tissues, or at least 60 non-diseased tissues. In some embodiments, one or more tables in a database comprise transcript expression values for at least 45 non-diseased tissues.
In some embodiments, one or more tables in a database comprise transcript expression values for multiple transcripts for multiple genes, multiple diseased samples, multiple non-diseased samples, multiple defined diseases, and multiple non-diseased tissues. In some embodiments, one or more tables in a database comprise transcript expression values for at least 50,000 transcripts for at least 19,000 genes, at least 10,000 diseased samples, at least 7,000 non-diseased samples, at least 35 defined diseases and 45 non-diseased tissues.
In some embodiments, a database comprises one or more precomputed features or values. A precomputed feature or value may be referred to as a pre-calculated feature or value, a pre-populated feature or value, a data bundle, a data bin, precomputed binned data, precomputed binned threshold, and the like. A feature or value may be precomputed during construction of a database such that when a user accesses the database, the precomputed feature or value is already present in the database. Without being limited by theory, rapid identification of transcripts and/or sub-populations of transcripts across large numbers of transcripts is due, in part, to the presence of precomputed features and/or values. A database that includes precomputed features or values may require less memory and provide faster search results compared to a database without precomputed features or values (e.g., a database comprising raw data). In some embodiments, for each transcript (e.g., each known protein coding transcript), a database comprises a precomputed maximum expression for non-diseased samples. For example, a database may comprise a box plot upper whisker TPM value in non-diseased tissues for each transcript. In some embodiments, for each transcript (e.g., each known protein coding transcript), a database comprises a precomputed median expression (e.g., median TPM expression value) for non-diseased samples. In some embodiments, for each transcript (e.g., each known protein coding transcript), a database comprises a plurality of precomputed minimum transcript expression value thresholds (e.g., TPM expression value thresholds) for diseased tissue samples. In some embodiments, for each transcript (e.g., each known protein coding transcript), a database comprises a percentage of samples exceeding predefined expression levels (e.g., TPM expression levels) in diseased tissue. In some embodiments, for each transcript (e.g., each known protein coding transcript), a database comprises an average expression (e.g., average TPM expression) in diseased tissue for each of the predefined expression levels (e.g., TPM expression levels). In some embodiments, for each transcript (e.g., each known protein coding transcript), a database comprises a plurality of precomputed transcript expression ratio thresholds (e.g., for transcript expression ratios generated according to transcript expression values for diseased tissue samples and transcript expression values for non-diseased tissues).
A method herein may comprise generating a database. A database may be generated by importing one or more datasets. A database may be generated by importing one or more datasets comprising transcript expression data, transcript isoform expression data, RNASeq data, and/or other transcript/protein data. Datasets may include, for example, datasets from data sources (e.g., The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), Genotype-Tissue Expression (GTEx), TOIL, Gencode (e.g., Gencode v23), a surfaceome dataset, and Matched Annotation from the NCBI and EBI (MANE)). A database may be generated by relating two or more features described above. A database may be generated by importing one or more tables that relate two or more features described herein. For example, a method for generating a database may comprise one or more of (i) relating transcript identifiers to genes, where at least a portion of the transcript identifiers correspond to transcript isoform sets, and each of the transcript isoform sets is encoded by a gene; (ii) relating transcript identifiers to corresponding transcript expression values in diseased tissue samples for a plurality of defined diseases, and (iii) relating percentages of diseased tissue samples for each defined disease to corresponding transcript identifiers, where the percentages are based on an amount of diseased tissue samples corresponding to a defined disease for which transcript expression values of a corresponding transcript identifier exceed a defined minimum transcript expression value threshold. In certain configurations, a method for generating a database may comprise one or more of (i) relating transcript identifiers to genes, where at least a portion of the transcript identifiers correspond to transcript isoform sets, and each of the transcript isoform sets is encoded by a gene; (ii) relating transcript identifiers to corresponding transcript expression values in diseased tissue samples for a plurality of defined diseases, and (iii) relating percentages of diseased tissue samples for each defined disease to corresponding transcript identifiers, where the percentages are based on an amount of diseased tissue samples corresponding to a defined disease for which transcript expression ratios of a corresponding transcript identifier exceed a defined transcript expression ratio threshold. An example method for generating a database is provided in Example 1.
Input
In some embodiments, a method herein comprises receiving user input. User input may be received by way of an interface (e.g., a graphical user interface). Examples of a user input interface is provided in
Each of the defined diseases may correspond to diseased tissue samples (e.g., tissue samples from subjects having, or suspected of having, the defined diseases). Non-limiting examples of defined diseases include Alzheimer's disease, Parkinson's disease, Lupus, IPEX syndrome, diabetes, rheumatoid arthritis, influenza, pneumonia, tuberculosis, acute myeloid leukemia, adrenocortical carcinoma, bladder urothelial carcinoma, brain lower grade glioma, breast invasive carcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma, cholangiocarcinoma, colon adenocarcinoma, esophageal carcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, kidney chromophobe, kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, pheochromocytoma and paraganglioma, prostate adenocarcinoma, rectum adenocarcinoma, sarcoma, skin cutaneous melanoma, stomach adenocarcinoma, testicular germ cell tumors, thymoma, thyroid carcinoma, uterine carcinosarcoma, uterine corpus endometrial carcinoma, uveal melanoma, chronic myeloid leukemia, acute lymphocytic leukemia, multiple myeloma, non-Hodgkin lymphoma, Hodgkin lymphoma, marginal zone lymphoma, follicular lymphoma, small lymphocyte lymphoma, B-cell lymphoma, diffuse large B-cell lymphoma or mantle cell lymphoma, acute lymphoblastic leukemia, adrenocortical cancer, cervical and endocervical cancer, clear cell sarcoma of the kidney, kidney clear cell carcinoma, kidney papillary cell carcinoma, neuroblastoma, testicular germ cell tumor, uterine corpus endometrioid carcinoma, Wilms tumor, carcinoma, lymphoma, blastoma, sarcoma, leukemia, squamous cell cancer, small-cell lung cancer, non-small cell lung cancer, adenocarcinoma of the lung, squamous carcinoma of the lung, cancer of the peritoneum, hepatocellular cancer, gastrointestinal cancer, pancreatic cancer, glioblastoma, cervical cancer, ovarian cancer, liver cancer, bladder cancer, hepatoma, breast cancer, colon cancer, colorectal cancer, endometrial or uterine carcinoma, salivary gland carcinoma, kidney cancer, liver cancer, prostate cancer, vulval cancer, thyroid cancer, hepatic carcinoma, various types of head and neck cancer, and the like.
In some embodiments, defined diseases comprise defined cancers. Defined cancers may be chosen from acute lymphoblastic leukemia, acute myeloid leukemia, adrenocortical cancer, bladder urothelial carcinoma, brain lower grade glioma, breast invasive carcinoma, cervical & endocervical cancer, cholangiocarcinoma, clear cell sarcoma of the kidney, colon adenocarcinoma, diffuse large B-cell lymphoma, esophageal carcinoma, glioblastoma multiforme, head & neck squamous cell carcinoma, kidney chromophobe, kidney clear cell carcinoma, kidney papillary cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, mesothelioma, neuroblastoma, ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, pheochromocytoma & paraganglioma, prostate adenocarcinoma, rectum adenocarcinoma, sarcoma, skin cutaneous melanoma, stomach adenocarcinoma, testicular germ cell tumor, thymoma, thyroid carcinoma, uterine carcinosarcoma, uterine corpus endometrioid carcinoma, uveal melanoma, and Wilms tumor.
In some embodiments, a defined feature is a gene. Accordingly, in some embodiments, a defined gene of a plurality of defined genes is received as input (see e.g.,
In some embodiments, a defined feature is a defined minimum transcript expression value threshold for diseased tissue samples. In some embodiments, a defined minimum transcript expression value threshold for diseased tissue samples is selected from a plurality of defined minimum transcript expression value thresholds for diseased tissue samples. In some embodiments, a database comprises a plurality of precomputed minimum transcript expression value thresholds for diseased tissue samples, and a user selects one of the precomputed minimum transcript expression value thresholds. Precomputed minimum transcript expression value thresholds may be referred to as precomputed binned thresholds for diseased tissues.
In some embodiments, a defined feature is a defined maximum transcript expression value for non-diseased tissues. For non-diseased or normal tissues, a database may include statistical summarizations of normal tissues such as, for example, high value, low value, 25 percentile, mean, 75th percentile, and outliers. In addition, a database may summarize for each transcript the highest value seen in any normal tissue and identifies the tissue. In some embodiments, a database may include for each transcript, a precomputed box plot upper whisker TPM value in non-diseased tissues, sometimes referred to as max normal expression. In some embodiments, a defined maximum transcript expression value threshold for non-diseased tissues is selected from a plurality of defined maximum transcript expression value thresholds for non-diseased tissues. In some embodiments, a database comprises a plurality of precomputed maximum transcript expression value thresholds for non-diseased tissue samples, and a user selects one of the precomputed maximum transcript expression value thresholds. Precomputed maximum transcript expression value thresholds may be referred to as precomputed binned thresholds for non-diseased tissues. In some embodiments, a database does not comprise precomputed maximum transcript expression value thresholds for non-diseased tissue samples, and a user selects or enters a maximum transcript expression value for non-diseased tissues.
In some embodiments, a defined feature is a defined median transcript expression value for non-diseased tissues. A median transcript expression value generally refers to the median (or middle value separating the higher half from the lower half of a data set) expression value for all (or a subset of) non-diseased tissues used in the database. Generally, a median is the middle value when all values are put in sequential order. If there are an even number of values, it is the average of the two values closest to the middle. In some embodiments, the median expression value is a median of the mean values in a non-diseased tissues expression database (e.g., GTEx). In certain configurations, an application with certain user settings compares median tumor value for a selected tumor type to the highest median value of the normal tissues.
In some embodiments, a defined median transcript expression value threshold for non-diseased tissues is selected from a plurality of defined median transcript expression value thresholds for non-diseased tissues. In some embodiments, a database comprises a plurality of precomputed median transcript expression value thresholds for non-diseased tissue samples, and a user selects one of the precomputed median transcript expression value thresholds. Precomputed median transcript expression value thresholds may be referred to as precomputed binned thresholds for non-diseased tissues. In some embodiments, a database does not comprise precomputed median transcript expression value thresholds for non-diseased tissue samples, and a user selects or enters a median transcript expression value for non-diseased tissues.
In some embodiments, a defined feature is a defined transcript expression ratio threshold. A defined transcript expression ratio threshold may also be referred to as a defined transcript expression ratio value threshold. A defined transcript expression ratio threshold generally is generated according to transcript expression values for diseased tissue samples and transcript expression values for non-diseased tissues. In certain configurations, a defined transcript expression ratio threshold is a ratio of a transcript expression value for diseased tissue samples to a transcript expression value for non-diseased tissues. In some embodiments, a defined transcript expression ratio threshold is selected from a plurality of defined transcript expression ratio thresholds. In some embodiments, a database comprises a plurality of precomputed defined transcript expression ratio thresholds generated according to expression data for diseased and non-diseased tissue samples, and a user selects one of the precomputed transcript expression ratio thresholds. Precomputed transcript expression ratio thresholds may be referred to as precomputed binned ratio thresholds. In certain configurations, a ratio is of mean or median diseased tissue expression to non-diseased tissue expression, where the non-diseased tissue expression is the expression value that shows the highest mean among the tissues of choice. In certain configurations, one or more non-diseased tissues may be excluded from this calculation, for example the testis, and/or other tissues considered non-vital or irrelevant. In certain configurations, a ratio is a TPM tumor value divided by a TPM non-disease value. For example, when a user selects ‘maximum’ for non-diseased tissue, the ratio is of the median of the selected tumor type divided by the maximum TPM value observed in any of the non-diseased tissues. When a user selects ‘median’ for non-diseased tissue, the ratio is of the median of the selected tumor type divided by the highest median TPM value observed in any non-diseased tissue. The same logic outlined above is true if a subpopulation is specified except that the numerator is the value from the subset of tumor samples that meet the subpopulation criterion as described herein.
In some embodiments, average transcript expression levels for diseased tissue samples can be calculated after a defined minimum transcript expression value threshold is received. In some embodiments, average transcript expression levels for diseased tissue samples can be calculated before a defined minimum transcript expression value threshold is received. In some embodiments, a database is pre-populated with average transcript expression levels calculated for each of the defined minimum expression thresholds. Without being limited by theory, rapid identification of sub-populations of transcripts across large numbers of transcripts is due, in part, to pre-population of calculated average transcript expression for an input defined disease for each minimum expression threshold. In some embodiments, average transcript expression levels for diseased tissue samples can be calculated after a defined transcript expression ratio threshold is received. In some embodiments, average transcript expression levels for diseased tissue samples can be calculated before a defined transcript expression ratio threshold is received. In some embodiments, a database is pre-populated with average transcript expression levels calculated for each of the defined transcript expression ratio thresholds. Without being limited by theory, rapid identification of sub-populations of transcripts across large numbers of transcripts is due, in part, to pre-population of calculated average transcript expression for an input defined disease for each transcript expression ratio threshold. In certain configurations, for diseased samples, a precompute is done for each of the possible threshold values (1 to 10, in increments of 1). At each threshold level, the percentage of samples that exceed the threshold is tallied. This supports fast target searches with a reasonable size database including subpopulation searches. In certain configurations, high value, low value, 25th percentile, mean, 75th percentile and outliers for each transcript for each disease type is computed and/or precomputed. In certain configurations, the percentage of samples that exceed a threshold for each possible disease TPM threshold for every transcript is computed and/or precomputed.
In some embodiments, average transcript expression levels for non-diseased tissue samples can be calculated after a defined maximum transcript expression value threshold is received. In some embodiments, average transcript expression levels for non-diseased tissue samples can be calculated before a defined maximum transcript expression value threshold is received. In some embodiments, a database is pre-populated with average transcript expression levels calculated for each of the defined maximum expression thresholds. In some embodiments, a database is not pre-populated with average transcript expression levels calculated for maximum expression values.
In some embodiments, average transcript expression levels for non-diseased tissue samples can be calculated after a defined median transcript expression value threshold is received. In some embodiments, average transcript expression levels for non-diseased tissue samples can be calculated before a defined median transcript expression value threshold is received. In some embodiments, a database is pre-populated with average transcript expression levels calculated for each of the defined median expression thresholds. In some embodiments, a database is not pre-populated with average transcript expression levels calculated for median expression values.
In some embodiments, one or more tables in a database relate average transcript expression values to corresponding transcript identifiers for each defined disease. Accordingly, for a defined disease, a user can obtain average transcript expression values for one or more transcripts. In some embodiments, each of the average transcript expression values is calculated based on transcript expression values for diseased tissue samples for which the transcript expression values exceed a defined minimum transcript expression value threshold. In some embodiments, each of the average transcript expression values is calculated for each defined minimum transcript expression value threshold of a plurality of defined minimum transcript expression value thresholds. In some embodiments, each of the average transcript expression values is precomputed for each defined minimum transcript expression value threshold of a plurality of defined minimum transcript expression value thresholds. In some embodiments, each of the average transcript expression values is calculated based on transcript expression values for diseased tissue samples for which the transcript expression values exceed a defined transcript expression ratio threshold. In some embodiments, each of the average transcript expression values is calculated for each defined transcript expression ratio threshold of a plurality of defined transcript expression ratio thresholds.
In some embodiments, each of the average transcript expression values is precomputed for each defined transcript expression ratio threshold of a plurality of defined transcript expression ratio thresholds. In some embodiments, a list of outputted transcripts for a defined disease comprises an average transcript expression value for each of the transcript identifiers listed. An average transcript expression value may be a mean transcript expression value. An average transcript expression value may be a median transcript expression value.
In some embodiments, a defined feature is a defined minimum sample sub-population percentage. In some embodiments, a defined minimum sample sub-population percentage is selected from a plurality of defined minimum sample sub-population percentages. In some embodiments, a database comprises a plurality of precomputed minimum sample sub-population percentages, and a user selects one of the precomputed minimum sample sub-population percentages. A defined minimum sample sub-population percentage can range from about 1% to about 100%. For example, a defined minimum sample sub-population percentage may be about 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or 100%. In some embodiments, a defined minimum sample sub-population percentage can range from about 10% to about 40%. For example, a defined minimum sample sub-population percentage may be about 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, or 40%.
In some embodiments, one or more tables in a database relate percentages of diseased tissue samples to corresponding transcript identifiers for each defined disease. Accordingly, for a defined disease, and for a defined minimum expression value in diseased tissue samples, a user can select a percentage of diseased tissue samples and obtain a list of transcripts that have at least the selected minimum expression value in the selected percentage of diseased samples. In some embodiments, for a defined disease, for a defined minimum expression value in diseased tissue samples, and for a defined maximum or defined median expression value in non-diseased tissue, a user can select a percentage of diseased tissue samples and obtain a list of transcripts that have at least the selected minimum expression value (and the selected maximum or selected median expression value in non-diseased tissue) in the selected percentage of diseased samples. In certain configurations, for a defined disease, and for a defined transcript expression ratio threshold, a user can select a percentage of diseased tissue samples and obtain a list of transcripts that have at least the selected transcript expression ratio threshold in the selected percentage of diseased samples. A percentage of diseased samples can range from about 1% to about 100%. For example, about 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or 100% of diseased samples may be selected. In some embodiments, percentage of diseased samples can range from about 10% to about 40%. For example, about 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, or 40% of diseased samples may be selected.
In some embodiments, each of the percentages is calculated based on an amount of diseased tissue samples for which the transcript expression values of a corresponding transcript identifier exceed the defined minimum transcript expression value threshold. In some embodiments, each of the percentages is calculated for each defined minimum transcript expression value threshold of the plurality of defined minimum transcript expression value thresholds. In some embodiments, each of the percentages equals (i) the number of diseased samples for a defined disease for which transcript expression values for a corresponding transcript identifier exceeds the defined minimum transcript expression value threshold, divided by (ii) total number of diseased samples for the defined disease. In some embodiments, a list of outputted transcript identifiers comprises a percentage of samples exceeding defined minimum transcript expression value threshold for each of the transcript identifiers listed. In some embodiments, a database is pre-populated with diseased sample percentages calculated for each of the defined minimum expression thresholds. Without being limited by theory, rapid identification of sub-populations of transcripts across large numbers of transcripts is due, in part, to pre-population of calculated percentages for each minimum expression threshold.
In some embodiments, each of the percentages is calculated based on an amount of diseased tissue samples for which the transcript expression ratios of a corresponding transcript identifier exceed the defined transcript expression ratio threshold. In some embodiments, each of the percentages is calculated for each defined transcript expression ratio threshold of the plurality of defined transcript expression ratio thresholds. In some embodiments, each of the percentages equals (i) the number of diseased samples for a defined disease for which transcript expression ratio values for a corresponding transcript identifier exceeds the defined transcript expression ratio threshold, divided by (ii) total number of diseased samples for the defined disease. In some embodiments, a list of outputted transcript identifiers comprises a percentage of samples exceeding a defined transcript expression ratio threshold for each of the transcript identifiers listed. In some embodiments, a database is pre-populated with diseased sample percentages calculated for each of the defined transcript expression ratio thresholds. Without being limited by theory, rapid identification of sub-populations of transcripts across large numbers of transcripts is due, in part, to pre-population of calculated percentages for each transcript expression ratio threshold.
In some embodiments, a method herein comprises identifying one or more transcript identifiers meeting certain criteria. In some embodiments, a method herein comprises identifying one or more transcript identifiers meeting certain criteria for a percentage of diseased samples greater than a defined minimum sub-population percentage. In some embodiments, a method herein comprises identifying one or more transcript identifiers meeting certain criteria for a percentage of diseased samples corresponding to a defined disease greater than a defined minimum sub-population percentage. A percentage of diseased samples can be calculated after the defined minimum sample sub-population percentage is received, or before it is received (e.g., the percentage is precomputed). Without being limited by theory, a system structure described herein allows for rapid identification of sub-populations of transcripts according to a percentage threshold. One or more transcript identifiers may be identified in a database (e.g., a database described herein). Criteria for identifying one or more transcript identifiers may include a corresponding transcript expression value in diseased tissue samples (corresponding to an input defined disease) that is greater than an input defined minimum transcript expression value threshold, and/or a corresponding transcript expression value in non-diseased tissue that is less than an input defined maximum or defined median transcript expression value threshold. In certain configurations, criteria for identifying one or more transcript identifiers may include a corresponding transcript expression ratio value (corresponding to an input defined disease) that is greater than an input defined transcript expression ratio threshold.
In some embodiments, a method herein comprises receiving user input where one or more filters is selected. For example, a user may select a cell surface filter. Selection of a cell surface filter allows for rapid identification of transcripts that can be targeted (e.g., by a binding molecule not requiring cellular uptake, for development of a therapy or diagnostic). In some embodiments, one or more tables in a database described herein comprise a transcript table. A transcript table can identify a subset of gene identifiers each corresponding to a gene encoding a cell surface protein. User input may comprise selecting a filter for outputting transcript identifiers corresponding to genes encoding cell surface proteins. Accordingly, a method herein may comprise identifying one or more transcript identifiers corresponding to genes encoding cell surface proteins.
In some embodiments, a user may select a unique polypeptide filter. Selection of a unique polypeptide filter allows for rapid identification of transcripts that can be targeted with minimized off-target effects (e.g., by a binding molecule with enhanced specificity for a particular target, for development of a therapy or diagnostic). In some embodiments, one or more tables in a database described herein comprise a transcript table. A transcript table can identify a subset of transcript identifiers each corresponding to a transcript encoding a unique polypeptide comprising an insertion, deletion or substitution of one or more amino acids relative to polypeptides encoded by other transcript isoforms of the same gene (e.g., encoded by a canonical isoform). Accordingly, a method herein may comprise identifying one or more transcript identifiers corresponding to transcripts encoding unique polypeptides.
In some embodiments, a user may select a unique or partially unique polypeptide filter. Such filter would identify transcripts that encode a unique polypeptide as described above and transcripts that encode a partially unique polypeptide. A partially unique polypeptide refers to a polypeptide encoded by more than one, but not all, transcript isoforms for a select gene. Selection of a unique/partially unique polypeptide filter allows for rapid identification of transcripts that can be targeted with some but not all possible off-target effects. In some embodiments, a transcript table can identify a subset of transcript identifiers each corresponding to transcripts encoding a partially unique polypeptide comprising an insertion, deletion or substitution of one or more amino acids relative to polypeptides encoded by some of the other transcript isoforms of the same gene (e.g., encoded by a canonical isoform). Accordingly, a method herein may comprise identifying one or more transcript identifiers corresponding to transcripts encoding partially unique polypeptides. In some embodiments, user input comprises selection of a function for merging the expression values for transcripts encoding partially unique polypeptides, thereby generating merged transcript expression values. In certain configurations, transcript expression values are merged among non-diseased tissues, but not diseased tissues. In certain configurations, transcript expression values are merged among non-diseased tissues, and are not merged for diseased tissues. In combining expression values for partially unique transcript isoforms, an example approach is to take a high percentile (e.g., the highest 70th, 75th or 80th percentile) and whisker high value, average the medians, and take a low percentile (e.g., the lowest 20th, 25th or 30th percentile) and whisker low value.
In some embodiments, user input comprises selecting all non-diseased tissues. For example, a user may select all non-diseased tissues (i.e., all non-diseased tissues for which sample data is included in a database described herein) when selecting a defined maximum or defined median transcript expression value threshold. In some embodiments, user input comprises selecting a subset of non-diseased tissues. For example, a user may select a subset of non-diseased tissues (i.e., a subset of non-diseased tissues for which sample data is included in a database described herein) when selecting a defined maximum or defined median transcript expression value threshold. In certain configurations, all non-diseased tissues are selected as a default and a user de-selects one or more non-diseased tissues. For example, a user interested in diseases of the female reproductive tract may de-select male-specific tissues such as testis or prostate. In some embodiments, a defined maximum transcript expression value threshold or a defined median transcript expression value threshold for non-diseased tissues is determined according to a subset of non-diseased tissues. In some embodiments, a defined transcript expression ratio threshold for diseased tissues vs. non-diseased tissues is determined, in part, according to a subset of non-diseased tissues. In some embodiments, a method herein comprises identifying in a database one or more transcript identifiers meeting the following criteria (1) and (2): (1) a corresponding transcript expression value in diseased tissue samples corresponding to an input defined disease is greater than an input defined minimum transcript expression value threshold, and (2) a corresponding transcript expression value in a subset of non-diseased tissues is less than an input defined maximum or defined median transcript expression value threshold.
Output
In some embodiments, a method or system herein comprises outputted information. Outputted information may be provided on a suitable visual medium (e.g., a peripheral or component of a machine, e.g., a printer, monitor, tablet, smart phone, display, and the like). Generally, outputted information can be displayed in a suitable format that facilitates analysis of one or more transcripts and/or expression values. Non-limiting examples of formats suitable for use for displaying outputted information include digital data, a graph, a 2D graph, a 3D graph, and 4D graph, a picture (e.g., a jpg, bitmap (e.g., bmp), pdf, tiff, gif, raw, png, the like or suitable format), a pictograph, a list, a sequence alignment, a chart, a table, a box plot, a bar graph, a pie graph, a diagram, a flow chart, a scatter plot, a map, a histogram, a density chart, a function graph, a circuit diagram, a block diagram, a bubble map, a constellation diagram, a contour diagram, a cartogram, spider chart, Venn diagram, nomogram, and the like, and combination of the foregoing. Outputted information may include one or more features described herein. An example output display is provided in
In some embodiments, a method herein comprises outputting a list of one or more transcript identifiers. In some embodiments, a list comprises a transcript identifier for each of one or more transcripts identified according to a method described herein (e.g., criteria described herein). In some embodiments, a transcript of one or more genes is identified that is expressed at a level in a defined subpopulation of diseased tissue samples higher than a level the transcript is expressed in non-diseased tissue samples.
An example process for identifying one or more transcript identifiers meeting certain criteria is as follows: an interface (e.g., HTML interface) runs a script (e.g., Javascript) to send user specified parameters for disease type (e.g., tumor type), percentage of samples, and expression thresholds to a code (e.g., Python code) run by a local webserver. The code (e.g., Python code) performs a relational query (e.g., SQL query) of a database (e.g., a database described herein) using the supplied parameters. The database build process previously precomputed for each known protein coding transcript: box plot upper whisker TPM value in non-diseased tissues (max normal expression), % of samples exceeding predefined expression levels in diseased tissue (e.g., tumor tissue), and average TPM expression in diseased tissue (e.g., tumor tissue) for each predefined TPM expression levels. Because of this precompute, it is a simple query (e.g., SQL query) to retrieve all transcripts which DO NOT have expression in non-diseased tissue above a user threshold but which DO have expression in disease tissue (tumor tissue) above a threshold for at least the specified number (percentage) of samples. The code (e.g., Python code) returns a list of transcripts meeting user criteria along with the percentage of samples and average expression of subpopulation above threshold for each.
In some embodiments, a method herein comprises receiving a user selection of a transcript identifier. For example, a user may select a transcript identifier outputted as described above. In some embodiments, a user selects a transcript identifier from a list of transcript identifiers outputted as described above (see e.g.,
In some embodiments, a method herein comprises, for each transcript identifier outputted, outputting a maximum transcript expression value of non-diseased samples (e.g., for a non-diseased tissue having the highest transcript expression value). In some embodiments, a method herein comprises, for each transcript identifier outputted, outputting a median transcript expression value of non-diseased samples (e.g., for a non-diseased tissue having the median transcript expression value). In some embodiments, a method herein comprises, for each transcript identifier outputted, outputting a non-diseased tissue corresponding to the maximum transcript expression value. In some embodiments, a method herein comprises, for each transcript identifier outputted, outputting a non-diseased tissue corresponding to the median transcript expression value. In some embodiments, one or more tables in a database described herein relate average transcript expression values in diseased tissue samples to corresponding transcript identifiers. In some embodiments, one or more tables in a database described herein relate average transcript expression values for all non-diseased samples or a subset of non-diseased samples to corresponding transcript identifiers. An average transcript expression value may be a median transcript expression value. An average transcript expression value may be a mean transcript expression value. In some embodiments, one or more tables in a database described herein relate maximum or median transcript expression values of non-diseased samples to corresponding transcript identifiers. In some embodiments, one or more tables in a database described herein relate non-diseased tissues corresponding to the maximum or median transcript expression values to corresponding transcript identifiers.
In some embodiments, a method herein comprises outputting an alignment of polypeptide linear sequences corresponding to the outputted transcript identifiers described above (see e.g.,
In some embodiments, a method herein comprises outputting one or more statistical representations of transcript expression values corresponding to a selected transcript identifier. In some embodiments, a method herein comprises outputting one or more box plots for a transcript corresponding to a selected transcript identifier (see e.g.,
In some embodiments, a method herein comprises outputting a box plot for transcripts corresponding to unique polypeptides, where the box plot displays transcript expression values in diseased samples for each of a plurality of defined diseases. In some embodiments, a method herein comprises outputting a box plot for transcripts corresponding to unique polypeptides, where the box plot displays transcript expression values in non-diseased samples for each of a plurality of tissues. In some embodiments, a method herein comprises outputting a box plot for transcripts corresponding to partially unique polypeptides, where the box plot displays merged transcript expression values in diseased samples for each of a plurality of defined diseases. In some embodiments, a method herein comprises outputting a box plot for transcripts corresponding to partially unique polypeptides, where the box plot displays merged transcript expression values in non-diseased samples for each of a plurality of tissues.
Certain system structures described herein allow for rapid identification of sub-populations of disease-associated transcripts in a large database. For example, certain precomputed components of a database described herein allow for rapid identification of sub-populations of disease-associated transcripts in a large database. In some embodiments, an output (e.g., a list of transcript identifiers and/or other output features described herein) is generated within 60 seconds. In some embodiments, an output (e.g., a list of transcript identifiers and/or other output features described herein) is generated within 50 seconds. In some embodiments, an output (e.g., a list of transcript identifiers and/or other output features described herein) is generated within 40 seconds. In some embodiments, an output (e.g., a list of transcript identifiers and/or other output features described herein) is generated within 30 seconds. In some embodiments, an output (e.g., a list of transcript identifiers and/or other output features described herein) is generated within 20 seconds. In some embodiments, an output (e.g., a list of transcript identifiers and/or other output features described herein) is generated within 10 seconds. In some embodiments, an output (e.g., a list of transcript identifiers and/or other output features described herein) is generated within 5 seconds. In some embodiments, an output (e.g., a list of transcript identifiers and/or other output features described herein) is generated within 2 seconds.
In some embodiments, a method herein includes use of a protein analysis tool. In some embodiments, a system herein includes a protein analysis tool. A protein analysis tool may comprise one or more components useful for analyzing a protein or peptide sequence encoded by a transcript identified by a method or system described herein (see e.g.,
In some embodiments, a method herein comprises analyzing a polypeptide. A polypeptide analysis may comprise identifying one or more transcript identifiers in a database (e.g., according to a method described herein), and selecting a transcript identifier (e.g., from a list of outputted transcript identifiers as described herein). For a selected transcript identifier, a polypeptide analysis tool may output one or more features of a polypeptide or peptide encoded by the transcript corresponding to the selected transcript identifier. For example, a polypeptide analysis tool may output a three-dimensional structure corresponding to a polypeptide linear sequence corresponding to the selected transcript identifier. A polypeptide analysis tool may output functional polypeptide domain information for a polypeptide linear sequence corresponding to the selected transcript identifier. In some embodiments, one or more tables of a database described herein relate three-dimensional structure coordinates and functional polypeptide domain information to a polypeptide linear sequence.
In some embodiments, for a selected transcript identifier, a transcript identifier is output for each of one or more transcript isoforms encoded by the gene that encodes the transcript corresponding to the selected transcript identifier. In some embodiments, for a selected transcript identifier, an alignment of polypeptide linear sequences is outputted. In some embodiments, one or more tables of a database described herein relate polypeptide linear sequences to corresponding transcript identifiers. In some embodiments, an outputted three-dimensional structure corresponds to one or more polypeptide linear sequences. In some embodiments, outputted functional polypeptide domain information is for one or more polypeptide linear sequences. In some embodiments, a three-dimensional structure comprises one or more of the following features: the three-dimensional structure is a user-moveable structure, the three dimensional structure is annotated with functional polypeptide domain information, and linear polypeptide sequence(s) is/are mapped to the three-dimensional structure.
In some embodiments, a method herein comprises receiving a defined portion of a linear polypeptide sequence, and displaying a portion of a corresponding three-dimensional structure corresponding to the defined portion of the linear polypeptide sequence. A corresponding three-dimensional structure is a structure to which a linear polypeptide sequence has been mapped. In some embodiments, a method herein comprises receiving a defined portion of a three-dimensional structure, and displaying a portion of a corresponding linear polypeptide sequence corresponding to the defined portion of the three-dimensional structure. In some embodiments, a method herein comprises receiving a defined portion of a three-dimensional structure, and displaying annotated functional polypeptide domain information corresponding to the defined portion of the three-dimensional structure.
In some embodiments, a method herein comprises prior to identifying one or more transcript identifiers, receiving user input comprising one or more of: (i) a defined disease selected from a plurality of defined diseases, where each of the defined diseases corresponds to diseased tissue samples, (ii) a defined minimum transcript expression value threshold for diseased tissue samples, (iii) a defined maximum or a defined median transcript expression value threshold for non-diseased tissues, where each of the non-diseased tissues corresponds to non-diseased tissue samples, and (iv) a defined minimum sample sub-population percentage. In some embodiments, one or more identified transcript identifiers meet one or more of the following criteria (1) and (2): (1) a corresponding transcript expression value in diseased tissue samples corresponding to the input defined disease of (i) is greater than the input defined minimum transcript expression value threshold of (ii), and (2) a corresponding transcript expression value in non-diseased tissue is less than the input defined maximum or defined median transcript expression value threshold of (iii). In some embodiments, one or more identified transcript identifiers meet criteria (1) and/or (2) for a percentage of diseased samples corresponding to an input defined disease greater than a defined minimum sub-population percentage. In some embodiments, a method comprises prior to selecting a transcript identifier, outputting a list comprising a transcript identifier for each of one or more transcripts identified by a method herein, thereby identifying a transcript of one or more genes that is expressed at a level in a defined subpopulation of diseased tissue samples higher than a level the transcript is expressed in non-diseased tissue samples.
In some embodiments, a method herein comprises prior to identifying one or more transcript identifiers, receiving user input comprising one or more of: (i) a defined disease selected from a plurality of defined diseases, where each of the defined diseases corresponds to diseased tissue samples, (ii) a defined transcript expression ratio threshold, where the transcript expression ratio is a ratio of a transcript expression value for diseased tissue samples to a transcript expression value for non-diseased tissues, where each of the non-diseased tissues corresponds to non-diseased tissue samples, and (iii) a defined minimum sample sub-population percentage. In some embodiments, one or more identified transcript identifiers having a corresponding transcript expression ratio for an input defined disease that is greater than an input defined transcript expression ratio threshold. In some embodiments, one or more identified transcript identifiers are identified for a percentage of diseased samples corresponding to an input defined disease greater than a defined minimum sub-population percentage. In some embodiments, a method comprises prior to selecting a transcript identifier, outputting a list comprising a transcript identifier for each of one or more transcripts identified by a method herein, thereby identifying a transcript of one or more genes that is expressed at a level in a defined subpopulation of diseased tissue samples higher than a level the transcript is expressed in non-diseased tissue samples.
Provided herein are systems, machines and apparatuses for performing a method described herein. Systems, machines and apparatuses may comprise one or more microprocessors.
Systems, machines and apparatuses may comprise memory. Memory may comprise a database described herein. Memory may comprise a database comprising one more tables described herein.
Memory may comprise a database comprising one more tables that relate one or more features described herein. For example, memory may comprise a database comprising one more tables that relate (i) transcript identifiers to genes, where at least a portion of the transcript identifiers correspond to transcript isoform sets, and each of the transcript isoform sets is encoded by a gene; (ii) transcript identifiers to corresponding transcript expression values in diseased tissue samples, and (iii) transcript identifiers to corresponding transcript expression values in non-diseased tissue samples.
Systems, machines and apparatuses may comprise instructions executable by the one or more microprocessors configured to perform a method described herein. For example, systems, machines and apparatuses may comprise instructions executable by the one or more microprocessors configured to perform the following method: (a) receiving user input comprising: (i) a defined disease selected from a plurality of defined diseases, where each of the defined diseases corresponds to diseased tissue samples, (ii) a defined minimum transcript expression value threshold for diseased tissue samples, (iii) a defined maximum or a defined median transcript expression value threshold for non-diseased tissues, where each of the non-diseased tissues corresponds to non-diseased tissue samples, and (iv) a defined minimum sample sub-population percentage; (b) identifying in the database one or more transcript identifiers meeting the following criteria (1) and (2): (1) a corresponding transcript expression value in diseased tissue samples corresponding to the input defined disease of (a)(i) is greater than the input defined minimum transcript expression value threshold of (a)(ii), and (2) a corresponding transcript expression value in non-diseased tissue is less than the input defined maximum or defined median transcript expression value threshold of (a)(iii), for a percentage of diseased samples corresponding to the input defined disease of (a)(i) greater than the defined minimum sub-population percentage of (a)(iv); and (c) outputting a list comprising a transcript identifier for each of one or more transcripts identified in (b).
In certain configurations, systems, machines and apparatuses may comprise instructions executable by the one or more microprocessors configured to perform the following method: (a) receiving user input comprising: (i) a defined disease selected from a plurality of defined diseases, where each of the defined diseases corresponds to diseased tissue samples, (ii) a defined transcript expression ratio threshold, where the transcript expression ratio is a ratio of a transcript expression value for diseased tissue samples to a transcript expression value for non-diseased tissues, where each of the non-diseased tissues corresponds to non-diseased tissue samples, and (iii) a defined minimum sample sub-population percentage; (b) identifying in the database one or more transcript identifiers having a corresponding transcript expression ratio for the input defined disease of (a)(i) that is greater than the input defined transcript expression ratio threshold of (a)(ii), for a percentage of diseased samples corresponding to the input defined disease of (a)(i) greater than the defined minimum sub-population percentage of (a)(iii); and (c) outputting a list comprising a transcript identifier for each of one or more transcripts identified in (b).
Methods described herein for identifying disease-associated transcript isoforms generally are computer-implemented methods. A disease-associated isoform identification system or one or more portions thereof may be implemented with any type of hardware and/or software, and may be a pre-programmed general purpose computing device. For example, the system may be implemented using a server, a personal computer, a portable computer, a thin client, or any suitable device or devices. The system and/or components thereof may be a single device at a single location, or multiple devices at a single, or multiple, locations that are connected together using any appropriate communication protocols over any communication medium such as electric cable, fiber optic cable, or in a wireless manner.
A disease-associated isoform identification system or one or more portions thereof can include clients and servers. A client and server generally are remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some implementations, a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device). Data generated at the client device (e.g., a result of the user interaction) can be received from the client device at the server.
A disease-associated isoform identification system or one or more portions thereof can include a back-end component (e.g., a data server), and/or a middleware component (e.g., an application server), and/or a front-end component. A front-end component can be a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation described herein. Components of a disease-associated isoform identification system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
A disease-associated isoform identification system or one or more portions thereof can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. A disease-associated isoform identification system or one or more portions thereof can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. A computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. While a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially generated propagated signal. A computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).
Operations performed by a disease-associated isoform identification system or one or more portions thereof can be implemented as operations performed by a “data processing apparatus” on data stored on one or more computer-readable storage devices or received from other sources. A “data processing apparatus” encompasses different types of apparatus, devices, and machines for processing data, non-limiting examples of which include a programmable processor, a computer, a system on a chip, or multiples of, or combinations, of the foregoing. An apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). An apparatus can also include, in addition to hardware, code that creates an execution environment for a computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them. An apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.
A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
Processes and logic flows described for a disease-associated isoform identification system or one or more portions thereof can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. Processes and logic flows can be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Processors suitable for the execution of a computer program include general and special purpose microprocessors, for example, and any one or more processors of any kind of digital computer. Generally, a processor receives instructions and data from a read-only memory or a random access memory or both. Essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data. A computer can also include, or be operatively coupled to receive data from or transfer data to (or both), one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. A computer need not include devices. A computer can be embedded in a device including but not limited to a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive). Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD-ROM and DVD-ROM disks. A processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.
A non-limiting disease-associated isoform identification system implementation is illustrated in
Certain outputs of the methods and systems described herein (e.g., amino acid subsequences, amino acid sequences) may be utilized in a variety of applications. For example, certain applications may include an immunotherapeutic strategy targeting tumor-associated isoforms. In one embodiment, the immunotherapeutic strategy targeting tumor-associated isoforms is a peptide vaccine. In another embodiment, the peptide is encoded by a DNA vector. In another embodiment, the immunotherapeutic strategy targeting tumor-associated isoforms is an adoptive T-cells therapy. In another embodiment, the immunotherapeutic strategy targeting tumor-associated isoforms is monoclonal antibody therapy. In one embodiment, the immunotherapeutic strategy targeting tumor-specific mutations is a peptide vaccine. In another embodiment, the peptide is encoded by a DNA vector. In another embodiment, the immunotherapeutic strategy targeting tumor-specific mutations is an adoptive T-cells therapy. In another embodiment, the immunotherapeutic strategy targeting tumor-specific mutations is monoclonal antibody therapy.
Amino acid subsequences identified by a method or system described herein can be considered immunogenic peptide candidates, which can be utilized in a variety of applications. In certain instances, an immunogenic peptide identified by a method or system described herein can be synthesized. A peptide can be synthesized using any suitable method, including by chemical synthesis, by in vitro translation, or by recombinant translation in host cells. Thus, provided herein is a composition comprising a peptide identified by a method or system described herein and a method for synthesizing a peptide identified by a method or system described herein.
A synthesized peptide can be combined with one or more suitable pharmaceutically acceptable adjuvants and/or one or more suitable pharmaceutically acceptable carriers suitable for a vaccine. Non-limiting examples of pharmaceutically acceptable vaccine adjuvants include aluminum (e.g., amorphous aluminum hydroxyphosphate sulfate (AAHS), aluminum hydroxide, aluminum phosphate, potassium aluminum sulfate); monophosphoryl lipid A (MPL) and aluminum salt (ASO4); oil in water emulsion composed of squalene (MF59); monophosphoryl lipid A (MPL) and QS-21, a natural compound extracted from the Chilean soapbark tree, combined in a liposomal formulation (ASO1B); and cytosine phosphoguanine (CpG 1018). A pharmaceutically acceptable carrier can be a diluent, excipient, or vehicle included in a composition containing the peptide that is administered. A pharmaceutically acceptable carrier can be a sterile liquid. A pharmaceutically acceptable aqueous carrier, such as a saline solution, aqueous dextrose solution and/or glycerol solution can be included when a vaccine is administered intravenously. A vaccine composition that includes an immunogenic peptide identified by a method or system described herein and a pharmaceutically acceptable adjuvant, and optionally includes a pharmaceutically acceptable carrier, can be administered to a subject in need thereof (e.g., human subject) in an amount sufficient to induce an immune response to the peptide in the subject. Such a composition can be administered to a subject as part of a method for treating a condition (e.g., a disease, a cancer) in which inducing an immune response against the peptide can treat the condition (e.g., ameliorate a symptom associated with the condition).
A polynucleotide encoding an immunogenic peptide identified by a method or system described herein can be prepared. A polynucleotide can include one or more elements from a different type of organism from which the polynucleotide portion encoding the immunogenic peptide originated. In certain instances, a polynucleotide can include a polynucleotide portion form a human gene that encodes a peptide identified by a method or system described herein, and can include one or more polynucleotide portions from a different organism (e.g., from a virus; from a bacterium). A polynucleotide sometimes is an expression vector or expression plasmid. A polynucleotide sometimes is a vector or plasmid suitable for administration to a subject, and can be formulated as a vaccine. A polynucleotide vector sometimes is a DNA vector (e.g., a DNA virus or based on a DNA virus (e.g., double-stranded DNA virus), including a herpesvirus, an adenovirus, and a poxvirus) or a RNA vector (e.g., RNA virus or based on a RNA virus, including a retrovirus and a ssRNA virus). Non-limiting examples of polynucleotide vectors are described in Deng et al., Vaccine 33(48): 6938-6946 (2015). A vaccine composition that includes a polynucleotide encoding an immunogenic peptide identified by a method or system described herein, and optionally includes a pharmaceutically acceptable carrier and/or adjuvant, can be administered to a subject in need thereof (e.g., human subject) in an amount sufficient to induce an immune response to the peptide in the subject. Such a composition can be administered to a subject as part of a method for treating a condition (e.g., a disease, a cancer) in which inducing an immune response against the peptide can treat the condition (e.g., ameliorate a symptom associated with the condition).
A composition that includes an antigen presenting cell (APC) and a peptide identified by a method or system described herein can be prepared. In certain implementations, a composition that includes an APC and a polynucleotide encoding a peptide identified by a method or system described herein can be prepared. A composition that includes an APC transduced with a polynucleotide encoding a peptide identified by a method or system described herein can be prepared. A polynucleotide encoding a peptide identified by a method or system described herein sometimes is an expression plasmid or expression vector, and an APC can be transduced by the polynucleotide. An APC can be transduced by a polynucleotide in any suitable manner, non-limiting examples of which include transduction by naked polynucleotide and transduction by electroporation. A non-limiting example of an APC is a dendritic cell. In certain implementations, a composition comprising an APC and a peptide identified by a method or system described herein (e.g., a vaccine composition), or a composition comprising an APC transduced with a polynucleotide encoding a peptide identified by a method or system described herein (e.g., a vaccine composition), where the composition optionally includes a pharmaceutically acceptable carrier and/or adjuvant, can be administered to a subject in need thereof (e.g., human subject) in an amount sufficient to induce an immune response to the peptide in the subject. Such a composition can be administered to a subject as part of a method for treating a condition (e.g., a disease, a cancer) in which inducing an immune response against the peptide can treat the condition (e.g., ameliorate a symptom associated with the condition).
A peptide identified by a method or system described herein can be administered to a subject for production of antibodies that immunospecifically bind to the peptide. Antibodies produced can be polyclonal antibodies or monoclonal antibodies, for example. A peptide identified by a method or system described herein can be included in a composition administered to an animal subject (e.g., rabbit subject, camelid subject), antiserum can be obtained, and polyclonal antibodies optionally may be enriched and/or isolated from the antiserum. A peptide identified by a method or system described herein can be included in a composition administered to an animal subject (e.g., murine subject, guinea pig subject, rabbit subject) and spleen cells from the subject can be combined with myeloma cells under conditions that produce monoclonal antibody generating hybridomas. Hybridomas can be screened for those that produce monoclonal antibodies that immunospecifically bind to the peptide administered to the animal subject. Accordingly, compositions containing the peptide can be administered to a subject as part of a method for manufacturing antibodies (e.g., monoclonal antibodies, polyclonal antibodies) that immunospecifically bind to the peptide.
Provided herein are methods for analyzing nucleic acid and/or polypeptides from a sample. Nucleic acid and/or polypeptides may be isolated from a sample obtained from a subject (e.g., a test subject). A subject can be any living or non-living organism, including but not limited to a human, a non-human animal, a plant, a bacterium, a fungus, a protest, or a pathogen. Any human or non-human animal can be selected, and may include, for example, mammal, reptile, avian, amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine (e.g., horse), caprine and ovine (e.g., sheep, goat), swine (e.g., pig), camelid (e.g., camel, llama, alpaca), monkey, ape (e.g., gorilla, chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse, rat, fish, dolphin, whale and shark. A subject may be a male or female. A subject may be any age (e.g., an embryo, a fetus, an infant, a child, an adult). A subject may be a patient having a disease or condition, a patient suspected of having a disease or condition, a patient in remission for a disease or condition, a patient with a family history of a disease or condition, and/or a subject obtaining a screen for a disease or condition. A subject may be a cancer patient, a patient suspected of having cancer, a patient in remission, a patient with a family history of cancer, and/or a subject obtaining a cancer screen. A subject may be a patient having an infection or infectious disease or infected with a pathogen (e.g., bacteria, virus, fungus, protozoa, and the like), a patient suspected of having an infection or infectious disease or being infected with a pathogen, a patient recovering from an infection, infectious disease, or pathogenic infection, a patient with a history of infections, infectious disease, pathogenic infections, and/or a subject obtaining an infectious disease or pathogen screen.
A sample may be isolated or obtained from any type of suitable biological specimen or sample (e.g., a test sample). A nucleic acid sample may be isolated or obtained from a single cell, a plurality of cells (e.g., cultured cells), cell culture media, conditioned media, a tissue, an organ, or an organism (e.g., bacteria, yeast, or the like).
A sample or test sample may be any specimen that is isolated or obtained from a subject or part thereof (e.g., a human subject, a subject having a disease or condition, a cancer patient, a patient having an infection or infectious disease, a tumor, an infected organ or tissue, a diseased organ or tissue). Non-limiting examples of specimens include fluid or tissue from a subject, including, without limitation, blood or a blood product (e.g., serum, plasma, or the like), umbilical cord blood, chorionic villi, amniotic fluid, cerebrospinal fluid, spinal fluid, lavage fluid (e.g., bronchoalveolar, gastric, peritoneal, ductal, ear, arthroscopic), biopsy sample (e.g., cancer biopsy), celocentesis sample, cells (blood cells, normal cells, abnormal cells (e.g., cancer cells)) or parts thereof (e.g., mitochondrial, nucleus, extracts, or the like), washings of female reproductive tract, urine, feces, sputum, saliva, nasal mucous, prostate fluid, lavage, semen, lymphatic fluid, bile, tears, sweat, breast milk, breast fluid, the like or combinations thereof.
In some embodiments, a sample is isolated or obtained from a specimen (e.g., healthy or diseased specimen), cell(s) (e.g., healthy or diseased cell), tissue (e.g., healthy or diseased tissue), organ (e.g., healthy or diseased organ), and/or the like of an animal (e.g., an animal subject). A sample obtained from a healthy specimen, cell, tissue, and/or organ may be referred to as a non-disease or non-diseased sample. A non-disease or non-diseased sample may be obtained from a subject with no diagnosis of a particular disease (e.g., cancer), no history of a particular disease (e.g., cancer), and/or no suspicion of having a particular disease (e.g., cancer). A sample obtained from a diseased specimen, cell, tissue, and/or organ may be referred to as a disease sample or diseased sample. A disease or diseased sample may be obtained from a subject with a diagnosis of a particular disease or condition (e.g., cancer), a history of a particular disease or condition (e.g., cancer), and/or a subject suspected of having a particular disease or condition (e.g., cancer).
A sample can be a liquid sample. Examples of liquid samples include, but are not limited to, blood or a blood product (e.g., serum, plasma, or the like), urine, cerebral spinal fluid, saliva, sputum, biopsy sample (e.g., liquid biopsy for the detection of cancer), a liquid sample described above, the like or combinations thereof. In certain embodiments, a sample is a liquid biopsy, which generally refers to an assessment of a liquid sample from a subject for the presence, absence, progression or remission of a disease (e.g., cancer). A liquid biopsy can be used in conjunction with, or as an alternative to, a sold biopsy (e.g., tumor biopsy).
A sample may be a tumor sample (i.e., a sample isolated from a tumor). The term “tumor” generally refers to neoplastic cell growth and proliferation, whether malignant or benign, and may include pre-cancerous and cancerous cells and tissues. The terms “cancer” and “cancerous” generally refer to the physiological condition in mammals that is typically characterized by unregulated cell growth/proliferation. Examples of cancer include, but are not limited to, acute myeloid leukemia, adrenocortical carcinoma, bladder urothelial carcinoma, brain lower grade glioma, breast invasive carcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma, cholangiocarcinoma, colon adenocarcinoma, esophageal carcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, kidney chromophobe, kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, pheochromocytoma and paraganglioma, prostate adenocarcinoma, rectum adenocarcinoma, sarcoma, skin cutaneous melanoma, stomach adenocarcinoma, testicular germ cell tumors, thymoma, thyroid carcinoma, uterine carcinosarcoma, uterine corpus endometrial carcinoma, uveal melanoma, chronic myeloid leukemia, acute lymphocytic leukemia, multiple myeloma, non-Hodgkin lymphoma, Hodgkin lymphoma, marginal zone lymphoma, follicular lymphoma, small lymphocyte lymphoma, B-cell lymphoma, diffuse large B-cell lymphoma or mantle cell lymphoma, acute lymphoblastic leukemia, adrenocortical cancer, cervical and endocervical cancer, clear cell sarcoma of the kidney, kidney clear cell carcinoma, kidney papillary cell carcinoma, neuroblastoma, testicular germ cell tumor, uterine corpus endometrioid carcinoma, Wilms tumor, carcinoma, lymphoma, blastoma, sarcoma, leukemia, squamous cell cancer, small-cell lung cancer, non-small cell lung cancer, adenocarcinoma of the lung, squamous carcinoma of the lung, cancer of the peritoneum, hepatocellular cancer, gastrointestinal cancer, pancreatic cancer, glioblastoma, cervical cancer, ovarian cancer, liver cancer, bladder cancer, hepatoma, breast cancer, colon cancer, colorectal cancer, endometrial or uterine carcinoma, salivary gland carcinoma, kidney cancer, liver cancer, prostate cancer, vulval cancer, thyroid cancer, hepatic carcinoma, various types of head and neck cancer, and the like.
A sample may be a non-diseased sample (i.e., a sample isolated from a non-diseased tissue). Examples of non-diseased tissue sources include, but are not limited to, adipose—subcutaneous, adipose—visceral (omentum), adrenal gland, artery—aorta, artery—coronary, artery—tibial, brain—amygdala, brain—anterior cingulate cortex (BA24), brain—caudate (basal ganglia), brain —cerebellar hemisphere, brain—cerebellum, brain—cortex, brain—frontal cortex (BA9), brain —hippocampus, brain—hypothalamus, brain—nucleus accumbens (basal ganglia), brain—putamen (basal ganglia), brain—spinal cord (cervical c-1), brain—substantia nigra, breast—mammary tissue, colon—sigmoid, colon—transverse, esophagus—gastroesophageal junction, esophagus —mucosa, esophagus—muscularis, heart—atrial appendage, heart—left ventricle, kidney—cortex, kidney—medulla, liver, lung, minor salivary gland, muscle—skeletal, nerve—tibial, ovary, pancreas, pituitary, prostate, skin—not sun exposed (suprapubic), skin—sun exposed (lower leg), small intestine—terminal ileum, spleen, stomach, testis, thyroid, uterus, vagina, whole blood, bladder, cervix—endocervix, cervix—ectocervix, fallopian tube, cells—EBV-transformed lymphocytes, and cells—cultured fibroblasts.
Nucleic acid may be analyzed using the methods and systems described herein. The terms nucleic acid(s), nucleic acid molecule(s), nucleic acid fragment(s), target nucleic acid(s), nucleic acid template(s), template nucleic acid(s), nucleic acid target(s), target nucleic acid(s), polynucleotide(s), polynucleotide fragment(s), target polynucleotide(s), polynucleotide target(s), polynucleotide sequence(s), and the like may be used interchangeably throughout the disclosure. The terms refer to nucleic acids of any composition, such as DNA (e.g., complementary DNA (cDNA; synthesized from any RNA or DNA of interest), genomic DNA (gDNA), genomic DNA fragments, mitochondrial DNA (mtDNA), recombinant DNA (e.g., plasmid DNA), and the like), RNA (e.g., message RNA (mRNA), short inhibitory RNA (siRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), microRNA, transacting small interfering RNA (ta-siRNA), natural small interfering RNA (nat-siRNA), small nucleolar RNA (snoRNA), small nuclear RNA (snRNA), long non-coding RNA (lncRNA), non-coding RNA (ncRNA), transfer-messenger RNA (tmRNA), precursor messenger RNA (pre-mRNA), small Cajal body-specific RNA (scaRNA), piwi-interacting RNA (piRNA), endoribonuclease-prepared siRNA (esiRNA), small temporal RNA (stRNA), signal recognition RNA, telomere RNA, RNA highly expressed in a tumor, and the like), and/or DNA or RNA analogs (e.g., containing base analogs, sugar analogs and/or a non-native backbone and the like), RNA/DNA hybrids and polyamide nucleic acids (PNAs), all of which can be in single- or double-stranded form, and unless otherwise limited, can encompass known analogs of natural nucleotides that can function in a similar manner as naturally occurring nucleotides. A nucleic acid may be, or may be from, a plasmid, phage, virus, bacterium, autonomously replicating sequence (ARS), mitochondria, centromere, artificial chromosome, chromosome, or other nucleic acid able to replicate or be replicated in vitro or in a host cell, a cell, a cell nucleus or cytoplasm of a cell in certain embodiments. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions), alleles, orthologs, single nucleotide polymorphisms (SNPs), alternative splice variants, and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues. The term nucleic acid is used interchangeably with locus, gene, cDNA, and mRNA encoded by a gene. The term also may include, as equivalents, derivatives, variants and analogs of RNA or DNA synthesized from nucleotide analogs, single-stranded (“sense” or “antisense,” “plus” strand or “minus” strand, “forward” reading frame or “reverse” reading frame) and double-stranded polynucleotides. The term “gene” refers to a section of DNA involved in producing a polypeptide chain; and generally includes regions preceding and following the coding region (leader and trailer) involved in the transcription/translation of the gene product and the regulation of the transcription/translation, as well as intervening sequences (introns) between individual coding regions (exons). A nucleotide or base generally refers to the purine and pyrimidine molecular units of nucleic acid (e.g., adenine (A), thymine (T), guanine (G), and cytosine (C)). For RNA, the base thymine is replaced with uracil. Nucleic acid length or size may be expressed as a number of bases.
Nucleic acid analyzed by the methods and systems described herein may be from a category or collection of nucleic acids. For example, nucleic acid may from a genome, a transcriptome, a library (e.g., a DNA library (a genomic DNA library, a cDNA library), an RNA library (an mRNA library)), a nucleic acid pool, and the like or combinations thereof. A genome generally refers to a complete list of nucleotides (A, C, G, and T) that make up the chromosomes of an individual or a species, and includes both the genes (coding regions) and noncoding DNA, and may include mitochondrial DNA. A transcriptome generally refers to a set of RNA transcripts, including coding and non-coding, in an individual or a population of cells, and sometimes refers to all RNAs, or just mRNA, depending on the context. Data obtained from a transcriptome may be used to analyze processes such as cellular differentiation, carcinogenesis, transcription regulation, and biomarker discovery, for example. The transcriptome is related to other “omes” such as, for example, the proteome, metabolome, translatome, exome, meiome, and thanatotranscriptome, which describe specific types of RNA transcripts. A nucleic acid library generally refers to a plurality of polynucleotide molecules (e.g., a sample of nucleic acids) that are prepared, assembled and/or modified for a specific process, non-limiting examples of which include immobilization on a solid phase (e.g., a solid support, a flow cell, a bead), enrichment, amplification, cloning, detection and/or for nucleic acid sequencing. In certain embodiments, a nucleic acid library is prepared prior to or during a sequencing process. A nucleic acid library (e.g., sequencing library) can be prepared by a suitable method as known in the art. A nucleic acid library can be prepared by a targeted or a non-targeted preparation process.
Polypeptides may be analyzed using the methods and systems described herein. A polypeptide generally refers to a polymer, linked by peptide bonds, that has a sequence of amino acids encoded by a polynucleotide. A polypeptide may refer to a linear sequence of amino acids and/or may refer to a three dimensional protein. Proteins or portions thereof (e.g., a subunit of a protein) are generally made up of polypeptides. A peptide generally refers to a portion or fragment of a larger polypeptide. In some instances, a peptide refers to a polymer containing between about 2 amino acids to about 10 amino acids, 2 amino acids to about 20 amino acids, or about 2 amino acids to about 30 amino acids. Peptides, may include, for example, dipeptides, tripeptides, tetrapeptides, and oligopeptides. Amino acids that have been incorporated into peptides and/or polypeptides may be referred to as residues. Peptides and polypeptides typically have an N-terminal (amine group) residue at one end and C-terminal (carboxyl group) residue at the opposite end, and amino acid sequences are typically read in the N-terminal to C-terminal direction.
Following are non-limiting examples of certain implementations of the technology.
The examples set forth below illustrate certain implementations and do not limit the technology.
The disease-associated isoform identification system described herein can perform genome-wide searches of thousands of genes in thousands of tumor and normal samples to find transcript isoforms with distinct peptide targets that are specific to cancer. The distinct peptides that are identified may be candidate targets for immunotherapy. One aspect that makes this system possible is the construction of a database. The construction of an example database useful for a disease-associated isoform identification system is described below.
Data from data sources described below was used to construct the database described herein.
The data for construction of the database described herein is RNASeq data and, in particular, transcript isoform expression data, which is finer grained than gene expression data. When a gene has multiple transcripts (isoforms) produced by variation in the transcript splicing patterns, isoform expression data provides expression values for each isoform of each gene. Expression levels were measured in a normalized vale to support cross transcript and cross sample comparison (i.e., Transcripts Per Million (TPM) using the RSEM algorithm).
Differential analysis of isoform expression from all three sources above (TCGA, TARGET, GTEx) can accurately be performed if the isoform expression data was generated in a consistent manner using a common set of gene models (gene models can be thought of as the dictionary of transcript isoforms for all genes). The reprocessing of the RNASeq data from TCGA, TARGET, and GTEX by the Xena group at UCSC using the TOIL pipeline provided a single transcript expression dataset constructed with a common gene model, Gencode v23 (see e.g., World Wide Web Uniform Resource Locator xenabrowser.net/datapages/?hub=https://toil.xenahubs.net:443). The dataset used for the database described herein is located at World Wide Web Uniform Resource Locator xenabrowser.net/datapages/?cohort=TCGA %20TARGET %20GTEx&removeHub=https %3A %2F % 2Fxena.treehouse.gi.ucsc.edu %3A443. The following RSEM TPM and TCGA TARGET GTEX selected phenotypes files were used: TcgaTargetGtex_rsem_isoform_tpm and TcgaTargetGTEX_phenotype.
Database Build Process
The following steps were performed to construct a SQLite relational database. SQLite is a relational database management system contained in a C library and embedded into the end program. Accordingly, the database described herein exists as a file on computers where the disease-associated isoform program is installed. The database is queried by Python webserver code using SQL.
See
The disease-associated isoform database described herein includes expression data for 54,000+ transcripts from 19,500+ genes. There are 10,500+ tumor samples and 7,400+ normal samples. These samples are from 38 different cancer types and include 47 normal tissue types. This is an extremely large dataset with 567+ million transcript expression values.
Cancer types in the database include: acute lymphoblastic leukemia, breast cancer, acute myeloid leukemia, adrenocortical cancer, bile duct cancer, bladder cancer, cervical cancer, colon and rectal cancer, endometrioid cancer, esophageal cancer, glioblastoma, head and neck cancer, kidney chromophobe, kidney clear cell carcinoma, kidney papillary cell carcinoma, large B-cell lymphoma, liver cancer, lower grade glioma, lower grade glioma and glioblastoma, lung adenocarcinoma, lung cancer, lung squamous cell carcinoma, melanoma, mesothelioma, ocular melanomas, ovarian cancer, pancreatic cancer, prostate cancer, sarcoma, stomach cancer, testicular cancer, thymoma, thyroid cancer, and uterine carcinosarcoma. Specific cancer types in the database include: acute lymphoblastic leukemia, acute myeloid leukemia, adrenocortical cancer, bladder urothelial carcinoma, brain lower grade glioma, breast invasive carcinoma, cervical & endocervical cancer, cholangiocarcinoma, clear cell sarcoma of the kidney, colon adenocarcinoma, diffuse large B-cell lymphoma, esophageal carcinoma, glioblastoma multiforme, head & neck squamous cell carcinoma, kidney chromophobe, kidney clear cell carcinoma, kidney papillary cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, mesothelioma, neuroblastoma, ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, pheochromocytoma & paraganglioma, prostate adenocarcinoma, rectum adenocarcinoma, sarcoma, skin cutaneous melanoma, stomach adenocarcinoma, testicular germ cell tumor, thymoma, thyroid carcinoma, uterine carcinosarcoma, uterine corpus endometrioid carcinoma, uveal melanoma, and Wilms tumor.
Normal tissue types in the database include: adipose—subcutaneous, adipose—visceral (omentum), adrenal gland, artery—aorta, artery—coronary, artery—tibial, brain—amygdala, brain—anterior cingulate cortex (BA24), brain—caudate (basal ganglia), brain—cerebellar hemisphere, brain—cerebellum, brain—cortex, brain—frontal cortex (BA9), brain—hippocampus, brain—hypothalamus, brain—nucleus accumbens (basal ganglia), brain—putamen (basal ganglia), brain —spinal cord (cervical c-1), brain—substantia nigra, breast—mammary tissue, colon—sigmoid, colon—transverse, esophagus—gastroesophageal junction, esophagus—mucosa, esophagus —muscularis, heart—atrial appendage, heart—left ventricle, kidney—cortex, kidney—medulla, liver, lung, minor salivary gland, muscle—skeletal, nerve—tibial, ovary, pancreas, pituitary, prostate, skin—not sun exposed (suprapubic), skin—sun exposed (lower leg), small intestine—terminal ileum, spleen, stomach, testis, thyroid, uterus, vagina, whole blood, bladder, cervix—endocervix, cervix—ectocervix, fallopian tube, cells—EBV-transformed lymphocytes, and cells—cultured fibroblasts.
The disease-associated isoform program allows researchers to identify and investigate transcript isoforms that are strongly expressed in a specific tumor type or subset of a given tumor type and not expressed at high levels in any normal tissue. Further, the program can specifically identify isoforms in surface proteins with unique, targetable protein segments. This functionality would not be possible without some or all of the database build processes described above.
The disease-associated isoform database build process described in Example 1 is generally run once and the completed database can be distributed with the application when it is installed on a computer. The disease-associated isoform application is a dynamic, exploratory piece of software that allows researchers to explore the large volume of TCGA and Target cancer data to find cancer specific peptides (e.g., that are candidates for immunotherapy targets). The disease-associated isoform application is implemented in Python, installed locally on a researcher's workstation, and functions as a local webserver that queries the SQLite database. Researchers use their local web browser to interact with the application. Rank-ordered numerical results are combined with visualizations to present high volume, complex data in an intuitive, interactive fashion that supports rapid analysis and discovery.
The disease-associated isoform application searches thousands of transcript isoforms in thousands of individuals (samples) to find isoforms that are strongly expressed in tumor cells but not expressed in normal tissue. Rather than comparing tumor to matched normal (e.g. lung adenocarcinoma to normal lung) the disease-associated isoform application described herein compares the tumor samples in the tissue of choice to ALL normal samples in the large GTEX collection. This allows identification of tumor specific isoforms that are not expressed in any of the profiled normal tissues (e.g., to identify targets that would not have toxic off-target effects).
Further, the disease-associated isoform application described herein can identify sub-populations of patients that have strong expression of an isoform that is not expressed in normal tissues. If the selected tumor type occurs in large numbers, then a therapy for a subpopulation could be viable. The disease-associated isoform application also can filter out isoforms that do not produce surface proteins or those that do not have unique, targetable peptides. These filters can save a significant amount of investigator time.
Query results are returned in just a few seconds on an ordinary laptop computer (e.g., Mac laptop with 515 GB SSD, 16 GB DDR4 RAM, 8-core CPU running @3.2 GHz, 8-core GPU, and queries may be run on a computer with lower specifications). The database described herein generally takes up about 1 Gig of hard drive space. An average laptop is typically capable run the software, and a high-end computer typically is not needed. Without the precomputed database design described above, the type of queries used in the disease-associated isoform application generally would not be possible. Database size, memory needed, and response time of a query would be untenable. For example, using just 10% of the data (e.g., about 50 Gig with no precomputed database structure), it took about 20 minutes to run using most of the resources available on a high end computer. Queries against the full data set (i.e., with no precomputed database structure) would take hours or not finish at all. With the precomputed, binned thresholds (e.g., TPM of 1, 2, 3, . . . 10), the queries ran in less than 2 seconds. Accordingly, the precomputed database structure provides compact data and faster performance, and allows a user to query the database using an ordinary computer.
Gene Isoforms
The middle panel of
Multi Sequence Alignment
The bottom panel of
Normal Expression Box Plot
Tumor Expression Box Plot
The disease-associated isoform identification system described herein may include a protein analysis tool as a companion application. A protein analysis tool can provide “deep dive” information on a gene and/or transcript isoform. When a potential immunotherapy target is identified using the disease-associated isoform application described herein, and more details are desired, a protein analysis tool may be launched. The protein analysis tool was designed to integrate and automate multiple manually intensive tasks and to establish certain background checks for each candidate target. In the disease-associated isoform application results screen, a user can right click on a selected transcript to launch a protein analysis tool (see e.g.,
The first several sections in the example protein analysis tool described herein provide background information on the gene and linkouts to sources for additional information. Information known about isoforms is presented (see e.g.,
A sequence and structure component generally has three panels: 1) sequence panel, 2) domain panel, 3) and structure panel (see e.g.,
For every gene in the UniProt database, there is one canonical isoform (referred to as UniProt-canonical isoform). This UniProt-canonical isoform is used to link the three panels:
All alignment is done using a global alignment method implemented by pairwise2.align.globalms function of BioPython package, with a match score parameter of 5, the mismatch penalty parameter of −4, the gap opening penalty parameter of −5, and the gap extension penalty parameter of −0.1.
3D protein structures in the structure panel are visualized by 3Dmol.js (World Wide Web Uniform Resource Locator 3dmol.csb.pitt.edu/) library, which enables rotation, zooming in and out, and translation of 3D protein structures with mouse operations (see e.g.,
Using the mechanism for mapping among isoform sequences, protein domains, and 3D protein structures, the Sequence & Structure section enable users to do the following:
Additionally, the structure panel has the following capabilities:
The disease-associated isoform identification system described herein may include one or more modifications described below.
The disease-associated isoform identification system described herein can identify immunotherapy targets by comparing very large sets of tumor data to very large sets of normal data to identify gene expression isoforms that occur in cancer but not in normal tissues. In certain configurations, this comparison is done using highest observed in normal for any normal tissue (top of the whisker plot). In certain configurations, the median normal value is used rather than the maximum normal value (see
The ratio method is another modification to the search path. Instead of performing searches using absolute thresholds (normal expression must be below XXX and tumor must be above YYY), a ratio of tumor to normal isoform expression is used (see
When searching for immunotherapy targets, certain configurations of the disease-associated isoform identification system described herein compare tumor expression to expression in all normal tissue. Targets are identified only if expression is very low in all normal tissues. There are cases, however, where some normal tissues can be ignored. For example, if the therapy is for females and the expression in normal tissue only occurs in males (e.g., in a male-specific tissue like testis), or if the normal tissue is in a relatively unimportant normal tissue. In certain configurations of the disease-associated isoform identification system described herein, researchers can select the normal tissues that are relevant for the therapy being designed. In the example shown in
Immunotherapy targets can be selected from transcript isoforms identified using the disease-associated isoform identification system described herein. As described herein, isoforms are different versions of mRNA sequences produced by the same gene. The differences are caused by alternative splicing of the mRNA prior to leaving the cell nucleus. To be a good immunotherapy target, the isoform typically needs a section of protein that is different from other isoforms. If the isoform is expressed predominantly in cancer and not normal tissue and has a section of its protein sequence that is unique to the isoform, then it could be a good target. In certain instances, isoforms have a completely unique section of protein. In other instances, an isoform differs from the main isoform of the gene but shares its protein sequence with one or more other isoforms. In this case it may be a viable target but a therapy developed for the isoform will also target the other isoforms with the same protein sequence.
A modified feature of the disease-associated isoform identification system described herein calculates an approximation of tumor and normal expression of all isoforms sharing a target sequence by combining the expression from all the other isoforms that share the target sequence.
In the example shown in
The entirety of each patent, patent application, publication and document referenced herein is incorporated by reference. Citation of patents, patent applications, publications and documents is not an admission that any of the foregoing is pertinent prior art, nor does it constitute any admission as to the contents or date of these publications or documents. Their citation is not an indication of a search for relevant disclosures. All statements regarding the date(s) or contents of the documents is based on available information and is not an admission as to their accuracy or correctness.
The technology has been described with reference to specific implementations. The terms and expressions that have been utilized herein to describe the technology are descriptive and not necessarily limiting. Certain modifications made to the disclosed implementations can be considered within the scope of the technology. Certain aspects of the disclosed implementations suitably may be practiced in the presence or absence of certain elements not specifically disclosed herein.
Each of the terms “comprising,” “consisting essentially of,” and “consisting of” may be replaced with either of the other two terms. The term “a” or “an” can refer to one of or a plurality of the elements it modifies (e.g., “a reagent” can mean one or more reagents) unless it is contextually clear either one of the elements or more than one of the elements is described. The term “about” as used herein refers to a value within 10% of the underlying parameter (i.e., plus or minus 10%; e.g., a weight of “about 100 grams” can include a weight between 90 grams and 110 grams). Use of the term “about” at the beginning of a listing of values modifies each of the values (e.g., “about 1, 2 and 3” refers to “about 1, about 2 and about 3”). When a listing of values is described the listing includes all intermediate values and all fractional values thereof (e.g., the listing of values “80%, 85% or 90%” includes the intermediate value 86% and the fractional value 86.4%).
Certain implementations of the technology are set forth in the claim(s) that follow(s).
This patent application claims the benefit of U.S. provisional patent application No. 63/115,356 filed on Nov. 18, 2020, naming Leonardo Mirandola et al. as inventors, and designated by attorney docket number KIR-1003-PV. The entire content of the foregoing patent application is incorporated herein by reference for all purposes, including all text, tables and drawings.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US21/58436 | 11/8/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63115356 | Nov 2020 | US |