SARS-CoV2 is a coronavirus and causative agent of the COVID-19 pandemic. COVID-19 typically causes mild respiratory symptoms, but may escalate to acute respiratory distress syndrome (ARDs) with an increased risk of respiratory failure and death. However, the trajectory of disease progression and the status of affected tissues in COVID-19 patients has not been elucidated. A comprehensive analysis of gene expression data from the blood, lung, and airway of COVID-19 patients is performed to investigate the host response to SARS-CoV2 infection. Results indicate that COVID-19 pathogenesis is driven by populations of myeloid-lineage cells with highly inflammatory but distinct transcriptional signatures, suggesting a progression in activation from the periphery to the lung tissue. In addition, through analysis of immune cells and inflammatory pathways enriched in each compartment, a model of the systemic response to SARS-CoV2 is constructed, and therapeutics targeting key upstream regulators of pathways contributing to COVID-19 pathogenesis are identified.
In an aspect, the present disclosure provides a method of identifying one or more records having a specific phenotype, the method comprising: receiving a plurality of first records, wherein each first record is associated with one or more of a plurality of phenotypes; receiving a plurality of second records, wherein each second record is associated with one or more of the plurality of phenotypes, and wherein the plurality of second records and the plurality of first records are non-overlapping; applying a machine learning algorithm to at least one first record and at least one second record to determine a classifier; receiving a plurality of third records, wherein the third records are distinct from the plurality of first records and the plurality of second records; and applying the classifier to the plurality of third records to identify one or more third records associated with the specific phenotype.
In some embodiments, the first records and the second records comprise nucleic acid sequencing data, transcriptome data, genome data, epigenome data, proteome data, metabolome data, virome data, metabolome data, methylome data, lipidomic data, lineage-ome data, nucleosomal occupancy data, a genetic variant, a gene fusion, an insertion or deletion (indel), or any combination thereof. In some embodiments, the first records and the second records are in different formats. In some embodiments, the first records and the second records are from different sources, different studies, or both. In some embodiments, the phenotype comprises a disease state, an organ involvement, a medication response, or any combination thereof. In some embodiments, the classifier comprises an elastic generalized linear model classifier, a k-nearest neighbors classifier, a random forest classifier, or any combination thereof.
In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of about 0.8 to about 1. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of at least about 0.8, about 0.825, about 0.85, about 0.875, about 0.9, about 0.925, about 0.95, about 0.975, or about 1. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of at most about 0.8, about 0.825, about 0.85, about 0.875, about 0.9, about 0.925, about 0.95, about 0.975, or about 1. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of about 0.8 to about 0.825, about 0.8 to about 0.85, about 0.8 to about 0.875, about 0.8 to about 0.9, about 0.8 to about 0.925, about 0.8 to about 0.95, about 0.8 to about 0.975, about 0.8 to about 1, about 0.825 to about 0.85, about 0.825 to about 0.875, about 0.825 to about 0.9, about 0.825 to about 0.925, about 0.825 to about 0.95, about 0.825 to about 0.975, about 0.825 to about 1, about 0.85 to about 0.875, about 0.85 to about 0.9, about 0.85 to about 0.925, about 0.85 to about 0.95, about 0.85 to about 0.975, about 0.85 to about 1, about 0.875 to about 0.9, about 0.875 to about 0.925, about 0.875 to about 0.95, about 0.875 to about 0.975, about 0.875 to about 1, about 0.9 to about 0.925, about 0.9 to about 0.95, about 0.9 to about 0.975, about 0.9 to about 1, about 0.925 to about 0.95, about 0.925 to about 0.975, about 0.925 to about 1, about 0.95 to about 0.975, about 0.95 to about 1, or about 0.975 to about 1. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of about 0.8, about 0.825, about 0.85, about 0.875, about 0.9, about 0.925, about 0.95, about 0.975, or about 1.
In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is about 1 to about 20. In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is at least about 1, about 2, about 3, about 4, about 5, about 6, about 8, about 10, about 12, about 14, about 16, or about 20. In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is at most about 1, about 2, about 3, about 4, about 5, about 6, about 8, about 10, about 12, about 14, about 16, or about 20. In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is about 1 to about 2, about 1 to about 3, about 1 to about 4, about 1 to about 5, about 1 to about 6, about 1 to about 8, about 1 to about 10, about 1 to about 12, about 1 to about 14, about 1 to about 16, about 1 to about 20, about 2 to about 3, about 2 to about 4, about 2 to about 5, about 2 to about 6, about 2 to about 8, about 2 to about 10, about 2 to about 12, about 2 to about 14, about 2 to about 16, about 2 to about 20, about 3 to about 4, about 3 to about 5, about 3 to about 6, about 3 to about 8, about 3 to about 10, about 3 to about 12, about 3 to about 14, about 3 to about 16, about 3 to about 20, about 4 to about 5, about 4 to about 6, about 4 to about 8, about 4 to about 10, about 4 to about 12, about 4 to about 14, about 4 to about 16, about 4 to about 20, about 5 to about 6, about 5 to about 8, about 5 to about 10, about 5 to about 12, about 5 to about 14, about 5 to about 16, about 5 to about 20, about 6 to about 8, about 6 to about 10, about 6 to about 12, about 6 to about 14, about 6 to about 16, about 6 to about 20, about 8 to about 10, about 8 to about 12, about 8 to about 14, about 8 to about 16, about 8 to about 20, about 10 to about 12, about 10 to about 14, about 10 to about 16, about 10 to about 20, about 12 to about 14, about 12 to about 16, about 12 to about 20, about 14 to about 16, about 14 to about 20, or about 16 to about 20. In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is about 1, about 2, about 3, about 4, about 5, about 6, about 8, about 10, about 12, about 14, about 16, or about 20.
In some embodiments, the K-value of the random forest classifier is incremented by 1 if the k-value is an even number. In some embodiments, applying a machine learning algorithm to the third data set comprises applying a machine learning algorithm to a plurality of unique third data sets.
In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of about 70% to about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of at least about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of at most about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of about 70% to about 75%, about 70% to about 80%, about 70% to about 85%, about 70% to about 90%, about 70% to about 95%, about 70% to about 100%, about 75% to about 80%, about 75% to about 85%, about 75% to about 90%, about 75% to about 95%, about 75% to about 100%, about 80% to about 85%, about 80% to about 90%, about 80% to about 95%, about 80% to about 100%, about 85% to about 90%, about 85% to about 95%, about 85% to about 100%, about 90% to about 95%, about 90% to about 100%, or about 95% to about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%.
In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of about 70% to about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of at least about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of at most about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of about 70% to about 75%, about 70% to about 80%, about 70% to about 85%, about 70% to about 90%, about 70% to about 95%, about 70% to about 100%, about 75% to about 80%, about 75% to about 85%, about 75% to about 90%, about 75% to about 95%, about 75% to about 100%, about 80% to about 85%, about 80% to about 90%, about 80% to about 95%, about 80% to about 100%, about 85% to about 90%, about 85% to about 95%, about 85% to about 100%, about 90% to about 95%, about 90% to about 100%, or about 95% to about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%.
In some embodiments, the classifier herein enables a specific phenotype association sensitivity of about 70% to about 100%. In some embodiments, the classifier herein enables a specific phenotype association sensitivity of at least 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier herein enables a specific phenotype association sensitivity of at most 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier herein enables a specific phenotype association sensitivity of about 70% to about 75%, about 70% to about 80%, about 70% to about 85%, about 70% to about 90%, about 70% to about 95%, about 70% to about 100%, about 75% to about 80%, about 75% to about 85%, about 75% to about 90%, about 75% to about 95%, about 75% to about 100%, about 80% to about 85%, about 80% to about 90%, about 80% to about 95%, about 80% to about 100%, about 85% to about 90%, about 85% to about 95%, about 85% to about 100%, about 90% to about 95%, about 90% to about 100%, or about 95% to about 100%. In some embodiments, the classifier herein enables a specific phenotype association sensitivity of about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%.
In some embodiments, the classifier herein enables a specific phenotype association specificity of about 70% to about 100%. In some embodiments, the classifier herein enables a specific phenotype association specificity of at least 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier herein enables a specific phenotype association specificity of at most 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier herein enables a specific phenotype association specificity of about 70% to about 75%, about 70% to about 80%, about 70% to about 85%, about 70% to about 90%, about 70% to about 95%, about 70% to about 100%, about 75% to about 80%, about 75% to about 85%, about 75% to about 90%, about 75% to about 95%, about 75% to about 100%, about 80% to about 85%, about 80% to about 90%, about 80% to about 95%, about 80% to about 100%, about 85% to about 90%, about 85% to about 95%, about 85% to about 100%, about 90% to about 95%, about 90% to about 100%, or about 95% to about 100%. In some embodiments, the classifier herein enables a specific phenotype association specificity of about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%.
In some embodiments, the method further comprises filtering the first records, the second records, or both. In some embodiments, the filtering comprises removing outliers, removing background noise, removing data without annotation data, normalizing, scaling, variance correcting, Weighted Gene Co-expression Network Analysis, enrichment analysis, dimensionality reduction, or any combination thereof. In some embodiments, the normalizing is performed by Robust Multi-Array Analysis (RMA), Guanine Cytosine Robust Multi-Array Analysis (GCRMA), Linear Models for Microarray Data, variance stabilizing transformation (VST), normal-exponential quantile correction (NEQC), or any combination thereof. In some embodiments, the variance correction comprises employing a local empirical Bayesian shrinkage, adjusting the p-values for multiple hypothesis testing using the Benjamini-Hochberg correction, and removing all data with a set false discovery rate
In some embodiments, the false discovery rate is about 0.000001 to about 0.2. In some embodiments, the false discovery rate is at least about 0.000001. In some embodiments, the false discovery rate is at most about 0.2. In some embodiments, the false discovery rate is about 0.000001 to about 0.00005, about 0.000001 to about 0.00001, about 0.000001 to about 0.0005, about 0.000001 to about 0.0001, about 0.000001 to about 0.005, about 0.000001 to about 0.001, about 0.000001 to about 0.05, about 0.000001 to about 0.01, about 0.000001 to about 0.2, about 0.00005 to about 0.00001, about 0.00005 to about 0.0005, about 0.00005 to about 0.0001, about 0.00005 to about 0.005, about 0.00005 to about 0.001, about 0.00005 to about 0.05, about 0.00005 to about 0.01, about 0.00005 to about 0.2, about 0.00001 to about 0.0005, about 0.00001 to about 0.0001, about 0.00001 to about 0.005, about 0.00001 to about 0.001, about 0.00001 to about 0.05, about 0.00001 to about 0.01, about 0.00001 to about 0.2, about 0.0005 to about 0.0001, about 0.0005 to about 0.005, about 0.0005 to about 0.001, about 0.0005 to about 0.05, about 0.0005 to about 0.01, about 0.0005 to about 0.2, about 0.0001 to about 0.005, about 0.0001 to about 0.001, about 0.0001 to about 0.05, about 0.0001 to about 0.01, about 0.0001 to about 0.2, about 0.005 to about 0.001, about 0.005 to about 0.05, about 0.005 to about 0.01, about 0.005 to about 0.2, about 0.001 to about 0.05, about 0.001 to about 0.01, about 0.001 to about 0.2, about 0.05 to about 0.01, about 0.05 to about 0.2, or about 0.01 to about 0.2. In some embodiments, the false discovery rate is about 0.000001, about 0.00005, about 0.00001, about 0.0005, about 0.0001, about 0.005, about 0.001, about 0.05, about 0.01, or about 0.2.
In some embodiments, the Weighted Gene Co-expression Network Analysis comprises calculating a topology matrix, clustering the data based on the topology matrix, and correlating module eigenvalues for traits on a linear scale by Pearson correlation, for nonparametric traits by Spearman correlation, and for dichotomous traits by point-biserial correlation or t-test. The Pearson correlation or the Product Moment Correlation Coefficient (PMCC), is a number between −1 and 1 that indicates the extent to which two variables are linearly related. The Spearman correlation is a nonparametric measure of rank correlation; statistical dependence between the rankings of two variables.
In some embodiments, the one or more records having a specific phenotype correspond to one or more subjects, and the method further comprises identifying the one or more subjects as (i) having a diagnosis of a disease state or condition, (ii) having a prognosis of a disease state or condition, (iii) being suitable or not suitable for enrollment in a clinical trial for a disease state or condition, (iv) being suitable or not suitable for being administered a therapeutic regimen configured to treat a disease state or condition, (v) having an efficacy or not having an efficacy of a therapeutic regimen configured to treat a disease state or condition, based at least in part on the specific phenotype corresponding to the one or more subjects.
In another aspect, the present disclosure provides a non-transitory computer-readable storage media encoded with a computer program including instructions executable by a processor to create an application for identifying one or more records having a specific phenotype, the application comprising: a first receiving module receiving a plurality of first records, wherein each first record is associated with one or more of a plurality of phenotypes; a second receiving module receiving a plurality of second records, wherein each second record is associated with one or more of the plurality of phenotypes, and wherein the plurality of second records and the plurality of first records are non-overlapping; a machine learning module applying a machine learning algorithm to at least one first record and at least one second record to determine a classifier; a third receiving module receiving a plurality of third records, wherein the third records are distinct from the plurality of first records and the plurality of second records; and a classifying module applying the classifier to the plurality of third records to identify one or more third records associated with the specific phenotype.
In some embodiments, the first records and the second records comprise nucleic acid sequencing data, transcriptome data, genome data, epigenome data, proteome data, metabolome data, virome data, metabolome data, methylome data, lipidomic data, lineage-ome data, nucleosomal occupancy data, a genetic variant, a gene fusion, an insertion or deletion (indel), or any combination thereof. In some embodiments, the first records and the second records are in different formats. In some embodiments, the first records and the second records are from different sources, different studies, or both. In some embodiments, the phenotype comprises a disease state, an organ involvement, a medication response, or any combination thereof. In some embodiments, the classifier comprises an elastic generalized linear model classifier, a k-nearest neighbors classifier, a random forest classifier, or any combination thereof. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of about 0.9. In some embodiments, the k-nearest neighbors classifier employs a K-value of about 5% of the size of the plurality of distinct first data sets. In some embodiments, the K-value of the random forest classifier is incremented by 1 if the k-value is an even number. In some embodiments, applying a machine learning algorithm to the third data set comprises applying a machine learning algorithm to a plurality of unique third data sets. In some embodiments, said classifier identifies said one or more third records associated with the specific phenotype with an accuracy of at least about 70%. In some embodiments, the method further comprises filtering the first records, the second records, or both. In some embodiments, the filtering comprises removing outliers, removing background noise, removing data without annotation data, normalizing, scaling, variance correcting, Weighted Gene Co-expression Network Analysis, enrichment analysis, dimensionality reduction, or any combination thereof. In some embodiments, the normalizing is performed by Robust Multi-Array Analysis (RMA), Guanine Cytosine Robust Multi-Array Analysis (GCRMA), Linear Models for Microarray Data, variance stabilizing transformation (VST), normal-exponential quantile correction (NEQC), or any combination thereof. In some embodiments, the variance correction comprises employing a local empirical Bayesian shrinkage, adjusting the p-values for multiple hypothesis testing using the Benjamini-Hochberg correction, and removing all data with a false discovery rate of less than 0.2. In some embodiments, the Weighted Gene Co-expression Network Analysis comprises calculating a topology matrix, clustering the data based on the topology matrix, and correlating module eigenvalues for traits on a linear scale by Pearson correlation, for nonparametric traits by Spearman correlation, and for dichotomous traits by point-biserial correlation or t-test.
In another aspect, the present disclosure provides a method for identifying a disease state or a susceptibility thereof of a subject, comprising: (a) using an assay to process a biological sample derived from the subject to generate a quantitative measure of each of a plurality of disease-associated genomic loci; (b) processing the dataset to identify the disease state or the susceptibility thereof of the subject; and (c) electronically outputting a report indicative of the disease state or the susceptibility thereof of the subject.
In some embodiments, the plurality of quantitative measures comprises gene expression measurements. In some embodiments, the disease state comprises an active COVID-19 condition or an inactive COVID-19 condition.
In another aspect, the present disclosure provides a computer-implemented method for assessing a disease state or condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject; (b) selecting one or more data analysis tools, wherein the one or more data analysis tools comprise an analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), and a combination thereof; (c) processing the dataset using the one or more data analysis tools to generate a data signature of the biological sample of the subject; and (d) based at least in part on the data signature generated in (c), assessing the disease state or condition of the subject.
In some embodiments, the dataset comprises gene expression measurements. In some embodiments, the disease state comprises an active COVID-19 condition or an inactive COVID-19 condition.
In some embodiments, the dataset comprises mRNA gene expression or transcriptome data, DNA genomic data, proteomic data, metabolomic data, or a combination thereof. In some embodiments, the biological sample is selected from the group consisting of: a whole blood (WB) sample, a PBMC sample, a tissue sample, and a cell sample. In some embodiments, assessing the condition of the subject comprises identifying a disease or disorder of the subject.
In some embodiments, the method further comprises identifying a disease or disorder of the subject at a sensitivity or specificity of at least about 70%. In some embodiments, the method further comprises determining a likelihood of the identification of the disease or disorder of the subject. In some embodiments, the method further comprises providing a therapeutic intervention for the disease or disorder of the subject. In some embodiments, the method further comprises monitoring the disease or disorder of the subject, wherein the monitoring comprises assessing the disease or disorder of the subject at a plurality of time points, wherein the assessing is based at least on the disease or disorder identified at each of the plurality of time points.
In some embodiments, selecting the one or more data analysis tools comprises receiving a user selection of the one or more data analysis tools. In some embodiments, selecting the one or more data analysis tools is automatically performed by the computer without receiving a user selection of the one or more data analysis tools.
In another aspect, the present disclosure provides a computer system for assessing a disease state or condition of a subject, comprising: a database that is configured to store a dataset of a biological sample of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) select one or more data analysis tools, wherein the one or more data analysis tools comprise an analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), and a combination thereof; (ii) process the dataset using the one or more data analysis tools to generate a data signature of the biological sample of the subject; and (iii) based at least in part on the data signature generated in (ii), assess the disease state or condition of the subject.
In some embodiments, the dataset comprises gene expression measurements. In some embodiments, the disease state comprises an active COVID-19 condition or an inactive COVID-19 condition.
In another aspect, the present disclosure provides a non-transitory computer-readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing a disease state or condition of a subject, the method comprising: (a) receiving a dataset of a biological sample of the subject; (b) selecting one or more data analysis tools, wherein the one or more data analysis tools comprise an analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), and a combination thereof (c) processing the dataset using the one or more data analysis tools to generate a data signature of the biological sample of the subject; and (d) based at least in part on the data signature generated in (c), assessing the disease state or condition of the subject.
In some embodiments, the dataset comprises gene expression measurements. In some embodiments, the disease state comprises an active COVID-19 condition or an inactive COVID-19 condition.
In any embodiment described herein, the one or more data analysis tools can be a plurality of data analysis tools each independently selected from a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), and a combination thereof.
In another aspect, the present disclosure provides a method for determining a COVID-19 disease state of a subject, comprising: (a) assaying a biological sample obtained or derived from the subject to produce a data set comprising gene expression measurements of the biological sample at each of a plurality of COVID-19 disease-associated genomic loci, wherein the plurality of COVID-19 disease-associated genomic loci comprises at least a portion of a gene selected from the group of genes listed in Table 6 and Tables 7A-7C; (b) computer processing the data set to determine the COVID-19 disease state of the subject; and (c) electronically outputting a report indicative of the COVID-19 disease state of the subject.
In some embodiments, the plurality of COVID-19 disease-associated genomic loci comprises at least a portion of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300, 3400, 3500, 3600, 3700, 3800, 3900, 4000, 4100, 4200, 4300, 4400, 4500, 4600, 4700, 4800, 4900, 5000, 5100, 5200, 5300, 5400, 5500, 5600, 5700, 5800, 5900, or 6000 genes selected from the group of genes listed in Table 6 and Tables 7A-7C.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
In some embodiments, the subject has received a diagnosis of the COVID-19 disease. In some embodiments, the subject is suspected of having the COVID-19 disease. In some embodiments, the subject is at elevated risk of having the COVID-19 disease or having severe complications from the COVID-19 disease. In some embodiments, the subject is asymptomatic for the COVID-19 disease. In some embodiments, the method further comprises administering a treatment to the subject based at least in part on the determined COVID-19 disease state. In some embodiments, the treatment is configured to treat the COVID-19 disease state of the subject. In some embodiments, the treatment is configured to reduce a severity of the COVID-19 disease state of the subject. In some embodiments, the treatment is configured to reduce a risk of having the COVID-19 disease. In some embodiments, the treatment comprises a drug. In some embodiments, the drug is selected from the group listed in Tables 8A-8B.
In some embodiments, (b) comprises using a trained machine learning classifier to analyze the data set to determine the COVID-19 disease state of the subject. In some embodiments, the trained machine learning classifier is trained using gene expression data obtained by a data analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, and a Gene Set Variation Analysis (GSVA) tool. In some embodiments, the trained machine learning classifier is selected from the group consisting of a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, and a combination thereof.
In some embodiments, (b) comprises comparing the data set to a reference data set. In some embodiments, the reference data set comprises gene expression measurements of reference biological samples at each of the plurality of COVID-19 disease-associated genomic loci. In some embodiments, the reference biological samples comprise a first plurality of biological samples obtained or derived from subjects having the COVID-19 disease and a second plurality of biological samples obtained or derived from subjects not having the COVID-19 disease.
In some embodiments, the biological sample is selected from the group consisting of: a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a biopsy sample, and any derivative thereof.
In some embodiments, the method further comprises determining a likelihood of the determined COVID-19 disease state.
In some embodiments, the method further comprises monitoring the COVID-19 disease state of the subject, wherein the monitoring comprises assessing the COVID-19 disease state of the subject at a plurality of time points.
In some embodiments, a difference in the assessment of the COVID-19 disease state of the subject among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the COVID-19 disease state of the subject, (ii) a prognosis of the COVID-19 disease state of the subject, and (iii) an efficacy or non-efficacy of a course of treatment for treating the COVID-19 disease state of the subject.
In another aspect, the present disclosure provides a computer system for determining a COVID-19 disease state of a subject, comprising: a database that is configured to store a dataset comprising gene expression data, wherein the gene expression data is obtained by assaying a biological sample obtained or derived from the subject to produce gene expression measurements of the biological sample at each of a plurality of COVID-19 disease-associated genomic loci, wherein the plurality of COVID-19 disease-associated genomic loci comprises at least a portion of a gene selected from the group of genes listed in Table 6 and Tables 7A-7C; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) computer process the data set to determine the COVID-19 disease state of the subject; (ii) electronically output a report indicative of the COVID-19 disease state of the subject.
In some embodiments, the plurality of COVID-19 disease-associated genomic loci comprises at least a portion of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300, 3400, 3500, 3600, 3700, 3800, 3900, 4000, 4100, 4200, 4300, 4400, 4500, 4600, 4700, 4800, 4900, 5000, 5100, 5200, 5300, 5400, 5500, 5600, 5700, 5800, 5900, or 6000 genes selected from the group of genes listed in Table 6 and Tables 7A-7C.
In some embodiments, the one or more computer processors are individually or collectively programmed to further determine the COVID-19 disease state of the subject with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the one or more computer processors are individually or collectively programmed to further determine the COVID-19 disease state of the subject with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the one or more computer processors are individually or collectively programmed to further determine the COVID-19 disease state of the subject with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the one or more computer processors are individually or collectively programmed to further determine the COVID-19 disease state of the subject with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the one or more computer processors are individually or collectively programmed to further determine the COVID-19 disease state of the subject with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the one or more computer processors are individually or collectively programmed to further determine the COVID-19 disease state of the subject with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
In some embodiments, the subject has received a diagnosis of the COVID-19 disease. In some embodiments, the subject is suspected of having the COVID-19 disease. In some embodiments, the subject is at elevated risk of having the COVID-19 disease or having severe complications from the COVID-19 disease. In some embodiments, the subject is asymptomatic for the COVID-19 disease. In some embodiments, the one or more computer processors are individually or collectively programmed to further direct a treatment to be administered to the subject based at least in part on the determined COVID-19 disease state. In some embodiments, the treatment is configured to treat the COVID-19 disease state of the subject. In some embodiments, the treatment is configured to reduce a severity of the COVID-19 disease state of the subject. In some embodiments, the treatment is configured to reduce a risk of having the COVID-19 disease. In some embodiments, the treatment comprises a drug. In some embodiments, the drug is selected from the group listed in Tables 8A-8B.
In some embodiments, (i) comprises using a trained machine learning classifier to analyze the data set to determine the COVID-19 disease state of the subject. In some embodiments, the trained machine learning classifier is trained using gene expression data obtained by a data analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, and a Gene Set Variation Analysis (GSVA) tool. In some embodiments, the trained machine learning classifier is selected from the group consisting of a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, and a combination thereof.
In some embodiments, (i) comprises comparing the data set to a reference data set. In some embodiments, the reference data set comprises gene expression measurements of reference biological samples at each of the plurality of COVID-19 disease-associated genomic loci. In some embodiments, the reference biological samples comprise a first plurality of biological samples obtained or derived from subjects having the COVID-19 disease and a second plurality of biological samples obtained or derived from subjects not having the COVID-19 disease.
In some embodiments, the biological sample is selected from the group consisting of: a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a biopsy sample, and any derivative thereof.
In some embodiments, the one or more computer processors are individually or collectively programmed to further determine a likelihood of the determined COVID-19 disease state.
In some embodiments, the one or more computer processors are individually or collectively programmed to further monitor the COVID-19 disease state of the subject, wherein the monitoring comprises assessing the COVID-19 disease state of the subject at a plurality of time points.
In some embodiments, a difference in the assessment of the COVID-19 disease state of the subject among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the COVID-19 disease state of the subject, (ii) a prognosis of the COVID-19 disease state of the subject, and (iii) an efficacy or non-efficacy of a course of treatment for treating the COVID-19 disease state of the subject.
In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for determining a COVID-19 disease state of a subject, the method comprising: (a) assaying a biological sample obtained or derived from the subject to produce a data set comprising gene expression measurements of the biological sample at each of a plurality of COVID-19 disease-associated genomic loci, wherein the plurality of COVID-19 disease-associated genomic loci comprises at least a portion of a gene selected from the group of genes listed in Table 6 and Tables 7A-7C; (b) computer processing the data set to determine the COVID-19 disease state of the subject; and (c) electronically outputting a report indicative of the COVID-19 disease state of the subject.
In some embodiments, the plurality of COVID-19 disease-associated genomic loci comprises at least a portion of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300, 3400, 3500, 3600, 3700, 3800, 3900, 4000, 4100, 4200, 4300, 4400, 4500, 4600, 4700, 4800, 4900, 5000, 5100, 5200, 5300, 5400, 5500, 5600, 5700, 5800, 5900, or 6000 genes selected from the group of genes listed in Table 6 and Tables 7A-7C.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
In some embodiments, the method further comprises determining the COVID-19 disease state of the subject with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
In some embodiments, the subject has received a diagnosis of the COVID-19 disease. In some embodiments, the subject is suspected of having the COVID-19 disease. In some embodiments, the subject is at elevated risk of having the COVID-19 disease or having severe complications from the COVID-19 disease. In some embodiments, the subject is asymptomatic for the COVID-19 disease. In some embodiments, the method further comprises administering a treatment to the subject based at least in part on the determined COVID-19 disease state. In some embodiments, the treatment is configured to treat the COVID-19 disease state of the subject. In some embodiments, the treatment is configured to reduce a severity of the COVID-19 disease state of the subject. In some embodiments, the treatment is configured to reduce a risk of having the COVID-19 disease. In some embodiments, the treatment comprises a drug. In some embodiments, the drug is selected from the group listed in Tables 8A-8B.
In some embodiments, (b) comprises using a trained machine learning classifier to analyze the data set to determine the COVID-19 disease state of the subject. In some embodiments, the trained machine learning classifier is trained using gene expression data obtained by a data analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, and a Gene Set Variation Analysis (GSVA) tool. In some embodiments, the trained machine learning classifier is selected from the group consisting of a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, and a combination thereof.
In some embodiments, (b) comprises comparing the data set to a reference data set. In some embodiments, the reference data set comprises gene expression measurements of reference biological samples at each of the plurality of COVID-19 disease-associated genomic loci. In some embodiments, the reference biological samples comprise a first plurality of biological samples obtained or derived from subjects having the COVID-19 disease and a second plurality of biological samples obtained or derived from subjects not having the COVID-19 disease.
In some embodiments, the biological sample is selected from the group consisting of: a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a biopsy sample, and any derivative thereof.
In some embodiments, the method further comprises determining a likelihood of the determined COVID-19 disease state.
In some embodiments, the method further comprises monitoring the COVID-19 disease state of the subject, wherein the monitoring comprises assessing the COVID-19 disease state of the subject at a plurality of time points.
In some embodiments, a difference in the assessment of the COVID-19 disease state of the subject among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the COVID-19 disease state of the subject, (ii) a prognosis of the COVID-19 disease state of the subject, and (iii) an efficacy or non-efficacy of a course of treatment for treating the COVID-19 disease state of the subject.
Another aspect of the present disclosure provides a non-transitory computer-readable medium comprising machine executable code that, upon execution by one or more computer processors, implements any of the methods above or elsewhere herein.
Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto. The computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.
Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
The patent application file contains at least one drawing executed in color. Copies of this patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The novel features of the disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings of which:
Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
As used herein, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.
As used herein, the term “about” refers to an amount that is near the stated amount by 10%, 5%, or 1%, including increments therein.
As used herein, the phrases “at least one”, “one or more”, and “and/or” are open-ended expressions that are both conjunctive and disjunctive in operation. For example, each of the expressions “at least one of A, B and C”, “at least one of A, B, or C”, “one or more of A, B, and C”, “one or more of A, B, or C” and “A, B, and/or C” means A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.
As used herein, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.
As used herein, the term “subject” refers to an entity or a medium that has testable or detectable genetic information. A subject can be a person, individual, or patient. A subject can be a vertebrate, such as, for example, a mammal. Non-limiting examples of mammals include humans, simians, farm animals, sport animals, rodents, and pets. The subject may be displaying a symptom(s) indicative of a health or physiological state or condition of the subject, such as a disease or disorder of the subject. As an alternative, the subject can be asymptomatic with respect to such health or physiological state or condition.
As used herein, the term “sample,” generally refers to a biological sample obtained from or derived from one or more subjects. Biological samples may be processed or fractionated before further analysis. Biological samples may include a whole blood (WB) sample, a PBMC sample, a tissue sample, a purified cell sample, or derivatives thereof. In some embodiments, a whole blood sample may be purified to obtain the purified cell sample. The term “derived from” used herein refers to an origin or source, and may include naturally occurring, recombinant, unpurified or purified molecules.
To obtain a blood sample, various techniques may be used, e.g., a syringe or other vacuum suction device. A blood sample can be optionally pre-treated or processed prior to use. A sample, such as a blood sample, may be analyzed under any of the methods and systems herein within 4 weeks, 2 weeks, 1 week, 6 days, 5 days, 4 days, 3 days, 2 days, 1 day, 12 hr, 6 hr, 3 hr, 2 hr, or 1 hr from the time the sample is obtained, or longer if frozen. When obtaining a sample from a subject (e.g., blood sample), the amount can vary depending upon subject size and the condition being screened. In some embodiments, at least 10 mL, 5 mL, 1 mL, 0.5 mL, 250, 200, 150, 100, 50, 40, 30, 20, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 μL of a sample is obtained. In some embodiments, 1-50, 2-40, 3-30, or 4-20 μL of sample is obtained. In some embodiments, more than 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or 100 μL of a sample is obtained.
As used herein the term “diagnose” or “diagnosis” of a status or outcome includes predicting or diagnosing the status or outcome, determining predisposition to a status or outcome, monitoring treatment of patient, diagnosing a therapeutic response of a patient, and prognosis of status or outcome, progression, and response to particular treatment.
The sample may be taken before and/or after treatment of a subject with a disease or disorder. Samples may be obtained from a subject during a treatment or a treatment regime. Multiple samples may be obtained from a subject to monitor the effects of the treatment over time. The sample may be taken from a subject known or suspected of having a disease or disorder for which a definitive positive or negative diagnosis is not available via clinical tests. The sample may be taken from a subject suspected of having a disease or disorder. The sample may be taken from a subject experiencing unexplained symptoms, such as fatigue, nausea, weight loss, aches and pains, weakness, or bleeding. The sample may be taken from a subject having explained symptoms. The sample may be taken from a subject at risk of developing a disease or disorder due to factors such as familial history, age, hypertension or pre-hypertension, diabetes or pre-diabetes, overweight or obesity, environmental exposure, lifestyle risk factors (e.g., smoking, alcohol consumption, or drug use), or presence of other risk factors.
In some embodiments, a sample can be taken at a first time point and assayed, and then another sample can be taken at a subsequent time point and assayed. Such methods can be used, for example, for longitudinal monitoring purposes to track the development or progression of a disease. In some embodiments, the progression of a disease can be tracked before treatment, after treatment, or during the course of treatment, to determine the treatment's effectiveness. For example, a method as described herein can be performed on a subject prior to, and after, treatment with a disease state or condition therapy to measure the disease's progression or regression in response to the disease state or condition therapy.
After obtaining a sample from the subject, the sample may be processed to generate datasets indicative of a disease or disorder of the subject. For example, a presence, absence, or quantitative assessment of nucleic acid molecules of the sample at a panel of disease state or condition-associated or interferon-associated genomic loci or may be indicative of a disease state or condition of the subject. Processing the sample obtained from the subject may comprise (i) subjecting the sample to conditions that are sufficient to isolate, enrich, or extract a plurality of nucleic acid molecules, and (ii) assaying the plurality of nucleic acid molecules to generate the dataset (e.g., microarray data, nucleic acid sequences, or quantitative polymerase chain reaction (qPCR) data). Methods of assaying may include any assay known in the art or described in the literature, for example, a microarray assay, a sequencing assay (e.g., DNA sequencing, RNA sequencing, or RNA-Seq), or a quantitative polymerase chain reaction (qPCR) assay.
In some embodiments, a plurality of nucleic acid molecules is extracted from the sample and subjected to sequencing to generate a plurality of sequencing reads. The nucleic acid molecules may comprise ribonucleic acid (RNA) or deoxyribonucleic acid (DNA). The extraction method may extract all RNA or DNA molecules from a sample. Alternatively, the extraction method may selectively extract a portion of RNA or DNA molecules from a sample. Extracted RNA molecules from a sample may be converted to cDNA molecules by reverse transcription (RT).
The sample may be processed without any nucleic acid extraction. For example, the disease or disorder may be identified or monitored in the subject by using probes configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to a panel of disease state or condition-associated or interferon-associated genomic loci. The probes may be nucleic acid primers. The probes may have sequence complementarity with nucleic acid sequences from one or more of the panel of disease state or condition-associated or interferon-associated genomic loci. The panel of disease state or condition-associated or interferon-associated genomic loci may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 55, at least about 60, at least about 65, at least about 70, at least about 75, at least about 80, at least about 85, at least about 90, at least about 95, at least about 100, or more disease state or condition-associated or interferon-associated genomic loci.
The probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) of one or more genomic loci (e.g., disease state or condition-associated or interferon-associated genomic loci). These nucleic acid molecules may be primers or enrichment sequences. The assaying of the sample using probes that are selective for the one or more genomic loci (e.g., disease state or condition-associated or interferon-associated genomic loci) may comprise use of array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., RNA sequencing or DNA sequencing, such as RNA-Seq).
The assay readouts may be quantified at one or more genomic loci (e.g., disease state or condition-associated or interferon-associated genomic loci) to generate the data indicative of the disease or disorder. For example, quantification of array hybridization or polymerase chain reaction (PCR) corresponding to a plurality of genomic loci (e.g., disease state or condition-associated or interferon-associated genomic loci) may generate data indicative of the disease or disorder. Assay readouts may comprise quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, etc., or normalized values thereof.
Big Data Analysis Tools and Drug/Target Scoring Algorithms
The present disclosure provides systems and methods to perform data analysis using drug or target scoring algorithms and/or big data analysis tools. In various aspects, such drug or target scoring algorithms and/or big data analysis tools may be used to perform analysis of data sets including, for example, mRNA gene expression or transcriptome data, DNA genomic data, proteomic data, metabolomic data, other types of “-omic” data, or a combination thereof. Systems and methods of the present disclosure may use one or more of the following: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope).
A method to assess a condition (e.g., COVID-19) of a subject may comprise using one or more data analysis tools and/or algorithms. The method may comprise receiving a dataset of a biological sample of a subject. Next, the method may comprise selecting one or more data analysis tools and/or algorithms. For example, the data analysis tools and/or algorithms may comprise a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), or a combination thereof. Next, the method may comprise processing the dataset using selected data analysis tools and/or algorithms to generate a data signature of the biological sample of the subject. Next, the method may comprise assessing the condition of the subject based on the data signature.
The BIG-C (Biologically Informed Gene Clustering) tool may be configured to sort large groups of genes into a set of functional groups (e.g., 53 functional groups). The functional groups are created utilizing publicly available information from online tools and databases including UniProtKB/Swiss-Prot, GO Terms, KEGG pathways, NCBI PubMed, and the Interactome. The functional groups may include one or more of: Active RNA, Anti-apoptosis, anti-proliferation, autophagy, chromatin remodeling, cytoplasm and biochemistry, cytoskeleton, DNA repair, endocytosis, endoplasmic reticulum, endosome and vesicles, fatty acid biosynthesis, cell surface, transcription, glycolysis and gluconeogenesis, golgi, immune cell surface, immune secreted, immune signaling, integrin pathway, interferon stimulated genes, intracellular signaling, lysosome, melanosome, MHC class I, MHC class II, microRNA processing, microRNA, mitochondrial transcription, mitochondria, mitochondria oxidative phosphorylation, mitochondrial TCA cycle, mRNA processing, mRNA splicing, non-coding RNA, nuclear receptor, nucleus and nucleolus, palmitoylation, pattern recognition receptors, peroxisomes, pro-apoptosis, pro-cell cycle, proteasome, pseudogenes, RAS superfamily, reactive oxygen species protection, secreted and extracellular matrix, transcription factors, transporters, transposon control, ubiquitylation and sumoylation, unfolded protein and stress, and unknown. Enrichment scores for each group are calculated based on an overlap p value to determine the functional groups over or under-expressed in the gene expression dataset. The BIG-C may be configured such that each gene is sorted into only one of the 53 functional groups, allowing for a quick and relatively simple understanding of types of genes enriched and co-expressed in a big dataset.
The I-Scope™ tool may be configured to identify immune infiltrates. Hematopoietic cells are unique in that they move throughout the body patrolling for threats to the host, and may infiltrate tissue sites not normally home to immune cells. I-Scope™ may be configured to identify hematopoietic cells through an iterative search of more than 17,000 genes identified in more than 50 microarray datasets. From this search, 1226 candidate genes are identified and researched for restriction in hematopoietic cells as determined by the HPA, GTEx and FANTOMS datasets (e.g., available at proteinatlas.org). 926 genes meet the criteria for being mainly restricted to hematopoietic lineages (brain, reproductive organ exclusions were permitted). These genes are researched for immune cell specific expression in 27 hematopoietic sub-categories: alpha beta T cell, T cell, regulatory T Cell, activated T cell, anergic T cell, gamma delta T cells, CD8 T, NK/NKT cell, NK cell, T & B cells, B cells, germinal center B cells, B cell and plasmacytoid dendritic cell, T &B & myeloid, B & myeloid, T & myeloid, MHC Class II expressing cell, monocyte, dendritic cell, plasmacytoid dendritic cells, myeloid cell, plasma cell, erythrocyte, neutrophil, low density granulocyte, granulocyte, and platelet. Transcripts are entered into I-Scope™ and the number of transcripts in each category determined. Odd's ratios are calculated with confidence intervals using the Fisher's exact test in R.
The T-Scope™ tool may be configured to help identify types of non-hematopoietic cells in gene expression datasets. T-Scope™ may be configured by downloading approximately 10,000 tissue enriched and 8,000 cell line enriched genes from the human protein atlas along with their tissue or cell line designation (e.g., available at proteinatlas.org). Genes found in more than four tissues are eliminated. Housekeeping genes described in the gene expression study by She et al. are also removed (e.g., as described by She et al., “Definition, conservation and epigenetics of housekeeping and tissue-enriched genes,” BMC Genomics 2009, 10:269, which is incorporated herein by reference in its entirety). This list is further curated by removing genes differentially expressed in 34 hematopoietic cell gene expression datasets and adding kidney specific genes from datasets downloaded from the GEO repository and processed by Ampel BioSolutions. The resulting categories of genes represent genes enriched in the following 42 tissue/cell specific categories: adrenal gland, breast, cartilage, cerebral cortex, uterine cervix, chondrocyte, colon, duodenum, endometrium, epididymis, esophagus fallopian tube, esophagus, fibroblast, heart muscle, keratinocyte, kidney, liver, lung, melanocyte, ovary pancreas, parathyroid gland, placenta, podocyte, prostrate, rectum, salivary gland, seminal vesicle, skeletal muscle, skin, small intestine, smooth muscle, stomach, synoviocyte, testis, kidney loop of henle, kidney proximal tubule, kidney distal tubule, and kidney collecting duct.
The CellScan tool may be a combination of I-Scope™ and T-Scope™, and may be configured to analyze tissues with suspected immune infiltrations that should also have tissue specific genes. CellScan may potentially be more stringent than either I-Scope™ or T-Scope™ because it may be used to distinguish resident tissue cells from non-resident hematopoietic cells.
The MS (Molecular Signature) Scoring tool may be configured to assess specific pathways in a disease state. Information on genes that encode for proteins that participate in a specific signaling pathway, and whether the gene product promotes or inhibits the pathway, are compiled and curated through literature mining. Curated pathways presented by the company include CD40-CD40ligand, IL-6, IL-12/23, TNF, IL-17, IL-21, S1P1, IL-13 and PDE4, but this method may be used for any known signaling pathway with available data. To determine if a signaling pathway is over or under-expressed in a microarray dataset, the gene list for each signaling pathway may be queried against the limma differentially expressed genes from a disease state compared to healthy controls, and the differentially expressed genes in the signaling pathway may be identified for each set. The fold changes for genes that promoted the pathway may be added together and the fold changes for genes that inhibited the pathway may be subtracted from the score. This total score may be normalized based on the number of genes that could be detected on the specific microarray platform used for the experiment. Activation scores of −100 to +100 may be determined using this method with negative scores indicating an inhibition of the specific pathway in the disease state and positive scores indicating an up-regulation of a specific pathway in the disease state. The Fischer's exact test may be performed to determine if there was sufficient overlap of genes between the experimental differentially expressed genes and the genes in the signaling pathway.
Gene Set Variation Analysis (GSVA) may be performed (for example, as described in Catalina et al. (2019, Communications Biology, “Gene expression analysis delineates the potential roles of multiple interferons in systemic lupus erythematosus”, which is incorporated herein by reference in its entirety) to determine enrichment of signaling pathways in individual patient samples. Gene set variation analysis may be performed using an open source software package for the coding language R available at the R Bioconductor (bioconductor.org), e.g., as described by Hanzelman et al., (“GSVA: gene set variation analysis for microarray and RNA-Seq data,” BMC Bioinformatics, 2013, which is incorporated herein by reference in its entirety). The modules of genes to interrogate the datasets may be developed. Modules of genes determined to represent a specific signaling pathway or process may be identified (e.g., using publicly available datasets). For example, the IFNB1 signaling pathway is taken from a publicly available gene expression dataset of peripheral blood cells treated with IFNB1 in vitro. Genes co-expressed in this dataset (genes either all increased or decreased compared to control treated peripheral blood) are used to create modules of genes representing the IFNB1 signaling pathway, and GSVA is used to determine the enrichment of this set of genes and hence the IFNB1 signaling pathway in individual patient and control samples.
BIG-C™ Big Data Analysis Tool
BIG-C® is a fast and efficient cloud-based tool to functionally categorize gene products. With coverage of over 80% of the genome, BIG-C® leverages publicly available databases such as UniProtKB/Swiss-Prot, GO terms, KEGG pathways, NCBI PubMed and Interactome to place genes into 53 functional categories. The sorting into only one of 53 functional groups allows for a quick and relatively simple understanding of types of genes enriched and co-expressed in a big dataset. This assists in deriving further insights from genes expressed for a given disease state in human or pre-clinical mouse models.
BIG-C® can be used to functionally categorize immunological genes that are not covered in cancer databases such as GO and KEGG (e.g., as described by Grammer et al. 2016, “Drug repositioning in SLE: crowd-sourcing, literature-mining and Big Data analysis,” Lupus, 25 (10), 1150-1170, which is incorporated herein by reference in its entirety). Using a knowledge base of over 5000 patients with systemic lupus erythematosus (SLE), over 16432 genes may be each placed into one of 53 BIG-C® functional categories, and statistical analysis may be performed to identify enriched categories. BIG-C® categories may be cross-examined with the GO and KEGG terms to obtain additional information and insights.
A sample BIG-C® workflow may comprise the following steps. First, SLE genomic datasets are derived from whole blood, peripheral blood mononuclear cells, affected tissues, and purified immune cells. Second, datasets are analyzed using DE analysis or Weighted Gene Coexpression Network Analysis (WGCNA). Third, expressed genes are annotated using publicly available databases (e.g., UniProtKB/Swiss-Prot database, Human Immunodeficiencies database, Mouse MGI database, Entrez Molecular Sequence database, PubMed, and the Human Tissue Atlas). Fourth, signatures are cross-referenced with purified single-cell microarray datasets and RNAseq experiments. Fifth, BIG-C® is leveraged to separate the individual annotated genes into one of 53 functional categories shown in Table 1 (e.g., as described by Labonte et al. 2018, “Identification of alterations in macrophage activation associated with disease activity in systemic lupus erythematosus,” PloS one, 13 (12), e0208132, which is incorporated herein by reference in its entirety). Sixth, chi-squared analysis is used to determine enriched categories of interest from overlap p-values. Seventh, enriched categories are cross-examined with GO and KEGG terms to derive key insights for further analysis.
I-Scope™ Big Data Analysis Tool
I-Scope™ may be a tool configured for cross-examining the presence and activity of varying types of immune cell infiltrates with observed gene expression patterns. It may take annotated gene expression data and analyze it for hematopoietic cell lineage. I-Scope™ can be used downstream of the BIG-C® (Biologically Informed Gene-Clustering) tool in that it helps to provide even more insight into the nature of the genes being expressed after categorization.
I-Scope™ addresses the need to understand the involvement of specific cells for a given disease state. While it is helpful to understand the relative up-regulation and down-regulation at the gene expression level, it is even more informative to understand specifically in which cells this is occurring. I-Scope™ may be configured to identify hematopoietic cells through an iterative search of more than 17,000 genes identified in more than 50 microarray datasets (e.g., as described by Hubbard et al., “Analysis of Lupus Synovitis Gene Expression Reveals Dysregulation of Pathogenic Pathways Activated within Infiltrating Immune Cells,” Arthritis Rheumatol, 2018; 70 (suppl 10), which is incorporated herein by reference in its entirety). I-Scope™ may function by restricting the analysis to genes of hematopoietic cell heritage and allow for cross-checking against purified single-cell experiments or datasets. The cross-check confirms and categorizes specific transcript signatures to the 28 hematopoietic cell sub-categories shown in Table 2, ultimately allowing for cellular activity analysis across multiple samples and disease states. When combined with BIG-C® categories, the cellular activity can be correlated to specific functions within a given cell type.
A sample I-Scope™ workflow may comprise the following steps. First, candidate genes are identified from datasets (associated with a disease state or condition) potentially associated with immune cell expression. Second, using HPA, GTEx, and FANTOMS datasets, expression signatures associated with hematopoietic cell lineage are identified. Third, signatures are cross-referenced with purified single-cell microarray datasets and RNAseq experiments. Fourth, transcripts are categorized into 28 hematopoietic cell sub-categories and assess cellular expression across different samples and disease states. Odd's ratios are calculated with confidence intervals using the Fisher's exact test in R. An I-Scope™ signature analysis for a given sample may lead to the I-Scope™ signature analysis across multiple samples and disease states.
T-Scope™ Big Data Analysis Tool
The T-Scope™ tool may be configured for cross-examining gene expression signatures of a given sample with a database of non-hematopoietic cell types (e.g., as described by Hubbard et al., “Analysis of Gene Expression from Systemic Lupus Erythematosus Synovium Reveals Unique Pathogenic Mechanisms [Abstract], Annual Meeting of the American College of Rheumatology; June 2019; Chicago, Ill., which is incorporated herein by reference in its entirety). T-Scope™ may comprise a database of 704 transcripts allocated to 45 independent categories. Transcripts detected in the sample are matched to one of the cellular categories within the T-Scope™ tool to derive further insights on tissue cell activity. T-Scope™ can be used downstream of the BIG-C® (Biologically Informed Gene-Clustering) tool to understand which tissue cell types are present. In conjunction with I-Scope™ (which provides information related to immune cells), T-Scope™ can be performed to provide a complete view of all possible cell activity in a given sample.
T-Scope™ addresses the need to understand the involvement of specific tissue cells for a given disease state. While it is helpful to understand the relative up-regulation and down-regulation at the gene expression level, it is even more informative to understand specifically in which cells this is occurring. T-Scope™ may be configured by downloading a set of approximately 10,000 tissue enriched and 8,000 cell line enriched genes from the Human Protein Atlas along with their tissue or cell line designation. Genes differentially expressed in hematopoietic cell datasets are removed and kidney specific genes are added from the GEO repository. T-Scope™ may function by restricting the analysis to genes of known tissue cell heritage and allow for cross-checking against purified single-cell experiments or datasets. The cross-check confirms and categorizes specific transcript signatures to the 45 tissue cell sub-categories (as shown in Table 3), ultimately allowing for cellular activity analysis across multiple samples and disease states. When combined with BIG-C® categories, the cellular activity can be correlated to specific functions within a given tissue cell type.
A sample T-Scope™ workflow may comprise the following steps. First, candidate genes are identified from differential expression datasets (associated with a disease state or condition) potentially associated with tissue cell expression. Second, using publicly available databases, expression signatures associated with potential tissue cell activity are identified. Third, signatures are cross-referenced with microarray, scRNAseq or RNAseq experiments. Fourth, transcripts are categorized into 45 tissue cell sub-categories and cellular expression is assessed across different samples and disease states. Results may be obtained using T-Scope™ in combination with I-Scope™ for identification of cells post-DE-analysis.
Cell Scan Big Data Analysis Tool
A cloud-based genomic platform may be configured to provide users with access to CellScan™, which comprises a suite of tools for the identification, analysis, and prioritization of targets for drug development and/or repositioning. This platform is powered by a database containing the genomic information gathered from 5000+ autoimmune patients. The cloud-based genomic platform may leverage results from RNAseq and microarray experiments in conjunction with clinical information, such as medication and lab tests, to provide previously undiscovered insights.
CellScan™ may go beyond typical ‘omics analysis by performing one or more of the following: functionally categorizing genes and their products (e.g., using BIG-C®); deconvolving gene expression data to identify unique immunological cell types from blood or biopsy samples (e.g., using I-Scope™); identifying tissue specific cell from biopsy samples (e.g., using T-Scope™); identifying receptor-ligand interactions and subsequent signaling pathways (e.g., using MS-Scoring™); ranking genes and their products for targeting by drugs and miRNA mimetics; and prioritizing FDA-approved drugs and drugs-in-development for treatment in patients or pre-clinical models.
CellScan™ applications may include one or more of: Biomarker Discovery, Disease Mechanisms, Drug Mechanism of Action, Drug Mechanism of Toxicity, and Target Identification and Validation. Experimental approaches supported by CellScan™ may include one or more of: lncRNA, Metabolomics, MicroArray, miRNA, mRNA, qPCR, Proteomics, and RNAseq.
Data analysis and interpretation with CellScan™ may build on comprehensive, manually curated content of a knowledge base. Powerful, quick, and efficient tools may be used to perform deep analysis of NGS and miRNA data to identify gene function, immunological and tissue cell type, pathways, and target/drug appropriate for a specific disease state.
CellScan™ features may be configured to optimize or maximize the impact of information that surfaces in an analysis so that interpretation of a dataset is comprehensive and elucidates actionable insights. These features may include one or more of: NGS RNAseq data analysis, biomarker scoring, and prioritizing targets and drugs for human clinical trials and/or pre-clinical models. The NGS RNAseq data analysis may comprise interrogating RNA and miRNA data for function, cell-type (immunological or tissue) and pathways. The biomarker scoring may comprise using a knowledge base and gene expression data to assess and prioritize biomarkers associated with a target disease or phenotype. The target/drug prioritization may comprise leveraging objective scoring of targets and drugs based on parameters such as scientific rationale, evidence in mouse/human cells, prior clinical data, overall drug properties, and the risk of adverse events.
The knowledge base may be a repository created from millions of individual pieces of information gathered about genes, cells, tissues, drugs, and diseases, and manually reviewed for accuracy and includes rich contextual details and links to original publications. The knowledge base may enable access to relevant and substantiated knowledge from primary literature as well as public and private databases for comprehensive interpretation of NGS/RNAseq data elucidating function/pathways and prioritize targets/drugs for given disease states. Table 4 shows an example list of reference databases for the content in CellScan™, with both human and mouse species-specific identifiers supported.
MS (Molecular Signature) Scoring™ Analysis Tool
MS-Scoring™ may be configured to identify receptor-ligand interactions and predict ongoing signaling pathways. In addition, MS-Scoring™ may be used to validate molecular pathways as potential targets for new or repurposed drug therapies. The specificity of next-generation drug therapies requires a way to understand the potential of a given therapy to act on the intended biochemical target. Moreover, a potential application of this is the repositioning of drug therapies that may have the correct biochemical targeting to address multiple clinical needs beyond the initial intended therapeutic value.
MS-Scoring™ may be specifically developed to address gaps in the QIAGEN IPA® (Ingenuity Pathway Analysis) tool that does not contain many immunologically relevant pathways. Similar to IPA®, MS-Scoring™ 1 may use log-fold change information to score the target and its signaling pathway to verify the viability of the targets. If the fold-change of the genes of a signaling pathway appears to be upregulated or inhibitors appear to be downregulated, MS-Scoring™ 1 may provide a score of +1. Conversely if the genes of a signaling pathway appear downregulated or the inhibitors upregulated, MS-Scoring™ 1 may provide a score of −1. A score of zero may be provided if no fold-change is observed. The scores may then be summed and normalized across the entire pathway to yield a final % score between −100 (inhibition) and +100 (up-regulation). Higher absolute magnitude scores, scores that are close to −100 or +100, may indicate a high potential for therapeutic targeting. The Fischer's exact test may be performed to determine if there is sufficient overlap of genes between the experimental differentially expressed genes and the genes in the signaling pathway.
A sample MS-Scoring™ 1 workflow may comprise the following steps. First, potential drugs and pathways are identified by LINCS (Library of Integrated Network-Based Cellular Signatures) as candidates for therapeutic intervention. Second, MS-Scoring™ 1 is used to evaluate individual transcript elements of the target pathway. Third, signatures are cross-referenced with purified single-cell microarray datasets and RNAseq experiments. Fourth, scores are compiled and normalized to provide an overall % score for the pathway and higher absolute magnitude scores indicate a higher potential for therapeutic targeting.
MS-Scoring™ 2 may utilize custom-defined gene modules that represent a signaling pathway or process and is particularly useful for gene expression datasets from microarray or RNAseq. The MS-Scoring™ 2 tool may be configured to take a deeper look at signaling pathways analyzed using the MS-Scoring™ 1. The tool may analyze raw gene expression data and assess enrichment by the Gene Set Variation Analysis (as described herein), which assigns an indexed score to the individual co-expressed pathways between −1 and +1 indicating levels of down-regulation and up-regulation respectively.
A sample MS-Scoring™ 2 workflow may comprise the following steps. First, a signaling pathway of interest is selected from the MS-Scoring™ 2 menu. Second, a raw gene expression data is inputted into the MS-Scoring™ 2 tool. Third, enrichment of signaling pathway (s) is assessed on a patient by patient basis. Fourth, the data can then be used to drive insight for the target signaling pathways in individual patient samples.
Gene Set Variation Analysis (GSVA)
Gene Set Variation Analysis (GSVA) algorithms may be performed (for example, as described in Catalina et al. (2019, Communications Biology, “Gene expression analysis delineates the potential roles of multiple interferons in systemic lupus erythematosus”, which is incorporated herein by reference in its entirety) to determine enrichment of signaling pathways in individual patient samples. Gene set variation analysis may be performed using an open source software package for the coding language R available at the R Bioconductor (bioconductor.org), e.g., as described by Hanzelman et al., (“GSVA: gene set variation analysis for microarray and RNA-Seq data,” BMC Bioinformatics, 2013, which is incorporated herein by reference in its entirety) and as described by [R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. www.R-project.org/], which is incorporated herein by reference in its entirety. The modules of genes to interrogate the datasets may be developed. Modules of genes determined to represent a specific signaling pathway or process may be identified (e.g., using publicly available datasets). For example, the IFNB1 signaling pathway is taken from a publicly available gene expression dataset of peripheral blood cells treated with IFNB1 in vitro. Genes co-expressed in this dataset (genes either all increased or decreased compared to control treated peripheral blood) are used to create modules of genes representing the IFNB1 signaling pathway, and GSVA is used to determine the enrichment of this set of genes and hence the IFNB1 signaling pathway in individual patient and control samples.
A GSVA-based data analysis tool may be developed for use in analyzing specific sets of gene pathways. The GSVA-based data analysis tool (e.g., P-Scope) may use a GSVA statistical test-based tool using different sets of genes to analyze certain pathways. Such sets of genes may include, for example, human genes, mouse genes, or a combination thereof.
Computer Systems
The present disclosure provides computer systems that are programmed to implement methods of the disclosure.
The computer system 1601 can regulate various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, performing methods of the disclosure. The computer system 1601 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.
The computer system 1601 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1605, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 1601 also includes memory or memory location 1610 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1615 (e.g., hard disk), communication interface 1620 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 1625, such as cache, other memory, data storage and/or electronic display adapters. The memory 1610, storage unit 1615, interface 1620 and peripheral devices 1625 are in communication with the CPU 1605 through a communication bus (solid lines), such as a motherboard. The storage unit 1615 can be a data storage unit (or data repository) for storing data. The computer system 1601 can be operatively coupled to a computer network (“network”) 1630 with the aid of the communication interface 1620. The network 1630 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
The network 1630 in some cases is a telecommunication and/or data network. The network 1630 can include one or more computer servers, which can enable distributed computing, such as cloud computing. For example, one or more computer servers may enable cloud computing over the network 1630 (“the cloud”) to perform various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, performing methods of the disclosure. Such cloud computing may be provided by cloud computing platforms such as, for example, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and IBM cloud. The network 1630, in some cases with the aid of the computer system 1601, can implement a peer-to-peer network, which may enable devices coupled to the computer system 1601 to behave as a client or a server.
The CPU 1605 may comprise one or more computer processors and/or one or more graphics processing units (GPUs). The CPU 1605 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 1610. The instructions can be directed to the CPU 1605, which can subsequently program or otherwise configure the CPU 1605 to implement methods of the present disclosure. Examples of operations performed by the CPU 1605 can include fetch, decode, execute, and writeback.
The CPU 1605 can be part of a circuit, such as an integrated circuit. One or more other components of the system 1601 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
The storage unit 1615 can store files, such as drivers, libraries and saved programs. The storage unit 1615 can store user data, e.g., user preferences and user programs. The computer system 1601 in some cases can include one or more additional data storage units that are external to the computer system 1601, such as located on a remote server that is in communication with the computer system 1601 through an intranet or the Internet.
The computer system 1601 can communicate with one or more remote computer systems through the network 1630. For instance, the computer system 1601 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 1601 via the network 1630.
Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 1601, such as, for example, on the memory 1610 or electronic storage unit 1615. The machine executable or machine-readable code can be provided in the form of software. During use, the code can be executed by the processor 1605. In some cases, the code can be retrieved from the storage unit 1615 and stored on the memory 1610 for ready access by the processor 1605. In some situations, the electronic storage unit 1615 can be precluded, and machine-executable instructions are stored on memory 1610.
The code can be pre-compiled and configured for use with a machine having a processor adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
Aspects of the systems and methods provided herein, such as the computer system 1601, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine-readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
Hence, a machine-readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer-readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
The computer system 1601 can include or be in communication with an electronic display 1635 that comprises a user interface (UI) 1640 for providing, for example, a visual display. Examples of UIs include, without limitation, a graphical user interface (GUI) and web-based user interface.
Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 1605. The algorithm can, for example, perform methods of the disclosure.
The following illustrative examples are representative of embodiments of the software applications, systems, and methods described herein and are not meant to be limiting in any way.
Abstract
SARS-CoV2 is a previously uncharacterized coronavirus and causative agent of the COVID-19 pandemic. COVID-19 typically causes mild respiratory symptoms, but may escalate to acute respiratory distress syndrome (ARDs) with an increased risk of respiratory failure and death. However, the trajectory of disease progression and the status of affected tissues in COVID-19 patients has not been elucidated. We performed a comprehensive analysis of gene expression data from the blood, lung, and airway of COVID-19 patients to better understand the host response to SARS-CoV2 infection. Our results indicate that COVID-19 pathogenesis is driven by populations of myeloid-lineage cells with highly inflammatory but distinct transcriptional signatures, suggesting a progression in activation from the periphery to the lung tissue. In addition, through analysis of immune cells and inflammatory pathways enriched in each compartment, we have built a model of the systemic response to SARS-CoV2 and identified therapeutics targeting key upstream regulators of pathways contributing to COVID-19 pathogenesis.
Introduction
Coronaviruses (CoV) are a group of enveloped single positive stranded RNA viruses named for the spike proteins on their surface that resemble a crown (Fung and Liu, 2019). Seven coronaviruses have now been found to infect humans, causing mild to severe respiratory and intestinal illnesses including an estimated 15% of common colds (Cui et al., 2019; Greenberg, 2016). In the past two decades, three global pandemics have originated from coronaviruses capable of infecting the lower respiratory tract resulting in heightened pathogenicity and high mortality rates in humans. In 2002-2003, severe acute respiratory syndrome coronavirus (SARS-CoV) lead to greater than 8,000 cases with a mortality rate of nearly 10% (Drosten et al., 2003; Fung and Liu, 2019). Subsequently, Middle East respiratory syndrome coronavirus (MERS-CoV) emerged in 2012, leading to over 2,000 confirmed cases and a mortality rate of approximately 35% (Chafekar and Fielding, 2018; Fung and Liu, 2019). We are currently in the midst of a pandemic stemming from a new coronavirus strain, severe acute respiratory syndrome coronavirus 2 (SARS-CoV2), the causative agent of coronavirus disease 2019 (COVID-19). In the majority of cases, patients exhibit mild symptoms, whereas in more severe cases, patients may develop acute respiratory distress syndrome (ARDS), which leads to lung injury and death from respiratory failure (Chen et al., 2020a; Zhang et al., 2020).
SARS-CoV2 utilizes the SARS-CoV receptor, ACE2, in conjunction with the spike protein activator, TMPRSS2, to infect host cells (Hoffmann et al., 2020). Expression of ACE2 and TMPRSS2 has been detected in multiple tissues including lung epithelium and vascular endothelium, (Lovren et al., 2008; Sungnak et al., 2020) which are likely to be the first cells infected by the virus. However, at this time, there is still incomplete information available regarding the host response to SARS-CoV2 infection and the perturbations resulting in a severe outcome. Despite this, clues can be derived from our knowledge of the immune response to infection by other respiratory viruses, including SARS-CoV and MERS-CoV. After infection, viruses are typically detected by pattern recognition receptors (PRRs) such as the inflammasome sensor NLRP3, which signal the release of interferons and inflammatory cytokines including the IL-1 family, IL-6, and TNF which activate a local and systemic response to infection (Kelley et al., 2019; Lazear et al., 2019). This involves the recruitment, activation, and differentiation of innate and adaptive immune cells including neutrophils, inflammatory myeloid cells, CD8 T cells, and natural killer (NK) cells (Newton et al., 2016). Resolution of infection is largely dependent on the cytotoxic activity of CD8 T cells and NK cells, which enables clearance of virus-infected cells and thus acts to prevent further spread of the virus (Newton et al., 2016). Failure to clear virus-infected cells may facilitate a hyper-inflammatory state termed “cytokine storm”, macrophage activation syndrome (MAS), or haemophagocytic lymphohystocytosis (HLH) and ultimately damage to the infected lung (Crayne et al., 2019; McGonagle et al., 2020). Clearance of virus also coincides with development of neutralizing antibodies, although recent studies have suggested that 30% of subjects who clear SARS-CoV2 do not develop neutralizing antibodies (Tay et al.). Moreover, the role of the anti-SARS-CoV2 antibody is complex, in that in non-human primates, anti-viral antibody may exacerbate the lung pathology, a phenomenon also reported in older subjects infected with the virus (Wang et al., 2020).
The novelty of SARS-CoV2 and thus the lack of comprehensive knowledge regarding the progression of COVID-19 disease, has constrained our ability to develop effective treatments for infected patients. One means to obtain a comprehensive knowledge of the host response to SARS-CoV2 is to examine gene expression in relevant tissues. A limited number of studies have been reported and have suggested that the disease generally induces a dysfunctional immune response involving lymphopoenia, increased inflammatory cell recruitment to the lung, and an increase in local and systemic release of pro-inflammatory cytokines or “cytokine storm”, conditions which are exacerbated in more severe cases (Tay et al.). However, because of the small number of samples and limited analysis, a full picture of the biological state of SARS-CoV2-affected tissues has not emerged. To address this, we have assessed all of the available SARS-CoV2 gene expression datasets using a number of orthogonal bioinformatic tools to better understand the host response to SARS-CoV2 infection. As a result, we have identified immune cell types and inflammatory pathways critical for COVID-19 pathogenesis. Furthermore, we have characterized differences in transcriptional signatures between these compartments to develop a model for the systemic response to SARS-CoV2 infection and identified potential drugs to target harmful inflammatory mediators in COVID-19 patients.
Results
Gene Expression Analysis of Blood, Lung, and Airway of COVID-19 Patients
Gene expression reflects the current state of cellular activity of cells or organs and can be employed to delineate changes in cellular composition and/or function in a sample. A limited number of gene expression profiles are available from patients with COVID-19 and have yielded some insights into the pathogenic processes triggered by infection with SARS-CoV-2 (Blanco-Melo et al., 2020; Wei et al., 2020; Xiong et al., 2020). We reasoned that a more comprehensive analysis of all available gene sets from blood, lung and airway using a number of complementary orthogonal approaches might provide a more complete view of the nature of the COVID-19 inflammatory response and the potential points of intervention that might have therapeutic value. To characterize the pathologic response to SARS-CoV2 infection manifested in the peripheral blood and lung of COVID-19 patients, we analyzed RNA sequencing (RNA-seq) data from peripheral blood mononuclear cells (PBMCs) and postmortem lung tissue of COVID-19 patients and healthy controls as well as bronchoalveolar lavage (BAL) fluid of COVID-19 patients (Blanco-Melo et al., 2020; Xiong et al., 2020). We first determined changes in gene expression in the blood (PBMC-CTL vs PBMC-CoV2) and lung (Lung-CTL vs Lung-CoV2). Because no control BAL fluids were associated with the BAL-CoV2 samples, we compared BAL-CoV2 to PBMC-CoV2 from the same dataset to avoid effects related to batch and methodology. These results must be viewed with caution, understanding that the comparison can only show changes in gene expression profiles between different compartments. Despite that, important differences were observed. Overall, we found 4,277 differentially expressed genes (DEGs) in the blood (2,175 up and 2,102 down), 2,261 DE genes in the lung (711 up and 1550 down), and 9,183 DE genes in the airway (4,263 up and 4,920 down). From these comparisons, we traced transcriptional changes in the progression of the pathologic response to SARS-CoV2 through these three compartments from activation and mobilization of immune cells in the blood, to infiltration into the lung tissue and airway of infected patients (
Conserved and Differential Enrichment of Immune Cells and Pathways in Blood, Lung, and Airway of COVID-19 Patients
To interrogate pathologic pathways in the 3 compartments, we performed Gene Set Variation Analysis (GSVA) utilizing previously reported interferon response gene (IRG) modules (Catalina et al., 2019), gene modules of inflammatory pathways curated from online databases, and previously defined gene modules characterizing immune and inflammatory cells and processes (Cataline et al. unpublished manuscript) (
In the blood, inflammatory pathways including the classical and lectin-induced complement pathways and the NLRP3 inflammasome were enriched (
Although, there is conflicting data on the presence of an Interferon Gene Signature (IGS) and whether SARS-CoV2 infection induces a robust IFN response, (Blanco-Melo et al., 2020) we observed increased expression of Type I interferon genes (IFNA4, IFNA6, IFNA10) and significant enrichment of the Type I and Type II IGS specifically in the lung tissue, but not in the blood or airway of COVID-19 patients (
In the airway, the classical and alternative complement pathways but not the NLRP3 inflammasome signature were enriched in COVID-19 patients as compared to COVID PBMCs (
Increased Expression of Inflammatory Mediators in the Lungs of COVID-19 Patients
To examine the nature of the inflammatory response in the tissue compartments in greater detail, we examined differential expression of specific genes of interest (
Non-Hematopoietic Cells in the BAL Fluid May be Indicative of Viral-Induced Damage in the Lungs
To determine whether viral infection in the lungs and airway resulted in modification of resident tissue populations, we evaluated GSVA enrichment of non-hematopoietic cell type gene signatures including fibroblasts, Type I and Type II alveolar cells, ciliated lung cells, club cells, and a general lung tissue cell signature (
Protein-Protein Interaction Metaclusters Identify Myeloid Cells and Metabolic Pathways in Blood, Lung, and Airway of COVID-19 Patients
We next sought to utilize an unbiased, clustering approach to assess the relationship between immune cell types and the altered biological processes driving SARS-CoV2-mediated pathology within each tissue compartment. Protein-protein interaction (PPI) networks consisting of DE genes were simplified into metastructures defined by the number of genes in each cluster, the number of significant intracluster connections and the number of associations connecting members of different clusters to each other. Overall, upregulated protein networks from blood, lung, and airway of SARS-CoV2-infected patients identified numerous multifunctional clusters many of which included heterogeneous populations of monocyte and myeloid lineage cells. For example, PBMC cluster 8 was dominated by an inflammatory monocyte population defined by C2, C5, CXCL10, CCR2 and multiple interferon-stimulated genes, whereas cluster 3 contained hallmarks of alternatively activated (M2) macrophages and/or myeloid-derived suppressor cells (MDSCs), including CD33, CD36, CD93 and ITGAM (
Lung tissue was heavily inflamed exhibiting infiltration of monocyte/myeloid populations with additional infiltration of LDGs, granulocytes, T and B cells. Although distributed among multiple clusters, we observed upregulation of FCN1 (cluster 15), SELL (cluster 14) and S100A8/A9 (cluster 4) which comprise an inflammatory monocyte signature (G1 population) derived from the BAL fluid of COVID patients recently described by Liao et al. (Liao et al., 2020) (
Finally, the airway was enriched in inflammatory monocytes, mitochondrial function and transcription. Multifunctional cluster 4 was dominated by immune-secreted molecules including numerous chemokine and cytokine receptor-ligand pairs indicative of elevated chemotactic activity, while smaller immune clusters were enriched in classical complement activation (cluster 34), type II interferon (cluster 26) and IL-1 responses (cluster 51). We also observed evidence of recently described alveolar macrophages (AMs; G4 population) specifically in the airway (Liao et al., 2020), although these markers were distributed among multiple clusters, including FABP4 and PPARG in cluster 17, SPP1 and MRC1 in cluster 10, and MARCO and TFRC in clusters 34 and 7, respectively (
Myeloid Cell-Derived Metaclusters Define Functional Myeloid Subpopulations within the Blood, Lung, and Airway of Covid-19 Patients
Given the large number of monocyte and myeloid enriched clusters, we next wanted to examine those clusters in greater detail to identify unique myeloid lineage and/or monocyte populations within each tissue compartment. In PBMCs, metaclusters derived from monocyte-enriched clusters revealed new gene modules that were representative of common macrophage function (chemotaxis, proteolysis, etc) as well as two independent monocyte/myeloid subpopulations (
Similar to the blood, lung-derived monocyte/myeloid genes segregated into clusters associated with common myeloid-lineage cell functions, such as chemotaxis and pattern recognition, as well as multiple subpopulations (
Examination of genes in cluster 4 and 7 reveals a potential second myeloid population characteristic of alveolar macrophages (AMs) defined by CSF2RB, which functions as the receptor for GM-CSF, a cytokine that regulates AM differentiation (Joshi et al., 2006; Newton et al., 2016; Suzuki et al., 2014). Interestingly, further characterization of this population indicates significant involvement of the coagulation system indicated by F5, FGG, FGL1, FGL1, SERPINA and SERPINE2.
BAL fluid was also potentially defined by 2 populations, including cluster 7 which had hallmarks of inflammatory/M1 monocytes (MARCO and multiple members of the complement cascade), and AMs in clusters 3 and 5 (
Co-Expression Further Delineates Differing Myeloid Gene Expression Between Subpopulations within the Blood, Lung, and Airway of COVID-19 Patients
Both GSVA and PPI networks elucidated the presence of increased myeloid cell populations in all SARS-CoV-2 affected tissues, and PPIs revealed the presence of tissue-specific subpopulations, defined by differing biologic functions. The overlap between many Liao et al. signature genes and the tissue-defined PPI clusters motivated us to further evaluate their enrichment in all tissue compartments. Consistent with PPI clusters, the inflammatory-monocyte like population (G1) was found to be increased in both PBMC and lung, but was significantly decreased in the BAL (
Taken together, evaluation of PPIs and previously published COVID myeloid populations revealed nuances in the myeloid cell populations found among the tissue compartments. We wished to further classify these populations by their expression of traditional myeloid-specific genes including cell surface markers and chemokines. DE interrogation of all possible myeloid cell-specific genes demonstrated further heterogeneity in expression of markers, such as CD14, CD300C, and OSCAR between compartments (
To characterize the function and nature of these myeloid populations, we again compared them with previously published myeloid signatures including those identified as alveolar macrophages (Reyfman et al., 2019), M1 and M2 populations (Martinez et al., 2006) (Adam's RnD source), and other signatures observed in murine models of inflammatory arthritis (Culemann et al., 2019). Furthermore, we evaluated the overlap between co-expression derived myeloid clusters and those identified using STRING/MCODE PPI in
Of the co-expression derived myeloid clusters increased in each tissue compartment, we observed a progression of classically activated, proinflammatory monocytes from the blood into the lung tissue and in some cases getting sloughed off into the airway, as observed by significant overlap with the PBMC myeloid cluster 6, Lung myeloid clusters 3 and 6, Liao et al.'s proinflammatory population (G1), and the CCR2+IL1B+ infiltrating population (
Linear Regression Confirms Functional Associations with Myeloid Subpopulations
To confirm the function of these myeloid populations in the pathology of SARS-CoV-2, we performed linear regression analyses between myeloid GSVA scores and functional or signaling pathways. Earlier DE functional enrichment and MCODE/STRING analyses suggested a predominant role for metabolism, particularly glycolysis and the TCA cycle, in PBMCs, as well as OXPHOS in all three tissue compartments. We therefore evaluated metabolism in each compartment using GSVA (
Linear regression analyses between GSVA scores for the myeloid cell populations and metabolic, functional, and signaling pathways in each compartment further demonstrated differences among the myeloid cell populations. The myeloid cell population in the PBMCs was found to be highly glycolytic, whereas there was no significant change to metabolism detected in the lung, and the population in the BAL was reliant on OXPHOS (
IPA Confirms Metabolic Activity and Viral, Innate Immunity as Targetable Pathways in COVID-19 Patients
To further inform the biology of SARS-CoV-2-infected patients, we conducted pathway analysis on DEGs from each of the peripheral blood, lung, and airway compartments using IPA canonical signaling pathway and upstream regulator analysis functions (
Upstream regulators predicted to mediate responses to the virus in each compartment indicated uniform involvement of proinflammatory cytokines with type I interferon regulation dominant in the diseased lung (
Discussion
In the present study, we utilize multiple orthogonal bioinformatics approaches to analyze differential gene expression from the blood, lung, and airway of COVID-19 patients. As a whole, our results are in agreement with the current understanding of SARS-CoV2 pathogenesis, which generally involves lymphopenia accompanied by increased activation of pro-inflammatory pathways and myeloid cells (Tay et al.). In particular, we found that DE genes from COVID-19 patients were enriched in gene signatures of interferon, complement pathways, and the inflammasome, which would be expected to initiate a robust and systemic response to infection. We also utilized various approaches to derive and characterize the expanded myeloid cell subpopulations in COVID-19 patients. These observed increases in inflammatory pathways and myeloid cell subsets are likely reflective of a dysregulated immune response that contributes to lung damage, and in severe cases, to multi-organ failure and overall poor health outcomes in COVID-19 patients. Furthermore, by comparing DE results from multiple compartments (the blood, lung, and airway) in COVID-19 patients, we have developed a model of the systemic pathogenic response to SARS-CoV2 infection (
In our proposed model, after viral exposure, SARS-CoV2 enters the host airway and infects alveolar epithelial cells, which express the virus entry receptors ACE2 and TMPRSS2. In the lung tissue, we observed increased expression of Type I interferon genes, enrichment of Type I and Type II interferon gene signatures, and expression of IFN-stimulated genes including TNF and NOS2. Type I interferons are critical components of the host response to viral infection through inducing the expression of anti-viral genes and direct or indirect immune cell activation and, thus, are targets of immune evasion tactics by coronaviruses including SARS-CoV and MERS-CoV (Newton et al., 2016; Tay et al.). Type I interferons can be produced by virus-infected cells, or cells that have detected viral infection (Goritzka et al., 2015). Therefore, interferon production in the lung of COVID-19 patients is likely initiated by infected alveolar cells and propagated by activated alveolar macrophages, leading to the production of IFNγ and other pro-inflammatory mediators (Darwich et al., 2009; Newton et al., 2016). To date, reports have been divided over whether an interferon response is present in COVID-19 patients in that some reports cite a poor interferon response (Blanco-Melo et al., 2020), some observe a robust interferon response that increases with disease severity (Wei et al., 2020), and still others have found heterogeneity in whether patients exhibit an interferon response to SARS-CoV2 infection (Trouillet-Assant et al., 2020). Therefore, the interferon response to SARS-CoV2 requires further study and consideration when developing treatment strategies for individual patients.
Signals from the lung including interferon or other pro-inflammatory mediators from infected cells disseminate into the peripheral blood to launch a systemic inflammatory response. In the periphery, inflammatory mediators from the lung typically promote activation and migration of myeloid cells, NK cells, and adaptive immune cells including T and B cells, which can differentiate into effector CD8 T cells and antibody-producing plasma cells (Newton et al., 2016). We found significant deficiencies in gene signatures of T cells, NK cells, and cytotoxic cells, which include activated CD8 T cells and NK cells, in the peripheral blood of COVID-19 patients, which is consistent with clinical and analytical evidence of lymphopenia following SARS-CoV and SARS-CoV2 infection (Chen et al., 2020b; He et al., 2005; Qin et al., 2020; Xu et al., 2020). In contrast to T and NK cells, we observed increased evidence of B cell activation through CD40/CD40L and an increased plasma cell signature in PBMCs of COVID-19 patients. This result suggests that while effector CD8 T cells are deficient, that helper T (Th) cells are present and able to promote B cell activation, the generation of antibody secreting plasma cells, and an antibody-mediated immune response. However, whether a virus-specific antibody response is beneficial to recovery from SARS-CoV2 infection is unclear (Wu et al., 2020). Evidence from SARS-CoV infection suggests that the quality of the antibody produced is critical to mounting an effective anti-viral response. Low quality, low affinity antibody responses may have pathological consequences including promoting lung injury in some patients, although it is unknown if this occurs in SARS-CoV2 infected individuals (Iwasaki and Yang, 2020; Liu et al., 2019).
The predominant populations of immune cells we found to be enriched and activated in COVID-19 patients were myeloid cells and, in particular, subsets of inflammatory monocytes and macrophages, which differed between the blood, lung, and airway compartments. In the peripheral blood, we found significant enrichment of monocytes including classically activated inflammatory monocytes as well as another subset characterized by expression of CD33. This CD33+ myeloid subset appeared to be an alternatively activated population reminiscent of previously characterized IFN-activated macrophages and alveolar macrophages (AMs), which may represent an activation state specific to stimuli arising from the SARS-CoV2-infected lung. Myeloid cells enriched in the blood of COVID-19 patients were also correlated with pro-cell cycle and glycolysis gene signatures indicative of a metabolic status associated with pro-inflammatory M1 macrophages (Viola et al., 2019). In support of the presence of increased pro-inflammatory monocytes/macrophages, we also found significant increases in the expression of the pro-inflammatory IL-1 family member IL18, the interferon-induced chemokine CXCL10, and the chemokine receptor CCR2, all of which have cited roles in the response to respiratory virus infection (Tang et al., 2005; Teijaro, 2015; Wang et al., 2004).
In the lung tissue, we observed increased expression of a number of chemokine ligands for CCR2 (CCL2, CCL7, CCL8) and other receptors (CCL3, CXCL10, CXCL16), which would promote the mobilization of activated immune cells to the locus of infection. While in the blood, we found evidence of increased monocyte/macrophage populations, in the lung tissue, we also observed enrichment of gene signatures of other myeloid cells including neutrophils and low density granulocytes (LDGs), which we could not assess in the blood due to the process of isolating PBMCs. The role of neutrophils and LDGs in COVID-19 pathogenesis has been poorly characterized although they have been found in increased numbers and associated with poor disease outcome in COVID-19 patients (Huang et al., 2020). In addition, some have proposed that the formation of neutrophil extracellular traps (NETs) contributes to increased multi-organ damage and risk of death from SARS-CoV2 infection (Barnes et al., 2020; Thierry and Roch, 2020).
Monocyte/macrophage subsets in the lung of COVID-19 patients were characterized as infiltrating inflammatory monocytes and activated AMs, which exhibited a mixed metabolic status suggestive of different states of activation. Infiltrating monocytes from the peripheral blood appeared to be further activated in the lung tissue as evidenced by enhanced expression of alarmins and markers of highly inflammatory monocytes previously characterized in severe COVID-19 cases (Liao et al., 2020). In particular, we observed increased expression of IL-1 family members, most notably ILIA and enrichment of a pro-inflammatory IL-1 signature in the lung of COVID-19 patients. We also found evidence of a population of AMs previously defined in the BAL of COVID-19 patients (Liao et al., 2020). Alveolar macrophages are typically suppressed at steady state, but adopt a pro-inflammatory phenotype after viral infection in order to assist in phagocytosis and clearance of virus (Newton et al., 2016). However, we observed that this AM population also clustered with genes involved in the coagulation system, which is consistent with observations of procoagulant AM activity in ARDs (Tipping et al., 1988). As pulmonary thrombosis has been associated with poor clinical outcomes in COVID-19 patients, this result suggests that AMs in SARS-CoV2-infected lung tissue may be detrimental to host recovery from disease (Wichmann et al., 2020).
While we observed significant decreases in T cells in the blood of COVID-19 patients, they appeared to be increased in the lung tissue. However, upon closer examination, we noted that there was no enrichment of activated T cells or of cytotoxic CD8 T cells. This indicated that while T cells may be able to enter the infected lung tissue, they are not activated and thus lack the ability to clear virus-infected cells and resolve the inflammatory response. In cases of macrophage activation syndrome (MAS) defects in cytotoxic activity of CD8 T cells and NK cells result in increased cell-cell contacts, enhanced immune cell activation, and intensified production of pro-inflammatory cytokines. Many of the pro-inflammatory cytokines associated with MAS were also expressed in our COVID-19 patient samples, predominantly in the lung tissue and airway compartments, including IL1, IL18, and IL33 (Crayne et al., 2019; McGonagle et al., 2020). Thus, it is likely that the lack of activated CD8 T cells and NK cells and subsequent failure to clear virus-infected cells, is a major contributor to the dysregulated, macrophage-driven pathologic response to SARS-CoV2 observed in COVID-19 patients.
As disease progresses, increased infiltration of pro-inflammatory immune cells and release of inflammatory mediators in the lung is emulated by the presence of immune cells in the airway, which can be sampled through the BAL (Heron et al., 2012). Thus, the contents of the BAL fluid acts as a diagnostic marker or sensor for what is occurring in the lung tissue. In the airway, we detected gene signatures of various post-activated macrophage subsets including inflammatory M1 macrophages, alternatively activated M2 macrophages, and activated alveolar macrophages that had either migrated or been shed from the lung tissue. In conjunction with the presence of inflammatory macrophages, our results revealed a prominent IL-1 family signature in the airway of COVID-19 patients above what we observed in the blood or lung tissue and, in particular, an over 10-fold increase in expression of the alarmins ILIA and IL33 (Jing et al., 2016; Schmitz et al., 2005). These alarmins act as danger signals released by dead or dying cells to promote immune cell activation and therefore, their presence in the airway is likely indicative of an increase in virally infected lung cells. As SARS-CoV2 continues to propagate and viral clearance is impaired by a lack of cytotoxic activity, these alarmins promote inflammation and prevent resolution of the immune response in COVID-19 patients. Thus, this result further supports the notion that IL-1-mediated inflammation plays a critical role in COVID-19 pathogenesis. Expression of myeloid cell genes in the airway also correlated with a signature of oxidative metabolism, which is characteristic of M2 macrophages and typically associated with control of tissue damage (Viola et al., 2019). However, in the context of pulmonary infection, polarization of alveolar macrophages toward an anti-inflammatory M2 phenotype was found to promote continued pathogenesis, suggesting that these macrophages may not be effective mediators of anti-viral immunity (Allard et al., 2018).
We found increased enrichment of lung epithelial cells as well as increased expression of the viral entry receptors ACE2 and TMPRSS2, which are expressed by these cells, exclusively in the airway of COVID-19 patients. The presence of non-hematopoietic cells such as lung epithelium in the BAL is indicative of damage to the lung tissue as dead or dying cells are being sloughed off into the airway and suggests that localized inflammation as a result of enhanced immune cell infiltration promotes significant damage to the lung tissue. Furthermore, the lack of cytotoxic cells and thus, the inability to clear these virus-infected lung epithelial cells in the airway likely accounts for the increased presence of post-activated macrophages and high expression of pro-inflammatory IL-1 family members we observed in the BAL of COVID-19 patients. However, in contrast to previous characterizations of both SARS-CoV and SARS-CoV2 infection, our analysis did not indicate the presence of overwhelming pro-inflammatory cytokine production or “cytokine storm” (Tay et al.). Rather our data suggests that the increased numbers, overactivation, and potentially heightened pathogenicity of monocyte/macrophage populations are the main drivers of the dysregulated immune response and resulting tissue damage in COVID-19 patients.
Big Data Analyses Facilitate Drug Prediction
In the absence of effective antiviral treatment and/or a SARS-CoV2-specific vaccine, disease management is reliant upon supportive care and therapeutics capable of limiting the severity of clinical manifestations. Using empirical evidence as a guide, the current approach has been successful in identifying “actionable” points of intervention in an unbiased manner and in spite of formidable patient heterogeneity. Analyses presented here support several recent reports highlighting COVID19 infection-related increases in secreted factors, particularly IL6 and TNF, both of which function as predictors of poor prognosis (Pedersen and Ho, 2020; Voiriot et al., 2017), as well as complement activation (Gao et al., 2020; Huang et al., 2020; Mehta et al., 2020). Accordingly, anti-IL6 therapies including sarilumab, tocilizumab and clazakizumab, as well as biologics targeting terminal components of the complement cascade, such as eculizumab and ravulizumab, are in various clinical trial phases for treating COVID19 associated pneumonia. Candidate TNF blockers such as adalimumab, entanercept and many others, represent additional options for inhibiting deleterious pro-inflammatory signaling. Interestingly, numerous tubulin inhibitors were identified among the complied list of drugs counteracting infection-induced genomic changes; it is therefore of note that clinical trials involving colchicine, an antimitotic drug that binds soluble tubulin, are currently underway, providing further validation for the unbiased drug-prediction methodology presented here. Our analyses also point to the likely involvement of pro-inflammatory IL1 family members especially in the lung, suggesting anti-IL1 interventions, including canakinumab and anakinra, may be effective in preventing acute lung injury.
Overall, this report establishes the predominance of inflammatory monocyte/myeloid lineage cells in driving disease pathology and suggest therapies effective at blocking myeloid cell recruitment or forcing repolarization may prevent disease progression. CCL5 (RANTES) is a potent leukocyte chemoattractant that interacts with multiple receptors, including CCR1 (upregulated in the blood, lung and airway), and CCR5 (upregulated in the airway). Disruption of the CCR5-CCL5 axis was recently tested using the CCR5 blocking monoclonal antibody leronlimab in a small compassionate use trial with promising preliminary results (Pattterson et al.). Similarly, CD74 which functions as the receptor for the pro-inflammatory cytokine macrophage migration inhibitory factor (MIF), was upregulated specifically in the lung and represents an attractive target for intervention by imalumab, an anti-MIF monoclonal antibody.
It has also been observed that COVID19 may predispose patients to thromboembolic disease (Klok et al., 2020; Middeldorp et al., 2020). Indeed, the gene expression analyses presented here showing altered expression of coagulation factors and fibrinogen genes suggests dysfunction within the intrinsic clotting pathway. These findings, together with evidence of excessive inflammation and complement activation, may contribute to the systemic coagulation underlying the remarkably high incidence of thrombotic complications observed in severely ill patients, thereby reinforcing recommendations to apply pharmacological anti-thrombotic medications.
Finally, there has been much recent discussion concerning the use of anti-rheumatic drugs for managing COVID19. In fact, chloroquine (CQ) was one compound predicted as an upstream regulator with potential phenotype-reversing properties. In vitro experiments examining the anti-viral properties of CQ and its derivative hydroxychloroquine (HCQ) were effective in limiting viral load, however the efficacy of these drugs in clinical trials has been less clear. Questions about drug timing, dosage and adverse events have all called into question the use of these drugs for COVID19 patients.
Using systems and methods of the present disclosure, a treatment may be selected for an individual based on data analyses of biological samples obtained from the individual, in order to treat or manage a COVID19 disease state or condition of the individual. For example, a targeted treatment for the blood may be selected for an individual based on data analyses of biological samples obtained from the individual, in order to treat or manage a COVID19 disease state or condition of the individual. For example, a targeted treatment for the lungs may be selected for an individual based on data analyses of biological samples obtained from the individual, in order to treat or manage a COVID19 disease state or condition of the individual. For example, a targeted treatment for the airways may be selected for an individual based on data analyses of biological samples obtained from the individual, in order to treat or manage a COVID19 disease state or condition of the individual. In some embodiments, the treatment may be selected from among a plurality of different treatments for an individual based on data analyses of biological samples obtained from the individual, in order to treat or manage a COVID19 disease state or condition of the individual.
Weighted Gene Co-Expression Network Association
DEseq2 normalized log 2 transformed RNAseq gene counts for CRA2390 and GSE147507 datasets were used as inputs to WGCNA (V1.69). Adjacency co-expression matrices for all genes in a given set were calculated by Pearson's correlation using signed network type specific formulae. Blockwise network construction was performed using a soft threshold power value of 30 in order to preserve maximal scale free topology of the networks. Resultant dendrograms of correlation networks were trimmed to isolate individual modular groups of genes, labeled using semi-random color assignments, based on a detection cut height of 1, with a merging cut height of 0.2, with the additional use of a partitioning around medoids function. The module eigenegene (ME) vector per sample was calculated as the first principle component of the module's gene expression counts. Module correlations to cohort were calculated using Pearson's r against MEs, defining modules as either positively or negatively correlated as a whole by averaging constituent sample ME correlations to cohort. The strength of module representation was established by inspecting the number of members of the disease or healthy samples contributing to the overall average ME correlation to disease state. Majority modules highly representative of their cohort were those where more than half of the cohort constituent MEs were correlated in the same direction and general scale to cohort. Quality majority modules were those with the additional requirement that the opposing cohort correlations were all running in the opposite direction. Minority quality disease state modules were considered as being more representative of genetic expressions unique to patient rather than cohort. Module membership statistics were calculated including kIM, a measurement of intramodular connectivity of each gene's expression values across samples to neighboring module genes, kME, the correlation of each gene to its containing module eigengene, and general correlation of gene expression values to cohort. Hub genes were considered as those with the highest kIMs, and as a general rule also had the highest kMEs. Complete composite module preservation statistics were calculated using WGCNA's modulePreservation function through 200 permutations of the three data sets independently tested against each other as either a reference or test set. The Z summary statistic was selected as the global index of module preservation and is a composite of seven density and connectivity preservation statistics. All constituent statistics were retained for future granular analysis. Modules were considered as moderately preserved between reference-test network pairs as those with a Z summary statistic between 5 and 10, and as highly preserved as those with a Z summary score above 10. Preservation median rank was calculated to scale module preservation to module size. Gene symbols common between moderately and highly preserved modules in each reference-test pair were selected for ensuing GSVA enrichment analysis.
MEGENA Gene Co-Expression Network Association
The MEGENA (Multiscale Embedded Gene Co-expression Network Analysis) package (v1.3.7) was applied to reconstruct co-expression networks, as described by, for example, [Won-Min Song and Bin Zhang (2018). MEGENA: Multiscale Clustering of Geometrical Network. R package version 1.3.7. CRAN.R-project.org/package=MEGENA], which is incorporated by reference herein in its entirety. The DEseq2 normalized log 2 transformed RNAseq gene counts used for WGCNA analysis were used for initial MEGENA correlation calculation. Gene expression rows with a standard deviation of less than 0.2 were discarded. The correlations between all remaining gene pairs was calculated using Pearson's correlation coefficient (φ using MEGENA's calculation correlation function. After calculation of true pairwise ρ values, a permutation procedure determined the necessary ρ value necessary to achieve a false positive rate (FPR) of 0.05. The gene expression matrix was shuffled through 10 permutations and ρ coefficient results were pooled. Gene correlations greater than an FPR of 0.05 were discarded. An FDR threshold of 15% was additionally enforced to reduce type-one errors. Thresholded FDR correlations were submitted to the MEGENA package for generation of a planar filtered network (PFN) of genes mapped to each other through network connectivity strength. Briefly, gene pairs were first ranked based on expression similarity and then iteratively tested for planarity to expand the PFN, while favoring those pairs with larger similarities. Multi-scale module structures were generated through Multiscale Clustering Analysis (MCA) by clustering initial connected components of the PFN as parent clusters, with clustering repeated in an iterative fashion down through module lineages until no meaningful descendent modules remained to split. A minimum module size of 20 genes was enforced throughout. Multiscale Hub Analysis (MHA) was performed to detect significant hubs of individual clusters and across a, characterizing different scales of organizations in the PFN with emphasis given to multiscale hubs. Significant MEGENA modules were selected that showed a significance of compactness by p<0.05. Analogous to WGCNA, Cluster-Trait Association Analysis (CTA) was performed by performing principle component analysis (PCA) on each module and correlated to cohort, with correlation significance set to Fisher's adjust p value <0.05. First-generation modules (founding members of their lineage) were examined for various cell and pathway signature gene set variation analysis (GSVA) enrichments, with significant enrichments as those with a Hedge's corrected adjusted p value <0.05. MEGENA modules were coerced into WGCNA's module preservation function and analyzed as before by considering overlapping genes from reference-test network pairs with a mutual minimum composite Z summary statistic of 5. MEGENA modules were renamed to indicate their pedigree and networks were visualized using sunburst diagrams to depict module lineages. Some sunbursts were colored using majority WGCNA color assignment to elucidate differences in module creation between the MEGENA and WGCNA approaches. Some were colored using majority cell signature enrichment to demonstrate inheritance of cell signature through module lineage, and others colored with gene expression log-fold change. Heatmaps of log 2 transformed module gene expression were inspected to establish how well modules represented their cohort, similar to majority and minority quality WGCNA modules. Heatmaps were also generated of various module GSVA cell signature enrichments to further curate modules of immunological and other interests.
Module Gene Co-Expression Network Visualization
Modules showing high enrichment of cell signatures of interest were selected for additional enrichment analysis, pathway annotation, and network visualization. Module official HGNC gene symbols were imported into Cytoscape (v3.8.0) through its STRING (v11) protein query set to a confidence score cutoff of 0.9 with zero allowed maximum additional interactions. Cytoscope is described by, for example, [Shannon P, Markiel A, Ozier O, Baliga N S, Wang J T, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 13:11 (2498-504). 2003 November PubMed ID: 14597658], which is incorporated by reference herein in its entirety. String is described by, for example, [Szklarczyk D, Gable A L, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva N T, Morris J H, Bork P, Jensen L J, von Mering C. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019 January; 47:D607-613], which is incorporated by reference herein in its entirety. Edges (connections) between gene nodes were weighted as the STRING connection strengths. Genes with less than 3 connections to adjacent nodes were discarded. MCODE clustering was calculated on the whole network with a degree cutoff of 2 and clusters found allowing haircut and fluff, and some visualizations colored by MCODE cluster, as described by, for example, [Bader G D, Hogue C W. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4: (2). 2003 Jan. 13. PubMed ID: 12525261], which is incorporated by reference herein in its entirety. AMPEL cell signature annotations were merged to the nodes data table and used to adjust node color, border color, and sizes to call out genes of interest. Various attribute and network cluster algorithms from the Cytoscape clusterMaker app were applied to help simplify visualizations, including GLay community clustering, as described by, for example, [Morris J H, Apeltsin L, Newman A M, Baumbach J, Wittkop T, Su G, Bader G D, Ferrin T E. clusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics, 12: (436). 2011 Nov. 9. PubMed ID: 22070249], which is incorporated by reference herein in its entirety. BiNGO was used to query various GO (gene ontology) pathways to further elucidate network relations to biological processes, as described by, for example, [Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics, 21:16 (3448-9). 2005 Aug. 15. PubMed ID: 15972284], which is incorporated by reference herein in its entirety. Clusters from various modules were visualized together and examined for interconnectedness. Meta clusters were created by combining genes with similar functional annotation into composite nodes, with edges between them weighted as the total number of MCODE connections between constituent genes.
Derivation of Lung Tissue Cell Populations for GSVA
Nine single-cell RNA-seq lung cell populations (AT1, AT2, Ciliated, Club, Endothelial, Fibroblasts, Immuno Monocytes, Immuno T Cells, and Lymphatic Endothelium) were downloaded from the Eils Lung Tissues set (www.biorxiv.org/content/10.1101/2020.03.13.991455v3) accessed by the UC Santa Cruz Genome Browser (eils-lung.cells.ucsc.edu). Genes occurring in more than one cell type were removed. Additionally, genes known to be expressed by immune cells were removed. The Immuno Monocyte and Immuno T cell categories were not employed in further analyses.
Derivation of Co-Expressed Myeloid Subpopulations in Each Compartment
Co-expression analyses were conducted in R. Sample (control and patient) log 2 expression values for each gene of the 221 identified monocyte/myeloid cell genes in were analyzed for their Pearson correlation coefficient in each tissue compartment (PBMC, lung, BAL) using the Cor function. Of note, only 195 of 221 genes had changes in gene expression in at least one tissue by RNA-seq, and these genes are shown. Resulting correlation matrices were hierarchically clustered by their Euclidian distance into 2 clusters (k=2) using the heatmap.2 function in R. This resulted in 2 monocyte/myeloid co-expressed clusters in each compartment. The co-expressed myeloid populations in each compartment were then evaluated by GSVA.
Co-Expressed Myeloid Subpopulations Venn Diagrams
SARS-CoV2 may refer to an uncharacterized coronavirus and causative agent of the COVID-19 pandemic. The host response to SARS-CoV2 has not yet been fully delineated, hampering a precise approach to therapy. To address this, a comprehensive analysis of gene expression data from the blood, lung, and airway of COVID-19 patients was performed. The results obtained indicate that COVID-19 pathogenesis is driven by populations of myeloid-lineage cells with highly inflammatory but distinct transcriptional signatures in each compartment. The relative absence of cytotoxic cells in the lung indicates a model in which delayed clearance of the virus may permit exaggerated myeloid cell activation that contributes to disease pathogenesis by the production of inflammatory mediators. The gene expression profiles may also identify potential therapeutic targets that may be modified with available drugs. The results and data indicate that transcriptomic profiling can provide an understanding of the pathogenesis of COVID-19 in individual patients.
Coronaviruses (CoV) generally refer to a group of enveloped, single, positive-stranded RNA viruses causing mild to severe respiratory illnesses in humans (Refs. 1-3). In the past two decades, three worldwide outbreaks have originated from CoVs capable of infecting the lower respiratory tract, resulting in heightened pathogenicity and high mortality rates. There is currently a global pandemic stemming from a third CoV strain, severe acute respiratory syndrome coronavirus 2 (SARS-CoV2), the causative agent of coronavirus disease 2019 (COVID-19). In the majority of cases, patients exhibit mild symptoms, whereas in more severe cases, patients may develop severe lung injury and death from respiratory failure (Refs. 4-5).
At this time, there is still incomplete information available regarding the host response to SARS-CoV2 infection and the perturbations resulting in a severe outcome. Despite this, clues can be derived from our knowledge of the immune response to infection by other respiratory viruses, including other CoVs. After infection, viruses are typically detected by pattern recognition receptors (PRRs) such as the inflammasome sensor NLRP3, which signal the release of interferons (IFNs) and inflammatory cytokines including the IL-1 family, IL-6, and TNF, that activate a local and systemic response to infection (Refs. 6-7). This involves the recruitment, activation, and differentiation of innate and adaptive immune cells, including neutrophils, inflammatory myeloid cells, CD8 T cells, and natural killer (NK) cells (Ref 8). Resolution of infection may be largely dependent on the cytotoxic activity of CD8 T cells and NK cells, which enable clearance of virus-infected cells (Ref 8). Failure to clear virus-infected cells may facilitate a hyper-inflammatory state termed Macrophage (M1) activation syndrome (MAS) or “cytokine storm”, and ultimately damage to the infected lung (Refs. 9-10).
The recent emergence of SARS-CoV2 and the relative lack of comprehensive knowledge regarding the progression of COVID-19 disease has constrained the ability to develop effective treatments for infected patients. One approach to obtain a more complete understanding of the host response to SARS-CoV2 is to examine gene expression in relevant tissues. A limited number of gene expression profiles may be available from patients with COVID-19 and have yielded some insights into the pathogenic processes triggered by infection with SARS-CoV2 (Refs. 11-13). However, because of the small number of samples and limited analysis, a comprehensive understanding of the biological state of SARS-CoV2-affected tissues may not be available. Recognizing this need, the present disclosure provides an analysis and assessment of available SARS-CoV2 gene expression datasets from blood, lung, and airway using a number of orthogonal bioinformatic tools to provide a more complete view of the nature of the COVID-19 inflammatory response and the potential points of therapeutic intervention.
Gene expression analysis of blood, lung, and airway of COVID-19 patients was performed as follows. To characterize the pathologic response to SARS-CoV2 infection, transcriptomic data was analyzed from peripheral blood mononuclear cells (PBMCs) and postmortem lung tissue of COVID-19 patients and healthy controls as well as bronchoalveolar lavage (BAL) fluid of COVID-19 patients (CRA002390, GSE147507,
Conserved and differential enrichment of inflammatory cells and pathways in COVID-19 patients was analyzed as follows. To interrogate pathologic pathways in the 3 compartments, Gene Set Variation Analysis (GSVA) was performed utilizing a number of informative gene modules (Refs. 14-15) (
Although there is conflicting data on the presence of an IFN gene signature (IGS) and whether SARS-CoV2 infection induces a robust IFN response (Refs. 12-13 and 20), increased expression was observed of Type I IFN genes (IFNA4, IFNA6, IFNA10), and significant enrichment was observed of the common Type I and Type II IGS, including enrichment of IFNA2, IFNB1 and IFNG gene signatures specifically in the lung tissue (
Increased expression of inflammatory mediators in the lungs of COVID-19 patients was analyzed as follows. To examine the nature of the inflammatory response in the tissue compartments in greater detail, specific DEGs of interest (
It was observed that non-hematopoietic cells in the BAL fluid may be indicative of viral-induced damage, as follows. To determine whether viral infection resulted in modification of resident tissue populations, GSVA was performed with various non-hematopoietic cell gene signatures (
It was observed that protein-protein interactions identify myeloid subsets in COVID-19 patients, as follows. An unbiased, protein-protein interaction (PPI)-based clustering approach was utilized to assess the inflammatory cell types within each tissue compartment. PPI networks predicted from DEGs were simplified into metastructures defined by the number of genes in each cluster, the number of significant intra-cluster connections, and the number of associations connecting members of different clusters to each other (
In addition to the various Mo/myeloid populations, lung tissue was infiltrated by LDGs, granulocytes, T cells, and B cells. Metabolic function in the lung was varied, with Mo-enriched clusters (1 and 7) linked to glycolysis in cluster 18 (interaction scores of 0.74 and 0.87, respectively) potentially reflecting cellular activation, whereas OXPHOS was predominantly downregulated along with other nuclear processes (transcription and mRNA processing) (
Given the large number of clusters including Mo/myeloid/MΦ, next these clusters were examined in greater detail by altering the stringency of PPI clustering to further characterize unique myeloid lineage cells within each tissue compartment (
In the lung (
Characterization of myeloid populations in COVID-19 patients was performed as follows. The overlap between characterized BAL-defined gene signatures from COVID-19 patients (Ref. 27) and tissue-defined PPI clusters motivated an evaluation of these populations in greater detail by GSVA. Consistent with PPI clusters, the inflammatory-MΦ G1 population was increased in the blood (
It was observed that co-expression further delineates Mo/MΦ gene expression profiles of COVID-19 patients, as follows. Next, the biology of the populations of Mo/MΦ in the tissue compartments were examined in greater detail. A set of 196 co-expressed Mo/myeloid genes was derived and used (
To determine the function and nature of these myeloid populations, they were compared to other myeloid signatures (
Also, the overlap between the Mo/MΦ A1-A3 gene clusters and those identified using PPI clustering (
Trajectory analysis was performed to understand potential transitions of Mo/MΦ in various tissue compartments. This was based on the normalized 196 myeloid-cell specific genes, as well as 425 additional normalized genes that may be important in Mo/myeloid/MΦ cell differentiation, reflective of chemotaxis, IFN, and metabolism genes. This analysis suggested a branch point of differentiation of Mo/MΦ between blood and lung, with some blood Mo/MΦ differentiating directly to airway cells and others to lung cells in a more protracted manner as indicated by pseudotime (
Analysis of the biologic activities of myeloid subpopulations was performed as follows. To focus on functional distinctions among the co-expressed myeloid populations in the blood, lung, and airway compartments (A1-A3), linear regression analyses were utilized between GSVA scores for A1-A3 and scores for metabolic, functional, and signaling pathways (
It was observed that pathway and upstream regulator analysis inform tissue-specific drug discovery for treatment of COVID-19, as follows. To understand the biology of SARS-CoV2-infected patients in greater detail, pathway analysis was conducted on DEGs from the 3 compartments using IPA canonical signaling pathway and upstream regulator (UPR) analysis functions (
UPRs predicted to drive the responses in each compartment indicated uniform involvement of inflammatory cytokines, with Type I IFN regulation dominant in the SARS-CoV2-infected lung (
IPA analysis was also employed to predict drugs that might interfere with COVID-19 inflammation (
Another approach to predict possible drug targets is by employing connectivity scoring with drug-related gene expression profiles using the perturbagen CMAP database within CLUE (Tables 8A-8B). Although CLUE-predicted drugs tended to differ from those predicted by IPA or those matched to IPA-predicted UPRs, there were some overlapping mechanisms, including inhibition of AKT, angiogenesis, CDK, EGFR, FLT3, HSP, JAK, and mTOR. IPA-predicted drugs that were unique from connectivity-predicted drugs tended to capture more cytokine and lymphocyte biology, including inhibitors of IL1, IL6, IL17, TNF, type I and II interferon, CD40LG, CD38, and CD19, among other cytokines and immune cell-specific markers. Overall, the gene expression-based analysis of SARS-CoV2-infected blood and tissue compartments indicated several existing treatment options that may be evaluated as candidates to treat COVID-19, and subsequently administered to patients having COVID-19.
Multiple orthogonal bioinformatics approaches were employed to analyze DEG profiles from the blood, lung, and airway of COVID-19 patients, and revealed the dynamic nature of the inflammatory response to SARS-CoV-2 and possible points of therapeutic intervention. In the blood, evidence was seen of myeloid cell activation, lymphopenia, and elevation of plasma cells, as has been shown by both standard cell counts, flow cytometry and gene expression analysis (Ref 34). In the lungs, increased gene signatures were found of additional myeloid cell types including granulocytes, infiltrating inflammatory Mo, and AMs as well as the presence of non-activated T, B, and NK cells. Furthermore, inflammation in the lung tissue was enhanced by the greater presence of IFNs and more pro-inflammatory cytokines than observed in the blood. Finally, in the airway evidence was found of blood and AM-derived inflammatory and regulatory MΦs, and non-hematopoietic lung tissue cells accompanied by expression of SARS-CoV-2 receptors, and alarmins, indicative of viral infection and damage to the lung and consistent with detection of SARS-CoV2 in BAL fluid (Refs. 11 and 35). Together these findings indicate a systemic, but compartmentalized immune and inflammatory response with specific signs of cellular activation in blood, lung, and airway. This has informed a more comprehensive and integrated model of the nature of the local and systemic host response to SARS-CoV2.
The predominant populations of immune cells found to be enriched and activated in COVID-19 patients were myeloid cells and, in particular, subsets of inflammatory Mo and MΦs, which differed between the blood, lung, and airway compartments. In the peripheral blood, significant enrichment of Mo was found, including classically activated inflammatory M1 MΦs as well as a CD33+ myeloid subset, which appeared to be an M2 population reminiscent of characterized IFN-activated MΦs, AMs, and MDSCs, indicative of a potential regulatory population induced by stimuli arising from the SARS-CoV2-infected lung. Myeloid cells enriched in the blood of COVID-19 patients were also highly correlated with gene signatures of metabolic pathways (Glycolysis, Pentose Phosphate Pathway, and TCA cycle) indicative of pro-inflammatory M1 MΦs (Ref. 36).
The lung tissue was enriched in gene signatures of Mo/MΦs as well as other myeloid cells including two populations of granulocytes, neutrophils and LDGs. Increases in blood neutrophils may be associated with poor disease outcome in COVID-19 patients, and the formation of neutrophil extracellular traps (NETs) may contribute to increased risk of death from SARS-CoV2 infection (Refs. 37-39). In addition, populations of dysregulated neutrophils expressing pro-inflammatory or suppressive markers derived from scRNA-seq of COVID-19 patient PBMCs may be characterized and found to be positively correlated with disease severity (Refs. 18-19). These populations were found to be also increased in SARS-CoV2 infected lung tissue and, therefore, indicate that they may contribute to lung pathology. Although LDGs have not been reported in the COVID-19 lung, in comparison to neutrophils, they exhibit an enhanced capacity to produce Type I IFNs and form NETs and therefore, may have an even greater impact on disease progression (Ref 40).
Mo/MΦ subsets in the lung of COVID-19 patients were characterized as infiltrating inflammatory Mo and activated AMs, which exhibited a mixed metabolic status suggestive of different states of activation. Infiltrating Mo from the peripheral blood appeared to be further activated in the lung tissue as evidenced by enhanced expression of markers of highly inflammatory Mo characterized in severe COVID-19 cases (Ref 27). The AM population enriched in COVID lung tissue clustered with genes involved in the coagulation system, which is consistent with observations of procoagulant AM activity in COVID-19 and in ARDS (Ref. 41). As pulmonary thrombosis has been associated with poor clinical outcomes in COVID-19 patients, this result indicates that activated AMs in SARS-CoV2-infected lung tissue may be involved in facilitating a pro-thrombotic status, and thereby, contribute to poor disease outcome (Ref 42). Finally, a trend was noted toward an increase in platelets specifically in the COVID-19 lung, indicating that they may also contribute to thrombosis in some patients.
In the airway, gene signatures were detected of various post-activated MΦ subsets including inflammatory M1 MΦs, alternatively activated M2 MΦs, and activated AMs. Expression of myeloid cell genes in the airway also correlated with a signature of oxidative metabolism, which is characteristic of M2 macrophages and typically associated with control of tissue damage (Ref. 36). However, in the context of pulmonary infection, polarization of AMs toward an anti-inflammatory M2 phenotype was found to promote continued inflammation, suggesting that these MΦs may not be effective at resolving anti-viral immunity (Ref. 43).
In addition to myeloid cells, inflammatory mediators from the virally infected lung typically promote migration and activation of NK cells and adaptive immune cells including T and B cells (Ref 8). Significant deficiencies were found in gene signatures of T cells and cytotoxic CD8 and NK cells, consistent with clinical evidence of lymphopenia in the peripheral blood and airway of COVID-19 patients (Refs. 44-47). In contrast to T and NK cells, increased evidence was observed of B cell activation through CD40/CD40L and an increased plasma cell signature in the blood of COVID-19 patients. This result indicates that COVID-19 patients are able to mount an antibody-mediated immune response. However, whether a virus-specific antibody response is beneficial to recovery from SARS-CoV2 infection may be unclear (Ref. 48). Low quality, low affinity antibody responses to SARS-CoV may promote lung injury in some patients, although it may be unknown if this occurs in SARS-CoV2 infected individuals (Refs. 49-50).
The contents of the airway as assessed through the BAL fluid, act as a window into events in the alveoli and airways and can be used to understand what is happening in the infected tissue that is separate from the interstitium of the lung (Refs. 51-52). Increased enrichment was found of lung epithelial cells in the airway of COVID-19 patients, indicating that SARS-CoV2 infection of alveolar cells together with localized inflammation as a result of enhanced myeloid cell infiltration promote significant damage to the alveoli and result in affected cells being sloughed into the airway. Furthermore, the lack of cytotoxic cells and thus, the inability to clear these virus-infected lung epithelial cells in the airway likely accounts for the increased presence of post-activated MΦs and high expression of pro-inflammatory IL-1 family members observed in the BAL of COVID-19 patients. Importantly, these results indicate that sampling of BAL may provide an important mechanism to evaluate the impact of SARS-CoV2 infection.
DEGs from COVID-19 patients were enriched in IGS, complement pathways, inflammatory cytokines and the inflammasome, which may be expected to activate Mo/MΦ populations in the blood, lung, and airway of COVID-19 patients and initiate a robust and systemic response to infection. In particular, these results indicate that IL-1 family-mediated inflammation plays a critical role in COVID-19 pathogenesis. However, pro-inflammatory genes identified via GWAS as contributing to COVID-19 inflammation, including CCR2, CCR3, CXCR6, and MTA2B, were not significantly different from controls in the lung dataset (Ref 53). Thus, in contrast to characterizations of both SARS-CoV and SARS-CoV2 infection, these analyses did not indicate the presence of overwhelming pro-inflammatory cytokine production or “cytokine storm” in COVID-19 patients (Refs. 34 and 54). Rather, these data indicates that the increased numbers, overactivation, and potentially heightened pathogenicity of monocyte/MΦ populations are the main drivers of the dysregulated inflammatory response and resulting tissue damage in COVID-19 patients.
In the absence of proven antiviral treatment and/or a SARS-CoV2-specific vaccine, disease management may be reliant upon supportive care and therapeutics capable of limiting the severity of clinical manifestations. Using empiric evidence as a guide, the current approach may be successful in identifying “actionable” points of intervention in an unbiased manner and in spite of formidable patient heterogeneity. Analyses presented here support COVID-19 infection-related increases in inflammatory cytokines, particularly IL6 and TNF, both of which function as predictors of poor prognosis (Refs. 55-56), as well as complement activation (Refs. 37 and 57-58). Accordingly, anti-IL6 therapies including sarilumab, tocilizumab and clazakizumab, as well as biologics targeting terminal components of the complement cascade, such as eculizumab and ravulizumab, are in various clinical trial phases for treating COVID associated pneumonia. Candidate TNF blockers such as adalimumab, etanercept and many others, represent additional options for inhibiting deleterious pro-inflammatory signaling. However, most showed patient heterogeneity, indicating a requirement to identify the specific cytokine profile in each patient in order to offer personalized treatment. Our analyses also indicate the likely involvement of pro-inflammatory IL1 family members especially in the lung, suggesting anti-IL1 family interventions, including canakinumab and anakinra, may be effective in preventing acute lung injury.
This analysis also establishes the predominance of inflammatory Mo/myeloid lineage cells in driving disease pathology and indicates therapies effective at blocking myeloid cell recruitment or forcing repolarization may prevent disease progression. CCL5 (RANTES) is a potent leukocyte chemoattractant that interacts with multiple receptors, including CCR1 (upregulated in the blood, lung and airway), and CCR5 (upregulated in the airway). Disruption of the CCR5-CCL5 axis may be tested using the CCR5 neutralizing monoclonal antibody leronlimab in a compassionate use trial (Ref 59).
It may be observed that COVID-19 may predispose patients to thromboembolic disease (Refs. 60-61). Indeed, the gene expression analyses presented here showing altered expression of coagulation factors and fibrinogen genes indicate dysfunction within the intrinsic clotting pathway. These findings, together with evidence of excessive inflammation, complement activation and the involvement of LDGs in lung inflammation, may contribute to the systemic coagulation underlying the remarkably high incidence of thrombotic complications observed in severely ill patients, thereby reinforcing recommendations to apply pharmacological anti-thrombotic medications.
Further, anti-rheumatic drugs may be used for managing COVID-19. For example, CQ is a compound predicted as a UPR with potential phenotype-reversing properties. In vitro experiments examining the anti-viral properties of CQ and its derivative HCQ were effective in limiting viral load; however, the efficacy of these drugs in clinical trials has been less clear (Ref. 62). Questions about drug timing, dosage and adverse events may call into question the use of these drugs for COVID-19 patients. Despite results showing no negative connectivity between the gene signatures of SARS-CoV2 infection and HCQ treatment (Ref. 35), IPA predicted a role of anti-malarials as limiting the function of intracellular TLRs in the lung and also as a direct negative UPR of gene expression abnormalities in the lung, indicating a role in controlling COVID-19 inflammation and not viral replication. Further clinical testing may be performed to establish this possible utility; subsequently, these anti-malarials may be administered to COVID-19 patients to treat disease.
By comparing the transcriptomic profile of the blood, lung, and airway in COVID-19 patients, a model of the systemic pathogenic response to SARS-CoV2 infection has emerged (
As SARS-CoV2 continues to propagate, viral clearance may be impaired by a lack of cytotoxic CD8 T cells and NK cells. This is consistent with MAS occurring in other settings, in which defects in cytotoxic activity of CD8 T cells and NK cells result in enhanced innate immune cell activation and intensified production of pro-inflammatory cytokines, many of which were also expressed in COVID-19 patients (Refs. 9-10). Thus, the results indicated that the lack of activated CD8 T cells and NK cells and subsequent failure to clear virus-infected cells, is a major contributor to the MΦ-driven pathologic response to SARS-CoV2 observed in COVID-19 patients.
In order to develop a model of SARS-CoV2 infection, multiple orthogonal approaches were utilized to analyze gene expression from COVID-19 patients. Heterogeneity among patients in any given cohort may have an increased impact on the overall outcome. One possible reason for this intra-cohort heterogeneity is that the patients may have exhibited varying levels of disease severity. Therefore, additional studies may be performed on more patients, such as by accounting for differences in demographic information and disease status, in order to increase the power of downstream analyses.
As shown, transcriptomic analysis has contributed critical insights into the pathogenesis of COVID-19. Diffuse Mo/MΦ activation is the likely primary driver of clinical pathology. Therefore, this work provides a rationale for placing greater focus on the detrimental effects of exaggerated activation of pathogenic Mo/MΦs and for targeting these populations as an effective treatment strategy for COVID-19 patients.
Ethics statement: publicly available data sets used in this study are listed in Table 5. For each dataset, all patient samples were collected in adherence to local regulations and after obtaining institutional review board approved informed consent. The study corresponding to accession number CRA002309 was approved by the Ethics Committee of the Zhongnan Hospital of Wuhan University. The study corresponding to accession number GSE147505 was approved by the institutional review board at the Icahn School of Medicine at Mount Sinai under protocol HS #12-00145.
Read quality, trimming, mapping and summarization were performed as follows. RNA-seq data were processed using a consistent workflow using FASTQC, Trimmomatic, STAR, Sambamba, and featureCounts. As described below SRA files were downloaded and converted into FASTQ format using SRA toolkit. Read ends and adapters were trimmed with Trimmomatic (version 0.38) using a sliding window, ilmnclip, and headcrop filters. Both datasets were head cropped at 6 bp and adapters were removed before read alignment. Reads were mapped to the human reference genome hg38 using STAR, and the .sam files were converted to sorted .bam files using Sambamba. Read counts were summarized using the featureCounts function of the Subread package (version 1.61).
The RNA-seq tools are all free, open source programs, as follows: SRA toolkit (github.com/ncbi/sra-tools); FastQC (www.bioinformatics.babraham.ac.uk/projects/fastqc/); Trimmomatic (www.usadellab.org/cms/?page=trimmomatic); STAR (github.com/alexdobin/STAR); labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STAR.posix/doc/STARmanual.pdf; Sambamba (lomereiter.github.io/sambamba/); and FeatureCounts (subread.sourceforge.net).
Differential gene expression and gene set enrichment analysis were performed as follows. The DESeq2 workflow was used for differential expression analysis. Comparisons were made between control PBMCs and PBMCs from COVID-19 patients (PBMC-CTL vs. PBMC-CoV2) and control lung tissue and lung tissue from COVID-19 patients (Lung-CTL vs. Lung-CoV2). Since no corresponding control BAL samples were available for the COVID-19 BAL samples, BAL samples were compared from COVID-19 patients to COVID-19 PBMC (PBMC-CoV2 vs BAL-CoV2). This was possible because these samples were analyzed on the same platform, run at the same time. Also, normal BAL were compared to BAL of asthmatic individuals to identify genes unrelated to COVID-19 (PRJNA434133).
Two technical replicates were included for BAL cohort, and 4 technical replicates were included for postmortem lung samples. The replicates were collapsed and averaged into one using collapsereplicates function from DESeq2 package. The genes with low expression (i.e genes with very few reads) were removed by filtration. The filtered raw counts were normalized using the DESeq method and differentially expressed genes were determined by FDR<0.2 (Ref 63). Counts were then log 2 transformed and used for downstream analyses (Table 6).
Gene Set Variation Analysis (GSVA) was performed as follows. The GSVA (version 1.25.0) software package (Ref. 64) is an open source package available from R/Bioconductor and was used as a non-parametric, unsupervised method for estimating the variation of pre-defined gene sets in patient and control samples of microarray and RNA-seq expression data sets (www.bioconductor.org/packages/release/bioc/html/GSVA.html). The inputs for the GSVA algorithm were a gene expression matrix of log 2 expression values for pre-defined gene sets. All genes within a gene set were evaluated if the interquartile range (IQR) of their expression across the samples was greater than 0. Enrichment scores (GSVA scores) were calculated non-parametrically using a Kolmogorov Smirnoff (KS)-like random walk statistic and a negative value for a particular sample and gene set, indicating that the gene set has a lower expression than the same gene set with a positive value. The enrichment scores (ES) were the largest positive and negative random walk deviations from zero, respectively, for a particular sample and gene set. The positive and negative ES for a particular gene set depend on the expression levels of the genes that form the pre-defined gene set. GSVA calculates enrichment scores using the log 2 expression values for a group of genes in each SARS-CoV2 patient and healthy control and normalizes these scores between −1 (no enrichment) and +1 (enriched). Welch's t-test was used to calculate the significance of the gene sets between the cohorts. Significant enrichment of gene sets was determined by p-value <0.05.
Derivation of GSVA Gene Sets was performed as follows. Cellular and inflammatory modules and IFN-induced gene sets were derived (Refs. 14-15). Additional inflammatory pathways (NLRP3 Inflammasome, Classical Complement, Alternative Complement, and Lectin-induced Complement were curated from the Molecular Signatures Database. Other signatures were derived (Refs. 27, 30-33, and 65).
Additional hematopoietic cellular gene signatures (monocyte, myeloid, and neutrophil) were derived from I-Scope, a tool developed to identify immune cell specific genes in big data gene expression analyses. Non-hematopoietic fibroblast and lung cell gene sets were derived from T-Scope, a tool developed to identify genes specific for 45 non-hematopoietic cell types or tissues in big gene expression datasets. The T-Scope database contains 1,234 transcripts derived initially from 10,000 tissue enriched and 8,000 cell line enriched genes listed in the Human Protein Atlas. From the list of 18,000 potential tissue or cell specific genes, housekeeping genes and genes differentially expressed in 40 hematopoietic cell datasets were removed. The final gene lists were checked against available single cell analyses to confirm cellular specificity.
Nine single-cell RNA-seq lung cell populations (AT1, AT2, Ciliated, Club, Endothelial, Fibroblasts, Immuno Monocytes, Immuno T Cells, and Lymphatic Endothelium) were downloaded from the Eils Lung Tissues set (Ref. 66) accessed by the UC Santa Cruz Genome Browser (eils-lung.cells.ucsc.edu). Genes occurring in more than one cell type were removed. Additionally, genes known to be expressed by immune cells were removed. The Eils Lung Tissues set Immuno Monocyte, Immuno T Cell, Fibroblast, and Lymphatic Endothelium categories were not employed in further analyses.
Apoptosis and NFkB gene signatures were derived and modified from Ingenuity Pathway Analysis pathways Apoptosis Signaling and NFkB Signaling. ROS-protection was derived from Biologically Informed Gene-Clustering (BIG-C).
Network analysis and visualization were performed as follows. Visualization of protein-protein interaction and relationships between genes within datasets was done using Cytoscape (version 3.6.1) software. STRING (version 1.3.2) generated networks were imported into Cytoscape (version 3.6.1) and partitioned with MCODE via the clusterMaker2 (version 1.2.1) plugin. For PPIs in
Functional and cellular enrichment analysis was performed as follows. Functional enrichment of clusters was performed using Biologically Informed Gene-Clustering (BIG-C), which was developed to understand the potential biological meaning of large lists of genes (Ref 67). Genes are clustered into 53 categories based on their most likely biological function and/or cellular localization based on information from multiple on-line tools and databases including UniProtKB/Swiss-Prot, GO terms, KEGG Pathways, MGI database, NCBI PubMed, and the Interactome. Hematopoietic cellular enrichment was performed using I-Scope, a tool developed to identify immune cell specific genes in big data gene expression analyses. Statistically significant enriched types of cell types in DEGs were determined by Fisher's Exact test overlap p-value and then determining an Odds Ratio of enrichment.
Derivation of co-expressed myeloid subpopulations in each compartment was performed as follows. Co-expression analyses were conducted in R. Sample (control and patient) log 2 expression values for each gene of the 221 identified monocyte/myeloid cell genes in were analyzed for their Pearson correlation coefficient in each tissue compartment (blood, lung, and airway) using the Cor function. Of note, only 196 of 221 genes had changes in gene expression in at least one tissue by RNA-seq. Pearson correlations for these 196 genes were hierarchically clustered by their Euclidian distance into 2 clusters (k=2) using the heatmap.2 function in R. This resulted in 2 Mo/myeloid co-expressed clusters in each compartment corresponding to increased and decreased gene sets. The upregulated co-expressed genes were used to define the A1, A2, and A3 myeloid subpopulations from the blood, lung, and airway compartments, respectively (Tables 7A-7C). The co-expressed myeloid populations in each compartment (A1-A3) were then evaluated for enrichment by GSVA.
Inter-compartment myeloid gene comparisons were performed as follows. To compare relative expression of the 196 myeloid-specific genes among compartments, HTS filtered log 2 expression values for each gene were normalized to the average expression of FCGR1A, FCGR2A, and FCGR2C in each sample. Welch's t-test was used to calculate the significant differences in normalized gene expression between cohorts. Effect sizes were computed between cohorts using the cohen.d function with Hedges' correction in R.
Monocle analyses were performed as follows. Trajectory analyses were performed with Monocle (Refs. 68-70) version 2.14.0 in R. Gene expression values for 621 genes related to myeloid cell differentiation and function including cell surface and secreted markers, M1 and M2 markers, metabolism, and IFN genes were selected as a curated input list (Tables 7A-7C). The HTS filtered log 2 expression values for each of these genes in each sample for each tissue type (PBMC-CoV2, Lung-CoV2, and BAL-CoV2) was normalized by the average log 2 expression of FCGR1A, FCGR2A, and FCGR2C in that particular sample as described above. Normalized expression of these genes was used as the input expression data for Monocle. The CellDataSet was created with parameters of lowerDetectionLimit=0.01 and expressionFamily=uninormal( ). Dimensions were reduced using the DDRTree method, and the max_components parameter was set to 2. Cell state was ordered based upon the state corresponding to PBMC-CoV2.
Linear Regression Analysis was performed as follows. Simple linear regression between calculated myeloid subpopulation A1, A2, and A3 GSVA scores and biological functions or signaling pathway GSVA scores was performed in GraphPad Prism Version 8.4.2. In all analyses where pathway genes (e.g. classical complement) were also myeloid cell genes, these genes were removed from the myeloid GSVA score for that comparison and kept in the pathway GSVA score. For each regression analysis, the Goodness of Fit is displayed as the R squared value and the p-value testing the significant of the slope is displayed. All p-values are displayed with 3 digits and rounded-up unless rounding changes significance.
Ingenuity Pathway Analysis (IPA) was performed as follows. The canonical pathway and upstream regulator functions of IPA core expression analysis tool (Qiagen) were used to interrogate DEG lists. Canonical pathways and upstream regulators were considered significant if |Activation Z-Score|≥2 and overlap p-value <0.01. Chemical reagents, chemical toxicants, and endogenous non-mammalian ligands were culled from all upstream regulator analyses.
Drug-Target Matching was performed as follows. IPA-predicted upstream regulators were annotated with respective targeting drugs and compounds to elucidate potential useful therapies in SARS-CoV2. Drugs targeting gene products of interest by both direct and indirect targeting mechanisms were sourced by Combined Lupus Treatment Scoring (CoLTS)-scored drugs (Ref 71), the Connectivity Map via the drug repurposing tool, DrugBank, and literature mining. Similar methods were employed to determine information about drugs and compounds, including mechanism of action and stage of clinical development. The drug repurposing tool was accessed at clue.io/repurposing-app.
Analysis of COVID-19 PBMC, Lung, and BAL DEG profiles via CLUE was performed as follows. DEGs from PBMC-CoV2 vs. PBMC-CTL, Lung-CoV2 vs. Lung-CTL, and BAL-CoV2 vs. PBMC-CoV2 were used as input for the CMaP and LINCS Unified Environment (CLUE) cloud-based connectivity map analysis platform (clue.io/connectopedia/). Top upregulated and downregulated DEGs from each signature as determined by magnitude of log 2 fold change were sequentially entered into CLUE until 150 of each were accepted for analysis to determine drugs, compounds, small molecules, and other perturbagens that mimic or oppose the uploaded COVID-19 gene expression signatures. Resultant drugs and compounds with negative connectivity scores in the [−75, −100] range were analyzed to include results with high confidence of antagonizing COVID-19 gene expression profiles.
Data Availability: the datasets analyzed in this study are available from the China National Center for Bioinformation's National Genomics Data Center, bigd.big.ac.cn/gsa/browse/CRA002390, and in the NCBI GEO repository www.ncbi.nlm.nih.gov/bioproject/PRJNA615032. Additional data generated or analyzed during this study are described by, for example, Daamen et al., “Comprehensive transcriptomic analysis of COVID-19 blood, lung, and airway”, Scientific Reports, Vol. 11, No. 7052 (2021), which is incorporated by reference herein in its entirety.
†: FDA-approved
‡: Drug in development/clinical trials
Further details are described by, for example, Daamen et al., “Comprehensive transcriptomic analysis of COVID-19 blood, lung, and airway”, Scientific Reports, Vol. 11, No. 7052 (2021), which is incorporated by reference herein in its entirety.
While preferred embodiments have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the scope of the disclosure. It should be understood that various alternatives to the embodiments described herein may be employed in practice. Numerous different combinations of embodiments described herein are possible, and such combinations are considered part of the present disclosure. In addition, all features discussed in connection with any one embodiment herein can be readily adapted for use in other embodiments herein. It is intended that the following claims define the scope of the disclosure and that methods and structures within the scope of these claims and their equivalents be covered thereby.
This application claims the benefit of U.S. Provisional Patent Application No. 63/023,088, filed May 11, 2020, which is incorporated by reference herein in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/031535 | 5/10/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63023088 | May 2020 | US |