Machine Learning Disease Prediction and Treatment Prioritization

BACKGROUND

Machine learning is a computational method capable of harnessing complex data from multiple sources to develop self-trained prediction and analysis tools. When applied to high-scale disease and treatment data, machine learning algorithms may quickly and effectively identify genetic and phenotypic features.

SUMMARY
Analysis by Molecular Endotyping

In an aspect, the present disclosure provides a method of identifying one or more records having a specific phenotype, the method comprising: receiving a plurality of first records, wherein each first record is associated with one or more of a plurality of phenotypes; receiving a plurality of second records, wherein each second record is associated with one or more of the plurality of phenotypes, and wherein the plurality of second records and the plurality of first records are non-overlapping; applying a machine learning algorithm to at least one first record and at least one second record to determine a classifier; receiving a plurality of third records, wherein the third records are distinct from the plurality of first records and the plurality of second records; and applying the classifier to the plurality of third records to identify one or more third records associated with the specific phenotype.

In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of about 0.8 to about 1. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of at least about 0.8, about 0.825, about 0.85, about 0.875, about 0.9, about 0.925, about 0.95, about 0.975, or about 1. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of at most about 0.8, about 0.825, about 0.85, about 0.875, about 0.9, about 0.925, about 0.95, about 0.975, or about 1. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of about 0.8 to about 0.825, about 0.8 to about 0.85, about 0.8 to about 0.875, about 0.8 to about 0.9, about 0.8 to about 0.925, about 0.8 to about 0.95, about 0.8 to about 0.975, about 0.8 to about 1, about 0.825 to about 0.85, about 0.825 to about 0.875, about 0.825 to about 0.9, about 0.825 to about 0.925, about 0.825 to about 0.95, about 0.825 to about 0.975, about 0.825 to about 1, about 0.85 to about 0.875, about 0.85 to about 0.9, about 0.85 to about 0.925, about 0.85 to about 0.95, about 0.85 to about 0.975, about 0.85 to about 1, about 0.875 to about 0.9, about 0.875 to about 0.925, about 0.875 to about 0.95, about 0.875 to about 0.975, about 0.875 to about 1, about 0.9 to about 0.925, about 0.9 to about 0.95, about 0.9 to about 0.975, about 0.9 to about 1, about 0.925 to about 0.95, about 0.925 to about 0.975, about 0.925 to about 1, about 0.95 to about 0.975, about 0.95 to about 1, or about 0.975 to about 1. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of about 0.8, about 0.825, about 0.85, about 0.875, about 0.9, about 0.925, about 0.95, about 0.975, or about 1.

In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is about 1 to about 20. In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is at least about 1, about 2, about 3, about 4, about 5, about 6, about 8, about 10, about 12, about 14, about 16, or about 20. In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is at most about 1, about 2, about 3, about 4, about 5, about 6, about 8, about 10, about 12, about 14, about 16, or about 20. In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is about 1 to about 2, about 1 to about 3, about 1 to about 4, about 1 to about 5, about 1 to about 6, about 1 to about 8, about 1 to about 10, about 1 to about 12, about 1 to about 14, about 1 to about 16, about 1 to about 20, about 2 to about 3, about 2 to about 4, about 2 to about 5, about 2 to about 6, about 2 to about 8, about 2 to about 10, about 2 to about 12, about 2 to about 14, about 2 to about 16, about 2 to about 20, about 3 to about 4, about 3 to about 5, about 3 to about 6, about 3 to about 8, about 3 to about 10, about 3 to about 12, about 3 to about 14, about 3 to about 16, about 3 to about 20, about 4 to about 5, about 4 to about 6, about 4 to about 8, about 4 to about 10, about 4 to about 12, about 4 to about 14, about 4 to about 16, about 4 to about 20, about 5 to about 6, about 5 to about 8, about 5 to about 10, about 5 to about 12, about 5 to about 14, about 5 to about 16, about 5 to about 20, about 6 to about 8, about 6 to about 10, about 6 to about 12, about 6 to about 14, about 6 to about 16, about 6 to about 20, about 8 to about 10, about 8 to about 12, about 8 to about 14, about 8 to about 16, about 8 to about 20, about 10 to about 12, about 10 to about 14, about 10 to about 16, about 10 to about 20, about 12 to about 14, about 12 to about 16, about 12 to about 20, about 14 to about 16, about 14 to about 20, or about 16 to about 20. In some embodiments, the k-nearest neighbors classifier employs a K value of the size of the plurality of distinct first data sets, wherein k is about 1, about 2, about 3, about 4, about 5, about 6, about 8, about 10, about 12, about 14, about 16, or about 20.

In some embodiments, the K-value of the random forest classifier is incremented by 1 if the k-value is an even number. In some embodiments, applying a machine learning algorithm to the third data set comprises applying a machine learning algorithm to a plurality of unique third data sets.

In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of about 70% to about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of at least about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of at most about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of about 70% to about 75%, about 70% to about 80%, about 70% to about 85%, about 70% to about 90%, about 70% to about 95%, about 70% to about 100%, about 75% to about 80%, about 75% to about 85%, about 75% to about 90%, about 75% to about 95%, about 75% to about 100%, about 80% to about 85%, about 80% to about 90%, about 80% to about 95%, about 80% to about 100%, about 85% to about 90%, about 85% to about 95%, about 85% to about 100%, about 90% to about 95%, about 90% to about 100%, or about 95% to about 100%. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%.

In some embodiments, the classifier herein enables a specific phenotype association sensitivity of about 70% to about 100%. In some embodiments, the classifier herein enables a specific phenotype association sensitivity of at least 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier herein enables a specific phenotype association sensitivity of at most 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier herein enables a specific phenotype association sensitivity of about 70% to about 75%, about 70% to about 80%, about 70% to about 85%, about 70% to about 90%, about 70% to about 95%, about 70% to about 100%, about 75% to about 80%, about 75% to about 85%, about 75% to about 90%, about 75% to about 95%, about 75% to about 100%, about 80% to about 85%, about 80% to about 90%, about 80% to about 95%, about 80% to about 100%, about 85% to about 90%, about 85% to about 95%, about 85% to about 100%, about 90% to about 95%, about 90% to about 100%, or about 95% to about 100%. In some embodiments, the classifier herein enables a specific phenotype association sensitivity of about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%.

In some embodiments, the classifier herein enables a specific phenotype association specificity of about 70% to about 100%. In some embodiments, the classifier herein enables a specific phenotype association specificity of at least 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier herein enables a specific phenotype association specificity of at most 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%. In some embodiments, the classifier herein enables a specific phenotype association specificity of about 70% to about 75%, about 70% to about 80%, about 70% to about 85%, about 70% to about 90%, about 70% to about 95%, about 70% to about 100%, about 75% to about 80%, about 75% to about 85%, about 75% to about 90%, about 75% to about 95%, about 75% to about 100%, about 80% to about 85%, about 80% to about 90%, about 80% to about 95%, about 80% to about 100%, about 85% to about 90%, about 85% to about 95%, about 85% to about 100%, about 90% to about 95%, about 90% to about 100%, or about 95% to about 100%. In some embodiments, the classifier herein enables a specific phenotype association specificity of about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 100%.

In some embodiments, the method further comprises filtering the first records, the second records, or both. In some embodiments, the filtering comprises removing outliers, removing background noise, removing data without annotation data, normalizing, scaling, variance correcting, Weighted Gene Co-expression Network Analysis, enrichment analysis, dimensionality reduction, or any combination thereof. In some embodiments, the normalizing is performed by Robust Multi-Array Analysis (RMA), Guanine Cytosine Robust Multi-Array Analysis (GCRMA), Linear Models for Microarray Data, variance stabilizing transformation (VST), normal-exponential quantile correction (NEQC), or any combination thereof. In some embodiments, the variance correction comprises employing a local empirical Bayesian shrinkage, adjusting the p-values for multiple hypothesis testing using the Benjamini-Hochberg correction, and removing all data with a set false discovery rate

In some embodiments, the false discovery rate is about 0.000001 to about 0.2. In some embodiments, the false discovery rate is at least about 0.000001. In some embodiments, the false discovery rate is at most about 0.2. In some embodiments, the false discovery rate is about 0.000001 to about 0.00005, about 0.000001 to about 0.00001, about 0.000001 to about 0.0005, about 0.000001 to about 0.0001, about 0.000001 to about 0.005, about 0.000001 to about 0.001, about 0.000001 to about 0.05, about 0.000001 to about 0.01, about 0.000001 to about 0.2, about 0.00005 to about 0.00001, about 0.00005 to about 0.0005, about 0.00005 to about 0.0001, about 0.00005 to about 0.005, about 0.00005 to about 0.001, about 0.00005 to about 0.05, about 0.00005 to about 0.01, about 0.00005 to about 0.2, about 0.00001 to about 0.0005, about 0.00001 to about 0.0001, about 0.00001 to about 0.005, about 0.00001 to about 0.001, about 0.00001 to about 0.05, about 0.00001 to about 0.01, about 0.00001 to about 0.2, about 0.0005 to about 0.0001, about 0.0005 to about 0.005, about 0.0005 to about 0.001, about 0.0005 to about 0.05, about 0.0005 to about 0.01, about 0.0005 to about 0.2, about 0.0001 to about 0.005, about 0.0001 to about 0.001, about 0.0001 to about 0.05, about 0.0001 to about 0.01, about 0.0001 to about 0.2, about 0.005 to about 0.001, about 0.005 to about 0.05, about 0.005 to about 0.01, about 0.005 to about 0.2, about 0.001 to about 0.05, about 0.001 to about 0.01, about 0.001 to about 0.2, about 0.05 to about 0.01, about 0.05 to about 0.2, or about 0.01 to about 0.2. In some embodiments, the false discovery rate is about 0.000001, about 0.00005, about 0.00001, about 0.0005, about 0.0001, about 0.005, about 0.001, about 0.05, about 0.01, or about 0.2.

In some embodiments, the Weighted Gene Co-expression Network Analysis comprises calculating a topology matrix, clustering the data based on the topology matrix, and correlating module eigenvalues for traits on a linear scale by Pearson correlation, for nonparametric traits by Spearman correlation, and for dichotomous traits by point-biserial correlation or t-test. The Pearson correlation or the Product Moment Correlation Coefficient (PMCC), is a number between −1 and 1 that indicates the extent to which two variables are linearly related. The Spearman correlation is a nonparametric measure of rank correlation; statistical dependence between the rankings of two variables.

In some embodiments, the one or more records having a specific phenotype correspond to one or more subjects, and the method further comprises identifying the one or more subjects as (i) having a diagnosis of a lupus condition, (ii) having a prognosis of a lupus condition, (iii) being suitable or not suitable for enrollment in a clinical trial for a lupus condition, (iv) being suitable or not suitable for being administered a therapeutic regimen configured to treat a lupus condition, (v) having an efficacy or not having an efficacy of a therapeutic regimen configured to treat a lupus condition, based at least in part on the specific phenotype corresponding to the one or more subjects.

In another aspect, the present disclosure provides a non-transitory computer-readable storage media encoded with a computer program including instructions executable by a processor to create an application for identifying one or more records having a specific phenotype, the application comprising: a first receiving module receiving a plurality of first records, wherein each first record is associated with one or more of a plurality of phenotypes; a second receiving module receiving a plurality of second records, wherein each second record is associated with one or more of the plurality of phenotypes, and wherein the plurality of second records and the plurality of first records are non-overlapping; a machine learning module applying a machine learning algorithm to at least one first record and at least one second record to determine a classifier; a third receiving module receiving a plurality of third records, wherein the third records are distinct from the plurality of first records and the plurality of second records; and a classifying module applying the classifier to the plurality of third records to identify one or more third records associated with the specific phenotype.

In some embodiments, the first records and the second records comprise nucleic acid sequencing data, transcriptome data, genome data, epigenome data, proteome data, metabolome data, virome data, metabolome data, methylome data, lipidomic data, lineage-ome data, nucleosomal occupancy data, a genetic variant, a gene fusion, an insertion or deletion (indel), or any combination thereof. In some embodiments, the first records and the second records are in different formats. In some embodiments, the first records and the second records are from different sources, different studies, or both. In some embodiments, the phenotype comprises a disease state, an organ involvement, a medication response, or any combination thereof. In some embodiments, the classifier comprises an elastic generalized linear model classifier, a k-nearest neighbors classifier, a random forest classifier, or any combination thereof. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of about 0.9. In some embodiments, the k-nearest neighbors classifier employs a K-value of about 5% of the size of the plurality of distinct first data sets. In some embodiments, the K-value of the random forest classifier is incremented by 1 if the k-value is an even number. In some embodiments, applying a machine learning algorithm to the third data set comprises applying a machine learning algorithm to a plurality of unique third data sets. In some embodiments, said classifier identifies said one or more third records associated with the specific phenotype with an accuracy of at least about 70%. In some embodiments, the method further comprises filtering the first records, the second records, or both. In some embodiments, the filtering comprises removing outliers, removing background noise, removing data without annotation data, normalizing, scaling, variance correcting, Weighted Gene Co-expression Network Analysis, enrichment analysis, dimensionality reduction, or any combination thereof. In some embodiments, the normalizing is performed by Robust Multi-Array Analysis (RMA), Guanine Cytosine Robust Multi-Array Analysis (GCRMA), Linear Models for Microarray Data, variance stabilizing transformation (VST), normal-exponential quantile correction (NEQC), or any combination thereof. In some embodiments, the variance correction comprises employing a local empirical Bayesian shrinkage, adjusting the p-values for multiple hypothesis testing using the Benjamini-Hochberg correction, and removing all data with a false discovery rate of less than 0.2. In some embodiments, the Weighted Gene Co-expression Network Analysis comprises calculating a topology matrix, clustering the data based on the topology matrix, and correlating module eigenvalues for traits on a linear scale by Pearson correlation, for nonparametric traits by Spearman correlation, and for dichotomous traits by point-biserial correlation or t-test.

In another aspect, the present disclosure provides a method for identifying a disease state or a susceptibility thereof of a subject, comprising: (a) using an assay to process a biological sample derived from the subject to generate a quantitative measure of each of a plurality of disease-associated genomic loci, wherein the plurality of disease-associated genomic loci comprises at least 5 genes associated with a module of Table 8; (b) processing the dataset to identify the disease state or the susceptibility thereof of the subject at an accuracy of at least about 70%; and (c) electronically outputting a report indicative of the disease state or the susceptibility thereof of the subject.

In some embodiments, the plurality of quantitative measures comprises gene expression measurements. In some embodiments, the disease state comprises an active lupus condition or an inactive lupus condition. In some embodiments, the lupus condition is SLE. In some embodiments, the plurality of disease-associated genomic loci comprises one or more genes selected from the group consisting of: RAB4B, ADAR, MRPL44, CDCA5, MYD88, SNN, BRD3, C7orf43, CDC20, SP1, POFUT1, SAMD4B, ATP6V1B2, TSPAN9, SP140, STK26, IRF4, LCP1, LMO2, SF3B4, HIST2H2AA3, CITED4, ADAM8, TICAM1, and HSD17B7.

In another aspect, the present disclosure provides a method for identifying an immunological state of a subject, comprising: (a) using an assay to process a biological sample derived from the subject to generate a quantitative measure of each of a plurality of genomic loci, wherein the plurality of genomic loci comprises at least 5 genes associated with a module of Table 8; (b) processing the dataset to identify the immunological state of the subject at an accuracy of at least about 70%; and (c) electronically outputting a report indicative of the immunological state of the subject.

In some embodiments, the plurality of quantitative measures comprises gene expression measurements. In some embodiments, the immunological state comprises an active or inactive state of each of one or more of the plurality of genomic loci. In some embodiments, the plurality of genomic loci comprises one or more genes selected from the group consisting of: RAB4B, ADAR, MRPL44, CDCA5, MYD88, SNN, BRD3, C7orf43, CDC20, SP1, POFUT1, SAMD4B, ATP6V1B2, TSPAN9, SP140, STK26, IRF4, LCP1, LMO2, SF3B4, HIST2H2AA3, CITED4, ADAM8, TICAM1, and HSD17B7.

In another aspect, the present disclosure provides a method for identifying a disease state or a susceptibility thereof of a subject, comprising: (a) using an assay to process a biological sample derived from the subject to generate a quantitative measure of each of a plurality of disease-associated genomic loci, wherein the plurality of disease-associated genomic loci comprises one or more genes associated with a gene cluster of Table 1 to Table 72C; (b) processing the dataset to identify the disease state or the susceptibility thereof of the subject at an accuracy of at least about 70%; and (c) electronically outputting a report indicative of the disease state or the susceptibility thereof of the subject.

In some embodiments, the plurality of quantitative measures comprises gene expression measurements. In some embodiments, the disease state comprises an active lupus condition or an inactive lupus condition. In some embodiments, the lupus condition is systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), or lupus nephritis (LN). In some embodiments, the plurality of disease-associated genomic loci comprises 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or more than 50 genes associated with the gene cluster.

In another aspect, the present disclosure provides a method for identifying an immunological state of a subject, comprising: (a) using an assay to process a biological sample derived from the subject to generate a quantitative measure of each of a plurality of disease-associated genomic loci, wherein the plurality of disease-associated genomic loci comprises one or more genes associated with a gene cluster of Table 1 to Table 72C; (b) processing the dataset to identify the immunological state of the subject at an accuracy of at least about 70%; and (c) electronically outputting a report indicative of the immunological state of the subject.

In some embodiments, the plurality of quantitative measures comprises gene expression measurements. In some embodiments, the immunological state comprises an active lupus condition or an inactive lupus condition. In some embodiments, the lupus condition is systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), or lupus nephritis (LN). In some embodiments, the plurality of disease-associated genomic loci comprises 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or more than 50 genes associated with the gene cluster.

In another aspect, the present disclosure provides a method for identifying an immunological state of a subject, comprising: (a) using an assay to process a biological sample derived from the subject to generate a quantitative measure of each of a plurality of disease-associated genomic loci, wherein the plurality of disease-associated genomic loci comprises one or more genes associated with a pathway of Table 1 to Table 72C; (b) processing the dataset to identify the immunological state of the subject at an accuracy of at least about 70%; and (c) electronically outputting a report indicative of the immunological state of the subject.

Interferon Profiling of Lupus Conditions

In another aspect, the present disclosure provides a method for identifying a lupus condition of a subject, comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises genes induced by a plurality of interferons, thereby producing an interferon signature of the biological sample of the subject; (c) comparing the interferon signature with one or more reference interferon signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the interferon signature with corresponding quantitative measures of the gene of the one or more reference interferon signatures; and (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

In some embodiments, the lupus condition is selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). In some embodiments, the biological sample is selected from the group consisting of a whole blood (WB) sample, a peripheral blood mononuclear cell (PBMC) sample, a tissue sample, and a purified cell sample. In some embodiments, the tissue sample is selected from the group consisting of skin tissue, synovium tissue, and kidney tissue. In some embodiments, the kidney tissue is selected from the group consisting of glomerulus (Glom) and tubulointerstitium (TI). In some embodiments, the purified sample is selected from the group consisting of purified CD4⁺ T cells, purified CD19⁺ B cells, and purified CD14⁺ monocytes.

In some embodiments, the method further comprises purifying a whole blood sample of the subject to obtain the purified cell sample. In some embodiments, assaying the biological sample comprises (i) using a microarray to generate the dataset comprising the gene expression data (ii) sequencing the biological sample to generate the dataset comprising the gene expression data, or (iii) performing quantitative polymerase chain reaction (qPCR) of the biological sample to generate the dataset comprising the gene expression data.

In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 21. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 22. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 23. In some embodiments, the plurality of genes comprises one or more genes induced by in vitro stimulation of PBMC by IL12 treatment or TNF treatment. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 24. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 25. In some embodiments, the plurality of genes comprises one or more genes induced in vivo in IFNA2-treated HepC patients and/or IFNB1-treated MS patients. In some embodiments, the one or more genes induced in vivo in IFNA2-treated HepC patients and/or IFNB1-treated MS patients are selected from the genes listed in Table 32.

In some embodiments, (c) further comprises, for the at least one of the plurality of genes, determining a difference between the quantitative measure of the gene of the interferon signature with the corresponding quantitative measures of the gene of the one or more reference interferon signatures. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the difference satisfies a pre-determined criterion. In some embodiments, (c) further comprises, for the at least one of the plurality of genes, determining a Z-score of the quantitative measure of the gene of the interferon signature relative to the corresponding quantitative measures of the gene of the one or more reference interferon signatures. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score satisfies a pre-determined criterion. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score is at least 2, and identifying an absence of the lupus condition of the subject when the Z-score is less than 2.

In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 70%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 80%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 90%.

In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 70%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 80%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 90%.

In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 70%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 80%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 90%.

In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 70%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 80%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 90%.

In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.70. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.80. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.90.

In some embodiments, the method further comprises determining or predicting an active or inactive state of the identified lupus condition of the subject. In some embodiments, (d) further comprises identifying the lupus condition of the subject based at least in part on a SLEDAI (sysmetic lupus erythematosus activity index) score of the subject. In some embodiments, the subject is asymptomatic for one or more lupus conditions selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN).

In some embodiments, the method further comprises applying a trained algorithm to the interferon signature to identify the lupus condition of the subject. In some embodiments, the trained algorithm is trained using a first set of independent training samples associated with a presence of the lupus condition and a second set of independent training samples associated with an absence of the lupus condition. In some embodiments, the method further comprises using the trained algorithm to process a set of clinical health data of the subject to identify the lupus condition. In some embodiments, the trained algorithm comprises a supervised machine learning algorithm. In some embodiments, the supervised machine learning algorithm comprises a deep learning algorithm, a support vector machine (SVM), a neural network, or a Random Forest.

In some embodiments, (a) comprises (i) subjecting the biological sample to conditions that are sufficient to isolate, enrich, or extract a plurality of nucleic acid molecules; and (ii) analyzing the plurality of nucleic acid molecules to generate the dataset comprising the gene expression data. In some embodiments, the method further comprises using probes configured to selectively enrich the plurality of nucleic acid molecules corresponding to a panel of one or more genomic loci. In some embodiments, the probes are nucleic acid primers. In some embodiments, the probes have sequence complementarity with nucleic acid sequences of the panel of the one or more genomic loci. In some embodiments, the panel of the one or more genomic loci comprises genomic loci corresponding to the plurality of genes. In some embodiments, the panel of the one or more genomic loci comprises at least 5 distinct genomic loci. In some embodiments, the panel of the one or more genomic loci comprises at least 10 distinct genomic loci.

In some embodiments, the method further comprises (e) assaying a second biological sample of the subject to generate a second dataset comprising gene expression data; (f) processing the second dataset at each of the plurality of genes to determine second quantitative measures of each of the plurality of genes, thereby producing a second interferon signature of the second biological sample of the subject; (g) comparing the second interferon signature with one or more reference interferon signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the second interferon signature with corresponding quantitative measures of the gene of the one or more reference interferon signatures; and (h) based at least in part on the comparison in (g), identifying the lupus condition of the subject.

In some embodiments, the biological sample and the second biological sample comprise two different sample types selected from the group consisting of a whole blood (WB) sample, a PBMC sample, a skin tissue sample, a synovium tissue sample, a kidney tissue sample comprising glomerulus (Glom), a kidney tissue sample comprising tubulointerstitium (TI), a purified CD4⁺ T cell sample, a purified CD19⁺ B cell sample, and a purified CD14⁺ monocyte sample.

In some embodiments, the method further comprises determining a likelihood of the identification of the lupus condition of the subject. In some embodiments, the method further comprises providing a therapeutic intervention for the lupus condition of the subject.

In some embodiments, the method further comprises monitoring the lupus condition of the subject, wherein the monitoring comprises assessing the lupus condition of the subject at a plurality of time points, wherein the assessing is based at least on the lupus condition identified in (d) at each of the plurality of time points. In some embodiments, a difference in the assessment of the lupus condition of the subject among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of (i) a diagnosis of the lupus condition of the subject, (ii) a prognosis of the lupus condition of the subject, and (iii) an efficacy or non-efficacy of a course of treatment for treating the lupus condition of the subject.

In some embodiments, the one or more reference interferon signatures are generated by: assaying a biological sample of one or more patients with dermatomyositis to generate a reference dataset comprising gene expression data; and processing the reference dataset at each of the plurality of genes to determine quantitative measures of each of the plurality of genes.

In another aspect, the present disclosure provides a computer system for identifying a lupus condition of a subject, comprising: a database that is configured to store a dataset comprising gene expression data, wherein the gene expression data is obtained by assaying a biological sample of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises genes induced by a plurality of interferons, thereby producing an interferon signature of the biological sample of the subject; (ii) compare the interferon signature with one or more reference interferon signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the interferon signature with corresponding quantitative measures of the gene of the one or more reference interferon signatures; and (iii) based at least in part on the comparison in (ii), identify the lupus condition of the subject.

In some embodiments, the computer system further comprises an electronic display operatively coupled to the one or more computer processors, wherein the electronic display comprises a graphical user interface that is configured to display the report.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying a lupus condition of a subject, the method comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises genes induced by a plurality of interferons, thereby producing an interferon signature of the biological sample of the subject; (c) comparing the interferon signature with one or more reference interferon signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the interferon signature with corresponding quantitative measures of the gene of the one or more reference interferon signatures; and (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

In another aspect, the present disclosure provides a method for identifying a sepsis condition of a subject, comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises genes induced by TNF, thereby producing a TNF signature of the biological sample of the subject; (c) comparing the TNF signature with one or more reference TNF signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the TNF signature with corresponding quantitative measures of the gene of the one or more reference TNF signatures; and (d) based at least in part on the comparison in (c), identifying the sepsis condition of the subject.

Low-Density Granulocyte (LDG) Profiling of Lupus Conditions

In another aspect, the present disclosure provides a method for identifying a lupus condition of a subject, comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises low-density granulocyte (LDG)-associated genes, thereby producing an LDG signature of the biological sample of the subject; (c) comparing the LDG signature with one or more reference LDG signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the LDG signature with corresponding quantitative measures of the gene of the one or more reference LDG signatures; (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

In some embodiments, the lupus condition is selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). In some embodiments, the biological sample is selected from the group consisting of a whole blood (WB) sample, a PBMC sample, a tissue sample, and a cell sample. In some embodiments, the tissue sample is selected from the group consisting of skin tissue, synovium tissue, kidney tissue, and bone marrow tissue. In some embodiments, the kidney tissue is selected from the group consisting of glomerulus (Glom) and tubulointerstitium (TI). In some embodiments, the cell sample is selected from the group consisting of: myelocytes (MY), promyelocytes (PM), polymorphonuclear neutrophils (PMN), and peripheral blood mononuclear cells (PBMC).

In some embodiments, the method further comprises enriching or purifying a whole blood sample of the subject to obtain the cell sample. In some embodiments, assaying the biological sample comprises (i) using a microarray to generate the dataset comprising the gene expression data, (ii) sequencing the biological sample to generate the dataset comprising the gene expression data, or (iii) performing quantitative polymerase chain reaction (qPCR) of the biological sample to generate the dataset comprising the gene expression data.

In some embodiments, the plurality of genes comprises LDG-associated genes selected from the genes listed in Table 33. In some embodiments, the plurality of genes comprises LDG-associated genes selected from the genes listed in Table 34. In some embodiments, the plurality of genes comprises LDG-associated genes selected from the genes listed in Table 42A or Table 42B. In some embodiments, the plurality of genes comprises LDG-associated genes selected from the genes listed in Table 43A-43C. In some embodiments, the plurality of genes comprises LDG-associated genes selected from the genes listed in Table 44A. In some embodiments, the plurality of genes comprises LDG-associated genes selected from the genes listed in Table 45A or Table 45B.

In some embodiments, the quantitative measures of each of the plurality of genes comprise enrichment scores of each of the plurality of genes. In some embodiments, the enrichment scores of each of the plurality of genes comprise gene set variation analysis (GSVA) enrichment scores of each of the plurality of genes. In some embodiments, (c) further comprises, for the at least one of the plurality of genes, determining a difference between the quantitative measure of the gene of the LDG signature with the corresponding quantitative measures of the gene of the one or more reference LDG signatures. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the difference satisfies a pre-determined criterion.

In some embodiments, (c) further comprises, for the at least one of the plurality of genes, determining a Z-score of the quantitative measure of the gene of the LDG signature relative to the corresponding quantitative measures of the gene of the one or more reference LDG signatures. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score satisfies a pre-determined criterion. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score is at least 2, and identifying an absence of the lupus condition of the subject when the Z-score is less than 2.

In some embodiments, (d) further comprises identifying the lupus condition of the subject based at least in part on a SLEDAI score of the subject. In some embodiments, the subject is asymptomatic for one or more lupus conditions selected from the group consisting of systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN).

In some embodiments, the method further comprises applying a trained algorithm to the LDG signature to identify the lupus condition of the subject. In some embodiments, the trained algorithm is trained using a first set of independent training samples associated with a presence of the lupus condition and a second set of independent training samples associated with an absence of the lupus condition. In some embodiments, the method further comprises using the trained algorithm to process a set of clinical health data of the subject to identify the lupus condition. In some embodiments, the trained algorithm comprises a supervised machine learning algorithm. In some embodiments, the supervised machine learning algorithm comprises a deep learning algorithm, a support vector machine (SVM), a neural network, or a Random Forest.

In some embodiments, the method further comprises (e) assaying a second biological sample of the subject to generate a second dataset comprising gene expression data; (f) processing the second dataset at each of the plurality of genes to determine second quantitative measures of each of the plurality of genes, thereby producing a second LDG signature of the second biological sample of the subject; (g) comparing the second LDG signature with one or more reference LDG signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the second LDG signature with corresponding quantitative measures of the gene of the one or more reference LDG signatures; and (h) based at least in part on the comparison in (g), identifying the lupus condition of the subject.

In some embodiments, the biological sample and the second biological sample comprise two different sample types selected from the group consisting of: a whole blood (WB) sample, a PBMC sample, a skin tissue sample, a synovium tissue sample, a kidney tissue sample comprising glomerulus (Glom), a kidney tissue sample comprising tubulointerstitium (TI), a bone marrow tissue, a myelocyte (MY) cell sample, a promyelocyte (PM) cell sample, and a polymorphonuclear neutrophils (PMN) sample.

In some embodiments, a difference in the assessment of the lupus condition of the subject among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the lupus condition of the subject, (ii) a prognosis of the lupus condition of the subject, and (iii) an efficacy or non-efficacy of a course of treatment for treating the lupus condition of the subject.

In some embodiments, the one or more reference LDG signatures are generated by: assaying a biological sample of one or more patients having one or more disease symptoms or being treated with one or more drugs to generate a reference dataset comprising gene expression data; and processing the reference dataset at each of the plurality of genes to determine quantitative measures of each of the plurality of genes.

In some embodiments, the one or more disease symptoms are selected from the group consisting of: alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance.

In some embodiments, the one or more drugs are selected from the group consisting of antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs).

In another aspect, the present disclosure provides a computer system for identifying a lupus condition of a subject, comprising: a database that is configured to store a dataset comprising gene expression data, wherein the gene expression data is obtained by assaying a biological sample of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises low-density granulocyte (LDG)-associated genes, thereby producing an LDG signature of the biological sample of the subject; (ii) compare the LDG signature with one or more reference LDG signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the LDG signature with corresponding quantitative measures of the gene of the one or more reference LDG signatures; and (iii) based at least in part on the comparison in (ii), identify the lupus condition of the subject.

In some embodiments, computer system further comprises an electronic display operatively coupled to the one or more computer processors, wherein the electronic display comprises a graphical user interface that is configured to display the report.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying a lupus condition of a subject, the method comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises low-density granulocyte (LDG)-associated genes, thereby producing an LDG signature of the biological sample of the subject; (c) comparing the LDG signature with one or more reference LDG signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the LDG signature with corresponding quantitative measures of the gene of the one or more reference LDG signatures; (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

Primary Immunodeficiency (PID) Profiling of Lupus Conditions

In another aspect, the present disclosure provides a method for identifying a lupus condition of a subject, comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises primary immunodeficiency (PID)-associated genes, thereby producing a PID signature of the biological sample of the subject; (c) processing the PID signature with one or more reference PID signatures, wherein the processing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the PID signature with corresponding quantitative measures of the gene of the one or more reference PID signatures; (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

In some embodiments, the lupus condition is selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). In some embodiments, the biological sample is selected from the group consisting of: a whole blood (WB) sample, a PBMC sample, a tissue sample, and a cell sample. In some embodiments, the tissue sample is selected from the group consisting of: skin tissue, synovium tissue, kidney tissue, and bone marrow tissue. In some embodiments, the kidney tissue is selected from the group consisting of: glomerulus (Glom) and tubulointerstitium (TI). In some embodiments, the cell sample is selected from the group consisting of: myelocytes (MY), promyelocytes (PM), polymorphonuclear neutrophils (PMN), peripheral blood mononuclear cells (PBMC), and hematopoietic stem cells.

In some embodiments, the plurality of genes comprises PID-associated genes selected from the genes listed in Table 47. In some embodiments, the plurality of genes comprises at least 5 PID-associated genes selected from the genes listed in Table 47. In some embodiments, the plurality of genes comprises at least 10 PID-associated genes selected from the genes listed in Table 47. In some embodiments, the plurality of genes comprises at least 25 PID-associated genes selected from the genes listed in Table 47. In some embodiments, the plurality of genes comprises at least 50 PID-associated genes selected from the genes listed in Table 47. In some embodiments, the plurality of genes comprises at least 100 PID-associated genes selected from the genes listed in Table 47.

In some embodiments, the quantitative measures of each of the plurality of genes comprise enrichment scores of each of the plurality of genes. In some embodiments, the enrichment scores of each of the plurality of genes comprise gene set variation analysis (GSVA) enrichment scores of each of the plurality of genes. In some embodiments, (c) further comprises, for the at least one of the plurality of genes, determining a difference between the quantitative measure of the gene of the PID signature with the corresponding quantitative measures of the gene of the one or more reference PID signatures. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the difference satisfies a pre-determined criterion.

In some embodiments, (c) further comprises, for the at least one of the plurality of genes, determining a Z-score of the quantitative measure of the gene of the PID signature relative to the corresponding quantitative measures of the gene of the one or more reference PID signatures. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score satisfies a pre-determined criterion. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score is at least about 3, and identifying an absence of the lupus condition of the subject when the Z-score is less than about 3. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score is at least about 2.5, and identifying an absence of the lupus condition of the subject when the Z-score is less than about 2.5. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score is at least about 2, and identifying an absence of the lupus condition of the subject when the Z-score is less than about 2. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score is at least about 1.5, and identifying an absence of the lupus condition of the subject when the Z-score is less than about 1.5. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score is at least about 1, and identifying an absence of the lupus condition of the subject when the Z-score is less than about 1. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score is at least about 0.5, and identifying an absence of the lupus condition of the subject when the Z-score is less than about 0.5.

In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 60%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 65%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 70%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 75%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 80%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 85%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 90%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 95%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a sensitivity of at least about 99%.

In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 60%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 65%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 70%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 75%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 80%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 85%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 90%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 95%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a specificity of at least about 99%.

In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 60%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 65%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 70%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 75%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 80%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 85%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 90%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 95%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a positive predictive value (PPV) of at least about 99%.

In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 60%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 65%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 70%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 75%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 80%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 85%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 90%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 95%. In some embodiments, the method further comprises identifying the lupus condition of the subject at a negative predictive value (NPV) of at least about 99%.

In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.60. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.65. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.70. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.75. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.80. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.85. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.90. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.95. In some embodiments, the method further comprises identifying the lupus condition of the subject with an Area Under Curve (AUC) of at least about 0.99.

In some embodiments, (d) further comprises identifying the lupus condition of the subject based at least in part on a SLEDAI score of the subject. In some embodiments, the subject is asymptomatic for one or more lupus conditions selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN).

In some embodiments, the method further comprises applying a trained algorithm to the PID signature to identify the lupus condition of the subject. In some embodiments, the trained algorithm is trained using a first set of independent training samples associated with a presence of the lupus condition and a second set of independent training samples associated with an absence of the lupus condition. In some embodiments, the method further comprises using the trained algorithm to process a set of clinical health data of the subject to identify the lupus condition. In some embodiments, the trained algorithm comprises a supervised machine learning algorithm. In some embodiments, the supervised machine learning algorithm comprises a deep learning algorithm, a support vector machine (SVM), a neural network, or a Random Forest.

In some embodiments, the method further comprises using probes configured to selectively enrich the plurality of nucleic acid molecules corresponding to a panel of one or more genomic loci. In some embodiments, the probes are nucleic acid primers. In some embodiments, the probes have sequence complementarity with nucleic acid sequences of the panel of the one or more genomic loci. In some embodiments, the panel of the one or more genomic loci comprises genomic loci corresponding to the plurality of genes. In some embodiments, the panel of said one or more genomic loci comprises at least 5 distinct genomic loci. In some embodiments, the panel of said one or more genomic loci comprises at least 10 distinct genomic loci. In some embodiments, the panel of said one or more genomic loci comprises at least 25 distinct genomic loci. In some embodiments, the panel of said one or more genomic loci comprises at least 50 distinct genomic loci. In some embodiments, the panel of said one or more genomic loci comprises at least 100 distinct genomic loci. In some embodiments, the panel of said one or more genomic loci comprises at least 150 distinct genomic loci.

In some embodiments, the method further comprises (e) assaying a second biological sample of the subject to generate a second dataset comprising gene expression data; (f) processing the second dataset at each of the plurality of genes to determine second quantitative measures of each of the plurality of genes, thereby producing a second PID signature of the second biological sample of the subject; (g) processing the second PID signature with one or more reference PID signatures, wherein the processing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the second PID signature with corresponding quantitative measures of the gene of the one or more reference PID signatures; and (h) based at least in part on the comparison in (g), identifying the lupus condition of the subject.

In some embodiments, the one or more reference PID signatures are generated by: assaying a biological sample of one or more patients having one or more disease symptoms or being treated with one or more drugs to generate a reference dataset comprising gene expression data; and processing the reference dataset at each of the plurality of genes to determine quantitative measures of each of the plurality of genes.

In some embodiments, the one or more drugs are selected from the group consisting of antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs).

In another aspect, the present disclosure provides a computer system for identifying a lupus condition of a subject, comprising: a database that is configured to store a dataset comprising gene expression data, wherein the gene expression data is obtained by assaying a biological sample of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises primary immunodeficiency (PID)-associated genes, thereby producing a PID signature of the biological sample of the subject; (ii) process the PID signature with one or more reference PID signatures, wherein the processing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the PID signature with corresponding quantitative measures of the gene of the one or more reference PID signatures; and (iii) based at least in part on the comparison in (ii), identify the lupus condition of the subject.

Biological Data Analysis

In another aspect, the present disclosure provides a computer-implemented method for assessing a condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject; (b) selecting one or more data analysis tools, wherein the one or more data analysis tools comprise an analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, and a Target Scoring analysis tool, or a combination thereof; (c) processing the dataset using the one or more data analysis tools to generate a data signature of the biological sample of the subject; and (d) based at least in part on the data signature generated in (c), assessing the condition of the subject.

In some embodiments, the dataset comprises mRNA gene expression or transcriptome data, DNA genomic data, proteomic data, metabolomic data, or a combination thereof. In some embodiments, the biological sample is selected from the group consisting of: a whole blood (WB) sample, a PBMC sample, a tissue sample, and a cell sample. In some embodiments, assessing the condition of the subject comprises identifying a disease or disorder of the subject.

In some embodiments, the method further comprises identifying a disease or disorder of the subject at a sensitivity or specificity of at least about 70%. In some embodiments, the method further comprises determining a likelihood of the identification of the disease or disorder of the subject. In some embodiments, the method further comprises providing a therapeutic intervention for the disease or disorder of the subject. In some embodiments, the method further comprises monitoring the disease or disorder of the subject, wherein the monitoring comprises assessing the disease or disorder of the subject at a plurality of time points, wherein the assessing is based at least on the disease or disorder identified at each of the plurality of time points.

In some embodiments, selecting the one or more data analysis tools comprises receiving a user selection of the one or more data analysis tools. In some embodiments, selecting the one or more data analysis tools is automatically performed by the computer without receiving a user selection of the one or more data analysis tools.

In another aspect, the present disclosure provides a computer system for assessing a condition of a subject, comprising: a database that is configured to store a dataset of a biological sample of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) select one or more data analysis tools, wherein the one or more data analysis tools comprise an analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, and a Target Scoring analysis tool; (ii) process the dataset using the one or more data analysis tools to generate a data signature of the biological sample of the subject; and (iii) based at least in part on the data signature generated in (ii), assess the condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing a condition of a subject, the method comprising: (a) receiving a dataset of a biological sample of the subject; (b) selecting one or more data analysis tools, wherein the one or more data analysis tools comprise an analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, and a Target Scoring analysis tool; (c) processing the dataset using the one or more data analysis tools to generate a data signature of the biological sample of the subject; and (d) based at least in part on the data signature generated in (c), assessing the condition of the subject. In any embodiment described herein, the one or more data analysis tools can be a plurality of data analysis tools each independently selected from a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, and a Target Scoring analysis tool.

Analysis of Single Nucleotide Polymorphisms (SNPs) Associated with Lupus

In another aspect, the present disclosure provides a computer-implemented method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of SLE-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises (i) one or more AA-specific single nucleotide polymorphisms (SNPs) if the subject has an African-Ancestry (AA), or (ii) one or more EA-specific SNPs if the subject has a European-Ancestry (EA); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has an African-Ancestry (AA) or a European-Ancestry (EA), assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a computer-implemented method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more African-Ancestry (AA)-specific single nucleotide polymorphisms (SNPs); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has an African-Ancestry (AA), assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a computer-implemented method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more European-Ancestry (EA)-specific single nucleotide polymorphisms (SNPs); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has a European-Ancestry (EA), assessing the SLE condition of the subject.

In some embodiments, the dataset comprises RNA gene expression or transcriptome data, DNA genomic data, or a combination thereof. In some embodiments, the biological sample is selected from the group consisting of a whole blood (WB) sample, a PBMC sample, a tissue sample, and a cell sample. In some embodiments, assessing the SLE condition of the subject comprises determining a diagnosis of the SLE condition, a prognosis of the SLE condition, a susceptibility of the SLE condition, a treatment for the SLE condition, or an efficacy or non-efficacy of a treatment for the SLE condition.

In some embodiments, the method further comprises determining a diagnosis of the SLE condition with a sensitivity of at least about 70%. In some embodiments, the method further comprises determining a diagnosis of the SLE condition with a specificity of at least about 70%. In some embodiments, the method further comprises determining a diagnosis of the SLE condition with a positive predictive value of at least about 70%. In some embodiments, the method further comprises determining a diagnosis of the SLE condition with a negative predictive value of at least about 70%. In some embodiments, the method further comprises determining a diagnosis of the SLE condition with an Area Under Curve (AUC) of at least about 70%. In some embodiments, the method further comprises determining a likelihood of the diagnosis of the SLE condition of the subject.

In some embodiments, the method further comprises generating a plurality of drug candidates for the SLE condition of the subject. In some embodiments, the method further comprises evaluating or predicting a relative efficacy of the plurality of drug candidates for the SLE condition of the subject. In some embodiments, the method further comprises providing a therapeutic intervention comprising one or more of the plurality of drug candidates for the SLE condition of the subject.

In some embodiments, the method further comprises selecting a treatment for the SLE condition of the subject, the treatment comprising an AA-specific drug. In some embodiments, the AA-specific drug is selected from the group consisting of: an HDAC inhibitor, a retinoid, a IRAK4-targeted drug, and a CTLA4-targeted drug. In some embodiments, the method further comprises selecting a treatment for the SLE condition of the subject, the treatment comprising an EA-specific drug. In some embodiments, the EA-specific drug is selected from the group consisting of: hydroxychloroquine, a CD40LG-targeted drug, a CXCR1-targeted drug, and a CXCR2-targeted drug. In some embodiments, the method further comprises selecting a treatment for the SLE condition of the subject, the treatment comprising a drug targeting E-Genes or pathways shared by EA and AA. In some embodiments, the drug targeting E-Genes or pathways shared by EA and AA is selected from the group consisting of: ibrutinib, ruxolitinib, and ustekinumab.

In some embodiments, the method further comprises monitoring the SLE condition of the subject, wherein the monitoring comprises assessing the SLE condition of the subject at each of a plurality of time points, and processing the plurality of assessments of the SLE condition of the subject at each of the plurality of time points.

In some embodiments, the one or more EA-specific SNPs comprise one or more SNPs of genes selected from the group listed in Table 56. In some embodiments, the one or more AA-specific SNPs comprise one or more SNPs of genes selected from the group listed in Table 57. In some embodiments, the plurality of SLE-associated genomic loci comprises one or more shared SNPs, wherein the one or more shared SNPs are common to both EA and AA. In some embodiments, the one or more shared SNPs comprise one or more SNPs of genes selected from the group listed in Table 58.

In another aspect, the present disclosure provides a computer system for assessing an SLE condition of a subject, comprising: a database that is configured to store an African-Ancestry (AA) status of the subject, a European-Ancestry (EA) status of the subject, and a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of SLE-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises (i) one or more AA-specific single nucleotide polymorphisms (SNPs) if the subject has an African-Ancestry (AA), or (ii) one or more EA-specific SNPs if the subject has a European-Ancestry (EA); and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (ii) based at least in part on the one or more DE genomic loci identified in (ii), the AA status of the subject, and the EA status of the subject, assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a computer system for assessing an SLE condition of a subject, comprising: a database that is configured to store an African-Ancestry (AA) status of the subject and a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more African-Ancestry (AA)-specific single nucleotide polymorphisms (SNPs); and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (ii) based at least in part on the one or more DE genomic loci identified in (ii) and the AA status of the subject, assessing the SLE condition of the subject.

In some embodiments, In another aspect, the present disclosure provides a computer system for assessing an SLE condition of a subject, comprising: a database that is configured to store a European-Ancestry (EA) status of the subject and a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more European-Ancestry (EA)-specific single nucleotide polymorphisms (SNPs); and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (ii) based at least in part on the one or more DE genomic loci identified in (i) and the EA status of the subject, assess the SLE condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of SLE-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises (i) one or more AA-specific single nucleotide polymorphisms (SNPs) if the subject has an African-Ancestry (AA), or (ii) one or more EA-specific SNPs if the subject has a European-Ancestry (EA); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has an African-Ancestry (AA) or a European-Ancestry (EA), assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more African-Ancestry (AA)-specific single nucleotide polymorphisms (SNPs); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has an African-Ancestry (AA), assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more European-Ancestry (EA)-specific single nucleotide polymorphisms (SNPs); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has a European-Ancestry (EA) assessing the SLE condition of the subject.

Analysis of Single Nucleotide Polymorphisms (SNPs) Associated with Lupus

In another aspect, the present disclosure provides a method for identifying an autoimmune disease drug target, the method comprising: (a) treating an autoimmune disease animal model with a drug configured to inhibit a drug target of the autoimmune disease, thereby producing a treated animal model; (b) assaying an animal biological sample of the treated animal model to obtain gene expression data of the treated animal model; (c) processing the gene expression data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (d) obtaining a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (e) processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (f) identifying the drug target as the autoimmune disease drug target when the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

In some embodiments, the autoimmune disease animal model is selected from: a mouse model, a rat model, a cat model, a dog model, a rabbit model, a guinea pig model, a hamster model, a pig model, a horse model, and a primate model. In some embodiments, the autoimmune disease animal model comprises a mouse model. In some embodiments, the autoimmune disease comprises lupus. In some embodiments, the lupus comprises systemic lupus erythematosus (SLE) or discoid lupus erythematosus (DLE). In some embodiments, the drug target is HDAC6. In some embodiments, the drug target is HDAC6 or a portion thereof. In some embodiments, the drug is an HDAC6 inhibitor. In some embodiments, the HDAC6 inhibitor is ACY-738. In some embodiments, the animal biological sample or the human biological samples comprise one or more of a bodily fluid sample, a blood sample, a cell sample, and a tissue sample. In some embodiments, the one or more human autoimmune disease pathways are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the human genomic locus that is associated with up-regulation or down-regulation of the one or more human autoimmune disease pathways is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 65, Table 66, and Table 67. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the animal genomic locus is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 65, Table 66, and Table 67. In some embodiments, (e) comprises identifying (i) a plurality of animal genomic loci from among the first set of genomic loci, and (ii) a plurality of human genomic loci from among the second set of genomic loci that is associated with up-regulation or down-regulation of a plurality of human autoimmune disease pathways, wherein plurality of animal genomic loci and the plurality of human genomic loci are pairwise orthologous and share similarities in expression patterns and function; and (f) comprises identifying the drug target as the autoimmune disease drug target when the quantitative measures of the plurality of animal genomic loci of the animal gene signature are indicative of up-regulation or down-regulation of a plurality of autoimmune disease pathways of the autoimmune disease animal model. In some embodiments, the plurality of human autoimmune disease pathways comprises between 2 and 5 different human autoimmune disease pathways. In some embodiments, the plurality of human autoimmune disease pathways comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different human autoimmune disease pathways. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different autoimmune disease pathways. In some embodiments, the method further comprises determining the up-regulation or down-regulation of the autoimmune disease pathway of the autoimmune disease animal model based on determining a difference between the quantitative measure of the animal genomic locus of the animal gene signature and a reference quantitative measure of the animal genomic locus. In some embodiments, the method further comprises obtaining the reference quantitative measure of the animal genomic locus by, prior to (a), assaying an animal biological sample of the autoimmune disease animal model.

In another aspect, the present disclosure provides a computer-implemented method for identifying an autoimmune disease drug target, the method comprising: (a) obtaining gene expression data generated by assaying an animal biological sample of a treated animal model, wherein the treated animal model is obtained by treating an autoimmune disease animal model with a drug configured to inhibit a drug target of the autoimmune disease; (b) processing the gene expression data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (c) obtaining a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (d) processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (e) identifying the drug target as the autoimmune disease drug target when the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

In some embodiments, the autoimmune disease animal model is selected from: a mouse model, a rat model, a cat model, a dog model, a rabbit model, a guinea pig model, a hamster model, a pig model, a horse model, and a primate model. In some embodiments, the autoimmune disease animal model comprises a mouse model. In some embodiments, the autoimmune disease comprises lupus. In some embodiments, the lupus comprises systemic lupus erythematosus (SLE) or discoid lupus erythematosus (DLE). In some embodiments, the drug target is HDAC6. In some embodiments, the drug target is HDAC6 or a portion thereof. In some embodiments, the drug is an HDAC6 inhibitor. In some embodiments, the HDAC6 inhibitor is ACY-738. In some embodiments, the animal biological sample or the human biological samples comprise one or more of: a bodily fluid sample, a blood sample, a cell sample, and a tissue sample. In some embodiments, the one or more human autoimmune disease pathways are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the human genomic locus that is associated with up-regulation or down-regulation of the one or more human autoimmune disease pathways is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 65, Table 66, and Table 67. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the animal genomic locus is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 65, Table 66, and Table 67. In some embodiments, (d) comprises identifying (i) a plurality of animal genomic loci from among the first set of genomic loci, and (ii) a plurality of human genomic loci from among the second set of genomic loci that is associated with up-regulation or down-regulation of a plurality of human autoimmune disease pathways, wherein plurality of animal genomic loci and the plurality of human genomic loci are pairwise orthologous and share similarities in expression patterns and function; and (e) comprises identifying the drug target as the autoimmune disease drug target when the quantitative measures of the plurality of animal genomic loci of the animal gene signature are indicative of up-regulation or down-regulation of a plurality of autoimmune disease pathways of the autoimmune disease animal model. In some embodiments, the plurality of human autoimmune disease pathways comprises between 2 and 5 different human autoimmune disease pathways. In some embodiments, the plurality of human autoimmune disease pathways comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different human autoimmune disease pathways. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different autoimmune disease pathways. In some embodiments, the method further comprises determining the up-regulation or down-regulation of the autoimmune disease pathway of the autoimmune disease animal model based on determining a difference between the quantitative measure of the animal genomic locus of the animal gene signature and a reference quantitative measure of the animal genomic locus. In some embodiments, the method further comprises obtaining the reference quantitative measure of the animal genomic locus by, prior to (a), assaying an animal biological sample of the autoimmune disease animal model.

In another aspect, the present disclosure provides a computer system for identifying an autoimmune disease drug target, comprising: a database that is configured to store gene expression data generated by assaying an animal biological sample of a treated animal model, wherein the treated animal model is obtained by treating an autoimmune disease animal model with a drug configured to inhibit a drug target of the autoimmune disease; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the transcriptomic data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (ii) obtain a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (iii) process the animal gene signature with the set of human gene signatures to identify (1) an animal genomic locus from among the first set of genomic loci, and (2) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (iv) identify the drug target as the autoimmune disease drug target when the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

In some embodiments, the autoimmune disease animal model is selected from: a mouse model, a rat model, a cat model, a dog model, a rabbit model, a guinea pig model, a hamster model, a pig model, a horse model, and a primate model. In some embodiments, the autoimmune disease animal model comprises a mouse model. In some embodiments, the autoimmune disease comprises lupus. In some embodiments, the lupus comprises systemic lupus erythematosus (SLE) or discoid lupus erythematosus (DLE). In some embodiments, the drug target is HDAC6. In some embodiments, the drug target is HDAC6 or a portion thereof. In some embodiments, the drug is an HDAC6 inhibitor. In some embodiments, the HDAC6 inhibitor is ACY-738. In some embodiments, the animal biological sample or the human biological samples comprise one or more of a bodily fluid sample, a blood sample, a cell sample, and a tissue sample. In some embodiments, the one or more human autoimmune disease pathways are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the human genomic locus that is associated with up-regulation or down-regulation of the one or more human autoimmune disease pathways is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 65, Table 66, and Table 67. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the animal genomic locus is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 65, Table 66, and Table 67. In some embodiments, (iii) comprises identifying (1) a plurality of animal genomic loci from among the first set of genomic loci, and (2) a plurality of human genomic loci from among the second set of genomic loci that is associated with up-regulation or down-regulation of a plurality of human autoimmune disease pathways, wherein plurality of animal genomic loci and the plurality of human genomic loci are pairwise orthologous and share similarities in expression patterns and function; and (iv) comprises identifying the drug target as the autoimmune disease drug target when the quantitative measures of the plurality of animal genomic loci of the animal gene signature are indicative of up-regulation or down-regulation of a plurality of autoimmune disease pathways of the autoimmune disease animal model. In some embodiments, the plurality of human autoimmune disease pathways comprises between 2 and 5 different human autoimmune disease pathways. In some embodiments, the plurality of human autoimmune disease pathways comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different human autoimmune disease pathways. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different autoimmune disease pathways. In some embodiments, the one or more computer processors are individually or collectively programmed to further determine the up-regulation or down-regulation of the autoimmune disease pathway of the autoimmune disease animal model based on determining a difference between the quantitative measure of the animal genomic locus of the animal gene signature and a reference quantitative measure of the animal genomic locus. In some embodiments, the one or more computer processors are individually or collectively programmed to further obtain the reference quantitative measure of the animal genomic locus by, prior to (a), assaying an animal biological sample of the autoimmune disease animal model.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying an autoimmune disease drug target, the method comprising: (a) obtaining gene expression data generated by assaying an animal biological sample of a treated animal model, wherein the treated animal model is obtained by treating an autoimmune disease animal model with a drug configured to inhibit a drug target of the autoimmune disease; (b) processing the gene expression data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (c) obtaining a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (d) processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (e) identifying the drug target as the autoimmune disease drug target when the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

In some embodiments, the autoimmune disease animal model is selected from: a mouse model, a rat model, a cat model, a dog model, a rabbit model, a guinea pig model, a hamster model, a pig model, a horse model, and a primate model. In some embodiments, the autoimmune disease animal model comprises a mouse model. In some embodiments, the autoimmune disease comprises lupus. In some embodiments, the lupus comprises systemic lupus erythematosus (SLE) or discoid lupus erythematosus (DLE). In some embodiments, the drug target is HDAC6. In some embodiments, the drug target is HDAC6 or a portion thereof. In some embodiments, the drug is an HDAC6 inhibitor. In some embodiments, the HDAC6 inhibitor is ACY-738. In some embodiments, the animal biological sample or the human biological samples comprise one or more of a bodily fluid sample, a blood sample, a cell sample, and a tissue sample. In some embodiments, the one or more human autoimmune disease pathways are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the human genomic locus that is associated with up-regulation or down-regulation of the one or more human autoimmune disease pathways is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 65, Table 66, and Table 67. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the animal genomic locus is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 65, Table 66, and Table 67. In some embodiments, (d) comprises identifying (i) a plurality of animal genomic loci from among the first set of genomic loci, and (ii) a plurality of human genomic loci from among the second set of genomic loci that is associated with up-regulation or down-regulation of a plurality of human autoimmune disease pathways, wherein plurality of animal genomic loci and the plurality of human genomic loci are pairwise orthologous and share similarities in expression patterns and function; and (e) comprises identifying the drug target as the autoimmune disease drug target when the quantitative measures of the plurality of animal genomic loci of the animal gene signature are indicative of up-regulation or down-regulation of a plurality of autoimmune disease pathways of the autoimmune disease animal model. In some embodiments, the plurality of human autoimmune disease pathways comprises between 2 and 5 different human autoimmune disease pathways. In some embodiments, the plurality of human autoimmune disease pathways comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different human autoimmune disease pathways. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different autoimmune disease pathways. In some embodiments, the method further comprises determining the up-regulation or down-regulation of the autoimmune disease pathway of the autoimmune disease animal model based on determining a difference between the quantitative measure of the animal genomic locus of the animal gene signature and a reference quantitative measure of the animal genomic locus. In some embodiments, the method further comprises obtaining the reference quantitative measure of the animal genomic locus by, prior to (a), assaying an animal biological sample of the autoimmune disease animal model.

In another aspect, the present disclosure provides a method for evaluating a drug candidate for an autoimmune disease, the method comprising: (a) treating an autoimmune disease animal model with the drug candidate for the autoimmune disease, thereby producing a treated animal model; (b) assaying an animal biological sample of the treated animal model to obtain gene expression data of the treated animal model; (c) processing the gene expression data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (d) obtaining a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (e) processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (f) evaluating the efficacy of the drug candidate for the autoimmune disease based at least in part on whether the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

In another aspect, the present disclosure provides a computer-implemented method for evaluating a drug candidate for an autoimmune disease, the method comprising: (a) obtaining gene expression data generated by assaying an animal biological sample of a treated animal model, wherein the treated animal model is obtained by treating an autoimmune disease animal model with the drug candidate for the autoimmune disease; (b) processing the gene expression data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (c) obtaining a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (d) processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (e) evaluating the efficacy of the drug candidate for the autoimmune disease based at least in part on whether the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

In another aspect, the present disclosure provides a computer system for evaluating a drug candidate for an autoimmune disease, comprising: a database that is configured to store gene expression data generated by assaying an animal biological sample of a treated animal model, wherein the treated animal model is obtained by treating an autoimmune disease animal model with the drug candidate for the autoimmune disease; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the transcriptomic data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (ii) obtain a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (iii) process the animal gene signature with the set of human gene signatures to identify (1) an animal genomic locus from among the first set of genomic loci, and (2) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (iv) evaluate the efficacy of the drug candidate for the autoimmune disease based at least in part on whether the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for evaluating a drug candidate for an autoimmune disease, the method comprising: (a) treating an autoimmune disease animal model with the drug candidate for the autoimmune disease, thereby producing a treated animal model; (b) assaying an animal biological sample of the treated animal model to obtain gene expression data of the treated animal model; (c) processing the gene expression data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (d) obtaining a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (e) processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (f) evaluating the efficacy of the drug candidate for the autoimmune disease based at least in part on whether the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

Another aspect of the present disclosure provides a non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements any of the methods above or elsewhere herein.

Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto. The computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.

Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Certain Embodiments

Provided herein are methods comprising: assaying an isolated biological sample from a subject to generate a dataset comprising gene expression data, the assaying comprising: (a) performing an analysis with a microarray thereby measuring a concentration of a nucleic acid sequence from the biological sample or an amplicon thereof; (b) performing an RNA-Seq analysis to analyze the transcriptome of a biological sample by sequencing a complementary DNA (cDNA) synthesized from a nucleic acid sequence (RNA) from the biological sample or an amplicon thereof; or (c) performing quantitative polymerase chain reaction (qPCR) to measure the enrichment of a nucleic acid sequence in the biological sample or an amplicon thereof; and using a computer comprising a non-transitory computer-readable storage media encoded with a computer program including instructions executable by a processor to run an application for identifying and comparing (i) the gene expression data generated from assaying the isolated biological sample to (ii) a reference gene expression data set comprising a plurality of disease-associated genomic loci; electronically outputting a report detailing the comparison of (i) the gene expression data generated from assaying the isolated biological sample to (ii) the reference gene expression data set comprising the plurality of disease-associated genomic loci; wherein the report: (i) identifies an immunological state of the subject at an accuracy of at least about 70%; (ii) identifies a disease state or a susceptibility thereof of the subject at an accuracy of at least about 70%; (iii) identifies if the subject is likely to respond to a treatment comprising administration of a drug selected from: a immunoregulator, a immunosuppressant, a steroid, an anti-inflammatory, a JAK inhibitors, a TNF inhibitors, a baricitinib, a corticosteroid, a nonsteroidal anti-inflammatory drug (NSAID), a tofacitinib, a TYK2 inhibitor, a TYK2/JAK inbibitor, a combination inhibitor, a monoclonal antibody, an anti-TNF biologic, anti-IL-6 biologic, anti-IL-17 biologic, anti-IL-12/23 biologic, and anti-CD28 biologic, or combinations thereof; and/or (v) identifies an effectiveness of the treatment of the subject as compared to the disease state or disease progression; wherein: the disease state is associated to the plurality of disease-associated genomic loci; the plurality of disease-associated genomic loci comprises one or more genes associated with a gene cluster of Table 1 to Table 72C; or the plurality of disease-associated genomic loci comprises at least 5 genes associated with a module of Table 8; the disease state is selected from: a chronic condition, an inflammatory condition, an autoimmune condition, an arthritis, a rheumatoid arthritis (RA), an early inflammatory arthritis (EIA), an inflammatory arthritis, or combinations thereof; the isolated biological sample is selected from a group consisting of: a whole blood (WB) sample, a peripheral blood mononuclear cell (PBMC) sample, a tissue sample, and a purified cell sample; and optionally wherein the method for assaying a biological sample derived from a subject comprises purifying the biological sample derived from the subject to obtain the purified cell sample. In some embodiments, the disease-associated genomic loci comprises 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or more than 50 genes associated with the gene cluster. In some embodiments, the disease-associated genomic loci comprises 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or more than 50 genes associated with a biological pathway. In some embodiments, the disease state is the arthritis. In some embodiments, the disease state is the rheumatoid arthritis. In some embodiments, the disease state is the early inflammatory arthritis. In some embodiments, the disease state is the inflammatory arthritis. In some embodiments, the disease state is the chronic condition. In some embodiments, the disease state is the inflammatory condition. In some embodiments, the disease state is the autoimmune condition. In some embodiments, the treatment comprises administration of a drug to the subject. In some embodiments, the treatment comprises parenteral administration of a drug to the subject. In some embodiments, the treatment comprises administration for at least zero weeks, 16 weeks, and 52 weeks, at least 1 year, at least 2 years, at least 3 years, at least 4 years, at least 5 years, at least 6 years, at least 7 years, at least 8 years, at least 9 years, 10 years, at least 15 years, at least 20 years, at least 30 years, at least 35 years, at least 40 years, at least 45 years, at least 50 years, or at least the patient lifespan. In some embodiments, the treatment is adjusted as a function of the gene expression data. In some embodiments, the gene expression data is used to identify a drug for the treatment of the disease state. In some embodiments, the report comprises nucleic acid sequencing data, transcriptome data, genome data, epigenetic data, proteome data, metabolome data, virome data, metabolome data, methylome data, lipidomic data, lineage-ome data, nucleosomal occupancy data, a genetic variant, a gene fusion, an indel, or combinations thereof. In some embodiments, the report comprises different formats. In some embodiments, the report comprises data from different sources, different studies, or combinations thereof. In some embodiments, the data is used to define a phenotype. In some embodiments, the phenotype comprises a disease state, an organ involvement, a medication response, or any combination thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent application file contains at least one drawing executed in color. Copies of this patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

The novel features of the disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings of which:

FIG. 1 shows an example of a flow chart for a method of identifying one or more records, in accordance with disclosed embodiments.

FIG. 2A shows the z-scores determined by an example of differential expression analysis of disease state compared to status of the 100 most significant records within a first plurality of records, in accordance with disclosed embodiments.

FIG. 2B shows the z-scores determined by an example of differential expression analysis of active disease state compared to status of the 100 most significant records within a second plurality of records, in accordance with disclosed embodiments.

FIG. 2C shows the z-scores determined by an example of differential expression analysis of active disease state compared to status of the 100 most significant records within a third plurality of records, in accordance with disclosed embodiments.

FIG. 2D shows the z-scores determined by an example of differential expression analysis of active disease state compared to the combined records within the first, second, and third pluralities of records, in accordance with disclosed embodiments.

FIG. 2E shows the enrichment scores determined by an example of differential expression analysis of active disease state across a selected set of records compared to the first, second, and third pluralities of records, in accordance with disclosed embodiments.

FIG. 3 shows an example of a Venn diagram of the top 100 records within each of the first, second, and third pluralities of records, in accordance with disclosed embodiments.

FIG. 4A shows an example of Gene Set Enrichment Analysis (GSVA) enrichment scores and standard deviations for a first plurality of records, in accordance with disclosed embodiments.

FIG. 4B shows an example of GSVA enrichment scores and standard deviations for a second plurality of records, in accordance with disclosed embodiments.

FIG. 5 shows an example of Receiver Operating Characteristic (ROC) curves and the area under each curve for machine learning classifiers under different test conditions, in accordance with disclosed embodiments.

FIG. 6A shows an example of variable importance values of records as determined by mean decrease in Gini impurity, in accordance with disclosed embodiments.

FIG. 6B shows an example of variable importance values of de-duplicated records as determined by mean decrease in Gini impurity, in accordance with disclosed embodiments.

FIG. 6C shows an example of variable importance values of the top 25 individual genes determined by mean decrease in Gini impurity, in accordance with disclosed embodiments.

FIG. 7 shows a non-limiting schematic diagram of a digital processing device; in this case, a device with one or more CPUs, a memory, a communication interface, and a display;

FIG. 8 shows a non-limiting schematic diagram of a web/mobile application provision system; in this case, a system providing browser-based and/or native mobile user interfaces; and

FIG. 9 shows a non-limiting schematic diagram of a cloud-based web/mobile application provision system; in this case, a system comprising an elastically load balanced, auto-scaling web server and application server resources as well synchronously replicated databases.

FIG. 10A shows an example of heatmaps of −log 10(overlap p values) from RRHO, in accordance with disclosed embodiments. Strongest overlaps near the center of each plot indicate weak agreement among the most significantly upregulated and downregulated genes from each data set. Strong agreement between data sets may be indicated by a diagonal from the bottom-left corner to the top-right corner.

FIG. 10B shows an example of clustering all three studies on three consistent DE genes, in accordance with disclosed embodiments. DNAJC13, IRF4, and RPL22 were consistently differentially expressed in each study yet fail to fully separate active from inactive patients. Orange bars denote active patients; black bars denote inactive patients. Blue, yellow, and red bars denote patients from GSE39088, GSE45291, and GSE49454, respectively.

FIG. 11 shows GSVA results of a lupus Illuminate gene set, demonstrating the striking heterogeneity in SLE patient WB by showing patient specific enrichment of 27 cell and process specific modules of genes. In order to understand pathogenic mechanisms of SLE, a big data analysis approach may be used on purified cell populations implicated in SLE to help understand aberrant cellular-specific mechanisms.

FIG. 12 shows an example of cellular gene modules providing a basis for machine learning predictions of SLE activity, in accordance with disclosed embodiments. GSVA was performed on three SLE WB datasets using 25 WGCNA modules made from purified SLE cells with correlation or published relationship to SLEDAI. Orange: active patient; black: inactive patient. LDG: low-density granulocyte; PC: plasma cell.

FIGS. 13A and 13B show an example of individual WGCNA modules being ineffective at separating active and inactive SLE subjects, in accordance with disclosed embodiments. GSVA enrichment scores for CD4_Floralwhite (FIG. 13A) and CD4_Orangered4 (FIG. 13B) in SLE WB are unable to fully separate active patients from inactive patients. Asterisks denote significant differences by Welch's t-test. Error bars indicate mean±standard deviation.

FIG. 14 shows an example of performance of machine learning classifiers across three independent data sets, in accordance with disclosed embodiments. Classifiers were trained on the data sets listed across the top and evaluated in the data sets listed across the bottom. Data sets are listed by their GEO accession numbers. Expression (black): gene expression data. WGCNA (blue): module enrichment scores.

FIG. 15 shows an example of area under the ROC curve of machine learning classifiers across three independent data sets, in accordance with disclosed embodiments. Classifiers were trained on the data sets listed across the top and tested in the other two data sets. Data sets are listed by their GEO accession numbers. Expression (black): gene expression data. WGCNA (blue): module enrichment scores.

FIGS. 16A-16C show an example of random forest classifier revealing variable importance of genes and modules, in accordance with disclosed embodiments. FIG. 16A shows variable importance of top 25 individual genes as determined by mean decrease in Gini impurity. FIG. 16B shows variable importance of cell modules. FIG. 16C shows that many modules shared genes, modules were de-duplicated to determine the effects on the random forest classifier. The relative importance of the full modules and de-duplicated modules was strongly correlated (Spearman's rho=0.69, p=1.94E−4). LDG: low-density granulocyte; PC: plasma cell.

FIG. 17 shows a heat map showing the variation of gene expression in normal controls. Differentially expressed (DE) transcripts pertaining to cell type and process signatures in 10 SLE whole blood and peripheral blood mononuclear cell microarray datasets were used to create modules of genes potentially enriched in SLE patients determined by Gene Set Variation Analysis (GSVA).

FIG. 19 shows PCA and heatmap clustering of AA, EA, and NAA SLE Patients not taking steroids for 9 GSVA enrichment modules negative in healthy controls (HC). The cell cycle and Low Up modules were removed, GSVA enrichment scores for the 9 remaining modules were uploaded to ClustVis, and PCA plots and heatmaps were generated. Heatmaps were generated using correlation clustering distance for both rows and columns.

FIG. 21 shows heatmap clustering of SLE patients by enrichment of 10 immunologically related modules. SLE patients were grouped on the basis of having a negative PC1 loading score (plasma cell, left), a positive PC1 loading score (myeloid, middle), no enrichment of the 10 modules (No Sig, right). SLE patients within Plasma Cell or Myeloid that also expressed the opposite signature, as defined by either having a Mono GSVA enrichment score of at least 0.1, are identified by black boxes.

FIGS. 23A-23D show the correlation between clinical measures of disease activity and WGCNA modules. Patients were divided into sub-groups based on their expression of positive eigengenes for each category. Significant differences between clinical traits were determined between group using PRISM v7 Tukey's multiple comparison test, and p values are shown between groups when less than or equal to 0.05.

FIG. 24 shows mean GSVA scores of patients in each cluster defined by GMM. Numbers at the top denote the number of patients in each cluster.

FIG. 25 shows gene expression of subjects in groups defined by GMVAE. GSVA analysis of the patients in these clusters showed that the patients without serological SLE activity (clusters 3 and 5) also did not show immunological activity by gene expression, whereas the other clusters did show immunological activity.

FIGS. 26A-26D show limma differential expression (DE) analysis of AA, EA, and NAA SLE patients to each other, including determining thousands of DE transcripts for each ancestry compared to the others for the ILL1 dataset.

FIG. 27A shows that in EA SLE patients, transcripts for monocytes and low-density granulocytes (LDGs) were enriched in the ILL1 and ILL2 datasets compared to AA SLE patients, whereas T cell and MHC class II transcripts were enriched in EA patients compared to NAA patients. NAA patients had increased myeloid signatures, including transcripts associated with monocytes, LDGs, and neutrophils compared to both AA and EA patients.

FIG. 27B shows that, similar to the results using the ILL1 and ILL2 datasets, EA SLE patients were enriched for transcripts associated with myeloid cells, and AA SLE patients were enriched for transcripts associated with plasma cells, B cells, and T cells.

FIG. 28A shows results of gene set variation analysis (GSVA) employed to compare enrichment of 34 modules of genes corresponding to lymphocytes, myeloid cells, cellular processes, as well as groups of all the T Cell Receptor (TCR) and immunoglobulin (Ig) genes found on the Affymetrix HTA2.0 array.

FIGS. 28B-28C show that the AA and NAA patient groups had significantly more SLE patients with platelet and erythrocyte enrichment than EA patients, and significantly fewer patients with decreased erythrocyte and platelet GSVA scores compared to EA patients.

FIG. 28D shows an orthogonal approach using weighted gene co-expression network analysis (WGCNA) to confirm the association of ancestry with cellular signatures. WGCNA of GSE88884 ILL1 and ILL2 was performed separately, and results demonstrated a significant (p<0.05) positive association by Pearson correlation of AA ancestry to plasma cell, T cell, and FOXP3 T cell modules, as well as a significant negative correlation to granulocyte and myeloid cell WGCNA modules.

FIG. 29 shows a comparison of patients on specific therapies to patients not receiving the therapies for the 34 cell type and process modules, in order to determine the effect of SOC drugs on patient gene expression signatures.

FIGS. 30A-30C show a comparison of LDG, monocyte, and T cell GSVA scores for patients with or without corticosteroids, demonstrating that the corticosteroids were the largest contributor to the differences between patient LDG, monocyte, and T cell scores, but that AA patients still had lower LDG and monocyte scores and NAA patients still had lower T cell scores in the absence of corticosteroids.

FIG. 30D shows that MTX and MMF significantly lowered plasma cell GSVA scores, but did not negate the increased plasma cells determined for AA patients versus EA and NAA patients.

FIG. 30E shows that compensating for AZA treatment also did not offset the increased B cells in AA SLE patients.

FIG. 30F shows that compensating for AZA treatment also did not offset the the difference in NK cells between EA and NAA SLE patients.

FIG. 31A shows a comparison of GSVA enrichment scores for the 34 modules for patients with each manifestation individually to all other manifestations, in order to determine the association between different SLE manifestations and gene expression profiles.

FIG. 31B shows a comparison of the change in gene expression profile for the anti-dsDNA, anti-RNP, or both, to the 64 patients in this subset without anti-RNP or anti-dsDNA autoantibodies showed significant increases in GSVA enrichment scores for IFN (anti-dsDNA, p=0.0023; anti-RNP, p=0.0323; both, p<0.0001), plasma cells (anti-dsDNA, p=0.01; anti-RNP and both, p<0.0001), Ig (anti-dsDNA, p=0.0039; anti-RNP and both, p<0.0001) and cell cycle (anti-dsDNA, p=0.0003; anti-RNP and both, p<0.0001).

FIG. 32A shows a comparison of patients positive for both Low C and anti-dsDNA with and without specific drugs or manifestations for cell specific GSVA scores, to determine whether autoantibodies and complement levels or drugs contributed more to the relationship with specific GSVA signatures.

FIG. 32B shows that 90% of patients with both Low C and anti-dsDNA were also receiving corticosteroids, and patients taking corticosteroids had significantly increased LDG GSVA scores, demonstrating that the increase in LDGs observed in patients with anti-dsDNA and Low C was related to concomitant corticosteroid usage, and not the presence of anti-dsDNA and Low C.

FIGS. 32C-32D show that the increase in IFN signature observed in EA and AA SLE patients on corticosteroids was related to the disproportionate numbers of patients with Low C and anti-dsDNA in the corticosteroid population, 39%, versus only 13% of the patients not taking corticosteroids who had both Low C and anti-dsDNA.

FIGS. 32E-32F show that in EA SLE patients, decreased NK cells were detected in those with anti-dsDNA or Low C. The effect was related to 23% of patients with Low C and anti-dsDNA also being on AZA (FIG. 32E) compared to only 15% of patients without low C or anti-dsDNA taking AZA (FIG. 32F) and thus not directly related to having anti-dsDNA and Low C.

FIGS. 32G-32H show that separation of vasculitis patients by anti-dsDNA and Low C demonstrated that the significant increase in plasma cells and IFN GSVA scores were likely related to the patients also having both anti-dsDNA and Low C, as there was a significant increase in GSVA enrichment scores for IFN and plasma cells in vasculitis patients with both anti-dsDNA and Low C (plasma cell mean difference=0.2873, p=0.0013, IFN mean difference=0.3889, p<0.0001).

FIG. 33A shows GSVA enrichment scores calculated for the 34 cell and process modules for 14 AA, 93 EA, and 17 NAA GSE88884 ILL1 and ILL2 male patients and male HC, to determine whether ancestral differences are also observed in male lupus subjects.

FIG. 33B shows that the combination of anti-dsDNA and Low C was associated with positive plasma cell signatures, as was detected for female SLE patients.

FIGS. 33C-33E show results of using EA SLE patients to determine differences between female patients and male patients with SLE. Because of the large number of female patients, the sets of female patients and male patients were able to be balanced for the percentage of patients on corticosteroids, AZA, and MTX/MMF. Further, the female patients were divided into two age groups, 25-49 years and over 50 years, because of the effects of estrogen on immune responses.

FIG. 34A shows gene expression analysis of adult, self-described AA and EA HC subjects carried out on two separate microarray datasets of normal subjects of different ancestries, in order to demonstrate that gene expression differences detected between SLE patients are related to heritable differences manifesting in expressed genes in hematopoietic cells of healthy subjects of different ancestries.

FIG. 34B shows that I-scope analysis of the transcripts increased in healthy AA patients demonstrated an increase in B cell, dendritic, erythrocyte, and platelet associated transcripts compared to EA HC subjects, and an increase in granulocyte, monocyte, and myeloid transcripts in healthy EA subjects compared to AA HC subjects.

FIG. 35 shows a CIRCOS visualization of the odds ratios for each variable significantly (p<0.05) contributing to each GSVA enrichment score. Ancestry significantly influenced 21 of the 34 cell type and process module scores.

FIGS. 38A-38C contains plots showing that GSVA reveals potential pathways for therapeutic targeting in lupus affected tissues. Measures are shown for drug pathways significantly enriched in SLE affected tissue compared to control tissue as determined using the Welch's t-test for B cell activating factor (BAFF) (FIG. 38A), interleukin (IL-6) (FIG. 38B), and CD40 signaling in DLE, LA, and LN Glom (FIG. 38C). ** p<0.01, *** p<0.001.

FIG. 38D shows that genes commonly dysregulated in lupus tissues identified immune processes and cellular metabolism.

FIG. 38E shows that functional grouping and pathway analysis of DE genes expressed in lupus tissues revealed immune and metabolic abnormalities in common.

FIG. 38F shows that similar cellular and metabolic signatures were observed in lupus tissues.

FIG. 38G shows that increased immune/inflammatory cell signatures were observed in lupus tissues.

FIG. 38H shows that decreased tissue stromal cell signatures were observed in lupus tissues.

FIG. 38I shows that decreased metabolic signatures were observed in lupus tissues.

FIG. 38J contains plots showing the correlation between immune/inflammatory or tissue cell signature and metabolic signature in DLE and LN (LN GL and LN TI).

FIG. 38K-38L shows that Classification and Regression Trees (CART) analysis predicted the contributors to metabolic dysfunction.

FIG. 38M shows that Class 2 LN glomerulus demonstrated similar metabolic defects, indicating dysregulation is linked to stromal cells.

FIG. 38N contains plots showing the correlation between tissue or immune/inflammatory cell signature and metabolic signature for Class 2 LN glomerulus.

FIG. 38O-38P contain plots showing that metabolic changes were not correlated with T Cells in LN GL.

FIG. 39 contains plots showing results from mapping a total of 908 Immunochip SNPs to 252 eQTLs and coupling them to 760 E-Genes (207 in EAs, 30 in AAs, 523 shared), including (A) a Venn of E-Gene overlap and (B) a Cytoscape visualization of E-Gene PPI networks using MCODE clustering.

FIGS. 40A-40C show a non-limiting example of using interferon (IFN) subtype signatures to separate SLE patients from healthy controls (HC), using the systems and methods herein. FIG. 40A is a Venn diagram of the overlap of transcripts induced in human PBMC after 24-hour treatment with IFNA2, IFNB1, IFNW1, or IFNG. A 200-gene signature common to the three type I IFNs (IFN Core, 146+54) was determined. Gene symbols for the induced transcripts for each IFN are listed in Tables 19-29. The induced transcripts from IFN or cytokine treatment of PBMC were used as enrichment groups for GSVA analysis of SLE patient PBMC (FDA PBMC) (FIG. 40B), or SLE whole blood (GSE49454) (FIG. 40C). A heatmap visualization uses red (enriched signature) for GSVA values above zero and blue (decreased signature) for GSVA values below zero to show differences between SLE patients and controls. SLE patients were considered positive for a signature if their GSVA enrichment score was greater than the average healthy control (HC) GSVA enrichment score plus two standard deviations. Most SLE patients displayed prominent type I IFN signatures. In patients SLE.9495 and SLE.9491, enriched PBMC-TNF signatures compared to IFN signatures are displayed, and patient SLE.9544* had no PBMC-IFN signature and was grouped with controls (FIG. 40C).

FIGS. 41A-41D show a non-limiting example of using three interferon subtype signatures (IFNA2, IFNB1, and IFNW1) to separate SLE patients from healthy controls (HC), using the systems and methods herein. GSVA enrichment scores were calculated using the PBMC IFNA2, IFNB1, IFNW1, IFNG, IL12, or TNF induced transcripts, and a random signature (Random Gr1) (Table SD2), for discoid lupus erythematosus (DLE) and healthy control (HC) skin (FIG. 40A), SLE synovium and osteoarthritis synovium (FIG. 40B), lupus nephritis (LN) glomerulus (Glom) class III/IV and HC Glom (FIG. 40C), and LN tubulointerstitium (TI) class III/IV and HC tubulointerstitium (TI) (FIG. 40D). Hedge's G effect size (Effect) measures are shown for cytokine signatures significantly enriched in SLE affected tissues compared to control tissues as determined by a p value<0.05 using the Welch's t-test. For LN tissues, recalculation of effect size values without the five IFN negative tissues roughly doubled the effect size values for the type I IFNs. In particular, the effect size values obtained were: IFNW1 (Glom g=5.5, TI g=3.3), IL12 (Glom g=4.9, TI g=1.9); IFNG (Glom g=5.5, TI g=2.2), IFNB1 (Glom g=6.0, TI g=3.0), IFNA2 (Glom g=6.6, TI g=3.1), but they were still lower than the effect size values calculated for the DLE and SLE synovium.

FIGS. 42A-42E show a non-limiting example of using whole blood (WB) interferon (IFN) signatures induced in IFNA2-treated hepatitis C (HepC) patients and IFNB1-treated multiple sclerosis (MS) patients to separate SLE patients from healthy controls (HC), using the systems and methods herein. FIG. 42A is a Venn diagram of the overlapping increased transcripts from MS-IFNB1, HepC-IFNA2, IFNA2, IFNB1, and IFNW1 signatures. FIGS. 42B-42E show GSVA using the increased transcripts of MS-IFNB1, HepC-IFNA2, and the transcripts from either signature restricted to only genes listed on the Interferome (Ifome; www.interferome.org) for DLE and HC skin (FIG. 42B), SLE synovium and OA (FIG. 42C), LN Glom Class III/IV and HC Glom (FIG. 42D), and LN TI Class III/IV and HC TI (FIG. 42E). Hedge's G effect size measures are shown for IFN signatures significantly enriched in SLE affected tissues compared to control tissues as determined by a p value<0.05 using the Welch's t-test. For LN tissues, removal of the five IFN negative SLE tissues doubled the effect size values for HepC-IFNA2 (Glom g=6.8, TI g=3.1) and MS-IFNB1 (Glom g=7.7, TI g=3.2).

FIG. 43 shows a non-limiting example of measuring a strong IFNB1 signature in cells and tissues from SLE patients, using the systems and methods herein. Z scores were calculated using the differential expression (DE) results from human PBMC treated with IFNA2, IFNB1, IFNW1, IFNG, IL12, TNF, MS patients treated with IFNB1 (MS-IFNB1), sepsis PBMC (control), and dermatomyositis skin (control) for SLE WB, PBMC, and affected tissues. Z scores>2 are considered significant. WB and PBMC datasets from active (SLEDAI≥6) and inactive (SLEDAI<6) SLE patients were divided and compared to the same controls separately before Z scores were calculated.

FIG. 44 shows a non-limiting example that IGS is readily detected in active and inactive SLE patients, using the systems and methods herein. Seven SLE datasets were divided into active SLE patients with SLEDAI≥6 (1722 patients total), or inactive SLE patients with SLEDAI<6 (315 patients total). GSVA enrichment scores were calculated for each patient using the IFN Core signature (such as IFNA2, IFNB1, IFNW1, MS-IFNB1, and HepC-IFNA2 signatures). IFN core positive patients had GSVA enrichment scores greater than 2 standard deviations from the average of the CTL GSVA enrichment scores.

FIGS. 45A-45F show a non-limiting example that SLE patients may lose or gain the IGS over time, using the systems and methods herein. An F test differential expression (DE) analysis of SLE patients on standard of care (SOC) treatment at zero weeks, 16 weeks, and 52 weeks from SLE time course datasets GSE88885 and GSE88886 was carried out, and GSVA enrichment scores were calculated using the IFN core signature. The dotted line represents the average IFN core GSVA score for the controls, but only patients are shown in the graphs. Changes in the IGS score of greater than 0.2 standard deviations were considered significant. For the GSE88885 SLE dataset, 54 SLE patients had minimal changes in their IGS (FIG. 45A), 18 SLE patients changed from negative to positive score (FIG. 45B), and 14 SLE patients changed from positive to negative enrichment score (FIG. 45C). For the GSE88886 SLE dataset, 23 SLE patients had minimal changes in their IFN core GSVA enrichment score (FIG. 45D), five SLE patients changed from negative to positive (FIG. 45E), and five SLE patients changed from positive to negative IGS enrichment score (FIG. 45F).

FIGS. 46A-46F show a non-limiting example that the IGS and SLEDAI do not change synchronously, using the systems and methods herein. Ten SLE LN patients with SLEDAI>6 (GSE72747) and healthy controls (HC) (n=46) from GSE39088 had F test differential expression (DE) analysis using time zero, 12-week, and 24-week WB samples (Treatment with high-dose immunosuppressive was begun after time zero and continued for 12 weeks; at 12 weeks, all patients were switched to lower dose/maintenance therapy). Graphs show the change in SLEDAI versus the change in the IFN core signature GSVA enrichment score (FIGS. 46A-46B). GSVA enrichment signatures corresponding to B cells, T cells, plasma cells, and monocytes were determined at each time-point, and most patients had standard deviations>0.2 between their zero and 12-week time-points (FIGS. 46C-46F). One-way ANOVA p values were <0.05 for comparison of mean GSVA enrichment scores for B cells, T cells, and monocytes between time zero and 12 weeks. Tukey's multiple comparison test between time zero and 12 weeks showed significant differences in mean GSVA enrichment scores for B cells (p=0.02), T cells (p=0.03), and monocytes (p=0.05), but not plasma cells.

FIGS. 47A-47C show a non-limiting example of performing linear regression analysis to demonstrate that the IFN signature is most closely related to monocyte cell surface transcripts, using the systems and methods herein. Linear regression analysis using SLEDAI values from the patients of 5 SLE WB and 2 SLE PBMC datasets and the patient GSVA enrichment scores for cell type-specific signatures. FIG. 47A: Cell types or signatures with significant non-zero slopes (p<0.05) related to SLEDAI by linear regression analysis in at least half of the datasets which had determinable GSVA scores were used to determine overall significance of the regression lines and the r²predictive values for all 7 SLE datasets with available SLEDAI information. FIG. 47B shows a representative plot using the HepC-IFNA2 signature for the linear regression analysis between the IFN signature with overlapping transcripts to the cell type or process signatures removed and the cell type or process GSVA enrichment score for the patients from 10 SLE WB and PBMC datasets. Cell types or signatures significantly (p<0.05) related to HepC-IFNA2 score in at least half of the datasets which had determinable GSVA scores were used to determine overall regression lines for all 10 datasets. r²predictive values are listed after the GSVA enrichment category. Relationships and linear regression analysis can be performed likewise for the other IFN signatures. For time-course dataset GSE72747, linear regression analysis was done for the change in the core IFN GSVA score versus the change in monocyte cell surface score between 0 and 12 weeks and between 12 and 24 weeks (FIG. 47C).

FIGS. 48A-48G show a non-limiting example that monocytes from inactive SLE patients have an interferon signature and elevated STAT1 transcripts, using the systems and methods herein. WGCNA was performed on datasets GSE38351 CD14+ monocytes (6 active (SLEDAI>6), 6 inactive (SLEDAI<6), and 12 control), GSE10325 CD4+ T cells (8 active, 4 inactive, and 9 control), and GSE10325 CD19+ B cells (10 active, 4 inactive, and 9 control), and individual patient eigengene values are shown for the IFN module from each dataset (FIGS. 48A-48C). The modules were correlated to presence of SLE disease (versus control) or the SLEDAI, and Pearson r values are shown for significant correlations for each WGCNA dataset (p<0.05). “NS” means not significant. SLEDAI values for each patient are listed at the end of the patient number with controls and patients with inactive disease (SLEDAI<6) noted by underlined text. GSVA enrichment scores were calculated using the IFN core signature for SLE and control samples of CD4+ T cells (FIG. 48D), CD19+ B cells (FIG. 48E), and CD14+ monocytes (FIG. 48F). Tukey's multiple comparisons test was used to determine significant differences between mean GSVA scores between controls, inactive and active patients. “*” indicates a p-value of <0.05 between active SLE and control or between inactive SLE and control; “**” indicates a p-value of <0.05 between active SLE and inactive SLE or between active SLE and control. Datasets of SLE WB, PBMC, purified CD14+ monocytes, T cells, and B cells were divided into active (SLEDAI≥6) and inactive (SLEDAI<6) for differential expression (DE) analysis to controls (FIG. 48G). The log fold change (LFC) for STAT1 is reported for each active and inactive dataset.

FIG. 49 shows a non-limiting example of transcripts from the in vitro treatment of PBMC with IFNA2, IFNB1, IFNW1, and IFNG (as described by, for example, Waddell, S. J. et al. Interferon-induced transcriptional programs in human peripheral blood cells. PLoS One 5(3): e9753(2010), which is hereby incorporated by reference in its entirety). Transcripts increased by a minimum fold change of 2 at a false discovery rate of 0.05 compared to mock treated PBMC. Unique transcripts for IFNA2, IFNB1, IFNW1, and IFNG were determined by comparison of the four signatures. The heatmap scale represents fold change.

FIGS. 50A-50E show a non-limiting example that Chiche-Chaussable modules do not reflect a specific sub-type of IFN. Shown are the overlap of the three Chiche-Chaussabel interferon modules (IFN-M) with the Waddell transcripts induced by IFNA2 (FIG. 50A), IFNB1 (FIG. 50B), IFNW1 (FIG. 50C), and IFNG (FIG. 50D). Each IFN-M overlapped the IFNA2, IFNB, and IFNW1 signatures with the same genes, except IFI44L from M1.2 was only in IFNA2 and DRAP1, NBN and IRF9 from M5.12 were only found in the IFNB1-induced transcripts. Overlapping genes were found within the core IFN genes, not the unique IFN signatures (FIG. 50E).

FIG. 51 shows a non-limiting example that GSVA enrichment using random genes does not separate SLE patients from controls. Shown are heatmap visualization of the GSVA enrichment scores for the Waddell IFNB1 increased transcripts (IFNB1) and two groups of random, not co-expressed transcripts derived from random sorting of dataset GSE49454 differential expression (DE) transcripts. Enrichment scores were calculated using these groups for all patients and controls in dataset GSE49454 (n=46).

FIGS. 52A-52D show a non-limiting example that a DMS-IFNB1 signature in multiple sclerosis (MS) patient whole blood (WB) confirms a strong IFNB1 signature. Shown are linear regression analysis using the MS-IFNB1 signature of increased and decreased transcripts with SLE Active (SLEDAI≥6) whole blood (WB) (FIG. 52A), SLE active PBMC (FIG. 52B), DLE (FIG. 52C), and sepsis (FIG. 52D).

FIGS. 53A-53B show a non-limiting example that an MS-IFNB1 signature separates SLE cells and tissues. Shown are GSVA results using the MS-IFNB1 signature. Increased (IFNB UP) and decreased (IFNB DOWN) transcripts (DE to untreated multiple sclerosis patients) separated SLE patients from healthy controls (HC) in WB GSE49454 active (SLEDAI≥6) SLE patients (n=23) (FIG. 53A), and DLE GSE72535 (n=9) (FIG. 53B).

FIGS. 54A-54D show a non-limiting example that the alternative IFNB1 downstream signaling pathway does not predominate in SLE tissues. Murine IFN alpha/beta receptor 2 deficient mice were injected with IFNB1 into the peritoneum, and peritoneal exudate cells (PEC) were isolated for microarray expression analysis to control PEC. Increased transcripts induced by IFNB1 signaling through the IFN alpha/beta receptor 1 only were used as a GSVA enrichment group to determine if the alternative pathway of IFNB1 signaling was contributing to gene regulation in DLE (FIG. 54A), SLE synovium (FIG. 54B), LN Glom class III/IV (FIG. 54C), and LN TI class III/IV (FIG. 54D). Hedge's G effect size measures (Effect) are shown for tissues with significant (p<0.05) differences between the mean GSVA enrichment scores between SLE affected and control tissues by Welch's t-test. “N/A” denotes not applicable due to insignificant Welch's t-test value.

FIGS. 55A-55E show a non-limiting example that the IGS and SLEDAI do not change synchronously. Ten SLE lupus nephritis patients with SLEDAI>6 (GSE72747) had F test differential expression (DE) analysis using time zero, 12-week and 24-week time points. Treatment with high-dose immunosuppressive was begun after time zero and continued for 12 weeks; at 12 weeks, all patients were switched to lower dose/maintenance therapy; healthy controls from the GSE39088 dataset were included in the analysis. Graphs show the change in SLEDAI versus the change in the GSVA enrichment scores for 0 to 12 weeks (top), and for 12 to 24 weeks (bottom) for MS-IFNB1 (FIG. 55A), HepC-IFNA2 (FIG. 55B), IFNA2 (FIG. 55C), IFNB1 (FIG. 55D), and IFNW1 (FIG. 55E).

FIGS. 56A-56E show a non-limiting example that IFN subtypes are most related to monocyte cell surface transcripts by linear regression analysis. Shown are linear regression analysis results between the cell type-specific, nonoverlapping IFN signatures, and the GSVA enrichment cell type score (y-axis) for the patients from 10 SLE WB and PBMC datasets. Cell types or signatures significantly (p<0.05) related to the nonoverlapping IFN score for MS-IFNB1 (FIG. 56A), type I IFN core (FIG. 56B), IFNA2 (FIG. 56C), IFNB1 (FIG. 56D), and IFNW1 (FIG. 56E) in at least half of the datasets which had determinable GSVA scores were used to determine overall regression lines for all 10 datasets. The r²values are listed after the GSVA enrichment category. “PC” indicates plasma cell, “UPR” indicates unfolded protein response, and “LDG” indicates low density granulocyte.

FIGS. 57A-57B show a non-limiting example of using LDG-specific genes to compare low-density granulocyte (LDG) differentially expressed genes (DEGs) relative to SLE neutrophils and healthy control (HC) neutrophils, using the systems and methods herein. Shown is a comparison of LDG upregulated genes versus SLE neutrophils or HC neutrophils by limma analysis. Genes were considered upregulated or downregulated if they had an FDR<0.05. FIG. 57A shows a comparison of LDG genes upregulated versus SLE neutrophils or HC neutrophils.

FIG. 57B shows a comparison of LDG genes downregulated versus SLE neutrophils or HC neutrophils.

FIGS. 58A-58B show a non-limiting example of using weighted gene coexpression network analysis (WGCNA) module eigengene (ME) values to separate LDGs from both SLE neutrophils and HC neutrophils, using the systems and methods herein. Samples from GSE26975 were used in two separate WGCNA analyses to examine LDGs and HC or LDGs and SLE neutrophils. Module colors are assigned by the WGCNA pipeline based on module size. Eigengene values separate LDGs from HC neutrophils (n=9 HC, 10 LDG) (FIG. 2A) and SLE neutrophils (n=10 SLE, 10 LDG) (FIG. 2B) by Welch's t test (*p<0.05).

FIGS. 59A-59D show a non-limiting example of grouping LDG WGCNA modules by eigengene values and constituent genes, using the systems and methods herein. LDG eigengene values for pink and black modules (FIG. 59A) or grey60 and green-yellow modules (FIG. 59B) demonstrate that the four WGCNA modules can be broken into two groups based on the behavior of their eigengenes from patient to patient. Pearson r and p values are shown. WGCNA modules with highly correlated eigengenes have many genes in common. LDG module A was formed from the genes shared between the pink and black modules (FIG. 59C). LDG module B was formed from the genes shared between the grey60 and green-yellow modules (FIG. 59D).

FIGS. 60A-60C show a non-limiting example of performing STRING/MCODE functional analysis of LDG module B to elucidate two major clusters characterized by cell cycle and neutrophil degranulation, using the systems and methods herein. MCODE clustering was used to identify the most strongly connected members of module B's STRING protein-protein interaction network. The top cluster (FIG. 60A) has many genes associated with the cell cycle by GO (diamonds). The bottom cluster (FIG. 60B) is almost entirely composed of genes associated with neutrophil degranulation (squares). Cell cycle and neutrophil degranulation genes not connected to an MCODE cluster are shown on the right. The presence of neutrophil-associated genes in module B led to its selection as the module used to query blood and tissue gene expression data. A gene ontology designation is shown in FIG. 60C, where genes associated with cell cycle are denoted by diamonds, genes associated with neutrophil degranulation are denoted by squares, and genes having other ontologies are denoted by circles.

FIG. 61 shows a non-limiting example of computational and functional analyses to study the relationships between module enrichment and disease manifestations in SLE whole blood, using the systems and methods herein. Shown is a flow chart illustrating the process of generating, filtering, and analyzing WGCNA gene modules. Modules are evaluated by functional analysis and tests of co-expression in blood and tissue data sets. GSVA enrichment scores are used to study the relationships between module enrichment and disease manifestations in SLE whole blood.

FIGS. 62A-62F show a non-limiting example of determining that LDG Modules are associated with platelet counts or neutrophil counts in GSE49454 WB, using the systems and methods herein. Shown are LDG Module A enrichment score versus platelet counts (FIG. 62A), neutrophil counts (FIG. 62B), and neutrophil counts (FIG. 62C) excluding patients with counts less than 1,500/mm³or greater than 8,000/mm³. FIGS. 62D-62F show an analysis of LDG Module B enrichment scores.

FIG. 63 shows a non-limiting example of a method for identifying a lupus condition of a subject using PID profiling, in accordance with disclosed embodiments.

FIG. 64 shows a non-limiting example of cross-checking primary immunodeficiency (PID) genes in 928 hematopoietic immune cells, in accordance with disclosed embodiments. The expression of the genes must be specific to hematopoietic cells, because if not restricted, then these genes could be targeted in non-immune specific cells and have detrimental effects.

FIG. 65A shows a non-limiting example of a database at large, comprising 432 genes, in accordance with disclosed embodiments. Via deliberation of various primary literature, the database was compiled with 432 PID-associated genes. Each PID gene includes characteristic information that can be used to identify and describe the gene.

FIGS. 65B-65C show a non-limiting example of a table of the database shown in FIG. 65A, in accordance with disclosed embodiments.

FIG. 66A shows a non-limiting example of results showing that some PID-associated genes are specific to immune hematopoietic stem cells, in accordance with disclosed embodiments. Of the 450 PID-associated genes, 125 genes were determined to be specific to immune hematopoietic cells. Of the 25 immune cell categories specific to hematopoietic cells and various cell types, the 125 genes are concentrated in monocyte, myeloid, B cell, T cell, and B and T cell categories.

FIG. 66B shows a non-limiting example of results showing the cell count per category of various cell types.

FIGS. 67A-67B show a non-limiting example of protein-protein interaction-based clustering of 450 PID-associated genes, in accordance with disclosed embodiments. Protein-protein interaction networks and clusters were generated via Cytoscape using the STRING and MCODE plugins. FIG. 67A shows that of the 450 genes, 430 genes were grouped into 16 clusters, and the BIG-C™ category most representative of the gene list was used to biologically characterize the clusters. The clusters with the most genes include clusters 1, 2, 3, 4, and 5. The BIG-C™ categories represented by these large clusters include immune cell surface, intracellular signaling, pattern recognition receptors, DNA repair, pro-proliferation, secreted immune, and extracellular matrix. The node sizes correlate to the number of genes in each cluster, and the degree of node shading indicates the number of intracluster connections (see gradient at bottom of figure). The edge weight thickness represents the number of intercluster connections. FIG. 67B shows that the 450 genes were grouped into 16 clusters. Data from GSE88884, which includes transcriptomic data of 1,620 patients, was used to determine the differential expression of the genes.

FIG. 68 shows a non-limiting example of endotypes of SLE patients defined by functional groupings of PID-associated genes, in accordance with disclosed embodiments. Differentially expressed (DE) genes from the GSE88884 SLE WB dataset (1,620 patients) were assessed by GSVA for the 17 MCODE clusters, as shown in FIGS. 67A-67B (and on the x-axis of the heatmap). There is a clear distinction between enrichment of the clusters among the patients, thereby demonstrating that these groups of immune-specific genes can be used to differentitate SLE patients based on clinical presentation of disease.

FIG. 69 shows a non-limiting example of performing GSVA to identify the functional role of PID-associated genes expressed in SLE WB microarray datasets, in accordance with disclosed embodiments. DE genes from 14 SLE WB datasets shown on the x-axis were overlapped with the 432 PID-associated genes to assess common genes. SLE WB DE genes that are also PID-associated genes were analyzed by GSVA for function by enrichment with BIG-C functional categories as shown on the y-axis. Welch's t test was used to identify significant BIG-C categories including interferon stimulated genes, MHC class-1 antigen presentation, secreted-immune, secreted extracellular matrix, pattern recognition receptors, proteasome activity, and pro-apoptosis.

FIG. 70 shows a non-limiting example of results demonstrating that PID-associated genes differentially expressed in a large whole blood dataset comprised of distinct patient groups, in accordance with disclosed embodiments.

FIG. 71 shows a non-limiting example of a workflow to assess a condition of a subject using one or more data analysis tools and/or algorithms, in accordance with disclosed embodiments.

FIG. 72 shows a non-limiting example of using BIG-C® to generate a differential expression heatmap, in accordance with disclosed embodiments.

FIG. 73 shows a non-limiting example of using BIG-C® to generate a gene coexpression plot, in accordance with disclosed embodiments.

FIG. 74 shows a non-limiting example of using BIG-C® to cross-examine enriched categories with GO and KEGG terms to derive key insights for further analysis, as shown by the enriched categories identified (left) and cross-referenced to GO terms, in accordance with disclosed embodiments.

FIG. 75 shows a non-limiting example of an I-Scope™ signature analysis for a given sample, in accordance with disclosed embodiments.

FIG. 76 shows a non-limiting example of an I-Scope™ signature analysis for a given sample across multiple samples and disease states, in accordance with disclosed embodiments.

FIG. 77 shows a non-limiting example of results obtained using T-Scope™ in combination with I-Scope™ for identification of cells post-DE-analysis, in accordance with disclosed embodiments.

FIG. 78 shows a non-limiting example of MS-Scoring™ 1 of IL-12 and IL-23 related pathways for targeting using ustekinumab for SLE (systemic lupus erythematosus) drug repositioning, in accordance with disclosed embodiments.

FIG. 79 shows a non-limiting example of results from GSVA Analysis on SLE (systemic lupus erythematosus) signaling pathways, in accordance with disclosed embodiments.

FIG. 80 shows a non-limiting example of the CoLT Scoring® of SOC Therapies in Lupus (Belimumab, HCQ, and Rituximab), in accordance with disclosed embodiments.

FIG. 81 shows a non-limiting example of the Target-Scoring categories and point values, in accordance with disclosed embodiments.

FIGS. 83A-83B show generation of WGCNA gene modules from LN glomerular and tubulointerstitium (TI) differential expression (DE) data and correlation to clinical covariates.

FIGS. 84A-84B show GSVA enrichment and sorting of LN patients against WGCNA module membership.

FIG. 85 shows enrichment of functional categories in LN signatures via BIG-C®. Modules were characterized for patterns of member gene function via comparison to the BIG-C® database.

FIG. 86 shows enrichment of immune and tissue cell populations in LN signatures via I-Scope™ and T-Scope™.

FIG. 87 shows expression of PC and GC indicator genes in LN. To more closely and specifically interrogate LN samples for the presence and role of PCs, DE genes from LN glomeruli and TI across WHO classes were filtered against signatures for core plasma cell function, T follicular helper cells, and germinal center B cells.

FIGS. 88A-88E show patterns of upstream regulator activation in LN. IPA® UR analysis of DE genes from glomerular and TI samples across WHO classes produces five blocks of interest (FIGS. 88A-88E, respectively) for identifying shared and unique immune, inflammatory, and cytokine/chemokine pathways between tissues and levels of LN severity (p<0.01).

FIG. 93 shows an example of expression of PC and GC indicator genes in DLE. To more closely and specifically interrogate DLE samples for the presence and role of PCs, DE genes from each dataset were filtered against signatures for core plasma cell function, T follicular helper cells, and germinal center B cells.

FIGS. 95A-95B show an example of IPA® canonical pathway and upstream regulator (UR) analysis. IPA® canonical pathway and upstream regulator analysis was performed.

FIG. 96 shows a non-limiting example of a workflow to assess a condition of a subject using one or more data analysis tools and/or algorithms, in accordance with disclosed embodiments.

FIG. 97 shows the process of unpacking an SLE-associated SNP, in accordance with disclosed embodiments.

FIGS. 98A-98C show an example of mapping SNP associations to eQTLs and E-Genes, in accordance with disclosed embodiments. FIG. 98A shows a distribution of genomic functional categories for EA and AA SNP sets. “NT-R” is defined as Non-Traditional Regulatory: intronic or intergenic SNPs exhibiting strong regulatory potential, indicated by DNAse hypersensitivity, location within protein binding sites and evidence of epigenetic modification. “Other” non-coding regions include introns, intergenic regions, 5kb upstream of transcription start sites and 5kb downstream of transcription termination sites. FIG. 98B shows a summary of eQTL analysis. SLE-associated SNPs identify multiple eQTLs linked to E-Genes in the GTEx database. eQTLs and their associated E-Genes were divided into European ancestry (EA) and African ancestry (AA) groups depending on the ancestral origin of the original SLE-associated SNP. Shared E-Genes are derived from SNPs common to both EA and AA ancestries. FIG. 98C shows the number of EA and AA SNPs mapping to single E-Genes, multiple E-Genes or shared E-Genes.

FIGS. 99A-99D show an example of E-Gene functional and pathway analysis, in accordance with disclosed embodiments. PANTHER (v.13.1) was used to classify EA and AA E-Genes according to gene ontology (GO) biological processes and pathways. The number of EA (FIG. 99A) and AA (FIG. 99B) E-Genes assigned to GO biological processes is displayed in each bar graph; GO identifiers are reported to the right of each graph. For pathway analysis, EA (FIG. 99C) and AA (FIG. 99D) E-Gene sequences were assigned to GO pathways. EA E-genes are defined by 78 pathways; several pathways of interest containing 4 or more E-Genes are labeled. AA E-Genes are defined by 15 pathways as shown in the pie chart.

FIGS. 100A-100C show an example of generation of protein-protein interaction (PPI) networks, in accordance with disclosed embodiments. PPI networks and clusters generated were generated via CytoScape using the STRING and MCODE plugins. Networks were constructed of all EA, AA, and shared (EA+AA) E-Genes. MCODE clusters were determined by the strength of protein-protein interactions, calculated by pooling information from publicly available literature. FIG. 100A shows the cluster metastructure of each network and corresponding BIG-C™ categories, while FIGS. 100B-100C show the specific genes that make up each cluster. FIG. 100D shows EE, AA, and shared (EE+AA) E-Genes that were unclustered.

FIGS. 101A-101D show an example of a comparison of E-Genes predicted from SLE-associated SNPs with SLE differential expression datasets, in accordance with disclosed embodiments. Predicted E-Genes were matched with SLE differential expression (DE) data and organized by ancestry. FIG. 101A shows the fold-change variation of EA-only E-Genes. Due to the large number of DE EA E-Genes, a selection of the most highly upregulated and downregulated genes are presented. FIG. 101B shows AA-only DE E-Genes, and FIG. 101C shows DE E-Genes common to both the AA and EA gene sets. Color for all three heatmaps represents log fold change, as indicated by the legend underneath the central heatmap (FIG. 101D). Red asterisks indicate active SLEDAI datasets.

FIGS. 102-103 show an example of a comparison of E-Genes predicted from SLE-associated SNPs with SLE differential expression datasets, in accordance with disclosed embodiments. Compounds targeting EA, AA, shared tissue E-Genes and associated pathways are shown. Differentially expressed E-Genes from synovium, skin and kidney tissue datasets were first compared to immune-specific gene lists. Overlapping genes were used as input for IPA upstream regulator analysis. PPI networks and clusters were generated via CytoScape using the STRING and MCODE plugins. MCODE clusters were determined by the strength of protein-protein interactions, calculated by pooling information from publicly available literature. Select drugs acting on targets are shown. Where available, CoLT scores (−16 to +11) are depicted in superscript.

FIG. 104 shows a non-limiting example of a workflow to identify autoimmune disease drug targets, in accordance with disclosed embodiments.

FIGS. 105A-105E show a non-limiting example of results showing that inhibition of histone deacetylase HDAC6 reduced Ig and C deposition in NZB/W lupus nephritis. FIGS. 105A-105B show a representative Hematoxylin and Eosin (H&E) staining image of kidney glomerular region along with pathology score which reflects the severity of membranoproliferative changes and distribution. FIG. 105C shows a representative immunohistological staining of kidney section for IgG and C3. FIGS. 105D-105E show a graphic analysis of mean fluorescent intensity (MFI) of IgG and C3. Data are shown as mean standard error of the mean (s.e.m) n=4 mice for each group; T-test; *P<0.05, **P<0.01, ****P<0.0001.

FIG. 106 shows a non-limiting example of results showing that HDAC6i treatment of NZB/NZW F1 mice induced global gene expression changes in whole splenocytes. Hierarchical clustering of 3911 transcripts (1922up, 1989 down) that differed significantly (FDR<0.1) between control (C1, C3, C4, and C5) and treated mice (T1, T2, T3, and T5).

FIGS. 107A-107D show a non-limiting example of results showing that HDAC6i treatment results in significantly decreased GC activity and PC formation. FIG. 107A shows results of I-Scope hematopoietic cell enrichment demonstrating that HDAC6 inhibition decreased PC, B cells, and inflammatory myeloid cells. The numbers of transcripts corresponding to each cell type increased or decreased after HDAC6 inhibitor treatment are shown. Gene symbols for transcripts for PC, B cells, and inflammatory myeloid cells are shown in Table 54 (increased transcripts) and Table 55 (decreased transcripts). FIG. 107B shows results of GSVA analysis performed to determine the enrichment of PC, Tfh cells, and GC in each HDAC6 inhibitor-treated and control NZB/NZW mouse (Methods lists genes used for GSVA enrichment modules). FIG. 107C shows a representative splenic section stained with anti-CD138, anti-IgM, and PNA. FIG. 107D shows a representative splenic section stained for T cells, follicular B cells, and GC with anti-CD3, anti-IgD, and PNA.

FIG. 108 shows a non-limiting example of results showing that HDAC6 inhibition repressed B cell signaling pathways in NZB/NZW mice. The IPA Canonical Signaling Pathway “B Cell Receptor Signaling” had a Z score of −3.1. Transcripts differentially expressed between HDAC6 inhibitor-treated and untreated NZB/NZW mice were overlaid on genes in the IPA pathway. Decreased transcripts are shown in green, while increased transcripts are shown in pink.

FIGS. 109A-109D show a non-limiting example of results showing that inhibition of HDAC6 altered transcripts associated with cellular metabolism. FIG. 109A shows results of an ingenuity pathway analysis (IPA) performed on the differentially expressed transcripts between HDAC6 inhibitor-treated and untreated NZB/NZW mice. The most significant signaling pathways increased or decreased by Z score analysis with an overlap p value<0.05 are shown. The full list of significant increased and decreased pathways and the genes used to determine significance are in Table 56 (increased) and Table 57 (decreased). FIG. 109B shows results of a GO biological pathway enrichment analysis of the top most increased and decreased pathways by lowest overlap p value significance. A full list of GO biological pathways enriched (p<0.01) are in Table 5 (increased) and Table 59 (decreased). FIGS. 109C-109D show results of a BIG-C pathway enrichment performed using increased (FIG. 109C) or decreased (FIG. 109D) transcripts from the DE analysis of HDAC6 inhibitor-treated NZB/NZW mice compared to NZB/NZW mice. The −log (p value) is shown for the enriched categories. Gene symbols corresponding to each category are listed in Table 60 (increased) and Table 61 (decreased).

FIGS. 110A-110C show a non-limiting example of results showing that HDAC6 inhibition decreased citrate synthase activity and cytochrome c oxidase activity in NZB/W mice. Four weeks of treatment of NZB/W mice with the HDAC6 inhibitor ACY-738 lead to a significant decrease in the rate limiting enzyme of the TCA cycle (p=0.043) (FIG. 110A), and a decrease in cytochrome C oxidase activity (P=0.053) (FIG. 110B), while having minimal effect on beta hydroxyacyl coa dehydrogenase in splenocytes (n=5) (FIG. 110C).

FIGS. 111A-111B show a non-limiting example of results showing that HDAC6 inhibition decreases glucose and fatty acid oxidation in T and B cells from NZB/W mice. T cells and B cells from 12-week old NZB/W female were purified and stimulated with anti CD3/CD28 or LPS respectively for 24 hours with or without the addition of 4 μM ACY-738 (DMSO only was used as control). After 24 hours of culture, CO2 production from the oxidation of glucose (FIG. 111A) and palmitate (FIG. 111B) were determined from three separate experiments in triplicate (n=3).

FIG. 112 shows a non-limiting example of results showing that HDAC6 inhibition decreases lupus gene signature pathways in NZB/W mice that are increased in active human SLE. IPA canonical signaling pathways increased in human SLE microarray tissue datasets were compared to signaling pathways in NZB/W mice decreased by the HDAC6 inhibitor. Z scores greater or less than 2 are considered significant.

FIGS. 113A-113B show a non-limiting example of quantified germinal center formation in NZB/W female mice at 24 weeks-of age-treated with ACY-738 (treated, “T”) or without ACY-738 (control, “C”) for four weeks. We randomly picked 5 germinal centers from each spleen sample and analyzed by using ImageJ software to calculate the size of the germinal center. N=20, * P<0.05, **** P<0.0001.

FIGS. 114A-114D show a non-limiting example of results obtained by flow cytometry of GC B cells (FIGS. 114A and 114C) and TFH (FIGS. 114B and 114D) assessed by flow cytometry in C57BL/6J mice and C57BL/6J/HDAC6−/− mice. For spleen, n=5 (FIGS. 114A-114B), and for Peyer's patch, n=3 (FIGS. 114C-114D). Germinal center B cells are gated by CD19+, GL7+, IgD−. * P<0.05.

FIGS. 115A-115F show a non-limiting example of results obtained by flow cytometry of sorted B cells from C57BL/6J mice and C57BL/6J/HDAC6−/− mice stimulated with LPS or anti-IgM, anti-CD40 for 24 hours. The results showed reduced expression of activation markers of B cells CD86 (FIG. 115A) and MHCII (FIG. 115B) in C57BL/6J/HDAC6−/− mice compared to C57BL/6J mice with stimulation of anti-IgM and anti-CD40. In addition, MFI of CD69 (FIG. 115C), CD86 (FIG. 115D), MHC-II (FIG. 115E), and CD80 (FIG. 115F) are down-regulated in C57BL/6J/HDAC6−/− mice with stimulation of LPS. N=5. * P<0.05, ** P<0.01

FIGS. 116A-116F show a non-limiting example of results obtained by flow cytometry of sorted B cells from NZB/W mice stimulated with LPS or anti-IgM, anti-CD40 and then treated with ACY738 for 24 hours. The results showed reduced expression of activation markers of B cells CD86 (FIG. 116A) and MHCII (FIG. 116B) in ACY-738 treated B cells with stimulation of anti-IgM and anti-CD40. In addition, MFI of CD69 (FIG. 116C), CD86 (FIG. 116D), MHC-II (FIG. 116E), and CD80 (FIG. 116F) are significantly down-regulated in ACY-738 treated B cells with stimulation of LPS. N=5. * P<0.05, ** P<0.01, *** P<0.001, **** P<0.0001.

FIGS. 117A-117C show a non-limiting example of control experiments demonstrating the specificity and lack of cross reactivity of I-scope. Experiments were performed on the DE analysis of healthy control purified CD3+CD4⁺ T cells (FIGS. 117A and 117C), CD19+CD3−B and Plasma Cells (FIGS. 117A-117B), and CD33+CD3−Myeloid cells (FIGS. 117B-117C) from microarray dataset GSE10325. The genes in each I-scope category (29 categories in total; hematopoietic general was not used) were used as modules for gene set variation analysis to determine the specificity of each module and cross-reactivity to other cell types. For each comparison, only categories with at least three genes above the Interquartile Range threshold were considered for statistical analysis. Significance of GSVA enrichment scores was determined using Sidak's multiple comparisons test. Adjusted p values below 0.05 were considered significant. FIGS. 117D-117E show a non-limiting example of results demonstrating a strong relationship of human B cell/microliter counts to GSVA enrichment scores for the I-scope B cell category on 105 human subjects from microarray dataset GSE88884. Demonstration of the strong relationship of mouse flow cytometry values for plasma cells (B220+IgM−CD138+) and the GSVA enrichment scores using the I-scope plasma cell module on BXSB Yaa (points above X-axis) and BXSB MPJ mice (points below X-axis).

FIG. 118 shows a non-limiting example of a process for translating mouse to human genomic data, which allows a direct comparison of human and mouse genomic data.

FIG. 119 shows a non-limiting example of a process for translating mouse to human genomic data, using a BIG-C comparison of treated mouse lupus and human lupus tissue.

FIG. 120A shows the number of differentially expressed (DE) genes detected by LIMMA analysis in MC, CD4⁺ T cells, and B cells isolated from inactive (SLEDAI<6) and active (SLEDAI≥6) SLE patients when compared to healthy donors. n.s.: no genes found to be significantly differentially expressed (FDR<0.2) when compared to healthy controls. FIG. 120B shows Hierarchical clustering of differentially expressed (DE) genes detected by LIMMA analysis in CD14+ MC isolated from inactive (SLEDAI<6) and active (SLEDAI≥6) SLE patients when compared to healthy donors. Arrows highlight M1 (black) or M2 (white) polarization genes. FIG. 120C shows fold change variation of genes found to be upregulated in both active and inactive SLE MC. Polarization-related genes are shown in bold and M1 genes are represented by a black wedge while M2 genes are represented with a white wedge. Genes not associated with M1 or M2 pathways are represented with a gray wedge.

FIG. 121A shows DE genes from active and inactive CD14+ MC were analyzed by GSVA to determine pathway enrichment using functional definitions provided from the BIG-C (Biologically Informed Gene Clustering) annotation library. Samples were successfully sorted by disease cohort via this method in both active and inactive MC. Starred BIG-C categories only appeared in the active or inactive analysis, respectively. FIG. 121B shows WGCNA of CD14+ and CD33+ MC isolated from SLE patients. Dendrograms show hierarchy of modules formed by unsupervised WGCNA clustering of DE genes from CD14+ and CD33+ MC isolated from active and inactive SLE patients.

FIG. 122 shows a CIRCOS diagram comparing the composition of SLE positively-correlated CD14+ and CD33+ WGCNA modules to genes enriched in M1- or M2-polarized human Mϕ or genes associated with general MC activation (upregulated in both M1 and M2 conditions). Genes found in the yellow module (CD14+) are shown in black, genes found in the violet module (CD33+) are shown in red, and genes found in the sienna3 module (CD33+) are shown in orange. M1-related genes are represented with solid lines, M2-related genes are represented by dashed lines, and general MC activation genes are represented with dotted lines.

FIGS. 123A-123B show protein-protein interaction networks and clusters generated via CytoScape using the STRING and MCODE plugins. Networks were constructed of the gene lists of WGCNA modules positively (FIG. 123A, above) or negatively (FIG. 123B, below) correlated to SLEDAI from CD14⁺ MC (FIG. 123A(a) and FIG. 123B(a)) or CD33⁺ MC (FIG. 123A(b), FIG. 123A(c), FIG. 123B(b), and FIG. 123B(c)). MCODE clusters are determined by the strength of protein-protein interactions, calculated by pooling information from publicly available literature. Top half of diagrams show the cluster metastructure of each network while bottom half shows the specific genes that make up each cluster. M1-related genes are indicated by red arrows and M2-related genes are indicated by blue arrows.

FIG. 124A shows that IPA was used to analyze the CD14⁺ MC dataset and identify putative upstream regulators for active patient monocytes, inactive patient monocytes, and the active-inactive overlap using a p-value cutoff of 0.05. Only genes for which IPA assigned a z-score of ≥|2| in at least one of the three sets are shown. FIG. 124B shows representative diagrams showing downstream gene expression changes (outer circles) used to calculate upstream regulators (center).

FIG. 125 shows gene sets from CD14⁺ MC isolated from active or inactive SLE patients were used as input for the LINCS analysis platform, which reports connectivity scores for individual genes that describe how well the genomic change between the baseline and experimental input sets matches the change observed following the knockdown or overexpression of the individual gene in question. Knockdown and overexpression data were filtered by genes for which LINCS reported connectivity scores for both categories, and genes were identified as BURs for a particular dataset if they received a knockdown connectivity score between −75 and −100 and an overexpression connectivity score between 50 and 100 for that dataset.

FIG. 126A shows that GSVA was utilized to generate scores to assess enrichment of WGCNA lymphocyte subset gene modules correlated with disease activity in WB or PBMC samples separated into inactive or active SLE patients. Results are shown following unsupervised hierarchical clustering. The expected and observed correlations to disease states of each module and the cell type of their origin are shown on the right (black: positive correlation; gray: negative correlation; white: unknown correlation; x: no significant correlation). FIG. 126B shows that Odds ratios (OR) with 95% confidence intervals (CI) were calculated from the GSVA data to determine the strength of association of each cellular module with active disease. FIG. 126C shows ROC curves displaying representative results of disease activity prediction by the generalized linear model algorithm for modules from an individual cell type. Area under the curve is shown on each panel.

FIG. 127 shows PC DE profiles isolated from Published Microarray Profiles.

FIG. 128A-128C show functional characterization of DE PC gene signatures in SLE. FIG. 128A shows a filtered PC dataset containing only PC-specific gene signatures. FIG. 128B shows significantly enriched BIG-C categories found in the common DE gene signature, including ER, Golgi, Immune Cell Surface, and Unfolded Protein and Stress FIG. 128C shows that among the unique Tonsil PC DE genes, the ER, General Cell Surface, Golgi, Integrin Pathway, Secreted and ECM, and Transporters BIG-C category ORs were significantly enriched while the Endocytosis, Mitochondrial DNA-to-RNA, Mitochondria General, mRNA Splicing, mRNA Translation, Nuclear Hormone Receptors, and Nucleus and Nucleolus BIG-C categories were significantly underrepresented.

FIG. 129A-129B show protein interaction-based clustering of SLE PC and SLE/Tonsil Common DE genes. FIG. 129A shows that DE genes common to the SLE PC and Tonsil PC datasets formed four discrete clusters: a large unfolded protein response/secreted protein cluster, an ER cluster, a small unfolded protein response cluster, and a small cluster with undefined function. FIG. 129B shows that the SLE PC DE list produced only two clusters via MCODE analysis: one large cluster centered around pro-proliferation signaling pathways, and one small cluster containing ER- and mitochondria-related genes.

FIGS. 130A-130B show results of tracking a PC DE signature in the periphery and tissues of SLE patient via microarray data. FIG. 130A shows that many of the genes were found to be upregulated most in the skin and synovium, followed by the kidney and B cell datasets, with some expression detected in the PBMC and WB datasets. FIG. 130B shows that using the SLE PC and Common PC DE gene lists revealed enrichment patterns of divergent subsets of the PC signature across different SLE tissue and peripheral cell datasets.

FIGS. 131A-131E show that GSVA was used to determine enrichment of the Tonsil PC, SLE PC, and Common signatures in tissue (FIG. 131A-131D) and PBMC samples (FIG. 131E) from SLE, DLE, LN, and OA patients. FIG. 131A-131C show that enrichment of the Common and SLE PC signatures only appeared to successfully identify and sort DLE, SLE, and LN patient samples in the skin, synovium, and kidney glomerulus, respectively. FIG. 131D shows that LN patient samples were less cleanly identified from healthy control samples when these signatures were applied to the kidney tubulointerstitium, but the Common signature tended to be enriched in LN patient samples while the Tonsil PC signature (representing homeostatic/healthy PC gene signaling) tended to be enriched in the control samples. FIG. 131E shows that PBMC samples were not successfully discriminated by cohort according to GSVA enrichment of the Tonsil PC/SLE PC/Common signature paradigm.

FIGS. 132A-132C show identifying targets of the proteasome inhibitor family of chemotherapy agents (bortezomib, ixazomib, carfilzomib) as members and regulators of the SLE PC signature by multiple methods, including analysis of upstream regulators of SLE PC DE gene signatures cluster in proliferation and cell cycle checkpoint pathways. IPA upstream regulator analysis was used to further distill the SLE PC DE signature and identify keystone genes and signaling pathways. High-priority targets were generated via IPA upstream regulator analysis (FIG. 132A) and by cross-reference with the AMPEL Primary Immunodeficiency Gene Database (FIG. 132B), which identifies and catalogs keystone genes that act as checkpoints in the development of autoimmunity and protect against gross failure of immune tolerance.

FIG. 133A-133D show results obtained by mapping the functional genes predicted by SLE-associated SNPs. FIG. 133A shows a distribution of genomic functional categories for ancestry-specific non-HLA associated SLE SNPs (Tiers 1-3). Non-coding regions include micro (mi)RNAs, long non-coding (lnc)RNAs, introns and intergenic regions. Regulatory regions include transcription factor binding sites (TFBS), promoters, enhancers, repressors, promoter flanking regions and open chromatin. Coding regions were broken down further and include 5′UTRs, 3′UTRs, synonymous and nonsynonymous (missense and nonsense) mutations. FIG. 133B shows that functional genes predicted by SNPs are derived from 4 sources including regulatory elements (T-Genes), eQTL analysis (E-Genes), coding regions (C-Genes) and proximal gene-SNP annotation (P-Genes). FIG. 133C shows a Venn diagram depicting the overlap of all SLE-associated SNPs. FIG. 133D shows a Venn diagram depicting the overlap of and all predicted E-, T-, P-, and C-Genes.

FIGS. 134A-134E show the caracterization of predicted gene signatures. FIG. 134A shows that ancestry-dependent and independent E-, P-, T-, and C-Genes were analyzed to determine enrichment using functional definitions from the BIG-C(Biologically Informed Gene Clustering) annotation library. Enrichment was defined as any category with an odds ratio (OR)>1 and −log 10(p-value)>1.33. FIGS. 134B-134E shows heatmap visualizations of the top five significant IPA canonical pathways for each gene list (E-, P-, T-Genes) organized by ancestry. C-Genes were analyzed together. Top pathways with −log 10(p-value)>1.33 are listed.

FIGS. 135A-135D show that cluster metastructures were generated based on PPI networks, clustered using MCODE and visualized in CytoScape. Size indicates the number of genes per cluster, edge weight indicates the number of inter-cluster connections and color indicates the number of intra-cluster connections. FIG. 135E shows the quantitation of cluster size, intra- and intercluster connections. Error bars represent the 95% confidence interval; asterisks (*) indicate a p-value<0.05 using Welch's t-test.

FIG. 136A-136C shows that ancestry-specific E-, P-, T-, and C-Genes were matched to differential expression (DE) SLE datasets in various tissues, including whole blood, PBMCs, B-cells, T-cells, synovium, skin and kidney.

FIGS. 137A-137B show that DE predicted genes and UPRs were used as input to build STRING-based PPI networks, visualized in CytoScape, and clustered with MCODE. Individual clusters were then analyzed by BIG-C and IPA to identify those molecules and pathways highly associated with disease. A total of 45 pathways were representative of EA DE genes and UPRs, with the largest clusters 3 and 1 heavily involved in pattern recognition receptor signaling (activation of IRFs by cytosolic PRRs and role of RIG-I in antiviral immunity).

FIGS. 138A-138B show that the AA network was smaller (FIG. 138A), containing fewer predicted genes and associated UPRs, yet shared multiple pathways with EA, including B cell receptor signaling, GPCR signaling, opioid signaling, phagocyte maturation and hepatic cholestasis, a pathway involved in bile acid synthesis (FIG. 138B).

FIGS. 139A-139B show that pathways exemplified by ancestry-independent genes were a blend of both EA and AA pathways. For example, common pathways included IL12 signaling and production by macrophages, TLR signaling and activation of IRFs by cytosolic PRRs, pathways that were predicted by EA genes and UPRs, as well as PRRs in the recognition of bacteria and virus, a pathway shared with AA.

FIGS. 140A-140F depict both the unique and overlapping canonical pathways predicted by the EA and AA gene sets. Examination of pathway categories shared between EA and AA ancestral groups are those commonly associated with SLE representing aberrant immune function, altered transcriptional regulation, and abnormal cell cycle control, providing additional confirmation for the global gene expression analysis presented here (FIG. 140B).

FIGS. 141A-141C show an overview of gene expression in SLE vs OA synovium. FIG. 141A shows that DE analysis was conducted on gene expression data from SLE and OA synovium resulting in 6,496 DE genes, 2,477 upregulated in SLE and 4,019 downregulated in SLE. FIG. 141B shows that increased and decreased transcripts were each characterized by I-Scope and T-Scope (fibroblasts, synoviocytes) for prevalence of specific cell types. FIG. 141C shows that DE transcripts were also characterized by BIG-C for functional enrichment. Heatmaps in FIGS. 141B-141C shows that the figures represent the negative logarithm of the overlap p-value when odds ratio is greater than 1 by Fisher's Exact Test. Gray cells represent non-significant enrichment (p>0.05 or OR>1). A minimum p-value of 2.2e−16 was used.

FIGS. 142A-142C show that WGCNA reveals SLE-associated modules of genes enriched in immune cells. WGCNA of 4 SLE vs 4 OA patients yielded 7 modules of genes associated with SLE after QC and were characterized by I-Scope, T-Scope, and BIG-C. FIG. 142A shows module eigengene plots per sample of the 7 SLE-associated modules; color names are randomly generated as part of WGCNA module assignment. FIG. 142B shows that the negative logarithms of the overlap p-values identify specific immune/inflammatory cell populations or synovium-specific cell populations that may be linked to lupus synovitis or to indicate enrichment of functional gene categories (FIG. 142C). Data shown in FIGS. 142B-142C shows that the figures are significant (p<0.05) by right-sided Fisher's Exact test and must have an odds ratio above 1 to indicate enrichment.

FIGS. 143A-143B show signaling pathways and upstream regulators operative in lupus synovitis. IPA canonical pathway and upstream regulator analysis was performed. FIG. 143A shows consensus canonical pathways predicted to be significantly activated or inhibited by DE transcripts and at least one SLE-associated WGCNA module. FIG. 143B shows that consensus upstream regulators predicted to be significantly activated or inhibited by both DE transcripts and at least one SLE-associated WGCNA module are displayed and organized by BIG-C category. Canonical pathways and upstream regulators were considered significant if |Activation Z-Score|≥2 and overlap p-value≤0.01.

FIG. 144 shows germinal center B cell and Tfh cell markers in lupus synovitis, including an assessment of germinal center and follicular T helper cell markers in lupus synovium from DE genes or WGCNA. Genes found in SLE-associated WGCNA modules are indicated.

FIG. 145 shows that GSVA enrichment of immune populations in synovia confirms inflammatory infiltrate in SLE. GSVA of relevant immune cell populations, molecular signatures, and signaling pathways was conducted on log 2-normalized gene expression values from OA and SLE synovia. Significant differences in enrichment between cohorts were found by Welch's t-test (*p<0.05). Hedge's g effect sizes were calculated (right) with correction for small sample size for each gene set; zeroes represent non-significant differences in enrichment between cohorts. “#” indicates a literature-derived signature. Other gene set signatures were derived from IPA, where noted, PathCards, or are hand-curated lists from lupus gene expression data and literature mining.

FIG. 146 shows LINCS biological upstream regulators, including the top 50 targets from LINCS knockdown and overexpression data matching (overexpressed) and opposing (knocked down) the lupus synovitis gene signature. Knockdown and overexpression data were analyzed for connectivity scores in the −75 to −100 and 50 to 100 ranges, respectively. Drugs and compounds directly or indirectly antagonizing/inhibiting the biological upstream regulators were sourced from LINCS/CLUE, IPA®, literature mining, CoLTS, STITCH, and clinical trials databases. Where applicable, drug annotations are grouped together by target and CoLTS scores are displayed as integers in superscript. Indirect drug matches are displayed in italics. Only drugs with CoLTS scores are shown. “P”: Preclinical; “‡”: Drug in development/clinical trials; “†”: FDA-approved.

FIGS. 147A-147B show a comparison of gene expression between SLE and RA synovitis. A comparison of immune/inflammatory and synovial gene signatures was made between SLE and RA synovium using 7 RA patients from GSE36700. FIG. 147A shows that upregulated DEGs were identified between RA and OA synovium, compared to SLE, and characterized by I-Scope. FIG. 147B shows that GSVA of immune/inflammatory cell populations, molecular signatures, and signaling pathways was carried out on log 2-normalized gene expression values from RA and SLE synovia. Significant differences in enrichment between cohorts were found by Welch's t-test (*p<0.05). Hedge's g effect sizes were calculated (right) with correction for small sample size for each gene set; zeroes represent non-significant differences in enrichment between cohorts. “#” indicates a literature-derived signature. Other gene set signatures were derived from IPA, where noted, PathCards, or are hand-curated lists from lupus gene expression data and literature mining.

FIG. 148 shows a model of lupus synovitis. DEGs, molecules co-expressed in SLE correlated WGCNA modules, and IPA® upstream regulator predictions were integrated into a summary model of lupus synovitis. Transcripts listed on the right were either upregulated (red text), co-expressed in SLE correlated WGCNA modules (underlined), or identified as upstream regulators operative in lupus synovitis.

FIG. 149 shows an example of weighted gene co-expression network analysis (WGCNA) to create modules of correlated genes through hierarchical clustering, including constructing a gene co-expression network by gene:gene correlations across samples, identifying co-expression modules by dynamic cutting of hierarchical clustering trees, and correlating module eigengenes with phenotypic information.

FIGS. 150A-150C show that WGCNA identified modules with significant correlations to clinical variables in DLE datasets. WGCNA identified 41 modules for GSE72535, 23 modules for GSE81071, and 30 modules for GSE52471. FIG. 150A shows that in GSE72535, 12 modules were significantly correlated to CLASI.A or cohort (5 positively and 7 negatively). FIGS. 150B-150C show that in GSE81071 (FIG. 150B) and (FIG. 150C) GSE52471, 7 modules were significantly correlated to cohort (GSE81071: 4 positively and 3 negatively; GSE52471: 2 positively and 5 negatively).

FIGS. 151A-151B show WGCNA modules interrogated using BIG-C® functional characterizations as well as I-Scope™ and T-Scope™ for specific cellular subsets. DLE-associated modules identified in WGCNA are characterized by BIG-C® (FIG. 151A) and I-Scope™/T-Scope™ (FIG. 151B). Odds ratios above 1 are shown, and Fisher's exact tests with p-values below 0.05 are indicated with an asterisk. Consistent enrichment of several categories, including immune signaling, pattern recognition receptors, and pro-apoptosis, was seen across all three analyses. Additionally, a clear immune signature, including antigen presenting cells, T cells, and myeloid cells, was observed in positively correlated modules.

FIG. 152 shows WGCNA modules statistically preserved and common DE genes between three analyses. Module preservation was performed for each pairwise combination of datasets. The preservation Zsummary statistic was used to determine significant preservation. A representative example of the WGCNA modules from GSE81071 in the preservation analysis between GSE81071 and GSE52471. The overlap p-value (Fisher's exact test) was used to determine specific module associations between datasets. Interestingly, the analyses consistently showed the preservation of the two positively correlated modules in each dataset (Turquoise and Plum2 in GSE72535, Brown and Magenta in GSE81071, and Blue and LightGreen in GSE52471).

FIG. 153 shows BIG-C®, I-scope™ and T-scope™ analysis results in the preserved modules and common DE genes. The analysis compared DE genes common to all three datasets and the 6 preserved DLE-associated WGCNA modules. BIG-C® (left) and I-Scope or T-scope categories (right) found to have an odds ratios above 1 in both DE transcripts and at least one module from each dataset are shown. Fisher's exact tests with p-values below 0.05 are indicated with an asterisk.

FIGS. 154A-154B show results of IPA® canonical pathway and upstream regulator (UR) analysis. IPA® canonical pathway and upstream regulator analysis was performed. The analysis compared DE genes common to all three datasets and the 6 preserved DLE-associated WGCNA modules. FIG. 154A shows canonical pathways predicted to be significantly activated or inhibited in both DE transcripts and at least one module from each dataset. FIG. 154B shows that a total of 224 URs were significantly activated or inhibited in both the DE transcripts and at least one module from each dataset. The 84 URs targeted by existing drugs are shown and organized by BIG-CTM category. Canonical pathways and upstream regulators were considered significant if |Activation Z-Score|≥2.

DETAILED DESCRIPTION
Analysis by Molecular Endotyping

Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

As used herein, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.

As used herein, the term “about” refers to an amount that is near the stated amount by 10%, 5%, or 1%, including increments therein.

As used herein, the phrases “at least one”, “one or more”, and “and/or” are open-ended expressions that are both conjunctive and disjunctive in operation. For example, each of the expressions “at least one of A, B and C”, “at least one of A, B, or C”, “one or more of A, B, and C”, “one or more of A, B, or C” and “A, B, and/or C” means A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.

As used herein, the term “Gini impurity” refers to a measure of how often a randomly chosen element from the set may be incorrectly labeled if it is randomly labeled according to the distribution of labels in the subset.

Many complex and multi-systematic diseases and conditions currently pose major diagnostic and therapeutic challenges. Despite the wealth of records from, for example, genetic, epigenetic, and gene expression data that has emerged in the past few years, physicians often still rely on clinical evaluation and laboratory tests, including measurement of autoantibodies and complement levels.

Successful relation of records (e.g., gene expression records) to a specific disease phenotype activity has been attempted, including efforts to identify individual genes that predicted subsequent flares, and through the determination of a discrete group of differentially expressed (DE) genes that may be found in a particular record. Despite these advances, however, no such approach is available with sufficient predictive value to utilize in evaluation and treatment.

As such, there is a need for a predictive tool for evaluating patient at both the chemical and cellular levels to advance personalized treatment. Data analytical techniques such as machine learning enable proper correlation between genetic records and phenotypes.

The machine learning models tested here provide the basis of personalized medicine. Integration of the methods herein with emerging high-throughput record sampling technologies may unlock the potential to develop a simple blood test to predict phenotypic activity. The disclosures herein may be generalized to predict other manifestations, such as organ involvement. A better understanding of the cellular processes that drive pathogenesis may eventually lead to customized therapeutic strategies based on records' unique patterns of cellular activation.

Method of Identifying One or More Records Having a Specific Phenotype

One aspect disclosed herein, per FIG. 1, is a method of identifying one or more records (e.g., raw gene expression data, whole gene expression data, blood gene expression data, or informative gene modules). The method may comprise receiving a plurality of first records 101, receiving a plurality of second records 102, receiving a plurality of third records 104, applying a machine learning algorithm to at least one first record and at least one second record to determine a classifier (e.g., a machine learning classifier) 103, and applying the classifier to the plurality of third records 105. Applying the classifier to the plurality of third records 105 may identify one or more third records associated with the specific phenotype. In some embodiments, applying a machine learning algorithm to the third data set 105 comprises applying a machine learning algorithm to a plurality of unique third data sets.

Records

The records may comprise, for example, raw gene expression data, whole gene expression data, blood gene expression data, informative gene modules, or any combination thereof. The records may be generated by Weighted Gene Co-expression Network Analysis (WGCNA). In some embodiments, at least one of the first records and the second records comprise nucleic acid sequencing data, transcriptome data, genome data, epigenome data, proteome data, metabolome data, virome data, metabolome data, methylome data, lipidomic data, lineage-ome data, nucleosomal occupancy data, a genetic variant, a gene fusion, an insertion or deletion (indel), or any combination thereof. In some embodiments, the first records and the second records are in different formats. In some embodiments, the first records and the second records are from different sources, different studies, or both.

In some embodiments each record is associated with a specific phenotype (e.g., a disease state, an organ involvement, or a medication response). Each first record may be associated with one or more of a plurality of phenotypes. The plurality of second records and the plurality of first records may be non-overlapping. The third records may be distinct from the plurality of first records, the plurality of second records, or both. The third records may comprise a plurality of unique third data sets.

The records may be received from the Gene Expression Omnibus. The records may be associated with purified cell populations, whole blood gene expression, or both. The raw Gene Expression Omnibus source may comprise GSE10325 (e.g., from www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10325), GSE26975 (e.g., from www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26975), GSE38351 (e.g., from www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE38351), GSE39088 (e.g., from www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39088), GSE45291 (e.g., from www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE45291), GSE49454 (e.g., from www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49454), or any combination thereof.

For example, as the most important genes may be involved in a number of functions other than interferon signaling, such RNA processing, ubiquitylation, and mitochondrial processes, these pathways may play important roles in directing, or at least be indicative of, phenotypic activity. CD4 T cells originally may contribute the most important modules. However, when the modules are de-duplicated, CD14 monocyte-derived modules prove important as unique genes expressed by CD14 monocytes in tandem with interferon genes may be informative in the study of cell-specific methods of pathogenesis.

Phenotypes

In some embodiments, the phenotype comprises a disease state, an organ involvement a medication response, or any combination thereof. The disease state may comprise an active disease state, or an inactive disease state. At least one of the active disease state and the inactive disease state may be characterized by standard clinical composite outcome measures. The active disease state may comprise a Disease Activity Index of 6 or greater.

The disease may comprise an acute disease, a chronic disease, a clinical disease, a flare-up disease, a progressive disease, a refractory disease, a subclinical disease, or a terminal disease. The disease may comprise a localized disease, a disseminated disease, or a systemic disease. The disease may comprise an immune disease, a cancer, a genetic disease, a metabolic disease, an endocrine disease, a neurological disease, a musculoskeletal disease, or a psychiatric disease. The active disease state may comprise a Systemic Lupus Erythematosus Disease Activity Index (SLEDAI) of 6 or greater.

The organ involvement may comprise a possibly involved organ. The possibly involved organ may comprise bone, skin, hematopoietic system, spleen, liver, lung, mucosa, eye, ear, pituitary, or any combination thereof. The medication response may comprise an ultra-rapid metabolizer response, an extensive metabolizer response, an intermediate metabolizer response, or a poor metabolizer response. The ultra-rapid metabolizer response may refer to a record with substantially increased metabolic activity. The extensive metabolizer response may refer to a record with normal metabolic activity. The intermediate metabolizer response may refer to a record with reduced metabolic activity. The poor metabolizer response may refer to a record with little to no functional metabolic activity.

Machine Learning and Classifiers

The classifiers described herein may be used in machine learning algorithms. A variety of machine learning classifiers exist, wherein each classifier produces a unique machine learning process and/or output. The machine learning algorithms may comprise a biased algorithm or an unbiased algorithm. The biased algorithm may comprise Gene Set Enrichment Analysis (GSVA) enrichment of phenotype-associated cell-specific modules. The unbiased approach may employ all available phenotypic data. The machine learning algorithm may comprise an elastic generalized linear model (GLM), a k-nearest neighbors classifier (KNN), a random forest (RF) classifier, or any combination thereof. GLM, KNN, and RF machine learning algorithms may be performed using the glmnet, caret, and randomForest R packages, respectively.

The random forest classifier is able to sort through the inherent heterogeneity of the plurality of records to identify one or more third records associated with the specific phenotype. In some embodiments, the classifier identifies said one or more third records associated with the specific phenotype with an accuracy of at least about 70%. The implementation of the random forest classifier herein enable a specific phenotype association sensitivity of 85% and a specific phenotype association specificity of 83%. Further classifier optimization, however, may yield improved results.

KNN may classify unknown samples based on their proximity to a set number K of known samples. K may be 5% of the size of the pluralities of first, second, and third records. Alternatively, K may be 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, or any increment therein. A large K value may enable more precise calculations with less overall noise. Alternatively, the k-value may be determined through cross-validation by using an independent set of records to validate the K value. If the initial value of k is even, 1 may be added in order to avoid ties. RF may generate 500 decision trees which vote on the class of each sample. The Gini impurity index, a standard measure of misclassification error, correlates to the importance of such variables. In addition, pooled predictions may be assigned based on the average class probabilities across the three classifiers.

The GLM algorithm may carry out logistic regression with a tunable elastic penalty term to find a balance between an L1 (LASSO) and an L2 (ridge), whereby penalties facilitate variable selection in order to generate sparse solutions. Least Absolute Shrinkage and Selection Operator (LASSO) is a regularization feature selection technique to reduce overfitting in regression problems. Ridge regression employs a penalty term is to shrink the LASSO coefficient values. In some embodiments, the elastic generalized linear model classifier employs an elastic penalty of about 0.9, wherein the penalty is 90% lasso and 10% ridge. The elastic penalty may be 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or any increments therein.

Records may be classified as active or inactive using two different methodologies: (1) a leave-one-study-out cross-validation approach or (2) a 10-fold cross-validation approach. GLM, KNN, and RF classifiers may be tasked with identifying active and inactive state records based on whole blood (WB) gene expression data and module enrichment data.

Supervised classification approaches using elastic generalized linear modeling, k-nearest neighbors, and random forest classifiers may be implemented. The trends in performance when cross-validating by one of the pluralities of records or cross-validating 10-fold display the potential advantages and disadvantages of diagnostic tests incorporating gene expression data or module enrichment. Cross-validating by one of the pluralities of records may be used to generalize 1-fold cross validation as a suboptimal scenario, whereas a 10-fold cross-validation is in fact more optimal. Although classification of active and inactive records from the pluralities of different records with 1-fold cross-validation may be suboptimal, module enrichment may be employed to smooth out much of the technical variation between data sets. 10-fold cross-validation may enable a more standardized diagnostic test. Although the plurality of second records and the plurality of first records are non-overlapping, the test set employs overlapping records to facilitate proper classification.

Furthermore, modules that may be negatively associated with phenotypic activity may be just as important in classification as positively associated modules. Further study of underrepresented categories of transcripts may enhance understanding and correlation of phenotypic activity.

Reduction of technical noise may improve classification. For example, RNA-Seq platforms, which produce transcript count records rather than probe intensity values, may display less technical variation across records if all samples are processed in the same way.

The strong performance of the random forest classifier indicates that nonlinear, decision tree-based methods of classification may be ideal because decision trees ask questions about new records sequentially and adaptively. Random forest does not apply a one-size-fits-all approach to each of the different types of records to allow for classification of records whose expression patterns make them a minority within their phenotype. As such, active records that do not resemble the majority of active records still have a strong chance of being properly classified by random forest. By contrast other methods may approach variables from new records all at once.

Filtering

In some embodiments, the method further comprises filtering the first records, the second records, or both. In some embodiments, the filtering comprises normalizing, variance correction, removing outliers, removing background noise, removing data without annotation data, scaling, Weighted Gene Co-expression Network Analysis, enrichment analysis, dimensionality reduction, or any combination thereof.

In some embodiments, the normalizing is performed by Robust Multi-Array Analysis (RMA), Guanine Cytosine Robust Multi-Array Analysis (GCRMA), Linear Models for Microarray Data, variance stabilizing transformation (VST), normal-exponential quantile correction (NEQC), or any combination thereof. RMA may summarize the perfect matches through a median polish algorithm, quantile normalization, or both. Variance-stabilizing transformation may simplify considerations in graphical exploratory data analysis, allow the application of simple regression-based or analysis of variance techniques, or both. Normalized expression values may be variance corrected using local empirical Bayesian shrinkage, and DE may be assessed using the Linear Models for Microarray Data (LIMMA) package. Resulting p-values may be adjusted for multiple hypothesis testing using the Benjamini-Hochberg correction, which resulted in a false discovery rate (FDR). Significant genes within each study may be filtered to retain DE genes with an FDR<0.2, which may be considered statistically significant. The FDR may be selected a priori to diminish the number of genes that may be excluded as false negatives.

In some embodiments, the variance correction comprises employing a local empirical Bayesian shrinkage, adjusting the p-values for multiple hypothesis testing using the Benjamini-Hochberg correction, removing all data with a false discovery rate of less than 0.2, or any combination thereof. The Benjamini-Hochberg procedure may decrease the false discovery rate caused by incorrectly rejecting the true null hypotheses control for small p-values.

In some embodiments, the Weighted Gene Co-expression Network Analysis comprises calculating a topology matrix, clustering the data based on the topology matrix, correlating module eigenvalues for traits on a linear scale by Pearson correlation for nonparametric traits by Spearman correlation and for dichotomous traits by point-biserial correlation or t-test, or both. A topology matrix may specify the connections between vertices in directed multigraph.

Log 2-normalized microarray expression values from purified CD4, CD14, CD19, CD33, and low density granulocyte (LDG) populations may be used as input to WGCNA to conduct an unsupervised clustering analysis, resulting in co-expression “modules,” or groups of densely interconnected genes which may correspond to comparably regulated biologic pathways. For each experiment, an approximately scale-free topology matrix (TOM) may be first calculated to encode the network strength between probes. Probes may be clustered into WGCNA modules based on TOM distances. Resultant dendrograms of correlation networks may be trimmed to isolate individual modular groups of probes by partitioning around medoids and labeled using color assignments based on module size. Expression profiles of genes within modules may be summarized by a module eigengene (ME), which may be analogous to the module's first principal component. MEs act as characteristic expression values for their respective modules and may be correlated with sample traits such as SLEDAI or cell type by Pearson correlation for continuous or semi-continuous traits and by point-biserial correlation for dichotomous traits.

WGCNA modules from CD4, CD14, CD19, and CD33 cells may be tested for correlation to SLEDAI. Plasma cell modules may be generated by differential expression analysis and not WGCNA, but may be included because of the established importance of plasma cells in SLE pathogenesis.

Removing the outliers may be performed by statistical analysis using R and relevant Bioconductor packages. Non-normalized arrays may be inspected for visual artifacts or poor hybridization using Affy QC plots. Principal Component Analysis (PCA) plots may be used to inspect the raw data files for outliers. Data sets culled of outliers may be cleaned of background noise and normalized using RMA, GCRMA, or NEQC where appropriate. Data sets may be then filtered to remove probes with low intensity values and probes without gene annotation data. WB gene expression data sets may be filtered to only include genes that passed quality control in all data sets. Differential expression (DE) analysis and WGCNA may then be carried out on data sets. WB gene expression data sets may then be further processed before machine learning analysis. WB gene expression values may be centered and scaled to have zero-mean and unit-variance within each data set and the standardized expression values from each data set may be joined for classification.

The GSVA-R package may be used as a non-parametric method for estimating the variation of pre-defined gene sets in WB gene expression data sets. Standardized expression values from WB data sets may be used to test for enrichment of cell-specific WGCNA gene modules using the Single-sample Gene Set Enrichment Analysis (ssGSEA) method, which scores single samples in isolation and may be thus shielded from technical variation within and among data sets. Statistical analysis of GSVA enrichment scores may be performed by Spearman correlation or Welch's unequal variances t-test, where appropriate. GSVA may be performed on three WB datasets using 25 WGCNA modules made from purified cells with correlation or published relationship to SLEDAI (Table 1).

Patterns of enrichment of WGCNA modules that are derived from isolated cell populations of WB that are correlated to the phenotype may be more useful than gene expression across the pluralities of records to identify active versus inactive state records. To characterize the relationships between gene signatures from various records and phenotypic activity, WGCNA may be used to generate co-expression gene modules from purified populations of cells from records with an active disease state. Such records may be subsequently tested for enrichment in whole blood of other records. WGCNA analysis of leukocyte subsets may result in several gene modules with significant Pearson correlations to SLEDAI (all |r|>0.47, p<0.05). CD4, CD14, CD19, and CD33 cells with 3, 6, 8, and 4 significant modules, respectively (Table 1). Two low-density granulocyte (LDG) modules may be created by performing WGCNA analysis of LDGs along with either neutrophils or HC neutrophils and merging the modules most strongly expressed by LDGs Two plasma cell (PC) modules may be created by using the most increased and decreased transcripts of isolated plasma cells compared to naïve and memory B cells.

TABLE 1

Gene modules identified as correlating with SLEDAI via WGCNA analysis of leukocytes

Module
Correlation with

Cell Type
Module Name
Size
SLEDAI
Top GO Biological Process
Top BIG-C Category

CD4
Floralwhite
237
0.81
type I interferon signaling pathway
Interferon-Stimulated-Genes

Turquoise
805
0.50
positive reg of ubiquitin-protein ligase
Proteasome

Orangered4
237
−0.77
translational initiation
mRNA-Translation

CD14
Plum1
247
0.47
ubiquitin-dependent protein catabolic process
mRNA-Translation

Yellow
356
0.65
type I interferon signaling pathway
Interferon-Stimulated-Genes

Greenyellow
89
−0.49
transcription from RNA polymerase II promoter
General-Transcription

Pink
261
−0.77
protein phosphorylation
Endosome-and-Vesicles

Purple
124
−0.66
inositol phosphate metabolic process
Fatty-Acid-Biosynthesis

Sienna3
222
−0.64
translational initiation
mRNA-Translation

CD19
Darkolivegreen
591
0.78
cell division
Proteasome

Greenyellow
251
0.66
Notch signaling pathway
mRNA-Translation

Steelblue
146
0.65
gluconeogenesis
Glycolysis-Gluconeogenesis

Turquoise
572
0.50
ER to Golgi vesicle-mediated transport
Unfolded-Protein-and-Stress

Violet
566
0.61
mitochondrial respiratory chain complex I
Interferon-Stimulated-Genes

Brown
620
−0.62
regulation of transcription, DNA-templated
Chromatin-Remodeling

Green
541
−0.49
transcription, DNA-templated
Transcription-Factors

Skyblue
755
−0.74
viral transcription
mRNA-Translation

CD33
Royalblue
94
0.60
positive reg of cytosolic calcium ions
Transposon-Control

Sienna3
133
0.76
type I interferon signaling pathway
Interferon-Stimulated-Genes

Violet
177
0.79
defense response to virus
Interferon-Stimulated-Genes

Darkmagenta
273
−0.49
ubiquinone biosynthetic process
MHC-Class-TWO

LDG⁺
LDG_A
334
0.79
platelet degranulation
Cytoskeleton

LDG_B
92
0.81
regulation of transcription
Secreted-Immune

PC*
PC_Up
423
N/A
protein N-linked glycosylation
Endoplasmic-Reticulum

PC_Down
183
N/A
antigen processing and presentation MHC II
MHC-Class-TWO

Gene Ontology (GO) analysis of the genes within each of the record indicates that that some processes, such as those related to interferon signaling, RNA transcription, and protein translation, may be shared among cell types, whereas other processes may be unique to certain cell types (Table 1) and may be used to better classification of records.

GSVA enrichment may be performed using the 25 cell-specific gene modules in WB from 156 records (82 active, 74 inactive), per Table 4 and FIG. 2E. Of the 25 cell-specific modules, 12 had enrichment scores with significant Spearman correlations to SLEDAI (p<0.05), and 14 had enrichment scores with significant differences between active and inactive state records by Welch's unequal variances t-test (p<0.05), per Table 2. Notably, each cell type produced at least one module with a significant correlation to SLEDAI in WB and at least one module with a significant difference in enrichment scores between active and inactive records, demonstrating a relationship between phenotypic activity in specific cellular subsets and overall phenotypic activity in WB. However, as the Spearman's rho values ranged from −0.40 to +0.36, no one module may have a substantial predictive value. Furthermore, the effect sizes as measured by Cohen's d when testing active versus inactive enrichment scores ranged from −0.85 to +0.79. The CD4 Floralwhite and Orangered4 modules, which had the largest positive and negative effect sizes, respectively, showed a high degree of overlap in the enrichment scores of active and inactive records, per FIGS. 4A and 4B, where error bars indicate mean±standard deviation. WB may be unable to fully separate active records from inactive records.

TABLE 2

Cell-specific modules by Spearman correlation to SLEDAI and active vs. inactive state

Spearman correlation to SLEDAI
Active vs. Inactive t-test

rho
p value
t statistic
p value
Cohen's d

CD4_Floralwhite

0.360

3.90E−06

4.90

2.40E−06
0.788

CD4_Turquoise
−0.044
0.587
−0.93
0.352
−0.149

CD4_Orangered4
−0.400
2.21E−07
−5.29
4.35E−07
−0.853

CD14_Plum1
0.010
0.904
−0.35
0.729
−0.054

CD14_Yellow

0.356

4.93E−06

4.76

4.44E−06
0.761

CD14_Greenyellow
−0.132
0.100
−2.10
0.037
−0.339

CD14_Pink
−0.026
0.751
0.13
0.894
0.021

CD14_Purple
−0.149
0.064
−1.65
0.101
−0.263

CD14_Sienna3
−0.368
2.27E−06
−4.99
1.62E−06
−0.799

CD19_Darkolivegreen
0.020
0.809
−0.06
0.953
−0.010

CD19_Greenyellow

0.192

0.016

2.55

0.012
0.403

CD19_Steelblue
0.016
0.838
0.55
0.580
0.089

CD19_Turquoise
−0.069
0.393
−0.84
0.403
−0.132

CD19_Violet
−0.087
0.282
−1.48
0.141
−0.236

CD19_Brown
−0.050
0.537
−1.04
0.301
−0.164

CD19_Green
−0.150
0.062
−2.07
0.040
−0.330

CD19_Skyblue
−0.205
0.010
−2.35
0.020
−0.378

CD33_Royalblue

0.308

8.99E−05

3.99

1.03E−04
0.637

CD33_Sienna3

0.362

3.41E−06

4.69

6.15E−06
0.753

CD33_Violet

0.322

4.15E−05

4.35

2.46E−05
0.696

CD33_Darkmagenta
−0.216
6.74E−03
−2.34
0.021
−0.369

LDG_A
−0.044
0.588
−0.25
0.802
−0.040

LDG_B

0.220

5.71E−03

2.37

0.019
0.377

PC_Up

0.262

9.75E−04

3.21

1.61E−03
0.508

PC_Down
0.022
0.781
0.80
0.426
0.129

Analysis of individual phenotypic activity associated peripheral cellular subset gene modules may not be sufficient to predict phenotypic activity in unrelated WB data sets, since no single module from any cell type may be able to separate active from inactive state records, per FIG. 2E. Although no single module had a sufficiently high predictive value, many cell-specific gene modules may be combined and optimized to predict phenotypes of active records. Moreover, the results emphasized the need for more advanced analysis to employ gene expression analysis to predict phenotypic activity.

Performance and Accuracy

When training and testing sets are formed by holding out entire data sets, machine learning algorithms using raw gene expression data had an average classification accuracy of only 53 percent. However, converting this gene expression data to module enrichment improved classification accuracy to 71 percent. When training and testing sets are formed by mixing records from the three data sets, module enrichment remained at a 70 percent classification accuracy. However, classification accuracy using raw gene expression increased to a mean of 79 percent. The best overall performance came from the random forest classifier, which had a predictive accuracy of 84 percent.

The performance of each machine learning algorithm may be determined by evaluating 2 different forms of cross-validation. A random 10-fold cross-validation may randomly assign each record to one of 10 groups. A leave-one-study-out cross-validation may determine the effects of systematic technical differences among data sets on classification performance. For each pass of cross-validation, one fold or study may be held out as a test set, whereby the classifiers are trained on the remaining data. Accuracy may be assessed as the proportion of records correctly classified across all testing folds. Performance metrics such as sensitivity and specificity may be assessed after cross-validation by agglomerating class probabilities and assignments from each fold or study. Receiver Operating Characteristic (ROC) curves may be generated using the pROC R package.

The performance of each classifier in each situation is shown in Table 3, and corresponding ROC curves are shown in FIG. 5, whereas the area under each ROC curve is displayed. In almost all cases, the random forest classifier outperformed the GLM and KNN classifiers, although the results may be not significantly different when assessed by testing for equality of proportions (p>0.05). Pooled predictions based on the class probabilities from the three classifiers may not improve overall performance.

TABLE 3

Cross-validation of gene expression and cell modules

Study-fold Cross-Validation
10-fold Cross-Validation

Gene
Cell

Gene
Cell

Expression
Modules

Expression
Modules

GLM
0.56
0.68
GLM
0.80
0.72

KNN
0.48
0.68
KNN
0.75
0.7

RF
0.54
0.74
RF
0.84
0.72

Pooled
0.53
0.71
Pooled
0.78
0.73

Mean (SD)
0.53 (0.03)
0.70 (0.03)
Mean (SD)
0.79 (0.04)
0.72 (0.01)

When cross-validating by study, the use of expression values may achieve an accuracy of only 53 percent, per Table 3, which is consistent with the findings shown in FIGS. 2A-2D that gene expression values may provide less value towards classifying unfamiliar records. When the training records and test records are greatly heterogeneous, the classifiers learning patterns may be less helpful for classifying test records. Remarkably, the use of module enrichment scores improved accuracy to approximately 70 percent.

Per Table 3, the 10-fold cross-validation with raw gene expression values may result in better performance compared to the leave-one-study-out cross-validation. This increase in performance may be attributed to the presence of records from all plurality of first, second, and third records in both the training and test sets. In this case, the classifiers may learn patterns inherent to each set of records. In this circumstance, the random forest classifier may be the strongest performer with 84% accuracy (85% sensitivity, 83% specificity), whereby the ROC curve demonstrates an excellent tradeoff between recall and fall-out. The performance of module enrichment, however may not be substantially different between 10-fold cross-validation and leave-one-study-out cross-validation.

Overall, in a study-by-study approach (leave-one-study-out cross-validation), module enrichment may be more successful than raw gene expression. Importantly, when using the 10-fold cross-validation approach, raw gene expression may outperform module enrichment. Thus, phenotypic activity classification based on raw gene expression may be sensitive to technical variability, whereas classification based on module enrichment may cope better with variation among data sets.

The variable importance of Random forest provides insight into directors of the identification of phenotypic activity, random forest classifiers may be trained on all records from each of the plurality of records in order to identify the most important genes and modules as determined by mean decrease in the Gini impurity, a measure of misclassification error.

As shown in FIGS. 6A-6C, the most important genes and modules identified a wide array of cell types and biological functions. The most important genes encompass such diverse functions as interferon signaling, pattern recognition receptor signaling, and control of survival and proliferation, per FIG. 6C. Notably, the most influential modules may be skewed away from B cell-derived modules and towards T cell- and myeloid cell-derived modules, per FIG. 6A. As some of these modules had overlapping genes, the variable importance experiment may be repeated with modules that may be first scrubbed of any genes that appeared in more than one module before GSVA enrichment scoring. The relative variable importance scores of the de-duplicated modules correlated strongly with those of the original modules (Spearman's rho=0.73, p=5.18E−5), indicating that module behavior may be partly driven by the overlapping genes but strongly driven by unique genes, per FIG. 6A. Variable importance of top 25 individual genes. LDG: low-density granulocyte; PC: plasma cell.

CD4_Floralwhite and CD14_Yellow, two interferon-related modules which maintained high importance after deduplication, may be further analyzed to study the effect of unique genes on module importance. Gene lists may be tested for statistical overrepresentation of Gene Ontology biological process terms with FDR correction on pantherdb.org. CD4_Floralwhite did not show any significant enrichment, but CD14_Yellow, which had the highest importance after deduplication, may be highly enriched for genes with the “Immune Effector Process” designation (26/77 genes, FDR=9.38E−11 by Fisher's exact test). This suggests that CD14+ monocytes express unique genes that may play important roles in the initiation of phenotypic activity.

Several important findings on the topic of gene expression heterogeneity within and across data sets have been elucidated by this study. First, DE analysis of active vs inactive records may be insufficient for proper classification of phenotypic activity, as systematic differences between data sets render conventional bioinformatics techniques largely non-generalizable.

Further, WGCNA modules created from the cellular components of WB and correlated to SLEDAI phenotypic activity may improve classification of phenotypic activity in records. The use of cell-specific gene modules based on a priori knowledge about their relevance to disease fared slightly better than raw gene expression, as it generated informative enrichment patterns, and many of the modules maintained significant correlations with SLEDAI in WB. However, these enrichment scores failed to completely separate active records from inactive records by hierarchical clustering.

Method Characterization

Conventional bioinformatics approaches do not satisfactorily identify one or more records having a specific phenotype. DE analysis of a plurality of first records, a plurality of second records, and a plurality of third records having an active disease state and a non-active disease state, per FIGS. 2A-2D displayed the major differences and heterogeneity. First, the 100 most significant DE genes by FDR in the plurality of first, second, and third records may be used to carry out hierarchical clustering of active and inactive disease state records, per FIGS. 2A-C. Active disease state records are clearly separated from inactive records, per FIG. 2B, but only partially separated from inactive records, per FIGS. 2A and 2C.

Out of 6,640 unique DE genes from the three pluralities of records, 5,170 genes are unique to one of the plurality of records, 1,234 are shared by two of the plurality of records, and 36 are shared by all three of the plurality of records. Per FIG. 3 there is minimal overlap of the 100 most significant genes by FDR in each of the pluralities of records. The only overlaps among the top 100 DE genes in each study by FDR are: TWY3 and EHBP1, shared between the plurality of first records and the plurality of third records; and LZIC, shared between the plurality of first records and plurality of second records. Furthermore, the fold change distributions of the 100 most significant DE genes in each of the pluralities of records varied considerably. In the plurality of first records, 94 of the 100 most significant genes are downregulated in active disease state records; in the plurality of second records, all of the top 100 genes are upregulated in active disease state records; and in the plurality of third records, the top 100 genes are more evenly distributed (41 up, 59 down). Per FIG. 3 orange bars denote active state records, wherein black bars denote inactive state records.

The plurality of first, second, and third records may represent different populations and may be collected on different microarray platforms per Table 4 below. The lack of commonality among the genes most descriptive of active state records and inactive state records in each of the pluralities of records casts doubt on whether active and inactive states from the different pluralities of records may be easily determined using conventional techniques.

TABLE 4

Accession of records by microarray platform, number of active

and inactive records, SLEDAI range, and SLEADAI mean

N
N

Microarray
Ac-
Inac-
SLEDAI
SLEDAI

Accession
Platform
tive
tive
Range
Mean (SD)

Plurality
GPL570
24
13
2-12
6.8 (2.7)

of First
(Affymetrix

Records
HG-U133+ 2.0)

Plurality
GPL13158
35
35
0-11
4.3 (3.5)

of Second
(Affymetrix

Records
HG-U133+ PM)

Records from the pluralities of first, second, and third records may then be joined to evaluate whether unsupervised techniques may separate active state records from inactive state records. Hierarchical clustering on the 297 unique most significant DE genes by FDR showed considerable heterogeneity, and active records and inactive records did not consistently separate, per the heat map of the top 100 DE genes by FDR from each of the pluralities of records (combined total of 297 unique genes from the plurality of first, second, and third records) expressed in all records in FIG. 2D. As such, conventional techniques failed to identify active records, highlighting the need for more advanced algorithms.

Digital Processing Device

In some embodiments, the platforms, systems, media, and methods described herein include a digital processing device, or use of the same. In further embodiments, the digital processing device includes one or more hardware central processing units (CPUs) or general purpose graphics processing units (GPGPUs) that carry out the device's functions. In still further embodiments, the digital processing device further comprises an operating system configured to perform executable instructions. In some embodiments, the digital processing device is optionally connected a computer network. In further embodiments, the digital processing device is optionally connected to the Internet such that it accesses the World Wide Web. In still further embodiments, the digital processing device is optionally connected to a cloud computing infrastructure. In other embodiments, the digital processing device is optionally connected to an intranet. In other embodiments, the digital processing device is optionally connected to a data storage device.

In accordance with the description herein, suitable digital processing devices include, by way of non-limiting examples, server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, personal digital assistants, video game consoles, and vehicles. Those of skill in the art will recognize that many smartphones are suitable for use in the system described herein. Those of skill in the art will also recognize that select televisions, video players, and digital music players with optional computer network connectivity are suitable for use in the system described herein. Suitable tablet computers include those with booklet, slate, and convertible configurations, known to those of skill in the art.

In some embodiments, the digital processing device includes an operating system configured to perform executable instructions. The operating system is, for example, software, including programs and data, which manages the device's hardware and provides services for execution of applications. Those of skill in the art will recognize that suitable server operating systems include, by way of non-limiting examples, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®. Those of skill in the art will recognize that suitable personal computer operating systems include, by way of non-limiting examples, Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®. In some embodiments, the operating system is provided by cloud computing. Those of skill in the art will also recognize that suitable mobile smart phone operating systems include, by way of non-limiting examples, Nokia® Symbian® OS, Apple® iOS®, Research In Motion® BlackBerry OS®, Google® Android®, Microsoft® Windows Phone® OS, Microsoft® Windows Mobile® OS, Linux®, and Palm® WebOS®. Those of skill in the art will also recognize that suitable media streaming device operating systems include, by way of non-limiting examples, Apple TV®, Roku®, Boxee®, Google TV®, Google Chromecast®, Amazon Fire®, and Samsung® HomeSync®. Those of skill in the art will also recognize that suitable video game console operating systems include, by way of non-limiting examples, Sony® PS3®, Sony® PS4®, Microsoft® Xbox 360®, Microsoft Xbox One, Nintendo® Wii®, Nintendo® Wii U®, and Ouya®.

In some embodiments, the device includes a storage and/or memory device. The storage and/or memory device is one or more physical apparatuses used to store data or programs on a temporary or permanent basis. In some embodiments, the device is volatile memory and requires power to maintain stored information. In some embodiments, the device is non-volatile memory and retains stored information when the digital processing device is not powered. In further embodiments, the non-volatile memory comprises flash memory. In some embodiments, the non-volatile memory comprises dynamic random-access memory (DRAM). In some embodiments, the non-volatile memory comprises ferroelectric random access memory (FRAM). In some embodiments, the non-volatile memory comprises phase-change random access memory (PRAM). In other embodiments, the device is a storage device including, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage. In further embodiments, the storage and/or memory device is a combination of devices such as those disclosed herein.

In some embodiments, the digital processing device includes a display to send visual information to a user. In some embodiments, the display is a liquid crystal display (LCD). In further embodiments, the display is a thin film transistor liquid crystal display (TFT-LCD). In some embodiments, the display is an organic light emitting diode (OLED) display. In various further embodiments, on OLED display is a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display. In some embodiments, the display is a plasma display. In other embodiments, the display is a video projector. In yet other embodiments, the display is a head-mounted display in communication with the digital processing device, such as a VR headset. In further embodiments, suitable VR headsets include, by way of non-limiting examples, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like. In still further embodiments, the display is a combination of devices such as those disclosed herein.

In some embodiments, the digital processing device includes an input device to receive information from a user. In some embodiments, the input device is a keyboard. In some embodiments, the input device is a pointing device including, by way of non-limiting examples, a mouse, trackball, track pad, joystick, game controller, or stylus. In some embodiments, the input device is a touch screen or a multi-touch screen. In other embodiments, the input device is a microphone to capture voice or other sound input. In other embodiments, the input device is a video camera or other sensor to capture motion or visual input. In further embodiments, the input device is a Kinect, Leap Motion, or the like. In still further embodiments, the input device is a combination of devices such as those disclosed herein.

Referring to FIG. 7, in a particular embodiment, a digital processing device 701 is programmed or otherwise configured to identify one or more records having a specific phenotype. The device 701 is programmed or otherwise configured to identify one or more records having a specific phenotype. In this embodiment, the digital processing device 701 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 705, which is optionally a single core, a multi core processor, or a plurality of processors for parallel processing. The digital processing device 701 also includes memory or memory location 710 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 715 (e.g., hard disk), communication interface 720 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 725, such as cache, other memory, data storage and/or electronic display adapters. The memory 710, storage unit 715, interface 720 and peripheral devices 725 are in communication with the CPU 705 through a communication bus (solid lines), such as a motherboard. The storage unit 715 comprises a data storage unit (or data repository) for storing data. The digital processing device 701 is optionally operatively coupled to a computer network (“network”) 730 with the aid of the communication interface 720. The network 730, in various cases, is the internet, an internet, and/or extranet, or an intranet and/or extranet that is in communication with the internet. The network 730, in some cases, is a telecommunication and/or data network. The network 730 optionally includes one or more computer servers, which enable distributed computing, such as cloud computing. The network 730, in some cases, with the aid of the device 701, implements a peer-to-peer network, which enables devices coupled to the device 701 to behave as a client or a server.

Continuing to refer to FIG. 7, the CPU 705 is configured to execute a sequence of machine-readable instructions, embodied in a program, application, and/or software. The instructions are optionally stored in a memory location, such as the memory 710. The instructions are directed to the CPU 705, which subsequently program or otherwise configure the CPU 705 to implement methods of the present disclosure. Examples of operations performed by the CPU 705 include fetch, decode, execute, and write back. The CPU 705 is, in some cases, part of a circuit, such as an integrated circuit. One or more other components of the device 701 are optionally included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).

Continuing to refer to FIG. 7, the storage unit 715 optionally stores files, such as drivers, libraries and saved programs. The storage unit 715 optionally stores user data, e.g., user preferences and user programs. The digital processing device 701, in some cases, includes one or more additional data storage units that are external, such as located on a remote server that is in communication through an intranet or the internet.

Continuing to refer to FIG. 7, the digital processing device 701 optionally communicates with one or more remote computer systems through the network 730. For instance, the device 701 optionally communicates with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PCs (e.g., Apple® iPad, Samsung® Galaxy Tab, etc.), smartphones (e.g., Apple® iPhone, Android-enabled device, Blackberry®, etc.), or personal digital assistants.

Methods as described herein are optionally implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the digital processing device 701, such as, for example, on the memory 710 or electronic storage unit 715. The machine executable or machine readable code is optionally provided in the form of software. During use, the code is executed by the processor 705. In some cases, the code is retrieved from the storage unit 715 and stored on the memory 710 for ready access by the processor 705. In some situations, the electronic storage unit 715 is precluded, and machine-executable instructions are stored on the memory 710.

Non-Transitory Computer Readable Storage Medium

In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device. In further embodiments, a computer readable storage medium is a tangible component of a digital processing device. In still further embodiments, a computer readable storage medium is optionally removable from a digital processing device. In some embodiments, a computer readable storage medium includes, by way of non-limiting examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like. In some cases, the program and instructions are permanently, substantially permanently, semi-permanently, or non-transitorily encoded on the media.

Computer Program

In some embodiments, the platforms, systems, media, and methods disclosed herein include at least one computer program, or use of the same. A computer program includes a sequence of instructions, executable in the digital processing device's CPU, written to perform a specified task. Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, those of skill in the art will recognize that a computer program may be written in various versions of various languages.

The functionality of the computer readable instructions may be combined or distributed as desired in various environments. In some embodiments, a computer program comprises one sequence of instructions. In some embodiments, a computer program comprises a plurality of sequences of instructions. In some embodiments, a computer program is provided from one location. In other embodiments, a computer program is provided from a plurality of locations. In various embodiments, a computer program includes one or more software modules. In various embodiments, a computer program includes, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof.

Web Application

In some embodiments, a computer program includes a web application. In light of the disclosure provided herein, those of skill in the art will recognize that a web application, in various embodiments, utilizes one or more software frameworks and one or more database systems. In some embodiments, a web application is created upon a software framework such as Microsoft® .NET or Ruby on Rails (RoR). In some embodiments, a web application utilizes one or more database systems including, by way of non-limiting examples, relational, non-relational, object oriented, associative, and XML database systems. In further embodiments, suitable relational database systems include, by way of non-limiting examples, Microsoft® SQL Server, mySQL™, and Oracle®. Those of skill in the art will also recognize that a web application, in various embodiments, is written in one or more versions of one or more languages. A web application may be written in one or more markup languages, presentation definition languages, client-side scripting languages, server-side coding languages, database query languages, or combinations thereof. In some embodiments, a web application is written to some extent in a markup language such as Hypertext Markup Language (HTML), Extensible Hypertext Markup Language (XHTML), or eXtensible Markup Language (XML). In some embodiments, a web application is written to some extent in a presentation definition language such as Cascading Style Sheets (CSS). In some embodiments, a web application is written to some extent in a client-side scripting language such as Asynchronous Javascript and XML (AJAX), Flash® Actionscript, Javascript, or Silverlight®. In some embodiments, a web application is written to some extent in a server-side coding language such as Active Server Pages (ASP), ColdFusion®, Perl, Java™, JavaServer Pages (JSP), Hypertext Preprocessor (PHP), Python™, Ruby, Tc1, Smalltalk, WebDNA®, or Groovy. In some embodiments, a web application is written to some extent in a database query language such as Structured Query Language (SQL). In some embodiments, a web application integrates enterprise server products such as IBM® Lotus Domino®. In some embodiments, a web application includes a media player element. In various further embodiments, a media player element utilizes one or more of many suitable multimedia technologies including, by way of non-limiting examples, Adobe® Flash®, HTML 5, Apple® QuickTime®, Microsoft® Silverlight®, Java™, and Unity®.

Referring to FIG. 8, in a particular embodiment, an application provision system comprises one or more databases 800 accessed by a relational database management system (RDBMS) 810. Suitable RDBMSs include Firebird, MySQL, PostgreSQL, SQLite, Oracle Database, Microsoft SQL Server, IBM DB2, IBM Informix, SAP Sybase, SAP Sybase, Teradata, and the like. In this embodiment, the application provision system further comprises one or more application severs 820 (such as Java servers, .NET servers, PHP servers, and the like) and one or more web servers 830 (such as Apache, IIS, GWS and the like). The web server(s) optionally expose one or more web services via app application programming interfaces (APIs) 840. Via a network, such as the internet, the system provides browser-based and/or mobile native user interfaces.

Referring to FIG. 9, in a particular embodiment, an application provision system alternatively has a distributed, cloud-based architecture 900 and comprises elastically load balanced, auto-scaling web server resources 910 and application server resources 920 as well synchronously replicated databases 930.

Standalone Application

In some embodiments, a computer program includes a standalone application, which is a program that is run as an independent computer process, not an add-on to an existing process, e.g., not a plug-in. Those of skill in the art will recognize that standalone applications are often compiled. A compiler is a computer program(s) that transforms source code written in a programming language into binary object code such as assembly language or machine code. Suitable compiled programming languages include, by way of non-limiting examples, C, C++, Objective-C, COBOL, Delphi, Eiffel, Java™, Lisp, Python™, Visual Basic, and VB .NET, or combinations thereof. Compilation is often performed, at least in part, to create an executable program. In some embodiments, a computer program includes one or more executable complied applications.

Web Browser Plug-in

In some embodiments, the computer program includes a web browser plug-in (e.g., extension, etc.). In computing, a plug-in is one or more software components that add specific functionality to a larger software application. Makers of software applications support plug-ins to enable third-party developers to create abilities which extend an application, to support easily adding new features, and to reduce the size of an application. When supported, plug-ins enable customizing the functionality of a software application. For example, plug-ins are commonly used in web browsers to play video, generate interactivity, scan for viruses, and display particular file types. Those of skill in the art will be familiar with several web browser plug-ins including, Adobe® Flash® Player, Microsoft® Silverlight®, and Apple® QuickTime®.

In view of the disclosure provided herein, those of skill in the art will recognize that several plug-in frameworks are available that enable development of plug-ins in various programming languages, including, by way of non-limiting examples, C++, Delphi, Java™ PHP, Python™, and VB .NET, or combinations thereof.

Web browsers (also called Internet browsers) are software applications, designed for use with network-connected digital processing devices, for retrieving, presenting, and traversing information resources on the World Wide Web. Suitable web browsers include, by way of non-limiting examples, Microsoft® Internet Explorer®, Mozilla® Firefox®, Google® Chrome, Apple® Safari®, Opera Software® Opera®, and KDE Konqueror. In some embodiments, the web browser is a mobile web browser. Mobile web browsers (also called mircrobrowsers, mini-browsers, and wireless browsers) are designed for use on mobile digital processing devices including, by way of non-limiting examples, handheld computers, tablet computers, netbook computers, subnotebook computers, smartphones, music players, personal digital assistants (PDAs), and handheld video game systems. Suitable mobile web browsers include, by way of non-limiting examples, Google® Android® browser, RIM BlackBerry® Browser, Apple® Safari®, Palm® Blazer, Palm® WebOS® Browser, Mozilla® Firefox® for mobile, Microsoft® Internet Explorer® Mobile, Amazon® Kindle® Basic Web, Nokia® Browser, Opera Software® Opera® Mobile, and Sony® PSP™ browser.

Software Modules

In some embodiments, the platforms, systems, media, and methods disclosed herein include software, server, and/or database modules, or use of the same. In view of the disclosure provided herein, software modules are created by techniques known to those of skill in the art using machines, software, and languages known to the art. The software modules disclosed herein are implemented in a multitude of ways. In various embodiments, a software module comprises a file, a section of code, a programming object, a programming structure, or combinations thereof. In further various embodiments, a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, or combinations thereof. In various embodiments, the one or more software modules comprise, by way of non-limiting examples, a web application, a mobile application, and a standalone application. In some embodiments, software modules are in one computer program or application. In other embodiments, software modules are in more than one computer program or application. In some embodiments, software modules are hosted on one machine. In other embodiments, software modules are hosted on more than one machine. In further embodiments, software modules are hosted on cloud computing platforms. In some embodiments, software modules are hosted on one or more machines in one location. In other embodiments, software modules are hosted on one or more machines in more than one location.

Databases

In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more databases, or use of the same. In view of the disclosure provided herein, those of skill in the art will recognize that many databases are suitable for identifying one or more records having a specific phenotype. In various embodiments, suitable databases include, by way of non-limiting examples, relational databases, non-relational databases, object oriented databases, object databases, entity-relationship model databases, associative databases, and XML databases. Further non-limiting examples include SQL, PostgreSQL, MySQL, Oracle, DB2, and Sybase. In some embodiments, a database is internet-based. In further embodiments, a database is web-based. In still further embodiments, a database is cloud computing-based. In other embodiments, a database is based on one or more local computer storage devices.

Interferon Profiling of Lupus Conditions

A role for interferon (IFN) in SLE pathogenesis may be inferred from the prominent IFN gene signature (IGS), but the major IFN species and its relationship to SLE disease activity may be unknown. A bioinformatic approach employing gene signatures specific for individual IFN species to interrogate SLE microarray datasets may demonstrate a putative role for numerous IFN species, with prominent expression of IFNB1 and IFNW induced genes, and concordance between IFN signatures in MS patients treated with IFNB1 and SLE-affected skin and synovium compared to SLE nephritis, suggesting that IFN signaling is less prominent in SLE renal disease. SLE patients with inactive disease have readily detectable IGS, and the IGS changed synchronously with a monocyte signature but not disease activity, and was significantly related to monocyte transcripts. Monocyte over-expression of three times as many IGS transcripts as T cells and B cells and IGS retention in monocytes, but not T cells and B cells from inactive SLE patients contribute to the lack of correlation between the IGS and SLE disease activity.

A role for interferon (IFN) in the pathogenesis of systemic lupus erythematosus (SLE) has been proposed since early experiments showed elevated IFN activity in SLE patients and the advent of gene expression profiling demonstrated a robust IFN gene signature (IGS) in SLE patient peripheral blood, purified B cells, T cells, monocytes, and affected organs. Various IFN responsive genes have been used to define the IGS but little is understood regarding the specific species of IFN underlying the signature. Notably, there remains a lack of consensus concerning the association of the IGS with SLE disease activity. Although some disease metrics have been associated with the IGS in small studies, longitudinal studies may not show correlation between the IGS and disease activity.

Anecdotal accounts of patients developing SLE-like symptoms after treatment with IFNs have been reported, suggesting that IFN might play a role in the induction of SLE. Moreover, standard of care (SOC) drugs used to treat lupus may eliminate the IGS. Two anti-IFNA antibodies have been used to treat SLE in Phase II clinical trials but with only modest effects. In contrast, a trial using the antibody anifrolumab, which blocks binding of all type I IFNs to the shared IFN receptor, provided clinically meaningful benefit in subjects with SLE and with high IGS scores. These trials raise the important question of whether IFNA (IFN-alpha or IFN-α) is the predominant IFN acting in SLE.

An IGS may be induced by type I or type II IFNs. The human type I IFN locus comprises thirteen IFNA genes (A1, A2, A4, A5, A6, A7, A8, A10, A13, A14, A16, A17, and A21), IFNB1 (IFN-beta1 or IFN-β1), IFNW1 (IFN-omega1 or IFN-ω1), and IFNE (IFN-epsilon or IFN-ε). Despite a similarity in structure and common receptor, these IFNs may induce different downstream signaling events, although mRNA signatures to distinguish the action of a specific subtype of type I IFN have not been developed or employed to delineate the actions of specific Type 1 IFNs. The type II IFN, IFNG (IFN-gamma or IFN-γ), also induces an IGS through its distinct IFNG receptor and has been shown to be important for pathogenesis in lupus mouse models. The role of IFNG in the pathogenesis of human lupus has been inferred largely through in vitro experiments.

Deconvolution of the IGS in SLE may be performed by creating three modules of IFN genes (M1.2, M3.4, M5.12) from SLE microarray datasets clustered using a K-means algorithm on the basis of their expression. Some correlation between module 5.12 with SLE flares may be noted, and characterization of the module using the IFN database, the Interferome, may be done in an attempt to classify the species of IFN. However, the Interferome may not necessarily reflect the downstream microarray signature present in human cells and tissues.

In order to delineate the specific types of IFNs present in SLE and the potential role of specific IFNs in SLE disease processes, systems and methods provided herein may employ a systems-level approach by using multiple, publicly available gene expression datasets from SLE patients, and probing them using reference datasets of the downstream IGS induced in vitro in human peripheral blood mononuclear cells (PBMC) or in vivo in whole blood (WB) by administration of specific IFNs to patients. This approach may allow the determination of the relative contributions of different types of IFN in SLE affected cells and tissues as well as a better understanding of the IGS and its relationship to SLE disease processes.

The present disclosure provides systems and methods to interrogate the IGS in SLE microarray datasets using reference datasets. The use of microarray data from unrelated yet relevant datasets as a tool for microarray dataset interrogation is an important advance, since it does not rely on prior characterization or knowledge of any genes, and also focuses the analysis on gene changes that have been shown to be operative in human samples. Using systems and methods described herein, strong enrichment may be demonstrated for IFNB1 in the SLE skin and synovium, and importantly a strong similarity may be shown between signatures in patients treated chronically with IFNB1 and the SLE WB signature. Moreover, the IGS may be related to monocytes in the analyzed samples.

Z score calculations and GSVA enrichment scores may demonstrate the likely role of IFNB1 in SLE pathogenesis, and suggest that targeting these IFNs in lupus skin and synovium may be more beneficial than blocking IFN in SLE patients with proliferative LN. Effect size values for GSVA enrichment scores and Z scores for IFNs are lower in LN tissue, and about 20% of LN patients may lack a type I IGS. The finding that the kidneys differ from skin and synovium may be unexpected and may not be anticipated from the blood analysis, thereby demonstrating the important contributions of tissue samples to results disclosed herein. Single-cell analysis of hematopoietic cells derived from the kidneys of LN patients demonstrates a low IGS in cells from most patients. These results together with our data may suggest that the IFN signaling pathway may not be as prominent in this tissue compared to skin and synovium. Noting that both skin and synovium are rich in fibroblasts, an important IFNB1 producing cell type, that constitutive IFNB1 production may provide a background of IFN in these tissues whereas the normal kidney has relatively few fibroblasts.

The greater association between the MS-IFNB1 signature and the SLE IGS signature may be of particular note. The much higher Z scores calculated using the MS-IFNB1 signature for all WB, PBMC, and SLE affected tissues in comparison to the calculated GSVA enrichment scores may be related to the increased overlap of decreased transcripts between the MS-IFNB1 signature and the signature in SLE patients. Long-term exposure to IFNB1 in MS patients may lead to a decrease in transcripts such as CD1C, CD160, IGFIR, and TNFRSF9 (4-1BB) that are also seen in SLE patients. All of these molecules participate in cellular activation, and inhibition of them after long-term exposure to IFNB1 may suggest a shared down-regulatory mechanism between MS patients treated with IFNB1 and SLE patients. Little evidence is shown for enrichment of the non-canonical IFNB1 signaling pathway in SLE affected tissues, however, this conclusion may be tempered by the use of a murine signature derived from IFNAR2 deficient peritoneal exudate cells as a comparator.

Although results show strong enrichment of IFNB1 in SLE, they may not preclude a role for the IFNAs. Indeed, IFNB1 itself has been shown to induce the expression of IFNAs. The two-step model of type I IFN induction by viruses, TLR, or other cytosolic pattern recognition receptors may establish that the activation of the constitutively expressed IRF3 in the cytoplasm leads to the initial induction of only IFNB1. The induced IFNB1 acts on the IFNA/B receptor to induce IRF7 expression by activating ISGF3 in the cytoplasm leading to the induction of IFNAs. IFNW1 is among the most induced genes in humans, along with IFNA2 and IFNB1, after pDC treatment with TLR7 agonists.

The IFNG signature has significant effect size and Z scores for all SLE tissues and most peripheral datasets, albeit lower than the three type I signatures. The induction of type I IFNs in response to virus initiates a cascade of events leading to the recruitment and/or activation of CD8 T cells and natural killer (NK) cells. While IFNG is induced in CD8 T cells, NK cells constitutively express IFNG transcripts, and NK cells are not easily discernible from CD8 T cells by microarray expression. In lupus mouse models, IFNG appears to play a more prominent role than in humans, and a hypothesis is proposed that the presence of IFNG may represent a late stage response to the inappropriate induction of type I IFNs in response to sterile inflammatory stimuli.

Using systems and methods disclosed herein, it may be shown that inactive SLE patients have a readily detectable IGS and that some SLE patients over time may change their IGS status. In two longitudinal datasets assessing SLE patients treated with standard of care (SOC) medications (GSE88885, GSE88886), the gain or loss of the IGS is demonstrated in about 30% of subjects. This change in status in the absence of intense immunotherapy may suggest that the IGS is not stable during the disease process in one third of SLE patients. The results disclosed herein, involving more than 2000 patients, may suggest that there is not a relationship between SLEDAI and the IGS. Additionally, about 30% of the 119 SLE patients on standard of care (SOC) treatment significantly changed their IGS over a one-year period. Notably, no predictable relationship may be measured between the SLEDAI and IGS. In ten SLE LN patients (GSE72747), the IGS did not change synchronously with the SLEDAI, and the change in IGS may be shown to be associated with a change in monocytes.

Because of the high degree of heterogeneity in both SLE patients and in microarray dataset platforms, processing and controls, a meta-analysis approach can be performed in order to understand and interpret the relationship between gene expression signatures to each other and disease activity. Linear regression analysis of the SLEDAI and GSVA scores for cell types, cellular processes, or IGS for seven SLE datasets show the strongest relationship to the SLEDAI is expression of genes regulating the cell cycle. This may be reassuring, as this cell cycle signature is taken from a WGCNA plasma cell module in SLE CD19 B cells correlated to SLEDAI, and plasma cells have been shown to correlate with SLEDAI. A plasma cell signature comprised of immunoglobulin (Ig) genes as well as other hallmark genes of plasma cells is also correlated to SLEDAI, although this full signature may not be detected in datasets on the Illumina platform because of the absence of Ig genes and may be underestimated on microarray chips in general because of their limited number of Ig genes. The IFN core, IFNW1, and IFNB1 signatures have low positive correlations with SLEDAI, and as was the case for the cell cycle and plasma cell signatures, have low predictive value for the SLEDAI.

A predictive relationship across ten SLE WB and PBMC datasets (2152 patients) is determined for all the IGS and monocyte cell surface transcripts with a range of r²predictive values of 0.29-0.58. This may suggest that the IGS is most related to the increased presence of monocytes expressing the IGS. Three times as many transcripts from the IFN core signature were enriched in monocytes relative to T cells and B cells. However, whereas some members of the IGS in SLE were highly overexpressed in SLE monocytes (e.g., EIF2AK2, OASL, OAS2, OAS3, PLSCR1, and CXCL10), some of the most overexpressed transcripts when SLE patients were compared to HC, including IFI27, IFI44L, IFIH1, IFIT3, OASL, RSAD2, SPATS2L and USP18, are not over-expressed in SLE monocytes compared with SLE T cell and B cells. Support for monocytes having a greater intensity IGS may be shown in experiments in which the log signal ratios of a 20-gene IGS are compared between purified T cells, B cells, and monocytes in SLE patients.

In addition to monocytes from active SLE patients expressing a greater intensity for 2/3 of the IFN core transcripts, another contributing factor for the strong relationship of monocytes to the IGS may be found by studying the IGS in purified T cells, B cells, and monocytes from subjects with inactive SLE. The T cell and B cell WGCNA-derived IFN modules may correlate significantly to SLEDAI, whereas the CD14 monocyte IFN module may not. The presence of an IGS in CD14 monocytes, but not in CD4 T and CD19 B cells from inactive patients, may support that monocytes are maintaining the IGS in inactive SLE patients. One explanation for this may be the increased STAT1 transcripts found in inactive SLE WB, PBMC, and monocyte datasets, but not the inactive SLE CD4 T or CD19 B cells. A prolonged IGS in monocytes in the absence IFN may also explain why some patients with IGS signatures have no IFNA detected using an ultrasensitive ELISA.

Another possible explanation for how monocytes may maintain an enhanced IGS derives from experiments treating human monocytes with a combination of TNF and IFN on a background of TLR signaling. IFN treatment in this context leads to epigenetic changes allowing for a much greater IGS than when cells are stimulated with IFN alone. Thus, the presence of inflammatory cytokines such as TNF, along with nucleic acid-containing immune complexes capable of signaling through TLRs, may account for the prolonged IGS seen in monocytes even when disease activity is low. Further work to elucidate the specific relationship between WB signatures and matching signatures from SLE affected tissues may improve understanding of this prominent signature and its association with an increased monocyte gene signature.

IFNB1 presents an intriguing target for SLE therapy because of the predominance of its signature in SLE affected tissues, its unique signaling properties and cellular expression, and its potential role in B cell development and tolerance. However, as shown by the results herein, the IGS may not correlate with the SLEDAI disease measurement, and a prolonged IGS in monocytes may make interpretation of the IGS as a measure of disease activity or the immediate presence of IFN challenging. The potential benefit of targeting IFNB1 may be considered within the practical limitations of disease measurement indices used in SLE clinical trials. It may be of critical importance that disease measurements truly reflect a change in the tissue manifestations of SLE.

In one aspect, the present disclosure provides a method for identifying a lupus condition of a subject, comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises genes induced by a plurality of interferons, thereby producing an interferon signature of the biological sample of the subject; (c) comparing the interferon signature with one or more reference interferon signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the interferon signature with corresponding quantitative measures of the gene of the one or more reference interferon signatures; and (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

In some embodiments, the lupus condition is selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). In some embodiments, the biological sample is selected from the group consisting of a whole blood (WB) sample, a PBMC sample, a tissue sample, and a purified cell sample. In some embodiments, the tissue sample is selected from the group consisting of: skin tissue, synovium tissue, and kidney tissue. In some embodiments, the kidney tissue is selected from the group consisting of glomerulus (Glom) and tubulointerstitium (TI). In some embodiments, the purified sample is selected from the group consisting of: purified CD4⁺ T cells, purified CD19⁺ B cells, and purified CD14⁺ monocytes.

In some embodiments, the method further comprises purifying a whole blood sample of the subject to obtain the purified cell sample. In some embodiments, assaying the biological sample comprises (i) using a microarray to generate the dataset comprising the gene expression data, (ii) sequencing the biological sample to generate the dataset comprising the gene expression data, or (iii) performing quantitative polymerase chain reaction (qPCR) of the biological sample to generate the dataset comprising the gene expression data.

In some embodiments, the plurality of interferons comprises Type I interferons and/or Type II interferons. In some embodiments, the Type I interferons and/or Type II interferons are selected from the group consisting of IFNA2, IFNB1, IFNW1, and IFNG. In some embodiments, the plurality of genes comprises one or more genes induced by in vitro stimulation of PBMC by the plurality of interferons. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 13. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 14. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 15. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 16. In some embodiments, the plurality of genes comprises one or more genes induced by in vitro stimulation of PBMC by IL12 treatment or TNF treatment. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 17. In some embodiments, the one or more genes induced by in vitro stimulation of PBMC are selected from the genes listed in Table 18. In some embodiments, the plurality of genes comprises one or more genes induced in vivo in IFNA2-treated HepC patients and/or IFNB1-treated MS patients. In some embodiments, the one or more genes induced in vivo in IFNA2-treated HepC patients and/or IFNB1-treated MS patients are selected from the genes listed in Table 25.

In some embodiments, (c) further comprises, for the at least one of the plurality of genes, determining a difference between the quantitative measure of the gene of the interferon signature with the corresponding quantitative measures of the gene of the one or more reference interferon signatures. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the difference satisfies a pre-determined criterion. In some embodiments, (c) further comprises, for the at least one of the plurality of genes, determining a Z-score of the quantitative measure of the gene of the interferon signature relative to the corresponding quantitative measures of the gene of the one or more reference interferon signatures. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the Z-score satisfies a pre-determined criterion. In some embodiments, (d) further comprises identifying the presence of the lupus condition of the subject when the Z-score is at least 2, and identifying the absence of the lupus condition of the subject when the Z-score is less than 2.

In some embodiments, the method further comprises determining or predicting an active or inactive state of the identified lupus condition of the subject. In some embodiments, (d) further comprises identifying the lupus condition of the subject based at least in part on a SLEDAI score of the subject. In some embodiments, the subject is asymptomatic for one or more lupus conditions selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN).

In some embodiments, the biological sample and the second biological sample comprise two different sample types selected from the group consisting of: a whole blood (WB) sample, a PBMC sample, a skin tissue sample, a synovium tissue sample, a kidney tissue sample comprising glomerulus (Glom), a kidney tissue sample comprising tubulointerstitium (TI), a purified CD4⁺ T cell sample, a purified CD19⁺ B cell sample, and a purified CD14⁺ monocyte sample.

In some embodiments, the method further comprises monitoring the lupus condition of the subject, wherein the monitoring comprises assessing the lupus condition of the subject at a plurality of time points, wherein the assessing is based at least on the lupus condition identified in (d) at each of the plurality of time points. In some embodiments, a difference in the assessment of the lupus condition of the subject among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the lupus condition of the subject, (ii) a prognosis of the lupus condition of the subject, and (iii) an efficacy or non-efficacy of a course of treatment for treating the lupus condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying a lupus condition of a subject, the method comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises genes induced by a plurality of interferons, thereby producing an interferon signature of the biological sample of the subject; (c) comparing the interferon signature with one or more reference interferon signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the interferon signature with corresponding quantitative measures of the gene of the one or more reference interferon signatures; and (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

Certain Terms

As used herein, the term “subject” refers to an entity or a medium that has testable or detectable genetic information. A subject can be a person, individual, or patient. A subject can be a vertebrate, such as, for example, a mammal. Non-limiting examples of mammals include humans, simians, farm animals, sport animals, rodents, and pets. The subject may be displaying a symptom(s) indicative of a health or physiological state or condition of the subject, such as a disease or disorder of the subject. As an alternative, the subject can be asymptomatic with respect to such health or physiological state or condition.

As used herein, the term “sample,” generally refers to a biological sample obtained from or derived from one or more subjects. Biological samples may be processed or fractionated before further analysis. Biological samples may include a whole blood (WB) sample, a PBMC sample, a tissue sample, a purified cell sample, or derivatives thereof. For example, a tissue sample may comprise skin tissue, synovium tissue, kidney tissue (e.g., glomerulus (Glom) or tubulointerstitium (TI)), or derivatives thereof. For example, a purified cell sample may comprise purified CD4⁺ T cells, purified CD19⁺ B cells, purified CD14⁺ V monocytes, or derivatives thereof. In some embodiments, a whole blood sample may be purified to obtain the purified cell sample. The term “derived from” used herein refers to an origin or source, and may include naturally occurring, recombinant, unpurified or purified molecules.

To obtain a blood sample, various techniques may be used, e.g., a syringe or other vacuum suction device. A blood sample can be optionally pre-treated or processed prior to use. A sample, such as a blood sample, may be analyzed under any of the methods and systems herein within 4 weeks, 2 weeks, 1 week, 6 days, 5 days, 4 days, 3 days, 2 days, 1 day, 12 hr, 6 hr, 3 hr, 2 hr, or 1 hr from the time the sample is obtained, or longer if frozen. When obtaining a sample from a subject (e.g., blood sample), the amount can vary depending upon subject size and the condition being screened. In some embodiments, at least 10 mL, 5 mL, 1 mL, 0.5 mL, 250, 200, 150, 100, 50, 40, 30, 20, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 μL of a sample is obtained. In some embodiments, 1-50, 2-40, 3-30, or 4-20 μL of sample is obtained. In some embodiments, more than 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or 100 μL of a sample is obtained.

As used herein the term “diagnose” or “diagnosis” of a status or outcome includes predicting or diagnosing the status or outcome, determining predisposition to a status or outcome, monitoring treatment of patient, diagnosing a therapeutic response of a patient, and prognosis of status or outcome, progression, and response to particular treatment.

The sample may be taken before and/or after treatment of a subject with a disease or disorder. Samples may be obtained from a subject during a treatment or a treatment regime. Multiple samples may be obtained from a subject to monitor the effects of the treatment over time. The sample may be taken from a subject known or suspected of having a disease or disorder for which a definitive positive or negative diagnosis is not available via clinical tests. The sample may be taken from a subject suspected of having a disease or disorder. The sample may be taken from a subject experiencing unexplained symptoms, such as fatigue, nausea, weight loss, aches and pains, weakness, or bleeding. The sample may be taken from a subject having explained symptoms. The sample may be taken from a subject at risk of developing a disease or disorder due to factors such as familial history, age, hypertension or pre-hypertension, diabetes or pre-diabetes, overweight or obesity, environmental exposure, lifestyle risk factors (e.g., smoking, alcohol consumption, or drug use), or presence of other risk factors.

In some embodiments, a sample can be taken at a first time point and assayed, and then another sample can be taken at a subsequent time point and assayed. Such methods can be used, for example, for longitudinal monitoring purposes to track the development or progression of a disease. In some embodiments, the progression of a disease can be tracked before treatment, after treatment, or during the course of treatment, to determine the treatment's effectiveness. For example, a method as described herein can be performed on a subject prior to, and after, treatment with a lupus condition therapy to measure the disease's progression or regression in response to the lupus condition therapy.

After obtaining a sample from the subject, the sample may be processed to generate datasets indicative of a disease or disorder of the subject. For example, a presence, absence, or quantitative assessment of nucleic acid molecules of the sample at a panel of lupus condition-associated or interferon-associated genomic loci or may be indicative of a lupus condition of the subject. Processing the sample obtained from the subject may comprise (i) subjecting the sample to conditions that are sufficient to isolate, enrich, or extract a plurality of nucleic acid molecules, and (ii) assaying the plurality of nucleic acid molecules to generate the dataset (e.g., microarray data, nucleic acid sequences, or quantitative polymerase chain reaction (qPCR) data). Methods of assaying may include any assay known in the art or described in the literature, for example, a microarray assay, a sequencing assay (e.g., DNA sequencing, RNA sequencing, or RNA-Seq), or a quantitative polymerase chain reaction (qPCR) assay.

In some embodiments, a plurality of nucleic acid molecules is extracted from the sample and subjected to sequencing to generate a plurality of sequencing reads. The nucleic acid molecules may comprise ribonucleic acid (RNA) or deoxyribonucleic acid (DNA). The extraction method may extract all RNA or DNA molecules from a sample. Alternatively, the extraction method may selectively extract a portion of RNA or DNA molecules from a sample. Extracted RNA molecules from a sample may be converted to cDNA molecules by reverse transcription (RT).

The sample may be processed without any nucleic acid extraction. For example, the disease or disorder may be identified or monitored in the subject by using probes configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to a panel of lupus condition-associated or interferon-associated genomic loci. The probes may be nucleic acid primers. The probes may have sequence complementarity with nucleic acid sequences from one or more of the panel of lupus condition-associated or interferon-associated genomic loci. The panel of lupus condition-associated or interferon-associated genomic loci may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 55, at least about 60, at least about 65, at least about 70, at least about 75, at least about 80, at least about 85, at least about 90, at least about 95, at least about 100, or more lupus condition-associated or interferon-associated genomic loci.

The probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) of one or more genomic loci (e.g., lupus condition-associated or interferon-associated genomic loci). These nucleic acid molecules may be primers or enrichment sequences. The assaying of the sample using probes that are selective for the one or more genomic loci (e.g., lupus condition-associated or interferon-associated genomic loci) may comprise use of array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., RNA sequencing or DNA sequencing, such as RNA-Seq).

The assay readouts may be quantified at one or more genomic loci (e.g., lupus condition-associated or interferon-associated genomic loci) to generate the data indicative of the disease or disorder. For example, quantification of array hybridization or polymerase chain reaction (PCR) corresponding to a plurality of genomic loci (e.g., lupus condition-associated or interferon-associated genomic loci) may generate data indicative of the disease or disorder. Assay readouts may comprise quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, etc., or normalized values thereof.

Methods

Gene expression data may be compiled from SLE patients as follows. Data are derived from publicly available datasets and collaborators (Table 19). Differential gene expression (DE) may be performed for each dataset of SLE patients and controls. GCRMA normalized expression values are variance corrected using local empirical Bayesian shrinkage before calculation of DE using the ebayes function in the open source BioConductor LIMMA package (https.//www.bioconductor.org/packages/release/bioc/html/limma.html). Resulting p-values are adjusted for multiple hypothesis testing and filtered to retain DE probes with an FDR<0.2. This cutoff is employed a priori to increase the number of genes that may be subsequently analyzed, with the understanding that even though the number of false positives may be increased, fewer false negatives may be excluded from the analysis. The heterogeneity in SLE patient blood samples may be demonstrated, and as a practical matter, signatures for LDGs and plasma cells are sometimes not detectable in limma analysis of populations depending on the specific patient make-up. An FDR of 0.2 may allow detection of cell types and processes which may not be found in all SLE patients, but that contribute significantly to the disease state in subpopulations of patients.

TABLE 19

SLE Datasets and SLE Time Course Datasets

SLE
Healthy

Sample

Pa-
Con-

Type
Sex
SLEDAI
tients
trols

SLE Dataset

GSE88884
WB
Female
Six to 27
813

10^b

ILL1

GSE88884
WB
Female
Six to 40
807

7^b

ILL2

GSE45291
WB
Female
zero to 11
266
20

GSE22098*
WB
Female
unknown
24
15

GSE61635
WB
Female
unknown
64
30

GSE29536
WB
Female
unknown
27
41

GSE39088
WB
Female
Two to Ten
17
34

GSE49454*
WB
Female
Zero to 26
49
10

GSE50772
PBMC
Female
Zero to 13
56
20

FDA PBMC
PBMC
Female
Zero to 25
30
6

GSE38351
CD14
Female
Zero-24
12
12

Monocytes

GSE10325
CD4 T cells
Female
Two-22
12
9

GSE10325
CD19 B cells
Female
Two-22
14
9

GSE52471
DLE
5 Female,
unknown
7
10

2 Male

GSE72535
DLE
8 Female,
Two
9
9

1 Male

GSE36700^a
Synovium
Female
unknown
4
4

GSE32591
Kidney
Mixed
unknown
30
14

Glom Class

II, III/IV

GSE32591
Kidney TI
Mixed
unknown
30
15

Class II,

III/IV

SLE Time

Course

Datasets

GSE72747
WB
9 Female,
(Time 0) >6
10

46^c

1 Male

GSE88885
WB
Female
(Time 0) >6

86^d
16

GSE88886
WB
Female
(Time 0) >6

33^d
12

*Only adult SLE patients were used

^aOsteoarthritis samples are the control synovial tissue

^bUsed only female controls

^cNo controls were available for this set. GSE39088 Male and Female controls were used for this dataset

^dPatients on standard of care (SOC) therapy who were given placebo in a clinical study

^ewww.ncbi.nlm.nih.gov/geo/

Gene Set Variation Analysis (GSVA) may be performed as follows. The GSVA (V1.25.0) software package, an open source package available from R/Bioconductor, is used as a non-parametric, unsupervised method for estimating the variation of pre-defined gene sets in patient and control samples of microarray expression data sets (www.bioconductor.org/packages/release/bioc/html/GSVA.html). The inputs for the GSVA algorithm may be a gene expression matrix of log 2 microarray expression values and pre-defined gene sets co-expressed in SLE datasets. Enrichment scores (GSVA scores) may be calculated non-parametrically using a Kolmogorov Smirnoff (KS)-like random walk statistic and a negative value for a particular sample and gene set, meaning that the gene set has a lower expression than the same gene set with a positive value. The enrichment scores (ES) may be the largest positive and negative random walk deviations from zero, respectively, for a particular sample and gene set. The positive and negative ES for a particular gene set may depend on the expression levels of the genes that form the pre-defined gene set.

Random Group (Gr) 1 and Random Group (Gr) 2 signatures may be determined by first assigning random numbers to the list of DE genes (FDR 0.2) from dataset GSE49454 in Microsoft® Excel® using the formula “rand( )”, and then sorting on ascending genes and taking the first 100 genes. This may be performed twice to generate Random Gr1 and Random Gr2 signatures. Gene symbols for these random signatures are listed in Tables 28-29.

TABLE 20

Genes with Induced Transcripts in PBMC by IFNA2 Treatment

ACSL1
CASP10
CXCL11
FLNA
IFI16
ISG20
MED1
PDGFRL
SP110
TOR1B

ADAR
CASP5
CXCR2
FOXO1
IFI27
ITIH2
MGLL
PGGT1B
SP140
TRA2B

AGT
CBR1
CYP2J2
FTL
IFI35
JAK2
CXCL9
PKD2
SPIB
TRD

AIM2
CBWD1
DAB2
FUT4
IFI44
JUP
MMP16
PLSCR1
ST3GAL5
TRIM21

AKAP2
CCL13
DEFB1
GADD45B
IFI44L
KCNA3
MNDA
PMAIP1
STAP1
TRIM22

APOBEC3B
CCL7
DLL1
GBAP1
IFI6
KDELR2
MRPS15
PML
STAT1
TRIM26

APOBEC3G
CCL8
DSC2
GBP1
IFIT1
KIF20B
MSR1
PRKRA
STAT2
TRLM38

APOL3
CCNA1
DUSP5
GBP2
IFIT5
KLF6
MX1
PSMB9
STX11
UBA7

ATF3
CCND2
DUSP7
GCH1
IFITM1
KPNB1
MX2
PTCH1
SUPT3H
UBE2L6

ATF5
CD2AP
DYNLT1
GCNT1
IFITM2
KRT8
MYD88
RBCK1
SYN2
UBE2S

BAG1
CD38
DYSF
GLB1
IFITM3
LAG3
NAMPT
RET
TAF5L
UBE3A

BARD1
CD4
ECE1
GLS
IFNG
LAMP3
NFE2L3
RGS1
TAP1
UNC93B1

BCL7B
CD69
EDN1
GMPR
IFRD1
LAP3
NKTR
RGS6
TAP2
USP18

BLVRA
CDC42EP1
EIF2AK2
GPR161
IGL
LEPR
NMI
TRIM34
TARBP1
VAMP5

BRCA1
CDK4
EIF2B1
GUK1
IKBKG
LGALS2
NR3C1
RPS9
TCN2
WARS

BRCA2
CDKN1A
EIF4ENIF1
HBG2
IL15
LGALS3BP
NUB1
RTP4
TFDP2
WT1

BST2
CFB
ENPP2
HCAR3
IL15RA
LGALS9
NUPR1
SAT1
TGM1
XAF1

BUB1
CH25H
EPB41
HIST2H2AA3
IL1RN
LGMN
OAS1
SCARB2
TLR3

C2
CHKA
ETV4
HLA-DOA
IL6
LMNB1
OAS2
SERPING1
TLR7

CACNA1A
CNTN6
F8
HLA-DRB5
INPPL1
LMO2
OAS3
SIT1
TNFRSF11A

CAD
COL3A1
FAF1
HS6ST1
IRF2
LY6E
OSBPL1A
SLAMF1
TNFSF10

CAMK2A
CTSL
FAS
HSP90AA1
IRF7
MAP2K5
PATJ
SOCS1
TNFSF6

CASP1
CXCL10
FGF1
IDO1
ISG15
MCL1
PDGFB
SP100
TNK2

TABLE 21

Genes with Induced Transcripts in PBMC by IFNB1 Treatment

ACLY
CACNA1A
CHKA
ELF1
HSP90AA1
JAK2
MFHAS1
PKD2
SFTPB
TNFAIP2

ACSL1
CAD
CISH
ELF4
HSPA1A
JCHAIN
MGLL
PLEK
SIDT2
TNFRSF11A

ADAM19
CALD1
CKB
ENPP2
HSPA1L
JUP
CXCL9
PLSCR1
SIT1
TNFSF10

ADAP2
CAMK2A
CMAHP
EPB41
IDO1
KCNA3
MNDA
PMAIP1
SLAMF1
TNFSF6

ADAR
CAPN2
CNTN6
ETV4
IFI16
KCNMB1
MRPS15
PML
SMO
TNK2

ADGRE2
CASP1
CNTRL
ETV6
IFI27
KDELR2
MS4A7
PMS2
SNX2
TOR1B

ADM
CASP10
COL3A1
F8
IFI35
KIF20B
MSR1
PPP2R2A
SOCS1
TRA2B

AFF3
CASP5
COX17
FAF1
IFI44
KLF2
MX1
PRKAG1
SOS1
TRD

AGT
CBR1
CSF2RB
FAS
IFI6
KLF6
MX2
PRKRA
SP100
TRG

AIM2
CBWD1
CTSL
FBXW2
IFIT1
KLRB1
MYD88
PRKX
SP110
TRIM21

AKAP10
CCL13
CXCL10
FCGR1A
IFIT5
KPNB1
NAMPT
PSMB8
SP140
TRIM22

AKAP2
CCL3L1
CXCL11
FCMR
IFITM1
KRT8
NAPSA
PSMB9
SPIB
TRIM26

ALOX12
CCL4
CXCL2
FGF1
IFITM2
LAG3
NBN
PTCH1
SPTA1
TRIM38

ALOX5
CCL7
CXCR2
FLNA
IFITM3
LAMP3
NCF1
PTGER2
SPTLC2
TSPAN15

ANXA4
CCL8
CYBB
FMR1
IFNG
LANCL1
NCOA2
RALB
SRRM2
TXK

APOBEC3B
CCNA1
CYP19A1
FOXO1
IFRD1
LAP3
NEBL
RASGRP1
SSB
UBA7

APOBEC3G
CCND2
CYP2J2
FPR2
IGL
LBR
NEK4
RBBP6
ST3GAL5
UBE2L6

APOL3
CCR1
DAB2
FTL
IKBKE
LEPR
NFE2L3
RBCK1
STAP1
UBE2S

ATF3
CCR5
DEFA1
FUT4
IKBKG
LGALS2
NKTR
RERE
STAT1
UBE3A

ATF5
CCRL2
DEFB1
GADD45B
IL15
LGALS3BP
NMI
RGS1
STAT2
UBQLN2

ATM
CD163
DHFR
GBAP1
IL15RA
LGALS9
NOTCH1
RGS6
STOML2
UNC93B1

ATP13A1
CD164
DLL1
GBP1
IL18BP
LGMN
NR3C1
RIN1
STX11
USP15

B4GAT1
CD2AP
DMXL1
GBP2
IL18R1
LILRA1
NR4A3
RIPK1
SUPT3H
USP18

BAG1
CD38
DNMT1
GCH1
IL1RN
LINC00597
NUB1
RIPK3
TANK
USP25

BAK1
CD4
DRAP1
GCNT1
IL6
LMNB1
NUPR1
RIPOR2
TAP1
USPL1

BARD1
CD59
DSC2
GLS
IL7
LMO2
OAS1
RNF114
TAP2
UVRAG

BCL11A
CD69
DUSP5
GMPR
INPP5D
LTA
OAS2
TRIM34
TAPBP
VAMP5

BCL7B
CD72
DUSP7
GPI
INPPL1
LTB4R
OAS3
RPS6KA5
TARBP1
WARS

BGN
CD86
DYNLT1
GPR161
IRF1
LY6E
PATJ
RPS9
TBX21
WIPF1

BLNK
CDK17
DYSF
GUK1
IRF2
LYN
PAX5
RRBP1
TCN2
WT1

BLVRA
CDKN1A
E2F1
HBG2
IRF4
MAP2K5
PAX8
RTP4
TFDP2
XAF1

BLZF1
CENPA
ECE1
HCAR3
IRF7
MAP3K8
PDE4B
SAT1
TFF1
ZNF107

BRCA1
CENPE
EDN1
HHEX
IRF9
MARCKS
PDGFB
SCARB2
TGM1

BRCA2
CFB
EGR1
HIST2H2AA3
ISG15
MBNL
PDGFRL
SDS
THY1

BST2
CFLAR
EIF2AK2
HK2
ISG20
MCL1
PFKFB3
SELL
TLR1

BUB1
CH25H
EIF2B1
HLA-DOA
ITGAL
MED1
PFKP
SERPIND1
TLR3

C3AR1
CHI3L2
EIF4ENIF1
HS6ST1
ITGAX
MEF2A
PIM2
SERPING1
TLR7

TABLE 22

Genes with Induced Transcripts in PBMC by IFNW1 Treatment

ABCB10
CAD
CFB
EIF4ENIF1
GUK1
IRF1
MAP2K5
OSBPL1A
SERPIND1
TNFAIP3

ACLY
CALD1
CFLAR
ENPP2
HBG2
IRF2
MARCKS
PATJ
SERPING1
TNFRSF11A

ACSL1
CAMK2A
CHKA
EPB41
HHEX
IRF7
MBNL1
PAX8
SFT2D2
TNFSF10

ADAR
CAPN2
CKB
ERCC4
HIST2H2AA3
IRF8
MCL1
PDGFB
SIT1
TNFSF6

ADM
CASK
CMAHP
ETV4
HLA-DOA
ISG15
MED1
PDGFRL
SLC30A4
TNK2

AGT
CASP1
CNTN6
ETV6
HS6ST1
ISG20
MEF2A
PKD2
SOCS1
TOR1B

AIM2
CASP10
CNTRL
F8
HSP90AA1
ITIH2
MGLL
PLEK
SOS1
TRA2B

AKAP10
CASP5
COL3A1
FAF1
HSPA1A
JAK2
CXCL9
PLSCR1
SP100
TRD

AKAP2
CBR1
CSF2RB
FAS
IDO1
JCHAIN
MLF1
PMAIP1
SP110
TRIM21

ALOX12
CBWD1
CTSL
FCER1G
IFI16
JUP
MMP16
PML
SP140
TRIM22

ANXA4
CCL13
CXCL10
FGF1
IFI27
KCNA3
MNDA
PPP2R2A
SPIB
TRIM38

APOBEC3B
CCL3L1
CXCL11
FGF13
IFI35
KDELR2
MRPS15
PRKAG1
SRRM2
UBA7

APOBEC3G
CCL7
CXCR2
FGL2
IFI44
KIF20B
MS4A7
PRKRA
ST3GAL5
UBE2C

APOL3
CCL8
CYBB
FLNA
IFI6
KLF6
MSR1
PSMB9
STAP1
UBE2L6

ATF3
CCNA1
CYP19A1
FMR1
IFIT1
KPNB1
MX1
PTCH1
STAT1
UBE2S

ATF5
CCND2
CYP2J2
FOXO1
IFIT5
KRT8
MX2
PTGER2
STAT2
UNC93B1

ATM
CCR1
DEFB1
FTL
IFITM1
LAG3
MYD88
RALB
STX11
USP18

B4GAT1
CCR5
DLL1
FUT4
IFITM2
LAMP3
NAMPT
RBBP6
SUPT3H
USP25

BAG1
CCR7
DSC2
GADD45B
IFITM3
LAP3
NCF1
RBCK1
TAP1
WARS

BARD1
CCRL2
DUSP5
GBAP1
IFRD1
LEPR
NFE2L3
RERE
TAP2
WIPF1

BCL11A
CD164
DUSP7
GBP1
IGL
LGALS2
NKTR
RGS1
TARBP1
WT1

BCL7B
CD2AP
DYNLT1
GBP2
IKBKG
LGALS3BP
NMI
RGS6
TBX21
XAF1

BLVRA
CD38
DYSF
GCH1
IL15
LGALS9
NPTX1
TRIM34
TCN2
ZNF107

BLZF1
CD4
E2F1
GCNT1
IL15RA
LGMN
NR3C1
RPS6KA5
TFDP2

BRCA1
CD47
ECE1
GLB1
IL18R1
LINC00597
NUB1
RTP4
TFF1

BRCA2
CD59
EDN1
GLS
IL1RN
LMNB1
NUPR1
SAT1
TGM1

BRD4
CD69
EGR1
GMPR
IL6
LMO2
OAS1
SCARB2
THY1

BST2
CDKN1A
EIF2AK2
GPR161
IL7
LY6E
OAS2
SDS
TLR3

C3AR1
CENPE
EIF2B1
GSTM5
INPPL1
LYN
OAS3
SELL
TLR7

TABLE 23

Genes with Induced Transcripts in PBMC by IFNG Treatment

ACLY
CASP10
CXCL10
FLII
IDO1
KLF2
NR3C1
SERPIND1
TAP1
VSNL1

ACSL1
CCL8
CXCL11
GADD45B
IFI27
LAP3
OAS1
SERPING1
TAP2
WARS

AFF2
CCND2
CYBB
GBP1
IFI44
LIMK2
OAS3
SFTPB
TBX21
XRN1

AIM2
CCR5
EDN1
GBP2
IL15
LMNB1
P2RY13
SLAMF1
TENM1

AKAP10
CD38
EPB41
GCH1
IL15RA
CXCL9
PCDH9
SLC1A5
TFF1

APOL3
CDKN1A
ETAA1
GCNT1
IL18BP
MMP25
PLA2G4C
SOCS1
TNFAIP2

ATF3
CFB
ETV4
GLS
IL1A
MRPS15
PLEK
SP100
TNFSF10

ATM
CKB
F8
GSTM5
IL7
MSR1
POLR2B
SPRY4
UBD

C1QB
CLEC10A
FAS
HBG2
IRF1
NET1
PSMB9
SRRM2
UBE2C

C4A
CPT1B
FBLN1
HHEX
IRF8
NIN
PTCH1
STAT1
UBE2L6

CALD1
CSF2RB
FBXL2
HP
JAK2
NKTR
RALB
STAT2
UBE3A

CASP1
CTNND2
FCGR1A
ICAM1
JCHAIN
NLRP1
RGS1
STX11
VAMP5

TABLE 24

Genes with Induced Transcripts in PBMC by IL12 Treatment

ACLY
CASK
CYBB
FCGR1A
GZMB
IL18BP
KLF2
NIN
SOCS1
TNFAIP3

AKAP10
CASP1
DEFA1
GBP1
HHEX
IL18R1
KRT8
NLRP1
STAT1
TNFSF10

APOL3
CCR5
ETAA1
GBP2
HP
ILIA
LIMK1
PCDH9
TAP2
TXK

BACH2
CDKN3
FASLG
GLS
HSPA6
INPP5D
LINC00597
SELL
TBX21

BRCA2
CXCL10
FBXL2
GNPDA1
IFNG
INSIG1
LY75
SERPIND1
TFF1

CALD1
CXCR3
FCER2
GSTM5
IL16
IRF1
MMP25
SLAMF1
TNFAIP2

TABLE 25

Genes with Induced Transcripts in PBMC by TNF Treatment

ACLY
BHMT
CDKN3
EPB41
GJB2
IL16
MAP3K4
NFKBIA
RPGR
TAP2

ACSL1
BIRC3
CKB
EREG
GLS
IL18
MARCKS
NFKBIZ
RPS9
TBX3

ADGRE2
BRCA1
CR2
ETAA1
GMIP
IL1A
MGLL
NKX3-2
SDC4
TFF1

AK3
CALD1
CTNND2
F3
GP1BA
IL1B
MMP19
NR3C1
SERPIND1
TNF

AKAP10
CASP1
CXCL1
FABP1
GRK3
IL1RN
MN1
OAS3
SFRP1
TNFAIP2

AMPD3
CASP10
CXCL2
FBXL2
HCAR3
IL6
MRPS15
PATJ
SH3BP5
TNFAIP3

APOL3
CCL15
CXCL3
FCER2
HHEX
INHBA
MSC
PDE4DIP
SLAMF1
TNFRSF11A

ARID3A
CCL20
CXCL8
FCGR2A
HOMER2
INSIG1
MTF1
PDPN
SLC30A4
TRAF1

ARSE
CCL23
CYP27B1
FLJ11129
HP
ITGA6
MX1
PIAS4
SOD2
TSC22D1

ASAP1
CCL3L1
DAB2
FLNA
ICAM1
KITLG
NAMPT
PLAUR
SPI1
TYROBP

B4GALT5
CD37
EBI3
G0S2
IDO1
KLF1
NELL2
PTGES
SSPN
UBE2C

BCL2A1
CD38
EGR1
GBP1
IFI44
KMO
NFKB1
PTGS2
STAT4
VEGFA

BHLHE41
CD83
EGR2
GCH1
IKBKG
LGALS3BP
NFKB2
RELB
TAF15
WT1

TABLE 26

Genes of IFN Core with Induced Transcripts

ACSL1
CASP10
CXCL11
FAF1
HS6ST1
INPPL1
LGMN
NUPR1
SERPING1
TLR7

ADAR
CASP5
CXCL9
FAS
HSP90AA1
IRF2
LMNB1
OAS1
SIT1
TNFRSF11A

AGT
CBR1
CXCR2
FGF1
IDO1
IRF7
LMO2
OAS2
SOCS1
TNFSF10

AIM2
CBWD1
CYP2J2
FLNA
IFI16
ISG15
LY6E
OAS3
SP100
TNFSF6

AKAP2
CCL13
DEFB1
FOXO1
IFI27
ISG20
MAP2K5
PATJ
SP110
TNK2

APOBEC3B
CCL7
DLL1
FTL
IFI35
JAK2
MCL1
PDGFB
SP140
TOR1B

APOBEC3G
CCL8
DSC2
FUT4
IFI44
JUP
MED1
PDGFRL
SPIB
TRA2B

APOL3
CCNA1
DUSP5
GADD45B
IFI6
KCNA3
MGLL
PKD2
ST3GAL5
TRD

ATF3
CCND2
DUSP7
GBAP1
IFIT1
KDELR2
MNDA
PLSCR1
STAP1
TRIM21

ATF5
CD2AP
DYNLT1
GBP1
IFIT5
KIF20B
MRPS15
PMAIP1
STAT1
TRIM22

BAG1
CD38
DYSF
GBP2
IFITM1
KLF6
MSR1
PML
STAT2
TRIM34

BARD1
CD4
ECE1
GCH1
IFITM2
KPNB1
MX1
PRKRA
STX11
TRLM38

BCL7B
CD69
EDN1
GCNT1
IFITM3
KRT8
MX2
PSMB9
SUPT3H
UBA7

BLVRA
CDKN1A
EIF2AK2
GLS
IFRD1
LAG3
MYD88
PTCH1
TAP1
UBE2L6

BRCA1
CFB
EIF2B1
GMPR
IGL
LAMP3
NAMPT
RBCK1
TAP2
UBE2S

BRCA2
CHKA
EIF4ENIF1
GPR161
IKBKG
LAP3
NFE2L3
RGS1
TARBP1
UNC93B1

BST2
CNTN6
ENPP2
GUK1
IL15
LEPR
NKTR
RGS6
TCN2
USP18

CAD
COL3A1
EPB41
HBG2
IL15RA
LGALS2
NMI
RTP4
TFDP2
WARS

CAMK2A
CTSL
ETV4
HIST2H2AA3
IL1RN
LGALS3BP
NR3C1
SAT1
TGM1
WT1

CASP1
CXCL10
F8
HLA-DOA
IL6
LGALS9
NUB1
SCARB2
TLR3
XAF1

TABLE 27

Genes of Type I and Type II IFN Core

ACSL1
CCL8
CXCL11
GBP1
IDO1
LAP3
NR3C1
SERPING1
TAP1

AIM2
CCND2
EDN1
GBP2
IFI27
LMNB1
OAS1
SOCS1
TAP2

APOL3
CD38
EPB41
GCH1
IFI44
CXCL9
OAS3
SP100
FAS

ATF3
CDKN1A
ETV4
GCNT1
IL15
MRPS15
PSMB9
STAT1
TNFSF10

CASP1
CFB
F8
GLS
IL15RA
MSR1
PTCH1
STAT2
UBE2L6

CASP10
CXCL10
GADD45B
HBG2
JAK2
NKTR
RGS1
STX11
WARS

TABLE 28

Genes of Random Gr 1

TYW3
AASDHPPT
HNRNPC
MS2P1
FAM50A
PSME3
RAB13
SNTB1
WDR45
KDM6B

PID1
LOC284023
NPC1
ZC3H8
EEF2K
PPP1R35
APH1B
USB1
SLC2A5
ST6GALNAC4

MXD4
EEF2
ANAPC10
HNRNPR
FAM175B
AKTIP
SPPL2A
NCOA1
DGAT2
APOPT1

ARPC4
HIC2
ZNF362
IDH3B
ZNF485
RNF4
BRCA1
RHOT1
CYP4F3
CASP5

CD81
SSR2
WDR82
HPS5
MCM7
FAM189B
DOCK8
DLST
PFKFB4
CDC34

TPM3
ZNF830
PRPF8
KRT10
DHX32
YWHAE
DGKD
KIAA0513
ABCG1
EIF5

TBC1D31
FAM84B
RASSF1
MIEF1
NDUFC1
PAM16
NFKBIA
ATP6V0B
CARD16
ACO1

ASF1A
UTP23
EIF5
RNF144A
ACO1
TARBP1
STAU1
FCER1G
MARCH2
RBM4

HMGN2
MIB2
MIS12
NMD3
FASTKD2
CCNA2
RELB
ABCA7
ACOX1
RABGAP1L

SNX1
TMEM177
RPL15
SF3B4
GID8
SETX
SLC43A3
GYG1
PDLIM7
MGAT1

TABLE 29

Genes of Random Gr 2

SH3YL1
BRIX1
FAM159A
SECISBP2L
VDAC3
ZNF3
SAP30L
ZNF493
ACTA2
PELI1

TARP
FBXO21
SLC2A4RG
ALKBH2
SLC30A5
AUH
MANBAL
TAZ
CTRL
FAM214B

VPS51
PEBP1
DDX1
UBE2N
ZNF275
ANAPC15
FAM45A
RAP2C
TMEM170B
SLC2A3

ARL2BP
PAOX
PHF5A
SLC3A2
PHF10
TNPO2
ATP11B
RAB32
ABCA7
TRIB1

RPS28
JADE1
VKORC1
CEP41
ACD
FAM192A
RBM10
GPR1
TMEM120A
COLGALT1

NSA2
POGLUT1
PSMC5
LYPLAL1
HSCB
SCLT1
SPTBN4
RNU4-2
PKM
STAT3

ACYP2
DENND2D
MAEA
OXCT1
ZNF485
PI4KB
SPAG9
LRWD1
NAMPT
MPO

SPOUT1
TMEM8B
KDSR
RANGAP1
PPP1R11
CALML4
PTTG1IP
LATP4B
MSL1
OLR1

HIVEP2
EXOSC1
FKBP4
SRSF4
MCM7
C4orf32
PRELID1
LILRB3
ACSL4
PSMD1

SDR39U1
TMEM14B
LINC01278
NENF
RPUSD1
CCNA2
GGA3
MYADM
ZDHHC19
MAP3K3

Enrichment modules containing cell type and process specific genes may be created through an iterative process of identifying DE transcripts pertaining to a restricted profile of hematopoietic cells in a majority of the SLE microarray datasets analyzed and checked for expression in purified T cells, B cells, and monocytes to remove transcripts indicative of multiple cell types. Transcripts may be researched by searching through literature. In the case of the cell cycle, unfolded protein response (UPR), and plasma cell modules, genes may be initially identified through the DE analysis, and WGCNA created modules may correlated to SLEDAI from CD19 and CD20 B cells. These genes may be identified by searching through literature, and STRING interactome analysis as belonging to these categories and their DE may be confirmed in the 13 SLE WB and PBMC datasets used in these studies.

In order to have a significant overlap, a minimum number, such as three transcripts, for each category may have to be found in each dataset and may be used based on calculating an error rate of 20% for one transcript, an error rate of 4% for two transcripts, and an error rate of 0.8% for three transcripts. GSVA enrichment modules used for linear regression analyses may have overlapping transcripts between the IFN signatures and the cell type specific signatures removed.

For each group of patients and controls analyzed by GSVA, DE may be performed on active and inactive patients together relative to HC at an FDR of 0.2. Differences between HC and SLE patient GSVA enrichment scores may be determined using the Welch's t-test for unequal variances (e.g., in PRISM 7.0 v7.0c). In order to quantitate the difference between the SLE and HC groups, the Hedge's g effect size may be determined (e.g., using the Effect Size Calculator for T-Test at the website Social Science Statistics, www.socscistatistics.com/effectsize/Default3.aspx).

Z score analysis may be performed as follows. Z score calculations may be employed to identify and compare the enrichment of specific signatures in SLE and control datasets. For each regulator, an activation z-score may be calculated strictly from the experimentally observed information provided for the downstream targets. Reference datasets may be used to determine the identity and direction (increased or decreased) of downstream targets. The formula Z=x/σ_x=Σ_iw_ix_i/√Σ_iw_i²may be used to calculate Z scores with edge weights set to 1. Z scores above or below 1.96 are significant at the 95% confidence level, and Z scores above or below 2.54 are significant at the 99% confidence level. SLE WB and PBMC datasets may be divided into patients with SLEDAI≥6 (active) and patients with SLEDAI<6 (inactive).

Reference and control datasets may be obtained as follows. A first reference dataset used may comprise the transcripts (FDR<0.01, LFC>2) from the in vitro treatment of healthy, human PBMC with 0.6 μM IFNA2b, IFNB1a, IFNW1, IFNG, IL12, or TNF differentially expressed compared to control treated PBMC. To eliminate differences in genetic background, a single donor may be used for these experiments. A second reference dataset used may comprise the IFNB1 (MS-IFNB1) signature induced in vivo in the whole blood of a first plurality of Multiple Sclerosis (MS) patients treated with IFNB1 (Avonex, Betaseron, or Rebif) for one to two years compared to a second plurality of MS patients not treated with IFNB1. A third reference dataset used may comprise the IFNA signature induced in a plurality of HepC patients treated with recombinant IFNA for six hours compared to their PBMC before the injection of recombinant IFNA (as described in Table 2 of [Hoffman, R. W. et al. Gene Expression and Pharmacodynamic Changes in 1,760 Systemic Lupus Erythematosus Patients From Two Phase III Trials of BAFF Blockade With Tabalumab. Arthritis Rheumatol. 69, 643-654 (2017)], which is hereby incorporated by reference in its entirety) for the HepC-IFNA2 signature. Published transcripts of PBMC from patients with sepsis DE to controls, and of skin biopsies from patients with dermatomyositis DE to controls may be used as comparators for Z score calculations. The reference dataset for the alternative IFNB1 signaling pathway may be taken from the IFNB1-induced signatures in IFNAR1-deficient mice. Genes may be translated to human gene symbols, and the increased transcripts may be used to determine GSVA scores.

Weighted Gene Co-expression Network Association (WGCNA) may be performed as follows. WGCNA, an open source package for R available at https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/, may be used. Log 2 normalized microarray expression values for WB, PBMC, purified T cell, B cell, or monocyte datasets may be filtered using an IQR to remove saturated probes with low variability between samples and used as inputs to WGCNA (V1.51). Adjacency co-expression matrices for all probes in a given set may be calculated by Pearson's correlation using signed network type specific formulae. Blockwise network construction may be performed using soft threshold power values that are manually selected and specific to each dataset in order to preserve maximal scale free topology of the networks. Resultant dendrograms of correlation networks may be trimmed to isolate individual modular groups of probes, labeled using semi-random color assignments, based on a detection cut height of 1, with a merging cut height of 0.2, with the additional use of a partitioning around medoids function. Final membership of probes representing the same gene into modules may be based on selection of greatest scale within module correlation against module eigengene (ME) values. Correlation to the presence of SLE disease (versus control) or the disease measure SLEDAI may be performed using Pearson's r against MEs, defining modules as either positively or negatively correlated with those traits as a whole.

F Test analysis for DE gene expression in SLE patients with multiple time points may be performed as follows. One-way analysis of variance (ANOVA) may be used to compare means of two or more samples (using the F distribution). The statistic fit2$F and the corresponding fit$F.p.value may be used to combine the pair-wise comparisons into one F-test. This is equivalent to a one-way ANOVA for each gene, except that the residual mean squares have been moderated between genes. For the GSE88885 dataset, a subset of patients on standard of care (SOC) therapy and placebo from the Illuminate 1 clinical trial have time-course microarray expression data; 86 placebo treated SLE patients at t=0, t=16 weeks, and t=52 weeks and 16 HC may be analyzed together. For GSE88886, a subset of placebo patients on SOC from the Illuminate 2 clinical trial with time-course microarray data, 33 placebo treated SLE patients with time points at t=0, t=16 weeks, and t=52 weeks and 12 HC may be analyzed together. For GSE72747, all ten patient values at t=0, t=12 weeks, and t=24 weeks and 46 HC from GSE39088 may be analyzed together. Significant changes in IGS may be determined to be a standard deviation (SD) of 0.2 by calculating the SD of the HC for each signature and using the highest SD as a measure of significance.

Other statistical analyses may be performed as follows. GraphPad PRISM 7 version 7.0c may be used to perform linear regression analysis, calculation of r²values, and Tukey's multiple comparison analysis for ANOVA. Average and SD may be calculated using Microsoft® Excel®. The built-in ANOVA function in R may be used to compute two-way ANOVA p-values.

In some embodiments, the systems and methods herein are configured for RNA sequencing (RNA-Seq) data analysis, especially single-cell RNA-Seq (scRNA-Seq) data analysis. In some embodiments, scRNA-Seq data has the potential to increase our understanding of cell populations in various diseases, such as lupus and cancer. However, phenotype of individual cells may not be available or manageable when the cell population is large, e.g., 10,000 cells. In some embodiments, scRNA-Seq data is used to identify cell populations or clusters computationally.

In some embodiments, the RNA-Seq data comprises data entries of gene expression levels. In some embodiments, the RNA-Seq data is generated using unique molecular identifiers (UMIs). In some embodiments, the RNA-Seq data is not generated using UMIs. In some embodiments, the RNA-Seq data is of each single cell of the plurality of cells, e.g., scRNA-Seq data. In some embodiments, the RNA-Seq data of one or more cells of the plurality of cells comprise data entries that are identical to the data entries in other cells of the plurality of cells. In some embodiments, the identical data entries is more than 50%, 60%, 70%, 80%, 90%, or even more of the RNA-Seq data of the one or more cells. As an example, data sets generated using UMI can have the vast majority (e.g., 90-95%) of data entries set to zero, which baffles existing bioinformatics techniques and even those designed for use with bulk RNA-Seq data. Such large number of zero entries tends to make all cells look alike in experiments intended to study cellular heterogeneity.

In some embodiments, the RNA-Seq data is raw gene expression data. In some embodiments, the RNA-Seq data for each cell includes one data entry for each gene, the data entry can range from zero to an arbitrary number that is greater than zero, e.g., 10, 100, 1,000, 10,000, etc.

In some embodiments, each cell is associated with a unique cell identification number (ID). In some embodiments, the scRNA-Seq data of a cell is associated with the unique cell ID.

Classifiers

In some embodiments, the present disclosure provides a system, method, or kit having data analysis realized in software application, computing hardware, or both. In various embodiments, the analysis application or system includes at least a data receiving module, a data pre-processing module, a data analysis module, a data interpretation module, or a data visualization module. In one embodiment, the data receiving module can comprise computer systems that connect laboratory hardware or instrumentation with computer systems that process laboratory data. In one embodiment, the data pre-processing module can comprise hardware systems or computer software that performs operations on the data in preparation for analysis. Examples of operations that can be applied to the data in the pre-processing module include affine transformations, denoising operations, data cleaning, reformatting, or subsampling. A data analysis module, which can be specialized for analyzing genomic data from one or more genomic materials, can, for example, take assembled genomic sequences and perform probabilistic and statistical analysis to identify abnormal patterns related to a disease, pathology, state, risk, condition, or phenotype. A data interpretation module can use analysis methods, for example, drawn from statistics, mathematics, or biology, to support understanding of the relation between the identified abnormal patterns and health conditions, functional states, prognoses, or risks. A data visualization module can use methods of mathematical modeling, computer graphics, or rendering to create visual representations of data that can facilitate the understanding or interpretation of results.

Feature sets may be generated from datasets obtained using one or more assays of a biological sample, and a trained algorithm may be used to process one or more of the feature sets to identify or assess the condition (e.g., a disease or disorder, such as a lupus condition). For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of lupus condition-associated or interferon-associated genomic loci that are associated with two or more classes of individuals inputted into a machine learning model, in order to classify a subject into one of the two or more classes of individuals. For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of lupus condition-associated or interferon-associated genomic loci that are associated with individuals with known conditions (e.g., a disease or disorder, such as a lupus condition) and individuals not having the condition (e.g., healthy individuals, or individuals who do not have a lupus condition), in order to classify a subject as having the condition (e.g., positive test outcome) or not having the condition (e.g., negative test outcome).

The trained algorithm may be configured to identify the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as a lupus condition) with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than 99%. This accuracy may be achieved for a set of at least about 25, at least about 50, at least about 100, at least about 150, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, at least about 500, at least about 1,000, or more than about 1,000 independent samples.

The trained algorithm may comprise a machine learning algorithm, such as a supervised machine learning algorithm. The supervised machine learning algorithm may comprise, for example, a Random Forest, a support vector machine (SVM), a neural network, or a deep learning algorithm. The trained algorithm may comprise a classification and regression tree (CART) algorithm. The trained algorithm may comprise an unsupervised machine learning algorithm.

The trained algorithm may comprise a classifier configured to accept as input a plurality of input variables or features (e.g., lupus condition-associated or interferon-associated genomic loci) and to produce or output one or more output values based on the plurality of input variables or features (e.g., lupus condition-associated or interferon-associated genomic loci). The plurality of input variables or features may comprise one or more datasets indicative of the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as a lupus condition). For example, an input variable or feature may comprise a number of sequences corresponding to or aligning to each of the plurality of lupus condition-associated or interferon-associated genomic loci.

The plurality of input variables or features may also include clinical information of a subject, such as health data. For example, the health data of a subject may comprise one or more of: a diagnosis of one or more conditions (e.g., a disease or disorder, such as a lupus condition), a prognosis of one or more conditions (e.g., a disease or disorder, such as a lupus condition), a risk of having one or more conditions (e.g., a disease or disorder, such as a lupus condition), a treatment history of one or more conditions (e.g., a disease or disorder, such as a lupus condition), a history of previous treatment for one or more conditions (e.g., a disease or disorder, such as a lupus condition), a history of prescribed medications, a history of prescribed medical devices, age, height, weight, sex, smoking status, and one or more symptoms of the subject. For example, the disease or disorder may comprise one or more of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN).

The trained algorithm may comprise a classifier, such that each of the one or more output values comprises one of a fixed number of possible values (e.g., a linear classifier, a logistic regression classifier, etc.) indicating a classification of the sample by the classifier. The trained algorithm may comprise a binary classifier, such that each of the one or more output values comprises one of two values (e.g., {0, 1}, {positive, negative}, or {high-risk, low-risk}) indicating a classification of the sample by the classifier. The trained algorithm may be another type of classifier, such that each of the one or more output values comprises one of more than two values (e.g., {0, 1, 2}, {positive, negative, or indeterminate}, or {high-risk, intermediate-risk, or low-risk}) indicating a classification of the sample by the classifier.

The classifier may be configured to classify samples by assigning output values, which may comprise descriptive labels, numerical values, or a combination thereof. Some of the output values may comprise descriptive labels. Such descriptive labels may provide an identification or indication of the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as a lupus condition) of the subject, and may comprise, for example, positive, negative, high-risk, intermediate-risk, low-risk, or indeterminate. Such descriptive labels may provide an identification of a treatment for the one or more conditions of the subject, and may comprise, for example, a therapeutic intervention, a duration of the therapeutic intervention, and/or a dosage of the therapeutic intervention suitable to treat the one or more conditions of the subject. Such descriptive labels may provide an identification of secondary clinical tests that may be appropriate to perform on the subject, and may comprise, for example, an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof. For example, such descriptive labels may provide a prognosis of the one or more conditions of the subject. As another example, such descriptive labels may provide a relative assessment of the one or more conditions of the subject. Some descriptive labels may be mapped to numerical values, for example, by mapping “positive” to 1 and “negative” to 0.

The classifier may be configured to classify samples by assigning output values that comprise numerical values, such as binary, integer, or continuous values. Such binary output values may comprise, for example, {0, 1},{positive, negative}, or {high-risk, low-risk}. Such integer output values may comprise, for example, {0, 1, 2}. Such continuous output values may comprise, for example, a probability value of at least 0 and no more than 1. Such continuous output values may comprise, for example, an un-normalized probability value of at least 0. Such continuous output values may indicate a prognosis of the one or more conditions (e.g., a disease or disorder, such as a lupus condition) of the subject. Some numerical values may be mapped to descriptive labels, for example, by mapping 1 to “positive” and 0 to “negative.”

The classifier may be configured to classify samples by assigning output values based on one or more cutoff values. For example, a binary classification of samples may assign an output value of “positive” or 1 if the sample indicates that the subject has at least a 50% probability of having one or more conditions (e.g., a disease or disorder, such as a lupus condition), thereby assigning the subject to a class of individuals receiving a positive test result. As another example, a binary classification of samples may assign an output value of “negative” or 0 if the sample indicates that the subject has less than a 50% probability of having one or more conditions (e.g., a disease or disorder), thereby assigning the subject to a class of individuals receiving a negative test result. In this case, a single cutoff value of 50% is used to classify samples into one of the two possible binary output values or classes of individuals (e.g., those receiving a positive test result and those receiving a negative test result). Examples of single cutoff values may include about 1%, about 2%, about 5%, about 10%, about 15%, about 20%, about 25%, about 30%, about 35%, about 40%, about 45%, about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, and about 99%.

As another example, the classifier may be configured to classify samples by assigning an output value of “positive” or 1 if the sample indicates that the subject has a probability of having one or more conditions (e.g., a disease or disorder, such as a lupus condition) of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The classification of samples may assign an output value of “positive” or 1 if the sample indicates that the subject has a probability of having one or more conditions (e.g., a disease or disorder, such as a lupus condition) of more than about 50%, more than about 55%, more than about 60%, more than about 65%, more than about 70%, more than about 75%, more than about 80%, more than about 85%, more than about 90%, more than about 91%, more than about 92%, more than about 93%, more than about 94%, more than about 95%, more than about 96%, more than about 97%, more than about 98%, or more than about 99%.

The classifier may be configured to classify samples by assigning an output value of “negative” or 0 if the sample indicates that the subject has a probability of having one or more conditions (e.g., a disease or disorder, such as a lupus condition) of less than about 50%, less than about 45%, less than about 40%, less than about 35%, less than about 30%, less than about 25%, less than about 20%, less than about 15%, less than about 10%, less than about 9%, less than about 8%, less than about 7%, less than about 6%, less than about 5%, less than about 4%, less than about 3%, less than about 2%, or less than about 1%. The classification of samples may assign an output value of “negative” or 0 if the sample indicates that the subject has a probability of having one or more conditions (e.g., a disease or disorder, such as a lupus condition) of no more than about 50%, no more than about 45%, no more than about 40%, no more than about 35%, no more than about 30%, no more than about 25%, no more than about 20%, no more than about 15%, no more than about 10%, no more than about 9%, no more than about 8%, no more than about 7%, no more than about 6%, no more than about 5%, no more than about 4%, no more than about 3%, no more than about 2%, or no more than about 1%.

The classifier may be configured to classify samples by assigning an output value of “indeterminate” or 2 if the sample is not classified as “positive”, “negative”, 1, or 0. In this case, a set of two cutoff values is used to classify samples into one of the three possible output values or classes of individuals (e.g., corresponding to outcome groups of individuals having “low risk,” “intermediate risk,” and “high risk” of having one or more conditions, such as a disease or disorder). Examples of sets of cutoff values may include {1%, 99%}, {2%, 98%}, {5%, 95%}, {10%, 90%}, {15%, 85%}, {20%, 80%}, {25%, 75%}, {30%, 70%}, {35%, 65%}, {40%, 60%}, and {45%, 55%}. Similarly, sets of n cutoff values may be used to classify samples into one of n+1 possible output values or classes of individuals, where n is any positive integer.

The trained algorithm may be trained with a plurality of independent training samples. Each of the independent training samples may comprise a sample from a subject, associated datasets obtained by assaying the sample (as described elsewhere herein), and one or more known output values or classes of individuals corresponding to the sample (e.g., a clinical diagnosis, prognosis, absence, or treatment efficacy of a condition of the subject). Independent training samples may comprise samples and associated datasets and outputs obtained or derived from a plurality of different subjects. Independent training samples may comprise samples and associated datasets and outputs obtained at a plurality of different time points from the same subject (e.g., on a regular basis such as weekly, biweekly, or monthly), as part of a longitudinal monitoring of a subject before, during, and after a course of treatment for one or more conditions of the subject. Independent training samples may be associated with presence of the condition (e.g., training samples comprising samples and associated datasets and outputs obtained or derived from a plurality of subjects known to have the condition). Independent training samples may be associated with absence of the condition (e.g., training samples comprising samples and associated datasets and outputs obtained or derived from a plurality of subjects who are known to not have a previous diagnosis of the condition or who have received a negative test result for the condition).

The trained algorithm may be trained with at least about 5, at least about 10, at least about 15, at least about 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 100, at least about 150, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, or at least about 500 independent training samples. The independent training samples may comprise samples associated with presence of the condition and/or samples associated with absence of the condition. The trained algorithm may be trained with no more than about 500, no more than about 450, no more than about 400, no more than about 350, no more than about 300, no more than about 250, no more than about 200, no more than about 150, no more than about 100, or no more than about 50 independent training samples associated with presence of the condition (e.g., a disease or disorder, such as a lupus condition). The trained algorithm may be trained with no more than about 500, no more than about 450, no more than about 400, no more than about 350, no more than about 300, no more than about 250, no more than about 200, no more than about 150, no more than about 100, or no more than about 50 independent training samples associated with absence of the condition (e.g., a disease or disorder, such as a lupus condition). In some embodiments, the sample is independent of samples used to train the trained algorithm.

The trained algorithm may be trained with a first number of independent training samples associated with a presence of the condition (e.g., a disease or disorder, such as a lupus condition) and a second number of independent training samples associated with an absence of the condition (e.g., a disease or disorder, such as a lupus condition). The first number of independent training samples associated with presence of the condition (e.g., a disease or disorder, such as a lupus condition) may be no more than the second number of independent training samples associated with absence of the condition (e.g., a disease or disorder, such as a lupus condition). The first number of independent training samples associated with a presence of the condition (e.g., a disease or disorder) may be equal to the second number of independent training samples associated with an absence of the condition (e.g., a disease or disorder, such as a lupus condition). The first number of independent training samples associated with a presence of the condition (e.g., a disease or disorder, such as a lupus condition) may be greater than the second number of independent training samples associated with an absence of the condition (e.g., a disease or disorder, such as a lupus condition).

The trained algorithm may comprise a classifier configured to identify the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as a lupus condition) at an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more; for at least about 5, at least about 10, at least about 15, at least about 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 100, at least about 150, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, or at least about 500 independent training samples. The accuracy of identifying the presence (e.g., positive test result) or absence (e.g., negative test result) of the one or more conditions by the trained algorithm may be calculated as the percentage of independent test samples (e.g., subjects known to have the condition or subjects with negative clinical test results for the condition) that are correctly identified or classified as having or not having the condition.

The trained algorithm may comprise a classifier configured to identify one or more conditions (e.g., a disease or disorder, such as a lupus condition) with a positive predictive value (PPV) of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The PPV of identifying the condition using the trained algorithm may be calculated as the percentage of samples identified or classified as having the condition that correspond to subjects that truly have the condition.

The trained algorithm may comprise a classifier configured to identify one or more conditions (e.g., a disease or disorder, such as a lupus condition) with a negative predictive value (NPV) of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The NPV of identifying the condition using the trained algorithm may be calculated as the percentage of samples identified or classified as not having the condition that correspond to subjects that truly do not have the condition.

The trained algorithm may comprise a classifier configured to identify one or more conditions (e.g., a disease or disorder, such as a lupus condition) with a clinical sensitivity at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9%, at least about 99.99%, at least about 99.999%, or more. The clinical sensitivity of identifying the condition using the trained algorithm may be calculated as the percentage of independent test samples associated with presence of the condition (e.g., subjects known to have the condition) that are correctly identified or classified as having the condition.

The trained algorithm may comprise a classifier configured to identify one or more conditions (e.g., a disease or disorder, such as a lupus condition) with a clinical specificity of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9%, at least about 99.99%, at least about 99.999%, or more. The clinical specificity of identifying the condition using the trained algorithm may be calculated as the percentage of independent test samples associated with absence of the condition (e.g., subjects with negative clinical test results for the condition) that are correctly identified or classified as not having the condition.

Classifiers of the trained algorithm may be adjusted or tuned to improve or optimize one or more performance metrics, such as accuracy, PPV, NPV, clinical sensitivity, clinical specificity, AUC, or a combination thereof (e.g., a performance index incorporating a plurality of such performance metrics, such as by calculating a weight sum therefrom), of identifying the presence (e.g., positive test result) or absence (e.g., negative test result) of the condition. The classifiers may be adjusted or tuned by adjusting parameters of the classifiers (e.g., a set of cutoff values used to classify a sample as described elsewhere herein, or weights of a neural network) to improve or optimize the performance metrics. The one or more classifiers may be adjusted or tuned so as to reduce an overall classification error (e.g., an “out-of-bag” or oob error rate for a Random Forest classifier). The one or more classifiers may be adjusted or tuned continuously during the training process (e.g., as sample datasets are added to the training set) or after the training process has completed.

The trained algorithm may comprise a plurality of classifiers (e.g., an ensemble) such that the plurality of classifications or outcome values of the plurality of classifiers may be combined to produce a single classification or outcome value for the sample. For example, a sum or a weighted sum of the plurality of classifications or outcome values of the plurality of classifiers may be calculated to produce a single classification or outcome value for the sample. As another example, a majority vote of the plurality of classifications or outcome values of the plurality of classifiers may be identified to produce a single classification or outcome value for the sample. In this manner, a single classification or outcome value may be produced for the sample having greater confidence or statistical significance than the individual classifications or outcome values produced by each of the plurality of classifiers.

After the trained algorithm is initially trained, a subset of the inputs may be identified as most influential or most important to be included for making high-quality classifications (e.g., having highest permutation feature importance). For example, a subset of the panel of lupus condition-associated or interferon-associated genomic loci may be identified as most influential or most important to be included for making high-quality classifications or identifications of conditions (or sub-types of conditions). The panel of lupus condition-associated or interferon-associated genomic loci, or a subset thereof, may be ranked based on classification metrics indicative of each influence or importance of each individual lupus condition-associated or interferon-associated genomic locus toward making high-quality classifications or identifications of conditions (or sub-types of conditions). Such metrics may be used to reduce, in some cases significantly, the number of input variables (e.g., predictor variables) that may be used to train the one or more classifiers of the trained algorithm to a desired performance level (e.g., based on a desired minimum accuracy, PPV, NPV, clinical sensitivity, clinical specificity, AUC, or a combination thereof).

For example, if training a classifier of the trained algorithm with a plurality comprising several dozen or hundreds of input variables to the classifier results in an accuracy of classification of more than 99%, then training the classifier of the trained algorithm instead with only a selected subset of no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100 such most influential or most important input variables among the plurality can yield decreased but still acceptable accuracy of classification (e.g., at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%).

As another example, if training a classifier of the trained algorithm with a plurality comprising several dozen or hundreds of input variables to the classifier results in a sensitivity or specificity of classification of more than 99%, then training the classifier of the trained algorithm instead with only a selected subset of no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100 such most influential or most important input variables among the plurality can yield decreased but still acceptable sensitivity or specificity of classification (e.g., at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%).

The subset of the plurality of input variables (e.g., the panel of lupus condition-associated or interferon-associated genomic loci) to the classifier of the trained algorithm may be selected by rank-ordering the entire plurality of input variables and selecting a predetermined number (e.g., no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100) of input variables with the best classification metrics (e.g., permutation feature importance).

Upon identifying the subject as having one or more conditions (e.g., a disease or disorder, such as a lupus condition), the subject may be optionally provided with a therapeutic intervention (e.g., prescribing an appropriate course of treatment to treat the one or more conditions of the subject). The therapeutic intervention may comprise a prescription of an effective dose of a drug, a further testing or evaluation of the condition, a further monitoring of the condition, or a combination thereof. If the subject is currently being treated for the condition with a course of treatment, the therapeutic intervention may comprise a subsequent different course of treatment (e.g., to increase treatment efficacy due to non-efficacy of the current course of treatment).

The therapeutic intervention may comprise recommending the subject for a secondary clinical test to confirm a diagnosis of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

The feature sets (e.g., comprising quantitative measures of a panel of lupus condition-associated or interferon-associated genomic loci) may be analyzed and assessed (e.g., using a trained algorithm comprising one or more classifiers) over a duration of time to monitor a patient (e.g., subject who has a condition or who is being treated for a condition). In such cases, the feature sets of the patient may change during the course of treatment. For example, the quantitative measures of the feature sets of a patient with decreasing risk of the condition due to an effective treatment may shift toward the profile or distribution of a healthy subject (e.g., a subject without the condition). Conversely, for example, the quantitative measures of the feature sets of a patient with increasing risk of the condition due to an ineffective treatment may shift toward the profile or distribution of a subject with higher risk of the condition or a more advanced stage or severity of the condition.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or interferon-associated genomic loci) determined between the two or more time points may be indicative of the subject having an increased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a negative difference (e.g., the quantitative measures of a panel of lupus condition-associated or interferon-associated genomic loci increased from the earlier time point to the later time point), then the difference may be indicative of the subject having an increased risk of the condition. A clinical action or decision may be made based on this indication of the increased risk of the condition, e.g., prescribing a new therapeutic intervention or switching therapeutic interventions (e.g., ending a current treatment and prescribing a new treatment) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the increased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or interferon-associated genomic loci) determined between the two or more time points may be indicative of the subject having a decreased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a positive difference (e.g., the quantitative measures of a panel of lupus condition-associated or interferon-associated genomic loci decreased from the earlier time point to the later time point), then the difference may be indicative of the subject having a decreased risk of the condition. A clinical action or decision may be made based on this indication of the decreased risk of the condition (e.g., continuing or ending a current therapeutic intervention) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the decreased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or interferon-associated genomic loci) determined between the two or more time points may be indicative of an efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject at an earlier time point but was not detected in the subject at a later time point, then the difference may be indicative of an efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the efficacy of the course of treatment for treating the condition of the subject, e.g., continuing or ending a current therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or interferon-associated genomic loci) determined between the two or more time points may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a negative or zero difference (e.g., the quantitative measures of a panel of lupus condition-associated or interferon-associated genomic loci increased or remained at a constant level from the earlier time point to the later time point), and if an efficacious treatment was indicated at an earlier time point, then the difference may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the non-efficacy of the course of treatment for treating the condition of the subject, e.g., ending a current therapeutic intervention and/or switching to (e.g., prescribing) a different new therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the non-efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In various embodiments, machine learning methods are applied to distinguish samples in a population of samples. In one embodiment, machine learning methods are applied to distinguish samples between healthy and lupus (e.g., SLE or DLE) samples.

Kits

The present disclosure provides kits for identifying or monitoring a disease or disorder (e.g., a lupus condition) of a subject. A kit may comprise probes for identifying a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of lupus condition-associated or interferon-associated genomic loci in a sample of the subject. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of lupus condition-associated or interferon-associated genomic loci in the sample may be indicative of the disease or disorder (e.g., a lupus condition) of the subject. The probes may be selective for the sequences at the panel of lupus condition-associated or interferon-associated genomic loci in the sample. A kit may comprise instructions for using the probes to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or interferon-associated genomic loci in a sample of the subject.

The probes in the kit may be selective for the sequences at the panel of lupus condition-associated or interferon-associated genomic loci in the sample. The probes in the kit may be configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to the panel of lupus condition-associated or interferon-associated genomic loci. The probes in the kit may be nucleic acid primers. The probes in the kit may have sequence complementarity with nucleic acid sequences from one or more of the panel of lupus condition-associated or interferon-associated genomic loci. The panel of lupus condition-associated or interferon-associated genomic loci or genomic regions may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, or more distinct lupus condition-associated or interferon-associated genomic loci.

The instructions in the kit may comprise instructions to assay the sample using the probes that are selective for the sequences at the panel of lupus condition-associated or interferon-associated genomic loci in the cell-free biological sample. These probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) from one or more of the plurality of panel of lupus condition-associated or interferon-associated genomic loci. These nucleic acid molecules may be primers or enrichment sequences. The instructions to assay the cell-free biological sample may comprise introductions to perform array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., DNA sequencing or RNA sequencing) to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or interferon-associated genomic loci in the sample. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of lupus condition-associated or interferon-associated genomic loci in the sample may be indicative of a disease or disorder (e.g., a lupus condition).

The instructions in the kit may comprise instructions to measure and interpret assay readouts, which may be quantified at one or more of the panel of lupus condition-associated or interferon-associated genomic loci to generate the datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or interferon-associated genomic loci in the sample. For example, quantification of array hybridization or polymerase chain reaction (PCR) corresponding to the panel of lupus condition-associated or interferon-associated genomic loci may generate the datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or interferon-associated genomic loci in the sample. Assay readouts may comprise quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, etc., or normalized values thereof.

Low-Density Granulocyte (LDG) Profiling of Lupus Conditions

Systemic lupus erythematosus (SLE) is an autoimmune disease characterized by the presence of low-density granulocytes (LDGs) with a heightened capacity for spontaneous NETosis, but the contribution of LDGs to SLE pathogenesis may remain unclear. Systems and methods of the present disclosure may characterize LDGs in human SLE by characterizing gene expression profiles derived from isolated LDGs by weighted gene coexpression network analysis (WGCNA). A multiple-gene module (e.g., a 92-gene module) may be identified in this manner. The LDG gene signature may be enriched in genes related to neutrophil degranulation and cell cycle regulation. This signature may be assessed in gene expression datasets from two large-scale SLE clinical trials to study associations between LDG enrichment, SLE manifestations, and treatment regimens. LDG enrichment in the blood may be found to be associated with corticosteroid treatment as well as anti-dsDNA, low serum complement, renal manifestations, and vasculitis, but the latter two of these associations may be dependent on concomitant corticosteroid treatment. In addition, LDG enrichment may be found to be associated with enrichment of gene signatures induced by type I interferon (IFN) and tumor necrosis factor (TNF) irrespective of corticosteroid treatment. Notably, LDG enrichment may not be found in numerous tissues affected by SLE. Comparison with relevant reference datasets may indicate that LDG enrichment is likely reflective of increased granulopoiesis in the bone marrow and not peripheral neutrophil activation. The results obtained using systems and methods of the present disclosure may uncover important determinants of the appearance of LDGs in SLE and emphasize the likely role of LDGs in specific aspects of lupus pathogenesis.

SLE is an autoimmune disease characterized by autoreactive B cell hyperactivity, autoantibody generation, and the presence of a type I IFN gene expression signature. SLE patients may also manifest an increased population of low-density granulocytes (LDGs) in the peripheral blood that remains in the peripheral blood mononuclear cell (PBMC) fraction after Ficoll density gradient separation rather than sedimenting with normal-density granulocytes. LDGs may appear in the circulation of subjects with a number of diseases, including rheumatoid arthritis, HIV infection, cancer, tuberculosis, and Plasmodium vivax infection. Although the presence of LDGs in these conditions may tend to be associated with more severe disease, the physiologic effects of this population may be mediated by diverse pro-inflammatory and anti-inflammatory mechanisms. For example, LDGs may contribute to rheumatoid arthritis pathogenesis by exposing immunogenic citrullinated histones, whereas LDGs in HIV infection may aggravate disease by inhibiting CD4+ T cells via arginase 1.

In SLE, LDGs have been described as a pro-inflammatory subset of neutrophils with an enhanced capacity to release neutrophil extracellular traps (NETs) compared with autologous SLE neutrophils and healthy control (HC) neutrophils through a process called NETosis. During this process, neutrophils expel chromatin, antimicrobial agents, and immunostimulatory molecules into the extracellular space to trap and kill bacteria, but this process can also induce tissue damage. LDGs expose dsDNA, oxidized mitochondrial DNA, LL-37, elastase, and IL-17, among other molecules, during NETosis, and increased NETosis by LDGs may be an important source of immunostimulatory molecules and autoantigens involved in the pathogenesis of SLE.

The presence of LDGs in pediatric SLE patients may be associated with increased lupus activity as measured by the SLE Disease Activity Index (SLEDAI). LDGs have also been implicated in skin involvement and vascular damage in SLE, and netting neutrophils have been described in the glomeruli and skin of lupus patients, although it may remain unclear whether the infiltrating cells were LDGs or normal-density neutrophils.

Based on nuclear morphology and surface marker expression, LDGs have been hypothesized to be immature neutrophil precursors released from the bone marrow, perhaps related to stimulation by colony stimulating factor (CSF), such as granulocyte CSF (G-CSF) or granulocyte/macrophage CSF (GM-CSF). However, the specific origin of LDGs in SLE and, more importantly, the mechanisms by which they contribute to organ involvement and/or disease activity may remain unclear. To gain more insight into LDGs in SLE, systems and methods of the present disclosure may employ a large-scale bioinformatics approach that combines gene expression data and clinical measurements. Using systems and methods of the present disclosure, a transcriptomic signature may be generated that characterizes LDGs in SLE, to determine whether this signature can be detected in the blood and tissue of SLE patients, and to characterize the relationship between this signature and SLE disease manifestations.

The present disclosure provides systems and methods to perform genomic identification of low-density granulocytes (LDGs) and analysis of their role in the pathogenesis of systemic lupus erythematosus (SLE). Analysis of LDGs, SLE neutrophils, and HC neutrophils may reveal hundreds of genes significantly differentially expressed by LDGs and initially identify granulopoietic and proliferative signatures as potentially descriptive of LDGs. Given that circulating neutrophils do not express granulopoietic genes and that SLE neutrophils did not differentially express any genes relative to HC neutrophils, it has been posited that the detection of these signatures in SLE blood may be attributed to LDGs. However, the DE approach may be challenged by contamination from platelets and lymphocytes. LDGs may be isolated from PBMC by negative selection, using a mixture of biotinylated antibodies (Abs) to human cluster of differentiation (CD) molecules; HC and SLE neutrophils may be isolated by dextran sedimentation of red blood cell (RBC) pellets. Although the purity of LDG and neutrophil isolates may be high, the low baseline levels of transcription in neutrophils may allow even small amounts of contamination to affect microarray results strongly, so further refinement may be needed to extract a robust LDG gene expression signature.

The coexpression-based unsupervised clustering method of WGCNA may be able to dissect the gene expression landscape down into several modules of genes that separate LDG samples and neutrophil samples. One of these modules may capture what may seem to be a pattern of lymphocyte contamination in the original expression data, and another set of modules, which may be merged to form module A, may contain many of the platelet genes identified in the original DE analysis. Functional analysis may be performed to narrow the WGCNA modules down to one final module of genes, which may contain neutrophil granule genes and cell cycle regulation genes. The presence of granule genes may indicate that the module is neutrophil lineage-specific, whereas the presence of cell cycle genes after coexpression network construction may suggest that the cell cycle signature is likely descriptive of LDGs and not an artifact of the isolation protocol. The combination of neutrophil lineage-specific granule genes along with cell cycle genes may appear to identify the unique signature of LDGs. This module of genes may be strongly coexpressed in SLE blood expression data but not in lupus-affected tissue, including lupus nephritis (LN) glomerulus, LN tubulointerstitium (TI), lupus skin, and synovium. This result may indicate that the LDG gene expression signature can be recovered from blood but not from tissue. Although netting neutrophils have been described in SLE-affected glomerulus and skin, the current results may suggest that infiltrating neutrophils are either normal-density neutrophils or LDGs with an altered transcriptional program. More studies may be performed to investigate further, as LDGs may not differentially express any homing receptors or activation markers associated with the ability to infiltrate tissues.

It may be initially surprising not to find transcriptional evidence for LDGs in SLE-affected kidneys or a strong association between LDG enrichment and renal involvement, as a similar group of neutrophil genes may be found to be enriched in the blood of LN patients compared with lupus patients without nephritis. A claim of an association with neutrophils may be based on a gene module, M5.15, derived from modular repertoire analysis and consisting of 24 neutrophilspecific genes, 14 of which overlap with LDG module B. Notably, both LDG module B and M5.15 may contain a core signature of 10 granulopoiesis-related genes that are not part of an endotoxemia-induced neutrophil activation signature (AZU1, CAMP, CEACAM6, CEACAM8, CTSG, DEFA4, ELANE, LTF, MPO, and MS4A3). This may suggest that module M5.15 may not describe neutrophil activation but rather the presence of LDGs. A limitation may be that the presence of rapidly progressive or severe renal disease excludes patients from the ILLUMINATE trials, so an association of active renal disease with enrichment of LDGs may be missed. Therefore, enrichment of LDG genes may not yet be ruled out as a potential biomarker for LN. It may be notable that an association between the LDG signature in the blood and renal involvement in the current study may only be noted in those patients receiving corticosteroids. Whether the usage of corticosteroids is a surrogate for disease activity in this circumstance may not be further delineated, but it may suggest that LDG module B and similar signatures may be of diagnostic use to identify those with LN only in the subset of patients taking corticosteroids.

By taking a large-scale transcriptomics approach to quantify the enrichment of the LDG signature in SLE blood gene expression data, it may be possible to draw associations between LDG enrichment and clinical measurements of disease manifestation by studying both relative enrichment scores and binary LDG enrichment. LDG enrichment may be associated with increased disease activity estimated by SLEDAI, decreased complement levels, and the presence of anti-dsDNA, suggesting that LDGs can act as markers of serological disease activity. Because complement levels and anti-dsDNA are components of the SLEDAI score, it is possible that these measurements account for the association with increased SLEDAI, as the associations with anti-dsDNA and low complement may be stronger than the association with SLEDAI score.

The association between corticosteroid use and LDG enrichment may be notable. Patients taking corticosteroids may have significantly higher LDG enrichment than those not taking corticosteroids, and some disease manifestations may only be associated with LDG enrichment in patients taking corticosteroids. It may be unknown at this time whether increased LDG enrichment among patients using corticosteroids is related to increased granulopoiesis in the bone marrow or demargination of LDGs from the endothelium. Other studies may suggest that the major effect of corticosteroids on distribution of cells of the neutrophil lineage relates to demargination, although this may not be known for LDGs. However, the findings may suggest that at least one component of the appearance of increased LDGs in the blood of lupus patients relates to corticosteroid-induced demargination. It may be suggested that LDGs play a role in SLE vascular pathology. It may be possible, therefore, that LDGs home to the endothelium and contribute to local vascular inflammation. In this situation, corticosteroid-induced demargination may be therapeutically useful by dissociating LDGs from the vascular endothelium. The relationship between circulating LDGs and vascular pathology may be complex, and a better understanding of whether corticosteroid use stimulates LDG production or alternatively causes demargination of LDGs may therefore be essential to resolve this conundrum.

The presence of LDG-specific genes in bone marrow myeloid precursors may support the hypothesis that LDGs are related to early neutrophil precursors (PM or MY) released from the bone marrow in response to cytokine challenge. Other studies may suggest that there may be two populations of LDGs in tumor-bearing mice and humans: one originating from the bone marrow and the second from peripheral neutrophils as a result of TGF-b stimulation. Similarly, present results may indicate that LDGs overexpress CD66b (CEACAM8), but no evidence of upregulation of the TGF-b signaling pathway may be found. These results may be most consistent with the conclusion that the LDGs expanded in SLE are most similar to early neutrophil precursors and not TGF-b-stimulated mature neutrophils. Taken together with the strong association between LDG enrichment and TNF response, these results may suggest that another component of the increased appearance of LDGs in the blood of lupus patients may relate to their enhanced release from the bone marrow as a result of chronic TNF-induced production of G-CSF. The associations between LDG enrichment and both low complement levels (indicative of complement consumption, presumably owing to the presence of immune complexes) and a TNF response may suggest that LDGs are part of an acute phase-like response in SLE. Autoantibodies to dsDNA may be found to be present in ˜73% of patients with positive LDG enrichment, and an IFN signature may be seen in 98% of patients with LDGs. These results may be consistent with a role for autoantibodies and/or autoantibody containing immune complexes in the appearance of LDGs in the circulation either directly or through the induction of cytokines, such as type I IFN or TNF. Alternatively, LDGs may play a role in the induction of autoantibodies, as LDG NETs may be autoantigenic and interferogenic.

Systems and methods of the present disclosure may comprise analysis of bulk RNA from blood and various lupus-affected tissues and, as a result, may not explore the possible heterogeneity of LDGs at the single-cell level. Single-cell transcriptomic studies of LDGs in SLE may be performed to further elucidate the characteristics of this cell population and whether a related population is present in lupus-affected tissues. A deeper understanding of any subtypes of LDGs and how they differ in composition among SLE patients may offer unique insights into disease processes and therapeutic options for patients with circulating LDGs.

The current results may suggest that LDGs are not directly involved in inflammation in SLE-affected organs, but they may act as biomarkers of processes that can in parallel result in tissue damage or vascular damage. As LDGs are associated with anti-dsDNA, low serum complement, and the presence of an IGS, they may indirectly lead to increasingly severe disease in afflicted patients. However, the possibility that factors such as treatment regimens may contribute to the presence of LDGs may not be dismissed because of their association with increased disease activity, highlighting the complexity of the association of LDGs with disease manifestations in SLE. Further studies of LDGs may be performed to help understand the links between corticosteroid treatment, LDG enrichment, and SLE pathogenesis.

In one aspect, the present disclosure provides a method for identifying a lupus condition of a subject, comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises low-density granulocyte (LDG)-associated genes, thereby producing an LDG signature of the biological sample of the subject; (c) comparing the LDG signature with one or more reference LDG signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the LDG signature with corresponding quantitative measures of the gene of the one or more reference LDG signatures; (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

In some embodiments, the lupus condition is selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). In some embodiments, the biological sample is selected from the group consisting of: a whole blood (WB) sample, a PBMC sample, a tissue sample, and a cell sample. In some embodiments, the tissue sample is selected from the group consisting of: skin tissue, synovium tissue, kidney tissue, and bone marrow tissue. In some embodiments, the kidney tissue is selected from the group consisting of: glomerulus (Glom) and tubulointerstitium (TI). In some embodiments, the cell sample is selected from the group consisting of: myelocytes (MY), promyelocytes (PM), polymorphonuclear neutrophils (PMN), and peripheral blood mononuclear cells (PBMC).

In some embodiments, the method further comprises (e) assaying a second biological sample of the subject to generate a second dataset comprising gene expression data; (f) processing the second dataset at each of the plurality of genes to determine second quantitative measures of each of the plurality of genes, thereby producing a second LDG signature of the second biological sample of the subject; (g) comparing the second LDG signature with one or more reference LDG signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the second LDG signature with corresponding quantitative measures of the gene of the one or more reference LDG signatures; and (h) based at least in part on the comparison in (g), identifying the lupus condition of the subject.

In some embodiments, the one or more drugs are selected from the group consisting of antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs).

In another aspect, the present disclosure provides a computer system for identifying a lupus condition of a subject, comprising: a database that is configured to store a dataset comprising gene expression data, wherein the gene expression data is obtained by assaying a biological sample of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises low-density granulocyte (LDG)-associated genes, thereby producing an LDG signature of the biological sample of the subject; (ii) compare the LDG signature with one or more reference LDG signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the LDG signature with corresponding quantitative measures of the gene of the one or more reference LDG signatures; and (iii) based at least in part on the comparison in (ii), identify the lupus condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying a lupus condition of a subject, the method comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises low-density granulocyte (LDG)-associated genes, thereby producing an LDG signature of the biological sample of the subject; (c) comparing the LDG signature with one or more reference LDG signatures, wherein the comparing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the LDG signature with corresponding quantitative measures of the gene of the one or more reference LDG signatures; (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

After obtaining a sample from the subject, the sample may be processed to generate datasets indicative of a disease or disorder of the subject. For example, a presence, absence, or quantitative assessment of nucleic acid molecules of the sample at a panel of lupus condition-associated or LDG-associated genomic loci or may be indicative of a lupus condition of the subject. Processing the sample obtained from the subject may comprise (i) subjecting the sample to conditions that are sufficient to isolate, enrich, or extract a plurality of nucleic acid molecules, and (ii) assaying the plurality of nucleic acid molecules to generate the dataset (e.g., microarray data, nucleic acid sequences, or quantitative polymerase chain reaction (qPCR) data). Methods of assaying may include any assay known in the art or described in the literature, for example, a microarray assay, a sequencing assay (e.g., DNA sequencing, RNA sequencing, or RNA-Seq), or a quantitative polymerase chain reaction (qPCR) assay.

The sample may be processed without any nucleic acid extraction. For example, the disease or disorder may be identified or monitored in the subject by using probes configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to a panel of lupus condition-associated or LDG-associated genomic loci. The probes may be nucleic acid primers. The probes may have sequence complementarity with nucleic acid sequences from one or more of the panel of lupus condition-associated or LDG-associated genomic loci. The panel of lupus condition-associated or LDG-associated genomic loci may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 55, at least about 60, at least about 65, at least about 70, at least about 75, at least about 80, at least about 85, at least about 90, at least about 95, at least about 100, or more lupus condition-associated or LDG-associated genomic loci.

The probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) of one or more genomic loci (e.g., lupus condition-associated or LDG-associated genomic loci). These nucleic acid molecules may be primers or enrichment sequences. The assaying of the sample using probes that are selective for the one or more genomic loci (e.g., lupus condition-associated or LDG-associated genomic loci) may comprise use of array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., RNA sequencing or DNA sequencing, such as RNA-Seq).

The assay readouts may be quantified at one or more genomic loci (e.g., lupus condition-associated or LDG-associated genomic loci) to generate the data indicative of the disease or disorder. For example, quantification of array hybridization or polymerase chain reaction (PCR) corresponding to a plurality of genomic loci (e.g., lupus condition-associated or LDG-associated genomic loci) may generate data indicative of the disease or disorder. Assay readouts may comprise quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, etc., or normalized values thereof.

Methods

Gene expression data may be compiled from SLE patients as follows. Data are derived from publicly available datasets on Gene Expression Omnibus (<https://www.ncbi.nlm.nih.gov/geo/>) and collaborators. Raw data sources are as follows: LDGs (GSE26975 [9 healthy control (HC) neutrophils, 10 SLE neutrophils, and 10 SLE LDGs]), PBMCs (GSE50772 [20 HC and 59 SLE], GSE81622 [25 HC and 30 SLE], FDABMC3 [6 HC and 43 SLE]), whole blood (WB) (GSE49454 [10 HC and 49 SLE], GSE88884 [17 HC and 1612 SLE]), kidney glomerulus and tubulointerstitium (TI) (GSE32591 [14 HC and 30 lupus nephritis (LN)]), skin (GSE52471 [3 HC and 7 discoid lupus erythematosus (DLE)], GSE72535 [8 HC and 9 DLE]), synovium (GSE36700 [4 osteoarthritis (OA) and 4 SLE]), and bone marrow myeloid lineage cells (GSE19556 [6 promyelocytes (PM), 6 myelocytes (MY), 6 bone marrow polymorphonuclear neutrophils (PMN), and 6 peripheral blood PMN]). Clinical data, when available, including disease activity assessed by SLEDAI, anti-dsDNA titers, and complement levels, may be included in the analysis.

Quantity control and normalization of raw data files may be performed as follows. Statistical analysis is conducted using R and relevant Bioconductor packages. Nonnormalized arrays are inspected for visual artifacts or poor RNA hybridization using Affy quality control plots. To inspect the raw data files for outliers, principal component analysis plots are generated for all cell types available for each experiment. Datasets culled of outliers are cleaned of background noise and normalized using GeneChip robust multiarray averaging, resulting in log 2 intensity values compiled into Rexpression set objects (E-sets). To increase the probability of identifying differentially expressed genes (DEGs), analysis is conducted using normalized datasets prepared using the native Affy chip definition files (CDFs), followed by custom BrainArray (BA) Entrez CDFs maintained by the University of Michigan Molecular and Behavioral Neuroscience Institute. The Affy CDFs include multiple probes per gene and almost twice as many probes as BA CDFs. Although Affy CDFs can provide the greatest amount of variance information for Bayesian fitting, the BA CDFs are used to exclude probes with known nonspecific binding and those shown by quarterly BLASTs to no longer fall within the target gene. Illumina CDFs are used for the Illumina datasets (GSE49454, GSE81622).

Differential gene expression (DE) analysis may be performed as follows. The CDF-annotated E-sets are filtered to remove probes with very low-intensity values. This reduces the E-set dimensions and the degree of multiple hypothesis testing correction, which increases the statistical significance of the differential expression (DE) probes. Probes missing gene annotation data are also discarded. GeneChip robust multiarray averaging-normalized expression values are variance corrected using local empirical Bayesian shrinkage before calculation of DE, using the ebayes function in the Bioconductor limma package. Resulting p values are adjusted for multiple hypothesis testing using the Benjamini-Hochberg correction, which results in a false discovery rate (FDR). Significant Affy and BA probes within each study are merged and filtered to retain DE probes with an FDR<0.05, which are considered statistically significant. This list is further filtered to retain only the most significant probe per gene to remove duplicate probes.

Weighted gene coexpression network analysis (WGCNA) may be performed as follows. Log2 normalized microarray expression values are used as input to weighted gene coexpression network analysis (WGCNA) to conduct an unsupervised clustering analysis, resulting in coexpression “modules,” or groups of densely interconnected genes, which may correspond to comparably regulated biologic pathways. For each experiment, an approximately scale-free topology matrix is first calculated to encode the network strength between probes. Probes are clustered into WGCNA modules based on topology matrix distances. Resultant dendrograms of correlation networks are trimmed to isolate individual modular groups of probes, labeled using semi-random color assignments, based on a detection cut height of 1, with a merging cut height of 0.2, with the additional use of a partitioning around medoids function. Final membership of probes representing the same gene into modules is based on selection of the greatest within-module correlation with module eigengene (ME) values.

Expression profiles of genes within modules are summarized by an ME, the module's first principal component. MEs act as characteristic expression values for their respective modules and can be associated with sample traits such as cell type, cohort (HC or SLE), or serological measurements. This is done by Welch's t test. The correlation coefficient of each gene in a module with the ME (kME), a metric for module membership, is used to determine the association of individual genes with the expression of the module as a whole. The mean kME of all genes in a module is taken as a metric of overall module quality. If the genes in a module have low kMEs, it is indicative that a few highly variable genes dominate the eigengene calculation. Modules with mean kMEs close to 1 are considered to be high quality, and modules with mean kMEs close to 0 are considered to be low quality. When analyzing multiple datasets, the grand mean is the mean of the mean kMEs for each dataset.

Cytoscape and STRING may be used to create MCODE clusters as follows. STRING (v10.5) is used to score protein-protein interaction networks, which are visualized using the Cytoscape (v3.5.1) software. The clusterMaker2 (v1.1.0) plugin application is used to create MCODE clusters of the most closely related genes.

Gene Set Variation Analysis (GSVA) may be performed as follows. The gene set variation analysis (GSVA) Bioconductor package is used as a nonparametric, unsupervised method for estimating the variation of predefined gene sets in patient and control samples of microarray expression datasets. The GSVA algorithm accepts a gene expression matrix of log 2-transformed expression values and a collection of predefined gene sets as inputs. Enrichment scores are calculated nonparametrically using a Kolmogorov-Smirnov-like random walk statistic. The enrichment scores are the largest positive and negative random walk deviations from zero, respectively, for a particular sample and gene set. Individual patient gene expression sets are considered positively enriched for a given signature if they display a z-score of greater than 2 relative to controls. Individual patient gene expression sets are considered negatively enriched for a given signature if they display a z-score of less than 2 relative to controls. Analysis of GSVA scores is carried out using Fisher's exact test or Welch's unequal variances t test, where appropriate.

Other statistical analyses may be performed as follows. The p values resulting from DE analysis are adjusted by the Benjamini-Hochberg FDR correction. Analysis of parametric data is performed using a two-tailed Welch's t test. Correlation analysis of continuous variables is performed by Pearson correlation, and analysis of noncontinuous variables is performed by Spearman rank correlation. Correlations are reported as Pearson r or Spearman rho, as appropriate. Odds ratio analysis is performed by Fisher's exact test, and odds ratios are accompanied by 95% confidence intervals.

Classifiers

Feature sets may be generated from datasets obtained using one or more assays of a biological sample, and a trained algorithm may be used to process one or more of the feature sets to identify or assess the condition (e.g., a disease or disorder, such as a lupus condition). For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of lupus condition-associated or LDG-associated genomic loci that are associated with two or more classes of individuals inputted into a machine learning model, in order to classify a subject into one of the two or more classes of individuals. For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of lupus condition-associated or LDG-associated genomic loci that are associated with individuals with known conditions (e.g., a disease or disorder, such as a lupus condition) and individuals not having the condition (e.g., healthy individuals, or individuals who do not have a lupus condition), in order to classify a subject as having the condition (e.g., positive test outcome) or not having the condition (e.g., negative test outcome).

The trained algorithm may comprise a classifier configured to accept as input a plurality of input variables or features (e.g., lupus condition-associated or LDG-associated genomic loci) and to produce or output one or more output values based on the plurality of input variables or features (e.g., lupus condition-associated or LDG-associated genomic loci). The plurality of input variables or features may comprise one or more datasets indicative of the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as a lupus condition). For example, an input variable or feature may comprise a number of sequences corresponding to or aligning to each of the plurality of lupus condition-associated or LDG-associated genomic loci.

For example, the disease or disorder may comprise one or more of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). As another example, the symptoms may include one or more of: alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof. As another example, the prescribed medications or drugs may include one or more of: antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs).

The classifier may be configured to classify samples by assigning an output value of “indeterminate” or 2 if the sample is not classified as “positive”, “negative”, 1, or 0. In this case, a set of two cutoff values is used to classify samples into one of the three possible output values or classes of individuals (e.g., corresponding to outcome groups of individuals having “low risk,” “intermediate risk,” and “high risk” of having one or more conditions, such as a disease or disorder). Examples of sets of cutoff values may include {1%, 99%}, {2%, 98%}, {5%, 95%}{10%, 90%}, {15%, 85%}, {20%, 80%}, {25%, 75%}, {30%, 70%}, {35%, 65%}, {40%, 60%}, and {45%, 55%}. Similarly, sets of n cutoff values may be used to classify samples into one of n+1 possible output values or classes of individuals, where n is any positive integer.

After the trained algorithm is initially trained, a subset of the inputs may be identified as most influential or most important to be included for making high-quality classifications (e.g., having highest permutation feature importance). For example, a subset of the panel of lupus condition-associated or LDG-associated genomic loci may be identified as most influential or most important to be included for making high-quality classifications or identifications of conditions (or sub-types of conditions). The panel of lupus condition-associated or LDG-associated genomic loci, or a subset thereof, may be ranked based on classification metrics indicative of each influence or importance of each individual lupus condition-associated or LDG-associated genomic locus toward making high-quality classifications or identifications of conditions (or sub-types of conditions). Such metrics may be used to reduce, in some cases significantly, the number of input variables (e.g., predictor variables) that may be used to train the one or more classifiers of the trained algorithm to a desired performance level (e.g., based on a desired minimum accuracy, PPV, NPV, clinical sensitivity, clinical specificity, AUC, or a combination thereof).

The subset of the plurality of input variables (e.g., the panel of lupus condition-associated or LDG-associated genomic loci) to the classifier of the trained algorithm may be selected by rank-ordering the entire plurality of input variables and selecting a predetermined number (e.g., no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100) of input variables with the best classification metrics (e.g., permutation feature importance).

The therapeutic intervention may include prescribed medications or drugs, which may include one or more of: antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs). The therapeutic intervention may be effective to alleviate or decrease one or more symptoms, which may include one or more of alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof.

The feature sets (e.g., comprising quantitative measures of a panel of lupus condition-associated or LDG-associated genomic loci) may be analyzed and assessed (e.g., using a trained algorithm comprising one or more classifiers) over a duration of time to monitor a patient (e.g., subject who has a condition or who is being treated for a condition). In such cases, the feature sets of the patient may change during the course of treatment. For example, the quantitative measures of the feature sets of a patient with decreasing risk of the condition due to an effective treatment may shift toward the profile or distribution of a healthy subject (e.g., a subject without the condition). Conversely, for example, the quantitative measures of the feature sets of a patient with increasing risk of the condition due to an ineffective treatment may shift toward the profile or distribution of a subject with higher risk of the condition or a more advanced stage or severity of the condition.

The condition of the subject may be monitored by monitoring a course of treatment for treating the condition of the subject. The monitoring may comprise assessing the condition of the subject at two or more time points. The assessing may be based at least on the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or LDG-associated genomic loci) determined at each of the two or more time points. The therapeutic intervention may include prescribed medications or drugs, which may include one or more of antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs). The therapeutic intervention may be effective to alleviate or decrease one or more symptoms, which may include one or more of: alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof. The assessing may be based at least on the presence, absence, or severity of one or more symptoms, such as alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or LDG-associated genomic loci) determined between the two or more time points may be indicative of the subject having an increased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a negative difference (e.g., the quantitative measures of a panel of lupus condition-associated or LDG-associated genomic loci increased from the earlier time point to the later time point), then the difference may be indicative of the subject having an increased risk of the condition. A clinical action or decision may be made based on this indication of the increased risk of the condition, e.g., prescribing a new therapeutic intervention or switching therapeutic interventions (e.g., ending a current treatment and prescribing a new treatment) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the increased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or LDG-associated genomic loci) determined between the two or more time points may be indicative of the subject having a decreased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a positive difference (e.g., the quantitative measures of a panel of lupus condition-associated or LDG-associated genomic loci decreased from the earlier time point to the later time point), then the difference may be indicative of the subject having a decreased risk of the condition. A clinical action or decision may be made based on this indication of the decreased risk of the condition (e.g., continuing or ending a current therapeutic intervention) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the decreased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or LDG-associated genomic loci) determined between the two or more time points may be indicative of an efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject at an earlier time point but was not detected in the subject at a later time point, then the difference may be indicative of an efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the efficacy of the course of treatment for treating the condition of the subject, e.g., continuing or ending a current therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or LDG-associated genomic loci) determined between the two or more time points may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a negative or zero difference (e.g., the quantitative measures of a panel of lupus condition-associated or LDG-associated genomic loci increased or remained at a constant level from the earlier time point to the later time point), and if an efficacious treatment was indicated at an earlier time point, then the difference may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the non-efficacy of the course of treatment for treating the condition of the subject, e.g., ending a current therapeutic intervention and/or switching to (e.g., prescribing) a different new therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the non-efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

Kits

The present disclosure provides kits for identifying or monitoring a disease or disorder (e.g., a lupus condition) of a subject. A kit may comprise probes for identifying a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of lupus condition-associated or LDG-associated genomic loci in a sample of the subject. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of lupus condition-associated or LDG-associated genomic loci in the sample may be indicative of the disease or disorder (e.g., a lupus condition) of the subject. The probes may be selective for the sequences at the panel of lupus condition-associated or LDG-associated genomic loci in the sample. A kit may comprise instructions for using the probes to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or LDG-associated genomic loci in a sample of the subject.

The probes in the kit may be selective for the sequences at the panel of lupus condition-associated or LDG-associated genomic loci in the sample. The probes in the kit may be configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to the panel of lupus condition-associated or LDG-associated genomic loci. The probes in the kit may be nucleic acid primers. The probes in the kit may have sequence complementarity with nucleic acid sequences from one or more of the panel of lupus condition-associated or LDG-associated genomic loci. The panel of lupus condition-associated or LDG-associated genomic loci or genomic regions may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, or more distinct lupus condition-associated or LDG-associated genomic loci.

The instructions in the kit may comprise instructions to assay the sample using the probes that are selective for the sequences at the panel of lupus condition-associated or LDG-associated genomic loci in the cell-free biological sample. These probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) from one or more of the plurality of panel of lupus condition-associated or LDG-associated genomic loci. These nucleic acid molecules may be primers or enrichment sequences. The instructions to assay the cell-free biological sample may comprise introductions to perform array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., DNA sequencing or RNA sequencing) to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or LDG-associated genomic loci in the sample. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of lupus condition-associated or LDG-associated genomic loci in the sample may be indicative of a disease or disorder (e.g., a lupus condition).

The instructions in the kit may comprise instructions to measure and interpret assay readouts, which may be quantified at one or more of the panel of lupus condition-associated or LDG-associated genomic loci to generate the datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or LDG-associated genomic loci in the sample. For example, quantification of array hybridization or polymerase chain reaction (PCR) corresponding to the panel of lupus condition-associated or LDG-associated genomic loci may generate the datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or LDG-associated genomic loci in the sample. Assay readouts may comprise quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, etc., or normalized values thereof.

Primary Immunodeficiency (PID) Profiling of Lupus Conditions

To examine checkpoints in the immune system driving autoimmunity in SLE, sets of genes abnormally expressed in SLE cells may be compared to sets of causal genes underlying PID. A hypothesis that genes “knocked out” in PID are overexpressed in lupus, and therefore possibly contributing to the immune over-reactivity, may be tested. After compiling a comprehensive database of genes discovered through this process, some of the the PID-associated genes may be observed to be differentially expressed (DE) in SLE. Further, some of the the PID-associated genes may be found to be uniquely DE in immune subsets (e.g., myeloid, T cells, NK cells, B cells, plasma cells, and neutrophils). A variety of bioinformatics tools may be employed to elucidate the nature of the PID-associated genes that were over-expressed in SLE. For example, STRING, a protein-protein interaction analytic tool, may be applied to the dataset, and distinct groups (e.g., clusters) of PID-associated genes may be identified. Further, Gene Set Variation Analysis (GSVA) may be applied to the dataset, and distinct gene clusters may be identified to be enriched in a set of SLE patients. Clusters of PID-associated genes may be observed to be consistently enriched (e.g., interferon stimulated genes, MHC class-1 antigen presentation, secreted-immune, secreted extracellular matrix, pattern recognition receptors, proteasome activity, and pro-apoptosis). These results may establish that the non-redundant checkpoint genes underlying PID are over-expressed in SLE patients. These genes and the pathways they identify may be used as unique targets for novel therapies in SLE.

The results obtained may provide a deeper understanding of the relationship between primary immunodeficiency (PID) genes and a specific autoimmune disorder, systemic lupus erythematosus (SLE). SLE is a complex genetically-based autoimmune disease defined by the production of high affinity autoantibodies that cause damage to tissues and may be lethal. SLE may disproportionately affect certain groups of subjects (e.g., patients), such as females of African ancestry, and may include exacerbations and great variability. PID may be considered as essentially the functional inactivation of the immune system, in which the causal genes are biological upstream regulators. If a particular gene is knocked out in a subject, then a severe immune phenotype may persist, and the subject's susceptibility to recurrent infections may increase significantly. On the other hand, autoimmunity generally arises in a subject from the over-activation of the immune system of the subject. Therefore, PID and autoimmunity may be considered as opposite sides of the same coin.

In some cases, PID and autoimmunity may share the loss of regulatory checkpoints in the immune system, and these checkpoints may be governed by the same genes. Instead of examining the entire human genome, identified PID-associated genes were analyzed, and their role in SLE was elucidated. For example, PID-associated genes may be identified and the role of these genes in SLE may be analyzed, e.g., by cross-referencing differential expression datasets and utilizing various analytical tools to understand the common genes between SLE and PID. Due to the complexity of SLE, many types of drugs (e.g., antimalarial, corticosteroids, immunosuppressants, biologics, and nonsteroidal anti-inflammatory drugs) may be utilized to treat symptoms. Belimumab (Benlysta®), the only drug approved in 60 years to treat SLE, is a biologic that inhibits the binding of B cells to B lymphocyte stimulators. Identified PID-associated genes that are also marker genes for SLE may be explored as potential drug therapy targets for SLE patients.

In an aspect, the present disclosure provides a method for identifying a lupus condition of a subject, comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises primary immunodeficiency (PID)-associated genes, thereby producing a PID signature of the biological sample of the subject; (c) processing the PID signature with one or more reference PID signatures, wherein the processing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the PID signature with corresponding quantitative measures of the gene of the one or more reference PID signatures; (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

In some embodiments, the lupus condition is selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). In some embodiments, the biological sample is selected from the group consisting of a whole blood (WB) sample, a PBMC sample, a tissue sample, and a cell sample. In some embodiments, the tissue sample is selected from the group consisting of: skin tissue, synovium tissue, kidney tissue, and bone marrow tissue. In some embodiments, the kidney tissue is selected from the group consisting of glomerulus (Glom) and tubulointerstitium (TI). In some embodiments, the cell sample is selected from the group consisting of: myelocytes (MY), promyelocytes (PM), polymorphonuclear neutrophils (PMN), peripheral blood mononuclear cells (PBMC), and hematopoietic stem cells.

In some embodiments, the quantitative measures of each of the plurality of genes comprise enrichment scores of each of the plurality of genes. In some embodiments, the enrichment scores of each of the plurality of genes comprise gene set variation analysis (GSVA) enrichment scores of each of the plurality of genes. In some embodiments, (c) further comprises, for the at least one of the plurality of genes, determining a difference between the quantitative measure of the gene of the PID signature with the corresponding quantitative measures of the gene of the one or more reference PID signatures. In some embodiments, (d) further comprises identifying the lupus condition of the subject when the difference satisfies a pre-determined criterion.

In some embodiments, (d) further comprises identifying the lupus condition of the subject based at least in part on a SLEDAI score of the subject. In some embodiments, the subject is asymptomatic for one or more lupus conditions selected from the group consisting of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN).

In some embodiments, the method further comprises (e) assaying a second biological sample of the subject to generate a second dataset comprising gene expression data; (f) processing the second dataset at each of the plurality of genes to determine second quantitative measures of each of the plurality of genes, thereby producing a second PID signature of the second biological sample of the subject; (g) processing the second PID signature with one or more reference PID signatures, wherein the processing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the second PID signature with corresponding quantitative measures of the gene of the one or more reference PID signatures; and (h) based at least in part on the comparison in (g), identifying the lupus condition of the subject.

In some embodiments, the biological sample and the second biological sample comprise two different sample types selected from the group consisting of a whole blood (WB) sample, a PBMC sample, a skin tissue sample, a synovium tissue sample, a kidney tissue sample comprising glomerulus (Glom), a kidney tissue sample comprising tubulointerstitium (TI), a bone marrow tissue, a myelocyte (MY) cell sample, a promyelocyte (PM) cell sample, a polymorphonuclear neutrophils (PMN) sample, and a hematopoietic stem cell sample.

In some embodiments, a difference in the assessment of the lupus condition of the subject among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of (i) a diagnosis of the lupus condition of the subject, (ii) a prognosis of the lupus condition of the subject, and (iii) an efficacy or non-efficacy of a course of treatment for treating the lupus condition of the subject.

In some embodiments, the one or more drugs are selected from the group consisting of: antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs).

In another aspect, the present disclosure provides a computer system for identifying a lupus condition of a subject, comprising: a database that is configured to store a dataset comprising gene expression data, wherein the gene expression data is obtained by assaying a biological sample of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises primary immunodeficiency (PID)-associated genes, thereby producing a PID signature of the biological sample of the subject; (ii) process the PID signature with one or more reference PID signatures, wherein the processing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the PID signature with corresponding quantitative measures of the gene of the one or more reference PID signatures; and (iii) based at least in part on the comparison in (ii), identify the lupus condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying a lupus condition of a subject, the method comprising: (a) assaying a biological sample of the subject to generate a dataset comprising gene expression data; (b) processing the dataset at each of a plurality of genes to determine quantitative measures of each of the plurality of genes, wherein the plurality of genes comprises primary immunodeficiency (PID)-associated genes, thereby producing a PID signature of the biological sample of the subject; (c) processing the PID signature with one or more reference PID signatures, wherein the processing comprises, for at least one of the plurality of genes, comparing the quantitative measure of the gene of the PID signature with corresponding quantitative measures of the gene of the one or more reference PID signatures; (d) based at least in part on the comparison in (c), identifying the lupus condition of the subject.

After obtaining a sample from the subject, the sample may be processed to generate datasets indicative of a disease or disorder of the subject. For example, a presence, absence, or quantitative assessment of nucleic acid molecules of the sample at a panel of lupus condition-associated or PID-associated genomic loci or may be indicative of a lupus condition of the subject. Processing the sample obtained from the subject may comprise (i) subjecting the sample to conditions that are sufficient to isolate, enrich, or extract a plurality of nucleic acid molecules, and (ii) assaying the plurality of nucleic acid molecules to generate the dataset (e.g., microarray data, nucleic acid sequences, or quantitative polymerase chain reaction (qPCR) data). Methods of assaying may include any assay known in the art or described in the literature, for example, a microarray assay, a sequencing assay (e.g., DNA sequencing, RNA sequencing, or RNA-Seq), or a quantitative polymerase chain reaction (qPCR) assay.

The sample may be processed without any nucleic acid extraction. For example, the disease or disorder may be identified or monitored in the subject by using probes configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to a panel of lupus condition-associated or PID-associated genomic loci. The probes may be nucleic acid primers. The probes may have sequence complementarity with nucleic acid sequences from one or more of the panel of lupus condition-associated or PID-associated genomic loci. The panel of lupus condition-associated or PID-associated genomic loci may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 55, at least about 60, at least about 65, at least about 70, at least about 75, at least about 80, at least about 85, at least about 90, at least about 95, at least about 100, or more lupus condition-associated or PID-associated genomic loci.

The probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) of one or more genomic loci (e.g., lupus condition-associated or PID-associated genomic loci). These nucleic acid molecules may be primers or enrichment sequences. The assaying of the sample using probes that are selective for the one or more genomic loci (e.g., lupus condition-associated or PID-associated genomic loci) may comprise use of array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., RNA sequencing or DNA sequencing, such as RNA-Seq).

The assay readouts may be quantified at one or more genomic loci (e.g., lupus condition-associated or PID-associated genomic loci) to generate the data indicative of the disease or disorder. For example, quantification of array hybridization or polymerase chain reaction (PCR) corresponding to a plurality of genomic loci (e.g., lupus condition-associated or PID-associated genomic loci) may generate data indicative of the disease or disorder. Assay readouts may comprise quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, etc., or normalized values thereof.

Methods

FIG. 63 shows a non-limiting example of a method 6300 for identifying a lupus condition of a subject using PID profiling, in accordance with disclosed embodiments. The method may comprise assaying a biological sample of a subject to generate a dataset comprising gene expression data (as in 6302). Next, the method may comprise processing the dataset to determine quantitative measures of each of a plurality of PID-associated genes, thereby producing a PID signature of the biological sample (as in 6304). Next, the method may comprise processing the PID signature with a reference PID signature (as in 6306). For example, the processing may be performed by comparing the respective quantitative measures of the genes of the PID signature and the reference PID signature. Next, the method may comprise identifying the lupus condition of the subject based at least in part on the comparison (as in 6308).

A database of PID-associated genes may be constructed as follows. Once identified via thorough searches of primary scientific literature on PIDs, a plurality of causal genes may be compiled into a database. The database may include one or more of the following information for each gene: Gene Symbol, Official Symbol, Full Name, Functional Category (BIG-C™) Entrez ID, Ensembl ID, Gene Type, Synonyms, Chromosome Number, Cytogenetic Location, Inheritance, genetic Defect/Pathogenesis, Phenotype, Relevance to SLE, Allelic Mutations (OMIM and Primary literature), Protein Effect (GeneCards), OMIM Gene ID, OMIM Phenotype ID, and Mendelian Genetics ID.

BIG-C™ analysis may be performed on the data as follows. Biologically Informed Gene Clustering (BIG-C™) is a functional aggregating tool (AMPEL BioSolutions, Charlottesville, Virginia) for analyzing and understanding the biological groupings of large lists of genes. Genes are sorted into 45 categories based on their most likely biological function and/or cellular localization based on information from multiple online tools and databases.

I-SCOPE analysis may be performed on the data as follows. PID-associated genes may be cross-referenced with immune genes restrictively expressed in hematopoietic genes restrictively expressed in hematopoietic cells using the I-SCOPE tool (AMPEL BioSolutions, Charlottesville, Virginia).

Cytoscape, STRING, and MCODE analyses may be performed on the data as follows. A visualization of protein-protein interactions and relationships between genes within datasets may be performed using the Cytoscape (V3.6.0) software and the MCODE StringApp (V1.3.2) plugin application. The Clustermaker2 App (V1.2.1) plugin may be used to create clusters of the most related genes within a dataset, using a network scoring degree cutoff of 2 and setting a node score cut-off of 0.2, k-Core of 2, and a max depth of 100.

Gene expression data may be compiled from SLE patients as follows. Data may be derived from publicly available datasets and collaborators. Raw data files may be obtained from the GEO repository for SLE whole blood data. The following datasets may be used: GSE22098, GSE39088, GSE88884, GSE45291, and GSE61635.

The data may be analyzed for differential gene expression (e.g., between SLE patients vs. controls) as follows. GCRMA normalized expression values may be variance corrected using local empirical Bayesian shrinkage, followed by calculation of DE using the ebayes function in the BioConductor LIMMA package. Resulting p-values may be adjusted for multiple hypothesis testing and filtered to retain DE probes with an FDR<0.2.

Gene Set Variation Analysis (GSVA) may be performed on the data as follows. The GSVA (V1.25.0) software package for R/Bioconductor may be used as a non-parametric, unsupervised method for estimating the variation of pre-defined gene sets in patient and control samples of microarray expression data sets. GSVA may be run using GSE88884 and the MCODE Clusters.

Hedge's G values, a measure of effect size, may be calculated from the GSVA enrichment scores, by contrasting K-S scores of all controls against all lupus patient samples. GSVA enrichment scores may be additionally utilized for Welch's t-tests to identify significant (e.g., p<0.05) gene categories contributing to substantial segregation of cohort samples. Results may be visualized by using a matrix of Hedge's G values was entered as input to the corplot package of R (dual scale heatmap). Significant categories may be identified (e.g., having a statistically significant degree of DE).

Classifiers

Feature sets may be generated from datasets obtained using one or more assays of a biological sample, and a trained algorithm may be used to process one or more of the feature sets to identify or assess the condition (e.g., a disease or disorder, such as a lupus condition). For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of lupus condition-associated or PID-associated genomic loci that are associated with two or more classes of individuals inputted into a machine learning model, in order to classify a subject into one of the two or more classes of individuals. For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of lupus condition-associated or PID-associated genomic loci that are associated with individuals with known conditions (e.g., a disease or disorder, such as a lupus condition) and individuals not having the condition (e.g., healthy individuals, or individuals who do not have a lupus condition), in order to classify a subject as having the condition (e.g., positive test outcome) or not having the condition (e.g., negative test outcome).

The trained algorithm may comprise a classifier configured to accept as input a plurality of input variables or features (e.g., lupus condition-associated or PID-associated genomic loci) and to produce or output one or more output values based on the plurality of input variables or features (e.g., lupus condition-associated or PID-associated genomic loci). The plurality of input variables or features may comprise one or more datasets indicative of the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as a lupus condition). For example, an input variable or feature may comprise a number of sequences corresponding to or aligning to each of the plurality of lupus condition-associated or PID-associated genomic loci.

For example, the disease or disorder may comprise one or more of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). As another example, the symptoms may include one or more of alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof. As another example, the prescribed medications or drugs may include one or more of: antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs).

The classifier may be configured to classify samples by assigning an output value of “indeterminate” or 2 if the sample is not classified as “positive”, “negative”, 1, or 0. In this case, a set of two cutoff values is used to classify samples into one of the three possible output values or classes of individuals (e.g., corresponding to outcome groups of individuals having “low risk,” “intermediate risk,” and “high risk” of having one or more conditions, such as a disease or disorder). Examples of sets of cutoff values may include {1%, 99%}, {2%, 98%}, {5%, 95%}{10%, 90%}, {15%, 85%}, {20%, 80%}, {25%, 75%}, {30%, 70%}, {35%, 65%}, {40%, 60%}, and {45%, 55%}. Similarly, sets of n cutoff values may be used to classify samples into one of n+1 possible output values or classes of individuals, where n is any positive integer.

After the trained algorithm is initially trained, a subset of the inputs may be identified as most influential or most important to be included for making high-quality classifications (e.g., having highest permutation feature importance). For example, a subset of the panel of lupus condition-associated or PID-associated genomic loci may be identified as most influential or most important to be included for making high-quality classifications or identifications of conditions (or sub-types of conditions). The panel of lupus condition-associated or PID-associated genomic loci, or a subset thereof, may be ranked based on classification metrics indicative of each influence or importance of each individual lupus condition-associated or PID-associated genomic locus toward making high-quality classifications or identifications of conditions (or sub-types of conditions). Such metrics may be used to reduce, in some cases significantly, the number of input variables (e.g., predictor variables) that may be used to train the one or more classifiers of the trained algorithm to a desired performance level (e.g., based on a desired minimum accuracy, PPV, NPV, clinical sensitivity, clinical specificity, AUC, or a combination thereof).

The subset of the plurality of input variables (e.g., the panel of lupus condition-associated or PID-associated genomic loci) to the classifier of the trained algorithm may be selected by rank-ordering the entire plurality of input variables and selecting a predetermined number (e.g., no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100) of input variables with the best classification metrics (e.g., permutation feature importance).

The feature sets (e.g., comprising quantitative measures of a panel of lupus condition-associated or PID-associated genomic loci) may be analyzed and assessed (e.g., using a trained algorithm comprising one or more classifiers) over a duration of time to monitor a patient (e.g., subject who has a condition or who is being treated for a condition). In such cases, the feature sets of the patient may change during the course of treatment. For example, the quantitative measures of the feature sets of a patient with decreasing risk of the condition due to an effective treatment may shift toward the profile or distribution of a healthy subject (e.g., a subject without the condition). Conversely, for example, the quantitative measures of the feature sets of a patient with increasing risk of the condition due to an ineffective treatment may shift toward the profile or distribution of a subject with higher risk of the condition or a more advanced stage or severity of the condition.

The condition of the subject may be monitored by monitoring a course of treatment for treating the condition of the subject. The monitoring may comprise assessing the condition of the subject at two or more time points. The assessing may be based at least on the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or PID-associated genomic loci) determined at each of the two or more time points. The therapeutic intervention may include prescribed medications or drugs, which may include one or more of: antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs). The therapeutic intervention may be effective to alleviate or decrease one or more symptoms, which may include one or more of: alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof. The assessing may be based at least on the presence, absence, or severity of one or more symptoms, such as alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or PID-associated genomic loci) determined between the two or more time points may be indicative of the subject having an increased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a positive difference (e.g., the quantitative measures of a panel of lupus condition-associated or PID-associated genomic loci increased from the earlier time point to the later time point), then the difference may be indicative of the subject having an increased risk of the condition. A clinical action or decision may be made based on this indication of the increased risk of the condition, e.g., prescribing a new therapeutic intervention or switching therapeutic interventions (e.g., ending a current treatment and prescribing a new treatment) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the increased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or PID-associated genomic loci) determined between the two or more time points may be indicative of the subject having a decreased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a negative difference (e.g., the quantitative measures of a panel of lupus condition-associated or PID-associated genomic loci decreased from the earlier time point to the later time point), then the difference may be indicative of the subject having a decreased risk of the condition. A clinical action or decision may be made based on this indication of the decreased risk of the condition (e.g., continuing or ending a current therapeutic intervention) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the decreased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or PID-associated genomic loci) determined between the two or more time points may be indicative of an efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject at an earlier time point but was not detected in the subject at a later time point, then the difference may be indicative of an efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the efficacy of the course of treatment for treating the condition of the subject, e.g., continuing or ending a current therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of lupus condition-associated or PID-associated genomic loci) determined between the two or more time points may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a positive or zero difference (e.g., the quantitative measures of a panel of lupus condition-associated or PID-associated genomic loci increased or remained at a constant level from the earlier time point to the later time point), and if an efficacious treatment was indicated at an earlier time point, then the difference may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the non-efficacy of the course of treatment for treating the condition of the subject, e.g., ending a current therapeutic intervention and/or switching to (e.g., prescribing) a different new therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the non-efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

Kits

The present disclosure provides kits for identifying or monitoring a disease or disorder (e.g., a lupus condition) of a subject. A kit may comprise probes for identifying a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of lupus condition-associated or PID-associated genomic loci in a sample of the subject. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of lupus condition-associated or PID-associated genomic loci in the sample may be indicative of the disease or disorder (e.g., a lupus condition) of the subject. The probes may be selective for the sequences at the panel of lupus condition-associated or PID-associated genomic loci in the sample. A kit may comprise instructions for using the probes to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or PID-associated genomic loci in a sample of the subject.

The probes in the kit may be selective for the sequences at the panel of lupus condition-associated or PID-associated genomic loci in the sample. The probes in the kit may be configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to the panel of lupus condition-associated or PID-associated genomic loci. The probes in the kit may be nucleic acid primers. The probes in the kit may have sequence complementarity with nucleic acid sequences from one or more of the panel of lupus condition-associated or PID-associated genomic loci. The panel of lupus condition-associated or PID-associated genomic loci or genomic regions may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, or more distinct lupus condition-associated or PID-associated genomic loci.

The instructions in the kit may comprise instructions to assay the sample using the probes that are selective for the sequences at the panel of lupus condition-associated or PID-associated genomic loci in the cell-free biological sample. These probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) from one or more of the panel of lupus condition-associated or PID-associated genomic loci. These nucleic acid molecules may be primers or enrichment sequences. The instructions to assay the cell-free biological sample may comprise introductions to perform array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., DNA sequencing or RNA sequencing) to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or PID-associated genomic loci in the sample. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of lupus condition-associated or PID-associated genomic loci in the sample may be indicative of a disease or disorder (e.g., a lupus condition).

The instructions in the kit may comprise instructions to measure and interpret assay readouts, which may be quantified at one or more of the panel of lupus condition-associated or PID-associated genomic loci to generate the datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or PID-associated genomic loci in the sample. For example, quantification of array hybridization or polymerase chain reaction (PCR) corresponding to the panel of lupus condition-associated or PID-associated genomic loci may generate the datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of lupus condition-associated or PID-associated genomic loci in the sample. Assay readouts may comprise quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, etc., or normalized values thereof.

Biological Data Analysis

In an aspect, the present disclosure provides a computer-implemented method for assessing a condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject; (b) selecting one or more data analysis tools, wherein the one or more data analysis tools comprise an analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, and a Target Scoring analysis tool; (c) processing the dataset using the one or more data analysis tools to generate a data signature of the biological sample of the subject; and (d) based at least in part on the data signature generated in (c), assessing the condition of the subject.

In another aspect, the present disclosure provides a computer system for assessing a condition of a subject, comprising: a database that is configured to store a dataset of a biological sample of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) select one or more data analysis tools comprising: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, a Target Scoring analysis tool, or a combination thereof; (ii) process the dataset using the one or more data analysis tools to generate a data signature of the biological sample of the subject; and (iii) based at least in part on the data signature generated in (ii), assess the condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing a condition of a subject, the method comprising: (a) receiving a dataset of a biological sample of the subject; (b) selecting one or more data analysis tools, wherein the one or more data analysis tools comprise an analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, and a Target Scoring analysis tool; (c) processing the dataset using the one or more data analysis tools to generate a data signature of the biological sample of the subject; and (d) based at least in part on the data signature generated in (c), assessing the condition of the subject. In any embodiment described herein, the one or more data analysis tools can be a plurality of data analysis tools each independently selected from a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, and a Target Scoring analysis tool.

After obtaining a sample from the subject, the sample may be processed to generate datasets indicative of a disease or disorder of the subject. For example, a presence, absence, or quantitative assessment of nucleic acid molecules of the sample at a panel of condition-associated genomic loci or may be indicative of a lupus condition of the subject. Processing the sample obtained from the subject may comprise (i) subjecting the sample to conditions that are sufficient to isolate, enrich, or extract a plurality of nucleic acid molecules, and (ii) assaying the plurality of nucleic acid molecules to generate the dataset (e.g., microarray data, nucleic acid sequences, or quantitative polymerase chain reaction (qPCR) data). Methods of assaying may include any assay known in the art or described in the literature, for example, a microarray assay, a sequencing assay (e.g., DNA sequencing, RNA sequencing, or RNA-Seq), or a quantitative polymerase chain reaction (qPCR) assay.

The sample may be processed without any nucleic acid extraction. For example, the disease or disorder may be identified or monitored in the subject by using probes configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to a panel of condition-associated genomic loci. The probes may be nucleic acid primers. The probes may have sequence complementarity with nucleic acid sequences from one or more of the panel of condition-associated genomic loci. The panel of condition-associated genomic loci may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 55, at least about 60, at least about 65, at least about 70, at least about 75, at least about 80, at least about 85, at least about 90, at least about 95, at least about 100, or more condition-associated genomic loci.

The probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) of one or more genomic loci (e.g., condition-associated genomic loci). These nucleic acid molecules may be primers or enrichment sequences. The assaying of the sample using probes that are selective for the one or more genomic loci (e.g., condition-associated genomic loci) may comprise use of array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., RNA sequencing or DNA sequencing, such as RNA-Seq).

The assay readouts may be quantified at one or more genomic loci (e.g., condition-associated genomic loci) to generate the data indicative of the disease or disorder. For example, quantification of array hybridization or polymerase chain reaction (PCR) corresponding to a plurality of genomic loci (e.g., condition-associated genomic loci) may generate data indicative of the disease or disorder. Assay readouts may comprise quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, etc., or normalized values thereof.

Big Data Analysis Tools and Drug/Target Scoring Algorithms

The present disclosure provides systems and methods to perform data analysis using drug or target scoring algorithms and/or big data analysis tools. In various aspects, such drug or target scoring algorithms and/or big data analysis tools may be used to perform analysis of data sets including, for example, mRNA gene expression or transcriptome data, DNA genomic data, proteomic data, metabolomic data, other types of “-omic” data, or a combination thereof. Systems and methods of the present disclosure may use one or more of the following: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, and a Target Scoring analysis tool.

FIG. 71 shows a non-limiting example of a workflow of a method 7100 to assess a condition of a subject using one or more data analysis tools and/or algorithms. The method may comprise receiving a dataset of a biological sample of a subject (as in 7102). Next, the method may comprise selecting one or more data analysis tools and/or algorithms (as in 7104). For example, the data analysis tools and/or algorithms may comprise a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring™ analysis tool, a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope), a CoLTs® (Combined Lupus Treatment Scoring) analysis tool, a Target Scoring analysis tool, or a combination thereof. Next, the method may comprise processing the dataset using selected data analysis tools and/or algorithms to generate a data signature of the biological sample of the subject (as in 7106). Next, the method may comprise assessing the condition of the subject based on the data signature (as in 7108).

The BIG-C(Biologically Informed Gene Clustering) tool may be configured to sort large groups of genes into a set of functional groups (e.g., 53 functional groups). The functional groups are created utilizing publicly available information from online tools and databases including UniProtKB/Swiss-Prot, GO Terms, KEGG pathways, NCBI PubMed, and the Interactome. The functional groups may include one or more of. Active RNA, Anti-apoptosis, anti-proliferation, autophagy, chromatin remodeling, cytoplasm and biochemistry, cytoskeleton, DNA repair, endocytosis, endoplasmic reticulum, endosome and vesicles, fatty acid biosynthesis, cell surface, transcription, glycolysis and gluconeogenesis, golgi, immune cell surface, immune secreted, immune signaling, integrin pathway, interferon stimulated genes, intracellular signaling, lysosome, melanosome, MHC class I, MHC class II, microRNA processing, microRNA, mitochondrial transcription, mitochondria, mitochondria oxidative phosphorylation, mitochondrial TCA cycle, mRNA processing, mRNA splicing, non-coding RNA, nuclear receptor, nucleus and nucleolus, palmitoylation, pattern recognition receptors, peroxisomes, pro-apoptosis, pro-cell cycle, proteasome, pseudogenes, RAS superfamily, reactive oxygen species protection, secreted and extracellular matrix, transcription factors, transporters, transposon control, ubiquitylation and sumoylation, unfolded protein and stress, and unknown. Enrichment scores for each group are calculated based on an overlap p value to determine the functional groups over or under-expressed in the gene expression dataset. The BIG-C may be configured such that each gene is sorted into only one of the 53 functional groups, allowing for a quick and relatively simple understanding of types of genes enriched and co-expressed in a big dataset.

The I-Scope™ tool may be configured to identify immune infiltrates. Hematopoietic cells are unique in that they move throughout the body patrolling for threats to the host, and may infiltrate tissue sites not normally home to immune cells. I-Scope™ may be configured to identify hematopoietic cells through an iterative search of more than 17,000 genes identified in more than 50 microarray datasets. From this search, 1226 candidate genes are identified and researched for restriction in hematopoietic cells as determined by the HPA, GTEx and FANTOM5 datasets (e.g., available at proteinatlas.org). 926 genes meet the criteria for being mainly restricted to hematopoietic lineages (brain, reproductive organ exclusions were permitted). These genes are researched for immune cell specific expression in 27 hematopoietic sub-categories: alpha beta T cell, T cell, regulatory T Cell, activated T cell, anergic T cell, gamma delta T cells, CD8 T, NK/NKT cell, NK cell, T & B cells, B cells, germinal center B cells, B cell and plasmacytoid dendritic cell, T &B & myeloid, B & myeloid, T & myeloid, MHC Class II expressing cell, monocyte, dendritic cell, plasmacytoid dendritic cells, myeloid cell, plasma cell, erythrocyte, neutrophil, low density granulocyte, granulocyte, and platelet. Transcripts are entered into I-Scope™ and the number of transcripts in each category determined. Odd's ratios are calculated with confidence intervals using the Fisher's exact test in R.

The T-Scope™ tool may be configured to help identify types of non-hematopoietic cells in gene expression datasets. T-Scope™ may be configured by downloading approximately 10,000 tissue enriched and 8,000 cell line enriched genes from the human protein atlas along with their tissue or cell line designation (e.g., available at proteinatlas.org). Genes found in more than four tissues are eliminated. Housekeeping genes described in the gene expression study by She et al. are also removed (e.g., as described by She et al., “Definition, conservation and epigenetics of housekeeping and tissue-enriched genes,” BMC Genomics 2009, 10:269, which is incorporated herein by reference in its entirety). This list is further curated by removing genes differentially expressed in 34 hematopoietic cell gene expression datasets and adding kidney specific genes from datasets downloaded from the GEO repository and processed by Ampel BioSolutions. The resulting categories of genes represent genes enriched in the following 42 tissue/cell specific categories: adrenal gland, breast, cartilage, cerebral cortex, uterine cervix, chondrocyte, colon, duodenum, endometrium, epididymis, esophagus fallopian tube, esophagus, fibroblast, heart muscle, keratinocyte, kidney, liver, lung, melanocyte, ovary pancreas, parathyroid gland, placenta, podocyte, prostrate, rectum, salivary gland, seminal vesicle, skeletal muscle, skin, small intestine, smooth muscle, stomach, synoviocyte, testis, kidney loop of henle, kidney proximal tubule, kidney distal tubule, and kidney collecting duct.

The CellScan tool may be a combination of I-Scope™ and T-Scope™, and may be configured to analyse tissues with suspected immune infiltrations that should also have tissue specific genes. CellScan may potentially be more stringent than either I-Scope™ or T-Scope™ because it may be used to distinguish resident tissue cells from non-resident hematopoietic cells.

The MS (Molecular Signature) Scoring tool may be configured to assess specific pathways in a disease state. Information on genes that encode for proteins that participate in a specific signaling pathway, and whether the gene product promotes or inhibits the pathway, are compiled and curated through literature mining. Curated pathways presented by the company include CD40-CD40ligand, IL-6, IL-12/23, TNF, IL-17, IL-21, S1P1, IL-13 and PDE4, but this method may be used for any known signaling pathway with available data. To determine if a signaling pathway is over or under-expressed in a microarray dataset, the gene list for each signaling pathway may be queried against the limma differentially expressed genes from a disease state compared to healthy controls, and the differentially expressed genes in the signaling pathway may be identified for each set. The fold changes for genes that promoted the pathway may be added together and the fold changes for genes that inhibited the pathway may be subtracted from the score. This total score may be normalized based on the number of genes that could be detected on the specific microarray platform used for the experiment. Activation scores of −100 to +100 may be determined using this method with negative scores indicating an inhibition of the specific pathway in the disease state and positive scores indicating an up-regulation of a specific pathway in the disease state. The Fischer's exact test may be performed to determine if there was sufficient overlap of genes between the experimental differentially expressed genes and the genes in the signaling pathway.

Gene Set Variation Analysis (GSVA) may be performed (for example, as described in Catalina et al. (2019, Communications Biology, “Gene expression analysis delineates the potential roles of multiple interferons in systemic lupus erythematosus”, which is incorporated herein by reference in its entirety) to determine enrichment of signaling pathways in individual patient samples. Gene set variation analysis may be performed using an open source software package for the coding language R available at the R Bioconductor (bioconductor.org), e.g., as described by Hanzelman et al., (“GSVA: gene set variation analysis for microarray and RNA-Seq data,” BMC Bioinformatics, 2013, which is incorporated herein by reference in its entirety). The modules of genes to interrogate the datasets may be developed. Modules of genes determined to represent a specific signaling pathway or process may be identified (e.g., using publicly available datasets). For example, the IFNB1 signaling pathway is taken from a publicly available gene expression dataset of peripheral blood cells treated with IFNB1 in vitro. Genes co-expressed in this dataset (genes either all increased or decreased compared to control treated peripheral blood) are used to create modules of genes representing the IFNB1 signaling pathway, and GSVA is used to determine the enrichment of this set of genes and hence the IFNB1 signaling pathway in individual patient and control samples.

The CoLTs®, or Combined Lupus Treatment Scoring, may be configured to rank identified drugs or therapies by a number of essential characteristics, including scientific rationale, experience in lupus mice/human cells (preclinical), previous clinical experience in autoimmunity, drug properties, and safety profile, including adverse events. Face and test validities may be established by scoring SOC medications and confirming the scores with a panel of lupus clinicians. The final result may be the CoLTs® score. A CoLTs® algorithm may also be configured for drugs in development (DID), which typically do not have drug metabolism and adverse event information available.

The target scoring algorithm may be configured to prioritize a specific gene or protein that is potentially a good choice to target with a drug in lupus patients. It may be utilized even if there is currently no drug available to the target gene or protein. The algorithm may be based on the addition of 18 data based determinations plus the overall scientific rationale and generates scores from −13 (not a good target in SLE) to 27 (very promising target in SLE).

BIG-C™ Big Data Analysis Tool

BIG-C® is a fast and efficient cloud-based tool to functionally categorize gene products. With coverage of over 80% of the genome, BIG-C® leverages publicly available databases such as UniProtKB/Swiss-Prot, GO terms, KEGG pathways, NCBI PubMed and Interactome to place genes into 53 functional categories. The sorting into only one of 53 functional groups allows for a quick and relatively simple understanding of types of genes enriched and co-expressed in a big dataset. This assists in deriving further insights from genes expressed for a given disease state in human or pre-clinical mouse models.

BIG-C® can be used to functionally categorize immunological genes that are not covered in cancer databases such as GO and KEGG (e.g., as described by Grammer et al. 2016, “Drug repositioning in SLE: crowd-sourcing, literature-mining and Big Data analysis,” Lupus, 25(10), 1150-1170, which is incorporated herein by reference in its entirety). Using a knowledge base of over 5000 patients with systemic lupus erythematosus (SLE), over 16432 genes are each placed into one of 53 BIG-C® functional categories, and statistical analysis is performed to identify enriched categories. BIG-C® categories are cross-examined with the GO and KEGG terms to obtain additional information and insights.

A sample BIG-C® workflow may comprise the following steps. First, SLE genomic datasets are derived from whole blood, peripheral blood mononuclear cells, affected tissues, and purified immune cells. Second, datasets are analyzed using DE analysis (as shown by differential expression heatmap in FIG. 72) or Weighted Gene Coexpression Network Analysis (WGCNA) (as shown by the gene coexpression plot in FIG. 73). Third, expressed genes are annotated using publicly available databases (e.g., UniProtKB/Swiss-Prot database, Human Immunodeficiencies database, Mouse MGI database, Entrez Molecular Sequence database, PubMed, and the Human Tissue Atlas). Fourth, signatures are cross-referenced with purified single-cell microarray datasets and RNAseq experiments. Fifth, BIG-C® is leveraged to separate the individual annotated genes into one of 53 functional categories shown in Table 50 (e.g., as described by Labonte et al. 2018, “Identification of alterations in macrophage activation associated with disease activity in systemic lupus erythematosus,” PloS one, 13(12), e0208132, which is incorporated herein by reference in its entirety). Sixth, chi-squared analysis is used to determine enriched categories of interest from overlap p-values. Seventh, enriched categories are cross-examined with GO and KEGG terms to derive key insights for further analysis (as shown by the enriched categories identified (left) and cross-referenced to GO terms (right) in FIG. 74).

TABLE 50

BIG-C Categories

Immune Cell
General Cell
Immune
Intracellular
MHC Class
MHC Class
Secreted

Pat. Recog.

Surface
Surface
Signaling
Signaling
I
II
Immune
Secreted ECM
Receptors

Interferon
PRO-Cell
Anti-Cell
PRO
Anti
Unfold Prot.
Proteasome
Autophagy
Ubiquitylation

Gene Sig
Cycle
Cycle
Apoptosis
Apoptosis
Stress

General
Transcript.
Nuc. Horm.
Chromatin
DNA
mRNA
mRNA
MicroRNA
Cytoskeleton

Transcript.
Factors
Receptors
Remodel
Repair
Translation
Splicing
Processing

Integrin
RAS
WNT
Lysosome
Endocytosis
Endosome
Endoplas.
Oxidative
TCA Cycle

Pathway
Superfamily
Signaling

& Vesicles
Retic.
Phosphor.

Mito. DNA
Mito
FA
Transporters
Cytoplasm
Peroxisomes
ROS
Nuclear &
Active RNA

to RNA

Biosynth

Biochem

Protection
Nucleolus

MicroRNA
Melanosome
Unknown
Pseudogenes
Transposon
Golgi
Glycolysis
Palmitoylation

Control

I-Scope™ Big Data Analysis Tool

I-Scope™ may be a tool configured for cross-examining the presence and activity of varying types of immune cell infiltrates with observed gene expression patterns. It may take annotated gene expression data and analyze it for hematopoietic cell lineage. I-Scope™ can be used downstream of the BIG-C® (Biologically Informed Gene-Clustering) tool in that it helps to provide even more insight into the nature of the genes being expressed after categorization.

I-Scope™ addresses the need to understand the involvement of specific cells for a given disease state. While it is helpful to understand the relative up-regulation and down-regulation at the gene expression level, it is even more informative to understand specifically in which cells this is occurring. I-Scope™ may be configured to identify hematopoietic cells through an iterative search of more than 17,000 genes identified in more than 50 microarray datasets (e.g., as described by Hubbard et al., “Analysis of Lupus Synovitis Gene Expression Reveals Dysregulation of Pathogenic Pathways Activated within Infiltrating Immune Cells,” Arthritis Rheumatol, 2018; 70 (suppl 10), which is incorporated herein by reference in its entirety). I-Scope™ may function by restricting the analysis to genes of hematopoietic cell heritage and allow for cross-checking against purified single-cell experiments or datasets. The cross-check confirms and categorizes specific transcript signatures to the 28 hematopoietic cell sub-categories shown in Table 51, ultimately allowing for cellular activity analysis across multiple samples and disease states. When combined with BIG-C® categories, the cellular activity can be correlated to specific functions within a given cell type.

TABLE 51

I-Scope ™ Cell Sub-Categories

Monos/Macs
Plasma
T-Cells
B-Cells
Dendritic
T&B Cells
CD8 T

Cells

Myeloid
Tact
LDG
Hematopoietic
Neutrophil
Ag
Granulocytes

Cells

Presentation

Platelets
pDC
T, B, Mono
Langerhans
Bact
Mono and B
Erythrocytes

Mast
T reg
Gd T
T anergic
FDC
CD4T
T/NK/NKT

Cell

Cells

A sample I-Scope™ workflow may comprise the following steps. First, candidate genes are identified from SLE (systemic lupus erythematosus) datasets potentially associated with immune cell expression. Second, using HPA, GTEx, and FANTOM5 datasets, expression signatures associated with hematopoietic cell lineage are identified. Third, signatures are cross-referenced with purified single-cell microarray datasets and RNAseq experiments. Fourth, transcripts are categorized into 28 hematopoietic cell sub-categories and assess cellular expression across different samples and disease states. Odd's ratios are calculated with confidence intervals using the Fisher's exact test in R. FIG. 75 shows an I-Scope™ signature analysis for a given sample, which leads to the I-Scope™ signature analysis across multiple samples and disease states (as shown in FIG. 76).

T-Scope™ Big Data Analysis Tool

The T-Scope™ tool may be configured for cross-examining gene expression signatures of a given sample with a database of non-hematopoietic cell types (e.g., as described by Hubbard et al., “Analysis of Gene Expression from Systemic Lupus Erythematosus Synovium Reveals Unique Pathogenic Mechanisms [Abstract], Annual Meeting of the American College of Rheumatology; June 2019; Chicago, IL, which is incorporated herein by reference in its entirety). T-Scope™ may comprise a database of 704 transcripts allocated to 45 independent categories. Transcripts detected in the sample are matched to one of the cellular categories within the T-Scope™ tool to derive further insights on tissue cell activity. T-Scope™ can be used downstream of the BIG-C® (Biologically Informed Gene-Clustering) tool to understand which tissue cell types are present. In conjunction with I-Scope™ (which provides information related to immune cells), T-Scope™ can be performed to provide a complete view of all possible cell activity in a given sample.

TABLE 52

T-Scope ™ 45 Categories of Tissue Cells

Adipose
Adrenal
Breast
Cartilage
Cerebral
Cervix,
Chondrocyte
Colon
Dendritic

Tissue
Gland

Cortex
Uterine

Duodenum
Endometrium
Endothelial
Epididymis
Erythrocytes
Esophagus
Fallopian
Fibroblast
Gallbaldder

Tube

Heart
Keratinocyte
Keratinocyte
Kidney
Kidney
Kidney
Kidney
Kidney
Kidney

Muscle

Skin

Distal
Loop
Proximal
Tubule
Tubule

Tubules

Tubules
Duct

Langherhans
Liver
Lung
Melanocyte
Podocyte
Prostate
Rectum
Salivary
Seminal

Gland
Vesicle

Skeletal
Skin
Small
Smooth
Stomach
Synoviocyte
Testis
Thyroid
Urinary

Muscle

Intenstine
Muscle

Gland
Bladder

A sample T-Scope™ workflow may comprise the following steps. First, candidate genes are identified from SLE (systemic lupus erythematosus) differential expression datasets potentially associated with tissue cell expression. Second, using publicly available databases, expression signatures associated with potential tissue cell activity are identified. Third, signatures are cross-referenced with microarray, scRNAseq or RNAseq experiments. Fourth, transcripts are categorized into 45 tissue cell sub-categories and cellular expression is assessed across different samples and disease states. FIG. 77 shows results obtained using T-Scope™ in combination with I-Scope™ for identification of cells post-DE-analysis.

CellScan Big Data Analysis Tool

A cloud-based genomic platform may be configured to provide users with access to CellScan™, which comprises a suite of tools for the identification, analysis, and prioritization of targets for drug development and/or repositioning. This platform is powered by a database containing the genomic information gathered from 5000+ autoimmune patients. The cloud-based genomic platform may leverage results from RNAseq and microarray experiments in conjunction with clinical information, such as medication and lab tests, to provide previously undiscovered insights.

CellScan™ may go beyond typical 'omics analysis by performing one or more of the following: functionally categorizing genes and their products (e.g., using BIG-CR); deconvolving gene expression data to identify unique immunological cell types from blood or biopsy samples (e.g., using I-Scope™); identifying tissue specific cell from biopsy samples (e.g., using T-Scope™); identifying receptor-ligand interactions and subsequent signaling pathways (e.g., using MS-Scoring™); ranking genes and their products for targeting by drugs and miRNA mimetics (e.g., using Target-Scoring™); and prioritizing FDA-approved drugs and drugs-in-development for treatment in patients or pre-clinical models (e.g., using CoLTs®).

CellScan™ applications may include one or more of: Biomarker Discovery, Disease Mechanisms, Drug Mechanism of Action, Drug Mechanism of Toxicity, and Target Identification and Validation. Experimental approaches supported by CellScan™ may include one or more of: 1ncRNA, Metabolomics, MicroArray, miRNA, mRNA, qPCR, Proteomics, and RNAseq.

Data analysis and interpretation with CellScan™ may build on comprehensive, manually curated content of a knowledge base. Powerful, quick, and efficient tools may be used to perform deep analysis of NGS and miRNA data to identify gene function, immunological and tissue cell type, pathways, and target/drug appropriate for a specific disease state.

CellScan™ features may be configured to optimize or maximize the impact of information that surfaces in an analysis so that interpretation of a dataset is comprehensive and elucidates actionable insights. These features may include one or more of: NGS RNAseq data analysis, biomarker scoring, and prioritizing targets and drugs for human clinical trials and/or pre-clinical models. The NGS RNAseq data analysis may comprise interrogating RNA and miRNA data for function, cell-type (immunological or tissue) and pathways. The biomarker scoring may comprise using a knowledge base and gene expression data to assess and prioritize biomarkers associated with a target disease or phenotype. The target/drug prioritization may comprise leveraging objective scoring of targets and drugs based on parameters such as scientific rationale, evidence in mouse/human cells, prior clinical data, overall drug properties, and the risk of adverse events.

The knowledge base may be a repository created from millions of individual pieces of information gathered about genes, cells, tissues, drugs, and diseases, and manually reviewed for accuracy and includes rich contextual details and links to original publications. The knowledge base may enable access to relevant and substantiated knowledge from primary literature as well as public and private databases for comprehensive interpretation of NGS/RNAseq data elucidating function/pathways and prioritize targets/drugs for given disease states. Table 53 shows an example list of reference databases for the content in CellScan™, with both human and mouse species-specific identifiers supported.

TABLE 53

Reference Databases for Content in CellScan ™

Affymetrix
Entrez Gene
HPA
scRNAseq

Agilent
FANTOM5
Illumina
STITCH

BrainArray
GenBank
Interactome
Mouse Genome

Database (MGD)

CAS Registry
Gene Symbol - human (Hugo/HGNC)
KEGG
UCSC (hg18)

Number

Clinicaltrials.gov
Gene Symbol - mouse (Entrez Gene)
LINCS/CLUE
UCSC (hg19)

CodeLink
GNF Tissue Expression Body Atlas
Mosby's Drug
Unigene

Consult

DrugBank
GO terms
NCBI PubMed
Uniprot/Swiss-

Prot Accession

Drugs@FDA
Goodman & Gilman's Pharmacological
NCI-60 Cell Line

Basis of Therapeutics
Expression Atlas

Ensembl
GTEx
Refseq

MS (Molecular Signature) Scoring™ Analysis Tool

MS-Scoring™ may be configured to identify receptor-ligand interactions and predict ongoing signaling pathways. In addition, MS-Scoring™ may be used to validate molecular pathways as potential targets for new or repurposed drug therapies. The specificity of next-generation drug therapies requires a way to understand the potential of a given therapy to act on the intended biochemical target. Moreover, a potential application of this is the repositioning of drug therapies that may have the correct biochemical targeting to address multiple clinical needs beyond the initial intended therapeutic value.

MS-Scoring™ may be specifically developed to address gaps in the QIAGEN IPA® (Ingenuity Pathway Analysis) tool that does not contain many immunologically relevant pathways. Similar to IPA®, MS-Scoring™ 1 may use log-fold change information to score the target and its signaling pathway to verify the viability of the targets. If the fold-change of the genes of a signaling pathway appears to be upregulated or inhibitors appear to be downregulated, MS-Scoring™ 1 may provide a score of +1. Conversely if the genes of a signaling pathway appear downregulated or the inhibitors upregulated, MS-Scoring™ 1 may provide a score of −1. A score of zero may be provided if no fold-change is observed. The scores may then be summed and normalized across the entire pathway to yield a final % score between −100 (inhibition) and +100 (up-regulation). Higher absolute magnitude scores, scores that are close to −100 or +100, may indicate a high potential for therapeutic targeting. The Fischer's exact test may be performed to determine if there is sufficient overlap of genes between the experimental differentially expressed genes and the genes in the signaling pathway.

A sample MS-Scoring™ 1 workflow may comprise the following steps. First, potential drugs and pathways are identified by LINCS (Library of Integrated Network-Based Cellular Signatures) as candidates for therapeutic intervention. Second, MS-Scoring™ 1 is used to evaluate individual transcript elements of the target pathway. Third, signatures are cross-referenced with purified single-cell microarray datasets and RNAseq experiments. Fourth, scores are compiled and normalized to provide an overall % score for the pathway and higher absolute magnitude scores indicate a higher potential for therapeutic targeting.

FIG. 78 shows MS-Scoring™ 1 of IL-12 and IL-23 related pathways for targeting using ustekinumab for SLE (systemic lupus erythematosus) drug repositioning (e.g., as described by Grammer et al., 2016, “Drug repositioning in SLE: crowd-sourcing, literature-mining and Big Data analysis,” Lupus, 25(10), 1150-1170, which is incorporated herein by reference in its entirety).

MS-Scoring™ 2 may utilize custom-defined gene modules that represent a signaling pathway or process and is particularly useful for gene expression datasets from microarray or RNAseq. The MS-Scoring™ 2 tool may be configured to take a deeper look at signaling pathways analyzed using the MS-Scoring™ 1. The tool may analyze raw gene expression data and assess enrichment by the Gene Set Variation Analysis (as described herein), which assigns an indexed score to the individual co-expressed pathways between −1 and +1 indicating levels of down-regulation and up-regulation respectively.

A sample MS-Scoring™ 2 workflow may comprise the following steps. First, a signaling pathway of interest is selected from the MS-Scoring™ 2 menu Second, a raw gene expression data is inputted into the MS-Scoring™ 2 tool. Third, enrichment of signaling pathway(s) is assessed on a patient by patient basis. Fourth, the data can then be used to drive insight for the target signaling pathways in individual patient samples.

FIG. 79 shows results from GSVA Analysis on SLE (systemic lupus erythematosus) signaling pathways, e.g., as described by Hänzelmann et al., “GSVA: Gene Set Variation Analysis for Microarray and RNA-Seq Data,” BMC Bioinformatics, vol. 14, no. 1, 2013, p. 7., which is incorporated herein by reference in its entirety.

CoLTs® (Combined Lupus Treatment Scoring) Analysis Tool

A scoring method called CoLTs®, or Combined Lupus Treatment Scoring, may be configured to assessing and prioritizing the repositioning potential of drug therapies. CoLTs® may rank identified drugs/therapies by a number of essential characteristics, including scientific rationale, experience in lupus mice/human cells (preclinical), previous clinical experience in autoimmunity, drug properties, and safety profile, including adverse events. Face and test validities may be established by scoring standard of care (SOC) medications and confirming the scores with a panel of lupus clinicians. The final result may be the CoLTs® score. A CoLTs® algorithm may also be configured for drugs in development (DID) since they typically do not have drug metabolism and adverse event information available. The algorithms for CoLTs® scoring are shown in Table 54.

TABLE 54

Algorithms for CoLTs ® Scoring

CoLTs FDA-

Approved

CoLTs DID

Algorithm

Algorithm

Score Category
Points
Question
Points

Rationale
0 to +3
Does the mechanism have a role in lupus pathogenesis? (0) No Role in lupus, (+1) possible role
0 to +3

in lupus, (+2) likely role in lupus, (+3) demonstrated role in lupus

Lupus Mice
−1 to +1
Has the drug been used to treat lupus in mice? (−1) no benefit, (0) not tried/conflicting results,
−1 to +1

(+1) efficacious in lupus mice

Lupus Cells
−1 to +1
Has the drug been used in in vitro experiments with human cells? (−1) no benefit, (0) not
−1 to +1

in vitro

studied/conflicting results, (+1) reduced lupus abnormalities in vitro with lupus derived cells

Lupus
−1 to +1
Is the target of the drug abnormal in lupus? (−1) studied but not present, (0) not
−1 to +1

Abnormalities

studied/conflicting results, (+1) drug target is active and/or present in lupus

Drug Clinical
−1 to +1
Has the drug been used to treat autoimmune disease? (−1) tried but not benefit, (0) not tried, (+1)
−1 to +1

Experience in

trial or case report with benefit

Autoimmunity

Drug Clinical
−1 to +1
Has the drug been used to treat lupus? (−1) Tried but no benefit - failed primary endpoint in
−1 to +1

Experience in

Phase 2b, (0) not tried/ongoing/failed primary Phase 2b endpoint with some positive result, (+1)

Lupus

trial with benefit in Phase 2b clinical trials

Drug Properties
−3 to +3
Does the drug interact with current SLE drugs? (−1) if it interacts with corticosteroids, NSAIDs,
N/A

MMF, MTX, AZA, statins, chloroquines, cyclophosphamide, ACE inhibitors). Is binding

reversible? (−1) covalent inhibition, (0) noncovalent inhibition. How the drug is administered?

(+1) SC, (0) IV. How frequently is the drug administered? (1) by mouth once daily, (0) more than

one time per day. Is the drug a human/humanized antibody? (+1) human/humanized, (0)

not/chimeric. Is this drug specific? (+1) one target/specific, (0) effective but not targeted, i.e.

downstream, (−1) many targets/nonspecific

Induces Lupus
−1 to 0
Does the drug induce lupus? (−1) Induces lupus, no reports of drug induced lupus (0)
N/A

Drug Metabolism
−2 to 0
Is the drug metabolized using p450 and/or through the kidneys? (−2) If p450 and kidney
N?A

excretion >20%, (−1) If p450 issues or kidney excretion >20%, (0) If neither

Adverse Events
−5 to 0
Reported adverse events and Black Box Warnings from Medscape and DailyMed for each for
NA

drug are compared to the 150 scored adverse events (each event is scored from −1 to −5 based

upon severity). The individual adverse events scores for each drug are summed to create the tox

score, which is multiplied by the number of adverse events to create the tox product. Then tox

product is then normalized to produce a score ranging from −5 to 0

Range
−16 to 11

−5 to +8

CoLTs® may be configured to perform objective scoring of drug molecules based on a hypothesis-based literature search of publicly available databases. The tool has the ability to rank drug molecules from both FDA-approved and non-approved classes and ranked based upon parameters such as scientific rationale, evidence in mouse/human cells, prior clinical data, overall drug properties, and the risk of adverse events. The parameters are used within five independent drug therapy categories: small molecules, biologics, complementary and alternative therapies, and drugs in development.

CoLTs® may address the need for a systematic and objective way to evaluate the potential of drug therapies to be repositioned for treatment of autoimmune diseases, initially within SLE (systemic lupus erythematosus). The composite score may embody all the accessible information in literature databases, inclusive of efficacy and adverse reactions, to be able to assist in the prioritization of drug development. While the composite score takes into account many aspects of a drug, it may heavily weigh the risk of adverse events and ranges from −16 to +11. CoLT Scoring® may be validated through repeated scoring of 215 potential therapies using a total of over 5000 reference data points as well as by clinicians specializing in the field of rheumatology. Specifically, CoLTs®′ prediction of Stelara/Ustekinumab to be atop priority biologic for lupus drug repositioning is validated by a successful Phase 2 clinical trial (e.g., as described by Vollenhoven et al., “Efficacy and Safety of Ustekinumab, an IL-12 and IL-23 Inhibitor, in Patients with Active Systemic Lupus Erythematosus: Results of a Multicentre, Double-Blind, Phase 2, Randomised, Controlled Study.” The Lancet, vol. 392, no. 10155, 2018, pp. 1330-1339, which is incorporated herein by reference in its entirety). CoLTs® may be calibrated on SoC (Standard of Care) therapies for the individual autoimmune disease being assessed.

Within the ten major categories, rationale ranges from 0 to +3, mouse/human in vitro experience ranges from −1 to +1, clinical properties are on a scale of −3 to +3, the adverse effect of inducing lupus ranges from −1 to 0, metabolic properties range from −2 to 0, and finally adverse events (such as toxicity, infection, carcinogenic, etc.) were given a score of −5 to 0 (e.g., as described by Grammer et al., 2016, “Drug repositioning in SLE: crowd-sourcing, literature-mining and Big Data analysis,” Lupus, 25(10), 1150-1170, which is incorporated herein by reference in its entirety). FIG. 80 shows the CoLT Scoring® of SOC Therapies in Lupus (Belimumab, HCQ, and Rituximab).

Target Scoring Analysis Tool

The Target scoring algorithm may be configured to prioritize a specific gene or protein that would potentially be a good choice to target with a drug in lupus patients. It may be utilized even if there is currently no drug available to the target gene or protein. The algorithm may be based on the addition of 18 data based determinations plus the overall scientific rationale and generates scores from −13 (not a good target in SLE) to 27 (very promising target in SLE). The scoring system is shown in Table 55.

TABLE 55

Target Scoring Algorithm

Scoring Category
Points
Question

Genetically Alt Mice
−1 to 3
Has the gene been studied in genetically altered mice (−1 to 3) 9-1 not viable, 0 no mouse; +1 immunological

phenotype, +2 immunological phenotype with autoimmunity, +3 immunological phenotype w lupus)

Human Deletion
0 to 2
Is the gen associated with a human genetic deficiency? (0 to 2) (0 none, +2 immunological/inflammatory/

immunodeficiency disease)

Lupus Mouse Express
−1 to 1
Do lupus mice have mRNA or Protein expression (−1 to +1)

Gene Cross Lupus Mice
−1 to 1
Lupus mice genetic (cross into lupus strain) (−1 to +1): no known genetic component or makes lupus

mouse worse (−1); 0, no involvement or no impact) to known genetic component or genetic manipulation

makes lupus mouse better (+1)

Assoc Func PW in Mice
−1 to 1
Does the gen associate with a functional pathway known to be abnormal in lupus mice? (−1 to +1)

Assoc Func PW in Humans
−1 to 1
Does the gene associate with a functional pathway known to be abnormal in human SLE? (−1 to +1)

GWAS
0 to 1
Identified as associated with lupus by GWAS or deep sequencing: (0 = no, 1 = yes)

Gene Methylation
−1 to 1
Identified as associated with lupus by Methylation Data (0 = no, 1 = yes)

In vitro data
−1 to 1
Gene is implicated in in vitro experiments upus cells in vitro (−1 to +1)

Change after human SOC
−1 to 1
Protein or mRNA (or pathway) changed in lupus mouse by treating w a drug (−1 to +1)?

Drug target in lupus mouse
−1 to 1
Protein or mRNA (or pathway) changed in lupus mouse by treating w a drug (−1 to +1)?

CLUE
−1 to 1
Does CLUE analysis support the pathway as potentially involved in lupus (−1 to +1)?

Lupus Biomarker
0 to 1
Can the target be used as a biomarker in lupus? (0 = no, 1 = yes)

Redundancy
−3 to 1
Is the target non-redundant, no multiple ligand receptor interactions? −3 to 1

WGCNA
0 to 3
Is the gene associated with on disease parameter (+1), two (+2), or three (+3)

Tissue Consensus
0 to 2
Is the gene Overexpressed in SLE tissues? 0 for none, 1 for 1 tissue, 2 for two or more SLE tissues

Upstream Regulator
0 to 1
Is the gene an UPR in IPA with significant z-score (>3)? 0 = no, +1 = yes

Hematopoietic Restricted
0 to 1
Is the gene hematopoetically restricted? 0 = no, +1 = yes

Biologic Rationale
0 to 3
Rationale/Mechanism: no role (or no information) to demonstrated role in lupus pathogenesis (0 to +3)

Target Data Score
−13 to 27

Target-Scoring™ may be configured to assessing and prioritizing the potential of molecular targets for further development of drug therapies. The Target-Scoring™ tool is very similar to CoLTs® except it approaches the need for new SLE therapies from a different angle. Target Scoring may be configured to perform an objective assessment of molecular targets for the development of new or repurposed drug therapies. Like CoLTs®, it also derives data from a hypothesis-based literature search and generates a composite score based on the publicly available information. Leveraging the composite score, researchers can better prioritize the development of novel drug therapies addressing the assessed targets of interest.

Target-Scoring™ may utilize 19 different scoring categories (as shown by the Target-Scoring categories and point values in FIG. 81) to derive a composite score that ranges from −13 to +27 for the suitability of a gene target for SLE therapy development. Target-Scoring™ may be validated through repeated scoring of potential therapies as well as by clinicians (e.g., clinicians specializing in the field of immunology).

Classifiers

Feature sets may be generated from datasets obtained using one or more assays of a biological sample obtained or derived from a subject, and a trained algorithm may be used to process one or more of the feature sets to identify or assess a condition (e.g., a disease or disorder, such as a lupus condition) of a subject. For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of condition-associated genomic loci that are associated with two or more classes of individuals inputted into a machine learning model, in order to classify a subject into one of the two or more classes of individuals. For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of condition-associated that are associated with individuals with known conditions (e.g., a disease or disorder, such as a lupus condition) and individuals not having the condition (e.g., healthy individuals, or individuals who do not have a lupus condition), in order to classify a subject as having the condition (e.g., positive test outcome) or not having the condition (e.g., negative test outcome).

The trained algorithm may comprise a classifier configured to accept as input a plurality of input variables or features (e.g., condition-associated genomic loci) and to produce or output one or more output values based on the plurality of input variables or features (e.g., condition-associated genomic loci). The plurality of input variables or features may comprise one or more datasets indicative of the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as a lupus condition). For example, an input variable or feature may comprise a number of sequences corresponding to or aligning to each of the plurality of condition-associated genomic loci.

For example, the disease or disorder may comprise one or more of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). As another example, the symptoms may include one or more of alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof. As another example, the prescribed medications or drugs may include one or more of: antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs).

The classifier may be configured to classify samples by assigning an output value of “indeterminate” or 2 if the sample is not classified as “positive”, “negative”, 1, or 0. In this case, a set of two cutoff values is used to classify samples into one of the three possible output values or classes of individuals (e.g., corresponding to outcome groups of individuals having “low risk,” “intermediate risk,” and “high risk” of having one or more conditions, such as a disease or disorder). Examples of sets of cutoff values may include {1%, 99%}, {2%, 98%}, {5%, 95%}{10%, 90%}, {15%, 85%}, {20%, 80%}, {25%, 75%}, {30%, 70%}, {35%, 65%}, {40%, 60%}, and {45%, 55%}. Similarly, sets of n cutoff values may be used to classify samples into one of n+1 possible output values or classes of individuals, where n is any positive integer.

After the trained algorithm is initially trained, a subset of the inputs may be identified as most influential or most important to be included for making high-quality classifications (e.g., having highest permutation feature importance). For example, a subset of the panel of condition-associated genomic loci may be identified as most influential or most important to be included for making high-quality classifications or identifications of conditions (or sub-types of conditions). The panel of condition-associated genomic loci, or a subset thereof, may be ranked based on classification metrics indicative of each influence or importance of each individual condition-associated genomic locus toward making high-quality classifications or identifications of conditions (or sub-types of conditions). Such metrics may be used to reduce, in some cases significantly, the number of input variables (e.g., predictor variables) that may be used to train the one or more classifiers of the trained algorithm to a desired performance level (e.g., based on a desired minimum accuracy, PPV, NPV, clinical sensitivity, clinical specificity, AUC, or a combination thereof).

The subset of the plurality of input variables (e.g., the panel of condition-associated genomic loci) to the classifier of the trained algorithm may be selected by rank-ordering the entire plurality of input variables and selecting a predetermined number (e.g., no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100) of input variables with the best classification metrics (e.g., permutation feature importance).

The feature sets (e.g., comprising quantitative measures of a panel of condition-associated genomic loci) may be analyzed and assessed (e.g., using a trained algorithm comprising one or more classifiers) over a duration of time to monitor a patient (e.g., subject who has a condition or who is being treated for a condition). In such cases, the feature sets of the patient may change during the course of treatment. For example, the quantitative measures of the feature sets of a patient with decreasing risk of the condition due to an effective treatment may shift toward the profile or distribution of a healthy subject (e.g., a subject without the condition). Conversely, for example, the quantitative measures of the feature sets of a patient with increasing risk of the condition due to an ineffective treatment may shift toward the profile or distribution of a subject with higher risk of the condition or a more advanced stage or severity of the condition.

The condition of the subject may be monitored by monitoring a course of treatment for treating the condition of the subject. The monitoring may comprise assessing the condition of the subject at two or more time points. The assessing may be based at least on the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined at each of the two or more time points. The therapeutic intervention may include prescribed medications or drugs, which may include one or more of: antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs). The therapeutic intervention may be effective to alleviate or decrease one or more symptoms, which may include one or more of: alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof. The assessing may be based at least on the presence, absence, or severity of one or more symptoms, such as alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of one or more clinical indications, such as (i) a diagnosis of the condition of the subject, (ii) a prognosis of the condition of the subject, (iii) an increased risk of the condition of the subject, (iv) a decreased risk of the condition of the subject, (v) an efficacy of the course of treatment for treating the condition of the subject, and (vi) a non-efficacy of the course of treatment for treating the condition of the subject.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of a diagnosis of the condition of the subject. For example, if the condition was not detected in the subject at an earlier time point but was detected in the subject at a later time point, then the difference is indicative of a diagnosis of the condition of the subject. A clinical action or decision may be made based on this indication of diagnosis of the condition of the subject, such as, for example, prescribing a new therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the diagnosis of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of the subject having an increased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a negative difference (e.g., the quantitative measures of a panel of condition-associated genomic loci increased from the earlier time point to the later time point), then the difference may be indicative of the subject having an increased risk of the condition. A clinical action or decision may be made based on this indication of the increased risk of the condition, e.g., prescribing a new therapeutic intervention or switching therapeutic interventions (e.g., ending a current treatment and prescribing a new treatment) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the increased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of the subject having a decreased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a positive difference (e.g., the quantitative measures of a panel of condition-associated genomic loci decreased from the earlier time point to the later time point), then the difference may be indicative of the subject having a decreased risk of the condition. A clinical action or decision may be made based on this indication of the decreased risk of the condition (e.g., continuing or ending a current therapeutic intervention) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the decreased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of an efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject at an earlier time point but was not detected in the subject at a later time point, then the difference may be indicative of an efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the efficacy of the course of treatment for treating the condition of the subject, e.g., continuing or ending a current therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a negative or zero difference (e.g., the quantitative measures of a panel of condition-associated genomic loci increased or remained at a constant level from the earlier time point to the later time point), and if an efficacious treatment was indicated at an earlier time point, then the difference may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the non-efficacy of the course of treatment for treating the condition of the subject, e.g., ending a current therapeutic intervention and/or switching to (e.g., prescribing) a different new therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the non-efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

Kits

The present disclosure provides kits for identifying or monitoring a disease or disorder (e.g., a lupus condition) of a subject. A kit may comprise probes for identifying a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of condition-associated genomic loci in a sample of the subject. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of condition-associated genomic loci in the sample may be indicative of the disease or disorder (e.g., a lupus condition) of the subject. The probes may be selective for the sequences at the panel of condition-associated genomic loci in the sample. A kit may comprise instructions for using the probes to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of condition-associated genomic loci in a sample of the subject.

The probes in the kit may be selective for the sequences at the panel of condition-associated genomic loci in the sample. The probes in the kit may be configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to the panel of condition-associated genomic loci. The probes in the kit may be nucleic acid primers. The probes in the kit may have sequence complementarity with nucleic acid sequences from one or more of the panel of condition-associated genomic loci. The panel of condition-associated genomic loci or genomic regions may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, or more distinct condition-associated genomic loci.

The instructions in the kit may comprise instructions to assay the sample using the probes that are selective for the sequences at the panel of condition-associated genomic loci in the cell-free biological sample. These probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) from one or more of the plurality of panel of condition-associated genomic loci. These nucleic acid molecules may be primers or enrichment sequences. The instructions to assay the cell-free biological sample may comprise introductions to perform array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., DNA sequencing or RNA sequencing) to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of condition-associated genomic loci in the sample. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of condition-associated genomic loci in the sample may be indicative of a disease or disorder (e.g., a lupus condition).

The instructions in the kit may comprise instructions to measure and interpret assay readouts, which may be quantified at one or more of the panel of condition-associated genomic loci to generate the datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of condition-associated genomic loci in the sample. For example, quantification of array hybridization or polymerase chain reaction (PCR) corresponding to the panel of condition-associated genomic loci may generate the datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of condition-associated genomic loci in the sample. Assay readouts may comprise quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, etc., or normalized values thereof.

Analysis of Single Nucleotide Polymorphisms (SNPs) Associated with Lupus

The present disclosure provides systems and methods to assess an SLE condition of a subject via analysis of data sets based on one or more ancestral groups of the subject. In various aspects, such systems and methods may be used to perform analysis of data sets including, for example, RNA gene expression or transcriptome data, or DNA genomic data.

In an aspect, the present disclosure provides a computer-implemented method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of SLE-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises (i) one or more AA-specific single nucleotide polymorphisms (SNPs) if the subject has an African-Ancestry (AA), or (ii) one or more EA-specific SNPs if the subject has a European-Ancestry (EA); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has an African-Ancestry (AA) or a European-Ancestry (EA), assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a computer-implemented method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more African-Ancestry (AA)-specific single nucleotide polymorphisms (SNPs); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has an African-Ancestry (AA), assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a computer-implemented method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more European-Ancestry (EA)-specific single nucleotide polymorphisms (SNPs); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has a European-Ancestry (EA), assessing the SLE condition of the subject.

In some embodiments, the dataset comprises RNA gene expression or transcriptome data, DNA genomic data, or a combination thereof. In some embodiments, the biological sample is selected from the group consisting of: a whole blood (WB) sample, a PBMC sample, a tissue sample, and a cell sample. In some embodiments, assessing the SLE condition of the subject comprises determining a diagnosis of the SLE condition, a prognosis of the SLE condition, a susceptibility of the SLE condition, a treatment for the SLE condition, or an efficacy or non-efficacy of a treatment for the SLE condition.

In another aspect, the present disclosure provides a computer system for assessing an SLE condition of a subject, comprising: a database that is configured to store an African-Ancestry (AA) status of the subject and a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more African-Ancestry (AA)-specific single nucleotide polymorphisms (SNPs); and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (ii) based at least in part on the one or more DE genomic loci identified in (ii) and the AA status of the subject, assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a computer system for assessing an SLE condition of a subject, comprising: a database that is configured to store a European-Ancestry (EA) status of the subject and a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more European-Ancestry (EA)-specific single nucleotide polymorphisms (SNPs); and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) process the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (ii) based at least in part on the one or more DE genomic loci identified in (i) and the EA status of the subject, assess the SLE condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of SLE-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises (i) one or more AA-specific single nucleotide polymorphisms (SNPs) if the subject has an African-Ancestry (AA), or (ii) one or more EA-specific SNPs if the subject has a European-Ancestry (EA); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has an African-Ancestry (AA) or a European-Ancestry (EA), assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more African-Ancestry (AA)-specific single nucleotide polymorphisms (SNPs); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has an African-Ancestry (AA), assessing the SLE condition of the subject.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing an SLE condition of a subject, comprising: (a) receiving a dataset of a biological sample of the subject, wherein the dataset comprises quantitative measures of gene expression at each a plurality of systemic lupus erythematosus (SLE)-associated genomic loci, wherein the plurality of SLE-associated genomic loci comprises one or more European-Ancestry (EA)-specific single nucleotide polymorphisms (SNPs); (b) processing the dataset to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci; and (c) based at least in part on the one or more DE genomic loci identified in (b) and whether the subject has a European-Ancestry (EA) assessing the SLE condition of the subject.

Assessment of SLE Conditions

FIG. 96 shows a non-limiting example of a method 9600 to assess an SLE condition of a subject, in accordance with disclosed embodiments. In operation 9602, a dataset of a biological sample of a subject is received. The dataset may comprise quantitative measures of gene expression at each of a plurality of SLE-associated genomic loci. The plurality of SLE-associated genomic loci may comprise (i) SNPs specific to African-Ancestry (AA) if the subject has an African ancestry, or (ii) SNPs specific to European-Ancestry (EA) if the subject has a European ancestry. In operation 9604, the dataset is processed to identify one or more differentially expressed (DE) genomic loci among the plurality of SLE-associated genomic loci. In operation 9606, the SLE condition of the subject is assessed based on the DE genomic loci and whether the subject has an African ancestry or a European ancestry.

In some embodiments, a sample can be taken at a first time point and assayed, and then another sample can be taken at a subsequent time point and assayed. Such methods can be used, for example, for longitudinal monitoring purposes to track the development or progression of a disease or disorder (e.g., an SLE condition). In some embodiments, the progression of a disease can be tracked before treatment, after treatment, or during the course of treatment, to determine the treatment's effectiveness. For example, a method as described herein can be performed on a subject prior to, and after, treatment with an SLE therapy to measure the disease's progression or regression in response to the SLE therapy.

After obtaining a sample from the subject, the sample may be processed to generate datasets indicative of a condition (e.g., an SLE condition) of the subject. For example, a presence, absence, or quantitative assessment of nucleic acid molecules of the sample at a panel of condition-associated (e.g., SLE-associated) genomic loci or may be indicative of a condition (e.g., an SLE condition) of the subject. Processing the sample obtained from the subject may comprise (i) subjecting the sample to conditions that are sufficient to isolate, enrich, or extract a plurality of nucleic acid molecules, and (ii) assaying the plurality of nucleic acid molecules to generate the dataset (e.g., microarray data, nucleic acid sequences, or quantitative polymerase chain reaction (qPCR) data). Methods of assaying may include any assay known in the art or described in the literature, for example, a microarray assay, a sequencing assay (e.g., DNA sequencing, RNA sequencing, or RNA-Seq), or a quantitative polymerase chain reaction (qPCR) assay.

The sample may be processed without any nucleic acid extraction. For example, the disease or disorder may be identified or monitored in the subject by using probes configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to a panel of SLE-associated genomic loci. The probes may be nucleic acid primers. The probes may have sequence complementarity with nucleic acid sequences from one or more of the panel of condition-associated (e.g., SLE-associated) genomic loci. The panel of condition-associated genomic loci may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 55, at least about 60, at least about 65, at least about 70, at least about 75, at least about 80, at least about 85, at least about 90, at least about 95, at least about 100, or more condition-associated genomic loci.

Classifiers

Feature sets may be generated from datasets obtained using one or more assays of a biological sample obtained or derived from a subject, and a trained algorithm may be used to process one or more of the feature sets to identify or assess a condition (e.g., a disease or disorder, such as an SLE condition) of a subject. For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of condition-associated genomic loci that are associated with two or more classes of individuals inputted into a machine learning model, in order to classify a subject into one of the two or more classes of individuals. For example, the trained algorithm may be used to apply a machine learning classifier to a plurality of condition-associated (e.g., SLE-associated) that are associated with individuals with known conditions (e.g., a disease or disorder, such as an SLE condition) and individuals not having the condition (e.g., healthy individuals, or individuals who do not have an SLE condition), in order to classify a subject as having the condition (e.g., positive test outcome) or not having the condition (e.g., negative test outcome).

The trained algorithm may be configured to identify the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as an SLE condition) with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than 99%. This accuracy may be achieved for a set of at least about 25, at least about 50, at least about 100, at least about 150, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, at least about 500, at least about 1,000, or more than about 1,000 independent samples.

The trained algorithm may comprise a classifier configured to accept as input a plurality of input variables or features (e.g., condition-associated (e.g., SLE-associated) genomic loci) and to produce or output one or more output values based on the plurality of input variables or features (e.g., condition-associated genomic loci). The plurality of input variables or features may comprise one or more datasets indicative of the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as an SLE condition). For example, an input variable or feature may comprise a number of sequences corresponding to or aligning to each of the plurality of condition-associated genomic loci.

The plurality of input variables or features may also include clinical information of a subject, such as health data. For example, the health data of a subject may comprise one or more of: a diagnosis of one or more conditions (e.g., a disease or disorder, such as an SLE condition), a prognosis of one or more conditions (e.g., a disease or disorder, such as an SLE condition), a risk of having one or more conditions (e.g., a disease or disorder, such as an SLE condition), a treatment history of one or more conditions (e.g., a disease or disorder, such as an SLE condition), a history of previous treatment for one or more conditions (e.g., a disease or disorder, such as an SLE condition), a history of prescribed medications, a history of prescribed medical devices, smoking status, age, height, weight, sex, race, ethnicity, nationality, African-Ancestry (AA) status, European-Ancestry (EA) status, and one or more symptoms of the subject.

For example, the disease or disorder may comprise one or more of: systemic lupus erythematosus (SLE), discoid lupus erythematosus (DLE), and lupus nephritis (LN). As another example, the symptoms may include one or more of alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof. As another example, the prescribed medications or drugs may include one or more of: antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs).

The classifier may be configured to classify samples by assigning output values, which may comprise descriptive labels, numerical values, or a combination thereof. Some of the output values may comprise descriptive labels. Such descriptive labels may provide an identification or indication of the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as an SLE condition) of the subject, and may comprise, for example, positive, negative, high-risk, intermediate-risk, low-risk, or indeterminate. Such descriptive labels may provide an identification of a treatment for the one or more conditions of the subject, and may comprise, for example, a therapeutic intervention, a duration of the therapeutic intervention, and/or a dosage of the therapeutic intervention suitable to treat the one or more conditions of the subject. Such descriptive labels may provide an identification of secondary clinical tests that may be appropriate to perform on the subject, and may comprise, for example, an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof. For example, such descriptive labels may provide a prognosis of the one or more conditions of the subject. As another example, such descriptive labels may provide a relative assessment of the one or more conditions of the subject. Some descriptive labels may be mapped to numerical values, for example, by mapping “positive” to 1 and “negative” to 0.

The classifier may be configured to classify samples by assigning output values that comprise numerical values, such as binary, integer, or continuous values. Such binary output values may comprise, for example, {0, 1},{positive, negative}, or {high-risk, low-risk}. Such integer output values may comprise, for example, {0, 1, 2}. Such continuous output values may comprise, for example, a probability value of at least 0 and no more than 1. Such continuous output values may comprise, for example, an un-normalized probability value of at least 0. Such continuous output values may indicate a prognosis of the one or more conditions (e.g., a disease or disorder, such as an SLE condition) of the subject. Some numerical values may be mapped to descriptive labels, for example, by mapping 1 to “positive” and 0 to “negative.”

The classifier may be configured to classify samples by assigning output values based on one or more cutoff values. For example, a binary classification of samples may assign an output value of “positive” or 1 if the sample indicates that the subject has at least a 50% probability of having one or more conditions (e.g., a disease or disorder, such as an SLE condition), thereby assigning the subject to a class of individuals receiving a positive test result. As another example, a binary classification of samples may assign an output value of “negative” or 0 if the sample indicates that the subject has less than a 50% probability of having one or more conditions (e.g., a disease or disorder), thereby assigning the subject to a class of individuals receiving a negative test result. In this case, a single cutoff value of 50% is used to classify samples into one of the two possible binary output values or classes of individuals (e.g., those receiving a positive test result and those receiving a negative test result). Examples of single cutoff values may include about 1%, about 2%, about 5%, about 10%, about 15%, about 20%, about 25%, about 30%, about 35%, about 40%, about 45%, about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, and about 99%.

As another example, the classifier may be configured to classify samples by assigning an output value of “positive” or 1 if the sample indicates that the subject has a probability of having one or more conditions (e.g., a disease or disorder, such as an SLE condition) of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The classification of samples may assign an output value of “positive” or 1 if the sample indicates that the subject has a probability of having one or more conditions (e.g., a disease or disorder, such as an SLE condition) of more than about 50%, more than about 55%, more than about 60%, more than about 65%, more than about 70%, more than about 75%, more than about 80%, more than about 85%, more than about 90%, more than about 91%, more than about 92%, more than about 93%, more than about 94%, more than about 95%, more than about 96%, more than about 97%, more than about 98%, or more than about 99%.

The classifier may be configured to classify samples by assigning an output value of “negative” or 0 if the sample indicates that the subject has a probability of having one or more conditions (e.g., a disease or disorder, such as an SLE condition) of less than about 50%, less than about 45%, less than about 40%, less than about 35%, less than about 30%, less than about 25%, less than about 20%, less than about 15%, less than about 10%, less than about 9%, less than about 8%, less than about 7%, less than about 6%, less than about 5%, less than about 4%, less than about 3%, less than about 2%, or less than about 1%. The classification of samples may assign an output value of “negative” or 0 if the sample indicates that the subject has a probability of having one or more conditions (e.g., a disease or disorder, such as an SLE condition) of no more than about 50%, no more than about 45%, no more than about 40%, no more than about 35%, no more than about 30%, no more than about 25%, no more than about 20%, no more than about 15%, no more than about 10%, no more than about 9%, no more than about 8%, no more than about 7%, no more than about 6%, no more than about 5%, no more than about 4%, no more than about 3%, no more than about 2%, or no more than about 1%.

The classifier may be configured to classify samples by assigning an output value of “indeterminate” or 2 if the sample is not classified as “positive”, “negative”, 1, or 0. In this case, a set of two cutoff values is used to classify samples into one of the three possible output values or classes of individuals (e.g., corresponding to outcome groups of individuals having “low risk,” “intermediate risk,” and “high risk” of having one or more conditions, such as a disease or disorder). Examples of sets of cutoff values may include {1%, 99%}, {20%, 98%}, {5%, 95%}, {10%, 90%}, {15%, 85%}, {20%, 80%}, {25%, 75%}, {30%, 70%}, {35%, 65%}, {40%, 60%}, and {45%, 55%}. Similarly, sets of n cutoff values may be used to classify samples into one of n+1 possible output values or classes of individuals, where n is any positive integer.

The trained algorithm may be trained with at least about 5, at least about 10, at least about 15, at least about 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 100, at least about 150, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, or at least about 500 independent training samples. The independent training samples may comprise samples associated with presence of the condition and/or samples associated with absence of the condition. The trained algorithm may be trained with no more than about 500, no more than about 450, no more than about 400, no more than about 350, no more than about 300, no more than about 250, no more than about 200, no more than about 150, no more than about 100, or no more than about 50 independent training samples associated with presence of the condition (e.g., a disease or disorder, such as an SLE condition). The trained algorithm may be trained with no more than about 500, no more than about 450, no more than about 400, no more than about 350, no more than about 300, no more than about 250, no more than about 200, no more than about 150, no more than about 100, or no more than about 50 independent training samples associated with absence of the condition (e.g., a disease or disorder, such as an SLE condition). In some embodiments, the sample is independent of samples used to train the trained algorithm.

The trained algorithm may be trained with a first number of independent training samples associated with a presence of the condition (e.g., a disease or disorder, such as an SLE condition) and a second number of independent training samples associated with an absence of the condition (e.g., a disease or disorder, such as an SLE condition). The first number of independent training samples associated with presence of the condition (e.g., a disease or disorder, such as an SLE condition) may be no more than the second number of independent training samples associated with absence of the condition (e.g., a disease or disorder, such as an SLE condition). The first number of independent training samples associated with a presence of the condition (e.g., a disease or disorder) may be equal to the second number of independent training samples associated with an absence of the condition (e.g., a disease or disorder, such as an SLE condition). The first number of independent training samples associated with a presence of the condition (e.g., a disease or disorder, such as an SLE condition) may be greater than the second number of independent training samples associated with an absence of the condition (e.g., a disease or disorder, such as an SLE condition).

The trained algorithm may comprise a classifier configured to identify the presence (e.g., positive test result) or absence (e.g., negative test result) of one or more conditions (e.g., a disease or disorder, such as an SLE condition) with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more; for at least about 5, at least about 10, at least about 15, at least about 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 100, at least about 150, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, or at least about 500 independent training samples. The accuracy of identifying the presence (e.g., positive test result) or absence (e.g., negative test result) of the one or more conditions by the trained algorithm may be calculated as the percentage of independent test samples (e.g., subjects known to have the condition or subjects with negative clinical test results for the condition) that are correctly identified or classified as having or not having the condition.

The trained algorithm may comprise a classifier configured to identify one or more conditions (e.g., a disease or disorder, such as an SLE condition) with a positive predictive value (PPV) of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The PPV of identifying the condition using the trained algorithm may be calculated as the percentage of samples identified or classified as having the condition that correspond to subjects that truly have the condition.

The trained algorithm may comprise a classifier configured to identify one or more conditions (e.g., a disease or disorder, such as an SLE condition) with a negative predictive value (NPV) of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The NPV of identifying the condition using the trained algorithm may be calculated as the percentage of samples identified or classified as not having the condition that correspond to subjects that truly do not have the condition.

The trained algorithm may comprise a classifier configured to identify one or more conditions (e.g., a disease or disorder, such as an SLE condition) with a clinical sensitivity at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9%, at least about 99.99%, at least about 99.999%, or more. The clinical sensitivity of identifying the condition using the trained algorithm may be calculated as the percentage of independent test samples associated with presence of the condition (e.g., subjects known to have the condition) that are correctly identified or classified as having the condition.

The trained algorithm may comprise a classifier configured to identify one or more conditions (e.g., a disease or disorder, such as an SLE condition) with a clinical specificity of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9%, at least about 99.99%, at least about 99.999%, or more. The clinical specificity of identifying the condition using the trained algorithm may be calculated as the percentage of independent test samples associated with absence of the condition (e.g., subjects with negative clinical test results for the condition) that are correctly identified or classified as not having the condition.

After the trained algorithm is initially trained, a subset of the inputs may be identified as most influential or most important to be included for making high-quality classifications (e.g., having highest permutation feature importance). For example, a subset of the panel of condition-associated genomic loci may be identified as most influential or most important to be included for making high-quality classifications or identifications of conditions (or sub-types of conditions). The panel of condition-associated genomic loci, or a subset thereof, may be ranked based on classification metrics indicative of each influence or importance of each individual condition-associated genomic locus toward making high-quality classifications or identifications of conditions (or sub-types of conditions). Such metrics may be used to reduce, in some cases significantly, the number of input variables (e.g., predictor variables) that may be used to train the one or more classifiers of the trained algorithm to a desired performance level (e.g., based on a desired minimum accuracy, PPV, NPV, clinical sensitivity, clinical specificity, AUC, or a combination thereof).

Upon identifying the subject as having one or more conditions (e.g., a disease or disorder, such as an SLE condition), the subject may be optionally provided with a therapeutic intervention (e.g., prescribing an appropriate course of treatment to treat the one or more conditions of the subject). The therapeutic intervention may comprise a prescription of an effective dose of a drug, a further testing or evaluation of the condition, a further monitoring of the condition, or a combination thereof. If the subject is currently being treated for the condition with a course of treatment, the therapeutic intervention may comprise a subsequent different course of treatment (e.g., to increase treatment efficacy due to non-efficacy of the current course of treatment).

The condition of the subject may be monitored by monitoring a course of treatment for treating the condition of the subject. The monitoring may comprise assessing the condition of the subject at two or more time points. The assessing may be based at least on the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined at each of the two or more time points. The therapeutic intervention may include prescribed medications or drugs, which may include one or more of: antimalarials, corticosteroids, immunosuppressants, and nonsteroidal anti-inflammatory drugs (NSAIDs). The therapeutic intervention may be effective to alleviate or decrease one or more symptoms, which may include one or more of alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof. The assessing may be based at least on the presence, absence, or severity of one or more symptoms, such as alopecia, anti-dsDNA seropositivity, arthritis, fever, hematuria, leukopenia, low serum complement, mucosal ulcer, myositis, pericarditis, pleurisy, proteinuria, pyuria, rash, thrombocytopenia, urinary cast, vasculitis, visual disturbance, or a combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of one or more clinical indications, such as (i) a diagnosis of the condition of the subject, (ii) a prognosis of the condition of the subject, (iii) an increased risk of the condition of the subject, (iv) a decreased risk of the condition of the subject, (v) an efficacy of the course of treatment for treating the condition of the subject, and (vi) a non-efficacy of the course of treatment for treating the condition of the subject.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of a diagnosis of the condition of the subject. For example, if the condition was not detected in the subject at an earlier time point but was detected in the subject at a later time point, then the difference is indicative of a diagnosis of the condition of the subject. A clinical action or decision may be made based on this indication of diagnosis of the condition of the subject, such as, for example, prescribing a new therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the diagnosis of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of the subject having an increased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a negative difference (e.g., the quantitative measures of a panel of condition-associated genomic loci increased from the earlier time point to the later time point), then the difference may be indicative of the subject having an increased risk of the condition. A clinical action or decision may be made based on this indication of the increased risk of the condition, e.g., prescribing a new therapeutic intervention or switching therapeutic interventions (e.g., ending a current treatment and prescribing a new treatment) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the increased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of the subject having a decreased risk of the condition. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a positive difference (e.g., the quantitative measures of a panel of condition-associated genomic loci decreased from the earlier time point to the later time point), then the difference may be indicative of the subject having a decreased risk of the condition. A clinical action or decision may be made based on this indication of the decreased risk of the condition (e.g., continuing or ending a current therapeutic intervention) for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the decreased risk of the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of an efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject at an earlier time point but was not detected in the subject at a later time point, then the difference may be indicative of an efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the efficacy of the course of treatment for treating the condition of the subject, e.g., continuing or ending a current therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

In some embodiments, a difference in the feature sets (e.g., quantitative measures of a panel of condition-associated genomic loci) determined between the two or more time points may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. For example, if the condition was detected in the subject both at an earlier time point and at a later time point, and if the difference is a negative or zero difference (e.g., the quantitative measures of a panel of condition-associated genomic loci increased or remained at a constant level from the earlier time point to the later time point), and if an efficacious treatment was indicated at an earlier time point, then the difference may be indicative of a non-efficacy of the course of treatment for treating the condition of the subject. A clinical action or decision may be made based on this indication of the non-efficacy of the course of treatment for treating the condition of the subject, e.g., ending a current therapeutic intervention and/or switching to (e.g., prescribing) a different new therapeutic intervention for the subject. The clinical action or decision may comprise recommending the subject for a secondary clinical test to confirm the non-efficacy of the course of treatment for treating the condition. This secondary clinical test may comprise an imaging test, a blood test, a computed tomography (CT) scan, a magnetic resonance imaging (MRI) scan, an ultrasound scan, a chest X-ray, a positron emission tomography (PET) scan, a PET-CT scan, or any combination thereof.

Kits

The present disclosure provides kits for identifying or monitoring a disease or disorder (e.g., an SLE condition) of a subject. A kit may comprise probes for identifying a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of condition-associated (e.g., SLE-associated) genomic loci in a sample of the subject. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of condition-associated genomic loci in the sample may be indicative of the disease or disorder (e.g., an SLE condition) of the subject. The probes may be selective for the sequences at the panel of condition-associated genomic loci in the sample. A kit may comprise instructions for using the probes to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of condition-associated genomic loci in a sample of the subject.

The probes in the kit may be selective for the sequences at the panel of condition-associated (e.g., SLE-associated) genomic loci in the sample. The probes in the kit may be configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to the panel of condition-associated genomic loci. The probes in the kit may be nucleic acid primers. The probes in the kit may have sequence complementarity with nucleic acid sequences from one or more of the panel of condition-associated genomic loci. The panel of condition-associated genomic loci or genomic regions may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, or more distinct condition-associated genomic loci.

The instructions in the kit may comprise instructions to assay the sample using the probes that are selective for the sequences at the panel of condition-associated (e.g., SLE-associated) genomic loci in the cell-free biological sample. These probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with nucleic acid sequences (e.g., RNA or DNA) from one or more of the plurality of panel of condition-associated genomic loci. These nucleic acid molecules may be primers or enrichment sequences. The instructions to assay the cell-free biological sample may comprise introductions to perform array hybridization, polymerase chain reaction (PCR), or nucleic acid sequencing (e.g., DNA sequencing or RNA sequencing) to process the sample to generate datasets indicative of a quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of the panel of condition-associated genomic loci in the sample. A quantitative measure (e.g., indicative of a presence, absence, or relative amount) of sequences at each of a panel of condition-associated genomic loci in the sample may be indicative of a disease or disorder (e.g., an SLE condition).

Analysis of Single Nucleotide Polymorphisms (SNPs) Associated with Lupus

In an aspect, the present disclosure provides a method for identifying an autoimmune disease drug target, the method comprising: (a) treating an autoimmune disease animal model with a drug configured to inhibit a drug target of the autoimmune disease, thereby producing a treated animal model; (b) assaying an animal biological sample of the treated animal model to obtain gene expression data of the treated animal model; (c) processing the gene expression data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (d) obtaining a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (e) processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (f) identifying the drug target as the autoimmune disease drug target when the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

In some embodiments, the autoimmune disease animal model is selected from: a mouse model, a rat model, a cat model, a dog model, a rabbit model, a guinea pig model, a hamster model, a pig model, a horse model, and a primate model. In some embodiments, the autoimmune disease animal model comprises a mouse model. In some embodiments, the autoimmune disease comprises lupus. In some embodiments, the lupus comprises systemic lupus erythematosus (SLE) or discoid lupus erythematosus (DLE). In some embodiments, the drug target is HDAC6. In some embodiments, the drug target is HDAC6 or a portion thereof. In some embodiments, the drug is an HDAC6 inhibitor. In some embodiments, the HDAC6 inhibitor is ACY-738. In some embodiments, the animal biological sample or the human biological samples comprise one or more of a bodily fluid sample, a blood sample, a cell sample, and a tissue sample. In some embodiments, the one or more human autoimmune disease pathways are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the human genomic locus that is associated with up-regulation or down-regulation of the one or more human autoimmune disease pathways is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 63, Table 64, Table 65, Table 66, Table 67, Table 68, and Table 69. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the animal genomic locus is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 63, Table 64, Table 65, Table 66, Table 67, Table 68, and Table 69. In some embodiments, (e) comprises identifying (i) a plurality of animal genomic loci from among the first set of genomic loci, and (ii) a plurality of human genomic loci from among the second set of genomic loci that is associated with up-regulation or down-regulation of a plurality of human autoimmune disease pathways, wherein plurality of animal genomic loci and the plurality of human genomic loci are pairwise orthologous and share similarities in expression patterns and function; and (f) comprises identifying the drug target as the autoimmune disease drug target when the quantitative measures of the plurality of animal genomic loci of the animal gene signature are indicative of up-regulation or down-regulation of a plurality of autoimmune disease pathways of the autoimmune disease animal model. In some embodiments, the plurality of human autoimmune disease pathways comprises between 2 and 5 different human autoimmune disease pathways. In some embodiments, the plurality of human autoimmune disease pathways comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different human autoimmune disease pathways. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different autoimmune disease pathways. In some embodiments, the method further comprises determining the up-regulation or down-regulation of the autoimmune disease pathway of the autoimmune disease animal model based on determining a difference between the quantitative measure of the animal genomic locus of the animal gene signature and a reference quantitative measure of the animal genomic locus. In some embodiments, the method further comprises obtaining the reference quantitative measure of the animal genomic locus by, prior to (a), assaying an animal biological sample of the autoimmune disease animal model.

In some embodiments, the autoimmune disease animal model is selected from: a mouse model, a rat model, a cat model, a dog model, a rabbit model, a guinea pig model, a hamster model, a pig model, a horse model, and a primate model. In some embodiments, the autoimmune disease animal model comprises a mouse model. In some embodiments, the autoimmune disease comprises lupus. In some embodiments, the lupus comprises systemic lupus erythematosus (SLE) or discoid lupus erythematosus (DLE). In some embodiments, the drug target is HDAC6. In some embodiments, the drug target is HDAC6 or a portion thereof. In some embodiments, the drug is an HDAC6 inhibitor. In some embodiments, the HDAC6 inhibitor is ACY-738. In some embodiments, the animal biological sample or the human biological samples comprise one or more of: a bodily fluid sample, a blood sample, a cell sample, and a tissue sample. In some embodiments, the one or more human autoimmune disease pathways are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the human genomic locus that is associated with up-regulation or down-regulation of the one or more human autoimmune disease pathways is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 63, Table 64, Table 65, Table 66, Table 67, Table 68, and Table 69. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the animal genomic locus is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 63, Table 64, Table 65, Table 66, Table 67, Table 68, and Table 69. In some embodiments, (d) comprises identifying (i) a plurality of animal genomic loci from among the first set of genomic loci, and (ii) a plurality of human genomic loci from among the second set of genomic loci that is associated with up-regulation or down-regulation of a plurality of human autoimmune disease pathways, wherein plurality of animal genomic loci and the plurality of human genomic loci are pairwise orthologous and share similarities in expression patterns and function; and (e) comprises identifying the drug target as the autoimmune disease drug target when the quantitative measures of the plurality of animal genomic loci of the animal gene signature are indicative of up-regulation or down-regulation of a plurality of autoimmune disease pathways of the autoimmune disease animal model. In some embodiments, the plurality of human autoimmune disease pathways comprises between 2 and 5 different human autoimmune disease pathways. In some embodiments, the plurality of human autoimmune disease pathways comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different human autoimmune disease pathways. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different autoimmune disease pathways. In some embodiments, the method further comprises determining the up-regulation or down-regulation of the autoimmune disease pathway of the autoimmune disease animal model based on determining a difference between the quantitative measure of the animal genomic locus of the animal gene signature and a reference quantitative measure of the animal genomic locus. In some embodiments, the method further comprises obtaining the reference quantitative measure of the animal genomic locus by, prior to (a), assaying an animal biological sample of the autoimmune disease animal model.

In some embodiments, the autoimmune disease animal model is selected from: a mouse model, a rat model, a cat model, a dog model, a rabbit model, a guinea pig model, a hamster model, a pig model, a horse model, and a primate model. In some embodiments, the autoimmune disease animal model comprises a mouse model. In some embodiments, the autoimmune disease comprises lupus. In some embodiments, the lupus comprises systemic lupus erythematosus (SLE) or discoid lupus erythematosus (DLE). In some embodiments, the drug target is HDAC6. In some embodiments, the drug target is HDAC6 or a portion thereof. In some embodiments, the drug is an HDAC6 inhibitor. In some embodiments, the HDAC6 inhibitor is ACY-738. In some embodiments, the animal biological sample or the human biological samples comprise one or more of: a bodily fluid sample, a blood sample, a cell sample, and a tissue sample. In some embodiments, the one or more human autoimmune disease pathways are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the human genomic locus that is associated with up-regulation or down-regulation of the one or more human autoimmune disease pathways is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 63, Table 64, Table 65, Table 66, Table 67, Table 68, and Table 69. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the animal genomic locus is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 63, Table 64, Table 65, Table 66, Table 67, Table 68, and Table 69. In some embodiments, (iii) comprises identifying (1) a plurality of animal genomic loci from among the first set of genomic loci, and (2) a plurality of human genomic loci from among the second set of genomic loci that is associated with up-regulation or down-regulation of a plurality of human autoimmune disease pathways, wherein plurality of animal genomic loci and the plurality of human genomic loci are pairwise orthologous and share similarities in expression patterns and function; and (iv) comprises identifying the drug target as the autoimmune disease drug target when the quantitative measures of the plurality of animal genomic loci of the animal gene signature are indicative of up-regulation or down-regulation of a plurality of autoimmune disease pathways of the autoimmune disease animal model. In some embodiments, the plurality of human autoimmune disease pathways comprises between 2 and 5 different human autoimmune disease pathways. In some embodiments, the plurality of human autoimmune disease pathways comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different human autoimmune disease pathways. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different autoimmune disease pathways. In some embodiments, the one or more computer processors are individually or collectively programmed to further determine the up-regulation or down-regulation of the autoimmune disease pathway of the autoimmune disease animal model based on determining a difference between the quantitative measure of the animal genomic locus of the animal gene signature and a reference quantitative measure of the animal genomic locus. In some embodiments, the one or more computer processors are individually or collectively programmed to further obtain the reference quantitative measure of the animal genomic locus by, prior to (a), assaying an animal biological sample of the autoimmune disease animal model.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying an autoimmune disease drug target, the method comprising: (a) obtaining gene expression data generated by assaying an animal biological sample of a treated animal model, wherein the treated animal model is obtained by treating an autoimmune disease animal model with a drug configured to inhibit a drug target of the autoimmune disease; (b) processing the gene expression data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (c) obtaining a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (d) processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (e) identifying the drug target as the autoimmune disease drug target when the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

In some embodiments, the autoimmune disease animal model is selected from: a mouse model, a rat model, a cat model, a dog model, a rabbit model, a guinea pig model, a hamster model, a pig model, a horse model, and a primate model. In some embodiments, the autoimmune disease animal model comprises a mouse model. In some embodiments, the autoimmune disease comprises lupus. In some embodiments, the lupus comprises systemic lupus erythematosus (SLE) or discoid lupus erythematosus (DLE). In some embodiments, the drug target is HDAC6. In some embodiments, the drug target is HDAC6 or a portion thereof. In some embodiments, the drug is an HDAC6 inhibitor. In some embodiments, the HDAC6 inhibitor is ACY-738. In some embodiments, the animal biological sample or the human biological samples comprise one or more of a bodily fluid sample, a blood sample, a cell sample, and a tissue sample. In some embodiments, the one or more human autoimmune disease pathways are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the human genomic locus that is associated with up-regulation or down-regulation of the one or more human autoimmune disease pathways is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 63, Table 64, Table 65, Table 66, Table 67, Table 68, and Table 69. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model are selected from the pathways listed in Table 61, Table 62, Table 63, and Table 64. In some embodiments, the animal genomic locus is selected from the genes listed in Table 59, Table 60, Table 61, Table 62, Table 63, Table 64, Table 65, Table 66, Table 67, Table 68, and Table 69. In some embodiments, (d) comprises identifying (i) a plurality of animal genomic loci from among the first set of genomic loci, and (ii) a plurality of human genomic loci from among the second set of genomic loci that is associated with up-regulation or down-regulation of a plurality of human autoimmune disease pathways, wherein plurality of animal genomic loci and the plurality of human genomic loci are pairwise orthologous and share similarities in expression patterns and function; and (e) comprises identifying the drug target as the autoimmune disease drug target when the quantitative measures of the plurality of animal genomic loci of the animal gene signature are indicative of up-regulation or down-regulation of a plurality of autoimmune disease pathways of the autoimmune disease animal model. In some embodiments, the plurality of human autoimmune disease pathways comprises between 2 and 5 different human autoimmune disease pathways. In some embodiments, the plurality of human autoimmune disease pathways comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different human autoimmune disease pathways. In some embodiments, the autoimmune disease pathways of the autoimmune disease animal model comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 60, about 70, about 80, about 90, about 100, or more than about 100 different autoimmune disease pathways. In some embodiments, the method further comprises determining the up-regulation or down-regulation of the autoimmune disease pathway of the autoimmune disease animal model based on determining a difference between the quantitative measure of the animal genomic locus of the animal gene signature and a reference quantitative measure of the animal genomic locus. In some embodiments, the method further comprises obtaining the reference quantitative measure of the animal genomic locus by, prior to (a), assaying an animal biological sample of the autoimmune disease animal model.

In another aspect, the present disclosure provides a non-transitory computer readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for evaluating a drug candidate for an autoimmune disease, the method comprising: (a) treating an autoimmune disease animal model with the drug candidate for the autoimmune disease, thereby producing a treated animal model; (b) assaying an animal biological sample of the treated animal model to obtain gene expression data of the treated animal model; (c) processing the gene expression data to obtain an animal gene signature, wherein the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model; (d) obtaining a set of human gene signatures, wherein the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease, and wherein the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data; (e) processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways, wherein the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function; and (f) evaluating the efficacy of the drug candidate for the autoimmune disease based at least in part on whether the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal model.

To obtain a blood sample, various techniques may be used, e.g., a syringe or other vacuum suction device. A blood sample may be optionally pre-treated or processed prior to use. A sample, such as a blood sample, may be analyzed under any of the methods and systems herein within 4 weeks, 2 weeks, 1 week, 6 days, 5 days, 4 days, 3 days, 2 days, 1 day, 12 hr, 6 hr, 3 hr, 2 hr, or 1 hr from the time the sample is obtained, or longer if frozen. When obtaining a sample from a subject (e.g., blood sample), the amount may vary depending upon subject size and the condition being screened. In some embodiments, at least 10 mL, 5 mL, 1 mL, 0.5 mL, 250, 200, 150, 100, 50, 40, 30, 20, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 μL of a sample is obtained. In some embodiments, 1-50, 2-40, 3-30, or 4-20 μL of sample is obtained. In some embodiments, more than 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or 100 μL of a sample is obtained.

In some embodiments, a sample may be taken at a first time point and assayed, and then another sample may be taken at a subsequent time point and assayed. Such methods may be used, for example, for longitudinal monitoring purposes to track the development or progression of a disease. In some embodiments, the progression of a disease may be tracked before treatment, after treatment, or during the course of treatment, to determine the treatment's effectiveness. For example, a method as described herein may be performed on a subject prior to, and after, treatment with a lupus condition therapy to measure the disease's progression or regression in response to the lupus condition therapy.

After obtaining a sample from the subject, the sample may be processed or assayed to generate datasets of the subject. The datasets may be indicative of a disease, disorder, or abnormal condition (e.g., lupus) of the subject. For example, a presence, absence, or quantitative assessment of nucleic acid molecules of the sample at a panel of condition-associated genomic loci may comprise a gene signature of a subject (e.g., a mouse or human). The gene signature may be indicative of a autoimmune disease (e.g., lupus) of the subject or of suitable disease targets of the autoimmune disease. Processing or assaying the sample obtained from the subject may comprise (i) subjecting the sample to conditions that are sufficient to isolate, enrich, or extract a plurality of nucleic acid molecules, and (ii) assaying the plurality of nucleic acid molecules to generate the dataset (e.g., microarray data, nucleic acid sequences, or quantitative polymerase chain reaction (qPCR) data). Methods of assaying may include the use of a variety of suitable assays, for example, a microarray assay, a sequencing assay (e.g., DNA sequencing, RNA sequencing, or RNA-Seq), a single-cell assay, or a quantitative polymerase chain reaction (qPCR) assay.

In some embodiments, single-cell RNA-Seq data may be obtained from biological samples and then analyzed by a clustering approach such as spherical transformation and recursive splitting for heuristic identification of partitions (Starship), which is adapted for single-cell RNA-Seq data. Generally, bulk cell analysis methods may fail to account for the zero-inflated nature of single-cell RNA-Seq data. For example, Euclidean-based methods may be confounded by the vast number of zeros, which tends to make all cells look similar. In addition, density-based methods may fail to adapt to different levels of heterogeneity among leukocytes (e.g., the differences between myeloid populations may be more prominent than those between B cells and T cells). For example, conventional methods may be unable to cluster all of the cells in one pass, and may need to be re-run manually on sub-clusters to fully partition the cells. Single-cell RNA-Seq data, particularly those gathered with Unique Molecular Identifier (UMI) barcodes, may tend to resemble bag-of-words text data in several ways, such as: 1) each observation takes an integer value, and 2) most genes may not appear in a given cell, much like most words may not appear in a given document. Clustering of this sparse data may be performed by mapping samples onto the surface of a unit n-dimensional sphere, where n is the number of genes. Rather than clustering with a set number of clusters (k), Starship recursively clusters data with k=2 until pre-defined stop criteria are met. Once the clustering is complete, several functions can be run to further analyze and/or visualize the resulting clusters of cells. The Starship algorithm may be performed as described in, for example, PCT Appl. No. PCT/US2019/049129, entitled “Systems and Methods for Single-Cell RNA-Seq Data Analysis,” filed Aug. 30, 2019, which is incorporated herein by reference in its entirety.

The sample may be processed without any nucleic acid extraction. For example, the disease, disorder, or abnormal condition (e.g., an autoimmune disease such as lupus) may be identified or monitored in the subject by using probes configured to selectively enrich nucleic acid (e.g., RNA or DNA) molecules corresponding to a panel of condition-associated genomic loci. The probes may be nucleic acid primers. The probes may have sequence complementarity with nucleic acid sequences from one or more of the panel of condition-associated genomic loci. The panel of condition-associated genomic loci may comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 55, at least about 60, at least about 65, at least about 70, at least about 75, at least about 80, at least about 85, at least about 90, at least about 95, at least about 100, or more condition-associated genomic loci.

Data Analysis

The present disclosure provides systems and methods to identify autoimmune disease drug targets using data analysis tools or algorithms. In various aspects, such data analysis tools or algorithms may be used to perform analysis of data sets including, for example, mRNA gene expression or transcriptome data, DNA genomic data, proteomic data, metabolomic data, other types of “-omic” data, or a combination thereof. Methods and systems of the present disclosure may use one or more of the following: a BIG-C™ data analysis algorithm, an I-Scope™ data analysis algorithm, a T-Scope™ data analysis algorithm, a P-Scope™ data analysis algorithm, and a Gene Set Variation Analysis (GSVA) algorithm.

FIG. 104 shows a non-limiting example of a workflow of a method 1040 to identify an autoimmune disease drug target, using one or more data analysis algorithms or tools. The method may comprise treating an autoimmune disease animal model with a drug configured to inhibit a drug target of the autoimmune disease, thereby producing a treated animal model (as in operation 1041). Next, the method may comprise assaying an animal biological sample of the treated animal model to obtain gene expression data of the treated animal model (as in operation 1042). Next, the method may comprise processing the gene expression data to obtain an animal gene signature (as in operation 1043). In some embodiments, the animal gene signature comprises quantitative measures of a first set of genomic loci associated with autoimmune disease pathways of the autoimmune disease animal model. Next, the method may comprise obtaining a set of human gene signatures (as in 1044). In some embodiments, the set of human gene signatures comprises quantitative measures of a second set of genomic loci associated with up-regulation or down-regulation of human autoimmune disease pathways in human patients having active autoimmune disease. In some embodiments, the set of human gene signatures is generated by assaying human biological samples from one or more human patients having the autoimmune disease to obtain gene expression data. Next, the method may comprise processing the animal gene signature with the set of human gene signatures to identify (i) an animal genomic locus from among the first set of genomic loci, and (ii) a human genomic locus from among the second set of genomic loci that is associated with up-regulation or down-regulation of one or more human autoimmune disease pathways (as in operation 1045). In some embodiments, the animal genomic locus and the human genomic locus are orthologous and share similarity in expression patterns and function. Next, the method may comprise identifying the drug target as the autoimmune disease drug target based on the quantitative measure of the animal genomic locus of the animal gene signature (e.g., when the quantitative measure of the animal genomic locus of the animal gene signature is indicative of up-regulation or down-regulation of an autoimmune disease pathway of the autoimmune disease animal) model (as in operation 1046).

BIG-C™ Big Data Analysis Algorithm

BIG-C® may be a fast and efficient cloud-based algorithm to functionally categorize gene products. With coverage of over 80% of the genome, BIG-C® leverages publicly available databases such as UniProtKB/Swiss-Prot, GO terms, KEGG pathways, NCBI PubMed and Interactome to place genes into 53 functional categories. The sorting into only one of 53 functional groups allows for a quick and relatively simple understanding of types of genes enriched and co-expressed in a big dataset. This assists in deriving further insights from genes expressed for a given disease state in human or pre-clinical mouse models.

BIG-C® may be used to functionally categorize immunological genes that are not covered in cancer databases such as GO and KEGG (e.g., as described by Grammer et al. 2016, “Drug repositioning in SLE: crowd-sourcing, literature-mining and Big Data analysis,” Lupus, 25(10), 1150-1170, which is incorporated herein by reference in its entirety). Using a knowledge base of over 5000 patients with systemic lupus erythematosus (SLE), over 16432 genes are each placed into one of 53 BIG-C® functional categories, and statistical analysis is performed to identify enriched categories. BIG-C® categories are cross-examined with the GO and KEGG terms to obtain additional information and insights.

The BIG-C(Biologically Informed Gene Clustering) algorithm may be configured to sort large groups of genes into a set of functional groups (e.g., 53 functional groups). The functional groups are created utilizing publicly available information from online tools and databases including UniProtKB/Swiss-Prot, GO Terms, KEGG pathways, NCBI PubMed, and the Interactome. The functional groups may include one or more of: Active RNA, Anti-apoptosis, anti-proliferation, autophagy, chromatin remodeling, cytoplasm and biochemistry, cytoskeleton, DNA repair, endocytosis, endoplasmic reticulum, endosome and vesicles, fatty acid biosynthesis, cell surface, transcription, glycolysis and gluconeogenesis, golgi, immune cell surface, immune secreted, immune signaling, integrin pathway, interferon stimulated genes, intracellular signaling, lysosome, melanosome, MHC class I, MHC class II, microRNA processing, microRNA, mitochondrial transcription, mitochondria, mitochondria oxidative phosphorylation, mitochondrial TCA cycle, mRNA processing, mRNA splicing, non-coding RNA, nuclear receptor, nucleus and nucleolus, palmitoylation, pattern recognition receptors, peroxisomes, pro-apoptosis, pro-cell cycle, proteasome, pseudogenes, RAS superfamily, reactive oxygen species protection, secreted and extracellular matrix, transcription factors, transporters, transposon control, ubiquitylation and sumoylation, unfolded protein and stress, and unknown. Enrichment scores for each group are calculated based on an overlap p value to determine the functional groups over or under-expressed in the gene expression dataset. The BIG-C may be configured such that each gene is sorted into only one of the 53 functional groups, allowing for a quick and relatively simple understanding of types of genes enriched and co-expressed in a big dataset.

A sample BIG-C® workflow may comprise the following steps. First, SLE genomic datasets are derived from whole blood, peripheral blood mononuclear cells, affected tissues, and purified immune cells. Second, datasets are analyzed using differential expression analysis or Weighted Gene Coexpression Network Analysis (WGCNA). Third, expressed genes are annotated using publicly available databases (e.g., UniProtKB/Swiss-Prot database, Human Immunodeficiencies database, Mouse MGI database, Entrez Molecular Sequence database, PubMed, and the Human Tissue Atlas). Fourth, signatures are cross-referenced with purified single-cell microarray datasets and RNAseq experiments. Fifth, BIG-C® is leveraged to separate the individual annotated genes into one of 53 functional categories (e.g., as described by Labonte et al. 2018, “Identification of alterations in macrophage activation associated with disease activity in systemic lupus erythematosus,” PloS one, 13(12), e0208132, which is incorporated herein by reference in its entirety). Sixth, chi-squared analysis is used to determine enriched categories of interest from overlap p-values. Seventh, enriched categories are cross-examined with GO and KEGG terms to derive key insights for further analysis.

I-Scope™ Big Data Analysis

I-Scope™ may be a big data analysis algorithm configured for cross-examining the presence and activity of varying types of immune cell infiltrates with observed gene expression patterns. It may take annotated gene expression data and analyze it for hematopoietic cell lineage. I-Scope™ may be used downstream of the BIG-C® (Biologically Informed Gene-Clustering) tool in that it helps to provide even more insight into the nature of the genes being expressed after categorization.

I-Scope™ addresses the need to understand the involvement of specific cells for a given disease state. While it is helpful to understand the relative up-regulation and down-regulation at the gene expression level, it is even more informative to understand specifically in which cells this is occurring. I-Scope™ may be configured to identify hematopoietic cells through an iterative search of more than 17,000 genes identified in more than 50 microarray datasets (e.g., as described by Hubbard et al., “Analysis of Lupus Synovitis Gene Expression Reveals Dysregulation of Pathogenic Pathways Activated within Infiltrating Immune Cells,” Arthritis Rheumatol, 2018; 70 (suppl 10), which is incorporated herein by reference in its entirety). I-Scope™may function by restricting the analysis to genes of hematopoietic cell heritage and allow for cross-checking against purified single-cell experiments or datasets. The cross-check confirms and categorizes specific transcript signatures to the 28 hematopoietic cell sub-categories, ultimately allowing for cellular activity analysis across multiple samples and disease states. The hematopoietic cell sub-categories may include, for example, Monos/Macs, Plasma Cells, T-Cells, B-Cells, Dendritic, T&B Cells, CD8 T, Myeloid Cells, Tact, LDG, Hematopoietic, Neutrophil, Ag Presentation, Granulocytes, Platelets, pDC, “T, B, Mono”, Langerhans, Bact, Mono and B, Erythrocytes, Mast Cell, T reg, Gd T, T anergic, FDC, CD4T, and T/NK/NKT Cells. When combined with BIG-C® categories, the cellular activity may be correlated to specific functions within a given cell type.

The I-Scope™ algorithm may be configured to identify immune infiltrates. Hematopoietic cells are unique in that they move throughout the body patrolling for threats to the host, and may infiltrate tissue sites not normally home to immune cells. I-Scope™ may be configured to identify hematopoietic cells through an iterative search of more than 17,000 genes identified in more than 50 microarray datasets. From this search, 1226 candidate genes are identified and researched for restriction in hematopoietic cells as determined by the HPA, GTEx and FANTOM5 datasets (e.g., available at proteinatlas.org). 926 genes meet the criteria for being mainly restricted to hematopoietic lineages (brain, reproductive organ exclusions were permitted). These genes are researched for immune cell specific expression in 27 hematopoietic sub-categories: alpha beta T cell, T cell, regulatory T Cell, activated T cell, anergic T cell, gamma delta T cells, CD8 T, NK/NKT cell, NK cell, T & B cells, B cells, germinal center B cells, B cell and plasmacytoid dendritic cell, T &B & myeloid, B & myeloid, T & myeloid, MHC Class II expressing cell, monocyte, dendritic cell, plasmacytoid dendritic cells, myeloid cell, plasma cell, erythrocyte, neutrophil, low density granulocyte, granulocyte, and platelet. Transcripts are entered into I-Scope™ and the number of transcripts in each category determined. Odd's ratios are calculated with confidence intervals using the Fisher's exact test in R.

A sample I-Scope™ workflow may comprise the following steps. First, candidate genes are identified from SLE (systemic lupus erythematosus) datasets potentially associated with immune cell expression. Second, using HPA, GTEx, and FANTOM5 datasets, expression signatures associated with hematopoietic cell lineage are identified. Third, signatures are cross-referenced with purified single-cell microarray datasets and RNAseq experiments. Fourth, transcripts are categorized into 28 hematopoietic cell sub-categories and assess cellular expression across different samples and disease states. Odds ratios are calculated with confidence intervals using the Fisher's exact test in R. An I-Scope™ signature analysis for a given sample may generate an I-Scope™ signature analysis across multiple samples and disease states.

T-Scope™ Big Data Analysis

The T-Scope™ algorithm may be configured for cross-examining gene expression signatures of a given sample with a database of non-hematopoietic cell types (e.g., as described by Hubbard et al., “Analysis of Gene Expression from Systemic Lupus Erythematosus Synovium Reveals Unique Pathogenic Mechanisms [Abstract], Annual Meeting of the American College of Rheumatology; June 2019; Chicago, IL, which is incorporated herein by reference in its entirety). T-Scope™ may comprise a database of 704 transcripts allocated to 45 independent categories. Transcripts detected in the sample are matched to one of the cellular categories within the T-Scope™ tool to derive further insights on tissue cell activity. T-Scope™ can be used downstream of the BIG-C® (Biologically Informed Gene-Clustering) tool to understand which tissue cell types are present. In conjunction with I-Scope™ (which provides information related to immune cells), T-Scope™ can be performed to provide a complete view of all possible cell activity in a given sample.

T-Scope™ addresses the need to understand the involvement of specific tissue cells for a given disease state. While it is helpful to understand the relative up-regulation and down-regulation at the gene expression level, it is even more informative to understand specifically in which cells this is occurring. T-Scope™ may be configured by downloading a set of approximately 10,000 tissue enriched and 8,000 cell line enriched genes from the Human Protein Atlas along with their tissue or cell line designation. Genes differentially expressed in hematopoietic cell datasets are removed and kidney specific genes are added from the GEO repository. T-Scope™ may function by restricting the analysis to genes of known tissue cell heritage and allow for cross-checking against purified single-cell experiments or datasets. The cross-check confirms and categorizes specific transcript signatures to the 45 tissue cell sub-categories (Adipose Tissue, Adrenal Gland, Breast, Cartilage, Cerebral Cortex, “Cervix, Uterine”, Chondrocyte, Colon, Dendritic, Duodenum, Endometrium, Endothelial, Epididymis, Erythrocytes, Esophagus, Fallopian Tube, Fibroblast, Gallbaldder, Heart Muscle, Keratinocyte, Keratinocyte Skin, Kidney, Kidney Distal Tubules, Kidney Loop, Kidney Proximal Tubules, Kidney Tubule Duct, Kidney Tubule, Langherhans, Liver, Lung, Melanocyte, Podocyte, Prostate, Rectum, Salivary Gland, Seminal Vesicle, Skeletal Muscle, Skin, Small Intenstine, Smooth Muscle, Stomach, Synoviocyte, Testis, Thyroid Gland, and Urinary Bladder), ultimately allowing for cellular activity analysis across multiple samples and disease states. When combined with BIG-C® categories, the cellular activity can be correlated to specific functions within a given tissue cell type.

The T-Scope™ algorithm may be configured to help identify types of non-hematopoietic cells in gene expression datasets. T-Scope™ may be configured by downloading approximately 10,000 tissue enriched and 8,000 cell line enriched genes from the human protein atlas along with their tissue or cell line designation (e.g., available at proteinatlas.org). Genes found in more than four tissues are eliminated. Housekeeping genes described in the gene expression study by She et al. are also removed (e.g., as described by She et al., “Definition, conservation and epigenetics of housekeeping and tissue-enriched genes,” BMC Genomics 2009, 10:269, which is incorporated herein by reference in its entirety). This list is further curated by removing genes differentially expressed in 34 hematopoietic cell gene expression datasets and adding kidney specific genes from datasets downloaded from the GEO repository and processed by Ampel BioSolutions. The resulting categories of genes represent genes enriched in the following 42 tissue or cell-specific categories: adrenal gland, breast, cartilage, cerebral cortex, uterine cervix, chondrocyte, colon, duodenum, endometrium, epididymis, esophagus fallopian tube, esophagus, fibroblast, heart muscle, keratinocyte, kidney, liver, lung, melanocyte, ovary pancreas, parathyroid gland, placenta, podocyte, prostrate, rectum, salivary gland, seminal vesicle, skeletal muscle, skin, small intestine, smooth muscle, stomach, synoviocyte, testis, kidney loop of henle, kidney proximal tubule, kidney distal tubule, and kidney collecting duct.

A sample T-Scope™ workflow may comprise the following steps. First, candidate genes are identified from SLE (systemic lupus erythematosus) differential expression datasets potentially associated with tissue cell expression. Second, using publicly available databases, expression signatures associated with potential tissue cell activity are identified. Third, signatures are cross-referenced with microarray, scRNAseq or RNAseq experiments. Fourth, transcripts are categorized into 45 tissue cell sub-categories and cellular expression is assessed across different samples and disease states. Using T-Scope™ in combination with I-Scope™ identification of cells post-DE-analysis may be performed.

Gene Set Variation Analysis (GSVA)

Gene Set Variation Analysis (GSVA) algorithms may be performed (for example, as described in Catalina et al. (2019, Communications Biology, “Gene expression analysis delineates the potential roles of multiple interferons in systemic lupus erythematosus”, which is incorporated herein by reference in its entirety) to determine enrichment of signaling pathways in individual patient samples. Gene set variation analysis may be performed using an open source software package for the coding language R available at the R Bioconductor (bioconductor.org), e.g., as described by Hanzelman et al., (“GSVA: gene set variation analysis for microarray and RNA-Seq data,” BMC Bioinformatics, 2013, which is incorporated herein by reference in its entirety). The modules of genes to interrogate the datasets may be developed. Modules of genes determined to represent a specific signaling pathway or process may be identified (e.g., using publicly available datasets). For example, the IFNB1 signaling pathway is taken from a publicly available gene expression dataset of peripheral blood cells treated with IFNB1 in vitro. Genes co-expressed in this dataset (genes either all increased or decreased compared to control treated peripheral blood) are used to create modules of genes representing the IFNB1 signaling pathway, and GSVA is used to determine the enrichment of this set of genes and hence the IFNB1 signaling pathway in individual patient and control samples.

A GSVA-based data analysis tool may be developed for use in analyzing specific sets of gene pathways. The GSVA-based data analysis tool (e.g., P-Scope) may use a GSVA statistical test-based tool using different sets of genes to analyze certain pathways. Such sets of genes may include, for example, human genes, mouse genes, or a combination thereof.

EXAMPLES

The following illustrative examples are representative of embodiments of the software applications, systems, and methods described herein and are not meant to be limiting in any way.

Example 1: Identification of Active Vs. Inactive SLE by Applying a Random Forest Classifier to SLE Gene Expression Data

Random forest, a high-performing classifier, may be used to perform analysis to sort through the inherent heterogeneity in raw SLE gene expression data and may be able to identify records with active versus inactive disease with a sensitivity of 85 percent and a specificity of 83 percent. Fine tuning the algorithms may be able to generate sufficient accuracy to be informative as a stand-alone estimate of disease activity. Accuracy may be assessed as the proportion of patients correctly classified across all testing folds.

SLE is a complex, multisystem autoimmune disease that continues to be a major diagnostic as well as therapeutic challenge. There are no definitive diagnostic tools available to determine whether a patient has SLE, and diagnostic approaches in SLE have not changed in decades. Physicians still rely on clinical evaluation and a few laboratory tests, including measurement of autoantibodies and complement levels. Despite the wealth of genetic, epigenetic, and gene expression data that has emerged in the past few years at both the patient and cellular levels, none has been integrated to produce a predictive tool that can be used to evaluate an individual SLE patient.

In SLE, defects in central and peripheral tolerance allow for activation of self-reactive B cell clones and differentiation into plasmablasts/plasma cells (PCs) that secrete autoantibodies, which in turn mediate tissue damage. Genome wide association studies (GWAS) have identified numerous polymorphisms in regions encoding genes or regulatory regions that may influence B cell function, suggesting that a general state of B cell hyper-responsiveness may contribute to SLE pathogenesis. Autoantibody-containing immune complexes stimulate production of type 1 interferon, a hallmark of infection that is also observed in SLE patients, regardless of disease activity. In addition to B cells and PCs, various T cell populations also exert differential effects on SLE pathogenesis. T follicular helper cell subsets contribute to B cell activation and differentiation, and abnormal T cell receptor signaling is also thought to lead to hyper-responsive autoreactive T cell activity. Furthermore, defects in regulatory T cells, partially secondary to deficient IL-2 production, result in faulty modulation of immune activity and inflammation.

Myeloid cells (MC) also play a role in SLE pathogenesis. Factors present in the local microenvironment may cause macrophages (Mϕ) to undergo extreme changes in transcriptional regulation in a process called Mϕ polarization Overabundance of proinflammatory M1 Mϕ and decreased expression of markers for anti-inflammatory M2 Mϕ are detected in both lupus-prone mice and SLE patients, and therapeutic stimulation of M2 polarization significantly decreases disease severity in murine SLE. Experimental intervention in M2 polarization as well as microRNA array profiling suggest that abnormalities in M2 Mϕ may contribute to SLE severity. Low-density granulocytes (LDGs) are abnormal neutrophil-like cells that appear in the blood of lupus patients as well as in many other disease states. Although their involvement in SLE has not been studied as extensively as that of other cell types, LDGs have already been linked to kidney disease, vascular disease, and other manifestations in lupus patients. LDG modules may be generated by WGCNA meta-analysis (manuscript in preparation), and r values indicate separation from control and SLE neutrophils.

To date, however, it has been difficult to relate gene expression profiles to SLE disease activity successfully. Many attempts have been made to characterize SLE patients by gene expression, including efforts to identify individual genes that predicted subsequent flares, and the determination of a discrete group of differentially expressed (DE) genes that may be found in subjects with SLE renal disease. extensively analyzed pediatric lupus samples and attempted to associate modules of expressed genes with disease manifestations in children. Despite these advances, none of the data has yet provided an approach with sufficient predictive value to utilize in decision making about individual subjects with SLE, nor has any cellular phenotype been independently verified to be able to distinguish a patient with active SLE from one with inactive disease. This distinction is critical both for patient evaluation and for clinical trials, as most SLE trials are aimed at controlling disease activity.

Therefore, in order to advance personalized treatment of SLE patients, the use of big data analytical techniques, including machine learning, may be useful to understand the relationships between cell subsets, gene expression, and disease activity. Machine learning describes a wide range of computational methods which allow researchers to harness complex data and develop self-trained strategies to predict the characteristics of new samples, such as whether a given SLE patient has active or inactive disease. When applied to high-throughput bioinformatics data, machine learning algorithms may identify the gene expression features with the most utility for the task at hand and may thereby provide insights into disease pathogenesis.

Conventional bioinformatics methods in conjunction with unsupervised and supervised machine learning techniques to: (1) test the potential of raw gene expression data and modules of genes to classify subjects with active and inactive SLE, (2) determine the optimum classifier or classifiers, and (3) understand the combinations of variables that best facilitate classification.

Provided herein are machine learning approaches to integrate gene expression data from multiple SLE data sets and used it to predict active disease. Both raw whole blood gene expression data and informative gene modules generated by Weighted Gene Co-expression Network Analysis from purified leukocyte populations are employed by classification algorithms. SLE whole blood gene expression data from 156 patients across three data sets are used to classify patients as having active or inactive disease as characterized by standard clinical composite outcome measures. When training and testing sets are formed by holding out entire data sets, machine learning algorithms using raw gene expression data had an average classification accuracy of only 53 percent. However, converting this gene expression data to module enrichment improved classification accuracy to 71 percent. When training and testing sets are formed by mixing patients from the three data sets, module enrichment remained at a 70 percent classification accuracy. However, classification accuracy using raw gene expression increased to a mean of 79 percent. The best overall performance came from the random forest classifier, which had a predictive accuracy of 84 percent.

Gene expression data may be compiled as follows. Publicly available gene expression data and corresponding phenotypic data may be mined from the Gene Expression Omnibus. Raw data sources for purified cell populations are as follows: GSE10325 (CD4: 8 SLE, 9 HC; CD19: 10 SLE, 8 HC; CD33: 9 SLE, 9 HC); GSE26975 (10 SLE LDG, 10 SLE Neutrophil, 9 HC Neutrophil); GSE38351 (CD14: 8 SLE, 12 HC). Raw data sources for SLE whole blood gene expression are as follows: GSE39088 (24 active, 13 inactive); GSE45291 (35 active, 257 inactive); GSE49454 (23 active, 26 inactive). 35 randomly sampled inactive patients may be taken from GSE45291 to avoid a major imbalance between active and inactive SLE patients. Active SLE may be defined as having an SLE Disease Activity Index (SLEDAI) of 6 or greater.

Quality control and normalization may be performed as follows. Statistical analysis may be conducted using R and relevant Bioconductor packages. Non-normalized arrays may be inspected for visual artifacts or poor hybridization using Affy QC plots. PCA plots may be used to inspect the raw data files for outliers. Data sets culled of outliers may be cleaned of background noise and normalized using RMA, GCRMA, or NEQC where appropriate. Data sets may be then filtered to remove probes with low intensity values and probes without gene annotation data. WB gene expression data sets may be filtered to only include genes that passed quality control in all data sets. At this juncture, differential expression (DE) analysis and Weighted Gene Co-expression Network Analysis (WGCNA) may be carried out on data sets. WB gene expression data sets may be then further processed before machine learning analysis. WB gene expression values may be centered and scaled to have zero-mean and unit-variance within each data set, and the standardized expression values from each data set may be joined for classification.

Differential expression (DE) analysis may be performed as follows. Normalized expression values may be variance corrected using local empirical Bayesian shrinkage, and DE may be assessed using the LIMMA package. Resulting p-values may be adjusted for multiple hypothesis testing using the Benjamini-Hochberg correction, which resulted in a false discovery rate (FDR). Significant genes within each study may be filtered to retain DE genes with an FDR<0.2, which may be considered statistically significant. The FDR may be selected a priori to diminish the number of genes that may be excluded as false negatives.

Weighted Gene Co-expression Network Analysis (WGCNA) may be performed as follows. Log2-normalized microarray expression values from purified CD4, CD14, CD19, CD33, and low density granulocyte (LDG) populations may be used as input to WGCNA to conduct an unsupervised clustering analysis, resulting in co-expression “modules,” or groups of densely interconnected genes which may correspond to comparably regulated biologic pathways. For each experiment, an approximately scale-free topology matrix (TOM) may be first calculated to encode the network strength between probes. Probes may be clustered into WGCNA modules based on TOM distances. Resultant dendrograms of correlation networks may be trimmed to isolate individual modular groups of probes by partitioning around medoids and labeled using color assignments based on module size. Expression profiles of genes within modules may be summarized by a module eigengene (ME), which is analogous to the module's first principal component. MEs act as characteristic expression values for their respective modules and may be correlated with sample traits such as SLEDAI or cell type. This may be done by Pearson correlation for continuous or semi-continuous traits and by point-biserial correlation for dichotomous traits.

WGCNA modules from CD4, CD14, CD19, and CD33 cells may be tested for correlation to SLEDAI. SLEDAI information may be not available for the LDG modules, so the two modules provided are descriptive of LDGs compared to SLE neutrophils and HC neutrophils. Plasma cell modules may be generated by differential expression analysis and not WGCNA, but may be included because of the established importance of plasma cells in SLE pathogenesis.

Gene Set Variation Analysis (GSVA)-based enrichment of expression data may be performed as follows. The GSVA R package may be used as a non-parametric method for estimating the variation of pre-defined gene sets in SLE WB gene expression data sets. Standardized expression values from WB data sets may be used to test for enrichment of cell-specific WGCNA gene modules using the Single-sample Gene Set Enrichment Analysis (ssGSEA) method, which scores single samples in isolation and is thus shielded from technical variation within and among data sets. Statistical analysis of GSVA enrichment scores may be done by Spearman correlation or Welch's unequal variances t-test, where appropriate. GSVA may be performed on three SLE WB datasets using 25 WGCNA modules made from purified SLE cells with correlation or published relationship to SLEDAI, per Table 1. In the top line, orange: active patient; black: inactive patient. LDG: low-density granulocyte; PC: plasma cell.

Machine learning algorithms and parameters may be developed as follows. Three distinct machine learning algorithms may be employed to test biased and unbiased approaches to microarray data analysis. The biased approach involved GSVA enrichment of disease-associated, cell-specific modules, and the unbiased approach employed all available gene expression data in the WB. An elastic generalized linear model (GLM), k-nearest neighbors classifier (KNN), and random forest (RF) classifier may be deployed to classify active and inactive SLE patients and determine whether gene expression may serve as a general predictor of disease activity. GLM, KNN, and RF may be deployed using the glmnet, caret, and randomForest R packages, respectively.

GLM carries out logistic regression with a tunable elastic penalty term to find a balance between the L1 (lasso) and L2 (ridge) penalties and thereby facilitate variable selection. For our predictions, the elastic penalty may be set to 0.9, specifying a penalty that is 90% lasso and 10% ridge in order to generate sparse solutions. KNN classifies unknown samples based on their proximity to a set number k of known samples. K may be set to 5% of the size of the training set. If the initial value of k is even, 1 may be added in order to avoid ties. RF generates 500 decision trees which vote on the class of each sample. The Gini impurity index, a measure of misclassification error, may be used to evaluate the importance of variables. In addition to these three approaches, pooled predictions may be assigned based on the average class probabilities across the three classifiers.

Validation approaches may be performed as follows. The performance of each machine learning algorithm may be evaluated by 2 different forms of cross-validation. First, a random 10-fold cross-validation may be carried out by randomly assigning each patient to one of 10 groups. Next, as the data came from three separate studies, leave-one-study-out cross-validation may be also done to determine the effects of systematic technical differences among data sets on classification performance. For each pass of cross-validation, one fold or study may be held out as a test set, and the classifiers may be trained on the remaining data. Accuracy may be assessed as the proportion of patients correctly classified across all testing folds. Performance metrics such as sensitivity and specificity may be assessed after cross-validation by agglomerating class probabilities and assignments from each fold or study. Receiver Operating Characteristic (ROC) curves may be generated using the pROC R package.

Gene expression results may be obtained and analyzed as follows. Before employing machine learning techniques, it may be necessary to first assess whether conventional bioinformatics approaches may satisfactorily separate active SLE patient samples from those from inactive patients. DE analysis of active patient samples versus inactive patients in each whole blood study revealed major differences among data sets and considerable heterogeneity within data sets. First, the 100 most significant DE genes by FDR in each study may be used to carry out hierarchical clustering of active and inactive patient samples. Active patients separated from inactive patients in GSE45291, but separated with mixed results in GSE39088 and GSE49454.

Next, the lists of genes may be compared for commonalities. Out of 6,640 unique DE genes from the three studies, 5,170 genes are unique to one study, 1,234 are shared by two studies, and 36 are shared by all three studies, with a minimal overlap of the 100 most significant genes by FDR in each study. The only overlaps among the top 100 DE genes in each study by FDR are: TWY3 and EHBP1, shared between GSE39088 and GSE49454; and LZIC, shared between GSE39088 and GSE45291.

Furthermore, the fold change distributions of the 100 most significant DE genes in each study varied considerably. In GSE39088, 94 of the 100 most significant genes may be downregulated in active patients; in GSE45291, all of the top 100 genes may be upregulated in active patients; and in GSE49454, the top 100 genes may be more evenly distributed (41 up, 59 down). The three data sets are comprised of different patient populations and may be collected on different microarray platforms per Table 4. Still, the heterogeneity is striking. The lack of commonality among the genes most descriptive of active and inactive patients in each data set already casts doubt on whether active and inactive patients from different data sets may separate cleanly.

Patients from each study may be then joined to evaluate whether unsupervised techniques may separate active patients from inactive patients. Hierarchical clustering on the 297 unique most significant DE genes by FDR showed considerable heterogeneity, and active patients and inactive patients did not consistently separate, per the map of the top 100 DE genes by FDR from each study (combined total of 297 unique genes from the three studies) expressed in all patients. If gene expression has the potential to identify active SLE patients, conventional bioinformatics techniques failed to harness that, highlighting the need for more advanced algorithms.

Patterns of enrichment of WGCNA modules may be derived from isolated cell populations of WB that are correlated to the SLEDAI disease activity measure may be more useful than gene expression across studies to identify active versus inactive lupus patients. To characterize the relationships between SLE gene signatures from various peripheral cellular subsets and disease activity, WGCNA may be used to generate co-expression gene modules from purified populations of cells from subjects with active SLE, which may subsequently be tested for enrichment in whole blood of other SLE subjects. WGCNA analysis of leukocyte subsets resulted in several gene modules with significant Pearson correlations to SLEDAI (all |r| >0.47, p<0.05). CD4, CD14, CD19, and CD33 cells had 3, 6, 8, and 4 significant modules, respectively, per Table 1. Two low-density granulocyte (LDG) modules may be created by performing WGCNA analysis of LDGs along with either SLE neutrophils or HC neutrophils and merging the modules most strongly expressed by LDGs Two plasma cell (PC) modules may be created by using the most increased and decreased transcripts of isolated SLE plasma cells compared to SLE naïve and memory B cells.

Gene Ontology (GO) analysis of the genes within each module showed that some processes, such as those related to interferon signaling, RNA transcription, and protein translation, are shared among cell types, whereas other processes may be unique to certain cell types (Table 1) and may be used to better classify patients.

To characterize the relationships between SLE gene modules from cell subsets and disease activity in greater detail, GSVA enrichment may be performed using the 25 cell-specific gene modules in WB from 156 SLE patients (82 active, 74 inactive), per Table 4. Of the 25 cell-specific modules, 12 had enrichment scores with significant Spearman correlations to SLEDAI (p<0.05), and 14 had enrichment scores with significant differences between active and inactive patients by Welch's unequal variances t-test (p<0.05) (Table 2). Notably, each cell type produced at least one module with a significant correlation to SLEDAI in WB and at least one module with a significant difference in enrichment scores between active and inactive patients, demonstrating a relationship between disease activity in specific cellular subsets and overall disease activity in WB. However, the Spearman's rho values ranged from −0.40 to +0.36, suggesting that no one module had substantial predictive value. Furthermore, the effect sizes as measured by Cohen's d when testing active versus inactive enrichment scores ranged from −0.85 to +0.79. The CD4 Floralwhite and Orangered4 modules, which had the largest positive and negative effect sizes, respectively, showed a high degree of overlap in the enrichment scores of active and inactive patients, whereas error bars indicate mean±standard deviation. WB may be unable to fully separate active patients from inactive patients.

Analysis of individual disease activity-associated peripheral cellular subset gene modules may be not sufficient to predict disease activity in unrelated WB data sets, since no single module from any cell type may be able to separate active from inactive SLE patients. Although no single module had a sufficiently high predictive value, many cell-specific gene modules may be combined and optimized to predict disease activity in SLE patients. Moreover, the results emphasized the need for more advanced analysis to employ gene expression analysis to predict disease activity.

Machine learning results may be obtained and analyzed as follows. To assess the effectiveness of either raw gene expression or module-based enrichment techniques, SLE patients may be classified as active or inactive using two different methodologies: (1) a leave-one-study-out cross-validation approach or (2) a 10-fold cross-validation approach. GLM, KNN, and RF classifiers may be tasked with identifying active and inactive SLE patients based on WB gene expression data and module enrichment data. The performance of each classifier in each situation is shown in Table 2, and corresponding ROC curves. Area under the curve is shown in each plot. In almost all cases, the random forest classifier outperformed the GLM and KNN classifiers, although the results may be not significantly different when assessed by testing for equality of proportions (p>0.05). Pooled predictions based on the class probabilities from the three classifiers did not improve overall performance.

When cross-validating by study, the use of expression values achieved an accuracy of only 53 percent, per Table 3. This is in line with the findings that gene expression values have little to no utility when attempting to classify unfamiliar samples. When the training data and test data show little similarity to one another (e.g., they come from different data sets), the classifiers learn patterns that are unhelpful for classifying test samples. Remarkably, the use of module enrichment scores improved accuracy to approximately 70 percent.

When doing 10-fold cross-validation (Table 3), the use of raw gene expression values resulted in better performance compared to module enrichment in contrast to leave-one-study-out cross-validation. This increase in performance may be attributed to the presence of data from all three studies in both the training and test sets. In this case, the classifiers have the opportunity to learn patterns inherent to each data set, which proves useful during testing. In this circumstance, the random forest classifier may be the strongest performer with 84% accuracy (85% sensitivity, 83% specificity). The ROC curve demonstrated an excellent tradeoff between recall and fall-out.

The performance of module enrichment may be not substantially different between 10-fold cross-validation and leave-one-study-out cross-validation.

Overall, in a study-by-study approach (leave-one-study-out cross-validation), module enrichment outperformed raw gene expression. Importantly, when using the 10-fold cross-validation approach, raw gene expression outperformed module enrichment. These results indicate that disease activity classification based on raw gene expression is sensitive to technical variability, whereas classification based on module enrichment better copes with variation among data sets.

Random forest had the highest accuracy in three out of four testing scenarios. To determine whether its assessments of variable importance may be used to gain insight into directors of the identification of SLE activity, random forest classifiers may be trained on all patients from all data sets in order to identify the most important genes and modules as determined by mean decrease in the Gini impurity, a measure of misclassification error.

The most important genes and modules identified a wide array of cell types and biological functions. The most important genes encompass such diverse functions as interferon signaling, pattern recognition receptor signaling, and control of survival and proliferation. Notably, the most influential modules skewed away from B cell-derived modules and towards T cell- and myeloid cell-derived modules. As some of these modules had overlapping genes, the variable importance experiment may be repeated with modules that may be first scrubbed of any genes that appeared in more than one module before GSVA enrichment scoring. The relative variable importance scores of the de-duplicated modules correlated strongly with those of the original modules (Spearman's rho=0.73, p=5.18E−5), indicating that module behavior may be partly driven by the overlapping genes but strongly driven by unique genes. Variable importance of top 25 individual genes. LDG: low-density granulocyte; PC: plasma cell.

CD4_Floralwhite and CD14_Yellow, two interferon-related modules which maintained high importance after deduplication, may be further analyzed to study the effect of unique genes on module importance. Gene lists may be tested for statistical overrepresentation of Gene Ontology biological process terms with FDR correction on pantherdb.org. CD4_Floralwhite did not show any significant enrichment, but CD14_Yellow, which had the highest importance after deduplication, is highly enriched for genes with the “Immune Effector Process” designation (26/77 genes, FDR=9.38E−11 by Fisher's exact test). This suggests that CD14+ monocytes express unique genes that may play important roles in the initiation of SLE activity.

Several important findings on the topic of SLE gene expression heterogeneity within and across data sets have been elucidated by this study. First, DE analysis of active vs inactive patients may be insufficient for proper classification of SLE disease activity, as systematic differences between data sets may render conventional bioinformatics techniques largely non-generalizable.

Further, WGCNA modules created from the cellular components of WB and correlated to SLEDAI disease activity may improve classification of disease activity in SLE patients. The use of cell-specific gene modules based on a priori knowledge about their relevance to disease fared slightly better than raw gene expression, as it generated informative enrichment patterns, and many of the modules maintained significant correlations with SLEDAI in WB. However, these enrichment scores failed to completely separate active patients from inactive patients by hierarchical clustering.

A comparison may be then performed between the raw expression data and the WGCNA generated modules of genes in machine learning applications. Supervised classification approaches using elastic generalized linear modeling, k-nearest neighbors, and random forest classifiers may be implemented. The trends in performance when cross-validating by study or cross-validating 10-fold speak to the potential advantages and disadvantages of diagnostic tests incorporating gene expression data or module enrichment. Cross-validating by study serves as a kind of “worst-case” scenario, whereas 10-fold cross-validation serves as a “best-case.” Attempting to classify active and inactive SLE patients from different data sets and different microarray platforms during cross-validation by study may encounter challenges, but module enrichment may be able to smooth out much of the technical variation between data sets. 10-fold cross-validation simulated a more standardized diagnostic test. Although the data may be sourced from three different microarray platforms, each cohort in the test set had many similar patients in the training set to facilitate classification by gene expression. If such a test may be reliably free from technical noise, it is likely that raw gene expression may perform very well. RNA-Seq platforms, which produce transcript counts rather than probe intensity values, may display less technical variation across data sets if all samples are processed in the same way. An optimal panel of genes may be constructed that is similar to that identified by the random forest classifier, which may result in a simple, focused test to determine disease activity by gene expression data alone.

The strong performance of the random forest classifier indicates that nonlinear, decision tree-based methods of classification may be well suited to SLE diagnostics. This may be because decision trees ask questions about new samples sequentially and adaptively in contrast to other methods that approach variables from new samples all at once. Random forest is able to “understand” to an extent that different types of patients exist and that a one-size-fits-all approach may tend to misclassify those patients whose expression patterns make them a minority within their phenotype. In other words, active patients that do not resemble the majority of active patients may still have a strong chance of being properly classified by random forest.

The random forest classifier may be used to assess the importance of each gene and module in patient classification. The most important genes may be involved in a number of functions other than interferon signaling, such RNA processing, ubiquitylation, and mitochondrial processes. These pathways may play important roles in directing, or at least be indicative of, SLE disease activity. CD4 T cells originally contributed the most important modules, but when the modules may be de-duplicated, CD14 monocyte-derived modules gained importance. This suggests that unique genes expressed by CD14 monocytes in tandem with interferon genes may prove to be informative in the study of cell-specific methods of SLE pathogenesis. Furthermore, it is important to note that modules that may be negatively associated with disease activity may be just as important in classification as positively associated modules. Further study of underrepresented categories of transcripts may enhance our understanding of SLE activity.

While creating dedicated training and test sets may be preferable to cross-validation, this approach may require a large number of samples. Although there are large numbers of publicly available gene expression profiles of SLE patients, many of these profiles are not annotated with SLEDAI data. Furthermore, some data sets which include SLEDAI data show heavy class imbalance, which impedes classification. Cross-platform expression data may be integrated toward expanding the ability to classify active and inactive SLE patients.

The machine learning models developed provide the basis of personalized medicine for SLE patients. Integration of these approaches with high-throughput patient sampling technologies may unlock the potential to develop a simple blood test to predict SLE disease activity. These approaches may also be generalized to predict other SLE manifestations, such as organ involvement. A better understanding of the cellular processes that drive SLE pathogenesis may eventually lead to customized therapeutic strategies based on patients' unique patterns of cellular activation.

Example 2: Prediction of Lupus Disease Activity by Applying a Machine Learning Approaches to SLE Gene Expression Data

The integration of gene expression data to predict systemic lupus erythematosus (SLE) disease activity may be a significant challenge because of the high degree of heterogeneity among patients and study cohorts, especially those collected on different microarray platforms. Machine learning approaches may be deployed to integrate gene expression data from three SLE data sets, and may be used to classify patients as having active or inactive disease (e.g., as characterized by standard clinical composite outcome measures). Both raw whole blood gene expression data and informative gene modules generated by Weighted Gene Co-expression Network Analysis from purified leukocyte populations were employed with various classification algorithms. Classifiers were evaluated by 10-fold cross-validation across three combined data sets or by training and testing in independent data sets, the latter of which amplified the effects of technical variation. A random forest classifier achieved a peak classification accuracy of 83 percent under 10-fold cross-validation, but its performance may be severely affected by technical variation among data sets. The use of gene modules rather than raw gene expression was more robust, achieving classification accuracies of approximately 70 percent regardless of how the training and testing sets were formed. Fine tuning the algorithms and parameter sets may generate sufficient accuracy to be informative as a standalone estimate of disease activity.

SLE is a complex, multisystem autoimmune disease that continues to be a major diagnostic as well as therapeutic challenge. There may be no definitive, specific diagnostic tools available to determine whether a patient has SLE, and diagnostic approaches in SLE have not changed in decades. Physicians still rely on clinical evaluation and a few laboratory tests, including measurement of autoantibodies and complement levels. Despite the wealth of genetic, epigenetic, and gene expression data that has emerged in the past few years at both the patient and cellular levels, none has been integrated to produce a predictive tool that may be used to evaluate an individual SLE patient.

Myeloid cells (MC) also play a role in SLE pathogenesis. Factors present in the local microenvironment may cause macrophages (Mϕ) to undergo extreme changes in transcriptional regulation in a process called Mϕ polarization. Overabundance of proinflammatory M1 Mϕ and decreased expression of markers for anti-inflammatory M2 Mϕ are detected in both lupus-prone mice and SLE patients, and therapeutic stimulation of M2 polarization significantly decreases disease severity in murine SLE. Experimental intervention in M2 polarization as well as microRNA array profiling suggest that abnormalities in M2 Mϕ may contribute to SLE severity. Low-density granulocytes (LDGs) are abnormal neutrophil-like cells that appear in the blood of lupus patients as well as in many other disease states. Although their involvement in SLE has not been studied as extensively as that of other cell types, LDGs have already been linked to kidney disease, vascular disease, and other manifestations in lupus patients.

To date, however, it has been difficult to relate gene expression profiles to SLE disease activity successfully. Gene expression data analysis approaches may have challenges with producing sufficient predictive value to utilize in decision making about individual subjects with SLE. Furthermore, no cellular phenotype has been independently verified to be able to distinguish a patient with active SLE from one with inactive disease. This distinction is critical both for patient evaluation and for clinical trials, as most SLE trials are aimed at controlling disease activity.

Therefore, in order to advance personalized treatment of SLE patients, the use of big data analytical techniques, including machine learning, may be useful to understand the relationships between cell subsets, gene expression, and disease activity. Machine learning describes a wide range of computational methods to harness complex data and develop self-trained strategies to predict the characteristics of new samples, such as whether a given SLE patient has active or inactive disease. Machine learning techniques may be used, for example, to characterize lupus disease risk and identify new biomarkers based on genotypic data or urine tests. When applied to high-throughput transcriptomic data, machine learning algorithms may be used to identify the gene expression features with the most utility to identify subjects with higher degrees of disease activity and may also provide insights into disease pathogenesis.

Bioinformatics methods may be applied in conjunction with unsupervised and supervised machine learning techniques to: (1) test the potential of raw gene expression data and modules of genes to classify subjects with active and inactive SLE, (2) determine the optimum classifier or classifiers, and (3) understand the combinations of variables that best facilitate classification.

Gene expression data may be analyzed to assess SLE disease activity as follows. Before employing machine learning techniques, first an assessment was made regarding whether bioinformatics approaches may accurately separate active SLE patient samples from those obtained from inactive patients. First, three whole blood (WB) data sets (Table 5) were filtered to include only those genes which passed quality control and filtering in all three studies. Table 5 shows data sources for active (SLEDAI≥6) and inactive (SLEDAI<6) SLE WB gene expression. Data sets are listed by Gene Expression Omnibus (GEO) accession numbers. N Active/Inactive: number of active/inactive patients in data set. Range, mean, and standard deviation of SLEDAI values in each data set are provided.

TABLE 5

Accession of records by microarray platform, number of active

and inactive records, SLEDAI range, and SLEADAI mean

N
N

Microarray
Ac-
Inac-
SLEDAI
SLEDAI

Accession
Platform
tive
tive
Range
Mean (SD)

GSE39088
GPL570
24
13
2-12
6.8 (2.7)

(Affymetrix

HG-U133 + 2.0)

GSE45291
GPL13158
35
35
0-11
4.3 (3.5)

(Affymetrix

HG-U133 + PM)

GSE49454
GPL10558
23
26
0-26
7.7 (7.2)

(Illumina

HumanHT-12 v4.0)

Differential expression (DE) analysis of active versus inactive patient samples with the remaining filtered 7,848 genes revealed major differences among data sets and considerable heterogeneity within data sets. GSE39088 had only 176 DE genes with a false discovery rate (FDR) less than 0.2 and none with FDR<0.05; GSE45291 had 5850 DE genes with FDR<0.2 and 4837 with FDR<0.05; GSE49454 had 1710 DE genes with FDR<0.2 and 72 with FDR<0.05 (Data S1).

Hierarchical clustering was carried out on each study with all genes, DE genes with FDR<0.2, and DE genes with FDR<0.05 to determine whether active and inactive patients may separate into two clusters. The Adjusted Rand Index (ARI) was used to compare these clusterings to the known status of the patients. When using all genes, all three studies had ARIs near zero, indicating that clustering separated active and inactive patients no better than random chance (Table 6). Table 6 shows Adjusted Rand Index of Unsupervised Hierarchical Clustering Compared to Known Disease Activity. Data sets are listed by GEO accession numbers. GSE39088 had no genes with FDR<0.05. The “Three Consistent DE Genes” are DNAJC13, IRF4, and RPL22.

TABLE 6

Adjusted Rand Index of Unsupervised Hierarchical

Clustering Compared to Known Disease Activity

Adjusted Rand Index

GSE39088
−0.04

GSE39088; FDR <0.2
0.19

GSE39088; FDR <0.05
N/A

GSE45291
0.03

GSE45291; FDR <0.2
−0.01

GSE45291; FDR <0.05
0.94

GSE49454
0.04

GSE49454; FDR <0.2
0.14

GSE49454; FDR <0.05
0.14

All Studies
0.03

All Studies; Three Consistent DE Genes
0.05

GSE39088 and GSE49454 showed only mild improvement after filtering genes, whereas GSE45291 attained an ARI of 0.94 when using genes with FDR<0.05.

Next, the lists of genes were compared for commonalities. Out of 6,440 unique DE genes from the three studies, 5,170 genes were unique to one study, 1,234 were shared by two studies, and 36 were shared by all three studies. Of these 36 genes, only three had consistent fold changes across all studies (DNAJC13 and IRF4 upregulated; RPL22 downregulated). Rank-rank Hypergeometric Overlap (RRHO) was next applied as a threshold-free comparison of the studies (as described by, for example, Plaisier et al., “Rank-rank hypergeometric overlap: identification of statistically significant overlap between gene-expression signatures,” Nucleic Acids Res. 38, e169, which is incorporated by reference herein in its entirety). All genes that were tested for differential expression were sorted by FDR from most significantly overexpressed to most significantly underexpressed and broken into 36 groups of 218 genes each. Among the three studies, the ranked gene lists failed to demonstrate significant overlap of the most overexpressed and underexpressed genes (FIG. 10A). The three data sets were comprised of different patient populations and were collected on different microarray platforms (Table 5); still, the heterogeneity is striking. The lack of commonality among the genes most descriptive of active and inactive patients in each data set casts doubt on whether active and inactive patients from different data sets may separate cleanly.

Patients from each study were then joined to evaluate whether unsupervised techniques may separate active patients from inactive patients. Expression profiles from each study were first normalized to have zero mean and unit variance. FIG. 10B shows that even these three genes (DNAJC13, IRF4, and RPL22) failed to separate active patients from inactive patients precisely. Hierarchical clustering on all genes had an ARI of 0.03 when compared to the known status of the patients, and clustering on the three consistent DE genes shared among the studies (DNAJC13, IRF4, and RPL22) had an ARI of 0.05 (Table 6). If gene expression has the potential to identify active SLE patients robustly, bioinformatics techniques may fail to harness that potential, thereby highlighting the need for more advanced algorithms.

Thus far, bulk analysis of many WB and PBMC datasets on multiple platforms may show increased transcripts for IFN signature genes, granulocytes, monocytes, and plasma cells and decreased lymphocytes, but may yield little information on mechanisms of pathogenesis excepting IFN and pattern recognition receptor signaling because of the commonality of many transcripts expressed by different cell populations. Patient-specific transcriptomic “fingerprints” using readily accessible WB may be advantageously generated and analyzed to determine the relative contribution of cells, therapy, and ancestral effects, thereby providing valuable information that potentially may be used in determining entry into a clinical trial or personalized medicine strategies. FIG. 11 shows GSVA results of a lupus Illuminate gene set, demonstrating the striking heterogeneity in SLE patient WB by showing patient specific enrichment of 27 cell and process specific modules of genes. Distinct groups of lupus patients defined by GSVA groups or clusters or genes can be visually identified via the GSVA analysis. In order to understand pathogenic mechanisms of SLE, a big data analysis approach may be used on purified cell populations implicated in SLE to help understand aberrant cellular-specific mechanisms.

Patterns of enrichment of Weighted Gene Co-expression Network Analysis (WGCNA) modules derived from isolated cell populations that are correlated to the SLEDAI SLE disease activity index may be more useful than gene expression across studies to identify active versus inactive lupus patients. To characterize the relationships between SLE gene signatures from various peripheral cellular subsets and disease activity, WGCNA was used to generate co-expression gene modules from purified populations of cells from subjects with active SLE, which may subsequently be tested for enrichment in whole blood of other SLE subjects. WGCNA analysis of leukocyte subsets resulted in several gene modules with significant Pearson correlations to SLEDAI (all |r|>0.47, p<0.05). CD4, CD14, CD19, and CD33 cells yielded 3, 6, 8, and 4 modules significantly correlated to disease activity, respectively (Table 7). Table 7 shows cell module correlations to disease activity and functional analysis. Information on cell modules including number of genes, Pearson correlation coefficient to SLEDAI, and functional analysis. +: LDG modules were generated by WGCNA meta-analysis, and r values indicate separation from control and SLE neutrophils as SLEDAI was unavailable. *: PC modules are based solely on differential expression. LDG: low-density granulocyte; PC: plasma cell.

Two low-density granulocyte (LDG) modules were created by performing WGCNA analysis of LDGs along with either SLE neutrophils or HC neutrophils and merging the modules most strongly expressed by LDGs. Two plasma cell (PC) modules were created by using the most increased and decreased transcripts of isolated SLE plasma cells compared to SLE naïve and memory B cells.

TABLE 7

Cell module correlations to disease activity and functional analysis

Cell

Module
Correlation

Type
Module Name
Size
with SLEDAI
Top GO Biological Process
Top BIG-C Category

CD4
Floralwhite
237
0.81
type I interferon signaling pathway
Interferon-Stimulated-Genes

Turquoise
805
0.50
positive reg of ubiquitin-protein ligase
Proteasome

Orangered4
237
−0.77
translational initiation
mRNA-Translation

CD14
Plum1
247
0.47
ubiquitin-dependent protein catabolic process
mRNA-Translation

Yellow
356
0.65
type I interferon signaling pathway
Interferon-Stimulated-Genes

Greenyellow
89
−0.49
transcription from RNA polymerase II promoter
General-Transcription

Pink
261
−0.77
protein phosphorylation
Endosome-and-Vesicles

Purple
124
−0.66
inositol phosphate metabolic process
Fatty-Acid-Biosynthesis

Sienna3
222
−0.64
translational initiation
mRNA-Translation

CD19
Darkolivegreen
591
0.78
cell division
Proteasome

Greenyellow
251
0.66
Notch signaling pathway
mRNA-Translation

Steelblue
146
0.65
gluconeogenesis
Glycolysis-Gluconeogenesis

Turquoise
572
0.50
ER to Golgi vesicle-mediated transport
Unfolded-Protein-and-Stress

Violet
566
0.61
mitochondrial respiratory chain complex I
Interferon-Stimulated-Genes

Brown
620
−0.62
regulation of transcription, DNA-templated
Chromatin-Remodeling

Green
541
−0.49
transcription, DNA-templated
Transcription-Factors

Skyblue
756
−0.74
viral transcription
mRNA-Translation

CD33
Royalblue
94
0.60
positive reg of cytosolic calcium ions
Transposon-Control

Sienna3
133
0.76
type I interferon signaling pathway
Interferon-Stimulated-Genes

Violet
177
0.79
defense response to virus
Interferon-Stimulated-Genes

Darkmagenta
273
−0.49
ubiquinone biosynthetic process
MHC-Class-TWO

LDG⁺
LDG_A
334
0.79
platelet degranulation
Cytoskeleton

LDG_B
92
0.81
regulation of transcription
Secreted-Immune

LDG_C
82
−0.39
viral process
Nucleus-and-Nucleolus

PC*
PC_Up
423
N/A
protein N-linked glycosylation
Endoplasmic-Reticulum

PC_Down
183
N/A
antigen processing and presentation MHC II
MHC-Class-TWO

Gene Ontology (GO) analysis of the genes within each module showed that some processes, such as those related to interferon signaling, RNA transcription, and protein translation, were shared among cell types, whereas other processes were unique to certain cell types (Table 7) and may be used to classify patients more effectively. The genes in each module are listed in Table 8.

TABLE 8

Genes in modules identified via Gene Ontology (GO) analysis

Cell

Type
Module Name
Genes

CD4
Floralwhite
AARS, ABCA1, ABR, ADAM10, ADAR, AEN, AHR, AIMP1, ALOX5, ALOX5AP, APBA3,

APOL1, ARHGEF3, ARID5B, ARMCX2, ASB6, ATG4B, ATOX1, ATP1B3, ATP5J2, ATP6V1E1,

BATF, BCCIP, BCL2, C19orf66, C3orf14, CAPN2, CAPN3, CASP1, CD164, CD55, CFLAR,

CGGBP1, CHMP5, CISH, CLP1, CMTR1, CNP, CREM, CYTIP, DCAF11, DDX60, DHX58,

DNAJA1, DR1, DUSP5, EIF2AK2, EIF2S1, EIF3J, ELAC2, ENO1, ERCC1, ETV7, FAM13A,

FAM46A, FAR2, FBXL8, FCHSD2, FEM1B, GADD45B, GALNS, GCH1, GPKOW, GPR171,

GPRC5B, GSN, GTPBP1, H2BFS, HDAC9, HEMK1, HERC5, HERC6, HIST1H1C, HIST1H2BD,

HIST1H2BH, HIST1H2BK, HLA-B, HN1, ICA1, IDI1, IFI16, IFI27, IFI35, IFI44, IFI44L, IFI6,

IFIH1, IFIT1, IFIT3, IFIT5, IFITM1, IGHMBP2, IKBKE, INSL3, IPO4, IPO7, IRF4, IRF7, IRS1,

ISG15, ISG20, JUN, LAMP2, LAMP3, LAP3, LARP1, LARP7, LDHA, LGALS3BP, LGALS9,

LIMK2, LTA, LY6E, MAP4, ME3, MRPL42, MT1E, MT1F, MT1G, MT1H, MT1HL1, MT1X,

MT2A, MTM1, MTMR1, MX1, MX2, MYD88, N4BP1, NLRP2, NMI, NOP14, NPDC1, NPEPPS,

NQO2, NUP188, OAS1, OAS2, OAS3, OASL, OGFOD3, P2RX5, PARP12, PARP3, PCK2,

PDCD10, PDCD6, PDXK, PFKP, PGAM1, PGAP1, PHF11, PIGV, PIM1, PIP4K2C, PLSCR1,

PNO1, POMP, PSMA1, PSMA5, PSMB10, PSMB9, PSME1, PSME2, PTGER2, RAB11FIP1,

RASGRP3, RBCK1, RCL1, RCN1, REC8, RELB, REXO2, RMDN3, RSAD2, RTP4, RUVBL2,

SAMD9, SCO2, SELP, SEMA3G, SIPA1L1, SIRT5, SLC25A15, SNRPG, SOCS1, SOCS2, SP100,

SP110, SPATS2L, SPCS3, SQRDL, STAT1, STAT5A, STX17, SUB1, SUSD4, TAP1, TBK1,

TDRD7, TFDP2, TLR5, TLR7, TMEM140, TMSB10, TMX2, TNIP2, TRADD, TRAFD1, TRAK2,

TRANK1, TRBC1, TRIM21, TRIM22, TRIM26, TRIOBP, TSPAN13, TUBB2A, TULP4, TXNL4A,

TYMP, UBAP2, UBAP2L, UBE2L6, UCHL3, UPP1, USP11, USP18, USP46, WARS, XAF1,

YBX3, ZBP1, ZCCHC2, ZMIZ2, ZNF207, ZNF273

CD4
Turquoise
AAMDC, AASDHPPT, ABCC1, ABCC10, ACOT13, ACOT9, ACP1, ACSL1, ACTA2, ACTR3,

ACVR1, ADIPOR2, ADK, AIFM1, AIM2, AIMP2, AKAP1, ALAS1, ANAPC5, ANP32E, ANXA2,

ANXA2P2, ANXA2P3, ANXA4, APOL3, APPBP2, APTX, ARL3, ARPC1A, ARPC2, ARPC3,

ARPP19, ASCC1, ATF7IP, ATG4A, ATG5, ATIC, ATMIN, ATP2C1, ATP5G1, ATP5G3, ATP5I,

ATP5J, ATP5S, ATP5SL, ATP6V0E1, ATP6V1A, ATP6V1C1, ATP6V1D, ATP6V1H, ATPIF1,

B3GNT2, B4GALT5, BAG1, BAG5, BAK1, BAZ1A, BHLHE40, BID, BIRC3, BLVRA, BLZF1,

BORA, BTG3, BTN2A2, BUD31, BZW2, C10orf2, C11orf48, C11orf73, C14orf159, C14orf166,

C1GALT1, C1GALT1C1, C1orf50, C1QBP, C21orf59, C21orf91, C2CD3, C2orf43, C2orf44,

C2orf47, C6orf106, C8orf60, CALU, CAPZA1, CARS, CASK, CASP3, CASP4, CCDC53,

CCDC69, CCNA2, CCNB1IP1, CCNH, CCR5, CCT2, CCT3, CD28, CD38, CD59, CDC123,

CDC27, CDC73, CDK2AP1, CDK7, CDS2, CDV3, CEACAM5, CEBPG, CHCHD3, CHMP2A,

CHMP4A, CHN1, CHP1, CHST11, CHST12, CHST7, CISD1, CKS2, CLN8, CLTA, CLUAP1,

CMC2, CNDP2, CNPY2, COA3, COMMD3, COPS2, COPS5, COPS6, COQ2, COX17, COX5A,

COX5B, COX6B1, COX7A2, COX7B, CPSF6, CPT2, CRIPT, CSNK1A1, CSNK2A1, CSTF1,

CSTF2, CSTF3, CTDSP2, CTNNBL1, CTPS1, CTSK, CUL1, CUL3, CUL5, CYB5R4, CYCS,

CYLD, DBI, DCLRE1A, DCPS, DCTN5, DCTN6, DCTPP1, DDB2, DDX10, DDX19A, DDX24,

DDX27, DDX52, DDX54, DDX58, DECR1, DEF8, DERL2, DGCR2, DGUOK, DHTKD1,

DIABLO, DIMT1, DNAJC15, DNAJC2, DNAJC9, DNPEP, DNTTIP2, DOK1, DYNC1H1,

DYNC1LI1, DYNLL1, DYNLT1, EBNA1BP2, EBP, EEF1E1, EFR3A, EIF2B2, EIF2S2, EIF4A3,

EIF4E2, EIF4ENIF1, EIF5B, ELOVL6, ELP3, EMC3, EMC7, EMC8, ENDOD1, ENY2, EPS8L2,

ERAP2, ERO1L, ETF1, ETFB, ETNK1, ETS1, EZH2, EZR, F5, FABP5, FAM105A, FAM32A,

FAM50B, FAM63B, FAM69A, FANCL, FARS2, FARSA, FAS, FASTKD5, FBXO5, FBXO7,

FBXW2, FDPS, FDX1, FEN1, FH, FIBP, FIG4, FLNB, FOXK2, FRAT2, GADD45A, GALK2,

GARS, GART, GBP1, GBP2, GEMIN4, GEMIN6, GGCT, GGCX, GIGYF2, GLA, GLB1, GLG1,

GLRX2, GLRX3, GM2A, GMFG, GMNN, GMPS, GNAI3, GNPDA1, GORASP2, GOT2, GPR107,

GRAMD3, GRPEL1, GRSF1, GSTO1, GTF2A2, GTF2B, GTF2E2, GTF2H2, GTF2H5, GTPBP4,

H2AFZ, HAUS7, HCCS, HCFC2, HCP5, HDGFRP3, HDHD1, HDLBP, HEATR6, HERPUD1,

HEXIM1, HIF1AN, HIGD1A, HINFP, HIRIP3, HIST1H2AC, HMGB2, HMGCS1, HNRNPAB,

HNRNPC, HNRNPDL, HNRNPR, HPRT1, HRSP12, HSP90AA1, HSPA4, HSPA5, HSPD1,

HSPE1, HTATIP2, HTRA2, IARS, ICOS, ICT1, IDH1, IDH2, IDH3A, IDS, IER3IP1, IFT27,

IL10RB, IL13RA1, IL18R1, IL27RA, IMMT, ING2, INPP1, INPP5B, INTS12, IP6K2, IPPK,

ITFG1, ITGAE, ITGB1BP1, ITPA, JAK2, JAM2, JARID2, JMJD6, KATNA1, KCMF1, KCTD2,

KDM5A, KEAP1, KHNYN, KIAA0040, KIAA0101, KIAA0391, KIAA0586, KIAA0922, KIF22,

KLC1, KLF10, KLF12, KLHDC4, KLHL7, KPNA2, KPNA4, KPNB1, LAGE3, LAMTOR2,

LAMTOR5, LARP4, LARS2, LCMT1, LDLR, LDLRAD4, LETM1, LOC100289097, LPCAT1,

LRRC59, LRRC8D, LSM3, LXN, LYST, MAD2L1BP, MADD, MAF, MAGEF1, MANF,

MAPK13, MAPK1IP1L, MAPK9, MAPKAPK5, MBD2, MCL1, MCM3, MCM6, MCTS1, MCUR1,

MDH2, ME2, MED27, MED8, MEOX1, METTL1, METTL22, MFAP1, MFSD5, MICA,

MICALL1, MICB, MIOS, MMD, MOB1A, MPC2, MPHOSPH9, MR1, MREG, MRGBP, MRPL15,

MRPL17, MRPL20, MRPL22, MRPL3, MRPL33, MRPL46, MRPL57, MRPS11, MRPS14,

MRPS16, MRPS17, MRPS18A, MRPS18B, MRPS28, MRPS30, MRPS33, MRPS35, MTAP,

MTCH2, MTG1, MTHFD2, MTMR12, MTMR2, MTX2, MYL6, MYO5A, N4BP2L2, NAB1,

NADK, NBN, NCAPD2, NCAPH2, NCF4, NCK1, NCKAP1L, NDC80, NDUFA1, NDUFA6,

NDUFA8, NDUFA9, NDUFAB1, NDUFAF1, NDUFB3, NDUFB4, NDUFB7, NDUFB8, NDUFS2,

NDUFS3, NDUFS6, NEU1, NFE2L1, NFIL3, NFKBIE, NGFRAP1, NIPBL, NME1, NME7, NMT1,

NOD2, NOP16, NPM1, NRAS, NRBF2, NSD1, NSDHL, NSUN3, NTHL1, NUDC, NUDT21,

NUP155, NUP37, NUP93, NUP98, OBFC1, ODC1, OPTN, OSBPL3, PAF1, PAFAH1B1, PAICS,

PAK1IP1, PAK2, PAM, PANK2, PANX1, PARK7, PARN, PCIF1, PCMT1, PCTP, PDCD11,

PDCD5, PDE4B, PDE6D, PDIA6, PDPK1, PDSS1, PEX13, PEX26, PFDN2, PGD, PGM1,

PHF21A, PIGT, PIK3C3, PIK3R4, PIN4, PIP4K2A, PITPNA, PLAGL1, PLXNC1, PMAIP1,

PMS2P3, POLB, POLDIP2, POLR2I, POLR3C, POLR3K, POP4, POP7, POU2AF1, PPAP2A,

PPIE, PPM1G, PPP1R16B, PPP1R7, PPP2CA, PPRC1, PRDX3, PRDX4, PRIM1, PRKX, PRMT5,

PRPF18, PRPS2, PSEN1, PSMA2, PSMA3, PSMA4, PSMA7, PSMB1, PSMB3, PSMB7, PSMC1,

PSMC2, PSMC3IP, PSMC5, PSMD1, PSMD12, PSMD13, PSMD14, PSMD2, PSMD4, PSMD6,

PSMD9, PSMG1, PTPN2, PTRH2, PTTG1, PUS1, PWP1, QKI, QRSL1, RAB22A, RAB27A,

RAB29, RABGAP1L, RABGGTA, RABIF, RAC1, RACGAP1, RAD23B, RAD50, RAN,

RAP1GDS1, RBX1, RER1, RFC2, RFC3, RFC4, RFK, RFX5, RGS1, RHEB, RIPK1, RIT1,

RMDN1, RMND5A, RNASEH1, RNASEH2B, RNF34, RNF8, RNMTL1, RPA3, RPF1, RPL26L1,

RPL28, RPN2, RPP30, RPP40, RPS6KB1, RPUSD2, RRAS2, RRP12, RRP9, RRS1, RTCA, RTCB,

RTFDC1, RUVBL1, RWDD2B, RYBP, SAMHD1, SAP18, SAP30, SAP30BP, SAP30L, SAR1A,

SAT1, SDHA, SEC11A, SEC14L1, SEC16A, SENP5, SEPHS1, SERBP1, SERPINB1, SERPINI1,

SF3B3, SF3B5, SFPQ, SGK1, SH2D2A, SHFM1, SKAP2, SLBP, SLC16A1, SLC25A12, SLC25A4,

SLC2A3, SLC35B1, SLC35D2, SLC35F2, SLC3A2, SLC5A6, SLC7A5, SMAD3, SMAP1,

SMARCA4, SMC4, SMC6, SMCHD1, SMCO4, SMS, SNAPC3, SNF8, SNRNP25, SNRNP35,

SNRPB2, SNRPC, SNRPD1, SNRPD3, SNUPN, SNW1, SNX1, SOD1, SOS1, SP140, SP140L,

SPCS2, SPTLC2, SRD5A1, SRI, SRP19, SRSF4, STAM, STARD7, STAU1, STK17B, STK4,

STOML1, STOML2, STRAP, STX4, STX7, STX8, SUCLG1, SUMO1, SYNCRIP, SYT11, TACO1,

TAF12, TAF9, TALDO1, TARBP1, TARS, TARS2, TBC1D1, TBC1D22A, TBL2, TBXAS1,

TCEB3, TCOF1, TDP1, TESC, TFG, TFPT, TFRC, THADA, THG1L, THOC5, TIMM23, TINF2,

TIPARP, TJP2, TMCO1, TMEM11, TMEM126B, TMEM135, TMEM156, TMEM186, TMEM2,

TMEM5, TMEM62, TMEM70, TMSB4X, TNFRSF1B, TNFSF10, TNFSF8, TOM1, TOX,

TP53TG1, TPK1, TRAF3, TRAK1, TRAPPC12, TRIB1, TRIM14, TRIM38, TRIM5, TRIM68,

TSR1, TSR3, TTC1, TTC17, TUBG1, TXN, TXNL1, TXNRD1, UBAC1, UBE2A, UBE2D1,

UBE2K, UBL5, UBQLN2, UBR2, UBXN8, UCHL5, UGGT1, UMPS, UQCR10, UQCRC2,

UQCRQ, USP15, USP25, USP39, UTP11L, UTP18, UTP3, VAMP4, VAV3, VCP, VDAC1,

VDAC2, VOPP1, VRK1, VRK2, VTI1B, WBP1L, WDYHV1, WIPF2, WIPI1, WRAP53, WSB2,

WWP2, XRCC4, YARS, YEATS2, YIPF1, YLPM1, YWHAH, YWHAQ, ZBED1, ZC2HC1A,

ZDHHC4, ZMIZ1, ZNF226, ZNF536, ZNF593, ZNF710, ZPR1

CD4
Orangered4
ABCB1, ABLIM1, ACVR1B, ADARB1, ADNP2, ALDH6A1, ALDOC, ANGEL1, ANXA1,

AP1S2, APBA2, APP, APRT, AQP3, ARCN1, ARL2BP, ARRB1, ASB8, ATXN2, ATXN7L3B,

B4GALT4, BACH2, BAG3, BNIP3L, C12orf10, C14orf1, CACNA1A, CBX7, CCDC101, CCNG1,

CCNI, CCR2, CD44, CDC37, CDIPT, CDK5R1, CERK, CHPT1, CKAP4, CMPK1, COX4I1,

COX7A2L, COX7C, CRIP1, CRK, CUTA, CUX1, DDAH2, DDOST, DIAPH1, DNAJB1, DPEP2,

DPH5, DVL1, EDEM1, EEF1D, EEF2, EIF2D, EIF3F, EIF3G, EIF3H, EIF3K, EIF3L, EIF4B,

ENO2, EP400, EPHA1, ERN2, ESD, FAM168B, FAM20B, FAM8A1, FBL, FCGRT, FGFR1,

FHL1, FOXO3, FTL, GGA1, GLO1, GLS, GPR183, GPR27, GPX4, GSS, GTF2F1, GTPBP3,

HADHA, HIP1R, HLA-F-AS1, HMCES, HNRNPA0, HOPX, HSD17B11, HSD17B8, HSF2,

HSPA1L, IGF2R, IGHD, IMPDH2, INPP5A, IRS2, ITFG2, ITPKB, KCNQ1, KLHDC2, KLRB1,

KLRG1, KPNA1, LAIR1, LAMP1, LAPTM5, LINC00623, LITAF, LSM14A, LTA4H, MAGED2,

MAN1B1, MAN1C1, MED21, METTL9, MGA, MID2, MMP24-AS1, MOB3B, NAP1L1, NCOA1,

NDRG3, NFATC2IP, NPC2, ORAI2, P4HB, PABPC1, PABPC3, PABPC4, PACSIN2, PAFAH2,

PCBP2, PDCD4-AS1, PEBP1, PFDN5, PIK3R1, PLEKHB1, PMM1, POLR1E, POU6F1, PPM1F,

PPP1R2, PPP2R5D, PRKCA, PRKD3, PRMT2, PRNP, PRUNE, PSAP, PTDSS1, PURA, QARS,

RAB11FIP3, RCC1, RCOR3, REPIN1, RGCC, RNF130, RPL11, RPL15, RPL18, RPL19, RPL22,

RPL29, RPL3, RPL35, RPL35A, RPL6, RPL8, RPLP0, RPS14, RPS16, RPS19, RPS21, RPS28,

RPS3, RPS5, RPS7, RPS9, RSL1D1, RUFY3, SCPEP1, SDHAF1, SEMA4C, SERINC5, SESN1,

SF3A3, SGSM3, SLC25A6, SLC35C2, SND1, SORL1, SPAG8, SPOCK2, SPSB3, SRSF8, SSBP2,

SSR2, SSR4, ST13, SVIL, TAF7, TBC1D5, TGFBR2, TKTL1, TMEM134, TMEM230, TOMM20,

TRAPPC6A, TRIM27, TRIM44, TRMT112, TSC22D3, TSPO, TTC9, TXN2, TXNIP, UBA52,

UBE2E3, UXT, VEGFB, VGLL4, VIPR1, VPS51, WDR41, YIPF2, ZBTB18, ZC3HAV1, ZFAND3,

ZMAT3, ZSCAN18

CD14
Plum1
ABCD3, ADO, AKAP7, AMD1, ANKRA2, ANP32A, ANXA1, ARAP2, ARL6IP1, ARMCX1,

ARMCX3, ARPC2, ARPC3, ATP2C1, ATP6AP2, ATP6V1C1, AUH, BECN1, C1D, C5AR1,

C5orf22, C6orf62, CAPZA1, CAPZA2, CBX3, CCDC91, CCNC, CD55, CD9, CDC5L, CDC73,

CEBPB, CEBPD, CHMP2B, CHUK, CISH, CLIP1, CLPX, CNOT2, COMMD8, CPEB3,

CSGALNACT2, CTBS, CUL2, CYB5B, CYP1B1, DEK, DENR, DERA, DNTTIP2, DR1, DRAM1,

DTWD1, DUSP11, DYNLT3, E2F3, EBAG9, EDEM3, EID1, EIF3J, EIF4E, EP300, EPS15,

EWSR1, FAM216A, FOXN3, FUBP3, FUCA1, GLIPR1, GLTSCR1L, GLUL, GNPTAB, GRSF1,

HBS1L, HMGN4, HSD17B11, HUS1, IBTK, IMPACT, ISCA1, ITM2B, IVD, KCTD9, KIAA0226,

KIN, KLHL20, KTN1, KYNU, LAMP2, LAPTM4A, LARP4, LARP4B, LEPROT, LILRB2, LIN7C,

LSM5, LYN, LYPLA1, MAK16, MAP3K8, MAP4K3, MARCH7, MARS, MCM9, MEAF6, MED7,

MEF2A, MFF, MICU2, MKNK2, MTHFD2, MYO5A, NAA50, NDUFA4, NDUFA5, NDUFB1,

NFE2L2, NPTN, NUMB, NUP88, NXT2, OGFRL1, ORC4, PAIP1, PAK2, PCNP, PDHX, PDLIM5,

PDS5A, PFDN4, PICALM, PLAA, PPM1B, PPP1CB, PPP2CB, PPP2R3C, PRNP, PRRG4,

PSMD10, PSME4, PSMF1, PSPC1, PTEN, PTP4A1, QKI, RAB11FIP2, RAB27A, RAB29, RAB2A,

RAB7A, RALA, RAP2C, RBMS1, RCN2, RDH11, REST, REV3L, RFK, RMND5A, RNF103,

RNF11, RNF170, RP2, RPL37, RPL39, RTN4, SAR1B, SARAF, SAT1, SCP2, SEC23A, SEC23B,

SEMA3C, SEP15, SERBP1, SERPINB1, SHOC2, SKP1, SLC25A24, SLC35A3, SLMO2, SMA4,

SNRPA1, SNTB1, SNX10, SOCS5, SP2, SRGN, SRP9, SRSF10, ST3GAL6, STXBP3, SUB1,

SUCLA2, SUCLG2, SUMO1, SYPL1, TAF11, TBL2, TCEAL4, TCEB1, TERF1, THAP1, THOC7,

TM2D3, TMEM115, TMEM165, TMEM70, TMSB4X, TMX1, TOB1, TRAPPC13, TRIM8,

TSNAX, TSPAN31, TSPYL4, TTC37, TXNRD1, U2SURP, UBE2A, UBE2B, UBE2E1, UBE2K,

UBXN8, UFM1, UHRF1BP1L, ULK2, USP16, USP4, USP8, USP9X, UTP3, VCAN, VPS54,

WBP11, WIPI1, WWP1, XPOT, YTHDF3, YWHAB, YWHAQ, ZEB2, ZFAND6, ZFP36L1,

ZNF292, ZNF468, ZSCAN16

CD14
Yellow
ABCA1, ACSL1, ACVR1B, ADAM17, ADAP2, ADAR, ADD3, AGRN, AIM2, AIMP1, ALAS1,

ANKRD49, ARHGAP26, ARID3B, ARL4A, ATP10A, ATP11B, ATP5J, ATP6V0E1, ATP6V1E1,

ATP8B4, ATXN7, B2M, B3GNTL1, BACH1, BARD1, BCAS2, BCL10, BLVRA, BST2, BTG3,

C11orf24, C12orf5, C19orf66, C1GALT1C1, C1QA, C2orf47, C3AR1, CALM1, CALML4, CAPN2

CASP3, CASP7, CCR1, CD2AP, CD300A, CD38, CDC40, CHIC2, CHMP5, CHPT1, CIR1, CLN5,

CMTR1, CNIH4, CNP, COA1, COX17, CREG1, CTSC, CTSL, CTSS, CUL1, CXCL10, CYLD,

DAB2, DBR1, DCTN6, DDIT4, DDX58, DDX60, DECR1, DENND1B, DHRS7B, DIAPH1,

DICER1, DNAJC15, DNASE2, DPM1, DRAP1, DYNLT1, DYSF, EIF2AK2, ENPP4, EPHB2,

EXT1, FADD, FAM175B, FAM46A, FAM65B, FAM8A1, FAS, FCGR1B, FCGR3B, FFAR2,

FKBPL, FMR1, FPR2, FYCO1, GALNT3, GBP1, GBP2, GCH1, GCLM, GCNT1, GHITM,

GLRX2, GNG5, GPN2, GPR137B, GPR65, HBP1, HEG1, HELZ, HERC5, HERC6, HIST2H2BE,

HLA-A, HLA-B, HLA-C, HLA-F, HLA-J, HNRNPA2B1, HPRT1, IFI16, IFI27, IFI35, IFI44,

IFI44L, IFI6, IFIH1, IFIT1, IFIT2, IFIT3, IFIT5, IFITM1, IFITM2, IFITM3, IFNGR1, IL15, IL1RN,

IL6ST, IQGAP2, IRF7, IRF9, ISG15, ISG20, ITFG1, JUP, KAT2B, KCNJ2, KDM5B, KDM6A,

KLF9, KLHL9, KMO, LAP3, LARP7, LGALS3BP, LIPT1, LMO2, LRRFIP1, LXN, LY6E, LY96,

MAFB, MAGOH, MAML1, MAP2K6, MARCKS, MBD2, MED28, MERTK, METTL18, METTL5,

MGAM, MILR1, MRPL16, MRPL18, MRPL19, MRPS14, MRPS22, MS4A4A, MSL2, MSMO1,

MT1E, MT1F, MT1G, MT1H, MT1HL1, MT1X, MT2A, MX1, MX2, MYC, MYD88, MYL12A,

MYL4, MYOF, N4BP1, NAB1, NAPA, NAT1, NDUFB3, NDUFS1, NECAP1, NFE2, NGRN, NMI,

NPC1, NRIP1, NT5C2, OAS1, OAS2, OAS3, OASL, PANX1, PARP12, PCMT1, PELO, PER2,

PFKP, PGK1, PHF11, PHF3, PHTF2, PIGB, PIK3CA, PIN4, PLAC8, PLAGL2, PLIN2, PLSCR1,

PML, PNO1, POLB, PPM1D, PPP2R1B, PRKAG2, PSMA4, PSMB9, PSMC2, PSMD12, PSME2,

PTPN12, PTPRO, RAB11A, RAB1A, RAB8A, RAB9A, RABGAP1L, RAPGEF2, RBM7, RBX1,

RC3H2, REC8, RGL1, RHEB, RHOA, RIN2, RNASE1, RNASE2, RNF122, RPP38, RPS27,

RPS27L, RSAD2, RTCB, RTP4, S100A11, S100A8, SAMD9, SAMSN1, SC5D, SCFD1, SEC22B,

SERPING1, SH3GLB1, SIGLEC1, SKAP2, SLA, SLC25A46, SLC30A1, SLC31A2, SLCO4C1,

SMCHD1, SNRK, SNX1, SP100, SP110, SPATS2L, SPTLC2, SQLE, SRP19, SSB, ST3GAL5,

STAT1, STAT2, STOM, STS, SWAP70, TANK, TAOK3, TAP1, TBK1, TCF4, TCF7L2, TCN2,

TDP2, TDRD7, TFEC, TFG, TFIP11, TIMP1, TLR2, TMED5, TMEM110, TMEM123, TMEM131,

TMEM255A, TMEM50A, TMPO, TNFSF10, TNS3, TOR1B, TRAF6, TRIM14, TRIM21, TRIM22,

TRIM38, TSG101, TYROBP, UBE2J1, UBE2L6, UCHL3, USP18, USP25, VAV3, VDR, VEZF1,

VRK2, VWA5A, WDFY3, WDR41, WDR5B, WDYHV1, XAF1, YME1L1, ZBTB1, ZC3HAV1,

ZCCHC2, ZNF267, ZNF322, ZNF350, ZNF443, ZNF701

CD14
Greenyellow
ACVR2A, AGTPBP1, APOD, APOL1, ARHGAP10, ASTE1, ASXL2, ATP5C1, BLM, BTBD7,

C1orf216, CAST, CCDC51, CCL5, CD27, CD3D, CEMP1, CHD4, CROT, ENSA, EP400,

EPM2AIP1, ERP44, FAM114A1, FAM208A, FBXO9, FGFR1, FLCN, FUT6, GAB1, GNA11,

HAP1, HYAL2, ITFG2, ITGAL, KANSL3, KIF21B, KLF12, KMT2A, KPNB1, KSR1, LMF1,

LOC100272216, LOC100505915, LOC647070, LPAR1, MACF1, MASP1, MICAL2, MLH3,

MMP9, MUC5AC, MYB, MYO1C, N4BP2L2, NCALD, NDST1, OCA2, PAX8, PGGT1B,

POLR1C, POLR2C, PRDM14, PRODH, RNGTT, RRP15, S1PR4, SCAF4, SEPT6, SFI1, SLC12A4,

SPN, STK39, SYT11, TBP, TCAF1, TMEM212, TMEM59L, TNNI3, TNPO3, TRAF3, TUG1,

UNC45A, USP34, VWA9, ZHX3, ZNF665, ZNF76, ZNRF4

CD14
Pink
ACAN, ACOT11, ADGRB1, AGER, AKAP8L, AKT3, ALDH2, ALDOB, ALS2CL, AMT,

ANKRD2, ARMC7, ARPP19, ATP8B2, ATXN10, BACE1, BAIAP2, BARX2, BAZ2A, BBS1,

BIN3-IT1, BNIP3L, BRAP, BRE, BTNL3, C5, C9orf9, CA1, CA14, CAD, CAMK2B, CARS,

CBX5, CBX6, CCDC71, CCDC86, CCDC9, CD1A, CDC42BPB, CDKN2A, CEACAM6,

CHRNA2, CHRNG, CISD1, CKLF, CLTA, COA7, COL1A1, COL6A2, CPD, CREBZF, CRIP1,

CSNK1G1, CTNNA1, CTSK, CYFIP2, DAXX, DGCR11, DHFR, DHX32, DNAJA3, DNPH1,

DOCK1, DPH2, DST, DYRK3, DYRK4, EIF3M, ENGASE, EPHB4, EPHB6, FAM189B,

FAM192A, FBXL5, FBXO42, FKBP4, FUT7, FXYD3, GABBR2, GAS8, GBF1, GCNT4, GDPD5,

GIPC1, GLS, GOLGA3, GPR107, GSTA1, H2AFY2, HDAC6, HDHD1, HECTD4, HFE, HMGA1,

HMGB1, HNRNPD, IKBKE, INTS5, IQCC, IQSEC2, ITPK1, JRK, KDM4C, KDM5C, KIR2DL2,

KLHDC10, LAMC1, LDB3, LDLRAD4, LGALS2, LGALS8, LINC00894, LMNA, LRCH4,

LRRN2, LUZP1, LYRM9, MAPK8IP2, MAPK8IP3, MARK4, MBP, MDK, MED12, MINK1,

MPPE1, MPPED1, MRE11A, MTOR, MUC3B, MUTYH, MYO19, MYO7A, NAA10, NACA,

NECAB3, NENF, NF2, NFATC4, NIPAL2, NKTR, NNAT, NOP14, NPEPL1, NPR2, NPTXR,

NR4A1, NSUN5P1, NTM, NUP188, OCEL1, ONECUT2, OPHN1, OPN3, PAPPA2, PCYOX1L,

PCYT2, PDCD4-AS1, PDCD6, PDGFB, PEAK1, PIGO, PIP4K2C, PIPOX, PKD2L1, PKM,

PLA2G6, PLCB3, PLCD1, PLEKHG3, POLR1D, PPIA, PPP2R4, PPP6R2, PRAF2, PRINS,

PRRC2B, PSMD4, PTCRA, PTGES, R3HDM1, RAB31, RANBP10, RAP1GAP, RAPGEF3,

RBM17, REXO2, RHO, RNASEH2B, RPGR, RPH3A, RPL35A, S100A13, SAFB2, SEC31A,

SERINC2, SF3A1, SFN, SFTPB, SHQ1, SIGMAR1, SLC15A2, SLC28A1, SLC44A1, SLC46A3,

SLC7A6, SMARCD1, SMC1A, SMPD2, SNCA, SNX11, SNX3, SORBS3, SSBP1, SSBP3,

ST6GALNAC2, STK24, SUPT20H, SUPT6H, SYT13, TARBP1, TARBP2, TBX1, TCOF1,

THUMPD2, THY1, TMEM109, TMEM147-AS1, TMPRSS15, TNK1, TNS1, TOMM34, TOP3A,

TOPORS-AS1, TPM1, TPT1, TRIT1, TRO, TTC17, TTLL12, UBAP2L, UBE3B, UBL3, UGDH,

UNC119, UNC13A, USE1, VAC14, VPRBP, VPS13D, WDTC1, WWC3, ZBTB22, ZBTB40,

ZMYM3, ZMYND11, ZNF337, ZNF592, ZNF629, ZNF839, ZSWIM8, ZZEF1

CD14
Purple
AATK, ACSL5, ADGRE3, AEBP1, AIMP2, ANXA2P1, AQP6, ARMC6, ATG4B, AVPI1, BEST1,

C14orf93, C1orf54, C22orf31, C2CD2, CASP10, CBFB, CCDC130, CDX1, CEACAM3, CKAP5,

COL8A2, CXorf56, DCUN1D4, DIMT1, DYNC1H1, EIF5B, EMID1, FAM102A, FAM206A,

FARS2, FASTK, FXYD2, GABRR2, GALT, GLP1R, GLT8D1, GPATCH8, HEATR1, HMGXB3,

HSPB6, HUWE1, IFT88, INPP5E, IPPK, ITPKC, KIAA0586, KLK3, KRT31, LAMP1, LLGL1,

LMBR1L, LRRC14, MAGT1, MAP3K10, MAP3K7, MARC2, MAST2, MECP2, MLX, MTMR9,

MYBL2, MYNN, MYO9A, NFATC1, NIT2, NSMAF, NTRK3, NUP210, OR2H2, OXSM, PBX2,

PCDH12, PCK2, PHKA2, PHLPP2, PLCG1, PLEK2, POFUT1, POU6F1, PPIG, PPP1R26, PRCP,

PRUNE, PVR, PYCR1, RAB3IL1, RAD1, RBM19, RIN1, RMDN3, RPL38, RPS11, RRAGA,

SCML2, SDHAF1, SECISBP2L, SEL1L, SLAMF8, SLC1A4, ST3GAL4, STARD8, SUPV3L1,

TBX5, TCF3, THRA, TIMELESS, TMEM2, TRIM26, TRIM45, TRIO, TRMT12, TRPM6, TUB,

UBAP2, UBE2D4, VAMP3, VPS33B, WDR70, WNT10B, ZC3H13, ZMIZ2, ZNF419, ZNF862

CD14
Sienna3
ABCC5, ACIN1, ACP1, ACYP2, AFG3L2, AHCYL1, AHNAK, AKR7A2, ALOX5, ANAPC5,

AP2B1, APEX1, ARIH2, ARL1, ARMCX6, ASB8, ATIC, ATP5I, ATP5L, ATRN, AUP1, BTF3,

C14orf159, C2orf68, CAMLG, CAPN3, CASC3, CCDC69, CCNB1IP1, CCT3, CD244, CDC16,

CDK10, CDK19, CES2, CIITA, CKAP4, COIL, COPZ1, COX4I1, COX7C, CRTAP, CTNS,

CYP27A1, DCTD, DHX9, DUS1L, DVL1, ECHS1, EEF2, EIF1, EIF2B3, EIF2B4, EIF2B5, EIF2D,

EIF3A, EIF3D, EIF3E, EIF3F, EIF3H, EIF3K, EIF3L, EIF4B, EIF4EBP2, ENG, EPRS, FAM162A,

FAM35A, FAM49A, FBL, FBRS, FBXO21, FCER1A, FKBP11, FLII, FOLR2, FTSJ3, FUBP1,

FXN, FYN, GARS, GAS2L1, GATAD1, GLG1, GOLGB1, GOT2, GRWD1, GSS, HADHA,

HDLBP, HEBP1, HEMK1, HINT1, HLA-DMA, HLA-DQA1, HNRNPA1, HNRNPDL, IARS, ILF3,

IMPDH2, INTS3, IPO5, ISG20L2, ITPA, IVNS1ABP, KAT2A, KATNB1, KDM6B, LAS1L,

LDHB, LETMD1, LRRC47, LSG1, LSM4, LY86, LYRM4, LZTFL1, MAN1C1, MAP4, MAP4K1,

MAPK7, MBD1, MDH2, MGST2, MMS19, MPRIP, MPST, MRPS35, MXI1, NAE1, NAP1L1,

NONO, NPEPPS, NPM1, NUP93, OSBP, OXA1L, PABPC4, PAM, PCBP2, PDCD11, PFKM,

PHB2, PHF20, PMM2, PMS2P1, POLD2, POLR2H, POLR2I, PON2, PPOX, PRKCB, PRKDC,

PSKH1, PTAFR, PTCD3, QARS, RAE1, RCC1, RCN1, REPIN1, RPA1, RPL15, RPL19, RPL22,

RPL3, RPLP0, RPLP1, RPS10, RPS16, RPS17, RPS23, RPS27A, RPS3, RPS4X, RPS6, RPS7,

RPS9, RRNAD1, SDR39U1, SEC11A, SET, SFPQ, SGPL1, SGSM2, SH3YL1, SIVA1, SKP2,

SLC11A2, SLC25A5, SLC25A6, SLC5A3, SLC9A3R1, SND1, SORL1, SPCS2, SPG7, SPINT2,

SPSB3, SRPRB, SRSF4, SRSF5, ST13, STARD7, SUGP2, SYK, TAF15, TARDBP, TBC1D12,

THAP11, TPCN1, TPT1P8, TSEN34, TST, TUBG1, TXN2, UBE2I, UQCRC2, VENTX, VPS4A,

ZNF32, ZNF395

CD19
Darkolivegreen
AACS, ABCB9, ABCC4, ABCF2, ACOT7, ACOX1, ACTA2, ACTG1, ACTR1A, ADA, ADIPOR1,

AEN, AGK, AGPS, AK2, AKR1A1, ALDH18A1, ALDH3A2, ANAPC15, ANG, AP2B1, AP2S1,

APH1B, APIP, APOBEC3G, APOL1, APOO, AQP3, ARL3, ARPC2, ASF1B, ASPM, ATAD2,

ATF6, ATOX1, ATP1B3, ATP5B, ATP5C1, ATP5G1, ATP5G3, ATP5H, ATP5J, ATP5J2,

ATP8B2, AUNIP, AURKA, AURKB, B2M, B4GALT1, B9D1, BATF, BCAR3, BCCIP, BIRC5,

BLMH, BMP8B, BRAP, BSG, BUB1, BUB1B, C14orf1, C15orf39, C19orf10, C1orf216, C21orf91,

C22orf29, C2orf49, C3orf14, C6orf106, CADM1, CALML4, CALR, CALU, CAMKK2, CARHSP1,

CAV1, CCDC51, CCNA2, CCNB2, CCND2, CCNE1, CCNE2, CCR2, CCT5, CD320, CDC20,

CDC25A, CDC45, CDC6, CDCA3, CDCA4, CDCA8, CDK1, CDK2, CDK4, CDK5, CDKN2A,

CDKN2C, CDKN3, CDS2, CENPA, CENPE, CENPF, CENPM, CENPN, CEP55, CFLAR,

CHAF1A, CHCHD2, CHEK1, CHP1, CHST2, CINP, CKAP5, CLIC1, CLIC4, CLPB, CNIH1, CNP,

CNPY2, COA4, COX6A1, COX6B1, COX7A2L, COX7B, COX8A, CRADD, CREB3, CRELD2,

CSNK1E, CSNK2A1, CSRP1, CTNNAL1, CUL5, CUTA, CYC1, DARS2, DAZAP1, DCPS, DDB1,

DDX19A, DESI1, DHFR, DLGAP5, DNA2, DNAAF1, DNAJC1, DNAJC15, DNAJC3, DNMT1,

DONSON, DPP3, DTL, E2F8, EBP, EDC3, EDEM2, EEF1E1, EFCAB11, EIF2S1, EIF4A3,

EIF4G1, EIF4H, ELAVL1, ELL, EMC1, EMC6, EMC9, ERCC6L, ERGIC2, ERO1L, ESPL1, F11R,

FA2H, FADD, FANCG, FANCI, FARSA, FBXW2, FKBP1A, FKBP2, FLAD1, FOXM1, GABPA,

GABPB1, GADD45A, GADD45GIP1, GALE, GALNT14, GAR1, GARS, GART, GATB, GCLM,

GDE1, GEMIN4, GGCX, GINS1, GINS2, GINS3, GLRX, GLRX5, GMDS, GMPPA, GNAI3,

GNAS, GNB1, GORASP2, GOSR2, GOT1, GOT2, GPN2, GRB2, GTF2A2, GTF2F2, GTPBP8,

GTSE1, GUF1, H2AFV, H2AFX, HBS1L, HDLBP, HES1, HEXB, HIRIP3, HIST1H1C, HJURP,

HMBS, HMGB1, HMGB3, HMMR, HMOX2, HNRNPAB, HOXB7, HSD17B10, HSF2, HSPD1,

HYI, IARS, IDE, IDH2, IFNAR2, IGF1, IGF2BP3, IGHG1, IL12A, IL2RB, IL6ST, IMPAD1,

INPP4A, ISOC2, ITCH, ITGAX, ITGB7, KCNA3, KDM4A, KIF14, KIF15, KIF18B, KIF20A,

KIF22, KIF23, KIF2C, KIF4A, KIFC1, KLHL5, KPTN, LAMP5, LANCL2, LAP3, LARP1, LDHA,

LGALS1, LGALS3, LMNB1, LMNB2, LOC730101, LRRC42, LRRC59, LSM1, LSM12, LTB4R,

MAGEH1, MAGT1, MAPK6, MCCC2, MCFD2, MCM10, MCM4, MCM6, MDH1, MDH2, MELK,

MET, METTL1, MGAT1, MGST2, MIS18A, MKI67, MMADHC, MPDU1, MRPL12, MRPL15,

MRPL23, MRPL24, MRPL3, MRPL33, MRPL40, MRPL42, MRPL44, MRPS11, MRPS16,

MRPS17, MRPS18B, MRPS2, MRPS34, MRPS7, MRTO4, MSRB2, MTFR1, MTRR, MTX1,

MYBL2, NAA35, NAPA, NAPG, NASP, NBN, NCAPG, NCAPG2, NCAPH, NCLN, NDC1,

NDUFA1, NDUFA13, NDUFA2, NDUFA4, NDUFA6, NDUFA7, NDUFA9, NDUFAB1,

NDUFAF3, NDUFB8, NDUFS7, NEK2, NET1, NEU1, NFE2L1, NME1, NOP10, NPM1, NRBP1,

NSDHL, NTHL1, NUDT21, OGDH, OIP5, OPTN, OR7E12P, ORMDL2, OSBP, PAFAH1B3,

PAGR1, PAICS, PAK1IP1, PAK2, PARP2, PARPBP, PBK, PCCB, PDE6D, PDK1, PDXK, PGD,

PGM3, PGRMC1, PHB, PHGDH, PIM2, PKMYT1, PLA2G12A, PLAGL2, PLK4, PLOD1, PMM2,

PNO1, PNPLA4, POLA1, POLA2, POLDIP3, POLE2, POLR2D, POMP, POP7, PPA1, PPAT,

PPIA, PPIF, PPP2R1B, PPP2R2A, PPP6C, PRC1, PRCC, PRDM1, PRDX1, PRIM1, PRIM2,

PRKAG1, PRMT5, PROSER1, PRRC1, PSAT1, PSMA2, PSMA5, PSMA7, PSMB1, PSMB2,

PSMB5, PSMB6, PSMB8, PSMC1, PSMC3, PSMD11, PSMD12, PSMD14, PSMD8, PSMD9,

PSME2, PSME3, PSMG2, PTPLAD1, PXMP2, PXMP4, R3HDM1, RAB27A, RAB2A, RAB6A,

RAB8A, RABAC1, RABL6, RAD1, RAD51, RAD54B, RALA, RANBP1, RAP1A, RCC1,

RECQL4, RER1, REXO2, RFC5, RFK, RGS13, RMND5A, RNASEH1, RNASEH2A, RRAGD,

RRBP1, RRM1, RRS1, RUVBL1, S100A4, SAE1, SAMHD1, SBNO1, SCAMP2, SDF2L1, SDHB,

SEC13, SEC23IP, SEL1L, SEPHS1, SF3B5, SFXN1, SH3GLB1, SHMT1, SIL1, SKA1, SLBP,

SLC12A2, SLC16A1, SLC19A1, SLC25A11, SLC25A3, SLC25A4, SLC25A5, SLC35A2,

SLC39A14, SLC39A7, SLC7A5, SLC9A3R1, SLCO3A1, SLIRP, SMC2, SMOX, SNRPC,

SNRPD1, SNRPF, SNRPG, SP100, SPAG5, SPC25, SRM, SRPR, SRPRB, SRSF10, SSR3,

SSSCA1, STAM2, STARD7, STIL, STIP1, STRAP, SUMO3, SUPT4H1, SZRD1, TBL2, TCEB2,

TDP1, TECR, TGOLN2, THEMIS2, TIMM13, TIMM8B, TIMP2, TIPIN, TK1, TLE3, TM9SF4,

TMA16, TMED9, TMEM106C, TMEM110, TMEM147, TMEM184B, TMEM194A, TMEM248,

TMEM258, TMEM5, TMEM59, TMEM97, TMPO, TMSB10, TOP1, TOP2A, TPGS2, TPX2,

TRAPPC2L, TRAPPC3, TRIP13, TSHR, TST, TTK, TUBB2B, TUSC2, TXLNA, TXN2, TXNL4A,

UBE2C, UBE2D3, UBE2H, UBE2L3, UBE2S, UBFD1, UCHL1, UFD1L, UGGT1, UQCRC1,

UQCRFS1, UQCRQ, UROS, USP14, VAPA, VKORC1, WBSCR22, WDR1, WDR12, WDR76,

WHSC1, XRCC4, XRCC5, YARS, YIF1A, YKT6, ZDHHC3, ZNF207, ZNF35, ZNF593, ZNHIT1,

ZWILCH, ZWINT

CD19
Greenyellow
ABCC1, ABHD14A, ACLY, ACO2, ACP2, ACTB, ADAP1, ADCY3, ADRA2C, AKIRIN1,

ALDH3A1, ALG3, ANO10, AP1S1, APEH, ARF3, ARFIP1, ARHGDIA, ARSA, ARTN, ATG13,

ATP13A1, ATP6V0B, AURKAIP1, BAZ1B, BOP1, BTG2, BYSL, C11orf24, C17orf53, CAD,

CCDC186, CCNF, CD99, CDK16, CHN2, CHPF, CHPF2, CLDN14, CLPP, CLSPN, CNTD2,

COMMD4, COMT, COPE, CRMP1, CSNK1D, CTPS2, CXCR3, CYP4F12, DBP, DCSTAMP,

DCTPP1, DIAPH1, DLEC1, DNAJB12, DNASE2, DOK4, DPM2, DTX3, E2F1, EHD3, EIF2AK1,

EIF3B, EIF6, ELMO1, ERLIN1, ERV9-1, EXOSC4, FAM214B, FLNB, FLNC, FN1, FOXRED2,

FTSJ2, G3BP1, GANAB, GAS6, GCDH, GGA3, GNB1L, GPR144, GPR25, GRWD1, HAPLN2,

HAX1, HDGF, HEATR2, HHLA3, HMOX1, HNRNPF, HOXC4, HSPA6, HSPBP1, IFRD2, IGH,

IGHD, IGHM, IGK, IGLL3P, IKBKE, IL13, IL1RAPL2, INTS5, IQCE, JAG1, KATNB1, KCNN3,

KCNQ4, KCTD5, KDM8, KNOP1, KPNA6, LDHC, LDLR, LEPRE1, LILRB4, LPCAT4, LRRC41,

LTK, LYPLA2, MAPKAPK3, MAST2, MBD1, MCAT, MCL1, MEF2D, MEG3, MICU1, MROH7,

MRPL34, MSMB, MSRB1, MST1L, MUC3A, NABP2, NDUFB2, NDUFB7, NEUROD4, NF2,

NFYA, NHP2, NIPAL3, NKX3-1, NOC2L, NOL3, NOLC1, NPAS1, NQO1, NUBP2, NUCB1,

OXCT2, PAFAH2, PAM16, PCDHGB6, PCYT1B, PEA15, PEPD, PEX19, PFDN1, PHTF1, PIGO,

PLA2G2D, PLIN3, PLK1, PLOD3, PNMA2, POLE, POLR2L, PPIC, PPP1R14B, PPP2R3A,

PPP5C, PRKCD, PRR5, PSEN2, PSENEN, PSMD3, PSMD5, PTBP1, PTGES2, PTP4A3, PTPN12,

PTPN18, PTPN9, PYGB, RAB5A, RAC1, RANGAP1, RASGRF1, RBM12B, REC8, RITA1,

RNF130, RRP9, RSPH6A, RXRA, SAMD14, SAP30, SARS, SARS2, SCAMP3, SEC24C, SGTA,

SIGLEC6, SIVA1, SLC12A8, SLC16A6, SLC1A5, SLC35C1, SLC4A10, SLC52A2, SLC6A2,

SPAG11A, SPN, STXBP6, STYXL1, SUMO2, SYP, TADA2A, TARBP2, TCEA1, TCF25,

TDRD12, TEX261, THAP3, THOC5, TIMM10, TKT, TMEM223, TMEM230, TOMM22, TOR3A,

TOX4, TRIP6, TSFM, UBE2N, UBE2NL, UBE2Q1, UPF1, VAC14, VARS, VAV1, VCX2,

VPREB1, WDR18, WDR62, WTAP, YIPF2, ZNF282, ZNF609

CD19
Steelblue
ABCF1, ADI1, ADSL, AHCY, ALDOA, AP3D1, ATF4, ATF5, ATP1A1, ATP2C1, B4GALT5,

BCKDK, BCL2L11, BID, C12orf43, C21orf59, C4orf27, CCDC86, CCT3, CCT7, CD58, CDK7,

CHST11, CKLF, CLP1, COG7, COPA, CYTIP, DAP, DDX39A, DENND3, DYNLRB1, ECHS1,

EDF1, EIF2B4, EIF3I, ELAC2, ENO1, ERGIC3, FAF2, FASN, FASTKD5, FTSJ1, GALNT2,

GAPDH, GCN1L1, GLO1, GPAA1, GPI, GPN1, GSS, GSTO1, GSTZ1, GTPBP4, GUK1,

HNRNPC, IMMT, IMP4, IPO4, IRAK1, KARS, LAGE3, LCMT1, LRP8, LRPAP1, LSM4,

MAD2L1BP, MAGED1, MAPKAP1, MCM5, MCM7, MCOLN1, MECR, MIF, MPHOSPH10,

MRPL11, MTMR12, NCL, NDUFS2, NDUFV2, NOL7, NQO2, NRD1, NUDC, NUDT1, NUDT15,

NUP205, NUP93, ODC1, PA2G4, PARP4, PCK2, PGAM1, PGK1, PIGT, POLD3, POLDIP2, PPIH,

PPP1R7, PSMA1, PSMB3, PSMB4, PSMC5, PSMD2, PSMD4, PUS3, RGCC, RHOB, RNF114,

RPP30, SDF4, SF3B2, SKP2, SLC2A5, SLC38A2, SLC39A8, SLC3A2, SLC43A3, SLCO4A1,

SOD1, SPTLC2, SSR2, SSRP1, ST6GALNAC4, TACO1, TBC1D15, TCEB3, TDRD7, TIMM44,

TNPO3, TRAP1, TSSC1, TTLL12, TUBA1B, TUBA1C, TUBB, TUBB3, TUBB4B, TUFM,

UBL4A, VDAC2, WARS, WDR45, XRCC6, YBX1, ZNF410

CD19
Turquoise
AARS, AASDHPPT, ACADM, ACAT1, ACSL4, ACTR2, ACVR1, ADAM10, ADSS, AKAP1,

ALG5, ALG6, AMD1, ANKRD12, ANKRD17, ANKRD36, ANP32B, ANP32E, ANXA5, ANXA7,

API5, APOBEC3B, ARF4, ARFGAP3, ARHGAP6, ARHGEF12, ARID3A, ARL1, ARL4C,

ARL5A, ARL6IP1, ARMC1, ARMCX3, ARPP19, ASNS, ATG5, ATP2A2, ATP6AP2, ATXN1,

AZIN1, B3GNT2, B4GALT3, BAG2, BARD1, BBIP1, BBS7, BCKDHB, BECN1, BIK, BRCC3,

BTN1A1, BUB3, BZW1, C1D, C1orf27, C1QBP, C2orf43, C6orf62, CAAP1, CANX, CAPN2,

CAPN7, CAPRIN1, CAPZA2, CASP10, CASP3, CBFB, CBX3, CBX6, CCNB1, CCNC, CCP110,

CD164, CD27, CD38, CD59, CD86, CDC27, CDC42, CDK14, CDK17, CDK2AP2, CDV3, CENPQ,

CENPU, CEP57, CEP97, CHST12, CHST15, CHUK, CITED2, CKAP4, CLASP2, CLCC1,

CLDND1, CLINT1, CMAHP, CNKSR1, COBLL1, COL13A1, COPB1, COPG1, CORO1C, CPOX,

CREB3L2, CRIP1, CSF2RB, CSNK1G3, CSPP1, CTBP1, CTBS, CUL2, CUL4B, CYB5B,

DAAM1, DAD1, DAPK1, DCTD, DCTN4, DCTN5, DDOST, DDX18, DDX3X, DENND1B,

DENND5B, DERL1, DERL2, DMC1, DNAAF2, DNAJA2, DNAJB9, DNAJC10, DNM1L,

DNMT3B, DSTN, DUSP5, EBAG9, ECHDC1, EDEM1, EDEM3, EED, EGLN1, EID1, EIF1AX,

EIF3A, EIF3J, EIF4E, EIF5, ELL2, ENPP3, ENTPD1, EPHA4, EPRS, ERAP1, ETFA, ETNK1,

ETS1, EXOC5, EZH2, FAIM, FAM114A1, FAM129A, FAM46C, FBXO46, FBXW7, FDX1,

FEM1B, FEM1C, FKBP11, FLI1, FNDC3A, FNDC3B, FPGT, FUBP3, FUT6, FUT8, FXR1,

G3BP2, GALK2, GALNT1, GALNT3, GBAS, GCLC, GDI2, GFPT1, GGH, GHITM, GLDC,

GLE1, GLG1, GLS, GLUD1, GLUD2, GOLPH3, GOLT1B, GPNMB, GPR15, GPRC5D, GPX7,

GSN, GSPT1, GUSBP11, H2AFY, HCFC2, HERPUD1, HIBCH, HIF1AN, HIGD1A, HIRA,

HMGB2, HMGCR, HN1, HNRNPR, HNRNPU, HRASLS2, HS2ST1, HSD17B8, HSP90B1,

HSPA13, HSPA4, HSPA5, HSPA9, HSPH1, HYOU1, IDH3A, IFT52, IGKC, IGLC1, IGLJ3,

IGLV1-44, IKZF5, IL12B, IL6R, ILF2, ILF3, IMPA1, INSIG1, IPO5, IPO7, IQCB1, IQCG,

IQGAP1, IQGAP2, IRF4, ISCA1, ISOC1, ITGA4, ITM2A, ITM2C, IVD, JUN, KCNJ13, KCTD3,

KDELR2, KDM5A, KDM6A, KIAA0101, KIF11, KLF10, KRR1, L2HGDH, LARP4, LAX1,

LIMS1, LIN7C, LINS, LITAF, LMAN1, LMAN2, LMO4, LTN1, LYPLA1, M6PR, MAD2L1,

MAN1A1, MAN1A2, MAN2A1, MANEA, MANF, MAP2K6, MAP4K3, MAPRE1, MARCH7,

MBNL2, ME2, MED13L, MED17, MFN1, MGAT2, MGLL, MLEC, MLLT10, MLX, MOB1A,

MORF4L1, MORF4L2, MRPL35, MTDH, MTF2, MTHFD2, MYO1D, MZB1, NAA50, NAB1,

NAGA, NANS, NBR1, NCOA3, NFE2L2, NFIL3, NFX1, NMD3, NNT, NONO, NRAS, NT5DC2,

NUCB2, NUDT4, NUP50, NUP98, NUS1P3, NUSAP1, NXPE3, OAT, OGT, ORC2, OSBPL3,

OSBPL9, OXR1, P4HB, PABPC4, PAPOLA, PAPSS1, PAQR3, PARM1, PDIA3, PDIA4, PDIA5,

PDIA6, PDLIM5, PEBP1, PELI1, PERP, PGPEP1, PHF7, PHYH, PIAS2, PICALM, PIGK,

PLA2G16, PLEKHA6, PLK2, POTEKP, POU2AF1, POU4F1, PPCDC, PPIB, PPP1CB, PPP1R2,

PPP3R1, PRDX3, PRDX4, PREB, PRKAG2, PRKAR1A, PRKCI, PROSC, PRPS1, PSEN1,

PSMD13, PTGES3, PTP4A1, PTP4A2, PTPN11, PTPN22, PYCR1, RAB1A, RACGAP1, RAD17,

RAD23B, RAP2B, RB1CC1, RBBP4, RBM3, RBM47, RCBTB2, RCN2, RDX, RECQL, REEP5,

RHOA, RHOQ, RIF1, RIPK1, RNF115, RNF19A, RNPEP, ROCK1, ROCK2, RPA1, RPL36AL,

RPN1, RPN2, RPRD1A, RRM2, RSRC2, RTN3, RUFY3, S100A10, SAMSN1, SCARB2, SCYL2,

SEC11A, SEC14L1, SEC22B, SEC23A, SEC24A, SEC24D, SEC31A, SEC61A1, SEC61B,

SEC61G, SEC63, SEL1L3, SELT, SEMA4A, SEPT2, SERBP1, SERP1, SGK1, SGPP1, SHCBP1,

SLAMF7, SLC1A4, SLC25A17, SLC25A46, SLC30A5, SLC33A1, SLC35A3, SLC35B1,

SLC39A6, SLC7A1, SLMO2, SMARCC1, SMC4, SMCHD1, SND1, SNX13, SNX4, SORT1, SP3,

SPAG1, SPATS2, SPCS1, SPCS2, SPCS3, SPOP, SPTLC1, SPTSSA, SRGN, SRI, SRP54, SRP72,

SRPK1, SRSF1, SRSF3, SS18, SSB, SSR1, SSR4, STEAP3, STK38L, STRN3, STT3A, SUB1,

SUCLG2, SUMO1, SUMO4, TAF2, TES, TESC, TFAM, TFB2M, TFCP2, TFDP1, TFRC, TGDS,

TLK2, TM9SF1, TM9SF2, TMBIM6, TMED10, TMED2, TMED3, TMED5, TMEM135,

TMEM165, TMEM208, TMEM39A, TMEM50B, TMEM57, TMX1, TOMM70A, TOPORS,

TOR1A, TOR1AIP1, TOX, TP53I3, TP63, TPD52, TPP2, TRA2A, TRAM1, TRAM2, TRIB1,

TRIM23, TRRAP, TSPAN31, TTC37, TUBGCP3, TWSG1, TXNDC15, TXNRD2, TYMS,

U2SURP, UAP1, UBA5, UBA6, UBE2A, UBE2E1, UBE2G1, UBE2J1, UBE3A, UBE4B, UBR5,

UBXN4, UCHL5, UFL1, UFM1, UGDH, URI1, USO1, USP46, USP8, VAMP3, VCP, VDAC1,

VDR, VIM, VOPP1, VWA9, WDR44, WDYHV1, WIPF1, WIPI1, XAF1, XBP1, XPNPEP1, XPOT,

YAF2, YIPF5, YIPF6, YTHDF2, YWHAE, YWHAH, ZBP1, ZBTB32, ZC3H13, ZDHHC13,

ZFAND1, ZFR, ZNF706

CD19
Violet
ABCE1, ACAA2, ACN9, ACOT13, ACP1, ACSL1, ADAR, AGA, AGPAT4, AIFM1, AIMP2,

ALG8, ALG9, ALKBH1, ANAPC5, ANXA2, ANXA2P1, ANXA2P2, APOL3, ARMCX5,

ARPC5L, ASAHI, ASCC3, ASUN, ATG3, ATIC, ATP13A3, ATP1B1, ATP5A1, ATP5E, ATP5L,

ATP6V1A, ATP6V1C1, AVEN, AZI2, B4GALT4, BAG1, BAK1, BLVRA, BLZF1, BORA, BPGM,

BRCA1, BTG3, BZW2, C11orf48, C11orf58, C11orf73, C12orf4, C14orf166, C14orf2, C16orf62,

C1GALT1, C2orf47, CACYBP, CAND1, CARS, CASP1, CASP6, CASP7, CBR1, CBX5, CCBL2,

CCDC53, CCDC88C, CCNH, CCR1, CCT2, CCT4, CCT6A, CCT8, CD2AP, CDC123, CDC25B,

CDC37L1, CDC5L, CDC73, CDK12, CDK2AP1, CDKN1A, CDR2, CEBPG, CEP63, CEP76,

CERS6, CETN2, CHCHD3, CHMP2A, CHMP5, CIAPIN1, CKS1B, CKS2, CLCN3, CLEC2D,

CLN5, CLTA, CMC2, CNIH4, CNOT6, COA3, COL9A3, COMMD3, COPS2, COPS3, COPS4,

COPS6, COPS8, COX17, COX5A, COX5B, COX6C, COX7A2, CPSF6, CRIPT, CSE1L, CSTF2,

CSTF3, CTPS1, CYCS, CYP11B1, DBF4, DBI, DCAF17, DCTN6, DDRGK1, DDX1, DDX10,

DDX24, DDX46, DDX49, DDX60, DERA, DHRS9, DHX15, DHX29, DIABLO, DIMT1, DLAT,

DLD, DLEU2, DNAJA1, DNAJC2, DNAJC9, DPMI, DR1, DRG1, DUT, DYNLT1, E2F3, EHD4,

EI24, EIF2AK2, EIF2B1, EIF2B2, EIF2B3, EIF2S2, EIF4E2, EIF5B, EMC3, EMC7, EMC8,

ENOPH1, ENOSF1, ENY2, ETF1, ETFDH, EXOSC2, EXOSC9, FABP5, FAHD2A, FAM206A,

FAM49A, FARS2, FASTKD2, FASTKD3, FBXO5, FECH, FEN1, FGFR1OP, FGL2, FH, FOCAD,

FOXK2, GALC, GBP1, GEMIN2, GIGYF2, GLA, GLMN, GLRX3, GLT8D1, GMNN, GNG5,

GOLGA5, GPKOW, GPR137B, GRPEL1, GRSF1, GTF2E2, GTF2H2, GTF3C3, GUSB, GYG1,

H2AFZ, HADH, HADHB, HARS, HAT1, HCCS, HDAC2, HDHD1, HEATR1, HEATR3, HEG1,

HERC5, HERC6, HIST1H2BH, HMGXB4, HNRNPD, HPRT1, HRSP12, HSD17B12, HSP90AA1,

HSPA14, HSPB11, HSPE1, HYPK, ICT1, IFI27, IFI35, IFI44, IFI44L, IFI6, IFIH1, IFIT1, IFIT3,

IFIT5, IFITM1, IFT27, INPPI, INTS12, INTS6, INTS7, ISG15, ISG20, ITFG1, ITGB1BP1,

ITGB3BP, JAK2, JMJD6, KEAP1, KIAA0020, KIAA0196, KIAA1279, KIF20B, KLC1, KLF12,

KLHL7, KPNA2, LAMP2, LAMTOR2, LCP2, LGALS8, LSM3, LSM5, MAP2K4, MAPK1IP1L,

MBIP, MCM2, MCM3, MCTS1, MCUR1, MED27, MED6, MED8, METAP2, METTL22,

METTL5, MICB, MIEF1, MLH1, MOSPD1, MPC1, MPC2, MPHOSPH9, MPP6, MRPL13,

MRPL17, MRPL18, MRPL19, MRPL20, MRPL22, MRPL46, MRPL48, MRPL57, MRPS14,

MRPS15, MRPS18A, MRPS22, MRPS27, MRPS31, MRPS33, MRPS35, MSH2, MSH6, MT1X,

MTAP, MTCH2, MTHFD1, MTX2, MX1, MX2, MYO5A, NAMPT, NARS, NARS2, NCAPD2,

NCBP1, NDC80, NDUFA8, NDUFAF1, NDUFAF4, NDUFB1, NDUFB3, NDUFB4, NDUFB5,

NDUFB6, NDUFC1, NDUFS1, NDUFS3, NDUFS4, NDUFS5, NDUFS6, NECAP1, NFYB,

NIF3L1, NINJ2, NMI, NMT1, NOD2, NPTN, NTAN1, NUP153, NUP37, NUPL1, OAS1, OAS2,

OAS3, OASL, OGFOD3, ORC5, OXCT1, PAAF1, PAIP1, PARK7, PAXIP1, PCMT1, PCNA,

PCNX, PDCD2, PDCD5, PDHA1, PDHB, PDHX, PDS5B, PDXDC1, PELO, PFDN6, PI4K2A,

PIGF, PIK3CG, PIP4K2C, PLAA, PLIN2, PLSCR1, POLE3, POLR2K, POP4, POP5, PPA2, PPID,

PPM1G, PPP1CC, PPP2CB, PPP2R5C, PPT1, PRDX6, PRPF18, PRPF4, PSMA3, PSMA4, PSMB7,

PSMB9, PSMC2, PSMC3IP, PSMD1, PSMD6, PSMD7, PSME1, PSMG1, PSPH, PSRC1, PTENP1,

PTPN2, PTRH2, PTS, PTTG1, QDPR, QKI, RAB22A, RAB40B, RAB7A, RABEPK, RAD51AP1,

RAD51C, RAE1, RAN, RANBP9, RBBP8, RBCK1, RBM15, RBMX2, RBX1, RCN1, RFC2, RFC3,

RFC4, RHEB, RIOK2, RMDN1, RMDN3, RNF103, RNF11, RPA3, RPF1, RPL26L1, RPP40,

RPS27L, RPS6KB1, RPS6KC1, RSAD2, RTF1, RTP4, RUVBL2, RWDD1, RWDD2B, SAC3D1,

SAMD9, SAR1A, SAT1, SCAMP1, SCFD1, SCO2, SCP2, SDHC, SEC23B, SHFM1, SLAMF1,

SLC20A1, SLC25A12, SLC25A20, SLC30A9, SLC35F2, SLFN12, SMAD2, SMAP1, SMARCA4,

SMARCA5, SMC3, SMCO4, SNAP29, SNF8, SNRPB2, SNRPD3, SNRPE, SOAT1, SOS1,

SPATA5L1, SPATS2L, SQLE, SQRDL, SRBD1, SRP19, SRR, SSBP1, STAG1, STAT1, STAU1,

STK17B, STMN1, STOM, STOML2, STX18, SUCLG1, SYNCRIP, SYT11, TAF1B, TAF5, TAF9,

TALDO1, TAPI, TARS, TBC1D31, TBCA, TBX21, TCEB1, TCTN3, TEX30, TFG, THG1L,

THOC7, TIMM17A, TIMM23, TIMM9, TLE4, TMCO1, TMEM126B, TMEM70, TNFSF10, TPM4,

TPRKB, TRDMT1, TRIM14, TRIM26, TSC22D1, TSG101, TSN, TTC1, TTF2, TUBG1, TWF1,

TXN, TXNL1, TXNRD1, UBAC1, UBAP2, UBE2B, UBE2K, UBE2L6, UBE2V2, UBE3C,

UBXN8, UCHL3, UFC1, UFSP2, UMPS, UQCC1, UQCR10, UQCRB, UQCRC2, USP10, USP16,

USP18, UTP11L, UTP18, VDAC3, VRK2, VTI1B, WDR61, WSB2, YIPF1, YME1L1, YWHAQ,

ZC3H15, ZDHHC4, ZFYVE21

CD19
Brown
ABCA1, ABCC5, ABCG1, ABI1, ACAP2, ACSL3, ACYP1, ADAM17, ADARB1, ADAT1, ADD3,

ADRBK2, AGL, AGPAT5, AHCYL1, AHNAK, AIDA, AIM1, AIMP1, AKAP11, AKAP9,

ALCAM, ALDH1L1, ALMS1, ALPK1, AMMECR1, ANK3, ANKRA2, ANKRD10, ANKRD10-

IT1, ANKRD36B, ANXA11, AP1AR, APAF1, APOOL, APP, APPBP2, APPL1, ARFGEF2,

ARGLU1, ARHGAP10, ARHGAP12, ARHGAP26, ARHGAP5, ARHGEF18, ARHGEF6, ARID1A,

ARID4B, ARNT, ARNTL, ARPC1B, ASAP1, ASPH, ATAD2B, ATF7IP, ATF7IP2, ATP10D,

ATP2B1, ATP8A1, ATP8B1, ATRX, ATXN10, ATXN7, AVIL, BACE2, BANK1, BAZ2B, BBS10,

BICD2, BIRC3, BLCAP, BLNK, BMP2K, BRD4, BTBD1, BTN2A2, C11orf21, C11orf80, C18orf8,

C5orf28, C9orf156, C9orf91, CA5B, CALCOCO1, CAPN3, CASP8AP2, CAT, CBFA2T3, CBR4,

CCNG2, CCNT2, CCR6, CCSER2, CD180, CD1C, CD24, CD46, CD47, CDC40, CDC42EP3,

CDK13, CEP104, CEP135, CEP83, CHD1, CHD9, CIAO1, CIR1, CLCN4, CLEC4A, CNNM3,

CNOT8, COIL, COL5A3, CR1, CR2, CRBN, CREB1, CREBZF, CRK, CRY2, CRYL1, CSAD,

CSNK1A1, CTAGE5, CTNNB1, CTSS, CWC25, CXorf21, CYBB, CYP2E1, DAPP1, DBT, DCK,

DCLRE1C, DCP2, DCUN1D2, DCUN1D4, DDX52, DENND4A, DIAPH2, DIP2A, DIS3,

DKFZP586I1420, DLG1, DLGAP4, DNAJB14, DNAJC16, DOPEY2, DSCR3, DSE, DSERG1,

DSP, DUS2, DUSP22, DYM, DZANK1, E2F5, EAPP, EFR3A, EGR3, EIF3M, EIF4G3, ELOVL5,

ENTPD4, EPS15, ERBB2IP, ERP44, ETAA1, EVI5, EXOSC5, EXOSC7, FAM134A, FAM13B,

FAM178A, FAM179B, FAM192A, FAM49B, FAM53C, FAM63B, FAM65B, FBXO28, FBXO3,

FBXO41, FBXO42, FBXW12, FCGR2B, FCGR2C, FCRL2, FGFR1, FKBP9, FLJ42627, FMR1,

FOXN3, FRAT1, FUBP1, GALNT10, GALNT7, GATAD1, GFOD1, GLIPR1, GNE, GNG7,

GOLGA4, GPATCH8, GPR153, GPR18, GPR183, GSTA4, GTF2H3, HAUS2, HCG26, HCK,

HDAC4, HDAC9, HECTD4, HERC4, HEXA, HEXIM1, HMG20A, HMGN4, HNRNPH1,

HNRNPM, HRK, ICK, IDO1, IFNGR1, IGHV5-78, IKZF1, IL13RA1, IL15, IL6, IL7, INADL,

INPP5B, IRAK4, IRGQ, ITPR1, ITSN2, JADE3, JAG2, JRKL, KAT2B, KCNMB3, KDM3B,

KDM4C, KIAA0040, KIAA0355, KIAA0754, KIAA1033, KIAA1109, KIAA1551, KIF16B,

KLHL20, KLHL24, KMO, KMT2A, KPNA1, KPNB1, KRCC1, KRIT1, LANCL1, LAPTM4A,

LARP4B, LARS, LBH, LEMD3, LIAS, LINC00597, LOC100272216, LOC100505915,

LOC157562, LOC728093, LONRF1, LPGAT1, LPP, LRRFIP1, LRRFIP2, LSM14A, LUC7L3,

LYN, LZTFL1, MACF1, MALT1, MAP3K5, MAP3K7, MAP3K8, MAP4, MAP4K5, MARCH1,

MARCH3, MARCH6, MARCKS, MAT2B, MAVS, MBD4, MED14, MEF2A, MEF2C, METAP1,

METTL3, METTL4, METTL8, MEX3C, MFSD11, MGC12488, MGEA5, MINOS1P1, MOB3B,

MPZL1, MR1, MRPS30, MSANTD2, MSL2, MSL3, MTMR4, MVK, MYO1C, MYO1F, MZT2B,

N4BP2L1, N4BP2L2, N4BP2L2-IT2, NAA40, NAAA, NACAP1, NCOR1, NDRG2, NDUFAF7,

NEMF, NFATC2IP, NFYC, NHLRC2, NOTCH2, NOTCH2NL, NPEPPS, NR2C1, NRCAM,

NSFL1C, NUP43, OPN3, OSBPL10, OSBPL8, OSGEP, OSGEPL1, OTUD4, P2RY10, PAIP2B,

PARP12, PAXBP1, PCF11, PCMTD2, PDCD6, PDCL, PDLIM1, PDS5A, PEX12, PFKM, PGF,

PHF2, PHF20, PHKB, PHLDA1, PHTF2, PIAS1, PIKFYVE, PITPNA, PKN2, PKNOX1,

PLA2G4C, PLAG1, PLAGL1, PLEKHF2, PLEKHM1, PODNL1, POGZ, POLR1B, PPAP2A,

PPFIA1, PPIP5K1, PPP1R12A, PPP3CA, PPP6R3, PRDM10, PRDX2, PREPL, PRKAA1,

PRKACB, PRKAR2A, PRKD3, PRPF39, PRPF4B, PRR11, PRRC2C, PSME4, PSMF1, PTBP2,

PTEN, PTGER4, PTK2, PTPN6, PTPRC, PTPRK, PUM1, PYROXD1, QRSL1, RAB11FIP2,

RAB14, RAB3GAP1, RABGAP1, RABGAP1L, RAD52, RALB, RALGAPA1, RALGAPB,

RALGPS1, RALGPS2, RAP2C, RBL2, RBM25, RBM39, RBM48, RBM5, RBMS1, REL, REPS1,

REST, RFWD3, RFX5, RFX7, RGP1, RIOK3, RNF219, RNF38, RPL28, RPS15A, RPS6KA5,

RRAS2, RREB1, RSF1, RSRP1, RUFY2, RUNX1-IT1, SACS, SCAF4, SCRN1, SDCCAG3,

SEC24B, SECISBP2, SECISBP2L, SEH1L, SERGEF, SETBP1, SFTPB, SH3BP5, SHMT2,

SHOC2, SIAH1, SIRT5, SKAP2, SLC15A2, SLC25A24, SLC2A3, SLC2A6, SLC35D2, SLC35E1,

SLC35E3, SLC38A6, SLC46A3, SLC4A7, SLK, SMA4, SMARCA2, SMC6, SMYD2, SNAP23,

SNAPC3, SND1-IT1, SNX10, SNX3, SPAG16, SPG11, SPG21, SPIDR, SPTBN1, SRPK2, SRSF11,

ST13, ST8SIA4, STAP1, STEAP1, STK38, STRN, STX7, SUN1, SUPT20H, SUV420H1, SWAP70,

SYK, SYNRG, TAB2, TAF9B, TANK, TAOK1, TARBP1, TASP1, TBC1D5, TBC1D9, TCF12,

TCF4, TCL1B, THAP9-AS1, THOC1, TIA1, TIPRL, TLK1, TM2D1, TMBIM4, TMCC2,

TMEM168, TMEM212, TMEM41B, TMEM63A, TMEM9B, TNFAIP8, TNFRSF10C, TNKS2,

TOB2, TPR, TRAPPC2, TRIB2, TRIM38, TRIM52, TRIO, TRMT13, TSC22D2, TSNAX,

TSPAN13, TSPAN3, TSPYL1, TSPYL4, TSPYL5, TSR1, TTC13, TTN, TTR, UBE2D4, UBE3B,

UBP1, UBQLN4, UBXN7, USP15, USP22, USP33, USP34, USP4, USP47, USP6, USP6NL,

USP9X, UST, UTP6, UTRN, UVRAG, VAV3, VPS13A, VPS13B, VPS13C, WAPAL, WDR11,

WDR60, WDR77, WDR82, WNK1, WWC3, WWOX, XIST, YTHDC2, YWHAB, ZBED2, ZBTB1,

ZBTB20, ZBTB24, ZC3H7B, ZCCHC11, ZFC3H1, ZFX, ZKSCAN7, ZMYM6, ZMYND11,

ZNF107, ZNF142, ZNF146, ZNF154, ZNF160, ZNF26, ZNF273, ZNF280D, ZNF33B, ZNF354A,

ZNF43, ZNF468, ZNF506, ZNF510, ZNF518A, ZNF529, ZNF532, ZNF562, ZNF573, ZNF587,

ZNF611, ZNF665, ZNF669, ZNF675, ZNF701, ZNF721, ZNF75D, ZNF764, ZNF85, ZNHIT6

CD19
Green
ABCB7, ABHD10, ABHD6, ACAA1, ACO1, ACTR3B, ACYP2, ADD2, ADIPOR2, ADK, ADO,

ADPGK, ADPRM, ADRB2, AGO1, AGO4, AKAP10, AMIGO2, ANGEL2, ANKH, ANKRD26,

ANKRD27, ANKRD6, AP4S1, APC, AQR, ARFGEF1, ARHGAP19, ARHGAP24, ARHGAP32,

ARHGEF10, ARHGEF5, ARHGEF9, ARIH1, ARL8B, ARMCX2, ATG12, ATG7, ATP5S,

ATP6V0E1, ATP6V1H, ATP7A, ATP9B, ATRN, AVL9, BAG5, BBS4, BBS9, BDH2, BPHL, BRE,

BRWD1, BTBD3, BTBD7, C10orf2, C11orf30, C11orf95, C1orf109, C2CD2, C2orf42, C2orf44,

C9orf78, CACFD1, CALCOCO2, CAMSAP1, CARKD, CARS2, CASP4, CBX1, CCDC25,

CCDC28A, CDC16, CDYL, CELSR1, CENPI, CEP162, CEPT1, CFAP44, CFDP1, CGRRF1,

CHD1L, CLNS1A, CNTNAP2, COA1, COQ7, COX11, COX7C, CPQ, CRCP, CREBL2, CRKL,

CROT, CRYBG3, CRYZL1, CTNS, CTR9, CTSK, CUL3, CUL4A, CUZD1, CXorf57, CYP2C8,

DAPK2, DAZAP2, DAZL, DCLK2, DDX27, DDX28, DDX42, DEGS1, DENND2D, DFNA5,

DHX40, DHX57, DIEXF, DLG3, DNAJA3, DNAJC8, DNASE1L1, DOPEY1, DPF2, DPH5, DPP8,

DST, DUS4L, DYNC1H1, EBLN2, EIF2B5, ENAH, ENTPD1-AS1, EPM2A, EPM2AIP1, ERCC3,

ERCC5, EXOC1, EXOC2, EXT2, FAM149B1, FAM172A, FAM50A, FAM50B, FAN1, FANCF,

FBXL14, FBXL4, FEZ2, FGGY, FHL1, FIG4, FKTN, FMO5, FNBP1L, FNTA, FOXJ3, FRAT2,

FTH1, FTSJ3, GALNT11, GAPVD1, GAS2, GCC1, GCNT1, GGNBP2, GIN1, GLTSCR1L,

GNG11, GNL3L, GNPAT, GOLGA1, GPATCH1, GPM6A, GPM6B, GPR65, GSTA1, HARS2,

HAUS5, HCP5, HDDC2, HEATR6, HEMK1, HILPDA, HLCS, HMGCS1, HN1L, HNRNPH3,

HOMER1, HPS1, HPS4, HS3ST1, HSD17B7, HSDL2, IDI1, IFFO1, IFNAR1, IFT74, IL18, IL24,

IL27RA, IMPA2, ING1, INTS9, INVS, IP6K2, IRAK3, ITGAE, ITGB5, ITPR2, IVNS1ABP, JAM3,

KANSL2, KATNA1, KCTD2, KIAA0586, KIAA0753, KIDINS220, KIZ, KLF3-AS1, KLHDC10,

KRT18, KYNU, L3MBTL1, LAMP1, LARGE, LARS2, LASPI, LCMT2, LEP, LETM1, LGR4,

LINC00667, LMO2, LOC100129361, LOC389906, LPCAT3, LRRC1, LRRC47, LRRC8B, LSG1,

LUC7L, LYRM1, MAEA, MAGEF1, MAMLD1, MAP2K1, MAP2K7, MAP3K7CL, MAP3K9,

MAPK14, MARS, MAT2A, MCCC1, MCF2L, MDC1, METTL13, MICALL1, MID1, MIPEP,

MKRN2, MLH3, MORC4, MPPE1, MPPED2, MRFAP1L1, MRS2, MTERF1, MTMR2, MTMR3,

MTOR, MTRF1, MTUS1, MYBL1, MYH3, MYO1B, MYO1E, MYOM1, NACA, NAIP, NBEA,

NCBP2, NCKAP1L, NFS1, NHP2L1, NIPBL, NKRF, NOP14-AS1, NPAT, NPC1, NPFF, NSMAF,

NSMCE4A, NUBP1, NUDCD3, NUDT13, NUP160, OARD1, OCRL, OPA1, OR10H1, OSBPL2,

OVGP1, PAPD7, PCM1, PCYT1A, PDCD11, PDZD8, PEX3, PEX5, PFAS, PHACTR1, PHF3,

PHIP, PIAS3, PIAS4, PIGB, PIGV, PJA1, PLCXD1, PLEKHA8P1, POLI, POLR1C, POLR2J4,

PON2, POU2F1, PPARD, PPARGC1A, PPCS, PPFIBP2, PPM1B, PPP1R12B, PRPF6, PRPSAP1,

PRR5L, PSPC1, PTCD3, PTPRN2, PUM2, PUS1, QTRTD1, RAB9A, RABEP1, RBM41, REPS2,

REV1, RFPL3S, RGS7, RHOH, RMND5B, RNASEH2B, RNFT2, RNMT, RPA4, RPL10L,

RPL23AP32, RPL37, RPL37A, RPP38, RPS6KA2, RRAGB, RRN3P1, RRP12, RSL1D1, RTN1,

RUFY1, RWDD3, SAMM50, SAYSD1, SCAF8, SCAPER, SCD, SCN2B, SCRIB, SDHAF1,

SEC14L1P1, SEC16A, SEC22A, SEC62, SEMA3F, SEPHS2, SEPT7, SERPINB6, SETD4, SETX,

SF3B3, SIK2, SLA, SLC24A1, SLC25A13, SLC25A37, SLC25A38, SLC36A1, SMARCAL1,

SMEK2, SMIM14, SNAPC4, SNRNP200, SNRNP35, SNX5, SOBP, SON, SP140, SPATA2,

SPRY1, SRD5A1, SS18L1, ST3GAL6, ST6GAL1, STK17A, STRADA, STX12, STX2, SUPT16H,

SUPT7L, SYF2, SYT17, TACC1, TBCC, TBL1X, TBRG4, TBX19, TBXA2R, TCEAL1, TCEAL4,

TCF7L2, TCL6, TDRD3, TECPR2, THNSL2, THOC2, TIMM22, TIMM8A, TJAP1, TLDC1, TLE1,

TLR1, TM6SF1, TMA7, TMEM127, TMEM186, TMEM231, TMEM251, TMEM62, TNFSF4,

TOMM20, TOMM34, TOP2B, TOPORS-AS1, TP53TG1, TPH1, TPST1, TPT1P8, TRAF3IP3,

TRAK2, TREML2, TRIM32, TSEN2, TTC19, TTLL5, TUBBP5, UBAP2L, UBE2D2, UBE4A,

UCN, UGT2B28, UIMC1, UNC119B, UPF3A, URB2, URGCP, UROD, USPL1, VCL, VIPAS39,

VPRBP, VPS26A, VPS33B, VPS37C, VPS41, WBP1L, WDR48, WDR73, WRAP73, WWP2, XPA,

XYLT2, YARS2, YLPM1, YPEL1, YWHAZ, ZBTB3, ZCCHC24, ZCWPW1, ZFYVE26, ZHX3,

ZKSCAN4, ZMYM1, ZMYND8, ZNF10, ZNF112, ZNF133, ZNF135, ZNF140, ZNF16, ZNF165,

ZNF180, ZNF189, ZNF200, ZNF202, ZNF213-AS1, ZNF223, ZNF224, ZNF225, ZNF227, ZNF23,

ZNF236, ZNF239, ZNF254, ZNF271, ZNF337, ZNF350, ZNF394, ZNF415, ZNF432, ZNF45,

ZNF473, ZNF493, ZNF516, ZNF544, ZNF571, ZNF587B, ZNF614, ZNF623, ZNF638, ZNF671,

ZNF696, ZNF7, ZNF710, ZNF74, ZNF813, ZNF93, ZSCAN26, ZSCAN9

CD19
Skyblue
ABAT, ABCA11P, ABCB4, ABCD4, ABLIM1, ACACB, ACSL5, ACTR5, ADAM28, ADAP2,

ADCK2, ADCK3, ADCY7, ADD1, ADNP2, AEBP1, AGBL2, AHCYL2, AKAP8, AKR7A2,

AKT3, ALAD, ALDH2, ALG13, ALOX5, AMPD3, AMT, ANKEF1, ANKMY1, ANKRD11,

ANKRD49, ANKZF1, AP1S2, AP2A2, APLP2, APPL2, ARAP2, ARHGAP15, ARHGAP17,

ARHGAP25, ARHGEF7, ARID5B, ARIH2, ARL4A, ARL6IP5, ASB1, ASB13, ASMTL, ASTE1,

ASXL1, ATG14, ATG4B, ATM, ATXN7L3B, AUTS2, B3GALT4, B3GNTL1, BACH2, BANP,

BCL2, BCL6, BCLAF1, BCORL1, BEND5, BEX4, BIN1, BNIP3L, BPTF, BRD1, BRD3, BTAF1,

BTG1, BTN2A1, C10orf76, C12orf29, C14orf93, C21orf33, C2orf68, C3orf18, C5orf45, C6orf120,

CAMLG, CAMTA2, CAPRIN2, CASD1, CAST, CBFA2T2, CBLB, CBLL1, CBR3, CBX7,

CCBL1, CCDC101, CCDC109B, CCDC22, CCDC93, CCNB1IP1, CCNG1, CCNI, CCNL1,

CCNL2, CCR7, CD1A, CD1D, CD200, CD22, CD244, CD2BP2, CD40, CD55, CD69, CD96,

CDC14A, CDK10, CDK19, CDK5RAP1, CDK5RAP3, CDKN1C, CECR7, CELF1, CEP164,

CEP170, CEP68, CHCHD7, CHD7, CHI3L2, CHMP1B, CHTOP, CLASRP, CLCN6, CLEC11A,

CLK1, CLK4, CLMN, CMPK1, CNBP, CNOT2, CNPPD1, CNTRL, COL4A3, CREBBP, CRELD1,

CRLF3, CSDE1, CSRNP2, CTDSP2, CTNNBL1, CTSB, CUX1, CXCR4, CYFIP2, CYHR1, CYLD,

DAG1, DCAF10, DCAF8, DDHD2, DDX17, DEK, DENND4C, DEPDC5, DFFB, DGKD,

DHRS12, DICER1, DIDO1, DIP2C, DMTF1, DMXL1, DNAJC11, DOK1, DPEP2, DPYD, DSTYK,

DVL1, DYNLT3, DYRK1A, DYRK2, ECD, ECHDC2, EEF1A1, EEF1D, EFCAB14, EGR1,

EIF1B, EIF3E, EIF3F, EIF3L, EIF4B, ELL3, ENGASE, EP400, EPB41L2, ESD, EVL, EXOC3,

EXOSC10, EZH1, FAM111A, FAM134C, FAM160B2, FAM168A, FAM168B, FAM193A,

FAM208A, FAM32A, FAM46A, FAM60A, FBXL12, FBXL15, FBXL5, FBXO11, FBXO21,

FBXO9, FKBP15, FLJ10038, FNBP1, FNBP4, FOXJ2, FOXO1, FRYL, FUCA1, FYCO1,

GABBR1, GCC2, GDPD3, GGA2, GGPS1, GIT2, GLOD4, GMEB2, GMFB, GNA11, GNA12,

GNB5, GOLGA7, GOLGA8A, GON4L, GOSR1, GPBP1L1, GPR107, GPRASP1, GRAMD1B,

GSAP, GSDMB, GSE1, GSTM4, GTF3C2, GVINP1, H2AFJ, HAGH, HBP1, HEBP1, HECA,

HERC1, HFE, HIVEP2, HLA-DQB1, HLA-E, HLA-F, HLA-F-AS1, HNRNPA0, HNRNPA1,

HNRNPA3, HNRNPDL, HNRNPL, HPS6, HSBP1, HSD17B11, HSPBAP1, HTATIP2, HTRA2,

HUWE1, ICAM3, ID3, IER5, IFT57, IFT88, IKBKB, IL11RA, IL16, IL4R, ING4, INPP5D, IRS2,

IST1, ITM2B, JADE1, JADE2, JAK1, JARID2, JMJD1C, JRK, KAT2A, KAT6A, KBTBD2,

KCNQ1, KDM2A, KDM3A, KDM4B, KDM5B, KDM7A, KIAA0141, KIAA0226L, KIAA0247,

KIAA0430, KIAA0907, KIAA0922, KIAA0930, KIAA1467, KLF11, KLHDC2, KLHL22, KPNA4,

KRBOX4, LAIR1, LAMC1, LDOC1, LETMD1, LHFPL2, LIN37, LINC00094, LINC00341, LIPT1,

LMAN2L, LMBR1L, LMF1, LOC202181, LOC647070, LOC728392, LPIN2, LRIG1, LRRC37A2,

LRRC40, LTA4H, LTB, LY75, LYRM9, LYST, MADD, MAML1, MAN1B1, MAN1C1,

MAN2A2, MAN2B2, MANBA, MAP2K5, MAP3K4, MAP4K4, MAPK1, MAPK9, MAPKAPK5-

AS1, MAPRE2, MARCH8, MARCKSL1, MAX, MBNL1, MCM3AP, MEAF6, MECP2, MED13,

MED23, METRN, METTL17, MGA, MGAT5, MIA3, MICA, MICAL3, MKKS, MKNK1, MKRN1,

MLYCD, MOAP1, MPRIP, MTCH1, MTERF4, MTF1, MTMR1, MTMR9, MYCBP2, MYL12B,

MYO9A, MZF1, NADSYN1, NAP1L1, NAT10, NBPF1, NCK2, NCR3, NDRG3, NDST2,

NECAP2, NEK7, NEK9, NFATC1, NFATC3, NGRN, NISCH, NKTR, NLRP1, NOD1, NR3C1,

NR3C2, NREP, NRF1, NRIP1, NSUN5P1, NT5E, NUP210, NUP214, NUP88, OAZ2, OFD1,

OGG1, OSER1, OTUD3, P2RX5, P2RY14, PAFAH1B1, PAN2, PANK4, PAPOLG, PARP6,

PARP8, PASK, PATZ1, PCBP2, PCGF3, PCNT, PCNXL2, PDE8A, PECAM1, PEG10, PFDN5,

PGAP3, PGS1, PHC1, PHF11, PHF21A, PHKA2, PHLPP2, PIBF1, PIGA, PIGG, PIK3C2B,

PIK3CD, PIK3R1, PIK3R4, PILRB, PIN4, PISD, PKI55, PKIA, PLCL2, PLEKHJ1, PNISR, PNN,

PNRC1, POLG2, POLR2G, PPM1F, PPOX, PPP1R16B, PPP6R2, PRDM2, PRDM4, PRKCB,

PRKCZ, PRKRIR, PRKX, PRMT2, PRNP, PRPF3, PRRC2B, PRUNE, PSIP1, PTPLB, PTPRE,

PTPRO, PTTG1IP, PURA, PWP2, QRICH1, RAB33B, RANBP10, RANBP6, RASA1, RBBP6,

RBM10, RBM12, RBM19, RBM4, RBM4B, RBM6, RECK, RGL2, RGS14, RIN3, RLF, RNASET2,

RNF111, RNF125, RNF141, RNF41, RPARP-AS1, RPL11, RPL14, RPL18, RPL22, RPL24, RPL27,

RPL31, RPL34, RPL35, RPL35A, RPL39, RPL6, RPL7, RPS12, RPS16, RPS17, RPS21, RPS23,

RPS25, RPS27A, RPS28, RPS29, RPS3, RPS6, RPS6KA3, RPS7, RRAGA, RRNAD1, RSAD1,

RSBN1, RXRB, RYK, S1PR1, SAFB, SAP18, SARAF, SAV1, SC5D, SDCBP, SDR39U1, SEC31B,

SELL, SENP6, SEPT6, SERTAD2, SESN1, SET, SETD2, SETDB1, SF3B1, SFPQ, SFSWAP,

SGPL1, SH2B3, SH3BGRL, SH3YL1, SIK3, SIN3B, SIRT1, SKP1, SLC23A2, SLC25A36,

SLC25A44, SLC37A1, SLC6A16, SLC7A6, SLC9A8, SMAD3, SMARCE1, SMIM7, SNN, SNRK,

SNUPN, SNX1, SNX11, SNX2, SNX27, SOCS5, SORL1, SOS2, SP140L, SPECC1L, SPG7,

SPSB3, SRRM2, SRSF5, SRSF6, SRSF7, SRSF8, SSBP2, SSH1, ST3GAL1, STAG2, STAT5A,

STAT5B, STK19, STK26, STX16, STX4, STX6, SUGP2, SYNE1, SYPL1, TAF1A, TAF1C,

TAF1D, TAF7, TAOK3, TARDBP, TAZ, TBC1D13, TBP, TCF3, TCF7, TCTN1, TGFBR2,

TGFBRAP1, TGIF1, TGIF2, TM2D3, TMCC1, TMCO6, TMEM134, TMEM164, TMEM2,

TMEM243, TMF1, TMUB2, TMX4, TNFAIP3, TNFRSF10B, TNKS, TNS3, TOM1, TP73-AS1,

TPCN1, TPT1, TRAF3, TRAF5, TRAK1, TRAPPC10, TRAPPC12, TRAPPC8, TRIM13, TRIM27,

TRIM33, TRIM44, TRIM68, TRMT61B, TSC1, TSC22D3, TSEN34, TTC12, TTC31, TTC9, TTF1,

TUBA1A, TUG1, TXNIP, TXNL4B, UBE2G2, UBE2I, UBQLN2, UBR2, UBR4, UBXN1,

UNC45A, USP12, USP24, USP7, UXS1, VAMP4, VEZF1, VILL, VPS11, VPS13D, VPS35, VPS39,

WASF1, WDR19, WDR37, WDR55, WDR59, WDR5B, WIPF2, WIPI2, WSB1, XPC, XPO6,

XRCC2, XYLT1, YPEL5, YTHDC1, YY1AP1, ZBED5, ZBTB11, ZBTB14, ZBTB40, ZBTB5,

ZBTB7A, ZC3HAV1, ZDHHC17, ZDHHC6, ZHX2, ZKSCAN5, ZMYM4, ZMYM5, ZNF106,

ZNF134, ZNF136, ZNF137P, ZNF14, ZNF195, ZNF204P, ZNF211, ZNF212, ZNF217, ZNF222,

ZNF230, ZNF232, ZNF235, ZNF24, ZNF248, ZNF266, ZNF268, ZNF274, ZNF277, ZNF302,

ZNF304, ZNF318, ZNF32, ZNF329, ZNF37BP, ZNF395, ZNF419, ZNF443, ZNF451, ZNF551,

ZNF592, ZNF606, ZNF672, ZNF767P, ZNF83, ZNF839, ZNF84, ZRSR2, ZSCAN18, ZSCAN32,

ZXDC, ZZEF1

CD33
Royalblue
ACTC1, ADAMTS2, ADD1, AGGF1, AGPAT1, ANXA9, APBB1IP, APLP2, ARHGDIA,

ARHGEF11, BCL2L1, C10orf76, C1orf115, CA5BP1, CACNB2, CFAP70, CLDN4, COCH,

COL8A2, CPSF6, CSNK1G1, CYB5R2, CYP1A2, DACH1, DBT, DEDD, DOCK5, DUSP7, E2F4,

ECE2, EPN1, ERCC8, ERV9-1, FADS2, FAM124B, FAM13A, FASN, FUT6, GNAQ, GTPBP1,

HFE, HIST1H2AM, HMGCS2, HTATSF1, IGLJ3, KCNJ3, KCNMB3, KLHL26, LDLRAD4,

LPAR1, MAP7, MATN1, MED14, MOB4, MVB12B, NDN, NREP, OPRL1, OR7E12P, PANK3,

PDLIM7, PDX1, PEX14, PLAA, PLCE1, PLSCR2, PNPLA2, PPARG, PPP2R1B, PRKCA, PSEN2,

PTCRA, PTGFR, PTK2B, PTPN1, PXMP4, RBKS, RERE, RITA1, SCN1B, SIGLEC7, SP1,

SPATA2, STAP2, STRN4, STYXL1, TBC1D10B, TDRD12, TMEM45A, ZHX3, ZNF155, ZNF235,

ZNF536, ZNF556

CD33
Sienna3
ACSL1, ACVR1, ADM, AP2B1, APOL6, AQP9, ARID5B, ATP2A2, B3GALT4, BCAS2, C2,

CALU, CCL2, CCR1, CD38, CD63, CDS2, CHD7, CHST11, CLIC4, CSF2RB, CSNK1A1, CUL1,

CXCL8, DCTN5, DDX60, DESI1, DNASE2, DRAM1, EGR1, EMC1, EMC7, EPHB2, ERO1L,

ERP44, FCAR, FCGR1B, FFAR2, FLVCR2, GALNT2, GAS7, GK, GNAI3, GNPDA1, HPSE,

IER2, IFI27, IFI35, IFI6, IFIT2, IFIT3, IGSF6, IL15RA, IL1RN, IRF7, ISG20, JMJD6, JUN,

KCNJ15, KHNYN, KLF9, KMO, KYNU, LAMP1, LAP3, LDHA, LDLR, LEPROT, LGALS3BP,

LGALS8, LIMK2, LMNB1, LXN, MAPK1IP1L, MED6, MR1, MT1HL1, MT1X, MT2A, N4BP1,

NAIP, NAMPT, NAPA, NMI, NRAS, NUDT15, OASL, PANX1, PEF1, PLIN3, PLSCR1, POP4,

PPP1R2, PRPF18, PSMD12, PSMD14, QKI, RAB5A, RAB8B, RIPK2, RIT1, RNF19B, RSAD2,

SC5D, SEC24A, SERPING1, SIRPA, SLC12A6, SLC2A3, SLC31A2, SLC38A2, SLC39A8,

SPATS2L, SQLE, SREBF2, SRP54, STAT1, SUMO3, SYNJ2, TAP1, TCEB1, TFEC, TMCO1,

TMEM180, TMSB10, TNFAIP6, TOR1B, TRAFD1, TRIP4, TXN, XBP1, YME1L1, ZDHHC3

CD33
Violet
ACOT13, ACOX1, ACPP, ACTA2, ADAR, ADARB1, AIM2, APOBEC3G, ATP10A, BCL2L13,

BLM, BLVRA, BLZF1, C15orf39, C19orf66, C1QA, C2orf47, CACNA1A, CBR1, CD2AP, CDK7,

CEBPG, CHMP5, CHST12, CLU, CMC2, CMTR1, CNIH4, CNP, COA3, COX17, CR1, CSNK1D,

CYB5A, DBI, DDA1, DDX58, DHX58, DHX8, DNAJA1, DPM1, DYNLT1, DYSF, EIF2AK2,

ENOPH1, ERGIC2, ETFDH, ETNK1, EXT1, EXT2, F2RL1, FAM49A, FAM69A, FAM8A1, FAS,

FKBP15, FOLR3, FXYD6, GCH1, GLE1, GMPR, GRK6, GRPEL1, GTPBP2, HERC5, HERC6,

HGF, HINFP, HIST2H2BE, HSPB11, IFI44, IFI44L, IFIH1, IFIT1, IFIT5, IFITM1, IFITM3, IGJ,

IGKC, IGLC1, IK, ING1, IRF2, ISG15, JUP, KEAP1, KIAA0040, KIAA0226, KLHL7, LAIR2,

LARP7, LEPROTL1, LILRA2, LILRA5, LOC100996756, LY6E, LY96, MAD2L1BP, MCTS1,

MICU1, MRPL22, MSRB2, MX1, MX2, NCALD, NDFIP1, NDUFS6, NFYA, NRN1, NSUN3,

NUCB1, NUDT9, OAS2, OAS3, OSBPL1A, PAK2, PDK3, PGGT1B, PHF11, PML, POLB,

PPP1R3D, PPP2R2A, PSMA4, PSME3, QRSL1, RBMS2, RIN2, RNF34, RPS6KC1, RTP4,

SAMD4A, SAMD9, SAP30L, SCCPDH, SEC22B, SIGLEC1, SLC25A37, SMAD3, SMCHD1,

SNRPG, SNTB1, SORT1, SP100, SP110, SP140, SPATA5L1, SPTLC2, SRD5A1, SRP19, STAP1,

STEAP4, STX17, SULT1B1, TBPL1, TCF4, TCN2, TDRD7, TLE3, TMEM140, TMEM2,

TMEM255A, TMOD3, TNFSF10, TRIM14, TRIM21, TRIM22, TROVE2, UAP1L1, UBE2K,

USP18, VAMP1, VTI1B, WDFY3, WWC3, XRCC4, ZNF322

CD33
Darkmagenta
ABCC4, ACTR1B, AHI1, AK1, AKAP13, AKAP8L, ALG12, ANGEL1, ANK3, ANKEF1, APOM,

AQP3, ARFIP2, ARFRP1, ARHGEF16, ARRB1, ARTN, ASAP3, ATAD3A, ATN1, B3GAT1,

B4GALT1, BAHD1, BATF3, BCL2, BCL7A, BIN1, BOP1, BTBD2, C10orf2, C14orf1, C16orf45,

C16orf58, C5orf45, CA11, CABP1, CACNA1I, CACNA2D2, CASP10, CD5, CD74, CD79A,

CDHR1, CDK16, CEP170B, CES2, CFHR2, CIC, CLEC10A, CLOCK, CLUAP1, CMTM6,

CNTFR, COL1A2, COMT, COQ3, COQ4, COQ7, COX11, CRELD1, CRTAC1, CRY2, CRYGD,

CRYM, CSF1R, CTSF, CWF19L1, CXADR, CYLD, DAO, DDX31, DECR2, DIEXF, DMPK,

DNPEP, DNPH1, DOCK6, DOLPP1, DPH2, DSPP, DTNA, DZIP3, ECHS1, EHD2, EIF3A, EIF5A,

ENPP1, EPOR, ERBB2, ESR2, EVX1, EXOSC4, FAM153A, FANCC, FBXO2, FBXO31, FCF1,

FGF4, FGF6, FHL1, FKBP4, FUBP1, FZR1, GAMT, GDF5, GFER, GJB3, GLP1R, GNB1L,

GOLGA2P5, GOLGA8A, GON4L, GRHL2, HABP4, HADH, HEMK1, HGH1, HIP1R, HIST1H1T,

HIST3H2A, HLA-DPA1, HLA-DQB1, HMGA1, HMOX2, HNRNPA0, HRK, ICAM4, IGH,

IGSF9B, IPO9, ISYNA1, IZUMO4, KCNA3, KDM8, KLF12, KLHDC3, LIMD2, LIMS2,

LINC00260, LINC01278, LRRC16A, LTA, LTBP4, MACROD1, MAZ, MCM3AP, MCM3AP-AS1,

MEGF6, MGAT4B, MID2, MMP19, MRM1, MRPL12, MRTO4, MUC5AC, MUC8, MYO1C,

NAA40, NECAB3, NF2, NFATC1, NFATC2IP, NIPAL3, NIPSNAP1, NPAT, NPM3, NPRL3,

NPTXR, NR3C2, NRF1, NRL, NSG1, NUBPL, NUFIP1, OSBP2, PCBP4, PCYT2, PDLIM4,

PGAP2, PHC1, PHGDH, PHLPP2, PIK3IP1, PLCG1, PLCH2, PLXNB2, PNPLA4, POF1B,

POLD4, POLR3G, POU6F1, PPDPF, PPIP5K1, PPP1R13B, PPP2R5D, PREPL, PSAT1, PTGIR,

RAB11FIP3, RAB2A, RAB40B, RBM19, RCL1, REXO4, RFTN1, RGS12, RNPS1, ROBO3,

RPS12, RRP1B, SAFB2, SBF1, SEMA3G, SF3B3, SGSM2, SH2D3A, SIVA1, SLC12A2,

SLC25A22, SLC25A4, SLC2A6, SLC5A5, SLC7A8, SMPD2, SNAP25, SOX12, SPDEF, SPTBN1,

SREK1IP1, SSBP3, STAG3, STRA13, SURF2, SYNGR3, TAC3, TCEA2, TCEB2, TCL6, TEAD3,

TFAM, TJP3, TLE2, TM7SF2, TMEM177, TMEM63A, TOP3B, TPT1P8, TRAF3IP3, TRAF4,

TRAK1, TRIM2, TSKU, TSPAN5, TTC28, TUBD1, TULP3, UBE2D4, UBE2O, USP13, USP5,

UTP20, VASH1, VPS13D, WDR59, WDR61, WDR73, WDR74, YPEL1, ZBTB38, ZFP36L2,

ZNF510, ZNF76, ZSCAN18

LDG
LDG_A
ABCC3, ABCC4, ABHD15, ABI2, ABLIM3, ACER3, ACRBP, ACSBG1, ACVR1, ADCY3,

ADRA2A, AFAP1, AFAP1L2, AFF3, AGBL5, AGPAT5, AIG1, AKIP1, ALDH1A1, ALOX12,

ANKRD28, ANO6, AP1S2, APP, AQP10, AR, ARHGAP18, ARHGAP21, ARHGAP32,

ARHGAP6, ARHGEF12, ARMCX3, ASAP2, ATP5E, ATP5S, ATP9A, AVPR1A, B4GALT6,

BACE1, BCL11A, BCL2L1, BCL2L2, BEND2, BET1, BEX3, BICD1, BLNK, BMP6, BMP8B,

C12orf75, C12orf76, C15orf52, C15orf54, C19orf33, C1orf198, C2orf88, C7orf73, CA13, CA2,

CALD1, CAMTA1, CANX, CASP6, CCDC88A, CD151, CD226, CD36, CDC14B, CDIP1,

CDK2AP1, CDK6, CDKL1, CDYL, CHD9, CLCN3, CLDN5, CLEC1B, CLIC4, CLU, CMTM5,

CNRIP1, CNST, COMT, CPED1, CPNE5, CRAT, CRLS1, CTC-338M12.4, CTDSPL, CTTN,

CXCL5, DAAM1, DAB2, DCLRE1A, DDX11L2, DENND2C, DIMT1, DMTN, DNAJC6, DNM3,

DPPA4, DPYSL2, DST, EGF, EGLN3, EHD3, ELOVL7, ENDOD1, ENKUR, EPB41L3, ERG,

ERV3-1, ESAM, F13A1, F2R, FAM20B, FAM212B-AS1, FAM65C, FAM69B, FAM81B,

FAXDC2, FHL1, FHL2, FKBP1B, FNBP1L, FRMD3, FSTL1, GADD45A, GAS2L1, GGTA1P,

GLCE, GMPR, GNA12, GNAZ, GNG11, GNG8, GP1BA, GP5, GP6, GPX1, GRAP2, GRB14,

GSTP1, GUCY1A3, GUCY1B3, H1F0, H2AFJ, HEMGN, HEXIM2, HGD, HIST1H2AE,

HIST1H2BJ, HIST1H2BO, HIST1H4I, HMGB1, HMGN1, HRASLS, IGF2BP3, IGKC, IGLC1,

IRS1, ITGA2B, ITGA9, ITGB1, ITGB3, ITGB5, JAM3, KALRN, KCND3, KIF2A, KLHL5,

LAPTM4B, LGALSL, LINC00853, LINC00938, LIPH, LMNA, LOC101928419, LOC105371967,

LOC105377276, LOC283194, LPAR5, LRBA, LTBP1, LYPLAL1, LZTS2, M1AP, MAGI2-AS3,

MAGOHB, MAP1A, MAP1B, MAP3K7CL, MAST4, MAX, MBTD1, MCM6, MCUR1, MEIS1,

MEST, MFAP3L, MGLL, MINPP1, MITF, MLH3, MMD, MMRN1, MOB1B, MPL, MSANTD3,

MSN, MTHFD2L, MTMR2, MTURN, MYB, MYCT1, MYL9, MYLK, MYNN, NAP1L1, NAT8B,

NCAPG2, NCK1-AS1, NCKAP1, ND4, NENF, NEXN, NIPA1, NLK, NORAD, NPRL3, NREP,

NRGN, NT5M, NUTM2A-AS1, OPN3, P2RY12, PANX1, PARD3, PARVB, PAWR, PBX1,

PCYT1B, PDE2A, PDE3A, PDE5A, PDGFA, PDGFC, PDLIM1, PDZD2, PDZK1IP1, PEAR1, PF4,

PF4V1, PGRMC1, PITPNM2, PKHD1L1, PKIG, PLA2G12A, PLEKHA8P1, PLOD2, PNMA1,

PPBP, PPM1L, PRDX6, PRG2, PRKAR2B, PROS1, PROSER2, PRTFDC1, PRUNE1, PSD3,

PSPH, PTCRA, PTGIR, PTGS1, PTK2, PTPN18, PTPRS, PXDC1, PYGB, RAB13, RAB27B,

RAB30, RAP1B, RAP2B, RBPMS2, RCC2, RDH11, RGS10, RHBDD1, RHOBTB1, RNF11,

RNF217, RSU1, SAV1, SCFD2, SCN9A, SDC4, SDPR, SEC14L5, SEPT11, SERPINE2,

SH3BGRL2, SH3TC2, SHTN1, SIAE, SLA2, SLC25A43, SLC35D2, SLC35D3, SLC44A1,

SLC8A3, SMAD1, SMIM24, SMIM5, SMOX, SNAPC3, SNCA, SNPH, SOX4, SPARC, SPOCD1,

SPSB1, SPX, SSX2IP, ST3GAL3, STMN1, STON2, STRADB, SYNM, SYTL4, TAL1, TARBP1,

TBXA2R, TCEAL8, TCF4, TCL1A, TDRP, TEX2, TFB1M, TFPI, TGFB1I1, TGFBI, THBS1,

THRB, TLK1, TLR7, TMCC2, TMEM158, TMEM40, TMEM45A, TMEM64, TNFSF4, TNIK,

TNS1, TNS3, TPM1, TPSAB1, TPSB2, TPST2, TPTEP1, TRBV27, TREML1, TRIM10, TRIM13,

TRIM58, TSC22D1, TSPAN18, TSPAN33, TSPAN9, TTC7B, TUBB, TUBB1, TWSG1, UBE2E2,

UBE2O, UBL4A, UGCG, USP12, USP31, UXS1, VCL, VEPH1, VIL1, VSIG2, VWA5A, VWF,

WASF1, WASF3, WDR11-AS1, WHAMMP2, WRB, WWC1, XK, XPNPEP1, YIF1B, YWHAE,

YWHAH, ZBTB16, ZC3HAV1L, ZNF175, ZNF271P, ZNF367, ZNF431, ZNF521, ZNF529-AS1,

ZNF542P, ZNF677, ZNF718

LDG
LDG_B
ABCA13, ARG1, ATP8B4, AZU1, CAMP, CEACAM6, CEACAM8, CHIT1, CLEC12A, CLEC5A,

CPNE3, CRISP3, CTSG, CYBB, DEFA4, ELANE, HP, LCN2, LTF, MGST1, MMP8, MPO,

MS4A3, OLFM4, OLR1, RNASE3, SERPINB10, SLC2A5, STOM, TCN1, ANLN, BIRC5, BUB1B,

CCNA1, CDK1, CDKN2B, DHFR, GFI1, INHBA, IQGAP3, KIAA0101, KIF11, KIF14, KNL1,

MIS18BP1, NCAPG, RGCC, RRM2, SKA2, TOP2A, TYMS, AGPS, ANXA4, ATP23, BCL2L15,

BEX1, CD24, CTBP2, CTC1, DCBLD2, ECRP, ERG, FBXO9, GALNT10, GCLM, GLOD5,

GVINP1, HMGB2, HMGN2, KBTBD6, LINC00323, LMO4, MED7, NFYC, NUCB2, PCOLCE2,

PDLIM5, PLEKHA3, PPFIA4, RPE, SCD, SENP1, SLC28A3, SMIM8, TACSTD2, TCTEX1D1,

THBS4, TMEM234, TMEM50B, TMLHE, TRMT5, ZNF788

PC
PC_Up
AAK1, ADA, ADCYAP1, ADGRB1, AGK, AHCYL2, ALG5, ALG9, AMOTL2, ANG, ANKS1B,

APOA4, AQP3, ARF4, ARHGEF40, ARL1, ASIC1, ASPM, ATF5, ATP11A, ATP1A2, ATP2A2,

AURKA, B4GALT3, B9D1, BAZ1B, BCAN, BIK, BIRC5, BMP8B, BSCL2, BUB1, BUB1B,

C11orf80, C1GALT1C1, C1orf27, CA6, CADM1, CADM3, CALML4, CALR, CALU, CASP3,

CAV1, CCNA2, CCNB1, CCNB2, CCNC, CCND2, CCNE2, CCR10, CD27, CD300A, CD320,

CD38, CD59, CD6, CDC20, CDC25A, CDC42BPA, CDC6, CDCA3, CDKN2C, CDKN3, CDR2,

CENPE, CENPN, CENPU, CEP55, CEP97, CFLAR, CHAC1, CHEK1, CHPF, CHST12, CHST2,

CITED2, CKAP4, CLIC3, CLINT1, CNKSR1, CNPY2, COL9A3, COPA, COPB2, COX11,

COX7A2, CRB1, CRELD2, CSF2RB, CSHL1, CSNK1E, CTNNAL1, CYP11B2, CYP26A1,

CYP2E1, DCPS, DDOST, DENND1B, DERL1, DERL2, DLGAP5, DNAJC1, DNAJC3, DOK4,

DRD4, DSTN, E2F8, EDEM2, EDEM3, EFS, ELL2, ERAP1, ERCC6L, ESPL1, ESR1, EXOSC4,

EXT1, FAAH, FABP5, FAM149A, FAR2, FAXDC2, FBXO5, FDX1, FEN1, FKBP11, FKBP2,

FNDC3B, FOLH1B, FUT8, FZD7, GAB1, GADD45A, GALNT2, GARS, GAS6, GC, GCSH, GFI1,

GGH, GLRX5, GMNN, GMPPA, GMPPB, GNAS, GOLT1B, GPLD1, GPR15, GPRC5D, GRIK1,

GSC2, GSPT1, H2AFX, HDLBP, HIBCH, HIST1H2AM, HIST1H2BB, HIST1H2BC, HIST1H2BG,

HIST1H3D, HIST1H4B, HIST1H4L, HJURP, HMGN5, HMMR, HPGD, HPX, HRH1, HSD11B2,

HSD17B8, HSP90B1, HSPA13, HSPA5, HYOU1, IDH2, IFNAR2, IGF1, IGHD, IGHG1, IGHG3,

IGHM, IGK, IGKC, IGKV1-5, IGKV1D-13, IGKV1D-8, IGL, IGLJ3, IGLL1, IGLL3P, IGLV3-19,

IGLV4-60, IL1R1, IL6R, IL6ST, INPP4A, IQGAP2, IQSEC2, IRF4, ITGA6, ITGB1BP1, ITM2C,

JCHAIN, KCNJ5, KCNK12, KCNN3, KDELC1, KDELR2, KIAA0101, KIF20A, KIFC1, KIR2DL4,

KLF10, KLK11, KLKB1, LAP3, LAX1, LDLRAD4, LGALS3, LIME1, LMAN1, LMAN2,

LRRC59, LSR, LZTS1, MAN1A1, MANEA, MANF, MAP2K6, MAPKAPK5, MAST1, MBNL2,

MCM10, MCM3AP, MCUR1, MELK, MGAT2, MIF, MKI67, MLEC, MORF4L2, MPHOSPH9,

MRPL22, MTNR1A, MTRR, MUC5B, MYCBP, MYDGF, MYO1D, NANS, NAT2, NAT8,

NCAPG, NCOA3, NDUFB6, NEK2, NES, NEU1, NEUROG3, NME1, NPIP, NPIPB15, NPM1,

NT5DC2, NUCB2, NUS1P3, NUSAP1, OGFOD3, OGT, P4HB, PAK5, PAM, PARP2, PCDHGA3,

PCSK4, PDE1A, PDIA2, PDIA4, PDIA6, PDK1, PDXK, PERP, PGM3, PHGDH, PIK3CG, PKP4,

PMM2, POU6F2, PPA1, PPCDC, PPIB, PRDM1, PRDX4, PREB, PROSC, PSMA3, PSMC2,

PTPRD, PTTG1, PTTG3P, PYCR1, PYCRL, R3HCC1, RAB27A, RAPGEF2, RBM47, RGS13,

RGS16, RPN1, RPN2, RRBP1, RRM2, RS1, RWDD2A, SAR1A, SAR1B, SCUBE3, SDC1,

SDF2L1, SEC13, SEC14L1, SEC23B, SEC24A, SEC24D, SEC61A1, SEC61B, SEC61G, SEL1L,

SELPLG, SEMA4A, SEPT10, SEPT4, SERPINF1, SGK1, SIL1, SLAMF7, SLC16A1, SLC16A6,

SLC19A1, SLC1A4, SLC1A7, SLC27A2, SLC31A2, SLC35B1, SLC7A11, SLC7A5, SLC9A3R1,

SLCO2B1, SLCO3A1, SLCO4A1, SLFN12, SMAD6, SPATS2, SPCS1, SPCS2, SPCS3, SPINK5,

SPRR1A, SRM, SRP19, SSR1, SSR3, SSR4, ST3GAL6, ST6GALNAC4, STARD5, STT3A,

SULT1C2, TAZ, TBL2, TECR, TIMM17A, TIMM44, TIMM8B, TIMP2, TIMP4, TK1, TLX3,

TM9SF1, TMBIM6, TMED10, TMED2, TMED5, TMEM184B, TMEM208, TMEM258,

TNFRSF17, TP63, TPP2, TPST2, TRA, TRAM1, TRAM2, TRAT1, TRD, TRIB1, TRIP13, TRIP6,

TSHR, TST, TUBG1, TXN, TXNDC15, TXNDC5, TYMS, UAP1, UBE2C, UCHL1, UCK2,

UGGT2, UQCRB, VDR, VEGFA, WARS, WHSC1, WIPI1, XBP1, XCL1, YIPF2, ZMYM2,

ZNF593, ZWINT

PC
PC_Down
ABLIM1, ABR, ADARB1, AKAP1, AKT3, ALOX5AP, ANKZF1, ARHGAP17, ARPC4, BANK1,

BCL11A, BIN1, BLK, BMP2K, C7orf26, CACNA1A, CAPN3, CBR3, CBX7, CCND3, CCR6,

CD19, CD1C, CD1D, CD22, CD37, CD72, CDK5R1, CEP170, CERS4, CIITA, CLCN4, CLIP2,

CNPPD1, COA1, CPQ, CSGALNACT1, CYLD, DCUN1D4, DDX24, DDX60, DEK, DENND5A,

DHX58, DPEP2, DYRK2, ELF4, FAM208A, FAM20B, FAM46A, FAM65A, FCGR2A, FCMR,

FGR, FOXO4, FYN, GAS7, GCNT1, GGA1, GGA2, GPD1L, GPR18, GRAP, GSAP, HCK, HHEX,

HIP1R, HLA-DMA, HLA-DMB, HLA-DOB, HLA-DPA1, HLA-DQB1, HLA-DRB1, HLA-DRB3,

HS3ST1, ID3, INPP5D, IRF5, IRF8, ITPKB, KDM4B, KIAA0141, KIF21B, KLF9, KLHDC10,

KMO, LAIR1, LAPTM5, LAT2, LBH, LINC00472, LIPA, LPGAT1, LYL1, LYST, MAPRE2,

MFHAS1, MNDA, MS4A1, MTSS1, MZF1, NAIP, NCR3, NLRP1, NOTCH2, NOTCH2NL, NT5E,

OPN3, P2RX5, PAX5, PCDH9, PDE4DIP, PDLIM2, PHC1, PIK3CD, PIKFYVE, PKIG, PLAC8,

PLCB2, PLEKHA1, PLEKHO1, POLD4, PPM1F, PRPF6, PRRC2B, PTK2, PTK2B, PTPN12,

PTPN6, PTPRCAP, RASGRP2, RBMS1, RIN3, RNF130, RNF141, RTL1, SAMD4A, SH3BP2,

SIDT2, SIPA1L1, SLC15A3, SMG1, SNAP23, SNN, SNX1, SNX2, SNX6, SPIB, SSBP2, STAT6,

STX7, SUSD5, SWAP70, SYNPO, SYPL1, TBC1D22A, TBL1X, TGFBR2, TMEM127, TNFSF12,

TNFSF13, TRAK1, TRAK2, TRIM34, TRIM38, TSPAN-3, TTC9, UNC119, UNKL, USF2, VAV3,

VEGFB, WASF2, XIST, ZBTB18, ZEB2, ZNF236, ZNF318, ZNF395, ZNF443, ZNF83, ZSCAN18,

ZXDC

To characterize the relationships between SLE gene modules from cell subsets and disease activity in greater detail, Gene Set Variation Analysis (GSVA) enrichment was carried out using the 25 cell-specific gene modules (FIG. 12). Of the 25 cell-specific modules, 12 had enrichment scores with significant Spearman correlations to SLEDAI (p<0.05), and 14 had enrichment scores with significant differences between active and inactive patients (Welch's t-test, p<0.05) (Table 9). Table 9 shows assessment of WGCNA module relationships with SLE disease activity in WB, including statistics on WGCNA module relationships with SLEDAI and active disease. Correlation to SLEDAI was done by Spearman rank correlation, and the relationship with active versus inactive disease was assessed by Welch's unequal variances t-test and Cohen's d. Significant results are bolded (p<0.05). LDG: low-density granulocyte; PC: plasma cell.

TABLE 9

Cell-specific modules by Spearman correlation

to SLEDAI and active vs. inactive

Spearman

correlation
Active vs. Inactive t-test

to SLEDAI
t sta-

rho
p value
tistic
p value
d

CD4_Floralwhite
0.360
3.90E−06
4.90
2.40E−06
0.788

CD4_Turquoise
−0.044
0.587
−0.93
0.352
−0.149

CD4_Orangered4

−0.400

2.21E−07

−5.29

4.35E−07
−0.853

CD14_Plum1
0.010
0.904
−0.35
0.729
−0.054

CD14_Yellow
0.356
4.93E−06
4.76
4.44E−06
0.761

CD14_Greenyellow
−0.132
0.100

−2.10

0.037
−0.339

CD14_Pink
−0.026
0.751
0.13
0.894
0.021

CD14_Purple
−0.149
0.064
−1.65
0.101
−0.263

CD14_Sienna3

−0.368

2.27E−06

−4.99

1.62E−06
−0.799

CD19_Darkolivegreen
0.020
0.809
−0.06
0.953
−0.010

CD19_Greenyellow
0.192
0.016
2.55
0.012
0.403

CD19_Steelblue
0.016
0.838
0.55
0.580
0.089

CD19_Turquoise
−0.069
0.393
−0.84
0.403
−0.132

CD19_Violet
−0.087
0.282
−1.48
0.141
−0.236

CD19_Brown
−0.050
0.537
−1.04
0.301
−0.164

CD19_Green
−0.150
0.062

−2.07

0.040
−0.330

CD19_Skyblue

−0.205

0.010

−2.35

0.020
−0.378

CD33_Royalblue
0.308
8.99E−05
3.99
1.03E−04
0.637

CD33_Sienna3
0.362
3.41E−06
4.69
6.15E−06
0.753

CD33_Violet
0.322
4.15E−05
4.35
2.46E−05
0.696

CD33_Darkmagenta

−0.216

6.74E−03

−2.34

0.021
−0.369

LDG_A
−0.044
0.588
−0.25
0.802
−0.040

LDG_B
0.220
5.71E−03
2.37
0.019
0.377

PC_Up
0.262
9.75E−04
3.21
1.61E−03
0.508

PC_Down
0.022
0.781
0.80
0.426
0.129

Notably, each cell type produced at least one module with a significant correlation to SLEDAI in WB and at least one module with a significant difference in enrichment scores between active and inactive patients, demonstrating a relationship between disease activity in specific cellular subsets and overall disease activity in WB. However, the Spearman's rho values ranged from −0.40 to +0.36, suggesting that no one module had substantial predictive value. Furthermore, the effect sizes as measured by Cohen's d when testing active versus inactive enrichment scores ranged from −0.85 to +0.79. The CD4 Floralwhite and Orangered4 modules, which had the largest positive and negative effect sizes, respectively, showed a high degree of overlap in the enrichment scores of active and inactive patients (FIG. 4).

Analysis of individual disease activity-associated peripheral cellular subset gene modules was not sufficient to predict disease activity in unrelated WB data sets, since no single module from any cell type was able to separate active from inactive SLE patients (FIGS. 13A and 13B). The results emphasized the need for more advanced analysis to employ gene expression analysis to predict disease activity.

Machine learning may be applied to analyze and assess disease activity as follows. To assess the effectiveness of either raw gene expression or module-based enrichment techniques, SLE patients were classified as active or inactive using generalized linear models (GLM), k-nearest neighbors (KNN), and random forest (RF) classifiers. Classifiers were validated using two different methodologies: (1) 10-fold cross-validation or (2) study-based cross-validation, in which classifiers were trained on each data set independently and tested in the other two data sets. When evaluating the performance of classifiers on the data set on which they were trained, GLM accuracy was defined as one minus the cross-validated classification error from the cv.glmnet( ) function, and RF accuracy was determined based on out-of-bag predictions. The accuracy of each classifier trained with either gene expression or module enrichment is shown in FIG. 14, and receiver operating characteristic (ROC) curves are plotted in FIG. 15. Classification metrics for each classifier are shown in Table 10.

TABLE 10

Classification metrics for GLM, KNN, and RF classifiers

10-fold CV
Trained on GSE39088
Trained on GSE45291
Trained on GSE49454

Expression
WGCNA
Expression
WGCNA
Expression
WGCNA
Expression
WGCNA

GLM
Accuracy
0.80
0.72
0.51
0.56
0.57
0.56
0.63
0.63

Sensitivity
0.78
0.73
0.86
0.79
0.51
0.60
0.54
0.59

Specificity
0.82
0.70
0.18
0.34
0.64
0.51
0.73
0.67

AUC
0.84
0.73
0.62
0.65
0.68
0.55
0.63
0.69

Kappa
0.60
0.43
0.04
0.14
0.15
0.11
0.26
0.26

PPV
0.83
0.73
0.50
0.53
0.63
0.60
0.71
0.69

NPV
0.77
0.70
0.58
0.64
0.52
0.51
0.56
0.57

KNN
Accuracy
0.75
0.70
0.50
0.70
0.49
0.70
0.51
0.72

Sensitivity
0.66
0.72
0.59
0.83
0.23
0.68
0.31
0.68

Specificity
0.85
0.68
0.41
0.57
0.79
0.72
0.77
0.77

AUC
0.82
0.74
0.54
0.71
0.58
0.75
0.63
0.70

Kappa
0.50
0.40
0.00
0.40
0.03
0.40
0.07
0.44

PPV
0.83
0.71
0.49
0.65
0.58
0.74
0.62
0.78

NPV
0.69
0.68
0.51
0.78
0.46
0.65
0.47
0.66

RF
Accuracy
0.83
0.72
0.45
0.63
0.47
0.63
0.61
0.66

Sensitivity
0.83
0.77
0.86
0.91
0.53
0.62
0.54
0.61

Specificity
0.82
0.68
0.07
0.36
0.38
0.64
0.69
0.73

AUC
0.89
0.77
0.69
0.73
0.58
0.68
0.65
0.74

Kappa
0.65
0.45
−0.07
0.27
−0.08
0.26
0.22
0.33

PPV
0.84
0.72
0.47
0.58
0.51
0.67
0.68
0.73

NPV
0.81
0.72
0.33
0.81
0.41
0.58
0.55
0.60

When performing 10-fold cross-validation, the use of gene expression values resulted in belier performance from all three classifiers compared to module enrichment scores. The random forest classifier was the strongest performer with 83 percent accuracy, and its corresponding ROC curve demonstrated an excellent tradeoff between recall and fall-out (AUC of 0.89). This high accuracy may likely be attributed to the presence of data from all three studies in both the training and test sets. In this case, the classifiers have the opportunity to learn patterns inherent to each data set, which proves useful during testing. To ensure that the classifiers were not disproportionately learning patterns from certain data sets at the expense of others, the classification results from the 10-fold cross-validation approach were subdivided by data set. All classifiers exhibited good performance with small differences between their highest and lowest accuracies in individual data sets, with the exception of the WGCNA-based KNN classifier (Table 11).

Table 11 shows classification metrics of 10-fold CV machine learning classifiers with results subdivided by data set. Data sets are listed by their GEO accession numbers. Range: difference between maximum and minimum values for each metric. Expression: gene expression data. WGCNA: module enrichment scores. AUC: area under the receiver operating characteristic curve. Kappa: Cohen's kappa coefficient. PPV: positive predictive value. NPV: negative predictive value.

TABLE 11

Classification metrics of 10-fold CV machine learning classifiers with results subdivided by data set

Subset: GSE39088
Subset: GSE45291
Subset: GSE49454
Range

Expression
WGCNA
Expression
WGCNA
Expression
WGCNA
Expression
WGCNA

GLM
Accuracy
0.81
0.70
0.83
0.74
0.76
0.69
0.07
0.05

Sensitivity
0.73
0.73
0.83
0.71
0.76
0.76
0.10
0.05

Specificity
0.93
0.67
0.83
0.77
0.75
0.63
0.18
0.14

AUC
0.85
0.74
0.84
0.75
0.84
0.70
0.01
0.05

Kappa
0.63
0.39
0.66
0.49
0.51
0.39
0.15
0.10

PPV
0.94
0.76
0.83
0.76
0.76
0.68
0.18
0.08

NPV
0.70
0.63
0.83
0.73
0.75
0.71
0.13
0.10

KNN
Accuracy
0.78
0.84
0.76
0.70
0.71
0.59
0.07
0.25

Sensitivity
0.68
0.86
0.71
0.71
0.56
0.60
0.15
0.26

Specificity
0.93
0.80
0.80
0.69
0.88
0.58
0.13
0.22

AUC
0.85
0.84
0.79
0.75
0.84
0.65
0.06
0.19

Kappa
0.58
0.66
0.51
0.40
0.43
0.18
0.15
0.48

PPV
0.94
0.86
0.78
0.69
0.83
0.60
0.16
0.26

NPV
0.67
0.80
0.74
0.71
0.66
0.58
0.08
0.22

RF
Accuracy
0.81
0.81
0.83
0.71
0.84
0.67
0.03
0.14

Sensitivity
0.82
0.82
0.86
0.74
0.80
0.76
0.06
0.08

Specificity
0.80
0.80
0.80
0.69
0.88
0.58
0.08
0.22

AUC
0.87
0.86
0.90
0.78
0.88
0.72
0.03
0.14

Kappa
0.61
0.61
0.66
0.43
0.67
0.34
0.06
0.27

PPV
0.86
0.86
0.81
0.70
0.87
0.66
0.06
0.20

NPV
0.75
0.75
0.85
0.73
0.81
0.70
0.10
0.05

When performing study-based cross-validation, classifiers trained on expression data performed belier on their respective training sets than those trained on module enrichment scores in nearly all cases (FIG. 14). However, the accuracy of classifiers trained on expression values in the test sets was approximately 50 percent. This is in line with the findings of the initial bioinformatic analysis (Table 6), namely, that gene expression values may have little utility when attempting to classify unfamiliar samples. When the training and test data come from different data sets, the classifiers learn patterns that are unhelpful for classifying test samples. Although classifiers trained on module enrichment scores did not achieve high accuracies in their training sets, they did not experience as sharp a drop in accuracy when tested on unfamiliar data sets. Remarkably, the use of module enrichment scores improved RF test accuracy to approximately 65 percent and improved KNN test accuracy to approximately 70 percent.

Overall, gene expression values provide high accuracy when performing 10-fold cross-validation but are rendered nearly useless when performing study-based cross-validation. These results indicate that disease activity classification based on raw gene expression, while more accurate, is sensitive to technical variability, whereas classification based on module enrichment better copes with variation among data sets.

Random forest consistently achieved high performance, and its assessments of variable importance may be used to gain insight into directors of the identification of SLE activity. To this end, random forest classifiers were trained on all patients from all data sets in order to identify the most important genes and modules as determined by mean decrease in the Gini impurity, a measure of misclassification error. The classifier trained with gene expression data achieved an out-of-bag accuracy of 81 percent, with a sensitivity of 83 percent and a specificity of 78 percent. The classifier trained with module enrichment scores achieved an out-of-bag accuracy of 73 percent, with a sensitivity of 78 percent and a specificity of 68 percent.

The most important genes and modules identified a wide array of cell types and biological functions (FIGS. 16A-16C). The most important genes encompass such diverse functions as interferon signaling, pattern recognition receptor signaling, and control of survival and proliferation (FIG. 16A). These most important genes include RAB4B, ADAR, MRPL44, CDCA5, MYD88, SNN, BRD3, C7orf43, CDC20, SP1, POFUT1, SAMD4B, ATP6V1B2, TSPAN9, SP140, STK26, IRF4, LCP1, LMO2, SF3B4, HIST2H2AA3, CITED4, ADAM8, TICAM1, and HSD17B7. Notably, the most influential modules skewed away from B cell-derived modules and towards T cell- and myeloid cell-derived modules (FIG. 16B). As some of these modules had overlapping genes, the variable importance experiment was repeated with modules that were de-duplicated by removing any genes that appeared in more than one module before GSVA enrichment scoring. The relative variable importance scores of the de-duplicated modules correlated strongly with those of the original modules (Spearman's rho=0.69, p=1.94E−4), indicating that module behavior was partly driven by the overlapping genes but strongly driven by unique genes (FIG. 16C).

CD4_Floralwhite and CD14_Yellow, two interferon-related modules which maintained high importance after deduplication, were further analyzed to study the effect of unique genes on module importance. Gene lists were tested for statistical overrepresentation of Gene Ontology biological process terms with FDR correction on pantherdb.org. CD4_Floralwhite did not show any significant enrichment, but CD14_Yellow, which had the highest importance after deduplication, was highly enriched for genes with the “Immune Effector Process” designation (26/77 genes, FDR=9.38E−11 by Fisher's exact test). This suggests that CD14+ monocytes express unique genes that may play important roles in the initiation of SLE activity.

Several important findings related to SLE gene expression heterogeneity within and across data sets have been elucidated by this study. First, DE analysis of active vs. inactive patients may be insufficient for proper classification of SLE disease activity, as systematic differences between data sets render conventional bioinformatics techniques largely non-generalizable.

Next, it was hypothesized that WGCNA modules created from the cellular components of WB and correlated to SLEDAI disease activity may improve classification of disease activity in SLE patients. The use of cell-specific gene modules based on a priori knowledge about their relevance to disease fared slightly better than raw gene expression, as it generated informative enrichment patterns, and many of the modules maintained significant correlations with SLEDAI in WB. However, these enrichment scores failed to separate active patients from inactive patients completely by hierarchical clustering.

Raw expression data was then compared alongside the WGCNA generated modules of genes in machine learning applications. A supervised classification approach was applied using elastic generalized linear modeling, k-nearest neighbors, and random forest classifiers. The trends in performance when cross-validating by study or cross-validating 10-fold indicate the potential advantages and disadvantages of diagnostic tests incorporating gene expression data or module enrichment. Cross-validating by study serves as a kind of “worst-case” scenario, whereas 10-fold cross-validation serves as a “best-case.” Attempting to classify active and inactive SLE patients from different data sets and different microarray platforms during cross-validation by study proved difficult, but module enrichment was able to smooth out much of the technical variation between data sets. 10-fold cross-validation simulated a more standardized diagnostic test. Although the data was sourced from three different microarray platforms, each cohort in the test set had many similar patients in the training set to facilitate classification by gene expression. If such a test may be reliably free from technical noise, it is likely that raw gene expression may perform very well.

RNA-Seq platforms, which produce transcript counts rather than probe intensity values, may display less technical variation across data sets because they are not dependent on the binding characteristics of pre-defined probes that differ among arrays. On the other hand, comparison of RNA-Seq and microarray samples may show that the two methods may deliver highly consistent results, so a microarray-based test may be feasible if it were only conducted on one platform. Constructing an optimal panel of genes similar to that identified by the random forest classifier may result in a simple, focused test to determine disease activity by gene expression data alone. Interestingly, module enrichment scores, which show little variation across platforms, may be used to develop diagnostic tests that leverage existing data sets, even if they are sourced from different platforms.

The strong performance of the random forest classifier indicates that nonlinear, decision tree-based methods of classification may be well suited to SLE diagnostics. This may be because decision trees ask questions about new samples sequentially and adaptively in contrast to other methods that approach variables from new samples all at once. Random forest is able to “understand” to an extent that different types of patients exist and that a one-size-fits-all approach may tend to misclassify those patients whose expression patterns make them a minority within their phenotype. To put it more simply, active patients that do not resemble the majority of active patients still have a strong chance of being properly classified by random forest.

The random forest classifier was used to assess the importance of each gene and module in patient classification. The most important genes were involved in a number of functions other than interferon signaling, such RNA processing, ubiquitylation, and mitochondrial processes. These pathways may play important roles in directing, or at least be indicative of, SLE disease activity. CD4 T cells originally contributed the most important modules, but when the modules were de-duplicated, CD14 monocyte-derived modules gained importance. This suggests that unique genes expressed by CD14 monocytes in tandem with interferon genes may prove to be informative in the study of cell-specific methods of SLE pathogenesis. Furthermore, it is important to note that modules that were negatively associated with disease activity were just as important in classification as positively associated modules. Study of underrepresented categories of transcripts may enhance an understanding of SLE activity.

Gene expression data may be compiled from SLE patients as follows. Publicly available gene expression data and corresponding phenotypic data were mined from the Gene Expression Omnibus. Raw data sources for purified cell populations are as follows: GSE10325 (CD4: 8 SLE, 9 HC; CD19: 10 SLE, 8 HC; CD33: 9 SLE, 9 HC); GSE26975 (10 SLE LDG, 10 SLE Neutrophil, 9 HC Neutrophil); GSE38351 (CD14: 8 SLE, 12 HC). Raw data sources for SLE whole blood gene expression are as follows: GSE39088 (24 active, 13 inactive); GSE45291 (35 active, 257 inactive); GSE49454 (23 active, 26 inactive). 35 randomly sampled inactive patients were taken from GSE45291 to avoid a major imbalance between active and inactive SLE patients. Active SLE was defined as having an SLE Disease Activity Index (SLEDAI) of 6 or greater.

Quality control and normalization of raw data files may be performed as follows. Statistical analysis was conducted using R and relevant Bioconductor packages. Non-normalized arrays were inspected for visual artifacts or poor hybridization using Affy QC plots. PCA plots were used to inspect the raw data files for outliers. Data sets culled of outliers were cleaned of background noise and normalized using RMA, GCRMA, or NEQC where appropriate. Data sets were then filtered to remove probes with low intensity values and probes without gene annotation data. WB gene expression data sets were filtered to only include genes that passed quality control in all data sets. At this juncture, differential expression (DE) analysis and Weighted Gene Co-expression Network Analysis (WGCNA) were carried out on data sets. WB gene expression data sets were then further processed before machine learning analysis. WB gene expression values were centered and scaled to have zero-mean and unit-variance within each data set, and the standardized expression values from each data set were joined for classification.

Differential Expression analysis may be performed as follows. Normalized expression values were variance corrected using local empirical Bayesian shrinkage, and DE was assessed using the LIMMA R package. Resulting p-values were adjusted for multiple hypothesis testing using the Benjamini-Hochberg correction, which resulted in a false discovery rate (FDR). Significant genes within each study were filtered to retain DE genes with an FDR<0.2, which were considered statistically significant. The FDR was selected a priori to diminish the number of genes that may be excluded as false negatives. Rank-rank hypergeometric overlap between data sets was assessed using the RRHO R package. Additional analyses examined differentially expressed genes with an FDR<0.05.

Weighted Gene Co-expression Network Analysis (WGCNA) of purified cell populations may be performed as follows. Log2-normalized microarray expression values from purified CD4, CD14, CD19, CD33, and low density granulocyte (LDG) populations were used as input to WGCNA to conduct an unsupervised clustering analysis, resulting in co-expression “modules,” or groups of densely interconnected genes which may correspond to comparably regulated biologic pathways. For each experiment, an approximately scale-free topology matrix (TOM) was first calculated to encode the network strength between probes. Probes were clustered into WGCNA modules based on TOM distances. Resultant dendrograms of correlation networks were trimmed to isolate individual modular groups of probes by partitioning around medoids and labeled using color assignments based on module size. Expression profiles of genes within modules were summarized by a module eigengene (ME), which is analogous to the module's first principal component. MEs act as characteristic expression values for their respective modules and may be correlated with sample traits such as SLEDAI or cell type. This was done by Pearson correlation for continuous or semi-continuous traits and by point-biserial correlation for dichotomous traits.

WGCNA modules from CD4, CD14, CD19, and CD33 cells were tested for correlation to SLEDAI. SLEDAI information was not available for the LDG modules, so the two modules provided are descriptive of LDGs compared to SLE neutrophils and HC neutrophils.

Plasma cell modules were generated by differential expression analysis and not WGCNA, but were included because of the established importance of plasma cells in SLE pathogenesis and their increase in active disease.

Gene Set Variation Analysis (GSVA)-based enrichment of expression data may be performed as follows. The GSVA R package was used as a non-parametric method for estimating the variation of pre-defined gene sets in SLE WB gene expression data sets. Standardized expression values from WB data sets were used to test for enrichment of cell-specific WGCNA gene modules using the Single-sample Gene Set Enrichment Analysis (ssGSEA) method, which scores single samples in isolation and is thus shielded from technical variation within and among data sets. Statistical analysis of GSVA enrichment scores was done by Spearman correlation or Welch's unequal variances t-test, where appropriate. Effect sizes were assessed by Cohen's d.

Machine learning algorithms and parameters may be developed as follows. Three distinct machine learning algorithms were employed to test biased and unbiased approaches to microarray data analysis. The biased approach involved GSVA enrichment of disease-associated, cell-specific modules, and the unbiased approach employed all available gene expression data in the WB. An elastic generalized linear model (GLM), k-nearest neighbors classifier (KNN), and random forest (RF) classifier were deployed to classify active and inactive SLE patients and determine whether gene expression may serve as a general predictor of disease activity. GLM, KNN, and RF were deployed using the glmnet, caret, and randomForest R packages, respectively.

GLM carries out logistic regression with a tunable elastic penalty term to find a balance between the L1 (lasso) and L2 (ridge) penalties and thereby facilitate variable selection. For our predictions, the elastic penalty was set to 0.9, specifying a penalty that is 90% lasso and 10% ridge in order to generate sparse solutions. KNN classifies unknown samples based on their proximity to a set number k of known samples. K was set to 5% of the size of the training set. If the initial value of k was even, 1 was added in order to avoid ties. RF generates 500 decision trees which vote on the class of each sample. The Gini impurity index, a measure of misclassification error, was used to evaluate the importance of variables. In addition to these three approaches, pooled predictions were assigned based on the average class probabilities across the three classifiers.

Validation approaches may be performed as follows. The performance of each machine learning algorithm was evaluated by 2 different forms of cross-validation. First, a random 10-fold cross-validation was carried out by randomly assigning each patient to one of 10 groups. For each pass of cross-validation, one group was held out as a test set, and the classifiers were trained on the remaining data. Next, as the data came from three separate studies, study-based cross-validation was also done to determine the effects of systematic technical differences among data sets on classification performance. In this circumstance, the classifiers were trained on one data set and tested in the other two data sets. Accuracy was assessed as the proportion of patients correctly classified across all testing folds. Performance metrics such as sensitivity and specificity were assessed after cross-validation by agglomerating class probabilities and assignments from each fold or study. Receiver Operating Characteristic (ROC) curves were generated using the pROC R package.

Example 3: Molecular Endotyping Analysis for Identifying Subsets of Patients with Systemic Lupus Erythematosus Who are Candidates to be Enrolled in Clinical Trials and have a Propensity to Respond to Specific Drugs

Using methods and systems of the present disclosure, molecular endotyping analysis may be performed for identifying subsets of patients with Systemic Lupus Erythematosus who are candidates to be enrolled in clinical trials and have a propensity to respond to specific drugs. In precision medicine, identifying patients who may be appropriate candidates for entry into a clinical trial and/or who have a propensity to respond to a specific therapy is crucial, for example, to de-risk clinical trials. In trials of complex diseases, such as Systemic Lupus Erythematosus (SLE), with current approaches, it may be difficult to identify significant phenotypic and transcriptomic differences between subjects who may be responders and non-responders to specific therapies. For example, post-hoc analysis of the ILLUMINATE trials of tabalumab in SLE by Lilly was unable to identify any genes that were differentially expressed between responders and non-responders.

A hypothesis may be that SLE in particular is a common clinical manifestation of several molecular abnormalities or endotypes, each driven by a distinct combination of cell types and immune or inflammatory mechanisms. Incorporating knowledge of endotypes of individual subjects (e.g., SLE patients) may be a crucial step in the identification of subjects appropriate to enter a clinical trial and/or to benefit from a specific therapy (e.g., targeted therapy to treat SLE).

Methods and systems of the present disclosure can be used to determine whether distinct phenotypic and/or transcriptomic subsets of subjects exist and, subsequently, whether each group is likely to respond to specific therapies. The appropriate or inappropriate entry of such patients into trials may inflate or deflate the efficacy of a clinically tested treatment. Moreover, an investigational product that fails in a clinical trial may later be documented to be highly efficacious when tested on a patient subset with an appropriate molecular endotype.

The ability to stratify SLE patients into different groups associated with different types of disease or disease activity by transcriptomic signatures provides significant advantages toward determining appropriate patient care and enrollment in clinical trials. Using methods and systems disclosed herein, immunologically active SLE patients can be distinguished for entry into SLE clinical trials or to change patients to a more appropriate drug regimen. Results demonstrated that SLE patients can be grouped (e.g., clustered or distinguished) by their transcriptomic signatures. For example, FIG. 17 shows a heat map showing the variation of gene expression in normal controls. Differentially expressed (DE) transcripts pertaining to cell type and process signatures in 10 SLE whole blood and peripheral blood mononuclear cell microarray datasets were used to create modules of genes potentially enriched in SLE patients determined by Gene Set Variation Analysis (GSVA). Although significant differences in transcripts pertaining to B cells, T cells, erythrocytes, and platelets between SLE patients may be observed in SLE, it is notable that at the level of RNA transcription, these signatures may not be uniformly expressed in the healthy controls (HC) (FIG. 17) from several SLE datasets, demonstrating that the differences in these signatures are related to heterogeneity in controls unrelated to SLE.

A suite of clustering techniques may be used to partition clinical trial enrollees at baseline based on gene expression data and/or clinical parameters. These methods may be used to drastically reduce the dimensionality of transcriptomic-scale data, even for cases in which Principal Component Analysis (PCA) fails to generate an informative set of variables.

Furthermore, extensive analysis of the contribution of subject demographic and clinical variables revealed that many of the differences between datasets and patients were not related to the disease, but to the patient's ancestry, gender, or the subject's drug regimen, each of which may independently influence the transcriptomic signature. Thus, in order to determine whether there were different types of SLE molecular endotypes common amongst patients of different ancestral backgrounds, different SLE standard of care treatments and different manifestations, 11 transcriptomic signatures negative in controls were used for principal component analysis (PCA) of 1,566 female SLE patients divided into three ancestry sub-groups; African ancestry (AA, n=216), European ancestry (EA, n=1,118) and Native Southern American ancestry (NAA, n=232). An 11-dimension principal component analysis (PCA) was performed, and results established that principal component 1 (PC1) was determined by whether the patient had circulating plasma cells (PC1−) or myeloid cells (PC1+); in other words, the greatest separation between patients was affected by whether they had a plasma cell or Myeloid Cell dominated transcriptomic signature. As another example, PC2 was roughly half the contribution of PC1 and was related to the difference between the presence of a low-density granulocyte (LDG) /neutrophil signature and the interferon (IFN) signature. As shown in FIG. 17, heatmap clustering of the PCA analysis demonstrated two prominent divisions between the 11 immunologically related modules in the SLE patients. Plasma cell, Immunoglobulins, Mature PC, and cell cycle grouped together (FIG. 17, left) and all the other signatures grouped together including IFN and anti-inflammation. PCA and heatmap divisions were the same between ancestries, except that more AA SLE patients were PC1− (plasma cells) than PC1+(myeloid) and more NAA SLE patients were PC1+(myeloid) than PC1− (plasma cell).

FIG. 18 shows PCA and heatmap clustering of AA, EA, and NAA SLE patients for 11 GSVA enrichment modules negative in healthy controls (HC). GSVA enrichment scores were uploaded to ClustVis, and PCA plots were generated. Low Up, a signature derived from SLE patients with no enrichment for IFN, PC, or myeloid cells (FCGR1A, SNORD80, SNORD44, SNORD47, SNORD24, CEACAM1, and LGALS1) changed where it grouped depending on ancestry. Heatmaps were generated using correlation clustering distance for both rows and columns. The heatmap clustering of the 11 modules revealed a dichotomy in SLE patient transcriptomic signatures; SLE patients with strong PC signatures were less likely to have strong myeloid signatures, especially in patients of AA ancestry, and in SLE patients with strong myeloid signatures, there were fewer contributing plasma cell signatures. Interferon signatures occurred with either myeloid or plasma cell signatures but were more often paired with strong monocyte signatures. Low density granulocytes/neutrophils were associated with the myeloid signature as well. Importantly, within each ancestral background, there were both plasma cell and myeloid SLE patients (FIG. 18). Steroids may be shown to be associated with low-density granulocyte enrichment and low-density granulocytes were important in both PC1 as part of the myeloid signature and the signature dominated PC2; therefore, PCA plots and heatmaps were generated for SLE patients not taking steroids. AA SLE patients not taking steroids had few patients with myeloid SLE signatures. The proportion of EA and NAA SLE patients with myeloid signatures decreased, although since most NAA SLE patients were on steroids there were very few patients in this analysis (FIG. 19).

SLE microarray datasets have wide heterogeneity related to the disease but also because of the different platforms to measure transcripts and variability; therefore, it was important to establish that the divisions found in the 1,566 female illuminate patients (GSE88884) are applicable to SLE patients assayed on a different array platform. AA and EA SLE patients with low disease activity (SLEDAI range 2-11) from dataset GSE45291 had PC1 and PC2 components similar to GSE88884 patients and demonstrated the same dichotomy in having either a plasma cell or Myeloid cell type of SLE. As was shown for dataset GSE88884, there were a higher percentage of SLE patients with AA ancestry and plasma cell SLE, and a higher percentage of SLE patients with EA ancestry and myeloid SLE (FIG. 20).

FIG. 20 shows PCA and heatmap clustering of a second, independent microarray dataset demonstrate that SLE patients divided into plasma cell or myeloid lupus. 73 AA and 71 EA patients from GSE45291 with SLEDAI in the range of 2-11 had GSVA scores calculated for 10 signatures. ClustVis was used to determine PC1 and PC2 for AA (top left) and EA (top right). Heatmaps show the patient distribution for the plasma cell related GSVA enrichment categories (Cell cycle, Mature plasma cell, plasma cell, and immunoglobulin chains) versus the myeloid cell enrichment categories (Interferon, Anti-Inflammation, Mono Surface, Mono Secrete, LDG, and Act Neut). Dataset GSE45291 was assayed on Affymetrix chip HT HG-U133+ PM which does not have probes for small nucleolar RNAs that make up most of the Low Up signature.

209 female SLE patients (13.3%) enrolled in the Illuminate clinical trial (GSE88884) had GSVA scores for the 10 immunologically related modules indistinguishable from HC (not including LowUp, which was based on patients which were difficult to distinguish from HC). These immunologically inactive SLE patients represented all three ancestry sets studied: 161 EA (14.4%), 25 AA (11.6%), and 23 NAA (10.3%); they were categorized as having no immunologically related signature (No Sig). PCA analysis was performed using the 10 immunologically related GSVA modules, and the PC1 loadings for each patient were used to determine the classification of either plasma cell or myleoid SLE based on whether they were PC1− (enriched for modules for plasma cell, Ig) or PC1+ (enriched for myeloid modules) (FIG. 21).

SLE disease measures were compared for each ancestry between PC1−, PC1+, and No Sig SLE patients. Although the average SLEDAI was generally higher for SLE patients expressing either PC or Myeloid modules compared to the No Sig group of patients, there was not a discernable cut-off for a SLEDAI which was suitable for defining a patient with no transcriptional sign of immunological perturbation. The mean SLEDAI was significantly higher (p<0.05 by Tukey's multiple comparisons test) for myeloid among AA patients, plasma cell and myeloid among EA patients, and plasma cell for NAA patients, as compared to the No Sig category within each ancestry. No significant difference in SLEDAI was found between SLE patients with myeloid versus plasma cell SLE. Steroid usage was significantly higher (p<0.05) for the myeloid signature for all three ancestries (Table 12).

TABLE 12

Disease differences between PC1−, PC1+, and No Sig categories

AA (n = 216)
EA (n = 1118)
NaAm (n = 232)

PC1−
PC1+
No Sig
PC1−
PC1+
No Sig
PC1−
PC1+
No Sig

n
125
66
25
449
508
161
80
129
23

average
10.73
10.97{circumflex over ( )}
8.8
10.74^#
10.21^##
9.35
11.66*
11.124
9.04

SLEDAI

median
10
10
8
10
10
8
11
10
8

SLEDAI

mode
8
8
8
10
8
8
12
10
8

SLEDAI

#
3.6
3.8
3.2
3.8
3.6
3.2
4
4
3.5

Manifest

average
7.99
9.83{circumflex over ( )}{circumflex over ( )}
4.2
9.05^$
9.47^$$
4.13
10.76
12.98**
6.52

steroid

median
5
10
0
7.5
10
0
10
10
5

steroid

mode
0
10
0
0
10
0
10
10
5

steroid

MMF or
16.8%
(21)
41%
(27)
16%
(4)
12.2%
(55)
22%
(113)
19%
(31)
24%
(19)
36%
(47)
22%
(5)

MTX (n)

dsDNA
40%
(50)
32%
(21)
20%
(5)
22%
(98)
26%
(133)
16%
(25)
23%
(18)
30%
(39)
17%
(4)

(n)

lowC
3%
(4)
11%
(7)
0%
8%
(37)
7%
(38)
11%
(18)
8%
(6)
8%
(10)
4%
(1)

(n)

dsDAN +
27%
(34)
24%
(16)
8%
(2)
45%
(200)
30%
(152)
7%
(12)
51%
(41)
28%
(46)
13%
(3)

lowC (n)

{circumflex over ( )}AA SLEDAI PC1+ to No Sig p = .05

{circumflex over ( )}{circumflex over ( )}AA SLEDAI PC1+ Steroid to No Sig p = .02

ANOVA & Tukey's Multiple Comparison

^#EA SLEDAI PC1− to No Sig p = .0001

^##EA SLEDAI PC1+ to No Sig p = .03

^$EA Steroid PC1− to No Sig p < .0001

^$$EA Steroid PC1+ to No Sig p < .0001

*NaAm SLEDAI PC1− to No Sig p = .02

**NaAM Steroid PC1+ to No Sig = .001

A heatmap visualization of the different ancestral SLE patients together as plasma cell, myeloid, or No Sig was generated; it revealed SLE patients with both plasma cell and myeloid signatures. Patients with both signatures (as determined by having a GSVA enrichment score 2 standard deviations above healthy control GSVA scores for both the myeloid and the plasma cell signatures) were combined to form a new group, “Both” (FIGS. 22A-22B).

FIGS. 22A-22B show heatmap clustering of SLE patients by enrichment of 10 immunologically related modules. Four divisions were found for the 1,566 female SLE patients enrolled in the ILL clinical trials. Based on PC1 loadings for PCA of patients, PC and myeloid SLE patients were sorted by the opposite GSVA enrichment signature: monocyte cell surface for the PC signature (PCA PC1−) and Ig for the myeloid signature (PCA PC1+), and SLE patients with GSVA enrichment scores of at least 0.1 for the opposite signature were removed and reclassified as having both signatures (FIG. 22A). SLE patients of all ancestries were grouped based on the four classifications. ANOVA and Tukey's multiple comparisons test was performed between the four groupings (FIG. 22B). For SLEDAI, No sig* was significantly lower from PC, Myeloid, and Both (p<0.05), and Both** was significantly (p<0.05) higher than PC and Myeloid. For steroid usage, No sig* was significantly lower (p<0.0001) than all other groups. PC was significantly lower than Both (p=0.0053). For aDS DNA, No sig* was significantly lower (p<0.0001) than all other groups and Both** was significantly higher (p<0.0001) than all other groups. For complement C3 and C4, all groups were significantly different (p<0.01) from each other; No sig* had the highest values, followed by myeloid. PC had lower values than No Sig and Myeloid, but Both** had the lowest C3 and C4 values.

Heatmap clustering of the four groups demonstrated that similar percentages of AA, EA, and NAA patients were found in the No Sig (AA 12%, NAA 12%, EA 13%) and Both (AA 25%, NAA 26%, EA 22%) groups, but there were a higher percentage of AA patients in the plasma cell only (p<0.05, Fisher's Exact Test; AA 42%, NAA 20%, EA 29%) and NAA in myeloid only (p<0.05 Fisher's Exact Test; AA 21%, NaAm 44%, EA 35%) (FIG. 22A). Comparison of the SLEDAI, steroid dose, anti-double stranded DNA levels, C3, and C4 serum measurements by ANOVA revealed significant differences between the groups. The No Sig classification with no immunologic transcriptomic signatures had the lowest SLEDAI and anti-double stranded DNA levels, and the highest C3 and C4 levels. Interestingly, this group was also taking the least amount of corticosteroids. SLE patients with both a myeloid and a plasma cell transcriptomic signature had the highest SLEDAI and highest percentage of anti-double stranded DNA values, and the lowest C3 and C4 values. This group was taking similar steroids to the myeloid only group and significantly more steroids than the No Sig or plasma cell only group. The plasma cell only and myeloid only groups were similar for SLEDAI and anti-double stranded DNA levels, but the plasma cell group had significantly lower C3 and C4 levels and were taking less steroids (FIG. 22B).

The Low Up Category was derived from the highest overexpressed transcripts by log fold change (FDR<0.05) between patients not separated from healthy control after initial PCA analysis of all the GSE88884 dataset log 2 expression values. This signature was expressed in 30% of the No Sig SLE patients and was increased in more immunologically transcriptomic patients: plasma cell only, 39% (180/456); myeloid only, 55% (298/544); and Both, 71% (254/357).

Example 4: Molecular Endotyping Analysis for Identifying Subsets of Patients with Systemic Lupus Erythematosus Who are Candidates to be Enrolled in Clinical Trials and have a Propensity to Respond to Specific Drugs

Weighted gene co-expression network analysis (WGCNA) was performed, using a computer program in R that takes a microarray or RNAseq dataset and identifies modules (groups) of genes that are co-expressed in a similar manner in the samples and or controls. Each individual sample is designated with a positive or negative value for each module indicating whether the individual sample co-expresses the genes in the module or does not. The number of groups or modules WGCNA identifies is unbiased in that there is no preconceived number of modules in a data set. The gene expression value of a module (eigengene) is used to determine whether an individual patient expresses a module or modules, whether groups of patients can be identified who express a similar constellation of modules and, also, whether there are patterns to the groupings. This approach can also be employed to determine whether positivity of specific WGCNA modules is correlated to SLE disease measures, such as disease activity, autoantibodies, and complement abnormalities. and other confounding factors such as patient ancestry.

WGCNA was performed on a set of 810 female systemic lupus erythematosus (SLE) patients and 11 healthy control whole blood samples. Patients were mainly of European ancestry (EA), African ancestry (AA), or Southern Native American ancestry (NAA; Guatemala, Peru, Ecuador) ancestry. The WGCNA results identified 13 discrete modules. Characterization of the modules was performed using multiple programs, such as CellScan and I-scope to determine whether a module was enriched in cellular markers corresponding to a specific cell type, and BIG-C to determine whether modules were enriched in specific cellular function or process. This analysis revealed prominent signatures related to cell types and processes, IFN signaling, and MicroRNA in 12 of the 13 modules. One module, turquoise (modules are randomly designated with colors for convenience), had more than 5,000 genes and no discernable cell type or function. This module also had the lowest percentage of genes that were differentially expressed between SLE patients and controls in separate limma analysis (for example, AA to CTL only had 1.67% of the turquoise genes differentially expressed (DE) compared to CTL).

Table 13 shows WGCNA modules identified in SLE patients.

TABLE 13

WGCNA modules identified in SLE patients

Percent Positive of DE transcripts in Module

Granulo-

IL1,
un-

cytes/

PC
Lympho-
Inflamm
known
T cell,
Plate-
Erythro-
Micro
Myeloid
NKTR
Basophils
CD14+

number of genes
IFN
ma-
cytes
myeloid
tur-
SNORAs
lets
cytes
RNA
light
IL16
midnight
TGFB1+

DE to control
black
genta
blue
brown
quoise
pink
purple
green
cyan
cyan
red
blue
yellow

AA
1591
70.03
15.58
16.32
10.13
1.67
6.55
14.04
6.95
2.80
4.71
7.79
7.49
3.56

to

ctl

EA
1906
71.18
6.49
18.25
25.11
2.62
17.86
3.51
4.63
0.93
7.58
15.15
10.08
3.21

to

ctl

NaAm
6580
85.59
20.35
74.38
64.76
9.82
32.14
23.98
26.77
37.58
25.42
45.45
42.64
25.19

to

ctl

Modules with negative eigengene values in healthy human controls were the IFN PRR module (black), plasma cell module (magenta), inflammatory myeloid module (brown), MicroRNA module (cyan) and platelet module (purple). Modules with positive expression in healthy controls were NKTR (red), lymphocytes (blue) and T cells (pink) (Table 14).

TABLE 14

WGCNA modules and their eigengene values in healthy controls

Modules with variable expression in Controls

Decreased in Controls
Myeloid, SELL,

Inflammatory

TBK1, CD16, SYK,
Basophils - VEGFA,

Myeloid Cells, IL1,

TANK, IRAK4, AOAH,
METRNL, OSM, LCAT,

IFN
Plasma
Tons of Secreted
MicroRNA
Platelets
not activated
LTBR, LILRB5 LCE1F,

black
Cells magenta
Protein Genes brown
TNFSF4 cyan
purple
light cyan
S1PR4 mignight blue

CTL.0073.NA
−0.06
−0.03
−0.07
−0.03
−0.01
−0.06
0.00

CTL.0106.NA
−0.04
−0.02
−0.04
−0.02
−0.04
−0.03
0.00

CTL0256.NA
−0.05
−0.01
−0.04
−0.01
0.00
−0.04
−0.01

CTL.0343.NA
−0.04
−0.04
0.02
0.04
−0.02
0.05
−0.03

CTL.0388.NA
−0.05
−0.02
−0.03
0.00
−0.01
0.01
−0.03

CTL.0581.NA
−0.06
−0.03
−0.03
−0.01
−0.01
−0.04
0.00

CTL.0812.NA
−0.05
−0.03
0.00
0.01
0.04
−0.01
0.01

CTL.0879.NA
−0.06
−0.02
0.00
0.00
−0.02
−0.02
0.02

CTL.1403.NA
−0.03
−0.02
−0.02
0.00
0.00
0.00
−0.02

CTL.1406.NA
−0.04
−0.01
−0.01
0.03
−0.01
0.02
−0.03

CTL.1703.NA
−0.04
0.00
−0.03
−0.02
0.00
0.00
−0.05

Modules with variable expression in Controls

CD14 Monocytes,

ox phos and tea

cycle, peroxisomes,

proteasome, TGFB1,
Erythrocytes,
No discernable
Increased in Controls

TNFSF8, IKLYZ, FCN2,
GYPAE, GYPAB,
cell type or
NKTR, IL16
Lymphocytes,
T

HBB, HAVCR2, CCR2+,
KEL, RHD, BSG
function module >5000
T cell receptor
T cells,
cells

MS4A6A, BTN3A3, yellow
green
turquoise
J chains red
B cells blue
pink

CTL.0073.NA
−0.03
0.03
0.04
0.02
0.01
0.05

CTL.0106.NA
−0.02
−0.05
0.02
0.00
0.02
0.03

CTL0256.NA
−0.01
0.05
0.01
0.02
0.01
0.04

CTL.0343.NA
0.05
0.01
−0.04
0.06
0.02
0.02

CTL.0388.NA
0.03
−0.03
−0.01
0.04
0.04
0.01

CTL.0581.NA
−0.02
0.06
0.02
0.00
0.00
0.02

CTL.0812.NA
−0.02
0.05
0.03
−0.01
−0.02
0.00

CTL.0879.NA
−0.02
0.03
0.04
−0.01
0.00
0.02

CTL.1403.NA
0.01
−0.01
−0.01
0.03
0.03
0.02

CTL.1406.NA
0.04
−0.02
−0.04
0.0
0.04
0.04

CTL.1703.NA
0.04
−0.02
−0.02
0.04
0.06
0.01

As shown in Table 15, WGCNA identified four modules with correlation to the presence of SLE: IFN signaling and pattern recognition receptors (black), plasma cells (magenta), inflammatory myeloid cells (brown) and T cells (pink). The IFN and plasma cell modules had a relationship to the lupus disease activity measure SLEDAI and also to anti-double stranded DNA antibodies (dsDNA) and a negative relationship to complement protein C3 and C4 levels, important serum components associated with active SLE disease. Inflammatory myeloid cells were significantly correlated to anti-double stranded DNA, but not to low complement or the SLEDAI. T cells (pink) had a negative correlation to the SLE cohort and a negative relationship to the presence of anti-double stranded DNA autoantibodies and a positive relationship to complement C3 and C4 levels.

TABLE 15

WGCNA module correlations in 810 female SLE patients

assigned color
module Count
Cohort
Cohort p
SLEDAI
SLEDAI.p
dsDNA IU
dsDNA IU.p
C3 GperL

IFN and
black
347
0.16
3.6E−06
0.25
9.9E−13
0.30
4.9E−19
−0.32

PRR

Plasma Cells
magenta
231
0.07
0.0577 text missing or illegible when filed

0.22
9.7E−11
0.29
1.8E−17
−0.32

Inflammatory
brown
908
0.07
0.03332
0.05
0.18802
0.10
0.0054
0.00

Myeloid

Cells

Micro RNA
cyan
322
0.00
0.9426
0.04
0.20196
0.00
0.99021
0.10

Platelets
purple
171
0.02
0.50223
−0.03
0.33369
0.02
0.48014
0.20

Myeloid. Not
lightcyan
594
0.04
0.3008
0.05
0.16035
0.10
0.00622
−0.05

activated.

Basophils
midnightblue
387
0.04
0.28478
0.03
0.43885
−0.02
0.59274
0.10

T cells
pink
336
−0.08
0.01916
−0.04
0.20566
−0.16
5.1E−06
0.18

Lymphocytes
blue
3365
−0.06
0.06677
−0.03
0.39424
−0.03
0.40269
−0.08

T and B

cells, mRNA

translation

NKTR, IL16
red
462
−0.07
0.05007
0.01
0.87992
−0.06
0.09848
0.05

Unknown
turquoise
5569
−0.01
0.74594
−0.04
0.23356
−0.07
0.03969
0.09

Monocyte
yellow
1433
−0.01
0.68829
0.05
0.19486
0.07
0.05707
−0.12

TGFB1

CCR2+

Erythrocytes
green
691
−0.03
0.4157
−0.06
0.10228
−0.11
0.00246
0.21

C3

C4

Race

GperL.p
C4 GperL
GperL.p
Race AA
Race AA.p
Race NaAm
NaAm.p
Race EA
Race EA.p

IFN and
2.5E−21
−0.28
1.2E−16
0.04
0.2396
0.08
0.03115
−0.08
0.02481

PRR

Plasma Cells
1.4E−21
−0.30

2E−18
0.12
0.00088
−0.06
0.07102
−0.06
0.10054

Inflammatory
0.91912
−0.01
0.80436
−0.11
0.00162
0.11
0.00202
0.02
0.4957

Myeloid

Cells

Micro RNA
0.00379
0.09
0.01468
0.00
0.96314
0.13
0.00022
−0.08
0.02075

Platelets
3.3E−09
0.16
2.8E−06
0.10
0.00622
0.07
0.04613
−0.12
0.0004

Myeloid. Not
0.16016
−0.05
0.14122
−0.07
0.05153
0.01
0.70275
0.07
0.0369

activated.

Basophils
0.00341
0.09
0.01196
0.01
0.6846
0.17
9.5E−07
−0.14
3.6E−05

T cells
1.1E−07
0.17
1.5E−06
0.12
0.0007
0.03
0.35564
−0.11
0.00114

Lymphocytes
0.0149
−0.08
0.0223
0.03
0.42534
−0.19
3.4E−08
0.14
9.8E−05

T and B

cells, mRNA

translation

NKTR, IL16
0.17912
0.05
0.1911
0.06
0.08086
−0.05
0.13721
0.01
0.77421

Unknown
0.01398
0.08
0.02564
0.02
0.55253
0.08
0.02007
−0.09
0.00694

Monocyte
0.00085
−0.11
0.00137
−0.01
0.76358
−0.11
0.00199
0.12
0.00077

TGFB1

CCR2+

Erythrocytes
8.4E−10
0.15

1E−05
0.06
0.08531
0.09
0.00851
−0.12
0.00036

text missing or illegible when filed

indicates data missing or illegible when filed

In order to understand whether the three modules with positive correlation to the SLE cohort were related to other modules, the categories IFN PRR (black), plasma cell (magenta), and inflammatory myeloid (brown) were investigated further. The percentage of patients with positive eigengenes for each category was determined, and whether or not patients with positive eigengenes for one of these three gene modules were also positive for the other gene modules was determined. Table 16 demonstrates that patients positive for the IFN module were evenly split with regard to positivity of all other modules, except for the (myeloid not activated) (66%) and the (CD14 monocyte, TGFB1) modules (63%). Patients with positive eigengene values for the plasma cell module were also more likely to be IFN positive (72%), (CD14 TGFB1) positive (68%) and lymphocyte module positive (72%). Patients with inflammatory myeloid cell modules were likely to have positive eigengenes for the MicroRNA module (75%), (myeloid not activated) module (78%), basophils or granulocytes (67%), and negative eigengenes for lymphocytes (35%).

TABLE 16

Percentage of patients in each category with positive eigengene values

Percentage of Patients in Each Category with Positive Elgengene Values

Percent

Myeloid

Patients
IFN
Plasma
Myeloid
Micro
Plate-
not
Baso-
CD14,
Erythro-
No
NKTR-
Lympho-
T

n
Positive
PRR
Cell
Inflam.
RNA
lets
activated
philis
TGFB1
cyte
Identity
IL16
cyte
cells

IFN PRR
430
53%

57%
55%
53%
47%
66%
48%
63%
40%
37%
53%
54%
39%

Module

Positive

Plasma Cell
337
42%
72%

37%
35%
37%
53%
36%
68%
34%
38%
54%
72%
39%

module

Positive

Inflam-
384
47%
61%
33%

75%
57%
78%
67%
53%
53%
41%
50%
35%
44%

matory

Myeloid

Module

IFN Plus
104
13%

70%
42%
87%
57%
72%
32%
22%
51%
50%
29%

Myeloid Plus

Plasma Cell

IFN PRR
132
16%

78%
62%
81%
76%
45%
51%
45%
45%
22%
42%

Plus

Myeloid

IFN PRR
140
17%

18%
32%
46%
21%
76%
33%
34%
59%
84%
37%

Plus

Plasma

Cell

Plasma
22
3%

55%
50%
68%
36%
68%
64%
45%
64%
68%
32%

Cell

Plus

Myeloid

IFN Only
53
7%

53%
57%
43%
36%
60%
51%
45%
64%
62%
57%

PC Only
71
9%

11%
37%
11%
35%
48%
32%
59%
45%
82%
58%

Inflam
126
16%

80%
63%
68%
72%
44%
71%
48%
52%
31%
60%

Myeloid

Only

No IFN, PC
162
20%

26%
47%
12%
51%
30%
62%
72%
48%
51%
67%

or Myeloid

Further breakdown of the three categories with positive relationships to having SLE disease (versus control) demonstrated that patients who had positive eigengene values for all three categories were also likely to be positive for MicroRNA (70%), (Myeloid not activated) (87%), (CD14, TGFB1) (72%), and to have less positive eigengenes for erythrocytes (32%) and the T cell module (29%). Consideration of patients with positive eigengenes for two of the three modules showed that myeloid cells generally stayed together with the exception of the (CD14+TGFB1) module that seemed to sort with the IFN signature. Patients with positive eigengenes for inflammatory myeloid cells were generally positive for the MicroRNA signature, (myeloid not activated), basophils, and erythrocytes. Patients with positive eigengene values for plasma cells were likely to also be positive for lymphocytes (B and T cells) unless also positive for inflammatory myeloid cells. Perhaps most striking were the patients without positive eigengenes for any of the three modules positively correlated to SLE. These patients likely had positive eigengenes for the no identity module (72%) and T cells (67%). They were also likely negative for the MicroRNA module (26%+), myeloid not activated module (12%+), and CD14+TGFB1 monocyte (30%+). Whereas plasma cell and myeloid positive eigengenes were not mutually exclusive, they were unlikely to come together without also having an IFN signature (3%) and it was more common for these signatures to be alone (plasma cell+IFN 17% of patients, myeloid+IFN 16% of patients) than together with the IFN signature (13% of patients). These three patterns of signatures comprised 46% of the total patients (Table 16).

Next, the relationship between these modules and SLE disease activity was determined. The four disease measures considered were the SLEDAI, IU of anti-double stranded autoantibodies, g per L complement C3 and C4. As shown in FIGS. 23A-23D, for all disease measures, categories with plasma cells had higher measures of disease activity (increased SLEDAI, autoantibodies, Low C3, C4) than categories without, but the highest disease measures were when patients had positive eigengene values for both PC and the IFN signature.

The pink module had a negative correlation to the SLE cohort and included many T Cell Receptor J region chains and SNORAs and SNORDs. Its negative correlation with the presence of SLE may be used to help subdivide the patients further.

WGCNA was used to divide patients into distinct subsets based on the whether they had expression of plasma cell transcripts, IFN, PRR, and myeloid transcripts, or inflammatory myeloid transcripts. It also revealed that 20% of patients were negative for these transcripts, demonstrating that a significant proportion of patients entered into this clinical trial may have a type of non-immune cell mediated lupus. For example, these patients may be eliminated or excluded from lupus clinical trials for immune modulating drugs. Additionally, WGCNA clearly identified patients with only plasma cells but no inflammatory myeloid cells, and vice versa. Both of these signatures were likely to have an IFN signature as well. These signatures or endotypes may also allow for immune modulating drugs, which target plasma cells or myeloid cells, to be properly administered to patients with the matching blood signatures.

Example 5: Molecular Endotyping Analysis for Identifying Subsets of Patients with Systemic Lupus Erythematosus Who are Candidates to be Enrolled in Clinical Trials and have a Propensity to Respond to Specific Drugs

Methods of molecular endotyping analysis may comprise performing Gene Set Variation Analysis (GSVA) on gene expression data with predefined gene sets, which may include genes descriptive of inflammatory or immune pathways or immune cell types. This yields a relatively small number of variables which are amenable to standard clustering methods such as k-means, k-medoids, or Gaussian mixture modeling (GMM). GMM may be advantageous over k-means because it considers the variance of each variable separately and is therefore less likely to be adversely affected by clusters of varying shapes and sizes. For each of these methods, clustering algorithms were applied with a range of possible numbers of clusters. Metrics such as the clustering silhouette and Bayesian Information Criterion (BIC) were used to select an optimal number of clusters. GMM analysis of GSVA scores from immunologically related modules in patients from the ILLUMINATE-1 and ILLUMINATE-2 trials indicated that the data was best fitted by four clusters.

The first cluster of patients was highly immunologically active, the second cluster was immunologically inactive, and the other two clusters displayed heterogeneous activation of immune cells and pathways. Patients in these clusters differed in their demographics, concomitant medications, and SLE manifestations. They also showed promising differences in their responses to tabalumab versus placebo. The cluster defined by myeloid cell activation showed little benefit from tabalumab, whereas the cluster defined by lymphoid cell activation trended toward a positive response to tabalumab. Interestingly, the immunologically inactive cluster also trended towards a positive response, partly because this group was the least responsive to placebo.

FIG. 24 shows mean GSVA scores of patients in each cluster defined by GMM. Numbers at the top denote the number of patients in each cluster.

The unbiased gene expression methods do not take prior knowledge of gene sets into account. In some embodiments, the method comprises unsupervised clustering of gene sets generated by WGCNA, as described above. The modules generated by WGCNA can then be used to perform k-means, k-medoids, or GMM clustering of patients. In some embodiments, a search is performed for genes whose expression values are bimodally distributed (preliminary analysis of ILLUMINATE data indicates there are roughly 40 of these genes, mostly IFN-related). These genes are then investigated with clustering methods. In some embodiments, non-linear dimensionality reduction is performed on gene expression data with an autoencoder neural network, and then subjects are clustered based on the resulting latent variables. A particular kind of autoencoder, termed a Gaussian mixture variational autoencoder (GMVAE), constrains the latent variables to be generated by Gaussian mixtures. The gene expression data activates the components of the Gaussian mixtures, which in turn activate the latent variables, which are decoded to reconstruct the gene expression input. A GMM may then be fitted to the latent space to perform clustering; alternatively, subjects may be assigned to clusters based directly on the mixture probabilities.

Clustering methods based on the subjects' clinical parameters also may be used to generate meaningful subsets. Combinations of factors such as age, ancestry, SLE manifestations, and concomitant medications allow for clustering of trial subjects. Methods such as k-medoids may be applicable to categorical data sets. GMVAEs, which are often employed to cluster image data, may be used to process binary clinical variables because these variables are analogous to activated or deactivated pixels in an image.

GMVAE clustering of clinical variables from patients in the ILLUMINATE trials was performed, and five clusters of patients were identified (Table 17). A GMVAE with two latent dimensions was trained on 13 clinical variables. The model correctly reconstructed an average of 10 traits, indicating strong performance even with a relatively low number of samples by neural network standards. This approach was used to identify five patient clusters. There is a very similar cluster of young patients with aggressive disease that respond poorly to placebo (Chi-square p value=0.16).

TABLE 17

Average patients in each cluster

Anti-
Low

Size
SLEDAI
Age
Alopecia
dsDNA
Comp.
Ulcers
Antimal.
Cortico.
Immuno.
NSAID
Q2W
Q4W
Placebo

218
11
42
62%
67%
35%
35%
65%
82%
39%
30%
41%
44%
38%

405
12
37
59%
98%
94%
35%
63%
98%
50%
13%
41%
51%*
30%

242
8
45
76%
11%
2%
25%
81%
74%
31%
23%
46%
47%
41%

110
11
39
49%
92%
51%
22%
57%
80%
57%
26%
47%
33%
31%

228
9
46
50%
18%
14%
52%
59%
21%
25%
71%
41%
40%
38%

The patients in clusters 3 and 5 did not have anti-dsDNA or low complement, and were treated with antimalarials and either corticosteroids or NSAIDs. These patients did not show significant benefit from tabalumab compared to placebo. The other three clusters were more likely to have anti-dsDNA and low complement. Cluster 4, which included 171 patients treated with corticosteroids and immunosuppressives, showed a trend toward positive response to tabalumab (SRI-5 response rates: Q2W 47%, Q4W 33%, Placebo 31%). Cluster 2, which was treated with antimalarials and corticosteroids, achieved significant results (SRI-5 response rates: Q2W 41%, Q4W 51%, Placebo 30%). FIG. 25 shows gene expression of subjects in groups defined by GMVAE. GSVA analysis of the patients in these clusters showed that the patients without serological SLE activity (clusters 3 and 5) also did not show immunological activity by gene expression, whereas the other clusters did show immunological activity.

These approaches demonstrate that patients can be automatically distinguished or stratified into distinct groups, clusters, or subsets, via analysis of their gene expression data, based on factors such as whether a given clinical trial (e.g., for a lupus drug) is more or less likely to succeed for a particular patient. Certain subsets of subjects were shown to respond to treatment at substantially different rates from the other subjects in the study. However, small deviations toward better response to active treatment and worse response to placebo can be combined to produce significant results. Subsets have been successfully identified which are a fraction of the size of the original trials yet still see significant improvement from active treatment compared to placebo. Also, subsets of patients may be identified who achieve little to no benefit from active treatment and ought to be excluded from enrollment in clinical trials. In the ILLUMINATE trials, subsets were identified based on characteristics beyond those that were originally tested for an effect on the outcome. For example, it may seem intuitive to divide subjects in an anti-B-cell activating factor trial on the basis of anti-dsDNA seropositivity, but this failed to explain the failure of the trial. In the analysis results presented herein, the trial succeeded in a cluster of patients with anti-dsDNA, low complement, and concomitant corticosteroids but failed in clusters of patients that were more defined by concomitant use of immunosuppressives. These results demonstrate that complex combinations of factors may be used to more effectively and successfully subdivide patients (e.g., into responder and non-responder groups).

Example 6: Ancestry Influences the Gene Expression Profile in Systemic Lupus Erythematosus (SLE) and Contributes to Gene Expression Heterogeneity in Lupus Patients

Systemic Lupus Erythematosus (SLE) generally refers to a complex autoimmune disease, which has both sex and ancestral bias in affected patients. Gene expression analysis may reveal complex heterogeneity between SLE patients, and the contribution of ancestry, drugs, and SLE manifestations to this heterogeneity were determined. Gene expression analysis between female disease-matched SLE patients of African, European, and Native American ancestry revealed thousands of differentially expressed (DE) transcripts between ancestries, but none within a single ancestry. African, European, and Native ancestry SLE patients had significantly different cellular contributions to gene expression, and these differences were found to be related to significantly different percentages of patients in each ancestry with specific signatures. Gene Set Variation Analysis (GSVA) showed an increase in plasma cells, B cells, and T cells in the majority of African ancestry patients and an increase in myeloid cell transcripts in most European and Native American ancestry patients. The treatment of SLE patients with drugs, such as corticosteroids and immunosuppressives, significantly changed their gene expression and contributed to the disparate signatures between and within ancestries. Autoantibodies and low complement, but not other clinical features of SLE, were also significantly associated with the gene expression in European and Native American ancestry SLE patients and to a lesser degree in African ancestry SLE patients. Further, differences between African and European ancestry SLE patients were found to be similar to those between healthy people of these ancestries. These ancestry-specific gene expression profiles provide a specific transcriptomic background upon which the SLE patient gene expression pattern can be built.

Systemic Lupus Erythematosus (SLE) generally refers to a complex autoimmune disease affecting mostly women (9:1) and characterized by autoantibodies to DNA and nuclear proteins leading to immune complex formation, complement deposition, and immune damage in multiple organ systems. Heterogeneity in ancestral prevalence, disease severity, organ involvement, and response to treatment can be observed; however, an explanation had not been fully delineated. Whereas the disease may be most prevalent in Asians and people of African-Ancestry (AA), a disproportionate number of clinical trials may be focused on the European Ancestry (EA) population. Further, Native people of North American ancestry may have earlier onset of disease and more organ involvement. In some cases, increased active disease, organ involvement, and autoantibody levels may be observed for AA compared to EA patients, and increased mortality may be observed for AA patients. At the cellular level, the AA population may have more activated B cells and B cell receptor signaling than the EA population. There may be differences in responses of both innate immune cells as well as lymphocytes, suggesting that ancestral differences in immune cells may contribute to the different disease course and incidence between populations. Also, there may be ancestry-related differences in response to therapy across individual patients. For example, AA SLE patients may respond better to B cell depletion therapies than Caucasian patients, but they may display lower responses to anti-BAFF treatment in Phase III clinical trials. Higher serum levels of BAFF in AA SLE patients may suggest that higher doses of the biologic may be necessary in AA patients, and that underlying genetic differences between AA and EA SLE patients may be accounted for in determining treatment decisions. There may be different genetic components contributing to disease development and progression in different ancestral populations. For example, transancestral genetic mapping may demonstrate a multigenic effect in SLE that differs according to ancestral background, suggesting a heterogeneous genetic component to disease activity. Unfortunately, many multigenic Genome Wide Association Study (GWAS) differences between AA and EA may be present in non-coding regions, thereby making extrapolation to differences in disease severity challenging.

Heterogeneity in SLE gene expression signatures may be observed for the IFN-stimulated genes. SLE patient gene expression differences may be investigated by creating modules of genes over-represented in pediatric SLE patients. Although expression of some modules may be correlated with changes in disease activity, it may be difficult to reconcile disease activity as measured by SLE Disease Activity Index (SLEDAI) and gene expression signatures in patients. For example, an attempt to group lupus patients in 158 pediatric SLE patients may suggest as many as seven different types of lupus. Increased plasmablasts may be detected in AA and increased myeloid signatures may be observed in some EA and Hispanic SLE patients, suggesting that there may be an ancestral basis to explain some of the heterogeneity in SLE gene expression signatures. The many different SLE organ manifestations may also contribute to the heterogeneity in gene expression signatures. The low-density granulocyte (LDG) signature observed in SLE PBMC may correlate with skin and vasculitis manifestations. Further, neutrophil signatures may correlate with progression to active lupus nephritis in pediatric SLE patients. An association between the IFN signature and skin involvement, anti-double-stranded DNA autoantibodies (anti-dsDNA), low complement (Low C) and musculoskeletal SLEDAI manifestations may also be observed.

Whole blood transcriptomes and gene expression analysis may be performed to assess the pattern of abnormal representation of thousands of genes simultaneously, thereby deducing the underlying abnormalities. Moreover, this approach can be used to develop an understanding of the association of ancestry, standard of care (SOC) therapy, and SLE manifestations. Here, the contribution of ancestry, SOC drug therapy, and SLE manifestations to the blood gene expression profile of subjects with SLE was determined. Although some study may assume the transcriptomic differences between SLE patients and healthy controls (HC) are related to the disease, these results provide strong evidence that much of the gene expression signature measured between SLE patients and HC is related to patient ancestry and SOC drug regimens, thereby resulting in alterations in the proportions of hematopoietic cells, cellular processes, and signaling pathways detected.

Significant Ancestral Gene Expression Differences in SLE Patients

In order to determine ancestral contributions to gene expression signatures in whole blood (WB), two large phase 3 clinical trial databases with microarray analysis at baseline were analyzed (GSE88884, as described by Hoffman, 2017, which is incorporated by reference herein in its entirety). The Illuminate 1 (ILL1) and Illuminate 2 (ILL2) clinical trials had microarray expression data for 1,566 female patients of self-described ancestry as follows: AA (n=216), EA (n=1,118), and Native American Ancestry (NAA; mostly from South America, n=232; top three countries of origin Peru (n=81), Ecuador (n=30), and Guatemala (n=27)); male patients and patients of multiple, Asian, and other ancestries were removed to avoid contributions of gender differences and low numbers of patients, respectively. Ancestral backgrounds were split evenly between the ILL1 and ILL2 datasets, allowing for a training and test set to determine bulk gene expression differences. Entry criteria for the trials required a positive anti-nuclear autoantibody (ANA) titer and a minimum disease activity of 6, as determined by the SLE Disease Activity Index (SLEDAI). Disease activity was similar among ancestries, as was percentage of patients with anti-dsDNA (Table S1). The trials excluded patients with progressive lupus nephritis and entered only one patient with central nervous system manifestations. Most female patients recruited had a mixture of six SLE manifestations: arthritis (86.4%), anti-dsDNA (57.5%), low complement (Low C, 40.0%), alopecia (58.9%), rash (68.3%), and mucosal ulcers (31.7%) (Table S2). Gene expression differences were first determined by carrying out limma differential expression (DE) analysis of AA, EA, and NAA SLE patients to each other. At a false discovery rate (FDR) of 0.05, thousands of DE transcripts were determined for each ancestry compared to the others for the ILL1 dataset (FIGS. 26A-26D). As a control, each ancestral background was randomized into two separate groups five separate times, and DE to patients of the same ancestral background was assessed. No DE transcripts were found, even at a less stringent FDR of 0.2. DE analysis of ILL2 SLE patients of AA, EA, and NAA SLE patients to each other yielded similar results to ILL1, indicating thousands of DE transcripts between ancestries at an FDR of 0.05 (FIGS. 26A-26D). Importantly, the patterns of ancestry-related DE genes were comparable in ILL1 and ILL2 (FIGS. 26A-26D).

In order to interpret the biological meaning of the ancestral gene expression differences, I-scope, a tool for determining the likely hematopoietic cell type in bulk datasets, was used to determine whether there were cellular differences between SLE patients of different ancestral backgrounds. I-Scope demonstrated a relative predominance of plasma cells and B cells in AA patients, and of myeloid cells in EA and NAA patients. In EA SLE patients, transcripts for monocytes and low-density granulocytes (LDGs) were enriched compared to AA SLE patients, whereas T cell and MHC class II transcripts were enriched in EA patients compared to NAA patients. NAA patients had increased myeloid signatures, including transcripts associated with monocytes, LDGs, and neutrophils compared to both AA and EA patients (FIG. 27A). Thus, the same ancestral-based cellular enrichments were found for the ILL1 and ILL2 dataset, and the transcripts signifying these cellular categories were remarkably similar between the ILL1 and ILL2 datasets. These results indicated a meaningful difference in gene expression profiles of SLE subjects with similar disease severity but of different ancestries.

Next, Gene ontology (GO) biological pathway and Biologically Informed Gene Clustering (BIG-C) (Labonte et al., 2018) enrichment of molecular pathways (Fisher's Exact p<0.05) in AA, EA, or NAA patients was performed, and results supported the conclusions of the I-scope analysis. GO biological pathways demonstrated increased innate immune response and neutrophil chemotaxis in EA and NAA SLE patients compared to AA patients, and increased immunoglobulin transcripts (in GO categories complement activation and regulation of immune response) in AA compared to EA and NAA. There were no GO biological pathways enriched in EA patients compared to both AA and NAA patients. BIG-C analysis revealed that AA patients had increased immune cell surface, immune signaling, and MHC II compared to both NAA and EA patients. AA patients also manifested increased IFN stimulated genes, chromatin remodeling, fatty acid biosynthesis, and the unfolded protein response compared to EA patients. NAA patients had increased immune cell surface, immune signaling, MHC I, autophagy, inflammasome and pattern recognition receptors, anti-apoptosis, and ROS protection compared to both AA and EA patients. NAA patients had increased IFN stimulated genes, transporters, unfolded protein response and integrin pathway compared to EA patients. Similar to GO biological pathways, there were no increased BIG-C categories for EA patients compared to both AA and NAA patients. Gene categories up-regulated in EA patients compared to AA patients included immune cell surface, autophagy, ROS protection, lysosome, and glycolysis. AA and EA patients shared increases in a number of categories compared to NAA patients indicating these processes were likely decreased in NAA patients compared to both AA and EA patients; these included mitochondrial DNA to RNA, mRNA translation, mRNA splicing, MicroRNA processing, TCA cycle, oxidative phosphorylation, and proteasome.

The 798 ILL1 and 768 ILL2 SLE patients were analyzed separately and yielded similar results, even at the individual gene level. To rule out the possibility that these findings could not be extrapolated to other SLE datasets, and to confirm the finding that ancestral differences were significantly contributing to the heterogeneity in gene expression signatures, SLE dataset GSE45291 was also analyzed. 73 AA and 71 EA SLE patients with the same range of SLEDAI scores (2-11), similar mean SLEDAI (AA 3.78+/−2.46; EA 3.53+/−2.08), and mode of SLEDAI (2), were analyzed by Linear Models for Microarray Data (limma) DE analysis, and results indicated that 859 transcripts were increased in AA patients compared to EA patients, and 955 transcripts were increased in EA patients compared to AA patients (FDR 0.05).

Similar to the results using the ILL1 and ILL2 datasets, EA SLE patients were enriched for transcripts associated with myeloid cells (FIG. 27B), and AA SLE patients were enriched for transcripts associated with plasma cells, B cells, and T cells (FIG. 27B).

GO biological pathway analysis demonstrated increased transcripts associated with chemotaxis, TLR signaling, and proteins which may be phosphorylated in EA, and increased transcripts for regulation of immune response, translation, T cell co-stimulation, complement activation, and BCR signaling in AA SLE patients.

BIG-C analysis showed increased immune cell surface, immune signaling, oxidative phosphorylation, mRNA translation, ubiquitylation and ER in AA and increased autophagy, inflammasome, glycolysis, lysosome, endosome, immune cell surface, and intracellular signaling in EA patients. DE analysis of SLE patients with inactive disease (SLEDAI of zero), including 25 AA and 75 EA patients, also revealed significant DE transcripts: 470 increased transcripts in EA patients and 258 increased transcripts in AA SLE patients (FDR of 0.05).

I-scope analysis showed a similar pattern of increased transcripts related to myeloid cells in EA patients, including CLEC4D, CXCL1, CXCL8, FCGR3B, FGL2, LTB4R, BPI, CAMP, IL17RA, MMP9, SIGLEC9, BMX, ITGAM, FPR1, and to plasma cells and B cells in AA patients, including transcripts for IGKC, IKGV4-1, IGLC1, IGLJ3, and JAKMIP1, even though the number of these cell-specific transcripts were decreased compared to patients with higher SLEDAI values (FIGS. 27A-27B). GO biological pathway analysis demonstrated increased glucose metabolism, small GTPase signal transduction, and vesicle fusion in EA patients, and increased membrane components, heme biosynthesis, microtubule, and secreted protein transcripts in AA patients with very low disease activity. Further, BIG-C analysis demonstrated immune cell surface, cytoskeleton, MHC II, and mitochondria increased in AA patients, and TCR cycle, lysosome, endosome, and ubiquitylation upregulated in EA patients. Thus, DE analysis of 4 SLE datasets comprising 1,810 female SLE patients demonstrated significant ancestral components to the whole blood gene expression profile, and some of these gene expression differences were observed to be independent of disease activity.

Differences in Gene Expression Between Ancestries were Associated with Significantly Different Percentages of Patients with Particular Signatures

Using the population gene expression analysis was useful for finding signatures that were significantly different for groups of patients of a specific ancestry. Further, a possibility that features of individual subjects, such as therapy and/or specific disease manifestations, may have contributed to such DE was ruled out, which may be important since ancestral groups may differ in these features. To address this, gene set variation analysis (GSVA) was employed to compare enrichment of 34 modules of genes corresponding to lymphocytes, myeloid cells, cellular processes, as well as groups of all the T Cell Receptor (TCR) and immunoglobulin (Ig) genes found on the Affymetrix HTA2.0 array. GSVA calculates enrichment scores using the log 2 expression values for a group of genes in each SLE patient and healthy control and normalizes these scores between −1 (no enrichment) and +1 (enriched). When many genes of a particular cell type or process are co-expressed, GSVA roughly reflects cell counts (FIG. S2). GSVA enrichment scores were calculated for the set of 1,566 female SLE patients and 17 female HC from the ILL1 and ILL2 datasets (GSE88884). The average plus or minus 1 standard deviation (SD) for the healthy controls was used to determine whether a patient had an increased, decreased, or similar signature compared to HC (FIG. 28A).

GSVA results demonstrated that the differences between the ancestry groups were related to the significantly different percentages of patients with particular signatures. All three ancestry groups had significantly different frequencies of patients (p<0.01, Fisher's Exact Test) with enrichment of the LDG, granulocyte, IL1 cytokine, and inflammasome signatures. NAA patients had the highest percentage of patients with these signatures, followed by EA patients, and AA patients had the lowest. NAA patients also had significantly more patients with monocyte cell surface and monocytes than AA patients; however, interestingly, signatures for myeloid secreted proteins, which included complement components, TNF, and CXCL10, were not different between the three ancestry groups. The AA patient group had significantly more patients with B cell, Ig, plasma cell, and T regulatory (IKZF2, FOXP3) signatures compared to EA and NAA patients. The NAA patient group had significantly fewer patients with T cell associated signatures compared to both EA and AA patients. The EA patient group had significantly fewer patients with dendritic and pDC signatures decreased compared to controls. The percentage of AA patients with IFN signatures was higher than that of EA patients (Fisher's exact p=0.04), but differences in overall percentages only ranged from 79% positive (EA) to 85% positive (AA). The AA and NAA patient groups had significantly more SLE patients with platelet and erythrocyte enrichment than EA patients, and significantly fewer patients with decreased erythrocyte and platelet GSVA scores compared to EA patients (FIGS. 28B-28C).

An orthogonal approach using weighted gene co-expression network analysis (WGCNA) was used to confirm the association of ancestry with cellular signatures. WGCNA of GSE88884 ILL1 and ILL2 was performed separately, and results demonstrated a significant (p<0.05) positive association by Pearson correlation of AA ancestry to plasma cell, T cell, and FOXP3 T cell modules, as well as a significant negative correlation to granulocyte and myeloid cell WGCNA modules. NAA ancestry had positive correlations to IFN, granulocyte, platelet, and erythrocyte modules, and negative correlations to T cell and lymphocyte modules. EA ancestry was positively correlated to one myeloid cell module and negatively correlated to IFN, plasma cell, platelet, and erythrocyte modules (FIG. 28D). These analyses confirmed the findings from the DE and GSVA analysis.

SOC Therapy is Associated with Changes in Gene Expression Profiles

All SLE patients in these analyses were on SOC drug therapy, and the heterogeneity observed in gene expression signatures between ancestral backgrounds may have been influenced by different drug regimens. In order to determine the effect of SOC drugs on patient gene expression signatures, patients on specific therapies were compared to patients not receiving the therapies for the 34 cell type and process modules. Within ancestral groupings, patients taking corticosteroids had significantly (Sidek's multiple comparisons test) increased LDG (AA, EA, and NAA, with p<0.0001) and anti-inflammation (AA, EA, and NAA, with p<0.0001) GSVA scores compared to patients of the same ancestry not taking the drugs, demonstrating that these signatures were strongly influenced by corticosteroid usage. Additionally, both AA and EA patients receiving corticosteroids had significant enrichment for granulocytes (AA, p=0.0009; EA, p=0.005), myeloid secreted (AA, p=0.0001; EA, p<0.0001), monocyte cell surface (AA and EA, p<0.0001), monocytes (AA and EA, p<0.0001), cell cycle (AA, p=0.04; EA, p<0.0001) and the IFN signature (AA, p=0.001; EA, p<0.0001). The effect of corticosteroids on myeloid signatures was further amplified at corticosteroid doses greater than 15 mg/day. Immunosuppressive therapy (e.g., IS, azathioprine (AZA), mycophenolate mofetil (MMF), or methotrexate (MTX)) did not have a consistent effect on all three ancestry groups. However, IS increased monocyte cell surface (EA, p=0.0013; AA, p=0.0103) and IL1 (EA, p=0.03; AA, p=0.0168) in AA and EA patients. When IS therapy was restricted to just MMF and MTX, there was a consistent decrease across all three ancestry groups in plasma cell (AA, p=0.0087; EA, p<0.0001; NAA, p=0.0130) and immunoglobulin (AA, p=0.0026; EA, p<0.0001; NAA, p=0.0168) GSVA scores. AZA treatment yielded significantly decreased NK cell GSVA scores (AA, p=0.0004; EA, p<0.0001; NAA, p=0.002) in all three ancestry groups and also significantly decreased T cytotoxic (EA and NAA, p<0.0001) and B cells (EA and NAA, p<0.0001) in NAA and EA ancestries. EA patients receiving NSAIDs compared to all other treatments had decreased LDG (p<0.0001) and anti-inflammation signatures (p=0.0053), whereas anti-malarial drugs had no significant effect on enrichment scores of the 34 cell type and process modules (FIG. 29).

To demonstrate that these treatment differences were sufficient to account for the ancestral gene expression differences, signatures were compared between patients on the same drug regimens. Almost all NAA SLE patients were receiving corticosteroids (92%; n=214/232) compared to 70% of AA (n=152 out of 216) and EA (n=787 out of 1,118) patients, and NAA patients were also more frequently taking immunosuppressive drugs (58%) compared to AA (39%) and EA (39%) patients. Comparison of LDG, monocyte, and T cell GSVA scores for patients with or without corticosteroids demonstrated that the corticosteroids were the largest contributor to the differences between patient LDG, monocyte, and T cell scores, but that AA patients still had lower LDG and monocyte scores and NAA patients still had lower T cell scores in the absence of corticosteroids (FIGS. 30A-30C). MTX and MMF significantly lowered plasma cell GSVA scores, but did not negate the increased plasma cells determined for AA patients versus EA and NAA patients (FIG. 30D). Compensating for AZA treatment also did not offset the increased B cells in AA SLE patients (FIG. 30E) or the difference in NK cells between EA and NAA SLE patients (FIG. 30F).

Dataset GSE45291 also had current drug information available for the gene expression data; therefore, GSVA enrichment scores were determined for the 34 cell and process modules, and differences between different drug treatments were determined. Corticosteroids increased LDG, monocyte, and anti-inflammation GSVA enrichment scores, MTX and MMF decreased plasma cell GSVA enrichment scores, and AZA decreased NK and B cell enrichment scores (FIG. S3), in support of the data generated from dataset GSE88884.

Autoantibodies and Complement Levels, but not Clinical Features were Associated with Gene Expression Profiles

Variation in SLE disease manifestations may be a cause for cellular and gene expression heterogeneity in SLE WB. In order to determine the association between different SLE manifestations and gene expression profiles, GSVA enrichment scores for the 34 modules were compared for patients with each manifestation individually to all other manifestations. The presence of arthritis, rash, alopecia, mucosal ulcers, or vasculitis had no consistent differences on GSVA scores of the 34 modules across the ancestries. Patients of all ancestries with both anti-dsDNA and Low C had significantly higher (Sedak's multiple comparisons test, p<0.01) GSVA enrichment scores for anti-inflammation (AA. p=0.0277; EA and NAA, p<0.0001), IFN (AA, p<0.0001; EA and NAA, p<0.0001), plasma cells (AA, p=0.0032; EA and NAA, p<0.0001), immunoglobulins (AA, p=0.0044; EA and NAA, p<0.0001), monocyte cell surface (AA, p=0.03; EA, p<0.0001; NAA, p=0.04) and LDGs (AA, p=0.0008, EA p<0.0001; NAA, p=0.0103) compared to patients without anti-dsDNA and Low C. For AA and EA SLE patients, increased GSVA scores for plasma cells (AA, p=0.02; EA, p=0.0002) and Ig (AA, p=0.04; EA, p=0.0001) were also found for SLE patients with anti-dsDNA, but not Low C (FIG. 31A).

All patients in the ILL1 and ILL2 datasets were ANA positive, and 255 SLE patients also had anti-ribonucleoprotein (RNP) autoantibody titers measured. For these 255 SLE patients (19 AA, 54 NAA, and 182 EA), 86 SLE patients were positive for anti-dsDNA, 37 were positive for anti-RNP, and 68 were positive for both. Comparison of the change in gene expression profile for the anti-dsDNA, anti-RNP, or both, to the 64 patients in this subset without anti-RNP or anti-dsDNA autoantibodies showed significant increases in GSVA enrichment scores for IFN (anti-dsDNA, p=0.0023; anti-RNP, p=0.0323; both, p<0.0001), plasma cells (anti-dsDNA, p=0.01; anti-RNP and both, p<0.0001), Ig (anti-dsDNA, p=0.0039; anti-RNP and both, p<0.0001) and cell cycle (anti-dsDNA, p=0.0003; anti-RNP and both, p<0.0001). There was a significant decrease in dendritic cells for anti-dsDNA (p=0.03) and a significant increase in T regulatory GSVA scores for both (p<0.0001) (FIG. 31B).

The significant increase in plasma cell signatures detected in AA patients may not be explained by AA patients having an increased incidence of anti-dsDNA and Low C; the AA patient group had the lowest number and percentage of patients with both anti-dsDNA and Low C, 23% (n=50), whereas 29% (n=320) of EA patients and 37% (n=86) of NAA patients had both anti-dsDNA and Low C. To determine whether autoantibodies and complement levels or drugs contributed more to the relationship with specific GSVA signatures, patients positive for both Low C and anti-dsDNA were compared with and without specific drugs or manifestations for cell specific GSVA scores. Patients having both Low C and anti-dsDNA had significantly lower plasma cell GSVA scores if they were also taking either MTX or MMF (FIG. 32A). 90% of patients with both Low C and anti-dsDNA were also receiving corticosteroids, and patients taking corticosteroids had significantly increased LDG GSVA scores, demonstrating that the increase in LDGs observed in patients with anti-dsDNA and Low C was related to concomitant corticosteroid usage, and not the presence of anti-dsDNA and Low C (FIG. 32B).

The increase in monocyte cell surface and IFN signature GSVA scores in patients with both Low C and anti-dsDNA was not explained by corticosteroid usage, as GSVA scores were similar between patients taking or not taking corticosteroids. The increase in IFN signature observed in EA and AA SLE patients on corticosteroids was related to the disproportionate numbers of patients with Low C and anti-dsDNA in the corticosteroid population, 39%, versus only 13% of the patients not taking corticosteroids who had both Low C and anti-dsDNA (FIGS. 32C-32D). In EA SLE patients, decreased NK cells were detected in those with anti-dsDNA or Low C. The effect was related to 23% of patients with Low C and anti-dsDNA also being on AZA (FIG. 32E) compared to only 15% of patients without low C or anti-dsDNA taking AZA (FIG. 32F) and thus not directly related to having anti-dsDNA and Low C. Vasculitis patients had a higher incidence of both anti-dsDNA and Low C, 41%, compared to 22% overall. Separation of vasculitis patients by anti-dsDNA and Low C demonstrated that the significant increase in plasma cells and IFN GSVA scores were likely related to the patients also having both anti-dsDNA and Low C, as there was a significant increase in GSVA enrichment scores for IFN and plasma cells in vasculitis patients with both anti-dsDNA and Low C (FIGS. 32G-32H; plasma cell mean difference=0.2873, p=0.0013, IFN mean difference=0.3889, p<0.0001). Thus, SLE serum components significantly contribute to individual gene expression signatures, but still may not explain the differences observed between AA, EA, and NAA patients.

Male SLE Patients Demonstrated Similar Ancestral Differences as Female SLE Patients

Since the frequency and severity of SLE in male and female patients with SLE is different, initially only female lupus subjects were examined. However, to determine whether ancestral differences are also observed in male lupus subjects, GSVA enrichment scores were calculated for the 34 cell and process modules for 14 AA, 93 EA, and 17 NAA GSE88884 ILL1 and ILL2 male patients and male HC. As shown in FIG. 33A, the pattern of enrichment was similar to that seen between the results obtained for female patients in FIG. 27B, with increased plasma cells, Ig, and T regulatory signatures in AA SLE patients and increased LDG and myeloid signatures in NAA and EA SLE patients. The statistical significance between the groups may not be apparent because of the low numbers of patients examined, except for the LDG and granulocyte signature in NAA compared to AA patients (p=0.0261, p=0.013), the T regulatory signature in AA compared to NAA patients (p=0.0008), and a lack of decreased platelet signatures in NAA compared to AA (p=0.0365) and EA (p=0.0001) patients. AA male patients were also less likely to have decreased TCR alpha and TCR beta signatures compared to EA (p=0.0257, p=0.0141) and NAA (p=0.0013, p=0.0017) male patients. The combination of anti-dsDNA and Low C was associated with positive plasma cell signatures, as was detected for female SLE patients (FIG. 33B).

EA SLE patients were used to determine differences between female patients and male patients with SLE. Because of the large number of female patients, the sets of female patients and male patients were able to be balanced for the percentage of patients on corticosteroids, AZA, and MTX/MMF. Further, the female patients were divided into two age groups, 25-49 years and over 50 years, because of the effects of estrogen on immune responses. For comparison of females 25-49 years old to males, there were 261 DE transcripts from the ILL1 dataset and 74 DE transcripts from the ILL2 dataset (FDR=0.05); 35 of these transcripts were in common between the two datasets, and of these, 26 were encoded on the X or Y chromosome. For comparison to females over 50 years of age, there were 32 DE transcripts from ILL1 and 97 DE transcripts from ILL2; 26 of these transcripts were in common between the two datasets, and of these, 23 were encoded on the X or Y chromosome (FIGS. 33C-33E). For comparison of females age 25-49, there were several increased TCR alpha J region chains, but no increased expression of previously reported estrogen induced genes. There were no DE genes associated with plasma cells or interferon signatures. There were a few transcripts associated with granulocytes (CSF2RA, CEACAM8, DEFA4, CLEC4D, BPI) increased in ILL2 males compared to females over age 50 and ILL1 males compared to females 25-49 years, but no consistent pattern based on age of the female patients.

Ancestry Provides the Gene Expression Backbone, but SOC Drugs Greatly Modify Gene Expression

Analyses of the DE transcripts between different ancestries have shown that EA and NAA populations overexpressed the Duffy blood group antigen ACKR1, the platelet and monocyte receptor CD36, and G6PD, in comparison to all AA populations, and that all of these genes have risk alleles resulting in decreased expression in the AA population. Therefore, gene expression differences detected between SLE patients was shown to be related to heritable differences manifesting in expressed genes in hematopoietic cells of healthy subjects of different ancestries. In order to demonstrate this, gene expression analysis of adult, self-described AA and EA HC subjects was carried out on two separate microarray datasets of normal subjects of different ancestries. Both datasets had hundreds of DE transcripts for healthy AA patients compared to healthy EA patients; GSE111386 (10 AA, 57 EA) had 3,295 DE transcripts and GSE35846 (22 AA, 55 EA) had 2,476 DE transcripts (FDR of 0.2) with 1,234 transcripts in common between the two datasets. Significant odds ratios (overlap p value<0.0001) were documented between transcripts increased in HC AA subjects compared to HC EA subjects, and transcripts increased in AA SLE patients compared to EA SLE patients in all four SLE datasets: GSE88884 ILL1, GSE88884 ILL2, GSE45291 with SLEDAI of 0, and GSE45291 with SLEDAI of 2-11) and significant odds ratios (Fisher's exact p value<0.0001) were demonstrated between transcripts increased in EA HC subjects and those increased in EA SLE patients, but no significant overlap was observed between AA HC subjects and EA SLE patients, or between EA HC subjects and AA SLE patients (FIG. 34A).

I-scope analysis of the transcripts increased in healthy AA patients demonstrated an increase in B cell, dendritic, erythrocyte, and platelet associated transcripts compared to EA HC subjects, and an increase in granulocyte, monocyte, and myeloid transcripts in healthy EA subjects compared to AA HC subjects (FIG. 34B). IFI27, a gene commonly used to monitor the IFN signature, was increased in healthy AA subjects in both datasets, and IFITM2, another IFN signature gene, was increased in both healthy EA datasets. CXCL5, IL32, and TNFSF4 were increased in healthy AA subjects in both datasets, and CXCL8, CXCL1, GRN, MMP9, TNFSF14, and CXCL6 were increased in healthy EA subjects in both datasets. There were no genes associated with plasma cells or LDGs DE between AA and EA HC subjects, and the majority of the IFN signature genes and inflammatory secreted genes were not differentially expressed between AA and EA subjects, including IF144, IFI44L, C1QA, C1QB, C1QC, CCL2, CXCL10, CXCL2, IL1B, TNF, and THBD.

In order to determine the relative importance of ancestry, SOC drugs, and SLE manifestations to gene expression signatures, stepwise logistic regression analysis was performed for each of the 34 cell type and process signatures using the variables of ancestry (AA, EA, NAA), SOC drugs (MTX, MMF, AZA, corticosteroid drugs, NSAID drugs, and anti-malarial drugs), SLE serum components (anti-dsDNA, Low C3, Low C4) and SLE manifestations (arthritis, rash, mucosal ulcers, vasculitis, thrombocytopenia). FIG. 35 shows a CIRCOS visualization of the odds ratios for each variable significantly (p<0.05) contributing to each GSVA enrichment score. Ancestry significantly influenced 21 of the 34 cell type and process module scores. For AA patients, there was a negative relationship to LDG, granulocytes, IL1 cytokines, and inflammasome and a positive relationship to low pDC, Treg, IFN, plasma cells, Ig, and B cells. Low MHC II and the low SNOR up were negatively associated with NAA patients, and NAA status was positively associated with inflammasome, low T cells, and platelets. For EA patients, there was a negative association to low NK cells, granulocytes, UPR, low SNOR down, and the cell cycle and a positive association to the inflammasome, low platelets, and Treg. SLE serum components significantly influenced 19 of the 34 modules with the most significant odds ratios and confidence intervals for the IFN signature, cell cycle, plasma cells, and Ig. SLE manifestations influenced the transcriptome the least, with significant relationships to 14 signatures, but with confidence intervals very close to 1. SOC drugs influenced every cell and process module GSVA enrichment score, with the most profound effects by AZA on NK and B cells, MTX/MMF on plasma cells, Ig, and T cells, and corticosteroids on myeloid cells (based on Spearman correlation coefficients between variables, confidence intervals, p values, and odd's ratios).

Based on this data, it was hypothesized that balancing SOC drugs in SLE patients may significantly reduce the number of DE transcripts between AA and EA SLE patients. The DE analysis was repeated on GSE88884 ILL1 and ILL2 AA to EA SLE patients from FIGS. 26A-26D, but this time with selected AA and EA SLE patients of similar daily steroid usage (mean, median, and mode), no immunosuppressive drugs, and similar percentages receiving anti-malarial drugs and NSAID drugs. There were 606 DE transcripts from the ILL1 dataset AA (n=41) to EA (n=144), and 535 DE transcripts for ILL2 dataset AA (n=44) to EA (n=154) (FDR=0.05); a loss of 83 and percent 82 percent of the DE transcripts, respectively, compared to DE analysis of all ILL1 and ILL2 AA to EA SLE patients with non-matched SOC drugs in FIGS. 26A-26D. Thus, the combination of different drug regimens and ancestry significantly changed patient gene expression having profound implications for interpretation of gene expression analyses.

DISCUSSION

The analysis and results herein provide a significant understanding of the contributions of SLE patient ancestry and SOC drugs to the subject's gene expression profile. Furthermore, the results demonstrate important ancestry-based gene expression differences present in healthy controls of AA, NAA, and EA ancestry, that serve as the background for the heterogenous transcriptomic signatures detected in SLE patients. Thousands of DE transcripts were identified when AA, EA, and NAA SLE patients were compared to each other. There were no detectable transcripts when SLE patients of the same ancestry were randomized and compared, demonstrating that the differential expression between ancestral groups was determined by genetic ancestral make-up to a significant extent.

The ancestry-related differences in gene expression profiles highlights an important issue of using appropriate numbers of controls with matching ancestry to determine meaningful changes in a disease state. A striking overlap was observed between unrelated AA HC subjects and EA DE analyses and the separate AA SLE and EA DE analyses of 1,810 patients. Somewhat surprisingly, the AA HC subjects overlapped with AA SLE patients better than the EA HC subjects to EA SLE patients, since the AA subjects may be expected to contain more admixture than the EA subjects. These data demonstrate that ancestral gene expression differences serve as a backdrop on which the transcriptomic signature is built and accounts for much of the heterogeneity in blood gene signatures. Ancestral SNPs in HC may be estimated to account for about 17-28% of variation in gene expression, and these results demonstrated these gene expression differences readily contribute to an SLE patient's transcriptomic signature. Additionally, several ancestral-related genes divergent between AA and EA populations that are also involved in immune responses were differentially expressed between SLE patients and HC subjects of different ancestries: IL8, CXCL1, CXCL5, STAT1, CEPBP, ITGAM, and CD58, demonstrating that ancestral SNPs contribute to the gene expression profile. It may be shown that AA is associated with increased responses to infection and increased expression of inflammatory response genes. While generally, an increased inflammatory response may be associated with an increase in innate immune response cells, the results actually showed a depletion, or less of an increase, in myeloid cells in AA patients compared to EA and NAA patients. Interestingly, there was no significant difference in expression of transcripts for inflammatory mediators such as complement, TNF, and CXCL10, despite the difference in detection of cell types that generally produce these inflammatory mediators. This result indicates that individual innate immune cells from AA patients produce more inflammatory mediators.

The ramifications of these results toward interpretation of gene expression analysis are important. HC of AA and EA ancestries were reproducibly shown to be disparate in transcripts for erythrocyte, platelet, B cell, T cell, NK cell, granulocytes, and monocyte transcripts; furthermore, this transcript data agrees with cell counts and genetic differences between ancestries. Platelet counts may be shown to be higher in AA than EA patients, and the Duffy Null Polymorphism (ACRK1 gene) may be shown to be a cause of decreased neutrophil counts in AA patients. CD19+ B cell counts may be shown to be increased in AA patients compared to EA patients, and CD3+ T cells may be shown to be increased in EA patients versus AA patients, although overall lymphocyte counts may not be different. The erythrocyte transcripts increased in AA patients may be related to increased reticulocytes in the circulation, and this may be explained by AA patients more frequently possessing x-linked G6PD alleles responsible for the African ancestry-associated G6PD deficiency prominent in AA males. Reticulocytosis may be augmented in AA patients with SLE, as persons with G6PD deficiency may have induced hemolysis secondary to infection and leukocyte phagocytosis. G6PD was decreased in both AA SLE patients and AA HC subjects compared to EA SLE patients and EA HC subjects. The ancestral transcriptomic backbone may be emphasized depending on HC comparators, and as a result, many DE transcripts may be inappropriately attributed to the disease instead of the ancestry, whether or not the allelic differences play an actual role in the pathogenesis of SLE. Analysis of purified cell types from AA and EA SLE patients may show only about 10% similar transcripts, indicating disparate constitutive pathways and metabolism operating in AA and EA SLE patient hematopoietic cells. Although these data and results described herein confirmed strong ancestral contributions to the SLE signature, there were patients within all ancestries with disparate signatures from the prevailing ancestral type, demonstrating that personalized medicine strategies to determine the type of lupus may be helpful, instead of relying on ancestral background or group statistics (e.g., median or mean). Additionally, drugs and their effect on cell populations and signaling pathways may be taken into account to help focus attention onto pathways and cells involved in disease and not the treatment. The IL-1, inflammasome, and LDG increased signatures detected in NAA patients appeared to be related to corticosteroid drugs. This signature may be further deciphered by performing studies of healthy NAA patients. Single-cell technology may be used to elucidate and observe effects of ancestry and SOC drugs, and to distinguish between out cell populations prominent in ancestries and induced or repressed by concomitant drugs, from cell populations actively participating in disease processes.

The results demonstrate a strong relationship between SLE serum components and circulating Ig, plasma cell, cell cycle, and IFN GSVA scores; further, this association was more pronounced in EA and NAA patients than AA patients. These data also and demonstrated that observed increases in plasma cell signatures in pediatric AA SLE patients are likely related to ancestry, and not disease activity. Increased Ig production is associated with plasma cells, and Ig genes have been used as a proxy for plasma cell measurements in microarray datasets. Both healthy control AA and EA datasets were on Illuminate chips that harbor only a few Ig genes, so although Ig genes were not detected as different between healthy AA and EA, in some cases, this signature may derive from healthy B cells, which may explain why AA plasma cell GSVA scores did not correlate as well with serum component measurements. Single-cell RNAseq analysis of isolated hematopoietic cell types in healthy subjects may demonstrate that B cells have increased Ig transcripts compared to all cell types except plasma cells. Lupus in the AA population may be strongly biased towards generation of plasma cells. Since healthy AA subjects, in two separate datasets, also showed increased transcripts associated with B cells, the increase in plasma cells may have an origin in the inherent differences in the healthy AA population.

Further, the results herein demonstrated that increased IFN signatures were associated with anti-dsDNA and Low C in all ancestry groups. AA SLE patients may be shown to be more likely to have an IFN signature than EA SLE patients; the results obtained also detected significantly more AA than EA SLE patients with an IFN signature, but the percentages of IFN-positive patients were greater than 75% for both ancestry groups and less useful for distinguishing AA from EA SLE patients. Corticosteroids may be demonstrated to decrease IFN signaling, but this effect was not seen in this study and may be a result of the large number of patients on corticosteroids also having both anti-dsDNA and Low C. In some cases, monocytes appear to retain the IFN signature in inactive lupus patients, confounding usage of this signature to determine disease activity, and the increased IFN signature in SLE patients with anti-dsDNA and Low C may be accompanied with increased signatures for monocyte cell surface transcripts.

Besides the effect of ancestry and SLE serum components, the results and data demonstrated the profound effect SOC therapies have on SLE patient gene expression profiles, and indicate a method of accounting for these effects using the change in GSVA enrichment score associated with drug administration. When the SOC drugs were matched between AA and EA SLE patients, more than 80% of the DE transcripts were lost between AA and EA SLE patients from ILL1, and this was repeated in ILL2. Patients with increased GSVA scores compared to controls for the inflammasome, IL-1, and myeloid signatures were significantly increased in the NAA population, and the number of DE transcripts between AA and EA patients was almost twice the difference between AA and EA patients, indicating at first that this population was the most different from AA and EA patients. However, further analysis determined that NAA were also receiving more corticosteroids and immunosuppressive therapy, and that this therapy was likely accounting for much of their increased myeloid and decreased lymphocyte signatures.

Further, the results showed increased signatures for myeloid cells in pediatric EA and NAA (Hispanic) SLE compared to AA patients, although this difference may be related to the benign neutropenia common in people of African ancestry, the increased corticosteroids taken by NAA patients, and not lupus related. By using more than 1,500 SLE patients, it was shown that AA SLE patients did not have significantly enriched plasma cell signatures compared to EA and NAA ancestry groups, if all patients had both anti-dsDNA and Low C, or if all patients were receiving MTX or MMF. Although AA patients also had the lowest number of patients on AZA, and AZA therapy was related to decreased B cell GSVA scores, there were not enough patients receiving this therapy for this drug to account for the differences noted between ancestry groups. In confirmation of the methodology used, AZA treatment significantly decreased NK cell GSVA scores in all three ancestry groups in the GSE88884 and GSE45291 datasets, consistent with an effect of AZA on NK cells. EA patients had significantly higher NK cell GSVA scores compared to NAA patients, when both were not receiving AZA treatment; however, there was no significant difference when both ancestry groups were receiving AZA treatment.

The association of neutrophil granule protein transcripts (LDG signature) with corticosteroid usage may be observed. Corticosteroid usage also had a significant effect on most myeloid signatures including monocyte cell surface transcripts, myeloid secreted protein transcripts, and IL1 transcripts. This may be a result of increasing this population in the periphery as steroids may be shown to increase demargination of mature neutrophils. The LDG signature was also prominently detected in EA SLE patients with SLEDAI values of zero on corticosteroids. LDGs in autoimmunity may be described as being inflammatory and contributing to SLE pathogenesis from data obtained from in vitro experiments demonstrating an increased capacity for production of inflammatory cytokines. However, corticosteroids may be demonstrated to induce human monocytes to secrete G-CSF, and G-CSF may mobilize neutrophils from the bone marrow, indicating a mechanism where chronic corticosteroid use may promote the release of immature neutrophils. G-CSF therapy for neutropenia in lupus patients may induce flares and vasculitis, indicating a pathologic role for G-CSF. G-CSF also may be shown to increase a glycosylated, membrane form of MPO on mature neutrophils and monocytes, and this form of MPO may bind to E-selectin on human endothelium and induce cytotoxicity. The strong relationship between LDGs and corticosteroid usage, and yet the presence of transcripts for granule proteins in patients reportedly not taking corticosteroids, may be indicative that there may be two or more different populations of granule expressing cell populations. The relative contribution to microarray signatures of genes related to neutrophils may be disparate between AA and other populations and may not reflect differences in lupus. Therefore, different neutrophil signatures may arise because of ancestry-related rather than lupus-related differences.

The observed lack of difference in GSVA scores for inflammatory cell populations, inflammatory cytokines, IFN signatures, and the TNF pathway for patients treated with anti-malarial drugs (e.g., hydroxychloroquine (Plaquenil), chloroquine (Aralen), and quinacrine (Atabrine)) compared to all other treatments was surprising, as chloroquine may decrease anti-inflammatory cytokine production. Experiments may demonstrate that hydroxychloroquine blocks TLR 9/7 stimulation and subsequent IFN production in vitro. As plasmacytoid dendritic cells were generally decreased in the periphery of SLE patients, perhaps the target cells for anti-malarial drugs are found in tissues, but this data demonstrated no significant changes in cell populations or processes associated with anti-malarial usage in the periphery. Surprisingly, NSAID drugs had more of an effect on gene expression profiles than anti-malarial drugs. Although commonly known as cyclooxygenase isoenzyme inhibitors, NSAID drugs may be shown to block caspases and inflammation; although the change in GSVA score was not greater than 0.2, there did appear to be a decrease in LDGs and the anti-inflammation signature, at least in EA SLE patients.

Major differences may be reported in lupus cohorts between male and female SLE patients with respect to renal involvement and serological manifestations. While renal patients were excluded from the ILL1 and ILL2 clinical trials, among patients with non-renal manifestations, there did not appear to be consistent differences in gene expression other than the expected transcripts encoded on the X and Y chromosomes. Gene expression differences attributable to estrogen in female patients under 50 may be expected; however, analysis of the DE transcripts did not reveal an obvious link to effects on the immune system. The ancestral differences between males also appeared similar to the ancestral differences between females, indicating the ancestral component to gene expression are more important to take into consideration than male-vs.-female differences.

Self-identified ancestry gave useful information for the genetic background of an individual; further, pairing studies with genetic data may be performed to determine specific ancestry admixtures. The current results provide a framework for determining the meaningful contributions to the SLE disease transcriptome and to separate these contributions from the effects of SOC therapy and ancestry.

In summary, ancestry plays an important role in the gene expression profiles of individual SLE patients and by implication contributes to the molecular pathways operative in each subject. Understanding, for example, that some self-described AA patients may have higher levels of transcripts for B cells, erythrocytes, and platelets compared to EA SLE patients may help explain differences in gene expression data that do not manifest from the SLE disease, but from the patient's ancestral background. The relationship of corticosteroid drugs to LDGs has implications against using this signature as a measure of disease severity or interpreting LDGs as playing a role in worsening disease, as worsening disease may prompt an increase in corticosteroid doses. Combinations of different ancestry, SOC therapy, and autoantibody production associated with gene expression profiles m datasets comprised of different populations from around the world difficult to compare. Understanding the contributions of the gene expression signature components may permit a better understanding and interpretation of the signatures and their relationship to disease status.

Methods

Gene expression datasets were obtained as follows. Data were derived from publicly available datasets on Gene Expression Omnibus (GEO, www.ncbi.nlm.nih.gov/geo/). Raw data sources were used as follows: GSE88884 female whole blood Illuminate 1 (ILL1; 10 female HC, 798 female SLE (540 EA, 101 AA, and 157 NAA); all with SLEDAI≥6), GSE88884 female whole blood Illuminate 2 (ILL1; 7 female HC, 767 female SLE (577 EA, 115 AA, and 75 NAA) all with SLEDAI≥6), GSE88884 male whole blood Illuminate 1 SLE (ILL1: 5 male HC, 59 male SLE (6 AA, 42 EA, and 11 NAA), GSE88884 male whole blood Illuminate 2 (ILL2: 4 male HC, 65 male SLE (8 AA, 51 EA, and 6 NAA); (GSE45291 whole blood (9 female HC, female SLE: 73 AA, 71 EA with SLEDAI of 2-11), GSE45291 whole blood (9 female HC, female SLE: 25 AA, 75 EA; all with SLEDAI of 0), GSE35846 whole blood from healthy females (55 EA, 22 AA), and GSE111386 whole blood from healthy females (10 AA, 57 EA). Clinical data including disease activity assessed by SLEDAI, anti-dsDNA titers, complement levels, disease manifestations, and standard of care drugs were provided by Eli Lilly (GSE88884 Illuminate I and Illuminate 2).

Quality control and normalization of raw data files were performed as follows. Statistical analysis was conducted using R and relevant Bioconductor packages. For datasets GSE88884 and GSE45291, non-normalized arrays were inspected for visual artifacts or poor RNA hybridization using Affy QC plots. To increase the probability of identifying differentially expressed genes (DEGs), analysis was conducted using normalized datasets prepared using both the native Affy chip definition files, followed by custom Brain Array Entrez CDFs maintained by the University of Michigan Molecular and Behavioral Neuroscience Institute. The Affy CDFs include multiple probes per gene and almost twice as many probes as BA CDFs. Whereas Affy chip definition files can provide the greatest amount of variance information for Bayesian fitting, the Brain Array chip definition files are used to exclude probes with known non-specific binding and those shown by quarterly BLASTs to no longer fall within the target gene. Illumina CDFs were used for the Illumina datasets (GSE35846, GSE111386).

Differential gene expression (DE) analysis was performed as follows. GCRMA normalized expression values were variance-corrected using local empirical Bayesian shrinkage, followed by calculation of DE using the ebayes function in the open source BioConductor LIMMA package (www.bioconductor.org/packages/release/bioc/html/limma.html). Resulting p-values were adjusted for multiple hypothesis testing and filtered to retain DE probes with a False Discovery Rate (FDR) of less than 0.05.

Determination of female and male controls was performed as follows. Log2 expression values were used to determine sex of unknown healthy controls and to compute sex module scores using the formula below:

Sex module=XISTlog 2 expression+TSIXlog 2 expression−(UTYlog 2 expression+USP9Ylog2 expression).

Female controls scored above zero and male controls scored below zero.

I-Scope

I-scope is a tool developed to identify immune infiltrates. I-scope was created through an iterative search of more than 17,000 genes identified in more than 50 microarray datasets. From this search, 1,226 candidate genes were identified and researched for restriction in hematopoietic cells as determined by the HPA, GTEx, and FANTOM5 datasets (www.proteinatlas.org). A set of 926 genes met a set of criteria for being mainly restricted to hematopoietic lineages (brain, reproductive organ exclusions were permitted). These genes were researched for immune cell specific expression in hematopoietic sub-categories: T cells, Regulatory T Cells (Treg), Activated Tcells (Tactivated), Anergic/Activated cells (Tanergic), Alpha/Beta T cells (abTcells), Gamma delta T cells (gdTcells), CD8 T, NK/NKT cells, NK cells, T or B cells, B cells, B or pDC cells, GC B cells, T or B or Myeloid cells, B or Myeloid cells, Antigen Presenting Cells or MHC Class II expressing cells (MHC II), Dendritic cells (Dendritic), Plasmacytoid dendritic cells (pDC), Myeloid cells (Myeloid), Monocytes, Plasma Cells (Plasma), Erythrocytes (Erythro), Granulocytes (Neut), Low density granulocytes (LDG), and Platelets. Transcripts are entered into I-scope, and the number of transcripts in each category were determined. Odds ratios were calculated with confidence intervals using the Fisher's exact test in R.

Gene ontology (GO) biological pathways were determined as follows. The database for annotation, visualization and integrated discovery (DAVID) (david.abcc.ncifcrf.gov/) was used to determine enriched GO biological pathways.

Gene Set Variation Analysis (GSVA) was performed as follows. The GSVA (V1.25.0) software package is an open source package available from R/Bioconductor, and was used as a non-parametric, unsupervised method for estimating the variation of pre-defined gene sets in patient and control samples of microarray expression data sets (www.bioconductor.org/packages/release/bioc/html/GSVA.html). The inputs for the GSVA algorithm were a gene expression matrix of log 2 microarray expression values (Brain Array chip definitions) for pre-defined gene sets co-expressed in SLE datasets. Enrichment scores (GSVA scores) were calculated non-parametrically using a Kolmogorov Smirnoff (KS)-like random walk statistic and a negative value for a particular sample and gene set, meaning that the gene set has a lower expression than the same gene set with a positive value. The enrichment scores (ES) were the largest positive and negative random walk deviations from zero, respectively, for a particular sample and gene set. The positive and negative ES for a particular gene set depend on the expression levels of the genes that form the pre-defined gene set.

Enrichment modules containing cell type and process-specific genes were created through an iterative process of identifying DE transcripts pertaining to a restricted profile of hematopoietic cells in 13 SLE microarray datasets, and checked for expression in purified T cells, B cells, and Monocytes to remove transcripts indicative of multiple cell types. Genes were identified through literature mining, GO biological pathways, and STRING interactome analysis as belonging to specific categories. The Low Disease (Signature) Up and Low Disease (Signature) Down are the seven most over-expressed and seven most under-expressed transcripts by log fold change for 348 female patients from dataset GSE88884 (ILL1 and ILL2) that were not separated from healthy controls by principal component analysis (PCA) compared by limma DE analysis to HC (FDR=0.05). The LDG signature was taken from purified LDGs DE to HC and SLE neutrophils, (Villaneueva, 2011) and consists mainly of neutrophil granule proteins from Module B as described in Kegerreis et al (2019). The overlap in genes between some signatures was intentional and used to check that signatures were behaving cohesively between patients.

Weighted Gene Coexpression Network Analysis (WGCNA) was performed as follows. WGCNA is an open source package for R available at horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/.

Log2 normalized microarray expression values for the GSE88884 ILL1 and ILL2 datasets were filtered using an IQR to remove saturated probes with low variability between samples and used as inputs to WGCNA (V1.51). Adjacency co-expression matrices for all probes in a given set were calculated by Pearson's correlation using signed network type specific formulae. Blockwise network construction was performed using soft threshold power values that were manually selected and specific to each dataset in order to preserve maximal scale free topology of the networks.

Resultant dendrograms of correlation networks were trimmed to isolate individual modular groups of probes, labeled using semi-random color assignments, based on a detection cut height of 1, with a merging cut height of 0.2, with the additional use of a partitioning around medoids function. Final membership of probes representing the same gene into modules was based on selection of greatest scale within module correlation against module eigengene (ME) values. Correlation to ancestry was performed using Pearson's r against MEs, defining modules as either positively or negatively correlated with those traits as a whole.

Gene Overlap analysis was performed as follows. Gene Overlap is an R bioconductor package (www.bioconductor.org/packages/release/bioc/html/GeneOverlap.html), which was used to test the significance of overlap between two sets of gene lists. It uses the Fisher's exact test to compute both an odd's ratio and overlap p value. For comparison of datasets on different array platforms (Illuminate versus Affymetrix), an FDR of 0.2 was used.

Logistic regression modeling was performed as follows. SAS 9.4 (Cary, NC) was used for stepwise logistic regression. GSVA enrichment scores greater or less than healthy control averages plus or minus one standard deviation were determined, and SLE patients were assigned a 1 or 0 based on having a signature greater than or less (Low) than HC, respectively. These scores were used as 34 dependent binary variables to be modeled individually as the outcome variable to 17 independent categorical (e.g., binary) variables, including ancestry (AA, EA, and NAA), drugs (corticosteroid drugs, antimalarial drugs, NSAID drugs, Azathioprine, Methotrexate, Mycophenalate mofetil), and SLE manifestations (rash, arthritis, mucosal ulcers, vasculitis, thrombocytopenia, anti-ds DNA, Low C3, and Low C4). Spearman correlation coefficients were determined between variables, followed by stepwise linear regression, in order to determine if groups were too similar to give independent information to the model. Further, odd's ratios, p values, and confidence intervals were determined. Immunosuppressive as a general category was removed since it had a Spearman correlation greater than 0.4 compared to MTX and MMF. The stepwise approach was used to produce the statistically significant model. The results of any model that violated the Hosmer Lemeshow test were discarded.

CIRCOS analysis was performed as follows. CIRCOS (V0.69.3) software was used to visualize the odd's ratios determined by stepwise logistic regression analysis. Odd's ratio values are non-negative, and a change from an odds ratio of 0.5 to 0.25 is the same relative change as that between 2.0 and 4.0. For representative visualization, odd's ratios between 0 and 1 were converted to the 1/X value, where X is an odd's ratio between 0 and 1.

Statistical analysis was performed as follows. GraphPad PRISM 7 version 7.0c was used to calculate or perform mean, median, mode, standard deviation, ANOVA, Tukey's multiple comparisons test, Sedak's multiple comparisons test, linear regression analysis, and unpaired t-test with Welch's correction. The Fisher's exact test was performed in R.

Data availability was as follows. All microarray datasets in this publication are available on the NCBI's database Gene Expression Omnibus (GEO) (www.ncbi.nlm.nih.gov/geo/).

Code availability was as follows. All software used to produce results described in this example is open source, and freely available for R. Additionally, example code used to produce results described in this example for LIMMA, GSVA and WGCNA are available at figshare (www.figshare.com). File names are “AMPEL BioSolutions LIMMA Differential Expression Analysis Code”, “AMPEL BioSolutions Gene Set Variation Analysis Code”, and “AMPEL BioSolutions Weighted Correlation Network Analysis WGCNA Code”.

Example 7: Ancestry Influences the Gene Expression Profile in Systemic Lupus Erythematosus (SLE) and Contributes to Gene Expression Heterogeneity in Lupus Patients

Systemic Lupus Erythematosus (SLE) is a complex autoimmune disease with both sex and ancestral bias. Gene expression analysis has revealed complex heterogeneity between SLE patients, making deconvolution of the data difficult and delineation of the impact of different disease drivers uncertain. Therefore, the individual contributions of ancestry, gender, and medications to gene expression heterogeneity were assessed. Further, the association of gene expression profiles with various SLE manifestations was determined.

Bulk Differential Expression (DE) analysis and Gene Set Variation Analysis (GSVA) were carried out on 1,903 SLE patients of African (AA), European (EA), and Native American (NAA) ancestry. Modules of genes defined by co-expression in patients and representing either functional or cell specific groups were used to determine the relationship between drugs, SLE manifestations and individual patient gene expression. Logistic regression analysis was used to understand the relative contribution of ancestry, drugs and SLE manifestations to gene expression signatures.

Gene expression analysis between female disease-matched SLE patients of AA, EA, and NAA ancestry revealed thousands of DE transcripts between ancestries, but none within a single ancestry. AA, EA, and NAA SLE patients had significantly different cellular contributions to gene expression, and these differences were related to significantly different percentages of patients in each ancestry with specific signatures. GSVA showed an increase in plasma cells, B cells, and T cells in the majority of AA SLE patients, and an increase in myeloid cells in most EA and NAA SLE patients. Corticosteroid drugs and immunosuppressive drugs significantly changed gene expression and contributed to the disparate signatures between and within ancestry groups. Anti-dsDNA autoantibodies and low complement, but not other clinical features of SLE, were significantly associated with gene expression in AA, EA, and NAA SLE patients. Despite the impact of medications, ancestry made a significant contribution to gene expression profiles. Notably, Differences between AA and EA SLE patients were observed to be similar to those between healthy people of these ancestry groups, and there were fewer differences between males and females of the same ancestry, than between ancestry groups.

FIG. 36 shows that gene expression is affected by ancestry, SLE autoantibodies, and standard-of-care (SOC) drugs. Average difference in GSVA enrichment scores are shown for healthy subjects. Average GSVA enrichment scores are shown for lupus (SLE) patients. Combinations of different ancestries, specific medications, and autoantibody production are associated with gene expression profiles (FIG. 36). Importantly, ancestry contributes unique features of gene expression, indicating differences in the molecular basis of SLE in these populations. Understanding the contributions of the gene expression signature components may permit a better interpretation of the signatures and their relationship to disease status.

Example 8: Analysis of Discoid Lupus Erythematosus (DLE) Gene Expression Reveals Dysregulation of Pathogenic Pathways Associated with Infiltrating Immune/Inflammatory Cells

Discoid lupus erythematosus (DLE) is a chronic, scarring inflammatory autoimmune disease of the skin. The precise molecular pathways underlying DLE pathogenesis have not been fully delineated. To obtain a more complete view of the pathologic processes involved in DLE, a comprehensive analysis of gene expression profiles from DLE affected skin was performed.

Microarray gene expression data was obtained from skin biopsy samples of three studies (GSE81071, GSE72535, and GSE52471). Differentially expressed genes (DEGs) between DLE and control were identified by LIMMA analysis. Weighted gene co-expression network analysis (WGCNA) yielded modules of co-expressed genes. Modules correlating to clinical data were prioritized. Correlated modules were interrogated for statistical enrichment of immune and non-immune cell type specific gene signatures. Genes were functionally characterized using a curated immune-specific gene functional category database (BIG-C) and pathways elucidated using IPA®. Queries of a perturbation database (LINCS, Library of Integrated Network-Based Cellular Signatures) were used to identify drugs that could reverse the altered gene expression patterns in DLE.

For each dataset, between 7-12 WGCNA modules had significant correlations to disease. Significant WGCNA module preservation was observed between all three datasets. Non-immune cell types (fibroblasts, keratinocytes, melanocytes) and also Langerhans cells were represented in WGCNA modules negatively correlated with disease. An immune cell signature was observed in WGCNA modules positively correlated to DLE, including DCs, myeloid cells, CD4+ & CD8+ T cells, NK cells, B cells as well as pre- and post-switch plasma cells (PCs). The presence of both Ig−κ and −λ as well as multiple VL genes suggests the presence of polyclonal PCs. Chemokines that mediate lymphocyte organization and/or recruitment into the skin were identified, including CCL5,7,8 and CXCL9-10,13. Cytokines (TNF, IFNγ, IFNα, IL1β, IL2, IL6, IL12, IL17, IL23, and IL27), signaling molecules (CD40L, PI3K, and mTOR) and transcription factors (NF-κB, NF-AT), as well as cellular proliferation, were evident. IPA® UPR analysis indicated that many of the expressed genes may be secondary to signaling by TNF, IFNγ, IFNα, CD40L, IL1β, IL2, IL6, IL12, IL17, IL23, and IL27. Interestingly, connectivity analysis using LINCS/CLUE identified high-priority drug targets, such as IKZF1/3 (lenalidomide, CC-220), JAK1/2 (ruxolitinib), and HDAC6 (Ricolinostat) may be viable options for therapeutic intervention.

Bioinformatic analysis of DLE gene expression has elucidated many dysregulated signaling pathways potentially involved in the pathogenesis of DLE that may be targeted by novel therapeutic strategies. Further investigation of these signatures may provide an enhanced understanding of the pathogenesis of DLE.

Example 9: Analysis of Gene Expression from Systemic Lupus Erythematosus (SLE) Synovium Reveals Unique Pathogenic Mechanisms

Arthritis is a common manifestation of systemic lupus erythematosus (SLE), and the efficacy of a new lupus therapy for a given SLE patient often depends on its ability to suppress joint inflammation. Despite this, an understanding of the underlying pathogenic mechanisms driving lupus synovitis remains incomplete. Therefore, gene expression profiles of SLE synovium were interrogated to gain insight into the nature of joint inflammation in lupus arthritis.

Biopsied knee synovia from SLE and osteoarthritis (OA) patients were analyzed for differentially expressed genes (DEGs) and also by Weighted Gene Co-expression Network Analysis (WGCNA) to determine similarities and differences between gene profiles and to identify modules of highly co-expressed genes that correlated with clinical features of lupus arthritis. DEGs and correlated modules were interrogated for statistical enrichment of immune and non-immune cell type-specific signatures and validated by Gene Set Variation Analysis (GSVA). Genes were functionally characterized using BIG-C and canonical pathways and upstream regulators operative in lupus synovitis were predicted by IPA®.

DEGs upregulated in lupus arthritis revealed enrichment of numerous immune and inflammatory cell types dominated by a myeloid phentoype, whereas downregulated genes were characteristic of fibroblasts. WGCNA revealed 7 modules of co-expressed genes significantly correlated to lupus arthritis or disease activity (e.g., as indicated by SLEDAI or anti-dsDNA titer). Functional characterization of both DEGs and WGCNA modules by BIG-C analysis revealed consistent co-expression of immune signaling molecules and immune cell surface markers, pattern recognition receptors (PRRs), antigen presentation, and interferon stimulated genes. Although DEGs were predominantly enriched in myeloid cell transcripts, WGCNA also revealed enrichment of activated T cells, B cells, CD8 T, and NK cells, and plasma cells/plasmablasts, indicating an adaptive immune response in lupus arthritis. Th1, Th2, and Th17 cells were not identified by transcriptomic analysis, although IPA® analysis predicted signaling by the Th1 pathway and numerous innate immune signaling pathways were verified by GSVA. IPA® additionally predicted inflammatory cytokines TNF, CD40L, IFNα, IFNβ, IFNγ, IL27, IL1, IL12, and IL15 as active upstream regulators of the lupus arthritis gene expression profile, in addition to the PRRs IRF7, IRF3, TLR7, TICAM1, IRF4, IRF5, TLR9, TLR4, and TLR3. Analysis of chemokine receptor-ligand pairs, adhesion molecules, germinal center (GC) markers, and T follicular helper (Tfh) cell markers indicated trafficking of immune cell populations into the synovium by chemokine signaling, but not in situ generation of fully-formed GCs. GSVA confirmed activation of both myeloid and lymphoid cell types and inflammatory signaling pathways in lupus arthritis, whereas OA was characterized by tissue repair and damage.

Bioinformatic analysis of lupus arthritis revealed a pattern of immunopathogenesis in which myeloid cell-mediated inflammation dominates, leading to further recruitment of adaptive immune cells that contribute to the ongoing inflammatory synovitis.

Example 10: Transcriptomic Meta-Analysis of Lupus-Affected Tissues Reveals Shared Immune, Metabolic, and Biochemical Dysregulation

Systemic lupus erythematosus (SLE) affects various organs and tissues, but whether pathologic processes in each organ are distinct or whether dysregulated molecular functions are found in common in all tissues may be unknown. Therefore, a meta-analysis of gene expression profiles in four affected SLE tissues was performed to identify commonly dysregulated pathways.

Gene expression datasets for discoid lupus erythematosus (DLE), lupus arthritis (LA), lupus nephritis (LN) glomerulus (Glom), and LN tubulointerstitium (TI) were obtained from GEO. Differentially expressed genes (DEGs) were identified by LIMMA analysis for each dataset. DEGs from each tissue were analyzed with a multi-pronged bioinformatics approach to elucidate common immune cell infiltrates and common functional categories. These findings were then utilized to form modules of co-expressed genes to determine their enrichment using Gene Set Variation Analysis (GSVA).

All tissues demonstrated the presence of immune cells with the fewest immune cell transcripts in LN TI. Analysis of bulk gene expression revealed enrichment of antigen presenting cells (APCs), monocytes, and myeloid cells in all four tissues. Notably, enrichment of B cells, plasma cells, germinal center (GC) B cells, and CD8 T cells was only detected in DLE and LA. All four tissues demonstrated upregulated immune activity, including interferon-stimulated genes, pattern recognition receptors (PRRs), and antigen presentation (MHC Class II). Pro-apoptosis genes were also found enriched in DLE, LN Glom, and LN TI. A generalized decrease in biochemical processes was found in all four tissues, and a specific decrease in both fatty acid biosynthesis and the tricarboxylic acid cycle was found in DLE and LN. Ingenuity Pathway Analysis (IPA®) further confirmed the activation of Dendritic Cell Maturation, Interferon, NFAT Regulation of Immune Response, PRRs, and TH1 signaling pathways in all four tissues. Additionally, IPA demonstrated cholesterol biosynthesis was decreased in all tissues except LA.

To confirm the aforementioned cellular infiltrates and aberrant pathways, as well as additional pathways, were operative in individual SLE tissues, GSVA was used to analyze enrichment of gene modules in patient samples. As shown in Table 18 and FIGS. 37-38, specific abnormalities were found in the majority of tissues, including enrichment of myeloid cells/monocytes, APCs, and GC B cells, whereas others were observed in some but not all tissues.

TABLE 18

Percentages of SLE tissue samples with GSVA

enrichment of specific immune cell modules

DLE
LA
LN Glom
LN TI

Antigen Presenting Cell
66.67%
75.00%
77.27%
63.64%

Monocyte
88.89%
100.00%
95.45%
59.09%

Myeloid Cell
77.78%
100.00%
81.82%
68.18%

Germinal Center B Cell
77.78%
100.00%
54.55%
77.27%

Plasma Cell
88.89%
75.00%
50.00%
45.45%

FIG. 37 contains plots showing that GSVA demonstrates metabolic dysregulation in individual SLE affected tissues. GSVA enrichment scores were calculated for (A) glycolysis, (B) pentose phosphate, (C) tricarboxylic acid cycle (TCA), (D) oxidative phosphorylation, (E) fatty acid beta oxidation, and (F) cholesterol biosynthesis modules in DLE, LA, LN Glom, and LN TI. Significant enrichment of tissue control to SLE affected tissue or SLE affected tissue to tissue control was determined using the Welch's t-test. The red bar represents enrichment of SLE tissue over control, and the blue bar represents enrichment of tissue control over SLE tissue. #p<0.1 *p<0.05, ** p<0.01, *** p<0.001, ****<0.0001.