In certain medical fields, for example the areas of cancer research and treatment, voluminous amounts of data may be generated and collected for each patient. This data may include demographic information, such as the patient's age, gender, height, weight, smoking history, geographic location, and other, non-medical information. The data also may include clinical components, such as tumor type, location, size, and stage, as well as treatment data including medications, dosages, treatment therapies, mortality rates, and other outcome/response data. Moreover, more advanced analysis also may include genomic information about the patient and/or tumor, including genetic markers, mutations, as well as other information from fields including proteome, transcriptome, epigenome, metabolome, microbiome, and other multi-omic fields.
Despite this wealth of data, there is a dearth of meaningful ways to compile and analyze the data quickly, efficiently, and comprehensively.
Thus what are needed are a user interface, system, and method that overcome one or more of these challenges.
In one aspect, a system and user interface are provided to predict an expected response of a particular patient population or cohort when provided with a certain treatment. In order to accomplish those predictions, the system uses a pre-existing dataset to define a sample patient population, or “cohort,” and identifies one or more key inflection points in the distribution of patients exhibiting each attribute of interest in the cohort, relative to a general patient population distribution, thereby targeting the prediction of expected survival and/or response for a particular patient population.
The system described herein facilitates the discovery of insights of therapeutic significance, through the automated analysis of patterns occurring in patient clinical, molecular, phenotypic, and response data, and enabling further exploration via a fully integrated, reactive user interface.
In one aspect, a method is provided, comprising: receiving, from a user, a patient cohort definition, the patient cohort definition including at least one selection criterion for a feature or combination of features; generating, based on the patient cohort definition, a first patient record query, the first patient record query being configured to identify, within a first data source including a first plurality of patient data records, patient data records meeting the patient cohort definition, the first data source being stored at a first storage, the first data source being inaccessible to the user; querying, using the first patient record query, the first data source to identify a first one or more patient data records of the first plurality of patient data records, the first one or more patient data records meeting the patient cohort definition; generating, based on the patient cohort definition, a second patient record query, the second patient record query being configured to identify, within a second data source including a second plurality of patient data records, patient data records meeting the patient cohort definition, the second data source being stored at a second storage; querying, using the second patient record query, the second data source to identify a second one or more patient data records of the second plurality of patient data records, the second one or more patient data records meeting the patient cohort definition; generating, by a computer including a processor, an interactive user interface, the interactive user interface including information about the first and second one or more patient data records, the information including a total number of patient data records included in the first and second one or more patient data records; receiving, from the user via the interactive user interface, a provisioning input, the provisioning input including a provisioning number that is less than or equal to the total number of patient data records included in the first and second one or more patient data records; provisioning, from the one or more patient data records, a set of patient data records, the number of patient data records in the set of patient data records being equal to the provisioning number; writing each patient data record of the set of patient data records into a patient data store; and providing the user access to the patient data store.
In one embodiment the invention provides a system, comprising: a computer including a processing device, the processing device configured to: receive, from a user, a patient cohort definition, the patient cohort definition including at least one selection criterion for a feature or combination of features; generate, based on the patient cohort definition, a first patient record query, the first patient record query being configured to identify, within a first data source including a first plurality of patient data records, patient data records meeting the patient cohort definition, the first data source being stored at a first storage that is inaccessible to the user; query, using the first patient record query, the first data source to identify a first one or more patient data records of the first plurality of patient data records, the first one or more patient data records meeting the patient cohort definition; generate, based on the patient cohort definition, a second patient record query, the second patient record query being configured to identify, within a second data source including a second plurality of patient data records, patient data records meeting the patient cohort definition; query, using the second patient record query, the second data source to identify a second one or more patient data records of the second plurality of patient data records, the second one or more patient data records meeting the patient cohort definition; output, to a display, an interactive user interface, the interactive user interface including information about the first and second one or more patient data records, the information including a total number of patient data records included in the first and second one or more patient data records; receive, via the interactive user interface, a provisioning input, the provisioning input including a provisioning number that is less than or equal to the total number of patient data records included in the first and second one or more patient data records; provision, from the one or more patient data records, a set of patient data records, the number of patient data records in the set of patient data records being equal to the provisioning number; write each patient data record of the set of patient data records into a patient data store; and provide the user access to the patient data store.
In one embodiment the invention provides a non-transitory computer-readable storage medium having stored thereon program code instructions that, when executed by a processor, cause the processor to: receive, from a user, a patient cohort definition, the patient cohort definition including at least one selection criterion for a feature or combination of features; generate, based on the patient cohort definition, a first patient record query, the first patient record query being configured to identify, within a first data source including a first plurality of patient data records, patient data records meeting the patient cohort definition, the first data source being stored on a first storage that is inaccessible to the user; query, using the first patient record query, the first data source to identify a first one or more patient data records of the first plurality of patient data records, the first one or more patient data records meeting the patient cohort definition; generate, based on the patient cohort definition, a second patient record query, the second patient record query being configured to identify, within a second data source including a second plurality of patient data records, patient data records meeting the patient cohort definition; query, using the second patient record query, the second data source to identify a second one or more patient data records of the second plurality of patient data records, the second one or more patient data records meeting the patient cohort definition; output, to a display, an interactive user interface, the interactive user interface including information about the first and second one or more patient data records, the information including a total number of patient data records included in the first and second one or more patient data records; receive, via the interactive user interface, a provisioning input, the provisioning input including a provisioning number that is less than or equal to the total number of patient data records included in the first and second one or more patient data records; provision, from the one or more patient data records, a set of patient data records, the number of patient data records in the set of patient data records being equal to the provisioning number; write each patient data record of the set of patient data records into a patient data store; and provide the user access to the patient data store.
The foregoing and other aspects and advantages of the invention will appear from the following description. In the description, reference is made to the accompanying drawings which form a part hereof, and in which there is shown by way of illustration preferred embodiments of the invention. Such embodiments do not necessarily represent the full scope of the invention, however, and reference is made therefore to the claims herein for interpreting the scope of the invention.
Further objects, features and advantages of the present disclosure will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the present disclosure, in which:
With reference to the accompanying figures, and particularly with reference to
The interactive analysis portal 22 may include a plurality of user interfaces including an interactive cohort selection filtering interface 24 that, as discussed in greater detail below, permits a user to query and filter elements of the data store 14. As discussed in greater detail below, the portal 22 also may include a cohort funnel and population analysis interface 26, a patient timeline analysis user interface 28, a patient survival analysis user interface 30, and a patient event likelihood analysis user interface 32. The portal 22 further may include a patient next analysis user interface 34 and one or more patient future analysis user interfaces 36.
Returning to
The patient data store 14 may be a pre-existing dataset that includes patient clinical history, such as demographics, comorbidities, diagnoses and recurrences, medications, surgeries, and other treatments along with their response and adverse effects details. The Patient Data Store may also include patient genetic/molecular sequencing and genetic mutation details relating to the patient, as well as organoid modeling results. In one aspect, these datasets may be generated from one or more sources. For example, institutions implementing the system may be able to draw from all of their records; for example, all records from all doctors and/or patients connected with the institution may be available to the institution's agents, physicians, research, or other authorized members. Similarly, doctors may be able to draw from all of their records; for example, records for all of their patients. Alternatively, certain system users may be able to buy or license access to the datasets, such as when those users do not have immediate access to a sufficiently robust dataset, when those users are looking for even more records, and/or when those users are looking for specific data types, such as data reflecting patients having certain primary cancers, metastases by origin site and/or diagnosis site, recurrences by origin, metastases, or diagnosis sites, etc.
Features and Feature Modules
A patient data store may include one or more feature modules which may comprise a collection of features available for every patient in the system 10. These features may be used to generate and model the artificial intelligence classifiers in the system 10. While feature scope across all patients is informationally dense, a patient's feature set may be sparsely populated across the entirety of the collective feature scope of all features across all patients. For example, the feature scope across all patients may expand into the tens of thousands of features while a patient's unique feature set may only include a subset of hundreds or thousands of the collective feature scope based upon the records available for that patient.
Feature collections may include a diverse set of fields available within patient health records. Clinical information may be based upon fields which have been entered into an electronic medical record (EMR) or an electronic health record (EHR) by a physician, nurse, or other medical professional or representative. Other clinical information may be curated from other sources, such as molecular fields from genetic sequencing reports. Sequencing may include next-generation sequencing (NGS) and may be long-read, short-read, or other forms of sequencing a patient's somatic and/or normal genome. A comprehensive collection of features in additional feature modules may combine a variety of features together across varying fields of medicine which may include diagnoses, responses to treatment regimens, genetic profiles, clinical and phenotypic characteristics, and/or other medical, geographic, demographic, clinical, molecular, or genetic features. For example, a subset of features may comprise molecular data features, such as features derived from an RNA feature module or a DNA feature module sequencing.
Another subset of features, imaging features from imaging feature module, may comprise features identified through review of a specimen through pathologist review, such as a review of stained H&E or IHC slides. As another example, a subset of features may comprise derivative features obtained from the analysis of the individual and combined results of such feature sets. Features derived from DNA and RNA sequencing may include genetic variants from variant science module which are present in the sequenced tissue. Further analysis of the genetic variants may include additional steps such as identifying single or multiple nucleotide polymorphisms, identifying whether a variation is an insertion or deletion event, identifying loss or gain of function, identifying fusions, calculating copy number variation, calculating microsatellite instability, calculating tumor mutational burden, or other structural variations within the DNA and RNA. Analysis of slides for H&E staining or IHC staining may reveal features such as tumor infiltration, programmed death-ligand 1 (PD-L1) status, human leukocyte antigen (HLA) status, or other immunology features.
Features derived from structured, curated, or electronic medical or health records may include clinical features such as diagnosis, symptoms, therapies, outcomes, patient demographics such as patient name, date of birth, gender, ethnicity, date of death, address, smoking status, diagnosis dates for cancer, illness, disease, diabetes, depression, other physical or mental maladies, personal medical history, family medical history, clinical diagnoses such as date of initial diagnosis, date of metastatic diagnosis, cancer staging, tumor characterization, tissue of origin, treatments and outcomes such as line of therapy, therapy groups, clinical trials, medications prescribed or taken, surgeries, radiotherapy, imaging, adverse effects, associated outcomes, genetic testing and laboratory information such as performance scores, lab tests, pathology results, prognostic indicators, date of genetic testing, testing provider used, testing method used, such as genetic sequencing method or gene panel, gene results, such as included genes, variants, expression levels/statuses, or corresponding dates to any of the above.
Features may be derived from information from additional medical or research based Omics fields including proteome, transcriptome, epigenome, metabolome, microbiome, and other multi-omic fields. Features derived from an organoid modeling lab may include the DNA and RNA sequencing information germane to each organoid and results from treatments applied to those organoids. Features derived from imaging data may further include reports associated with a stained slide, size of tumor, tumor size differentials over time including treatments during the period of change, as well as machine learning approaches for classifying PDL1 status, HLA status, or other characteristics from imaging data. Other features may include the additional derivative features sets from other machine learning approaches based at least in part on combinations of any new features and/or those listed above. For example, imaging results may need to be combined with MSI calculations derived from RNA expressions to determine additional further imaging features. In another example a machine learning model may generate a likelihood that a patient's cancer will metastasize to a particular organ or a patient's future probability of metastasis to yet another organ in the body. Other features that may be extracted from medical information may also be used. There are many thousands of features, and the above listing of types of features are merely representative and should not be construed as a complete listing of features.
An alteration module may be one or more microservices, servers, scripts, or other executable algorithms which generate alteration features associated with de-identified patient features from the feature collection. Alterations modules may retrieve inputs from the feature collection and may provide alterations for storage. Exemplary alterations modules may include one or more of the following alterations as a collection of alteration modules. A SNP (single-nucleotide polymorphism) module may identify a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. >1%). For example, at a specific base position, or loci, in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position and the two possible nucleotide variations, C or A, are said to be alleles for this position. SNPs underline differences in our susceptibility to a wide range of diseases (e.g.—sickle-cell anemia, p-thalassemia and cystic fibrosis result from SNPs). The severity of illness and the way the body responds to treatments are also manifestations of genetic variations. For example, a single-base mutation in the APOE (apolipoprotein E) gene is associated with a lower risk for Alzheimer's disease. A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single-nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration. An MNP (Multiple-nucleotide polymorphisms) module may identify the substitution of consecutive nucleotides at a specific position in the genome. An InDels module may identify an insertion or deletion of bases in the genome of an organism classified among small genetic variations. While usually measuring from 1 to 10 000 base pairs in length, a microindel is defined as an indel that results in a net change of 1 to 50 nucleotides. Indels can be contrasted with a SNP or point mutation. An indel inserts and deletes nucleotides from a sequence, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the overall number in the DNA. Indels, being either insertions, or deletions, can be used as genetic markers in natural populations, especially in phylogenetic studies. Indel frequency tends to be markedly lower than that of single nucleotide polymorphisms (SNP), except near highly repetitive regions, including homopolymers and microsatellites. An MSI (microsatellite instability) module may identify genetic hypermutability (predisposition to mutation) that results from impaired DNA mismatch repair (MMR). The presence of MSI represents phenotypic evidence that MMR is not functioning normally. MMR corrects errors that spontaneously occur during DNA replication, such as single base mismatches or short insertions and deletions. The proteins involved in MMR correct polymerase errors by forming a complex that binds to the mismatched section of DNA, excises the error, and inserts the correct sequence in its place. Cells with abnormally functioning MMR are unable to correct errors that occur during DNA replication and consequently accumulate errors. This causes the creation of novel microsatellite fragments. Polymerase chain reaction-based assays can reveal these novel microsatellites and provide evidence for the presence of MSI. Microsatellites are repeated sequences of DNA. These sequences can be made of repeating units of one to six base pairs in length. Although the length of these microsatellites is highly variable from person to person and contributes to the individual DNA “fingerprint”, each individual has microsatellites of a set length. The most common microsatellite in humans is a dinucleotide repeat of the nucleotides C and A, which occurs tens of thousands of times across the genome. Microsatellites are also known as simple sequence repeats (SSRs). A TMB (tumor mutational burden) module may identify a measurement of mutations carried by tumor cells and is a predictive biomarker being studied to evaluate its association with response to Immuno-Oncology (I-O) therapy. Tumor cells with high TMB may have more neoantigens, with an associated increase in cancer-fighting T cells in the tumor microenvironment and periphery. These neoantigens can be recognized by T cells, inciting an anti-tumor response. TMB has emerged more recently as a quantitative marker that can help predict potential responses to immunotherapies across different cancers, including melanoma, lung cancer and bladder cancer. TMB is defined as the total number of mutations per coding area of a tumor genome. Importantly, TMB is consistently reproducible. It provides a quantitative measure that can be used to better inform treatment decisions, such as selection of targeted or immunotherapies or enrollment in clinical trials. A CNV (copy number variation) module may identify deviations from the normal genome and any subsequent implications from analyzing genes, variants, alleles, or sequences of nucleotides. CNV are the phenomenon in which structural variations may occur in sections of nucleotides, or base pairs, that include repetitions, deletions, or inversions. A Fusions module may identify hybrid genes formed from two previously separate genes. It can occur as a result of: translocation, interstitial deletion, or chromosomal inversion. Gene fusion plays an important role in tumorgenesis. Fusion genes can contribute to tumor formation because fusion genes can produce much more active abnormal protein than non-fusion genes. Often, fusion genes are oncogenes that cause cancer; these include BCR-ABL, TEL-AML1 (ALL with t(12; 21)), AML1-ETO (M2 AML with t(8; 21)), and TMPRSS2-ERG with an interstitial deletion on chromosome 21, often occurring in prostate cancer. In the case of TMPRSS2-ERG, by disrupting androgen receptor (AR) signaling and inhibiting AR expression by oncogenic ETS transcription factor, the fusion product regulates the prostate cancer. Most fusion genes are found from hematological cancers, sarcomas, and prostate cancer. BCAM-AKT2 is a fusion gene that is specific and unique to high-grade serous ovarian cancer. Oncogenic fusion genes may lead to a gene product with a new or different function from the two fusion partners. Alternatively, a proto-oncogene is fused to a strong promoter, and thereby the oncogenic function is set to function by an upregulation caused by the strong promoter of the upstream fusion partner. The latter is common in lymphomas, where oncogenes are juxtaposed to the promoters of the immunoglobulin genes. Oncogenic fusion transcripts may also be caused by trans-splicing or read-through events. Since chromosomal translocations play such a significant role in neoplasia, a specialized database of chromosomal aberrations and gene fusions in cancer has been created. This database is called Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer. An IHC (Immunohistochemistry) module may identify antigens (proteins) in cells of a tissue section by exploiting the principle of antibodies binding specifically to antigens in biological tissues. IHC staining is widely used in the diagnosis of abnormal cells such as those found in cancerous tumors. Specific molecular markers are characteristic of particular cellular events such as proliferation or cell death (apoptosis). IHC is also widely used in basic research to understand the distribution and localization of biomarkers and differentially expressed proteins in different parts of a biological tissue. Visualising an antibody-antigen interaction can be accomplished in a number of ways. In the most common instance, an antibody is conjugated to an enzyme, such as peroxidase, that can catalyse a color-producing reaction in immunoperoxidase staining. Alternatively, the antibody can also be tagged to a fluorophore, such as fluorescein or rhodamine in immunofluorescence. Approximations from RNA expression data, H&E slide imaging data, or other data may be generated. A Therapies module may identify differences in cancer cells (or other cells near them) that help them grow and thrive and drugs that “target” these differences. Treatment with these drugs is called targeted therapy. For example, many targeted drugs go after the cancer cells' inner ‘programming’ that makes them different from normal, healthy cells, while leaving most healthy cells alone. Targeted drugs may block or turn off chemical signals that tell the cancer cell to grow and divide; change proteins within the cancer cells so the cells die; stop making new blood vessels to feed the cancer cells; trigger your immune system to kill the cancer cells; or carry toxins to the cancer cells to kill them, but not normal cells. Some targeted drugs are more “targeted” than others. Some might target only a single change in cancer cells, while others can affect several different changes. Others boost the way your body fights the cancer cells. This can affect where these drugs work and what side effects they cause. Matching targeted therapies may include identifying the therapy targets in the patients and satisfying any other inclusion or exclusion criteria. A VUS (variant of unknown significance) module may identify variants which are called but cannot be classified as pathogenic or benign at the time of calling. VUS may be catalogued from publications regarding a VUS to identify if they may be classified as benign or pathogenic. A Trial module may identify and test hypotheses for treating cancers having specific characteristics by matching features of a patient to clinical trials. These trials have inclusion and exclusion criteria that must be matched to enroll which may be ingested and structured from publications, trial reports, or other documentation. An Amplifications module may identify genes which increase in count disproportionately to other genes. Amplifications may cause a gene having the increased count to go dormant, become overactive, or operate in another unexpected fashion. Amplifications may be detected at a gene level, variant level, RNA transcript or expression level, or even a protein level. Detections may be performed across all the different detection mechanisms or levels and validated against one another. An Isoforms module may identify alternative splicing (AS), the biological process in which more than one mRNA (isoforms) is generated from the transcript of a same gene through different combinations of exons and introns. It is estimated by large-scale genomics studies that 30-60% of mammalian genes are alternatively spliced. The possible patterns of alternative splicing for a gene can be very complicated and the complexity increases rapidly as number of introns in a gene increases. In silico alternative splicing prediction may find large insertions or deletions within a set of mRNA sharing a large portion of aligned sequences by identifying genomic loci through searches of mRNA sequences against genomic sequences, extracting sequences for genomic loci and extending the sequences at both ends up to 20 kb, searching the genomic sequences (repeat sequences have been masked), extracting splicing pairs (two boundaries of alignment gap with GT-AG consensus or with more than two expressed sequence tags aligned at both ends of the gap), assembling splicing pairs according to their coordinates, determining gene boundaries (splicing pair predictions are generated to this point), generating predicted gene structures by aligning mRNA sequences to genomic templates, and comparing splicing pair predictions and gene structure predictions to find alternative spliced isoforms. A Pathways module may identify defects in DNA repair pathways which enable cancer cells to accumulate genomic alterations that contribute to their aggressive phenotype. Cancerous tumors rely on residual DNA repair capacities to survive the damage induced by genotoxic stress which leads to isolated DNA repair pathways being inactivated in cancer cells. DNA repair pathways are generally thought of as mutually exclusive mechanistic units handling different types of lesions in distinct cell cycle phases. Recent preclinical studies, however, provide strong evidence that multifunctional DNA repair hubs, which are involved in multiple conventional DNA repair pathways, are frequently altered in cancer. Identifying pathways which may be affected may lead to important patient treatment considerations. A Raw Counts module may identify a count of the variants that are detected from the sequencing data. For DNA, this may be the number of reads from sequencing which correspond to a particular variant in a gene. For RNA, this may be the gene expression counts or the transcriptome counts from sequencing.
Structural variant classification may include evaluating features from the feature collection, alterations from the alteration module, and other classifications from within itself from one or more classification modules. Structural variant classification may provide classifications to a stored classifications storage. An exemplary classification module may include a classification of a CNV as “Reportable” may mean that the CNV has been identified in one or more reference databases as influencing the tumor cancer characterization, disease state, or pharmacogenomics, “Not Reportable” may mean that the CNV has not been identified as such, and “Conflicting Evidence” may mean that the CNV has both evidence suggesting “Reportable” and “Not Reportable.” Furthermore, a classification of therapeutic relevance is similarly ascertained from any reference datasets mention of a therapy which may be impacted by the detection (or non-detection) of the CNV. Other classifications may include applications of machine learning algorithms, neural networks, regression techniques, graphing techniques, inductive reasoning approaches, or other artificial intelligence evaluations within modules. A classifier for clinical trials may include evaluation of variants identified from the alteration module which have been identified as significant or reportable, evaluation of all clinical trials available to identify inclusion and exclusion criteria, mapping the patient's variants and other information to the inclusion and exclusion criteria, and classifying clinical trials as applicable to the patient or as not applicable to the patient. Similar classifications may be performed for therapies, loss-of-function, gain-of-function, diagnosis, microsatellite instability, tumor mutational burden, indels, SNP, MNP, fusions, and other alterations which may be classified based upon the results of the alteration modules.
Each of the feature collection, alteration module(s), structural variant and feature store may be communicatively coupled to a data bus to transfer data between each module for processing and/or storage. In another embodiment, each of the feature collection, alteration module(s), structural variant and feature store may be communicatively coupled to each other for independent communication without sharing the data bus.
In addition to the above features and enumerated modules, feature modules may further include one or more of the following modules within their respective modules as a sub-module or as a standalone module.
Germline/somatic DNA feature module may comprise a feature collection associated with the DNA-derived information of a patient or a patient's tumor. These features may include raw sequencing results, such as those stored in FASTQ, BAM, VCF, or other sequencing file types known in the art; genes; mutations; variant calls; and variant characterizations. Genomic information from a patient's normal sample may be stored as germline and genomic information from a patient's tumor sample may be stored as somatic.
An RNA feature module may comprise a feature collection associated with the RNA-derived information of a patient, such as transcriptome information. These features may include raw sequencing results, transcriptome expressions, genes, mutations, variant calls, and variant characterizations.
A metadata module may comprise a feature collection associated with the human genome, protein structures and their effects, such as changes in energy stability based on a protein structure.
A clinical module may comprise a feature collection associated with information derived from clinical records of a patient and records from family members of the patient. These may be abstracted from unstructured clinical documents, EMR, EHR, or other sources of patient history. Information may include patient symptoms, diagnosis, treatments, medications, therapies, hospice, responses to treatments, laboratory testing results, medical history, geographic locations of each, demographics, or other features of the patient which may be found in the patient's medical record. Information about treatments, medications, therapies, and the like may be ingested as a recommendation or prescription and/or as a confirmation that such treatments, medications, therapies, and the like were administered or taken.
An imaging module may comprise a feature collection associated with information derived from imaging records of a patient. Imaging records may include H&E slides, IHC slides, radiology images, and other medical imaging which may be ordered by a physician during the course of diagnosis and treatment of various illnesses and diseases. These features may include TMB, ploidy, purity, nuclear-cytoplasmic ratio, large nuclei, cell state alterations, biological pathway activations, hormone receptor alterations, immune cell infiltration, immune biomarkers of MMR, MSI, PDL1, CD3, FOXP3, HRD, PTEN, PIK3CA; collagen or stroma composition, appearance, density, or characteristics; tumor budding, size, aggressiveness, metastasis, immune state, chromatin morphology; and other characteristics of cells, tissues, or tumors for prognostic predictions.
An epigenome module, such as epigenome module from Omics, may comprise a feature collection associated with information derived from DNA modifications which are not changes to the DNA sequence and regulate the gene expression. These modifications are frequently the result of environmental factors based on what the patient may breathe, eat, or drink. These features may include DNA methylation, histone modification, or other factors which deactivate a gene or cause alterations to gene function without altering the sequence of nucleotides in the gene.
A microbiome module, such as microbiome module from Omics, may comprise a feature collection associated with information derived from the viruses and bacteria of a patient. These features may include viral infections which may affect treatment and diagnosis of certain illnesses as well as the bacteria present in the patient's gastrointestinal tract which may affect the efficacy of medicines ingested by the patient.
A proteome module, such as proteome module from Omics, may comprise a feature collection associated with information derived from the proteins produced in the patient. These features may include protein composition, structure, and activity; when and where proteins are expressed; rates of protein production, degradation, and steady-state abundance; how proteins are modified, for example, post-translational modifications such as phosphorylation; the movement of proteins between subcellular compartments; the involvement of proteins in metabolic pathways; how proteins interact with one another; or modifications to the protein after translation from the RNA such as phosphorylation, ubiquitination, methylation, acetylation, glycosylation, oxidation, or nitrosylation.
Additional Omics module(s) may also be included in Omics, such as a feature collection associated with all the different field of omics, including: cognitive genomics, a collection of features comprising the study of the changes in cognitive processes associated with genetic profiles; comparative genomics, a collection of features comprising the study of the relationship of genome structure and function across different biological species or strains; functional genomics, a collection of features comprising the study of gene and protein functions and interactions including transcriptomics; interactomics, a collection of features comprising the study relating to large-scale analyses of gene-gene, protein-protein, or protein-ligand interactions; metagenomics, a collection of features comprising the study of metagenomes such as genetic material recovered directly from environmental samples; neurogenomics, a collection of features comprising the study of genetic influences on the development and function of the nervous system; pangenomics, a collection of features comprising the study of the entire collection of gene families found within a given species; personal genomics, a collection of features comprising the study of genomics concerned with the sequencing and analysis of the genome of an individual such that once the genotypes are known, the individual's genotype can be compared with the published literature to determine likelihood of trait expression and disease risk to enhance personalized medicine suggestions; epigenomics, a collection of features comprising the study of supporting the structure of genome, including protein and RNA binders, alternative DNA structures, and chemical modifications on DNA; nucleomics, a collection of features comprising the study of the complete set of genomic components which form the cell nucleus as a complex, dynamic biological system; lipidomics, a collection of features comprising the study of cellular lipids, including the modifications made to any particular set of lipids produced by a patient; proteomics, a collection of features comprising the study of proteins, including the modifications made to any particular set of proteins produced by a patient; immunoproteomics, a collection of features comprising the study of large sets of proteins involved in the immune response; nutriproteomics, a collection of features comprising the study of identifying molecular targets of nutritive and non-nutritive components of the diet including the use of proteomics mass spectrometry data for protein expression studies; proteogenomics, a collection of features comprising the study of biological research at the intersection of proteomics and genomics including data which identifies gene annotations; structural genomics, a collection of features comprising the study of 3-dimensional structure of every protein encoded by a given genome using a combination of modeling approaches; glycomics, a collection of features comprising the study of sugars and carbohydrates and their effects in the patient; foodomics, a collection of features comprising the study of the intersection between the food and nutrition domains through the application and integration of technologies to improve consumer's well-being, health, and knowledge; transcriptomics, a collection of features comprising the study of RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA, produced in cells; metabolomics, a collection of features comprising the study of chemical processes involving metabolites, or unique chemical fingerprints that specific cellular processes leave behind, and their small-molecule metabolite profiles; metabonomics, a collection of features comprising the study of the quantitative measurement of the dynamic multiparametric metabolic response of cells to pathophysiological stimuli or genetic modification; nutrigenetics, a collection of features comprising the study of genetic variations on the interaction between diet and health with implications to susceptible subgroups; cognitive genomics, a collection of features comprising the study of the changes in cognitive processes associated with genetic profiles; pharmacogenomics, a collection of features comprising the study of the effect of the sum of variations within the human genome on drugs; pharmacomicrobiomics, a collection of features comprising the study of the effect of variations within the human microbiome on drugs; toxicogenomics, a collection of features comprising the study of gene and protein activity within particular cell or tissue of an organism in response to toxic substances; mitointeractome, a collection of features comprising the study of the process by which the mitochondria proteins interact; psychogenomics, a collection of features comprising the study of the process of applying the powerful tools of genomics and proteomics to achieve a better understanding of the biological substrates of normal behavior and of diseases of the brain that manifest themselves as behavioral abnormalities, including applying psychogenomics to the study of drug addiction to develop more effective treatments for these disorders as well as objective diagnostic tools, preventive measures, and cures; stem cell genomics, a collection of features comprising the study of stem cell biology to establish stem cells as a model system for understanding human biology and disease states; connectomics, a collection of features comprising the study of the neural connections in the brain; microbiomics, a collection of features comprising the study of the genomes of the communities of microorganisms that live in the digestive tract; cellomics, a collection of features comprising the study of the quantitative cell analysis and study using bioimaging methods and bioinformatics; tomomics, a collection of features comprising the study of tomography and omics methods to understand tissue or cell biochemistry at high spatial resolution from imaging mass spectrometry data; ethomics, a collection of features comprising the study of high-throughput machine measurement of patient behavior; and videomics, a collection of features comprising the study of a video analysis paradigm inspired by genomics principles, where a continuous image sequence, or video, can be interpreted as the capture of a single image evolving through time of mutations revealing patient insights.
A feature set for DNA related (molecular) features may include a proprietary calculation of the maximum effect a gene may have from sequencing results for the following genes: ABCB1-somatic, ACTA2-germline, ACTC1-germline, ALK-fluorescence_in_situ_hybridization_(fish), ALK-immunohistochemistry_(ihc), ALK-md_dictated, ALK-somatic, AMER1-somatic, APC-gene_mutation_analysis, APC-germline, APC-somatic, APOB-germline, APOB-somatic, AR-somatic, ARHGAP35-somatic, ARID1A-somatic, ARID1B-somatic, ARID2-somatic, ASXL1-somatic, ATM-gene_mutation_analysis, ATM-germline, ATM-somatic, ATP7B-germline, ATR-somatic, ATRX-somatic, AXIN2-germline, BACH1-germline, BCL11B-somatic, BCLAF1-somatic, BCOR-somatic, BCORL1-somatic, BCR-somatic, BMPR1A-germline, BRAF-gene_mutation_analysis, BRAF-md_dictated, BRAF-somatic, BRCA1-germline, BRCA1-somatic, BRCA2-germline, BRCA2-somatic, BRD4-somatic, BRIP1-germline, CACNA1S-germline, CARD11-somatic, CASR-somatic, CD274-immunohistochemistry_(ihc), CD274-md_dictated, CDH1-germline, CDH1-somatic, CDK12-germline, CDKN2A-immunohistochemistry_(ihc), CDKN2A-germline, CDKN2A-somatic, CEBPA-germline, CEBPA-somatic, CFTR-somatic, CHD2-somatic, CHD4-somatic, CHEK2-germline, CIC-somatic, COL3A1-germline, CREBBP-somatic, CTNNB1-somatic, CUX1-somatic, DICERI-somatic, DOTiL-somatic, DPYD-somatic, DSC2-germline, DSG2-germline, DSP-germline, DYNC2H1-somatic, EGFR-gene_mutation_analysis, EGFR-immunohistochemistry_(ihc), EGFR-md_dictated, EGFR-germline, EGFR-somatic, EP300-somatic, EPCAM-germline, EPHA2-somatic, EPHA7-somatic, EPHB1-somatic, ERBB2-fluorescence_in_situ_hybridization_(fish), ERBB2-immunohistochemistry_(ihc), ERBB2-md_dictated, ERBB2-somatic, ERBB3-somatic, ERBB4-somatic, ESR1-immunohistochemistry_(ihc), ESR1-somatic, ETV6-germline, FANCA-germline, FANCA-somatic, FANCD2-germline, FANCI-germline, FANCL-germline, FANCM-somatic, FAT1-somatic, FBN1-germline, FBXW7-somatic, FGFR3-somatic, FH-germline, FLCN-germline, FLG-somatic, FLT1-somatic, FLT4-somatic, GATA2-germline, GATA3-somatic, GATA4-somatic, GATA6-somatic, GLA-germline, GNAS-somatic, GRIN2A-somatic, GRM3-somatic, HDAC4-somatic, HGF-somatic, IDH1-somatic, IKZF1-somatic, IRS2-somatic, JAK3-somatic, KCNH2-germline, KCNQ1-germline, KDM5A-somatic, KDM5C-somatic, KDM6A-somatic, KDR-somatic, KEAP1-somatic, KEL-somatic, KIFiB-somatic, KMT2A-fluorescence_in_situ_hybridization_(fish), KMT2A-somatic, KMT2B-somatic, KMT2C-somatic, KMT2D-somatic, KRAS-gene_mutation_analysis, KRAS-md_dictated, KRAS-somatic, LDLR-germline, LMNA-germline, LRP1B-somatic, MAP3K1-somatic, MED12-somatic, MEN1-germline, MET-fluorescence_in_situ_hybridization_(fish), MET-somatic, MKI67-immunohistochemistry_(ihc), MKI67-somatic, MLH1-germline, MSH2-germline, MSH3-germline, MSH6-germline, MSH6-somatic, MTOR-somatic, MUTYH-germline, MYBPC3-germline, MYCN-somatic, MYH1l-germline, MYH1l-somatic, MYH7-germline, MYL2-germline, MYL3-germline, NBN-germline, NCOR1-somatic, NCOR2-somatic, NF1-somatic, NF2-germline, NOTCH1-somatic, NOTCH2-somatic, NOTCH3-somatic, NRG1-somatic, NSD1-somatic, NTRK1-somatic, NTRK3-somatic, NUP98-somatic, OTC-germline, PALB2-germline, PALLD-somatic, PBRM1-somatic, PCSK9-germline, PDGFRA-somatic, PDGFRB-somatic, PGR-immunohistochemistry_(ihc), PIK3C2B-somatic, PIK3CA-somatic, PIK3CG-somatic, PIK3R1-somatic, PIK3R2-somatic, PKP2-germline, PLCG2-somatic, PML-somatic, PMS2-germline, POLD1-germline, POLD1-somatic, POLE-germline, POLE-somatic, PREX2-somatic, PRKAG2-germline, PTCH1-somatic, PTEN-fluorescence_in_situ_hybridization_(fish), PTEN-gene_mutation_analysis, PTEN-germline, PTEN-somatic, PTPN13-somatic, PTPRD-somatic, RAD51B-germline, RAD51C-germline, RAD51D-germline, RAD52-germline, RAD54L-germline, RANBP2-somatic, RB1-germline, RB1-somatic, RBM10-somatic, RECQL4-somatic, RET-fluorescence_in_situ_hybridization_(fish), RET-germline, RET-somatic, RICTOR-somatic, RNF43-somatic, ROS1-fluorescence_in_situ_hybridization_(fish), ROS1-md_dictated, ROS1-somatic, RPTOR-somatic, RUNX1-germline, RUNX1T1-somatic, RYR1-germline, RYR2-germline, SCN5A-germline, SDHAF2-germline, SDHB-germline, SDHC-germline, SDHD-germline, SETBP1-somatic, SETD2-somatic, SH2B3-somatic, SLIT2-somatic, SLX4-somatic, SMAD3-germline, SMAD4-germline, SMAD4-somatic, SMARCA4-somatic, SOX9-somatic, SPEN-somatic, STAG2-somatic, STK11-gene_mutation_analysis, STK11-germline, STK11-somatic, TAF1-somatic, TBX3-somatic, TCF7L2-somatic, TERT-somatic, TET2-somatic, TGFBR1-germline, TGFBR2-germline, TGFBR2-somatic, TMEM43-germline, TNNI3-germline, TNNT2-germline, TP53-gene_mutation_analysis, TP53-immunohistochemistry_(ihc), TP53-md_dictated, TP53-germline, TP53-somatic, TPM1-germline, TSC1-germline, TSC1-somatic, TSC2-germline, TSC2-somatic, VHL-germline, WT1-germline, WT1-somatic, XRCC3-germline, and ZFHX3-somatic.
A sufficiently robust collection of features may include all of the features disclosed above; however, models and predictions based from the available features may include models which are optimized and trained from a selection of features that are much more limiting than the exhaustive feature set. Such a constrained feature set may include as few as tens to hundreds of features. For example, a model's constrained feature set may include the genomic results of a sequencing of the patient's tumor, derivative features based upon the genomic results, the patient's tumor origin, the patient's age at diagnosis, the patient's gender and race, and symptoms that the patient brought to their physicians attention during a routine checkup.
A feature store may enhance a patient's feature set through the application of machine learning and analytics by selecting from any features, alterations, or calculated output derived from the patient's features or alterations to those features. Such a feature store may generate new features from the original features found in feature module or may identify and store important insights or analysis based upon the features. The selections of features may be based upon an alteration or calculation to be generated, and may include the calculation of single or multiple nucleotide polymorphisms insertion or deletions of the genome, a tumor mutational burden, a microsatellite instability, a copy number variation, a fusion, or other such calculations. An exemplary output of an alteration or calculation generated which may inform future alterations or calculations includes a finding of hypertrophic cardiomyopathy (HCM) and variants in MYH7. Wherein previous classified variants may be identified in the patient's genome which may inform the classification of novel variants or indicate a further risk of disease. An exemplary approach may include the enrichment of variants and their respective classifications to identify a region in MYH7 that is associated with HCM. Any novel variants detected from a patient's sequencing localized to this region would increase the patient's risk for HCM. Features which may be utilized in such an alteration detection include the structure of MYH7 and classification of variants therein. A model which focuses on enrichment may isolate such variants.
Artificial Intelligence Models
Artificial intelligence models referenced herein may be gradient boosting models, random forest models, neural networks (NN), regression models, Naive Bayes models, or machine learning algorithms (MLA). A MLA or a NN may be trained from a training data set. In an exemplary prediction profile, a training data set may include imaging, pathology, clinical, and/or molecular reports and details of a patient, such as those curated from an EHR or genetic sequencing reports. MLAs include supervised algorithms (such as algorithms where the features/classifications in the data set are annotated) using linear regression, logistic regression, decision trees, classification and regression trees, Naïve Bayes, nearest neighbor clustering; unsupervised algorithms (such as algorithms where no features/classification in the data set are annotated) using Apriori, means clustering, principal component analysis, random forest, adaptive boosting; and semi-supervised algorithms (such as algorithms where an incomplete number of features/classifications in the data set are annotated) using generative approach (such as a mixture of Gaussian distributions, mixture of multinomial distributions, hidden Markov models), low density separation, graph-based approaches (such as mincut, harmonic function, manifold regularization), heuristic approaches, or support vector machines. NNs include conditional random fields, convolutional neural networks, attention based neural networks, deep learning, long short term memory networks, or other neural models where the training data set includes a plurality of tumor samples, RNA expression data for each sample, and pathology reports covering imaging data for each sample. While MLA and neural networks identify distinct approaches to machine learning, the terms may be used interchangeably herein. Thus, a mention of MLA may include a corresponding NN or a mention of NN may include a corresponding MLA unless explicitly stated otherwise. Training may include providing optimized datasets, labeling these traits as they occur in patient records, and training the MLA to predict or classify based on new inputs. Artificial NNs are efficient computing models which have shown their strengths in solving hard problems in artificial intelligence. They have also been shown to be universal approximators (can represent a wide variety of functions when given appropriate parameters). Some MLA may identify features of importance and identify a coefficient, or weight, to them. The coefficient may be multiplied with the occurrence frequency of the feature to generate a score, and once the scores of one or more features exceed a threshold, certain classifications may be predicted by the MLA. A coefficient schema may be combined with a rule based schema to generate more complicated predictions, such as predictions based upon multiple features. For example, ten key features may be identified across different classifications. A list of coefficients may exist for the key features, and a rule set may exist for the classification. A rule set may be based upon the number of occurrences of the feature, the scaled weights of the features, or other qualitative and quantitative assessments of features encoded in logic known to those of ordinary skill in the art. In other MLA, features may be organized in a binary tree structure. For example, key features which distinguish between the most classifications may exist as the root of the binary tree and each subsequent branch in the tree until a classification may be awarded based upon reaching a terminal node of the tree. For example, a binary tree may have a root node which tests for a first feature. The occurrence or non-occurrence of this feature must exist (the binary decision), and the logic may traverse the branch which is true for the item being classified. Additional rules may be based upon thresholds, ranges, or other qualitative and quantitative tests. While supervised methods are useful when the training dataset has many known values or annotations, the nature of EMR/EHR documents is that there may not be many annotations provided. When exploring large amounts of unlabeled data, unsupervised methods are useful for binning/bucketing instances in the data set. A single instance of the above models, or two or more such instances in combination, may constitute a model for the purposes of models, artificial intelligence, neural networks, or machine learning algorithms, herein.
A set of transformation steps may be performed to convert the data from the Patient Data Store into a format suitable for analysis. Various modern machine learning algorithms may be utilized to train models targeting the prediction of expected survival and/or response for a particular patient population. An exemplary data store 14 is described in further detail in U.S. Provisional Patent Application No. 62/746,997, titled “Data Based Cancer Research and Treatment Systems and Methods,” filed Oct. 17, 2018; U.S. patent application Ser. No. 16/289,027, titled “Mobile Supplementation, Extraction, and Analysis of Health Records” and filed Feb. 28, 2019, and issued Aug. 27, 2019, as U.S. Pat. No. 10,395,772; and PCT International Application No. PCT/US19/56713 filed Oct. 17, 2019 and titled “Data Based Cancer Research and Treatment Systems and Methods,” each of which is incorporated herein by reference in its entirety.
The system may include a data delivery pipeline to transmit clinical and molecular de-identified records in bulk. The system also may include separate storage for de-identified and identified data to maintain data privacy and compliance with applicable laws or guidelines, such as the Health Insurance Portability and Accountability Act.
The raw input data and/or any transformed, normalized, and/or predictive data may be stored in one or more relational databases for further access by the system in order to carry out one or more comparative or analytical functions, as described in greater detail herein. The data model used to construct the relational database(s) may be used to store, organize, display, and/or interpret a significant amount and variety of data, e.g., dozens of tables that comprise hundreds of different columns. Unlike standard data models such as OMOP or QDM, the data model may generate unique linkages within a table or across tables to directly relate various clinical attributes, thereby making complex clinical attributes easier to ingest, interpret and analyze.
Once the relevant data has been received, transformed, and manipulated, as discussed above, the system may include a plurality of modules in order to generate the desired dynamic user interfaces, as discussed above with regard to the system diagram of
Patient Cohort Filtering User Interface
Turning to
Additionally, or alternatively, the system may recognize one or more attributes defined for tumor data stored by the system, where those attributes may be, for example, genotypic, phenotypic, genealogical, or demographic. The various selectable attribute criteria may reflect patient-related metadata stored in the patient data store 14, where exemplary metadata may include, for instance: Project Name (which may reflect a database storing a list of patients) 204, Gender 206, Race 208; Cancer, Cancer Site 210, Cancer Name 212; Metastasis, Cancer Name 214; Tumor Site 216 (which may reflect where the tumor was located), Stage 218 (such as I, II, III, IV, and unknown), M Stage 220 (such as m0, m1, m2, m3, and unknown); Medication (such as by Name 222 or Ingredient 224); Sequencing 226 (such as gene name or variant), MSI (Microsatellite Instability) status 228, TMB (Tumor Mutational Burden) status (not shown); Procedure 230 (such as, by Name); or Death (such as, by Event Name 232 or Cause of Death 234).
The system also may permit a user to filter patient data according to any of the criteria listed herein including those listed under the heading “Features and Feature Modules,” and include one or more of the following additional criteria: institution, demographics, molecular data, assessments, diagnosis site, tumor characterization, treatment, or one or more internal criteria. The institution option may permit a user to filter according to a specific facility. The demographics option may permit a user to sort, for example, by one or more of gender, death status, age at initial diagnosis, or race. The molecular data option may permit a user to filter according to variant calls (for example, when there is molecular data available for the patient, what the particular gene name, mutation, mutation effect, and/or sample type is), abstracted variants (including, for example, gene name and/or sequencing method), MSI status (for example, stable, low, or high), or TMB status (for example, selectable within or outside of a user-defined ranges). Assessments may permit a user to filter according to various system-defined criteria such as smoking status and/or menopausal status. Diagnosis site may permit a user to filter according to primary and/or metastatic sites. Tumor characterization may permit the user to filter according to one or more tumor-related criteria, for example, grade, histology, stage, TNM Classification of Malignant Tumours (TNM) and/or each respective T value, N value, and/or M value. Treatment may permit the user to select from among various treatment-related options, including, for instance, an ingredient, a regimen, a treatment type, etc.
Certain criteria may permit the user to select from a plurality of sub-criteria that may be indicated once the initial criteria is selected. Other criteria may present the user with a binary option, for example, deceased or not. Still other criteria may present the user with slider or range-type options, for example, age at initial diagnosis may presented as a slider with user-selectable lower and upper bounds. Still further, for any of these options, the system may present the user with a radio button or slider to alternate between whether the system should include or exclude patients based on the selected criterion. It should be understood that the examples described herein do not limit the scope of the types of information that may be used as criteria. Any type of medical information capable of being stored in a structured format may be used as a criteria.
In another embodiment, the user interface may include a natural language search style bar to facilitate filter criteria definition for the cohort, for example, in the “Ask Gene” tab 236 of the user interface or via a text input of the filtering interface. In one aspect, an ability to specify a query, either via keyboard-type input or via machine-interpreted dictation, may define one or more of the subsequent layers of a cohort funnel (described in greater detail in the next section). Thus, for example, when employing traditional natural language processing software or techniques, an input of “breast cancer patients” would cause the system to recognize a filter of “cancer_site==breast cancer” and add that as the next layer of filtering. Similarly, the system would recognize an input of “pancreatic patients with adverse reactions to gemcitabine” and translate it into multiple successive layers of filtering, for example, “cancer_site==pancreatic cancer” AND “medication==gemcitabine” AND “adverse reaction==not null.”
In a second aspect, the natural language processing may permit a user to use the system to query for general insights directly, thereby both narrowing down a cohort of patients via one or more funnel levels and also causing the system to display an appropriate summary panel in the user interface. Thus, in the situation that the system receives the query “What is the 5 years progression-free survival rate for stage III colorectal cancer patients, after radiotherapy?,” it would translate it into a series of filters such as “cancer_site==colorectal” AND “stage==III” AND “treatment==radiotherapy” and then display five-year progression-free survival rates using, for example, the patient survival analysis user interface 30. Similarly, the query “What percentage of female lung cancer patients are post-menopausal at a time of diagnosis?” would translate it into a series of patients such as “gender==female,” “cancer_site==lung,” and “temporal==at diagnosis,” determine how many of the resulting patients had data reflecting a post-menopause situation, and then determine the relevant percentage, for example, displaying the results through one or more statistical summary charts.
Cohort Funnel and Population Analysis User Interface
Turning now to
In another embodiment, the system may include a selectable button or icon that opens a dialogue box 238 which shows a plurality of selectable tabs, each tab representing the same or similar filtering criteria discussed above (Demographics, Molecular Data, Assessments, Diagnosis Site, Tumor Characterization, and Treatment). Selection of each tab may present the user with the same or similar options for each respective filter as discussed above (for example, selecting “Demographics” may present the user with further options relating to: Gender, Death Status, Age at Initial Diagnosis, or Race). The user then may select one or more options, select “next,” and then select whether it is an inclusion or exclusion filter, and the corresponding selection is added to the funnel (discussed in greater detail below), with an icon moving to be below a next successively narrower portion of the funnel.
Additionally, or alternatively, looking at the cohort, or set of patients in a database, the system permits filtering by a plurality of clinical and molecular factors via a menu 240. For example, and with regard to clinical factors, the system may include filters based on patient demographics 242, cancer site 244, tumor characterization 246, or molecular data 248 which further may include their own subsets of filterable options 242, such as histology 250, stage 252, and/or grade-based options 254 (see
Although the examples discussed herein provide analysis with regard to various cancer types, in other embodiments, it will be appreciated that the system may be used to indicate filtered display of other disease conditions, and it should be understood that the selection items will differ in those situations to focus particularly on the relevant conditions for the other disease.
The cohort funnel and population analysis user interface 26 visually may depict the number of patients in the data set, either all at once or progressively upon receiving a user's selection of multiple filtering criteria. In one aspect, the display of patient frequencies by filter attribute may be provided using an interactive funnel chart 264. As seen in
The above filtering can be performed upon receiving each user selection of a filter criterion, the funnel 264 updating to show the narrowing span of the dataset upon each filter selection. In that situation a filtering menu 240 such as the one discussed above may remain visible in each tab as they are toggled, or may be collapsed to the side, or may be represented as a summary 266 of the selected filtered options to keep the user apprised of the reduced data set/size.
With regard to each filtering method discussed above, the combination of factors may be based on Boolean-style combinations. Exemplary Boolean-style combinations may include, for filtering factors A and B, permitting the user to select whether to search for patients with “A AND B,” “A OR B,” “A AND NOT B,” “B AND NOT A,” etc.
The final filtered cohort of interest may form the basis for further detailed analysis in the modules or other user interfaces described below. The population of interest is called a “cohort”. The user interface can provide fixed functional attribute selectors pre-populated appropriately based on the available data attributes in a Patient Data Store.
The display may further indicate a geographic location clustering plot of patients and/or demographic distribution comparisons with publicly reported statistics and/or privately curated statistics.
Patient Timeline Analysis Module
Additionally, the system may include a patient timeline analysis module 28 that permits a user to review the sequence of events in the clinical life of each patient. It will be appreciated that this data may be anonymized, as discussed above, in order to protect confidentiality of the patient data.
Once a user has provided all of his or her desired filter criteria, e.g., via the cohort funnel & population analysis user interface 26, the system permits the user to analyze the filtered subset of patients. With respect to the user interface depicted in the figures, this procedure may be accomplished by selecting the “Analyze Cohort” option 268 presented in the upper right-hand corner of the interface 26.
Turning now to
The user interface 28 also permits a user to query the data summary information presented in the data summary window or region 300 in order to sort that data further, e.g., using a control panel 312. For example, as seen in
Turning now to
In one embodiment, an event timeline Gantt style chart is provided for a high-level overview, coupled with a tabular detail panel. The display may also enable the visualization and comparison of multiple patients concurrently on a normalized timeline, for the purposes of identifying both areas of overlap, and potential discontinuity across a patient subset.
Patient “Survival” Analysis Module
The system further may provide survival analysis for the subset of patients through use of the patient survival analysis user interface 30, as seen in
In order to provide the user with flexibility to define the metes and bounds of that analysis, the system may permit the user to select one or both of the starting and ending events upon which that analysis is based. Exemplary starting events include an initial primary disease diagnosis, progression, metastasis, regression, identification of a first primary cancer, an initial prescription of medication, etc. Conversely, exemplary ending events may include progression, metastasis, recurrence, death, a period of time, and treatment start/end dates. Selecting a starting event sets an anchor point for all patients from which the curve begins, and selecting an end event sets a horizon for which the curve is predicting.
As seen in
Additionally, the system may be configured to permit the user to focus or zoom in on a particular time span within the plot, as seen in
Turning now to
As shown in
As seen in
As will be appreciated from the previous discussion, underpinning the utility of the system is the ability to highlight features and interaction pathways of high importance driving these predictions, and the ability to further pinpoint cohorts of patients exhibiting levels of response that significantly deviate from expected norms. In this context, high importance may be understood to be based upon feature importance to an outcome of a prediction. In particular, features that provide the greatest weight to the prediction may be designated as those of high importance. The present system and user interface provide an intuitive, efficient method for patient selection and cohort definition given specific inclusion and/or exclusion criteria. The system also provides a robust user interface to facilitate internal research and analysis, including research and analysis into the impact of specific clinical and/or molecular attributes, as well as drug dosages, combinations, and/or other treatment protocols on therapeutic outcomes and patient survival for potentially large, otherwise unwieldy patient sample sizes.
The modeling and visualization framework set forth herein may enable users to interactively explore auto-detected patterns in the clinical and genomic data of their filtered patient cohort, and to analyze the relationship of those patterns to therapeutic response and/or survival likelihood. That analysis may lead a user to more informed treatment decisions for patients, earlier in the cycle than may be the case without the present system and user interface. The analysis also may be useful in the context of clinical trials, providing robust, data-backed clinical trial inclusion and/or exclusion analysis. Backed by an extensive library of clinical and molecular data, the present system unifies and applies various algorithms and concepts relating to clinical analysis and machine learning to generate a fully integrated, interactive user interface.
Outlier Analysis Module
Turning now to
Additionally, the user interface may include a second region 410 including a control panel 412 for filtering, selecting, or otherwise highlighting in the first region a subset of the patients as outliers. Setting a value or range in the control panel may generate an overlay 414 on the radar plot (see
In another aspect, as seen in
As with the previously described user interface, the interface of
With regard to either outlier user interface described above, the interface further may include a third region 440 providing information specific to a selected node when the system receives a user input corresponding to a given indicator, for example, by clicking on that indicator 436 in the first region of the interface, as seen in
Additionally, with regard to either outlier user interface described above, the algorithm to determine the existence of an outlier may be based on a binary tree 500 such as the one seen in
In some instances, data in a branch may be lost when the system fully extrapolates out to a leaf. In such instances, the system may scan features that a current patient has in common with outlier patients, and suggest changes to clinical process that may place them in a new bucket (leaf/node) of patients that have a higher outlier. For example, if a branch has a high PFS in a node, but loses the distinction by the time the branch resolves in a leaf, the system may identify the node with the highest PFS as a leaf.
In order to generate an expected survival rate for a population, the system may rely upon a predictive algorithm built on the survival rates of the patients in the data set 14. Alternatively, the system may use an external source for a PFS prediction, such as an FDA published PFS for certain cancers or treatments. The system then may compare the expected survival rate with an observed PFS rate for a population in order to determine outliers.
In one particular embodiment, a method for identifying one or more outlier groups of patients are provided. The method includes steps of selecting a cohort of patients, where the cohort includes a plurality of patients. Selection of the cohort may be based on identifying a group of patients having a particular condition such as a particular disease. In one particular embodiment, the cohort may include a group of patients (e.g. several tens, hundreds, thousands, or more) who have non-small cell lung cancer or breast cancer. Other groupings based on other criteria are also possible.
In various embodiments, a next step of the method may include calculating an average survival rate for the cohort of patients. For example, based on available data it may be determined that these patients on average survive for a particular time (e.g. a number of months such as 63 months).
In certain embodiments, another step of the method may include selecting a plurality of clinical or molecular characteristics associated with the cohort of patients. The clinical or molecular characteristics associated with the cohort of patients may include one or more of a genetic marker, a procedure performed on a patient, a pharmaceutical treatment given to a patient, an age at which a patient receives a diagnosis, an age at which a patient receives a treatment, or a lifestyle indicator. In particular embodiments, the clinical or molecular characteristics for a patient may include a smoking status of the patient (e.g. yes, no, unknown), a DNA mutation associated with the patient (e.g. KRAS, BRAF, EGFR, etc.), an age of the patient at a time of diagnosis or treatment (e.g. one or more integers in a particular age range such as 18-115 years old), or one or more treatment procedures or pharmaceuticals received by the patient.
In some embodiments, information regarding the cohort of patients may be used to generate a tree structure, where a node of the tree structure may contain one or more patients who are outliers, that is, patients who have shown a significantly different survival (shorter or longer) for a given set of conditions. Thus to generate the tree structure, for each characteristic of the plurality of characteristics the method may include identifying a plurality of data values associated with the characteristic. For each data value of the plurality of data values associated with the characteristic, the method may include: dividing the cohort of patients into a first subgroup and a second subgroup of the plurality of patients based on a criterion such as whether each patient of the plurality of patients survived during an outlier time period; determining a difference between a number of patients in the first subgroup and the second subgroup; and selecting a data value that results in the difference that is a largest difference between a number of patients in the first subgroup and the second subgroup.
This procedure may be repeated for each data value of each characteristic. For example, for embodiments in which the characteristic relates to an age then the data values include a range of ages, beginning with a lower age range such as age 18, 19, 20, 21, . . . to an upper limit such as age 115 (or another suitable value). In one particular example, if age=20 and the time period is x years (e.g. 5 years), then a first cohort of patients may be those who died x years after an age 20 diagnosis and a second cohort of patients may be those who did not die within x years of an age 20 diagnosis.
To determine the difference, the number of patients who did not survive within the particular time is considered a first subgroup of patients and the number of patients who did survive during the particular time is considered a second subgroup of patients. A difference is then determined between the number of patients in the first and second subgroups for each data value associated with each characteristic. The difference may be divided by the total number of patients in the first and second subgroups and expressed as a decimal value between 0 and 1 (e.g. if 400 patients died x years after age 20 diagnosis and 100 patients did not die x years after age 20 diagnosis, then the difference 400−100=300, which is divided by the total number in the two groups, 500, to get a difference of 0.6). The particular data value having the largest such difference may be retained while the procedure is being performed in order to determine a node for the tree structure (e.g. the largest difference may be a difference of 0.7 at age=44).
The method may further include creating a new node of the tree structure based on the data value that results in the largest difference between the number of patients in the first subgroup and the second subgroup (e.g. a node may be created for age=44). Once the particular data value has been identified as having the largest difference, the method may then include creating branches from the node, including creating a first branch from the new node based on the first subgroup, and creating a second branch from the new node based on the second subgroup. Several examples of potential nodes may include the following: Smoking=Yes, Difference=0.8; DNA mutation=KRAS, Difference=0.78; Age=82, Difference=0.9; Gender=Male, Difference=0.6. Based on this information, the “Age” characteristic has the greatest difference and is selected, where branches may be created that are based on Age greater than or equal to 82 and Age less than 82.
The tree structure may continue to be built by repeating steps above, including steps of dividing the cohort into subgroups for each characteristic and each data value of each characteristic. The starting cohort in each subsequent repeated step is the group of patients in the particular node that is the starting point. This procedure is repeated at each node based on the patients in the first subgroup and the second subgroup, respectively. The procedure continues until one or both of the following conditions are met: (1) a maximum number of nodes or branches has been created, or (2) a node contains fewer than a minimum number of patients. When the procedure is complete, the method may include identifying at least one node from the tree structure which contains an outlier group of patients.
Smart Cohorts
In various embodiments, a prediction model may be developed which facilitates identification of one or more cohorts of patients whose disease progression and/or likelihood of survival is substantially different from expectation, for example significantly longer or shorter than would be expected. Information from these cohorts may then be examined to identify one or more primary factors that could potentially contribute to the survival profile of the cohorts. Identification of smart cohorts may be used to provide precision medicine results for a particular patient, aid in the identification of potential areas of interest to target medication research, and/or identification of unexpected potential to expand medication patient targeting.
Given a set of patient timelines, in various embodiments the objective of the smart cohorts module will be three-fold, attempting to answer one or more of the following questions:
1. What is the likelihood of each patient surviving longer than Y years (or living progression-free for at least Y years) (i.e. “Survival”), measured at each event point in the patient's timeline;
2. What are the primary factors that most influence the expected survival outcome;
3. Which subsets of patients exhibit combinations of these factors such that they stand out as an outlier cohort in terms of their survival profile, relative to expectation, at a user specified anchor timeline event (e.g. at stage IV diagnosis), and what are these patients' characteristics;
This problem may be approached from a time series modeling perspective, with point in time snapshots of feature states, and a binary classification objective. In certain embodiments a tree-based supervised-clustering approach may be used to help identify patient groups of interest, although in other embodiments other analysis and visualization methods are also included.
The inherent temporal nature of the problem is complicated by the fact that target survival at anchor point T may be just as dependent on what happens to the patient after point T as it is on what happened prior to point T. As such, expected future survival cannot simply be modeled using event history alone and future events cannot be included in the model without invalidating the model as a recommender or accidentally introducing information leakage into the features, which could result in overfitting.
In certain embodiments a hybrid two-model approach may be taken. In one part of the approach, a historic only model is trained to derive “expectation” at each time point, and in another part of the approach a forward-looking clustering model is developed to isolate divergences between expected and observed survival, along with associated features.
Thus, in certain embodiments, the hybrid approach may include:
1. Building a dataset that only utilizes backward-looking features, derived at each event point on the timeline;
2. Training a model on such a dataset, to derive predictions for expected future survival at each time point;
3. Tagging these expected survival predictions at each time point to act as best-guess priors using all historic information content;
4. Building a “forward looking” feature set at each time point, ensuring not to permit implicit survival duration information be incorporated into the features (in some cases the historic priors may be included as features in this set as well); and
5. Training a “Summarization/Clustering” model using the forward looking feature set.
At this point, following the “training” step, a determination may be made regarding whether to limit how forward-looking the features for this part may be. For example it may not make sense to include a feature that is observed 2 years in the future if you are trying to predict 1 year survival likelihood. In addition one could also consider giving less importance to features that happen further away from the anchor event. Finally, one may consider excluding event points that are observed after the outcome event of interest, even if such events occur within the X-year boundary. For example, if the first progression event observed is within 6 months, and we are predicting 2 year PFS, then for that patient should exclude all events between 6 months and 2 years.
6. Comparing the expected survival predictions to the actual survival based on the forward looking model, for each of the forward-looking clusters, and identify clusters of high divergence from the expected survival predictions, along with their constituent forward-looking feature set.
Thus the model is directed to determining how future events may impact an expected survival that is predicted by prior events, agnostic to whether the expected survival prediction for a particular sub-cluster is higher than the expected survival prediction for a different cluster (although the root cause of a divergence in expected survival predictions would also be of interest). That is, it is of interest to know whether the next actions have an impact on the patient's survival, or whether patient survival is mainly determined by their already-experienced events.
The prediction model may be implemented based on data from a large number of patients, using information about the patients' medical history and treatments along with information about their survival. In order to chronologically align the data from numerous patients, one or more anchor points (also referred to as “patient timepoints”) may be identified within the data (
There may be some imprecision with regard to the time of certain anchor point events, for example a date of first diagnosis may occur several weeks earlier or later for a given patient (e.g. relative to when the disease began) due to the time that the patient first notices symptoms or sees a clinician to receive the diagnosis to account for the lack of precision. Therefore, in certain embodiments the anchor points may include a tolerance window before and/or after the date of the anchor point which can provide flexibility in the modeling procedure. In various embodiments, the tolerance window may be +/−1 day, 3 days, 1 week, 2 weeks, 1 month, 2 months, 3 months, or other suitable time period.
With regard to the predictive model, in various embodiments a plurality of data is obtained or received for a plurality of patients, covering a period of time (e.g. a time span covering each of the patients' medical history from the time of their diagnosis until the current time or a time of death, medical history may also begin before diagnosis).
The data may be processed to identify a plurality of patient timepoints (anchor points) that occur within the period of time covered by each patient's data. As discussed above, the anchor points or patient timepoints may include timepoints associated with any patient interaction with the medical system, including any interaction with an individual or facility that provides medical care or obtains medical information such as a care provider, a genetic sequencing organization, a hospital outpatient or inpatient facility, etc. The patient timepoints may be identified by a date attached to or associated with each piece of data in the received set of patient data.
In general both temporal and static features may be derived from the patient data but the analysis at this stage is purely backward-looking to avoid leaking future information. Different categories or classes of features include: “time since last/first XXX”; “number of XXX”; or “demographics.” Extracting features may include multiple lookback horizons, for example features may be bounded to the trailing 12 months or may be based on continuous historic analysis.
In one particular example, four timepoints may be identified for a hypothetical patient A: date of biopsy collection, Jul. 1, 2018 (KRAS PL1S147GLU mutation with high SNP effect identified); start anastrozal and lotinib administration, Aug. 1, 2018; radiation therapy performed, Nov. 1, 2018; therapy outcome reported: progression of disease from stage 1 to stage 2, Jan. 1, 2019; imaging performed, Jul. 1, 2018 and Nov. 1, 2018. Other patients B, C, D . . . will each have their own sets of timepoints which may correspond to some of the same events (e.g. diagnosis, start medication, imaging, etc.) or to different events, or to a combination of some of the same events and some different events.
Based on the data for each of the patients and for each patient timepoint, an outcome target for an outcome event may be calculated within a horizon time window; a plurality of prior features may be identified; and a state of each of the plurality of prior features at the patient timepoint may be determined. An outcome event may include a state of the patient and/or the disease, such as progression or death, and the outcome target may be described with a target label such as a yes or no indication of whether the outcome will occur within a particular horizon time window from the patient timepoint/anchor point, along with a date of the endpoint. The horizon time window may include any suitable periods of time such as 3 months, 6 months, 9 months, 12 months, 24 months, 36 months, 48 months, or 60 months, or other periods of time.
In the case of hypothetical patient A, the analysis of a progression event occurring within 6 months of a timepoint is as follows:
Patient A: Jul. 1, 2018—Progression within 12 mo.—Yes, Jan. 1, 2019
Patient A: Aug. 1, 2018—Progression within 12 mo.—Yes, Jan. 1, 2019
Patient A: Nov. 1, 2018—Progression within 12 mo.—Yes, Jan. 1, 2019
Patient A: Jan. 1, 2019—Progression within 12 mo.—null
Since the data for patient A included information of a report of progression from stage 1 to stage 2 on Jan. 1, 2019, there is a valid outcome target for “progression within 12 months” for each of the first three time points: “yes.” However, the analysis for the final time point is indicated as “null” because no patient information is available after this date from which to inform the model. Although progression was reported on this date, no further information is available for patient A after this date.
The prior features may include various features related to a patient's medical condition and/or treatment. In various embodiments the prior features may include temporal/time-based events or features, structural or biological features, or molecular/genetic features, among other categories. In particular embodiments the prior features may include one or more of: time since starting a particular medication; time since taking a particular medication; time since last progressive therapy outcome (e.g. patient response to drug); time since metastasis; largest tumor size to date/last recorded tumor size; most severe effect of identified SNP (e.g. low effect, high effect); or RNA features (e.g. expression level per gene/transcript). In some embodiments the data may require additional processing, such as using an autoencoder, to reduce dimensionality of the feature space.
A state of each prior feature may be determined at each of the patient timepoints. For hypothetical patient A, the state of three features (time since starting medication A, time since last imaging, and highest SNP effect as identified by lab A) for each of the four patient timepoints is shown below (note that the value for “time since taking medication A” at the first patient timepoint is “null” since patient A did not take medication A until the next timepoint):
Patient A: Jul. 1, 2018
Patient A: Aug. 1, 2018
Patient A: Nov. 1, 2018
Patient A: Jan. 1, 2019
Next a plurality of forward features may be identified for each patient timepoint of the plurality of timepoints which has a valid outcome target and for each combination of horizon time window and outcome event. The combinations of horizon time windows and outcome events may include “progression within 6 months,” “progression within 12 months,” “progression within 24 months,” progression within 60 months,” “death within 6 months,” “death within 12 months,” “death within 24 months,” death within 60 months,” etc.
For patient A, using a horizon time window/outcome event combination of “progression within 12 months,” the forward features may include:
Patient A: Jul. 1, 2018—
Patient A: Aug. 1, 2018—
Patient A: Nov. 1, 2018—
At this point a plurality of sets of predictions for the plurality of patients may be generated based on the plurality of prior features and the plurality of forward features, and a prediction model may be generated based on the sets of predictions using machine learning. In some embodiments the prediction model may be generated using gradient boosting.
The plurality of sets of predictions may be divided into several folds, where each fold includes data corresponding to a subset or subgroup of the plurality of patients such that the data for each patient is kept within the same fold (
Having generated the plurality of predictions, this information may be used to identify one or more “smart cohorts,” that is, one or more cohorts of patients whose disease progression and/or likelihood of survival is substantially different from expectation, for example significantly longer or shorter than would be expected. In general, a decision tree may be constructed using the prediction information to identify various potential smart cohorts, which end up being grouped in various leaf nodes of the decision tree. Disclosed herein are two approaches for constructing decision trees which are referred to as Offline Smart Cohorts and Online Smart Cohorts.
Offline Smart Cohorts
In certain embodiments, a method for identifying a cohort of patients may be developed. The method may include selecting a cohort of patients including a plurality of patients, for example a cohort of 500 breast cancer patients. In general, the cohort may be selected based on the patients having a particular condition in common, e.g. a particular disease.
The method may also include identifying a common anchor point in time from a set of anchor points associated with each of the group of patients, where the common anchor point is shared by each of the group of patients in the cohort. Selecting a common point between all patients facilitates visualization of the data and also makes it possible to prevent the same patient from appearing in the model multiple times at each of the patient's available anchors. The possible anchor points include time of diagnosis, times of treatments, time of metastasis, and others. In one particular embodiment, the time of diagnosis may be selected as the anchor point.
For each patient in the group of patients, a timeline associated with each of the group of patients may be aligned to the common anchor point. Next an outcome target may be identified, such as disease progression within 12 months. Subsequently, the plurality of sets of predictions that were previously generated, each of which includes a predicted target value, may be retrieved for each patient of the group of patients and for each of the plurality of forward features and the plurality of prior features. The predictions may include information such as that shown in Table 1:
More generally, the “target prediction” may take the form of: “Probability for Survival (PFS) in X months,” “Death in X months,” “Likelihood of taking medication in X months,” “Likelihood of other targets in X months,” etc. and may be in the form of a decimal value between 0 and 1. The “target actual” value is essentially a binary, yes/no value that is shown as a 1 or a 0 and represents the occurrence or non-occurrence of the event within X months. In various embodiments the feature sets may include prior features and/or forward features, for example any of the features disclosed herein including those listed under the heading of “Features and Feature Models.” The prior features may include one or more of Age, Gender, Treatments (e.g. medications, procedures, therapies, etc.), Sequencing/Lab/Imaging results. The forward features, which are discussed further below, may include events, treatments, etc. that happen in the future between the anchor point and the observed target.
In various embodiments, hundreds or thousands (or other, greater numbers) of decision trees may be generated using this information, for example using a procedure similar to that described above for the Outliers procedure. For each of the decision trees that is constructed, for each feature of the plurality of forward features and the plurality of prior features, the following steps may be carried out.
A new node of the tree structure may be created based on the feature that results in the largest difference between the number of patients in the first subgroup and the second subgroup. A first branch may be created from the new node based on the first subgroup, and a second branch may be created from the new node based on the second subgroup. The steps of building the decision tree may then be repeated for each of the first branch and the second branch based on patients in the first subgroup and the second subgroup, respectively. This may continue as the tree is completed as defined by either: a maximum number of nodes or branches has been created, or a particular node contains fewer than a minimum number of patients for all nodes and branches.
The goal of constructing the decision trees is, for each patient and based on the features in the feature set, to predict the difference between the prediction and the actual outcome for the target by clustering the patients based on which features most accurately predict the difference between the prediction and the actual outcomes.
In certain embodiments, the method may include determining a similarity metric by determining how often a given patient ends up in a same leaf node of the trees with other patients across the hundreds or thousands of decision trees. Thus, for each patient of the group of patients, the method may include identifying a co-incidence of the given patient occurring within each of the plurality of leaf nodes, across the hundreds or thousands of decision trees, with each of the other of the plurality of patients. The similarity metric may be determined for the given patient based on a sum of the co-incidence divided by a total number of nodes the given patient is in across all of the hundreds or thousands of decision trees that are constructed and analyzed. In some embodiments a database of patient-patient similarity metrics may be generated based on determining the similarity metric for each of the plurality of patients. In other embodiments the similarity metric may be displayed, e.g. as a cohort radar plot. Further, data may be displayed in association with one or more of the steps outlined above to identify at least one of the plurality of features.
The method may further include determining a similarity metric for a new patient, i.e. a patient different from the initial group of patients. The new patient may be matched with a subgroup of patients corresponding to a particular leaf node of the plurality of leaf nodes based on determining the similarity metric. A treatment may then be identified for the new patient based on matching the new patient with the subgroup of patients. Further, the database of patient-patient similarity metrics may be processed using a dimensionality reducing algorithm to identify a particular cohort of patients having a shared feature such as a shared prior feature or a shared forward feature. In general, dimensionality reduction identifies a certain subgrouping (such as K subgroups) where each of the subgroups 1-k has certain characteristics in common across the grouping that is identified from the entire patient cohort (standard population grouping).
Online Smart Cohorts
In addition to the plurality of predictions, the system may receive an outcome target, a subset of the plurality of forward features corresponding to the outcome target, and a cohort of patients including a subset of the plurality of patients. The cohort may be a group that shares a condition or trait of interest, for example the cohort may be a group of 20,000 breast cancer patients. This group will then be subdivided using the decision tree to find one or more particular subgroups of interest for further investigation.
Table 2 shows an example of the type of prediction data that might be received:
The forward features may include various future actions or conditions that relate to the patients and in certain embodiments could be used to advise patients who have a particular condition. Some of the forward features may be “actionable,” that is, they may include things that a given patient could do to possibly change their prognosis or outcome. For example, a doctor or other clinician could take certain steps or actions (e.g. prescribe a medication or combination of medications; prescribe a particular treatment such as surgery, chemotherapy, or radiation; or send a tumor sample for sequencing to receive molecular information such as a test for a DNA marker) to improve the patient's prognosis. Certain molecular features may or may not be considered actionable, based on whether the molecular information that is obtained is associated with a subsequent action or step. In various embodiments, features such as lab results, imaging results, tumor characterization (e.g. histology, grade, TNM stage, etc.) may not be included as forward features in order to avoid making a suggestion to a patient to take an action that is not within their control such as “lower N stage”, “increase hemoglobin density”, etc.
In various embodiments, this information could be used to counsel a particular patient group, e.g. for N Stage patients with X mutation, treatment A and B taken together improve probability for survival (PFS) within 12 months. For example, Stage 4 Breast cancer patients with the KRAS mutation are expected to progress based on their placement in a cohort (90% progression prediction) and should take anastrozal and lotinib together as an intervening therapy to improve PFS within 12 months (60% progression prediction) based on predictions after the selected anchor point of time of first metastasis. Other specific courses of action could be determined based on the data.
Examples of predictions include predictions of probability for survival within 12 months, for Patient A and B and timepoints T1 (Jan. 1, 2018) and T2 (May 1, 2018), expressed as a probability value between 0 and 1, as shown in Table 3:
The outcome target may be a probability for survival within 12 months, given as a 0 or 1, as shown in Table 4:
Below is an example of a subset of the plurality of forward features (FD1, FD2, FD3, each indicated below) corresponding to the outcome target including forward data corresponding to probability for survival within 12 months:
Jan. 1, 2018:
May 1, 2018:
The system may also receive an anchor point or patient timepoint, e.g. a time of first diagnosis, a time of first metastasis, a time of first treatment, etc.
A subset of the plurality of forward features may be selected. These features may include medications (future and historic) as well as sequencing (somatic sequencing (future or historic), germline sequencing, etc.). For each patient in the cohort having the anchor point, the prediction model may be provided with the selected subset of the plurality of forward features and a difference may be determined between each of the plurality of predictions and the outcome target.
For example, the model may receive data such as:
Patient A: [0.95-1], [Medications and sequencing data sets]
Patient B: [0.92-1], [Medications and sequencing data sets]
Patient C: [0.63-0], [Medications and sequencing data sets]
The data may include information such as “medications and sequencing data sets at the anchor point” which may include an N×M table of patients and respective features. The respective features may include information such as:
Patient A: Jul. 1, 2018 (date of anchor point)—
Col. 1: Will patient take medication A after timepoint and before date of endpoint (YES)
Col. 2: Did patient take medication A before timepoint (NO)
Col. 3: Highest SNP Effect As Identified by Lab A: Germline: KRAS: High (5)
Subsequently, for each feature of the selected subset of the plurality of forward features, a decision tree may be generated based on determining a greatest difference between each of the plurality of predictions and the outcome target. The decision tree may include a plurality of leaf nodes and one or more branch nodes, and each of the one or more branch nodes may include a pair of branches each of which includes a leaf node or a branch node, where the branches are formed based on a feature selected from the subset of the plurality of forward features.
Each of the plurality of leaf nodes of the decision tree may include a number of patients from the cohort of patients. In some embodiments, the decision tree may continue to split based on the difference between each of the plurality of predictions and the outcome target until the number of patients in a particular leaf node of the plurality of leaf nodes is less than a minimum number of patients. In other embodiments, the decision tree may continue to split based on the difference between each of the plurality of predictions and the outcome target until the number of levels of the decision tree has reached a particular number, that is, is equal to a maximum number of levels. In one specific example, each patient's status with regard to a feature “KRAS Somatic: Historical >3” may be used to split a branch node to two branches based on whether each patient's historical importance value for this marker is greater than 3 (high importance).
The leaf nodes of the decision tree provide information that may be used to identify cohorts of interest. In some cases leaf nodes may have high values for the prediction target since prediction values are on average much higher than target values. For patient C in the examples above, the prediction indicated that it was likely that patient C's condition would progress but in fact it did not. In other cases leaf nodes may also generate low negative values for the difference of “prediction minus target”; for example, a prediction minus target may be [0.05-1]=−0.95, which would indicate that the patient's condition would be unlikely to progress but in some instances it may still progress. However in certain cases the leaf nodes may have a value of approximately zero, which indicates that the model has made an accurate prediction. The Smart Cohorts procedure focuses on the instances where patients' actual outcomes have greatly deviated from the expected result because these groups of patents can provide information as to what can be done to change the trajectory of a disease progression, whereas the cohorts where the prediction-target differences are closest to zero inform the model on what features are most important to a reliable prediction.
In some embodiments, analytics may be performed on one or more of the leaf nodes of the decision tree, where the analytics parse the branches of the leaf to render them meaningful. Only subsets of features that are sent to the model will be considered for creating splits. In one embodiment in which the subset of features includes “medication” and “molecular,” a particular leaf may show “Variant effect on KRAS (somatic) protein (post-anchor): >1” (a molecular feature) and “Will not take medication: Pembrolizumab” (a medical feature). Thus, analytics may be performed on the data to improve the overall quality and to improve the accuracy of the splitting and the resulting leaf nodes. In a particular case (although not relevant to the case in which medication and molecular features are used for splitting), analytics may be used to parse branching information to make otherwise ambiguous information meaningful: information indicating “Gender not male” may be set to “gender female.”
In another instance, which relates to the model in which splitting is based on medication and molecular features, the analytics may be used to map data to particular categories and/or ranges to render the data meaningful. For example, a range may be presented as:
which may map to:
where the term ‘negative’ indicates ‘tested and confirmed not to be mutated’ (as opposed to unknown status).
In certain embodiments the analysis which leads to generating branches from a node requires that all of the patients in the resulting leaf nodes meet the particular requirements, that is, the procedure may require 100% cohort participation to form branches. In some cases, however, features derived from the tree may miss statistically relevant cohort features due to this requirement for 100% cohort participation. Therefore in certain embodiments a Subset Aware Feature Effect (SAFE) algorithm may be run to allow features which are shared by fewer than all of the patients (e.g. shared by 95%) of the leaf cohort but not all (e.g. 95%) of patients in the whole cohort to be included in a particular leaf.
In various embodiments the smart cohorts algorithm may be run in an observational mode (which does not use predictions and uses targets only, e.g. 0 or 1) or an algorithmic mode (which uses predictions, e.g. prediction—target [0.95-1]).
The SAFE algorithm has been developed to return viable feature importance ranks based on the selected sub-population of patients, without a need for re-training of the underlying models. Given the predictions from a pre-trained global multi cancer type model on the patient population, the SAFE algorithm may derive approximate high level importance ranks interactively and quickly. In addition, the feature importance ranks may be intelligently and dynamically adjusted to be relevant given a selected subset cohort of the population, without needing to re-train the global model. To optimize interpretability, in certain embodiments the SAFE feature importance algorithm may be agnostic of the underlying machine learning model that was used and may be made to cleanly handle assigning appropriate importance to correlated features. The SAFE algorithm may also provide the ability to explore feature importance on “feature+prediction” datasets for which targets may not necessarily have been defined. Finally, for more continuous features, the SAFE algorithm may enable deeper exploration of the change in feature importance with varying feature value.
In one embodiment, the SAFE algorithm may include calculating a population mean prediction. The algorithm may then include encoding categorical feature levels as the delta between the predicted value and the population mean prediction, where infrequent levels may be grouped together. The algorithm may further include clustering or bucketing of continuous features and processing these features as in the previous step. Next the algorithm may include, for each feature, aggregating an average (p−E(p)) per categorical level. Finally, the algorithm may include, for each feature, assigning an overall feature importance as the frequency-weighted sum of an absolute value of all values.
As can be seen using the above-described approach, the algorithm does not rely explicitly on the presence of a target variable for deriving an importance ranking and instead only requires features and predictions. As such, it can effectively be applied to predictions made on unlabeled datasets, as well generalizing to predictions obtained from different types of machine learning (ML) algorithms.
Although the SAFE algorithm does not directly factor in feature interactions, these values may be derived from manually constructed composite features. In addition, the SAFE algorithm is geared towards conveying how each feature impacts the predicted values from the underlying model, which is used as an indirect proxy for feature importance to predicting the target, although this will be subject to the efficacy of the model.
Notebooks
In various embodiments, one or more statistical models and analyses may be combined to accommodate a particular purpose and, through a variation of the initial analysis, may be used to solve a number of problems. Such a combination of statistical models and analyses may be stored as a notebook in the Interactive Analysis Portal 22. Notebook is a feature in the Interactive Analysis Portal 22 which provides an easily accessible framework for building statistical models and analyses. Once the statistical models and analyses have been developed, they may then be shared with different users to analyze and find answers to scientific and business questions other than those for which they were initially developed.
1) The Interactive Analysis Portal 22 allows input customization through a simple, intuitive point-and-click/drag-and-drop interface to narrow down the cohort for analysis. Cohorts which have been selected, either through the Interactive Analysis Portal 22, Outliers, Smart Cohorts, or other portals of the Interactive Analysis Portal 22, may be provided to a notebook for processing.
2) A custom application interface (API) having a library of function calls which interface with the Interactive Analysis Portal 22, underlying authorized databases, and any supported statistical models, visualizations, arithmetic models, and other provided operations may be provided to the user to integrate a notebook or workbook with the Interactive Analysis Portal 22 data, function calls, and other resources. Exemplary function calls may include listing authorized sources of data, selecting a datasource, filtering the datasource, listing clinical events of the patients in the current filtered cohort, identification of fusions from RNA or DNA, identification of genes from RNA or DNA, identifying matching clinical trials, DNA variants, identifying immunohistochemistry (IHC), identifying RNA expressions, identifying therapies in the cohort, identifying potential therapies that are applicable to treat patients in the cohort, and other cohort or dataset processing.
3) The Interactive Analysis Portal 22 allows the Notebook generation to perform one or more statistical models, analysis, and visualization or reporting of results to the narrowed down cohort without having the user code anything in the notebook as the selected models, analysis, visualizations, or reports of the notebook itself are configured to accept the cohort from the Interactive Analysis Portal 22 and provide the analysis on the cohort as is, without user intervention at the code level. Some models may have hyperparameters or tuning parameters which may be selected, or the models themselves may identify the optimal parameters to be applied based on the cohort and/or other models, analysis, visualizations, or reports during run-time.
4) The Interactive Analysis Portal 22 displays the prepared results to the user based on the selected notebook.
5) An associated user may then select a previously generated notebook which applies selected analysis to the narrowed down cohort without having the user code or recode anything in the notebook as the notebook itself is configured to accept the cohort from the Interactive Analysis Portal 22 and provide the notebook results without user intervention.
6) Users may track the computation resources used by their notebooks for understanding the costs for cloud computing or hardware resources over the network and may track the popularity of their notebook to judge the effectiveness of the statistical analysis that they provide through the notebook.
In certain embodiments, notebooks provide a benefit to users by allowing the Interactive Analysis Portal 22 to provide custom templates to their selected data and leverage pre-built healthcare statistical models to provide results to users who are not sophisticated in programming. Internal teams may analyze curated data in order to support new healthcare insights that both help improve patient care and improve life science research. Similarly, external users have easy access to this proprietary real-world data for analysis and access to proprietary statistical models.
A billing model for a user may be provided on a subscription basis or an on-demand basis. For example, a user may subscribe to one or more data sets for a period of time, such as a monthly or yearly subscription, or the user may pay on a per-access basis for data and notebook usage, such as for loading a specific cohort with corresponding notebook and paying a fee to generate the instant results for consumption. Users may desire a benchmarking and optimization portal through which they may view and optimize their storage and computing resources uses.
Generating a notebook may be performed with a GUI for notebook editing. A user may configure a reporting page for a notebook. A reporting page may include text, images, and graphs as selected and populated by the users. Preconfigured elements may be selected from a list, such as a dropdown list or a drag-and-drop menu. Preconfigured elements include statistical analysis modules and machine learning models. For example, a user may wish to perform linear regression on the data with respect to specific features. A user may select linear regression, and a menu with checkboxes may appear with features from their data set which should be supplied to the linear regression model. Once filled out, a template for reporting the linear regression results with respect to the selected features may be added to the reporting page at a location identified by the active cursor or the drop location for a drag- and drop-element. If a user wishes to solve a problem using a machine learning model, it may be added to the sheet. A header may be populated identifying the model, the hypertuning parameters, and the reported results. In some instances, a model that was previously trained may then be applied to the current cohort. In other instances, the model may be trained on the fly, for example by selecting annotated features and associated outcomes for which the model should be trained. In an unsupervised machine learning model, the model may not require selection of annotated features as the features will be identified during training. In some embodiments, if a selected statistical model requires results from a trained model which are not computed in the template, the template may automatically add the trained model to generate the required results prior to inserting the selected statistical model to the notebook.
Statistical analysis models may be predesigned for calculating the arithmetic mean of the cohort with respect to a selected feature, the standard deviation/distribution of the cohort for a selected feature, regression relationships between variables for selected features, sample size determining models for subsetting the cohort into the optimal sub-population for analysis, or t-testing modules for identifying statistically significant features and correlations in the cohort. Other precomputed statistical analysis modules may perform cohort analysis to identify significant correlations and/or features in the cohort, data mining to identify meaningful patterns, or data dredging to match statistical models to the data and report out which models may be applicable and add those models to the notebook.
Machine learning models may apply linear regression algorithms, non-linear regression, logistic regression algorithms, classification models, bootstrap resampling models, subset selection models, dimensionality reduction models, tree-based models (such as bagging, boosting, and random forest), and other supervised or unsupervised models. As each model is selected, a target output may be requested from the user specifying which feature(s) the model should identify, classify, and/or report. For example, a user may select for the model to identify which features most closely correlate to patient survival in the cohort, or which features most closely correlate with a positive treatment outcome in the cohort. The user may also select which classification labels from the classification labels of the model that they wish the model to classify. In an example where the model may classify the cohort according to five labels, the user may specify one or more labels as a binary classification (patient has label, patient does not have label) such as whether a patient with a tumor of unknown origin originated from the breast, lung, or brain. The user may select only breast to identify for any tumors of unknown origin whether the tumor may be classified as coming from the breast or not from the breast.
The notebook user interface 2900 may be accessed by selecting Notebook from the Interactive Analysis Portal 22, such as via a sidebar menu 2910 either before or after filtering a database of patients to a desired cohort of patients via Interactive Cohort Selection Filtering 24.
Notebooks, or workbooks, may be internally curated at the company label by team members proficient in the fields of data science, machine learning, or other fields that routinely perform analytics on patient data and presented to the user via a custom workbooks widget 2920. The custom workbooks widget may be presented as a searchable list, searchable icons, a scrolling window which may scroll horizontally or vertically to display additional workbooks, or an expandable window which expands to provide access to all workbooks for which the user is authorized to access. A workbook may be represented by an icon and associated text, such as illustrated for workbook 2960. The user may also generate personalized workbooks which may be accessed via the my workbooks widget 2930. A workbook viewing window 2950 may be provided to view a workbook selected from widgets 2920 or 2930. New workbooks may be created by the user by selecting a blank workbook 2940. Upon selection of the blank workbook 2940, a workbook generation interface may open.
Workbook generation interface 3000 may be provided to the user upon selection of a blank workbook from the notebook user interface. A text entry user interface element (UIE) 3010 may be provided to name the workbook for identification, searching, and indexing after generation. A series of button and drop down menu UIEs 3020 may be provided to compartmentalize grouped elements of the user interface. UIEs 3020 may assist the user in building and structuring the workbook's presentation. A cell UIE may provide selections pertaining to the currently selected cell of window 3040 having a block of code, such a commands for running the currently selected cell, terminating the currently selected cell, adding a cell, deleting a cell, running all cells, running all cells above, running all cells below, or terminating all cells. A kernel UIE may provide selections pertaining to one or more programming languages and/or available to the user such as Python, Structured Query Language (SQL), R, Spark, Haskell, Ruby, Typescript, Javascript, Perl, Lua, C, C++, Matlab, Java, Emu86, and other kernels. Selecting a kernel from the kernel UIE reloads the workbook so that the cells execute commands from the respective language. A widget UIE may provide selections pertaining to one or more supported code snippets for the active kernel. Code snippets may include code for creating visualizations such as a graph or a plot, code for simple arithmetic operations such as calculating a mean or a standard deviation, or code for more complex operations such as calculating a distribution and displaying a respective curve. A series of icon UIEs 3030 may be provided where each icon represents a popular command executed from the UIE 3020. Exemplary popular commands may include saving the document, adding a new cell, cutting or pasting code or cells, rearranging cells by moving them upwards or downwards down the page in relation to any other cells, or running/terminating the code in the active cell(s).
One or more cells may be present in window 3040 for a user to insert one or more lines of code for the active kernel. A user may enter code or commands into a cell which may operate on an active database or cohort of patients. Running the cell with execute the entered code or command. Outputs, such as stdout, error messages, or print statements may be displayed directly below the cell upon running. Additionally, a text widget may be inserted which will provide formatting and associated text based upon the code from one or more cells. Such a text widget may provide a simple, readable format for results from execute code. In one embodiment, a text widget may be presented as a markdown cell supporting HTML, indented lists, text formatting, TeX/LaTeX equations, and inline tables.
In one example, a code block may perform arithmetic on a matrix of values. An associated output, such as printing the matrix would result in a difficult to understand series of brackets, parentheticals, and commas. A visualization widget may receive a variable containing the matrix, and provide an image having the matrix values visible in a visible table format that represents a matrix instead of a potentially confusing text output. Cells accept all commands associated with each supported kernel and programming language. A cell may import a module or library from another source (such as dask, fastparaquet, pandas, or other libraries), support data structures, support conditional statements and logic loops, as well as establish and call functions. Cell output is generated asynchronously as the code runs so that the user may view the instantaneous output from the active code. If the output exceeds a preconfigured limit on the number of lines to display, the output may become scrollable text which may autoscroll with new entries or scroll upon user input.
One or more templates may be provided in template window 3050 for the user's convenience. Templates may include one or more cells preconfigured to operate on an input data such as the filtered patient cohort, run one or more cells of code to generate logical results, and run one or more cells of text or visualizations to report out the results of the performed logic on the input data in a convenient manner. Templates may exist for charts, graphs, regressions, dimension reductions, classifications, RNA or DNA normalization, and other commonly used features across templates available to the user. Templates may be provided with the dataset or custom created by a user to be shared with other users.
Returning to notebook user interface 2900, the user may populate workbook viewing window 2950 with a custom workbook from the custom workbook widget 2920 by clicking and dragging the desired workbook from the widget to the viewing window. In one example, the user may select workbook 2960 with the mouse cursor and drag the workbook to viewing window 2950 as illustrated at 3120. Other intuitive mouse, keyboard, or gesture commands may be implemented in place of, or in addition to, clicking and dragging.
Notebook editor 3200 may auto-populate with Title 3210 and one or more cells 3240A-D based upon the user selected workbook. The user may rename the workbook using edit the workbook further using a text entry UIE 3220. The user may alter the configuration of the workbook via a series of button and drop down menu UIEs 3220 may be provided to compartmentalize grouped elements of the user interface. UIEs 3220 may assist the user in building and structuring the workbook's presentation. A cell UIE may provide selections pertaining to the currently selected cell 3240A-D having a block of code, such a commands for running the currently selected cell, terminating the currently selected cell, adding a cell, deleting a cell, running all cells, running all cells above, running all cells below, or terminating all cells. A kernel UIE may provide selections pertaining to one or more programming languages and/or available to the user such as Python, Structured Query Language (SQL), R, Spark, Haskell, Ruby, Typescript, Javascript, Perl, Lua, C, C++, Matlab, Java, Emu86, and other kernels. Selecting a kernel from the kernel UIE reloads the workbook so that the cells execute commands from the respective language. A widget UIE may provide selections pertaining to one or more supported code snippets for the active kernel. Code snippets may include code for creating visualizations such as a graph or a plot, code for simple arithmetic operations such as calculating a mean or a standard deviation, or code for more complex operations such as calculating a distribution and displaying a respective curve. The user may further alter the configuration of the workbook via a series of icon UIEs 3230 may be provided where each icon represents a popular command executed from the UIE 3220. Exemplary popular commands may include saving the document, adding a new cell, cutting or pasting code or cells, rearranging cells by moving them upwards or downwards down the page in relation to any other cells, or running/terminating the code in the active cell(s).
The user may also edit the source code for each of cells 3240A-D by selecting the cell and selecting the cell UIE option for edit or pressing an associated keyboard shortcut.
Cells 3310A and 3310B become visible (3310C-D not shown) upon entering an edit cell view of the workbook having cells 3240A-D. Cell 3310A displaying the code that generates a survival curve 3240A based on a propensity difference between a control cohort and a treatment cohort of patients. Cell 3310B displaying the code that generates a scatterplot 3240B (not shown) based on normalized RNA expressions for two selected RNA transcriptomes in the filtered cohort of patients. Similar cells 3310C-D (not shown) may be generated for scatter and box plots 3240C-D (not shown) respectively.
The user may edit the code to modify the workbook for their purposes as well as add or remove additional cells to create a new customized workbook.
During edit cell view, the user may also see one or more templates may be provided in template window 3050 for the user's convenience. Templates may include one or more cells preconfigured to operate on an input data such as the filtered patient cohort, run one or more cells of code to generate logical results, and run one or more cells of text or visualizations to report out the results of the performed logic on the input data in a convenient manner. Templates may exist for charts, graphs, regressions, dimension reductions, classifications, RNA or DNA normalization, and other commonly used features across templates available to the user. Templates may be provided with the dataset or custom created by a user to be shared with other users.
The user may drag any template into a cell to populate that cell with the code for generating the template's associated visualization or arithmetic.
Users may access the user interface for databases of patients which have been provisioned to the user by association with an institution or medical facility with a subscription to each patient database. Custom workbooks may also be provided on a database-by-database basis where workbooks are selected for their applicability to the patients within each database. Accessing the user interface may spawn resources in a cloud computing environment with access to any authorized databases and/or workbooks. User resource usage in the cloud computing environment may be monitored and tracked to supplement accurate billing for resources consumed by the user. Users may request and purchase other databases of patients. Databases of patients may be purchased based on characteristics of the patients within them. For example, a user may desire a database of patients who have been diagnosed with breast cancer. A look-up table (LUT) or cancer ontology may be referenced to provide alternative matchings for breast cancer, such as ductal carcinoma of the breast, cancer of the breast, mammary carcinoma, breast carcinoma, or other relevant terminology. Patients satisfying the requested diagnosis and any of the alternative terminologies from the LUT or cancer ontology may be combined into a database and delivered to the user. The user may then perform statistical analysis and research on the data in accordance with the disclosure herein.
Other web interfaces may be incorporated into the Interactive Analysis Portal 22 similar to the Outliers, Smart Cohorts, and Notebook portals above. One such other web interface may include identifying effects of a therapy, procedure, clinical trial, or other medical event on a disease state of a patient using propensity scoring. Propensity scoring and associated web interface is described in further detail in U.S. patent application Ser. No. 16/679,054, titled “Evaluating Effect of Event on Condition Using Propensity Scoring,” filed Nov. 8, 2019, which is incorporated herein by reference in its entirety.
The machine may operate in the capacity of a server or a client machine in a client-server network environment, as a peer machine in a peer-to-peer (or distributed) network environment, or as a server or a client machine in a cloud computing infrastructure or environment. The machine may be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, a switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
The example computer system 3400 includes a processing device 3402, a main memory 3404 (such as read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or DRAM, etc.), a static memory 3406 (such as flash memory, static random access memory (SRAM), etc.), and a data storage device 3418, which communicate with each other via a bus 3430.
Processing device 3402 represents one or more general-purpose processing devices such as a microprocessor, a central processing unit, or the like. More particularly, the processing device may be complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device 3402 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. The processing device 3402 is configured to execute instructions 3422 for performing the operations and steps discussed herein.
The computer system 3400 may further include a network interface device 3408 for connecting to the LAN, intranet, internet, and/or the extranet. The computer system 3400 also may include a video display unit 3410 (such as a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 3412 (such as a keyboard), a cursor control device 3414 (such as a mouse), a signal generation device 3416 (such as a speaker), and a graphic processing unit 3424 (such as a graphics card).
The data storage device 3418 may be a machine-readable storage medium (also known as a computer-readable medium) on which is stored one or more sets of instructions or software 3422 embodying any one or more of the methodologies or functions described herein. The instructions 3422 may also reside, completely or at least partially, within the main memory 3404 and/or within the processing device 3402 during execution thereof by the computer system 3400, the main memory 3404 and the processing device 3402 also constituting machine-readable storage media.
In one implementation, the instructions 3422 include instructions for an interactive analysis portal (such as interactive analysis portal 22 of
In another implementation, a virtual machine 3440 may include a module for executing instructions for a patient filtering module 3426 (such as the interactive cohort selection filtering interface 24 of
In some contexts, it can be useful for researchers to provision patient data records matching a certain selection criteria across multiple different data sources. The data sources can be stored in various formats and, in some cases owned by different entities. In some cases, the data sources are stored on different compute resources. For example, a first data source can be at least partially hosted at a storage or memory of a virtual or physical server distinct from a physical or virtual server hosting another data source. In some cases, a researcher may require the data of a patient data record to include certain characteristics (e.g., a cancer type) that may be included in some data sources and not in others. In some conventional systems, provisioning patient data records may require performing different queries across different data sources. For example, in some cases, a research user can acquire patient data records by individually requesting patient data records from each data source. A research user can formulate the request as an abstract request to an owner of a data source (e.g., through an email, a phone call, an input form, etc.) and the research user can provide the selection criteria to the owner of the given data source. The owner or administrator of the data source can query the data source in accordance with instructions from the research user to identify patient data records in the data source meeting the research user's criteria. In some cases, the process of provisioning patient data records can include manual elements, can be labor intensive, and can require a provisioning time that can be on the order of hours, days, or months. There is therefore a need to provide systems and methods for provisioning patient data records from among different data sources that can partially or completely eliminate manual provisioning or human intervention to identify patient data records in the individual data sources, and can reduce a time to provision (e.g., on the order of minutes, seconds, etc.) the patient data records.
Researchers may need access to specific subsets of patient data within the system to conduct research projects. These projects can include, for example, running machine learning processes against a subset of the patient data or defining a specific patient study, which may include identifying a cohort of patients that is relevant to the study, identifying the therapies that may be applicable, and/or identifying a desired outcome for the study. In some embodiments, therefore, the interactive analysis portal 22 can include a self-service provisioning module for provisioning an environment (e.g., a workspace) in which the researcher can perform analysis on patient data for defined cohorts of patients. However, such cohort identification and provisioning may be more complex than the cohort identification procedures described above, such as the population filtering techniques discussed using the disclosed cohort funnel & population analysis user interface. For example, in some cases, research requirements can exceed the computational capacity of a distributed computing layer associated with the interactive analysis portal, and therefore additional resources must be provisioned.
In some embodiments research users need access to full patient data records of patients meeting one or more desired criteria. For example, a research user may desire to perform research on patient data records for patients having a certain type of cancer and meeting certain demographic conditions. The patient data records can be sourced from multiple data sources which can each include patient data records for a given subset of patients. For example, a research user may search for patient data records meeting a desired criteria within a database data records for patients having stage IV ovarian cancer, stage IV pancreatic cancer, and stage IV liver cancer (see, e.g.,
In some cases, a research user does not have direct access to at least one of the data sources hosting the patient data records and cannot perform queries directly against the data sources to identify patient data records. Additionally, a research user may not have access to view attributes associated with individual patient data records in at least one of the data sources. In some cases, a research user may have partial access to a data source, to access a subset of patient data records on the data source, but does not have access to other records in the data source. In some embodiments, multiple data sources can be stored on a common storage (e.g., a storage associated with the same computing device, physical server, virtual server, or cloud environment), and the research user may have access to one of the data sources, but not others. For example, the research user may desire to perform a query on multiple data sources, including data sources owned by the research user and other data sources not owned by the research user. In some cases, the disclosed processes, techniques, and user interfaces can provide a benefit by providing a research user information about a composition of patient data records to be provisioned before provisioning. Thus, access to the interactive analysis portal 22 and associated UIs can mitigate a research user's lack of direct access to the patient data records, and can be useful in allowing the research user to further define criteria for patient data records to be provisioned. Once provisioned, the patient data records can be copied or populated into a patient data store that is accessible to the research user. Thus, the disclosed systems beneficially allow a research user to perform preliminary analysis of patient data records of a defined patient cohort before provisioning the patient data records, which can incur a cost to the research user.
Thus, according to some embodiments, systems and methods can be provided to provision patient data records from multiple data sources according to criteria provided by a research user. For example, as described below, the disclosed systems and methods can receive a research user input indicating selection criteria for patient data records, and the system can query one or more data sources to identify patient data records within the respective data source meeting the selection criteria. The patient data records from the respective data sources can be processed to produce transformed patient data records in a format that is consumable by the research user, and the transformed patient data records can be combined (e.g., in a database), and provided to the research user for analysis of the patient data records.
Therefore, according to some embodiments, a user of the system can elect to provision a compute environment, which can be referred to as a workspace for the purpose of running research workloads against a subset of the patient data. The user can interact with the workspace separately from the interactive analysis portal. In some embodiments, a workspace can comprise a logically partitioned technology environment in which compute resources (e.g., servers, virtual machines, storage, GPUs, CPUs, memory, networking elements etc.) and other technological services (e.g., platform services, application services, database services, messaging, logging services, etc.) can be provisioned. In some embodiments, logically partitioning a workspace can involve role-based access controls, with specified users having access within the environment to perform specific tasks (e.g., reading data, provisioning resources, running machine learning workloads, etc.). Additionally or alternatively, logical partitioning can be implemented on a network level, with at least a portion of the compute resources of a workspace being provisioned into a dedicated subnet. In some embodiments, certain IP addresses can be whitelisted, allowing a user of whitelisted devices to access the workspace, and blocking access from any device that is not specifically allowed by the whitelist.
Patient data records to be analyzed within a workspace can be subject to data governance requirements and can further be licensed to researchers for a limited duration of time, or for a limited scope. Accordingly, it can be advantageous for patient data records to be modified before entering a workspace. For example, information identifying a given patient that is contained in a patient data record, an image file, a database entry, or metadata can be removed before the records are imported (e.g., seeded) into a workspace for analysis by research users of the workspace. Further, rules can be implemented for the workspace to limit or prohibit egress of data from the workspace. In some embodiments, data within the workspace can only be egressed to certain resources that can make that data available within the interactive data analysis portal 22.
Within a workspace, then, research users can provision compute resources and other services to analyze patient data records for certain cohorts of patients whose records are within the workspace. In some cases, a workspace can include pre-provisioned resources, or can include preconfigured services for certain workloads that may be commonly performed in research. For example, machine learning services may be provided in containers, which can contain modules of the services and can be deployed on various server platforms. Providing containerized services can provide an advantage as it can reduce dependencies on services from specific service providers. For example, in some embodiments, a workspace can be provisioned from one of several cloud service providers (e.g., Amazon Web Services, Google Cloud Platform, Microsoft Azure, etc.), or alternatively, from on-premise systems. In some cases, one choice of service provider may provide certain advantages for certain workloads, or may, for example, be more cost-effective than another service provider. Deploying containers with modular services that can be deployed in multiple cloud service provider or computing platforms can thus prevent a dependency on any given cloud service provider or computing platform. In some embodiments, the system 10 can include a workspace provisioning engine that can select an environment in which to provision a workspace, based on information which can include projected costs, workload suitability, capacity, or other metrics that can be relevant.
In some embodiments, a workspace can include a provisioned set of patient data to be analyzed, as well as services and computing elements to analyze the data. Further, a workspace can include access control to allow only select users (e.g., research users) to access the patient data within the workspace and provision and utilize computing resources within the workspace. A workspace may include monitoring elements for resource usage within the workspace. In some embodiments, the workspace can be accessed independently of accessing the interactive analysis portal 22 as through a workspace UI, an API, or a command line interface (CLI) that can allow users to interact with the workspace.
Referring now to
Still referring to
In some embodiments, a research project can include tools and functionality for analyzing or performing research experiments on patient records within the patient data cohorts of the research project. At block 3508, a determination can be made if the modeling capabilities of the system 10 can be used to conduct the given research project. This determination can be made based on any number of factors, as, for example, a capacity of the system 10 to perform the workloads, the amount of patient data to be analyzed, the cost of running workloads on the system 10, and/or the level of control needed by research users to define the modeling to be performed. In some embodiments, this determination can be made by a user of the research project. In other embodiments, the decision can be made automatically by the system 10. If the system capabilities are sufficient, at block 3510, research users can utilize the capabilities provided by the system 10 (e.g., notebooks, decision trees, or any other modeling capabilities supported by the distributed compute and modeling layer 38) to model and analyze the data as desired.
If, at block 3508 the user or the system determines that the capabilities of the system 10 are insufficient to perform the desired research experiment, and a workspace has not been provisioned for the research project, a workspace can be provisioned at block 3512 and associated the with research project. In some embodiments, the workspace can be provisioned in a compute environment (e.g., within data centers or logical computing environments of cloud service providers such as AWS, GCP, Azure, etc.). Users may be provided the option to create a workspace at any time from the research project. For example, as shown in
At block 3514 data can be provisioned (e.g., seeded) into the workspace. For example, the patient data records belonging to the patients of the patient data cohorts in the research project can be copied into the workspace (e.g., the patient data records can be copied to a storage of the workspace distinct from the storage of the data sources from which the patient data records were sourced). In some embodiments, the patient data records can be deidentified before being provisioned into the workspace, which can include removing metadata or characteristics of the patient data records that could identify a patient. In some embodiments, the patient data store 14 can be upstream of healthcare operations that would require identification of patients, and thus, the records in the patient data store 14 can already be deidentified or otherwise anonymized, with no additional de-identification process being necessary for provisioning the records into the workspace. As discussed further below, patient data records can be provisioned into a workspace as entries in a database containing patient data records, or as individual files stored in file systems or object storage systems in the workspace.
The patient cohort can be comprised of patient data records from a plurality of data sources meeting the patient cohort definition. In some embodiments, the plurality of data sources can be included in the patient data store 14, and can comprise databases, object storage systems, file systems, or a combination thereof. In some embodiments, the patient data records can be obtained from a third party through making an API call to the third party data source based on the patient cohort definition. At block 3524, the process 3514 can check if all data sources have been queried for patient data records meeting the patient cohort definition. The process 3514 can query all data sources independently (e.g., simultaneously or without being dependent on a query of another data source being completed) or can iterate through the data sources to identify patient data records meeting the patient cohort definition. If a data source of the data sources has not yet been queried, the process 3514 can proceed to perform operations against that data source.
In some embodiments, as described above, a type of the data source (e.g., a relational database, non-relational database, object storage system, file storage system, etc.) of data sources to be queried can vary between data sources. Further, data can be differently arranged between different data sources. For example, column names may vary between data sources for similar attributes, or the values provided for given attributes can vary between data sources. At block 3526, the patient cohort definition (e.g., the patient cohort definition defined at block 3504) may be translated into a data source query for each data source. For example, where a data source is a relational database, the patient cohort definition can be translated into a SQL query to identify entries in the data source matching the definition. In some examples, including when the data source is an object storage system, the patient cohort definition can be translated into an object storage API call including a query for objects or metadata of objects meeting the patient cohort definition, the query being in a format that is consumable by the object storage API. In some examples, a machine learning model can be used to identify patient data records in a data source that meet the patient cohort definition. A machine learning model can be advantageous, as it can be trained to identify patient data records in different types of data source, and data sources storing patient data records in different formats. Thus, a single machine learning model can be used to translate the patient cohort definition to identify patient data records meeting the patient cohort definition in multiple data sources.
At block 3528, the process 3514 can identify patient data records in the respective data source matching the patient cohort definition. The patient data records can be identified using the translated query generated at block 3526. Identifying the patient data records can include obtaining information associated with each patient data record. In some embodiments, the process 3514 can add the patient data records to a table or newly generated data source upon determining that the patient data record meets the patient cohort definition. In some embodiments, data about the patient data records is obtained to produce a summary of the identified patient data records. For example, a number of the patient data records matching the patient cohort definition can be obtained from each data source and can be summed into a total number of patient data records across all data sources that meet the patient cohort definition.
At block 3530, the process 3514 can provide a user information (e.g., at a UI) about the identified patient data records. For example, as shown at least in
At block 3532, the patient cohort definition can be refined further, which can include providing a new patient cohort definition at block 3522, translating the patient cohort definition into a query for each data source at block 3526, and identifying patient data records in the respective data sources meeting the new patient cohort definition at block 3528. Information about the patient data records matching the new patient cohort definition can be provided to the user at block 3530 for the user to determine if further redefinition is required, or whether the patient cohort definition is satisfactory.
If, at block 3532, the patient cohort does not require redefinition (e.g., the patient cohort is acceptable to the research user), the process can proceed to block 3534. In some embodiments, the user can provide an indication that the patient cohort definition is acceptable (e.g., by clicking the “License Data” button 3926 shown at least in
At block 3536, a number of the patient data records can be provisioned into a patient cohort database within the workspace provisioned at block 3512 of process 3500. The number of patient data records can be the number selected by the research user at block 3534. The patient data records can be transformed before being provisioned into the patient cohort database. For example, a full or partial deidentification, anonymization, pseudonymization, or other anonymization techniques can be performed on the patient data records before they are provisioned into the patient cohort database in the workspace. Further, the patient data records from different data sources can require that the data therein be standardized before being provisioned into the patient cohort database. As an example, patient data records of a first data source can include abbreviations for cancer types while patient data records in a second data source can include the full name of the cancer type, and the full name of the cancer type can be transformed into an abbreviated name before the patient data record is provisioned into the patient cohort database to facilitate analysis of the data. In another example, attributes of a patient data record may be unstructured, and can be structured (e.g., columns can be populated for the patient data record) according to the structure of the patient cohort database. In another example, the patient data records can be provisioned into a data store of the workspace as unstructured files, and the data of the patient data records can be unchanged when the records are provisioned into the workspace.
Returning now to
At block 3518, technology resources can be provisioned within the workspace to analyze the patient data, perform research experiments, or run machine learning workloads against the patient data records in the environment. In some embodiments, research users can provision individual compute resources such as servers or virtual machines, having predefined memory, processing (e.g., CPUs, GPUs), storage, and networking aspects. In some embodiments, storage, memory, compute, and networking elements can be provisioned separately. A workspace can further include service offerings which can include, for example, database services, or machine learning services that may be provisioned independently of, or in conjunction with compute resources. In some embodiments, the system 10 can seed containers with defined modular services into the workspace for use in analyzing, modeling, and transforming patient data records in the workspace.
At block 3520, research workloads can be run against the provisioned patient data records in the workspace. An extract, transform, load (“ETL”) process may be required to provide the patient data records to machine learning services or workloads for the records to be utilized in research. The ETL operations can be performed using compute resources or services provisioned at block 3514. Upon being prepared, the data of the patient data records can be provided to train machine learning models. For example, the patient data records can be images of patient samples, and can include information in metadata of the image including pathology, demographics, site, etc. The model to be trained can be a model that can run against images to identify a pathology of a sample based on image data. The output of the workloads can be a machine learning model. Alternatively, the workload can transform the patient data records or enrich the patient data records, and the end result can be a transformed patient data store of the workspace including the transformed or enriched patient data records. In some embodiments, data outputs from research workloads in a workspace can be provided to the interactive analysis portal 22, and can be usable within the research project UI. In some cases, the data outputs can be added to the patient data store 14, and can be made available to other users of the interactive analysis portal 22.
Referring now to
In some cases, as illustrated in
The workspace 3600 can include a workspace patient data store 3606, including data for patient data cohorts on which research is to be performed. Data to be included in the workspace patient data store 3606 can be the patient data records of the cohorts defined in the research project (e.g., the data selected at block 3504 in process 3500, as shown in
Before populating the workspace patient data store 3606 (e.g., seeding the patient data), the patient records of the patient data store 14 can be processed at processing module 3608. For example, the patient records to be seeded in the workspace 3600 can be a subset of patient data, and the processing module 3608 can filter the patient data records for only those records within a specific cohort or cohorts. As well, the format and content of the patient data records can be adjusted at processing module 3608 to match a format that can be compatible with the format of the workspace patient data store 3606. In some cases, personally identifiable information must be removed from patient records before the records can be seeded into the workspace 3600. Thus, at processing module 3608, identifying information can be removed from records, which can, for example, include removing some metadata from the record, or copying the data without certain attributes.
The workspace 3600 can include compute resources 3610 and services 3612 for processing and analyzing patient data. In some embodiments, a workspace can include standard compute resources and services upon provisioning. In other embodiments, research users 3602 can provision resources within the workspace 3600 after it has been created.
The compute resources 3610 can be virtual or physical servers, and can have memory, storage, processing (e.g., CPUs, vCPUs, GPUs or vGPUs), and networking components. The compute resources 3610 can be provisioned according to specifications of a user, and the user can specify a quantity of storage and memory to be included with the compute resources 3610, and a number of CPUs or GPUs. In some embodiments, compute resources 3610 can be standard compute resources, and research users can select compute resources with a standardized specification. In some embodiments, users can provision services 3612, and the compute resources 3610 can be provisioned automatically within the workspace 3600 according to the computing requirements of the services. Compute resources 3610 within a workspace 3600 can scale as can be necessary to perform computing workloads. Services 3612 can be services available for provisioning within a workspace to perform tasks such as ETL, training machine learning models, etc. In some embodiments, containerized applications can be provided, which can be deployed on commoditized compute resources, so that the workspaces can be independent of specific technology environments or cloud service providers.
As further illustrated in
Monitoring services 3620 can be provided for resources in a workspace 3600. These monitoring services can track usage of compute resources, storage, database, and other service usage within the workspace 3600. The monitoring can provide useful insights into performance of resources within the workspace 3600. Alerts can be provided, for example, when research users exceed an allowable usage target, or alternatively, the monitoring can be used to determine a cost of running resources with the workspace 3600.
In some embodiments, it can be useful to provide an access layer 3622 through which research users 3602 can access the workspace 3600. For example, an API can be provided at access layer 3622 to allow provisioning of compute resources and services within a workspace, and to access and run workloads against patient data records of the workspace patient data store 3606. An API at access layer 3622 can provide an abstraction layer so that research users 3602 can interact with workspaces 3600 in a standardized way, without producing a dependency on a specific cloud service provider. This can allow for cost savings, as it can facilitate the selection of technology platforms and environments with the lowest cost transparently to research users 3602. In some embodiments, a research user 3602 can be an application or virtual identity and can rely on standardized APIs to interact with a workspace and run machine learning workloads against patient data records. Providing an API access layer 3622 can thus allow for automation of research tasks that could otherwise require manual steps on the part of a research user. In some embodiments, the access layer 3622 could be a workspace UI from which resources 3610 and services 3612 can be provisioned within the workspace 3600. In other embodiments, the access layer 3622 can be a CLI.
Referring now to
As shown in the “data source” column of patient cohort data table 3706, patient data cohorts 3708 can be sourced from multiple sources. For example, as shown, a first patient data cohort 3708a include patient data records from an “Ovarian Stage IV” database, while a second patient data cohort 3708b includes patient data records from a “Liver Stage IV” database. Further, cohorts of a research project can be snapshots, including data from a given cohort at a particular point in time, as is the case, for example, for the first patient data cohort 3708a. Alternatively, a patient data cohort 3708, including second patient data cohort 3708b as shown can be a live cohort, which can be updated with records matching the filter criteria of the given cohort as records are added or updated in the data source for the cohort.
Research users can use research projects and workspaces of an interactive user portal to analyze patient data records, and patient data records to be analyzed can originate from within the interactive analysis portal, or externally. In some non-limiting examples, the patient data cohorts (e.g., cohorts 3708 represented in rows of the data table 3706) can include patient data records that are entirely sourced from the patient data store 14. In other embodiments, data of patient data cohort can be imported into the interactive analysis portal 22 at an interface (e.g., through a GUI, API, or CLI) through user input, and a user can provide patient data records directly to the research project (e.g., the user can upload the patient data record in a format consumable by the interactive analysis portal as through csv, excel spreadsheet, yml, xml, html, etc.). In other embodiments, a data source for patient data cohort 3708 can be a data source external to the interactive analysis portal 22 (e.g., an online data source accessible from a web page, database, API). In some embodiments, each individual patient data cohort 3708 can include ownership information that can indicate an originator of the patient data cohort. The owner of the patient data cohort can have greater functional control over the patient data cohort than other users of a workspace. For example, the owner can redefine the patient data cohort to include more or fewer patient data records, while other users in the workspace can only use the patient data records within the patient data cohort.
Other tabs 3704b, 3704c of the top region 3702 can be selected to display other parameters of a research project. For example, tab 3704b can be a “Files” tab, and when selected can display files of the research project in the display region 3702. The files can include saved notebooks of the research project, or artifacts of the research project, or reports, or any other file that can be associated with a research project. Tab 3704c can be a “People” tab, and when selected can display the research users (e.g., research users 3602) associated with the research project. When tab 3704c is selected, the display region 3702 may also include a button, dropdown, or other input allowing a research user to add other research users, remove research users, or amend a level of access or a role of a given research user.
The heading region 3701 can further include additional inputs for interacting with and modifying aspects of the research project. For example, a button 3710 can be provided to perform queries on patient data records of the system. These queries can include filtering patient data records as described above to generate a patient data cohort. The results of a query generated upon selection of the button 3710 can be a patient data cohort that can be added to the patient data cohorts of the research project and can be displayed as an additional row in the patient cohort data table 3706. An additional button 3714 can be provided, which, when selected, can display a dropdown menu 3716. The dropdown menu 3716 can include the option 3718 for creating a workspace (e.g., workspace 3600). In some embodiments, if a workspace is already associated with the research project, the option 3718 to create a workspace can be greyed out or otherwise unavailable for selection. In some embodiments, upon selection of option 3718, a researcher can be presented with a form or can otherwise input preferences regarding how the workspace is to be provisioned.
When a workspace is provisioned for a research project, a link can be provided on the research project UI 3700 to access the workspace. In this regard,
Further, the sidebar region 3703 can include additional functionality for a research project. In the illustrated embodiment, the sidebar region is located on the right side of the research project UI 3700, but in other embodiments, the sidebar region 3703 can be on a right side of the UI 3700 or could alternatively be oriented horizontally and be located along a top or a bottom of the research project UI 3700. In some embodiments, including as shown, the sidebar region can include filtering functionality to filter patient data records to generate additional patient data cohorts for the research project.
As stated above, patient data cohorts can be defined for use with research projects and workspaces, and can include patient data records provisioned through an interactive analysis portal, or alternatively could include patient data records provided by a user and imported into the research project. In some examples, interfaces can be provided for a research user to define patient data cohorts, which can be a subset of patient data records in a patient record database having common characteristics, aspects, or attributes. The interfaces can be any interface which can be usable to select a subset of patient data records, including a graphical user interface, an API, or a command line interface. It should be understood that any functionality described with respect to one interface can be performed using any other interface (e.g., a filtering function available through a GUI can be performed using an API).
According to some embodiments,
Upon selection of an option for a selection criterion, a user may elect to continue filtering patient data records to further define a patient data cohort, or, alternatively, could choose to provision the defined cohort into the research workspace (e.g., by licensing the data records, as further described below). In the illustrated example, an additional selection criteria 3802b, which corresponds to a modality of the patient data records, is applied to the patient data cohort to further narrow the patient data cohort. In some cases, the selection criteria 3802b can be selected from a plurality of options available for selection criteria (e.g., filters). For example, the GUI 3800 includes a filter selection section 3812, from which selection criteria 3802 can be selected and applied to narrow or filter a patient data cohort to include patient data records having desired characteristics. As shown, the filter selection section 3812 is located in a panel on a left side of the GUI 3800, but a filter selection section, can be positioned at any location on the GUI 3800 to include a right sidebar, a top or a bottom bar, etc. Further, the filter selection section 3812 can be collapsed or shown as desired by a research user. The filter selection section 3812 can include heading elements 3814 which can be dropdown menus, accordion menus, or expand sections which can include available selection criteria included in the grouping. For example, under an “outcomes” filter grouping 3814d, filters can be provided based on survival rates of patients in the defined patient data cohort, or responses to treatments, etc. Individual selection criteria 3802 can be dragged from the filter selection section 3812 into the cohort definition section 3801 and can then comprise a cascading row 3804 containing options for the corresponding selection criterion 3802. The filter selection section 3812 can also include a search bar 3816, which can allow a user to search for a desired selection criterion by typing into the search bar 3816 the name of the selection criterion which the user desires to apply to define a patient data cohort. Additionally or alternatively, a search bar 3817 can be provided below cascading rows 3804, as can be a natural location for a user to search for a next filter as the user works downwardly through the GUI 3800 in filtering patient data records to define a patient data cohort. In some embodiments, a subsequent selection criterion 3802b can be selected automatically, or by default upon selection of one of the options 3808 of the previous selection criterion 3802a. For example, upon selecting the “all data” option 3808a including patient data records for a corresponding data source, a Modality selection criteria 3802b can be automatically presented to the user for selection of a modality by which to define patient data records.
As shown, in some cases, a selection criterion can provide non-exclusive options which may be selected individually or in combination. In the example shown, the Modality selection criteria 3802b includes options 3820 which can be selected individually or in combination with one another, as visually communicated to the user through presenting the user with check boxes available for each option 3820. In the illustrated example, no options 3820 are selected for the modality selection criteria 3802b, however, the corresponding numerical indicator 3806b shows a decrease in the number of available patient data records from the numerical indicator 3806a. In some cases, some patient data records may not be filterable for a given selection criteria, and thus, selecting a selection criterion can exclude from the cohort patient data records which are incapable of filtering using the selection criterion. Accordingly, in the illustrated example, the difference between the number displayed by the numerical indicator 3806a and the number displayed by the numerical indicator 3806b can be the number of patient data records within the “Unlimited” data source (i.e., the source selected through option 3808a) that do not have associated data for modality of the patient data record. Options 3820 for modality of patient data records can include a clinical modality option (e.g., patient data records including clinical data), a DNA modality option (e.g., patient data records including genetic data), an RNA modality option (e.g., patient data records including transcriptome data), and an imaging modality option (e.g., patient data records for which imaging data is associated). Further, in the illustrated embodiment, selection of multiple options 3820 filters the patient data records using a logical AND operation. In some embodiments, however, a logical OR can be used to select patient data records, and in yet other embodiments, a user can choose whether to filter patient data records using options 3820 with either a logical AND or a logical OR.
Referring now to
In some embodiments, selection of patient data records to define a patient data cohort can include selection of only two selection criteria (e.g., the data criterion 3802a and the modality criterion 3802b), but in other embodiments additional or different selection criteria can be used to define a patient data cohort. Thus, further selection criteria can be added through GUI 3800 to define a cohort, including through use of the search bar 3816.
Additionally or alternatively to using the search bar 3817 to define selection criteria, selection criteria (e.g., filters) can be selected through the filter selection section 3812. As shown in
As shown in
It can be advantageous for a research user to view a profile of the patient data cohort meeting set selection criteria before provisioning or licensing the patient data records. In some cases, for example, the patient data records selected can include a bias which can negatively affect any analysis on the patient data record. In these cases, providing a research user analytics or a profile on the selected patient data records can allow the research user to determine if further filtering is necessary, or if selection criteria need to be removed or adjusted to obtain a more useful patient data cohort. For example, a research user may desire a substantially equal balance of a sex of patients in a data cohort, and a significant imbalance in the sexes of patient included in the data cohort could provide the research user an opportunity to further refine the selection criteria for the patient data cohort. Analysis can be provided in visualizations which can visually communicate (e.g., through graphs and charts) to the user statistical information about the defined patient data cohort. In this regard, as illustrated in each of
In some examples, users can provision (e.g., license) patient data records of a defined patient data cohort into a research project to perform further analysis thereon. In this regard, the cohort definition GUI 3800 can include a button 3826, as shown in
Upon selecting the cohort preview element 3825 (e.g., by clicking, tapping, hovering over, sliding, etc.), the user can view a cohort preview GUI 3900, as shown in
As the data completeness tab 3904a is selected in
The data completeness panel 3906a can include additional visualizations 3908 including progress bars for data completeness of different attributes associated with patient data records of the defined cohort. For example,
Additional summary visualizations 3914 can be provided in the data completeness panel 3906a to provide a research user with further information about the patient data records in the defined patient data cohort. For example, summary visualizations 3914 can include a modality Venn diagram 3914a, a “most complete fields” visualization 3914b, and a “Least complete Fields” visualization 3914c. The modalities Venn diagram 3914a can provide a view of the modalities selected for the patient data cohort (e.g., as defined in selection criteria 3802b shown in
The GUI 3900 can include a filter summary section 3920, which can display the selection criteria which define the patient data cohort. For example, the filter summary section 3920 can include a visual indicator 3922 corresponding to each selection criterion (e.g., selection criteria 3802 shown in
In some examples, additional filters or selection criteria can be applied to the patient data cohort directly from the GUI 3900 to achieve a patient data cohort having desired characteristics or a desired profile for future analysis. For example, from the data completeness panel 3906a, a research user can select a given attribute and apply additional filters or selection criteria to the patient data cohort based on the selected attribute. In this regard,
Further, upon selection of an attribute through selection of the progress bar 3910a, a filter modal 3928 can be displayed to the user including options available for filtering or applying selection criteria based on the selected attribute. For example, within the data completeness panel 3906a, a filter modal 3928 can present a user an option to further filter the patient data records of the patient data cohort to include only patient data records having the selected attribute populated. The filter modal 3928 can include a numerical indicator 3930 indicating the number of patient data records matching the filter or selection criteria to be applied, which, in the illustrated embodiment is a number of patient data records for which histology information is populated. The filter modal 3928 can include an implementation element 3932, which, as displayed is a button, and selection of the implementation element 3932 can apply the filter to the patient data cohort. Thus, when the filter represented by the filter modal 3928 corresponding to the histology attribute is applied, the data completeness panel 3906a can be updated, as shown in
As noted with respect to
As illustrated in
For example, visualization 3940a shows a number of somatic variants present for different genes 3944. A bar graph 3946 can be provided for each gene showing the number of patient data record in each respective source including somatic variants of the identified gene 3944. As shown in the bar graph 3946a for the KRAS gene 3944a, each data source (e.g., as identified by options 3942) can include patient data records including somatic KRAS variants, and the majority of patient data records including a somatic KRAS variant can originate from the main application (e.g., corresponding to the main application option 3942a). The bar graph 3946b for somatic variants of the SDKN2B gene 3944b, however may communicate that the uploaded data (e.g., corresponding to the uploaded data option 3942c) does not include any somatic variants for the SDKN2b gene 3944b. A research user could thus decide to exclude patient data records with a somatic variant in the SDKN2b gene 3944b from the patient data cohort, to increase a quality of the data in the cohort. In some cases, the data may be excluded by selecting the SDKN2b gene 3944b or corresponding bar graph 3946b in the GUI 3900 and filtering as desired in a filter modal that can be provided (e.g., similar to filter modal 3928).
Turning now to
A user can select additional services and functionalities to be included with provisioned patient data cohorts, and accordingly, the modal 4000 can include an “Add-ons” section 4008c, within which add-on options 4014 can be displayed for the user to select or decline to select. As shown, add-on options 4014 can include an option 4014a to download data (e.g., de-identified or otherwise anonymized patient data records) from the research project. In the illustrated embodiment, add-on options 4014 further include an enhanced curation option, a scientific professional services option (e.g., support for running analysis against the provisioned patient data records), and a custom research compute option, which can provide the user the ability to define compute resources to be used to analyze patient data records within the research project. A “Terms and Pricing” section 4008d can allow the user to select a billing frequency 4016. As shown the user can select a weekly billing frequency, a quarterly billing frequency, or an annual billing frequency. In other embodiments, a user can be presented additional frequency options for billing, including, for example, monthly, bi-monthly, bi-weekly, etc.
In some embodiments, a user can be provided a cost of licensing patient data records before provisioning the patient data records of the patient data cohort into a research project. In some cases, the cost of licensing the selected cohort or data records can be displayed within the licensing modal 4000. The cost can be dependent on selected options (e.g., any or all of options 4010, 4012, 4014) and a number of patient data records to be included in the patient data cohort. For example, a cost to license the patient data records can be greater if an add-on is selected than if no add-ons are selected. Further, the cost can be presented to the user as a cost per unit time, which can correspond to the selected billing frequency 4016.
In some examples the user can opt for fewer file options 4010, or could deselect tool options 4012, including as necessary to reduce a cost of licensing to a desired amount. Additionally, as shown in
As shown in
Referring now to
Displaying filters and a number of patient data records of a patient data cohort to a user, as described, can beneficially allow the user to make appropriate decisions in analyzing the data. For example, a user could determine that a sample size of patient data records is too small, or that too many records are included which may be costly and increase a time required to run workloads against the patient data records. In some cases, the selected patient data cohort can be a patient data cohort different than a cohort which the user may intend to analyze, and providing the information about the patient data cohort can allow the user to change a data cohort to be analyzed before running workloads on the incorrect or undesired patient data records. In some embodiments, the research project UI 4100 can include elements which can allow a researcher to select a different patient data cohort on which to run workloads and analyses. For example, a cohort definition button 4111 can be provided on the research project UI 4100, and when selected, can allow the user to define a new cohort. In some cases, upon selecting the cohort definition button 4111, the user is navigated to the GUI 3800 described above, or a similar GUI allowing the user to define a patient data cohort by applying selection criteria to a set of patient data records to filter the patient data records as desired. In some embodiments, the user can navigate to a data panel of the research project UI 4100 by selecting the data tab 4104a, and the data panel can include a table of patient data cohorts that have been provisioned into the research project (e.g., similar to table 3706 shown in
The research project UI 4100 can include a workspace settings panel 4112 which can be displayed to the user when the workspace settings tab is active, as shown in
In the illustrated embodiment, the workspace settings panel 4112 includes an environments section 4114, and a usage section 4116 corresponding to technological environments associated with a workspace and usage of compute resources of a workspace respectively. In other embodiments, a workspace settings panel can include additional sections including, for example, a section including a cumulative cost of the workspace.
The environments section 4114, as shown, can display information about one or multiple technological environments 4118 associated with the workspace. Each of the one of more environments 4118 can be defined by technological resources, including computing infrastructure of the environment, tools or services associated with the environment, and an integrated development environments (IDE) through which the user can program workloads within the workspace. In the illustrated embodiment, two technological environments 4118 are associated with the workspace: an R Basic environment 4118a, and a Python Basic environment 4118a. Computing resources 4120 associated with an environment can be displayed for each environment. The computing resources 4120 can include information regarding a type of instance (e.g., of a virtual or physical server, or a container, or a cluster of servers or containers), and specifications associated therewith. In some embodiments, the computing resources 4120 can be hosted on a cloud service provider (e.g., GCP, AWS, Azure, etc.) and a name of the instance can correspond to a service offering provided by the cloud service provider. For example, as shown, the computing resource 4120a, 4120b for each of the technological environments 4118a, 4118b respectively include a “small instance,” which, in each environment, is a virtual server having 4 CPUs and 16 GB of RAM. The “small instance” can be a provision able unit of compute on a cloud storage provider, and other units of compute can include instances including a greater or lesser amount of RAM or CPUs. In some cases, instances can also have GPUs associated therewith.
The technological environments 4118 can further be defined by technological resources associated therewith. For example, software packages can be installed on compute resources for a workspace, which can allow a user to analyze patient data records using the software running on the provisioned compute resources. Further, in some examples, including when the provisioned compute resources of a workspace are provisioned through a cloud service provider, other services may be accessible from the compute resources (e.g., database services, object storage, machine learning services, data analysis services, etc.), and can thus be usable with the workspace without installation of corresponding software on the compute resources. In the illustrated embodiment, technological resources 4122 are shown for each technological environment 4118a, 4118b. The technological resources 4122a associated with the R Basic Environment 4118a can include a version of the R programming language (e.g., R 4.1 as shown) and libraries (e.g., modules Bioconductor 2.0, tidyverse 1.7, etc.), which can be installed on the computing resource 4120a. The libraries can provide additional predefined functionality within the R programming language to the user for analysis of the provisioned data. Correspondingly, the technological resources 4122b associated with the Python Basic environment 4118b can include the Python language (e.g., Python 3.4) and associated libraries (e.g., pandas 3.1, survival 1.7, etc.).
IDEs can be provided with technological environments of a research project and can allow users to program against and interact with resources of the technological environment to analyze and run workload on patient data cohorts. As further shown in
The exemplary technological environments 4118a, 4118b are provided for illustration, and are not intended to be limiting. A technological environment, according to some embodiments, can include computing resources having any specifications and can be hosted on a cloud service provider or, alternatively, in a data center of the provider of the interactive analysis portal 22. Additionally, any programming language can be used to analyze patient data records of a patient data cohort, including, but not limited to Python, Structured Query Language (SQL), R, Spark, Haskell, Ruby, Typescript, Javascript, Perl, Lua, C, C++, Matlab, and Java.
In some embodiments, the illustrated technological environments 4118a, 4118b are automatically generated and provisioned along with the research project, or, alternatively, along with the individual patient data cohorts. In some embodiments, a user may define technological environments in addition to or instead of automatically provisioned technological environments 4118a, 4118b. For example, the user may require that a technological environment include services that are not provided by default environments, or, in other examples, workloads for a patient data cohort may require greater compute resources (e.g., more memory, CPUs, Storage, GPUs, etc.) than provided in default technological environments. Thus, in some cases, the environments section 4114 of the workspace settings panel 4112 can include an environment addition element 4126 (e.g., a button, hyperlink, clickable image, etc.) which, when selected, can provide the research user with GUI elements (e.g., a form, a modal, etc.) to allow the user to define and provision additional environments.
In this regard,
Turning now to
As shown in
The usage section 4116 of a research project UI 4100 can include graphical or other visual representations of usage parameters of the research project. For example, as shown in
In some embodiments, users can preserve artifacts (e.g., files) generated or used in the course of running workloads for the research project. For example, a user may develop a notebook that can be useful for multiple input data sets, or that provides output data in a standardized way, and is may thus be advantageous to the user to have the ability to persist that notebook for future workloads. Additionally, a user may desire to save and export machine learning models output by AI/ML training workloads. Further, results of a machine learning model or an analysis may be saved in files of a system for future reference. Thus, as illustrated in
The files panel 4142 can include curated notebooks 4144, which have predefined code for performing certain desired workloads. As shown, notebooks can be provided for multiple programming languages (e.g., Python and R as shown), and can be visually grouped according to certain parameters of the notebooks 4144. In some embodiments, including as shown, the curated notebooks can be divided into a starter section 4146 (e.g., containing notebooks 4144a and 4144b as shown) and a premium section 4148 (e.g., containing notebooks 4144c, 4144d, and 4144e as shown). Notebooks 4144 in the premium section 4148 can require additional payment to access, as opposed to notebooks 4144 in the starter section. In other embodiments, notebooks can be grouped by function, or by language, or by any other common characteristic. In some embodiments, research users may make custom notebooks available for access by others in the research project, and notebooks displayed on a files panel 4142 can include these notebooks. Notebooks 4144 can each include an open option 4150, which can be a button that, when clicked, allows a user to select a program in which to open the notebook 4144 and view or edit code thereof. In some cases, the user can have an option to open the notebook 4144 in the browser or in an IDE (e.g., one of the IDEs defined for the environment).
Still referring to
A user can access files of a research project through a research project UI. For example, as further shown in
In some embodiments, the files panel 4142 can include an edit button 4180 which can allow the user to open and edit the displayed file 4168. In some embodiments, the file 4168 can be opened in a notebook upon selection of the edit button 4180, and a user can update code or display elements of the file as desired and save the file back into the corresponding folder.
A navigation bar can be provided in GUIs of the interactive analysis portal 22 to allow users of the interactive analysis portal or GUIs thereof to navigate between GUIs and perform functions within the interactive analysis portal.
Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “identifying” or “providing” or “calculating” or “determining” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage devices.
The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the intended purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the method. The structure for a variety of these systems will appear as set forth in the description below. In addition, the present disclosure is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the disclosure as described herein.
The present disclosure may be provided as a computer program product, or software, that may include a machine-readable medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to the present disclosure. A machine-readable medium includes any mechanism for storing information in a form readable by a machine (such as a computer). For example, a machine-readable (such as computer-readable) medium includes a machine (such as a computer) readable storage medium such as a read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory devices, etc.
In the foregoing specification, implementations of the disclosure have been described with reference to specific example implementations thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of implementations of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.
It will be apparent to those skilled in the art that numerous changes and modifications can be made in the specific embodiments of the invention described above without departing from the scope of the invention. Accordingly, the whole of the foregoing description is to be interpreted in an illustrative and not in a limitative sense.
This application is a continuation-in-part of U.S. patent application Ser. No. 16/732,168, filed Dec. 31, 2019.
Number | Date | Country | |
---|---|---|---|
Parent | 16732168 | Dec 2019 | US |
Child | 18167812 | US |