The following relates generally to the clinical testing arts, genomic testing arts, proteomic testing arts, and related arts.
Genomic and proteomic testing is increasingly applied as tools for diagnosing and typing cancers, determining pathogen strains, and other clinical tasks. These techniques are capable of producing vast quantities of data.
Genomic testing may employ next-generation sequencing (NGS) to acquire a whole genome sequence (WGS), a whole exome sequence (WES, including only protein-encoding exons), RNA sequences, or so forth. In a typical NGS workflow, a tissue sample from a cancerous tumor or other tissue of interest is drawn via a biopsy or other interventional procedure. Wet lab processing is used to extract, purify or otherwise prepare deoxyribonucleic acid (DNA) from the sample, followed by target enrichment (e.g. for WES), polymerase chain reaction (PCR) amplification, and/or other sample processing. The prepared sample is loaded into a NGS genetic sequencer that generates unaligned DNA sequence fragment reads (data representations of base sequences of DNA fragments) which may for example be stored as FASTQ data files. The unaligned reads are aligned with a reference DNA sequence using suitable data processing such as a Burrows-Wheeler Alignment (BWA) tool followed by SAMtools to align longer sequences. The aligned DNA sequence (e.g. WGS or WES sequence) is stored as a Sequence Alignment/Map (SAM) or Binary Alignment Map (BAM) or similar-type file. Variant calling software may be applied to identify genetic variants such as single nucleotide polymorphism (SNP) or single nucleotide variant (SNV) variants, base modification variants (e.g. methylation), extra or missing bases (inserts or deletes, i.e. indels), copy number variations (CNVs), or so forth. A list of genetic variants may be stored as a standard variant calls file (VCF) or the like.
Proteomic data may be acquired from a tissue sample using laboratory tools such as mass spectroscopy or microarray or protein chip analysis. For example, cells of a microarray are designed to interrogate specific proteins, and the outputs of the cells represent protein concentrations quantifying gene expression levels for corresponding genes. Mass spectroscopy similarly quantifies concentrations of resolved proteins in the sample. As with NGS, large quantities of data can be generated. Combining genomic and proteomic analyses can in principle provide synergistic information.
However, extracting clinically useful information from genomic or proteomic data sets is challenging. In a supervised learning approach, samples in the form of WGS, gene expression data or the like for various patients is analyzed. In a supervised approach the samples (i.e. patients) are labeled as to whether they have the clinical condition of interest (e.g. the type of cancer). In such cases, the analysis amounts to identifying correlations between various features of the genomic/proteomic data (where a feature may be a genetic variant, a certain expression level bin, or so forth) and presence/absence of the clinical condition of interest. This can be challenging when the genomic/proteomic data set contains tens of thousands of features.
Supervised learning is restricted to samples that are labeled as to the clinical condition of interest, and cannot leverage unsupervised data, that is, samples which are not labeled as to presence/absence of the clinical condition of interest. Thus, supervised learning of genomic and/or proteomic tests cannot leverage data sets without the appropriate clinical labeling. On the other hand, unsupervised learning techniques employ clustering or the like to group together similar samples, without regard to clinical labeling. These clusters can then be compared with any available labeled data to derive useful information from the unlabeled data. However, unsupervised learning of useful clinical tests in the absence of clinical labeling of (at least most) samples is even more challenging than supervised learning.
To address the dimensionality challenge and associated issues, techniques such as deep learning auto-encoders have been used to reduce the dimensions of the feature space and compress the data structure while minimizing the data content loss. However, the structure of the auto-encoder needs to be defined in advance, and optimization results as well as data compression depend strongly on this pre-defined structure; yet, there is little guidance available to the test developer as to how to optimally pick such a structure.
The following discloses a new and improved systems and methods.
In one disclosed aspect, a genomic/proteomic test synthesis device comprises a computer and a non-transitory storage medium that stores instructions readable and executable by the computer to perform a genomic/proteomic test synthesis method. That method includes: receiving a genomic/proteomic data set comprising samples corresponding to persons with each sample including values of features of a set of features derived from genomic/proteomic data for the corresponding person; for each feature, generating a kernel density estimate (KDE) of sample density versus feature value for the feature; and performing multivariate analysis on the features using the KDEs to generate a set of discriminative features.
In another disclosed aspect, a non-transitory storage medium stores instructions readable and executable by an electronic processor to perform a genomic/proteomic test synthesis method comprising: receiving a genomic/proteomic data set comprising samples corresponding to persons with each sample including values of features of a set of features derived from genomic/proteomic data for the corresponding person; for each feature, performing univariate analysis on the values of the feature for the samples of the genomic/proteomic data set to generate a sample density versus feature value data set for the feature; and performing multivariate analysis on the features using the sample density versus feature value data sets to generate at least one set of discriminative features.
In another disclosed aspect, a genomic/proteomic test synthesis method is disclosed. A genomic/proteomic data set is received at a computer. The data set comprises samples corresponding to persons with each sample including values of features of a set of features derived from genomic/proteomic data for the corresponding person. For each feature and using the computer, univariate analysis is performed on the values of the feature for the samples of the genomic/proteomic data set to generate a sample density versus feature value data set for the feature. Using the computer, multivariate analysis is performed on the features using the sample density versus feature value data sets to generate at least one set of discriminative features.
One advantage resides in providing more robust feature selection for synthesis of a genomic/proteomic test.
Another advantage resides in providing more efficient synthesis of a genomic/proteomic test.
Another advantage resides in providing more computationally efficient detection of the most discriminative features for use in synthesis of a genomic/proteomic test.
Another advantage resides in providing selection of the most discriminative features for use in synthesis of a genomic/proteomic test that is effective to detect single features that are highly discriminative.
Another advantage resides in providing one or more of the foregoing benefits without the need for a labeled (or fully labeled) samples data set.
A given embodiment may provide none, one, two, more, or all of the foregoing advantages, and/or may provide other advantages as will become apparent to one of ordinary skill in the art upon reading and understanding the present disclosure.
The invention may take form in various components and arrangements of components, and in various steps and arrangements of steps. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. In drawings presenting log or service call data, certain identifying information has been redacted by use of superimposed redaction boxes.
Some approaches for genomic/proteomic test synthesis disclosed herein proceed in two stages. First, univariate feature pre-selection is performed, since there is a possibility of even a single feature providing important characterization of a dataset. Next the process iterates over features ranked by the analysis results of the first step and detects associated sample clustering while doing forward selection and non-linear transformation of features. Clustering characteristics such as connectedness, homogeneity, and/or so forth may be assessed to include or exclude certain features from further iterations. One or more sets of discriminative features are obtained, and associated sample clusters that characterize the data set based on the chosen criteria. For clinical applications the discriminative features are linked with sample groups defined by clinical variables to provide analytic solutions for predictive diagnostics, and biomarker detection.
The disclosed approaches provide efficient feature selection by way of unsupervised learning, and various embodiments exhibit advantages such as one or more of the following: improved characterization of an arbitrary dataset; improved capturing of important features; and/or improved performance of predictive modelling schemes.
With reference to
As diagrammatically indicated in
As diagrammatically indicated in
The univariate analyses 30 are followed by one or more multivariate analyses 32, 34, which in the illustrative embodiment include: (1) a multivariate energy spectral density (ESD) analysis 32 producing a top-ranked set of features 36, e.g. ranked above some nth percentile of the M features; and (2) a multivariate peak locations analysis 34 producing a top-ranked set of features 38, e.g. ranked above some nth percentile of the M features (where a different percentile n is optionally used versus the ESD ranging 36). In the illustrative approaches, clustering of samples is used to assess and rank the features, and clustering performance metrics can then be used in an operation 40 to evaluate performance of the features in discriminating samples from one another. Further, if two or more top-ranked sets of features 36, 38 are generated then the operation 40 can also include a consistency cross-check, e.g. using a rand index comparison.
If clinical data of interest are available in the form of labels annotated to the samples of the data set 12, then the top-ranked features 36, 38 are also mapped to the clinical data of interest in an operation 42. This allows for identification of the most discriminative features (or combination of features) from the list(s) of top-ranked features 36, 38. For example, the most discriminative feature(s) specifically for distinguishing whether a patient has a particular form of cancer may be more effectively distinguished using the mapped labeling for this cancer type.
The list(s) of top-ranked features 36, 38 along with the statistical information from the statistical performance evaluation(s) 40 and the clinical data mapping 42, are used to generate a clinical diagnostic test 44 with a statistical strength metric indicating how strongly the identified feature or set of features correlates with the test output (which may, for example, be an indication of whether the clinical patient has a certain type of cancer). The genomic/proteomic test 44 synthesized using the device 10 of
With continuing reference to
With reference to
where Vjmax=max{V1j, . . . , VNj} is the largest value of the feature, and Vjmin=min{V1j, . . . , VNj} is the smallest value of the feature. For all operations subsequent to the normalization operation 50, the normalized values {Vij}norm is indicated simply as Vij for simplicity of notation herein.
The kernel density estimate (KDE) 52 is then computed according to:
where KDEj(x) is the KDE for (normalized) feature Fj and is defined over the interval [0,1], K( . . . ) is the kernel function, e.g. a Gaussian kernel may be used in some embodiments, and h is the kernel bandwidth and is chosen to be sufficiently small to provide the desired resolution along the interval [0,1] and sufficiently large to provide smoothing. The kernel density estimate KDEj(x) of Equation (2) is merely one illustrative embodiment of a suitable smoothed sample density versus feature value data set, and other formulations are contemplated.
The sample density versus feature value data set for each feature Fj quantitatively captures the distribution of the value of the feature over the N samples. This can be further summarized in various ways. For example, the (preferably normalized) energy spectral density (ESD) 54 of each KDE 52 may be used. In computing the ESD, the kernel density estimate KDEj(x) is treated as a finite energy time-series signal, and the ESD may be computed as:
where f denotes frequency in the range [−π,π]. The ESD is binned into Q frequency ranges denoted here as D1, . . . , DQ over the range [ωmin, ωmax] where ωmin=−π and ωmax=π. Here, Q is a method parameter allowing flexible evaluation of feature characteristics at various frequency ranges at tuneable resolutions, including the major regions such as low, high and intermediate frequency ranges. In each of the regions D1, . . . , DQ, the associated energy content is computed from the values of Ej (f) in each given frequency region, that is: E1j, . . . , EQj. These values are normalized to the range [0,1] similarly to Equation (1), i.e.:
For all operations subsequent to the ESD computation operation 54, the normalized values {Eij}norm is indicated simply as Eij for simplicity of notation herein.
With continuing reference to
With reference now to
Optionally, in an operation 62, for each of the feature groups kernel principal component analysis (KPCA) is applied to nonlinearly transform features and identify number of major principal components capturing variance above a chosen threshold (e.g., >=75th percentile).
In an operation 64, clustering of samples is performed using the (optionally KPCA transformed) features of each of the feature groups defined in operation 60 separately, and sample clustering scores are computed for the features as a weighted average of the within-cluster pairwise distances normalized by corresponding cluster sizes. In other words, in the operation 64 for each feature group, clustering of the samples of the data set 12 is performed using the features of that feature group to generate sample clusters for the feature group, and a score is computed for each discriminative feature of the feature group (either original features Fj or KPCA-transformed features, depending on whether operation 62 is performed) on the basis of pairwise distances between samples in the same sample cluster, where the pairwise distances are computed using the values of the discriminative feature for the samples. In an operation 66, the features are ranked by the cluster scores computed in operation 64. The highest-ranked discriminative features 36 are selected using a specific threshold (e.g., 75th percentile or more generally above an nth percentile).
With reference now to
The multivariate analyses 32, 34 using ESD and peak location characteristics, respectively, of the sample density versus feature value data sets 52 are merely illustrative examples. While using both ESD and peak locations in the multivariate analyses 32, 34 is expected to provides synergistic benefits, it is alternatively contemplated to employ only the multivariate analysis 32 using ESD characteristics of the sample density versus feature value data sets 52. As another contemplated alternative, it is contemplated to employ only the multivariate analysis 34 using peak location characteristics of the sample density versus feature value data sets 52. Additional or other multivariate analyses using other characteristics of the sample density versus feature value data sets is also contemplated, such as using discrete Fourier transform characteristics of the sample density versus feature value data sets.
With reference to
In an illustrative example of the clinical data mapping operation 42 of
It should be noted that the clinical context labeling of the data set 12 is not used except at the point of performing the mapping operations 100, 102. That is, the selection of the one or more sets of discriminative features 36, 38 entails unsupervised learning that does not rely upon clinical context labeling. Moreover, the mapping operations 100, 102 can map incomplete labeling and perform the diagnostic feature(s) identification 104, 106 with incompletely labeled samples. For example, if only 10% of the samples of the data set 12 are labeled as to a particular cancer type, the labeled 10% of the data can be used to perform the diagnostic feature(s) identification 104, 106, leveraging the unsupervised learning of the one or more sets of discriminative features 36, 38 operating on all 100% of the data set 12 to substantially improve computational efficiency.
With reference to
The invention has been described with reference to the preferred embodiments. Modifications and alterations may occur to others upon reading and understanding the preceding detailed description. It is intended that the invention be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2018/078941 | 10/23/2018 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62583034 | Nov 2017 | US |