A VIRAL EXPOSURE SIGNATURE FOR DETECTION OF EARLY STAGE HEPATOCELLULAR CARCINOMA

Information

  • Patent Application
  • 20230003730
  • Publication Number
    20230003730
  • Date Filed
    October 09, 2020
    4 years ago
  • Date Published
    January 05, 2023
    a year ago
Abstract
A viral exposure signature (VES) that can identify early stage, pre-symptomatic hepatocellular carcinoma (HCC) among at-risk patients is described. The VES was developed using serological profiling and synthetic virome technology to identify unique viral peptide epitopes corresponding to 61 viral species. Methods of identifying a subject with early stage (pre-symptomatic) HCC using the VES are described.
Description
FIELD

This disclosure concerns a viral exposure signature and its use for identifying a subject with early stage (pre-symptomatic) hepatocellular carcinoma.


BACKGROUND

Hepatocellular carcinoma (HCC) is considered a virus-related malignancy in which hepatitis B and C viruses (HCV and HBV) are major etiological factors (Farazi et al., Nat Rev Cancer 2006; 6:674-687). Viral hepatitis causes inflammation and chronic liver diseases (CLD), which may lead to fibrosis, cirrhosis and eventually, HCC. While HBV or HCV chronic carriers have an increased risk of developing HCC, the risk varies among individuals and not all patients with liver disease develop liver cancer (Arzumanyan et al., Nat Rev Cancer 2013; 13:123-135). An effective strategy to prevent HCC is to eliminate causative factors. However, while direct-acting antiviral (DDA) treatment is remarkably effective in eliminating HCV infection, it reduces but does not completely eliminate HCC risk (Janjua et al., J Hepatol 2017; 66:504-513; Carrat et al., Lancet 2019; 393:1453-1464). Similarly, HBV vaccination, introduced in the early 80s, has been successful in significantly reducing HBV carriers but only modestly reduces HCC burden in HBV-prevalent areas (Chang et al., Gastroenterology 2016; 151:472-480). It is puzzling that the control of HBV infection in HBV-prevalent areas as well as HCV infection has been remarkably successful for decades, while the global HCC incidence and mortality rate has continued to increase since the 1990s (Liu et al., J Hepatol 2019; 70:674-683). Changing trends of etiological factors such as alcohol and non-alcohol/non-viral related liver diseases may contribute to the observed increase. Thus, in addition to cancer prevention, early detection is a key research area to stop HCC-inflicted mortality. Currently, medical guidelines recommend biannual surveillance using ultrasound with or without alpha-fetoprotein (AFP) for individuals with chronic liver disease such as cirrhosis (Sherman et al., Hepatology 2012; 56:793-796). However, these practices have yielded mix results as to whether it is effective in detecting HCC at an early stage and can provide survival benefit (Tzartzeva et al., Gastroenterology 2018; 154:1706-1718; Moon et al., Gastroenterology 2018; 155:1128-1139; Sherman et al., Hepatology 1995; 22:432-438). Noticeably, a majority of HCC patients are still diagnosed at an advanced stage, which precludes their chance to receive potentially curative therapies, and consequently leads to poor survival. Thus, there is an unmet need to implement an effective biomarker-guided surveillance program for early cancer detection.


SUMMARY

Described herein is a viral exposure signature (VES) that can be used to identify a subject with early stage HCC, particularly pre-symptomatic HCC. The VES is based on the presence or absence of antibodies to specific viral strains in a subject. Detection of the VES in a subject can be used, for example, to guide treatment and disease monitoring decisions.


Provided herein are methods of identifying a subject with early stage HCC. In some embodiments, the method includes detecting the presence or absence of antibodies to a plurality of viruses in a sample obtained from the subject; determining the presence of a viral exposure signature (VES) in the sample obtained from the subject; and identifying the subject as being at risk for developing HCC when the VES is present. In some embodiments, the plurality of viruses comprises at least 10, at least 20, at least 30, at least 40, at least 50 or at least 60 of the viruses listed in Table 5A. In some examples, the plurality of viruses comprises or consists of the 61 viruses listed in Table 5A or the 31 viruses listed in Table 6.


In some embodiments, the presence of the VES is determined by identifying antibodies to one or more of hepatitis C virus (HCV) genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; human cytomegalovirus, strain AD169; HCV genotype 6g, isolate JK046; HCV genotype 1b, isolate BK; HCV genotype 1c, isolate HC-G9; HCV genotype 1b, strain HC-J4; HCV genotype 4a, isolate ED43; hepatitis delta virus; HCV genotype 5a, isolate EUH1480; human cytomegalovirus; Crimean-Congo hemorrhagic fever virus, strain Nigeria/IbAr10200/1970; HCV genotype 1b, isolate HC-J1; influenza A virus, strain A/USSR/90/1977 H1N1; influenza A virus, strain A/Bangkok/1/1979 H3N2; HCV genotype 1c, isolate India; and Chapare virus, isolate Human/Bolivia/810419/2003.


In some embodiments, the presence of the VES is determined by not detecting antibodies to one or more of Epstein-Barr virus, strain B95-8; human rhinovirus 23; HCMV, strain Towne; human herpesvirus 2 (HHV-2), strain HG52; human herpesvirus 3; varicella-zoster virus, strain Dumas; Cercopithecine herpesvirus 16; human adenovirus C serotype 2; human astrovirus-1; human respiratory syncytial virus; human herpesvirus 6B, strain Z29; human herpesvirus 7, strain JI; human rhinovirus 14; Lordsdale virus, strain GII/Human/United Kingdom/Lordsdale/1993; human herpesvirus 1, strain KOS; human metapneumovirus, strain CAN97-83; coxsackievirus A16, strain G-10; Epstein-Barr virus, strain AG876; cowpox virus; human herpesvirus 1, strain 17; human adenovirus E serotype 4; human adenovirus F serotype 40; tanapox virus; human adenovirus C serotype 5; rhinovirus B; human herpesvirus 8; human herpesvirus 6A, strain Uganda-1102; human rhinovirus A serotype 89, strain 41467-Gallo; norovirus MD145, isolate GII/Human/United States/MD145-12/1987; molluscum contagiosum virus subtype 1; vaccinia virus, strain Copenhagen; poliovirus type 1, strain Sabin; orf virus; HHV-2, strain 333; hepatitis B virus; Epstein-Barr virus, strain GD1; human parainfluenza 3 virus, strain Wash/47885/57; HHV-2; human enterovirus 71, strain BrCr; human herpesvirus 6A, strain GS; Cercopithecine herpesvirus 1; influenza B virus, strain B/Yamagata/16/1988; and influenza A virus, strain A/Philippines/2/1982 H3N2.


In other embodiments, the method of identifying a subject with early stage HCC includes (i) detecting the presence or absence of antibodies specific for a plurality of viruses in a sample obtained from the subject, wherein the plurality of viruses comprises hepatitis C virus (HCV) genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; human cytomegalovirus (HCMV) strain AD169; HCV genotype 6g, isolate JK046; Epstein-Barr virus (EBV), strain B95-8; human rhinovirus 23; HCMV strain Towne; HCV genotype 1b, isolate BK; and human herpesvirus 2 (HHV-2), strain HG52; and (ii) identifying the subject as being at risk for developing HCC if: (a) antibodies specific for HCV genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; HCMV strain AD169; HCV genotype 6g, isolate JK046; and/or HCV genotype 1b, isolate BK, are detected in the sample; and/or (b) antibodies specific for EBV, strain B95-8; human rhinovirus 23; HCMV strain Towne; and/or HHV-2, strain HG52, are not detected in the sample.


In some embodiments, the sample is a blood or serum sample.


In some embodiments, the antibodies are detected by phage immunoprecipitation, immunoblot or enzyme-linked immunosorbent assay.


In some embodiments, the method further includes administering an appropriate therapy or providing an appropriate procedure (such as surgery) for the treatment of HCC. In some examples, the method further includes performing a liver transplant in the subject with early stage HCC. In other examples, the method further includes liver resection of the subject with early stage HCC, with or without radiofrequency ablation (RFA). In some examples, if the subject is also positive for HBV or HCV, the subject is administered an anti-viral drug.


In some embodiments, the method further includes active diagnostic monitoring of the subject with early stage HCC. For example, the subject can be monitored on a regular schedule, such as every 3 months or every 6 months, using ultrasound, contrast enhanced computerized tomography (CT) and/or magnetic resonance imaging (MRI).


Also provided is a phage display library expressing unique peptide epitopes from each of the viruses listed in Table 5A or Table 6. In some embodiments, the phage display library expresses the peptides of SEQ ID NOs: 1-61, or a subset thereof. In some examples, the phage display library expresses the peptides of SEQ ID NOs: 1-102, or a subset thereof. In other examples, the phage display library expresses the peptides of SEQ ID NOs: 62-102, or a subset thereof.


Further provided is an array comprising unique peptide epitopes from each of the viruses listed in Table 5A or Table 6. In some examples the unique peptide epitopes comprise the peptides of SEQ ID NOs: 1-61 (shown in Table 5B), the peptides of SEQ ID NOs: 62-102 (shown in Table 3B), or the peptides of SEQ ID NOs: 1-102.


The foregoing and other objects and features of the disclosure will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.





BRIEF DESCRIPTION OF THE DRAWINGS


FIGS. 1A-1E: Viral richness and frequency of infection spectrum in serum. (FIG. 1A) Schema of screening of NCI-UMD cohort including 899 serum samples by VirScan and 849 matching buffy coat or cheek swab samples by genome-wide association study (GWAS), with integrated analysis among population groups: population controls (PC, n=412), high risk chronic liver disease cases (HR, n=337), and hepatocellular carcinoma cases (HCC, n=150); the VES is validated in a perspective NIDDK cohort with NIDDK-HR (n=129) and NIDDK-HCC (n=44). (FIG. 1B) Histogram showing the sequencing reads of VirScan with the mean coverage accuracy of 0.93. (FIG. 1C) Rarefaction plot showing the viral species richness detected in PC, HR and HCC groups. (FIG. 1D) Raincloud plot showing the viral species in each individual across populations. From left to right, each integrated boxplot illustrates: minimum, the first quantile, mean, the third quantile, and maximum, respectively. (FIG. 1E) Left: Bar plot showing the percentage of the prevalent viral infection among all samples. Right: Dot plot showing the number of the corresponding unique epitopes in each sample. Each dot represents the unique epitope number of one individual. The blue bars on the dot plot represent the mean.



FIGS. 2A-2C: Comparison of VirScan with medical charts, antigenicity of HCV1b and HIV coinfection viruses. (FIG. 2A) Contingency matrices comparing HCV, HBV, and HIV detection with VirScan against viral detection laboratory tests reported in the patient medical charts. For the purpose of computing binary classification test statistics, clinical results were considered true values and VirScan results were considered predicted values. (FIG. 2B) Left: Heatmap showing HCV proteomic enrichment among PC, HR and HCC groups. Each row represents the significant peptide tiling. Each column is a sample. The colored bar on the left of the panel indicates proteomic location of the tiling peptides (green). The first colored bar at the top of the panel indicates the groups of the samples among PC, HR, and HCC groups. The second bar at the top is HCV species positive (HCV species+) based on VirScan data. The intensity of each cell corresponds to the scaled −log 10 (p-value) measure of significance of enrichment for a peptide in a sample (greater values indicate stronger antibody response). Right: Bar plot showing the B-cell epitope prediction score for each peptide. (FIG. 2C) Bar chart representing the coinfection viral status in HIV positive (HIV +) versus HIV negative (HIV −) cases. Asterisks denote the false discovery rate less than 0.05.



FIGS. 3A-3E: Composition of VES associated with HCC. (FIG. 3A) VES are identified using Xgboost machine learning method. Flow chart showing training set and 10× cross validation sets to compare the viral profiles in HCC versus PC. The scored results are shown the predictive VES score of each sample among PC, HR and HCC. (FIG. 3B) Gradient boosting plot showing the area under the curve (AUC) value of training sets and 10× cross validation sets. The vertical line represents gradient boosting stops at round 108th testing to avoid overfitting. (FIG. 3C) Bar plot showing the 61-VES identified by comparing HCC with PC using Xgboost in NCI-UMD cohort. (FIG. 3D) Violin plot showing the predictive VES score among PC, HR and HCC groups. (**** P<0.0001, two-tailed p-value in Mann Whitney test). (FIG. 3E) Phylogenetic analysis of the 61 viral strains, which results in eight well-defined branches.



FIGS. 4A-4H: Determination of VES predictive accuracy and association with clinical outcomes. (FIG. 4A) Estimate of receiver operating characteristic curves (ROC) of NCI-UMD cohort at HCC diagnosis. Plots display AUC estimation for 61-VES at HCC diagnosis (PC, n=412; HR, n=337; HCC, n=150). (FIG. 4B) VES levels are listed as below, low and high of NCI-UMD cohort. The dashed line indicates less than 0.5 is below VES level. Low and high VES levels are defined by more than 0.5 VES level (median of more than 0.5 feature level as a separation). (FIG. 4C) Kaplan Meier (KM) plot survival curve for the NCI-UMD cohort with either 61-VES. (FIGS. 4D, 4E) Estimate of receiver operating characteristic curves (ROC) in predicting NIDDK validation cohort at HCC diagnosis and baseline. Plots display area under the curve estimation for 61-VES and clinical variable AFP at HCC diagnosis (NIDDK-HR2, n=106; NIDDK-HCC, n=44) and at baseline (NIDDK-HR1, n=129; NIDDK-HCC, n=44). (FIG. 4F) Time-dependent AUC showing the landmark time points performance of VES from 1 to 10 years relative to baseline. (FIG. 4G) The boxplots show the relationships between 61-VES and the clinical diagnosis in the NIDDK validation cohort at different follow-up (F/U) time points. (FIG. 4H) AUC values corresponding to predictions based on clinical indicators from patient charts compared with those based on VES, as well as those based on the combination clinical and VES for NIDDK cohort at baseline.



FIGS. 5A-5E: VirScan reproducibility and viral composition at DNA, RNA virus level and viral family level. (FIG. 5A) Distribution of reproducibility threshold −log 10 (p-values) is shown. Histogram of the frequency of the reproducibility threshold −log 10 (p-values). The mode of the distribution is approximately 2.358. (FIG. 5B) Examples of the experimental repeats in VirScan showing the background signals of the blank PBS samples at the bottom and the hits with significant −log 10 (P-value) more than 2.358 of serum samples (top panel). (FIG. 5C) Pie charts showing the DNA and RNA viral compositions before and after immunoprecipitation in VirScan, as library input and Phage-IP, respectively. (FIG. 5D) Stacked bar plot showing phylogenetic composition of common viral taxa (0.1% abundance) at the viral family level among PC, HR and HCC. (FIG. 5E) The diagram includes detailed information on the excluded participants from initial enrollment, sample allocation with indicated criteria, QC and final data analysis.



FIGS. 6A-6B: Extended information of composition of viral features in the investigated population. (FIG. 6A) Heatmap showing the hierarchical clustering (hCluster) of the samples among PC, HR and HCC with the differential viral features. The listed 17 viruses exhibit a fold change greater than 2 with FDR<0.05 in PC and HCC ANOVA test. Bottom bar shows the scaled density signal. (FIG. 6B) Histogram showing the most differential viral species (sp) and strains in HCC versus PC.



FIGS. 7A-7C: Quality control of the GWAS study. (FIG. 7A) QQ-plot for all 729,000 variants represented in the GWAS. (FIG. 7B) Principal component analysis (PCA) of all samples after quality control (QC) in different racial groups. (FIG. 7C) SNP rs12979860 was significantly associated with epitopes in Core and NS5B regions of HCV. Left panel: Heatmap showing the significance of SNP associated with 375 epitopes abundances of HCV genotype 2 and 3. Core and NS5B regions were highly associated with the genotypes. Right panel: Boxplots represent the difference of the epitope abundance between the genotypes in the Core region and NS5B region.



FIGS. 8A-8E: CONSORT flow diagrams for NIDDK cohort and assessment of the association of clinical outcomes with VES in NIDDK Cohort. (FIG. 8A) The diagram includes detailed information on the excluded participants from initial enrollment, sample allocation with indicated criteria, follow-up, QC and final data analysis. (FIG. 8B) Kaplan-Meier survival curves for NIDDK cohorts grouped by VES level. (FIG. 8C) Time-dependent ROC curve analysis of VES performance for landmark time points 1-10 years relative to baseline. (FIG. 8D) AUC prediction performance based on univariate and multivariate clinical indicators compared to VES (vertical band) for the NIDDK cohort at diagnosis. (FIG. 8E) AUC prediction performance based on univariate and multivariate clinical indicators compared to VES (vertical band) for the NCI-UMD cohort.



FIGS. 9A-9G: Genome-wide scan identifies specific genetic variants linked to VES. (FIG. 9A) Manhattan plot showing the detected genetic variants from GWAS associated with the viral featural phenotype of NCI-UMD cohort. Annotated names of gene loci with P-value less than 10−7. (FIG. 9B) Locus Zoom plot showing the LD structure of one of the lead SNPs, rs16960234, around the region of CDH13 and RP11-543N12.1. (FIG. 9C) Heatmap showing the high linkage disequilibrium (LD) SNPs of rs16960234 from 1000 Genomes database (R2>0.6). The density of the heatmap indicates the r2 value of the correlation. The labeled SNPs are the ones with eQTL available. (FIG. 9D) The eQTL of CDH13 in tissue artery tibial across genotypes of SNP rs1690234 from GTEx database. (FIG. 9E) The genotypic odds ratios (OR) of rs1690234 among HR and HCC relative to PC. (FIGS. 9F, 9G) VES score fold changes (FD) in genotypes AA, AG and GG of rs1690234 based on 61-VES and 31-VES among HCC relative to PC.



FIG. 10: Viral infection prevalence and unique viral epitope count across population control (PC), at risk group (AR), and HCC group. The viral infection prevalence across all PC, AR and HCC samples is shown on the bar plots. The count of unique epitopes per sample is shown on the dot plot and the vertical lines represent the mean values of the count of unique epitopes.



FIGS. 11A-11G: Further validation of robustness of the 61-VES. (FIG. 11A) XGBoost performance evaluated by AUC on HCC versus AR with 10× cross-validation. (FIG. 11B) ROC curves for PC versus HCC prediction, as well as for AR versus HCC prediction, using features from HCC versus AR predication. (FIG. 11C) Features selected by HCC versus AR predication was highly overlapped with VES signature. (FIG. 11D) XGBoost performance evaluated by AUC on HCC versus PC with 60/40 train-test split. (FIG. 11E) ROC curves showed the train and test datasets performance. (FIG. 11F) 1000 permutation with the 60/40 train-test split. (FIG. 11G) The selected features and feature importance after 1000 permutation test.





SEQUENCE LISTING

The amino acid sequences listed in the accompanying sequence listing are shown using standard three letter code for amino acids, as defined in 37 C.F.R. 1.822. The Sequence Listing is submitted as an ASCII text file, created on Oct. 8, 2020, 58.3 KB, which is incorporated by reference herein. In the accompanying sequence listing:


SEQ ID NOs: 1-102 are amino acid sequences of unique peptide epitopes from human viruses.


DETAILED DESCRIPTION
I. Terms and Methods

Unless otherwise noted, technical terms are used according to conventional usage. Definitions of common terms in molecular biology may be found in Benjamin Lewin, Genes VII, published by Oxford University Press, 2000 (ISBN 019879276X); Kendrew et al. (eds.), The Encyclopedia of Molecular Biology, published by Blackwell Publishers, 1994 (ISBN 0632021829); Robert A. Meyers (ed.), Molecular Biology and Biotechnology: a Comprehensive Desk Reference, published by Wiley, John & Sons, Inc., 1995 (ISBN 0471186341); and George P. Rédei, Encyclopedic Dictionary of Genetics, Genomics, and Proteomics, 2nd Edition, 2003 (ISBN: 0-471-26821-6).


The singular forms “a,” “an,” and “the” refer to one or more than one, unless the context clearly dictates otherwise. For example, the term “comprising a probe” includes single or plural probes and is considered equivalent to the phrase “comprising at least one probe.” The term “or” refers to a single element of stated alternative elements or a combination of two or more elements, unless the context clearly indicates otherwise. As used herein, “comprises” means “includes.” Thus, “comprising A or B,” means “including A, B, or A and B,” without excluding additional elements.


Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure, suitable methods and materials are described below. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety, as are the GenBank® Accession numbers (for the sequence present on Feb. 8, 2016). In case of conflict, the present specification, including explanations of terms, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.


Except as otherwise noted, the methods and techniques of the present disclosure are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the present specification. See, e.g., Sambrook et al., Molecular Cloning: A Laboratory Manual, 2d ed., Cold Spring Harbor Laboratory Press, 1989; Sambrook et al., Molecular Cloning: A Laboratory Manual, 3d ed., Cold Spring Harbor Press, 2001; Ausubel et al., Current Protocols in Molecular Biology, Greene Publishing Associates, 1992 (and Supplements to 2000); Ausubel et al., Short Protocols in Molecular Biology: A Compendium of Methods from Current Protocols in Molecular Biology, 4th ed., Wiley & Sons, 1999; Harlow and Lane, Antibodies: A Laboratory Manual, Cold Spring Harbor Laboratory Press, 1990; and Harlow and Lane, Using Antibodies: A Laboratory Manual, Cold Spring Harbor Laboratory Press, 1999.


In order to facilitate review of the various embodiments of the disclosure, the following explanations of specific terms are provided:


Administration: The introduction of an agent, such as an anti-viral therapeutic, into a subject by a chosen route. Administration can be local or systemic. For example, if the chosen route is intravascular, the agent is administered by introducing the composition into a blood vessel of the subject. Exemplary routes of administration include, but are not limited to, oral, injection (such as subcutaneous, intramuscular, intradermal, intraperitoneal, and intravenous), sublingual, rectal, transdermal (for example, topical), intranasal, vaginal, and inhalation routes.


Antibody: A polypeptide ligand comprising at least one variable region that recognizes and binds (such as specifically recognizes and specifically binds) an epitope of an antigen, such as a viral antigen. Mammalian immunoglobulin molecules are composed of a heavy (H) chain and a light (L) chain, each of which has a variable region, termed the variable heavy (VH) region and the variable light (VL) region, respectively. Together, the VH region and the VL region are responsible for binding the antigen recognized by the antibody. There are five main heavy chain classes (or isotypes) of mammalian immunoglobulin, which determine the functional activity of an antibody molecule: IgM, IgD, IgG, IgA and IgE. Antibody isotypes not found in mammals include IgX, IgY, IgW and IgNAR. IgY is the primary antibody produced by birds and reptiles, and has some functionally similar to mammalian IgG and IgE. IgW and IgNAR antibodies are produced by cartilaginous fish, while IgX antibodies are found in amphibians.


Array: An arrangement of molecules, such as biological macromolecules (such as peptides or nucleic acid molecules) or biological samples (such as tissue sections), in addressable locations on or in a substrate. In some embodiments herein, the array comprises at least 10, at least 20, at least 30, at least 40, at least 50, at least 60 (such as 61) addressable locations. In particular examples, the array comprises peptide epitopes from each of the viruses listed in Table 5A or Table 6.


Control: A “control” refers to a sample or standard used for comparison with an experimental sample, such as a serum sample obtained from a subject suspected of having or at risk for HCC. In some embodiments, the control is a sample obtained from a healthy patient (e.g., one not having HCC or cirrhosis). In some embodiments, the control is a historical control or standard reference value or range of values (e.g., a previously tested control sample, such as a group of samples that represent baseline or normal values).


Diagnosis: The process of identifying a disease by its signs, symptoms and results of various tests. The conclusion reached through that process is also called “a diagnosis.” Forms of testing commonly performed include blood tests, medical imaging, and biopsy.


Early stage: In the context of the present disclosure, detecting “early stage” HCC refers to identifying HCC in a subject prior to the onset of symptoms and/or prior to standard clinical diagnosis. “Early stage” in this context is not synonymous with stage 0 or stage I cancer. In some embodiments, early stage HCC is characterized by the presence of a single lesion less than 3 cm in diameter (such as 0.1 to 2.9 cm in diameter, such as 0.5 to 2.5 cm, 0.5 to 1 cm or 1 to 2.9 cm in dimeter) without detectable local or distant metastatic lesions (such as detectable by CT or MRI).


Epitope: An antigenic determinant. These are particular chemical groups or peptide sequences on a molecule that are antigenic, i.e. that elicit a specific immune response. An antibody specifically binds a particular antigenic epitope on a polypeptide, such as a viral polypeptide.


Hepatocellular carcinoma (HCC): A primary malignancy of the liver, which in some cases occurs in patients with inflammatory livers resulting from viral hepatitis, liver toxins or hepatic cirrhosis (often caused by alcoholism). Exemplary therapies for HCC include but are not limited to, one or more of surgery, transarterial chemoembolization (TACE), ablative therapies (including both thermal and cryoablation), radio embolization, and percutaneous alcohol injection.


Isolated: An “isolated” biological component (such as a nucleic acid molecule, protein, or cell) has been substantially separated or purified away from other biological components, such as other chromosomal and extra-chromosomal DNA and RNA, proteins and cells. Nucleic acid molecules and proteins that have been “isolated” include nucleic acid molecules and proteins purified by standard purification methods. The term also embraces nucleic acid molecules and proteins prepared by recombinant expression in a host cell as well as chemically synthesized nucleic acid molecules and proteins.


Sample (or biological sample): A biological specimen containing genomic DNA, RNA (including mRNA), protein (such as antibodies), or combinations thereof, obtained from a subject. Examples include, but are not limited to, peripheral blood, plasma, urine, saliva, tissue biopsy, fine needle aspirate, punch biopsy surgical specimen, and autopsy material. In specific embodiments herein, the sample is a blood or serum sample.


Sequence identity: The identity or similarity between two or more nucleic acid sequences, or two or more amino acid sequences, is expressed in terms of the identity or similarity between the sequences. Sequence identity can be measured in terms of percentage identity; the higher the percentage, the more identical the sequences are. Sequence similarity can be measured in terms of percentage similarity (which takes into account conservative amino acid substitutions); the higher the percentage, the more similar the sequences are.


Methods of alignment of sequences for comparison are well known in the art. Various programs and alignment algorithms are described in: Smith & Waterman, Adv. Appl. Math. 2:482, 1981; Needleman & Wunsch, J. Mol. Biol. 48:443, 1970; Pearson & Lipman, Proc. Natl. Acad. Sci. USA 85:2444, 1988; Higgins & Sharp, Gene, 73:237-44, 1988; Higgins & Sharp, CABIOS 5:151-3, 1989; Corpet et al., Nuc. Acids Res. 16:10881-90, 1988; Huang et al. Computer Appls. in the Biosciences 8, 155-65, 1992; and Pearson et al., Meth. Mol. Bio. 24:307-31, 1994. Altschul et al., J. Mol. Biol. 215:403-10, 1990, presents a detailed consideration of sequence alignment methods and homology calculations.


The NCBI Basic Local Alignment Search Tool (BLAST) (Altschul et al., J. Mol. Biol. 215:403-10, 1990) is available from several sources, including the National Center for Biological Information (NCBI) and on the internet, for use in connection with the sequence analysis programs blastp, blastn, blastx, tblastn and tblastx. Additional information can be found at the NCBI web site.


Subject: Living multi-cellular vertebrate organisms, a category that includes human and non-human mammals. In some examples herein, the subject is suspected of having or at risk for having HCC.


Tumor: All neoplastic cell growth and proliferation, whether malignant or benign, and all pre-cancerous and cancerous cells and tissues. In some examples, the tumor is a HCC tumor.


II. Viral Exposure Signature and Methods of Use

Viruses are known to affect human health by altering host immunity, which makes the interplay between the virome and the host crucial in the pathogenesis of human chronic diseases, including cancer (Foxman et al., Nat Rev Microbiol 2011; 9:254-64; Cadwell, Immunity 2015; 42:805-813). Diverse pathogenic and non-pathogenic viruses may interact with one another as well as their host to shape host immunity, which may alter its response to new infections. Consequently, viruses that persist or are cleared in the host may leave unique molecular footprints that can alter disease susceptibility to cancer and may serve as an excellent window of early onset disease (Cadwell, Immunity 2015; 42:805-813). It was hypothesized that unique post-viral exposure signatures resulting from virus-host interactions could reflect a cascade of events that may alter the risk of developing HCC. Such signatures could serve as early detection biomarkers and offer knowledge about potentially modifiable factors for early onset HCC. In the study disclosed herein, serological samples from 899 individuals enrolled in a case-control study of liver cancer (NCT00913757; clinicaltrials.gov) were profiled using a synthetic virome technology, VirScan, based on a high-throughput sequencing method, to detect exposure history to all known human viruses (Xu et al., Science 2015; 348:aaa0698). A unique viral exposure signature (VES) that can discriminate HCC cases from CLD and healthy volunteers matched by age and sex is disclosed herein. The VES was validated in a prospective National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) at-risk cohort for HCC.


Provided herein are methods of identifying a subject as being at risk for developing HCC. In some embodiments, the method includes detecting the presence or absence of antibodies to a plurality of viruses in a sample obtained from the subject; determining the presence of a viral exposure signature (VES) in the sample obtained from the subject; and identifying the subject as being at risk for developing HCC when the VES is present.


In some embodiments, the presence of the VES is determined by identifying antibodies to one or more of hepatitis C virus (HCV) genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; human cytomegalovirus, strain AD169; HCV genotype 6g, isolate JK046; HCV genotype 1b, isolate BK; HCV genotype 1c, isolate HC-G9; HCV genotype 1b, strain HC-J4; HCV genotype 4a, isolate ED43; hepatitis delta virus; HCV genotype 5a, isolate EUH1480; human cytomegalovirus; Crimean-Congo hemorrhagic fever virus, strain Nigeria/IbAr10200/1970; HCV genotype 1b, isolate HC-J1; influenza A virus, strain A/USSR/90/1977 H1N1; influenza A virus, strain A/Bangkok/1/1979 H3N2; HCV genotype 1c, isolate India; and Chapare virus, isolate Human/Bolivia/810419/2003.


In some embodiments, the presence of the VES is determined by not detecting antibodies to one or more of Epstein-Barr virus, strain B95-8; human rhinovirus 23; HCMV, strain Towne; human herpesvirus 2 (HHV-2), strain HG52; human herpesvirus 3; varicella-zoster virus, strain Dumas; Cercopithecine herpesvirus 16; human adenovirus C serotype 2; human astrovirus-1; human respiratory syncytial virus; human herpesvirus 6B, strain Z29; human herpesvirus 7, strain JI; human rhinovirus 14; Lordsdale virus, strain GII/Human/United Kingdom/Lordsdale/1993; human herpesvirus 1, strain KOS; human metapneumovirus, strain CAN97-83; coxsackievirus A16, strain G-10; Epstein-Barr virus, strain AG876; cowpox virus; human herpesvirus 1, strain 17; human adenovirus E serotype 4; human adenovirus F serotype 40; tanapox virus; human adenovirus C serotype 5; rhinovirus B; human herpesvirus 8; human herpesvirus 6A, strain Uganda-1102; human rhinovirus A serotype 89, strain 41467-Gallo; norovirus MD145, isolate GII/Human/United States/MD145-12/1987; molluscum contagiosum virus subtype 1; vaccinia virus, strain Copenhagen; poliovirus type 1, strain Sabin; orf virus; HHV-2, strain 333; hepatitis B virus; Epstein-Barr virus, strain GD1; human parainfluenza 3 virus, strain Wash/47885/57; HHV-2; human enterovirus 71, strain BrCr; human herpesvirus 6A, strain GS; Cercopithecine herpesvirus 1; influenza B virus, strain B/Yamagata/16/1988; and influenza A virus, strain A/Philippines/2/1982 H3N2.


In some embodiments, the plurality of viruses includes at least 10, at least 20, at least 30, at least 40, at least 50 or at least 60 of the viruses listed in Table 5A. In some examples, the plurality of viruses comprises or consists of the 61 viruses listed in Table 5A. In some examples, the plurality of viruses comprises or consists of the 31 viruses listed in Table 6.


In particular embodiments, step (ii) includes determining the presence of the VES in the sample obtained from the subject if (a) antibodies specific for three or more, four or more, five or more, six or more, or seven or more of hepatitis C virus (HCV) genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; human cytomegalovirus, strain AD169; HCV genotype 6g, isolate JK046; HCV genotype 1b, isolate BK; HCV genotype 1c, isolate HC-G9; HCV genotype 1b, strain HC-J4; HCV genotype 4a, isolate ED43; hepatitis delta virus; HCV genotype 5a, isolate EUH1480; human cytomegalovirus; Crimean-Congo hemorrhagic fever virus, strain Nigeria/IbAr10200/1970; HCV genotype 1b, isolate HC-J1; influenza A virus, strain A/USSR/90/1977 H1N1; influenza A virus, strain A/Bangkok/1/1979 H3N2; HCV genotype 1c, isolate India; and Chapare virus, isolate Human/Bolivia/810419/2003 are detected in the sample; and/or (b) antibodies specific for three or more, four or more, five or more, six or more, or seven or more of Epstein-Barr virus, strain B95-8; human rhinovirus 23; HCMV, strain Towne; human herpesvirus 2 (HHV-2), strain HG52; human herpesvirus 3; varicella-zoster virus, strain Dumas; Cercopithecine herpesvirus 16; human adenovirus C serotype 2; human astrovirus-1; human respiratory syncytial virus; human herpesvirus 6B, strain Z29; human herpesvirus 7, strain JI; human rhinovirus 14; Lordsdale virus, strain GII/Human/United Kingdom/Lordsdale/1993; human herpesvirus 1, strain KOS; human metapneumovirus, strain CAN97-83; coxsackievirus A16, strain G-10; Epstein-Barr virus, strain AG876; cowpox virus; human herpesvirus 1, strain 17; human adenovirus E serotype 4; human adenovirus F serotype 40; tanapox virus; human adenovirus C serotype 5; rhinovirus B; human herpesvirus 8; human herpesvirus 6A, strain Uganda-1102; human rhinovirus A serotype 89, strain 41467-Gallo; norovirus MD145, isolate GII/Human/United States/MD145-12/1987; molluscum contagiosum virus subtype 1; vaccinia virus, strain Copenhagen; poliovirus type 1, strain Sabin; orf virus; HHV-2, strain 333; hepatitis B virus; Epstein-Barr virus, strain GD1; human parainfluenza 3 virus, strain Wash/47885/57; HHV-2; human enterovirus 71, strain BrCr; human herpesvirus 6A, strain GS; Cercopithecine herpesvirus 1; influenza B virus, strain B/Yamagata/16/1988; and influenza A virus, strain A/Philippines/2/1982 H3N2 are not detected in the sample.


In some embodiments, the sample is a blood or serum sample. In some examples, the method further includes obtaining the biological sample from the subject. In some examples, the subject is a human subject.


The presence of antibodies can be detected using any immunoassay. In some embodiments, the antibodies are detected by phage immunoprecipitation, immunoblot or enzyme-linked immunosorbent assay.


Also provided is a phage display library expressing unique peptide epitopes from each of the viruses listed in Table 5A or Table 6. The phage display library can be used to determine the presence of the VES. In some embodiments, the phage display library expresses the peptides of SEQ ID NOs: 1-61 (see Table 5B). In other examples, the phage display library expresses the peptides of SEQ ID NOs: 62-102 (see Table 3B). In some examples, the phage display library expresses the peptides of SEQ ID NOs: 1-102. In some examples, the phage display library expresses peptides at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99% or 100% identical to any of SEQ ID NOs: 1-61, SEQ ID NOs: 62-102 and SEQ ID NOs: 1-102.


Further provided is an array including unique peptide epitopes from each of the viruses listed in Table 5A or Table 6. The array can be used to determine the presence of the VES. In some examples the unique peptide epitopes comprise the peptides of SEQ ID NOs: 1-61 (shown in Table 5B), the peptides of SEQ ID NOs: 62-102 (shown in Table 3B), or the peptides of SEQ ID NOs: 1-102. In some examples, the peptides have amino acid sequences at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99% or 100% identical to any of SEQ ID NOs: 1-61, SEQ ID NOs: 62-102 and SEQ ID NOs: 1-102.


In other embodiments provided herein, the method of identifying a subject as being at risk for developing HCC includes (i) detecting the presence or absence of antibodies specific for a plurality of viruses in a sample obtained from the subject, wherein the plurality of viruses includes hepatitis C virus (HCV) genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; human cytomegalovirus (HCMV) strain AD169; HCV genotype 6g, isolate JK046; Epstein-Barr virus (EBV), strain B95-8; human rhinovirus 23; HCMV strain Towne; HCV genotype 1b, isolate BK; and human herpesvirus 2 (HHV-2), strain HG52; and (ii) identifying the subject as being at risk for developing HCC if: (a) antibodies specific for HCV genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; HCMV strain AD169; HCV genotype 6g, isolate JK046; and/or HCV genotype 1b, isolate BK, are detected in the sample; and/or (b) antibodies specific for EBV, strain B95-8; human rhinovirus 23; HCMV strain Towne; and/or HHV-2, strain HG52, are not detected in the sample.


In some examples, step (ii) includes identifying the subject as being at risk for developing HCC if (a) antibodies specific for at least two, at least three, at least four, at least five or all six of HCV genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; HCMV strain AD169; HCV genotype 6g, isolate JK046; and HCV genotype 1b, isolate BK, are detected in the sample; and/or (b) antibodies specific for at least one, at least two, at least three or all four of EBV strain B95-8; human rhinovirus 23; HCMV strain Towne; and/or HHV-2 strain HG52, are not detected in the sample.


In some examples, the sample is a blood or serum sample. In specific examples, the method further includes obtaining the biological sample from the subject.


In some examples, the antibodies are detected by phage immunoprecipitation, immunoblot or enzyme-linked immunosorbent assay.


In some embodiments of the disclosed methods, the method further includes treating a subject with an appropriate therapy to aid in the prevention or treatment of HCC. In some examples, the appropriate therapy includes vaccination against hepatitis B virus (HBV) (such as administration of Engerix-B®, Recombivax HB®, or Heplisav-B®), anti-viral treatment against HBV (such as administration of PEG-IFN, entecavir, tenofovir, lamivudine, adefovir, and/or telbivudine) and/or anti-viral treatment against HCV (such as administration of one or more of glecaprevir, sofobuvir, daclatasvir, grazoprevir, and ombitasvir). Anti-viral drugs include, for example, nucleoside/nucleotide analogs (e.g., entecavir and tenofovir disoproxil fumarate), interferon, and lamivudine. In some examples, the method further includes performing a liver transplant in the subject with early stage HCC. In other examples, the method further includes liver resection of the subject with early stage HCC, with or without radiofrequency ablation (RFA).


In some embodiments, the method further includes active diagnostic monitoring of the subject with early stage HCC. For example, the subject can be monitored on a regular schedule, such as every 2 months, every 3 months, every 4 months, every 5 months or every 6 months, using ultrasound, contrast enhanced computerized tomography (CT) and/or magnetic resonance imaging (MRI).


In some examples, the additional treatment includes lifestyle or diet changes, including programs to reduce intravenous drug use, needle exchange programs, prevention of sexually-transmitted diseases, reducing or eliminating alcohol consumption, reducing obesity-related inflammation (such as by improving diet and increasing exercise), improving insulin resistance, increasing consumption of vegetables, consuming branched-chain amino acids and/or taking vitamin D. For some patients, such as those with hereditary hemochromatosis, iron overload can increase the risk of developing HCC. Thus, in some examples, the appropriate therapy includes treating iron overload. Aflatoxin B1, a known carcinogen produced by fungi of the Aspergillus species, is commonly found as a contaminate of grains, nuts, and vegetables in regions such as Asia and Africa. Thus, reducing aflatoxin exposure can also be used to prevent or treat HCC. Additional preventative therapies and treatments are described in Schutte et al., Gastrointest Tumors 3(1): 37-43, 2016 and Schutte et al., Gastrointest Tumors 2(4): 188-194, 2016.


III. Phage Immunoprecipitation Sequencing

In some embodiments of the present disclosure, the methods of detecting the presence or absence of specific antibodies in patient samples, and thereby determining the presence of the VES, can be performed using phage immunoprecipitation sequencing (PhIP-Seq). This method is a high-throughput method that allows for a comprehensive analysis of a subject's antibody repertoire (see U.S. Publication No. 2016/0320406; Larman et al., Nat. Biotechnol 29: 535-541, 2011; and Mohan et al., Nat Protoc 13:1958-1978, 2018; each of which is incorporated by reference herein).


PhIP-Seq is one method that can be used to rapidly detect the presence or absence of a plurality of virus-specific antibodies in a patient sample. Briefly, this method includes designing a peptide library that is representative of the viruses that are to be detected. In context of the present disclosure, the library includes, for example, the 61 or 31 unique peptide epitopes of the 61-VES or 31-VES, respectively (see Tables 5A and 6). An oligonucleotide library encoding the peptides is constructed and PCR-amplified with adapters for cloning into a selected phage display vector to produce the phage display library. A patient sample, such as a blood or serum sample, is contacted with the phage display library to allow for phage-antibody complex formation and subsequent immunoprecipitation. The library of peptide-encoding oligonucleotide sequences is amplified by PCR directly from the immunoprecipitate, bar-coded and subjected to deep sequencing. Additional details of this method can be found in U.S. Publication No. 2016/0320406; Larman et al. (Nat. Biotechnol 29: 535-541, 2011), Mohan et al. (Nat Protoc 13:1958-1978, 2018), and the Novagen T7Select System Manual (available online)


The following examples are provided to illustrate certain particular features and/or embodiments. These examples should not be construed to limit the disclosure to the particular features or embodiments described.


EXAMPLES
Example 1: Methods

This example describes the materials and experimental procedures used for the studies described in Example 2.


Participants and VirScan Analysis

The patient cohort consisted of 899 sequentially enrolled participants (clinicaltrials.gov number: NCT0091375), including 150 HCC cases, 337 CLD as at-risk individuals (HR or AR, used interchangeably) and 412 healthy volunteers as a population control (PC) matched by age and sex (FIG. 1A).


Study Cohorts

UMD cohort. To measure virome-host interplay, 899 participants were recruited. Participants were grouped as (1) population control (PC, n=412) if they were relatively healthy without any diagnosis of liver disease; (2) high-risk (HR, n=337) if they were diagnosed with chronic liver diseases (hepatitis B virus (HBV), hepatitis C virus (HCV), hepatitis delta virus (HDV), aflatoxins from fungal contamination, alcohol, nonalcoholic fatty-liver disease (NAFLD) and nonalcoholic steatohepatitis (NASH)); or hepatocellular carcinoma (HCC, n=150) if they were diagnosed with HCC. All clinic measurements were covered by NCT0091375 (clinicaltrials.gov) with the enrollment criteria as the liver disease status. Serum, matching buffy coat and cheek swab samples were collected from each individual.


NIDDK cohort. This cohort consisted of 173 patients with chronic liver disease that included 44 HCC cases with 129 controls matched by liver disease etiology, age and sex. Patients were enrolled in a natural history protocol (clinicaltrials.gov number; NCT0001971) with longitudinal follow-up, at least annually with serologic testing and imaging, for up to 20 years. Only cases with complete clinical and laboratory data and available longitudinal serologic samples were selected for analysis. The 44 HCC cases were sequentially identified out of 3,067 patients followed in this natural history study on chronic liver disease, and the controls were matched on a 2:1 basis as described above. HCC was diagnosed by radiologic imaging and/or liver biopsy as described by the American Association for the Study of Liver Disease (AASLD) practice guidelines (see Marrero et al., “Diagnosis, Staging and Management of Hepatocellular Carcinoma: 2018 Practice Guidance by the American Association for the Study of Liver Diseases,” Hepatology 68(2): 723-750, 2018). For the purposes of this analysis, stored serum samples (−80° C.) were analyzed at study entry (baseline) and at recurrent time points until the time of HCC diagnosis.


Sample Collection

Blood samples were collected and stored at −80° C. (n=899 from UMD, n=488 from NIDDK). Buffy coat and cheek swab samples also were collected and stored at −80° C. (n=849 from UMD).


Virscan PhIP-seq

Phage immunoprecipitation and sequencing were performed using a slightly modified version of previously published PhIP-Seq protocols. First, 96-deep-well plates were blocked with bovine serum albumin in TBST overnight on a rotator at 4° C. The diluted 1 ml bacteriophage library was added in each blocked well. Serum samples, containing 2 mg IgG, were mixed with the bacteriophage library. Two technical replicates for each sample were set up. After an overnight rotation, protein A and protein G Dynabeads were added to each well. After another 4-hour incubation on a rotator at 4° C. with a 96-well magnetic stand, the beads were washed three times with 400 ml of PhIP-Seq wash buffer. Next, the beads were resuspended in water and lysed at 95° C. for 10 minutes. Blank PBS samples (instead of serum) were also set up as negative controls on each plate. Two rounds of PCR were performed to amplify and multiplex on the lysed bacteriophage DNA product. After the second round of PCR, PCR products were pooled using equimolar amounts of all 192 samples for gel extraction. After gel extraction, the size and quality of libraries were assessed on a Bioanalyzer instrument from Agilent. The DNA samples were aliquoted and stored at −80° C. until sequencing. Sequencing was performed using 50 bp single read protocol on Illumina HiSeq 4000 platform (1×50 bp), which obtained ˜100 million to 200 million reads per lane (around 1,000,000 reads per sample in current setting).


Raw data from Illumina HiSeq 4000 platform was processed by BCL2FASTQ2 for demultiplexing and converting binary base calls and qualities to fastq format. The fastq files were mapped to original virome peptide reference sequences using the Bowtie program. Two sequencing samples were cut off from next-step analysis as their reads were less than 30,000. The initial informatics and statistical analysis were performed using a slightly modified version of the previously published technique and in-house scripts. Briefly, the scatter plots of the log 10 of the −log 10 (P values) and a sliding window of width 0.005 from 0 to 2 across the axis of one replicate were used. It was determined that the distribution of the threshold −log 10 (P value) was centered around a mode of ˜2.358 (FIG. 5B). The 593 hits that came up in at least 3 of the 22 immunoprecipitations with PBS beads alone blank sample were eliminated. Also, any peptides that were not enriched in at least two of the samples were filtered out. A threshold number of hits per virus was set based on the size of the virus. If the hit shared a subsequence of at least 7 amino acids with any hit previously observed in any of the viruses from that sample, that hit was considered to be from a cross-reactive antibody and would be ignored for that virus. The peptide hits, which do not share any linear epitopes, were summed to be strain and species score data. The final score was compared for each virus to the threshold for that virus to determine whether the sample is positive for exposure to that viral species. The raw count data were calculate based on −log 10(p-value) 2.358 cutoff.


DNA Sample Extraction

DNA extraction from buffy coat or lymphocyte samples was performed following the manufacturer's instruction (DNeasy Blood & Tissue Kit from Qiagen). The eluted DNA was stored at −20° C. for further analysis.


GWAS Platform

Illumina OmniExpress was applied for the SNP array. Genotyping was performed on 200 ng of genomic DNA using Illumina Infinium HTS Global Screening Arrays on an Illumina iScan system. The raw genotyping data were processed by Illumina GenomeStudio software 2.0. Quality control was performed using PLINK version 2.0 (available online). Samples with a genotyping call rate<95% were removed. SNPs with MAF (Minor Allele Frequency)<0.05, HWE (Hardy-Weinberg equilibrium)<10-4, and call rate<95%, were excluded.


GWAS Analysis

Variant quality control was performed. After filtering, 849 individuals and 713,111 SNPs remained for further analysis, with the total genotyping rate 99.79%. Hardy-Weinberg equilibrium deviation was flagged at p value <0.0001. Independent loci in regions were identified for SNPs associated with virus feature phenotype at P<5×10-7 using PLINK. LocusZoom was used to plot regional signals associated with phenotype with LD and recombination rate calculated from 1000 Genome. LD structure of signals were further investigated with Haploview. A linear regression with additive model was applied to estimate the genotypic effect the SNP contributed to the disease or phenotype.


ELISA Assay

IgG, IgA and IgG4 levels in serum were measured using human ELISA kits (Bethyl and Thermo Fisher) according to the manufacturers' instructions. ELISA result reading was performed using a machine (Biorad).


Statistical Methods

To identify differences between populations, Xgboost and LEfSe were used to calculate the significance of association of virus exposure traits with HCC versus PC.


XGBoost

XGBoost (available online) is software for a machine learning method of regression and classification using ensemble learning with gradient tree boosting. It is designed to increase the scalability and acceleration of optimized computation for practical use. XGBoost includes three types of parameters—general, booster and task. Each of the types has several hyperparmeters, such as maximum depth of the regression trees, number of weak learners, learning rate, and regularization, that need to be tuned. These parameters were tuned using a grid search to maximize the mean AUC value computed from 5-fold cross validation on the training data. After finding the optimal values of the hyperparameters, the model was constructed using the following main parameter setting: max_depth=3, eta=0.1, subsample=1, colsample_bytree=0.5, and min_child_weight=1. Then XGBoost was applied to the entire data set with 200 boosting iterations. To avoid over-fitting, stop model training at least 20 rounds when no improvement was observed in AUC value was set (early_stopping_rounds=20). The best iteration model was used as the final model. XGBoost automatically conducts feature selection and calculates importance for each feature. Multiple subsets of the features were tested to achieve the highest AUC and a decision was made to take all of the output features for further analysis. For each training and testing sample, a virus feature score was also generated based on the features selected and implemented in the XGBoost classification prediction.


LEfSe

The LEfSe method of analysis first compares abundance of all viral clades (in this case between PC and HCC) by Kruskal-Wallis test at a pre-defined a of 0.05. Significantly different vectors resulting from the comparison of relative abundances between PC and HCC are used as input for linear discriminant analysis (LDA), which produces an effect size and a p-value. The LDA threshold on the logarithmic LDA score for discriminative features is set up at 2.0. LEfSe also calculated the hierarchically organized viral taxa. The relative abundance data for Lefse test was prepared based on strain and species score data.


Additional Statistical Methods

All analyses were conducted in R and GraphPad Prism 7 (La Jolla, Calif.) and used for statistical analyses. Data are presented either as means+/−s.e.m. or medians of continuous values and were analyzed by a two-sided Student's t-test or Mann-Whitney test used for comparison of two groups, respectively. Fisher's exact X2 t-test was used to calculate statistical significance of categorical values between groups. Two-tail P values with no more than 0.05 were considered significant. Linear regression was used to determine the correlation between two different variables.


Viral Feature Level, Clinical Outcome and ROC Curve

All HCC patients were classified into high, low or below viral feature score groups based on viral feature levels (FIGS. 4B and 6A). Kaplan-Meier estimates of overall survival were estimated for each group and compared using the log rank test. Hazard ratios and 95% confidence intervals were calculated using univariate and multivariate Cox proportional hazards models to assess associations between different viral feature level along with several clinical factors. The ability of clinical and viral features in predicting HCC was assessed by computing receiver operating characteristic (ROC) curves using the logistic regression in R. Area under the curve (AUC) values were calculated for these variables.


Example 2: Viral Exposure Signature (VES) for Diagnosis of Hepatocellular Carcinoma (HCC)

This example describes the development of two virus exposure signatures—a first VES based on detection of 61 viral strains and a second VES based on detection of 31 viral strains—to identify subject's at risk for developing HCC.


The Landscape of Viral Exposure Profiles

VirScan applies a phage display library that covers 93,904 viral epitopes, representing 206 human viral species and over 1000 viral strains, to screen for previous exposure history (Xu et al., Science 2015; 348:aaa0698). A phage particle with an epitope that was recognized by a participant's antibody was immunoprecipitated (Phage-IP), and the encoding DNA barcode was then sequenced (FIG. 1A). A case-control design of the Maryland (NCI-UMD) cohort was used for the discovery of viral exposure profiles. The inclusion and enrollment of the study subjects are outlined in FIG. 5E, following the CONSORT guideline (Schulz et al., BMJ 340:c332, 2010) (Table 8). For the NCI-UMD cohort, VirScan Phage-IP products yielded 0.5-5 million single-end reads per serum sample, with the mean of the mapped reads rate of 0.93 (FIG. 1B). A total of 30,033 viral epitopes were significantly enriched with a p-value (−log 10) greater than the reproducibility threshold of 2.358 based on both replicates (FIGS. 5A-5B). It was noted that the composition of the viral types at the viral taxonomic level showed small yet noticeable differences between the obtained Phage-IP products and the library input (FIGS. 5D-5E), indicating a measurable difference between patients-derived data and the original input. When assessing viral richness among PC, HR and HCC, it was determined that the numbers of viral infection increased along with the sample size and reached saturation at the sample size over 200 (FIG. 1C). An average of 7 species of virus per sample was detected and more than 20 out of 206 viral species were found in four individuals (FIG. 1D). Overall, the distribution of viral species was similar among PC, HR and HCC (FIG. 1D), indicating no bias in the landscape of overall viral exposure profiles between different groups. The abundance of the most prevalent viral species among all volunteers such as human herpesvirus 4 (EBV) and human herpesvirus 5 (HCMV) was similar to a prior population study (FIG. 1E, Table 2A) (Xu et al., Science 2015; 348:aaa0698), and was consistent with previous epidemiology reports (Straus et al., Ann Intern Med 1993; 118:45-58; Ho, Rev Infect Dis 1990; 12 Suppl 7:S701-S710). However, the HCV infection rate (26.4%) in this study was relatively high, which was mainly contributed by AR (48.4%) and HCC (39.3%) (Table 2A; FIG. 10). A wide range of unique viral epitopes for each viral species that were recognized among different participants was detected, indicating that B-cell antigenicity to the same viral species is diverse among the participants (FIG. 1E, right panel; FIG. 10). Moreover, global compositions of the viral types at the viral taxonomic level show small but noticeable differences between Phage-IP products and the library input (FIGS. 5C-5D).


To further assess the quality of VirScan, the results of VirScan were compared to available medical chart entries for HCV, HBV and HIV testing results and found that VirScan had 45%, 47% and 70% specificity in detecting HCV, HBV and HIV, respectively, when compared to these medical record data (FIG. 2A). In contrast, its sensitivity was 84% for HCV, 48% for HBV and 73% for HIV. A majority of viral status data from medical charts was unknown or missing (Table 2B), which makes this comparison suboptimal. Epitope enrichment of HCV1b, a major type associated with HCC (Bruno et al., Hepatology 2007; 46:1350-1356), was also examined. Consistently, an increase in peptide enrichment, corresponding mainly to the core, NS4 and NS5A of HCV1b, was observed among AR and HCC compared to PC, and these regions were consistent with the prediction score of B-cell antigenicity (FIG. 2B). The presence of HIV and other viruses known to have co-infection with HIV (Xu et al., Science 2015; 348:aaa0698; Chang et al., Immunol Rev 2013; 254:114-142; Echavarria, Clin Microbiol Rev 2008; 21:704-715; Stover et al., J Infect Dis 2003; 187:1388-1396) was also examined. A significant increase of co-infection between HIV and human herpesvirus 5, human adenovirus C, human adenovirus D, human herpesvirus B or HBV was found, with a false discovery rate (FDR)<0.05 (FIG. 2C). Taken together, the above results revealed that VirScan is a reliable method to capture a broad spectrum of viral exposures with a serological test.


HCC-Associated VES

A gradient boosting approach was applied to search for the best-fit virus composition that can discriminate HCC from PC (FIG. 3A). Using 10-fold cross validation and 1,000 random permutations, it was found that a VES can significantly discriminate HCC from PC with an AUC value of 0.9 and 0.7 for training and cross validation, respectively (FIG. 3B). This signature consisted of unique peptides corresponding to 61 viral strains (FIG. 3C). Among them, 18 viruses were positively associated, while the remaining viruses were negatively associated, with HCC. HCV, including 11 unique variants such as 3b or Taiwan 1b among others, was the main contributing virus in the signature. This was not surprising since 39.3% of HCC cases from this cohort were HCV+. It was also found that herpesvirus 5, HDV, influenza virus H1N1 and influenza virus H3N2 were enriched in the HCC group. In contrast, 43 viruses, such as human respiratory syncytial virus and human rhinovirus 23, were preferentially depleted in the HCC group (Table 5A, FIG. 3C). Weighed VES scores of the 61 viruses differed significantly between HCC and PC (p<0.0001), as well as HCC and HR (p<0.0001), or HR and PC (p<0.0001) (FIG. 3D). There was a significant increase among PC, HR and HCC (ptrend<0.0001), suggesting that the VES was positively linked to hepatocarcinogenesis.


A phylogenetic analysis of the reactive epitopes of the 61 viral strains was performed to determine similarity among these HCC-related viruses (FIG. 3E). To search common reactive viral epitopes either enriched or depleted in HCC, viral epitopes that rank at the top for their association with HCC were restricted. These viruses can be divided into eight main branches where different HCV epitopes are clustered together with other viral epitopes, with an exception of cluster #6, which contains six HCV variants (out of 12 viruses) (FIG. 3E; Table 5B). In general, there was no clear enrichment within each branch for increased or decreased viruses, suggesting that varying viral epitopes involved in immunoreactivity are commonly shared among HCC. Since a majority of HCC patients have evidence of CLDs, to avoid this confounding variable, AR was also compared to HCC using the same gradient-boosting approach. It was found that an AR versus HCC VES can significantly discriminate HCC from AR or PC with AUC values similar to VES for training and cross validation (FIGS. 11A-11B). A majority of these VES-related viral strains overlap (FIG. 11C). To further test the robustness of VES, a 60/40 split was performed where 60% of cases were used for VES discovery while the remaining 40% of cases were used for an independent prediction. In total, 1,000 permutations of the split were performed to establish the confidence interval (CI). Again, similar VES was found with a mean of AUC 0.7 for prediction (FIGS. 11D-11G).


Another statistically conserved method, the linear discriminant analysis of effect size (LEfSe, or LDA) (Segata et al., Genome Biol 2011; 12:R60), was used to search for HCC associated viruses. Furthermore, pairwise comparisons were performed for viral taxa at all levels including DNA/RNA viruses, viral families, viral species and viral strains between HCC and PC. In addition to VES at the strain level, this analysis also identified the viral taxonomic differences by viral families, such as Flaviviridae of positive single-strand RNA viruses, Pneumoviridae of negative single-strand RNA viruses and Circoviridae of single-strand DNA viruses. These analyses resulted in 341 viruses that can significantly distinguish HCC from PC. Among them, several HCV variants, herpesvirus 5 variants, Norwalk virus variants, cytomegalovirus, adenovirus variant and astrovirus-1 were uniquely different between PC and HCC (FIG. 6B). A total of 31 viruses were overlapping between Xgboost and LEfSe (Table 6) and were different between PC and HCC. Unsupervised hierarchical clustering of the abundances of the top-ranking viruses revealed that HCC were more closely related to HR than PC, consistent with the VES prediction score (FIG. 6A, FIG. 3D). Collectively, these results indicate that a unique set of VES is robust in defining HCC.


Validation of the VES in HCC

To further validate the two VES identified above for their clinical utility, VirScan profiles in the at-risk NIDDK cohort for HCC was analyzed. This cohort consisted of 173 CLD patients (NIDDK-HR) who were enrolled for a natural history study for liver disease with a follow-up of up to 20 years (Table 1; FIG. 8A). Among them, 44 individuals developed HCC. This cohort contained serum samples collected at enrollment (baseline) and at various follow-up time points until a diagnosis of HCC (diagnosis). Logistic regression analysis was performed using the VES from either all 61 viruses (FIG. 4) or the overlapping 31 viruses (FIG. 7) and receiver-operating characteristic (ROC) curves were generated corresponding to the Maryland cohort or the NIDDK-HR cohort, respectively. The areas under the curve (AUC) were 0.89, 95% CI (0.86-0.92) for 61-VES (FIG. 4A) and 0.85, 95% CI (0.81-0.88) for 31-VES in the Maryland cohort (FIG. 7A). It was observed that levels of 61-VES scores varied among HCC cases in the Maryland cohort with some having below the detection limit and others having either low or high levels (FIG. 4B). Patients with a high level had a significantly worse survival compared to patients with a low level or below the detection limit (log rank p=0.026, and p-trend=0.033) (FIG. 4C). Similar results were observed with the 31-VES. Among patients from the NIDDK cohort, VirScan data were available for 40 HCC cases at baseline, 129 controls at baseline, 44 HCC cases at diagnosis and 106 controls at diagnosis (n=106). The average number of viral species in each case of NIDDK cohort were 6.


Table 9A shows the results from univariable and multivariable Cox model survival analysis on several clinicopathologic variables to clarify the independent and additional prognostic value of VES. Among patients from the NIDDK cohort, VirScan data were available for 40 HCC cases at baseline, 129 controls at baseline, 44 HCC cases at diagnosis and 106 controls at diagnosis. It was found that the AUC values were 0.98, 95% CI (0.97-1.00) at diagnosis (FIG. 4D) and 0.91, 95% CI (0.87-0.96) at baseline (FIG. 4E) with 61-VES. Similar results were obtained with 31-VES. The performance of the VES was superior to alpha-fetoprotein (AFP), a known HCC diagnostic marker used in the clinic. The 31-VES yielded AUC values of 0.92, 95% CI (0.87-0.97) and 0.81, 95% CI (0.74-0.89) at diagnosis and at baseline, respectively, when combined with AFP. The DeLong test showed a significant improvement between VES and AFP (p values 4×10−12 and 8×10−10 at baseline and diagnosis, respectively) (FIGS. 4D and 4E). Similar trends (p-trend=0.19) were also found between the levels of VES and overall survival among 44 patients in the NIDDK cohort (FIG. 8B). In order to assess the time-dependent performance of VES to predict the onset of HCC, 104 cancer-free controls and 40 HCC cases (from the NIDDK validation cohort) for which at least two time points were available were analyzed. In the context of survival modeling, an event was defined as the occurrence of an HCC diagnosis. Under this interpretation, censoring time was defined as the time difference between baseline and follow-up within the cancer-free control group, whereas event time was defined as the time difference between baseline and HCC diagnosis within the HCC group. Table 9B shows results from a multivariable Cox regression model generated to predict the occurrence of HCC diagnosis based on VES scores at baseline, adjusted for clinical prognostic variables. Moreover, a time-dependent ROC curve analysis (Bansal and Heagerty, Diagn Progn Res 3:14, 2019; Blanche et al., Stat Med 32: 5381-5397, 2013) was performed to assess the performance of VES over a range of landmark time points from 1 to 10 years relative to baseline (FIGS. 4F and 8C), which appears very robust and stable across this range. It was found that patients who developed HCC had, on average, much higher VES scores at baseline and at different times of follow-up until HCC diagnosis, when compared to cancer-free at-risk patients who were followed up at a similar time interval without developing HCC (FIG. 4G). A statistically significant increase in viral exposures (p<0.05) was observed only for patients who developed HCC over time during the surveillance period in the NIDDK cohort. It appears that HCC cases with a high viral exposure had a more aggressive disease than those with a low viral exposure, and that VES was a robust indicator of early onset of HCC in this prospective cohort. Furthermore, the prediction performance of AR versus HCC based on VES was superior to other clinical indicators from the patient charts, such as AFP, alanine transaminase (ALT), cirrhosis and platelet counts, as well as the combination of all key clinical variables, as shown by analyses of the NIDDK cohort at baseline (FIG. 4H), which agree qualitatively with those of NIDDK at diagnosis (FIG. 8D) and the NCI-UMD cohort (FIG. 8E). An association of VES and HCC was similarly found in both HCV-positive and HCV-negative patients (Table 9C).









TABLE 1







Clinical Characteristics of the Patients*











Without HCC
With HCC



Variable
(N = 129)
(N = 44)
P Value





Age-year


 0.12


Median (Range)
 51 (23-79)
54 (23-79)



Missing data
 1
 0



Sex-no. (%)





Female
 40 (31.0)
14 (31.8)
 1.00


Male
 89 (69.0)
30 (68.2)



Missing data
 0
 0



Race-no. (%)


 0.69


European American
 63 (48.8)
22 (50.0)



African American
 29 (22.5)
12 (27.3)



Asian American
 26 (20.2)
 8 (18.2)



Other
 2 (1.6)
 0



Missing data
 9 (7.0)
 2 (4.6)



HCV only-no. (%)
 98 (76.0)
27 (61.4)
 0.61


HBV only-no. (%)
 18 (14.0)
 7 (15.9)



HBV + HCV-no.
 2 (1.6)
 1 (2.3)



HBV + HDV-no.
 4 (3.1)
 3 (6.8)



Others not hepatitis
 7 (5.4)
 6 (13.6)



Cirrhosis-no. (%)
 15 (11.6)
28 (63.6)
<0.001


Missing data
 2 (1.6)
 4 (9.1)



Alanine aminotransferase-no. (%)


<0.01


Elevated (>50 U/L)
 84 (65.1)
28 (63.6)



Normal (≤50 U/L)
 45 (34.9)
16 (36.4)



Alpha-fetoprotein-no. (%)


<0.001


>20 ng/mL
 9 (7.0)
21 (47.7)



≤20 ng/mL
120 (93.0)
23 (52.3)



Missing data
 0
 0



Survival (months)





Median
NA
15.2



Range
NA
0.07-131.8



Missing data (%)
NA
 1 (2.3)





*The clinical characteristics of the 173 at-risk patients in the prospective NIDDK cohort.







Phenotype-Genotype Association with VES


To determine if host genetic background may be linked to VES, a genome-wide association study (GWAS) in the Maryland cohort was performed, as this approach may help identifying susceptibility variants related to viral infection and cancer (McKay J et al., Nat Genet 2017; 49:1126-1132; Pharoah et al., Nat Genet 2013; 45:362-370; Fumagalli et al., PLoS Genet 2010; 6:e1000849). After assessment using the genetic quality control measures, 849 participants (PC, n=402; HR, n=323; HCC, n=124) were included in the analysis. Following the removal of monoallelic SNPs and the ones that deviate away from Hardy-Weinberg equilibrium, an association test was performed for all the remaining SNPs. To further assess the quality of the GWAS data, it was determined whether there was an association between an SNP, rs12979860 in IL28B, and HCV infection. As its favorable genotype, CC has been shown to be associated with better HCV treatment response or natural clearance. It was found that rs12979860-CC was significantly associated with HCV genotype 3 with odds ratio (OR) 2.74 (95% CI 1.14-7.97) in a dominant model manner (Table 3A). Furthermore, the SNP associated with 375 epitopes abundances of HCV genotype 2 and 3 was evaluated. The CC allele was found to be associated with a decreased abundance of core epitopes but an increased abundance of NS5B epitopes in the HCV genome (FIG. 7C; Table 3B), consistent with a recent study (Ansari et al., Nat Genet 49:666-673, 2017). To assess VES-associated SNPs, HCC and PC groups were combined and then divided into two groups based on dichotomization of VES scores. In the associated quantile-quantile plots (FIG. 8B), a wider spread with small differences in allele frequencies was evident with increased slope of the line. Principal-component analysis based on genotyping revealed differences in ethnicity (FIG. 7B).


Manhattan plot analysis revealed several SNPs with much larger differences between high and low VES scores having the p-values <10−5 (FIG. 9A). Three SNPs, rs34725101, rs4483229, and rs16960234, in three different genomic regions corresponding to RHOA, EPB41L4B and CDH13, respectively, had the p-values <10-7, an acceptable standard for common-variant GWAS, to be linked to VES (Table 3C and FIG. 9A). Among them, rs16960234 was further analyzed because both major and minor alleles of this variant could be detected in this cohort. High linkage disequilibrium (LD) SNPs (r2>0.6) were also found for rs16960234, but not rs34725101 and rs4483229 (FIGS. 9B-9C; Table 7). Seven of the high LD SNPs of rs16960234 showed the expression profile of CDH13 as expression quantitative trait loci (eQTL) in genotype-tissue expression (GTEx) database (McKay J et al., Nat Genet 2017; 49:1126-1132). The CDH13 expression levels in the artery tibial tissues from the carriers with risk/protective G/G genotype of rs16960234 were significantly higher than the carriers with protective/risk genotype A/A (FIG. 9D). To obtain the genotypic effects of rs16960234 in HCC or HR, logistic regression was constructed and the genotypic odds ratio of this SNP in HR or HCC was calculated and compared to PC (FIG. 9E). rs16960234 genotyping G/G showed significant increase risk in HR vs. PC, OR; 1.89 (0.30-11.4) and risk was even higher in HCC vs. PC, OR: 7.22 (1.30-40.0) (FIG. 9E; Table 4). Consistent with genotypic effect in HCC, the VES score also showed gradual increases in heterogeneous A/G and G/G compared with A/A (FIGS. 9F-9G). Thus, rs16960234 and its linked gene CDH13 may be associated with VES and contributed to the disease risk.


Diagnostic Applications

Detecting cancer at an early stage preferably before it is symptomatic may provide an opportunity in achieving a cure and improving outcomes on cancer-related mortality. Evidence suggests that earlier detection of cancer improves survival for some cancer types, such as cervical and colon cancers. A conventional approach is to develop biomarkers specific for cancer cells to aid in early cancer diagnosis. CancerSEEK is an emerging platform successful in achieving a good sensitivity and specificity to clinically-detected multiple cancer types by profiling circulating cell-free DNA (ctDNA) presumably shed from tumor cells (Cohen et al., Science 2018; 359:926-930). A recent study offers a cautionary note for measuring cancer gene panels using ctDNA because of its high false positivity among healthy individuals (Liu et al., Ann Oncol 2019; 30:464-470). Molecular and biological heterogeneity of cancer cells contributed by complex etiological landscape creates a dilemma as how best to design cancer-specific diagnostic panels effective for early cancer detection. As such, a continuous debate has been carried out in recent decades for many malignant diseases including HCC as whether available methods are adequate in achieving this goal (Sherman et al., Hepatology 2012; 56:793-796; Shieh et al., Nat Rev Clin Oncol 2016; 13:550-56).


HCC is a unique malignancy for which most major causative etiologies are known (Wang and Thorrgeirsson, Oncology 2014; 1:5). However, defining biomarkers specific for HCC cells has been challenging because of its complex genomic landscape with extensive intratumor and intertumor heterogeneities. Are there common features shared among HCC patients to be used as a surrogate for early detection? An emerging concept is that an interplay between viral infection and host genetic background is crucial for maintaining virome homeostasis or causing human disease (Virgin, Cell 2014; 157:142-150). The study disclosed herein assessed how a history of viral exposures by an individual is associated with their risk of developing HCC. Using a synthetic viral scan technology (VirScan) with a simple blood test (Xu et al., Science 2015; 348:aaa0698), a VES was identified that could discriminate HCC with a high confidence from individuals with chronic liver diseases or from healthy volunteers. Remarkably, this signature was able to identify individuals at a medium follow-up year of 8.8 prior to a clinical diagnosis of HCC. Thus, these results offer a sensitive tool applicable to the HCC surveillance program to improve early diagnosis.


The current study took the advantage of a simple tool to profile serological samples to link an individual's history of viral infection and corresponding response to early onset HCC. The strategy was first to search VES using a case control design that include HCC cases as well as at-risk individuals with chronic liver diseases and healthy volunteers matched by age and sex. A VES that can discriminate HCC from at-risk and healthy individuals was then validated using a prospective cohort of sequentially enrolled at-risk patients who were followed up for the development of HCC. The VES consists of known HCC etiologies such as HCV, HBV and HDV, but also includes other viruses such as herpesviruses 4 and 5, Crimean-Congo hemorrhagic fever virus, cytomegalovirus, and influenza A virus, among others. A few features are noted. First, HCV appears to be a major etiology driving VES but an extended heterogeneity in various HCV subtypes are noted in both Maryland and NIDDK cohorts. Second, a set of viruses are enriched while many others including HBV are depleted in HCC patients.


The current method of VirScan is based on the phage immunoprecipitation sequencing (PhIP-Seq) technology that provides a powerful approach for analyzing antibody-repertoire binding specificities with high throughput and at low cost to all known human viruses (Mohan et al., Nat Protoc 2018; 13:1958-1978). Comparing VirScan results with HCV and HBV status from medical chart of the UMD cohort, it was found that VirScan shows great specificity for both HCV and HBV, and good sensitivity for HCV but to a lesser extent for HBV. HCV encodes a large polyprotein consisting of ˜3,000 amino acids, which is cleaved co- and post-translationally into ten different proteins associated with intracellular membranes (Bartenschlager et al., Nat Rev Microbiol 2013; 11:482-496). Consistently, HCV antigen reactivity largely overlapped with the predicted antigenicity score by the B-cell epitope prediction method coinciding with peptides to be presented at the surface of the cellular membrane. Consistent with early reports for the likelihood of coinfection of HIV and other viruses associated with AIDS and non-AIDS diseases (Xu et al., Science 2015; 348:aaa0698; Slyker et al., J Infect Dis 2013; 207:1798-1806; Lichtner et al., J Infect Dis 2015; 211:178-186), evidence of coinfection between HIV and viruses such as HBV, herpesvirus 8 and adenovirus D, influenza B virus, adenovirus C, and herpesvirus 5 was found in patients enrolled in the Maryland cohort. History of HCV infection is prevalent among at-risk (48%), HCC patients (39%) and healthy volunteers (4%) who reside in Maryland. This is in contrast to an estimated prevalence of about 4.6 million persons (˜1.5%) infected with HCV in the U.S. (Edlin et al., Hepatology 2015; 62:1353-1363). It should be noted that 7.5%-44% of incarcerated individuals and 4%-38% of hospitalized patients tested positive for HCV (Edlin et al., Hepatology 2015; 62:1353-1363), suggesting that the current surveys underestimate the prevalence of HCV infection. In contrast, while 2.6% of the Maryland healthy individuals showed evidence of HBV infection, more than 800,000 chronic HBV carriers were detected during 2011-2012 in the noninstitutionalized U.S. population (Roberts et al., Hepatology 2016; 63:388-397). The current survey methods may underestimate the prevalence of HBV and HCV. This is important as both HBV and HCV are major causative factors for HCC. Collectively, VirScan is a reliable method for profiling viral exposure and is scalable regarding to sample throughput and relatively low cost per analysis amenable for surveillance and early detection of HCC.









TABLE 2A







Viral Frequency in 899 patients and volunteers from NCI-UMD cohort












Total
PC
HR
HCC


Viral Species
(N = 899)
(N = 412)
(N = 337)
(N = 150)





Human herpesvirus 4
97.00%
97.82%
96.74%
95.33%


Human herpesvirus 5
67.30%
60.92%
72.11%
74.00%


Human herpesvirus 1
62.85%
60.44%
65.58%
63.33%


Human respiratory syncytial virus
50.06%
55.10%
45.70%
46.00%


Influenza A virus
48.16%
50.00%
48.07%
43.33%


Human adenovirus C
41.60%
46.36%
35.91%
41.33%


Human herpesvirus 6B
37.82%
43.93%
32.64%
32.67%


Human herpesvirus 3
32.59%
37.86%
27.00%
30.67%


Human herpesvirus 2
30.59%
31.55%
31.16%
26.67%


Influenza B virus
27.47%
30.10%
27.30%
20.67%


Hepatitis C virus
26.36%
 3.64%
48.37%
39.33%


Rhinovirus A
22.80%
27.91%
19.88%
15.33%


Rhinovirus B
20.47%
23.06%
18.99%
16.67%


Human herpesvirus 7
12.35%
15.53%
10.09%
 8.67%


Enterovirus C
10.79%
11.65%
11.57%
 6.67%


Human adenovirus B
 8.23%
 8.01%
 9.50%
 6.00%


Human immunodeficiency virus 1
 8.01%
 7.28%
 9.79%
 6.00%


Human herpesvirus 6A
 6.90%
 8.25%
 5.34%
 6.67%


Human adenovirus D
 5.23%
 5.34%
 5.64%
 4.00%


Vaccinia virus
 4.89%
 5.83%
 3.26%
 6.00%


Human herpesvirus 8
 4.34%
 4.61%
 3.86%
 4.67%


Cowpox virus
 2.78%
 4.37%
 1.78%
 0.67%


Papiine herpesvirus 2
 2.78%
 2.91%
 2.37%
 3.33%


Hepatitis B virus
 2.56%
 2.67%
 2.67%
 2.00%


Mamastrovirus 1
 2.45%
 3.40%
 0.89%
 3.33%


Human adenovirus F
 2.11%
 2.43%
 1.19%
 3.33%


Orf virus
 1.89%
 1.94%
 1.48%
 2.67%


Human parainfluenza virus 3
 1.78%
 2.67%
 0.59%
 2.00%


Macacine herpesvirus 1
 1.78%
 1.70%
 2.37%
 0.67%


Molluscum contagiosum virus
 1.67%
 1.94%
 1.48%
 1.33%


Human metapneumovirus
 1.56%
 2.18%
 1.19%
 0.67%


Human adenovirus A
 1.45%
 0.97%
 2.08%
 1.33%


Rotavirus A
 1.45%
 1.70%
 1.48%
 0.67%


Torque teno virus
 1.45%
 0.49%
 2.67%
 1.33%


Influenza C virus
 1.11%
 1.21%
 0.59%
 2.00%


Enterovirus A
 1.00%
 1.46%
 0.89%
 0.00%


Norwalk virus
 1.00%
 1.46%
 0.89%
 0.00%


Alphapapillomavirus 9
 0.89%
 0.73%
 1.19%
 0.67%


Betapapillomavirus 1
 0.78%
 0.73%
 1.19%
 0.00%


Tanapox virus
 0.78%
 1.21%
 0.59%
 0.00%


Aichivirus A
 0.67%
 1.21%
 0.00%
 0.67%


Enterovirus B
 0.67%
 0.97%
 0.30%
 0.67%


Human coronavirus HKUl
 0.56%
 0.73%
 0.30%
 0.67%


Human parainfluenza virus 2
 0.44%
 0.73%
 0.30%
 0.00%


Variola virus
 0.44%
 0.24%
 0.59%
 0.67%


Betapapillomavirus 2
 0.33%
 0.49%
 0.30%
 0.00%


Human immunodeficiency virus 2
 0.33%
 0.49%
 0.30%
 0.00%


Ross River virus
 0.33%
 0.49%
 0.30%
 0.00%


Venezuelan equine encephalitis virus
 0.33%
 0.24%
 0.30%
 0.67%


Betacoronavirus 1
 0.22%
 0.00%
 0.59%
 0.00%


Hepatitis E virus
 0.22%
 0.24%
 0.30%
 0.00%


Human adenovirus E
 0.22%
 0.49%
 0.00%
 0.00%


Measles virus
 0.22%
 0.24%
 0.00%
 0.67%


Rubella virus
 0.22%
 0.00%
 0.30%
 0.67%


Yaba monkey tumor virus
 0.22%
 0.24%
 0.30%
 0.00%


Alphapapillomavirus 10
 0.11%
 0.00%
 0.30%
 0.00%


Crimean-Congo hemorrhagic fever virus
 0.11%
 0.24%
 0.00%
 0.00%


Hendra virus
 0.11%
 0.24%
 0.00%
 0.00%


Hepatitis delta virus
 0.11%
 0.00%
 0.30%
 0.00%


Human parainfluenza virus 1
 0.11%
 0.24%
 0.00%
 0.00%


Human parainfluenza virus 4
 0.11%
 0.00%
 0.30%
 0.00%


Marburg marburgvirus
 0.11%
 0.24%
 0.00%
 0.00%


Primate T-lympho tropic virus 1
 0.11%
 0.00%
 0.30%
 0.00%


Pseudocowpox virus
 0.11%
 0.24%
 0.00%
 0.00%


Rotavirus B
 0.11%
 0.00%
 0.30%
 0.00%






1PC: population control




2HR: high risk group




3HCC: hepatocellular carcinoma














TABLE 2B







Comparison of VirScan with HBV, HCV,


and HIV from medical charts












Virscan
Virscan



Clinical Variable
Negative
Positive















Hepatitis B Virus (HBV)





HBV surface antibody





Negative
88
75



Positive
48
27



Unknown/Missing
132
117



HBV core antibody





Negative
80
18



Positive
82
10



Unknown/Missing
250
47



Hepatitis C Virus (HCV)





HCV IgG antibody





Not detected
27
34



Detected
24
123



Unknown/Missing
95
184



HCV RNA PCR





Negative
2
2



Positive
2
18



Unknown/Missing
142
321



Human Immunodeficiency Virus (HIV)





Negative
268
116



Positive
7
19



Unknown/Missing
58
19

















TABLE 3A







The association between SNP in IL28B gene


(rs12979860) and HCV genotype 2 & genotype 3













Genotype
NO
YES
ORa (95% CI)
P











HCV-2 status













TT
206
13
ref




CT + CC
575
53
1.46
0.30






(0.77-2.98)








HCV-3 status













TT
213
 6
ref




CT + CC
583
45
2.74
0.020






(1.14-7.97)








HCV-2 & 3 status













TT
201
18
ref




CT + CC
539
89
1.84
0.024






(1.07-3.34)








aOR: Odds Ratio














TABLE 3B







SNP in IL28B gene (rs12979860) association with the epitopes of HCV genotype 2 & 3













HCV amino acid



SEQ



position (start
Viral
-log10

ID


HCV genotype 2 & 3
to end)
protein
(p-value)
Epitope sequence
NO:















HCV-2c (isolate BEBE1)
   1-56
Core
2.496
MSTNPKPQRKTKRNTNRRPQDV
62






KFPGGGQIVGGVYLLPRRGPRL







GVRAARKTSERS






HCV-2b (isolate HC-J8)
   1-56
Core
2.453
MSTNPKPQRKTKRNTNRRPQDV
63






KFPGGGQIVGGVYLLPRRGPRL







GVRATRKTSERS






HCV-2k (isolate VAT96)
   1-56
Core
2.334
MSTNPKPQRKTKRNTNRRPQDV
64






KFPGGGQIVGGVYLLPRRGPRL







GVRATRKTSERS






HCV-2a (isolate HC-J6)
   1-56
Core
2.218
MSTNPKPQRKTKRNTNRRPQDV
65






KFPGGGQIVGGVYLLPRRGPRL







GVRATRKTSERS






HCV-2a (isolate JFH-1)
   1-56
Core
2.105
MSTNPKPQRKTKRNTNRRPEDV
66






KFPGGGQIVGGVYLLPRRGPRL







GVRTTRKTSERS






HCV-3a (isolate NZL1)
   1-56
Core
1.776
MSTLPKPQRKTKRNTIRRPQDV
67






KFPGGGQIVGGVYVLPRRGPRL







GVRATRKTSERS






HCV-3k (isolate JK049)
   1-56
Core
1.771
MSTLPKPQRITKRNINRRPQDV
68






KFPGGGQIVGGVYVLPRRGPKL







GVRAVRKTSERS






HCV-3b (isolate Tr-Kj)
   1-56
Core
1.483
MSTLPKPKRQTKRNTLRRPKNV
69






KFPAGGQIVGEVYVLPRRGPQL







GVREVRKTSERS






HCV-2c (isolate BEBE1)
  29-84
Core
2.470
QIVGGVYLLPRRGPRLGVRAAR
70






KTSERSQPRGRRQPIPKDRRST







GKSWGRPGYPWP






HCV-2b (isolate HC-J8)
  29-84
Core
2.438
QIVGGVYLLPRRGPRLGVRATR
71






KTSERSQPRGRRQPIPKDRRST







GKSWGKPGYPWP






HCV-2a (isolate HC-J6)
  29-84
Core
2.116
QIVGGVYLLPRRGPRLGVRATR
72






KTSERSQPRGRRQPIPKDRRST







GKSWGKPGYPWP






HCV-3b (isolate Tr-Kj)
  29-84
Core
1.548
QIVGEVYVLPRRGPQLGVREVR
73






KTSERSQPRGRRQPTPKARPRE







GRSWAQPGYPWP






HCV-2a (isolate JFH-1)
  29-84
Core
1.370
QIVGGVYLLPRRGPRLGVRTTR
74






KTSERSQPRGRRQPIPKDRRST







GKAWGKPGRPWP






HCV-3b (isolate Tr-Kj)
  57-112
Core
2.551
QPRGRRQPTPKARPREGRSWAQ
75






PGYPWPLYGNEGCGWAGWLLPP







RGSRPSWGQNDP






HCV-2a (isolate HC-J6)
  57-112
Core
1.902
QPRGRRQPIPKDRRSTGKSWGK
76






PGYPWPLYGNEGLGWAGWLLSP







RGSRPSWGPNDP






HCV-3a (isolate NZL1)
  57-112
Core
1.798
QPRGRRQPIPKARRSEGRSWAQ
77






PGYPWPLYGNEGCGWAGWLLSP







RGSRPSWGPNDP






HCV-2a (isolate JFH-1)
  85-140
Core
2.926
LYGNEGLGWAGWLLSPRGSRPS
78






WGPTDPRHRSRNVGKVIDTLTC







GFADLMGYIPVV






HCV-2a (isolate HC-J6)
  85-140
Core
2.829
LYGNEGLGWAGWLLSPRGSRPS
79






WGPNDPRHRSRNVGKVIDTLTC







GFADLMGYIPVV






HCV-2b (isolate
  85-140
Core
2.733
LYGNEGCGWAGWLLSPRGSRPT
80


JPUT971017)



WGPSDPRHRSRNLGRVIDTITC







GFADLMGYIPVV






HCV-2k (isolate VAT96)
  85-140
Core
1.321
LYGNEGLGWAGWLLSPRGSRPS
81






WGPTDPRHRSRNLGKVIDTLTC







GFADLMGYIPVV






HCV-2c (isolate BEBE1)
 421-476
E2
1.596
HINRTALNCNDSLETGFLAALF
82






YTSSFNSSGCPERLAACRSIES







FRIGWGSLEYEE






HCV-2c (isolate BEBE1)
 645-700
E2
2.197
QAACNFTRGDRCNLEDRDRSQL
83






SPLLHSTTEWAILPCSYTDLPA







LSTGLLHLHQNI






HCV-2c (isolate BEBE1)
 505-560
E3
1.908
CGPVYCFTPSPVVVGTTDRAGA
84






PTYNWGENETDVFLLNSTRPPK







GAWFGCTWMNGT






HCV-2k (isolate VAT96)
 505-560
E4
1.879
CGPVYCFTPSPVVVGTTDRRGV
85






PTYTWGENDTDVFLLNSTRPPR







GAWFGCTWMNST






HCV-2b (isolate
 505-560
E5
1.563
CGPVYCFTPSPVVVGTTDRQGV
86


JPUT971017)



PTYNWGDNETDVFLLNSTRPPR







GAWFGCTWMNGT






HCV-2b (isolate HC-J8)
 561-616
E6
1.735
GFTKTCGAPPCRIRKDYNSTID
87






LLCPTDCFRKHPDATYLKCGAG







PWLTPRCLVDYP






HCV-2b (isolate
1681-1736
NS4A
2.035
TGCISIIGRIHLNDQVVVAPDK
88


JPUT971017)



EILYEAFDEMEECASKAALIEE







GQRMAEMLKSKI






HCV-2b (isolate HC-J8)
2101-2156
NS5A
1.726
GSFSYVTGLTSDNLKVPCQVPA
89






PEFFSWVDGVQIHRFAPVPGPF







FRDEVTFTVGLN






HCV-2b (isolate
2157-2212
NS5A
1.518
SLVVGSQLPCDPEPDTEVLASM
90


JPUT971017)



LTDPSHITAETAARRLARGSPP







SQASSSASQLSA






HCV-3a (isolate k3a)
2437-2492
NS5B
2.676
TGALITPCSAEEEKLPISPLSN
91






SLLRHHNLVYSTSSRSASQRQK







KVTFDRLQVLDD






HCV-3k (isolate JK049)
2437-2492
NS5B
2.533
ALITPCAAEEEKLPISPLSNSL
92






LRHHNLVYSTSSRSAAQRQKKV







TFDRLQVLDDHY






HCV-2b (isolate HC-J8)
2465-2520
NS5B
1.586
INPLSNSLMRFHNKVYSTTSRS
93






ASLRAKKVTFDRVQVLDAHYDS







VLQDVKRAASKV






HCV-3k (isolate JK049)
2493-2548
NS5B
2.245
NTTLKEIKELASGVKAELLSVE
94






EACRLVPSHSARSKFGYGAKEV







RSLSSKAINHIN






HCV-3a (isolate k3a)
2521-2576
NS5B
3.065
LVPPHSARSKFGYSAKDVRSLS
95






SKAINQIRSVWEDLLEDTTTPI







PTTIMAKNEVFC






HCV-2b (isolate HC-J8)
2521-2576
NS5B
1.878
SARLLTVEEACALTPPHSAKSR
96






YGFGAKEVRSLSRRAVNHIRSV







WEDLLEDQHTPI






HCV-2a (isolate JFH-1)
2521-2576
NS5B
1.422
SARLLTLEEACQLTPPHSARSK
97






YGFGAKEVRSLSGRAVNHIKSV







WKDLLEDPQTPI






HCV-3a (isolate k3a)
2549-2604
NS5B
2.591
IRSVWEDLLEDTTTPIPTTIMA
98






KNEVFCVDPAKGGRKAARLIVY







PDLGVRVCEKRA






HCV-3a (isolate NZL1)
2549-2604
NS5B
2.062
IRSVWEDLLEDTTTPIPTTIMA
99






KNEVFCVDPAKGGRKPARLIVY







PDLGVRVCEKRA






HCV-2b (isolate HC-J8)
2549-2604
NS5B
1.767
EVRSLSRRAVNHIRSVWEDLLE
100






DQHTPIDTTIMAKNEVFCIDPT







KGGKKPARLIVY






HCV-2b (isolate HC-J8)
2661-2716
NS5B
1.409
YDTRCFDSTVTERDIRTEESIY
101






QACSLPQEARTVIHSLTERLYV







GGPMTNSKGQSC






HCV-2a (isolate HC-J6)
2745-2800
NS5B
1.565
CKAAGIIAPTMLVCGDDLVVIS
102






ESQGTEEDERNLRAFTEAMTRY







SAPPGDPPRPEY
















TABLE 3C







The top significant SNPs associated with VES with p-value < 10−7















Chromosome
SNP
Position(hg19)
Gene Symbol
Allele1
VES
nonVES
P value
OR2


















 3
rs34725101
49401262
RHOA
A/C
0.15
0.03
1.23E−10
6.03


 9
 rs4483229
111980234
EPB41L4B
A/G
0.32
0.14
9.14E−08
2.96


16
rs16960234
83356008
CDH13
G/A
0.19
0.06
9.20E−08
3.75






1Allels: Minor/Major allele




2OR: Odd Ratio














TABLE 4







VES linked SNPs were associated with disease risk























P-value







OR4 (95% CI)
P-value
OR4 (95% CI)
HR vs.



Genotype
PC1(%)
HR2(%)
HCC3(%)
HCC vs. PC
HCC vs. PC
HR vs. PC
PC


















rs34725101
CC
391
269
88
Ref
Ref
Ref
Ref




(97.3%)
(83.3%)
(71.0%)







CA
11
54
36
14.5 (7.12-29.7)
6.32E−16
7.14 (3.70-13.9)
2.64E−11




(2.7%)
(16.7%)
(29.0%)







AA
0
0
0








(0.0%)
(0.0%)
(0.0%)







Ptrend5




0.44

0.72


rs4483229
GG
292
217
75
Ref
Ref
Ref
Ref




(72.6%)
(67.2%)
(60.5%)







GA
102
96
46
1.76 (1.14-2.70)
0.01
1.27 (0.90-1.80)
0.18




(25.4%)
(29.7%)
(37.1%)







AA
8
10
3
1.46 (0.38-5.64)
0.70
1.68 (0.70-4.30)
0.34




(2.0%)
(3.1%)
(2.4%)







Ptrend




0.02

0.09


rsl6960234
AA
354
281
98
Ref
Ref
Ref
Ref




(88.1%)
(87.0%)
(79.0%)







AG
46
39
22
1.73 (0.99-3.01)
0.06
1.07 (0.70-1.70)
0.82




(11.4%)
(12.1%)
(17.7%)







GG
2
3
4
7.22 (1.30-40.0)
0.02
1.89 (0.30-11.4)
0.66




(0.5%)
(0.9%)
(3.2%)







Ptrend




0.003

0.58






1PC: population control;




2HR: high risk group;




3HCC: hepatocellular carcinoma;




4OR: Odd Ratio




5Ptrend: Calculated by wald test from logistic regression














TABLE 5A







Viruses in the 61-VES










Rank
Feature
Regulation1
Importance score2





 1
Hepatitis C virus genotype 3b (isolate Tr-Kj) (HCV)
increased
9.33%


 2
Hepatitis C virus genotype 1b (isolate Taiwan) (HCV)
increased
8.13%


 3
Hepatitis C virus genotype 1a (isolate l) (HCV)
increased
7.64%


 4
Human cytomegalovirus (strain AD 169) (HHV-5) (Human herpesvirus 5)
increased
5.01%


 5
Hepatitis C virus genotype 6g (isolate JK046) (HCV)
increased
3.91%


 6
Epstein-Barr virus (strain B95-8) (HHV-4) (Human herpesvirus 4)
decreased
3.70%


 7
Humanrhinovirus 23 (HRV-23)
decreased
3.38%


 8
Human cytomegalovirus (strain Towne) (HHV-5) (Human herpesvirus 5)
decreased
3.04%


 9
Hepatitis C virus genotype 1b (isolate BK) (HCV)
increased
3.04%


10
Human herpesvirus 2 (strain HG52) (HHV-2) (Human herpes simplex
decreased
3.01%



virus 2)




11
Hepatitis C virus genotype 1c (isolate HC-G9) (HCV)
increased
2.81%


12
Human herpesvirus 3 (HHV-3) (Varicella-zoster virus)
decreased
2.53%


13
Varicella-zoster virus (strain Dumas) (HHV-3) (Human herpesvirus 3)
decreased
2.46%


14
Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2)
decreased
2.45%


15
Human adenovirus C serotype 2 (HAdV-2) (Human adenovirus 2)
decreased
2.34%


16
Human astrovirus-1 (HAstV-1)
decreased
2.19%


17
Human respiratory syncytial virus
decreased
2.10%


18
Hepatitis C virus genotype lb (strain HC-J4) (HCV)
increased
1.92%


19
Human herpesvirus 6B (strain Z29) (HHV-6 variant B) (Human B
decreased
1.86%



lymphotropic virus)




20
Hepatitis C virus genotype 4a (isolate ED43) (HCV)
increased
1.77%


21
Human herpesvirus 7 (strain JI) (HHV-7) (Human T lymphotropic virus)
decreased
1.72%


22
Hepatitis delta virus (HDV)
increased
1.52%


23
Human rhino virus 14(HRV-14)
decreased
1.48%


24
Lordsdale virus (strain GII/Human/United Kingdom/Lordsdale/1993)
decreased
1.44%



(Human enteric calicivirus) (Hu/NV/LD/1993/UK)




25
Human herpesvirus 1 (strain KOS) (HHV-1) (Human herpes simplex
decreased
1.42%



virus 1)




26
Human metapneumovirus (strain CAN97-83) (HMPV)
decreased
1.40%


27
Coxsackievirus A16 (strain G-10)
decreased
1.31%


28
Epstein-Barr virus (strain AG876) (HHV-4) (Human herpesvirus 4)
decreased
1.17%


29
Cowpox virus (CPV)
decreased
1.14%


30
Hepatitis C virus genotype 5a (isolate EUH1480) (HCV)
increased
1.12%


31
Human cytomegalovirus (HHV-5) (Human herpesvirus 5)
increased
1.11%


32
Human herpesvirus 1 (strain 17) (HHV-1) (Human herpes simplex virus 1)
decreased
0.96%


33
Human adenovirus E serotype 4 (HAdV-4) (Human adenovirus 4)
decreased
0.94%


34
Human adenovirus F serotype 40 (HAdV-40) (Human adenovirus 40)
decreased
0.87%


35
Crimean-Congo hemorrhagic fever virus (strain Nigeria/IbArl0200/1970)
increased
0.77%



(CCHFV)




36
Tanapox virus
decreased
0.76%


37
Human adenovirus C serotype 5 (HAdV-5) (Human adenovirus 5)
decreased
0.72%


38
Rhinovirus B
decreased
0.70%


39
Human herpesvirus 8 (HHV-8) (Kaposi's sarcoma-associated herpesvirus)
decreased
0.61%


40
Human herpesvirus 6A (strain Uganda-1102) (HHV-6 variant A) (Human
decreased
0.59%



B lymphotropic virus)




41
Hepatitis C virus genotype lb (isolate HC-JI) (HCV)
increased
0.57%


42
Influenza A virus (strain A/USSR/90/1977 H1N1)
increased
0.50%


43
Human rhinovirus A serotype 89 (strain 41467-Gallo) (HRV-89)
decreased
0.50%


44
Norovirus MD145 (isolate GII/Human/United States/MD 145-12/1987)
decreased
0.42%



(Hu/NLV/GII/MD145-12/1987/US)




45
Molluscum contagiosum virus subtype 1 (MOCV) (MCVI)
decreased
0.41%


46
Vaccinia virus (strain Copenhagen) (VACV)
decreased
0.36%


47
Poliovirus type 1 (strain Sabin)
decreased
0.33%


48
Orf virus (ORFV)
decreased
0.32%


49
Human herpesvirus 2 (strain 333) (HHV-2) (Human herpes simplex
decreased
0.32%



virus 2)




50
Influenza A virus (strain A/Bangkok/l/1979 H3N2)
increased
0.30%


51
Hepatitis C virus genotype 1c (isolate India) (HCV)
increased
0.26%


52
Hepatitis B virus (HBV)
decreased
0.26%


53
Epstein-Barr virus (strain GDI) (HHV-4) (Human herpesvirus 4)
decreased
0.25%


54
Human parainfluenza 3 virus (strain Wash/47885/57) (HPIV-3) (Human
decreased
0.21%



parainfluenza 3 virus (strain NIH 47885))




55
Human herpesvirus 2 (HHV-2) (Human herpes simplex virus 2)
decreased
0.15%


56
Human enterovirus 71 (strain BrCr) (Ev 71)
decreased
0.13%


57
Human herpesvirus 6A (strain GS) (HHV-6 variant A) (Human B
decreased
0.12%



lymphotropic virus)




58
Chapare virus (isolate Human/Bolivia/8 10419/2003)
increased
0.10%


59
Cercopithecine herpesvirus 1 (CeHV-1) (Simian herpes B virus)
decreased
0.05%


60
Influenza B virus (strain B/Yamagata/16/1988)
decreased
0.04%


61
Influenza A virus (strain A/Philippines/2/1982 H3N2)
decreased
0.02%





1Regulation: the frequency of the virus feature is higher in disease population (increased) or lower (decreased)



2Importance score: the improvement in accuracy brought by a feature to the decision tree branches it is on. The higher the score is, the more important the feature is to the module prediction














TABLE 5B







Most frequent epitopes from the 61-VES











SEQ ID


Virus Name
Epitope sequence
NO:





Herpesvirus 2 (strain 333) (HHV-2)
ELSDTTNATQPELVPEDPEDSALLEDPAGTV
 1


(Human herpes simplex vims 2)
SSQIPPNWHIPSIQDVAPHHAPAAP






Cercopithecine herpesvirus 1 (CeHV-1)
EVVETANVTRPELAPEERGTSRTPGDEPAPA
 2


(Simian herpes B virus)
VAAQLPPNWHVPEASDVTIQGPAPA






Herpesvirus 3 (strain Dumas) (HHV-3)
EKPNATDTPIEEIGDSQNTEPSVNSGFDPDKF
 3



REAQEMIKYMTLVSAAERQESKAR






Human respiratory syncytial virus
TPSAESTPQSTTVKTKNTTTTQIQPSKPTTKQ
 4



RQNKPQNKPNNDFHFEVFNFVPCS






Cowpox virus (CPV)
IDYDDNKDDDKDDDKDDNKDDDKDDNKD
 5



DDKDDKDDNKDDDSDSDSDSDSDSDSDD






Rhinovirus A (serotype 89 strain
LYSHIKEEDRRRSSAAQAMEAIFQGIDLQSPP
 6


41467-Gallo) (HRV-89)
PPAIADLLRSVKTPEIIKYCQDNN






Influenza B virus (strain B/Yamagata/
MSNMDIDGINTGTIDKTPEEITSGTSGTTRPII
 7


16/1988)
RPATLAPPSNKRTRNPSPERATT






HCV 3b (isolate Tr-Kj)
LYQQYDEMEECSQSAPYIEQAQAIAQQFKD
8



KVLGLLQRASQQEAEIRPIVQSQWQK






HCV 1b (isolate BK)
YPGHVSGHRMAWDMMMNWSPTTALVVSQ
9



LLRIPQAVVDMVAGAHWGVLAGLAYYSM






HCV 4a (isolate ED43)
RRKRTVQLTESVVSTALAELAAKTFGQSEPS
10



SDRDTDLTTPTETTDSGPIVVDDAS






Cytomegalovirus (HHV-5)
LTVTYSSHTTSAAHSRSGSVSQRVTSSQTVS
11



HGVNETIYNTTLKYGDVVGVNTTKY






Influenza A virus (Bangkok H3N2)
CITPNGSIPNDKPFQNVNKITYGACPKYVKQ
12



NTLKLATGMRNVPEKQTR






EBV (strain B95-8) (HHV-4)
TSGATAAASAAAAVDTGSGGGGQPHDTAPR
13



GARKKQ






Herpesvirus 2 (strain HG52) (HHV-2)
MTSRPADQDSVRSSASVPLYPAASPVPAEAY
14



YSESEDEAANDFLVRMGRQQSVLRR






Cercopithecine herpesvirus 16
MEPPRPPDADSLLSDATSVIPLTPPAQGAEAY
15


(CeHV-16)
YTESDDETAADFLMRMGRQQTALR






Human herpesvirus 6B (strain Z29)
NFIKISLGETMGITPKEPTNPTQLLNVKNQTE
16


(HHV-6 variant B)
YANETHSTEVQTVKTFKEDRFQRT






Herpesvirus 1 (strain KOS) (HHV-1)
EDEYLSEEMMELTARALERGNGEWSTDAAL
17



EVAHEAEALVSQLGNAGEVFNFGDFG






Herpesvirus 1 (strain 17) (HHV-1)
RRHTQKAPKRIRLPHIREDDQPSSHQPLFY
18





Adenovirus E serotype 4 (HAdV-4)
QPPLEAPYVPPRYLAPTEGRNSIRYSELTPLY
19



DTTRLYLVDNKSADIASLNYQNDH






Orf virus (ORFV)
SGSRESGSRESGSRESGSREVRESGVRETEVQ
20



VVRVRQESGGRVTAPSESRKKFLD






Influenza A virus (Philippines H3N2)
QNLPGNDNSTATLCLGHHAVPNGTLVKTITN
21



DQIEVTNATELVQSSSTGRICDSPH






Herpesvirus 3 (HHV-3) (Varicella-
TELYTSAASRKPDPAVAPTSAASRKPDPAVA
22


zoster virus)
PTSAATRKPDPAVAPTSAATRKPDP






Lordsdale virus (Human enteric
LSSMAVTFKRALGGRAKQPPPRETPQRPPRP
23


calicivirus) (Hu/NV/LD/1993/UK)
PTPELVKKIPPPPPNGEDELVVSYS






Norovirus MD 145
MKMASNDASAAAVANSNNDTAKSSSDGVL
24



SSMAITFKRALGARPKQPPPREILQRP






Vaccinia virus (strain Copenhagen)
MDGTLFPGDDDLAIPATEFFSTKADKKPEAK
25


(VACV)
REAIVKADEDDNEETLKQRLTNLEK






Herpesvirus 2 (HHV-2)
DADDHAASFGGLAAAAAGAAGVARKRAFH
26



GDDPFGEGPPEKKDLTLDML






Humanrhinovirus 23 (HRV-23)
KGIIAQNPIENYVDEVLNEVLVVPNINSSHPT
27



TSNSAPALDAAETGHTSNVQPEDV






Herpesvirus 7 (strain JI) (HHV-7)
MGSKCCKTIHGGIFSKAEDTLVDYKGKYINL
28



EKEFSALSDTESEEELQLEKPLLNK






Tanapox virus
MDFMSKYSKELVLTAKNIKDEEPNLNKKET
29



SFDLSTYLKTKETHYQKKIRDQLAEK






Poliovirus type 1 (strain Sabin)
IDNTVRETVGAATSRDALPNTEASGPAHSKE
30



IPALTAVETGATNPLVPSDTVQTRH






Human parainfluenza 3 virus (HPIV-3) 
RLNKRLNDKKKQGSQPSTNPTNRTNQDEID
31


(Human parainfluenza 3 virus (strain
DLFNAFGSN



NIH 47885))







Human herpesvirus 6A (strain GS)
TTNATQKIESTTFTTIGIKEINGNTYSSPKNSI
32


(HHV-6 variant A)
YLKSKSQQSTTKFTDAEHTTPIL






HCV 1b (isolate Taiwan)
MSTNGKPQRKTKRNTNRRPQDVKFPGGGQI
33



VGGVYLLPRRGPRLGVRATRKTWERS






HCV 1a (isolate 1)
MSTNPKPQKKNKRNTNRRPQDVKFPGGGQI
34



VGGVYLLPRRGPRLGVRATRKTSERS






HCV 6g (isolate JK046) (HCV)
MSTNPKPQRQTKRNTNRRPQDVKFPGGGQI
35



VGGVYLLPRRGPRLGVRATRKTSERS






HCV 5a (isolate EUH1480)
NITRVEAENKVEILDCFKPLKEEEDDREISVS
36



ADCFKKGPAFPPALPVWARPGYDP









Crimean-Congo hemorrhagic fever virus
VRLPHIYHEGVFIPGTYKIVIDKKNKLNDRCT
37


(strain Nigeria/IbAr10200/1970) (CCHFV)
LFTDCVIKGREVRKGQSVLRQYKT






HCV 1b (isolate HC-JI)
MSTIPKPQRKTKRNTNRRPQDVKFPGGGQIV
38



GGVYLLPRRGPRLGVRATRKTSERS






Influenza A virus (strain A/USSR/90/
SSAGLKNDLLENLQAYQKRMGVQMQRFK
39


1977 H1N1)







HCV 1c (isolate India)
ITRVESENKIVVLDSFDPLVAEEDDREISIPAE
40



ILRKFKQFPPAMPIWARPDYNPP






Chapare virus (isolate Human/Bolivia/
VKKRENMFIDERPGNRNPYENLLYKLCLSGE
41


810419/2003)
GWPYIGSRSQVKGRSWENTTVDLSL






Astro virus-1 (HAstV-1)
EDIETDTDIESTEDEDEADRFDIIDTSDEEDEN
42



ETDRVTLLSTLVNQGMTMTRATR






Adenovirus C serotype 5 (HAdV-5)
MAPKKKLQLPPPPTDEEEYWDSQAEEVLDE
43



EEEDMMEDWESLDEEASEVEEVSDET






Human herpesvirus 6A (strain Uganda-1102)
EPPAGILAGPQVKPQEKPPAEPPAGLPAGPQ
44


(HHV-6 variant A)
AKPPVKPQAKPPAEPPVGILAGPQA






Cytomegalovirus (strain AD169) (HHV-5)
TASGEEVAVLSHHDSLESRRLREEEDDDDDE
45



DFEDA






HCV 1e (isolate HC-G9)
GSSTTSGVTSGEAAESSPAPSCDGELDSEAES
46



YSSMPPLEGEPGDPDLSDGSWSTV






Cytomegalovirus (strain Towne) (HHV-5)
LDGQTGTQDKGQKPNLLDRLRHRKNGYRH
47



LKDSDEEENV






EBV (strain AG876) (HHV-4)
TGSSQAAPSSSSVAPVASLSGDLEEEEEGSRE
48



SPSLPSSKKGADEFEAWLEAQDAN






HBV
QHFRKLLLLDEEAGPLEEELPRLADEGLNRR
49



VAEDLNLGNLNVSIPWTHKVGNFTG






HDV
PSMQGIPESRFTRTGEGLDVRGSRGFPQDILF
50



PSDPPFSPQSCRPQ






HCV 1b (strain HC-J4)
VIVGRIILSGKPAVVPDREVLYQEFDEMEEC
51



ASQLPYIEQGMQLAEQFKQKALGLL






Adenovirus C serotype 2 (HAdV-2)
GGNNSGSGAEENSNAAAAAMQPVEDMNDH
52



AIRGDTFATRAEEKRAEAEAAAEAAAP






Rhino virus 14(HRV-14)
TGQVYLLSFISACPDFKLRLMKDTQTISQTV
53



ALTEGLGDELEEVIVEKTKQTVASI






Metapneumovirus (strain CAN97-83)
NFSSLGLTDEEKEAAEHFLNVSDDSQNDYE
54


(HMPV)







Coxsackievirus A16 (strain G-10)
QVEPTAANTNASEHRLGTGLVPALQAAETG
55



ASSNAQDENLIETRCVLNHHSTQETT






Adenovirus F serotype 40 (HAdV-40)
EGVLRCYHGLEMIQKEQLVEMDVASENAQR
56



ALKEHPSRAKVVQNRWGRSVVQLKND






Rhinovirus B
TGQVHLLSFISACPDFKLRLMKDTQTISQTD
57



ALTEGLGDELEEVIVEKTKQTLASV






Herpesvirus 8 (HHV-8) (Kaposi's sarcoma-
EEQEQELEEQEQELEEQEQELEEQEQELEEQ
58


associated herpesvirus)
EQELEEQEQELEEQEQELEEQEQEL






Molluscum contagiosum virus subtype 1
AQAQQAQQAQAQQAQQAQQAQQAQQAQQ
59


(MOCV) (MCVI)
AQQAQAQQAQQAQQAQAQQAQAQQAQAQ






EBV (strain GD1) (HHV-4)
SGSGPRHRDGARRPPKRPSCIGC
60





Enterovirus 71 (strain BrCr) (Ev 71)
SAIGNTIEALFQGPPKFRPIRISLEEKPAPDAIS
61



DLLASVDSEEVRQYCREQGWII
















TABLE 6







Viruses in the 31-VES










Viral Strains
Group1
LDA score2
P-value





Human cytomegalovirus (strain AD 169) (HHV-5) (Human herpesvirus 5)
HCC
−448.17%
0.00%


Hepatitis C virus genotype 1b (isolate Taiwan) (HCV)
HCC
−391.18%
0.00%


Human cytomegalovirus (HHV-5) (Human herpesvirus 5)
HCC
−366.42%
0.59%


Hepatitis C virus genotype 1a (isolate 1) (HCV)
HCC
−342.29%
0.00%


Hepatitis C virus genotype 1c (isolate HC-G9) (HCV)
HCC
−335.99%
0.01%


Hepatitis C virus genotype 3b (isolate Tr-Kj) (HCV)
HCC
−332.30%
0.00%


Hepatitis C virus genotype 1b (isolate HC-J1) (HCV)
HCC
−320.54%
0.17%


Hepatitis C virus genotype 6g (isolate JK046) (HCV)
HCC
−318.87%
0.00%


Hepatitis C virus genotype 1b (isolate BK) (HCV)
HCC
−314.50%
0.00%


Hepatitis C virus genotype 1c (isolate India) (HCV)
HCC
−311.55%
4.67%


Hepatitis C virus genotype 1b (strain HC-J4) (HCV)
HCC
−305.44%
0.04%


Hepatitis C virus genotype 4a (isolate ED43) (HCV)
HCC
−300.05%
0.04%


Hepatitis delta virus (HDV)
HCC
−294.22%
4.48%


Hepatitis C virus genotype 5a (isolate EUH1480) (HCV)
HCC
−290.80%
0.00%


Norovirus MD145 (isolate GII/Human/United States/MD145-12/1987)
PC
 265.06%
3.37%


(Hu/NLV/GII/MD145-l 2/1987/US)





Human astrovirus-1 (HAstV-1)
PC
 294.42%
0.25%


Human adenovirus F serotype 40 (HAdV-40) (Human adenovirus 40)
PC
 300.70%
0.93%


Coxsackievirus A16 (strain G-10)
PC
 302.42%
0.73%


Human metapneumovirus (strain CAN97-83) (HMPV)
PC
 302.53%
0.67%


Lordsdale virus (strain GII/Human/United Kingdom/Lordsdale/1993)
PC
 305.18%
0.16%


(Human enteric calicivirus) (Hu/NV/LD/1993/UK)





Human adenovirus C serotype 5 (HAdV-5) (Human adenovirus 5)
PC
 312.39%
2.52%


Cowpox virus (CPV)
PC
 316.74%
1.48%


Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2)
PC
 317.31%
0.95%


Influenza B virus (strain B/Yamagata/16/1988)
PC
 320.08%
3.51%


Poliovirus type 1 (strain Sabin)
PC
 320.92%
2.12%


Human herpesvirus 3 (HHV-3) (Varicella-zoster virus)
PC
 321.38%
0.91%


Human herpesvirus 7 (strain JI) (HHV-7) (Human T lymphotropic virus)
PC
 348.86%
0.70%


Rhinovirus B
PC
 353.43%
0.60%


Human respiratory syncytial virus
PC
 360.56%
0.87%


Humanrhinovirus 23 (HRV-23)
PC
 360.78%
0.05%


Human herpesvirus 2 (strain HG52) (HHV-2) (Human herpes simplex
PC
 392.74%
1.88%


virus 2)






1PC: population control, HCC: hepatocellular carcinoma




2LDA score: Linear discriminant analysis (LDA) effect size, the degree of consistent difference in relative abundance between features in the two groups














TABLE 7







High Linkage Disequilibrium (LD) SNPs















SNP
Populations1
chr2
pos (hg38)3
R2
Linked variants
Ref4
Alt5
eQTL6


















rs16960234
EUR
16
83373197
0.61
rs79266488
C
T
Yes


rs16960234
AFR
16
83365352
0.61
rs74034199
A
G



rs16960234
EUR
16
83367052
0.63
rs11643358
A
G
Yes


rs16960234
AFR
16
83282008
0.63
rs60199380
C
T



rs16960234
AFR
16
83284440
0.63
rs74031966
C
T



rs16960234
EUR
16
83364582
0.64
rs17212165
G
C



rs16960234
EUR
16
83365038
0.64
rs113440209
C
G



rs16960234
EUR
16
83365095
0.64
rs78895225
G
A



rs16960234
EUR
16
83369047
0.64
rs150300998
G
A



rs16960234
EUR
16
83369190
0.64
rs76744497
C
G
Yes


rs16960234
EUR
16
83370348
0.64
rs11647809
A
G
Yes


rs16960234
EUR
16
83276951
0.65
rs113648214
A
C



rs16960234
EUR
16
83356473
0.65
rs17284098
A
G



rs16960234
EUR
16
83361752
0.65
rs75150179
C
T



rs16960234
EUR
16
83366520
0.65
rs79901975
T
C
Yes


rs16960234
EUR
16
83366981
0.65
rs11643322
A
C
Yes


rs16960234
EUR
16
83276270
0.66
rs79704908
T
C



rs16960234
EUR
16
83284591
0.66
rs76176371
G
A



rs16960234
EUR
16
83285349
0.66
rs75047464
G
C



rs16960234
EUR
16
83356730
0.66
rs78862640
G
A



rs16960234
EUR
16
83358688
0.66
rs17284265
A
G



rs16960234
EUR
16
83363682
0.66
rs 17284390
A
G



rs16960234
EUR
16
83365496
0.66
rs75620104
A
T



rs16960234
EUR
16
83368938
0.66
rs111838458
A
G



rs16960234
EUR
16
83278400
0.67
rs76995467
T
C



rs16960234
EUR
16
83361910
0.67
rs74318409
G
A



rs16960234
EUR
16
83363403
0.67
rs80149860
G
A



rs16960234
EUR
16
83365990
0.67
rs79730292
A
G



rs16960234
EUR
16
83366159
0.67
rs75523793
A
T



rs16960234
EUR
16
83279855
0.68
rs75376892
C
T



rs16960234
EUR
16
83361234
0.68
rs118185902
C
T



rs16960234
AFR
16
83323628
0.69
rs77449464
A
T



rs16960234
AFR
16
83324428
0.69
rs74034143
A
G



rs16960234
EUR
16
83281774
0.70
rs117294751
G
A



rs16960234
EUR
16
83359841
0.71
rs77436711
C
T
Yes


rs16960234
EUR
16
83360949
0.71
rs117352383
T
G



rs16960234
EUR
16
83348547
0.74
rs74541123
C
A



rs16960234
EUR
16
83302079
0.75
rs889729
C
G



rs16960234
EUR
16
83303584
0.75
rs74031990
A
T



rs16960234
EUR
16
83303787
0.75
rs74031991
G
T



rs16960234
EUR
16
83313776
0.77
rs71402061
A
G



rs16960234
EUR
16
83317816
0.77
rs76225392
G
A



rs16960234
AFR
16
83310864
0.78
rs57810667
A
C



rs16960234
AFR
16
83313310
0.78
rs74034109
T
G



rs16960234
EUR
16
83289953
0.80
rs929893
A
T



rs16960234
EUR
16
83290218
0.80
rs929895
T
C



rs16960234
EUR
16
83295757
0.80
rs76816724
T
G



rs16960234
EUR
16
83289363
0.81
rs77652642
G
C



rs16960234
EUR
16
83299164
0.81
rs79282218
T
G



rs16960234
AFR
16
83321413
0.81
rs74034139
C
A



rs16960234
AFR
16
83321685
0.81
rs113201349
T
C



rs16960234
EUR
16
83290746
0.83
rs79858538
G
A



rs16960234
EUR
16
83300167
0.83
rs76296650
A
G



rs16960234
EUR
16
83326962
0.83
rs76158834
C
A



rs16960234
EUR
16
83326986
0.83
rs74538806
G
T



rs16960234
AFR
16
83308296
0.83
rs74034105
G
T



rs16960234
AFR
16
83329699
0.83
rs57066373
T
G



rs16960234
AFR
16
83332460
0.83
rs74034164
A
G



rs16960234
EUR
16
83310319
0.84
rs77635880
G
C



rs16960234
EUR
16
83314717
0.84
rs 17282232
G
A,C



rs16960234
EUR
16
83317522
0.84
rs17210046
C
A,T



rs16960234
EUR
16
83326143
0.84
rs12325503
T
C



rs16960234
EUR
16
83326206
0.84
rs10514578
G
A



rs16960234
EUR
16
83326800
0.84
rs76742309
G
A



rs16960234
EUR
16
83329485
0.84
rs75225088
C
T



rs16960234
EUR
16
83330172
0.84
rs76719419
T
C



rs16960234
EUR
16
83311620
0.85
rs77055246
T
C



rs16960234
EUR
16
83321413
0.85
rs74034139
C
A



rs16960234
EUR
16
83325540
0.85
rs76664463
G
A



rs16960234
EUR
16
83325680
0.85
rs76598341
C
A



rs16960234
EUR
16
83324428
0.87
rs74034143
A
G



rs16960234
EUR
16
83291577
0.88
rs77378326
C
G



rs16960234
EUR
16
83323628
0.88
rs77449464
A
T



rs16960234
AFR
16
83303938
0.89
rs74031992
C
T



rs16960234
AFR
16
83304034
0.89
rs74034104
G
T



rs16960234
AFR
16
83313830
0.89
rs74034111
T
C



rs16960234
AFR
16
83315207
0.89
rs16960229
G
A



rs16960234
AFR
16
83315877
0.89
rs74034113
G
C



rs16960234
AFR
16
83315970
0.89
rs74034114
G
T



rs16960234
AFR
16
83323462
0.89
rs78340799
G
A



rs16960234
AFR
16
83325793
0.89
rs74034145
G
A



rs16960234
AFR
16
83325990
0.89
rs74034146
A
G



rs16960234
AFR
16
83326600
0.89
rs57413765
G
A



rs16960234
AFR
16
83326886
0.89
rs74034149
G
C



rs16960234
AFR
16
83327240
0.89
rs74034150
A
G



rs16960234
AFR
16
83327785
0.89
rs74034151
G
C



rs16960234
AFR
16
83341951
0.89
rs74034173
G
C



rs16960234
EUR
16
83308892
0.91
rs2325934
A
C



rs16960234
EUR
16
83309121
0.91
rs80088527
C
A



rs16960234
EUR
16
83309271
0.91
rs75586590
C
G



rs16960234
EUR
16
83309487
0.91
rs111918530
C
T



rs16960234
EUR
16
83309753
0.91
rs76343373
C
T



rs16960234
EUR
16
83297254
0.93
rs78794145
G
C



rs16960234
EUR
16
83307307
0.93
rs75001885
A
G



rs16960234
EUR
16
83319327
0.93
rs10514582
G
A



rs16960234
EUR
16
83335615
0.93
rs112288081
G
A



rs16960234
EUR
16
83346132
0.93
rs17211581
G
A



rs16960234
EUR
16
83302117
0.94
rs79028139
G
C



rs16960234
EUR
16
83318106
0.94
rs75473666
T
C



rs16960234
EUR
16
83319768
0.94
rs10514580
G
T



rs16960234
EUR
16
83320534
0.94
rs79784474
G
A



rs16960234
EUR
16
83336458
0.94
rs76161362
G
C



rs16960234
EUR
16
83339005
0.94
rs79780526
A
G



rs16960234
EUR
16
83341936
0.94
rs17211371
T
C



rs16960234
EUR
16
83342784
0.94
rs78860402
A
C



rs16960234
AFR
16
83319867
0.94
rs74034115
C
G



rs16960234
EUR
16
83313411
0.96
rs79842380
A
T



rs16960234
EUR
16
83327178
0.96
rs79131725
A
T



rs16960234
EUR
16
83327459
0.96
rs17210599
A
G



rs16960234
EUR
16
83330141
0.96
rs77866289
G
A



rs16960234
EUR
16
83340212
0.96
rs10514575
T
C



rs16960234
EUR
16
83344491
0.96
rs1424168
A
G



rs16960234
EUR
16
83325526
0.97
rs77980290
C
T



rs16960234
EUR
16
83321587
0.99
rs112285137
G
T



rs16960234
EUR
16
83321611
0.99
rs80170986
C
T



rs16960234
EUR
16
83321620
0.99
rs75636201
C
G



rs16960234
EUR
16
83321685
0.99
rs113201349
T
C



rs16960234
EUR
16
83322403
1.00
rs16960234
T
C



rs16960234
EUR
16
83322838
1.00
rs17210298
A
G



rs16960234
AFR
16
83321587
1.00
rs112285137
G
T



rs16960234
AFR
16
83321611
1.00
rs80170986
C
T



rs16960234
AFR
16
83321620
1.00
rs75636201
C
G



rs16960234
AFR
16
83322403
1.00
rs16960234
T
C



rs34725101
EUR
3
49363829
1.00
rs34725101
C
A



rs34725101
AFR
3
49363829
1.00
rs34725101
C
A



rs4483229
EUR
9
109217954
1.00
rs4483229
G
A



rs4483229
AFR
9
109217954
1.00
rs4483229
G
A






1EUR is European and AFR is African from 1000G Phase 1 population




2chr is chromatin




3pos(hg38) is the position on human reference genome version 38




4Ref stands for reference sequence




5Alt stands for alternative sequence




6eQTL is Expression quantitative trait loci; eQTL information is from gtexportal.org/home/














TABLE 8







Clinical characteristics of 899 patients and volunteers from NCI-UMD cohort

















P-value4
P-value
P-value



PC1
HR2
HCC3
(PC vs.
(PC vs.
(HR vs.


Variable
(N = 412)
(N = 337)
(N = 150)
HCC)
HR)
HCC)
















Age-year



0.12
0.01
0.86


Median (Range)
61 (46-79)
58 (41-80)
61 (19-87)





Missing data
0 (0.0)
0 (0.0)
2 (1.3)





Sex-no. (%)



1.00
0.46
0.58


Female
74 (18.0)
68 (20.2)
27 (18.0)





Male
338 (82.0)
269 (79.8)
123 (82.0)





Missing data
0 (0.0)
0 (0.0)
0 (0.0)





Race-no. (%)



<0.0001
0.22
<0.0001


European American
141 (34.2)
130 (38.6)
77 (51.3)





African American
271 (65.8)
206 (61.1)
57 (38.0)





Asian American
0 (0.0)
1 (0.3)
7 (4.7)





Other
0 (0.0)
0 (0.0)
2 (1.3)





Missing data
0 (0.0)
0 (0.0)
7 (4.7)





HCV only-no. (%) (diagnosed

272 (80.7)
68 (45.3)


0.08


positive)








HBV only-no. (%) (diagnosed

8 (2.4)
6 (4.0)





positive)








HBV + HCV-no. (%) (diagnosed

14 (4.2)
6 (4.0)





positive)








HBV + HDV-no. (%) (diagnosed

0 (0.0)
0 (0.0)





positive)








Others not hepatitis infection

0 (0.0)
0 (0.0)





Cirrhosis-no. (%) (diagnosed positive)

163 (48.4)
80 (53.3)


<0.00001


Missing data

2 (0.6)
47 (31.3)















Alanine aminotransferase-no. (%)




<0.001













Elevated (>50 U/L)

108 (32.0)
57 (38.0)





Normal (<50 U/L)

210 (62.3)
51 (34.0)















Alpha-fetoprotein-no. (%)




<0.00001













>20 ng/mL

15 (4.5)
38 (25.3)





≤20 ng/mL

99 (29.4)
34 (22.7)





Missing data








Survival (months)








Median


25.4





Range


0.5->40





Missing data (%)


12 (8.0)






1PC: population control;




2HR: high risk group;




3HCC: hepatocellular carcinoma;




4P-value: p-value was calculated by t-test or Chi-seq test, with 2 tailed














TABLE 9A







Univariable and multivariable analyses of factors associated with


survival of NCI-UMD cohort










Univariable analysisa
Multivariable analysisb












Hazard ratio

Hazard ratio



Clinical variable
(95% CI)
P-value
(95% CI)
P-value





VES
1.40(1.04-1.82)
0.025
2.17(0.39-12.08)
0.377


Age
0.86(0.54-1.38)
0.528
1.02(0.97-1.08)
0.359


HCV (diagnosis positive versus negative)
1.40(0.80-2.58)
0.221
0.70(0.26-1.90)
0.479


HBV (diagnosis positive versus negative)
0.79(0.31-1.98)
0.614
1.19(0.35-4.04)
0.777


Cirrhosis (1 versus 0)
1.20(0.64-2.30)
0.554
0.90(0.36-2.24)
0.824


AFP (>=20 ng/ml versus <20 ng/ml)
2.70(1.34-5.44)
0.005
2.92(1.21-7.08)
0.018


ALT (>=50 U/L versus <50 U/L)
1.40 (0.83-2.40)
0.210
1.76 (0.76-4.10))
0.189






aUnivariable Cox regression.




bMultivariable Cox regression.














TABLE 9B







Multivariable Cox regression of HCC diagnosis


on the NIDDK cohort predicted with the VES signature


at baseline (adjusted for clinical prognostic variables)










Number of events
36







Regression coefficients




(standard error)




VES score
 1.06 (0.69)



Hep B
 3.21 (1.41)



Hep C
 2.35 (1.37)



NAFLD
 0.40 (1.63)



Cirrhosis
 1.55 (0.64)



Diabetes
 0.80 (0.69)



ALT
−0.0004 (0.002)



Creatinine
−0.55 (0.85)



Albumin
−1.41 (0.67)



Bilirubin Tot
−0.35 (0.89)



PLT
−0.006 (0.004)



Prothromb T
−0.03 (0.23)



AFP
−0.04 (0.54)



Concordance (standard error)
 0.817 (0.053)



Likelihood ratio test p-value
 0.000002



Wald test p-value
 0.0003



Logrank test p-value
 0.0000004

















TABLE 9C







Prediction performance within HCV+


and HCV− subcohorts















AUC (95%


Cohort
Subcohort

AUC
CI upper)





NIDDK at baseline
HCV−
1
1
1


NIDDK at baseline
HCV+
1
1
1


NIDDK at diagnosis
HCV−
1
1
1


NIDDK at diagnosis
HCV+
1
1
1


NCI-UMD
HCV−
1
1
1


NCI-UMD
HCV+
0.91
0.95
0.99









In view of the many possible embodiments to which the principles of the disclosed subject matter may be applied, it should be recognized that the illustrated embodiments are only examples of the disclosure and should not be taken as limiting the scope of the disclosure. Rather, the scope of the disclosure is defined by the following claims. We therefore claim all that comes within the scope and spirit of these claims.

Claims
  • 1. A method of identifying a subject with early stage hepatocellular carcinoma (HCC), comprising: (i) detecting the presence or absence of antibodies to a plurality of viruses in a sample obtained from the subject, wherein the plurality of viruses comprises at least 10, at least 20, at least 30, at least 40, at least 50, or at least 60 of the viruses listed in Table 5A;(ii) determining the presence of a viral exposure signature (VES) in the sample obtained from the subject if: (a) antibodies specific for one or more of hepatitis C virus (HCV) genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; human cytomegalovirus, strain AD169; HCV genotype 6g, isolate JK046; HCV genotype 1b, isolate BK; HCV genotype 1c, isolate HC-G9; HCV genotype 1b, strain HC-J4; HCV genotype 4a, isolate ED43; hepatitis delta virus; HCV genotype 5a, isolate EUH1480; human cytomegalovirus; Crimean-Congo hemorrhagic fever virus, strain Nigeria/IbAr10200/1970; HCV genotype 1b, isolate HC-J1; influenza A virus, strain A/USSR/90/1977 H1N1; influenza A virus, strain A/Bangkok/1/1979 H3N2; HCV genotype 1c, isolate India; and Chapare virus, isolate Human/Bolivia/810419/2003 are detected in the sample; and/or(b) antibodies specific for one or more of Epstein-Barr virus, strain B95-8; human rhinovirus 23; HCMV, strain Towne; human herpesvirus 2 (HHV-2), strain HG52; human herpesvirus 3; varicella-zoster virus, strain Dumas; Cercopithecine herpesvirus 16; human adenovirus C serotype 2; human astrovirus-1; human respiratory syncytial virus; human herpesvirus 6B, strain Z29; human herpesvirus 7, strain JI; human rhinovirus 14; Lordsdale virus, strain GI/Human/United Kingdom/Lordsdale/1993; human herpesvirus 1, strain KOS; human metapneumovirus, strain CAN97-83; coxsackievirus A16, strain G-10; Epstein-Barr virus, strain AG876; cowpox virus; human herpesvirus 1, strain 17; human adenovirus E serotype 4; human adenovirus F serotype 40; tanapox virus; human adenovirus C serotype 5; rhinovirus B; human herpesvirus 8; human herpesvirus 6A, strain Uganda-1102; human rhinovirus A serotype 89, strain 41467-Gallo; norovirus MD145, isolate GI/Human/United States/MD145-12/1987; molluscum contagiosum virus subtype 1; vaccinia virus, strain Copenhagen; poliovirus type 1, strain Sabin; orf virus; HHV-2, strain 333; hepatitis B virus; Epstein-Barr virus, strain GD1; human parainfluenza 3 virus, strain Wash/47885/57; HHV-2; human enterovirus 71, strain BrCr; human herpesvirus 6A, strain GS; Cercopithecine herpesvirus 1; influenza B virus, strain B/Yamagata/16/1988; and influenza A virus, strain A/Philippines/2/1982 H3N2 are not detected in the sample; and(iii) identifying the subject as having early stage HCC when the VES is present.
  • 2. The method of claim 1, wherein the plurality of viruses comprises the 61 viruses listed in Table 5A.
  • 3. The method of claim 1, wherein the plurality of viruses consists of the 61 viruses listed in Table 5A.
  • 4. The method of claim 1, wherein the plurality of viruses comprises the 31 viruses listed in Table 6.
  • 5. The method of claim 1, wherein the plurality of viruses consists of the 31 viruses listed in Table 6.
  • 6. The method of claim 1, wherein step (ii) comprises determining the presence of the VES in the sample obtained from the subject if: (a) antibodies specific for three or more, five or more, or seven or more of hepatitis C virus (HCV) genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; human cytomegalovirus, strain AD169; HCV genotype 6g, isolate JK046; HCV genotype 1b, isolate BK; HCV genotype 1c, isolate HC-G9; HCV genotype 1b, strain HC-J4; HCV genotype 4a, isolate ED43; hepatitis delta virus; HCV genotype 5a, isolate EUH1480; human cytomegalovirus; Crimean-Congo hemorrhagic fever virus, strain Nigeria/IbAr10200/1970; HCV genotype 1b, isolate HC-J1; influenza A virus, strain A/USSR/90/1977 H1N1; influenza A virus, strain A/Bangkok/1/1979 H3N2; HCV genotype 1c, isolate India; and Chapare virus, isolate Human/Bolivia/810419/2003 are detected in the sample; and/or(b) antibodies specific for three or more, five or more, or seven or more of Epstein-Barr virus, strain B95-8; human rhinovirus 23; HCMV, strain Towne; human herpesvirus 2 (HHV-2), strain HG52; human herpesvirus 3; varicella-zoster virus, strain Dumas; Cercopithecine herpesvirus 16; human adenovirus C serotype 2; human astrovirus-1; human respiratory syncytial virus; human herpesvirus 6B, strain Z29; human herpesvirus 7, strain JI; human rhinovirus 14; Lordsdale virus, strain GI/Human/United Kingdom/Lordsdale/1993; human herpesvirus 1, strain KOS; human metapneumovirus, strain CAN97-83; coxsackievirus A16, strain G-10; Epstein-Barr virus, strain AG876; cowpox virus; human herpesvirus 1, strain 17; human adenovirus E serotype 4; human adenovirus F serotype 40; tanapox virus; human adenovirus C serotype 5; rhinovirus B; human herpesvirus 8; human herpesvirus 6A, strain Uganda-1102; human rhinovirus A serotype 89, strain 41467-Gallo; norovirus MD145, isolate GI/Human/United States/MD145-12/1987; molluscum contagiosum virus subtype 1; vaccinia virus, strain Copenhagen; poliovirus type 1, strain Sabin; orf virus; HHV-2, strain 333; hepatitis B virus; Epstein-Barr virus, strain GD1; human parainfluenza 3 virus, strain Wash/47885/57; HHV-2; human enterovirus 71, strain BrCr; human herpesvirus 6A, strain GS; Cercopithecine herpesvirus 1; influenza B virus, strain B/Yamagata/16/1988; and influenza A virus, strain A/Philippines/2/1982 H3N2 are not detected in the sample.
  • 7. A method of identifying a subject as having early stage hepatocellular carcinoma (HCC), comprising: (i) detecting the presence or absence of antibodies specific for a plurality of viruses in a sample obtained from the subject, wherein the plurality of viruses comprises hepatitis C virus (HCV) genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; human cytomegalovirus (HCMV) strain AD169; HCV genotype 6g, isolate JK046; Epstein-Barr virus (EBV), strain B95-8; human rhinovirus 23; HCMV strain Towne; HCV genotype 1b, isolate BK; and human herpesvirus 2 (HHV-2), strain HG52; and(ii) identifying the subject as having early stage HCC if: (a) antibodies specific for HCV genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; HCMV strain AD169; HCV genotype 6g, isolate JK046; and/or HCV genotype 1b, isolate BK, are detected in the sample; and/or(b) antibodies specific for EBV, strain B95-8; human rhinovirus 23; HCMV strain Towne; and/or HHV-2, strain HG52, are not detected in the sample.
  • 8. The method of claim 7, wherein step (ii) comprises identifying the subject as having early stage HCC if: (a) antibodies specific for at least two, at least three, at least four, at least five or all six of HCV genotype 3b, isolate Tr-Kj; HCV genotype 1b, isolate Taiwan; HCV genotype 1a, isolate 1; HCMV strain AD169; HCV genotype 6g, isolate JK046; and HCV genotype 1b, isolate BK, are detected in the sample; and/or(b) antibodies specific for at least one, at least two, at least three or all four of EBV strain B95-8; human rhinovirus 23; HCMV strain Towne; and/or HHV-2 strain HG52, are not detected in the sample.
  • 9. The method of claim 1, wherein the sample is a blood or serum sample.
  • 10. The method of claim 1, wherein the antibodies are detected by phage immunoprecipitation, immunoblot or enzyme-linked immunosorbent assay.
  • 11. The method of claim 1, further comprising administering an appropriate therapy for the prevention or treatment of HCC.
  • 12. The method of claim 11, wherein the appropriate therapy comprises vaccination against HBV, vaccination against HCV, administration of an anti-viral drug, a lifestyle change or a dietary modification.
  • 13. The method of claim 12, wherein the anti-viral drug is a nucleoside analog, interferon, or lamivudine.
  • 14. The method of claim 12, wherein the lifestyle or diet change includes reducing or eliminating intravenous drug use, reducing or eliminating alcohol consumption, reducing exposure to aflatoxin, or reducing iron overload.
  • 15. The method of claim 11, wherein the appropriate therapy comprises a liver transplant or liver resection.
  • 16. The method of claim 15, further comprising radiofrequency ablation.
  • 17. The method of claim 1, further comprising diagnostic monitoring every 3 months or every 6 months of the subject with early stage HCC.
  • 18. The method of claim 17, wherein diagnostic monitoring comprises ultrasound, computerized tomography (CT), magnetic resonance imaging (MRI), or a combination thereof.
  • 19. The method of claim 1, wherein the subject has not previously had a diagnosis of one or more of liver disease, hepatitis B virus (HBV) infection, hepatitis C virus (HCV) infection, hepatitis delta virus (HDV) infection, nonalcoholic fatty-liver disease (NAFLD), nonalcoholic steatohepatitis (NASH) and hepatocellular carcinoma (HCC).
  • 20. A phage display library expressing unique peptide epitopes from each of the viruses listed in Table 5A or Table 6, or an array comprising unique peptide epitopes from each of the viruses listed in Table 5A or Table 6.
  • 21. The phage display library of claim 20, wherein the peptide epitopes comprise: the peptides of SEQ ID NOs: 1-61;peptides comprising at least 90%, at least 95%, at least 96%, at last 97%, at least 98%, or at least 99% sequence identity to each of SEQ ID NOs: 1-61;the peptides of SEQ ID NOs: 62-102;peptides comprising at least 90%, at least 95%, at least 96%, at last 97%, at least 98%, or at least 99% sequence identity to each of SEQ ID NOs: 62-102; orcombinations thereof.
  • 22. (canceled)
  • 23. The array of claim 20, wherein the peptide epitopes comprise: the peptides of SEQ ID NOs: 1-61;peptides comprising at least 90%, at least 95%, at least 96%, at last 97%, at least 98%, or at least 99% sequence identity to each of SEQ ID NOs: 1-61;the peptides of SEQ ID NOs: 62-102;peptides comprising at least 90%, at least 95%, at least 96%, at last 97%, at least 98%, or at least 99% sequence identity to each of SEQ ID NOs: 62-102; or combinations thereof.
CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 62/914,138, filed Oct. 11, 2019, which is herein incorporated by reference in its entirety.

ACKNOWLEDGMENT OF GOVERNMENT SUPPORT

This invention was made with government support under project number Z01-BC010313 awarded by the National Institutes of Health. The government has certain rights in the invention.

PCT Information
Filing Document Filing Date Country Kind
PCT/US2020/055077 10/9/2020 WO
Provisional Applications (1)
Number Date Country
62914138 Oct 2019 US